ruby 1.9 utf-8 mostly works 07. Jul 2008
I was curious to see how far the implementation of utf-8 in Ruby 1.9 has developed. First we assign a pure Ascii string and check it’s encoding.
>> a = "Restaurant"
=> "Restaurant"
>> a.encoding
=> #<Encoding:US-ASCII>
Okay, nothing spectacular so far. Next we take a Unicode string.
>> b = "Café"
=> "Café"
>> b.encoding
=> #<Encoding:UTF-8>
Looking great, now let’s work with that string.
>> b.each_byte {|byte| puts byte}
67
97
102
195
169
=> "Café"
Here we see the utf-8 encoded é as two bytes.
>> b.each_char {|char| puts char}
C
a
f
é
=> "Café"
Works as expected.
>> b.size
=> 4
In Ruby 1.8 this would have returned 5.
>> b.reverse
=> "éfaC"
In Ruby 1.8 this would have generated a broken character at the start.
>> b.chop
=> "Caf"
In Ruby 1.8 this would have generated a broken character at the end.
>> b.upcase
=> "CAFé"
This is where work is still needed! In Ruby 1.8 there is Nikolai Weibull’s Ruby Character Encodings Library that does the job.
>> require 'encoding/character/utf-8'
=> true
>> b = "Café"
=> "Caf\303\251"
>> b.length
=> 5
>> b = u"Café"
=> u"Caf\303\251"
>> b.length
=> 4
>> b = +"Café"
=> u"Caf\303\251"
>> b.length
=> 4
>> puts b.upcase
CAFÉ
The library however is not compatible with Ruby 1.9. Another outstanding issue is the sorting of arrays containing Unicode strings.
>> a = %w[ä a b c]
=> ["ä", "a", "b", "c"]
>> a.sort
=> ["a", "b", "c", "ä"]
The result would be okay, if I wanted swedish sorting order, but what if I wanted german sorting order? This needs to be addressed. There are libraries for this in Java, so Ruby shouldn’t stand behind!
Kommentare (1)