ruby 1.9 utf-8 mostly works 07. Jul 2008

I was curious to see how far the implementation of utf-8 in Ruby 1.9 has developed. First we assign a pure Ascii string and check it’s encoding.

>> a = "Restaurant"
=> "Restaurant"
>> a.encoding
=> #<Encoding:US-ASCII>

Okay, nothing spectacular so far. Next we take a Unicode string.

>> b = "Café"
=> "Café"
>> b.encoding
=> #<Encoding:UTF-8>

Looking great, now let’s work with that string.

>> b.each_byte {|byte| puts byte}
67
97
102
195
169
=> "Café"

Here we see the utf-8 encoded é as two bytes.

>> b.each_char {|char| puts char}
C
a
f
é
=> "Café"

Works as expected.

>> b.size
=> 4

In Ruby 1.8 this would have returned 5.

>> b.reverse
=> "éfaC"

In Ruby 1.8 this would have generated a broken character at the start.

>> b.chop
=> "Caf"

In Ruby 1.8 this would have generated a broken character at the end.

>> b.upcase
=> "CAFé"

This is where work is still needed! In Ruby 1.8 there is Nikolai Weibull’s Ruby Character Encodings Library that does the job.

>> require 'encoding/character/utf-8'
=> true
>> b = "Café"
=> "Caf\303\251"
>> b.length
=> 5
>> b = u"Café"
=> u"Caf\303\251"
>> b.length
=> 4
>> b = +"Café" 
=> u"Caf\303\251"
>> b.length 
=> 4
>> puts b.upcase
CAFÉ

The library however is not compatible with Ruby 1.9. Another outstanding issue is the sorting of arrays containing Unicode strings.

>> a = %w[ä a b c]
=> ["ä", "a", "b", "c"]
>> a.sort
=> ["a", "b", "c", "ä"]

The result would be okay, if I wanted swedish sorting order, but what if I wanted german sorting order? This needs to be addressed. There are libraries for this in Java, so Ruby shouldn’t stand behind!

 

Kommentar schreiben

Markdown Syntax