<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>loopkid: ruby 1.9 utf-8 mostly works</title>
    <link>http://loopkid.net/articles/2008/07/07/ruby-1-9-utf-8-mostly-works</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>sad songs make me happy</description>
    <item>
      <title>ruby 1.9 utf-8 mostly works</title>
      <description>&lt;p&gt;I was curious to see how far the implementation of utf-8 in Ruby 1.9 has developed. First we assign a pure Ascii string and check it&amp;#8217;s encoding.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; a = "Restaurant"
=&amp;gt; "Restaurant"
&amp;gt;&amp;gt; a.encoding
=&amp;gt; #&amp;lt;Encoding:US-ASCII&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Okay, nothing spectacular so far. Next we take a Unicode string.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; b = "Café"
=&amp;gt; "Café"
&amp;gt;&amp;gt; b.encoding
=&amp;gt; #&amp;lt;Encoding:UTF-8&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Looking great, now let&amp;#8217;s work with that string.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; b.each_byte {|byte| puts byte}
67
97
102
195
169
=&amp;gt; "Café"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here we see the utf-8 encoded é as two bytes.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; b.each_char {|char| puts char}
C
a
f
é
=&amp;gt; "Café"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Works as expected.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; b.size
=&amp;gt; 4
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In Ruby 1.8 this would have returned 5.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; b.reverse
=&amp;gt; "éfaC"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In Ruby 1.8 this would have generated a broken character at the start.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; b.chop
=&amp;gt; "Caf"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In Ruby 1.8 this would have generated a broken character at the end.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; b.upcase
=&amp;gt; "CAFé"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is where work is still needed! In Ruby 1.8 there is Nikolai Weibull&amp;#8217;s &lt;a href="http://bitwi.se/software/ruby/character-encodings/"&gt;Ruby Character Encodings Library&lt;/a&gt; that does the job.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; require 'encoding/character/utf-8'
=&amp;gt; true
&amp;gt;&amp;gt; b = "Café"
=&amp;gt; "Caf\303\251"
&amp;gt;&amp;gt; b.length
=&amp;gt; 5
&amp;gt;&amp;gt; b = u"Café"
=&amp;gt; u"Caf\303\251"
&amp;gt;&amp;gt; b.length
=&amp;gt; 4
&amp;gt;&amp;gt; b = +"Café" 
=&amp;gt; u"Caf\303\251"
&amp;gt;&amp;gt; b.length 
=&amp;gt; 4
&amp;gt;&amp;gt; puts b.upcase
CAFÉ
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The library however is &lt;a href="http://rubyforge.org/tracker/index.php?func=detail&amp;amp;aid=21150&amp;amp;group_id=1982&amp;amp;atid=7741"&gt;not compatible&lt;/a&gt; with Ruby 1.9. Another outstanding issue is the sorting of arrays containing Unicode strings.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; a = %w[ä a b c]
=&amp;gt; ["ä", "a", "b", "c"]
&amp;gt;&amp;gt; a.sort
=&amp;gt; ["a", "b", "c", "ä"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The result would be okay, if I wanted swedish sorting order, but what if I wanted german sorting order? This needs to be addressed. There are libraries for this in Java, so Ruby shouldn&amp;#8217;t stand behind!&lt;/p&gt;</description>
      <pubDate>Mon, 07 Jul 2008 21:40:00 +0200</pubDate>
      <guid isPermaLink="false">urn:uuid:b09c8f6a-3bc4-4199-93ad-86d2097786ce</guid>
      <author>Stefan</author>
      <link>http://loopkid.net/articles/2008/07/07/ruby-1-9-utf-8-mostly-works</link>
      <category>English</category>
      <category>Ruby</category>
      <trackback:ping>http://loopkid.net/articles/trackback/8576</trackback:ping>
    </item>
  </channel>
</rss>
