ruby 1.9 utf-8 mostly works 07. Jul 2008

I was curious to see how far the implementation of utf-8 in Ruby 1.9 has developed. First we assign a pure Ascii string and check it’s encoding.

>> a = "Restaurant"
=> "Restaurant"
>> a.encoding
=> #<Encoding:US-ASCII>

Okay, nothing spectacular so far. Next we take a Unicode string.

>> b = "Café"
=> "Café"
>> b.encoding
=> #<Encoding:UTF-8>

Looking great, now let’s work with that string.

>> b.each_byte {|byte| puts byte}
67
97
102
195
169
=> "Café"

Here we see the utf-8 encoded é as two bytes.

>> b.each_char {|char| puts char}
C
a
f
é
=> "Café"

Works as expected.

>> b.size
=> 4

In Ruby 1.8 this would have returned 5.

>> b.reverse
=> "éfaC"

In Ruby 1.8 this would have generated a broken character at the start.

>> b.chop
=> "Caf"

In Ruby 1.8 this would have generated a broken character at the end.

>> b.upcase
=> "CAFé"

This is where work is still needed! In Ruby 1.8 there is Nikolai Weibull’s Ruby Character Encodings Library that does the job.

>> require 'encoding/character/utf-8'
=> true
>> b = "Café"
=> "Caf\303\251"
>> b.length
=> 5
>> b = u"Café"
=> u"Caf\303\251"
>> b.length
=> 4
>> b = +"Café" 
=> u"Caf\303\251"
>> b.length 
=> 4
>> puts b.upcase
CAFÉ

The library however is not compatible with Ruby 1.9. Another outstanding issue is the sorting of arrays containing Unicode strings.

>> a = %w[ä a b c]
=> ["ä", "a", "b", "c"]
>> a.sort
=> ["a", "b", "c", "ä"]

The result would be okay, if I wanted swedish sorting order, but what if I wanted german sorting order? This needs to be addressed. There are libraries for this in Java, so Ruby shouldn’t stand behind!

 

predefined character classes in grep 06. Jul 2008

Many regex implementations have “macros” for various character classes. In Perl, for example, \d matches any digit ([0-9]) and \w matches any “word character” ([a-zA-Z0-9_]). Grep uses a slightly different notation for the same thing: [:digit:] for digits and [:alnum:] for alphanumeric characters. (BSD)

Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z]. (grep man page)

Was unter Ruby

line = "length 1450"
puts line if line =~ /\d{4}/

heißt wird also unter der bash mit grep zu

export line="length 1450"
echo $line | egrep [[:digit:]]{4}

was zwar die regulären Ausdrücke unnötig aufbläht, aber immerhin die gleiche Funktionalität zur Verfügung stellt.

 

conciseness vs. readability 01. Jul 2008

To print the contents of all files passed as arguments or in case of no arguments print the contents of stdin you could write

puts *ARGF

but isn’t

if ARGV.empty?
  puts $stdin.readlines
else
  ARGV.each do |filename|
    puts File.readlines(filename)
  end
end

much more readable?

 

read file with one line of code 01. Jul 2008

There are probably at least a have a dozen ways to print the contents of a file in Ruby, here I will present four ways of doing it.

The first method is rather verbose. Assign a file object and iterate on the file object with the each_line method and then close the file.

f = open('file.txt')
f.each_line do |line|
  puts line
end
f.close

The second method still assigns a file object, but then uses the more compact readlines instance method and closes the file.

f = open('file.txt')
puts f.readlines
f.close

The third method uses block form to access the file and automatically close it.

open('file.txt') { |f| puts f.readlines }

The fourth method calls the class method readlines to wrap it all up in one statement.

puts File.readlines('file.txt')
 

the beauty of ruby 1.9 28. Jun 2008

Let’s say we have a string and three arrays

string = "672"
java = []
ruby18 = []
ruby19 = []

Now you want to convert the string to an array of integers. If you were writing Ruby in javastyle you would probably write something like

i=0
digits = string.split(//)
while i < digits.size do
  java << digits[i].to_i
  i+=1
end

but since we’re we’re doing Ruby in rubystyle it looks more like

string.split(//).each { |digit| ruby18 << digit.to_i }

which looks much nicer, but still not quite right, but with Ruby 1.9’s chars interator it gets even better.

string.chars { |digit| ruby19 << digit.to_i }

Goes down like butter, doesn’t it? Isn’t that how it always should have been? And the good thing is, there is lots of this good stuff in Ruby 1.9. Maybe in Ruby 2.0 there will even be a to_i method in the array class, but now I’m getting carried away.

 

1 2 3