ruby 1.9 utf-8 mostly works 07. Jul 2008

I was curious to see how far the implementation of utf-8 in Ruby 1.9 has developed. First we assign a pure Ascii string and check it’s encoding.

>> a = "Restaurant"
=> "Restaurant"
>> a.encoding
=> #<Encoding:US-ASCII>

Okay, nothing spectacular so far. Next we take a Unicode string.

>> b = "Café"
=> "Café"
>> b.encoding
=> #<Encoding:UTF-8>

Looking great, now let’s work with that string.

>> b.each_byte {|byte| puts byte}
67
97
102
195
169
=> "Café"

Here we see the utf-8 encoded é as two bytes.

>> b.each_char {|char| puts char}
C
a
f
é
=> "Café"

Works as expected.

>> b.size
=> 4

In Ruby 1.8 this would have returned 5.

>> b.reverse
=> "éfaC"

In Ruby 1.8 this would have generated a broken character at the start.

>> b.chop
=> "Caf"

In Ruby 1.8 this would have generated a broken character at the end.

>> b.upcase
=> "CAFé"

This is where work is still needed! In Ruby 1.8 there is Nikolai Weibull’s Ruby Character Encodings Library that does the job.

>> require 'encoding/character/utf-8'
=> true
>> b = "Café"
=> "Caf\303\251"
>> b.length
=> 5
>> b = u"Café"
=> u"Caf\303\251"
>> b.length
=> 4
>> b = +"Café" 
=> u"Caf\303\251"
>> b.length 
=> 4
>> puts b.upcase
CAFÉ

The library however is not compatible with Ruby 1.9. Another outstanding issue is the sorting of arrays containing Unicode strings.

>> a = %w[ä a b c]
=> ["ä", "a", "b", "c"]
>> a.sort
=> ["a", "b", "c", "ä"]

The result would be okay, if I wanted swedish sorting order, but what if I wanted german sorting order? This needs to be addressed. There are libraries for this in Java, so Ruby shouldn’t stand behind!

 

predefined character classes in grep 06. Jul 2008

Many regex implementations have “macros” for various character classes. In Perl, for example, \d matches any digit ([0-9]) and \w matches any “word character” ([a-zA-Z0-9_]). Grep uses a slightly different notation for the same thing: [:digit:] for digits and [:alnum:] for alphanumeric characters. (BSD)

Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z]. (grep man page)

Was unter Ruby

line = "length 1450"
puts line if line =~ /\d{4}/

heißt wird also unter der bash mit grep zu

export line="length 1450"
echo $line | egrep [[:digit:]]{4}

was zwar die regulären Ausdrücke unnötig aufbläht, aber immerhin die gleiche Funktionalität zur Verfügung stellt.

 

the dust bunnies are my only friends 03. Jul 2008

Isobel Campbell saved him from the loneliness in his empty apartment.

 

conciseness vs. readability 01. Jul 2008

To print the contents of all files passed as arguments or in case of no arguments print the contents of stdin you could write

puts *ARGF

but isn’t

if ARGV.empty?
  puts $stdin.readlines
else
  ARGV.each do |filename|
    puts File.readlines(filename)
  end
end

much more readable?

 

read file with one line of code 01. Jul 2008

There are probably at least a have a dozen ways to print the contents of a file in Ruby, here I will present four ways of doing it.

The first method is rather verbose. Assign a file object and iterate on the file object with the each_line method and then close the file.

f = open('file.txt')
f.each_line do |line|
  puts line
end
f.close

The second method still assigns a file object, but then uses the more compact readlines instance method and closes the file.

f = open('file.txt')
puts f.readlines
f.close

The third method uses block form to access the file and automatically close it.

open('file.txt') { |f| puts f.readlines }

The fourth method calls the class method readlines to wrap it all up in one statement.

puts File.readlines('file.txt')
 

1 ... 5 6 7 8 9 ... 315