ruby 1.9 utf-8 mostly works 07. Jul 2008

I was curious to see how far the implementation of utf-8 in Ruby 1.9 has developed. First we assign a pure Ascii string and check it’s encoding.

>> a = "Restaurant"
=> "Restaurant"
>> a.encoding
=> #<Encoding:US-ASCII>

Okay, nothing spectacular so far. Next we take a Unicode string.

>> b = "Café"
=> "Café"
>> b.encoding
=> #<Encoding:UTF-8>

Looking great, now let’s work with that string.

>> b.each_byte {|byte| puts byte}
67
97
102
195
169
=> "Café"

Here we see the utf-8 encoded é as two bytes.

>> b.each_char {|char| puts char}
C
a
f
é
=> "Café"

Works as expected.

>> b.size
=> 4

In Ruby 1.8 this would have returned 5.

>> b.reverse
=> "éfaC"

In Ruby 1.8 this would have generated a broken character at the start.

>> b.chop
=> "Caf"

In Ruby 1.8 this would have generated a broken character at the end.

>> b.upcase
=> "CAFé"

This is where work is still needed! In Ruby 1.8 there is Nikolai Weibull’s Ruby Character Encodings Library that does the job.

>> require 'encoding/character/utf-8'
=> true
>> b = "Café"
=> "Caf\303\251"
>> b.length
=> 5
>> b = u"Café"
=> u"Caf\303\251"
>> b.length
=> 4
>> b = +"Café" 
=> u"Caf\303\251"
>> b.length 
=> 4
>> puts b.upcase
CAFÉ

The library however is not compatible with Ruby 1.9. Another outstanding issue is the sorting of arrays containing Unicode strings.

>> a = %w[ä a b c]
=> ["ä", "a", "b", "c"]
>> a.sort
=> ["a", "b", "c", "ä"]

The result would be okay, if I wanted swedish sorting order, but what if I wanted german sorting order? This needs to be addressed. There are libraries for this in Java, so Ruby shouldn’t stand behind!

 

conciseness vs. readability 01. Jul 2008

To print the contents of all files passed as arguments or in case of no arguments print the contents of stdin you could write

puts *ARGF

but isn’t

if ARGV.empty?
  puts $stdin.readlines
else
  ARGV.each do |filename|
    puts File.readlines(filename)
  end
end

much more readable?

 

read file with one line of code 01. Jul 2008

There are probably at least a have a dozen ways to print the contents of a file in Ruby, here I will present four ways of doing it.

The first method is rather verbose. Assign a file object and iterate on the file object with the each_line method and then close the file.

f = open('file.txt')
f.each_line do |line|
  puts line
end
f.close

The second method still assigns a file object, but then uses the more compact readlines instance method and closes the file.

f = open('file.txt')
puts f.readlines
f.close

The third method uses block form to access the file and automatically close it.

open('file.txt') { |f| puts f.readlines }

The fourth method calls the class method readlines to wrap it all up in one statement.

puts File.readlines('file.txt')
 

the beauty of ruby 1.9 28. Jun 2008

Let’s say we have a string and three arrays

string = "672"
java = []
ruby18 = []
ruby19 = []

Now you want to convert the string to an array of integers. If you were writing Ruby in javastyle you would probably write something like

i=0
digits = string.split(//)
while i < digits.size do
  java << digits[i].to_i
  i+=1
end

but since we’re we’re doing Ruby in rubystyle it looks more like

string.split(//).each { |digit| ruby18 << digit.to_i }

which looks much nicer, but still not quite right, but with Ruby 1.9’s chars interator it gets even better.

string.chars { |digit| ruby19 << digit.to_i }

Goes down like butter, doesn’t it? Isn’t that how it always should have been? And the good thing is, there is lots of this good stuff in Ruby 1.9. Maybe in Ruby 2.0 there will even be a to_i method in the array class, but now I’m getting carried away.

 

beautiful network monitoring 27. Jun 2008

Monitoring networks with tcpdump works fine, but even in quiet mode tcpdump outputs too much information if you’re interested in application layer protocols like HTTP or IMAP. A nice alternative on the command line is ngrep which has a much more readable output. ngrep filters all tcp packets with an empty data part and strips the header of non-empty tcp packets. The only beauty flaw in my eyes is the dot ngrep inserts for every tab and for every carriage return. In my opinion it should offer an option to specify the tabulator size and just ignore the carriage returns.

But see for yourself, here the tcpdump output of an IMAP session,

$ sudo tcpdump -i en1 -A -s 0 -qtn port imap
tcpdump: verbose output suppressed,
use -v or -vv for full protocol decode
listening on en1, link-type EN10MB (Ethernet),
capture size 65535 bytes
IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 0
E..@..@.@.......P..N.|...7.n........Go.............
/!..........
IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 0
E..<..@.9...P..N.......|.....7.o...............
w.A./!......
IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 0
E..4..@.@.......P..N.|...7.o...............
/!..w.A.
IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 21
E..I..@.9...P..N.......|.....7.o...........
w.A./!..* OK Dovecot ready.

IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 0
E..4.P@.@..j....P..N.|...7.o...............
/!..w.A.
IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 33
E..U.8@.@..`....P..N.|...7.o...............
/!.2w.A.1 LOGIN foo@bar.net password

IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 0
E..4..@.9...P..N.......|.....7.......<.....
w.F'/!.2
IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 17
E..E..@.9...P..N.......|.....7......c......
w.F./!.21 OK Logged in.

IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 0
E..4.0@.@.......P..N.|...7.....?.....D.....
/!.>w.F.
IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 14
E..B.#@.@.......P..N.|...7.....?...........
/!.cw.F.1 LIST """%"

IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 0
E..4..@.9...P..N.......|...?.7.............
w.H./!.c
IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 513
E..5..@.9...P..N.......|...?.7.............
w.H./!.c* LIST (\HasChildren) "." "Trash"
* LIST (\HasNoChildren) "." "Sent"
* LIST (\HasNoChildren) "." "Spam"
* LIST (\HasNoChildren) "." "Sent Messages"
* LIST (\HasNoChildren) "." "Drafts"
* LIST (\HasNoChildren) "." "Spamtraining"
* LIST (\HasNoChildren) "." "Hamtraining"
* LIST (\HasNoChildren) "." "Spamtesting"
* LIST (\HasNoChildren) "." "Hamtesting"
* LIST (\HasNoChildren) "." "Deleted Messages"
* LIST (\HasNoChildren) "." "Spamverdacht"
* LIST (\HasNoChildren) "." "INBOX"
1 OK List completed.

IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 0
E..4]'@.@.8.....P..N.|...7.....@...........
/!.cw.H.
IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 10
E..>.L@.@..c....P..N.|...7.....@.....N.....
/!..w.H.1 LOGOUT

IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 19
E..G..@.9...P..N.......|...@.7.............
w.IP/!..* BYE Logging out

IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 0
E..4.I@.@..p....P..N.|...7.....S.....%.....
/!..w.IP
IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 24
E..L..@.9...P..N.......|...S.7.............
w.IP/!..1 OK Logout completed.

IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 0
E..4C#@.@.R.....P..N.|...7.....l...........
/!..w.IP
IP 192.168.2.22.50556 > 80.237.145.78.143: tcp 0
E..4.;@.@.......P..N.|...7.....l...........
/!..w.IP
IP 80.237.145.78.143 > 192.168.2.22.50556: tcp 0
E..4..@.9...P..N.......|...l.7.......j.....
w.IQ/!..

20 packets captured
28 packets received by filter
0 packets dropped by kernel

compared with the corresponding ngrep output.

$ sudo ngrep -d en1 -W byline port imap
interface: en1 (192.168.2.0/255.255.255.0)
filter: (ip) and ( port imap )
####
T 80.237.145.78:143 -> 192.168.2.22:50556 [AP]
* OK Dovecot ready..

##
T 192.168.2.22:50556 -> 80.237.145.78:143 [AP]
1 LOGIN foo@bar.net password.

##
T 80.237.145.78:143 -> 192.168.2.22:50556 [AP]
1 OK Logged in..

##
T 192.168.2.22:50556 -> 80.237.145.78:143 [AP]
1 LIST """%".

##
T 80.237.145.78:143 -> 192.168.2.22:50556 [AP]
* LIST (\HasChildren) "." "Trash".
* LIST (\HasNoChildren) "." "Sent".
* LIST (\HasNoChildren) "." "Spam".
* LIST (\HasNoChildren) "." "Sent Messages".
* LIST (\HasNoChildren) "." "Drafts".
* LIST (\HasNoChildren) "." "Spamtraining".
* LIST (\HasNoChildren) "." "Hamtraining".
* LIST (\HasNoChildren) "." "Spamtesting".
* LIST (\HasNoChildren) "." "Hamtesting".
* LIST (\HasNoChildren) "." "Deleted Messages".
* LIST (\HasNoChildren) "." "Spamverdacht".
* LIST (\HasNoChildren) "." "INBOX".
1 OK List completed..

##
T 192.168.2.22:50556 -> 80.237.145.78:143 [AP]
1 LOGOUT.

#
T 80.237.145.78:143 -> 192.168.2.22:50556 [AP]
* BYE Logging out.

##
T 80.237.145.78:143 -> 192.168.2.22:50556 [AFP]
1 OK Logout completed..

28 received, 0 dropped
 

1 2 3 4