exploring unicode sorting on mac os x 21. Mar 2011
I was wondering how well unicode sorting works on Mac OS X and did some experiments. Let’s collect some system infomation first:
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.6.6
BuildVersion: 10J567
$ locale | grep LC_COLLATE
LC_COLLATE="de_DE.UTF-8"
So my default collation is german. I created a directory with some dummy files
$ touch ä ö ß å ø a o
Let’s see how they list:
$ ls -1 | tr -d "\n" | echo `cat` | tee -a ../de_list
aäåoösßø
Looks halfway reasonable, although I would have expected the ø to be in line with the other o’s. Now let’s pipe the results through the sort utility:
$ ls -1 | iconv -f utf-8-mac -t utf-8 | sort | tr -d "\n" |\
echo `cat` | tee -a ../de_list
aosßäåöø
Hm, it’s different and obviously completely wrong. Let’s try the swedish sorting order for a change:
$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" |\
echo `cat` | tee -a ../se_list
aäåoösßø
Again, way off. With the sort utility:
$ ls -1 | iconv -f utf-8-mac -t utf-8 | LC_COLLATE=sv_SE.UTF-8 sort |\
tr -d "\n" | echo `cat` | tee -a ../se_list
aosßäåöø
Much better, but still wrong. sorting order is stored in the LC_COLLATE files. Oddly if you look at the german and swedish LC_COLLATE files they’re both a symbolic link to ../la_LN.US-ASCII/LC_COLLATE which seems rather meaningless. Let’s try to “fix” that with a dirty hack.
$ sudo ln -fs ../de_DE.ISO8859-1/LC_COLLATE \
/usr/share/locale/de_DE.UTF-8/LC_COLLATE
$ sudo ln -fs ../sv_SE.ISO8859-1/LC_COLLATE \
/usr/share/locale/sv_SE.UTF-8/LC_COLLATE
And one more time all four tries:
$ ls -1 | tr -d "\n" | echo `cat` | tee -a ../de_list
aäåoøösß
$ ls -1 | iconv -f utf-8-mac -t utf-8 | sort | tr -d "\n" |\
echo `cat` | tee -a ../de_list
aåäoöøsß
$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" | echo `cat` |\
tee -a ../se_list
aäåoösøß
$ ls -1 | iconv -f utf-8-mac -t utf-8 | LC_COLLATE=sv_SE.UTF-8 sort |\
tr -d "\n" | echo `cat` | tee -a ../se_list
aosåäöøß
Revert the symbolic links to their previous state:
$ sudo ln -fs ../la_LN.US-ASCII/LC_COLLATE \
/usr/share/locale/de_DE.UTF-8/LC_COLLATE
$ sudo ln -fs ../la_LN.US-ASCII/LC_COLLATE \
/usr/share/locale/sv_SE.UTF-8/LC_COLLATE
Now let’s see how many unique results we have:
$ cat ../de_list | uniq | gwc -l
4
$ cat ../se_list | uniq | gwc -l
4
Hm, all results are unique. Let’s compare:
$ cat -n ../de_list
1 aäåoösßø
2 aosßäåöø
3 aäåoøösß
4 aåäoöøsß
$ cat -n ../se_list
1 aäåoösßø
2 aosßäåöø
3 aäåoösøß
4 aosåäöøß
Of the german results the second one is obviously wrong. The fourth swedish result seems quite reasonable. Let’s see what Ubuntu says for a comparison:
$ cat /etc/lsb-release | grep DISTRIB_DESCRIPTION
DISTRIB_DESCRIPTION="Ubuntu 10.10"
$ locale | grep LC_COLLATE
LC_COLLATE="de_DE.utf8"
$ ls -1 | tr -d "\n" | echo `cat`
aåäoöøsß
$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" | echo `cat`
aosßåäöø
The german result is identical with fourth result from Mac OS X. The swedish result is yet one more variation. And finally what does the Finder say with a german locale:

And with a swedish locale:

Interestingly the Finder’s results are identical with Ubuntu’s, which leaves us with the question of how to achieve Finder/Ubuntu sort order in the Mac OS X shell. Oddly the german sort order is identical with the fourth result, but the swedish sort order never matches.
groking hfs+ character encoding 19. Mar 2011
Linux and (most?) other Unix-like operating systems use the so called normalization form C (NFC) for its UTF-8 encoding by default but do not enforce this. Darwin, the base of the Macintosh OS enforces normalization form D (NFD), where a few characters are encoded in a different way. On OS X it’s not possible to create NFC UTF-8 filenames because this is prevented at filesystem layer. On HFS+ filenames are internally stored in UTF-16 and when converted back to UTF-8, for the underlying BSD system to be handable, NFD is created. See here for defails. I think it was a very bad idea and breaks many things under OS X which expect a normal POSIX conforming system. Anywhere else convmv is able to convert files from NFC to NFD or vice versa which makes interoperability with such systems a lot easier. (Source: convmv)
If you print the german umlaut ä the composed form is used.
$ printf ä | hexdump
0000000 c3 a4
0000002
If you create a file named by ä the decomposed form is used instead.
$ touch ä
$ ls | tr -d '\n' | hexdump
0000000 61 cc 88
0000003
You can convert the decomposed form into the composed form.
$ ls | iconv -f utf-8-mac -t utf-8 | tr -d '\n' | hexdump
0000000 c3 a4
0000002
setting a time limit for curl and wget 10. Mar 2011
To record an http livestream (e.g. radio broadcast) you can use both curl and wget. Usually you’d want to limit the duration of the recording. curl makes it easy. You can set the duration in seconds with the -m option. This would record 24 hours of livestream:
curl -m 86400 www.server.com > recording.flv
Of course you’d need to adjust the file extension according to the supplied container format. Furthermore, if you’re doing a longer recording session you’d probably want to send the process to the background and record the errors in a logfile:
curl -sS -m 86400 www.server.com > recording.flv 2> error.log &
With wget it works two, but it gets a bit more verbose. It doesn’t provide an option out of the box for setting a time limit, but you can write a little shell wrapper:
#/bin/bash
wget -nv -o error.log -O recording.flv www.server.com &
sleep 86400
kill $!
If you want to close the terminal window after executing the commands you might need to execute them with nohup to prevent them being killed. If you use the script more often you’d probably want to use command line parameters instead of hard coded parameters.
update the mac os x locate database 10. Mar 2011
By default there is no updatedb command on Mac OS X. The proper way to update the locate database is to call locate.updatedb. The directory in which the executable is located is however not in the default search path, but you can simply create a symbolic link with the proper name:
ln -s /usr/libexec/locate.updatedb ~/bin/updatedb
installing ssh-copy-id on mac os x 10. Mar 2011
By default ssh-copy-id is missing on Mac OS X, but since it’s just a simple bash script you can install it quickly from the official Portable OpenSSH CVS repository:
$ sudo bash -c "cvs -d anoncvs@anoncvs.mindrot.org:/cvs \
get -p openssh/contrib/ssh-copy-id > /usr/local/bin/ssh-copy-id"
$ sudo chmod +x /usr/local/bin/ssh-copy-id