exploring unicode sorting on mac os x 21. Mar 2011

I was wondering how well unicode sorting works on Mac OS X and did some experiments. Let’s collect some system infomation first:

$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.6.6
BuildVersion:   10J567
$ locale | grep LC_COLLATE
LC_COLLATE="de_DE.UTF-8"

So my default collation is german. I created a directory with some dummy files

$ touch ä ö ß å ø a o

Let’s see how they list:

$ ls -1 | tr -d "\n" | echo `cat` | tee -a ../de_list
aäåoösßø

Looks halfway reasonable, although I would have expected the ø to be in line with the other o’s. Now let’s pipe the results through the sort utility:

$ ls -1 | iconv -f utf-8-mac -t utf-8 | sort | tr -d "\n" |\
echo `cat` | tee -a ../de_list
aosßäåöø

Hm, it’s different and obviously completely wrong. Let’s try the swedish sorting order for a change:

$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" |\
echo `cat` | tee -a ../se_list
aäåoösßø

Again, way off. With the sort utility:

$ ls -1 | iconv -f utf-8-mac -t utf-8 | LC_COLLATE=sv_SE.UTF-8 sort |\
tr -d "\n" | echo `cat` | tee -a ../se_list
aosßäåöø

Much better, but still wrong. sorting order is stored in the LC_COLLATE files. Oddly if you look at the german and swedish LC_COLLATE files they’re both a symbolic link to ../la_LN.US-ASCII/LC_COLLATE which seems rather meaningless. Let’s try to “fix” that with a dirty hack.

$ sudo ln -fs ../de_DE.ISO8859-1/LC_COLLATE \
/usr/share/locale/de_DE.UTF-8/LC_COLLATE
$ sudo ln -fs ../sv_SE.ISO8859-1/LC_COLLATE \
/usr/share/locale/sv_SE.UTF-8/LC_COLLATE

And one more time all four tries:

$ ls -1 | tr -d "\n" | echo `cat` | tee -a ../de_list
aäåoøösß
$ ls -1 | iconv -f utf-8-mac -t utf-8 | sort | tr -d "\n" |\
echo `cat` | tee -a ../de_list
aåäoöøsß
$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" | echo `cat` |\
tee -a ../se_list
aäåoösøß
$ ls -1 | iconv -f utf-8-mac -t utf-8 | LC_COLLATE=sv_SE.UTF-8 sort |\
tr -d "\n" | echo `cat` | tee -a ../se_list
aosåäöøß

Revert the symbolic links to their previous state:

$ sudo ln -fs ../la_LN.US-ASCII/LC_COLLATE \
/usr/share/locale/de_DE.UTF-8/LC_COLLATE
$ sudo ln -fs ../la_LN.US-ASCII/LC_COLLATE \
/usr/share/locale/sv_SE.UTF-8/LC_COLLATE

Now let’s see how many unique results we have:

$ cat ../de_list | uniq | gwc -l
4
$ cat ../se_list | uniq | gwc -l
4

Hm, all results are unique. Let’s compare:

$ cat -n ../de_list
     1  aäåoösßø
     2  aosßäåöø
     3  aäåoøösß
     4  aåäoöøsß
$ cat -n ../se_list
     1  aäåoösßø
     2  aosßäåöø
     3  aäåoösøß
     4  aosåäöøß

Of the german results the second one is obviously wrong. The fourth swedish result seems quite reasonable. Let’s see what Ubuntu says for a comparison:

$ cat /etc/lsb-release | grep DISTRIB_DESCRIPTION
DISTRIB_DESCRIPTION="Ubuntu 10.10"
$ locale | grep LC_COLLATE
LC_COLLATE="de_DE.utf8"
$ ls -1 | tr -d "\n" | echo `cat`
aåäoöøsß
$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" | echo `cat`
aosßåäöø

The german result is identical with fourth result from Mac OS X. The swedish result is yet one more variation. And finally what does the Finder say with a german locale:

German sort order in the Finder

And with a swedish locale:

Swedish sort order in the Finder

Interestingly the Finder’s results are identical with Ubuntu’s, which leaves us with the question of how to achieve Finder/Ubuntu sort order in the Mac OS X shell. Oddly the german sort order is identical with the fourth result, but the swedish sort order never matches.

 

Kommentare (1)

  1. x 156 days later

    Actually, the Swedish åäö ain’t “umlaute” but true letters of the alphabet. Placed behind xyz. That’s why the sort order is different.

Kommentar schreiben

Markdown Syntax