exploring unicode sorting on mac os x 21. Mar 2011
I was wondering how well unicode sorting works on Mac OS X and did some experiments. Let’s collect some system infomation first:
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.6.6
BuildVersion: 10J567
$ locale | grep LC_COLLATE
LC_COLLATE="de_DE.UTF-8"
So my default collation is german. I created a directory with some dummy files
$ touch ä ö ß å ø a o
Let’s see how they list:
$ ls -1 | tr -d "\n" | echo `cat` | tee -a ../de_list
aäåoösßø
Looks halfway reasonable, although I would have expected the ø to be in line with the other o’s. Now let’s pipe the results through the sort utility:
$ ls -1 | iconv -f utf-8-mac -t utf-8 | sort | tr -d "\n" |\
echo `cat` | tee -a ../de_list
aosßäåöø
Hm, it’s different and obviously completely wrong. Let’s try the swedish sorting order for a change:
$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" |\
echo `cat` | tee -a ../se_list
aäåoösßø
Again, way off. With the sort utility:
$ ls -1 | iconv -f utf-8-mac -t utf-8 | LC_COLLATE=sv_SE.UTF-8 sort |\
tr -d "\n" | echo `cat` | tee -a ../se_list
aosßäåöø
Much better, but still wrong. sorting order is stored in the LC_COLLATE files. Oddly if you look at the german and swedish LC_COLLATE files they’re both a symbolic link to ../la_LN.US-ASCII/LC_COLLATE which seems rather meaningless. Let’s try to “fix” that with a dirty hack.
$ sudo ln -fs ../de_DE.ISO8859-1/LC_COLLATE \
/usr/share/locale/de_DE.UTF-8/LC_COLLATE
$ sudo ln -fs ../sv_SE.ISO8859-1/LC_COLLATE \
/usr/share/locale/sv_SE.UTF-8/LC_COLLATE
And one more time all four tries:
$ ls -1 | tr -d "\n" | echo `cat` | tee -a ../de_list
aäåoøösß
$ ls -1 | iconv -f utf-8-mac -t utf-8 | sort | tr -d "\n" |\
echo `cat` | tee -a ../de_list
aåäoöøsß
$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" | echo `cat` |\
tee -a ../se_list
aäåoösøß
$ ls -1 | iconv -f utf-8-mac -t utf-8 | LC_COLLATE=sv_SE.UTF-8 sort |\
tr -d "\n" | echo `cat` | tee -a ../se_list
aosåäöøß
Revert the symbolic links to their previous state:
$ sudo ln -fs ../la_LN.US-ASCII/LC_COLLATE \
/usr/share/locale/de_DE.UTF-8/LC_COLLATE
$ sudo ln -fs ../la_LN.US-ASCII/LC_COLLATE \
/usr/share/locale/sv_SE.UTF-8/LC_COLLATE
Now let’s see how many unique results we have:
$ cat ../de_list | uniq | gwc -l
4
$ cat ../se_list | uniq | gwc -l
4
Hm, all results are unique. Let’s compare:
$ cat -n ../de_list
1 aäåoösßø
2 aosßäåöø
3 aäåoøösß
4 aåäoöøsß
$ cat -n ../se_list
1 aäåoösßø
2 aosßäåöø
3 aäåoösøß
4 aosåäöøß
Of the german results the second one is obviously wrong. The fourth swedish result seems quite reasonable. Let’s see what Ubuntu says for a comparison:
$ cat /etc/lsb-release | grep DISTRIB_DESCRIPTION
DISTRIB_DESCRIPTION="Ubuntu 10.10"
$ locale | grep LC_COLLATE
LC_COLLATE="de_DE.utf8"
$ ls -1 | tr -d "\n" | echo `cat`
aåäoöøsß
$ LC_COLLATE=sv_SE.UTF-8 ls -1 | tr -d "\n" | echo `cat`
aosßåäöø
The german result is identical with fourth result from Mac OS X. The swedish result is yet one more variation. And finally what does the Finder say with a german locale:

And with a swedish locale:

Interestingly the Finder’s results are identical with Ubuntu’s, which leaves us with the question of how to achieve Finder/Ubuntu sort order in the Mac OS X shell. Oddly the german sort order is identical with the fourth result, but the swedish sort order never matches.
Actually, the Swedish åäö ain’t “umlaute” but true letters of the alphabet. Placed behind xyz. That’s why the sort order is different.