groking hfs+ character encoding 19. Mar 2011

Linux and (most?) other Unix-like operating systems use the so called normalization form C (NFC) for its UTF-8 encoding by default but do not enforce this. Darwin, the base of the Macintosh OS enforces normalization form D (NFD), where a few characters are encoded in a different way. On OS X it’s not possible to create NFC UTF-8 filenames because this is prevented at filesystem layer. On HFS+ filenames are internally stored in UTF-16 and when converted back to UTF-8, for the underlying BSD system to be handable, NFD is created. See here for defails. I think it was a very bad idea and breaks many things under OS X which expect a normal POSIX conforming system. Anywhere else convmv is able to convert files from NFC to NFD or vice versa which makes interoperability with such systems a lot easier. (Source: convmv)

If you print the german umlaut ä the composed form is used.

$ printf ä | hexdump
0000000 c3 a4                                          
0000002

If you create a file named by ä the decomposed form is used instead.

$ touch ä
$ ls | tr -d '\n' | hexdump
0000000 61 cc 88                                       
0000003

You can convert the decomposed form into the composed form.

$ ls | iconv -f utf-8-mac -t utf-8 | tr -d '\n' | hexdump
0000000 c3 a4                                          
0000002
 

Kommentare (3)

Kommentar schreiben

Markdown Syntax