SVN + OS X + “Umlaute”

March 23rd, 2010 by nils

The Mac OS file system —Mac OS Extended (Journaled)— stores umlaut characters as two separate letters (i.e. ‘a’ and ‘¨’). This is referred to as NFD or Normalization Form D with canonical decomposition (see “Unicode Standard Annex #15 – Unicode Normalization Forms”, http://unicode.org/reports/tr15/#Norm_Forms).

This behavior can have unfortunate side effects in applications. Especially remote applications that work path based and interact with different operating systems can run into problems here.

I came across this when I tried to access a subversion repository that contained file names with German umlauts from my Mac. I am running subversion 1.6.5 and when I check out a file with an umlaut in its name, executing “svn stat” will list the file twice, once as missing (with an ‘!’) and once as unversioned (with a ‘?’). A search in the collabnet discussion forums finally confirmed that this is a know issue. The following links provide some documentation:
http://www.opensimwiki.net/index.php/SVN
http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

However, the subversion issues are just one specific bug. For application developers it is important to know that Unicode equivalence is a term to keep in mind. The wikipedia article (http://en.wikipedia.org/wiki/Unicode_equivalence) mentions a bug in the samba protocol due to different representations of Unicode characters.

So, next time you come across an issue that involves a Mac and umlauts, Unicode equivalence might be the term to look for.