29 August 2007

DUCET not dulcet

So I was working on a little piece of code that was responsible for sorting a list of names—titles of surveys, to be specific. And I noticed that my test data was sorting a little funny. I had a couple of surveys named "Case 4310-1" and "Case 4310/2" and I saw that the former sorted after the latter, even though the crib sheet that I always keep in my Day-Timer says that "-" is ASCII 45 (decimal, hex 2D) and "/" is ASCII 47 (hex 2F). How could this be?

Well, the first thing that I did was explicitly specify the method that I wanted to use to perform the sorting. We're developing in .NET 2.0, and I'm using a generic SortedList to build up the list of survey titles. If I chose, I could specify a strict binary character-by-character sort with this constructor

SortedList sortedList = new SortedList(StringComparer.Ordinal);

and I would get the sort that I "expected." But an ordinal comparer is case-insensitive, and I really want to be able to sort "case 10" and "Case 10" together. Now, fortunately, this code runs only on our servers and there is no provision in the app (at present!) to allow a user to specify a culture (what we called a "locale" in my old UNIX days) for sorting: one size fits all. So instead I wrote

SortedList sortedList = new SortedList(StringComparer.InvariantCultureIgnoreCase);

and I was back to the odd behavior that initially puzzled me. Furthermore, inserting space around the punctuation changed the sort order: "Case 4310 - 1" sorts ahead of "Case 4310 / 2".

It was clear that some culture-based behavior was in play—some kind of special treatment of a hyphen in certain cases—and I was perfectly happy to ship the app this way: all we really cared about was getting the names sorted into some usable order. But I was curious: what is the sort order for the "invariant culture"? I prowled around the Microsoft documentation and found little more than this explanation:

InvariantCulture retrieves an instance of the invariant culture. It is associated with the English language but not with any country/region.

and the hand-waving

The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them; for example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases; therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

and a pointer to the Unicode docs.

So I opened up Unicode Technical Standard #10: Unicode Collation Algorithm, and OMG life is so much more complicated than the old POSIX days when about you had to know was that "ll" sorted after "l" in Spanish. Consider this tidbit from the introduction:

For example, Swedish and French have clear and different rules on sorting ä (either after z or as an accented character with a secondary difference from a), but neither defines the ordering of other characters such as Ж, ש, ♫, ∞, ◊, or ⌂.

I was taken back to the days when I worked in the music library, where the librarians had to figure out how to shelve a score whose title was in Russian.

I also learned about the concept of equivalence: different sequences of Unicode characters that can be treated exactly the same for collation purposes. For instance, there are three different ways to represent the angstrom symbol, a capital A with a ring.

UTS #10 points to the Default Unicode Collation Element Table (DUCET), a huge text file that provides, for one collation, all the data to the sorting algorithm for all Unicode characters and their combinations. Here's a snip of what I think is the relevant data for my question:

002A ; [*02FB.0020.0002.002A] # ASTERISK
002B ; [*04B8.0020.0002.002B] # PLUS SIGN
002C ; [*0232.0020.0002.002C] # COMMA
002D ; [*0222.0020.0002.002D] # HYPHEN-MINUS
002E ; [*0266.0020.0002.002E] # FULL STOP
002F ; [*02FF.0020.0002.002F] # SOLIDUS
003A ; [*0241.0020.0002.003A] # COLON
003B ; [*023E.0020.0002.003B] # SEMICOLON

With a little more patience, and some stumbling through the algorithm, I may be able to derive an explanation for the sorting behavior I've observed. But it's too bad that it can't just be reduced to a brief explanation in words, like "ignore a hyphen when it's followed by another alphanumeric character." The problem is just too elaborate now.

No comments: