29 August 2007

DUCET not dulcet

So I was working on a little piece of code that was responsible for sorting a list of names—titles of surveys, to be specific. And I noticed that my test data was sorting a little funny. I had a couple of surveys named "Case 4310-1" and "Case 4310/2" and I saw that the former sorted after the latter, even though the crib sheet that I always keep in my Day-Timer says that "-" is ASCII 45 (decimal, hex 2D) and "/" is ASCII 47 (hex 2F). How could this be?

Well, the first thing that I did was explicitly specify the method that I wanted to use to perform the sorting. We're developing in .NET 2.0, and I'm using a generic SortedList to build up the list of survey titles. If I chose, I could specify a strict binary character-by-character sort with this constructor

SortedList sortedList = new SortedList(StringComparer.Ordinal);

and I would get the sort that I "expected." But an ordinal comparer is case-insensitive, and I really want to be able to sort "case 10" and "Case 10" together. Now, fortunately, this code runs only on our servers and there is no provision in the app (at present!) to allow a user to specify a culture (what we called a "locale" in my old UNIX days) for sorting: one size fits all. So instead I wrote

SortedList sortedList = new SortedList(StringComparer.InvariantCultureIgnoreCase);

and I was back to the odd behavior that initially puzzled me. Furthermore, inserting space around the punctuation changed the sort order: "Case 4310 - 1" sorts ahead of "Case 4310 / 2".

It was clear that some culture-based behavior was in play—some kind of special treatment of a hyphen in certain cases—and I was perfectly happy to ship the app this way: all we really cared about was getting the names sorted into some usable order. But I was curious: what is the sort order for the "invariant culture"? I prowled around the Microsoft documentation and found little more than this explanation:

InvariantCulture retrieves an instance of the invariant culture. It is associated with the English language but not with any country/region.

and the hand-waving

The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them; for example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases; therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

and a pointer to the Unicode docs.

So I opened up Unicode Technical Standard #10: Unicode Collation Algorithm, and OMG life is so much more complicated than the old POSIX days when about you had to know was that "ll" sorted after "l" in Spanish. Consider this tidbit from the introduction:

For example, Swedish and French have clear and different rules on sorting ä (either after z or as an accented character with a secondary difference from a), but neither defines the ordering of other characters such as Ж, ש, ♫, ∞, ◊, or ⌂.

I was taken back to the days when I worked in the music library, where the librarians had to figure out how to shelve a score whose title was in Russian.

I also learned about the concept of equivalence: different sequences of Unicode characters that can be treated exactly the same for collation purposes. For instance, there are three different ways to represent the angstrom symbol, a capital A with a ring.

UTS #10 points to the Default Unicode Collation Element Table (DUCET), a huge text file that provides, for one collation, all the data to the sorting algorithm for all Unicode characters and their combinations. Here's a snip of what I think is the relevant data for my question:

002A ; [*02FB.0020.0002.002A] # ASTERISK
002B ; [*04B8.0020.0002.002B] # PLUS SIGN
002C ; [*0232.0020.0002.002C] # COMMA
002D ; [*0222.0020.0002.002D] # HYPHEN-MINUS
002E ; [*0266.0020.0002.002E] # FULL STOP
002F ; [*02FF.0020.0002.002F] # SOLIDUS
003A ; [*0241.0020.0002.003A] # COLON
003B ; [*023E.0020.0002.003B] # SEMICOLON

With a little more patience, and some stumbling through the algorithm, I may be able to derive an explanation for the sorting behavior I've observed. But it's too bad that it can't just be reduced to a brief explanation in words, like "ignore a hyphen when it's followed by another alphanumeric character." The problem is just too elaborate now.

17 August 2007


Jeff Atwood critiques Yahoo!'s Thirteen Simple Rules for Speeding Up Your Web Site.
There's some good advice here, but there's also a lot of advice that only makes sense if you run a website that gets millions of unique users per day.

16 August 2007

How not to do it

Mark Liberman points to his own irreverent guide to using the web interface to FacilityFocus, an automated system for submitting and tracking building maintenance requests on the University of Pennsylvania campus. The irreverence is necessary, because the interface is appallingly bad. Consider Step 2 in Liberman's guide, which helps you get past a search page that also gateways the "add new work request" page:
You may be tempted to fill out some of the 23 temptingly-empty text boxes on this screen, with information like e-Mail (that's easy) and "Desired Date" (that one's a little personal, don't you think?) -- BUT DON'T! This is a search screen, and you've got nothing to search for yet, since you haven't actually gotten your work request into the system.

Yes, I know that it says "Search Criteria Required!" at the top of the screen, in red letters, with an exclamation point. But that's just to fool you into thinking that search criteria are required. In fact, the only thing that's required (or even permitted!) for you to do at this point is to click on the large button labelled "Insert" at the top of the page...

It only gets worse. The popups for specifying what the problem is, and in what building, are populated with pages and pages of codes, and apparently these can't be sorted (it's a little hard to tell from the screen shots). A column labelled PROBLEM DESC has unhelpful entries like "HVAC" (Liberman glosses this abbreviation for the users), "HOT," and "COLD." Does "HOT" mean "it's too hot" or "I need more heat"?

Liberman, who wrote the guide as if it were a handbook for an adventure/role-playing game, comments:
But adventure-game interaction is really the wrong metaphor. The designers of good adventure games have a excellent idea of what their target users are like, and they've carefully planned and tested for their users' reactions to each display and each event in the game. The obscurity and difficulty of the interaction is carefully crafted to be suspenseful, entertaining -- and eventually overcome. In contrast, an interface like FacilityFocus seems to be "mind blind". The obscurity and difficulty of the interaction is a random result of an apparent failure to try to model user reactions at all.