12 January 2011

Close enough

Michael Donohoe describes an update New York Times's Emphasis feature, which enables readers and bloggers to deep link to individual paragraphs within a story.

What's most interesting is the solution that the team devised to automatically assign durable keys to paragraphs of text, in a dynamic news environment when grafs are being added, subtracted, and moved around post-publication.

Each six-character key refers to a specific paragraph. Keys are generated by breaking a paragraph into sentences. Then, using the first and last sentences (which are sometimes the same), the key-generation code takes the first character from the first three words of each sentence.

* * *

For example, let’s look at this paragraph — the relevant characters for key generation are [set off as code]:

Last summer, after 10 years of debate and interagency wrangling, a prestigious committee from the National Academy of Sciences gave highest priority among big space projects in the coming decade to a satellite telescope that would take precise measure of dark energy, as it is known, and also look for planets beyond our solar system. The proposed project goes by the slightly unwieldy acronym Wfirst, for Wide-Field Infrared Survey Telescope.


The result is this key: LsaTpp


The matching process uses two means to perform a fuzzy equality match between the key derived from the current story text and the key on the hyperlink. If there is no perfect match, the 3-character partial keys for the first and last sentence are compared, in hopes of a match on either one. In the event that rewriting has changed the text that affects both parts of the keys, the Levenshtein distance between the linking key and the text's key is computed. If it's small enough, it's considered to be a match.

The Levenshtein metric makes perfect sense in this context, because the two keys have moved away from one another precisely because of insertion, deletion, and substitution. Andrew Hedges supplies a web app and JavaScript source for computing Levenshtein distances.

My only qualm with this strategy is that the keys it generates are not evenly distributed. There will be a lot of 3-character partial keys that start with "T" and "A."

No comments: