31 January 2011

Fun with MySQL

The problem: Given a pair of tables in a header-detail relationship (or any one-to-many relationship), write a query that returns one row for each header row, including a column that is a comma-delimited vector of the subkeys that are found in the detail rows (for instance, line item numbers). For example:

Table INVOICE

Invoice  Date
-------  ----
AZ456    1-Jan-2010
AZ457    2-Jan-2010
AZ459    2-Jan-2010

Table INVOICE_LINE

Invoice  Line  Part
-------  ----  ----
AZ456    1     ZQX3
AZ456    2     WF612
AZ457    1     TY301
AZ459    3     TY301

Colleague Jared gave me this screwdriver from the MySQL toolbox: the GROUP_CONCAT() function.

SELECT I.Invoice, I.Date, GROUP_CONCAT(IL.Line) AS Vector
FROM INVOICE I
INNER JOIN INVOICE_LINE IL ON IL.Invoice = I.Invoice
GROUP BY I.Invoice

The results:

Invoice  Date        Vector
-------  ----        ------
AZ456    1-Jan-2010  1,2
AZ457    2-Jan-2010  1
AZ459    2-Jan-2010  3

In particular, we used this query to produce a table of photo image assets retrieved from the CMS (that met certain search criteria against the metadata like caption, photographer, agency). Each image asset has one or more crops in various aspect ratios (standard 4:3, square, wide 16:9). The search results table includes a vector of which crops are available for each image asset. In the UI, this vector is rendered as a nifty, compact row of gray and black icons designed by colleague Vincent.

This technique would work for any other attributes of the detail table, not just keys, in which case you may want to add the DISTINCT keyword to GROUP_CONCAT's argument. The SEPARATOR clause can overide the default comma delimiter.

14 January 2011

12 January 2011

Close enough

Michael Donohoe describes an update New York Times's Emphasis feature, which enables readers and bloggers to deep link to individual paragraphs within a story.

What's most interesting is the solution that the team devised to automatically assign durable keys to paragraphs of text, in a dynamic news environment when grafs are being added, subtracted, and moved around post-publication.

Each six-character key refers to a specific paragraph. Keys are generated by breaking a paragraph into sentences. Then, using the first and last sentences (which are sometimes the same), the key-generation code takes the first character from the first three words of each sentence.

* * *

For example, let’s look at this paragraph — the relevant characters for key generation are [set off as code]:

Last summer, after 10 years of debate and interagency wrangling, a prestigious committee from the National Academy of Sciences gave highest priority among big space projects in the coming decade to a satellite telescope that would take precise measure of dark energy, as it is known, and also look for planets beyond our solar system. The proposed project goes by the slightly unwieldy acronym Wfirst, for Wide-Field Infrared Survey Telescope.


The result is this key: LsaTpp


The matching process uses two means to perform a fuzzy equality match between the key derived from the current story text and the key on the hyperlink. If there is no perfect match, the 3-character partial keys for the first and last sentence are compared, in hopes of a match on either one. In the event that rewriting has changed the text that affects both parts of the keys, the Levenshtein distance between the linking key and the text's key is computed. If it's small enough, it's considered to be a match.

The Levenshtein metric makes perfect sense in this context, because the two keys have moved away from one another precisely because of insertion, deletion, and substitution. Andrew Hedges supplies a web app and JavaScript source for computing Levenshtein distances.

My only qualm with this strategy is that the keys it generates are not evenly distributed. There will be a lot of 3-character partial keys that start with "T" and "A."