01 February 2016


Jane Cotler and Evan Sandhaus describe two neat tricks that the New York Times used to bring a recent 20-year block of articles into its TimesMachine service. First, an image tiling and rendering procedure that minimizes download requirements. Even more interesting, a fuzzy-logic string-matching algorithm that lines up a batch of texts taken from OCR with their counterparts from a digital archive. The trick to reducing the search space depends on dividing each text into blocks of overlapping tokens called shingles, a/k/a n-grams.

No comments: