Jane Cotler and Evan Sandhaus describe two neat tricks that the New York
Times used to bring a recent
20-year block of articles into its TimesMachine service. First, an image tiling and rendering procedure that minimizes download requirements. Even more interesting, a fuzzy-logic string-matching algorithm that lines up a batch of texts taken from OCR with their counterparts from a digital archive. The trick to reducing the search space depends on dividing each text into blocks of overlapping tokens called shingles, a/k/a n-grams.
No comments:
Post a Comment