03 December 2008

Decisions, decisions

Tak Cheung Lam et al. explore the details of syntactic parsing of XML under four technologies: the well-known DOM and SAX and the lesser-known StAX (Streaming API for XML) and VTD-XML (virtual token descriptor)—with a preparatory exploration of character decoding and lexical analysis, processing steps that are common to all XML analyzers.

Since neither SAX nor StAX create in-memory representations of the complete document, they are not well-suited to applications that must transform the document, but they can be effective for simple streaming applications. StAX uses a "pull" model that puts the processing loop in the application, so many developers will find it easier to use than SAX.

By contrast, DOM and VTD are the tools to use if you need to rewrite the XML. As compared to DOM, VTD does not construct its in-memory representation with an object tree, but rather with lightweight arrays of 64-bit integers, and the article gives a one-figure sketch of how this works. The authors estimate VTD to be 5 to 8 times faster than DOM and to take up 20% of the space that DOM does, especially for incremental updates (but they don't back up these calculations with empirical measurements). They also speculate that VTD is a good candidate for hardware acceleration.

No comments: