This work was done by Pankaj M. Tolani at the Database Systems Laboratory under Dr. Jayant R. Haritsa. Database Systems Laboratory is part of the Indian Institute of Science, Bangalore, INDIA - 560012. The related research paper was selected for the proceedings of the International Conference of Data Engineering (San Jose, 2002).
XML documents are extremely verbose since the "schema" is repeated for every "record" in the document. While a variety of compressors are available to address the problem, they are not designed to support direct querying of the compressed document, a useful feature from a database perspective. In this paper, we propose a new compression tool called XGrind, which directly supports queries in the compressed domain. A special feature of XGrind is that the compressed document retains the structure of the original document, permitting reuse of the standard XML techniques for processing the compressed document. Performance evaluation over a variety of XML documents and user queries indicates that XGrind simultaneously delivers improved query processing times and reasonable compression ratios.
1.The compression rates of XGrind are competitive with XMILL (XML-specific state-of-the-art compressor) and other similar tools, with the added benefit of permitting querying of the compressed data.
2.XGrind makes use of simple but effective techniques non-adaptive Huffman compression for arbitrary strings, compression of enumerated values in the context of XML and demonstrates that the techniques work well compared to similar tools.
3.XGrind preserves the hierarchical structure of a document and permits decompression of fragments of the document, both of which are important for query processing.
4.The XGrind prototype was implemented in C/C++ (about 5K lines of code) with the following modules: compressor, decompressor, query processing engine.
5.Optimizations done to speed up compression
a.XMILL SAX API XML parser used instead of the relatively much slower
commercially available XML parsers.
For this, we had to implement a DTD parser using Lex and Yacc tools to populate the data structures used for the statistics.
b.Optimized Huffman-Compressor/Enum-Encoder modules which avoid C++ libraries/constructs.
6.Optimizations done to speed up decompression/querying
The compressed XML document format is such that decompression/querying
a.needs a simple lexical analyzer to create the lexical tokens
b.needs a simple parser to parse the compressed XML document
Code (under GPL), Distribution, and Documentation
The binaries and source are available at http://sourceforge.net/projects/xgrind/ .
The copyright of the research paper is owned by IEEE and it appeared in ICDE as follows:
"XGrind: A Query-Friendly XML Compressor"
P. Tolani and J. Haritsa
Proc. of 18th IEEE Intl. Conf. on Data Engineering (ICDE),
San Jose, USA, February 2002, pgs. 225-234
Research Index version of the paper
contact author: firstname.lastname@example.org
contact author for the binaries and source: email@example.com