Overview

This work was done by Pankaj M. Tolani at the Database Systems Laboratory under Dr. Jayant R. Haritsa. Database Systems Laboratory is part of the Indian Institute of Science, Bangalore, INDIA - 560012.  The related research paper was selected for the proceedings of the International Conference of Data Engineering (San Jose, 2002).

About XGrind

XML documents are extremely verbose since the "schema" is repeated for every "record" in the document. While a variety of compressors are available to address the problem, they are not designed to support direct querying of the compressed document, a useful feature from a database perspective. In this paper, we propose a new compression tool called XGrind, which directly supports queries in the compressed domain. A special feature of XGrind is that the compressed document retains the structure of the original document, permitting reuse of the standard XML techniques for processing the compressed document. Performance evaluation over a variety of XML documents and user queries indicates that XGrind simultaneously delivers improved query processing times and reasonable compression ratios.

Features

1.The compression rates of XGrind are competitive with XMILL (XML-specific state-of-the-art compressor) and other similar tools, with the added benefit of permitting querying of the compressed data.

2.XGrind makes use of simple but effective techniques  non-adaptive Huffman compression for arbitrary strings, compression of enumerated values  in the context of XML and demonstrates that the techniques work well compared to similar tools.

3.XGrind preserves the hierarchical structure of a document and permits decompression of fragments of the document, both of which are important for query processing.

4.The XGrind prototype was implemented in C/C++ (about 5K lines of code) with the following modules: compressor, decompressor, query processing engine.

5.Optimizations done to speed up compression

a.XMILL SAX API XML parser used instead of the relatively much slower commercially available XML parsers.
For this, we had to implement a DTD parser using Lex and Yacc tools to populate the data structures used for the statistics.

b.Optimized Huffman-Compressor/Enum-Encoder modules which avoid C++ libraries/constructs.

6.Optimizations done to speed up decompression/querying

The compressed XML document format is such that decompression/querying

a.needs a simple lexical analyzer to create the lexical tokens

b.needs a simple parser to parse the compressed XML document

Code (under GPL), Distribution, and Documentation

The binaries and source are available at  http://sourceforge.net/projects/xgrind/ . SourceForge Logo

Documentation

The copyright of the research paper is owned by IEEE and it appeared in ICDE as follows:

    "XGrind: A Query-Friendly XML Compressor"
    P. Tolani and J. Haritsa
    Proc. of 18th IEEE Intl. Conf. on Data Engineering (ICDE),
    San Jose, USA, February 2002, pgs. 225-234

Research Index version of the paper

Technical Report

Contacts

contact author: haritsa@dsl.serc.iisc.ernet.in
contact author for the binaries and source: pankaj@dsl.serc.iisc.ernet.in