Abstract
The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yields also one of its main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. Many of these techniques first separate XML structure from the document content, and then compress each independently. Further compression gains can be realized by identifying and compressing together document content that is highly similar, thereby amortizing the storage costs of auxiliary information required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm is an important factor not only for the achievable compression gain, but also for access performance. Hence, choosing a compression configuration that optimizes compression gain requires one to determine (1) a partitioning strategy for document content, and (2) the best available compression algorithm to apply to each set within this partition. In this paper, we show that finding an optimal compression configuration with respect to compression gain is an NP-hard optimization problem. This problem remains intractable even if one considers a single compression algorithm for all content. We also describe an approximation algorithm for selecting a partitioning strategy for document content based on the branch-and-bound paradigm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adiego, J., la Fuente, P.D., Navarro, G.: Combining structural and textual contexts for compressing semistructured databases. In: ENC, pp. 68–73 (2005)
Cheney, J.: Compressing XML with multiplexed hierarchical PPM models. In: DCC, pp. 163–172 (2001)
Cheney, J.: An empirical evaluation of simple DTD-conscious compression techniques. In: WebDB, pp. 43–48 (2005)
Leighton, G., Müldner, T., Diamond, J.: TREECHOP: a tree-based query-able compressor for XML. In: CWIT, pp. 115–118 (2005)
Min, J.K., Park, M.J., Chung, C.W.: XPRESS: A queriable compression for XML data. In: SIGMOD, pp. 122–133 (2003)
Tolani, P.M., Haritsa, J.R.: XGRIND: A query-friendly XML compressor. In: ICDE, pp. 225–234 (2002)
Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: XQueC: A query-conscious compressed XML database. ACM TOIT 7(2), Article 10 (May 2007)
Leighton, G., Diamond, J., Müldner, T.: AXECHOP: a grammar-based compressor for XML. In: DCC, p. 467 (2005)
Liefke, H., Suciu, D.: XMill: An efficient compressor for XML data. In: SIGMOD, pp. 153–164 (2000)
Maneth, S., Mihaylov, N., Sakr, S.: XML tree structure compression. In: XANTEC, pp. 243–247 (2008)
Skibinski, P., Grabowski, S., Swacha, J.: Effective asymmetric XML compression. Software: Practice and Experience 38(10), 1027–1047 (2008)
Berglund, A., Boag, S., Chamberlin, D., Fernández, M.F., Kay, M., Robie, J., Siméon, J. (eds.): XML path language (XPath) 2.0, W3C Recommendation (January 2007), http://www.w3.org/TR/xpath20/
Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J. (eds.): XQuery 1.0: An XML query language, W3C Recommendation (January 2007), http://www.w3.org/TR/xquery/
Leighton, G., Barbosa, D.: Optimizing XML compression (extended version). CoRR abs/0905.4761 (2009), http://arxiv.org/abs/0905.4761
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Leighton, G., Barbosa, D. (2009). Optimizing XML Compression. In: Bellahsène, Z., Hunt, E., Rys, M., Unland, R. (eds) Database and XML Technologies. XSym 2009. Lecture Notes in Computer Science, vol 5679. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03555-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-03555-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03554-8
Online ISBN: 978-3-642-03555-5
eBook Packages: Computer ScienceComputer Science (R0)