Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

XXS: Efficient XPath Evaluation on Compressed XML Documents

Published: 08 July 2014 Publication History

Abstract

The eXtensible Markup Language (XML) is acknowledged as the de facto standard for semistructured data representation and data exchange on the Web and many other scenarios. A well-known shortcoming of XML is its verbosity, which increases manipulation, transmission, and processing costs. Various structure-blind and structure-conscious compression techniques can be applied to XML, and some are even access-friendly, meaning that the documents can be efficiently accessed in compressed form. Direct access is necessary to implement the query languages XPath and XQuery, which are the standard ones to exploit the expressiveness of XML. While a good deal of theoretical and practical proposals exist to solve XPath/XQuery operations on XML, only a few ones are well integrated with a compression format that supports the required access operations on the XML data. In this work we go one step further and design a compression format for XML collections that boosts the performance of XPath queries on the data. This is done by designing compressed representations of the XML data that support some complex operations apart from just accessing the data, and those are exploited to solve key components of the XPath queries. Our system, called XXS, is aimed at XML collections containing natural language text, which are compressed to within 35%--50% of their original size while supporting a large subset of XPath operations in time competitive with, and many times outperforming, the best state-of-the-art systems that work on uncompressed representations.

References

[1]
J. Adiego, G. Navarro, and P. de la Fuente. 2007a. Lempel-Ziv compression of highly structured documents. J. Amer. Soc. Inf. Sci. Tech. 58, 4, 461--478.
[2]
J. Adiego, G. Navarro, and P. de la Fuente. 2007b. Using structural contexts to compress semistructured text collections. Inf. Process. Manage. 43, 3, 769--790.
[3]
A. Arion, A. Bonifati, I. Manolescu, and A. Pugliese. 2007. XQueC: A query-conscious compressed XML database. ACM Trans. Internet Technol. 7, 2.
[4]
D. Arroyuelo, F. Claude, S. Maneth, V. Mäkinen, G. Navarro, K. Nguyen, J. Sirén, and N. Välimäki. 2010. Fast in-memory XPath search using compressed indexes. In Proceedings of the 26th IEEE International Conference on Data Engineering. 417--428.
[5]
R. A. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley Longman.
[6]
P. A. Boncz, T. Grust, M. Van Keulen, S. Manegold, J. Rittinger, and J. Teubner. 2006. MonetDB/XQuery: A fast XQuery processor powered by a relational engine. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 479--490.
[7]
R. Bourret. 2009. Going native: Use cases for native XML databases. http://www.rpbourret.com/xml/UseCases.htm.
[8]
N. R. Brisaboa, A. Cerdeira-Pena, and G. Navarro. 2009. A compressed self-indexed representation of XML documents. In Proceedings of the 13th European Conference on Digital Libraries. Lecture Notes in Computer Science, vol. 5714, 273--284.
[9]
N. R. Brisaboa, A. Cerdeira-Pena, G. Navarro, and O. Pedreira. 2012a. Ranked document retrieval in (almost) no space. In Proceedings of the 19th International Symposium on String Processing and Inf. Retrieval. 155--160.
[10]
N. R. Brisaboa, A. Fariña, S. Ladra, and G. Navarro. 2012b. Implicit indexing of natural language text by reorganizing bytecodes. Inf. Retrieval 15, 6, 527--557.
[11]
N. R. Brisaboa, A. Fariña, G. Navarro, and J. R. Paramá. 2007. Lightweight natural language text compression. Inf. Retrieval 10, 1, 1--33.
[12]
M. Burrows and D. Wheeler. 1994. A block-sorting lossless data compression algorithm. Tech. Rep., 124, Digital Equipment Corporation.
[13]
A. Cerdeira-Pena. 2013. Compressed Self-Indexed XML Representation with Efficient XPath Evaluation. Ph.D. thesis, Department of Computer Science, University of A Coruña, Spain.
[14]
J. Cheney. 2001. Compressing XML with multiplexed hierarchical PPM models. In Proceedings of the 11th Data Compression Conference. 163--172.
[15]
J. Cheng and W. Ng. 2004. XQzip: Querying compressed XML using structural indexing. In Proceedings of the 9th International Conference on Extending Database Technology. 219--236.
[16]
J. Cleary and I. H. Witten. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 4, 396--402.
[17]
J. S. Culpepper and A. Moffat. 2005. Enhanced byte codes with restricted prefix properties. In Proceedings of the International Symposium on String Processing and Information Retrieval. 1--12.
[18]
E. S. de Moura, G. Navarro, N. Ziviani, and R. A. Baeza-Yates. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inf. Syst. 18, 2, 113--139.
[19]
M. F. Fernández, J. Siméon, B. Choi, A. Marian, and G. Sur. 2003. Implementing XQuery 1.0: The Galax experience. In Proceedings of the 29th International Conference on Very Large Data Bases. 1077--1080.
[20]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. 2006. Compressing and searching XML data via two zips. In Proceedings of the 15th International World Wide Web Conference. 751--760.
[21]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. 2009. Compressing and indexing labeled trees, with applications. J. ACM 57, 1, 4:1--4:33.
[22]
M. Girardot and N. Sundaresan. 2000. Millau: An encoding format for efficient representation and exchange of XML over the web. Comput. Networks 33, 1--6, 747--765.
[23]
G. Gottlob, C. Koch, and R. Pichler. 2005. Efficient algorithms for processing XPath queries. ACM Trans. Database Syst. 30, 2, 444--491.
[24]
R. Grossi, A. Gupta, and J. S. Vitter. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms. 841--850.
[25]
D. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9, 1098--1101.
[26]
G. Jacobson. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science. 549--554.
[27]
M. Kay. 2008. Ten reasons why Saxon XQuery is fast. IEEE Data Eng. Bull. 31, 4, 65--74.
[28]
M. Lalmas. 2009. XML Retrieval. Morgan & Claypool Publishers.
[29]
C. League and K. Eng. 2007. Schema-based compression of XML data with RELAX NG. J. Computers 2, 10, 9--17.
[30]
G. Leighton, J. Diamond, and T. Müldner. 2005. AXECHOP: A grammar-based compressor for XML. In Proceedings of the 15th Data Compression Conference. 467.
[31]
G. Leighton, T. Müldner, and J. Diamond. 2005. TREECHOP: A tree-based queriable compressor for XML. Tech. Rep., Acadia University.
[32]
M. Levene and P. Wood. 2002. XML structure compression. In Proceedings of the 2nd International Workshop on Web Dynamics.
[33]
W. Li. 2003. XComp: An XML Compression Tool. M.S. thesis, University of Waterloo, Waterloo, Ontario, Canada.
[34]
H. Liefke and D. Suciu. 2000. XMill: An efficient compressor for XML data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 153--164.
[35]
Y. Lin, Y. Zhang, Q. Li, and J. Yang. 2005. Supporting efficient query processing on compressed XML files. In Proceedings of the 20th Annual ACM Symposium on Applied Computing. 660--665.
[36]
J. A. List, V. Mihajlovic, G. Ramírez, A. P. De Vries, D. Hiemstra, and H. E. Blok. 2005. Tijah: Embracing IR methods in XML databases. Inf. Retrieval 8, 4, 547--570.
[37]
S. Maneth and T. Sebastian. 2010. Fast and tiny structural self-indexes for XML. CoRR abs/1012.5696.
[38]
W. Meier. 2002. eXist: An Open Source Native XML Database. In Revised Papers from the NODe 2002 Web and Database-Related Workshops on Web, Web-Services, and Database Systems. 169--183.
[39]
J. Min, M. Park, and C. Chung. 2003. XPRESS: A queriable compression for XML data. In Proceedings of theACM SIGMOD International Conference on Management of Data. 122--133.
[40]
J. I. Munro and V. Raman. 2001. Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 3, 762--776.
[41]
G. Navarro and R. Baeza-Yates. 1997. Proximal Nodes: A model to query document databases by content and structure. ACM Trans. Inf. Syst. 15, 4, 400--435.
[42]
G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1.
[43]
W. Ng, W. Y. Lam, P. T. Wood, and M. Levene. 2006. XCQ: A queriable XML compression system. Knowl. Inf. Syst. 10, 4, 421--452.
[44]
D. Olteanu. 2007. SPEX: Streamed and progressive evaluation of XPath. IEEE Trans. KnowledgeData Eng. 19, 7, 934--949.
[45]
T. Oszu. 2003. XBench - a family of benchmarks for XML DBMSs. https://cs.uwaterloo.ca/∼tozsu/ddbms/projects/xbench/Specification.html.
[46]
F. Peng and S. S. Chawathe. 2005. XSQ: A streaming XPath engine. ACM Trans. Database Syst. 30, 2, 577--623.
[47]
K. Sadakane and G. Navarro. 2010. Fully-functional succinct trees. In Proceedings of the 21th Annual ACM/SIAM Symposium on Discrete Algorithms. 134--149.
[48]
S. Sakr. 2009. XML compression techniques: A survey and comparison. J. Comput. Syst. Sci. 75, 5, 303--322.
[49]
M. Schmidt, S. Scherzinger, and C. Koch. 2007. Combined static and dynamic analysis for effective buffer minimization in streaming XQuery evaluation. In Proceedings of the 23rd International Conference on Data Engineering. 236--245.
[50]
P. Skibinski, S. Grabowski, and J. Swacha 2008. Effective asymmetric XML compression. Softw. Practice Experi. 38, 10, 1027--1047.
[51]
H. Subramanian and P. Shankar 2005. Compressing XML documents using recursive finite state automata. In Proceedings of the 10th International Conference on Implementation and Application of Automata (CIAA). 282--293.
[52]
P. M. Tolani and J. R. Haritsa. 2002. XGrind: A query-friendly XML compressor. In Proceedings of the 18th International Conference on Data Engineering. 225--234.
[53]
V. Toman. 2003. Compression of XML data. M.S. thesis, Charles University, Prague, Czech Republic.
[54]
W3C. 1998. Recommendation of Extensible Markup Language (XML) 1.0. http://www.w3.org/TR/REC-xml.
[55]
W3C. 1999. Recommendation of XML Path Language (XPath) 1.0. http://www.w3.org/TR/xpath.
[56]
W3C. 2010a. Recommendation of XML Path Language (XPath) 2.0. http://www.w3.org/TR/xpath20.
[57]
W3C. 2010b. Recommendation of XML Query Language (XQuery) 1.0. http://www.w3.org/TR/xquery.
[58]
W3C. 2011. Recommendation of XQuery and XPath Full Text 1.0. http://www.w3.org/TR/xpath-full-text-10.
[59]
H. Wang, J. Li, J. Luo, and Z. He. 2004. XCPaqs: Compression of XML documents with XPath query support. In Proceedings of the International Conference on Information Technology: Coding and Computing. 354--358.
[60]
T. A. Welch. 1984. A technique for high-performance data compression. IEEE Computer17, 6, 8--19.
[61]
R. K. Wong, F. Lam, and W. M. Shui 2007. Querying and maintaining a compact XML storage. In Proceedings of the 16th International World Wide Web Conference (WWW). 1073--1082.
[62]
XML Mind Products. 2008. Qizx XML database engine. http://www.axyana.com/qizx.
[63]
J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3, 337--343.
[64]
J. Ziv and A. Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5, 530--536.

Cited By

View all
  • (2024)Comprehensive Review and Future Research Directions on ICT StandardisationInformation10.3390/info1511069115:11(691)Online publication date: 2-Nov-2024
  • (2018)Managing Compressed Structured TextEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_72(2176-2183)Online publication date: 7-Dec-2018
  • (2017)Managing Compressed Structured TextEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_72-2(1-8)Online publication date: 2-Aug-2017

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 32, Issue 3
June 2014
197 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2647579
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 July 2014
Accepted: 01 March 2014
Revised: 01 March 2014
Received: 01 October 2013
Published in TOIS Volume 32, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Semistructured data
  2. XML
  3. XPath
  4. compression
  5. self-index

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Comprehensive Review and Future Research Directions on ICT StandardisationInformation10.3390/info1511069115:11(691)Online publication date: 2-Nov-2024
  • (2018)Managing Compressed Structured TextEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_72(2176-2183)Online publication date: 7-Dec-2018
  • (2017)Managing Compressed Structured TextEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_72-2(1-8)Online publication date: 2-Aug-2017

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media