Abstract
We propose XIR, a novel method for processing partial match queries on heterogeneous XML documents using information retrieval (IR) techniques. A partial match query is defined as the one having the descendent-or-self axis “//” in its path expression. In its general form, a partial match query has branch predicates forming branching paths. The objective of XIR is to efficiently support this type of queries for large-scale documents of heterogeneous schemas. XIR has its basis on the conventional schema-level methods using relational tables and significantly improves their efficiency using two techniques: an inverted index technique and a novel prefix match join. The former indexes the labels in label paths as keywords in texts, and allows for finding the label paths matching the queries more efficiently than string match used in the conventional methods. The latter supports branching path expressions, and allows for finding the result nodes more efficiently than containment joins used in the conventional methods. We compare the efficiency of XIR with those of XRel and XParent using XML documents crawled from the Internet. The results show that XIR is more efficient than both XRel and XParent by several orders of magnitude for linear path expressions, and by several factors for branching path expressions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aboulnaga, A., Alameldeen, A.R., Naughton, J.: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. In: Proc. the 27th Int’l Conf. on Very Large Data Bases(VLDB), Rome, Italy, September 11-14 , pp. 591–600 (2001)
Amer-Yahia, S., Cho, S., Lakshmanan, L.V.S., Srivastava, D.: Minimization of Tree Pattern Queries. In: Proc. 2001 ACM SIGMOD Int’l Conf. on Management of Data, Santa Barbara, California, May 21-24 , pp. 497–508.(2001)
Altinel, M., Franklin, M.J.: Efficient Filtering of XML Documents for Selective Dissemination of Information. In: Proc. the 26th Int’l Conf. on Very Large Data Bases(VLDB), Cairo, Egypt, September 10-14, pp. 53–64 (2000)
Al-Khalifa, S., Jagadish, H.V., Koudas, N., Patel, J.M.: Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In: Proc. the 18th Int’l Conf. on Data Engineering (ICDE), San Jose, California, February 26 - March 1, pp. 141–152 (2002)
Bremer, J.-M., Gertz, M.: XQuery/IR: Integrating XML Document and Data Retrieval. In: Proc. the Fifth Int’l Workshop on the Web and Databases (WebDB 2002), Madison, Wisconsin, pp. 1–6 (2002)
Bruno, N., Koudas, N., Srivastava, D.: Holistic Twig Joins: Optimal XML Pattern Matching. In: Proc. 2002 ACM SIGMOD Int’l Conf. on Management of Data, Madison, Wisconsin, June 3-6, pp. 310–321 (2002)
Clark, J., DeRose, S.: XML Path Language (XPath), W3C Recommendation (November 1999), http://www.w3.org/TR/xpath
Cooper, B.F., Sample, N., Franklin, M.J., Hjaltason, G.R., Shadmon, M.: A Fast Index for Semistructured Data. In: Proc. the 27th Int’l Conf. on Very Large Data Bases(VLDB), Rome, Italy, September 11-14, pp. 341–350 (2001)
Florescu, D., Kossmann, D., Manolescu, I.: Integrating Keyword Search into XML Query Processing. In: Proc. the 9th WWW Conference/Computer Networks, Amsterdam, NL (May 2000), pp. 119–135 (2000)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked Keyword Search over XML Documents. In: Proc. 2003 ACM SIGMOD Int’l Conf. on Management of Data, San Diego, California, June 9-12, pp. 16–27 (2003)
Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In: Proc. the 23th Int’l Conf. on Very Large Data Bases(VLDB), Athens, Greece, August 26-29, pp. 436–445 (1997)
Halverson, A., Burger, J., Galanis, L., Kini, A., Krishnamurthy, R., Rao, A.N., Tian, F., Viglas, S., Wang, Y., Naughton, J.F., DeWitt, D.J.: Mixed Mode XML Query Processing. In: Proc. the 29th Int’l Conf. on Very Large Data Bases(VLDB), Berlin, Germany, September 9-12, pp. 225–236 (2003)
Ives, Z., Levy, A., Weld, D.: Efficient Evaluation of Regular Path Expressions on Streaming XML Data, Technical Report UW-CSE-2000-05-02, University of Washington (2000)
Jiang, H., Lu, H., Wang, W., Xu Yu, J.: Path Materialization Revisited: An Efficient Storage Model for XML Data. In: Proc. the 13th Australasian Database Conference (ADC), Melbourne, Australia, January 28 - February 1, pp. 85–94 (2002)
Jiang, H., Lu, H., Wang, W., Xu Yu, J.: XParent: An Efficient RDBMS-Based XML Database System. In: Proc. the 18th Int’l Conf. on Data Engineering (ICDE), San Jose, California, February 26 - March 1, pp. 335–336 (2002)
Jiang, H., Wang, W., Lu, H., Yu, J.X.: Holistic Twig Joins on Indexed XML Documents. In: Proc. the 29th Int’l Conf. on Very Large Data Bases(VLDB), Berlin, Germany, September 9-12, pp. 273–284 (2003)
Li, Q., Moon, B.: Indexing and Querying XML Data for Regular Path Expressions. In: Proc. the 27th Int’l Conf. on Very Large Data Bases(VLDB), Rome, Italy, September 11-14, pp. 361–370 (2001)
Mandreoli, F., Martoglia, R., Tiberio, P.: Searching Similar (Sub)Sentences for Example-Based Machine Translation. In: Proc. 2002 Italian Symposium on Sistemi Evoluti per Basi di Dati (SEBD 2002), Isola d’Elba, Italy, (June 2002)
McHugh, J., Widom, J.: Query Optimization for XML. In: Proc. the 25th Int’l Conf. on Very Large Data Bases (VLDB), Edinburgh, Scotland, UK, September 7-10, pp. 315–326 (1999)
Naughton, J., et al.: The Niagara Internet Query System. IEEE Data Engineering Bulletin 24(2), 27–33 (2001)
Polyzotis, N., Garofalakis, M.: Statistical Synopses for Graph-structured XML Databases. In: Proc. 2002 ACM SIGMOD Int’l Conf. on Management of Data, Madison, Wisconsin, June 3-6, pp. 358–369 (2002)
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Tatarinov, I., Viglas, S., Beyer, K.S., Shanmugasundaram, J., Shekita, E.J., Zhang, C.: Storing and Querying Ordered XML Using a Relational Database System. In: Proc. 2002 ACM SIGMOD Int’l Conf. on Management of Data, Madison, Wisconsin, June 3-6, pp. 204–215 (2002)
Yoshikawa, M., Amagasa, T., Shimura, T., Uemura, S.: XRel: A Path-based Approach to Storage and Retrieval of XML Documents using Relational Databases. ACM Trans. on Internet Technology(TOIT) 1(1), 110–141 (2001)
Park, Y., Whang, K.-Y., Lee, B., Han, W.: Efficient Evaluation of Partial Match Queries for XML Documents Using Information Retrieval Techniques, Technical Report CS-TR-2004-212, Department of Computer Science, KAIST (December 2004), Also, available on AITrc Technical Report No. 04-11-048, http://aitrc.kaist.ac.kr/util_tr.htm (December 28, 2004)
Zhang, C., Naughton, J.F., DeWitt, D.J., Luo, Q., Lohmann, G.M.: On Supporting Containment Queries in Relational Database Management Systems. In: Proc. 2001 ACM SIGMOD Int’l Conf. on Management of Data, Santa Barbara, California, May 21-24, pp. 425–436 (2001)
ReGet Deluxe 3.3 Beta (build 173), http://deluxe.reget.com/en/
Teleport Pro Version 1.29, http://www.tenmax.com/teleport/pro/home.htm
Xyleme, http://www.xyleme.com
eXtensible Markup Language(XML), http://www.w3.org/XML/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Park, YH., Whang, KY., Lee, B.S., Han, WS. (2005). Efficient Evaluation of Partial Match Queries for XML Documents Using Information Retrieval Techniques. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_11
Download citation
DOI: https://doi.org/10.1007/11408079_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25334-1
Online ISBN: 978-3-540-32005-0
eBook Packages: Computer ScienceComputer Science (R0)