Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Path sharing and predicate evaluation for high-performance XML filtering

Published: 01 December 2003 Publication History

Abstract

XML filtering systems aim to provide fast, on-the-fly matching of XML-encoded data to large numbers of query specifications containing constraints on both structure and content. It is now well accepted that approaches using event-based parsing and Finite State Machines (FSMs) can provide the basis for highly scalable structure-oriented XML filtering systems. The XFilter system [Altinel and Franklin 2000] was the first published FSM-based XML filtering approach. XFilter used a separate FSM per path query and a novel indexing mechanism to allow all of the FSMs to be executed simultaneously during the processing of a document. Building on the insights of the XFilter work, we describe a new method, called "YFilter" that combines all of the path queries into a single Nondeterministic Finite Automaton (NFA). YFilter exploits commonality among queries by merging common prefixes of the query paths such that they are processed at most once. The resulting shared processing provides tremendous improvements in structure matching performance but complicates the handling of value-based predicates.In this article, we first describe the XFilter and YFilter approaches and present results of a detailed performance comparison of structure matching for these algorithms as well as a hybrid approach. The results show that the path sharing employed by YFilter can provide order-of-magnitude performance benefits. We then propose two alternative techniques for extending YFilter's shared structure matching with support for value-based predicates, and compare the performance of these two techniques. The results of this latter study demonstrate some key differences between shared XML filtering and traditional database query processing. Finally, we describe how the YFilter approach is extended to handle more complicated queries containing nested path expressions.

References

[1]
Aksoy, D., Altinel, M., Bose, R., Cetintemel, U., Franklin, M. J., Wang, J., and Zdonik, S. B. 1998. Research in data broadcast and dissemination. In Proceedings of the of the 1st International Conference on Advanced Multimedia Content Processing. Springer, Berlin, Germany, 194--207.]]
[2]
Altinel, M., Aksoy, D., Baby, T., Franklin, M. J., Shapiro, W., and Zdonik, S. B. 1999. DBIS-Toolkit: Adaptable middleware for large scale data delivery. In Proceedings of the of SIGMOD 1999. ACM, New York, 544--546.]]
[3]
Altinel, M., and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of VLDB 2000. Morgan-Kaufmann, San Francisco, Calif., 53--64.]]
[4]
Apache XML Project. 1999. Xerces Java parser 1.2.3 Release. http://xml.apache.org/xerces-j/index.html.]]
[5]
Belkin, N. J. and Croft, B. W. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM, 35, 12, 29--38.]]
[6]
Bruno, N., Gravano, L., Koudas, N., and Srivastava, D. 2003. Navigation- vs. index-based XML multi-query processing. In Proceedings of ICDE 2003. IEEE Computer Society Press, Los Alamitos, Calif.]]
[7]
Bruno, N., Srivastava, D., and Koudas, N. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings of SIGMOD 2002. ACM, New York, 310--321.]]
[8]
Busse, R., Carey, M., Florescu, D., Kersten, M., Manolescu, I., Schmidt, A., and Waas, F. 2001. Xmark: An XML benchmark project. http://monetdb.cwi.nl/xml/index.html.]]
[9]
Cetintemel, U., Franklin, M. J., and Giles, C. L. 2000. Self-adaptive user profiles for large scale data delivery. In Proceedings of ICDE 2000. IEEE Computer Society, Los Alamitos, Calif., 622--633.]]
[10]
Chamberlin, D., Fankhauser, P., Florescu, D., Marchiori, M., and Robie, J. 2002. XML query use cases. W3C Working Draft. http://www.w3.org/TR/2002/WD-xmlquery-use-cases-20020430.]]
[11]
Chan, C., Felber, P., Garofalakis, M., and Rastogi, R. 2002. Efficient filtering of XML documents with XPath expressions. In Proceedings of ICDE 2002. IEEE Computer Society, Los Alamitos, Calif., 235--244.]]
[12]
Chen, J., DeWitt, D. J., and Naughton, J. F. 2002. Design and evaluation of alternative selection placement strategies in optimizing continuous queries. In Proceedings of ICDE 2002. IEEE Computer Society Press, Los Alamitos, Calif., 345--356.]]
[13]
Chen, J., Dewitt, D. J., Tian, F., and Wang, Y. 2000. NiagaraCQ: A scalable continuous query system for Internet databases. In Proceedings of SIGMOD 2000. ACM, New York, 379--390.]]
[14]
Clark, J. 1999. XSL transformations (XSLT)---version 1.0. http://www.w3.org/TR/xslt.]]
[15]
Clark, J., and DeRose, S. 1999. XML path language (XPath)---version 1.0. http://www.w3.org/TR/xpath.]]
[16]
Cover, R. 1999. The SGML/XML Web Page. http://www.w3.org/TR/xslt.]]
[17]
DeRose, S., Daniel Jr., R., and Maler, E. 1999. XML pointer language (XPointer). http://www.w3.org/TR/WD-xptr.]]
[18]
Diaz, A. L., and Lovell, D. 1999. XML generator. http://www.alphaworks.ibm.com/tech/xmlgenerator.]]
[19]
Fabret, F., Jacobsen, H-A., Llirbat, F., Pereira, J., Ross, K. A., and Shasha, D. 2001. Filtering algorithms and implementation for very fast publish/subscribe systems. In Proceedings of SIGMOD 2001. ACM, New York, 115--126.]]
[20]
Foltz, P. W., and Dumais, S. T. 1992. Personalized information delivery: An analysis of information filtering methods. Commun. ACM 35, 12, 51--60.]]
[21]
Goldman, R. and Widom, J. 1997. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of VLDB 1997. Morgan-Kaufmann, San Francisco, Calif., 436--445.]]
[22]
Green, T. J., Miklau, G., Onizuka, M., and Suciu, D. 2003. Processing XML streams with deterministic automata. In Proceedings of ICDT 2003. Springer, Berlin, Germany, 173--189.]]
[23]
Hanson, E. N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J. B., and Vernon, A. 1999. Scalable trigger processing. In Proceedings of ICDE 1999. IEEE Computer Society Press, Los Alamitos, Calif., 266--275.]]
[24]
Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addition-Wesley, Boston, Mass.]]
[25]
Ives, Z., Levy, A., and Weld, D. 2000. Efficient evaluation of regular path expressions on streaming XML data. Technical Report, University of Washington, Seattle, Wash.]]
[26]
Kay, M. 2001. Saxon: the XSLT processor. http://users.iclway.co.uk/mhkay/saxon/.]]
[27]
Lakshmanan, L. V. S., and Parthasarathy, S. 2002. On efficient matching of streaming XML documents and queries. In Proceedings of EDBT 2002. Springer, Berlin, Germany, 142--160.]]
[28]
Ley, M. 2001. DBLP DTD. http://www.acm.org/sigmod/dblp/db/about/dblp.dtd.]]
[29]
Liu, L., Pu, C., and Tang, W. 1999. Continual queries for Internet scale event-driven information delivery. Special Issue on Web Technologies, IEEE TKDE, 11, 4, 610--628.]]
[30]
Madden, S., Shah, M., Hellerstein, J. M., and Raman, V. 2002. Continuously adaptive continuous queries over streams. In Proceedings of SIGMOD 2002. ACM, New York, 49--60.]]
[31]
Nestorov, S., Ullman, J. D., Wiener, J. L., and Chawathe, S. S. 1997. Representative objects: Concise representations of semistrctured hierarchical data. In Proceedings of ICDE 1997. IEEE Computer Society Press, Los Alamitos, Calif., 79--90.]]
[32]
Nguyen, B., Abiteboul, S., Cobena, G., and Preda, M. 2001. Monitoring XML data on the Web. In Proceedings of SIGMOD 2001. ACM, New York, 437--448.]]
[33]
Ozen, B., Kilic, O., Altinel, M., and Dogac, A. 2001. Highly personalized information delivery to mobile clients. In Proceedings of the 2nd ACM International Workshop on Data Engineering for Wireless and Mobile Access. ACM, New York, 35--41.]]
[34]
Pereira, J., Fabret, F., Llirbat, F., and Jacobsen, H.-A. 2001. WebFilter: A high-throughput XML-based publish and subscribe System. In Proceedings of VLDB 2001. Morgan-Kaufmann, San Francisco, Calif., 723--724.]]
[35]
Salton, G. 1989. Automatic Text Processing. Addison-Wesley, Boston, Mass.]]
[36]
Sax Project Organization. 2001. SAX: Simple API for XML. http://www. saxproject.org.]]
[37]
Schreier, U., Pirahesh, H., Agrawal, R., and Mohan, C. 1991. Alert: An architecture for transforming a passive DBMS into an active DBMS. In Proceedings of VLDB 1991. Morgan-Kaufmann, San Francisco, Calif., 469--478.]]
[38]
Stonebraker, M., Jhingran, A., Goh, J., and Potamianos, S. 1990. On rules, procedures, caching and views in data base systems. In Proceedings of SIGMOD 1990. ACM, New York, 281--290.]]
[39]
Sun Microsystems, Inc. 2001. Java XML pack. Winter 01 update release. http://java.sun.com/xml/downloads/javaxmlpack.html.]]
[40]
Terry, D. B., Goldberg, D., Nichols, D. A., and Oki, B. M. 1992. Continuous queries over append-only databases. In Proceedings of SIGMOD 1992. ACM, New York, 321--330.]]
[41]
Watson, B. W. 1997. Practical optimization for automata. In Proceedings of the 2nd International Workshop on Implementing Automata. Springer, Berlin, Germany, 232--240.]]
[42]
Widom, J. and Finklestein, S. J. 1990. Set-oriented production rules in relational database systems. In Proceedings of SIGMOD 1990. ACM, New York, 259--270.]]
[43]
Wutka. 2000. DTD parser. http://www.wutka.com/dtdparser.html.]]
[44]
Yan, T. W. and Garcia-Molina, H. 1994. Index structures for selective dissemination of information under Boolean model. ACM Trans. Datab. Syst. 19, 2, 332--364.]]
[45]
Yan, T. W. and Garcia-Molina, H. 1999. The SIFT information dissemination system. ACM Trans. Datab. Syst. 24, 4, 529--565.]]

Cited By

View all
  • (2024)Nona: A Framework for Elastic Stream Provenance2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00071(703-714)Online publication date: 23-Jul-2024
  • (2023)Exploiting Structure in Regular Expression QueriesProceedings of the ACM on Management of Data10.1145/35892971:2(1-28)Online publication date: 20-Jun-2023
  • (2021)A-Tree: A Dynamic Data Structure for Efficiently Indexing Arbitrary Boolean ExpressionsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457266(817-829)Online publication date: 9-Jun-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 28, Issue 4
December 2003
286 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/958942
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2003
Published in TODS Volume 28, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Nondeterministic Finite Automaton
  2. XML filtering
  3. content-based matching
  4. nested path expressions.
  5. path sharing
  6. predicate evaluation
  7. structure matching

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Nona: A Framework for Elastic Stream Provenance2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00071(703-714)Online publication date: 23-Jul-2024
  • (2023)Exploiting Structure in Regular Expression QueriesProceedings of the ACM on Management of Data10.1145/35892971:2(1-28)Online publication date: 20-Jun-2023
  • (2021)A-Tree: A Dynamic Data Structure for Efficiently Indexing Arbitrary Boolean ExpressionsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457266(817-829)Online publication date: 9-Jun-2021
  • (2021)Comparative Analysis of Genetic Algorithm and XML Filtering Technique for Multi-Tenant SaaS Configuration Management2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON53756.2021.9623123(0223-0228)Online publication date: 27-Oct-2021
  • (2021)A survey on semi-structured web data manipulations by non-expert usersComputer Science Review10.1016/j.cosrev.2021.10036740(100367)Online publication date: May-2021
  • (2020)Hardware-Software XML-Documents ProcessingÈlektronnoe modelirovanie10.15407/emodel.42.01.03342:1(33-50)Online publication date: 5-Feb-2020
  • (2020)Optimal algorithms for ranked enumeration of answers to full conjunctive queriesProceedings of the VLDB Endowment10.14778/3397230.339725013:9(1582-1597)Online publication date: 26-Jun-2020
  • (2020)Understanding the effect of data center resource disaggregation on production DBMSsProceedings of the VLDB Endowment10.14778/3397230.339724913:9(1568-1581)Online publication date: 26-Jun-2020
  • (2020)Anytime stochastic routing with hybrid learningProceedings of the VLDB Endowment10.14778/3397230.339724813:9(1555-1567)Online publication date: 26-Jun-2020
  • (2020)AJoinProceedings of the VLDB Endowment10.14778/3372716.337271813:4(435-448)Online publication date: 6-Jan-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media