Nothing Special   »   [go: up one dir, main page]

Skip to main content

LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration

  • Conference paper
Database: Enterprise, Skills and Innovation (BNCOD 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3567))

Included in the following conference series:

Abstract

Recently, more and more data are published and exchanged by XML on the Internet. However, different XML data sources might contain the same data but have different structures. Therefore, it requires an efficient method to integrate such XML data sources so that more complete and useful information can be conveniently accessed and acquired by users.

The tree edit distance is regarded as an effective metric for evaluating the structural similarity in XML documents. However, its computational cost is extremely expensive and the traditional wisdom in join algorithms cannot be applied easily. In this paper, we propose LAX (Leaf-clustering based Approximate XML join algorithm), in which the two XML document trees are clustered into subtrees representing independent items and the similarity between them is determined by calculating the similarity degree based on the leaf nodes of each pair of subtrees. We also propose an effective algorithm for clustering the XML document for LAX. We show that it is easily to apply the traditional wisdom in join algorithms to LAX and the join result contains complete information of the two documents. We then do experiments to compare LAX with the tree edit distance and evaluate its performance using both synthetic and real data sets. Our experimental results show that LAX is more efficient in performance and more effective for measuring the approximate similarity between XML documents than the tree edit distance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. ACM SIGMOD Record in XML, Available at http://www.acm.org/sigmod/record/xml/

  2. Arenas, M., Libkin, L.: A Normal Form for XML Documents. ACM Transactions on Database Systems 29(1), 195–232 (2004)

    Article  Google Scholar 

  3. Chawathe, S., Garacia-Molina, H.: Meaningful Change Detection in Structured Data. In: Proc. of ACM SIGMOD 1997, pp. 26–37 (1997)

    Google Scholar 

  4. Chawathe, S., Tajaraman, A., Garacia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of ACM SIGMOD 1996, pp. 493–504 (1996)

    Google Scholar 

  5. Cruz, I.F., Xiao, H., Hsu, F.: An Ontology-Based Framework for XML Semantic Integration. In: Proc. of IDEAS 2004, pp. 217–226 (2004)

    Google Scholar 

  6. Doan, A., Domingos, P., Halevy, A.: Reconciling Schemas of Disparate Data Sources: A Machine-learning Approch. In: Proc. of ACM SIGMOD 2001, pp. 509–520 (2001)

    Google Scholar 

  7. Fan, W., Libkin, L.: On XML Integrity Constraints in the Presence of DTDs. In: Proc. of PODS 2001, pp. 114–125 (2001)

    Google Scholar 

  8. Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML Joins. In: Proc. of ACM SIGMOD 2002, pp. 287–298 (2002)

    Google Scholar 

  9. Guha, S., Koudas, N., Srivastava, D., Yu, T.: Index-Based Approximate XML Joins. In: Proc. of ICDE 2003, pp. 708–710 (2003)

    Google Scholar 

  10. IBM XML Generator, Available at http://www.alphaworks.ibm.com/xml/

  11. Lee, M., Yang, L., Hus, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proc. of CIKM 2002, pp. 292–299 (2002)

    Google Scholar 

  12. MAGE (MicroArray and Gene Expression), Available at http://www.mged.org/Workgroups/MAGE/mage.html

  13. Marian, A., Abiteboul, S., Cobena, G., Mignet, L.: Change-Centric Management of Versions in an XML Warehouse. In: Proc. of 27th VLDB, pp. 581–590 (2001)

    Google Scholar 

  14. Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. of WebDB 2002, pp. 61–66 (2002)

    Google Scholar 

  15. Rahm, E., Bernstein, P.A.: A Survey of approaches to automatic schema matching. The VLDB Journal 10(1), 334–350 (2001)

    Article  MATH  Google Scholar 

  16. Selkow, S.: The Tree-to-tree Editing Problem. Information Processing Letters 6(6), 184–186 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  17. Wang, Y., DeWitt, D.J., Cai, J.: X-Diff: An Effective Change Detection Algo-rithm for XML Documents. In: Proc. of ICDE 2003, pp. 519–530 (March 2003)

    Google Scholar 

  18. World Wide Web Consortium (W3C). The Document Object Model (DOM), http://www.w3.org/DOM/

  19. XML Version of DBLP, Available at http://dblp.uni-trier.de/xml/

  20. Yang, X., Lee, M., Ling, T.: Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 520–533. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  21. Zhang, K., Shasha, D.: Simple Fast Algorithm for the Editing Distance Between Trees and Related Problems. SIAM Journal of Computing 18(6), 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  22. Zhang, K., Shasha, D.: Tree Pattern Matching. In: Pattern Matching Algorithms, ch. 11, Oxford University Press, Oxford (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liang, W., Yokota, H. (2005). LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration. In: Jackson, M., Nelson, D., Stirk, S. (eds) Database: Enterprise, Skills and Innovation. BNCOD 2005. Lecture Notes in Computer Science, vol 3567. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11511854_7

Download citation

  • DOI: https://doi.org/10.1007/11511854_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26973-1

  • Online ISBN: 978-3-540-31677-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics