Abstract
Recently, more and more data are published and exchanged by XML on the Internet. However, different XML data sources might contain the same data but have different structures. Therefore, it requires an efficient method to integrate such XML data sources so that more complete and useful information can be conveniently accessed and acquired by users.
The tree edit distance is regarded as an effective metric for evaluating the structural similarity in XML documents. However, its computational cost is extremely expensive and the traditional wisdom in join algorithms cannot be applied easily. In this paper, we propose LAX (Leaf-clustering based Approximate XML join algorithm), in which the two XML document trees are clustered into subtrees representing independent items and the similarity between them is determined by calculating the similarity degree based on the leaf nodes of each pair of subtrees. We also propose an effective algorithm for clustering the XML document for LAX. We show that it is easily to apply the traditional wisdom in join algorithms to LAX and the join result contains complete information of the two documents. We then do experiments to compare LAX with the tree edit distance and evaluate its performance using both synthetic and real data sets. Our experimental results show that LAX is more efficient in performance and more effective for measuring the approximate similarity between XML documents than the tree edit distance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
ACM SIGMOD Record in XML, Available at http://www.acm.org/sigmod/record/xml/
Arenas, M., Libkin, L.: A Normal Form for XML Documents. ACM Transactions on Database Systems 29(1), 195–232 (2004)
Chawathe, S., Garacia-Molina, H.: Meaningful Change Detection in Structured Data. In: Proc. of ACM SIGMOD 1997, pp. 26–37 (1997)
Chawathe, S., Tajaraman, A., Garacia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of ACM SIGMOD 1996, pp. 493–504 (1996)
Cruz, I.F., Xiao, H., Hsu, F.: An Ontology-Based Framework for XML Semantic Integration. In: Proc. of IDEAS 2004, pp. 217–226 (2004)
Doan, A., Domingos, P., Halevy, A.: Reconciling Schemas of Disparate Data Sources: A Machine-learning Approch. In: Proc. of ACM SIGMOD 2001, pp. 509–520 (2001)
Fan, W., Libkin, L.: On XML Integrity Constraints in the Presence of DTDs. In: Proc. of PODS 2001, pp. 114–125 (2001)
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML Joins. In: Proc. of ACM SIGMOD 2002, pp. 287–298 (2002)
Guha, S., Koudas, N., Srivastava, D., Yu, T.: Index-Based Approximate XML Joins. In: Proc. of ICDE 2003, pp. 708–710 (2003)
IBM XML Generator, Available at http://www.alphaworks.ibm.com/xml/
Lee, M., Yang, L., Hus, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proc. of CIKM 2002, pp. 292–299 (2002)
MAGE (MicroArray and Gene Expression), Available at http://www.mged.org/Workgroups/MAGE/mage.html
Marian, A., Abiteboul, S., Cobena, G., Mignet, L.: Change-Centric Management of Versions in an XML Warehouse. In: Proc. of 27th VLDB, pp. 581–590 (2001)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. of WebDB 2002, pp. 61–66 (2002)
Rahm, E., Bernstein, P.A.: A Survey of approaches to automatic schema matching. The VLDB Journal 10(1), 334–350 (2001)
Selkow, S.: The Tree-to-tree Editing Problem. Information Processing Letters 6(6), 184–186 (1977)
Wang, Y., DeWitt, D.J., Cai, J.: X-Diff: An Effective Change Detection Algo-rithm for XML Documents. In: Proc. of ICDE 2003, pp. 519–530 (March 2003)
World Wide Web Consortium (W3C). The Document Object Model (DOM), http://www.w3.org/DOM/
XML Version of DBLP, Available at http://dblp.uni-trier.de/xml/
Yang, X., Lee, M., Ling, T.: Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 520–533. Springer, Heidelberg (2003)
Zhang, K., Shasha, D.: Simple Fast Algorithm for the Editing Distance Between Trees and Related Problems. SIAM Journal of Computing 18(6), 1245–1262 (1989)
Zhang, K., Shasha, D.: Tree Pattern Matching. In: Pattern Matching Algorithms, ch. 11, Oxford University Press, Oxford (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liang, W., Yokota, H. (2005). LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration. In: Jackson, M., Nelson, D., Stirk, S. (eds) Database: Enterprise, Skills and Innovation. BNCOD 2005. Lecture Notes in Computer Science, vol 3567. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11511854_7
Download citation
DOI: https://doi.org/10.1007/11511854_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26973-1
Online ISBN: 978-3-540-31677-0
eBook Packages: Computer ScienceComputer Science (R0)