Abstract
Similarity join is applied very widely nowadays since data items representing the same real-world objects may be different due to various conventions. Another reason for similarity join is that the efficiency of traditional methods is really low. Therefore, a method with both high efficiency and high join quality is in need. In the paper, we put forward two new edit operations (reversing and mapping) together with related algorithms concerning similarity join based on the new defined measure. In our method, computing tree edit distance is replaced by computing k-generation set distance between trees. The join process is simplified largely by applying the new method. The time complexity of our method is O(n 2 ), where n is the tree size. We have proved that our method owns some advantages over others. And it can be scaled to large data sets as well.
This research is partially supported by National Science Foundation of China under Grant No. 61003046, No. 60831160525, No. 61111130189. Key Program of the National Natural Science Foundation of China under Grant No. 60933001, National Postdoctoral Foundation of China under Grant No. 20090450126, No. 201003447, Doctoral Fund of Ministry of Education of China under Grant No. 20102302120054.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Augsten, N., Bohlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proc. of the 31st VLDB Conferences, Trondheim, Norway, pp. 301–312 (2005)
Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)
Li, F., Wang, H., Hao, L., Li, J., Gao, H.: pq-hash: An Efficient Method for Approximate XML Joins. In: Shen, H.T., Pei, J., Özsu, M.T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y., Shao, J. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 125–134. Springer, Heidelberg (2010)
Li, F., Wang, H., Zhang, C., Hao, L., Li, J., Gao, H.: Approximate Joins for XML Using g-String. In: Lee, M.L., Yu, J.X., Bellahsène, Z., Unland, R. (eds.) XSym 2010. LNCS, vol. 6309, pp. 3–17. Springer, Heidelberg (2010)
Augsten, N., Bohlen, M.H., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35(1) (2010)
Tatikonda, S., Parthasarathy, S.: Hashing Tree-Structured Data: Methods and Applications. In: ICDE (2010) (to appear)
Dulucq, S., Touzet, H.: Analysis of Tree Edit Distance Algorithms. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 83–95. Springer, Heidelberg (2003)
Han, Z., Wang, H., Gao, H., Li, J., Luo, J.: Clustering-Based Approximate Join Method on XML Documents. Journal of Computer Research and Development (suppl.), 81–86 (2009); ISSN:1000-1239/CN 11-1177/TP46
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML Joins. ACM SIGMOD (June 4-6, 2002)
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Integrating XML Data Sources Using Approximate Joins. ACM Transactions on Database Systems 31(1), 161–207 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, Y., Wang, H., Wang, Y., Gao, H. (2012). Similarity Join on XML Based on k-Generation Set Distance. In: Wang, L., Jiang, J., Lu, J., Hong, L., Liu, B. (eds) Web-Age Information Management. WAIM 2011. Lecture Notes in Computer Science, vol 7142. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28635-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-28635-3_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28634-6
Online ISBN: 978-3-642-28635-3
eBook Packages: Computer ScienceComputer Science (R0)