XML Structural Similarity Search Using MapReduce

Peisen Yuan^20,21,
Chaofeng Sha^20,21,
Xiaoling Wang²²,
Bin Yang^20,21,
Aoying Zhou^21,22 &
…
Su Yang^20,21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

International Conference on Web-Age Information Management

Abstract

XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. To cope with large scale data, MapReduce computing framework as an efficient solution has attracted more and more attention in the database community recently. In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. First, sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel. Then Min-Hashing and locality sensitive hashing techniques are developed on the distributed and parallel computing framework for efficient structural similarity search processing. An empirical study on the cluster with real large datasets demonstrates the effectiveness and efficiency of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An Efficient Document Indexing-Based Similarity Search in Large Datasets

Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data

An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

References

Özsu, M.T.: Distributed XML Processing. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, Q.-M. (eds.) APWeb/WAIM 2009. LNCS, vol. 5446, p. 1. Springer, Heidelberg (2009)
Chapter Google Scholar
Abiteboul, S., Benjelloun, O., Milo, T.: The Active XML project: an overview. The VLDB Journal 17(5), 1019–1040 (2008)
Article Google Scholar
Jiang, T., Wang, L., Zhang, K.: Alignment of Trees-An Alternative to Tree Edit. In: Crochemore, M., Gusfield, D. (eds.) CPM 1994. LNCS, vol. 807, pp. 75–86. Springer, Heidelberg (1994)
Google Scholar
Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB, pp. 301–312 (2005)
Google Scholar
Yuan, P., Wang, X., Sha, C., Gao, M., Zhou, A.: Grams³: An efficient framework for xml structural similarity search. In: DASFAA workshop: UDM (to appear, 2010)
Google Scholar
Apache Hadoop Project (2009), http://hadoop.apache.org/
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 1–13 (2004)
Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997, pp. 21–29 (1997)
Google Scholar
Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: ICVR, pp. 549–556. ACM, New York (2007)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. JCSS 60(3), 630–659 (2000)
MATH MathSciNet Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ASTC, pp. 604–613. ACM, New York (1998)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388. ACM, New York (2002)
Google Scholar
XML Data Repository (2009), http://www.cs.washington.edu/research/xmldatasets/
Sigmod Record (2009), http://www.sigmod.org/publications/sigmod-record/xml-edition
XML Twig (2009), http://xmltwig.com/xmltwig/
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in Web documents. In: SIGKDD, pp. 577–582. ACM, New York (2003)
Google Scholar
Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In: SIGMOD, pp. 754–765 (2005)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS 37(5), 43 (2003)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: SIGOPS/EuroSys, pp. 59–72. ACM, New York (2007)
Google Scholar
Hbase Project (2009), http://hadoop.apache.org/hbase/
Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW, pp. 271–280. ACM, New York (2007)
Chapter Google Scholar
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150. ACM, New York (2007)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, P.R.C
Peisen Yuan, Chaofeng Sha, Bin Yang & Su Yang
Shanghai Key Laboratory of Intelligent Information Processing, P.R.C
Peisen Yuan, Chaofeng Sha, Bin Yang, Aoying Zhou & Su Yang
Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, P.R.C
Xiaoling Wang & Aoying Zhou

Authors

Peisen Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Chaofeng Sha
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoling Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Su Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Lei Chen
Computer Department, Sichuan University, 610064, Chengdu, China
Changjie Tang
Department of Computer Science, Duke University, Box 90129, NC 27708-0129, Durham, USA
Jun Yang
College of Computer Science, Zhejiang University, 388 Yuhangtang Road, 310058, Hangzhou, China
Yunjun Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, P., Sha, C., Wang, X., Yang, B., Zhou, A., Yang, S. (2010). XML Structural Similarity Search Using MapReduce. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-14246-8_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14245-1
Online ISBN: 978-3-642-14246-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

XML Structural Similarity Search Using MapReduce

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

An Efficient Document Indexing-Based Similarity Search in Large Datasets

Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data

An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

XML Structural Similarity Search Using MapReduce

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

An Efficient Document Indexing-Based Similarity Search in Large Datasets

Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data

An Elastic Approximate Similarity Search in Very Large Datasets with MapReduce

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation