Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2245276.2245363acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Entity matching for semistructured data in the Cloud

Published: 26 March 2012 Publication History

Abstract

The rapid expansion of available information, on the Web or inside companies, is increasing. With Cloud infrastructure maturing (including tools for parallel data processing, text analytics, clustering, etc.), there is more interest in integrating data to produce higher-value content. New challenges, notably include entity matching over large volumes of heterogeneous data.
In this paper, we describe an approach for entity matching over large amounts of semistructured data in the Cloud. The approach combines ChuQL[4], a recently proposed extension of XQuery with MapReduce, and a blocking technique for entity matching which can be efficiently executed on top of MapReduce. We illustrate the proposed approach by applying it to extract automatically and enrich references in Wikipedia and report on an experimental evaluation of the approach.

References

[1]
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In SIGKDD '03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25--27, 2003.
[2]
P. Christen and T. Churches. Febrl: Freely extensible biomedical record linkage. Manual, release 0.2, 2003.
[3]
M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2: 9--37, 1998.
[4]
S. Khatchadourian, M. Consens, and J. Simeon. Having a ChuQL at XML on the Cloud. In AMW, 2011.
[5]
S. Khatchadourian, M. Consens, and J. Simeon. ChuQL: Processing XML with XQery using Hadoop. In Cascon, 2011.
[6]
T. Kirsten, L. Kolb, M. Hartung, A. Gross, H. Köpcke, and E. Rahm. Data Partitioning for Parallel Entity Matching. CoRR, 2010.
[7]
L. Kolb, H. Köpcke, A. Thor, and E. Rahm. Learning-based Entity Resolution with MapReduce. In CloudDb 2011, 2011.
[8]
L. Kolb, A. Thor, and E. Rahm. Parallel Sorted Neighborhood Blocking with MapReduce. In BTW, volume 180, pages 45--64, 2011.
[9]
S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In AGENTS '99, pages 392--393, 1999.
[10]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD '00, pages 169--178, 2000.
[11]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, pages 1425--1432, 2003.
[12]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD Conference, pages 495--506, 2010.

Cited By

View all
  • (2013)Large Scale Citation Matching Using Apache HadoopResearch and Advanced Technology for Digital Libraries10.1007/978-3-642-40501-3_37(362-365)Online publication date: 2013

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing
March 2012
2179 pages
ISBN:9781450308571
DOI:10.1145/2245276
  • Conference Chairs:
  • Sascha Ossowski,
  • Paola Lecca
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2012

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SAC 2012
Sponsor:
SAC 2012: ACM Symposium on Applied Computing
March 26 - 30, 2012
Trento, Italy

Acceptance Rates

SAC '12 Paper Acceptance Rate 270 of 1,056 submissions, 26%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2013)Large Scale Citation Matching Using Apache HadoopResearch and Advanced Technology for Digital Libraries10.1007/978-3-642-40501-3_37(362-365)Online publication date: 2013

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media