research-article

Entity matching for semistructured data in the Cloud

Authors:

Marcus Paradies,

Susan Malaika,

Jérôme Siméon,

Shahan Khatchadourian,

Kai-Uwe SattlerAuthors Info & Claims

SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing

Pages 453 - 458

https://doi.org/10.1145/2245276.2245363

Published: 26 March 2012 Publication History

Get Access

Abstract

The rapid expansion of available information, on the Web or inside companies, is increasing. With Cloud infrastructure maturing (including tools for parallel data processing, text analytics, clustering, etc.), there is more interest in integrating data to produce higher-value content. New challenges, notably include entity matching over large volumes of heterogeneous data.

In this paper, we describe an approach for entity matching over large amounts of semistructured data in the Cloud. The approach combines ChuQL[4], a recently proposed extension of XQuery with MapReduce, and a blocking technique for entity matching which can be efficiently executed on top of MapReduce. We illustrate the proposed approach by applying it to extract automatically and enrich references in Wikipedia and report on an experimental evaluation of the approach.

References

[1]

R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In SIGKDD '03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25--27, 2003.

Google Scholar

[2]

P. Christen and T. Churches. Febrl: Freely extensible biomedical record linkage. Manual, release 0.2, 2003.

Google Scholar

[3]

M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2: 9--37, 1998.

Digital Library

Google Scholar

[4]

S. Khatchadourian, M. Consens, and J. Simeon. Having a ChuQL at XML on the Cloud. In AMW, 2011.

Google Scholar

[5]

S. Khatchadourian, M. Consens, and J. Simeon. ChuQL: Processing XML with XQery using Hadoop. In Cascon, 2011.

Digital Library

Google Scholar

[6]

T. Kirsten, L. Kolb, M. Hartung, A. Gross, H. Köpcke, and E. Rahm. Data Partitioning for Parallel Entity Matching. CoRR, 2010.

Google Scholar

[7]

L. Kolb, H. Köpcke, A. Thor, and E. Rahm. Learning-based Entity Resolution with MapReduce. In CloudDb 2011, 2011.

Digital Library

Google Scholar

[8]

L. Kolb, A. Thor, and E. Rahm. Parallel Sorted Neighborhood Blocking with MapReduce. In BTW, volume 180, pages 45--64, 2011.

Google Scholar

[9]

S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In AGENTS '99, pages 392--393, 1999.

Digital Library

Google Scholar

[10]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD '00, pages 169--178, 2000.

Digital Library

Google Scholar

[11]

H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, pages 1425--1432, 2003.

Google Scholar

[12]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD Conference, pages 495--506, 2010.

Digital Library

Google Scholar

Cited By

View all

Fedoryszak MTkaczyk DBolikowski Ł(2013)Large Scale Citation Matching Using Apache HadoopResearch and Advanced Technology for Digital Libraries10.1007/978-3-642-40501-3_37(362-365)Online publication date: 2013
https://doi.org/10.1007/978-3-642-40501-3_37

Recommendations

Cross-lingual entity matching and infobox alignment in Wikipedia

Wikipedia has grown to a huge, multi-lingual source of encyclopedic knowledge. Apart from textual content, a large and ever-increasing number of articles feature so-called infoboxes, which provide factual information about the articles' subjects. As the ...
Exploring entity relations for named entity disambiguation
HLT-SS '11: Proceedings of the ACL 2011 Student Session

Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named ...
Quality-aware similarity assessment for entity matching in Web data

One of the key challenges to realize automated processing of the information on the Web, which is the central goal of the Semantic Web, is related to the entity matching problem. There are a number of tools that reliably recognize named entities, such ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing

March 2012

2179 pages

ISBN:9781450308571

DOI:10.1145/2245276

Conference Chairs:
Sascha Ossowski
University Rey Juan Carlos, Spain
,
Paola Lecca
The Microsoft Research - University of Trento COSBI, Italy

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SAC 2012

Sponsor:

SIGAPP

SAC 2012: ACM Symposium on Applied Computing

March 26 - 30, 2012

Trento, Italy

Acceptance Rates

SAC '12 Paper Acceptance Rate 270 of 1,056 submissions, 26%;

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
220
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Fedoryszak MTkaczyk DBolikowski Ł(2013)Large Scale Citation Matching Using Apache HadoopResearch and Advanced Technology for Digital Libraries10.1007/978-3-642-40501-3_37(362-365)Online publication date: 2013
https://doi.org/10.1007/978-3-642-40501-3_37

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Cross-lingual entity matching and infobox alignment in Wikipedia

Exploring entity relations for named entity disambiguation

Quality-aware similarity assessment for entity matching in Web data