research-article

Tuning large scale deduplication with reduced effort

Authors:

Guilherme Dal Bianco,

Renata Galante,

Carlos A. Heuser,

Marcos André GonçalvesAuthors Info & Claims

SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management

Article No.: 18, Pages 1 - 12

https://doi.org/10.1145/2484838.2484873

Published: 29 July 2013 Publication History

Abstract

Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.

References

[1]

A. Arasu, M. Gotz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010.

Digital Library

[2]

A. Arasu, C. Ré, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009.

Digital Library

[3]

A. Awekar, N. F. Samatova, and P. Breimyer. Incremental all pairs similarity search for varying similarity thresholds. In SNA-KDD, 2009.

Digital Library

[4]

R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.

Digital Library

[5]

R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In KDD WORKSHOPS, 2003.

[6]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007.

Digital Library

[7]

K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In KDD, 2012.

Digital Library

[8]

A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009.

Digital Library

[9]

M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003.

Digital Library

[10]

C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1--27:27, May 2011.

Digital Library

[11]

S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, 2007.

Digital Library

[12]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.

Digital Library

[13]

P. Christen and T. Churches. Febrl - freely extensible biomedical record linkage. Technical report, 2002.

[14]

M. de Carvalho, M. Gonçalves, A. Laender, and A. da Silva. Learning to deduplicate. In ACM/IEEE-CS 2006.

Digital Library

[15]

C. F. Dorneles, R. Gonçalves, and R. dos Santos Mello. Approximate data instance matching: a survey. KAIS, 2011.

Digital Library

[16]

A. K. Elmagarmid, P. G. Ipeirotis, Vassilios, and S. Verykios. Duplicate record detection: A survey. TKDE, 2007.

Digital Library

[17]

J. Gemmell, B. I. P. Rubinstein, and A. K. Chandra. Improving entity resolution with global constraints. CoRR, abs/1108.6016, 2011.

[18]

C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: an automatic citation indexing system. In DL, 2008.

Digital Library

[19]

M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 95.

Digital Library

[20]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.

Digital Library

[21]

M. Lenzerini. Data integration: A theoretical perspective. In PODS, pages 233--246, 2002.

Digital Library

[22]

C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[23]

H. Müller and j. Problems, Methods and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Berlin, 2003.

[24]

C. Peter. Performance and scalability of fast blocking techniques for deduplication and data linkage. Proc. VLDB Endow., 1(2):1253--1264, 2007.

[25]

V. Petricek, I. Cox, H. Han, I. Councill, and C. Giles. A comparison of on-line computer science citation databases. In Research and Advanced Technology for Digital Libraries, 2005.

Digital Library

[26]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.

Digital Library

[27]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD 2004.

Digital Library

[28]

E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, pages 678--684. ACM, 2005.

Digital Library

[29]

S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002.

Digital Library

[30]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010.

Digital Library

[31]

J. Wang, G. Li, and J. Fe. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, april 2011.

Digital Library

[32]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering: an adaptive framework for similarity join and search. In SIGMOD 2012.

Digital Library

[33]

J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: how similar is similar. Proc. VLDB Endow., 4(10):622--633, July 2011.

Digital Library

[34]

W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, 1999.

[35]

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. TODS, 2011.

Digital Library

Cited By

Maskat RPaton NEmbury S(2016)Pay-as-you-go Configuration of Entity ResolutionTransactions on Large-Scale Data- and Knowledge-Centered Systems XXIX - Volume 1012010.1007/978-3-662-54037-4_2(40-65)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/978-3-662-54037-4_2
Dal Bianco GGalante RGonçalves MCanuto SHeuser C(2015)A Practical and Effective Sampling Selection Strategy for Large Scale DeduplicationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.241673427:9(2305-2319)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1109/TKDE.2015.2416734
Wang QVatsalan DChristen P(2015)Efficient Interactive Training Selection for Large-Scale Entity ResolutionAdvances in Knowledge Discovery and Data Mining10.1007/978-3-319-18032-8_44(562-573)Online publication date: 9-May-2015
https://doi.org/10.1007/978-3-319-18032-8_44

Index Terms

Tuning large scale deduplication with reduced effort
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
A survey on deduplication systems

With the arrival of new technological trends such as Big Data and Internet of Things, tremendous amount of duplicate data is being generated. Duplicate data causes the wastage of storage capacity and degradation of performance of the storage systems. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management

July 2013

401 pages

ISBN:9781450319218

DOI:10.1145/2484838

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 July 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conselho Nacional de Desenvolvimento Científico e Tecnológico

Conference

SSDBM '13

SSDBM '13: Conference on Scientific and Statistical Database Management

July 29 - 31, 2013

Maryland, Baltimore, USA

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
258
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Maskat RPaton NEmbury S(2016)Pay-as-you-go Configuration of Entity ResolutionTransactions on Large-Scale Data- and Knowledge-Centered Systems XXIX - Volume 1012010.1007/978-3-662-54037-4_2(40-65)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/978-3-662-54037-4_2
Dal Bianco GGalante RGonçalves MCanuto SHeuser C(2015)A Practical and Effective Sampling Selection Strategy for Large Scale DeduplicationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.241673427:9(2305-2319)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1109/TKDE.2015.2416734
Wang QVatsalan DChristen P(2015)Efficient Interactive Training Selection for Large-Scale Entity ResolutionAdvances in Knowledge Discovery and Data Mining10.1007/978-3-319-18032-8_44(562-573)Online publication date: 9-May-2015
https://doi.org/10.1007/978-3-319-18032-8_44

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents