Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2484838.2484873acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Tuning large scale deduplication with reduced effort

Published: 29 July 2013 Publication History

Abstract

Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.

References

[1]
A. Arasu, M. Gotz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010.
[2]
A. Arasu, C. Ré, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009.
[3]
A. Awekar, N. F. Samatova, and P. Breimyer. Incremental all pairs similarity search for varying similarity thresholds. In SNA-KDD, 2009.
[4]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[5]
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In KDD WORKSHOPS, 2003.
[6]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007.
[7]
K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In KDD, 2012.
[8]
A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009.
[9]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003.
[10]
C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1--27:27, May 2011.
[11]
S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, 2007.
[12]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.
[13]
P. Christen and T. Churches. Febrl - freely extensible biomedical record linkage. Technical report, 2002.
[14]
M. de Carvalho, M. Gonçalves, A. Laender, and A. da Silva. Learning to deduplicate. In ACM/IEEE-CS 2006.
[15]
C. F. Dorneles, R. Gonçalves, and R. dos Santos Mello. Approximate data instance matching: a survey. KAIS, 2011.
[16]
A. K. Elmagarmid, P. G. Ipeirotis, Vassilios, and S. Verykios. Duplicate record detection: A survey. TKDE, 2007.
[17]
J. Gemmell, B. I. P. Rubinstein, and A. K. Chandra. Improving entity resolution with global constraints. CoRR, abs/1108.6016, 2011.
[18]
C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: an automatic citation indexing system. In DL, 2008.
[19]
M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 95.
[20]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.
[21]
M. Lenzerini. Data integration: A theoretical perspective. In PODS, pages 233--246, 2002.
[22]
C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
[23]
H. Müller and j. Problems, Methods and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Berlin, 2003.
[24]
C. Peter. Performance and scalability of fast blocking techniques for deduplication and data linkage. Proc. VLDB Endow., 1(2):1253--1264, 2007.
[25]
V. Petricek, I. Cox, H. Han, I. Councill, and C. Giles. A comparison of on-line computer science citation databases. In Research and Advanced Technology for Digital Libraries, 2005.
[26]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.
[27]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD 2004.
[28]
E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, pages 678--684. ACM, 2005.
[29]
S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002.
[30]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, 2010.
[31]
J. Wang, G. Li, and J. Fe. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, april 2011.
[32]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering: an adaptive framework for similarity join and search. In SIGMOD 2012.
[33]
J. Wang, G. Li, J. X. Yu, and J. Feng. Entity matching: how similar is similar. Proc. VLDB Endow., 4(10):622--633, July 2011.
[34]
W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, 1999.
[35]
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. TODS, 2011.

Cited By

View all
  • (2016)Pay-as-you-go Configuration of Entity ResolutionTransactions on Large-Scale Data- and Knowledge-Centered Systems XXIX - Volume 1012010.1007/978-3-662-54037-4_2(40-65)Online publication date: 1-Sep-2016
  • (2015)A Practical and Effective Sampling Selection Strategy for Large Scale DeduplicationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.241673427:9(2305-2319)Online publication date: 1-Sep-2015
  • (2015)Efficient Interactive Training Selection for Large-Scale Entity ResolutionAdvances in Knowledge Discovery and Data Mining10.1007/978-3-319-18032-8_44(562-573)Online publication date: 9-May-2015

Index Terms

  1. Tuning large scale deduplication with reduced effort

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management
    July 2013
    401 pages
    ISBN:9781450319218
    DOI:10.1145/2484838
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 July 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deduplication
    2. signature-based deduplication

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SSDBM '13

    Acceptance Rates

    Overall Acceptance Rate 56 of 146 submissions, 38%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)Pay-as-you-go Configuration of Entity ResolutionTransactions on Large-Scale Data- and Knowledge-Centered Systems XXIX - Volume 1012010.1007/978-3-662-54037-4_2(40-65)Online publication date: 1-Sep-2016
    • (2015)A Practical and Effective Sampling Selection Strategy for Large Scale DeduplicationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.241673427:9(2305-2319)Online publication date: 1-Sep-2015
    • (2015)Efficient Interactive Training Selection for Large-Scale Entity ResolutionAdvances in Knowledge Discovery and Data Mining10.1007/978-3-319-18032-8_44(562-573)Online publication date: 9-May-2015

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media