Deduplication in Data Cleaning

Raghav Kaushik³

24 Accesses
1 Citations

Synonyms

Clustering; Mergepurge; Record matching; Reference reconciliation

Definition

Many times, the same logical real world entity has multiple representations in a relation, due to data entry errors, varying conventions, and a variety of other reasons. For example, when Lisa purchases products from SuperMart twice, she might be entered as two different customers, e.g., (Lisa Simpson, Seattle, WA, USA, 98025) and (Simson Lisa, Seattle, WA, United States, 98025). Such duplicated information can cause significant problems for the users of the data. For example, it can lead to increased direct mailing costs because several customers may be sent multiple catalogs. Or, such duplicates could cause incorrect results in analytic queries (say, the number of SuperMart customers in Seattle), and lead to erroneous data mining models. Hence, a significant amount of time and effort are spent on the task of detecting and eliminating duplicates.

This problem of detecting and eliminating duplicated...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002.
Google Scholar
Aslam JA, Pelehov K, Rus D. A practical clustering algorithm for static and dynamic information organization. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms; 1999.
Google Scholar
Bansal N, Blum A, Chawla S. Correlation clustering. Mach Learn. 2002;56(1–3):89–113.
MathSciNet MATH Google Scholar
Bhattacharya I, Getoor L. Collective entity resolution in relational data. Q Bull IEEE TC Data Eng. 2006;29(2):4–12.
Google Scholar
Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st International Conference on Machine Learning; 2004.
Google Scholar
Bohannon P, Fan W, Flaster M, Rastogi R. A cost based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Google Scholar
Charikar M, Guruswami V, Wirth A. Clustering with qualitative information. J Comput Syst Sci. 2005;71(3):360–83.
Article MathSciNet MATH Google Scholar
Chaudhuri S, Sarma A, Ganti V, Kaushik R. Leveraging aggregate constraints for deduplication. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2007.
Google Scholar
Dong X, Halevy AY, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Google Scholar
Fuxman A, Fazli E, Miller RJ. ConQuer: efficient management of inconsistent databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Google Scholar
Galhardas H, Florescu D, Shasha D, Simon E, Saita C. Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001.
Google Scholar
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006.
Google Scholar
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002.
Google Scholar
Single linkage clustering. http://en.wikipedia.org/wiki/Single_linkage_clustering
The K-means clustering algorithm. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html
Trillium software. http://www.trilliumsoft.com/trilliumsoft.nsf
Toney S. Cleanup and deduplication of an international bibliographic database. Inform Tech Lib. 1992;11(1):25.
Google Scholar
Tung AKH, Ng RT, Lakshmanan LVS, Han J. Constraint-based clustering in large databases. In: Proceedings of the 8th International Conference on Database Theory; 2001.
Chapter Google Scholar
Yancey WE. Bigmatch: a program for extracting probable matches from a large file for record linkage. Statistical Research Report Series RRC2002/01, US Bureau of the Census; 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Redmond, WA, USA
Raghav Kaushik

Authors

Raghav Kaushik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raghav Kaushik .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

Microsoft Research, Microsoft Corporation, Redmond, WA, USA
Venkatesh Ganti

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Kaushik, R. (2018). Deduplication in Data Cleaning. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_596

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_596
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics