Synonyms
Clustering; Mergepurge; Record matching; Reference reconciliation
Definition
Many times, the same logical real world entity has multiple representations in a relation, due to data entry errors, varying conventions, and a variety of other reasons. For example, when Lisa purchases products from SuperMart twice, she might be entered as two different customers, e.g., (Lisa Simpson, Seattle, WA, USA, 98025) and (Simson Lisa, Seattle, WA, United States, 98025). Such duplicated information can cause significant problems for the users of the data. For example, it can lead to increased direct mailing costs because several customers may be sent multiple catalogs. Or, such duplicates could cause incorrect results in analytic queries (say, the number of SuperMart customers in Seattle), and lead to erroneous data mining models. Hence, a significant amount of time and effort are spent on the task of detecting and eliminating duplicates.
This problem of detecting and eliminating duplicated...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002.
Aslam JA, Pelehov K, Rus D. A practical clustering algorithm for static and dynamic information organization. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms; 1999.
Bansal N, Blum A, Chawla S. Correlation clustering. Mach Learn. 2002;56(1–3):89–113.
Bhattacharya I, Getoor L. Collective entity resolution in relational data. Q Bull IEEE TC Data Eng. 2006;29(2):4–12.
Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st International Conference on Machine Learning; 2004.
Bohannon P, Fan W, Flaster M, Rastogi R. A cost based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Charikar M, Guruswami V, Wirth A. Clustering with qualitative information. J Comput Syst Sci. 2005;71(3):360–83.
Chaudhuri S, Sarma A, Ganti V, Kaushik R. Leveraging aggregate constraints for deduplication. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2007.
Dong X, Halevy AY, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Fuxman A, Fazli E, Miller RJ. ConQuer: efficient management of inconsistent databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Galhardas H, Florescu D, Shasha D, Simon E, Saita C. Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001.
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006.
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002.
Single linkage clustering. http://en.wikipedia.org/wiki/Single_linkage_clustering
The K-means clustering algorithm. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html
Trillium software. http://www.trilliumsoft.com/trilliumsoft.nsf
Toney S. Cleanup and deduplication of an international bibliographic database. Inform Tech Lib. 1992;11(1):25.
Tung AKH, Ng RT, Lakshmanan LVS, Han J. Constraint-based clustering in large databases. In: Proceedings of the 8th International Conference on Database Theory; 2001.
Yancey WE. Bigmatch: a program for extracting probable matches from a large file for record linkage. Statistical Research Report Series RRC2002/01, US Bureau of the Census; 2002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Kaushik, R. (2018). Deduplication in Data Cleaning. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_596
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_596
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering