Nothing Special   »   [go: up one dir, main page]

Skip to main content

Deduplication in Data Cleaning

  • Reference work entry
  • First Online:
Encyclopedia of Database Systems

Synonyms

Clustering; Mergepurge; Record matching; Reference reconciliation

Definition

Many times, the same logical real world entity has multiple representations in a relation, due to data entry errors, varying conventions, and a variety of other reasons. For example, when Lisa purchases products from SuperMart twice, she might be entered as two different customers, e.g., (Lisa Simpson, Seattle, WA, USA, 98025) and (Simson Lisa, Seattle, WA, United States, 98025). Such duplicated information can cause significant problems for the users of the data. For example, it can lead to increased direct mailing costs because several customers may be sent multiple catalogs. Or, such duplicates could cause incorrect results in analytic queries (say, the number of SuperMart customers in Seattle), and lead to erroneous data mining models. Hence, a significant amount of time and effort are spent on the task of detecting and eliminating duplicates.

This problem of detecting and eliminating duplicated...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

  1. Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002.

    Google Scholar 

  2. Aslam JA, Pelehov K, Rus D. A practical clustering algorithm for static and dynamic information organization. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms; 1999.

    Google Scholar 

  3. Bansal N, Blum A, Chawla S. Correlation clustering. Mach Learn. 2002;56(1–3):89–113.

    MathSciNet  MATH  Google Scholar 

  4. Bhattacharya I, Getoor L. Collective entity resolution in relational data. Q Bull IEEE TC Data Eng. 2006;29(2):4–12.

    Google Scholar 

  5. Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st International Conference on Machine Learning; 2004.

    Google Scholar 

  6. Bohannon P, Fan W, Flaster M, Rastogi R. A cost based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.

    Google Scholar 

  7. Charikar M, Guruswami V, Wirth A. Clustering with qualitative information. J Comput Syst Sci. 2005;71(3):360–83.

    Article  MathSciNet  MATH  Google Scholar 

  8. Chaudhuri S, Sarma A, Ganti V, Kaushik R. Leveraging aggregate constraints for deduplication. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2007.

    Google Scholar 

  9. Dong X, Halevy AY, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.

    Google Scholar 

  10. Fuxman A, Fazli E, Miller RJ. ConQuer: efficient management of inconsistent databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.

    Google Scholar 

  11. Galhardas H, Florescu D, Shasha D, Simon E, Saita C. Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001.

    Google Scholar 

  12. Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006.

    Google Scholar 

  13. Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002.

    Google Scholar 

  14. Single linkage clustering. http://en.wikipedia.org/wiki/Single_linkage_clustering

  15. The K-means clustering algorithm. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html

  16. Trillium software. http://www.trilliumsoft.com/trilliumsoft.nsf

  17. Toney S. Cleanup and deduplication of an international bibliographic database. Inform Tech Lib. 1992;11(1):25.

    Google Scholar 

  18. Tung AKH, Ng RT, Lakshmanan LVS, Han J. Constraint-based clustering in large databases. In: Proceedings of the 8th International Conference on Database Theory; 2001.

    Chapter  Google Scholar 

  19. Yancey WE. Bigmatch: a program for extracting probable matches from a large file for record linkage. Statistical Research Report Series RRC2002/01, US Bureau of the Census; 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raghav Kaushik .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Kaushik, R. (2018). Deduplication in Data Cleaning. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_596

Download citation

Publish with us

Policies and ethics