Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3357384.3360316acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
tutorial

Learning-Based Methods with Human-in-the-Loop for Entity Resolution

Published: 03 November 2019 Publication History

Abstract

This tutorial is intended for researchers and practitioners working in the data integration area and, in particular, entity resolution (ER), which is a sub-area focused on linking entities across heterogeneous datasets. We outline the ideal requirements of modern ER systems: (1) capture domain knowledge via (minimal) human interaction, (2) provide as much automation as possible via machine learning techniques, and (3) achieve high explainability. We describe recent research trends towards bringing such ideal ER systems closer to reality. We begin with an overview of human-in-the-loop methods that are based on techniques such as crowdsourcing and active learning. We then dive into recent trends that involve deep learning techniques such as representation learning to automate feature engineering, and combinations of transfer and active learning to reduce the amount of user labels required. We also discuss how explainable AI relates to ER, and outline some of the recent advances towards explainable ER.

References

[1]
Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On Active Learning of Record Matching Packages. In SIGMOD 2010. 783--794.
[2]
Kedar Bellare, Suresh Iyengar, Aditya G. Parameswaran, and Vibhor Rastogi. 2012. Active Sampling for Entity Matching. In SIGKDD. 1131--1139.
[3]
Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2016. A Declarative Framework for Linking Entities. ACM TODS, Vol. 41, 3 (2016), 17:1--17:38. (Conference version appeared in ICDT 2015.).
[4]
Surajit Chaudhuri, Bee-Chung Chen, Venkatesh Ganti, and Raghav Kaushik. 2007. Example-driven design of efficient record matching queries. In VLDB. 327--338.
[5]
Z. Chen, Q. Chen, F. Fan, Y. Wang, Z. Wang, Y. Nafa, Z. Li, H. Liu, and W. Pan. 2018. Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework. In ICDE. 1156--1167.
[6]
Z. Chen, Q. Chen, and Z. Li. 2017. A Human-and-Machine Cooperative Framework for Entity Resolution with Quality Guarantees. In ICDE. 1405--1406.
[7]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. In PVLDB . 1454--1467.
[8]
Ivan Fellegi and Alan Sunter. 1969. A Theory for Record Linkage. Journal of the American Statistical Association (1969), 1183--1210.
[9]
Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off Crowdsourcing for Entity Matching. In SIGMOD. 601--612.
[10]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. ACL 2019 (to appear) (2019).
[11]
Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling Up Crowd-sourcing to Very Large Datasets: A Case for Active Learning. In PVLDB . 125--136.
[12]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD. 19--34.
[13]
Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active Learning for Large-Scale Entity Resolution. In CIKM. 1379--1388.
[14]
Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive Deduplication Using Active Learning. In KDD. 269--278.
[15]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. In PVLDB. 1483--1494.
[16]
Jiannan Wang, Guoliang Li, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2013. Leveraging Transitive Relations for Crowdsourced Joins. In SIGMOD. 229--240.
[17]
Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In WWW .

Cited By

View all
  • (2024)Gen-T: Table Reclamation in Data Lakes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00272(3532-3545)Online publication date: 13-May-2024
  • (2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
  • (2023)Genie in the ModelProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35808157:1(1-29)Online publication date: 28-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
November 2019
3373 pages
ISBN:9781450369763
DOI:10.1145/3357384
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2019

Check for updates

Author Tags

  1. entity resolution
  2. human-in-the-loop
  3. machine learning

Qualifiers

  • Tutorial

Conference

CIKM '19
Sponsor:

Acceptance Rates

CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Gen-T: Table Reclamation in Data Lakes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00272(3532-3545)Online publication date: 13-May-2024
  • (2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
  • (2023)Genie in the ModelProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35808157:1(1-29)Online publication date: 28-Mar-2023
  • (2023)Effective entity matching with transformersThe VLDB Journal10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 17-Jan-2023
  • (2021)Neural Networks for Entity Matching: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/344220015:3(1-37)Online publication date: 21-Apr-2021
  • (2021)Deep Entity MatchingJournal of Data and Information Quality10.1145/343181613:1(1-17)Online publication date: 6-Jan-2021
  • (2020)Deep entity matching with pre-trained language modelsProceedings of the VLDB Endowment10.14778/3421424.342143114:1(50-60)Online publication date: 1-Sep-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media