research-article

Deep Learning for Entity Matching: A Design Space Exploration

Authors:

Sidharth Mudgal,

Theodoros Rekatsinas,

Youngchoon Park,

Ganesh Krishnan,

Esteban Arcaute,

Vijay RaghavendraAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 19 - 34

https://doi.org/10.1145/3183713.3196926

Published: 27 May 2018 Publication History

Abstract

Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.

References

[1]

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. ICLR.

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR.

[3]

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, et almbox. 2016. End-to-end attention-based large vocabulary speech recognition. IEEE ICASSP.

[4]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. JMLR (March. 2003), 1137--1155.

Digital Library

[5]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures. KDD.

Digital Library

[6]

Piotr Bojanowski, Edouard Grave, Armand Joulin, et almbox. 2016. Enriching Word Vectors with Subword Information. CoRR Vol. abs/1607.04606 (2016).

[7]

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, et almbox. 2017. Recurrent neural network-based sentence encoder with gated attention for natural language inference. CoRR Vol. abs/1708.01353 (2017).

[8]

Kyunghyun Cho et almbox. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP.

[9]

Peter Christen. 2012. Data Matching. Springer.

[10]

Kevin Clark et almbox. 2016. Improving coreference resolution by learning entity-level distributed representations. CoRR Vol. abs/1606.01323 (2016).

[11]

William W. Cohen. 2016. TensorLog: A Differentiable Deductive Database. CoRR Vol. abs/1605.06523 (2016).

[12]

Ronan Collobert et almbox. 2011 a. Natural language processing (almost) from scratch. JMLR.

Digital Library

[13]

R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011 b. Torch7: A Matlab-like Environment for Machine Learning BigLearn, NIPS Workshop.

[14]

Ido Dagan, Dan Roth, Fabio Zanzotto, and Graeme Hirst. 2012. Recognizing Textual Entailment. Morgan &Claypool Publishers.

Digital Library

[15]

Sanjib Das et almbox. {n. d.}. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/data. (.{n. d.}).

[16]

Bhuwan Dhingra, Hanxiao Liu, et almbox. 2017. A Comparative Study of Word Embeddings for Reading Comprehension. CoRR Vol. abs/1703.00993 (2017).

[17]

Jens Dittrich. 2017. Deep Learning (m)eats Databases. VLDB Keynote.

[18]

Muhammad Ebraheem, Saravanan Thirumuruganathan, et almbox. 2017. DeepER--Deep Entity Resolution. CoRR Vol. abs/1710.00597 (2017).

[19]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE Vol. 19, 1 (Jan. . 2007), 1--16. nMaarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. ACL.

Digital Library

[20]

Massimo Nicosia and Alessandro Moschitti. 2017. Accurate Sentence Matching with Hybrid Siamese Networks. CIKM.

Digital Library

[21]

George Papadakis, Jonathan Svirsky, Avigdor Gal, et almbox. 2016. Comparative Analysis of Approximate Blocking Techniques for Entity Resolution. VLDB.

Digital Library

[22]

Ankur P Parikh, Oscar T"ackström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. EMNLP.

[23]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. EMNLP.

[24]

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, et almbox. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB.

Digital Library

[25]

Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks Vol. 61 (2015), 85--117.

Digital Library

[26]

Ziad Sehili, Lars Kolb, Christian Borgs, Rainer Schnell, and Erhard Rahm. 2015. Privacy Preserving Record Linkage with PPJoin. BTW.

[27]

Uri Shaham, Xiuyuan Cheng, Omer Dror, et almbox. 2016. A Deep Learning Approach to Unsupervised Ensemble Learning. ICML.

Digital Library

[28]

Tao Shen et almbox. 2017. DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding. CoRR Vol. abs/1709.04696 (2017).

[29]

Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE Vol. 27, 2 (2015), 443--460.

[30]

Rohit Singh, Vamsi Meduri, Ahmed Elmagarmid, et almbox. 2017. Generating Concise Entity Matching Rules. SIGMOD.

Digital Library

[31]

Parag Singla et almbox. 2006. Entity Resolution with Markov Logic. ICDM.

Digital Library

[32]

Richard Socher et almbox. 2013 a. Parsing with compositional vector grammars. ACL.

[33]

Richard Socher et almbox. 2013 b. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP.

[34]

Rupesh Kumar Srivastava et almbox. 2015. Highway networks. ICML.

[35]

Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, et almbox. 2013. Data Curation at Scale: The Data Tamer System. CIDR.

[36]

Hendrik Strobelt et almbox. 2016. Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks. CoRR abs/1606.07461 (2016).

[37]

Yaming Sun, Lei Lin, Duyu Tang, et almbox. 2015. Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation. IJCAI.

Digital Library

[38]

Ilya Sutskever. 2013. Training recurrent neural networks. Ph.D. Dissertation. bibinfoschoolUniversity of Toronto.

Digital Library

[39]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. NIPS.

Digital Library

[40]

Ming Tan et almbox. 2016. Improved Representation Learning for Question Answer Matching. ACL.

[41]

Ashish Vaswani et almbox. 2017. Attention Is All You Need. NIPS.

[42]

Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et almbox. 2016. Matching networks for one shot learning. ACL.

Digital Library

[43]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. VLDB.

Digital Library

[44]

Shuohang Wang and Jing Jiang. 2017. A Compare-Aggregate Model for Matching Text Sequences. ICLR.

[45]

Wei Wang et almbox. 2016. Database Meets Deep Learning: Challenges and Opportunities. ACM SIGMOD Record Vol. 45, 2 (2016), 17--22.

Digital Library

[46]

Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning Global Features for Coreference Resolution. NAACL.

[47]

Sen Wu, Luke Hsiao, Xiao Cheng, et almbox. 2017. Fonduer: Knowledge Base Construction from Richly Formatted Data. CoRR Vol. abs/1703.05028 (2017).

Digital Library

[48]

Wenpeng Yin et almbox. 2016 a. Simple Question Answering by Attentive Convolutional Neural Network. COLING.

[49]

Wenpeng Yin, Mo Yu, Bing Xiang, et almbox. 2016 b. Simple question answering by attentive convolutional neural network. CoRR Vol. abs/1606.03391 (2016).

[50]

Radu Florian Zhiguo Wang, Wael Hamza. 2017. Bilateral Multi-Perspective Matching for Natural Language Sentences. IJCAI.

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Torres NOlivares P(2024)De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier AttacksData10.3390/data90600759:6(75)Online publication date: 27-May-2024
https://doi.org/10.3390/data9060075
Shahbazi NErfanian MAsudeh ANargesian FSrivastava D(2024)FairEM360: A Suite for Responsible Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368588917:12(4417-4420)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685889
Show More Cited By

Index Terms

Deep Learning for Entity Matching: A Design Space Exploration
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution

Recommendations

Neural Networks for Entity Matching: A Survey
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging problem, and ...
Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience Papers

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Entity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

323
Total Citations
View Citations
3,834
Total Downloads

Downloads (Last 12 months)412
Downloads (Last 6 weeks)51

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Torres NOlivares P(2024)De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier AttacksData10.3390/data90600759:6(75)Online publication date: 27-May-2024
https://doi.org/10.3390/data9060075
Shahbazi NErfanian MAsudeh ANargesian FSrivastava D(2024)FairEM360: A Suite for Responsible Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368588917:12(4417-4420)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685889
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Ni WMiao XZhao XWu YLiang SYin J(2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 6-Aug-2024
https://doi.org/10.14778/3675034.3675051
Shah VParashos TKumar A(2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648178
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/3702315Online publication date: 2-Nov-2024
https://doi.org/10.1145/3702315
Huang Z(2024)Disambiguate Entity Matching using Large Language Models through Relation DiscoveryProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669844(36-39)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665601.3669844
Fan JTu JLi GWang PDu XJia XGao STang N(2024)Unicorn: A Unified Multi-Tasking Matching ModelACM SIGMOD Record10.1145/3665252.366526353:1(44-53)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665263
Li PHe YYashar DCui WGe SZhang HRifinski Fainman DZhang DChaudhuri S(2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654979
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents