Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3183713.3196926acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Deep Learning for Entity Matching: A Design Space Exploration

Published: 27 May 2018 Publication History

Abstract

Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related matching tasks in text processing (e.g., entity linking, textual entailment, etc.). We categorize these solutions and define a space of DL solutions for EM, as embodied by four solutions with varying representational power: SIF, RNN, Attention, and Hybrid. Next, we investigate the types of EM problems for which DL can be helpful. We consider three such problem types, which match structured data instances, textual instances, and dirty instances, respectively. We empirically compare the above four DL solutions with Magellan, a state-of-the-art learning-based EM solution. The results show that DL does not outperform current solutions on structured EM, but it can significantly outperform them on textual and dirty EM. For practitioners, this suggests that they should seriously consider using DL for textual and dirty EM problems. Finally, we analyze DL's performance and discuss future research directions.

References

[1]
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. ICLR.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR.
[3]
Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, et almbox. 2016. End-to-end attention-based large vocabulary speech recognition. IEEE ICASSP.
[4]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. JMLR (March. 2003), 1137--1155.
[5]
Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures. KDD.
[6]
Piotr Bojanowski, Edouard Grave, Armand Joulin, et almbox. 2016. Enriching Word Vectors with Subword Information. CoRR Vol. abs/1607.04606 (2016).
[7]
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, et almbox. 2017. Recurrent neural network-based sentence encoder with gated attention for natural language inference. CoRR Vol. abs/1708.01353 (2017).
[8]
Kyunghyun Cho et almbox. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP.
[9]
Peter Christen. 2012. Data Matching. Springer.
[10]
Kevin Clark et almbox. 2016. Improving coreference resolution by learning entity-level distributed representations. CoRR Vol. abs/1606.01323 (2016).
[11]
William W. Cohen. 2016. TensorLog: A Differentiable Deductive Database. CoRR Vol. abs/1605.06523 (2016).
[12]
Ronan Collobert et almbox. 2011 a. Natural language processing (almost) from scratch. JMLR.
[13]
R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011 b. Torch7: A Matlab-like Environment for Machine Learning BigLearn, NIPS Workshop.
[14]
Ido Dagan, Dan Roth, Fabio Zanzotto, and Graeme Hirst. 2012. Recognizing Textual Entailment. Morgan &Claypool Publishers.
[15]
Sanjib Das et almbox. {n. d.}. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/data. (.{n. d.}).
[16]
Bhuwan Dhingra, Hanxiao Liu, et almbox. 2017. A Comparative Study of Word Embeddings for Reading Comprehension. CoRR Vol. abs/1703.00993 (2017).
[17]
Jens Dittrich. 2017. Deep Learning (m)eats Databases. VLDB Keynote.
[18]
Muhammad Ebraheem, Saravanan Thirumuruganathan, et almbox. 2017. DeepER--Deep Entity Resolution. CoRR Vol. abs/1710.00597 (2017).
[19]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE Vol. 19, 1 (Jan. . 2007), 1--16. nMaarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. ACL.
[20]
Massimo Nicosia and Alessandro Moschitti. 2017. Accurate Sentence Matching with Hybrid Siamese Networks. CIKM.
[21]
George Papadakis, Jonathan Svirsky, Avigdor Gal, et almbox. 2016. Comparative Analysis of Approximate Blocking Techniques for Entity Resolution. VLDB.
[22]
Ankur P Parikh, Oscar T"ackström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. EMNLP.
[23]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. EMNLP.
[24]
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, et almbox. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB.
[25]
Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks Vol. 61 (2015), 85--117.
[26]
Ziad Sehili, Lars Kolb, Christian Borgs, Rainer Schnell, and Erhard Rahm. 2015. Privacy Preserving Record Linkage with PPJoin. BTW.
[27]
Uri Shaham, Xiuyuan Cheng, Omer Dror, et almbox. 2016. A Deep Learning Approach to Unsupervised Ensemble Learning. ICML.
[28]
Tao Shen et almbox. 2017. DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding. CoRR Vol. abs/1709.04696 (2017).
[29]
Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE Vol. 27, 2 (2015), 443--460.
[30]
Rohit Singh, Vamsi Meduri, Ahmed Elmagarmid, et almbox. 2017. Generating Concise Entity Matching Rules. SIGMOD.
[31]
Parag Singla et almbox. 2006. Entity Resolution with Markov Logic. ICDM.
[32]
Richard Socher et almbox. 2013 a. Parsing with compositional vector grammars. ACL.
[33]
Richard Socher et almbox. 2013 b. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP.
[34]
Rupesh Kumar Srivastava et almbox. 2015. Highway networks. ICML.
[35]
Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, et almbox. 2013. Data Curation at Scale: The Data Tamer System. CIDR.
[36]
Hendrik Strobelt et almbox. 2016. Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks. CoRR abs/1606.07461 (2016).
[37]
Yaming Sun, Lei Lin, Duyu Tang, et almbox. 2015. Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation. IJCAI.
[38]
Ilya Sutskever. 2013. Training recurrent neural networks. Ph.D. Dissertation. bibinfoschoolUniversity of Toronto.
[39]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. NIPS.
[40]
Ming Tan et almbox. 2016. Improved Representation Learning for Question Answer Matching. ACL.
[41]
Ashish Vaswani et almbox. 2017. Attention Is All You Need. NIPS.
[42]
Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et almbox. 2016. Matching networks for one shot learning. ACL.
[43]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. VLDB.
[44]
Shuohang Wang and Jing Jiang. 2017. A Compare-Aggregate Model for Matching Text Sequences. ICLR.
[45]
Wei Wang et almbox. 2016. Database Meets Deep Learning: Challenges and Opportunities. ACM SIGMOD Record Vol. 45, 2 (2016), 17--22.
[46]
Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning Global Features for Coreference Resolution. NAACL.
[47]
Sen Wu, Luke Hsiao, Xiao Cheng, et almbox. 2017. Fonduer: Knowledge Base Construction from Richly Formatted Data. CoRR Vol. abs/1703.05028 (2017).
[48]
Wenpeng Yin et almbox. 2016 a. Simple Question Answering by Attentive Convolutional Neural Network. COLING.
[49]
Wenpeng Yin, Mo Yu, Bing Xiang, et almbox. 2016 b. Simple question answering by attentive convolutional neural network. CoRR Vol. abs/1606.03391 (2016).
[50]
Radu Florian Zhiguo Wang, Wael Hamza. 2017. Bilateral Multi-Perspective Matching for Natural Language Sentences. IJCAI.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. entity matching
  3. entity resolution

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)412
  • Downloads (Last 6 weeks)51
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
  • (2024)De-Anonymizing Users across Rating Datasets via Record Linkage and Quasi-Identifier AttacksData10.3390/data90600759:6(75)Online publication date: 27-May-2024
  • (2024)FairEM360: A Suite for Responsible Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368588917:12(4417-4420)Online publication date: 8-Nov-2024
  • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
  • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 6-Aug-2024
  • (2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/3702315Online publication date: 2-Nov-2024
  • (2024)Disambiguate Entity Matching using Large Language Models through Relation DiscoveryProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669844(36-39)Online publication date: 9-Jun-2024
  • (2024)Unicorn: A Unified Multi-Tasking Matching ModelACM SIGMOD Record10.1145/3665252.366526353:1(44-53)Online publication date: 14-May-2024
  • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media