short-paper

Probing the Robustness of Pre-trained Language Models for Entity Matching

Authors:

Mehdi Akbarian Rastaghi,

Ehsan Kamalloo,

Davood RafieiAuthors Info & Claims

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Pages 3786 - 3790

https://doi.org/10.1145/3511808.3557673

Published: 17 October 2022 Publication History

Abstract

The paradigm of fine-tuning Pre-trained Language Models (PLMs) has been successful in Entity Matching (EM). Despite their remarkable performance, PLMs exhibit tendency to learn spurious correlations from training data. In this work, we aim at investigating whether PLM-based entity matching models can be trusted in real-world applications where data distribution is different from that of training. To this end, we design an evaluation benchmark to assess the robustness of EM models to facilitate their deployment in the real-world settings. Our assessments reveal that data imbalance in the training data is a key problem for robustness. We also find that data augmentation alone is not sufficient to make a model robust. As a remedy, we prescribe simple modifications that can improve the robustness of PLM-based EM models. Our experiments show that while yielding superior results for in-domain generalization, our proposed model significantly improves the model robustness, compared to state-of-the-art EM models.

References

[1]

Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching: A Survey. ACM Transactions on Knowledge Discovery from Data (2021). https://doi.org/10.1145/3442200

Digital Library

[2]

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? FAccT 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3442188.3445922

Digital Library

[3]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/956750.956759

Digital Library

[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, Vol. 2020-December (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

[5]

Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference (2019). https://doi.org/10.18653/v1/N19--1423

[6]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Trans. on Knowl. and Data Eng. (2007). https://doi.org/10.1109/TKDE.1990.10000

[7]

Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about Record Matching Rules. Proc. VLDB Endow. (2009). https://doi.org/10.14778/1687627.1687674

Digital Library

[8]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2020. Low-resource deep entity resolution with transfer and active learning. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (2020). https://doi.org/10.18653/v1/p19--1586

[9]

Pradap Venkatramanan Konda. 2018. Magellan: Toward building entity matching management systems. http://www.vldb.org/pvldb/vol9/p1197-pkonda.pdf

[10]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment (2010). https://doi.org/10.14778/1920841.1920904

Digital Library

[11]

Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2020. Dice Loss for Data-imbalanced NLP Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.45

[12]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, Wataru Hirota, and Wang Chiew Tan. 2021. Deep Entity Matching: Challenges and Opportunities. Journal of Data and Information Quality (2021). https://doi.org/10.1145/3431816

Digital Library

[13]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). https://doi.org/10.48550/arXiv.1907.11692

[14]

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/P19--1334

[15]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. Proceedings of the ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/3183713.3196926

Digital Library

[16]

Timothy Niven and Hung Yu Kao. 2020. Probing neural network comprehension of natural language arguments. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. https://doi.org/10.18653/v1/p19--1459

[17]

Ralph Peeters and Christian Bizer. 2021. Dual-Objective Fine-Tuning of BERT for Entity Matching. Proc. VLDB Endow. (2021). https://doi.org/10.14778/3467861.3467878

Digital Library

[18]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. OpenAI blog (2018). https://www.gwern.net/docs/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf

[19]

Ilya Sutskever Alec Radford, Jeffrey Wu, David Luan Rewon Child, and Dario Amodei. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog (2019). https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

[20]

Rohit Singh, Vamsi Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Generating Concise Entity Matching Rules. Proceedings of the 2017 ACM International Conference on Management of Data. https://doi.org/10.1145/3035918.3058739

Digital Library

[21]

Mohamed Trabelsi, Jeff Heflin, and Jin Cao. 2022. DAME: Domain Adaptation for Matching Entities. Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. https://doi.org/10.1145/3488560.3498486

Digital Library

[22]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. https://doi.org/10.18653/v1/2020.emnlp-demos.6

[23]

Chen Zhao and Yeye He. 2019. Auto-EM: End-to-End Fuzzy Entity-Matching Using Pre-Trained Deep Models and Transfer Learning. The World Wide Web Conference. https://doi.org/10.1145/3308558.3313578

Digital Library

[24]

Ziyun Zhou, Hong Huang, and Binhao Fang. 2021. Application of Weighted Cross-Entropy Loss Function in Intrusion Detection. Journal of Computer and Communications (2021). https://doi.org/10.4236/jcc.2021.911001

Cited By

Fan MHan XFan JChai CTang NLi GDu X(2024)Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00284(3696-3709)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00284
Nananukul NSisaengsuwanchai KKejriwal M(2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
https://doi.org/10.1007/s44163-024-00159-8
Zeakis APapadakis GSkoutas DKoubarakis M(2024)An in-depth analysis of pre-trained embeddings for entity resolutionThe VLDB Journal10.1007/s00778-024-00879-434:1Online publication date: 4-Dec-2024
https://doi.org/10.1007/s00778-024-00879-4
Show More Cited By

Index Terms

Probing the Robustness of Pre-trained Language Models for Entity Matching
1. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution

Recommendations

Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience Papers

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
QA-Matcher: Unsupervised Entity Matching Using a Question Answering Model
Advances in Knowledge Discovery and Data Mining
Abstract
Entity matching (EM) is a fundamental task in data integration, which involves identifying records that refer to the same real-world entity. Unsupervised EM is often preferred in real-world applications, as labeling data is often a labor-intensive ...
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMOD

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

October 2022

5274 pages

ISBN:9781450392365

DOI:10.1145/3511808

General Chairs:
Mohammad Al Hasan
Indiana University Purdue University, Indianapolis, USA
,
Li Xiong
Emory University, Atlanta, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

CIKM '22

Sponsor:

CIKM '22: The 31st ACM International Conference on Information and Knowledge Management

October 17 - 21, 2022

GA, Atlanta, USA

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
471
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fan MHan XFan JChai CTang NLi GDu X(2024)Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00284(3696-3709)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00284
Nananukul NSisaengsuwanchai KKejriwal M(2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
https://doi.org/10.1007/s44163-024-00159-8
Zeakis APapadakis GSkoutas DKoubarakis M(2024)An in-depth analysis of pre-trained embeddings for entity resolutionThe VLDB Journal10.1007/s00778-024-00879-434:1Online publication date: 4-Dec-2024
https://doi.org/10.1007/s00778-024-00879-4
O’Reilly-Morgan DTragos EDuriakova EDu HHurley NLawlor A(2024)Entity Matching with Large Language Models as Weak and Strong LabellersNew Trends in Database and Information Systems10.1007/978-3-031-70421-5_6(58-67)Online publication date: 14-Nov-2024
https://doi.org/10.1007/978-3-031-70421-5_6
Naeim abadi ANayeem MRafiei DFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Product Entity Matching via Tabular DataProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615172(4215-4219)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615172
Foua BWang XTalburt JXu X(2023)Train Once, Match Everywhere: Harnessing Generative Language Models for Entity Matching2023 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI62032.2023.00012(30-36)Online publication date: 13-Dec-2023
https://doi.org/10.1109/CSCI62032.2023.00012
Ghassabi SBehkamal BMilani M(2023)Leveraging Knowledge Graphs for Matching Heterogeneous Entities and Explanation2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386157(2910-2919)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386157
Peeters RBizer C(2023)Using ChatGPT for Entity MatchingNew Trends in Database and Information Systems10.1007/978-3-031-42941-5_20(221-230)Online publication date: 31-Aug-2023
https://doi.org/10.1007/978-3-031-42941-5_20

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten