research-article

Deep transfer learning for multi-source entity linkage via domain adaptation

Authors:

Bunyamin Sisman,

Danai KoutraAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 3

Pages 465 - 477

https://doi.org/10.14778/3494124.3494131

Published: 01 November 2021 Publication History

Abstract

Multi-source entity linkage focuses on integrating knowledge from multiple sources by linking the records that represent the same real world entity. This is critical in high-impact applications such as data cleaning and user stitching. The state-of-the-art entity linkage pipelines mainly depend on supervised learning that requires abundant amounts of training data. However, collecting well-labeled training data becomes expensive when the data from many sources arrives incrementally over time. Moreover, the trained models can easily overfit to specific data sources, and thus fail to generalize to new sources due to significant differences in data and label distributions. To address these challenges, we present AdaMEL, a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage. AdaMEL models the attribute importance that is used to match entities through an attribute-level self-attention mechanism, and leverages the massive unlabeled data from new data sources through domain adaptation to make it generic and data-source agnostic. In addition, AdaMEL is capable of incorporating an additional set of labeled data to more accurately integrate data sources with different attribute importance. Extensive experiments show that our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning. Besides, it is more stable in handling different sets of data sources in less runtime.

References

[1]

Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning. 17--36.

Digital Library

[2]

Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In KDD. 39--48.

Digital Library

[3]

Rita Chattopadhyay, Qian Sun, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. 2012. Multisource domain adaptation and its application to early detection of fatigue. ACM TKDD 6, 4 (2012), 1--26.

Digital Library

[4]

William W Cohen and Jacob Richman. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In KDD. 475--480.

Digital Library

[5]

di2kg. 2020. 2nd International Workshop on Challenges and Experiences from Data Integration to Knowledge Graphs. http://di2kg.inf.uniroma3.it/2020/.

[6]

AnHai Doan and Alon Y Halevy. 2005. Semantic integration research in the database community: A brief survey. AI magazine 26, 1 (2005), 83--83.

Digital Library

[7]

Xin Luna Dong and Felix Naumann. 2009. Data fusion: resolving data conflicts for integration. Proceedings of the VLDB Endowment 2, 2 (2009), 1654--1655.

Digital Library

[8]

Lixin Duan, Dong Xu, and Ivor Wai-Hung Tsang. 2012. Domain adaptation from multiple sources: A domain-dependent regularization approach. IEEE Transactions on neural networks and learning systems 23, 3 (2012), 504--518.

[9]

Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proceedings of the VLDB Endowment 2, 1 (2009), 407--418.

Digital Library

[10]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning. PMLR, 1126--1135.

Digital Library

[11]

Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2020. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 3665--3671.

Digital Library

[12]

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180--1189.

Digital Library

[13]

Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018--2019.

Digital Library

[14]

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.

Digital Library

[15]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE ICCV. 2961--2969.

[16]

Di Jin, Mark Heimann, Ryan A Rossi, and Danai Koutra. 2019. node2bits: Compact Time-and Attribute-aware Node Representations for User Stitching. In ECML-PKDD. Springer, 483--506.

[17]

Muhammad Ebraheem Saravanan Thirumuruganathan Shafiq Joty and Mourad Ouzzani Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. Proceedings of the VLDB Endowment 11, 11 (2018).

Digital Library

[18]

Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of ACL: Volume 2, Short Papers. 427--431.

[19]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5851--5861.

[20]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.

[21]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[22]

Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data & Knowledge Engineering 69, 2 (2010), 197--210.

Digital Library

[23]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60.

Digital Library

[24]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on EMNLP. 1412--1421.

[25]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.

[26]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.

Digital Library

[27]

Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 629--638.

Digital Library

[28]

Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. 2009. Zero-shot learning with semantic output codes. NIPS (2009), 1410--1418.

Digital Library

[29]

Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345--1359.

Digital Library

[30]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In Proceedings of the 2017 ACM CIKM. 1379--1388.

Digital Library

[31]

Gabriele Schweikert, Gunnar Rätsch, Christian Widmer, and Bernhard Schölkopf. 2008. An empirical analysis of domain adaptation algorithms for genomic sequence analysis. Advances in neural information processing systems 21 (2008), 1433--1440.

Digital Library

[32]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. VLDB 11, 2 (2017), 189--202.

Digital Library

[33]

Shiliang Sun, Honglei Shi, and Yuanbin Wu. 2015. A survey of multi-source domain adaptation. Information Fusion 24 (2015), 84--92.

Digital Library

[34]

Saravanan Thirumuruganathan, Shameem A Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty. 2018. Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018).

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in NIPS. 5998--6008.

Digital Library

[36]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.

[37]

Zhengyang Wang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Shuiwang Ji. 2020. CorDEL: A Contrastive Deep Learning Approach for Entity Linkage. arXiv preprint arXiv:2009.07203 (2020).

[38]

Garrett Wilson and Diane J Cook. 2020. A survey of unsupervised deep domain adaptation. ACM TIST 11, 5 (2020), 1--46.

Digital Library

[39]

Wei Ying, Yu Zhang, Junzhou Huang, and Qiang Yang. 2018. Transfer learning via learning to transfer. In ICML. 5085--5094.

[40]

Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In WWW. 2413--2424.

Digital Library

Cited By

Guo YChen LZhou ZZheng BFang ZZhang ZMao YGao YSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)CampER: An Effective Framework for Privacy-Aware Deep Entity ResolutionProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599266(626-637)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599266
Li YLi JSuhara YDoan ATan W(2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s00778-023-00779-z
Hayashi SDong YOyamada M(2023)QA-Matcher: Unsupervised Entity Matching Using a Question Answering ModelAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-33383-5_14(174-185)Online publication date: 25-May-2023
https://dl.acm.org/doi/10.1007/978-3-031-33383-5_14
Show More Cited By

Index Terms

Deep transfer learning for multi-source entity linkage via domain adaptation

Index terms have been assigned to the content through auto-classification.

Recommendations

Domain Adaptation for Deep Entity Resolution
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA) results on ER are achieved by deep learning (DL) based methods, trained with a lot of labeled matching/non-matching entity pairs. This may not be a problem when ...
Source-free unsupervised multi-source domain adaptation via proxy task for person re-identification
Abstract
Most of existing unsupervised domain adaptation methods focus on aligning the feature discrepancy between labeled source and unlabeled target data. However, in practice, the source data may not be accessible due to transfer issue, privacy problem, ...
Spectral domain-transfer learning
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Traditional spectral classification has been proved to be effective in dealing with both labeled and unlabeled data when these data are from the same domain. In many real world applications, however, we wish to make use of the labeled data from one ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 3

November 2021

364 pages

ISSN:2150-8097

Editors:
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2021

Published in PVLDB Volume 15, Issue 3

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
46
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo YChen LZhou ZZheng BFang ZZhang ZMao YGao YSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)CampER: An Effective Framework for Privacy-Aware Deep Entity ResolutionProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599266(626-637)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599266
Li YLi JSuhara YDoan ATan W(2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s00778-023-00779-z
Hayashi SDong YOyamada M(2023)QA-Matcher: Unsupervised Entity Matching Using a Question Answering ModelAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-33383-5_14(174-185)Online publication date: 25-May-2023
https://dl.acm.org/doi/10.1007/978-3-031-33383-5_14
Wang PZeng XChen LYe FMao YZhu JGao Y(2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565836

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents