Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Deep transfer learning for multi-source entity linkage via domain adaptation

Published: 01 November 2021 Publication History

Abstract

Multi-source entity linkage focuses on integrating knowledge from multiple sources by linking the records that represent the same real world entity. This is critical in high-impact applications such as data cleaning and user stitching. The state-of-the-art entity linkage pipelines mainly depend on supervised learning that requires abundant amounts of training data. However, collecting well-labeled training data becomes expensive when the data from many sources arrives incrementally over time. Moreover, the trained models can easily overfit to specific data sources, and thus fail to generalize to new sources due to significant differences in data and label distributions. To address these challenges, we present AdaMEL, a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage. AdaMEL models the attribute importance that is used to match entities through an attribute-level self-attention mechanism, and leverages the massive unlabeled data from new data sources through domain adaptation to make it generic and data-source agnostic. In addition, AdaMEL is capable of incorporating an additional set of labeled data to more accurately integrate data sources with different attribute importance. Extensive experiments show that our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning. Besides, it is more stable in handling different sets of data sources in less runtime.

References

[1]
Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning. 17--36.
[2]
Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In KDD. 39--48.
[3]
Rita Chattopadhyay, Qian Sun, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. 2012. Multisource domain adaptation and its application to early detection of fatigue. ACM TKDD 6, 4 (2012), 1--26.
[4]
William W Cohen and Jacob Richman. 2002. Learning to match and cluster large high-dimensional data sets for data integration. In KDD. 475--480.
[5]
di2kg. 2020. 2nd International Workshop on Challenges and Experiences from Data Integration to Knowledge Graphs. http://di2kg.inf.uniroma3.it/2020/.
[6]
AnHai Doan and Alon Y Halevy. 2005. Semantic integration research in the database community: A brief survey. AI magazine 26, 1 (2005), 83--83.
[7]
Xin Luna Dong and Felix Naumann. 2009. Data fusion: resolving data conflicts for integration. Proceedings of the VLDB Endowment 2, 2 (2009), 1654--1655.
[8]
Lixin Duan, Dong Xu, and Ivor Wai-Hung Tsang. 2012. Domain adaptation from multiple sources: A domain-dependent regularization approach. IEEE Transactions on neural networks and learning systems 23, 3 (2012), 504--518.
[9]
Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proceedings of the VLDB Endowment 2, 1 (2009), 407--418.
[10]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning. PMLR, 1126--1135.
[11]
Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2020. Hierarchical matching network for heterogeneous entity resolution. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 3665--3671.
[12]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180--1189.
[13]
Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018--2019.
[14]
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.
[15]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE ICCV. 2961--2969.
[16]
Di Jin, Mark Heimann, Ryan A Rossi, and Danai Koutra. 2019. node2bits: Compact Time-and Attribute-aware Node Representations for User Stitching. In ECML-PKDD. Springer, 483--506.
[17]
Muhammad Ebraheem Saravanan Thirumuruganathan Shafiq Joty and Mourad Ouzzani Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. Proceedings of the VLDB Endowment 11, 11 (2018).
[18]
Armand Joulin, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of ACL: Volume 2, Short Papers. 427--431.
[19]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5851--5861.
[20]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.
[21]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[22]
Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data & Knowledge Engineering 69, 2 (2010), 197--210.
[23]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60.
[24]
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on EMNLP. 1412--1421.
[25]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.
[26]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19--34.
[27]
Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 629--638.
[28]
Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. 2009. Zero-shot learning with semantic output codes. NIPS (2009), 1410--1418.
[29]
Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2009), 1345--1359.
[30]
Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In Proceedings of the 2017 ACM CIKM. 1379--1388.
[31]
Gabriele Schweikert, Gunnar Rätsch, Christian Widmer, and Bernhard Schölkopf. 2008. An empirical analysis of domain adaptation algorithms for genomic sequence analysis. Advances in neural information processing systems 21 (2008), 1433--1440.
[32]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. VLDB 11, 2 (2017), 189--202.
[33]
Shiliang Sun, Honglei Shi, and Yuanbin Wu. 2015. A survey of multi-source domain adaptation. Information Fusion 24 (2015), 84--92.
[34]
Saravanan Thirumuruganathan, Shameem A Puthiya Parambath, Mourad Ouzzani, Nan Tang, and Shafiq Joty. 2018. Reuse and adaptation for entity resolution through transfer learning. arXiv preprint arXiv:1809.11084 (2018).
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in NIPS. 5998--6008.
[36]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.
[37]
Zhengyang Wang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Shuiwang Ji. 2020. CorDEL: A Contrastive Deep Learning Approach for Entity Linkage. arXiv preprint arXiv:2009.07203 (2020).
[38]
Garrett Wilson and Diane J Cook. 2020. A survey of unsupervised deep domain adaptation. ACM TIST 11, 5 (2020), 1--46.
[39]
Wei Ying, Yu Zhang, Junzhou Huang, and Qiang Yang. 2018. Transfer learning via learning to transfer. In ICML. 5085--5094.
[40]
Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In WWW. 2413--2424.

Cited By

View all
  • (2023)CampER: An Effective Framework for Privacy-Aware Deep Entity ResolutionProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599266(626-637)Online publication date: 6-Aug-2023
  • (2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 1-Nov-2023
  • (2023)QA-Matcher: Unsupervised Entity Matching Using a Question Answering ModelAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-33383-5_14(174-185)Online publication date: 25-May-2023
  • Show More Cited By

Index Terms

  1. Deep transfer learning for multi-source entity linkage via domain adaptation
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 15, Issue 3
    November 2021
    364 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 November 2021
    Published in PVLDB Volume 15, Issue 3

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 27 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)CampER: An Effective Framework for Privacy-Aware Deep Entity ResolutionProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599266(626-637)Online publication date: 6-Aug-2023
    • (2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 1-Nov-2023
    • (2023)QA-Matcher: Unsupervised Entity Matching Using a Question Answering ModelAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-33383-5_14(174-185)Online publication date: 25-May-2023
    • (2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media