Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3580305.3599266acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Free access

CampER: An Effective Framework for Privacy-Aware Deep Entity Resolution

Published: 04 August 2023 Publication History

Abstract

Entity Resolution (ER) is a fundamental problem in data preparation. Standard deep ER methods have achieved state-of-the-art effectiveness, assuming that relations from different organizations are centrally stored. However, due to privacy concerns, it can be difficult to centralize data in practice, rendering standard deep ER solutions inapplicable. Despite efforts to develop rule-based privacy-preserving ER methods, they often neglect subtle matching mechanisms and have poor effectiveness as a result. To bridge effectiveness and privacy, in this paper, we propose CampER, an effective framework for privacy-aware deep entity resolution. Specifically, we first design a training pair self-generation strategy to overcome the absence of manually labeled data in privacy-aware scenarios. Based on the self-constructed training pairs, we present a collaborative fine-tuning approach to learn the match-aware and uni-space individual tuple embeddings for accurate matching decisions. During the matching decision-making process, we first introduce a cryptographically secure approach to determine matches. Furthermore, we propose an order-preserving perturbation strategy to significantly accelerate the matching computation while guaranteeing the consistency of ER results. Extensive experiments on eight widely-used benchmark datasets demonstrate that CampER not only is comparable with the state-of-the-art standard deep ER solutions in effectiveness, but also preserves privacy.

Supplementary Material

MP4 File (rtfp0370-2min-promo.mp4)
Promotional video of the paper "CampER: An Effective Framework for Privacy-Aware Deep Entity Resolution"

References

[1]
Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In CCS. 308--318.
[2]
Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. 2018. A Survey on Homomorphic Encryption Schemes: Theory and Implementation. ACM Comput. Surv., Vol. 51, 4 (2018), 79:1--79:35.
[3]
Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD. 39--48.
[4]
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating embeddings of heterogeneous relational datasets for data integration tasks. In SIGMOD. 1335--1349.
[5]
Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo Zaniolo. 2017. Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment. In IJCAI. 1511--1517.
[6]
Xin Luna Dong and Theodoros Rekatsinas. 2018. Data Integration and Machine Learning: A Natural Synergy. In SIGMOD. 1645--1650.
[7]
Cynthia Dwork. 2006. Differential Privacy. In ICALP, Vol. 4052. 1--12.
[8]
Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In TAMC, Vol. 4978. 1--19.
[9]
Cynthia Dwork and Jing Lei. 2009. Differential privacy and robust statistics. In STOC. 371--380.
[10]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. PVLDB., Vol. 11, 11 (2018), 1454--1467.
[11]
Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about Record Matching Rules. PVLDB., Vol. 2, 1 (2009), 407--418.
[12]
Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2022. CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration. To appear in TKDE (2022).
[13]
Aris Gkoulalas-Divanis, Dinusha Vatsalan, Dimitrios Karapiperis, and Murat Kantarcioglu. 2021. Modern Privacy-Preserving Record Linkage Techniques: An Overview. IEEE Trans. Inf. Forensics Secur., Vol. 16 (2021), 4966--4987.
[14]
Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD. 601--612.
[15]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR. 9726--9735.
[16]
Xi He, Ashwin Machanavajjhala, Cheryl J. Flynn, and Divesh Srivastava. 2017. Composing Differential Privacy and Secure Computation: A Case Study on Scaling Private Record Linkage. In CCS. 1389--1406.
[17]
Yaochen Hu, Di Niu, Jianming Yang, and Shengping Zhou. 2019. FDML: A Collaborative Machine Learning Framework for Distributed Features. In SIGKDD. 2232--2240.
[18]
Di Jin, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Danai Koutra. 2021. Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation. PVLDB., Vol. 15, 3 (2021), 465--477.
[19]
Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. 2017. Distance-Aware Encoding of Numerical Values for Privacy-Preserving Record Linkage. In ICDE. 135--138.
[20]
Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. 2018. FEDERAL: A Framework for Distance-Aware Privacy-Preserving Record Linkage. TKDE., Vol. 30, 2 (2018), 292--304.
[21]
Dimitrios Karapiperis and Vassilios S. Verykios. 2015. An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage. TKDE., Vol. 27, 4 (2015), 909--921.
[22]
Basit Khurram and Florian Kerschbaum. 2020. SFour: A Protocol for Cryptographically Secure Record Linkage at Scale. In ICDE. 277--288.
[23]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
[24]
Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems. PVLDB, Vol. 9, 12 (2016), 1197--1208.
[25]
Ioannis K. Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate Detection with Matching Dependencies. PVLDB., Vol. 13, 5 (2020), 712--725.
[26]
Bing Li, Yukai Miao, Yaoshu Wang, Yifang Sun, and Wei Wang. 2021b. Improving the Efficiency and Effectiveness for BERT-based Entity Resolution. In AAAI. 13226--13233.
[27]
Peng Li, Xiang Cheng, Xu Chu, Yeye He, and Surajit Chaudhuri. 2021a. Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. In SIGMOD. 1064--1076.
[28]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. PVLDB., Vol. 14, 1 (2020), 50--60.
[29]
Xiao Liu, Haoyun Hong, Xinghao Wang, Zeyi Chen, Evgeny Kharlamov, Yuxiao Dong, and Jie Tang. 2022. SelfKG: Self-Supervised Entity Alignment in Knowledge Graphs. In WWW. 860--870.
[30]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs/1907.11692 (2019).
[31]
Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
[32]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agü era y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In AISTATS, Vol. 54. 1273--1282.
[33]
H Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. 2016. Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629 (2016).
[34]
Zhengjie Miao, Yuliang Li, and Xiaolan Wang. 2021. Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond. In SIGMOD. 1303--1316.
[35]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19--34.
[36]
Pascal Paillier. 1999. Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. In EUROCRYPT, Vol. 1592. 223--238.
[37]
Shichao Pei, Lu Yu, Guoxian Yu, and Xiangliang Zhang. 2020. REA: Robust Cross-lingual Entity Alignment Between Knowledge Graphs. In KDD. 2175--2184.
[38]
Xuedi Qin, Chengliang Chai, Nan Tang, Jian Li, Yuyu Luo, Guoliang Li, and Yaoyu Zhu. 2022. Synthesizing Privacy Preserving Entity Resolution Datasets. In ICDE. 2359--2371.
[39]
General Data Protection Regulation. 2016. Regulation EU 2016/679 of the European Parliament and of the Council of 27 April 2016. Official Journal of the European Union (2016).
[40]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3980--3990.
[41]
Monica Scannapieco, Ilya Figotin, Elisa Bertino, and Ahmed K. Elmagarmid. 2007. Privacy preserving schema and data matching. In SIGMOD. 653--664.
[42]
Rainer Schnell, Tobias Bachteler, and Jörg Reiher. 2009. Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics Decis. Mak., Vol. 9 (2009), 41.
[43]
Mohsin Shah, Weiming Zhang, Honggang Hu, and Nenghai Yu. 2019. Paillier Cryptosystem based Mean Value Computation for Encrypted Domain Image Processing Operations. TOMM, Vol. 15, 3 (2019), 76:1--76:21.
[44]
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Generating Concise Entity Matching Rules. In SIGMOD. 1635--1638.
[45]
Congzheng Song and Ananth Raghunathan. 2020. Information Leakage in Embedding Models. In CCS. 377--390.
[46]
Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, and Xiaoyong Du. 2022. Domain Adaptation for Deep Entity Resolution. In SIGMOD. 443--457.
[47]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. PVLDB., Vol. 5, 11 (2012), 1483--1494.
[48]
Pengfei Wang, Xiaocan Zeng, Lu Chen, Fan Ye, Yuren Mao, Junhao Zhu, and Yunjun Gao. 2022. PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching. PVLDB., Vol. 16, 2 (2022), 369--378.
[49]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
[50]
Chuhan Wu, Fangzhao Wu, Yang Cao, Yongfeng Huang, and Xing Xie. 2021. Fedgnn: Federated graph neural network for privacy-preserving recommendation. arXiv preprint arXiv:2102.04925 (2021).
[51]
Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity Resolution using Zero Labeled Examples. In SIGMOD. 1149--1164.
[52]
Qian Yang, Jianyi Zhang, Weituo Hao, Gregory P. Spell, and Lawrence Carin. 2021. FLOP: Federated Learning on Medical Datasets using Partial Networks. In SIGKDD. ACM, 3845--3853.
[53]
Andrew Chi-Chih Yao. 1986. How to generate and exchange secrets. In SFCS. 162--167.
[54]
Dezhong Yao, Yuhong Gu, Gao Cong, Hai Jin, and Xinqiao Lv. 2022. Entity Resolution with Hierarchical Graph Attention Networks. In SIGMOD. 429--442.
[55]
Lei Yu, Ling Liu, Calton Pu, Mehmet Emre Gursoy, and Stacey Truex. 2019. Differentially Private Model Publishing for Deep Learning. In S&P. 332--349.
[56]
Zhikun Zhang, Min Chen, Michael Backes, Yun Shen, and Yang Zhang. 2022. Inference Attacks Against Graph Neural Networks. In USENIX. 4543--4560.

Cited By

View all
  • (2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
  • (2024)Power of Sentence Transformers in Record Linkage2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825999(6944-6955)Online publication date: 15-Dec-2024
  • (2024)Adaptive Target-Consistency Entity Matching Algorithm Based on Semi-Supervised Learning2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808744(31-37)Online publication date: 25-Oct-2024
  • Show More Cited By

Index Terms

  1. CampER: An Effective Framework for Privacy-Aware Deep Entity Resolution

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2023
    5996 pages
    ISBN:9798400701030
    DOI:10.1145/3580305
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 August 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. entity resolution
    2. representation learning
    3. similarity measurement

    Qualifiers

    • Research-article

    Conference

    KDD '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)284
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
    • (2024)Power of Sentence Transformers in Record Linkage2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825999(6944-6955)Online publication date: 15-Dec-2024
    • (2024)Adaptive Target-Consistency Entity Matching Algorithm Based on Semi-Supervised Learning2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808744(31-37)Online publication date: 25-Oct-2024
    • (2024)RLclean: An Unsupervised Integrated Data Cleaning Framework Based on Deep Reinforcement LearningInformation Sciences10.1016/j.ins.2024.121281(121281)Online publication date: Jul-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media