research-article

Free access

CampER: An Effective Framework for Privacy-Aware Deep Entity Resolution

Authors:

Yunjun GaoAuthors Info & Claims

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 626 - 637

https://doi.org/10.1145/3580305.3599266

Published: 04 August 2023 Publication History

Abstract

Entity Resolution (ER) is a fundamental problem in data preparation. Standard deep ER methods have achieved state-of-the-art effectiveness, assuming that relations from different organizations are centrally stored. However, due to privacy concerns, it can be difficult to centralize data in practice, rendering standard deep ER solutions inapplicable. Despite efforts to develop rule-based privacy-preserving ER methods, they often neglect subtle matching mechanisms and have poor effectiveness as a result. To bridge effectiveness and privacy, in this paper, we propose CampER, an effective framework for privacy-aware deep entity resolution. Specifically, we first design a training pair self-generation strategy to overcome the absence of manually labeled data in privacy-aware scenarios. Based on the self-constructed training pairs, we present a collaborative fine-tuning approach to learn the match-aware and uni-space individual tuple embeddings for accurate matching decisions. During the matching decision-making process, we first introduce a cryptographically secure approach to determine matches. Furthermore, we propose an order-preserving perturbation strategy to significantly accelerate the matching computation while guaranteeing the consistency of ER results. Extensive experiments on eight widely-used benchmark datasets demonstrate that CampER not only is comparable with the state-of-the-art standard deep ER solutions in effectiveness, but also preserves privacy.

Supplementary Material

MP4 File (rtfp0370-2min-promo.mp4)

Promotional video of the paper "CampER: An Effective Framework for Privacy-Aware Deep Entity Resolution"

Download
28.16 MB

References

[1]

Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In CCS. 308--318.

[2]

Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. 2018. A Survey on Homomorphic Encryption Schemes: Theory and Implementation. ACM Comput. Surv., Vol. 51, 4 (2018), 79:1--79:35.

Digital Library

[3]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD. 39--48.

[4]

Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating embeddings of heterogeneous relational datasets for data integration tasks. In SIGMOD. 1335--1349.

[5]

Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo Zaniolo. 2017. Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment. In IJCAI. 1511--1517.

[6]

Xin Luna Dong and Theodoros Rekatsinas. 2018. Data Integration and Machine Learning: A Natural Synergy. In SIGMOD. 1645--1650.

[7]

Cynthia Dwork. 2006. Differential Privacy. In ICALP, Vol. 4052. 1--12.

[8]

Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In TAMC, Vol. 4978. 1--19.

Digital Library

[9]

Cynthia Dwork and Jing Lei. 2009. Differential privacy and robust statistics. In STOC. 371--380.

[10]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. PVLDB., Vol. 11, 11 (2018), 1454--1467.

[11]

Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about Record Matching Rules. PVLDB., Vol. 2, 1 (2009), 407--418.

Digital Library

[12]

Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, and Yunjun Gao. 2022. CollaborEM: A Self-supervised Entity Matching Framework Using Multi-features Collaboration. To appear in TKDE (2022).

[13]

Aris Gkoulalas-Divanis, Dinusha Vatsalan, Dimitrios Karapiperis, and Murat Kantarcioglu. 2021. Modern Privacy-Preserving Record Linkage Techniques: An Overview. IEEE Trans. Inf. Forensics Secur., Vol. 16 (2021), 4966--4987.

Digital Library

[14]

Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD. 601--612.

[15]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR. 9726--9735.

[16]

Xi He, Ashwin Machanavajjhala, Cheryl J. Flynn, and Divesh Srivastava. 2017. Composing Differential Privacy and Secure Computation: A Case Study on Scaling Private Record Linkage. In CCS. 1389--1406.

[17]

Yaochen Hu, Di Niu, Jianming Yang, and Shengping Zhou. 2019. FDML: A Collaborative Machine Learning Framework for Distributed Features. In SIGKDD. 2232--2240.

Digital Library

[18]

Di Jin, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Danai Koutra. 2021. Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation. PVLDB., Vol. 15, 3 (2021), 465--477.

Digital Library

[19]

Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. 2017. Distance-Aware Encoding of Numerical Values for Privacy-Preserving Record Linkage. In ICDE. 135--138.

[20]

Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. 2018. FEDERAL: A Framework for Distance-Aware Privacy-Preserving Record Linkage. TKDE., Vol. 30, 2 (2018), 292--304.

[21]

Dimitrios Karapiperis and Vassilios S. Verykios. 2015. An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage. TKDE., Vol. 27, 4 (2015), 909--921.

Digital Library

[22]

Basit Khurram and Florian Kerschbaum. 2020. SFour: A Protocol for Cryptographically Secure Record Linkage at Scale. In ICDE. 277--288.

[23]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.

[24]

Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems. PVLDB, Vol. 9, 12 (2016), 1197--1208.

[25]

Ioannis K. Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate Detection with Matching Dependencies. PVLDB., Vol. 13, 5 (2020), 712--725.

[26]

Bing Li, Yukai Miao, Yaoshu Wang, Yifang Sun, and Wei Wang. 2021b. Improving the Efficiency and Effectiveness for BERT-based Entity Resolution. In AAAI. 13226--13233.

[27]

Peng Li, Xiang Cheng, Xu Chu, Yeye He, and Surajit Chaudhuri. 2021a. Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. In SIGMOD. 1064--1076.

[28]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. PVLDB., Vol. 14, 1 (2020), 50--60.

Digital Library

[29]

Xiao Liu, Haoyun Hong, Xinghao Wang, Zeyi Chen, Evgeny Kharlamov, Yuxiao Dong, and Jie Tang. 2022. SelfKG: Self-Supervised Entity Alignment in Knowledge Graphs. In WWW. 860--870.

[30]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs/1907.11692 (2019).

[31]

Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).

[32]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agü era y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In AISTATS, Vol. 54. 1273--1282.

[33]

H Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. 2016. Federated learning of deep networks using model averaging. arXiv preprint arXiv:1602.05629 (2016).

[34]

Zhengjie Miao, Yuliang Li, and Xiaolan Wang. 2021. Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond. In SIGMOD. 1303--1316.

Digital Library

[35]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19--34.

[36]

Pascal Paillier. 1999. Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. In EUROCRYPT, Vol. 1592. 223--238.

[37]

Shichao Pei, Lu Yu, Guoxian Yu, and Xiangliang Zhang. 2020. REA: Robust Cross-lingual Entity Alignment Between Knowledge Graphs. In KDD. 2175--2184.

[38]

Xuedi Qin, Chengliang Chai, Nan Tang, Jian Li, Yuyu Luo, Guoliang Li, and Yaoyu Zhu. 2022. Synthesizing Privacy Preserving Entity Resolution Datasets. In ICDE. 2359--2371.

[39]

General Data Protection Regulation. 2016. Regulation EU 2016/679 of the European Parliament and of the Council of 27 April 2016. Official Journal of the European Union (2016).

[40]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3980--3990.

[41]

Monica Scannapieco, Ilya Figotin, Elisa Bertino, and Ahmed K. Elmagarmid. 2007. Privacy preserving schema and data matching. In SIGMOD. 653--664.

[42]

Rainer Schnell, Tobias Bachteler, and Jörg Reiher. 2009. Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics Decis. Mak., Vol. 9 (2009), 41.

[43]

Mohsin Shah, Weiming Zhang, Honggang Hu, and Nenghai Yu. 2019. Paillier Cryptosystem based Mean Value Computation for Encrypted Domain Image Processing Operations. TOMM, Vol. 15, 3 (2019), 76:1--76:21.

Digital Library

[44]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Generating Concise Entity Matching Rules. In SIGMOD. 1635--1638.

[45]

Congzheng Song and Ananth Raghunathan. 2020. Information Leakage in Embedding Models. In CCS. 377--390.

[46]

Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, and Xiaoyong Du. 2022. Domain Adaptation for Deep Entity Resolution. In SIGMOD. 443--457.

[47]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. PVLDB., Vol. 5, 11 (2012), 1483--1494.

[48]

Pengfei Wang, Xiaocan Zeng, Lu Chen, Fan Ye, Yuren Mao, Junhao Zhu, and Yunjun Gao. 2022. PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching. PVLDB., Vol. 16, 2 (2022), 369--378.

[49]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).

[50]

Chuhan Wu, Fangzhao Wu, Yang Cao, Yongfeng Huang, and Xing Xie. 2021. Fedgnn: Federated graph neural network for privacy-preserving recommendation. arXiv preprint arXiv:2102.04925 (2021).

[51]

Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity Resolution using Zero Labeled Examples. In SIGMOD. 1149--1164.

[52]

Qian Yang, Jianyi Zhang, Weituo Hao, Gregory P. Spell, and Lawrence Carin. 2021. FLOP: Federated Learning on Medical Datasets using Partial Networks. In SIGKDD. ACM, 3845--3853.

[53]

Andrew Chi-Chih Yao. 1986. How to generate and exchange secrets. In SFCS. 162--167.

[54]

Dezhong Yao, Yuhong Gu, Gao Cong, Hai Jin, and Xinqiao Lv. 2022. Entity Resolution with Hierarchical Graph Attention Networks. In SIGMOD. 429--442.

[55]

Lei Yu, Ling Liu, Calton Pu, Mehmet Emre Gursoy, and Stacey Truex. 2019. Differentially Private Model Publishing for Deep Learning. In S&P. 332--349.

[56]

Zhikun Zhang, Min Chen, Michael Backes, Yun Shen, and Yang Zhang. 2022. Inference Attacks Against Graph Neural Networks. In USENIX. 4543--4560.

Cited By

Shahbazi NWang JMiao ZBhutani N(2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00268
Çeliktuğ MKantarcıoğlu M(2024)Power of Sentence Transformers in Record Linkage2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825999(6944-6955)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825999
Yuedanni (2024)Adaptive Target-Consistency Entity Matching Algorithm Based on Semi-Supervised Learning2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808744(31-37)Online publication date: 25-Oct-2024
https://doi.org/10.1109/BigDIA63733.2024.10808744
Show More Cited By

Index Terms

CampER: An Effective Framework for Privacy-Aware Deep Entity Resolution
1. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution

Recommendations

Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Entity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...
Pair-Wise entity resolution: overview and challenges
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

Information integration is one of the oldest and most important computer science problems: Information from diverse sources must be combined, so that users can access and manipulate the information in a unified way. One of the central problems in ...
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2023

5996 pages

ISBN:9798400701030

DOI:10.1145/3580305

General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '23

Sponsor:

KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 6 - 10, 2023

CA, Long Beach, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
622
Total Downloads

Downloads (Last 12 months)284
Downloads (Last 6 weeks)25

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shahbazi NWang JMiao ZBhutani N(2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00268
Çeliktuğ MKantarcıoğlu M(2024)Power of Sentence Transformers in Record Linkage2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825999(6944-6955)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825999
Yuedanni (2024)Adaptive Target-Consistency Entity Matching Algorithm Based on Semi-Supervised Learning2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808744(31-37)Online publication date: 25-Oct-2024
https://doi.org/10.1109/BigDIA63733.2024.10808744
Peng JShen DNie TKou Y(2024)RLclean: An Unsupervised Integrated Data Cleaning Framework Based on Deep Reinforcement LearningInformation Sciences10.1016/j.ins.2024.121281(121281)Online publication date: Jul-2024
https://doi.org/10.1016/j.ins.2024.121281

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten