research-article

A Graph-Based Blocking Approach for Entity Matching Using Contrastively Learned Embeddings

Authors:

John Bosco Mugeni,

Toshiyuki AmagasaAuthors Info & Claims

ACM SIGAPP Applied Computing Review, Volume 22, Issue 4

Pages 37 - 46

https://doi.org/10.1145/3584014.3584017

Published: 10 February 2023 Publication History

Abstract

Data integration is considered a crucial task in the entity matching process. In this process, redundant and cunning entries must be identified and eliminated to improve the data quality. To archive this, a comparison between all entities is performed. However, this has quadratic computational complexity. To avoid this, `blocking' limits comparisons to probable matches. This paper presents a k-nearest neighbor graph-based blocking approach utilizing state-of-the-art context-aware sentence embeddings from pre-trained transformers. Our approach maps each database tuple to a node and generates a graph where nodes are connected by edges if they are related. We then invoke unsupervised community detection techniques over this graph and treat blocking as a graph clustering problem. Our work is motivated by the scarcity of training data for entity matching in real-world scenarios and the limited scalability of blocking schemes in the presence of proliferating data. Additionally, we investigate the impact of contrastively trained embeddings on the above system and test its capabilities on four data sets exhibiting more than 6 million comparisons. We show that our block processing times on the target benchmarks vary owing to the efficient data structure of the k-nearest neighbor graph. Our results also show that our method achieves better performance in terms of F1 score when compared to current deep learning-based blocking solutions.

References

[1]

F. Azzalini, S. Jin, M. Renzi, and L. Tanca. Blocking techniques for entity linkage: A semantics-based approach. Data Science and Engineering, 6(1):20--38, 2020.

[2]

N. Barlaug and J. A. Gulla. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(3):1--37, 2021.

[3]

R. Baxter, P. Christen, and C. Epidemiology. A comparison of fast blocking methods for record linkage. Proc. of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25--27, 2003.

[4]

M. Belcaid, C. Arisdakessian, and Y. Kravchenko. Taming dna clustering in massive datasets with slymfast. ACM SIGAPP Appl. Comput. Rev., 22(1):15--23, 2022.

Digital Library

[5]

V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 10:P10008, 2008.

[6]

O.-M. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom. e-snli: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pages 9560--9572, 2018.

[7]

P. Christen. The data matching process. In Data matching, pages 23--35, 2012.

[8]

M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11):1454--1467, 2018.

Digital Library

[9]

C. Fu, X. Han, L. Sun, B. Chen, W. Zhang, S. Wu, and H. Kong. End-to-end multi-perspective matching for entity resolution. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 4961--4967, 2019.

[10]

T. Gao, X. Yao, and D. Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894--6910, 2021.

[11]

H. Garcia-Molina. Entity resolution: Overview and challenges. In P. Atzeni, W. Chu, H. Lu, S. Zhou, and T.-W. Ling, editors, Conceptual Modeling, pages 1--2, 2004.

[12]

F. L. Gewers, G. R. Ferreira, H. F. D. Arruda, F. N. Silva, C. H. Comin, D. R. Amancio, and L. D. F. Costa. Principal component analysis: A natural approach to data exploration. ACM Comput. Surv., 54(4):1--34, 2021.

Digital Library

[13]

S. Ghosh, S. Maji, and M. S. Desarkar. Effective utilization of labeled data from related tasks using graph contrastive pretraining: Application to disaster related text classification. In Proceedings of the 37th ACM SIGAPP Symposium on Applied Computing, pages 875--878, 2022.

Digital Library

[14]

R. D. Gottapu, C. Dagli, and B. Ali. Entity resolution using convolutional neural network. Procedia Computer Science, 95:153--158, 2016.

[15]

P. He, X. Liu, J. Gao, and W. Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.

[16]

M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 127--138, 1995.

Digital Library

[17]

N. Kooli, R. Allesiardo, and E. Pigneul. Deep learning based approach for entity resolution in databases. In Asian conference on intelligent information and database systems, pages 3--12, 2018.

[18]

M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871--7880, 2020.

[19]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169--178, 2000.

Digital Library

[20]

S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19--34, 2018.

Digital Library

[21]

G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR), 53(2):1--42, 2020.

[22]

B. A. Pijani, A. Imine, and M. Rusinowitch. Inferring attributes with picture metadata embeddings. ACM SIGAPP Appl. Comput. Rev., 20(2):36--45, 2020.

Digital Library

[23]

M. M. Rahman and A. Takasu. Exploiting knowledge graph and text for ranking entity types. ACM SIGAPP Appl. Comput. Rev., 20(3):35--46, 2020.

Digital Library

[24]

T. Shi and Z. Liu. Linking glove with word2vec. arXiv preprint arXiv:1411.5595, 2014.

[25]

V. A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9(1):1--12, 2019.

[26]

L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11):2579--2605, 2008.

[27]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000--6010, 2017.

Digital Library

[28]

Y. Warke. Suffix array blocking for efficient record linkage and de-duplication in sliding window fashion. In S. C. Satapathy, V. Bhateja, and A. Joshi, editors, Proceedings of the International Conference on Data Engineering and Communication Technology, pages 57--65, 2017.

[29]

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38--45, 2020.

[30]

H. Xu, B. Liu, L. Shu, and P. Yu. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2324--2335, 2019.

[31]

L. Zhuang, L. Wayne, S. Ya, and Z. Jun. A robustly optimized bert pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218--1227, 2021.

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Dou WShen DZhou XBai HKou YNie TCui HYu GSerra ESpezzano F(2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679843
Gagliardelli LPapadakis GSimonini GBergamaschi SPalpanas T(2024)GSMInformation Systems10.1016/j.is.2023.102307120:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.is.2023.102307
Show More Cited By

Recommendations

A graph-based blocking approach for entity matching using pre-trained contextual embedding models
SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

Data integration is considered a crucial task in the entity matching process. In this process, redundant and cunning entries must be identified and eliminated to improve the data quality. To archive this, a comparison between all entities is performed. ...
Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMOD

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGAPP Applied Computing Review

ACM SIGAPP Applied Computing Review Volume 22, Issue 4

December 2022

42 pages

ISSN:1559-6915

EISSN:1931-0161

DOI:10.1145/3584014

Issue’s Table of Contents

Copyright © 2023 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2023

Published in SIGAPP Volume 22, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
153
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Dou WShen DZhou XBai HKou YNie TCui HYu GSerra ESpezzano F(2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679843
Gagliardelli LPapadakis GSimonini GBergamaschi SPalpanas T(2024)GSMInformation Systems10.1016/j.is.2023.102307120:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.is.2023.102307
Duan SLong YXiao YWang RLi Q(2024)E-commerce bookstore user alignment model based on multidimensional feature joint representation and implicit behavior compensationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122084238:PDOnline publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122084
Mugeni JLynden SAmagasa TMatono A(2024)MultiMatch: Low-Resource Generalized Entity Matching Using Task-Conditioned Hyperadapters in Multitask LearningBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_4(51-65)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_4
Brinkmann AShraga RBizer C(2024)SC-Block: Supervised Contrastive Blocking Within Entity Resolution PipelinesThe Semantic Web10.1007/978-3-031-60626-7_7(121-142)Online publication date: 26-May-2024
https://dl.acm.org/doi/10.1007/978-3-031-60626-7_7
Guittoum AAïssaoui FBolle SBoyer FDe Palma N(2023)Leveraging Semantic Technologies for Collaborative Inference of Threatening IoT DependenciesACM SIGAPP Applied Computing Review10.1145/3626307.362631023:3(32-48)Online publication date: 29-Sep-2023
https://dl.acm.org/doi/10.1145/3626307.3626310

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents