Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Graph-Based Blocking Approach for Entity Matching Using Contrastively Learned Embeddings

Published: 10 February 2023 Publication History

Abstract

Data integration is considered a crucial task in the entity matching process. In this process, redundant and cunning entries must be identified and eliminated to improve the data quality. To archive this, a comparison between all entities is performed. However, this has quadratic computational complexity. To avoid this, `blocking' limits comparisons to probable matches. This paper presents a k-nearest neighbor graph-based blocking approach utilizing state-of-the-art context-aware sentence embeddings from pre-trained transformers. Our approach maps each database tuple to a node and generates a graph where nodes are connected by edges if they are related. We then invoke unsupervised community detection techniques over this graph and treat blocking as a graph clustering problem. Our work is motivated by the scarcity of training data for entity matching in real-world scenarios and the limited scalability of blocking schemes in the presence of proliferating data. Additionally, we investigate the impact of contrastively trained embeddings on the above system and test its capabilities on four data sets exhibiting more than 6 million comparisons. We show that our block processing times on the target benchmarks vary owing to the efficient data structure of the k-nearest neighbor graph. Our results also show that our method achieves better performance in terms of F1 score when compared to current deep learning-based blocking solutions.

References

[1]
F. Azzalini, S. Jin, M. Renzi, and L. Tanca. Blocking techniques for entity linkage: A semantics-based approach. Data Science and Engineering, 6(1):20--38, 2020.
[2]
N. Barlaug and J. A. Gulla. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(3):1--37, 2021.
[3]
R. Baxter, P. Christen, and C. Epidemiology. A comparison of fast blocking methods for record linkage. Proc. of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25--27, 2003.
[4]
M. Belcaid, C. Arisdakessian, and Y. Kravchenko. Taming dna clustering in massive datasets with slymfast. ACM SIGAPP Appl. Comput. Rev., 22(1):15--23, 2022.
[5]
V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 10:P10008, 2008.
[6]
O.-M. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom. e-snli: Natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pages 9560--9572, 2018.
[7]
P. Christen. The data matching process. In Data matching, pages 23--35, 2012.
[8]
M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11):1454--1467, 2018.
[9]
C. Fu, X. Han, L. Sun, B. Chen, W. Zhang, S. Wu, and H. Kong. End-to-end multi-perspective matching for entity resolution. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 4961--4967, 2019.
[10]
T. Gao, X. Yao, and D. Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894--6910, 2021.
[11]
H. Garcia-Molina. Entity resolution: Overview and challenges. In P. Atzeni, W. Chu, H. Lu, S. Zhou, and T.-W. Ling, editors, Conceptual Modeling, pages 1--2, 2004.
[12]
F. L. Gewers, G. R. Ferreira, H. F. D. Arruda, F. N. Silva, C. H. Comin, D. R. Amancio, and L. D. F. Costa. Principal component analysis: A natural approach to data exploration. ACM Comput. Surv., 54(4):1--34, 2021.
[13]
S. Ghosh, S. Maji, and M. S. Desarkar. Effective utilization of labeled data from related tasks using graph contrastive pretraining: Application to disaster related text classification. In Proceedings of the 37th ACM SIGAPP Symposium on Applied Computing, pages 875--878, 2022.
[14]
R. D. Gottapu, C. Dagli, and B. Ali. Entity resolution using convolutional neural network. Procedia Computer Science, 95:153--158, 2016.
[15]
P. He, X. Liu, J. Gao, and W. Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
[16]
M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 127--138, 1995.
[17]
N. Kooli, R. Allesiardo, and E. Pigneul. Deep learning based approach for entity resolution in databases. In Asian conference on intelligent information and database systems, pages 3--12, 2018.
[18]
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871--7880, 2020.
[19]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169--178, 2000.
[20]
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19--34, 2018.
[21]
G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR), 53(2):1--42, 2020.
[22]
B. A. Pijani, A. Imine, and M. Rusinowitch. Inferring attributes with picture metadata embeddings. ACM SIGAPP Appl. Comput. Rev., 20(2):36--45, 2020.
[23]
M. M. Rahman and A. Takasu. Exploiting knowledge graph and text for ranking entity types. ACM SIGAPP Appl. Comput. Rev., 20(3):35--46, 2020.
[24]
T. Shi and Z. Liu. Linking glove with word2vec. arXiv preprint arXiv:1411.5595, 2014.
[25]
V. A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports, 9(1):1--12, 2019.
[26]
L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11):2579--2605, 2008.
[27]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000--6010, 2017.
[28]
Y. Warke. Suffix array blocking for efficient record linkage and de-duplication in sliding window fashion. In S. C. Satapathy, V. Bhateja, and A. Joshi, editors, Proceedings of the International Conference on Data Engineering and Communication Technology, pages 57--65, 2017.
[29]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38--45, 2020.
[30]
H. Xu, B. Liu, L. Shu, and P. Yu. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2324--2335, 2019.
[31]
L. Zhuang, L. Wayne, S. Ya, and Z. Jun. A robustly optimized bert pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218--1227, 2021.

Cited By

View all
  • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
  • (2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
  • (2024)GSMInformation Systems10.1016/j.is.2023.102307120:COnline publication date: 4-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGAPP Applied Computing Review
ACM SIGAPP Applied Computing Review  Volume 22, Issue 4
December 2022
42 pages
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2023
Published in SIGAPP Volume 22, Issue 4

Check for updates

Author Tags

  1. contrastive learning
  2. entity matching
  3. graph based-blocking

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
  • (2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
  • (2024)GSMInformation Systems10.1016/j.is.2023.102307120:COnline publication date: 4-Mar-2024
  • (2024)E-commerce bookstore user alignment model based on multidimensional feature joint representation and implicit behavior compensationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122084238:PDOnline publication date: 27-Feb-2024
  • (2024)MultiMatch: Low-Resource Generalized Entity Matching Using Task-Conditioned Hyperadapters in Multitask LearningBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_4(51-65)Online publication date: 26-Aug-2024
  • (2024)SC-Block: Supervised Contrastive Blocking Within Entity Resolution PipelinesThe Semantic Web10.1007/978-3-031-60626-7_7(121-142)Online publication date: 26-May-2024
  • (2023)Leveraging Semantic Technologies for Collaborative Inference of Threatening IoT DependenciesACM SIGAPP Applied Computing Review10.1145/3626307.362631023:3(32-48)Online publication date: 29-Sep-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media