Entity Matching by Pool-Based Active Learning
<p>An example of entity matching.</p> "> Figure 2
<p>The process of active learning on entity matching.</p> "> Figure 3
<p>Algorithm of active learning for entity matching.</p> "> Figure 4
<p>Construction of feature vector.</p> "> Figure 5
<p>Stop criterion.</p> "> Figure 6
<p>F1-iteration curve on initialization strategies.</p> "> Figure 7
<p>Box-plot of ten repeated experiments on seven datasets.</p> ">
Abstract
:1. Introduction
- In many scenarios, it is difficult to obtain a larger number of labeled samples. Manual labeling requires a lot of labor-costing and time-costing, and usually difficult to obtain, adequate effective labels in a short time.
- Entity matching tasks usually have extremely imbalanced data samples. Generally, the number of mismatched samples is much larger than that of matched samples. The binary classification with imbalanced label distribution may lead to insufficient training of matched samples.
- For entity matching, there are less benefits in labeling too many entity pairs. For example, “journey to the west” and “pilgrimage to the west” represent the same entity, “dream of the Red Mansions” and “story of the stone” also represent the same entity. The former can be judged to be the same entity through simple character-level comparison, while the latter can be judged only through labeling by relevant experts. This phenomenon is very common for entity matching tasks. Many entity pairs can determine them to be matched or mismatched easily by comparison, so the benefit of labeling these data is very low. Therefore, if the records in different data sets are directly handed over to experts for pairwise labeling, there will be a lot of extraordinary workloads.
- Although the deep learning method based on language models can achieve a good matching result, it usually needs suitable pre-trained and domain-related texts for fine-tuning. When encountering new domain problems, this method is difficult to achieve good results without pre-trained language models and domain-related knowledge.
- We propose a pool-based active learning method for entity matching tasks, which can find the most valuable labeled samples to build the learning model using only a small number of labeled samples and achieve good performance compared to existing methods. This work can effectively solve the problems in the acquisition of labeled samples and lack of domain knowledge in entity matching.
- Our method integrates with query strategies to select valuable samples effectively from unlabeled samples for labeling. Experiment results show that the selected samples are highly representative and the method can effectively reduce the labeling workload.
- We verified the performance of our method on seven public datasets. Compared to existing ML-based and DL-based methods, the proposed method can reach a similar F1 score while using only a small number of labeled samples. In the small scale of data sets, the proposed method is even superior to the state-of-the-art deep learning methods.
2. Related Works
2.1. Rule-Based Entity Matching
2.2. Machine Learning-Based Entity Matching
2.3. Deep Learning-Based Entity Matching
3. The Pool-Based Active Learning Method for Entity Matching
3.1. The Framework of Pool-Based Active Learning
3.2. Data Preprocessing
3.2.1. Generating Data Set
3.2.2. Pruning
3.3. Generating the Initial Labeled Pool
3.4. Query Strategy
3.4.1. Entropy-Average Uncertainty
3.4.2. Entropy-Variance Uncertainty
3.4.3. Probability-Variance Uncertainty
3.4.4. Hybrid Uncertainty
3.5. Stop Criterion
4. Experiment and Evaluation
4.1. Experiment Setup
4.1.1. Data Sets
4.1.2. Classifiers
4.1.3. Other Hyper Parameters
4.2. Evaluation Metrics
4.3. Experiment Results
4.3.1. Performance of Query Strategies
4.3.2. Comparing with Deep Learning Based Methods
4.3.3. Comparing with Existing Active Learning Based Methods
4.3.4. Performance of Initial Labeled Pool
4.3.5. Performance of Pruning Method
4.3.6. Stability
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Tan, W.C. Technical Perspective: Toward Building Entity Matching Management Systems. SIGMOD Rec. 2018, 47, 32. [Google Scholar] [CrossRef]
- Koepcke, H.; Rahm, E. Frameworks for Entity Matching: A Comparison. Data Knowl. Eng. 2010, 69, 197–210. [Google Scholar] [CrossRef]
- Konda, P.; Das, S.; Doan, A.; Ardalan, A.; Ballard, J.R.; Li, H.; Panahi, F.; Zhang, H.; Naughton, J.; Prasad, S.; et al. Magellan: Toward Building Entity Matching Management Systems. VLDB Endow. 2016, 9, 1197–1208. [Google Scholar] [CrossRef]
- Christen, P. Data Matching; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Singh, R.; Meduri, V.; Elmagarmid, A.; Madden, S.; Papotti, P.; Quiané-Ruiz, J.-A.; Solar-Lezama, A.; Tang, N. Generating Concise Entity Matching Rules. In Proceedings of the ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017. [Google Scholar]
- Shen, W.; Li, X.; Doan, A.H. Constraint-based Entity Matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Pittsburgh, PA, USA, 9–13 July 2005. [Google Scholar]
- Whang, S.E.; Benjelloun, O.; Garcia-Molina, H. Generic Entity Resolution with Negative Rules. VLDB J. 2009, 18, 1261–1277. [Google Scholar] [CrossRef]
- Singla, P.; Domingos, P. Entity Resolution with Markov Logic. In Proceedings of the Sixth International Conference on Data Mining, Hong Kong, China, 18–22 December 2006. [Google Scholar]
- Chaudhuri, S.; Chen, B.C.; Ganti, V.; Kaushik, R. Example-Driven Design of Efficient Record Matching Queries. In Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23–28 September 2007. [Google Scholar]
- Schmidhuber, J. Deep Learning in Neural Networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef]
- Barlaug, N.; Atle Gulla, J. Neural Networks for Entity Matching: A Survey. ACM Trans. Knowl. Discov. Data 2021, 15, 37. [Google Scholar] [CrossRef]
- Settles, B. Active Learning Literature Survey; Technical Report; University of Wisconsin: Madison, WI, USA, 2010. [Google Scholar]
- Balcan, M.; Beygelzimer, A.; Langford, J. Agnostic Active Learning. J. Comput. Syst. Sci. 2009, 75, 78–89. [Google Scholar] [CrossRef]
- Attenberg, J.; Provost, F. Inactive learning? Difficulties Employing Active Learning in Practice. SIGKDD Explor. Newsl. 2010, 12, 36–41. [Google Scholar] [CrossRef]
- Chen, Z.; Tao, R.; Wu, X.; Wei, Z.; Luo, X. Active Learning for Spam Email Classification. In Proceedings of the 2nd International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 20–22 December 2019. [Google Scholar]
- Samuel, B.; Robinson Emma, C.; Bernhard, K. A Survey on Active Learning and Human-in-the-Loop Deep Learning for Medical Image Analysis. Med. Image Anal. 2021, 71, 102062. [Google Scholar]
- Agoun, J.; Hacid, M. Access Control based on Entity Matching for Secure Data Sharing. Serv. Oriented Comput. Appl. 2022, 16, 31–44. [Google Scholar] [CrossRef]
- Zhang, P.; Kang, X. Similar Physical Entity Matching Strategy for Mobile Edge Search. Digit. Commun. Netw. 2020, 6, 203–209. [Google Scholar] [CrossRef]
- Singh, R.; Meduri, V.V.; Elmagarmid, A.; Madden, S.; Papotti, P.; Quiané-Ruiz, J.-A.; Solar-Lezama, A.; Tang, N. Synthesizing Entity Matching Rules by Examples. VLDB Endow. 2017, 11, 189–202. [Google Scholar] [CrossRef]
- Ngomo, A.C.N. Link Discovery with Guaranteed Reduction Ratio in Affine Spaces with Minkowski Measures. In Proceedings of the International Semantic Web Conference, Boston, MA, USA, 11–15 November 2012. [Google Scholar]
- Jaro, M.A. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa. J. Am. Stat. Assoc. 1989, 84, 414–420. [Google Scholar] [CrossRef]
- Rodrigues, E.O.; Casanova, D.; Teixeira, M.; Pegorini, V.; Favarim, F.; Clua, E.; Conci, A.; Liatsis, P. Proposal and Study of Statistical Features for String Similarity Computation and Classification. Int. J. Data Min. Model. Manag. 2020, 12, 277–280. [Google Scholar] [CrossRef]
- Verykios, V.S.; Moustakides, G.V.; Elfeky, M.G. A Bayesian Decision Model for Cost Optimal Record Matching. VLDB J. 2002, 12, 28–40. [Google Scholar] [CrossRef]
- Dey, D. Entity Matching in Heterogeneous Databases: A Logistic Regression Approach. Decis. Support Syst. 2007, 44, 740–747. [Google Scholar] [CrossRef]
- Primpeli, A.; Bizer, C. Profiling Entity Matching Benchmark Tasks. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Online, 19–23 October 2020. [Google Scholar]
- Palumbo, E.; Rizzo, G.; Troncy, R. STEM: Stacked Threshold-based Entity Matching for Knowledge Base Generation. Semant. Web. 2018, 10, 117–137. [Google Scholar] [CrossRef]
- Mugeni, J.B.; Amagasa, T. A Graph-Based Blocking Approach for Entity Matching Using Contrastively Learned Embeddings. ACM SIGAPP Appl. Comput. Rev. 2023, 22, 37–46. [Google Scholar] [CrossRef]
- Ebraheem, M.; Thirumuruganathan, S.; Joty, S.; Ouzzani, M.; Tang, N. DeepER–Deep Entity Resolution. arXiv 2017, arXiv:1710.00597. [Google Scholar]
- Mudgal, S.; Li, H.; Rekatsinas, T.; Doan, A.; Park, Y.; Krishnan, G.; Deep, R.; Arcaute, E.; Raghavendra, V. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Arora, S.; Liang, Y.; Ma, T. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Parikh, A.P.; Täckström, O.; Das, D.; Uszkoreit, J. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, TX, USA, 1–4 November 2016. [Google Scholar]
- Huang, J.; Hu, W.; Bao, Z.; Chen, Q.; Qu, Y. Deep Entity Matching with Adversarial Active Learning. VLDB J. 2023, 32, 229–255. [Google Scholar] [CrossRef]
- Li, Y.; Li, J.; Suhara, Y.; Doan, A.H.; Tan, W.C. Deep Entity Matching with Pre-trained Language Models. VLDB Endow. 2020, 14, 50–60. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), Hong Kong, China, 3–7 November 2019. [Google Scholar]
- Li, Y.; Li, J.; Suhara, Y.; Wang, J.; Hirota, W.; Tan, W.C. Deep Entity Matching: Challenges and Opportunities. J. Data Inf. Qual. 2021, 13, 1–17. [Google Scholar] [CrossRef]
- Li, Y.; Li, J.; Suhara, Y.; Doan, A.; Tan, W.-C. Effective Entity Matching with Transformers. VLDB J. 2023, 32, 1215–1235. [Google Scholar] [CrossRef]
- Brunner, U.; Stockinger, K. Entity Matching with Transformer Architectures- A Step Forward in Data Integration. In Proceedings of the 23rd International Conference on Extending Database Technology, Copenhagen, Denmark, 30 March–2 April 2020. [Google Scholar]
- Peeters, R.; Bizer, C.; Glavaš, G. Intermediate Training of BERT for Product Matching. In Proceedings of the DI2KG Workshop at VLDB, Tokyo, Japan, 31 August 2020. [Google Scholar]
- Zhao, C.; He, Y. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In Proceedings of the World Wide Web Conference (WWW 2019), San Francisco, CA, USA, 13–17 May 2019. [Google Scholar]
- Dagan, I.; Engelson, S.P. Committee-based Sampling for Training Probabilistic Classifiers. In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995. [Google Scholar]
- Lewis, D.D.; Gale, W.A. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and development in Information Retrieval, Dublin, Ireland, 3–6 July 1994. [Google Scholar]
- Baxter, R.; Christen, P.; Churches, T. A Comparison of Fast Blocking Methods for Record Linkage. In Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, USA, 24 August 2003. [Google Scholar]
- Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Menestrina, D.; Whang, S.E.; Garcia-Molina, H. Evaluating Entity Resolution Results. VLDB Endow. 2010, 3, 208–219. [Google Scholar] [CrossRef]
- Wang, P.; Zheng, W.; Wang, J.; Pei, J. Automating Entity Matching Model Development. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021. [Google Scholar]
- Qian, K.; Popa, L.; Sen, P. Active Learning for Large-scale Entity Resolution. In Proceedings of the ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017. [Google Scholar]
- Arasu, A.; Götz, M.; Kaushik, R. On Active Learning of Record Matching Packages. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2010. [Google Scholar]
- Sarawagi, S.; Bhamidipaty, A. Interactive Deduplication Using Active Learning. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002. [Google Scholar]
- Kasai, J.; Qian, K.; Gurajada, S.; Li, Y.; Popa, L. Low-resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Symbol | Definition |
---|---|
M | Iteration Number |
S | Set of labeled record pairs on training set (labeled data pool) |
Q | Set of all record pairs on training set |
Qn = {q1, q2, …, qn} | Set of n queried record pairs |
L(q1), L(q2), …, L(qn) | Labels provided by experts |
C = {C1, C2, …, Cn} | Set of classifiers |
T | Set of labeled record pairs on the test set |
V | Set of labeled record pairs on validation set |
F1(V, Ci) | F1 score w.r.t. ground truth of a set V and classifier Ci predicts labels for elements in V |
Ci,j | Classifier Ci in the j-th iteration |
Data Set | Domain | Pairs | Match | # Attr. |
---|---|---|---|---|
Amazon-Google | Software | 11,460 | 1167 | 3 |
BeerAdvo-RateBeer | Beer | 450 | 68 | 4 |
DBLP-ACM | Citation | 12,363 | 2220 | 4 |
DBLP-Scholar | Citation | 28,707 | 5347 | 4 |
Fodors-Zagats | Restaurant | 946 | 110 | 6 |
iTunes-Amazon | Music | 539 | 132 | 8 |
Walmart-Amazon | Electronics | 10,242 | 962 | 5 |
DataSet | TrainSet | TestSet | ValidSet | |||
---|---|---|---|---|---|---|
Match | Pairs | Match | Pairs | Match | Pairs | |
Amazon-Google | 589/699 | 5077/6874 | 234 | 2293 | 234 | 2293 |
BeerAdvo-RateBeer | 29/40 | 103/268 | 14 | 91 | 14 | 91 |
DBLP-ACM | 1323/1332 | 2602/7417 | 444 | 2473 | 444 | 2473 |
DBLP-Scholar | 2915/3207 | 6231/17,223 | 1070 | 5742 | 1070 | 5742 |
Fodors-Zagats | 62/66 | 227/567 | 22 | 190 | 22 | 189 |
iTunes-Amazon | 74/78 | 166/321 | 27 | 109 | 27 | 109 |
Walmart-Amazon | 536/576 | 5379/6144 | 193 | 2049 | 193 | 2049 |
DataSet | Entropy | Ave_Entropy | Var_Entropy | Var_Prob | Hybrid | |||||
---|---|---|---|---|---|---|---|---|---|---|
F1 | N | F1 | N | F1 | N | F1 | N | F1 | N | |
Amazon-Google | 0.337 | 450 | 0.348 | 430 | 0.376 | 450 | 0.209 | 450 | 0.424 | 370 |
BeerAdvo-RateBeer | 0.839 | 30 | 0.828 | 22 | 0.839 | 22 | 0.867 | 22 | 0.867 | 22 |
DBLP-ACM | 0.982 | 310 | 0.982 | 330 | 0.979 | 210 | 0.976 | 440 | 0.984 | 210 |
DBLP-Scholar | 0.903 | 520 | 0.896 | 600 | 0.909 | 510 | 0.900 | 630 | 0.909 | 360 |
Fodors-Zagats | 1.000 | 22 | 1.000 | 10 | 1.000 | 10 | 1.000 | 18 | 1.000 | 10 |
iTunes-Amazon | 0.982 | 38 | 0.982 | 50 | 1.000 | 22 | 0.982 | 58 | 1.000 | 46 |
Walmart-Amazon | 0.696 | 70 | 0.696 | 120 | 0.707 | 210 | 0.700 | 240 | 0.721 | 80 |
DataSet | SIF | RNN | Attention | Magellan | DeepER | DeepMatcher | AutoML-EM | Ours | ΔF1 |
---|---|---|---|---|---|---|---|---|---|
Amazon-Google | 0.606 | 0.599 | 0.611 | 0.491 | 0.561 | 0.693 | 0.664 | 0.424 | −0.269 |
BeerAdvo-RateBeer | 0.581 | 0.722 | 0.808 | 0.788 | 0.5 | 0.727 | 0.823 | 0.867 | 0.044 |
DBLP-ACM | 0.975 | 0.983 | 0.984 | 0.984 | 0.976 | 0.984 | 0.984 | 0.984 | 0 |
DBLP-Scholar | 0.909 | 0.93 | 0.933 | 0.923 | 0.908 | 0.947 | 0.946 | 0.909 | −0.038 |
Fodors-Zagats | 1 | 1 | 0.821 | 1 | 0.977 | 1 | 1 | 1 | 0 |
iTunes-Amazon | 0.814 | 0.885 | 0.808 | 0.912 | 0.88 | 0.88 | 0.963 | 1 | 0.037 |
Walmart-Amazon | 0.651 | 0.676 | 0.5 | 0.719 | 0.506 | 0.669 | 0.785 | 0.721 | −0.064 |
DataSet | Proposed Method | Other Methods | Ratio |
---|---|---|---|
Amazon-Google | 370 | 5077 | 0.073 |
BeerAdvo-RateBeer | 22 | 103 | 0.216 |
DBLP-ACM | 210 | 2602 | 0.081 |
DBLP-Scholar | 360 | 6231 | 0.058 |
Fodors-Zagats | 10 | 277 | 0.036 |
iTunes-Amazon | 46 | 166 | 0.277 |
Walmart-Amazon | 80 | 5379 | 0.015 |
Method | DBLP-Scholar | DBLP-ACM | ||
---|---|---|---|---|
F1 | N | F1 | N | |
ERLEARN | 0.87 | 163 | N/A | N/A |
ALGPR | 0.80 | 210 | N/A | N/A |
ALIAS | 0.78 | 160 | N/A | N/A |
DTAL | 0.895 | 1000 | 0.979 | 400 |
DAL | 0.888 | 1000 | 0.954 | 400 |
Ours | 0.908 | 360 | 0.983 | 210 |
DataSet | Hybrid | Hybrid * | ||
---|---|---|---|---|
F1 | N | F1 | N | |
Amazon-Google | 0.282 | 450 | 0.424 | 370 |
BeerAdvo-RateBeer | 0.763 | 46 | 0.867 | 22 |
DBLP-ACM | 0.973 | 250 | 0.984 | 210 |
DBLP-Scholar | 0.865 | 470 | 0.909 | 360 |
Fodors-Zagats | 1 | 50 | 1 | 10 |
iTunes-Amazon | 0.912 | 62 | 1 | 46 |
Walmart-Amazon | 0.651 | 150 | 0.721 | 80 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Han, Y.; Li, C. Entity Matching by Pool-Based Active Learning. Electronics 2024, 13, 559. https://doi.org/10.3390/electronics13030559
Han Y, Li C. Entity Matching by Pool-Based Active Learning. Electronics. 2024; 13(3):559. https://doi.org/10.3390/electronics13030559
Chicago/Turabian StyleHan, Youfang, and Chunping Li. 2024. "Entity Matching by Pool-Based Active Learning" Electronics 13, no. 3: 559. https://doi.org/10.3390/electronics13030559