Abstract
Blocking is a mechanism to improve the efficiency of entity resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness–efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently (\(O(n \log ^2 n)\) time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5\(\times \) and 60%, respectively, improving the overall F-score of the entire ER process up to 60%.
Similar content being viewed by others
Notes
This assumption holds for block building methods such as standard blocking, q-grams blocking and sorted neighborhood with multiple orderings [13] and extends naturally to canopy clustering by using multiple similarity functions.
For algorithms such as [35], progress can be defined as a fraction \(\phi \cdot n\) of processed records since the previous round.
\(\epsilon = O(r_\epsilon ^t)\).
This threshold on block size was shown to have best blocking quality in [19].
We note that setting a score threshold rather than a limit on the number of pairs would not take into account different scores distributions fairly.
Each subsample was generated by using Febrl dataset generator with smaller value of n, the number of records.
References
Altowim, Y., Kalashnikov, D.V., Mehrotra, S.: Progressive approach to relational entity resolution. PVLDB 7(11), 999–1010 (2014)
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: ICDM (2006)
Christen, P., Churches, T., Hegland, M.: Febrl-a parallel open source data linkage system. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp. 638–647 (2004)
Crescenzi, V., Angelis, A. D., Firmani, D., Mazzei, M., Merialdo, P., Piai, F., Srivastava, D.: Alaska: a flexible benchmark for data integration tasks (2021)
dal Bianco, G., Gonçalves, M.A., Duarte, D.: Bloss: effective meta-blocking with almost no effort. Inf. Syst. 75, 75–89 (2018)
Das, S., Paul Suganthan, G.C., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD (2017)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. PVLDB 9(5), 384–395 (2016)
Galhotra, S., Firmani, D., Saha, B., Srivastava, D.: Robust entity resolution using random graphs. In: SIGMOD (2018)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD (2014)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. VLDB 1, 491–500 (2001)
Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. ACM Sigmod Rec. 24, 127–138 (1995)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. In: Hoeffding, W. (ed.) The Collected Works of Wassily Hoeffding, pp. 409–426. Springer, Berlin (1994)
Konda, P., Das, S., Paul Suganthan, G.C., Doan, A., Ardalan, A., Ballard, J.R., Li, H., Panahi, F., Zhang, H., Naughton, J., et al.: Magellan: toward building entity matching management systems. PVLDB 9(12), 1197–1208 (2016)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13) (2013)
Manning, C.D., Manning, C.D., Schütze, H.: Foundations of statistical natural language processing (1999)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169–178 (2000)
McNeill, N., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapreduce. Citeseer (2012)
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD (2018)
Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB 9(4), 312–323 (2015)
Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2012)
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolutionto the next level. TKDE 26, 1946–1960 (2014)
Papadakis, G., Mandilaras, G., Gagliardelli, L., Simonini, G., Thanos, E., Giannakopoulos, G., Bergamaschi, S., Palpanas, T., Koubarakis, M.: Three-dimensional entity resolution with JedAI. Inf. Sys. 93, 101565 (2020)
Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised meta-blocking. PVLDB 7(14), 1929–1940 (2014)
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9), 684–695 (2016)
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., Koubarakis, M.: The return of JedAI: end-to-end entity resolution for structured and semi-structured data. PVLDB 11(12), 1950–1953 (2018)
Papenbrock, T., Heise, A., Naumann, F.: Progressive duplicate detection. TKDE 27(5), 1316–1329 (2015)
Penrose, M., et al.: Random Geometric Graphs, vol. 5. Oxford University Press, Oxford (2003)
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to information retrieval. In: Proceedings of the International Communication of Association for Computing Machinery Conference, pp. 260 (2008)
Simonini, G., Bergamaschi, S., Jagadish, H.: Blast: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB 9(12), 1173–1184 (2016)
Simonini, G., Papadakis, G., Palpanas, T., Bergamaschi, S.: Schema-agnostic progressive entity resolution. IEEE Trans. Knowl. Data Eng. 31(6), 1208–1221 (2018)
Verroios, V., Garcia-Molina, H.: Entity resolution with crowd errors. In: ICDE, pp. 219–230 (2015)
Verroios, V., Garcia-Molina, H., Papakonstantinou, Y.: Waldo: an adaptive human interface for crowd entity resolution. In: SIGMOD (2017)
Vesdapunt, N., Bellare, K., Dalvi, N.: Crowdsourcing algorithms for entity resolution. PVLDB 7(12), 1071–1082 (2014)
Wang, J., Li, G., Kraska, T., Franklin, M. J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD (2013)
Whang, S.E., Garcia-Molina, H.: Incremental entity resolution on rules and data. VLDB J. 23(1), 77–102 (2014)
Whang, S.E., Marmaros, D., Garcia-Molina, H.: Pay-as-you-go entity resolution. TKDE 25(5), 1111–1124 (2013)
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD (2009)
Acknowledgements
This work is supported partly by NSF 1652303, 1909046, and HDR TRIPODS 1934846 grants, and an Alfred P. Sloan Fellowship.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Galhotra, S., Firmani, D., Saha, B. et al. Efficient and effective ER with progressive blocking. The VLDB Journal 30, 537–557 (2021). https://doi.org/10.1007/s00778-021-00656-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-021-00656-7