Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Efficient and effective ER with progressive blocking

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Blocking is a mechanism to improve the efficiency of entity resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness–efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently (\(O(n \log ^2 n)\) time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5\(\times \) and 60%, respectively, improving the overall F-score of the entire ER process up to 60%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. This assumption holds for block building methods such as standard blocking, q-grams blocking and sorted neighborhood with multiple orderings [13] and extends naturally to canopy clustering by using multiple similarity functions.

  2. For algorithms such as [35], progress can be defined as a fraction \(\phi \cdot n\) of processed records since the previous round.

  3. \(\epsilon = O(r_\epsilon ^t)\).

  4. https://cloud.google.com/vision, https://www.ibm.com/watson/services/visual-recognition/

  5. http://sourceforge.net/projects/erframework/

  6. This threshold on block size was shown to have best blocking quality in [19].

  7. We note that setting a score threshold rather than a limit on the number of pairs would not take into account different scores distributions fairly.

  8. https://scikit-learn.org/stable/

  9. https://spacy.io/

  10. Each subsample was generated by using Febrl dataset generator with smaller value of n, the number of records.

References

  1. Altowim, Y., Kalashnikov, D.V., Mehrotra, S.: Progressive approach to relational entity resolution. PVLDB 7(11), 999–1010 (2014)

    Google Scholar 

  2. Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: ICDM (2006)

  3. Christen, P., Churches, T., Hegland, M.: Febrl-a parallel open source data linkage system. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp. 638–647 (2004)

  4. Crescenzi, V., Angelis, A. D., Firmani, D., Mazzei, M., Merialdo, P., Piai, F., Srivastava, D.: Alaska: a flexible benchmark for data integration tasks (2021)

  5. dal Bianco, G., Gonçalves, M.A., Duarte, D.: Bloss: effective meta-blocking with almost no effort. Inf. Syst. 75, 75–89 (2018)

    Article  Google Scholar 

  6. Das, S., Paul Suganthan, G.C., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD (2017)

  7. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  8. Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. PVLDB 9(5), 384–395 (2016)

    Google Scholar 

  9. Galhotra, S., Firmani, D., Saha, B., Srivastava, D.: Robust entity resolution using random graphs. In: SIGMOD (2018)

  10. Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD (2014)

  11. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. VLDB 1, 491–500 (2001)

    Google Scholar 

  12. Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697–708 (2014)

    Google Scholar 

  13. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. ACM Sigmod Rec. 24, 127–138 (1995)

    Article  Google Scholar 

  14. Hoeffding, W.: Probability inequalities for sums of bounded random variables. In: Hoeffding, W. (ed.) The Collected Works of Wassily Hoeffding, pp. 409–426. Springer, Berlin (1994)

    Chapter  Google Scholar 

  15. Konda, P., Das, S., Paul Suganthan, G.C., Doan, A., Ardalan, A., Ballard, J.R., Li, H., Panahi, F., Zhang, H., Naughton, J., et al.: Magellan: toward building entity matching management systems. PVLDB 9(12), 1197–1208 (2016)

    Google Scholar 

  16. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13) (2013)

  17. Manning, C.D., Manning, C.D., Schütze, H.: Foundations of statistical natural language processing (1999)

  18. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169–178 (2000)

  19. McNeill, N., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapreduce. Citeseer (2012)

  20. Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD (2018)

  21. Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB 9(4), 312–323 (2015)

    Google Scholar 

  22. Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2012)

    Article  Google Scholar 

  23. Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolutionto the next level. TKDE 26, 1946–1960 (2014)

    Google Scholar 

  24. Papadakis, G., Mandilaras, G., Gagliardelli, L., Simonini, G., Thanos, E., Giannakopoulos, G., Bergamaschi, S., Palpanas, T., Koubarakis, M.: Three-dimensional entity resolution with JedAI. Inf. Sys. 93, 101565 (2020)

    Article  Google Scholar 

  25. Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised meta-blocking. PVLDB 7(14), 1929–1940 (2014)

    Google Scholar 

  26. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9), 684–695 (2016)

    Google Scholar 

  27. Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., Koubarakis, M.: The return of JedAI: end-to-end entity resolution for structured and semi-structured data. PVLDB 11(12), 1950–1953 (2018)

    Google Scholar 

  28. Papenbrock, T., Heise, A., Naumann, F.: Progressive duplicate detection. TKDE 27(5), 1316–1329 (2015)

    Google Scholar 

  29. Penrose, M., et al.: Random Geometric Graphs, vol. 5. Oxford University Press, Oxford (2003)

    Book  Google Scholar 

  30. Schütze, H., Manning, C.D., Raghavan, P.: Introduction to information retrieval. In: Proceedings of the International Communication of Association for Computing Machinery Conference, pp. 260 (2008)

  31. Simonini, G., Bergamaschi, S., Jagadish, H.: Blast: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB 9(12), 1173–1184 (2016)

    Google Scholar 

  32. Simonini, G., Papadakis, G., Palpanas, T., Bergamaschi, S.: Schema-agnostic progressive entity resolution. IEEE Trans. Knowl. Data Eng. 31(6), 1208–1221 (2018)

    Article  Google Scholar 

  33. Verroios, V., Garcia-Molina, H.: Entity resolution with crowd errors. In: ICDE, pp. 219–230 (2015)

  34. Verroios, V., Garcia-Molina, H., Papakonstantinou, Y.: Waldo: an adaptive human interface for crowd entity resolution. In: SIGMOD (2017)

  35. Vesdapunt, N., Bellare, K., Dalvi, N.: Crowdsourcing algorithms for entity resolution. PVLDB 7(12), 1071–1082 (2014)

    Google Scholar 

  36. Wang, J., Li, G., Kraska, T., Franklin, M. J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD (2013)

  37. Whang, S.E., Garcia-Molina, H.: Incremental entity resolution on rules and data. VLDB J. 23(1), 77–102 (2014)

    Article  Google Scholar 

  38. Whang, S.E., Marmaros, D., Garcia-Molina, H.: Pay-as-you-go entity resolution. TKDE 25(5), 1111–1124 (2013)

    Google Scholar 

  39. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD (2009)

  40. www.cs.umass.edu/mccallum/data/cora-refs.tar.gz

Download references

Acknowledgements

This work is supported partly by NSF 1652303, 1909046, and HDR TRIPODS 1934846 grants, and an Alfred P. Sloan Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sainyam Galhotra.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Galhotra, S., Firmani, D., Saha, B. et al. Efficient and effective ER with progressive blocking. The VLDB Journal 30, 537–557 (2021). https://doi.org/10.1007/s00778-021-00656-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00656-7

Keywords

Navigation