Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Crowdsourcing algorithms for entity resolution

Published: 01 August 2014 Publication History

Abstract

In this paper, we study a hybrid human-machine approach for solving the problem of Entity Resolution (ER). The goal of ER is to identify all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Our input is a graph over all the records in a database, where each edge has a probability denoting our prior belief (based on Machine Learning models) that the pair of records represented by the given edge are duplicates. Our objective is to resolve all the duplicates by asking humans to verify the equality of a subset of edges, leveraging the transitivity of the equality relation to infer the remaining edges (e.g. a = c can be inferred given a = b and b = c). We consider the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked. Using our theoretical framework, we analyze several strategies, and show that a strategy, claimed as "optimal" for this problem in a recent work, can perform arbitrarily bad in theory. We propose alternate strategies with theoretical guarantees. Using both public datasets as well as the production system at Facebook, we show that our techniques are effective in practice.

References

[1]
http://www.facebook.com/places/editor.
[2]
http://www.facebook.com/about/location.
[3]
http://dbs.uni-leipzig.de/file/Abt-Buy.zip.
[4]
N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In MACHINE LEARNING, pages 238--247, 2002.
[5]
M. Bilgic and L. Getoor. Active inference for collective classification. In Twenty-Fourth Conference on Artificial Intelligence (AAAI NECTAR Track), pages 1652--1655, 2010.
[6]
N. N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW, pages 285--294. International World Wide Web Conferences Steering Committee / ACM, 2013.
[7]
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, WWW '12, pages 469--478, New York, NY, USA, 2012. ACM.
[8]
M. Georgescu, D. D. Pham, C. S. Firan, W. Nejdl, and J. Gaugaz. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12, pages 1970--1974, New York, NY, USA, 2012. ACM.
[9]
A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, EC '11, pages 167--176, New York, NY, USA, 2011. ACM.
[10]
A. Gruenheid, D. Kossmann, R. Sukriti, and F. Widmer. Crowdsourcing entity resolution: When is a=b? Technical Report 785, ETH Zurich, Sept. 2012.
[11]
S. R. Jeffery, L. Sun, M. DeLand, N. Pendar, R. Barber, and A. Galdi. Arnold: Declarative crowd-machine data integration. In CIDR. www.cidrdb.org, 2013.
[12]
D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, NIPS, pages 1953--1961, 2011.
[13]
A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proc. VLDB Endow., 5(1):13--24, Sept. 2011.
[14]
A. McCallum. Cora dataset. http://www.cs.umass.edu/~mcallum/data/cora-refs.tar.gz, 2004.
[15]
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: crowdsourcing entity resolution. Proc. VLDB Endow., 5(11):1483--1494, July 2012.
[16]
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In K. A. Ross, D. Srivastava, and D. Papadias, editors, SIGMOD Conference, pages 229--240. ACM, 2013.
[17]
S. E. Whang and H. Garcia-Molina. Developments in generic entity resolution. IEEE Data Eng. Bull., 34(3):51--59, 2011.
[18]
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. In PVLDB. Stanford InfoLab, August 2013.
[19]
W. E. Winkler, W. E. Winkler, and N. P. Overview of record linkage and current research directions. Technical report, Bureau of the Census, 2006.

Cited By

View all
  • (2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
  • (2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
  • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 7, Issue 12
August 2014
296 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2014
Published in PVLDB Volume 7, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
  • (2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
  • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
  • (2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
  • (2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
  • (2022)Parallel tensor factorization for relational learningNeural Computing and Applications10.1007/s00521-021-05692-634:11(8455-8464)Online publication date: 1-Jun-2022
  • (2021)Active clustering for labeling training dataProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540909(8469-8480)Online publication date: 6-Dec-2021
  • (2021)Fuzzy clustering with similarity queriesProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540322(789-801)Online publication date: 6-Dec-2021
  • (2021)How to design robust algorithms using noisy comparison OracleProceedings of the VLDB Endowment10.14778/3467861.346786214:10(1703-1716)Online publication date: 26-Oct-2021
  • (2021)Budget constrained interactive search for multiple targetsProceedings of the VLDB Endowment10.14778/3447689.344769414:6(890-902)Online publication date: 12-Apr-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media