research-article

Crowdsourcing algorithms for entity resolution

Authors:

Norases Vesdapunt,

Nilesh DalviAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 7, Issue 12

Pages 1071 - 1082

https://doi.org/10.14778/2732977.2732982

Published: 01 August 2014 Publication History

Abstract

In this paper, we study a hybrid human-machine approach for solving the problem of Entity Resolution (ER). The goal of ER is to identify all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Our input is a graph over all the records in a database, where each edge has a probability denoting our prior belief (based on Machine Learning models) that the pair of records represented by the given edge are duplicates. Our objective is to resolve all the duplicates by asking humans to verify the equality of a subset of edges, leveraging the transitivity of the equality relation to infer the remaining edges (e.g. a = c can be inferred given a = b and b = c). We consider the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked. Using our theoretical framework, we analyze several strategies, and show that a strategy, claimed as "optimal" for this problem in a recent work, can perform arbitrarily bad in theory. We propose alternate strategies with theoretical guarantees. Using both public datasets as well as the production system at Facebook, we show that our techniques are effective in practice.

References

[1]

http://www.facebook.com/places/editor.

[2]

http://www.facebook.com/about/location.

[3]

http://dbs.uni-leipzig.de/file/Abt-Buy.zip.

[4]

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In MACHINE LEARNING, pages 238--247, 2002.

Digital Library

[5]

M. Bilgic and L. Getoor. Active inference for collective classification. In Twenty-Fourth Conference on Artificial Intelligence (AAAI NECTAR Track), pages 1652--1655, 2010.

Digital Library

[6]

N. N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW, pages 285--294. International World Wide Web Conferences Steering Committee / ACM, 2013.

Digital Library

[7]

G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, WWW '12, pages 469--478, New York, NY, USA, 2012. ACM.

Digital Library

[8]

M. Georgescu, D. D. Pham, C. S. Firan, W. Nejdl, and J. Gaugaz. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12, pages 1970--1974, New York, NY, USA, 2012. ACM.

Digital Library

[9]

A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, EC '11, pages 167--176, New York, NY, USA, 2011. ACM.

Digital Library

[10]

A. Gruenheid, D. Kossmann, R. Sukriti, and F. Widmer. Crowdsourcing entity resolution: When is a=b? Technical Report 785, ETH Zurich, Sept. 2012.

[11]

S. R. Jeffery, L. Sun, M. DeLand, N. Pendar, R. Barber, and A. Galdi. Arnold: Declarative crowd-machine data integration. In CIDR. www.cidrdb.org, 2013.

[12]

D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, NIPS, pages 1953--1961, 2011.

[13]

A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proc. VLDB Endow., 5(1):13--24, Sept. 2011.

Digital Library

[14]

A. McCallum. Cora dataset. http://www.cs.umass.edu/~mcallum/data/cora-refs.tar.gz, 2004.

[15]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: crowdsourcing entity resolution. Proc. VLDB Endow., 5(11):1483--1494, July 2012.

Digital Library

[16]

J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In K. A. Ross, D. Srivastava, and D. Papadias, editors, SIGMOD Conference, pages 229--240. ACM, 2013.

Digital Library

[17]

S. E. Whang and H. Garcia-Molina. Developments in generic entity resolution. IEEE Data Eng. Bull., 34(3):51--59, 2011.

[18]

S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. In PVLDB. Stanford InfoLab, August 2013.

Digital Library

[19]

W. E. Winkler, W. E. Winkler, and N. P. Overview of record linkage and current research directions. Technical report, Bureau of the Census, 2006.

Cited By

Dou WShen DZhou XBai HKou YNie TCui HYu GSerra ESpezzano F(2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679843
Meduri VQuamar ALei CQin XReinwald B(2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00822-z
Buono FFaggioli GPaganelli MBaraldi AGuerra FFerro N(2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
https://dl.acm.org/doi/10.1145/3584014.3584015
Show More Cited By

Recommendations

Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems

Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Joint Entity Resolution
ICDE '12: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 7, Issue 12

August 2014

296 pages

ISSN:2150-8097

Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2014

Published in PVLDB Volume 7, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

69
Total Citations
View Citations
428
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dou WShen DZhou XBai HKou YNie TCui HYu GSerra ESpezzano F(2024)Enhancing Deep Entity Resolution with Integrated Blocker-Matcher Training: Balancing Consensus and DiscrepancyProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679843(508-518)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679843
Meduri VQuamar ALei CQin XReinwald B(2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00822-z
Buono FFaggioli GPaganelli MBaraldi AGuerra FFerro N(2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
https://dl.acm.org/doi/10.1145/3584014.3584015
Cong QTang JHan KHuang YChen LChee YZhang ARangwala H(2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539267
Galhotra SFirmani DSaha BSrivastava DIves ZBonifati AEl Abbadi A(2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526147
Al-Obeidat FRocha ÁKhan MMaqbool FRazzaq S(2022)Parallel tensor factorization for relational learningNeural Computing and Applications10.1007/s00521-021-05692-634:11(8455-8464)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1007/s00521-021-05692-6
Lutz QPanafieu ÉScott AStein MRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Active clustering for labeling training dataProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540909(8469-8480)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3540909
Huleihel WMazumdar APal SRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Fuzzy clustering with similarity queriesProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540322(789-801)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3540322
Addanki RGalhotra SSaha B(2021)How to design robust algorithms using noisy comparison OracleProceedings of the VLDB Endowment10.14778/3467861.346786214:10(1703-1716)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.14778/3467861.3467862
Zhu XHuang XChoi BJiang JZou ZXu J(2021)Budget constrained interactive search for multiple targetsProceedings of the VLDB Endowment10.14778/3447689.344769414:6(890-902)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.14778/3447689.3447694
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents