Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services

S Das, PS GC, AH Doan, JF Naughton… - Proceedings of the …, 2017 - dl.acm.org
S Das, PS GC, AH Doan, JF Naughton, G Krishnan, R Deep, E Arcaute, V Raghavendra…
Proceedings of the 2017 ACM International Conference on Management of Data, 2017dl.acm.org
Many works have applied crowdsourcing to entity matching (EM). While promising, these
approaches are limited in that they often require a developer to be in the loop. As such, it is
difficult for an organization to deploy multiple crowdsourced EM solutions, because there are
simply not enough developers. To address this problem, a recent work has proposed
Corleone, a solution that crowdsources the entire EM workflow, requiring no developers.
While promising, Corleone is severely limited in that it does not scale to large tables. We …
Many works have applied crowdsourcing to entity matching (EM). While promising, these approaches are limited in that they often require a developer to be in the loop. As such, it is difficult for an organization to deploy multiple crowdsourced EM solutions, because there are simply not enough developers. To address this problem, a recent work has proposed Corleone, a solution that crowdsources the entire EM workflow, requiring no developers. While promising, Corleone is severely limited in that it does not scale to large tables. We propose Falcon, a solution that scales up the hands-off crowdsourced EM approach of Corleone, using RDBMS-style query execution and optimization over a Hadoop cluster. Specifically, we define a set of operators and develop efficient implementations. We translate a hands-off crowdsourced EM workflow into a plan consisting of these operators, optimize, then execute the plan. These plans involve both machine and crowd activities, giving rise to novel optimization techniques such as using crowd time to mask machine time. Extensive experiments show that Falcon can scale up to tables of millions of tuples, thus providing a practical solution for hands-off crowdsourced EM, to build cloud-based EM services.
ACM Digital Library