Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3025453.3026044acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets

Published: 02 May 2017 Publication History

Abstract

Crowdsourcing provides a scalable and efficient way to construct labeled datasets for training machine learning systems. However, creating comprehensive label guidelines for crowdworkers is often prohibitive even for seemingly simple concepts. Incomplete or ambiguous label guidelines can then result in differing interpretations of concepts and inconsistent labels. Existing approaches for improving label quality, such as worker screening or detection of poor work, are ineffective for this problem and can lead to rejection of honest work and a missed opportunity to capture rich interpretations about data. We introduce Revolt, a collaborative approach that brings ideas from expert annotation workflows to crowd-based labeling. Revolt eliminates the burden of creating detailed label guidelines by harnessing crowd disagreements to identify ambiguous concepts and create rich structures (groups of semantically related items) for post-hoc label decisions. Experiments comparing Revolt to traditional crowdsourced labeling show that Revolt produces high quality labels without requiring label guidelines in turn for an increase in monetary cost. This up front cost, however, is mitigated by Revolt's ability to produce reusable structures that can accommodate a variety of label boundaries without requiring new data to be collected. Further comparisons of Revolt's collaborative and non-collaborative variants show that collaboration reaches higher label accuracy with lower monetary cost.

References

[1]
1995. The 20 Newsgroups Dataset. (1995). http://people.csail.mit.edu/jrennie/20Newsgroups/
[2]
Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 19--26.
[3]
Omar Alonso, Catherine C Marshall, and Marc Najork. 2013. Are some tweets more interesting than others' hardquestion. In Proceedings of the Symposium on Human-Computer Interaction and Information Retrieval. ACM, 2.
[4]
Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 989--998.
[5]
Yoram Bachrach, Thore Graepel, Tom Minka, and John Guiver. 2012. How to grade a test without knowing the answers--A Bayesian graphical model for adaptive crowdsourcing and aptitude testing. The proceedings of the International Conference on Machine Learning (2012).
[6]
Michael S. Bernstein, Joel Brandt, Robert C. Miller, and David R. Karger. 2011. Crowds in Two Seconds: Enabling Realtime Crowd-powered Interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (UIST '11). ACM, New York, NY, USA, 33--42.
[7]
Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, and others. 2010. VizWiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM, 333--342.
[8]
Jonathan Bragg, Daniel S Weld, and others. 2013. Crowdsourcing multi-label classification for taxonomy creation. In First AAAI conference on human computation and crowdsourcing.
[9]
Michele A. Burton, Erin Brady, Robin Brewer, Callie Neylan, Jeffrey P. Bigham, and Amy Hurst. 2012. Crowdsourcing Subjective Fashion Advice Using VizWiz: Challenges and Opportunities. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '12). ACM, New York, NY, USA, 135--142.
[10]
Chris Callison-Burch and Mark Dredze. 2010. Creating speech and language data with Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, 1--12.
[11]
Joel Chan, Steven Dang, and Steven P Dow. 2016. Improving crowd innovation with expert facilitation. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. ACM, 1223--1235.
[12]
Joseph Chee Chang, Aniket Kittur, and Nathan Hahn. 2016. Alloy: Clustering with Crowds and Computation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 3180--3191.
[13]
Joseph Z. Chang, Jason S. Chang, and Jyh-Shing Roger Jang. 2012. Learning to Find Translations and Transliterations on the Web. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL '12). Association for Computational Linguistics, 130--134.
[14]
Lydia B Chilton, Juho Kim, Paul André, Felicia Cordeiro, James A Landay, Daniel S Weld, Steven P Dow, Robert C Miller, and Haoqi Zhang. 2014. Frenzy: collaborative data organization for creating conference sessions. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 1255--1264.
[15]
Lydia B Chilton, Greg Little, Darren Edge, Daniel S Weld, and James A Landay. 2013. Cascade: Crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1999--2008.
[16]
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2012. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-scale Entity Linking. In Proceedings of the 21st International Conference on World Wide Web (WWW '12). ACM, New York, NY, USA, 469--478.
[17]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.
[18]
Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. 2013. Pick-a-crowd: Tell Me What You Like, and I'Ll Tell You What to Do. In Proceedings of the 22Nd International Conference on World Wide Web (WWW '13). ACM, New York, NY, USA, 367--374.
[19]
Shayan Doroudi, Ece Kamar, Emma Brunskill, and Eric Horvitz. 2016. Toward a Learning Science for Complex Crowdsourcing Tasks. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2623--2634.
[20]
Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, 1013--1022.
[21]
Ryan Drapeau, Lydia B Chilton, Jonathan Bragg, and Daniel S Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. (2016).
[22]
Derek L Hansen, Patrick J Schone, Douglas Corey, Matthew Reid, and Jake Gehring. 2013. Quality control mechanisms for crowdsourcing: peer review, arbitration, & expertise at familysearch indexing. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 649--660.
[23]
Google Inc. 2016. Google Search Quality Evaluator Guidelines. (2016). http://static.googleusercontent. com/media/google.com/en//insidesearch/howsearchworks/ assets/searchqualityevaluatorguidelines.pdf
[24]
Oana Inel, Khalid Khamkham, Tatiana Cristea, Anca Dumitrache, Arne Rutjes, Jelle van der Ploeg, Lukasz Romaszko, Lora Aroyo, and Robert-Jan Sips. 2014. Crowdtruth: Machine-human computation framework for harnessing disagreement in gathering annotated data. In International Semantic Web Conference. Springer, 486--504.
[25]
Panagiotis G Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation. ACM, 64--67.
[26]
Sanjay Kairam and Jeffrey Heer. 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. ACM, 1637--1648.
[27]
Ece Kamar, Severin Hacker, and Eric Horvitz. 2012. Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 467--474.
[28]
Joy Kim, Justin Cheng, and Michael S Bernstein. 2014. Ensemble: exploring complementary strengths of leaders and crowds in creative collaboration. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 745--755.
[29]
Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 1301--1318.
[30]
James Q Knowlton. 1966. On the definition of "picture". AV Communication Review 14, 2 (1966), 157--183.
[31]
Ranjay A. Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, and Michael S. Bernstein. 2016. Embracing Error to Enable Rapid Crowdsourcing. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16). ACM, New York, NY, USA, 3167--3179.
[32]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[33]
Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Charles. 2014. Structured labeling for facilitating concept evolution in machine learning. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3075--3084.
[34]
Gierad Laput, Walter S Lasecki, Jason Wiese, Robert Xiao, Jeffrey P Bigham, and Chris Harrison. 2015. Zensors: Adaptive, rapidly deployable, human-intelligent sensor feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 1935--1944.
[35]
Walter S Lasecki, Mitchell Gordon, Danai Koutra, Malte F Jung, Steven P Dow, and Jeffrey P Bigham. 2014. Glance: Rapidly coding behavioral video with the crowd. In Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 551--562.
[36]
Walter S Lasecki, Rachel Wesley, Jeffrey Nichols, Anand Kulkarni, James F Allen, and Jeffrey P Bigham. 2013. Chorus: a crowd-powered conversational assistant. In Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 151--162.
[37]
Kathleen M MacQueen, Eleanor McLellan, Kelly Kay, and Bobby Milstein. 1998. Codebook development for team-based qualitative analysis. Cultural anthropology methods 10, 2 (1998), 31--36.
[38]
A. Mao, Y. Chen, K.Z. Gajos, D.C. Parkes, A.D. Procaccia, and H. Zhang. 2012. TurkServer: Enabling Synchronous and Longitudinal Online Experiments. In Proceedings of HCOMP'12.
[39]
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19, 2 (1993), 313--330.
[40]
Matt McGee. 2012. Yes, Bing Has Human Search Quality Raters and Here's How They Judge Web Pages. (2012). http://searchengineland.com/ bing-search-quality-rating-guidelines-130592
[41]
Brian McInnis, Dan Cosley, Chaebong Nam, and Gilly Leshed. 2016. Taking a HIT: Designing around Rejection, Mistrust, Risk, and WorkersaAZ A Experiences in Amazon Mechanical Turk. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2271--2282.
[42]
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39--41.
[43]
Tanushree Mitra, Clayton J Hutto, and Eric Gilbert. 2015. Comparing person-and process-centric strategies for obtaining quality data on amazon mechanical turk. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 1345--1354.
[44]
Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014. Linguistically debatable or just plain wrong?. In ACL (2). 507--511.
[45]
Matt Post, Chris Callison-Burch, and Miles Osborne. 2012. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation. Association for Computational Linguistics, 401--409.
[46]
Philip Resnik and Noah A. Smith. 2003. The Web As a Parallel Corpus. Comput. Linguist. 29, 3 (Sept. 2003), 349--380.
[47]
Jakob Rogstadius, Vassilis Kostakos, Aniket Kittur, Boris Smus, Jim Laredo, and Maja Vukovic. 2011. An assessment of intrinsic and extrinsic motivation on task performance in crowdsourcing markets. ICWSM 11 (2011), 17--21.
[48]
Jeffrey Rzeszotarski and Aniket Kittur. 2012. CrowdScape: interactively visualizing user behavior and output. In Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 55--62.
[49]
Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and Fast-but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, 254--263. http://dl.acm.org/citation.cfm?id=1613715.1613751
[50]
Anselm Strauss and Juliet Corbin. 1998. Basics of qualitative research: Techniques and procedures for developing grounded theory . Sage Publications, Inc.
[51]
Ann Taylor, Mitchell Marcus, and Beatrice Santorini. 2003. The Penn treebank: an overview. In Treebanks. Springer, 5--22.
[52]
Long Tran-Thanh, Trung Dong Huynh, Avi Rosenfeld, Sarvapali D. Ramchurn, and Nicholas R. Jennings. 2014. BudgetFix: Budget Limited Crowdsourcing for Interdependent Task Allocation with Quality Guarantees. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems (AAMAS '14). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 477--484. http://dl.acm.org/citation.cfm?id=2615731.2615809
[53]
Cynthia Weston, Terry Gandell, Jacinthe Beauchamp, Lynn McAlpine, Carol Wiseman, and Cathy Beauchamp. 2001. Analyzing interview data: The development and evolution of a coding system. Qualitative sociology 24, 3 (2001), 381--400.
[54]
Janyce M Wiebe, Rebecca F Bruce, and Thomas P O'Hara. 1999. Development and use of a gold-standard data set for subjectivity classifications. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 246--253.

Cited By

View all
  • (2025)Large Scale Anonymous Collusion and its detection in crowdsourcingExpert Systems with Applications10.1016/j.eswa.2024.125284259(125284)Online publication date: Jan-2025
  • (2024)Speed Labeling: Non-stop Scrolling for Fast Image LabelingProceedings of the 50th Graphics Interface Conference10.1145/3670947.3670958(1-10)Online publication date: 3-Jun-2024
  • (2024)The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and GuidelinesProceedings of the ACM on Human-Computer Interaction10.1145/36410238:CSCW1(1-45)Online publication date: 26-Apr-2024
  • Show More Cited By

Index Terms

  1. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems
    May 2017
    7138 pages
    ISBN:9781450346559
    DOI:10.1145/3025453
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 May 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. collaboration
    2. crowdsourcing
    3. machine learning
    4. real-time

    Qualifiers

    • Research-article

    Conference

    CHI '17
    Sponsor:

    Acceptance Rates

    CHI '17 Paper Acceptance Rate 600 of 2,400 submissions, 25%;
    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Upcoming Conference

    CHI '25
    CHI Conference on Human Factors in Computing Systems
    April 26 - May 1, 2025
    Yokohama , Japan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)259
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Large Scale Anonymous Collusion and its detection in crowdsourcingExpert Systems with Applications10.1016/j.eswa.2024.125284259(125284)Online publication date: Jan-2025
    • (2024)Speed Labeling: Non-stop Scrolling for Fast Image LabelingProceedings of the 50th Graphics Interface Conference10.1145/3670947.3670958(1-10)Online publication date: 3-Jun-2024
    • (2024)The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and GuidelinesProceedings of the ACM on Human-Computer Interaction10.1145/36410238:CSCW1(1-45)Online publication date: 26-Apr-2024
    • (2024)Closing the Knowledge Gap in Designing Data Annotation Interfaces for AI-powered Disaster Management Analytic SystemsProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645214(405-418)Online publication date: 18-Mar-2024
    • (2024)Studying Collaborative Interactive Machine Teaching in Image ClassificationProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645204(195-208)Online publication date: 18-Mar-2024
    • (2024)DynamicLabels: Supporting Informed Construction of Machine Learning Label Sets with Crowd FeedbackProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645157(209-228)Online publication date: 18-Mar-2024
    • (2024)Belief Miner: A Methodology for Discovering Causal Beliefs and Causal Illusions from General PopulationsProceedings of the ACM on Human-Computer Interaction10.1145/36372988:CSCW1(1-37)Online publication date: 26-Apr-2024
    • (2024)Towards a Non-Ideal Methodological Framework for Responsible MLProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642501(1-17)Online publication date: 11-May-2024
    • (2024)"Are we all in the same boat?" Customizable and Evolving Avatars to Improve Worker Engagement and Foster a Sense of Community in Online Crowd WorkProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642429(1-26)Online publication date: 11-May-2024
    • (2024)LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing SystemsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642089(1-21)Online publication date: 11-May-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media