Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3383652.3423860acmconferencesArticle/Chapter ViewAbstractPublication PagesivaConference Proceedingsconference-collections
research-article

Can we trust online crowdworkers?: Comparing online and offline participants in a preference test of virtual agents

Published: 19 October 2020 Publication History

Abstract

Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to run in-lab experiments, because of, for example, a pandemic. Crowdsourcing platforms such as Amazon Mechanical Turk (AMT) or Prolific can therefore be a suitable alternative to run certain experiments, such as evaluating virtual agents. Although previous studies investigated the use of crowdsourcing platforms for running experiments, there is still uncertainty as to whether the results are reliable for perceptual studies. Here we replicate a previous experiment where participants evaluated a gesture generation model for virtual agents. The experiment is conducted across three participant pools - in-lab, Prolific, and AMT - having similar demographics across the in-lab participants and the Prolific platform. Our results show no difference between the three participant pools in regards to their evaluations of the gesture generation models and their reliability scores. The results indicate that online platforms can successfully be used for perceptual evaluations of this kind.

References

[1]
Adam J Berinsky, Michele F Margolis, and Michael W Sances. 2016. Can we turn shirkers into workers? Journal of Experimental Social Psychology 66 (2016).
[2]
Alec Burmania, Srinivas Parthasarathy, and Carlos Busso. 2015. Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Transactions on Affective Computing 7, 4 (2015), 374--388.
[3]
Tara McAllister Byun, Peter F Halpin, and Daniel Szeredi. 2015. Online crowdsourcing for efficient rating of speech: A validation study. Journal of communication disorders 53 (2015), 70--83.
[4]
Janelle H Cheung, Deanna K Burns, Robert R Sinclair, and Michael Sliter. 2017. Amazon Mechanical Turk in organizational psychology: An evaluation and practical recommendations. Journal of Business and Psychology 32, 4 (2017), 347--361.
[5]
Jacob Cohen. 2013. Statistical power analysis for the behavioral sciences. Academic press.
[6]
Paul T Costa Jr and Robert R McCrae. 2008. The Revised NEO Personality Inventory (NEO-PI-R). Sage Publications, Inc.
[7]
Matthew JC Crump, John V McDonnell, and Todd M Gureckis. 2013. Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research. PloS one 8, 3 (2013), e57410.
[8]
Paul G Curran. 2016. Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology 66 (2016), 4--19.
[9]
Justin A DeSimone and PD Harms. 2018. Dirty data: The effects of screening respondents who provide low-quality data in survey research. Journal of Business and Psychology 33, 5 (2018), 559--577.
[10]
Avi Fleischer, Alan D Mead, and Jialin Huang. 2015. Inattentive responding in MTurk and other online samples. Industrial and Organizational Psychology 8, 2 (2015), 196--202.
[11]
Laura Germine, Ken Nakayama, Bradley C Duchaine, Christopher F Chabris, Garga Chatterjee, and Jeremy B Wilmer. 2012. Is the Web as good as the lab? Comparable performance from Web and lab in cognitive/perceptual experiments. Psychonomic bulletin & review 19, 5 (2012), 847--857.
[12]
Jeffrey Girard. 2020. agreement. R package version 0.0.0.9002.
[13]
Kilem L Gwet. 2014. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC.
[14]
David J Hauser, Phoebe C Ellsworth, and Richard Gonzalez. 2018. Are manipulation checks necessary? Frontiers in psychology 9 (2018).
[15]
David J Hauser and Norbert Schwarz. 2016. Attentive Turkers: M Turk participants perform better on online attention checks than do subject pool participants. Behavior research methods 48, 1 (2016), 400--407.
[16]
Patrik Jonell, Taras Kucherenko, Erik Ekstedt, and Jonas Beskow. 2019. Learning Non-verbal Behavior for a Social Robot from YouTube Videos. In ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, Oslo, Norway, August 19, 2019.
[17]
Nicolas Kaufmann, Thimo Schulze, and Daniel Veit. 2011. More than fun and money. Worker Motivation in Crowd so urcing-A Study on Mechanical Turk. In AMCIS, Vol. 11. Detroit, Michigan, USA, 1--11.
[18]
Jeremy Kees, Christopher Berry, Scot Burton, and Kim Sheehan. 2017. An analysis of data quality: Professional panels, student subject pools, and Amazon's Mechanical Turk. Journal of Advertising 46, 1 (2017), 141--155.
[19]
Steven Komarov, Katharina Reinecke, and Krzysztof Z. Gajos. 2013. Crowdsourcing Performance Evaluations of User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI '13). ACM, New York, NY, USA, 207--216. https://doi.org/10.1145/2470654.2470684
[20]
Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In International Conference on Intelligent Virtual Agents (IVA '19), Vol. 19. ACM, Paris, France, 97--104. https://doi.org/10.1145/3308532.3329472
[21]
Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction.
[22]
Franki YH Kung, Navio Kwok, and Douglas J Brown. 2018. Are attention check questions a threat to scale validity? Applied Psychology 67, 2 (2018), 264--283.
[23]
Alexandra Kuznetsova, Per B Brockhoff, and Rune Haubo Bojesen Christensen. 2017. lmerTest package: tests in linear mixed effects models. Journal of statistical software 82, 13(2017).
[24]
Kaitlin L Lansford, Stephanie A Borrie, and Lukas Bystricky. 2016. Use of crowdsourcing to assess the ecological validity of perceptual-training paradigms in dysarthria. American Journal of Speech-Language Pathology 25, 2 (2016), 233--239.
[25]
Matt Lovett, Saleh Bajaba, Myra Lovett, and Marcia J Simmering. 2018. Data quality from crowdsourced surveys: A mixed method inquiry into perceptions of amazon's mechanical turk masters. Applied Psychology 67, 2 (2018), 339--366.
[26]
Michael R Maniaci and Ronald D Rogge. 2014. Caring about carelessness: Participant inattention and its effects on research. Journal of Research in Personality (2014).
[27]
Róisín McNaney, Mohammad Othman, Dan Richardson, Paul Dunphy, Telmo Amaral, Nick Miller, Helen Stringer, Patrick Olivier, and John Vines. 2016. Speech-ing: mobile crowdsourced speech assessment to support self-monitoring and management for people with Parkinson's. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 4464--4476.
[28]
Babak Naderi, Ina Wechsung, and Sebastian Möller. 2015. Effect of being observed on the reliability of responses in crowdsourcing micro-task platforms. In 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX).
[29]
Daniel M Oppenheimer, Tom Meyvis, and Nicolas Davidenko. 2009. Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of experimental social psychology 45, 4 (2009), 867--872.
[30]
Najmeh Sadoughi and Carlos Busso. 2019. Speech-driven animation with meaningful behaviors. Speech Communication 110 (2019), 90--100.
[31]
Margarita Stolarova, Corinna Wolf, Tanja Rinker, and Aenne Brielmann. 2014. How to assess and compare inter-rater reliability, agreement and correlation of ratings: an exemplary analysis of mother-father and parent-teacher expressive vocabulary rating pairs. Frontiers in psychology 5 (2014), 509.
[32]
Andy T Woods, Carlos Velasco, Carmel A Levitan, Xiaoang Wan, and Charles Spence. 2015. Conducting perception research over the internet: a tutorial review. PeerJ 3 (2015), e1058.
[33]
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In International Conference on Robotics and Automation (ICRA '19). LEEE.

Cited By

View all
  • (2024)Influence of Simulation and Interactivity on Human Perceptions of a Robot During Navigation TasksACM Transactions on Human-Robot Interaction10.1145/3675784Online publication date: 16-Jul-2024
  • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
  • (2023)Investigating the effect of visual realism on empathic responses to emotionally expressive virtual humansACM Symposium on Applied Perception 202310.1145/3605495.3605799(1-7)Online publication date: 5-Aug-2023
  • Show More Cited By

Index Terms

  1. Can we trust online crowdworkers?: Comparing online and offline participants in a preference test of virtual agents

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents
      October 2020
      394 pages
      ISBN:9781450375863
      DOI:10.1145/3383652
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 October 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP)
      • Stiftelsen för Strategisk Forskning

      Conference

      IVA '20
      Sponsor:
      IVA '20: ACM International Conference on Intelligent Virtual Agents
      October 20 - 22, 2020
      Scotland, Virtual Event, UK

      Acceptance Rates

      Overall Acceptance Rate 53 of 196 submissions, 27%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)15
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 13 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Influence of Simulation and Interactivity on Human Perceptions of a Robot During Navigation TasksACM Transactions on Human-Robot Interaction10.1145/3675784Online publication date: 16-Jul-2024
      • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
      • (2023)Investigating the effect of visual realism on empathic responses to emotionally expressive virtual humansACM Symposium on Applied Perception 202310.1145/3605495.3605799(1-7)Online publication date: 5-Aug-2023
      • (2023)How Do We Perceive Our Trainee Robots? Exploring the Impact of Robot Errors and Appearance When Performing Domestic Physical Tasks on Teachers’ Trust and EvaluationsACM Transactions on Human-Robot Interaction10.1145/358251612:3(1-41)Online publication date: 5-May-2023
      • (2022)"Cool glasses, where did you get them?": Generating Visually Grounded Conversation Starters for Human-Robot DialogueProceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction10.5555/3523760.3523884(821-825)Online publication date: 7-Mar-2022
      • (2022)Political affiliation moderates subjective interpretations of COVID-19 graphsBig Data & Society10.1177/205395172210806789:1Online publication date: 4-Mar-2022
      • (2022)The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generationProceedings of the 2022 International Conference on Multimodal Interaction10.1145/3536221.3558058(736-747)Online publication date: 7-Nov-2022
      • (2022)The challenges of providing explanations of AI systems when they do not behave like users expectProceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization10.1145/3503252.3531306(110-120)Online publication date: 4-Jul-2022
      • (2022)A Review of Evaluation Practices of Gesture Generation in Embodied Conversational AgentsIEEE Transactions on Human-Machine Systems10.1109/THMS.2022.314917352:3(379-389)Online publication date: Jun-2022
      • (2022)‘Cool glasses, where did you get them?” Generating Visually Grounded Conversation Starters for Human-Robot Dialogue2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI)10.1109/HRI53351.2022.9889489(821-825)Online publication date: 7-Mar-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media