Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3490099.3511141acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article
Open access

HINT: Integration Testing for AI-based features with Humans in the Loop

Published: 22 March 2022 Publication History

Abstract

The dynamic nature of AI technologies makes testing human-AI interaction and collaboration challenging – especially before such features are deployed in the wild. This presents a challenge for designers and AI practitioners as early feedback for iteration is often unavailable in the development phase. In this paper, we take inspiration from integration testing concepts in software development and present HINT (Human-AI INtegration Testing), a crowd-based framework for testing AI-based experiences integrated with a humans-in-the-loop workflow. HINT supports early testing of AI-based features within the context of realistic user tasks and makes use of successive sessions to simulate AI experiences that evolve over-time. Finally, it provides practitioners with reports to evaluate and compare aspects of these experiences.
Through a crowd-based study, we demonstrate the need for over-time testing where user behaviors evolve as they interact with an AI system. We also show that HINT is able to capture and reveal these distinct user behavior patterns across a variety of common AI performance modalities using two AI-based feature prototypes. We further evaluated HINT’s potential to support practitioners’ evaluation of human-AI interaction experiences pre-deployment through semi-structured interviews with 13 practitioners.

References

[1]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300.
[2]
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, 2019. Guidelines for human-AI interaction. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–13.
[3]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382–398.
[4]
Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond accuracy: The role of mental models in human-AI team performance. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 2–11.
[5]
Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 2429–2437.
[6]
Gagan Bansal, Tongshuang Wu, Joyce Zhu, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S Weld. 2020. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. arXiv preprint arXiv:2006.14779(2020).
[7]
Zana Buçinca, Phoebe Lin, Krzysztof Z Gajos, and Elena L Glassman. 2020. Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces. 454–464.
[8]
Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 1721–1730.
[9]
Yan Chen, Maulishree Pandey, Jean Y. Song, Walter S. Lasecki, and Steve Oney. 2020. Improving Crowd-Supported GUI Testing with Structural Guidance. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376835
[10]
Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review(2020), 1–56.
[11]
Eelco Dolstra, Raynor Vliegendhart, and Johan Pouwelse. 2013. Crowdsourcing gui tests. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation. IEEE, 332–341.
[12]
Michael D Ekstrand, Daniel Kluver, F Maxwell Harper, and Joseph A Konstan. 2015. Letting users choose recommender algorithms: An experimental study. In Proceedings of the 9th ACM Conference on Recommender Systems. 11–18.
[13]
Markus Freitag, George F. Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. ArXiv abs/2104.14478(2021).
[14]
Susan R. Fussell, Robert E. Kraut, F. Javier Lerch, William L. Scherlis, Matthew M. McNally, and Jonathan J. Cadiz. 1998. Coordination, Overload and Team Performance: Effects of Team Communication Strategies. In Proceedings of the 1998 ACM Conference on Computer Supported Cooperative Work (Seattle, Washington, USA) (CSCW ’98). Association for Computing Machinery, New York, NY, USA, 275–284. https://doi.org/10.1145/289444.289502
[15]
Florent Garcin, Boi Faltings, Olivier Donatsch, Ayar Alazzawi, Christophe Bruttin, and Amr Huber. 2014. Offline and online evaluation of news recommender systems at swissinfo. ch. In Proceedings of the 8th ACM Conference on Recommender systems. 169–176.
[16]
Juan Cruz Gardey and Alejandra Garrido. 2020. User Experience Evaluation through Automatic A/B Testing(IUI ’20). Association for Computing Machinery, New York, NY, USA, 25–26. https://doi.org/10.1145/3379336.3381514
[17]
Mor Geva, Y. Goldberg, and Jonathan Berant. 2019. Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. ArXiv abs/1908.07898(2019).
[18]
Alyssa Glass, Deborah L. McGuinness, and Michael Wolverton. 2008. Toward Establishing Trust in Adaptive Agents. In Proceedings of the 13th International Conference on Intelligent User Interfaces (Gran Canaria, Spain) (IUI ’08). Association for Computing Machinery, New York, NY, USA, 227–236. https://doi.org/10.1145/1378773.1378804
[19]
William E Hefley and Dianne Murray. 1993. Intelligent user interfaces. In Proceedings of the 1st international conference on Intelligent user interfaces. 3–10.
[20]
Kevin Anthony Hoff and Masooda Bashir. 2015. Trust in automation: Integrating empirical evidence on factors that influence trust. Human factors 57, 3 (2015), 407–434.
[21]
Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 159–166.
[22]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems (TOIS) 25, 2 (2007), 7–es.
[23]
Matthew Kay, Tara Kola, Jessica R Hullman, and Sean A Munson. 2016. When (ish) is my bus? user-centered visualizations of uncertainty in everyday, mobile predictive systems. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 5092–5103.
[24]
Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2018. Human decisions and machine predictions. The quarterly journal of economics 133, 1 (2018), 237–293.
[25]
Bryan Klimt and Yiming Yang. 2004. The Enron Corpus: A New Dataset for Email Classification Research. In ECML. 217–226.
[26]
Rafal Kocielnik, Saleema Amershi, and Paul N Bennett. 2019. Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
[27]
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery 18, 1 (2009), 140–181.
[28]
Steven Komarov, Katharina Reinecke, and Krzysztof Z Gajos. 2013. Crowdsourcing performance evaluations of user interfaces. In Proceedings of the SIGCHI conference on human factors in computing systems. 207–216.
[29]
Himabindu Lakkaraju and Osbert Bastani. 2020. ” How do I fool you?” Manipulating User Trust via Misleading Black Box Explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 79–85.
[30]
John D Lee and Katrina A See. 2004. Trust in automation: Designing for appropriate reliance. Human factors 46, 1 (2004), 50–80.
[31]
Di Liu, Randolph G Bias, Matthew Lease, and Rebecca Kuipers. 2012. Crowdsourcing for usability testing. Proceedings of the American Society for Information Science and Technology 49, 1 (2012), 1–10.
[32]
Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, and Ashton Anderson. 2020. Aligning Superhuman AI and Human Behavior: Chess as a Model System. arXiv preprint arXiv:2006.01855(2020).
[33]
Michael Nebeling, Maximilian Speicher, and Moira C Norrie. 2013. Crowdstudy: General toolkit for crowdsourced evaluation of web interfaces. In Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems. 255–264.
[34]
Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. 625–632.
[35]
Jakob Nielsen. 1994. Usability engineering. Morgan Kaufmann.
[36]
Besmira Nushi, Ece Kamar, and Eric Horvitz. 2018. Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure. In Proceedings of the Sixth AAAI Conference on Human Computation and Crowdsourcing. 126–135.
[37]
M. Ould and Charles Unwin. 1987. Testing in software development. The Mathematical Gazette 71 (1987), 331–331.
[38]
Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810(2018).
[39]
Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan. 2019. The algorithmic automation problem: Prediction, triage, and human effort. arXiv preprint arXiv:1903.12220(2019).
[40]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ” Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.
[41]
Marco Túlio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. 4902–4912.
[42]
Tobias Schnabel, Gonzalo Ramos, and Saleema Amershi. 2020. “Who Doesn’t like Dinosaurs?” Finding and Eliciting Richer Preferences for Recommendation. In Fourteenth ACM Conference on Recommender Systems(Virtual Event, Brazil) (RecSys ’20). Association for Computing Machinery, New York, NY, USA, 398–407. https://doi.org/10.1145/3383313.3412267
[43]
Ben Shneiderman. 1996. The eyes have it: A task by data type taxonomy for information visualizations. In IEEE symposium on visual languages. 336–343.
[44]
Linda J Skitka, Kathleen L Mosier, and Mark Burdick. 1999. Does automation bias decision-making?International Journal of Human-Computer Studies 51, 5 (1999), 991–1006.
[45]
Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. 2019. How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods. arXiv preprint arXiv:1911.02508(2019).
[46]
Arjun Srinivasan, Mira Dontcheva, Eytan Adar, and Seth Walker. 2019. Discovering Natural Language Commands in Multimodal Interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces (Marina del Ray, California) (IUI ’19). Association for Computing Machinery, New York, NY, USA, 661–672. https://doi.org/10.1145/3301275.3302292
[47]
Aaron Steinfeld, Rachael Bennett, Kyle Cunningham, Matt Lahut, Pablo-Alejandro Quinones, Django Wexler, Daniel Siewiorek, Paul Cohen, Julie Fitzgerald, Othar Hansson, Jordan Hayes, Mike Pool, and Mark Drummond. 2006. The RADAR Test Methodology: Evaluating a Multi-Task Machine Learning System with Humans in the Loop. Technical Report CMU-CS-06-125. Carnegie Mellon University, Pittsburgh, PA.
[48]
Andrew H Turpin and William Hersh. 2001. Why batch and user evaluations do not give the same results. In SIGIR.
[49]
Ben Van Calster and Andrew J Vickers. 2015. Calibration of risk prediction models: impact on decision-analytic performance. Medical decision making 35, 2 (2015), 162–169.
[50]
Chris Welty, Praveen K. Paritosh, and Lora Aroyo. 2019. Metrology for AI: From Benchmarks to Instruments. ArXiv abs/1911.01875(2019).
[51]
Bryan Wilder, Eric Horvitz, and Ece Kamar. 2020. Learning to Complement Humans. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. 1526–1533. https://doi.org/10.24963/ijcai.2020/212
[52]
Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018. Investigating How Experienced UX Designers Effectively Work with Machine Learning. In Proceedings of the 2018 Designing Interactive Systems Conference (Hong Kong, China) (DIS ’18). Association for Computing Machinery, New York, NY, USA, 585–596. https://doi.org/10.1145/3196709.3196730
[53]
Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design. In Proceedings of the 2020 chi conference on human factors in computing systems. 1–13.
[54]
Omar Zaidan. 2011. MAISE: A Flexible, Configurable, Extensible Open Source Package for Mass AI System Evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, Edinburgh, Scotland, 130–134. https://aclanthology.org/W11-2114
[55]
Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering(2020).
[56]
Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. 2020. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 295–305.
[57]
Shkodran Zogaj, Ulrich Bretschneider, and Jan Marco Leimeister. 2014. Managing crowdsourced software testing: a case study based insight on the challenges of a crowdsourcing intermediary. Journal of Business Economics 84, 3 (2014), 375–405.

Cited By

View all
  • (2024)Lay User Involvement in Developing Human-centric Responsible AI Systems: When and How?ACM Journal on Responsible Computing10.1145/36525921:2(1-25)Online publication date: 20-Jun-2024
  • (2024)Simulation-based Optimization of User Interfaces for Quality-assuring Machine Learning Model PredictionsACM Transactions on Interactive Intelligent Systems10.1145/359455214:1(1-32)Online publication date: 9-Jan-2024
  • (2023)Introduction to the Special Issue on Human-Centred AI in Healthcare: Challenges Appearing in the WildACM Transactions on Computer-Human Interaction10.1145/358996130:2(1-12)Online publication date: 1-Jun-2023
  • Show More Cited By

Index Terms

  1. HINT: Integration Testing for AI-based features with Humans in the Loop
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IUI '22: Proceedings of the 27th International Conference on Intelligent User Interfaces
      March 2022
      888 pages
      ISBN:9781450391443
      DOI:10.1145/3490099
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 March 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Human-AI interaction
      2. crowdsourcing
      3. prototyping
      4. testing

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      IUI '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 746 of 2,811 submissions, 27%

      Upcoming Conference

      IUI '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,233
      • Downloads (Last 6 weeks)122
      Reflects downloads up to 27 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Lay User Involvement in Developing Human-centric Responsible AI Systems: When and How?ACM Journal on Responsible Computing10.1145/36525921:2(1-25)Online publication date: 20-Jun-2024
      • (2024)Simulation-based Optimization of User Interfaces for Quality-assuring Machine Learning Model PredictionsACM Transactions on Interactive Intelligent Systems10.1145/359455214:1(1-32)Online publication date: 9-Jan-2024
      • (2023)Introduction to the Special Issue on Human-Centred AI in Healthcare: Challenges Appearing in the WildACM Transactions on Computer-Human Interaction10.1145/358996130:2(1-12)Online publication date: 1-Jun-2023
      • (2023)Clinician-Facing AI in the Wild: Taking Stock of the Sociotechnical Challenges and Opportunities for HCIACM Transactions on Computer-Human Interaction10.1145/358243030:2(1-39)Online publication date: 17-Mar-2023
      • (2023)Explaining machine learning models with interactive natural language conversations using TalkToModelNature Machine Intelligence10.1038/s42256-023-00692-85:8(873-883)Online publication date: 27-Jul-2023
      • (2023)Requirements engineering for artificial intelligence systemsInformation and Software Technology10.1016/j.infsof.2023.107176158:COnline publication date: 1-Jun-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media