research-article

Open access

HINT: Integration Testing for AI-based features with Humans in the Loop

Authors:

Tobias Schnabel,

Saleema AmershiAuthors Info & Claims

IUI '22: Proceedings of the 27th International Conference on Intelligent User Interfaces

Pages 549 - 565

https://doi.org/10.1145/3490099.3511141

Published: 22 March 2022 Publication History

All formats PDF

Abstract

The dynamic nature of AI technologies makes testing human-AI interaction and collaboration challenging – especially before such features are deployed in the wild. This presents a challenge for designers and AI practitioners as early feedback for iteration is often unavailable in the development phase. In this paper, we take inspiration from integration testing concepts in software development and present HINT (Human-AI INtegration Testing), a crowd-based framework for testing AI-based experiences integrated with a humans-in-the-loop workflow. HINT supports early testing of AI-based features within the context of realistic user tasks and makes use of successive sessions to simulate AI experiences that evolve over-time. Finally, it provides practitioners with reports to evaluate and compare aspects of these experiences.

Through a crowd-based study, we demonstrate the need for over-time testing where user behaviors evolve as they interact with an AI system. We also show that HINT is able to capture and reveal these distinct user behavior patterns across a variety of common AI performance modalities using two AI-based feature prototypes. We further evaluated HINT’s potential to support practitioners’ evaluation of human-AI interaction experiences pre-deployment through semi-structured interviews with 13 practitioners.

References

[1]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300.

Digital Library

[2]

Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, 2019. Guidelines for human-AI interaction. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–13.

Digital Library

[3]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382–398.

[4]

Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond accuracy: The role of mental models in human-AI team performance. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 2–11.

[5]

Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 2429–2437.

Digital Library

[6]

Gagan Bansal, Tongshuang Wu, Joyce Zhu, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S Weld. 2020. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. arXiv preprint arXiv:2006.14779(2020).

[7]

Zana Buçinca, Phoebe Lin, Krzysztof Z Gajos, and Elena L Glassman. 2020. Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces. 454–464.

Digital Library

[8]

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 1721–1730.

Digital Library

[9]

Yan Chen, Maulishree Pandey, Jean Y. Song, Walter S. Lasecki, and Steve Oney. 2020. Improving Crowd-Supported GUI Testing with Structural Guidance. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376835

Digital Library

[10]

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review(2020), 1–56.

[11]

Eelco Dolstra, Raynor Vliegendhart, and Johan Pouwelse. 2013. Crowdsourcing gui tests. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation. IEEE, 332–341.

Digital Library

[12]

Michael D Ekstrand, Daniel Kluver, F Maxwell Harper, and Joseph A Konstan. 2015. Letting users choose recommender algorithms: An experimental study. In Proceedings of the 9th ACM Conference on Recommender Systems. 11–18.

Digital Library

[13]

Markus Freitag, George F. Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. ArXiv abs/2104.14478(2021).

[14]

Susan R. Fussell, Robert E. Kraut, F. Javier Lerch, William L. Scherlis, Matthew M. McNally, and Jonathan J. Cadiz. 1998. Coordination, Overload and Team Performance: Effects of Team Communication Strategies. In Proceedings of the 1998 ACM Conference on Computer Supported Cooperative Work (Seattle, Washington, USA) (CSCW ’98). Association for Computing Machinery, New York, NY, USA, 275–284. https://doi.org/10.1145/289444.289502

Digital Library

[15]

Florent Garcin, Boi Faltings, Olivier Donatsch, Ayar Alazzawi, Christophe Bruttin, and Amr Huber. 2014. Offline and online evaluation of news recommender systems at swissinfo. ch. In Proceedings of the 8th ACM Conference on Recommender systems. 169–176.

Digital Library

[16]

Juan Cruz Gardey and Alejandra Garrido. 2020. User Experience Evaluation through Automatic A/B Testing(IUI ’20). Association for Computing Machinery, New York, NY, USA, 25–26. https://doi.org/10.1145/3379336.3381514

Digital Library

[17]

Mor Geva, Y. Goldberg, and Jonathan Berant. 2019. Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. ArXiv abs/1908.07898(2019).

[18]

Alyssa Glass, Deborah L. McGuinness, and Michael Wolverton. 2008. Toward Establishing Trust in Adaptive Agents. In Proceedings of the 13th International Conference on Intelligent User Interfaces (Gran Canaria, Spain) (IUI ’08). Association for Computing Machinery, New York, NY, USA, 227–236. https://doi.org/10.1145/1378773.1378804

Digital Library

[19]

William E Hefley and Dianne Murray. 1993. Intelligent user interfaces. In Proceedings of the 1st international conference on Intelligent user interfaces. 3–10.

Digital Library

[20]

Kevin Anthony Hoff and Masooda Bashir. 2015. Trust in automation: Integrating empirical evidence on factors that influence trust. Human factors 57, 3 (2015), 407–434.

[21]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 159–166.

Digital Library

[22]

Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems (TOIS) 25, 2 (2007), 7–es.

Digital Library

[23]

Matthew Kay, Tara Kola, Jessica R Hullman, and Sean A Munson. 2016. When (ish) is my bus? user-centered visualizations of uncertainty in everyday, mobile predictive systems. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 5092–5103.

Digital Library

[24]

Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2018. Human decisions and machine predictions. The quarterly journal of economics 133, 1 (2018), 237–293.

[25]

Bryan Klimt and Yiming Yang. 2004. The Enron Corpus: A New Dataset for Email Classification Research. In ECML. 217–226.

[26]

Rafal Kocielnik, Saleema Amershi, and Paul N Bennett. 2019. Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.

Digital Library

[27]

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery 18, 1 (2009), 140–181.

[28]

Steven Komarov, Katharina Reinecke, and Krzysztof Z Gajos. 2013. Crowdsourcing performance evaluations of user interfaces. In Proceedings of the SIGCHI conference on human factors in computing systems. 207–216.

Digital Library

[29]

Himabindu Lakkaraju and Osbert Bastani. 2020. ” How do I fool you?” Manipulating User Trust via Misleading Black Box Explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 79–85.

Digital Library

[30]

John D Lee and Katrina A See. 2004. Trust in automation: Designing for appropriate reliance. Human factors 46, 1 (2004), 50–80.

[31]

Di Liu, Randolph G Bias, Matthew Lease, and Rebecca Kuipers. 2012. Crowdsourcing for usability testing. Proceedings of the American Society for Information Science and Technology 49, 1 (2012), 1–10.

[32]

Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, and Ashton Anderson. 2020. Aligning Superhuman AI and Human Behavior: Chess as a Model System. arXiv preprint arXiv:2006.01855(2020).

[33]

Michael Nebeling, Maximilian Speicher, and Moira C Norrie. 2013. Crowdstudy: General toolkit for crowdsourced evaluation of web interfaces. In Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems. 255–264.

Digital Library

[34]

Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. 625–632.

Digital Library

[35]

Jakob Nielsen. 1994. Usability engineering. Morgan Kaufmann.

[36]

Besmira Nushi, Ece Kamar, and Eric Horvitz. 2018. Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure. In Proceedings of the Sixth AAAI Conference on Human Computation and Crowdsourcing. 126–135.

[37]

M. Ould and Charles Unwin. 1987. Testing in software development. The Mathematical Gazette 71 (1987), 331–331.

[38]

Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810(2018).

[39]

Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan. 2019. The algorithmic automation problem: Prediction, triage, and human effort. arXiv preprint arXiv:1903.12220(2019).

[40]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ” Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.

Digital Library

[41]

Marco Túlio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. 4902–4912.

[42]

Tobias Schnabel, Gonzalo Ramos, and Saleema Amershi. 2020. “Who Doesn’t like Dinosaurs?” Finding and Eliciting Richer Preferences for Recommendation. In Fourteenth ACM Conference on Recommender Systems(Virtual Event, Brazil) (RecSys ’20). Association for Computing Machinery, New York, NY, USA, 398–407. https://doi.org/10.1145/3383313.3412267

Digital Library

[43]

Ben Shneiderman. 1996. The eyes have it: A task by data type taxonomy for information visualizations. In IEEE symposium on visual languages. 336–343.

[44]

Linda J Skitka, Kathleen L Mosier, and Mark Burdick. 1999. Does automation bias decision-making?International Journal of Human-Computer Studies 51, 5 (1999), 991–1006.

[45]

Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. 2019. How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods. arXiv preprint arXiv:1911.02508(2019).

[46]

Arjun Srinivasan, Mira Dontcheva, Eytan Adar, and Seth Walker. 2019. Discovering Natural Language Commands in Multimodal Interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces (Marina del Ray, California) (IUI ’19). Association for Computing Machinery, New York, NY, USA, 661–672. https://doi.org/10.1145/3301275.3302292

Digital Library

[47]

Aaron Steinfeld, Rachael Bennett, Kyle Cunningham, Matt Lahut, Pablo-Alejandro Quinones, Django Wexler, Daniel Siewiorek, Paul Cohen, Julie Fitzgerald, Othar Hansson, Jordan Hayes, Mike Pool, and Mark Drummond. 2006. The RADAR Test Methodology: Evaluating a Multi-Task Machine Learning System with Humans in the Loop. Technical Report CMU-CS-06-125. Carnegie Mellon University, Pittsburgh, PA.

[48]

Andrew H Turpin and William Hersh. 2001. Why batch and user evaluations do not give the same results. In SIGIR.

[49]

Ben Van Calster and Andrew J Vickers. 2015. Calibration of risk prediction models: impact on decision-analytic performance. Medical decision making 35, 2 (2015), 162–169.

[50]

Chris Welty, Praveen K. Paritosh, and Lora Aroyo. 2019. Metrology for AI: From Benchmarks to Instruments. ArXiv abs/1911.01875(2019).

[51]

Bryan Wilder, Eric Horvitz, and Ece Kamar. 2020. Learning to Complement Humans. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. 1526–1533. https://doi.org/10.24963/ijcai.2020/212

[52]

Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018. Investigating How Experienced UX Designers Effectively Work with Machine Learning. In Proceedings of the 2018 Designing Interactive Systems Conference (Hong Kong, China) (DIS ’18). Association for Computing Machinery, New York, NY, USA, 585–596. https://doi.org/10.1145/3196709.3196730

Digital Library

[53]

Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design. In Proceedings of the 2020 chi conference on human factors in computing systems. 1–13.

Digital Library

[54]

Omar Zaidan. 2011. MAISE: A Flexible, Configurable, Extensible Open Source Package for Mass AI System Evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, Edinburgh, Scotland, 130–134. https://aclanthology.org/W11-2114

[55]

Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering(2020).

Digital Library

[56]

Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. 2020. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 295–305.

Digital Library

[57]

Shkodran Zogaj, Ulrich Bretschneider, and Jan Marco Leimeister. 2014. Managing crowdsourced software testing: a case study based insight on the challenges of a crowdsourcing intermediary. Journal of Business Economics 84, 3 (2014), 375–405.

Cited By

Vincenzi BStumpf STaylor ANakao Y(2024)Lay User Involvement in Developing Human-centric Responsible AI Systems: When and How?ACM Journal on Responsible Computing10.1145/36525921:2(1-25)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3652592
Zhang YTennekes MDe Jong TCurier LCoecke BChen M(2024)Simulation-based Optimization of User Interfaces for Quality-assuring Machine Learning Model PredictionsACM Transactions on Interactive Intelligent Systems10.1145/359455214:1(1-32)Online publication date: 9-Jan-2024
https://dl.acm.org/doi/10.1145/3594552
Andersen TNunes FWilcox LCoiera ERogers Y(2023)Introduction to the Special Issue on Human-Centred AI in Healthcare: Challenges Appearing in the WildACM Transactions on Computer-Human Interaction10.1145/358996130:2(1-12)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1145/3589961
Show More Cited By

Index terms have been assigned to the content through auto-classification.

Recommendations

Guidelines for Human-AI Interaction
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Advances in artificial intelligence (AI) frame opportunities and challenges for user interface design. Principles for human-AI interaction have been discussed in the human-computer interaction community for over two decades, but more study and ...
Re-examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

Artificial Intelligence (AI) plays an increasingly important role in improving HCI and user experience. Yet many challenges persist in designing and innovating valuable human-AI interactions. For example, AI systems can make unpredictable errors, and ...
Investigating Explainability of Generative AI for Code through Scenario-based Design
IUI '22: Proceedings of the 27th International Conference on Intelligent User Interfaces

What does it mean for a generative AI model to be explainable? The emergent discipline of explainable AI (XAI) has made great strides in helping people understand discriminative models. Less attention has been paid to generative models that produce ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IUI '22: Proceedings of the 27th International Conference on Intelligent User Interfaces

March 2022

888 pages

ISBN:9781450391443

DOI:10.1145/3490099

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

IUI '22

Sponsor:

IUI '22: 27th International Conference on Intelligent User Interfaces

March 22 - 25, 2022

Helsinki, Finland

Acceptance Rates

Overall Acceptance Rate 746 of 2,811 submissions, 27%

Upcoming Conference

IUI '25

Sponsor:
sigai
sigai

30th International Conference on Intelligent User Interfaces

March 24 - 27, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
2,452
Total Downloads

Downloads (Last 12 months)1,114
Downloads (Last 6 weeks)101

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vincenzi BStumpf STaylor ANakao Y(2024)Lay User Involvement in Developing Human-centric Responsible AI Systems: When and How?ACM Journal on Responsible Computing10.1145/36525921:2(1-25)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3652592
Zhang YTennekes MDe Jong TCurier LCoecke BChen M(2024)Simulation-based Optimization of User Interfaces for Quality-assuring Machine Learning Model PredictionsACM Transactions on Interactive Intelligent Systems10.1145/359455214:1(1-32)Online publication date: 9-Jan-2024
https://dl.acm.org/doi/10.1145/3594552
Andersen TNunes FWilcox LCoiera ERogers Y(2023)Introduction to the Special Issue on Human-Centred AI in Healthcare: Challenges Appearing in the WildACM Transactions on Computer-Human Interaction10.1145/358996130:2(1-12)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1145/3589961
Zając HLi DDai XCarlsen JKensing FAndersen T(2023)Clinician-Facing AI in the Wild: Taking Stock of the Sociotechnical Challenges and Opportunities for HCIACM Transactions on Computer-Human Interaction10.1145/358243030:2(1-39)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1145/3582430
Slack DKrishna SLakkaraju HSingh S(2023)Explaining machine learning models with interactive natural language conversations using TalkToModelNature Machine Intelligence10.1038/s42256-023-00692-85:8(873-883)Online publication date: 27-Jul-2023
https://doi.org/10.1038/s42256-023-00692-8
Ahmad KAbdelrazek MArora CBano MGrundy J(2023)Requirements engineering for artificial intelligence systemsInformation and Software Technology10.1016/j.infsof.2023.107176158:COnline publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.infsof.2023.107176

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten