Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Evaluating What Others Say: The Effect of Accuracy Assessment in Shaping Mental Models of AI Systems

Published: 08 November 2024 Publication History

Abstract

Forming accurate mental models that align with the actual behavior of an AI system is critical for successful user experience and interactions. One way to develop mental models is through information shared by other users. However, this social information can be inaccurate and there is a lack of research examining whether inaccurate social information influences the development of accurate mental models. To address this gap, our study investigates the impact of social information accuracy on mental models, as well as whether prompting users to validate the social information can mitigate the impact. We conducted a between-subject experiment with 39 crowdworkers where each participant interacted with our AI system that automates a workflow given a natural language sentence. We compared participants' mental models between those exposed to social information of how the AI system worked, both correct and incorrect, versus those who formed mental models through their own usage of the system. Specifically, we designed three experimental conditions: 1) validation condition that presented the social information followed by an opportunity to validate its accuracy through testing example utterances, 2) social information condition that presented the social information only, without the validation opportunity, and 3) control condition that allowed users to interact with the system without any social information. Our results revealed that the inclusion of the validation process had a positive impact on the development of accurate mental models, especially around the knowledge distribution aspect of mental models. Furthermore, participants were more willing to share comments with others when they had the chance to validate the social information. The impact of inaccurate social information on altering user mental models was found to be non-significant, while 69.23% of participants incorrectly judged the social information accuracy at least once. We discuss the implications of these findings for designing tools that support the validation of social information and thereby improve human-AI interactions.

References

[1]
Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access, Vol. 6 (2018), 52138--52160.
[2]
Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal, Vol. 28, 5 (2019), 793--819.
[3]
Kamran Alipour, Arijit Ray, Xiao Lin, Michael Cogswell, Jurgen P Schulze, Yi Yao, and Giedrius T Burachas. 2021. Improving users' mental model with attention-directed counterfactual edits. Applied AI Letters, Vol. 2, 4 (2021), e47.
[4]
Mike Ananny and Kate Crawford. 2018. Seeing without knowing: Limitations of the transparency ideal and its application to algorithmic accountability. new media & society, Vol. 20, 3 (2018), 973--989.
[5]
Robert W Andrews, J Mason Lilly, Divya Srivastava, and Karen M Feigh. 2023. The role of shared mental models in human-AI teams: a theoretical review. Theoretical Issues in Ergonomics Science, Vol. 24, 2 (2023), 129--175.
[6]
Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion, Vol. 58 (2020), 82--115.
[7]
Zahra Ashktorab, Benjamin Hoover, Mayank Agarwal, Casey Dugan, Werner Geyer, Hao Bang Yang, and Mikhail Yurochkin. 2023. Fairness Evaluation in Text Classification: Machine Learning Practitioner Perspectives of Individual and Group Fairness. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--20.
[8]
Jaime Banks. 2020. Optimus primed: Media cultivation of robot mental models and social judgments. Frontiers in Robotics and AI, Vol. 7 (2020), 62.
[9]
Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond accuracy: The role of mental models in human-AI team performance. In Proceedings of the AAAI conference on human computation and crowdsourcing, Vol. 7. 2--11.
[10]
Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 2429--2437.
[11]
Simon Baron-Cohen. 1997. Mindblindness: An essay on autism and theory of mind. MIT press.
[12]
Michelle Brachman, Qian Pan, Hyo Jin Do, Casey Dugan, Arunima Chaudhary, James M Johnson, Priyanshu Rai, Tathagata Chakraborti, Thomas Gschwind, Jim A Laredo, et al. 2023. Follow the Successful Herd: Towards Explanations for Improved Use and Mental Models of Natural Language Systems. In Proceedings of the 28th International Conference on Intelligent User Interfaces. 220--239.
[13]
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology, Vol. 3, 2 (2006), 77--101.
[14]
Zana Buccinca, Maja Barbara Malaya, and Krzysztof Z Gajos. 2021. To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proceedings of the ACM on Human-Computer Interaction, Vol. 5, CSCW1 (2021), 1--21.
[15]
Carrie J Cai, Jonas Jongejan, and Jess Holbrook. 2019. The effects of example-based explanations in a machine learning interface. In Proceedings of the 24th international conference on intelligent user interfaces. 258--262.
[16]
Kelly Caine. 2016. Local standards for sample size at CHI. In Proceedings of the 2016 CHI conference on human factors in computing systems. 981--992.
[17]
John M Carroll and Judith Reitman Olson. 1988. Mental models in human-computer interaction. Handbook of human-computer interaction (1988), 45--65.
[18]
Diogo V Carvalho, Eduardo M Pereira, and Jaime S Cardoso. 2019. Machine learning interpretability: A survey on methods and metrics. Electronics, Vol. 8, 8 (2019), 832.
[19]
Jesse Chandler, Cheskie Rosenzweig, Aaron J Moss, Jonathan Robinson, and Leib Litman. 2019. Online panels in social science research: Expanding sampling methods beyond Mechanical Turk. Behavior research methods, Vol. 51, 5 (2019), 2022--2038.
[20]
Valerie Chen, Q Vera Liao, Jennifer Wortman Vaughan, and Gagan Bansal. 2023. Understanding the role of human intuition on reliance in human-AI decision-making with explanations. Proceedings of the ACM on Human-computer Interaction, Vol. 7, CSCW2 (2023), 1--32.
[21]
Wim De Neys. 2017. Dual process theory 2.0. Routledge.
[22]
Jeff Druce, James Niehaus, Vanessa Moody, David Jensen, and Michael L Littman. 2021. Brittle AI, causal confusion, and bad mental models: challenges and successes in the XAI program. arXiv preprint arXiv:2106.05506 (2021).
[23]
Motahhare Eslami, Karrie Karahalios, Christian Sandvig, Kristen Vaccaro, Aimee Rickman, Kevin Hamilton, and Alex Kirlik. 2016. First I" like" it, then I hide it: Folk Theories of Social Feeds. In Proceedings of the 2016 cHI conference on human factors in computing systems. 2371--2382.
[24]
Robert M Fein, Gary M Olson, and Judith S Olson. 1993. A mental model can help with learning to operate a complex device. In INTERACT'93 and CHI'93 conference companion on Human factors in computing systems. 157--158.
[25]
Martin Fishbein and Icek Ajzen. 1977. Belief, attitude, intention, and behavior: An introduction to theory and research. Philosophy and Rhetoric, Vol. 10, 2 (1977).
[26]
Katy Ilonka Gero, Zahra Ashktorab, Casey Dugan, Qian Pan, James Johnson, Werner Geyer, Maria Ruiz, Sarah Miller, David R Millen, Murray Campbell, et al. 2020. Mental models of AI agents in a cooperative game setting. In Proceedings of the 2020 chi conference on human factors in computing systems. 1--12.
[27]
Rosanna E Guadagno, Daniel M Rempala, Shannon Murphy, and Bradley M Okdie. 2013. What makes a video go viral? An analysis of emotional contagion and Internet memes. Computers in Human Behavior, Vol. 29, 6 (2013), 2312--2319.
[28]
Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models. ACM computing surveys (CSUR), Vol. 51, 5 (2018), 1--42.
[29]
Sumit Gulwani and Mark Marron. 2014. Nlyze: Interactive programming by natural language for spreadsheet data analysis and manipulation. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 803--814.
[30]
Gary G Hendrix. 1982. Natural-language interface. American Journal of Computational Linguistics, Vol. 8, 2 (1982), 56--61.
[31]
Cecilia Heyes. 2012. What's social about social learning? Journal of comparative psychology, Vol. 126, 2 (2012), 193.
[32]
Ting-Hao Kenneth Huang, Amos Azaria, and Jeffrey P Bigham. 2016. Instructablecrowd: Creating if-then rules via conversations with the crowd. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 1555--1562.
[33]
Maurice Jakesch, Jeffrey T Hancock, and Mor Naaman. 2023. Human heuristics for AI-generated language are flawed. Proceedings of the National Academy of Sciences, Vol. 120, 11 (2023), e2208839120.
[34]
Ellen Jiang, Edwin Toh, Alejandra Molina, Kristen Olson, Claire Kayacik, Aaron Donsbach, Carrie J Cai, and Michael Terry. 2022. Discovering the Syntax and Strategies of Natural Language Programming with Generative Language Models. In CHI Conference on Human Factors in Computing Systems. 1--19.
[35]
Markelle Kelly, Aakriti Kumar, Padhraic Smyth, and Mark Steyvers. 2023. Capturing Humans? Mental Models of AI: An Item Response Theory Approach. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 1723--1734.
[36]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning. PMLR, 2668--2677.
[37]
Tae Soo Kim, DaEun Choi, Yoonseo Choi, and Juho Kim. 2022. Stylette: Styling the Web with Natural Language. In CHI Conference on Human Factors in Computing Systems. 1--17.
[38]
Klaus Krippendorff. 2004. Reliability in content analysis: Some common misconceptions and recommendations. Human communication research, Vol. 30, 3 (2004), 411--433.
[39]
Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces. 126--137.
[40]
Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. 2012. Tell me more? The effects of mental model soundness on personalizing an intelligent agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1--10.
[41]
Todd Kulesza, Simone Stumpf, Margaret Burnett, Sherry Yang, Irwin Kwan, and Weng-Keen Wong. 2013. Too much, too little, or just right? Ways explanations impact end users' mental models. In 2013 IEEE Symposium on visual languages and human centric computing. IEEE, 3--10.
[42]
Toby Jia-Jun Li, Marissa Radensky, Justin Jia, Kirielle Singarajah, Tom M Mitchell, and Brad A Myers. 2019. Pumice: A multi-modal agent that learns concepts and conditionals from natural language and demonstrations. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology. 577--589.
[43]
Q Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv preprint arXiv:2306.01941 (2023).
[44]
Brian Y Lim, Anind K Dey, and Daniel Avrahami. 2009. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of the SIGCHI conference on human factors in computing systems. 2119--2128.
[45]
James Lin, Jeffrey Wong, Jeffrey Nichols, Allen Cypher, and Tessa A Lau. 2009. End-user programming of mashups with vegemite. In Proceedings of the 14th international conference on Intelligent user interfaces. 97--106.
[46]
Han Liu, Vivian Lai, and Chenhao Tan. 2021. Understanding the effect of out-of-distribution examples and interactive explanations on human-ai decision making. Proceedings of the ACM on Human-Computer Interaction, Vol. 5, CSCW2 (2021), 1--45.
[47]
Zhuoran Lu, Patrick Li, Weilong Wang, and Ming Yin. 2022. The Effects of AI-based Credibility Indicators on the Detection and Spread of Misinformation under Social Influence. Proceedings of the ACM on Human-Computer Interaction, Vol. 6, CSCW2 (2022), 1--27.
[48]
Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems, Vol. 30 (2017).
[49]
Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and inter-rater reliability in qualitative research: Norms and guidelines for CSCW and HCI practice. Proceedings of the ACM on human-computer interaction, Vol. 3, CSCW (2019), 1--23.
[50]
Omid Mohaddesi, Noah Chicoine, Min Gong, Ozlem Ergun, Jacqueline Griffin, David Kaeli, Stacy Marsella, and Casper Harteveld. 2023. Thought Bubbles: A Proxy into Players? Mental Model Development. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--21.
[51]
Mehdi Moussa"id, Juliane E Kämmer, Pantelis P Analytis, and Hansjörg Neth. 2013. Social influence and the collective dynamics of opinion formation. PloS one, Vol. 8, 11 (2013), e78433.
[52]
Yuri Nakao, Simone Stumpf, Subeida Ahmed, Aisha Naseer, and Lorenzo Strappelli. 2022. Toward involving end-users in interactive human-in-the-loop AI fairness. ACM Transactions on Interactive Intelligent Systems (TiiS), Vol. 12, 3 (2022), 1--30.
[53]
Thao Ngo and Nicole Krämer. 2022. Exploring folk theories of algorithmic news curation for explainable design. Behaviour & Information Technology, Vol. 41, 15 (2022), 3346--3359.
[54]
Don Norman. 2013. The design of everyday things: Revised and expanded edition. Basic books.
[55]
Donald A Norman. 2014. Some observations on mental models. In Mental models. Psychology Press, 15--22.
[56]
Mahsan Nourani, Chiradeep Roy, Jeremy E Block, Donald R Honeycutt, Tahrima Rahman, Eric Ragan, and Vibhav Gogate. 2021. Anchoring bias affects mental model formation and user reliance in explainable AI systems. In 26th International Conference on Intelligent User Interfaces. 340--350.
[57]
Gordon Pennycook and David G Rand. 2021. The psychology of fake news. Trends in cognitive sciences, Vol. 25, 5 (2021), 388--402.
[58]
José Pinheiro and Douglas Bates. 2006. Mixed-effects models in S and S-PLUS. Springer Science & Business Media.
[59]
Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to code: Learning semantic parsers for if-this-then-that recipes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 878--888.
[60]
Mitchel Resnick, Robbie Berg, and Michael Eisenberg. 2000. Beyond black boxes: Bringing transparency and aesthetics back to scientific investigation. The Journal of the Learning Sciences, Vol. 9, 1 (2000), 7--30.
[61]
Beau G Schelble, Christopher Flathmann, Nathan J McNeese, Guo Freeman, and Rohit Mallick. 2022. Let's think together! Assessing shared mental models, performance, and trust in human-agent teams. Proceedings of the ACM on Human-Computer Interaction, Vol. 6, GROUP (2022), 1--29.
[62]
Subhasree Sengupta and Caroline Haythornthwaite. 2020. Learning with comments: An analysis of comments and community on Stack Overflow. (2020).
[63]
Vidya Setlur, Sarah E Battersby, Melanie Tory, Rich Gossweiler, and Angel X Chang. 2016. Eviza: A natural language interface for visual analysis. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. 365--377.
[64]
Abhay Sukumaran and Clifford Nass. 2010. Socially cued mental models. In CHI'10 Extended Abstracts on Human Factors in Computing Systems. 3379--3384.
[65]
Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Ellen Jiang, Mahima Pushkarna, Carey Radebaugh, Emily Reif, et al. 2020. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. arXiv preprint arXiv:2008.05122 (2020).
[66]
Stefano Teso and Kristian Kersting. 2019. Explanatory interactive machine learning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 239--245.
[67]
Amos Tversky and Daniel Kahneman. 1974. Judgment under Uncertainty: Heuristics and Biases: Biases in judgments reveal some heuristics of thinking under uncertainty. science, Vol. 185, 4157 (1974), 1124--1131.
[68]
Jasper van der Waa, Elisabeth Nieuwburg, Anita Cremers, and Mark Neerincx. 2021. Evaluating XAI: A comparison of rule-based and example-based explanations. Artificial intelligence, Vol. 291 (2021), 103404.
[69]
Sandra A Vannoy and Prashant Palvia. 2010. The social influence model of technology adoption. Commun. ACM, Vol. 53, 6 (2010), 149--153.
[70]
Qiaosi Wang, Koustuv Saha, Eric Gregori, David Joyner, and Ashok Goel. 2021. Towards mutual theory of mind in human-ai interaction: How language reflects what students perceive about a virtual teaching assistant. In Proceedings of the 2021 CHI conference on human factors in computing systems. 1--14.
[71]
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 (2021).
[72]
Jeffery D Wilfong. 2006. Computer anxiety and anger: The impact of computer use, computer experience, and self-efficacy beliefs. Computers in human behavior, Vol. 22, 6 (2006), 1001--1011.
[73]
Terry Winograd. 1971. Procedures as a representation for data in a computer program for understanding natural language. Technical Report. MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC.
[74]
Frank F Xu, Bogdan Vasilescu, and Graham Neubig. 2022. In-ide code generation from natural language: Promise and challenges. ACM Transactions on Software Engineering and Methodology (TOSEM), Vol. 31, 2 (2022), 1--47.

Index Terms

  1. Evaluating What Others Say: The Effect of Accuracy Assessment in Shaping Mental Models of AI Systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Human-Computer Interaction
    Proceedings of the ACM on Human-Computer Interaction  Volume 8, Issue CSCW2
    CSCW
    November 2024
    5177 pages
    EISSN:2573-0142
    DOI:10.1145/3703902
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 November 2024
    Published in PACMHCI Volume 8, Issue CSCW2

    Check for updates

    Author Tags

    1. mental model
    2. natural language interface
    3. social information
    4. validation

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 61
      Total Downloads
    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)61
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media