Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3301275.3302265acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article

What can AI do for me?: evaluating machine learning interpretations in cooperative play

Published: 17 March 2019 Publication History

Abstract

Machine learning is an important tool for decision making, but its ethical and responsible application requires rigorous vetting of its interpretability and utility: an understudied problem, particularly for natural language processing models. We propose an evaluation of interpretation on a real task with real human users, where the effectiveness of interpretation is measured by how much it improves human performance. We design a grounded, realistic human-computer cooperative setting using a question answering task, Quizbowl. We recruit both trivia experts and novices to play this game with computer as their teammate, who communicates its prediction via three different interpretations. We also provide design guidance for natural language processing human-in-the-loop settings.

Supplementary Material

MP4 File (p229-feng.mp4)

References

[1]
Julius Adebayo, Been Kim, Ian Goodfellow, Justin Gilmer, and Moritz Hardt. 2018. Sanity Checks for Saliency Maps. In Proceedings of Advances in Neural Information Processing Systems.
[2]
JE Allen, Curry I Guinn, and E Horvtz. 1999. Mixed-initiative interaction. IEEE Intelligent Systems and their Applications (1999).
[3]
Stavros Antifakos, Nicky Kern, Bernt Schiele, and Adrian Schwaninger. 2005. Towards improving trust in context-aware systems by displaying system confidence. In Proceedings of the international conference on Human-computer interaction with mobile devices and services.
[4]
David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. 2010. How to Explain Individual Classification Decisions. Journal of Machine Learning Research (2010).
[5]
Jordan Boyd-Graber, Shi Feng, and Pedro Rodriguez. 2018. Human-Computer Question Answering: The Case for Quizbowl. Springer.
[6]
Jordan L. Boyd-Graber, Brianna Satinoff, He He, and Hal Daumé III. 2012. Besting the Quiz Master: Crowdsourcing Incremental Classification Games. In Proceedings of Empirical Methods in Natural Language Processing.
[7]
Tathagata Chakraborti, Subbarao Kambhampati, Matthias Scheutz, and Yu Zhang. 2017. AI challenges in human-robot cognitive teaming. arXiv preprint arXiv:1707.04775 (2017).
[8]
Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In International Conference on Intelligent User Interfaces.
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition.
[10]
Finale Doshi-Velez and Been Kim. 2018. Towards A Rigorous Science of Interpretable Machine Learning. Springer Series on Challenges in Machine Learning (2018).
[11]
European Parliament and Council of the European Union. 2016. General data protection regulation. (2016).
[12]
Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of Empirical Methods in Natural Language Processing.
[13]
Ruth C Fong and Andrea Vedaldi. 2017. Interpretable explanations of black boxes by meaningful perturbation. In International Conference on Computer Vision.
[14]
Amirata Ghorbani, Abubakar Abid, and James Y. Zou. 2018. Interpretation of Neural Networks is Fragile. Association for the Advancement of Artificial Intelligence.
[15]
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations.
[16]
Clinton Gormley and Zachary Tong. 2015. Elasticsearch: The Definitive Guide. O'Reilly Media, Inc.
[17]
Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal Daumé III. 2014. Don't Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation. In Empirical Methods in Natural Language Processing. docs/ 2014_emnlp_simtrans.pdf
[18]
Tovi Grossman, George Fitzmaurice, and Ramtin Attar. 2009. A survey of software learnability: metrics, methodologies and guidelines. In International Conference on Human Factors in Computing Systems.
[19]
David Gunning. 2017. Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web (2017).
[20]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the International Conference of Machine Learning.
[21]
He He, Jordan L. Boyd-Graber, Kevin Kwok, and Hal Daumé III. 2016. Opponent Modeling in Deep Reinforcement Learning. In Proceedings of the International Conference of Machine Learning.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International Conference on Computer Vision.
[23]
Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. 2018. Evaluating Feature Importance Estimates. In ICML Workshop on Human Interpretability in Machine Learning.
[24]
Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In International Conference on Human Factors in Computing Systems.
[25]
Mohit Iyyer, Jordan Boyd-Graber, Leonardo Max Batista Claudino, Richard Socher, and Hal Daumé III. 2014. A Neural Network for Factoid Question Answering over Paragraphs. In Proceedings of Empirical Methods in Natural Language Processing.
[26]
Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of Empirical Methods in Natural Language Processing.
[27]
Heinrich Jiang, Been Kim, and Maya R. Gupta. 2018. To Trust Or Not To Trust A Classifier. In Proceedings of Advances in Neural Information Processing Systems.
[28]
Wendy Ju and Larry Leifer. 2008. The design of implicit interactions: Making interactive systems less obnoxious. Design Issues (2008).
[29]
Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. 2017. The (Un)reliability of saliency methods. arXiv preprint arXiv: 1711.00867 (2017).
[30]
Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human decisions and machine predictions. The quarterly journal of economics (2017).
[31]
Ronald T Kneusel and Michael C Mozer. 2017. Improving Human-Machine Cooperative Visual Search With Soft Highlighting. ACM Transactions on Applied Perception (2017).
[32]
Kenneth R. Koedinger, Emma Brunskill, Ryan S.J.d. Baker, Elizabeth A. McLaughlin, and John Stamper. 2013. New Potentials for Data-Driven Intelligent Tutoring System Development and Optimization. AI Magazine 34, 3 (sep 2013), 27.
[33]
Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the International Conference of Machine Learning.
[34]
Josua Krause, Adam Perer, and Kenney Ng. 2016. Interacting with predictions: Visual inspection of black-box machine learning models. In International Conference on Human Factors in Computing Systems.
[35]
Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. In Proceedings of ACM FAT*.
[36]
Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Knowledge Discovery and Data Mining.
[37]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature (2015).
[38]
Tak Yeon Lee, Alison Smith, Kevin Seppi, Niklas Elmqvist, Jordan Boyd-Graber, and Leah Findlater. 2017. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies (2017).
[39]
Benjamin Letham, Cynthia Rudin, Tyler H McCormick, David Madigan, et al. 2015. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics (2015).
[40]
Jiwei Li, Will Monroe, and Daniel Jurafsky. 2016. Understanding Neural Networks through Representation Erasure. arXiv preprint arXiv: 1612.08220 (2016).
[41]
Zachary Chase Lipton. 2016. The Mythos of Model Interpretability. arXiv preprint arXiv: 1606.03490 (2016).
[42]
Shixia Liu, Xiting Wang, Mengchen Liu, and Jun Zhu. 2017. Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics (2017).
[43]
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics (1993).
[44]
Tim Miller. 2017. Explanation in artificial intelligence: insights from the social sciences. arXiv preprint arXiv:1706.07269 (2017).
[45]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature (2015).
[46]
W. James Murdoch, Peter J. Liu, and Bin Yu. 2018. Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs. In Proceedings of the International Conference on Learning Representations.
[47]
Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation. arXiv preprint arXiv: 1802.00682 (2018).
[48]
Simone Papandrea, Alessandro Raganato, and Claudio Delli Bovi. 2017. SUPWSD: A Flexible Toolkit for Supervised Word Sense Disambiguation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
[49]
Nicolas Papernot and Patrick D. McDaniel. 2018. Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. arXiv preprint arXiv: 1803.04765 (2018).
[50]
Ellen Peters, Daniel Västfjäll, Paul Slovic, CK Mertz, Ketti Mazzocco, and Stephan Dickert. 2006. Numeracy and decision making. Psychological science (2006).
[51]
Valerie F Reyna and Charles J Brainerd. 2008. Numeracy, ratio bias, and denominator neglect in judgments of risk and probability. Learning and individual differences (2008).
[52]
Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining.
[53]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically Equivalent Adversarial Rules for Debugging NLP Models. In Proceedings of the Association for Computational Linguistics.
[54]
Andrew Slavin Ross and Finale Doshi-Velez. 2018. Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients.
[55]
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of Artificial Intelligence and Statistics.
[56]
Cynthia Rudin. 2018. Please Stop Explaining Black Box Models for High Stakes Decisions. arXiv preprint arXiv:1811.10154 (2018).
[57]
Enrico Rukzio, John Hamard, Chie Noda, and Alexander De Luca. 2006. Visualization of uncertainty in context aware mobile applications. In Proceedings of the international conference on Human-computer interaction with mobile devices and services.
[58]
Albrecht Schmidt and Thomas Herrmann. 2017. Intervention user interfaces: a new interaction paradigm for automated systems. Interactions (2017).
[59]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of Go without human knowledge. Nature (2017).
[60]
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In Proceedings of the International Conference on Learning Representations.
[61]
Alison Smith, Tak Yeon Lee, Forough Poursabzi-Sangdeh, Jordan Boyd-Graber, Niklas Elmqvist, and Leah Findlater. 2017. Evaluating visual representations for topic understanding and their effects on manually generated labels. Transactions of the Association for Computational Linguistics (2017).
[62]
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. In Proceedings of the International Conference of Machine Learning.
[63]
Richard S Sutton and Andrew G Barto. 1998. Introduction to reinforcement learning.
[64]
William R Swartout. 1983. Xplain: A system for creating and explaining expert consulting programs. Technical Report. University of Southern California.
[65]
Clive Thompson. 2013. Smarter Than You Think: How Technology is Changing Our Minds for the Better. The Penguin Group.
[66]
USACM. 2017. Statement on algorithmic transparency and accountability. Public Policy Council (2017).
[67]
David W Vinson, Leila Takayama, Jodi Forlizzi, Wendy Ju, Maya Cakmak, and Hideaki Kuzuoka. 2018. Human-Robot Teaming. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems.
[68]
Eric Wallace and Jordan Boyd-Graber. 2018. Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions. In Proceedings of ACL 2018 Student Research Workshop.

Cited By

View all
  • (2024)When Should I Lead or Follow: Understanding Initiative Levels in Human-AI Collaborative GameplayProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661583(2037-2056)Online publication date: 1-Jul-2024
  • (2024)The Impact of Explanations on Fairness in Human-AI Decision-Making: Protected vs Proxy FeaturesProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645210(155-180)Online publication date: 18-Mar-2024
  • (2024)Benefits of Human-AI Interaction for Expert Users Interacting with Prediction Models: a Study on Marathon RunningProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645205(245-258)Online publication date: 18-Mar-2024
  • Show More Cited By

Index Terms

  1. What can AI do for me?: evaluating machine learning interpretations in cooperative play

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IUI '19: Proceedings of the 24th International Conference on Intelligent User Interfaces
      March 2019
      713 pages
      ISBN:9781450362726
      DOI:10.1145/3301275
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 March 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. interpretability
      2. natural language processing
      3. question answering

      Qualifiers

      • Research-article

      Conference

      IUI '19
      Sponsor:

      Acceptance Rates

      IUI '19 Paper Acceptance Rate 71 of 282 submissions, 25%;
      Overall Acceptance Rate 746 of 2,811 submissions, 27%

      Upcoming Conference

      IUI '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)306
      • Downloads (Last 6 weeks)22
      Reflects downloads up to 19 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)When Should I Lead or Follow: Understanding Initiative Levels in Human-AI Collaborative GameplayProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661583(2037-2056)Online publication date: 1-Jul-2024
      • (2024)The Impact of Explanations on Fairness in Human-AI Decision-Making: Protected vs Proxy FeaturesProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645210(155-180)Online publication date: 18-Mar-2024
      • (2024)Benefits of Human-AI Interaction for Expert Users Interacting with Prediction Models: a Study on Marathon RunningProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645205(245-258)Online publication date: 18-Mar-2024
      • (2024)Designing for Appropriate Reliance: The Roles of AI Uncertainty Presentation, Initial User Decision, and User Demographics in AI-Assisted Decision-MakingProceedings of the ACM on Human-Computer Interaction10.1145/36373188:CSCW1(1-32)Online publication date: 26-Apr-2024
      • (2024)A Decision Theoretic Framework for Measuring AI RelianceProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658901(221-236)Online publication date: 3-Jun-2024
      • (2024)Visual Explanation for Advertising Creative WorkflowExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3651123(1-8)Online publication date: 11-May-2024
      • (2024)Unraveling the Dilemma of AI Errors: Exploring the Effectiveness of Human and Machine Explanations for Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642934(1-20)Online publication date: 11-May-2024
      • (2024)Understanding Choice Independence and Error Types in Human-AI CollaborationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641946(1-19)Online publication date: 11-May-2024
      • (2024)Towards Human-Centered Explainable AI: A Survey of User Studies for Model ExplanationsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333184646:4(2104-2122)Online publication date: Apr-2024
      • (2024)Better Understanding of Humans for Cooperative AI through Clustering2024 IEEE Conference on Games (CoG)10.1109/CoG60054.2024.10645647(1-8)Online publication date: 5-Aug-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media