research-article

What can AI do for me?: evaluating machine learning interpretations in cooperative play

Authors:

Jordan Boyd-GraberAuthors Info & Claims

IUI '19: Proceedings of the 24th International Conference on Intelligent User Interfaces

Pages 229 - 239

https://doi.org/10.1145/3301275.3302265

Published: 17 March 2019 Publication History

Abstract

Machine learning is an important tool for decision making, but its ethical and responsible application requires rigorous vetting of its interpretability and utility: an understudied problem, particularly for natural language processing models. We propose an evaluation of interpretation on a real task with real human users, where the effectiveness of interpretation is measured by how much it improves human performance. We design a grounded, realistic human-computer cooperative setting using a question answering task, Quizbowl. We recruit both trivia experts and novices to play this game with computer as their teammate, who communicates its prediction via three different interpretations. We also provide design guidance for natural language processing human-in-the-loop settings.

Supplementary Material

MP4 File (p229-feng.mp4)

Download
155.72 MB

References

[1]

Julius Adebayo, Been Kim, Ian Goodfellow, Justin Gilmer, and Moritz Hardt. 2018. Sanity Checks for Saliency Maps. In Proceedings of Advances in Neural Information Processing Systems.

[2]

JE Allen, Curry I Guinn, and E Horvtz. 1999. Mixed-initiative interaction. IEEE Intelligent Systems and their Applications (1999).

[3]

Stavros Antifakos, Nicky Kern, Bernt Schiele, and Adrian Schwaninger. 2005. Towards improving trust in context-aware systems by displaying system confidence. In Proceedings of the international conference on Human-computer interaction with mobile devices and services.

Digital Library

[4]

David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. 2010. How to Explain Individual Classification Decisions. Journal of Machine Learning Research (2010).

Digital Library

[5]

Jordan Boyd-Graber, Shi Feng, and Pedro Rodriguez. 2018. Human-Computer Question Answering: The Case for Quizbowl. Springer.

[6]

Jordan L. Boyd-Graber, Brianna Satinoff, He He, and Hal Daumé III. 2012. Besting the Quiz Master: Crowdsourcing Incremental Classification Games. In Proceedings of Empirical Methods in Natural Language Processing.

Digital Library

[7]

Tathagata Chakraborti, Subbarao Kambhampati, Matthias Scheutz, and Yu Zhang. 2017. AI challenges in human-robot cognitive teaming. arXiv preprint arXiv:1707.04775 (2017).

[8]

Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In International Conference on Intelligent User Interfaces.

Digital Library

[9]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition.

[10]

Finale Doshi-Velez and Been Kim. 2018. Towards A Rigorous Science of Interpretable Machine Learning. Springer Series on Challenges in Machine Learning (2018).

[11]

European Parliament and Council of the European Union. 2016. General data protection regulation. (2016).

[12]

Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of Empirical Methods in Natural Language Processing.

[13]

Ruth C Fong and Andrea Vedaldi. 2017. Interpretable explanations of black boxes by meaningful perturbation. In International Conference on Computer Vision.

[14]

Amirata Ghorbani, Abubakar Abid, and James Y. Zou. 2018. Interpretation of Neural Networks is Fragile. Association for the Advancement of Artificial Intelligence.

[15]

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations.

[16]

Clinton Gormley and Zachary Tong. 2015. Elasticsearch: The Definitive Guide. O'Reilly Media, Inc.

Digital Library

[17]

Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal Daumé III. 2014. Don't Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation. In Empirical Methods in Natural Language Processing. docs/ 2014_emnlp_simtrans.pdf

[18]

Tovi Grossman, George Fitzmaurice, and Ramtin Attar. 2009. A survey of software learnability: metrics, methodologies and guidelines. In International Conference on Human Factors in Computing Systems.

Digital Library

[19]

David Gunning. 2017. Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web (2017).

[20]

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the International Conference of Machine Learning.

Digital Library

[21]

He He, Jordan L. Boyd-Graber, Kevin Kwok, and Hal Daumé III. 2016. Opponent Modeling in Deep Reinforcement Learning. In Proceedings of the International Conference of Machine Learning.

Digital Library

[22]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International Conference on Computer Vision.

Digital Library

[23]

Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. 2018. Evaluating Feature Importance Estimates. In ICML Workshop on Human Interpretability in Machine Learning.

[24]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In International Conference on Human Factors in Computing Systems.

Digital Library

[25]

Mohit Iyyer, Jordan Boyd-Graber, Leonardo Max Batista Claudino, Richard Socher, and Hal Daumé III. 2014. A Neural Network for Factoid Question Answering over Paragraphs. In Proceedings of Empirical Methods in Natural Language Processing.

[26]

Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of Empirical Methods in Natural Language Processing.

[27]

Heinrich Jiang, Been Kim, and Maya R. Gupta. 2018. To Trust Or Not To Trust A Classifier. In Proceedings of Advances in Neural Information Processing Systems.

[28]

Wendy Ju and Larry Leifer. 2008. The design of implicit interactions: Making interactive systems less obnoxious. Design Issues (2008).

[29]

Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. 2017. The (Un)reliability of saliency methods. arXiv preprint arXiv: 1711.00867 (2017).

[30]

Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human decisions and machine predictions. The quarterly journal of economics (2017).

[31]

Ronald T Kneusel and Michael C Mozer. 2017. Improving Human-Machine Cooperative Visual Search With Soft Highlighting. ACM Transactions on Applied Perception (2017).

Digital Library

[32]

Kenneth R. Koedinger, Emma Brunskill, Ryan S.J.d. Baker, Elizabeth A. McLaughlin, and John Stamper. 2013. New Potentials for Data-Driven Intelligent Tutoring System Development and Optimization. AI Magazine 34, 3 (sep 2013), 27.

Digital Library

[33]

Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the International Conference of Machine Learning.

Digital Library

[34]

Josua Krause, Adam Perer, and Kenney Ng. 2016. Interacting with predictions: Visual inspection of black-box machine learning models. In International Conference on Human Factors in Computing Systems.

Digital Library

[35]

Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. In Proceedings of ACM FAT*.

Digital Library

[36]

Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Knowledge Discovery and Data Mining.

Digital Library

[37]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature (2015).

[38]

Tak Yeon Lee, Alison Smith, Kevin Seppi, Niklas Elmqvist, Jordan Boyd-Graber, and Leah Findlater. 2017. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies (2017).

[39]

Benjamin Letham, Cynthia Rudin, Tyler H McCormick, David Madigan, et al. 2015. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics (2015).

[40]

Jiwei Li, Will Monroe, and Daniel Jurafsky. 2016. Understanding Neural Networks through Representation Erasure. arXiv preprint arXiv: 1612.08220 (2016).

[41]

Zachary Chase Lipton. 2016. The Mythos of Model Interpretability. arXiv preprint arXiv: 1606.03490 (2016).

[42]

Shixia Liu, Xiting Wang, Mengchen Liu, and Jun Zhu. 2017. Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics (2017).

[43]

Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics (1993).

Digital Library

[44]

Tim Miller. 2017. Explanation in artificial intelligence: insights from the social sciences. arXiv preprint arXiv:1706.07269 (2017).

[45]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature (2015).

[46]

W. James Murdoch, Peter J. Liu, and Bin Yu. 2018. Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs. In Proceedings of the International Conference on Learning Representations.

[47]

Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation. arXiv preprint arXiv: 1802.00682 (2018).

[48]

Simone Papandrea, Alessandro Raganato, and Claudio Delli Bovi. 2017. SUPWSD: A Flexible Toolkit for Supervised Word Sense Disambiguation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations.

[49]

Nicolas Papernot and Patrick D. McDaniel. 2018. Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. arXiv preprint arXiv: 1803.04765 (2018).

[50]

Ellen Peters, Daniel Västfjäll, Paul Slovic, CK Mertz, Ketti Mazzocco, and Stephan Dickert. 2006. Numeracy and decision making. Psychological science (2006).

[51]

Valerie F Reyna and Charles J Brainerd. 2008. Numeracy, ratio bias, and denominator neglect in judgments of risk and probability. Learning and individual differences (2008).

[52]

Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining.

Digital Library

[53]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically Equivalent Adversarial Rules for Debugging NLP Models. In Proceedings of the Association for Computational Linguistics.

[54]

Andrew Slavin Ross and Finale Doshi-Velez. 2018. Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients.

[55]

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of Artificial Intelligence and Statistics.

[56]

Cynthia Rudin. 2018. Please Stop Explaining Black Box Models for High Stakes Decisions. arXiv preprint arXiv:1811.10154 (2018).

[57]

Enrico Rukzio, John Hamard, Chie Noda, and Alexander De Luca. 2006. Visualization of uncertainty in context aware mobile applications. In Proceedings of the international conference on Human-computer interaction with mobile devices and services.

Digital Library

[58]

Albrecht Schmidt and Thomas Herrmann. 2017. Intervention user interfaces: a new interaction paradigm for automated systems. Interactions (2017).

Digital Library

[59]

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of Go without human knowledge. Nature (2017).

[60]

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In Proceedings of the International Conference on Learning Representations.

[61]

Alison Smith, Tak Yeon Lee, Forough Poursabzi-Sangdeh, Jordan Boyd-Graber, Niklas Elmqvist, and Leah Findlater. 2017. Evaluating visual representations for topic understanding and their effects on manually generated labels. Transactions of the Association for Computational Linguistics (2017).

[62]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. In Proceedings of the International Conference of Machine Learning.

Digital Library

[63]

Richard S Sutton and Andrew G Barto. 1998. Introduction to reinforcement learning.

Digital Library

[64]

William R Swartout. 1983. Xplain: A system for creating and explaining expert consulting programs. Technical Report. University of Southern California.

[65]

Clive Thompson. 2013. Smarter Than You Think: How Technology is Changing Our Minds for the Better. The Penguin Group.

Digital Library

[66]

USACM. 2017. Statement on algorithmic transparency and accountability. Public Policy Council (2017).

[67]

David W Vinson, Leila Takayama, Jodi Forlizzi, Wendy Ju, Maya Cakmak, and Hideaki Kuzuoka. 2018. Human-Robot Teaming. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems.

Digital Library

[68]

Eric Wallace and Jordan Boyd-Graber. 2018. Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions. In Proceedings of ACL 2018 Student Research Workshop.

Cited By

Lobo IKoch JRenoux JBatina IPrada R(2024)When Should I Lead or Follow: Understanding Initiative Levels in Human-AI Collaborative GameplayProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661583(2037-2056)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3661583
Goyal NBaumler CNguyen TDaumé III H(2024)The Impact of Explanations on Fairness in Human-AI Decision-Making: Protected vs Proxy FeaturesProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645210(155-180)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645210
Muijlwijk HWillemsen MSmyth BIJsselsteijn W(2024)Benefits of Human-AI Interaction for Expert Users Interacting with Prediction Models: a Study on Marathon RunningProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645205(245-258)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645205
Show More Cited By

Index Terms

What can AI do for me?: evaluating machine learning interpretations in cooperative play
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Collaborative interaction
      2. Natural language interfaces

Recommendations

Playing with knowledge

We model a virtual player for "Who Wants to be a Millionaire" game.The virtual player uses Question Answering over Wikipedia and DBpedia knowledge.We performed experiments on the Italian and the English version of the game.The virtual player outperforms ...
World-class interpretable poker
Abstract
We address the problem of interpretability in iterative game solving for imperfect-information games such as poker. This lack of interpretability has two main sources: first, the use of an uninterpretable feature representation, and second, the ...
Building blocks for creating enjoyable games—A systematic literature review
Highlights
- Game design elements that improve player enjoyment in games are presented in the article as a result of the systematic review.
Abstract
Designing serious games that engage lots of players is still a challenge, especially for domains that introduce complex, specialised, and tedious tasks that are difficult to represent in a game in terms of entertainment. Therefore, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IUI '19: Proceedings of the 24th International Conference on Intelligent User Interfaces

March 2019

713 pages

ISBN:9781450362726

DOI:10.1145/3301275

General Chairs:
Wai-Tat Fu,
Shimei Pan,
Program Chairs:
Oliver Brdiczka,
Polo Chau,
Gaelle Calvary

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 March 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

IUI '19

Sponsor:

IUI '19: 24th International Conference on Intelligent User Interfaces

March 17 - 20, 2019

California, Marina del Ray

Acceptance Rates

IUI '19 Paper Acceptance Rate 71 of 282 submissions, 25%;

Overall Acceptance Rate 746 of 2,811 submissions, 27%

Upcoming Conference

IUI '25

Sponsor:
sigai
sigai

30th International Conference on Intelligent User Interfaces

March 24 - 27, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

64
Total Citations
View Citations
1,659
Total Downloads

Downloads (Last 12 months)306
Downloads (Last 6 weeks)22

Reflects downloads up to 19 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lobo IKoch JRenoux JBatina IPrada R(2024)When Should I Lead or Follow: Understanding Initiative Levels in Human-AI Collaborative GameplayProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661583(2037-2056)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3661583
Goyal NBaumler CNguyen TDaumé III H(2024)The Impact of Explanations on Fairness in Human-AI Decision-Making: Protected vs Proxy FeaturesProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645210(155-180)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645210
Muijlwijk HWillemsen MSmyth BIJsselsteijn W(2024)Benefits of Human-AI Interaction for Expert Users Interacting with Prediction Models: a Study on Marathon RunningProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645205(245-258)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645205
Cao SLiu AHuang C(2024)Designing for Appropriate Reliance: The Roles of AI Uncertainty Presentation, Initial User Decision, and User Demographics in AI-Assisted Decision-MakingProceedings of the ACM on Human-Computer Interaction10.1145/36373188:CSCW1(1-32)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3637318
Guo ZWu YHartline JHullman J(2024)A Decision Theoretic Framework for Measuring AI RelianceProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658901(221-236)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658901
Sawada SSuzuki TYamaguchi KToyoda M(2024)Visual Explanation for Advertising Creative WorkflowExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3651123(1-8)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3651123
Pafla MLarson KHancock M(2024)Unraveling the Dilemma of AI Errors: Exploring the Effectiveness of Human and Machine Explanations for Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642934(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642934
Erlei ASharma AGadiraju U(2024)Understanding Choice Independence and Error Types in Human-AI CollaborationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3641946(1-19)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3641946
Rong YLeemann TNguyen TFiedler LQian PUnhelkar VSeidel TKasneci GKasneci E(2024)Towards Human-Centered Explainable AI: A Survey of User Studies for Model ExplanationsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.333184646:4(2104-2122)Online publication date: Apr-2024
https://doi.org/10.1109/TPAMI.2023.3331846
Su ERaffe WMathieson LWang Y(2024)Better Understanding of Humans for Cooperative AI through Clustering2024 IEEE Conference on Games (CoG)10.1109/CoG60054.2024.10645647(1-8)Online publication date: 5-Aug-2024
https://doi.org/10.1109/CoG60054.2024.10645647
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents