research-article

Public Access

Towards Effective Foraging by Data Scientists to Find Past Analysis Choices

Authors:

Mary Beth Kery,

Bonnie E. John,

Patrick O'Flaherty,

Brad A. MyersAuthors Info & Claims

CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Paper No.: 92, Pages 1 - 13

https://doi.org/10.1145/3290605.3300322

Published: 02 May 2019 Publication History

All formats PDF

Abstract

Data scientists are responsible for the analysis decisions they make, but it is hard for them to track the process by which they achieved a result. Even when data scientists keep logs, it is onerous to make sense of the resulting large number of history records full of overlapping variants of code, output, plots, etc. We developed algorithmic and visualization techniques for notebook code environments to help data scientists forage for information in their history. To test these interventions, we conducted a think-aloud evaluation with 15 data scientists, where participants were asked to find specific information from the history of another person's data science project. The participants succeed on a median of 80% of the tasks they performed. The quantitative results suggest promising aspects of our design, while qualitative results motivated a number of design improvements. The resulting system, called Verdant, is released as an open-source extension for JupyterLab.

Supplementary Material

MP4 File (pn4092.mp4)

Supplemental video

Download
60.41 MB

References

[1]

2015. Jupyter Notebook 2015 UX Survey Results. https://github.com/jupyter/ surveys/blob/master/surveys/2015--12-notebook-ux/analysis/report dashboard. ipynb. (2015).

[2]

2018. Databricks - Making Big Data Simple. https://databricks.com/. (2018). Accessed: 2018--9--20.

[3]

2018. Domino Data Science Platform. https://www.dominodatalab.com/. (2018). Accessed: 2018--9--20.

[4]

2018. Gigantum. https://gigantum.com/. (2018). Accessed: 2018--9--20.

[5]

2018. Google Colaboratory. https://colab.research.google.com/notebooks/ welcome.ipynb. (2018). Accessed: 2018--9--20.

[6]

2018. Google Docs - create and edit documents online, for free. https://www. google.com/docs/about/. (2018). Accessed: 2018--9--20.

[7]

Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y Lim, and Mohan Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, 582:1--582:18.

Digital Library

[8]

Saleema Amershi, Max Chickering, Steven M Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. ModelTracker: Redesigning Performance Analysis Tools for Machine Learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 337--346.

Digital Library

[9]

Scott Chacon and Ben Straub. 2014. Pro Git. Apress.

Digital Library

[10]

Mihai Codoban, Sruti Srinivasa Ragavan, Danny Dig, and Brian Bailey. 2015. Software history under the lens: A study on why and how developers examine it. In Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on. IEEE, 1--10.

Digital Library

[11]

Scott D Fleming, Chris Scaffidi, David Piorkowski, Margaret Burnett, Rachel Bellamy, Joseph Lawrance, and Irwin Kwan. 2013. An Information Foraging Theory Perspective on Tools for Debugging, Refactoring, and Reuse Tasks. ACM Trans. Softw. Eng. Methodol. 22, 2 (2013), 1--41.

Digital Library

[12]

Philip Jia Guo. 2012. Software tools to facilitate research programming. Ph.D. Dissertation. Stanford University. Towards Effective Foraging by Data Scientists to Find Past Analysis Choices CHI'19, May 2019, Glasgow, UK

[13]

Bj¨ orn Hartmann, Loren Yu, Abel Allison, Yeonsoo Yang, and Scott R Klemmer. 2008. Design As Exploration: Creating Interface Alternatives Through Parallel Authoring and Runtime Tuning. In Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology (UIST '08). ACM, New York, NY, USA, 91--100.

Digital Library

[14]

Austin Z Henley and Scott D Fleming. 2014. The Patchworks Code Editor: Toward Faster Navigation with Less Code Arranging and Fewer Navigation Mistakes. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '14). ACM, New York, NY, USA, 2511--2520.

Digital Library

[15]

Mary Beth Kery, Amber Horvath, and Brad A Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In CHI. 1265--1276.

Digital Library

[16]

Mary Beth Kery and Brad A Myers. 2018. Interactions for Untangling Messy History in a Computational Notebook. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 147--155.

[17]

Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool. (2018).

[18]

Thomas Kluyver, Benjamin Ragan-Kelley, Fernando P´ erez, Brian E Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Corlay, and Others. 2016. Jupyter Notebooks-a publishing format for reproducible computational workflows. In ELPUB. 87--90.

[19]

Joseph Lawrance, Rachel Bellamy, and Margaret Burnett. 2007. Scents in Programs:Does Information Foraging Theory Apply to Program Maintenance?. In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC 2007).

Digital Library

[20]

Joseph Lawrance, Rachel Bellamy, Margaret Burnett, and Kyle Rector. 2008. Using information scent to model the dynamic foraging behavior of programmers in maintenance tasks. In Proceeding of the twenty-sixth annual CHI conference on Human factors in computing systems - CHI '08.

Digital Library

[21]

Fabian Pedregosa, Ga¨ el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ´ Edouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, Oct (2011), 2825--2830.

Digital Library

[22]

Alexandre Perez and Rui Abreu. 2014. A diagnosis-based approach to software comprehension. In Proceedings of the 22nd International Conference on Program Comprehension - ICPC 2014.

Digital Library

[23]

Fernando P´ erez and Brian E Granger. 2007. IPython: a System for Interactive Scientific Computing. Computing in Science and Engineering 9, 3 (May 2007), 21--29.

Digital Library

[24]

D Piorkowski, S D Fleming, C Scaffidi, M Burnett, I Kwan, A Z Henley, J Macbeth, C Hill, and A Horvath. 2015. To fix or to learn? How production bias affects developers' information foraging during debugging. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). 11--20.

Digital Library

[25]

Peter Pirolli. 2007. Information Foraging Theory. In Information Foraging Theory. 3--29.

[26]

Adam Rule, Aur´ elien Tabard, and James Hollan. 2018. Exploration and Explanation in Computational Notebooks. In ACM CHI Conference on Human Factors in Computing Systems.

Digital Library

[27]

Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9, 10 (2013), e1003285.

[28]

Jeff Sauro. What is a Good Task-Completion Rate? https://measuringu.com/ task-completion/. (????). Accessed: 2019--1--4.

[29]

Sruti Srinivasa Ragavan, Sandeep Kaur Kuttal, Charles Hill, Anita Sarma, David Piorkowski, and Margaret Burnett. 2016. Foraging Among an Overabundance of Similar Variants. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16). ACM, New York, NY, USA, 3509--3521.

Digital Library

[30]

Edward R Tufte. 2006. Beautiful evidence. Vol. 1. Graphics Press Cheshire, CT.

Digital Library

[31]

Greg Wilson, D A Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven H D Haddock, Kathryn D Huff, Ian M Mitchell, Mark D Plumbley, Ben Waugh, Ethan P White, and Paul Wilson. 2014. Best practices for scientific computing. PLoS Biol. 12, 1 (Jan. 2014), e1001745.

[32]

Y Yoon, B A Myers, and S Koo. 2013. Visualization of fine-grained code change history. In 2013 IEEE Symposium on Visual Languages and Human Centric Computing. 119--126.

Cited By

Kazemitabaar MWilliams JDrosos IGrossman THenley ANegreanu CSarkar A(2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676345
Sato SNakamaru T(2024)Multiverse Notebook: Shifting Data Scientists to Time TravelersProceedings of the ACM on Programming Languages10.1145/36498388:OOPSLA1(754-783)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649838
Wang AWu ZBrooks COney SDig DBryksin TGolubev YBezzubov A(2024)Don't Step on My Toes: Resolving Editing Conflicts in Real-Time Collaboration in Computational NotebooksProceedings of the 1st ACM/IEEE Workshop on Integrated Development Environments10.1145/3643796.3648453(47-52)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3643796.3648453
Show More Cited By

Index Terms

Towards Effective Foraging by Data Scientists to Find Past Analysis Choices
1. Human-centered computing
  1. Human computer interaction (HCI)
2. Software and its engineering
  1. Software creation and management

Recommendations

The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool
CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems

Literate programming tools are used by millions of programmers today, and are intended to facilitate presenting data analyses in the form of a narrative. We interviewed 21 data scientists to study coding behaviors in a literate programming environment ...
How Data Scientists Use Computational Notebooks for Real-Time Collaboration

Effective collaboration in data science can leverage domain expertise from each team member and thus improve the quality and efficiency of the work. Computational notebooks give data scientists a convenient interactive solution for sharing and keeping ...
Supporting Data Workers To Perform Exploratory Programming
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems

Data science is an open-ended task in whichexploratory programming is a common practice. Data workers often need faster and easier ways to explore alternative approaches to obtain insights from data, which frequently compromises code quality. To ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

May 2019

9077 pages

ISBN:9781450359702

DOI:10.1145/3290605

General Chairs:
Stephen Brewster
University of Glasgow, Scotland, UK
,
Geraldine Fitzpatrick
TU Wien, Austria
,
Program Chairs:
Anna Cox
University College London, UK
,
Vassilis Kostakos
University of Melbourne, Australia

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 May 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

CHI '19

Sponsor:

SIGCHI

CHI '19: CHI Conference on Human Factors in Computing Systems

May 4 - 9, 2019

Glasgow, Scotland Uk

Acceptance Rates

CHI '19 Paper Acceptance Rate 703 of 2,958 submissions, 24%;

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Upcoming Conference

CHI 2025

Sponsor:
sigchi

ACM CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

68
Total Citations
View Citations
1,658
Total Downloads

Downloads (Last 12 months)279
Downloads (Last 6 weeks)46

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kazemitabaar MWilliams JDrosos IGrossman THenley ANegreanu CSarkar A(2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676345
Sato SNakamaru T(2024)Multiverse Notebook: Shifting Data Scientists to Time TravelersProceedings of the ACM on Programming Languages10.1145/36498388:OOPSLA1(754-783)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649838
Wang AWu ZBrooks COney SDig DBryksin TGolubev YBezzubov A(2024)Don't Step on My Toes: Resolving Editing Conflicts in Real-Time Collaboration in Computational NotebooksProceedings of the 1st ACM/IEEE Workshop on Integrated Development Environments10.1145/3643796.3648453(47-52)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3643796.3648453
Manesh DBowman Jr. DLee S(2024)SHARP: Exploring Version Control Systems in Live Coding MusicProceedings of the 16th Conference on Creativity & Cognition10.1145/3635636.3656195(426-437)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3635636.3656195
Qian CReif EKahng M(2024)Understanding the Dataset Practitioners Behind Large Language ModelsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3651007(1-7)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3651007
Wang FLin YYang LLi HGu MZhu MQu H(2024)OutlineSpark: Igniting AI-powered Presentation Slides Creation from Computational Notebooks through OutlinesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642865(1-16)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642865
Harrison GBryson KBamba ADovichi LBinion ABorem AUr B(2024)JupyterLab in Retrograde: Contextual Notifications That Highlight Fairness and Bias Issues for Data ScientistsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642755(1-19)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642755
Horvath AMacvean AMyers B(2024)Meta-Manager: A Tool for Collecting and Exploring Meta Information about CodeProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642676(1-17)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642676
Hohman FWang CLee JGörtler JMoritz DBigham JRen ZForet CShan QZhang X(2024)Talaria: Interactively Optimizing Machine Learning Models for Efficient InferenceProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642628(1-19)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642628
Kommiya Mothilal RGuha SAhmed S(2024)Towards a Non-Ideal Methodological Framework for Responsible MLProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642501(1-17)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642501
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten