Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3290605.3300322acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article
Public Access

Towards Effective Foraging by Data Scientists to Find Past Analysis Choices

Published: 02 May 2019 Publication History

Abstract

Data scientists are responsible for the analysis decisions they make, but it is hard for them to track the process by which they achieved a result. Even when data scientists keep logs, it is onerous to make sense of the resulting large number of history records full of overlapping variants of code, output, plots, etc. We developed algorithmic and visualization techniques for notebook code environments to help data scientists forage for information in their history. To test these interventions, we conducted a think-aloud evaluation with 15 data scientists, where participants were asked to find specific information from the history of another person's data science project. The participants succeed on a median of 80% of the tasks they performed. The quantitative results suggest promising aspects of our design, while qualitative results motivated a number of design improvements. The resulting system, called Verdant, is released as an open-source extension for JupyterLab.

Supplementary Material

MP4 File (pn4092.mp4)
Supplemental video

References

[1]
2015. Jupyter Notebook 2015 UX Survey Results. https://github.com/jupyter/ surveys/blob/master/surveys/2015--12-notebook-ux/analysis/report dashboard. ipynb. (2015).
[2]
2018. Databricks - Making Big Data Simple. https://databricks.com/. (2018). Accessed: 2018--9--20.
[3]
2018. Domino Data Science Platform. https://www.dominodatalab.com/. (2018). Accessed: 2018--9--20.
[4]
2018. Gigantum. https://gigantum.com/. (2018). Accessed: 2018--9--20.
[5]
2018. Google Colaboratory. https://colab.research.google.com/notebooks/ welcome.ipynb. (2018). Accessed: 2018--9--20.
[6]
2018. Google Docs - create and edit documents online, for free. https://www. google.com/docs/about/. (2018). Accessed: 2018--9--20.
[7]
Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y Lim, and Mohan Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, 582:1--582:18.
[8]
Saleema Amershi, Max Chickering, Steven M Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. ModelTracker: Redesigning Performance Analysis Tools for Machine Learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 337--346.
[9]
Scott Chacon and Ben Straub. 2014. Pro Git. Apress.
[10]
Mihai Codoban, Sruti Srinivasa Ragavan, Danny Dig, and Brian Bailey. 2015. Software history under the lens: A study on why and how developers examine it. In Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on. IEEE, 1--10.
[11]
Scott D Fleming, Chris Scaffidi, David Piorkowski, Margaret Burnett, Rachel Bellamy, Joseph Lawrance, and Irwin Kwan. 2013. An Information Foraging Theory Perspective on Tools for Debugging, Refactoring, and Reuse Tasks. ACM Trans. Softw. Eng. Methodol. 22, 2 (2013), 1--41.
[12]
Philip Jia Guo. 2012. Software tools to facilitate research programming. Ph.D. Dissertation. Stanford University. Towards Effective Foraging by Data Scientists to Find Past Analysis Choices CHI'19, May 2019, Glasgow, UK
[13]
Bj¨ orn Hartmann, Loren Yu, Abel Allison, Yeonsoo Yang, and Scott R Klemmer. 2008. Design As Exploration: Creating Interface Alternatives Through Parallel Authoring and Runtime Tuning. In Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology (UIST '08). ACM, New York, NY, USA, 91--100.
[14]
Austin Z Henley and Scott D Fleming. 2014. The Patchworks Code Editor: Toward Faster Navigation with Less Code Arranging and Fewer Navigation Mistakes. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '14). ACM, New York, NY, USA, 2511--2520.
[15]
Mary Beth Kery, Amber Horvath, and Brad A Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In CHI. 1265--1276.
[16]
Mary Beth Kery and Brad A Myers. 2018. Interactions for Untangling Messy History in a Computational Notebook. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 147--155.
[17]
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The Story in the Notebook: Exploratory Data Science using a Literate Programming Tool. (2018).
[18]
Thomas Kluyver, Benjamin Ragan-Kelley, Fernando P´ erez, Brian E Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Corlay, and Others. 2016. Jupyter Notebooks-a publishing format for reproducible computational workflows. In ELPUB. 87--90.
[19]
Joseph Lawrance, Rachel Bellamy, and Margaret Burnett. 2007. Scents in Programs:Does Information Foraging Theory Apply to Program Maintenance?. In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC 2007).
[20]
Joseph Lawrance, Rachel Bellamy, Margaret Burnett, and Kyle Rector. 2008. Using information scent to model the dynamic foraging behavior of programmers in maintenance tasks. In Proceeding of the twenty-sixth annual CHI conference on Human factors in computing systems - CHI '08.
[21]
Fabian Pedregosa, Ga¨ el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ´ Edouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, Oct (2011), 2825--2830.
[22]
Alexandre Perez and Rui Abreu. 2014. A diagnosis-based approach to software comprehension. In Proceedings of the 22nd International Conference on Program Comprehension - ICPC 2014.
[23]
Fernando P´ erez and Brian E Granger. 2007. IPython: a System for Interactive Scientific Computing. Computing in Science and Engineering 9, 3 (May 2007), 21--29.
[24]
D Piorkowski, S D Fleming, C Scaffidi, M Burnett, I Kwan, A Z Henley, J Macbeth, C Hill, and A Horvath. 2015. To fix or to learn? How production bias affects developers' information foraging during debugging. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). 11--20.
[25]
Peter Pirolli. 2007. Information Foraging Theory. In Information Foraging Theory. 3--29.
[26]
Adam Rule, Aur´ elien Tabard, and James Hollan. 2018. Exploration and Explanation in Computational Notebooks. In ACM CHI Conference on Human Factors in Computing Systems.
[27]
Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig. 2013. Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9, 10 (2013), e1003285.
[28]
Jeff Sauro. What is a Good Task-Completion Rate? https://measuringu.com/ task-completion/. (????). Accessed: 2019--1--4.
[29]
Sruti Srinivasa Ragavan, Sandeep Kaur Kuttal, Charles Hill, Anita Sarma, David Piorkowski, and Margaret Burnett. 2016. Foraging Among an Overabundance of Similar Variants. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16). ACM, New York, NY, USA, 3509--3521.
[30]
Edward R Tufte. 2006. Beautiful evidence. Vol. 1. Graphics Press Cheshire, CT.
[31]
Greg Wilson, D A Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven H D Haddock, Kathryn D Huff, Ian M Mitchell, Mark D Plumbley, Ben Waugh, Ethan P White, and Paul Wilson. 2014. Best practices for scientific computing. PLoS Biol. 12, 1 (Jan. 2014), e1001745.
[32]
Y Yoon, B A Myers, and S Koo. 2013. Visualization of fine-grained code change history. In 2013 IEEE Symposium on Visual Languages and Human Centric Computing. 119--126.

Cited By

View all
  • (2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
  • (2024)Multiverse Notebook: Shifting Data Scientists to Time TravelersProceedings of the ACM on Programming Languages10.1145/36498388:OOPSLA1(754-783)Online publication date: 29-Apr-2024
  • (2024)Don't Step on My Toes: Resolving Editing Conflicts in Real-Time Collaboration in Computational NotebooksProceedings of the 1st ACM/IEEE Workshop on Integrated Development Environments10.1145/3643796.3648453(47-52)Online publication date: 20-Apr-2024
  • Show More Cited By

Index Terms

  1. Towards Effective Foraging by Data Scientists to Find Past Analysis Choices

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
      May 2019
      9077 pages
      ISBN:9781450359702
      DOI:10.1145/3290605
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 May 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data science
      2. end-user programmers (eup)
      3. end-user software engineering (euse)
      4. exploratory programming
      5. literate programming

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      CHI '19
      Sponsor:

      Acceptance Rates

      CHI '19 Paper Acceptance Rate 703 of 2,958 submissions, 24%;
      Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

      Upcoming Conference

      CHI 2025
      ACM CHI Conference on Human Factors in Computing Systems
      April 26 - May 1, 2025
      Yokohama , Japan

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)279
      • Downloads (Last 6 weeks)46
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
      • (2024)Multiverse Notebook: Shifting Data Scientists to Time TravelersProceedings of the ACM on Programming Languages10.1145/36498388:OOPSLA1(754-783)Online publication date: 29-Apr-2024
      • (2024)Don't Step on My Toes: Resolving Editing Conflicts in Real-Time Collaboration in Computational NotebooksProceedings of the 1st ACM/IEEE Workshop on Integrated Development Environments10.1145/3643796.3648453(47-52)Online publication date: 20-Apr-2024
      • (2024)SHARP: Exploring Version Control Systems in Live Coding MusicProceedings of the 16th Conference on Creativity & Cognition10.1145/3635636.3656195(426-437)Online publication date: 23-Jun-2024
      • (2024)Understanding the Dataset Practitioners Behind Large Language ModelsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3651007(1-7)Online publication date: 11-May-2024
      • (2024)OutlineSpark: Igniting AI-powered Presentation Slides Creation from Computational Notebooks through OutlinesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642865(1-16)Online publication date: 11-May-2024
      • (2024)JupyterLab in Retrograde: Contextual Notifications That Highlight Fairness and Bias Issues for Data ScientistsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642755(1-19)Online publication date: 11-May-2024
      • (2024)Meta-Manager: A Tool for Collecting and Exploring Meta Information about CodeProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642676(1-17)Online publication date: 11-May-2024
      • (2024)Talaria: Interactively Optimizing Machine Learning Models for Efficient InferenceProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642628(1-19)Online publication date: 11-May-2024
      • (2024)Towards a Non-Ideal Methodological Framework for Responsible MLProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642501(1-17)Online publication date: 11-May-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media