Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3290605.3300356acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation

Published: 02 May 2019 Publication History

Abstract

With the rise of big data, there has been an increasing need for practitioners in this space and an increasing opportunity for researchers to understand their workflows and design new tools to improve it. Data science is often described as data-driven, comprising unambiguous data and proceeding through regularized steps of analysis. However, this view focuses more on abstract processes, pipelines, and workflows, and less on how data science workers engage with the data. In this paper, we build on the work of other CSCW and HCI researchers in describing the ways that scientists, scholars, engineers, and others work with their data, through analyses of interviews with 21 data science professionals. We set five approaches to data along a dimension of interventions: Data as given; as captured; as curated; as designed; and as created. Data science workers develop an intuitive sense of their data and processes, and actively shape their data. We propose new ways to apply these interventions analytically, to make sense of the complex activities around data practices.

References

[1]
Sebastian Abt and Harold Baier (2014). A plea for utilizing synthetic data when performing machine learning based cyber-security experiments. Proc. AISec 2014.
[2]
Ritu Agarwal and Vasant Dhar (2014). Editorial -- Big data, data science, and analytics: The opportunity and challenge for IS research. Info. Sys. Res. 25(3), 443--448.
[3]
Ashton Anderson, Jon Kleinberg, and Sendhil Mullainathan (2017). Assessing human error against a benchmark of perfection. TKDD 11(4), Art. 45.
[4]
Jesse Anderson (2018). Data engineers vs. data scientists. O'Reilly. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists .
[5]
Lora Aroyo and Chris Welty (2013). Crowd truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. Proc. Web Science 2013.
[6]
Karen S. Baker and Geoffrey C. Bowker (2007). Information ecology: Open system environments for data, memories, and knowing. J. Intell. Inf. Syst. 29, 127--144.
[7]
Karen S. Baker and Helena Karasti (2018). Data care and its politics: Designing for local collective data management as a neglected thing. Proc. PDC 2018, Art. 10.
[8]
Jo Bates, Yu-Wei Lin, and Paula Goodale (2016). Data journeys: Capturing the socio-material constitution of data objects and flows. Big Data & Soc. 3(2), 112.
[9]
Steve Benford, Gabriella Giannachi, Boriana Koleva, and Tom Rodden (2009). From interaction to trajectories: Designing coherent journeys through user experiences. Proc. CHI 2009, 709--718,
[10]
Hélène Bilis (2018). Mapping fiction: Social networks and the novel. Presentation at Shifting (the) boundaries conference, Wellesley College.
[11]
Herbert Blumer (1954). What is wrong with social theory? American Sociological Review 18, 3--1.
[12]
Glenn A. Bowen (2006). Grounded theory and sensitizing concepts. Int. J. Qual. Meth. 5(3), 12--23.
[13]
danah boyd and Kate Crawford (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenonomenon. Info. Comm. Soc. 15(5), 662--679.
[14]
Ciara Byrne (2013). The rise of the DIY data scientist. Fast Company. https://www.fastcompany.com/3014018/the-rise-of-the-diy-data-scientist .
[15]
Jennie Carroll. 2004. Completing design in use: closing the appropriation cycle. In Proceedings of the 12th European Conference on Information Systems (ECIS 2004). 337--347.
[16]
Kathy Charmaz (2015). Constructing grounded theory. Sage.
[17]
Akemi T. Chatfield, Vivian N. Shlemoon, Wilbur Redublado, and Faizur Rahman (2014). Data scientists as game changers in big data environments. Proc. ACIS 2014, 1--11.
[18]
Amy Cheatle and Steven J. Jackson (2015) "Digital entanglements: Craft, computation and collaboration in fine art furniture production. Proc. CSCW 2015, 958--968.
[19]
Madeline T.H. Chi, Robert Glaser, and Marshall Farr (1988/2014) (eds.). The nature of expertise. Psychology Press.
[20]
Søren Christensen, Jens Bæk Jørgensen, and Kim Halskov Madsen (1997). Design as interaction with computer based materials. Proc. DIS 1997, 65--71.
[21]
Juliet Corbin and Anselm L. Strauss (2007). Basics of qualitative research: Techniques and procedures for developing grounded theory. 3rd edition. Newbury Park, CA, USA: Sage.
[22]
Andrew Dearden (2006). Design as a conversation with digital materials. Des. Stud. 27(3), 399--421.
[23]
Alan Dix (2007). Designing for appropriation. Proc. BCS-HCI 2007, 27--30.
[24]
C. Dobre and F. Xhafa (2014), Intelligent services for big data science. Fut. Gen. Comp. Sys. 37, 267--291.
[25]
Anca Dumitrache, Lora Aroyo, and Chris Welty (2018). Crowdsourcing ground truth for medical relation extraction. TIIS 8(2), art. 11.
[26]
Ciarán Dunne (2011). The place of literature review in grounded theory research. Int.J. Soc. Res. Meth. 14(2), 111--124.
[27]
Hugh Durrante-Whyte (2015), Data, knowledge and discovery: Machine learning meets natural science. Proc. KDD 2015, 7.
[28]
Melanie Feinberg (2017a). A design perspective on data. Proc. CHI 2017, 29522963.
[29]
Melanie Feinberg, Daniel Carter, and Julia Bullard (2014b). A story without end: Writing the residual into descriptive infrastructure. Proc. DIS 2014, 385394.
[30]
Melanie Feinberg, Daniel Carter, Julia Bullard, and Ayse Gursoy (2017b). Translating texture: Design as integration. Proc. DIS 2017, 297--307.
[31]
Batya Friedman, Peter H. Kahn, and Alan Borning (2006). Value sensitive design and information systems. In P. Zhang and D. Galletta (eds.), HumanComputer Interaction and Management Information Systems: Foundations. M.E. Sharpe.
[32]
Lisa Gitelman (2013) (ed.), "Raw data" is an oxymoron. MIT Press.
[33]
Barney G. Glaser (1998). Doing grounded theory: Issues and discussions. Mill Valley, CA: Sociology Press.
[34]
Barney G. Glaser (2005). The grounded theory perspective III: Theoretical coding. Mill Valley, CA, USA: Sociology Press.
[35]
Robert Glaser and Micheline T.H. Chi (1988/2014). Overview. In Michelene T.H. Chi, Robert Glaser, and Marshall J. Farr (eds). (1988/2014). The nature of expertise. Taylor and Francis.
[36]
Michele Goetz (2015). 3 ways data preparation tools help you get ahead of big data. Forrester. https://go.forrester.com/blogs/15-02--17--3_ways_data _preparation_ tools_help_you_get_ahead_of_big_data/ .
[37]
Jonathan Gray, Carolyn Gerlitz, and Liliana Bounegru (2018). Data infrastructure literacy. Big Data & Soc. 5(2), 1--13.
[38]
Shad Gross, Jeoffrey Bardzell and Shaowen Bardzell (2014). Structures, forms, and stuff: The materiality and medium of interaction. Pers. Ubiquit. Comput. 18(3), 637--649.
[39]
Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer (2011). Proactive wrangling: Mixed-initiative end-user programming of data transformation scripts. Proc. UIST 2011, 65--74.
[40]
Bob Hayes (2018a). A majority of data scientists lack competency in advanced machine learning areas and techniques. BusinessOverBroadway, http://businessoverbroadway.com/a-majority-of-data-scientists-lackcompetency-in-advanced-machine-learning-areas-and-techniques .
[41]
Bob Hayes (2018c). Most used data science tools and technologies in 2017 and what to expect for 2018. BusinessOverBroadway. http://businessoverbroadway.com/most-used-data-science-tools-andtechnologies-in-2017-and-what-to-expect-for-2018 .
[42]
Bob Hayes (2018b). Top 10 challenges to practicing data science at work. BusinessOverBroadway. http://businessoverbroadway .com/top-10challenges-to-practicing-data-science-at-work .
[43]
Jeffrey Heer, Joseph M. Hellerstein, and Sean Kandel (2015). Predictive interaction for data transformation. Proc. CIDR 2015.
[44]
Tony Hey, Stewart Tansley, and Kristin Tolle (2009). The fourth paradigm: Data-intensive scientific discovery. Microsoft Research.
[45]
Ming-Tung Hong and Claudia Müller-Birn (2017). Conceptualization of computer-supported collaborative sensemaking. CSCW 2017 Companion, 199--202.
[46]
Marjin Janssen and George Kuk (2016). The challenges and limits of big data algorithms in technocratic governance. Gov. Info. Quart., 33(3), 371--377.
[47]
Kaggle (2017). Kaggle ML and data science survey, 2017: A big picture view of the state of data science and machine learning. Kaggle. https://www.kaggle.com/kaggle/kaggle-survey-2017 .
[48]
KDNuggets. (2018). Doing data science: A Kaggle walkthrough -- Cleaning data. https://www.kdnuggets.com/2016/03/doing-data-science-kaggleawalkthrough-cleaning-data.html .
[49]
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer (2011). Wrangler: Interactive visual specification of data transformation scripts. Proc. CHI 2011, 3363--3372.
[50]
James Max Kanter and Kalyan Veeramachaneri (2015). Deep feature synthesis: Towards automating data science endeavors. Proc. DSAA 2015, 1--10.
[51]
Jakko Kemper and Daan Kolkman (2018). Transparent to whom? No algorithmic accountability without a critical audience. Info. Comm. & Soc.
[52]
Allison Kidd (1994). The marks are on the knowledge worker. Proc. CHI 1994, 186--191.
[53]
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel (2016). The emerging role of data scientists on software development teams. Proc. IEEE CSE 2016, 96--107.
[54]
John King and Roger Magoulas (2015). 2015 data science salary survey: Tools, trends, what pays (and what doesn't) data professionals. O,Reilly. http://www.eli.sdsu.edu/courses/fall16/cs696/2015-data-science-salarysurvey.pdf .
[55]
Ákos Kiss and Tamás Szirányi (2013). Evaluation of manually created ground truth for multi-view people localization. Proc. VIGTA 2013.
[56]
Peter Gall Krogh, Marianne Graves Petersen, Kenton O'Hara, and Jens Emil Grønbæk (2017). Sensitizing concepts for socio-spatial literacy in HCI. Proc. CHI 2017, 6449--6460.
[57]
Cheng Han Lee (2014). Data career paths: Data analyst vs. data scientist vs. data engineer: 3 data careers decoded and what it means for you. Udacity. https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-dataengineer.html
[58]
Alan Lesgold, Harriet Rubinson, Paul Feltovich, Robert Glaser, Dale Klopfer, and Yen Wang (1988/2014). Expertise in a complex skill: Diagnosing X-ray pictures. In Michelene T.H. Chi, Robert Glaser, and Marshall J. Farr (eds). (1988/2014). The nature of expertise. Taylor and Francis.
[59]
Jessica Lingel (2016). The poetics of socio-technical space: Evaluating the Internet of things through craft. Proc. CHI 2016, 815--826.
[60]
Jessica Lingel and Tim Regan (2014). "it's in your spinal cord, it's in your fingertips: practices of tools and craft in building software." Proc. CSCW 2014, 295--304.
[61]
Paul Luo Li, Andrew J. Ko, and Jiamin Zhu (2015). What makes a great software engineer? Proc. ICSE 2015, 700--710.
[62]
Karen Grace Martin (2018). Preparing data for analysis is (more than) half the battle. Analysis Factor. https://www.theanalysisfactor.com/preparing-dataanalysis/ .
[63]
Gerry McGhee, Glenn R. Marland, and Jacqueline Atkinson (2007). Grounded theory research: Literature reviewing and reflexivity. J. Adv. Nurs. 60(3), 334342.
[64]
Helena M. Mentis, Ahmed Rahim, and Pierre Theodore (2016). Crafting the image in surgical telemedicine. Proc. CSCW 2016, 744--755.
[65]
Renée J. Miller (2017). The future of data integration. Proc. KDD 2017, 3.
[66]
Steven Miller (2014). Collaborative approaches needed to close the big data knowledge and skills gap. J. Org. Des. 3(1), 26--30.
[67]
Julia Moehrmann and Gunther Heidemann (2012). Efficient annotation of image data sets for computer vision applications. Proc. VIGTA 12.
[68]
Michael Muller (2014). Curiosity, creativity, and surprise as analytic tools: Grounded theory method. In Judith Olson and Wendy A. Kellogg (eds.), Ways of knowing in HCI. Springer.
[69]
Syed Sadat Nazrul (2018). DevOps for data scientists: Taming the unicorn. Medium: Towards Data Science, https://towardsdatascience.com/devopsfor-data-scientists-taming-the-unicorn-6410843990de .
[70]
Gina Neff, Ahissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn (2017). Critique and contribute: A practice-based framework for improving critical data studies and data science. Big Data 5(2), 85--97.
[71]
Samir Passi and Steven J. Jackson (2017). Data vision: Learning to see through algorithmic abstraction. Proc. CSCW 2017, 2436--2447.
[72]
Kayur Patel, James Fogarty, James A. Landay, and Beverly Harrison (2008). Investigating statistical machine learning as a tool for software development. Proc. CHI 2008, 667--676.
[73]
Kathleen H. Pine and Max Liboiron (2015). The politics of measurement and action. Proc. CHI 2015, 3147--3156.
[74]
Sarah Pink, Minna Ruckenstein, Robert Willim, and Melisa Duque (2018). Broken data: Conceptualizing data in an emerging world. Big Data & Soc. 5(1), 1--13.
[75]
Michael I. Posner (1988/2014). Introduction. In Michelene T.H. Chi, Robert Glaser, and Marshall J. Farr (eds). (1988/2014). The nature of expertise. Taylor and Francis.
[76]
Krishna Rajan (2013). Informatics for materials science and engineering: Datadriven discovery for materials science and engineering. Elsevier.
[77]
Vijayshankar Raman and Joseph M. Hellerstein (2001). Potter's wheel: An interactive data cleaning system. Proc. VLDB 2001.
[78]
Tye Rattenbury, Joseph M. Hellerstein, Jeffrey Heer, Sean Kandel, and Connor Carreras (2017). Principles of data wrangling: Practical techniques for data preparation. O'Reilly.
[79]
David Ribes (2017). Notes on the concept of data interoperability: Cases from an ecology of AIDS research infrastructures. Proc. CSCW 2017, 1514--1526.
[80]
Daniela K. Rosner, Miwa Ikemiya, and Tim Regan (2015). Resisting alignment: Code and clay. Proc. TEI 2015, 181--188.
[81]
Evelyn Ruppert (2013). Rethinking empirical social sciences. Dial. Hum. Geo. 3(3), 268--273.
[82]
Evelyn Ruppert, Penny Harvey, Celia Lury, Adrian Mackenzie, Ruth McNally, Stephanie Alice Baker, Yannis Kallianos, and Camilla Lewis (2015). Socializing big data: From concept to practice. CRESC, U. Manchester, Open U.
[83]
Daniel M. Russell, George Furnas, Mark Stefik, Stuart Card, and Peter Pirolli (2008). Sensemaking workshop 2008. CHI EA 2008, 4751--4754.
[84]
Donald Schön (1983). The reflective practitioner. How professionals think in action. Basic Books.
[85]
Scikit-Learn (2017). scikit-learn Tutorials. http://scikit-learn.org/stable/ tutorial/index.html .
[86]
Shventank Shah, Andrew Horne, and Jaime Capella (2012). Good data won't guarantee good decisions. Harv Bus Rev, Apr 2012.
[87]
Susan Elliott Sim, Marisa Levitt Cohn, and Kavita Philip. (2009). The work of software development as an assemblage of computing practices. Proc. CHASE 2009, 92--95.
[88]
Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana (2018). Data diff: Interpretable, executable summaries of changes in distributions for datq wrangling. Proc. KDD 2018.
[89]
Alex S. Taylor, Siân Lindley, Tim Regan, David Sweeney, Vasilis Vlachokyriakos, Lillie Grainger, and Jessa Lingel (2015). Data-in-place: Thinking through relations between data and community. Proc. CHI 2015, 2863--2872.
[90]
Jakob Tholander, Maria Normack, and Chiara Rossitto (2012). Understanding agency in interaction design materials. Proc. CHI 2012, 2499--2508.
[91]
Paul F. Uhlir and Peter Schoder (2007). Open data for global science. Data Sci. J. 6, 36--53.
[92]
Wil M.P. van der Aalst (2014). Data scientist: The engineer of the future. Proc. I-ESA 7, 13--26.
[93]
Ruben Verborgh and Max De Wilde (2013). Using OpenRefine. Packt.
[94]
Leonard J. Waks (2001). Donald Schon's {sic} philosophy of design and design education. Int. J. Tech. Des. Educ. 11, 37--51.
[95]
Samuel F. Way, Daniel B. Larremore, and Aaron Clauset (2016). Gender, productivity, and prestige in computer science faculty hiring networks. Proc. WWW 2016, 1169--1179.
[96]
Mikael Wiberg (2014). Methodology for materiality: Interaction design through a material lens. Pers. Ubiquit. Comput. 18(3), 625--636.
[97]
Fo Wilson (2010). The new materiality: Digital dialogues at the boundaries of contemporary craft. Cultura Visual 1(14), 83--88.
[98]
Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld (2018a). Investigating how experienced UX designers effectively work with machine learning. Proc. DIS 2018,
[99]
Qian Yang, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos (2018b). Grounding interactive machine learning tool design in how non-experts actually build models. Proc. DIS 2018, 573--584.

Cited By

View all
  • (2024)Inside Production Data Science: Exploring the Main Tasks of Data Scientists in Production EnvironmentsAI10.3390/ai50200435:2(873-886)Online publication date: 12-Jun-2024
  • (2024)The ‘doings’ behind data: An ethnography of police data constructionBig Data & Society10.1177/2053951724127069511:3Online publication date: 3-Sep-2024
  • (2024)Constructing a Classification Scheme - and its Consequences: A Field Study of Learning to Label Data for Computer Vision in a Hospital Intensive Care UnitProceedings of the ACM on Human-Computer Interaction10.1145/36870298:CSCW2(1-29)Online publication date: 8-Nov-2024
  • Show More Cited By

Index Terms

  1. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
    May 2019
    9077 pages
    ISBN:9781450359702
    DOI:10.1145/3290605
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 May 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data capture
    2. data creation
    3. data curation
    4. data design
    5. data discovery
    6. data science
    7. grounded theory
    8. work-practices

    Qualifiers

    • Research-article

    Conference

    CHI '19
    Sponsor:

    Acceptance Rates

    CHI '19 Paper Acceptance Rate 703 of 2,958 submissions, 24%;
    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Upcoming Conference

    CHI '25
    CHI Conference on Human Factors in Computing Systems
    April 26 - May 1, 2025
    Yokohama , Japan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)671
    • Downloads (Last 6 weeks)107
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Inside Production Data Science: Exploring the Main Tasks of Data Scientists in Production EnvironmentsAI10.3390/ai50200435:2(873-886)Online publication date: 12-Jun-2024
    • (2024)The ‘doings’ behind data: An ethnography of police data constructionBig Data & Society10.1177/2053951724127069511:3Online publication date: 3-Sep-2024
    • (2024)Constructing a Classification Scheme - and its Consequences: A Field Study of Learning to Label Data for Computer Vision in a Hospital Intensive Care UnitProceedings of the ACM on Human-Computer Interaction10.1145/36870298:CSCW2(1-29)Online publication date: 8-Nov-2024
    • (2024)"The struggle is a part of the experience": Engaging Discontents in the Design of Family Meal TechnologiesProceedings of the ACM on Human-Computer Interaction10.1145/36870168:CSCW2(1-33)Online publication date: 8-Nov-2024
    • (2024)"Guilds" as Worker Empowerment and Control in a Chinese Data Work PlatformProceedings of the ACM on Human-Computer Interaction10.1145/36869048:CSCW2(1-27)Online publication date: 8-Nov-2024
    • (2024)Understanding the Perceptions and Practices of the Machine Learning Professionals in BangladeshCompanion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing10.1145/3678884.3681920(647-652)Online publication date: 11-Nov-2024
    • (2024)Key Insights from a Feature Discovery User StudyProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665961(1-5)Online publication date: 14-Jun-2024
    • (2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
    • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
    • (2024)Towards Feature Engineering with Human and AI’s Knowledge: Understanding Data Science Practitioners’ Perceptions in Human&AI-Assisted Feature Engineering DesignProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661517(1789-1804)Online publication date: 1-Jul-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media