Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3650203.3663337acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

DLProv: A Data-Centric Support for Deep Learning Workflow Analyses

Published: 09 June 2024 Publication History

Abstract

The Deep Learning (DL) workflow involves several steps of data transformation. Evaluating various configurations at each step of the workflow may be a complex task when it comes to selecting DL models. This decision-making process requires basing decisions on metrics and continuously monitoring the progression of the workflow. With the plethora of framework options that manage the execution of DL workflows and algorithms, metrics such as accuracy and loss, in addition to hyperparameters, are no longer enough for choosing models to be deployed. The need for data enrichment for the analysis of artifacts generated during the workflow execution has led existing solutions to offer new environments for monitoring and analyzing data transformations. Nevertheless, it is commonly observed that data from monitoring and analysis are represented with limited relationships between artifacts and workflow steps. This makes it challenging to identify and associate input data used in model training, thereby complicating the task of identifying the data derivation path. Furthermore, these solutions often adopt ad-hoc data representations. Our goal, in this paper, is to provide a service, named DLProv, compliant with the W3C recommendation PROV to support decision-making queries during training while representing the necessary relationships to generate provenance traces. DLProv is independent of Machine Learning (ML) frameworks. We evaluated the DLProv service through a case study involving the widely recognized AlexNet convolutional neural network architecture. Additionally, we analyzed the query support of MLflow, MLflow2PROV, and DLProv, using typical DL provenance queries.

References

[1]
Khalid Belhajjame, Reza B'Far, James Cheney, Sam Coppens, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, Timothy Lebo, Jim McCusker, et al. 2013. Prov-dm: The prov data model. W3C Recommendation 14 (2013), 15--16.
[2]
Matthias Boehm, Arun Kumar, and Jun Yang. 2019. Data management in machine learning systems. Synthesis Lectures on Data Management 11, 1 (2019), 1--173.
[3]
Yang Cao, Christopher Jones, V Cuevas-Vicenttín, Matthew B Jones, Bertram Ludäscher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, et al. 2016. ProvONE: extending PROV to support the DataONE scientific community. PROV: Three Years Later (2016).
[4]
Adriane Chapman, Luca Lauro, Paolo Missier, and Riccardo Torlone. 2024. Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance. ACM Trans. Database Syst. 49, 2, Article 6 (apr 2024), 42 pages. https://doi.org/10.1145/3644385
[5]
Adriane Chapman, Paolo Missier, Giulia Simonelli, and Riccardo Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proceedings of the VLDB Endowment 14, 4 (2020), 507--520.
[6]
Andrew Chen, Andy Chow, Aaron Davidson, Arjun DCunha, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Clemens Mewald, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Avesh Singh, Fen Xie, Matei Zaharia, Richard Zang, Juntai Zheng, and Corey Zumar. 2020. Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning (Portland, OR, USA) (DEEM'20). ACM, New York, NY, USA, Article 5, 4 pages. https://doi.org/10.1145/3399579.3399867
[7]
Susan B. Davidson and Juliana Freire. 2008. Provenance and Scientific Workflows: Challenges and Opportunities. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD '08). ACM, New York, NY, USA, 1345--1350. https://doi.org/10.1145/1376616.1376772
[8]
Gharib Gharibi, Vijay Walunj, Rakan Alanazi, Sirisha Rella, and Yugyung Lee. 2019. Automated Management of Deep Learning Experiments. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning (Amsterdam, Netherlands) (DEEM'19). ACM, New York, NY, USA, Article 8, 4 pages. https://doi.org/10.1145/3329486.3329495
[9]
Stefan Grafberger, Paul Groth, Julia Stoyanovich, and Sebastian Schelter. 2022. Data distribution debugging in machine learning pipelines. The VLDB Journal 31, 5 (2022), 1103--1126.
[10]
Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI. Commun. ACM 66, 8 (jul 2023), 84--92. https://doi.org/10.1145/3571724
[11]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[12]
Paolo Missier, Khalid Belhajjame, and James Cheney. 2013. The W3C PROV family of specifications for modelling provenance metadata. In Proceedings of the 16th International Conference on Extending Database Technology. 773--776.
[13]
Luc Moreau, Juliana Freire, Joe Futrelle, Robert E. McGrath, Jim Myers, and Patrick Paulson. 2008. The Open Provenance Model: An Overview. In Provenance and Annotation of Data and Processes. Springer Berlin Heidelberg, Berlin, Heidelberg, 323--326.
[14]
Luc Moreau and Paul Groth. 2013. Provenance: an introduction to PROV. Synthesis lectures on the semantic web: theory and technology 3, 4 (2013), 1--129.
[15]
Luc Moreau, Bertram Ludäscher, Ilkay Altintas, Roger S Barga, Shawn Bowers, Steven Callahan, George Chin Jr, Ben Clifford, Shirley Cohen, Sarah Cohen-Boulakia, et al. 2008. The first provenance challenge., 409--418 pages.
[16]
Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Virtual Event, CA, USA) (KDD '20). ACM, New York, NY, USA, 1542--1551. https://doi.org/10.1145/3394486.3403205
[17]
M-E Nilsback and Andrew Zisserman. 2006. A visual vocabulary for flower classification. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1447--1454.
[18]
Débora Pina, Adriane Chapman, Daniel De Oliveira, and Marta Mattoso. 2023. Deep Learning Provenance Data Integration: a Practical Approach. In Companion Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW '23 Companion). ACM, New York, NY, USA, 1542--1550. https://doi.org/10.1145/3543873.3587561
[19]
Débora Pina, Liliane Kunstmann, Daniel de Oliveira, Patrick Valduriez, and Marta Mattoso. 2020. Provenance supporting hyperparameter analysis in deep neural networks. In International Provenance and Annotation Workshop. Springer, 20--38.
[20]
Jim Pruyne, Justin M. Wozniak, and Ian Foster. 2022. Tracking Dubious Data: Protecting Scientific Workflows from Invalidated Experiments. In 2022 IEEE 18th International Conference on e-Science (e-Science). 456--461. https://doi.org/10.1109/eScience55777.2022.00082
[21]
Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, and Gyuri Szarvas. 2015. On challenges in machine learning model management. IEEE Data Engineering Bulletin (2015). https://www.amazon.science/publications/on-challenges-in-machine-learning-model-management
[22]
Sebastian Schelter, Joos-Hendrik Böse, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. 2017. Automatically tracking metadata and provenance of machine learning experiments. In NeurIPS 2017. https://www.amazon.science/publications/automatically-tracking-metadata-and-provenance-of-machine-learning-experiments
[23]
Sebastian Schelter, Stefan Grafberger, Shubha Guha, Bojan Karlas, and Ce Zhang. 2023. Proactively Screening Machine Learning Pipelines with ARGUSEYES. In Companion of the 2023 International Conference on Management of Data (Seattle, WA, USA) (SIGMOD '23). Association for Computing Machinery, New York, NY, USA, 91--94. https://doi.org/10.1145/3555041.3589682
[24]
Marius Schlegel and Kai-Uwe Sattler. 2023. Extracting Provenance of Machine Learning Experiment Pipeline Artifacts. In Advances in Databases and Information Systems. Springer Nature Switzerland, Cham, 238--251.
[25]
Marius Schlegel and Kai-Uwe Sattler. 2023. Management of Machine Learning Lifecycle Artifacts: A Survey. SIGMOD Record 51, 4 (jan 2023), 18--35. https://doi.org/10.1145/3582302.3582306
[26]
Marius Schlegel and Kai-Uwe Sattler. 2023. MLflow2PROV: Extracting Provenance from Machine Learning Experiments. In Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning (Seattle, WA, USA) (DEEM '23). ACM, New York, NY, USA, Article 9, 4 pages. https://doi.org/10.1145/3595360.3595859
[27]
Renan Souza, Leonardo G Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, et al. 2022. Workflow provenance in the lifecycle of scientific machine learning. CCPE 34, 14 (2022), e6544.
[28]
Manasi Vartak and Samuel Madden. 2018. Modeldb: Opportunities and challenges in managing machine learning models. IEEE Data Eng. Bull. 41, 4 (2018), 16--25.
[29]
Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei Zaharia. 2016. ModelDB: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--3.
[30]
Justin M. Wozniak, Zhengchun Liu, Rafael Vescovi, Ryan Chard, Bogdan Nicolae, and Ian Foster. 2022. Braid-DB: Toward AI-Driven Science with Machine Learning Provenance. In Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. Springer, Cham, 247--261.
[31]
Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, et al. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41, 4 (2018), 39--45.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DEEM '24: Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning
June 2024
89 pages
ISBN:9798400706110
DOI:10.1145/3650203
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep Learning
  2. Provenance
  3. W3C PROV

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES)
  • EPSRC
  • Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Conference

SIGMOD/PODS '24
Sponsor:

Acceptance Rates

DEEM '24 Paper Acceptance Rate 12 of 17 submissions, 71%;
Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 71
    Total Downloads
  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)6
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media