research-article

DLProv: A Data-Centric Support for Deep Learning Workflow Analyses

Authors:

Adriane Chapman,

Liliane Kunstmann,

Daniel de Oliveira,

Marta MattosoAuthors Info & Claims

DEEM '24: Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning

Pages 77 - 85

https://doi.org/10.1145/3650203.3663337

Published: 09 June 2024 Publication History

Abstract

The Deep Learning (DL) workflow involves several steps of data transformation. Evaluating various configurations at each step of the workflow may be a complex task when it comes to selecting DL models. This decision-making process requires basing decisions on metrics and continuously monitoring the progression of the workflow. With the plethora of framework options that manage the execution of DL workflows and algorithms, metrics such as accuracy and loss, in addition to hyperparameters, are no longer enough for choosing models to be deployed. The need for data enrichment for the analysis of artifacts generated during the workflow execution has led existing solutions to offer new environments for monitoring and analyzing data transformations. Nevertheless, it is commonly observed that data from monitoring and analysis are represented with limited relationships between artifacts and workflow steps. This makes it challenging to identify and associate input data used in model training, thereby complicating the task of identifying the data derivation path. Furthermore, these solutions often adopt ad-hoc data representations. Our goal, in this paper, is to provide a service, named DLProv, compliant with the W3C recommendation PROV to support decision-making queries during training while representing the necessary relationships to generate provenance traces. DLProv is independent of Machine Learning (ML) frameworks. We evaluated the DLProv service through a case study involving the widely recognized AlexNet convolutional neural network architecture. Additionally, we analyzed the query support of MLflow, MLflow2PROV, and DLProv, using typical DL provenance queries.

References

[1]

Khalid Belhajjame, Reza B'Far, James Cheney, Sam Coppens, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, Timothy Lebo, Jim McCusker, et al. 2013. Prov-dm: The prov data model. W3C Recommendation 14 (2013), 15--16.

[2]

Matthias Boehm, Arun Kumar, and Jun Yang. 2019. Data management in machine learning systems. Synthesis Lectures on Data Management 11, 1 (2019), 1--173.

[3]

Yang Cao, Christopher Jones, V Cuevas-Vicenttín, Matthew B Jones, Bertram Ludäscher, Timothy McPhillips, Paolo Missier, Christopher Schwalm, Peter Slaughter, Dave Vieglais, et al. 2016. ProvONE: extending PROV to support the DataONE scientific community. PROV: Three Years Later (2016).

[4]

Adriane Chapman, Luca Lauro, Paolo Missier, and Riccardo Torlone. 2024. Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance. ACM Trans. Database Syst. 49, 2, Article 6 (apr 2024), 42 pages. https://doi.org/10.1145/3644385

Digital Library

[5]

Adriane Chapman, Paolo Missier, Giulia Simonelli, and Riccardo Torlone. 2020. Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proceedings of the VLDB Endowment 14, 4 (2020), 507--520.

Digital Library

[6]

Andrew Chen, Andy Chow, Aaron Davidson, Arjun DCunha, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Clemens Mewald, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Avesh Singh, Fen Xie, Matei Zaharia, Richard Zang, Juntai Zheng, and Corey Zumar. 2020. Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning (Portland, OR, USA) (DEEM'20). ACM, New York, NY, USA, Article 5, 4 pages. https://doi.org/10.1145/3399579.3399867

Digital Library

[7]

Susan B. Davidson and Juliana Freire. 2008. Provenance and Scientific Workflows: Challenges and Opportunities. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD '08). ACM, New York, NY, USA, 1345--1350. https://doi.org/10.1145/1376616.1376772

Digital Library

[8]

Gharib Gharibi, Vijay Walunj, Rakan Alanazi, Sirisha Rella, and Yugyung Lee. 2019. Automated Management of Deep Learning Experiments. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning (Amsterdam, Netherlands) (DEEM'19). ACM, New York, NY, USA, Article 8, 4 pages. https://doi.org/10.1145/3329486.3329495

Digital Library

[9]

Stefan Grafberger, Paul Groth, Julia Stoyanovich, and Sebastian Schelter. 2022. Data distribution debugging in machine learning pipelines. The VLDB Journal 31, 5 (2022), 1103--1126.

Digital Library

[10]

Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles of Data-Centric AI. Commun. ACM 66, 8 (jul 2023), 84--92. https://doi.org/10.1145/3571724

Digital Library

[11]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

[12]

Paolo Missier, Khalid Belhajjame, and James Cheney. 2013. The W3C PROV family of specifications for modelling provenance metadata. In Proceedings of the 16th International Conference on Extending Database Technology. 773--776.

Digital Library

[13]

Luc Moreau, Juliana Freire, Joe Futrelle, Robert E. McGrath, Jim Myers, and Patrick Paulson. 2008. The Open Provenance Model: An Overview. In Provenance and Annotation of Data and Processes. Springer Berlin Heidelberg, Berlin, Heidelberg, 323--326.

[14]

Luc Moreau and Paul Groth. 2013. Provenance: an introduction to PROV. Synthesis lectures on the semantic web: theory and technology 3, 4 (2013), 1--129.

Digital Library

[15]

Luc Moreau, Bertram Ludäscher, Ilkay Altintas, Roger S Barga, Shawn Bowers, Steven Callahan, George Chin Jr, Ben Clifford, Shirley Cohen, Sarah Cohen-Boulakia, et al. 2008. The first provenance challenge., 409--418 pages.

[16]

Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Virtual Event, CA, USA) (KDD '20). ACM, New York, NY, USA, 1542--1551. https://doi.org/10.1145/3394486.3403205

Digital Library

[17]

M-E Nilsback and Andrew Zisserman. 2006. A visual vocabulary for flower classification. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1447--1454.

Digital Library

[18]

Débora Pina, Adriane Chapman, Daniel De Oliveira, and Marta Mattoso. 2023. Deep Learning Provenance Data Integration: a Practical Approach. In Companion Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW '23 Companion). ACM, New York, NY, USA, 1542--1550. https://doi.org/10.1145/3543873.3587561

Digital Library

[19]

Débora Pina, Liliane Kunstmann, Daniel de Oliveira, Patrick Valduriez, and Marta Mattoso. 2020. Provenance supporting hyperparameter analysis in deep neural networks. In International Provenance and Annotation Workshop. Springer, 20--38.

[20]

Jim Pruyne, Justin M. Wozniak, and Ian Foster. 2022. Tracking Dubious Data: Protecting Scientific Workflows from Invalidated Experiments. In 2022 IEEE 18th International Conference on e-Science (e-Science). 456--461. https://doi.org/10.1109/eScience55777.2022.00082

[21]

Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, and Gyuri Szarvas. 2015. On challenges in machine learning model management. IEEE Data Engineering Bulletin (2015). https://www.amazon.science/publications/on-challenges-in-machine-learning-model-management

[22]

Sebastian Schelter, Joos-Hendrik Böse, Johannes Kirschnick, Thoralf Klein, and Stephan Seufert. 2017. Automatically tracking metadata and provenance of machine learning experiments. In NeurIPS 2017. https://www.amazon.science/publications/automatically-tracking-metadata-and-provenance-of-machine-learning-experiments

[23]

Sebastian Schelter, Stefan Grafberger, Shubha Guha, Bojan Karlas, and Ce Zhang. 2023. Proactively Screening Machine Learning Pipelines with ARGUSEYES. In Companion of the 2023 International Conference on Management of Data (Seattle, WA, USA) (SIGMOD '23). Association for Computing Machinery, New York, NY, USA, 91--94. https://doi.org/10.1145/3555041.3589682

Digital Library

[24]

Marius Schlegel and Kai-Uwe Sattler. 2023. Extracting Provenance of Machine Learning Experiment Pipeline Artifacts. In Advances in Databases and Information Systems. Springer Nature Switzerland, Cham, 238--251.

[25]

Marius Schlegel and Kai-Uwe Sattler. 2023. Management of Machine Learning Lifecycle Artifacts: A Survey. SIGMOD Record 51, 4 (jan 2023), 18--35. https://doi.org/10.1145/3582302.3582306

Digital Library

[26]

Marius Schlegel and Kai-Uwe Sattler. 2023. MLflow2PROV: Extracting Provenance from Machine Learning Experiments. In Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning (Seattle, WA, USA) (DEEM '23). ACM, New York, NY, USA, Article 9, 4 pages. https://doi.org/10.1145/3595360.3595859

Digital Library

[27]

Renan Souza, Leonardo G Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, et al. 2022. Workflow provenance in the lifecycle of scientific machine learning. CCPE 34, 14 (2022), e6544.

[28]

Manasi Vartak and Samuel Madden. 2018. Modeldb: Opportunities and challenges in managing machine learning models. IEEE Data Eng. Bull. 41, 4 (2018), 16--25.

[29]

Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei Zaharia. 2016. ModelDB: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--3.

Digital Library

[30]

Justin M. Wozniak, Zhengchun Liu, Rafael Vescovi, Ryan Chard, Bogdan Nicolae, and Ian Foster. 2022. Braid-DB: Toward AI-Driven Science with Machine Learning Provenance. In Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. Springer, Cham, 247--261.

[31]

Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, et al. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41, 4 (2018), 39--45.

Cited By

Recommendations

Deep Learning Provenance Data Integration: a Practical Approach
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

A Deep Learning (DL) life cycle involves several data transformations, such as performing data pre-processing, defining datasets to train and test a deep neural network (DNN), and training and evaluating the DL model. Choosing a final model requires DL ...
Using a suite of ontologies for preserving workflow-centric research objects

Scientific workflows are a popular mechanism for specifying and automating data-driven in silico experiments. A significant aspect of their value lies in their potential to be reused. Once shared, workflows become useful building blocks that can be ...
Distributed Storage and Querying Techniques for a Semantic Web of Scientific Workflow Provenance
SCC '10: Proceedings of the 2010 IEEE International Conference on Services Computing

In scientific workflow environments, scientific discovery reproducibility, result interpretation, and problem diagnosis primarily depend on provenance, which records the history of an in-silico experiment. Resource Description Framework is frequently ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

DEEM '24: Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning

June 2024

89 pages

ISBN:9798400706110

DOI:10.1145/3650203

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES)
EPSRC
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9, 2024

AA, Santiago, Chile

Acceptance Rates

DEEM '24 Paper Acceptance Rate 12 of 17 submissions, 71%;

Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
71
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)6

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents