research-article

Improving reproducibility of data science pipelines through transparent provenance capture

Authors:

Lukas Rupprecht,

James C. Davis,

Constantine Arnold,

Deepavali BhagwatAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 12

Pages 3354 - 3368

https://doi.org/10.14778/3415478.3415556

Published: 01 August 2020 Publication History

Abstract

Data science has become prevalent in a large variety of domains. Inherent in its practice is an exploratory, probing, and fact finding journey, which consists of the assembly, adaptation, and execution of complex data science pipelines. The trustworthiness of the results of such pipelines rests entirely on their ability to be reproduced with fidelity, which is difficult if pipelines are not documented or recorded minutely and consistently. This difficulty has led to a reproducibility crisis and presents a major obstacle to the safe adoption of the pipeline results in production environments. The crisis can be resolved if the provenance for each data science pipeline is captured transparently as pipelines are executed. However, due to the complexity of modern data science pipelines, transparently capturing sufficient provenance to allow for reproducibility is challenging. As a result, most existing systems require users to augment their code or use specific tools to capture provenance, which hinders productivity and results in a lack of adoption.

In this paper, we present Ursprung,¹ a transparent provenance collection system designed for data science environments.² The Ursprung philosophy is to capture provenance and build lineage by integrating with the execution environment to automatically track static and runtime configuration parameters of data science pipelines. Rather than requiring data scientists to make changes to their code, Ursprung records basic provenance information from system-level sources and combines it with provenance from application-level sources (e.g., log files, stdout), which can be accessed and recorded through a domain-specific language. In our evaluation, we show that Ursprung is able to capture sufficient provenance for a variety of use cases and only adds an overhead of up to 4%.

References

[1]

Common Workflow Language. https://www.commonwl.org/, 2018.

[2]

IBM Spectrum LSF. https://ibm.co/2Lpafez, 2018.

[3]

Introduction to Watch Folder. https://ibm.co/2q3QBhF, 2018.

[4]

Apache Kafka. https://kafka.apache.org/, 2019.

[5]

CodaLab. https://codalab.org/, 2019.

[6]

Linux Man Pages - auditd. http://man7.org/linux/man-pages/man8/auditd.8.html, 2019.

[7]

Linux Man Pages - inotify. http://man7.org/linux/man-pages/man7/inotify.7.html, 2019.

[8]

MLflow. https://mlflow.org/, 2019.

[9]

Pachyderm. https://www.pachyderm.io/, 2019.

[10]

Python Code Glitch May Have Caused Errors In Over 100 Published Studies. https://bit.ly/38DuUq5, 2019.

[11]

Artifact Review and Badging. https://bit.ly/2OdPm8c, 2020.

[12]

SPADE GitHub - Available Transformers. https://bit.ly/2zib6LY, 2020.

[13]

A. Agrawal, R. Chatterjee, C. Curino, A. Floratou, N. Gowdal, M. Interlandi, A. Jindal, K. Karanasos, S. Krishnan, B. Kroth, J. Leeka, K. Park, H. Patel, O. Poppe, F. Psallidas, R. Ramakrishnan, A. Roy, K. Saur, R. Sen, M. Weimer, T. Wright, and Y. Zhu. Cloudy with High Chance of DBMS: A 10-Year Prediction for Enterprise-Grade ML. In Proceedings of the 2020 Conference on Innovative Data Systems Research (CIDR'20), 2020.

[14]

S. Akoush, R. Sohan, and A. Hopper. HadoopProv: Towards Provenance as a First Class Citizen in MapReduce. In Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP'13), 2013.

Digital Library

[15]

S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann. Software Engineering for Machine Learning: A Case Study. In Proceedings of the 41st ACM/IEEE International Conference on Software Engineering (ICSE'19), 2019.

Digital Library

[16]

Y. Amsterdamer, S.B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB, 5(4):346--357, 2011.

Digital Library

[17]

G. Barber. Artificial Intelligence Confronts a `Reproducibility' Crisis. https://bit.ly/30tEBEk, 2019.

[18]

A. Bates, D. J. Tian, K. R. Butler, and T. Moyer. Trustworthy Whole-system Provenance for the Linux Kernel. In Proceedings of the 24th USENIX Security Symposium (USENIX Security'15), 2015.

Digital Library

[19]

L. Bernardi, T. Mavridis, and P. Estevez. 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), 2019.

Digital Library

[20]

S. Bhansali, W.-K. Chen, S. De Jong, A. Edwards, R. Murray, M. Drinić, D. Mihočka, and J. Chau. Framework for Instruction-level Tracing and Analysis of Program Executions. In Proceedings of the 2nd International Conference on Virtual Execution Environments (VEE'06), 2006.

Digital Library

[21]

R. L. Bocchino Jr, V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A Type and Effect System for Deterministic Parallel Java. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA'09), 2009.

Digital Library

[22]

M. A. Borkin, C. S. Yeh, M. Boyd, P. Macko, K. Z. Gajos, M. Seltzer, and H. Pfister. Evaluation of Filesystem Provenance Visualization Tools. IEEE Transactions on Visualization and Computer Graphics, 19(12), 2013.

Digital Library

[23]

E. Breck, N. Polyzotis, S. Roy, S. Whang, and M. Zinkevich. Data Validation for Machine Learning. In Proceedings of the Conference on Systems and Machine Learning (SysML'19), 2019.

[24]

A. Burt. Is There a `Right to Explanation' for Machine Learning in the GDPR? https://bit.ly/39OCZZ4, 2017.

[25]

S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo. VisTrails: Visualization Meets Data Management. In Proceedings of the ACM International Conference on Management of Data (SIGMOD'06), 2006.

Digital Library

[26]

L. Carata, S. Akoush, N. Balakrishnan, T. Bytheway, R. Sohan, M. Seltzer, and A. Hopper. A Primer on Provenance. Communications of the ACM, 57(5), 2014.

Digital Library

[27]

A. Chen, Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo. The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance. In Proceedings of ACM SIGCOMM (SIGCOMM'16), 2016.

Digital Library

[28]

J.-D. Choi, B. Alpern, T. Ngo, M. Sridharan, and J. Vlissides. A Perturbation-free Replay Platform for Cross-optimized Multithreaded Applications. In Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS'01), 2000.

Digital Library

[29]

J. Damji and J. Pohl. Building Complex Data Pipelines with Unified Analytics Platform. https://bit.ly/2V90uIj, 2017.

[30]

D. Deutch, N. Frost, and A. Gilad. Provenance for Non-Experts. IEEE Data Engineering Bulletin, 41(1), 2018.

[31]

S. I. Feldman and C. B. Brown. Igor: A System for Program Debugging via Reversible Execution. In Proceedings of the 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, 1988.

Digital Library

[32]

J. Freire and F. Chirigati. Provenance and the Different Flavors of Computational Reproducibility. IEEE Data Engineering Bulletin, 41(1), 2018.

[33]

A. Gehani and D. Tariq. SPADE: Support for Provenance Auditing in Distributed Environments. In Proceedings of the ACM/IFIP/USENIX International Middleware Conference (Middleware'12), 2012.

Digital Library

[34]

D. Ghoshal and B. Plale. Provenance from Log Files: A BigData Problem. In Proceedings of the EDBT/ICDT Workshops, 2013.

Digital Library

[35]

M. Haldar, M. Abdool, P. Ramanathan, T. Xu, S. Yang, H. Duan, Q. Zhang, N. Barrow-Williams, B. C. Turnbull, B. M. Collins, et al. Applying Deep Learning to Airbnb Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), 2019.

Digital Library

[36]

A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing Google's Datasets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD'16), 2016.

Digital Library

[37]

K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA'18), 2018.

[38]

M. Herschel, R. Diestelkämper, and H. Ben Lahmar. A Survey on Provenance: What for? What form? What from? The VLDB Journal, 26(6), 2017.

Digital Library

[39]

D. A. Holland, U. J. Braun, D. Maclean, K.-K. Muniswamy-Reddy, and M. I. Seltzer. Choosing a Data Model and Query Language for Provenance. In Proceedings of the 2nd International Provenance and Annotation Workshop (IPAW'08), 2008.

[40]

M. Hutson. Artificial Intelligence Faces Reproducibility Crisis. Science, 359(6377), 2018.

[41]

M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data Provenance Support in Spark. PVLDB, 9(3):216--227, 2015.

Digital Library

[42]

M. Jones. How Do We Address The Reproducibility Crisis In Artificial Intelligence? https://bit.ly/2SHEFy8, 2018.

[43]

J. Kobielus. How to Solve AI's Reproducibility Crisis. https://bit.ly/2TuPEvw, 2018.

[44]

A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 44(4), 2016.

Digital Library

[45]

D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism. ACM Sigplan Notices, 45(3), 2010.

Digital Library

[46]

P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. In arXiv pre-print 1904.09483, 2019.

[47]

D. Logothetis, S. De, and K. Yocum. Scalable Lineage Capture for Debugging Disc Analytics. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC'13), 2013.

Digital Library

[48]

B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience, 18(10), 2006.

Digital Library

[49]

P. Macko and M. Seltzer. Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP'11), 2011.

[50]

P. Macko and M. Seltzer. A General-Purpose Provenance Library. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP'12), 2012.

Digital Library

[51]

M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. Parameswaran, and A. Deshpande. Decibel: The Relational Dataset Branching System. PVLDB, 9(9):624--635, 2016.

Digital Library

[52]

R. Mavlyutov, C. Curino, B. Asipov, and P. Cudre-Mauroux. Dependency-Driven Analytics: A Compass for Uncharted Data Oceans. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR'17), 2017.

[53]

H. Miao, A. Chavan, and A. Deshpande. ProvDB: Lifecycle Management of Collaborative Analysis Workflows. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics (HILDA'17), 2017.

Digital Library

[54]

H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards Unified Data and Lifecycle Management for Deep Learning. In Proceedings of the IEEE International Conference on Data Engineering (ICDE'17), 2017.

[55]

K.-K. Muniswamy-Reddy, U. Braun, D. A. Holland, P. Macko, D. L. MacLean, D. W. Margo, M. I. Seltzer, and R. Smogor. Layering in Provenance Systems. In Proceedings of the USENIX Annual Technical Conference (ATC'09), 2009.

Digital Library

[56]

K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-Aware Storage Systems. In Proceedings of the USENIX Annual Technical Conference (ATC'06), 2006.

Digital Library

[57]

L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire. noWorkflow: Capturing and Analyzing Provenance of Scripts. In Proceedings of the 5th International Provenance and Annotation Workshop (IPAW'14), 2014.

Digital Library

[58]

B. K. Olorisade, P. Brereton, and P. Andras. Reproducibility in Machine Learning-Based Studies: An Example of Text Mining. 2017.

[59]

H. Park, R. Ikeda, and J. Widom. RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows. PVLDB, 4(12):1351--1354, 2011.

Digital Library

[60]

T. Pasquier, X. Han, M. Goldstein, T. Moyer, D. Eyers, M. Seltzer, and J. Bacon. Practical Whole-system Provenance Capture. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC'17), 2017.

Digital Library

[61]

H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. Pinplay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs. In Proceedings of the 8th annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'10), 2010.

Digital Library

[62]

J. F. Pimentel, J. Freire, V. Braganholo, and L. Murta. Tracking and Analyzing the Evolution of Provenance from Scripts. In Proceedings of the 6th International Provenance and Annotation Workshop (IPAW'16), 2016.

[63]

J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. noWorkflow: A Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts. PVLDB, 10(12):1841--1844, 2017.

Digital Library

[64]

H. E. Plesser. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Frontiers in Neuroinformatics, 11, 2018.

[65]

F. Psallidas, Y. Zhu, B. Karlas, M. Interlandi, A. Floratou, K. Karanasos, W. Wu, C. Zhang, S. Krishnan, C. Curino, and M. Weimer. Data Science Through the Looking Glass and What We Found There. In arXiv pre-print 1912.09536, 2019.

[66]

C. Sar and P. Cao. Lineage File System. http://crypto.stanford.edu/cao/lineage.html, 2005.

[67]

S. Schelter, J.-H. Böse, J. Kirschnick, T. Klein, and S. Seufert. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. In Proceedings of the ML Systems Workshop @ NeurIPS'17, 2017.

[68]

R. P. Spillane, R. Sears, C. Yalamanchili, S. Gaikwad, M. Chinni, and E. Zadok. Story Book: An Efficient Extensible Provenance Framework. In Proceedings of the 1st USENIX Workshop on the Theory and Practice of Provenance (TaPP'09), 2009.

Digital Library

[69]

V. Sridhar, S. Subramanian, D. Arteaga, S. Sundararaman, D. Roselli, and N. Talagala. Model Governance: Reducing the Anarchy of Production ML. In Proceedings of the USENIX Annual Technical Conference (ATC'18), 2018.

Digital Library

[70]

A. Vahdat and T. E. Anderson. Transparent Result Caching. In Proceedings of the USENIX Annual Technical Conference (ATC'98), 1998.

Digital Library

[71]

M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. ModelDB: A System for Machine Learning Model Management. In Proceedings of the 1st Workshop on Human-In-the-Loop Data Analytics (HILDA'16), 2016.

Digital Library

[72]

W3C. PROV-DM: The PROV Data Model. https://www.w3.org/TR/prov-dm/, 2013.

[73]

P. Warden. The Machine Learning Reproducibility Crisis. https://bit.ly/2v0ynQP, 2018.

[74]

K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, et al. The Taverna Workflow Suite: Designing and Executing Workflows of Web Services on the Desktop, Web or in the Cloud. Nucleic Acids Research, 41(W1), 2013.

[75]

A. Zhai, H.-Y. Wu, E. Tzeng, D. H. Park, and C. Rosenberg. Learning a Unified Embedding for Visual Search at Pinterest. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), 2019.

Digital Library

Cited By

Hui YYu MQi HGan YLi TLi YRen XMa SLu XWang Y(2024)On the Feasibility and Benefits of Extensive EvaluationProceedings of the ACM on Management of Data10.1145/36771372:4(1-24)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677137
Shankar SGarcia RHellerstein JParameswaran A(2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653697
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Show More Cited By

Recommendations

Transparent gathering of provenance during program execution
Automated data provenance capture in spreadsheets, with case studies

One of the most important tasks in eScience is capturing the provenance of data. While scientists frequently use off-the-shelf analysis tools to process and manipulate data, current provenance techniques such as those based on scientific workflows are ...
Practical whole-system provenance capture
SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

Data provenance describes how data came to be in its present form. It includes data sources and the transformations that have been applied to them. Data provenance has many uses, from forensics and security to aiding the reproducibility of scientific ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 12

August 2020

1710 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2020

Published in PVLDB Volume 13, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
453
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)11

Reflects downloads up to 07 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hui YYu MQi HGan YLi TLi YRen XMa SLu XWang Y(2024)On the Feasibility and Benefits of Extensive EvaluationProceedings of the ACM on Management of Data10.1145/36771372:4(1-24)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677137
Shankar SGarcia RHellerstein JParameswaran A(2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3653697
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Grayson SAguilar FMilewicz RKatz DMarinov D(2024)A benchmark suite and performance analysis of user-space provenance collectorsProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663627(85-95)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3641525.3663627
Psallidas FAgrawal ASugunan CIbrahim KKaranasos KCamacho-Rodríguez JFloratou ACurino CRamakrishnan R(2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611555
Wang YYu MHui YZhou FHuang YZhu RRen XLi TLu X(2022)A study of database performance sensitivity to experiment settingsProceedings of the VLDB Endowment10.14778/3523210.352322115:7(1439-1452)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.14778/3523210.3523221
Nakagawa TNarita KKim K(2022)How Provenance helps Quality Assurance Activities in AI/ML SystemsProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564801(1-9)Online publication date: 12-Oct-2022
https://dl.acm.org/doi/10.1145/3564121.3564801
Mansour ESrinivas KHose K(2022)Federated Data Science to Break Down Silos [Vision]ACM SIGMOD Record10.1145/3516431.351643550:4(16-22)Online publication date: 31-Jan-2022
https://dl.acm.org/doi/10.1145/3516431.3516435
Xin DMiao HParameswaran APolyzotis NLi GLi ZIdreos SSrivastava D(2021)Production Machine Learning PipelinesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457566(2639-2652)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457566
Kosyfaki CLi GLi ZIdreos SSrivastava D(2021)Flow Provenance in Temporal Interaction NetworksProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3450581(2893-2895)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3450581

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents