Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Improving reproducibility of data science pipelines through transparent provenance capture

Published: 01 August 2020 Publication History

Abstract

Data science has become prevalent in a large variety of domains. Inherent in its practice is an exploratory, probing, and fact finding journey, which consists of the assembly, adaptation, and execution of complex data science pipelines. The trustworthiness of the results of such pipelines rests entirely on their ability to be reproduced with fidelity, which is difficult if pipelines are not documented or recorded minutely and consistently. This difficulty has led to a reproducibility crisis and presents a major obstacle to the safe adoption of the pipeline results in production environments. The crisis can be resolved if the provenance for each data science pipeline is captured transparently as pipelines are executed. However, due to the complexity of modern data science pipelines, transparently capturing sufficient provenance to allow for reproducibility is challenging. As a result, most existing systems require users to augment their code or use specific tools to capture provenance, which hinders productivity and results in a lack of adoption.
In this paper, we present Ursprung,1 a transparent provenance collection system designed for data science environments.2 The Ursprung philosophy is to capture provenance and build lineage by integrating with the execution environment to automatically track static and runtime configuration parameters of data science pipelines. Rather than requiring data scientists to make changes to their code, Ursprung records basic provenance information from system-level sources and combines it with provenance from application-level sources (e.g., log files, stdout), which can be accessed and recorded through a domain-specific language. In our evaluation, we show that Ursprung is able to capture sufficient provenance for a variety of use cases and only adds an overhead of up to 4%.

References

[1]
Common Workflow Language. https://www.commonwl.org/, 2018.
[2]
IBM Spectrum LSF. https://ibm.co/2Lpafez, 2018.
[3]
Introduction to Watch Folder. https://ibm.co/2q3QBhF, 2018.
[4]
Apache Kafka. https://kafka.apache.org/, 2019.
[5]
CodaLab. https://codalab.org/, 2019.
[6]
Linux Man Pages - auditd. http://man7.org/linux/man-pages/man8/auditd.8.html, 2019.
[7]
Linux Man Pages - inotify. http://man7.org/linux/man-pages/man7/inotify.7.html, 2019.
[8]
MLflow. https://mlflow.org/, 2019.
[9]
Pachyderm. https://www.pachyderm.io/, 2019.
[10]
Python Code Glitch May Have Caused Errors In Over 100 Published Studies. https://bit.ly/38DuUq5, 2019.
[11]
Artifact Review and Badging. https://bit.ly/2OdPm8c, 2020.
[12]
SPADE GitHub - Available Transformers. https://bit.ly/2zib6LY, 2020.
[13]
A. Agrawal, R. Chatterjee, C. Curino, A. Floratou, N. Gowdal, M. Interlandi, A. Jindal, K. Karanasos, S. Krishnan, B. Kroth, J. Leeka, K. Park, H. Patel, O. Poppe, F. Psallidas, R. Ramakrishnan, A. Roy, K. Saur, R. Sen, M. Weimer, T. Wright, and Y. Zhu. Cloudy with High Chance of DBMS: A 10-Year Prediction for Enterprise-Grade ML. In Proceedings of the 2020 Conference on Innovative Data Systems Research (CIDR'20), 2020.
[14]
S. Akoush, R. Sohan, and A. Hopper. HadoopProv: Towards Provenance as a First Class Citizen in MapReduce. In Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP'13), 2013.
[15]
S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann. Software Engineering for Machine Learning: A Case Study. In Proceedings of the 41st ACM/IEEE International Conference on Software Engineering (ICSE'19), 2019.
[16]
Y. Amsterdamer, S.B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB, 5(4):346--357, 2011.
[17]
G. Barber. Artificial Intelligence Confronts a `Reproducibility' Crisis. https://bit.ly/30tEBEk, 2019.
[18]
A. Bates, D. J. Tian, K. R. Butler, and T. Moyer. Trustworthy Whole-system Provenance for the Linux Kernel. In Proceedings of the 24th USENIX Security Symposium (USENIX Security'15), 2015.
[19]
L. Bernardi, T. Mavridis, and P. Estevez. 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), 2019.
[20]
S. Bhansali, W.-K. Chen, S. De Jong, A. Edwards, R. Murray, M. Drinić, D. Mihočka, and J. Chau. Framework for Instruction-level Tracing and Analysis of Program Executions. In Proceedings of the 2nd International Conference on Virtual Execution Environments (VEE'06), 2006.
[21]
R. L. Bocchino Jr, V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A Type and Effect System for Deterministic Parallel Java. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA'09), 2009.
[22]
M. A. Borkin, C. S. Yeh, M. Boyd, P. Macko, K. Z. Gajos, M. Seltzer, and H. Pfister. Evaluation of Filesystem Provenance Visualization Tools. IEEE Transactions on Visualization and Computer Graphics, 19(12), 2013.
[23]
E. Breck, N. Polyzotis, S. Roy, S. Whang, and M. Zinkevich. Data Validation for Machine Learning. In Proceedings of the Conference on Systems and Machine Learning (SysML'19), 2019.
[24]
A. Burt. Is There a `Right to Explanation' for Machine Learning in the GDPR? https://bit.ly/39OCZZ4, 2017.
[25]
S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo. VisTrails: Visualization Meets Data Management. In Proceedings of the ACM International Conference on Management of Data (SIGMOD'06), 2006.
[26]
L. Carata, S. Akoush, N. Balakrishnan, T. Bytheway, R. Sohan, M. Seltzer, and A. Hopper. A Primer on Provenance. Communications of the ACM, 57(5), 2014.
[27]
A. Chen, Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo. The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance. In Proceedings of ACM SIGCOMM (SIGCOMM'16), 2016.
[28]
J.-D. Choi, B. Alpern, T. Ngo, M. Sridharan, and J. Vlissides. A Perturbation-free Replay Platform for Cross-optimized Multithreaded Applications. In Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS'01), 2000.
[29]
J. Damji and J. Pohl. Building Complex Data Pipelines with Unified Analytics Platform. https://bit.ly/2V90uIj, 2017.
[30]
D. Deutch, N. Frost, and A. Gilad. Provenance for Non-Experts. IEEE Data Engineering Bulletin, 41(1), 2018.
[31]
S. I. Feldman and C. B. Brown. Igor: A System for Program Debugging via Reversible Execution. In Proceedings of the 1988 ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, 1988.
[32]
J. Freire and F. Chirigati. Provenance and the Different Flavors of Computational Reproducibility. IEEE Data Engineering Bulletin, 41(1), 2018.
[33]
A. Gehani and D. Tariq. SPADE: Support for Provenance Auditing in Distributed Environments. In Proceedings of the ACM/IFIP/USENIX International Middleware Conference (Middleware'12), 2012.
[34]
D. Ghoshal and B. Plale. Provenance from Log Files: A BigData Problem. In Proceedings of the EDBT/ICDT Workshops, 2013.
[35]
M. Haldar, M. Abdool, P. Ramanathan, T. Xu, S. Yang, H. Duan, Q. Zhang, N. Barrow-Williams, B. C. Turnbull, B. M. Collins, et al. Applying Deep Learning to Airbnb Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), 2019.
[36]
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing Google's Datasets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD'16), 2016.
[37]
K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA'18), 2018.
[38]
M. Herschel, R. Diestelkämper, and H. Ben Lahmar. A Survey on Provenance: What for? What form? What from? The VLDB Journal, 26(6), 2017.
[39]
D. A. Holland, U. J. Braun, D. Maclean, K.-K. Muniswamy-Reddy, and M. I. Seltzer. Choosing a Data Model and Query Language for Provenance. In Proceedings of the 2nd International Provenance and Annotation Workshop (IPAW'08), 2008.
[40]
M. Hutson. Artificial Intelligence Faces Reproducibility Crisis. Science, 359(6377), 2018.
[41]
M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data Provenance Support in Spark. PVLDB, 9(3):216--227, 2015.
[42]
M. Jones. How Do We Address The Reproducibility Crisis In Artificial Intelligence? https://bit.ly/2SHEFy8, 2018.
[43]
J. Kobielus. How to Solve AI's Reproducibility Crisis. https://bit.ly/2TuPEvw, 2018.
[44]
A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: The Next Frontier of Advanced Analytics. SIGMOD Record, 44(4), 2016.
[45]
D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism. ACM Sigplan Notices, 45(3), 2010.
[46]
P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. In arXiv pre-print 1904.09483, 2019.
[47]
D. Logothetis, S. De, and K. Yocum. Scalable Lineage Capture for Debugging Disc Analytics. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC'13), 2013.
[48]
B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience, 18(10), 2006.
[49]
P. Macko and M. Seltzer. Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP'11), 2011.
[50]
P. Macko and M. Seltzer. A General-Purpose Provenance Library. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP'12), 2012.
[51]
M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. Parameswaran, and A. Deshpande. Decibel: The Relational Dataset Branching System. PVLDB, 9(9):624--635, 2016.
[52]
R. Mavlyutov, C. Curino, B. Asipov, and P. Cudre-Mauroux. Dependency-Driven Analytics: A Compass for Uncharted Data Oceans. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR'17), 2017.
[53]
H. Miao, A. Chavan, and A. Deshpande. ProvDB: Lifecycle Management of Collaborative Analysis Workflows. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics (HILDA'17), 2017.
[54]
H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards Unified Data and Lifecycle Management for Deep Learning. In Proceedings of the IEEE International Conference on Data Engineering (ICDE'17), 2017.
[55]
K.-K. Muniswamy-Reddy, U. Braun, D. A. Holland, P. Macko, D. L. MacLean, D. W. Margo, M. I. Seltzer, and R. Smogor. Layering in Provenance Systems. In Proceedings of the USENIX Annual Technical Conference (ATC'09), 2009.
[56]
K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-Aware Storage Systems. In Proceedings of the USENIX Annual Technical Conference (ATC'06), 2006.
[57]
L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire. noWorkflow: Capturing and Analyzing Provenance of Scripts. In Proceedings of the 5th International Provenance and Annotation Workshop (IPAW'14), 2014.
[58]
B. K. Olorisade, P. Brereton, and P. Andras. Reproducibility in Machine Learning-Based Studies: An Example of Text Mining. 2017.
[59]
H. Park, R. Ikeda, and J. Widom. RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows. PVLDB, 4(12):1351--1354, 2011.
[60]
T. Pasquier, X. Han, M. Goldstein, T. Moyer, D. Eyers, M. Seltzer, and J. Bacon. Practical Whole-system Provenance Capture. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC'17), 2017.
[61]
H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. Pinplay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs. In Proceedings of the 8th annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'10), 2010.
[62]
J. F. Pimentel, J. Freire, V. Braganholo, and L. Murta. Tracking and Analyzing the Evolution of Provenance from Scripts. In Proceedings of the 6th International Provenance and Annotation Workshop (IPAW'16), 2016.
[63]
J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. noWorkflow: A Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts. PVLDB, 10(12):1841--1844, 2017.
[64]
H. E. Plesser. Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Frontiers in Neuroinformatics, 11, 2018.
[65]
F. Psallidas, Y. Zhu, B. Karlas, M. Interlandi, A. Floratou, K. Karanasos, W. Wu, C. Zhang, S. Krishnan, C. Curino, and M. Weimer. Data Science Through the Looking Glass and What We Found There. In arXiv pre-print 1912.09536, 2019.
[66]
C. Sar and P. Cao. Lineage File System. http://crypto.stanford.edu/cao/lineage.html, 2005.
[67]
S. Schelter, J.-H. Böse, J. Kirschnick, T. Klein, and S. Seufert. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. In Proceedings of the ML Systems Workshop @ NeurIPS'17, 2017.
[68]
R. P. Spillane, R. Sears, C. Yalamanchili, S. Gaikwad, M. Chinni, and E. Zadok. Story Book: An Efficient Extensible Provenance Framework. In Proceedings of the 1st USENIX Workshop on the Theory and Practice of Provenance (TaPP'09), 2009.
[69]
V. Sridhar, S. Subramanian, D. Arteaga, S. Sundararaman, D. Roselli, and N. Talagala. Model Governance: Reducing the Anarchy of Production ML. In Proceedings of the USENIX Annual Technical Conference (ATC'18), 2018.
[70]
A. Vahdat and T. E. Anderson. Transparent Result Caching. In Proceedings of the USENIX Annual Technical Conference (ATC'98), 1998.
[71]
M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. ModelDB: A System for Machine Learning Model Management. In Proceedings of the 1st Workshop on Human-In-the-Loop Data Analytics (HILDA'16), 2016.
[72]
W3C. PROV-DM: The PROV Data Model. https://www.w3.org/TR/prov-dm/, 2013.
[73]
P. Warden. The Machine Learning Reproducibility Crisis. https://bit.ly/2v0ynQP, 2018.
[74]
K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, et al. The Taverna Workflow Suite: Designing and Executing Workflows of Web Services on the Desktop, Web or in the Cloud. Nucleic Acids Research, 41(W1), 2013.
[75]
A. Zhai, H.-Y. Wu, E. Tzeng, D. H. Park, and C. Rosenberg. Learning a Unified Embedding for Visual Search at Pinterest. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'19), 2019.

Cited By

View all
  • (2024)On the Feasibility and Benefits of Extensive EvaluationProceedings of the ACM on Management of Data10.1145/36771372:4(1-24)Online publication date: 30-Sep-2024
  • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
  • (2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 13, Issue 12
August 2020
1710 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2020
Published in PVLDB Volume 13, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)11
Reflects downloads up to 07 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)On the Feasibility and Benefits of Extensive EvaluationProceedings of the ACM on Management of Data10.1145/36771372:4(1-24)Online publication date: 30-Sep-2024
  • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
  • (2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
  • (2024)A benchmark suite and performance analysis of user-space provenance collectorsProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663627(85-95)Online publication date: 18-Jun-2024
  • (2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
  • (2022)A study of database performance sensitivity to experiment settingsProceedings of the VLDB Endowment10.14778/3523210.352322115:7(1439-1452)Online publication date: 1-Mar-2022
  • (2022)How Provenance helps Quality Assurance Activities in AI/ML SystemsProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564801(1-9)Online publication date: 12-Oct-2022
  • (2022)Federated Data Science to Break Down Silos [Vision]ACM SIGMOD Record10.1145/3516431.351643550:4(16-22)Online publication date: 31-Jan-2022
  • (2021)Production Machine Learning PipelinesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457566(2639-2652)Online publication date: 9-Jun-2021
  • (2021)Flow Provenance in Temporal Interaction NetworksProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3450581(2893-2895)Online publication date: 9-Jun-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media