research-article

Capturing and querying fine-grained provenance of preprocessing pipelines in data science

Authors:

Adriane Chapman,

Giulia Simonelli,

Riccardo TorloneAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 4

Pages 507 - 520

https://doi.org/10.14778/3436905.3436911

Published: 01 December 2020 Publication History

Abstract

Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models' accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers' debugging questions, as expressed on the Data Science Stack Exchange.

References

[1]

Pulkit Agrawal, Rajat Arya, Aanchal Bindal, Sandeep Bhatia, Anupriya Gagneja, Joseph Godlewski, Yucheng Low, Timothy Muss, Mudit Manu Paliwal, Sethu Raman, and et al. 2019. Data Platform for Machine Learning. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1803--1816.

Digital Library

[2]

Ahmed M Alaa and Mihaela van der Schaar. 2019. Demystifying Black-box Models with Symbolic Metamodels. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 11301--11311.

[3]

Bahareh Sadat Arab, Su Feng, Boris Glavic, Seokki Lee, Xing Niu, and Qitian Zeng. 2018. GProM - A Swiss Army Knife for Your Provenance Needs. IEEE Data Eng. Bull. 41, 1 (2018), 51--62.

[4]

Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In International conference on database theory. Springer, Springer-Verlag, 316--330.

Digital Library

[5]

Alvin Cheung. 2015. Rethinking the Application-Database Interface. Ph.D. Dissertation. Massachusetts Institute of Technology.

[6]

Laura Chiticariu, Wang Chiew Tan, and Gaurav Vijayvargiya. 2005. DBNotes: a post-it system for relational databases based on provenance. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, Fatma Özcan (Ed.). ACM, 942--944.

Digital Library

[7]

Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, et al. 2013. Orange: data mining toolbox in Python. The Journal of Machine Learning Research 14, 1 (2013), 2349--2353.

Digital Library

[8]

Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. 2013. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research 14 (2013), 2349--2353. http://jmlr.org/papers/v14/demsar13a.html

Digital Library

[9]

Alexander D'Amour, Hansa Srinivasan, James Atwood, Pallavi Baljekar, D. Sculley, and Yoni Halpern. 2020. Fairness is Not Static: Deeper Understanding of Long Term Fairness via Simulation Studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Barcelona, Spain) (FAT* '20). Association for Computing Machinery, New York, NY, USA, 525--534.

Digital Library

[10]

Salvador García, Sergio Ramírez-Gallego, Julián Luengo, José Manuel Benítez, and Francisco Herrera. 2016. Big data preprocessing: methods and prospects. Big Data Analytics 1, 1 (dec 2016), 9.

[11]

Amirata Ghorbani and James Y. Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2242--2251.

[12]

Boris Glavic and Gustavo Alonso. 2009. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China, Yannis E. Ioannidis, Dik Lun Lee, and Raymond T. Ng (Eds.). IEEE Computer Society, 174--185.

Digital Library

[13]

Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Leonid Libkin (Ed.). ACM, 31--40.

Digital Library

[14]

T. Guedes, V. Silva, M. Mattoso, M. V. N. Bedo, and D. de Oliveira. 2018. A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows. In Workflows in Support of Large-Scale Science (WORKS). IEEE/ACM, 31--41.

[15]

Trung Dong Huynh. 2018. Prov Python. https://prov.readthedocs.io/en/latest/index.html

[16]

Robert Ikeda, Junsang Cho, Charlie Fang, Semih Salihoglu, Satoshi Torikai, and Jennifer Widom. 2012. Provenance-Based Debugging and Drill-Down in Data-Oriented Workflows. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, Anastasios Kementsietsidis and Marcos Antonio Vaz Salles (Eds.). IEEE Computer Society, 1249--1252.

Digital Library

[17]

Matteo Interlandi, Kshitij Shah, Sai Tetali, Muhammad Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2016. Titian: Data Provenance Support in Spark. Proceedings of the VLDB Endowment International Conference on Very Large Data Bases 9 (01 2016), 216--227.

Digital Library

[18]

Nikolaos Konstantinou, Martin Koehler, Edward Abel, Cristina Civili, Bernd Neumayr, Emanuel Sallinger, Alvaro A.A. Fernandes, Georg Gottlob, John A. Keane, Leonid Libkin, and et al. 2017. The VADA Architecture for Cost-Effective Data Wrangling. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 1599--1602.

Digital Library

[19]

Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. 2017. Interpretable & Explorable Approximations of Black Box Models. CoRR abs/1707.01154 (2017). arXiv:1707.01154 http://arxiv.org/abs/1707.01154

[20]

Seokki Lee, Sven Köhler, Bertram Ludäscher, and Boris Glavic. 2017. A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. IEEE Computer Society, 485--496.

[21]

Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. BugDoc: Algorithms to Debug Computational Processes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 463--478.

Digital Library

[22]

Timothy McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen Dey, Juliana Freire, et al. 2015. YesWorkflow: a user-oriented, language-independent tool for recovering workflow information from scripts. arXiv preprint arXiv:1502.02403 (2015).

[23]

Bilal Mirza, Wei Wang, Jie Wang, Howard Choi, Neo Christopher Chung, and Peipei Ping. 2019. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes 10, 2 (jan 2019), 87.

[24]

Luc Moreau, Belfrit Victor Batlajery, Trung Dong Huynh, Danius T. Michaelides, and Heather S. Packer. 2018. A Templating System to Generate Provenance. IEEE Transactions on Software Engineering 44 (2018), 103--121.

[25]

Luc Moreau, James Cheney, and Paolo Missier. 2013. Constraints of the PROV data model. http://www.w3.org/TR/2013/REC-prov-constraints-20130430/

[26]

Luc Moreau, Paolo Missier, Khalid Belhajjame, Reza B'Far, James Cheney, Sam Coppens, Stephen Cresswell, Yolanda Gil, Paul Groth, Graham Klyne, et al. 2013. Prov-dm: The prov data model. W3C Recommendation REC-prov-dm-20130430. WWW Consortium (2013). https://www.w3.org/TR/prov-dm/

[27]

Ramaravind Kommiya Mothilal, Amit Sharma, and Chenhao Tan. 2019. Explaining machine learning classifiers through diverse counterfactual explanations. arXiv preprint arXiv:1905.07697 (2019).

[28]

Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, and Yinghui Wu. 2020. Vamsa: Tracking Provenance in Data Science Scripts. arXiv:2001.01861 [cs.LG]

[29]

Arvind Narayanan. 2018. Translation tutorial: 21 fairness definitions and their politics. In Proc. Conf. Fairness Accountability Transp., New York, USA.

[30]

Xing Niu, Raghav Kapoor, Boris Glavic, Dieter Gawlick, Zhen Hua Liu, and Venkatesh Radhakrishnan. 2017. Provenance-Aware Query Optimization. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. IEEE Computer Society, 473--484.

[31]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[32]

Devin Petersohn, William W. Ma, Doris Jung Lin Lee, Stephen Macke, Doris Xin, Xiangxi Mo, Joseph Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya G. Parameswaran. 2020. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13, 11 (2020), 2033--2046.

Digital Library

[33]

João Felipe Pimentel, Saumen Dey, Timothy McPhillips, Khalid Belhajjame, David Koop, Leonardo Murta, Vanessa Braganholo, and Bertram Ludäscher. 2016. Yin & Yang: demonstrating complementary provenance from noWorkflow & YesWorkflow. In International Provenance and Annotation Workshop. Springer, 161--165.

[34]

João Felipe Pimentel, Juliana Freire, Leonardo Murta, and Vanessa Braganholo. 2016. Fine-grained provenance collection over scripts through program slicing. In International Provenance and Annotation Workshop. Springer, 199--203.

[35]

João Felipe Pimentel, Juliana Freire, Leonardo Murta, and Vanessa Braganholo. 2019. A survey on collecting, managing, and analyzing provenance from scripts. ACM Computing Surveys (CSUR) 52, 3 (2019), 1--38.

Digital Library

[36]

João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2017. noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts. Proc. VLDB Endow. 10, 12 (2017), 1841--1844.

Digital Library

[37]

Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. 2014. TPCDI: The First Industry Benchmark for Data Integration. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1367--1378.

Digital Library

[38]

Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained Lineage at Interactive Speed. Proc. VLDB Endow. 11, 6 (2018), 719--732.

Digital Library

[39]

Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained lineage at interactive speed. Proceedings of the VLDB Endowment 11, 6 (2018), 719--732.

Digital Library

[40]

Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 1135--1144.

Digital Library

[41]

Sebastian Schelter, Joos-Hendrik Böse, Johannes Kirschnick, Thoralf Klein, Stephan Seufert, and Amazon. 2018. Declarative Metadata Management: A Missing Piece in End-To-End Machine Learning. In SysML Conference.

[42]

Stefanie Scherzinger, Christin Seifert, and Lena Wiese. 2019. The Best of both Worlds: Challenges in Linking Provenance and Explainability in Distributed Machine Learning. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1620--1629.

[43]

Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1171--1188.

Digital Library

[44]

Stefan Studer, Thanh Binh Bui, Christian Drescher, Alexander Hanuschkin, Ludwig Winkler, Steven Peters, and Klaus-Robert Mueller. 2020. Towards CRISP-ML (Q): A Machine Learning Process Model with Quality Assurance Methodology. arXiv preprint arXiv:2003.05155 (2020).

[45]

MingJie Tang, Saisai Shao, Weiqing Yang, Yanbo Liang, Yongyang Yu, Bikas Saha, and Dongjoon Hyun. 2019. SAC: A System for Big Data Lineage Tracking. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 1964--1967.

[46]

Manasi Vartak, Joana M. F. da Trindade, Samuel Madden, and Matei Zaharia. 2018. MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 1285--1300.

Digital Library

[47]

Yinjun Wu, Val Tannen, and Susan B. Davidson. 2020. PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 447--462.

Digital Library

[48]

Zhepeng Yan, Val Tannen, and Zachary G. Ives. 2016. Fine-grained Provenance for Linear Algebra Operators. In 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016, Washington, D.C., USA, June 8-9, 2016, Sarah Cohen Boulakia (Ed.). USENIX Association.

Digital Library

[49]

Qian Zhang, Paul J Morris, Timothy McPhillips, James Hanken, David Lowery, Bertram Ludäscher, James Macklin, Robert Morris, and John Wieczorek. 2017. Using YesWorkflow hybrid queries to reveal data lineage from data curation activities. Biodiversity Information Science and Standards 1 (2017), e20380.

[50]

Nan Zheng, Abdussalam Alawini, and Zachary Ives. 2019. Fine-Grained Provenance for Matching & ETL. Proceedings. International Conference on Data Engineering 2019 (04 2019), 184--195.

Cited By

Pina DChapman AKunstmann Lde Oliveira DMattoso M(2024)DLProv: A Data-Centric Support for Deep Learning Workflow AnalysesProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663337(77-85)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663337
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Bono CCappiello CPernici BRamalli EVitali M(2023)Pipeline Design for Data Preparation for Social Media AnalysisJournal of Data and Information Quality10.1145/359730515:4(1-25)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3597305
Show More Cited By

Recommendations

Improving reproducibility of data science pipelines through transparent provenance capture

Data science has become prevalent in a large variety of domains. Inherent in its practice is an exploratory, probing, and fact finding journey, which consists of the assembly, adaptation, and execution of complex data science pipelines. The ...
Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on ...
Querying data provenance
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 4

December 2020

263 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2020

Published in PVLDB Volume 14, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
525
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)10

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pina DChapman AKunstmann Lde Oliveira DMattoso M(2024)DLProv: A Data-Centric Support for Deep Learning Workflow AnalysesProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663337(77-85)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3650203.3663337
Chapman ALauro LMissier PTorlone R(2024)Supporting Better Insights of Data Science Pipelines with Fine-grained ProvenanceACM Transactions on Database Systems10.1145/364438549:2(1-42)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3644385
Bono CCappiello CPernici BRamalli EVitali M(2023)Pipeline Design for Data Preparation for Social Media AnalysisJournal of Data and Information Quality10.1145/359730515:4(1-25)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3597305
Pina DChapman ADe Oliveira DMattoso M(2023)Deep Learning Provenance Data Integration: a Practical ApproachCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587561(1542-1550)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587561
Shankar SParameswaran A(2022)Towards Observability for Production Machine Learning PipelinesProceedings of the VLDB Endowment10.14778/3565838.356585315:13(4015-4022)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565853
Chapman ALauro LMissier PTorlone R(2022)DPDSProceedings of the VLDB Endowment10.14778/3554821.355485715:12(3614-3617)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554857
Lou YCafarella M(2022)Enabling useful provenance in scripting languages with a human-in-the-loopProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547494(1-7)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3546930.3547494
Deo NGlavic BKennedy OChapman ADeutch DMalik T(2022)Runtime provenance refinement for notebooksProceedings of the 14th International Workshop on the Theory and Practice of Provenance10.1145/3530800.3534535(1-4)Online publication date: 17-Jun-2022
https://dl.acm.org/doi/10.1145/3530800.3534535

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents