Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3533028.3533305acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

How I stopped worrying about training data bugs and started complaining

Published: 12 June 2022 Publication History

Abstract

There is an increasing awareness of the gap between machine learning research and production. The research community has largely focused on developing a model that performs well on a validation set, but the production environment needs to make sure the model also performs well in a downstream application. The latter is more challenging because the test/inference-time data used in the application could be quite different from the training data. To address this challenge, we advocate for "complaint-driven" data debugging, which allows the user to complain about the unexpected behaviors of the model in the downstream application, and proposes interventions for training data errors that likely led to the complaints. This new debugging paradigm helps solve a range of training data quality problems such as labeling error, fairness, and data drift. We present our long-term vision, highlight achieved milestones, and outline a research roadmap including a number of open problems.

References

[1]
Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In SIGMOD.
[2]
Anna Fariha, Suman Nath, and Alexandra Meliou. 2020. Causality-Guided Adaptive Interventional Debugging. In SIGMOD.
[3]
Lampros Flokas, Weiyuan Wu, Yejia Liu, Jiannan Wang, Nakul Verma, and Eugene Wu. 2022. Complaint-Driven Training Data Debugging at Interactive Speeds. In SIGMOD.
[4]
Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, and Divesh Srivastava. 2021. DataExposer: Exposing Disconnect between Data and Systems. arxiv (2021). arXiv:2105.06058
[5]
J. Hermann and M. D. Balso. [n.d.]. Meet Michelangelo: Uber's machine learning platform. https://eng.uber.com/michelangelo.
[6]
Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. VLDB (2015).
[7]
Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. VLDB (2020).
[8]
Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In ICML.
[9]
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. arixv (2017). arXiv:1711.01299
[10]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. VLDB (2016).
[11]
Sanjay Krishnan and Eugene Wu. 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. arxiv (2019). arXiv:1904.11827
[12]
Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2019. PUG: a framework and practical implementation for why and why-not provenance. VLDB J. (2019).
[13]
Zifan Liu, Zhechun Zhou, and Theodoros Rekatsinas. 2020. Picket: Self-supervised Data Diagnostics for ML Pipelines. arxiv (2020). arXiv:2006.04730
[14]
Brandon Lockhart, Jinglin Peng, Weiyuan Wu, Jiannan Wang, and Eugene Wu. 2021. Explaining Inference Queries with Bayesian Optimization. arXiv (2021).
[15]
Raoni Lourenço, Juliana Freire, and Dennis E. Shasha. 2020. BugDoc: Algorithms to Debug Computational Processes. In SIGMOD.
[16]
MLTrace [n.d.]. Home - mltrace 0.16 documentation. https://mltrace.readthedocs.io/. Accessed: 2021-08-31.
[17]
Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. Data Engineering (2021).
[18]
Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained Lineage at Interactive Speed. VLDB (2018).
[19]
Lana Ramjit, Matteo Interlandi, Eugene Wu, and Ravi Netravali. 2019. Acorn: Aggressive Result Caching in Distributed Data Processing Frameworks (SoCC '19).
[20]
Christopher Ré. 2020. Overton: A Data System for Monitoring and Improving Machine-Learned Products. In CIDR.
[21]
Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. 2021. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. In CHI.
[22]
Manasi Vartak. 2017. MODELDB: A System for Machine Learning Model Management. In CIDR.
[23]
Wentao Wu and Ce Zhang. 2021. Towards understanding end-to-end learning in the context of data: machine learning dancing over semirings & Codd's table. In DEEM@SIGMOD.
[24]
Yinjun Wu, James Weimer, and Susan B. Davidson. 2021. CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties. VLDB (2021).
[25]
Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, et al. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. (2018).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
June 2022
63 pages
ISBN:9781450393751
DOI:10.1145/3533028
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2022

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

DEEM '22 Paper Acceptance Rate 9 of 13 submissions, 69%;
Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 308
    Total Downloads
  • Downloads (Last 12 months)129
  • Downloads (Last 6 weeks)19
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media