research-article

Public Access

How I stopped worrying about training data bugs and started complaining

Authors:

Lampros Flokas,

Eugene WuAuthors Info & Claims

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

Article No.: 1, Pages 1 - 5

https://doi.org/10.1145/3533028.3533305

Published: 12 June 2022 Publication History

Abstract

There is an increasing awareness of the gap between machine learning research and production. The research community has largely focused on developing a model that performs well on a validation set, but the production environment needs to make sure the model also performs well in a downstream application. The latter is more challenging because the test/inference-time data used in the application could be quite different from the training data. To address this challenge, we advocate for "complaint-driven" data debugging, which allows the user to complain about the unexpected behaviors of the model in the downstream application, and proposes interventions for training data errors that likely led to the complaints. This new debugging paradigm helps solve a range of training data quality problems such as labeling error, fairness, and data drift. We present our long-term vision, highlight achieved milestones, and outline a research roadmap including a number of open problems.

References

[1]

Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In SIGMOD.

[2]

Anna Fariha, Suman Nath, and Alexandra Meliou. 2020. Causality-Guided Adaptive Interventional Debugging. In SIGMOD.

[3]

Lampros Flokas, Weiyuan Wu, Yejia Liu, Jiannan Wang, Nakul Verma, and Eugene Wu. 2022. Complaint-Driven Training Data Debugging at Interactive Speeds. In SIGMOD.

[4]

Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, and Divesh Srivastava. 2021. DataExposer: Exposing Disconnect between Data and Systems. arxiv (2021). arXiv:2105.06058

[5]

J. Hermann and M. D. Balso. [n.d.]. Meet Michelangelo: Uber's machine learning platform. https://eng.uber.com/michelangelo.

[6]

Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. VLDB (2015).

[7]

Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. VLDB (2020).

[8]

Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In ICML.

[9]

Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. arixv (2017). arXiv:1711.01299

[10]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. VLDB (2016).

Digital Library

[11]

Sanjay Krishnan and Eugene Wu. 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. arxiv (2019). arXiv:1904.11827

[12]

Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2019. PUG: a framework and practical implementation for why and why-not provenance. VLDB J. (2019).

[13]

Zifan Liu, Zhechun Zhou, and Theodoros Rekatsinas. 2020. Picket: Self-supervised Data Diagnostics for ML Pipelines. arxiv (2020). arXiv:2006.04730

[14]

Brandon Lockhart, Jinglin Peng, Weiyuan Wu, Jiannan Wang, and Eugene Wu. 2021. Explaining Inference Queries with Bayesian Optimization. arXiv (2021).

[15]

Raoni Lourenço, Juliana Freire, and Dennis E. Shasha. 2020. BugDoc: Algorithms to Debug Computational Processes. In SIGMOD.

[16]

MLTrace [n.d.]. Home - mltrace 0.16 documentation. https://mltrace.readthedocs.io/. Accessed: 2021-08-31.

[17]

Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. Data Engineering (2021).

[18]

Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained Lineage at Interactive Speed. VLDB (2018).

[19]

Lana Ramjit, Matteo Interlandi, Eugene Wu, and Ravi Netravali. 2019. Acorn: Aggressive Result Caching in Distributed Data Processing Frameworks (SoCC '19).

Digital Library

[20]

Christopher Ré. 2020. Overton: A Data System for Monitoring and Improving Machine-Learned Products. In CIDR.

[21]

Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. 2021. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. In CHI.

[22]

Manasi Vartak. 2017. MODELDB: A System for Machine Learning Model Management. In CIDR.

[23]

Wentao Wu and Ce Zhang. 2021. Towards understanding end-to-end learning in the context of data: machine learning dancing over semirings & Codd's table. In DEEM@SIGMOD.

[24]

Yinjun Wu, James Weimer, and Susan B. Davidson. 2021. CHEF: A Cheap and Fast Pipeline for Iteratively Cleaning Label Uncertainties. VLDB (2021).

[25]

Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, et al. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. (2018).

Recommendations

Few training data for Objection Detection
EITCE '20: Proceedings of the 2020 4th International Conference on Electronic Information Technology and Computer Engineering

Deep learning method of object detection has achieved excellent results, but most of the object detection network training processes are supervised learning. The performance improvement is driven by a large amount of annotation data to drive deeper and ...
Multi-label co-training
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

Multi-label learning aims at assigning a set of appropriate labels to multi-label samples. Although it has been successfully applied in various domains in recent years, most multi-label learning methods require sufficient labeled training samples, ...
Automated data race bugs addition
EuroSec '20: Proceedings of the 13th European workshop on Systems Security

A challenge faced by concurrency bug detection techniques is the lack of ground-truth corpora, i.e., a lot of true concurrency bugs, making it difficult to evaluate and verify these technologies and tools, e.g., to precisely measure their false negative ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

June 2022

63 pages

ISBN:9781450393751

DOI:10.1145/3533028

Conference Chairs:
Matthias Boehm,
Paroma Varma,
Doris Xin

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 12, 2022

Pennsylvania, Philadelphia

Acceptance Rates

DEEM '22 Paper Acceptance Rate 9 of 13 submissions, 69%;

Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
308
Total Downloads

Downloads (Last 12 months)129
Downloads (Last 6 weeks)19

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents