tutorial

Overview and Importance of Data Quality for Machine Learning Tasks

Authors:

Lokesh Nagalapatti,

Shanmukha Guttula,

Shashank Mujumdar,

Ruhi Sharma Mittal,

Vitobha MunigalaAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 3561 - 3562

https://doi.org/10.1145/3394486.3406477

Published: 20 August 2020 Publication History

Abstract

It is well understood from literature that the performance of a machine learning (ML) model is upper bounded by the quality of the data. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality. One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and failure to do so can result in inaccurate analytics and unreliable decisions. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. This tutorial surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

References

[1]

Laure Berti-Equille. 2019. Learn2Clean: optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference. 2580--2586.

Digital Library

[2]

Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Conference on Systems and Machine Learning (SysML).

[3]

Edward Collins, Nikolai Rozanov, and Bingbing Zhang. 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks. arXiv preprint arXiv:1811.01910 (2018).

[4]

Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, and Suresh Venkatasubramanian. 2006. Rapid identification of column heterogeneity. In Sixth International Conference on Data Mining (ICDM'06). IEEE.

Digital Library

[5]

Misha Denil and Thomas Trappenberg. 2010. Overlap versus Imbalance. In Advances in Artificial Intelligence, Atefeh Farzindar and Vlado Kevs elj (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 220--231.

[6]

Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. Robustfill: Neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 990--998.

[7]

Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 (2019).

[8]

Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, Vol. 46, 1 (2011), 317--330.

Digital Library

[9]

Shaojie Jiang, Pengjie Ren, Christof Monz, and Maarten de Rijke. 2019. Improving neural response diversity with frequency-aware cross-entropy loss. In The World Wide Web Conference. 2879--2885.

Digital Library

[10]

Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. 2017. Outlier detection for text data. In Proceedings of the 2017 siam international conference on data mining. SIAM, 489--497.

[11]

Cornelia Kiefer. 2019. Quality indicators for text data. BTW 2019--Workshopband (2019).

[12]

Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K Kummerfeld, Parker Hill, Michael A Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars. 2019. Outlier Detection for Improved Data Quality and Diversity in Dialog Systems. arXiv preprint arXiv:1904.03122 (2019).

[13]

Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--15.

Digital Library

[14]

Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. 2019. Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv preprint arXiv:1911.00068 (2019).

[15]

Paulo Oliveira, Fátima Rodrigues, Pedro Rangel Henriques, and Helena Galhardas. 2005. A Taxonomy of Data Quality Problems. Journal of Data and Information Quality - JDIQ (01 2005).

[16]

Nicole Peinelt, Maria Liakata, and Dong Nguyen. 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2792--2798.

[17]

Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. (2000).

[18]

Jinsung Yoon, Sercan O Arik, and Tomas Pfister. 2019. Data Valuation using Reinforcement Learning. arXiv preprint arXiv:1909.11671 (2019).

Cited By

Bender ASouza EBender ICorrêa UAraujo R(2024)Integrating Domain Knowledge in Multi-Source Classification TasksJournal on Interactive Systems10.5753/jis.2024.409615:1(591-614)Online publication date: 29-Jun-2024
https://doi.org/10.5753/jis.2024.4096
Okpete UByeon H(2024)Challenges and prospects in bridging precision medicine and artificial intelligence in genomic psychiatric treatmentWorld Journal of Psychiatry10.5498/wjp.v14.i8.114814:8(1148-1164)Online publication date: 19-Aug-2024
https://doi.org/10.5498/wjp.v14.i8.1148
Alarefi M(2024)The Impact of Artificial Intelligence on Business Performance in Saudi Arabia: The Role of Technological Readiness and Data QualityEngineering, Technology & Applied Science Research10.48084/etasr.787114:5(16802-16807)Online publication date: 9-Oct-2024
https://doi.org/10.48084/etasr.7871
Show More Cited By

Index Terms

Overview and Importance of Data Quality for Machine Learning Tasks
1. Computing methodologies
  1. Machine learning

Recommendations

Data Quality for Machine Learning Tasks
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This ...
A DaQL to Monitor Data Quality in Machine Learning Applications
Database and Expert Systems Applications
Abstract
Machine learning models can only be as good as the data used to train them. Despite this obvious correlation, there is little research about data quality measurement to ensure the reliability and trustworthiness of machine learning models. ...
Construction of a quality model for machine learning systems
Abstract
Nowadays, systems containing components based on machine learning (ML) methods are becoming more widespread. In order to ensure the intended behavior of a software system, there are standards that define necessary qualities of the system and its ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Copyright © 2020 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Check for updates

Author Tags

Qualifiers

Tutorial

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

126
Total Citations
View Citations
4,021
Total Downloads

Downloads (Last 12 months)1,367
Downloads (Last 6 weeks)163

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bender ASouza EBender ICorrêa UAraujo R(2024)Integrating Domain Knowledge in Multi-Source Classification TasksJournal on Interactive Systems10.5753/jis.2024.409615:1(591-614)Online publication date: 29-Jun-2024
https://doi.org/10.5753/jis.2024.4096
Okpete UByeon H(2024)Challenges and prospects in bridging precision medicine and artificial intelligence in genomic psychiatric treatmentWorld Journal of Psychiatry10.5498/wjp.v14.i8.114814:8(1148-1164)Online publication date: 19-Aug-2024
https://doi.org/10.5498/wjp.v14.i8.1148
Alarefi M(2024)The Impact of Artificial Intelligence on Business Performance in Saudi Arabia: The Role of Technological Readiness and Data QualityEngineering, Technology & Applied Science Research10.48084/etasr.787114:5(16802-16807)Online publication date: 9-Oct-2024
https://doi.org/10.48084/etasr.7871
Jeong ICho NAhn SLee HGil H(2024)Machine learning approaches toward an understanding of acute kidney injury: current trends and future directionsThe Korean Journal of Internal Medicine10.3904/kjim.2024.09839:6(882-897)Online publication date: 1-Nov-2024
https://doi.org/10.3904/kjim.2024.098
Tamagusko TGomes Correia MFerreira A(2024)Machine Learning Applications in Road Pavement Management: A Review, Challenges and Future DirectionsInfrastructures10.3390/infrastructures91202139:12(213)Online publication date: 21-Nov-2024
https://doi.org/10.3390/infrastructures9120213
Fuentes-Peñailillo FGutter KVega RSilva G(2024)New Generation Sustainable Technologies for Soilless Vegetable ProductionHorticulturae10.3390/horticulturae1001004910:1(49)Online publication date: 4-Jan-2024
https://doi.org/10.3390/horticulturae10010049
Krupski JIwanowski MGraniszewski W(2024)Extraction of Minimal Set of Traffic Features Using Ensemble of Classifiers and Rank Aggregation for Network Intrusion Detection SystemsApplied Sciences10.3390/app1416699514:16(6995)Online publication date: 9-Aug-2024
https://doi.org/10.3390/app14166995
Regattieri AGabellini MCalabrese FCivolani LGalizia F(2024)Balancing Data Acquisition Benefits and Ordering Costs for Predictive Supplier Selection and Order AllocationApplied Sciences10.3390/app1410430614:10(4306)Online publication date: 19-May-2024
https://doi.org/10.3390/app14104306
Mäkiaho TKoskinen KLaitinen J(2024)Improving Deep Learning Anomaly Diagnostics with a Physics-Based Simulation ModelApplied Sciences10.3390/app1402080014:2(800)Online publication date: 17-Jan-2024
https://doi.org/10.3390/app14020800
Schmetz AKampker A(2024)Inside Production Data Science: Exploring the Main Tasks of Data Scientists in Production EnvironmentsAI10.3390/ai50200435:2(873-886)Online publication date: 12-Jun-2024
https://doi.org/10.3390/ai5020043
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents