Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3394486.3406477acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
tutorial

Overview and Importance of Data Quality for Machine Learning Tasks

Published: 20 August 2020 Publication History

Abstract

It is well understood from literature that the performance of a machine learning (ML) model is upper bounded by the quality of the data. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality. One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and failure to do so can result in inaccurate analytics and unreliable decisions. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. This tutorial surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

References

[1]
Laure Berti-Equille. 2019. Learn2Clean: optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference. 2580--2586.
[2]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Conference on Systems and Machine Learning (SysML).
[3]
Edward Collins, Nikolai Rozanov, and Bingbing Zhang. 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks. arXiv preprint arXiv:1811.01910 (2018).
[4]
Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, and Suresh Venkatasubramanian. 2006. Rapid identification of column heterogeneity. In Sixth International Conference on Data Mining (ICDM'06). IEEE.
[5]
Misha Denil and Thomas Trappenberg. 2010. Overlap versus Imbalance. In Advances in Artificial Intelligence, Atefeh Farzindar and Vlado Kevs elj (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 220--231.
[6]
Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. Robustfill: Neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 990--998.
[7]
Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 (2019).
[8]
Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, Vol. 46, 1 (2011), 317--330.
[9]
Shaojie Jiang, Pengjie Ren, Christof Monz, and Maarten de Rijke. 2019. Improving neural response diversity with frequency-aware cross-entropy loss. In The World Wide Web Conference. 2879--2885.
[10]
Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. 2017. Outlier detection for text data. In Proceedings of the 2017 siam international conference on data mining. SIAM, 489--497.
[11]
Cornelia Kiefer. 2019. Quality indicators for text data. BTW 2019--Workshopband (2019).
[12]
Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K Kummerfeld, Parker Hill, Michael A Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars. 2019. Outlier Detection for Improved Data Quality and Diversity in Dialog Systems. arXiv preprint arXiv:1904.03122 (2019).
[13]
Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--15.
[14]
Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. 2019. Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv preprint arXiv:1911.00068 (2019).
[15]
Paulo Oliveira, Fátima Rodrigues, Pedro Rangel Henriques, and Helena Galhardas. 2005. A Taxonomy of Data Quality Problems. Journal of Data and Information Quality - JDIQ (01 2005).
[16]
Nicole Peinelt, Maria Liakata, and Dong Nguyen. 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2792--2798.
[17]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. (2000).
[18]
Jinsung Yoon, Sercan O Arik, and Tomas Pfister. 2019. Data Valuation using Reinforcement Learning. arXiv preprint arXiv:1909.11671 (2019).

Cited By

View all
  • (2024)Integrating Domain Knowledge in Multi-Source Classification TasksJournal on Interactive Systems10.5753/jis.2024.409615:1(591-614)Online publication date: 29-Jun-2024
  • (2024)Challenges and prospects in bridging precision medicine and artificial intelligence in genomic psychiatric treatmentWorld Journal of Psychiatry10.5498/wjp.v14.i8.114814:8(1148-1164)Online publication date: 19-Aug-2024
  • (2024)The Impact of Artificial Intelligence on Business Performance in Saudi Arabia: The Role of Technological Readiness and Data QualityEngineering, Technology & Applied Science Research10.48084/etasr.787114:5(16802-16807)Online publication date: 9-Oct-2024
  • Show More Cited By

Index Terms

  1. Overview and Importance of Data Quality for Machine Learning Tasks

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
    August 2020
    3664 pages
    ISBN:9781450379984
    DOI:10.1145/3394486
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2020

    Check for updates

    Author Tags

    1. data quality
    2. machine learning
    3. quality metrics

    Qualifiers

    • Tutorial

    Conference

    KDD '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,367
    • Downloads (Last 6 weeks)163
    Reflects downloads up to 25 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Integrating Domain Knowledge in Multi-Source Classification TasksJournal on Interactive Systems10.5753/jis.2024.409615:1(591-614)Online publication date: 29-Jun-2024
    • (2024)Challenges and prospects in bridging precision medicine and artificial intelligence in genomic psychiatric treatmentWorld Journal of Psychiatry10.5498/wjp.v14.i8.114814:8(1148-1164)Online publication date: 19-Aug-2024
    • (2024)The Impact of Artificial Intelligence on Business Performance in Saudi Arabia: The Role of Technological Readiness and Data QualityEngineering, Technology & Applied Science Research10.48084/etasr.787114:5(16802-16807)Online publication date: 9-Oct-2024
    • (2024)Machine learning approaches toward an understanding of acute kidney injury: current trends and future directionsThe Korean Journal of Internal Medicine10.3904/kjim.2024.09839:6(882-897)Online publication date: 1-Nov-2024
    • (2024)Machine Learning Applications in Road Pavement Management: A Review, Challenges and Future DirectionsInfrastructures10.3390/infrastructures91202139:12(213)Online publication date: 21-Nov-2024
    • (2024)New Generation Sustainable Technologies for Soilless Vegetable ProductionHorticulturae10.3390/horticulturae1001004910:1(49)Online publication date: 4-Jan-2024
    • (2024)Extraction of Minimal Set of Traffic Features Using Ensemble of Classifiers and Rank Aggregation for Network Intrusion Detection SystemsApplied Sciences10.3390/app1416699514:16(6995)Online publication date: 9-Aug-2024
    • (2024)Balancing Data Acquisition Benefits and Ordering Costs for Predictive Supplier Selection and Order AllocationApplied Sciences10.3390/app1410430614:10(4306)Online publication date: 19-May-2024
    • (2024)Improving Deep Learning Anomaly Diagnostics with a Physics-Based Simulation ModelApplied Sciences10.3390/app1402080014:2(800)Online publication date: 17-Jan-2024
    • (2024)Inside Production Data Science: Exploring the Main Tasks of Data Scientists in Production EnvironmentsAI10.3390/ai50200435:2(873-886)Online publication date: 12-Jun-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media