abstract

Advances in Exploratory Data Analysis, Visualisation and Quality for Data Centric AI Systems

Authors:

Shanmukha Guttula,

Ruhi Sharma Mittal,

Naresh Manwani,

Laure Berti-Equille,

Abhijit ManatkarAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 4814 - 4815

https://doi.org/10.1145/3534678.3542604

Published: 14 August 2022 Publication History

Abstract

It is widely accepted that data preparation is one of the most time-consuming steps of the machine learning (ML) lifecycle. It is also one of the most important steps, as the quality of data directly influences the quality of a model. In this tutorial, we will discuss the importance and the role of exploratory data analysis (EDA) and data visualisation techniques to find data quality issues and for data preparation, relevant to building ML pipelines. We will also discuss the latest advances in these fields and bring out areas that need innovation. To make the tutorial actionable for practitioners, we will also discuss the most popular open-source packages that one can get started with along with their strengths and weaknesses. Finally, we will discuss on the challenges posed by industry workloads and the gaps to be addressed to make data-centric AI real in industry settings.

References

[1]

2019. Facets. https://github.com/pair-code/facets.

[2]

Shazia Afzal, Arunima Chaudhary, Nitin Gupta, Hima Patel, Carolina Spina, and Dakuo Wang. 2021. Data-Debugging Through Interactive Visual Explanations. In Trends and Applications in Knowledge Discovery and Data Mining, Manish Gupta and Ganesh Ramakrishnan (Eds.). Springer International Publishing, Cham, 133--142.

[3]

Julien Aligon, Enrico Gallinucci, Matteo Golfarelli, Patrick Marcel, and Stefano Rizzi. 2015. A collaborative filtering approach for recommending OLAP sessions. Decision Support Systems, Vol. 69 (01 2015), 20--30. https://doi.org/10.1016/j.dss.2014.11.003

Digital Library

[4]

Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference. 2580--2586.

Digital Library

[5]

Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Conference on Systems and Machine Learning (SysML).

[6]

Ugo Comignani, Noël Novelli, and Laure Berti-Équille2020. Data quality checking for machine learning with mesqual. In Advances in Database Technology-EDBT 2020, 23rd International Conference on Extending Database Technology,.

[7]

Victor Dibia and cC agatay Demiralp. 2018. Data2Vis: Automatic Generation of Data Visualizations Using Sequence to Sequence Recurrent Neural Networks. CoRR, Vol. abs/1804.03126 (2018). [arXiv]1804.03126 http://arxiv.org/abs/1804.03126

[8]

Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2016. AIDE: An Active Learning-Based Approach for Interactive Data Exploration. IEEE Transactions on Knowledge and Data Engineering, Vol. 28, 11 (2016), 2842--2856. https://doi.org/10.1109/TKDE.2016.2599168

Digital Library

[9]

Ori Bar El, Tova Milo, and Amit Somech. 2019ATENA: An Autonomous System for Data Exploration Based on Deep Reinforcement Learning. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019).

[10]

Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, et al. 2021. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935 (2021).

[11]

Kevin Hu, Michiel A. Bakker, Stephen Li, Tim Kraska, and César Hidalgo. 2019. VizML: A Machine Learning Approach to Visualization Recommendation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI '19). Association for Computing Machinery, New York, NY, USA, 1--12. https://doi.org/10.1145/3290605.3300358

Digital Library

[12]

Sean Kandel, Ravi Parikh, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Profiler: integrated statistical analysis and visualization for data quality assessment. In AVI.

[13]

Alan F. Karr, Ashish P. Sanil, and David L. Banks. 2006. Data quality: A statistical perspective. Statistical Methodology, Vol. 3, 2 (2006), 137--173. https://doi.org/10.1016/j.stamet.2005.08.005

[14]

Doris Jung-Lin Lee, Dixin Tang, Kunal Agarwal, Thyne Boonmark, Caitlyn Chen, Jake Kang, Ujjaini Mukhopadhyay, Jerry Song, Micah Yong, Marti A. Hearst, and Aditya G. Parameswaran. 2021. Lux: Always-on Visualization Recommendations for Exploratory Dataframe Workflows. Proc. VLDB Endow., Vol. 15, 3 (nov 2021), 727--738. https://doi.org/10.14778/3494124.3494151

Digital Library

[15]

Yuyu Luo, Xuedi Qin, Nan Tang, and Guoliang Li. 2018. DeepEye: Towards Automatic Data Visualization. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 101--112. https://doi.org/10.1109/ICDE.2018.00019

Digital Library

[16]

Rischan Mafrur, Mohamed A. Sharaf, and G. Zuccon. 2020. Quality Matters: Understanding the Impact of Incomplete Data on Visualization Recommendation. In DEXA.

[17]

Patrick Marcel, Nicolas Labroche, and Panos Vassiliadis. 2019. Towards a benefit-based optimizer for Interactive Data Analysis. In DOLAP 2019. Lisboa, France. https://hal.archives-ouvertes.fr/hal-02375855

[18]

Tova Milo and Amit Somech. 2016. REACT: Context-Sensitive Recommendations for Data Analysis. 2137--2140. https://doi.org/10.1145/2882903.2899392

Digital Library

[19]

Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. 2021. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20-25, 2021, Virtual Event, China.

Digital Library

[20]

A. Personnaz, S. Amer-Yahia, Laure Berti-Équille, M. Fabricius, and S. Subramanian. 2021. Balancing familiarity and curiosity in data exploration with deep reinforcement learning. In Fourth workshop in exploiting AI techniques for data management (aiDM'21), R. (ed.) Bordawekar, Y. (ed.) Amsterdamer, O. (ed.) Shmueli, and N. (ed.) Tatbul (Eds.). ACM, 16--23. https://hal.archives-ouvertes.fr/hal-03278966 SIGMOD/PODS '21: International Conference on Management of Data, En ligne, CHN, 12-/12/2025 - 12/12/2030.

[21]

Sergey Redyuk, Zoi Kaoudi, Volker Markl, and Sebastian Schelter. 2021. Automating Data Quality Validation for Dynamic Data Ingestion. In EDBT. 61--72.

[22]

Sebastian Schelter, Stefan Grafberger, Philipp Schmidt, Tammo Rukat, Mario Kiessling, Andrey Taptunov, Felix Biessmann, and Dustin Lange. 2019. Differential data quality verification on partitioned data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1940--1945.

[23]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, Vol. 11, 12 (2018), 1781--1794.

Digital Library

[24]

L. Shen, E. Shen, Y. Luo, X. Yang, X. Hu, X. Zhang, Z. Tai, and J. Wang. 5555. Towards Natural Language Interfaces for Data Visualization: A Survey. IEEE Transactions on Visualization & Computer Graphics 01 (jan 5555), 1-1. https://doi.org/10.1109/TVCG.2022.3148007

Digital Library

[25]

Arun Swami, Sriram Vasudevan, and Joojay Huyn. 2020. Data sentinel: A declarative production-scale data validation platform. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1579--1590.

Cited By

Majeed AHwang S(2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
https://doi.org/10.3390/electronics13112156
Kumar SDatta SSingh VSingh SSharma R(2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3369417
Bhatt NBhatt NPrajapati PSorathiya VAlshathri SEl-Shafai W(2024)A Data-Centric Approach to improve performance of deep learning modelsScientific Reports10.1038/s41598-024-73643-x14:1Online publication date: 27-Sep-2024
https://doi.org/10.1038/s41598-024-73643-x
Show More Cited By

Index Terms

Advances in Exploratory Data Analysis, Visualisation and Quality for Data Centric AI Systems

Recommendations

A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality Tasks
Democratisation of machine learning (ML) has been an important theme in the research community for the last several years with notable progress made by the model-building community with automated machine learning models. However, data play a central role ...
Interactive Data Visualization to Understand Data Better: Case Studies in Healthcare System

This paper focuses on interactive data visualization techniques and their applications in healthcare systems. Interactive data visualization is a collection of techniques translating data from its numeric format to graphic presentation dynamically for ...
How Domain Experts Structure Their Exploratory Data Analysis: Towards a Machine-Learned Storyline
CHI EA '20: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems

Exploratory data analysis is an open-ended iterative process, where the goal is to discover new insights. Much of the work to characterise this exploration stems from qualitative research resulting in rich findings, task taxonomies, and conceptual ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

Qualifiers

Abstract

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,003 of 6,772 submissions, 15%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
673
Total Downloads

Downloads (Last 12 months)235
Downloads (Last 6 weeks)25

Reflects downloads up to 01 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Majeed AHwang S(2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
https://doi.org/10.3390/electronics13112156
Kumar SDatta SSingh VSingh SSharma R(2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3369417
Bhatt NBhatt NPrajapati PSorathiya VAlshathri SEl-Shafai W(2024)A Data-Centric Approach to improve performance of deep learning modelsScientific Reports10.1038/s41598-024-73643-x14:1Online publication date: 27-Sep-2024
https://doi.org/10.1038/s41598-024-73643-x
Moscato VPostiglione MSperlí G(2023)Few-shot Named Entity Recognition: Definition, Taxonomy and Research DirectionsACM Transactions on Intelligent Systems and Technology10.1145/360948314:5(1-46)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3609483
Wang AChukova SSimpson CNguyen B(2023)Data-centric AI to Improve Early Detection of Mental Illness2023 IEEE Statistical Signal Processing Workshop (SSP)10.1109/SSP53291.2023.10207938(369-373)Online publication date: 2-Jul-2023
https://doi.org/10.1109/SSP53291.2023.10207938
Lijffijt JGkorou DVan Hertum PYpma APechenizkiy MVanschoren J(2022)Introduction to the Special Section on AI in ManufacturingACM SIGKDD Explorations Newsletter10.1145/3575637.357565024:2(81-85)Online publication date: 8-Dec-2022
https://dl.acm.org/doi/10.1145/3575637.3575650

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents