Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality Tasks

Published: 01 November 2023 Publication History

Abstract

Democratisation of machine learning (ML) has been an important theme in the research community for the last several years with notable progress made by the model-building community with automated machine learning models. However, data play a central role in building ML models and there is a need to focus on data-centric AI innovations. In this article, we first map the steps taken by data scientists for the data preparation phase and identify open areas and pain points via user interviews. We then propose a framework and four novel algorithms for exploratory data analysis and data quality for AI steps addressing the pain points from user interviews. We also validate our algorithms with open source datasets and show the effectiveness of our proposed methods. Next, we build a tool that automatically generates python code encompassing the above algorithms and study the usefulness of these algorithms via two user studies with data scientists. We observe from the first study results that the participants who used the tool were able to gain 2× productivity and  6% model improvement over the control group. The second study is performed in a more realistic environment to understand how the tool would be used in real-world scenarios. The results from this study are coherent with the first study and show an average of 30–50% of time savings that can be attributed to the tool.

1 Introduction

Data science is a multidisciplinary field focused on deriving insights from the data. Data scientists today have a wide variety of skills, including artificial intelligence, machine learning, data visualisation, and computer science techniques, including cloud computing. As more and more companies are embracing data science techniques, there is a dearth of data scientists, and companies are facing difficulty recruiting data scientists. This has led to a new trend of upskilling of professionals from business intelligence, and software engineering to the data science stream. With this trend, there is a large cohort of data scientists within the “early career bracket” for data science. To take care of this market need, there has been a lot of research on democratising data with notable progress on automated model-building technologies or AutoML techniques. There are several commercial and open source tools to automate the model-building work. Examples include Google’s AutoML [7], H2O [12], DataRobot [5], IBM AutoAI [10] and open source libraries such as Auto-sklearn [6] and TPOT [11]. However, automation in model-building steps alone is not sufficient, as the quality of a model is directly dependent on the quality of the data. Preparation of high-quality datasets has been called out as one of the most time-consuming steps of the machine learning (ML) lifecycle [8, 9]. Hence there is a need to focus on data-centric AI to bring in further automation.
Since the mid-2010s, there have been a few libraries built such as Deequ [58], DQLearn [60], Pandas Profiling [20], Google’s TensorFlow Validation [18], TDDA [3], and Great Expectations [2] to measure the quality of data. These address a naive subset of challenges like missing values, range, outliers, and basic statistical checks and do not focus on the metrics that can affect an ML model’s performance directly. Amazon Deequ [58] allows users to express arbitrary common quality constraints with custom validation code and thereby enables unit tests for data. Pandas Profiling [20] gives basic data analysis and visual functionalities. Similarly, DQLearn [60] focused on basic data quality metrics for structured and time-series data like missing value, duplicates, data profiling, and so on. Great Expectations [2] provides support for SQL and Parquet formats along with pandas, whereas Google’s TensorFlow Validation [18] automatically detects data errors using data visualisation and generation schema techniques. As we can see, the current data quality toolkits mostly focus on basic analysis and are very limited in their functionalities for both exploratory data analysis (EDA) and data quality (DQ) for AI analysis.
To further understand the gaps in the current tools and pain points faced by data scientists, we conduct user interviews to understand the as-is practices of data scientists and identify opportunities for innovation. We recruited 21 data scientists from different multi-national companies with six data scientists having experience of 0–3 years, eight data scientists having experience of 4–7 years, and seven data scientists with experience greater than 7 years. We interviewed each of the data scientists for an hour. We made a template of questions that were asked to every data scientist that fell into three buckets:understand current methods and mechanisms that are used for data preparation, understand open challenges faced by data scientists, and identification of time-consuming or manual activities. We recorded the responses for all the data scientists and grouped the observations as insights below. We share the main insights as follows:
(1)
Insight 1: Most data scientists performed the following steps as part of their EDA and DQ analysis: statistical tests, bivariate and multivariate plots, detection of missing values, detection of NA values, outlier detection, class imbalance analysis, data labeling, and so on, using open source libraries.
(2)
Insight 2: Most data scientists have their own ways of performing EDA and DQ analysis, using the rule of thumb and experience; variation was also observed in how early career data scientists approached this and how experienced data scientists approached this process. Most data scientists acknowledged that beyond light processes at a team level, this was an ad hoc process that lacked standardisation.
(3)
Insight 3: Often data scientists want to understand details about the dataset when they load the data. Common ways used by data scientists are creating a subset by printing top \(N\) or bottom \(N\) samples or randomly sampling the data. One challenge with these methods is that they do not guarantee to show all the variations in the data.
(4)
Insight 4: Data scientists spend time plotting and visualising the different columns of the dataset as part of understanding the data or data exploration. This can become cumbersome and time-consuming for very wide datasets (datasets with a large number of columns).
(5)
Insight 5: Data scientists shared that a lot of time was spent in coding and bug fixing for code written mainly to explore and understand the data. This time spent on coding varied with the experience of the data scientist. On the question of reuse of code, data scientists shared that they were able to reuse only \(30\%\) of the code across projects. All data scientists agreed that a lot of the code written for the data preprocessing phase was to get insights from the data but was not used in the final production code when the model is deployed.
(6)
Insight 6: For enterprise settings, the dataset sizes run into gigabytes. A lot of open source packages and libraries are unable to handle such large datasets, and hence most of the time, a data scientist is forced to work on a sample of data or write extra code to run analysis on their full datasets.
(7)
Insight 7: All data scientists agreed that data preparation is one of the most time-consuming activity. In terms of percentage time spent by a data scientist, the most common was greater than \(50\%\) and could go up to \(60\%\) or \(70\%\) , depending on the complexity of the project. The complexity increased when the user was working with multiple tables of a database instead of a single table.
The first two insights give us an understanding of the current methods employed by data scientists for data preparation. Insight numbers 3 and 4 are related to exploratory data analysis and the challenges that a user faces today to understand the data. Specifically for insight 3, a user wants to get a quick view of the data, which brings out all the unique variations in the data, without having to go through the full dataset. We address this pain point by developing an intelligent sampling algorithm that can bring out all the unique variations in every column of the dataset by picking the minimum number of rows from the data. We describe the problem statement and our algorithm formally in Section 4. Insight 4 is also related to understanding the data. Each column or feature in the dataset may have a different distribution and may relate to other columns or features in the dataset. To understand these characteristics, a user has to go through these different columns and use methods of visualisation, and broad statistics to find which columns are interesting for the ML task at hand. In our user interviews, we also ask about typical sizes of the dataset in enterprise settings. Most users have agreed that the number of columns in a dataset are greater than 100 and can go up to 500 or 1,000 columns. There are instances where the number of columns may exceed 1,000 columns, but those instances are less frequent. It is easy to understand that finding information about relevant columns via manual plotting and observations is not scalable. We address this pain point by devising a novel mechanism to rank columns of a dataset and show the most interesting columns to the user along with its plot, important details about the columns and its relationship to other columns in the dataset. For every column, we also explain why the column is interesting by tagging the properties of that column. We formally explain the problem and our solution for this problem in Section 5. We also devise two novel algorithms to address the problems of class overlap and label noise to find these issues in the dataset. These would be an addition to the issues that a user finds today via open source libraries, as noted in Insight 1. We describe the problem statements and our algorithms in Sections 6 and 7. We also address the challenge of time spent in coding and the lack of standardisation for data preparation steps by building an automated tool to auto-generate python code that can be used by the data scientists for the analysis. We discuss this in Section 3. We evaluate our individual algorithms using meaningful baselines and discuss the validation results in the same section as the algorithm details. We also present a more comprehensive evaluation of all our algorithms by evaluating the tool with a user study with data scientists employed with multi-national companies. From the insights from the user study, we note that the two other pressing challenges that a data scientist faces are the need for algorithms scalable to large datasets and the ability to perform data preparation when the input is not a single table but multiple tables from a database. These are topics that need deeper investigation and will be addressed in our future works. Our main contributions for this article are as follows:
Conduct user interviews to get insights on as-is data preparation steps and pain points (Section 1).
Design a tool that provides a framework to add various functionalities based on the insights from the user study (Section 3).
Propose a novel algorithm to find a data subset that captures all column variations of the data (Section 4).
Propose a novel algorithm to rank columns of the dataset to aid for easy exploration and understanding of the data (Section 5).
Propose a novel algorithm to find overlapping regions in the dataset and validate the algorithm using several open source datasets (Section 6).
Propose novel algorithm to find label noise in the dataset and validate the algorithm using several open source datasets (Section 7).
User study to understand the usefulness of the above algorithms and how they help a data scientist in terms of time, productivity, or model improvement (Section 8).

2 Related Work

In this section, we review the related work in this area. Data preparation is a long studied area with a focus on the quality of data to be stored in databases. Our work is different with a focus on data preparation for AI pipelines, and hence we review only the work that is related to AI pipelines.

2.1 Exploratory Data Analysis

The first step is to understand the data, and visualisation tools help in understanding the data by finding interesting properties that can help in developing curative measures. Reference [67] recommends data-driven visualisations using statistical and deviation-based metrics such as Earth Mover’s distance, Jenson-Shannon distance, and so on. Reference [61] uses ZQL query language for interactive visual analytics. The QUDE system [17, 76] also takes a statistical approach and provides data exploration using multiple hypothesis testing. Reference [30] discover and summarize interesting slices of the entire data. One of the gaps that we heard in user interviews was around the ability to quickly understand the data (Insight 3). Data scientists often seek to understand the details of a dataset upon its initial loading. This is commonly achieved through the creation of subsets of the data by printing the top or bottom few samples or by randomly sampling the data. However, these methods do not guarantee the inclusion of all variations within the data. Various techniques have been proposed in the literature for sampling, including random sampling [68], systematic sampling [62], stratified sampling [50], convenience sampling [59], snowball sampling [27], association-based sampling [52], and quota sampling [44]. While these methods may effectively replicate the distribution of the full dataset, they may not capture every single variation within each column of the data. It is important to capture all variations within a dataset, particularly in real-world applications where rare data patterns may be of significant value. Existing methods may capture the data points having some association with respect to target column [59] or most frequently occurring patterns but may miss rare patterns that could be of importance to the user. Moreover, these techniques focus on variations in column data values and not variations in data value patterns such as regex pattern and therefore are not able to rank rows based on the uniqueness and importance of row values in the data.

2.2 Data Quality for AI

A wide range of methods for statistical analysis [57], constraint mining and checking [22], entity matching [32, 46], and machine learning [34, 53, 71] are used nowadays for data quality checks, data cleaning, and data repairing [29] but still with a large focus on data quality for databases. An important step in data quality for AI is filling missing values in the data, called data imputation. Most research in the field of imputation focuses on numerical imputation. Some notable approaches include k-nearest neighbors [14], multivariate imputation by chained equations [38], matrix factorisation [33, 41, 66], and deep learning methods [15, 16, 26, 40, 75]. While some recent work addresses imputation for heterogeneous data types [28], heterogeneous in some works [47, 63, 64, 73] refer to binary, ordinal, or categorical variables, which can be easily transformed into numerical representations. Since this is a mature area, we do not make any algorithmic contributions to this area but design our framework in such a way that these can be easily plugged into the same (Section 3). We focus on two important problems, label noise and class overlap, that we believe are more important as their presence has shown to adversely affect ML models (please see experiments in Sections 6 and 7).

2.2.1 Label Noise.

Most of the real-world datasets generated or annotated have some inconsistent/noisy labels. Training data with noisy or inconsistent annotations or labels [48] can have a significant impact on the data science pipeline, leading to decreased model accuracy, increased complexity, and a greater need for training samples. Many approaches [23, 25, 36, 37, 42, 45, 72] have been proposed to address this problem, but most of them focus on designing robust loss functions for classifiers that can handle noisy labels. However, they do not focus on the problem of detection of noise present in the existing dataset. Reference [48] looks at the problem of label noise with respect to the detection task. But we find certain challenges in our experiments with the detector from Reference [48] and propose a new label noise algorithm that outperforms [48].

2.2.2 Class Overlap.

Overlap among classes can cause ML classifiers to have performance degradation in those regions by misclassifying points or being less confident in their predictions in those regions. There have been several techniques proposed to detect class overlap in datasets. Reference [70] proposed an algorithm that uses Support Vector Data Description (SVDD) to approximate class boundaries and determine overlap by the number of instances in the common boundary. Since this approach relies on SVDD, it can find only spherical boundaries, which is not ideal for real datasets. Other methods, such as those discussed in Reference [54], separate overlapping classes into separate binary classes and use a one-versus-one decomposition strategy to learn a classifier. However, these approaches do not focus on detecting overlap in existing datasets. Reference [35, 69, 74] examine the problem of overlap in conjunction with class imbalance, but their focus is on solving the problem rather than detecting it. We propose a novel algorithm to find class overlap in below Section 6.
Data preparation also includes transforming data before feeding into the ML pipeline. Some of the typical transformations include normalisation, bucketisation, winsorising, one-hot encoding, feature crosses, and using a pre-trained model or mappings from values such as words to real numbers (embeddings) to extract features [43]. Again, these are mature, and we have plugged them in our framework from open source libraries or our own implementation. We also discuss on various open source tools to check the quality of data in Section 1. These do not satisfy the challenges that we summarise from our user interviews. Next, we discuss our framework that overcomes the challenges and provides a mechanism to add algorithms for specific problems.

3 Framework Design

In this section, we discuss the design and implementation of our proposed framework and give a high-level overview for each of the blocks. Each individual block of the framework will be described in the next sections. Our main motivation was to design a framework/tool such that it can help a data scientist accelerate their work, while minimising the time consuming and mundane tasks. Specifically, our framework addresses insights 1–5 from Section 1. We design a tool that can automatically generate python code in Jupyter notebooks to perform advanced exploratory data analysis, data quality for AI assessment and remediation, and also maintain a track of all the data transformations done in the notebook. The history of all operations is presented via a data readiness report, which serves as a comprehensive record of all data properties, insights, and quality issues including lineage of all data operations to give a detailed record of how data have evolved. This can serve as accompanying documentation that can be used for governance and audit purposes. We build this based on our earlier work presented in Reference [13]. Our design is centered around a data scientist to help reduce time to value and, more importantly, give them flexibility and control over the data preparation process. An automated notebook with code allows them to have the flexibility to add/edit/delete code and make it customised to their business use case. We use the following guiding principles for the design of our tool:
Consistent input and output interfaces so that it is easy to learn and use the tool. All the algorithms output results in a JSON format that is kept consistent across all operators.
Ability for the data scientist to customise the data preparation process for their business use case by giving the capability to add, delete, and edit code as necessary. All the operations return results in a temporary pandas dataframe, which can be inspected by the data scientist, and then he or she can update the main dataframe.
Encapsulate all the algorithms via function calls but keep enough options so that the user can vary the parameters as required.
Documentation supporting the code that tells the user, what, and when should a functionality be used. It also calls out the code dependencies as well as what parameters can the user change and the effect of these on the algorithm details.
Record history of data operations applied on the dataset in the notebook
Consistent input–output interfaces help the user in understanding the code quickly. Another factor that we debated was how much code should be shown to the user so that is both not overwhelming as well as underwhelming. After consulting a few data scientists, we come up with the design to encapsulate the code via function calls, thereby keeping a balance between simplicity and transparency. We apply all the design points discussed above to our framework, which we describe next. Our framework consists of the following main sections. Each section contains multiple functionalities/algorithms. Beyond our contributions, we also add the standard operators from open source libraries that data scientists use regularly for the completeness of the tool (see Insight 1 from Section 1).
Exploratory Data Analysis: This section has algorithms that help a data scientist quickly explore the data, understand the issues/patterns, and decide on what data transformations should be done. We design two new algorithms that are presented in the corresponding sections: Dataset Snapshot (Section 4) and Interesting Columns Detection (Section 5) based on our user interviews from Section 1.
Data Quality Assessment and Remediation for AI(DQAI): This section has algorithms that help identify quality issues in the data that will affect the building of an ML model. We describe two novel algorithms that are presented in the corresponding sections, Label Purity and Class Overlap, that each identify a specific problem in the data that can hinder the performance of an ML model. For each quality assessment algorithm, we make recommendations for cleaning the data and add cleaning operators for the respective assessment operators.
Data Readiness Report: This section produces a summary of data insights and a history of operations performed on the data in the notebook.
All these sections are combined in one single Jupyter notebook. Our tool generates python code via a Jupyter notebook using nbformat library [4]. We choose Python and Jupyter notebook as a library and tool of choice, respectively, as it is popular among data scientists. It takes as inputs the path to the dataset and the label column and generates a notebook with python code that a data scientist can start working with.
Figure 1 shows a screenshot of the tool. A user needs to add the path to the dataset and the label column as shown in Figure 1(a). Once a user executes these five lines of code, a new Jupyter notebook is generated that contains both the documentation and Python code for the user to get started with as shown in Figure 1(b). Figure 1(c) of the same figure shows a code snippet for the class overlap algorithm that we discussed in Section 6. Each functionality returns a JSON that has the standard structure. The tool is designed in a modular and extensible fashion so that adding new algorithms is very easy, and it only incurs a small effort to write a wrapper function so that a new algorithm also adheres to the input–output interface of the tool. We next describe the proposed algorithms in the following sections.
Fig. 1.
Fig. 1. Tool design and overview.

4 EDA Method Details: Dataset Snapshot

We start with a discussion of our algorithm for building a subset of the data that serves as a snapshot of all the column variations in the dataset. We call this the dataset snapshot. Based on our user interviews (Insight 3), data scientists want to explore and understand the data. They use heuristic methods today by printing top \(N\) and bottom \(N\) rows or known sampling methods like random sampling. As discussed in the related work section, these techniques may not be sufficient, as they do not guarantee the capture of all the variations in the data and may miss out on rare patterns.
For example, when a categorical column has minority classes present, in the randomly generated sample, those minority classes could be absent when the sample size is not big. This could happen, because all classes are sampled according to their frequencies, which poses a disadvantage to small classes. Another issue with these methods is that the column-level information is not taken into account to get an effective sample. For example, if a gender column has four values, Male, M, Female, and F, then a user would want to see all the variations to standardise the data before applying other ML operations, else any encoding technique may mistake that there are four categories in the column, whereas there should only be two. However, with standard methods, there is no guarantee that all the values will be picked up. Another example is a date column that may have multiple date formats like dd-mm-yy and dd/mm/yy. Again, these need to be known so that they can be standardised before applying any other operations on the dataset. Another pain point is automatically determining the number of rows to sample that needs a lot of trial and error by the user to derive a good sample. We therefore felt the need to design a different approach that considers column-level information, covers all different patterns across different columns, and is aware of string patterns present in text columns to generate better data subsets or samples that users can use to explore and understand the data. We also give ourselves a goal that the algorithm should automatically determine the number of samples to pick without user input. We propose a principled way to solve this problem and guarantee that all the variations in a column are picked up and presented in the subset. We discuss our formulation and algorithm in Section 4.1.

4.1 Algorithm Details

Let us consider \(D_N\) as a full dataset with \(N\) number of rows, and it consists of a finite set of data patterns denoted by \(P(D_N)\) . We define a data pattern as the following: We process the data in every column to a finite set of patterns. For categorical columns, we denote each unique value as a pattern. For numeric columns, we bin the data and take the bin index as a pattern to be covered. For string pattern columns like dates and phone numbers, we convert string to regex patterns and take each regex pattern as a pattern to be covered. Unique columns like ID columns are excluded from the data, since no pattern can be derived from them. Thus a data pattern for the entire dataset can be defined as follows:
\begin{equation} P(D_N) = \cup _{i=1}^{k} P(C_i), \end{equation}
(1)
where \(P(C_i)\) represents set of patterns present in column \(i\) in \(D_N\) . Our goal is to pick a minimal sample dataset \(D_S\) such that \(P(D_S)=P(D_N)\) . Note that when \(S=N\) , the condition is automatically satisfied. We formalise our goal as follows. Let us define
\begin{equation} H = \lbrace D_s\subset D_N|P(D_s)=P(D_N)\rbrace , \end{equation}
(2)
and we pick an optimal sample by optimising the condition
\begin{equation} D_S = \mathop{\text{argmin}}\limits _{D_s \in H} s, \end{equation}
(3)
thereby picking sample data with minimum samples. This problem can be viewed as a Bi-Partite Graph \(G = (P, R, E)\) where \(P\) represents all patterns in \(D_N\) , \(R\) is the set of all rows in \(D_N\) , and \(E\) represents the edges connecting partitions \(P\) and \(R\) . In this bi-partite graph, we optimise the condition
\begin{equation} \min \limits _{e \in E} e \end{equation}
(4)
to select as minimum edges as possible, thereby selecting minimum rows as possible, such that \(C_e(P) = P(D_N)\) , where \(C_e\) represents coverage of patterns by edges \(e\) . Finding the optimal solution for this problem takes polynomial time, and it is NP-Complete. Therefore, we use a greedy algorithm to solve this problem. We use row importance metric to determine which rows are more important to pick. We define row importance as
\begin{equation} I(r_i) = \sum _{j=1}^{K}I(p_{c_{ij}}), \end{equation}
(5)
where \(I(r_i)\) denotes importance of row \(i\) , \(K\) denotes number of columns, \(p_{c_{ij}}\) denotes the data pattern present in column \(j\) of row \(i\) , and \(I(p)\) denotes importance of a pattern \(p\) . The importance of pattern \(p\) is defined as the relative frequency of pattern \(p\) occurring in the dataset.
We perform these processing steps on data \(D_N\) before applying the snapshot algorithm:
(1)
All unique value columns are dropped.
(2)
Numeric columns are binned and numeric values of the column are replaced by bin values.
(3)
String columns with patterns are converted to regex patterns and string values are replaced by regex values.
(4)
Categorical columns are not processed and are retained as is.
The algorithm is then run on the processed data to find \(D_S\) and is shown in Figure 1. In each iteration, we rank all rows by their importance and pick the top-ranked row so that rows with the most occurring patterns are picked first. We then set the importance of already-picked patterns to 0 to avoid giving importance to already-picked patterns in the next iteration for row selection. Once we observe all patterns are picked, we stop the iteration and return the sample dataset \(D_S\) . The sampled dataset \(D_S\) is built such that
(1)
Rows are ordered to place the most important rows at the top and least important at the bottom.
(2)
We rank and pick sample rows using the information of data patterns from all columns.
(3)
We avoid picking patterns that are already covered and pick most important rows, thereby enabling a small subset of data to be picked with all the patterns.
(4)
Number of samples to pick is automatically determined and not needed as input from the user.
To explain the working of our algorithm, we took a dataset of 200 rows and four columns as shown in Figure 2(a). The dataset has four columns, with each column having the following type of values:
Fig. 2.
Fig. 2. Full dataset, its processed version, and the final sample produced by Dataset Snapshot algorithm.
(1)
ID column is a unique column of numbers
(2)
LEVEL column is a categorical column with LOW, MEDIUM, and HIGH as its categories
(3)
DATE column is a column with three different types of dates (a) 17.05.2011 (with full stop as a separator), (b) 26/02/2011 (with forward slash as a separator), and (c) 11-01-2011 (with hyphen as a separator).
(4)
SCALE column is a numeric column
Our algorithm processes each column according to its datatype, and the processed data are shown in Figure 2(b):
(1)
Categorical columns like the LEVEL column are retained as it is without any processing. We consider all three unique values, i.e., LOW, MEDIUM, and HIGH as three patterns in column LEVEL.
(2)
Columns with all unique values like ID column are excluded from the processed dataset.
(3)
String or Date columns with patterns like DATE column are converted to their regex pattern. For example, 17.05.2011 is converted to num{2}.num{2}.num{4} regex pattern. We therefore end up with three unique patterns for the DATE column: (a) num{2}.num{2}.num{4}, (b) num{2}-num{2}-num{4}, and (c) num{2}/num{2}/num{4}.
(4)
For Numeric columns like the SCALE column, we bin the dataset and transform the column to bin numbers. In the SCALE column, we bin the numeric data into five bins and thus end up with five unique values, i.e., 0, 1, 2, 3, and 4. We consider these bin values as five patterns in column SCALE.
We have a total of 11 data patterns (3+3+5) in the processed dataset. We then use Algorithm 1 on the processed dataset to pick all patterns found in it. Finally, we end up with just five samples representing the full dataset as shown in Figure 2(c), and all the patterns from the full dataset are represented in the data snapshot.

4.2 Experiment Results

We have taken 11 open source datasets from Reference [1] and Reference [21] and run our snapshot algorithm on them. We record the number of samples picked up by the algorithm. Table 1 shows the results of our analysis. Compared to the full data, the number of samples picked by our algorithm is significantly less to capture all the data patterns. On average, we picked 3.6% of rows from the full data for the sample data. We also validate our algorithm’s claim of covering all patterns present in the full data for the sampled data. For that, we manually check the number of patterns present in each of the full original data and also do the same for the sampled data and the results are as shown in Table 1. We can see that sampled data cover 100% of the patterns present in the original full data for all the datasets. We also produce visualisations for two datasets using the standard TSNE plots, shown in Figure 3. In Figure 3, the plots of Breast Cancer and German Categorical datasets are shown, with blue circles representing the full dataset and yellow squares representing sampled data points the algorithm identified. The average runtime of the algorithm for detecting interesting columns was found to be 23.10 seconds, as measured across the datasets listed in Table 1. The system with a 2.3-GHz eight-core processor and 64 GB memory is used for all runtime calculations.
Table 1.
Dataset NameFull Data RowsSample Data RowsFull Data ColumnsFull Data PatternsSample Data Patterns
mfeat-zernike2,00010748938938
mfeat-factors2,0002042174,2774,277
mfeat-karhunen2,000227651,2811,281
segment2,3106720310310
sick3,7728730140140
phoneme5,404446100100
wall-robot-navigation5,45612625452452
texture5,5005841806806
optdigits5,62011965924924
satimage6,4306537726726
pendigits10,9924817330330
Table 1. Experiment Results for the Dataset Snapshot Algorithm
Column “Sample Data Rows” shows the number of rows picked by our algorithm. The last two columns verify that the number of patterns found in the full dataset is also found in the sample dataset. The patterns are counted as a sum of patterns across all columns for the dataset.
Fig. 3.
Fig. 3. TSNE plots of both datasets show that sampled data points replicate the original data distribution.

5 EDA Method Details: Interesting Columns Detection

In our user interviews, we learn that a data scientist spends a lot of time plotting and visualising different columns of the dataset to understand characteristics of different columns as part of exploratory data analysis (see Insight 4 in Section 1). It will be difficult for a data scientist to go through all the columns and plan for operations such as dropping columns, encoding columns, and so on, for getting the data ready for later stages of a ML pipeline. This becomes an acute pain point when the dataset is large with a very high number of columns. Thus, there is a need to analyse data from multiple perspectives and summarize the information at a column level, showing interesting insights into different columns. To the best of our knowledge, no other techniques address this in the data science field.
In this section, we discuss our methodology for finding interesting columns in a given dataset. The characteristics of an interesting column are the presence of dominant values, the presence of quality issues, correlations with other columns, and the presence of spurious syntactic patterns. A job-position column may be correlated to multiple columns, e.g., education, salary, and age of a person. A column with a large number of missing values must be interesting for a data scientist who needs to fill the missing values to get the data ready for model training. A business analyst might be interested in a column where one value dominates.
Figure 4 shows an example of an interesting column named relationship from the Adult dataset [31]. Note that the value Husband dominates over other values, as shown in the data distribution in Figure 4(a), and therefore the entropy is low. Moreover, this column is also related to columns age and gender, since every Husband is male and every Wife is female, as shown in the data distribution in Figure 4(b).
Fig. 4.
Fig. 4. Interesting column relationships: Panel (a) shows data distribution of relationship column, and panel (b) shows association between relationship and gender column.
We use four metrics to determine if a column is interesting. These metrics are pattern, associations, entropy, and missing fractions. We rank columns based on interestingness score to show columns exhibiting interesting properties,
\begin{equation*} interestingness\_score = (1-entropy)+missing\_fraction+association\_score+pattern\_score. \end{equation*}
Entropy: Entropy measures the degree of uncertainty in data. Entropy is maximum for a uniformly distributed random variable and minimum for a constant random variable. For a column with \(n\) unique values, with \(P(x_i)\) denoting the probability of a value \(x_i\) , entropy is computed as follows:
\begin{equation*} entropy = -\sum _{i=1}^n P(x_i)\log P(x_i), \text{ where } P(x_i) = \dfrac{number\ of\ $x_i$\ in\ column}{total\ number\ of\ values\ in\ column}. \end{equation*}
Missing Fraction: Data values in a dataset can be missing due to mishandling or human error. This can be an interesting metric for a data scientist preparing or cleaning the data, especially for training a machine learning model. Missing fraction is computed as the fraction of values missing in a column,
\begin{equation*} missing\_fraction = \dfrac{number\ of\ missing\ values\ in\ a\ column}{number\ of\ rows\ in\ data}. \end{equation*}
Association Score: Association score captures the relation between different columns and determines if a column is related to other columns. We compute Pearson [51] correlation coefficient for a pair of numeric columns and uncertainty coefficient Thiel’s \(U\) [65] between categorical columns, considering integer columns with few unique values as categorical to get more accurate correlations. These coefficients and correlations have been used in other works, such as data imputation [28] and data synthesis [55]. The association score is computed as follows:
\begin{equation*} association\_score = \dfrac{number\ of\ associations\ of\ a\ column\ with\ correlation\ \gt 0.5}{number\ of\ associations\ in\ data\ with\ correlation\ \gt 0.5}. \end{equation*}
Pattern Score: Pattern score aims to find the severity of minority patterns in a given a column. These patterns capture syntactic abstractions and different variations in column values and are different from data patterns with binned numeric columns and unique categorical values as pattern defined in Section 4.1. For example, the value Data is mapped to Aaaa, and EMPID1234 is mapped to AAAAA9999. Once the values in a column are mapped to syntactic patterns, they are grouped to identify the frequency of each identified pattern. A pattern is called minority pattern if is followed by less than 5% of cells in a given column. To reduce the false positives and not flag minority patterns in those columns that do not follow particular syntax, we identify if the given columns has a dominant or majority pattern, i.e., a pattern that is followed by at least 50% of the cells in a given column,
\begin{equation*} pattern\_score = \dfrac{number\ of\ minority\ patterns\ in\ a\ column}{total\ number\ of\ patterns\ in\ a\ column}. \end{equation*}
Figure 5 shows an example of an interesting column relationship and an example of a not-so-interesting column occupation from the Adult dataset [31]. Note that the interesting column relationship is shown with tags Low Entropy and High Correlation (on the right top corner). This column is interesting owing to the skewed data distribution, as shown in the distribution plot, and highly correlated column values with other columns, as discussed earlier. However, column occupation is not very interesting according to our metric. The average runtime of the algorithm for detecting interesting columns was found to be 8.8 seconds, as measured across the datasets listed in Table 1.
Fig. 5.
Fig. 5. Interesting column example.

6 DQAI Method Details: Detection of Class Overlap in the Data

One common problem seen in datasets for machine learning tasks is the presence of overlapping regions in the dataset or class overlap. Class overlap occurs when there are several data points that lie close to each other in the vector space but have different class labels. The presence of such regions causes difficulty in building classification models, as the model is likely to make mistakes in overlapping regions. Depending on the number and sizes of the overlapping regions present in the data, the complexity of building a machine learning model can be different. Formally, one can define an overlapping region \(R\) with data points \(x_{1}...x_{n}\) having class labels indicated by \(y_{1}..y_{k}\) , where \(n\) is the number of data points and \(k\) is the number of classes, such that the distance between the points is less than a threshold \(\theta\) and the number of classes in the region is greater than 1. This can be represented as follows:
\begin{equation} R = {\left\lbrace \begin{array}{ll} dist(x_{i}, \ldots ,x_{n}) \lt \theta & \forall i \in (0..n)\\ sum(intersection(y_{i}..y_{n})) \gt 1 & \text{where $y_{i}$ corresponds to class label for $x_{i}$}.\\ \end{array}\right.} \end{equation}
(6)
Typically, the problem of the overlapping region has been studied in the context of class imbalance problem [24, 49] by considering the effect of overlap regions between the majority and minority classes. In recent work, the authors in Reference [69] showed with experiments on synthetic datasets that while the presence of class overlap amplifies the class imbalance problem, the reverse may not be true. Their experiments also showed that class overlap can significantly hurt the classifier performance, even in balanced datasets. They show their experiments on synthetic datasets using Random Forest as a classifier of choice. While these experiments provide some initial signals, there are still some open questions that need to be answered. Does the presence of overlapping regions in datasets also affect state-of-the-art models like automated machine learning models [39]? Does class overlap affect only a certain configuration of datasets or is independent of dataset size and shape? Is there any difference based on the number of classes in the dataset? We answer all these questions systematically to understand the problem of class overlap better and its impact on downstream models.

6.1 Understanding the Effect of Overlapping Regions on ML Models

We pick 12 datasets from UCI [21] and Kaggle [1] with variation in the number of rows and columns and number of classes (refer Table 2). We use the approach followed in Reference [54] to induce \(20\%\) overlap in the data. We then follow the steps outlined below to perform systematic experiments using AutoML [39] classifier.
Table 2.
Dataset NameNumber of columnsNumber of rowsNumber of classes
mfeat-zernike472,00010
mfeat-factors2162,00010
mfeat-karhunen642,00010
segment192,3107
sick293,7722
phoneme55,4042
wall-robot-navigation245,4564
texture405,50011
optdigits645,62010
satimage366,4306
pendigits1610,99210
gas-drift12813,9106
Table 2. Summary of Datasets Used for Experiments for Class Overlap
Steps to understand the impact of class overlap on AutoML classifiers:
(1)
Divide the dataset into train and test splits. Let us call the train splits as \({D_{tr}}\) and test splits as \({D_{te}}\) .
(2)
Create a copy of dataset \(D^{\prime }\) , with the same splits as \(D\) . Let us call the train splits as \({D_{tr^{\prime }}}\) and test splits as \({D_{te^{\prime }}}\) .
(3)
Add synthetically generated overlapping data points to the \({D_{tr^{\prime }}}\) . No changes are made to the \({D_{te^{\prime }}}\) .
(4)
Train classifier \(C\) on original data \(D\) , by applying threefold CV on the training data \({D_{tr}}\) and record the accuracy on the test split \({D_{te}}\) .
(5)
Train classifier \(C^{\prime }\) on overlap induced dataset \(D^{\prime }\) by applying threefold CV on the training data \({D_{tr^{\prime }}}\) and record the accuracy on the test split \({D_{te^{\prime }}}\) . Note \({D_{te}}\) and \({D_{te^{\prime }}}\) are exactly the same.
(6)
Compare accuracy of \(C\) and \(C^{\prime }\) on the test splits.
We train 24 models for 12 datasets by following the steps outlined above using threefold CV. For each of these models, we keep the settings for AutoML classifier the same, except for the change in the training dataset. For each model, we report the accuracy numbers on the test split in Table 3. Note that the test split is the same for a given dataset for both \(C\) and \(C^{\prime }\) . We observe that the accuracy of the classifier trained on overlap datasets drops for all the datasets, with an average drop of \(5.9\%\) , a minimum drop of \(2.5 \%\) , and a maximum drop of \(15\%\) . This forms strong conclusive evidence that the presence of class overlap degrades the accuracy of state-of-the-art ML models. We do not see any patterns in the effect of the number of classes or other dataset characteristics on the impact of class overlap on ML models. We next describe an algorithm to detect class overlap in the data. Our algorithm can both detect and explain why a certain region is considered an overlap region.
Table 3.
Dataset NameAccuracy on raw data (D)Accuracy on overlap added data (D \(^{\prime }\) )Accuracy difference
mfeat-zernike0.9670.8910.076
mfeat-factors0.9820.9570.025
mfeat-karhunen0.9830.9060.077
segment0.9930.9470.046
sick0.9980.9430.055
phoneme0.9880.9430.045
wall-robot-navigation0.9820.9530.029
texture0.9930.9370.056
optdigits0.9890.9550.034
satimage0.9890.9420.047
pendigits0.9950.9520.043
gas-drift0.8980.7480.15
Table 3. Impact of Class Overlap Issue on Classifier Accuracy

6.2 Algorithm to Detect Class Overlap

Our goal is to detect class overlap problem to give insight to the user on the challenges in the data before they start any data preparation work. We propose a graph-based method for class overlap detection where the nodes of a graph are represented by the data points in the dataset and the edges are represented by the distance between the points. Every vertex stores information on the data point id and the corresponding class label. The graph construction starts with every vertex forming an edge with \(k\) neighboring vertices. Next, for every vertex in the graph, the class labels with the neighboring vertices are compared. If all the neighbors have the same class labels, then the edges between these vertices are pruned. We also prune all vertices of degree 0 from the graph. As a next step, for every vertex, with a degree greater than \(d\) , we find the connected components. For every connected component, we count the number of vertices belonging to different classes. If the ratio is less than \(r\) , then the edges are pruned and that connected component is dissolved. The remaining connected components are the overlap regions in the dataset. For every overlap region, we find the feature ranges for every data point and finds features and corresponding ranges that are causing an overlap in the region. The connected component can also be visualised using standard graph drawing tools to further explain the reason for the overlap. We also quantify the amount of overlap by counting the number of points in the overlap region divided by the total number of data points. Our proposed method for class overlap detection has the following properties:
Quantifies the amount of overlap by giving a score that ranges from 0 to 1, so as to objectively compare the amount of overlap across the datasets
Insights on which classes contribute to overlap and the percentage of contribution for each class
Index of data points belonging to each overlapping region along with the feature ranges that cause an overlap to provide explanations to users

6.3 Results

To demonstrate the effectiveness of our algorithm, we use the same 12 datasets and synthetically add overlapping data points, which serve as ground truth. We then calculate the precision and recall of our algorithm on these datasets. We add \(30\%\) overlap by using the method described in Reference [54]. Table 4 shows the results on precision and recall for the different datasets. The average precision for class overlap detection is 0.865, and the average recall is 0.7825. These results demonstrate the effectiveness of the proposed approach. The runtime of the proposed algorithm was measured to be 0.96 seconds, on average, across the datasets listed in Table 2.
Table 4.
Dataset NamePrecisionRecall
mfeat-zernike0.80.93
mfeat-factors0.870.88
mfeat-karhunen0.671
segment0.930.83
sick1.68
phoneme0.920.81
wall-robot-navigation0.80.72
texture0.84.98
optdigits0.80.1
satimage0.890.855
pendigits0.910.90
gas-drift0.960.71
Table 4. Results on Class Overlap Detection

7 DQAI Method Details: Label Noise

Correctness of the training data labels plays a very important role in determining the quality of the training data to build ML models. Most of the real-world datasets generated or annotated have some inconsistent/noisy labels [48]. Training data with noisy or inconsistent labels is one of the important issues in classification task settings that has a potential impact on the data science pipeline. For instance, label noise can lead to a decrease in model accuracy, an increase in model complexity, and an increase in the number of training samples required [19]. It is therefore important to detect and correct noisy samples in the dataset. Formally, the label noise operator can be defined as an operator that analyses and identifies inconsistencies or noise in the training data labels. For all the noisy labels detected, we also recommend clean labels.

7.1 Understanding the Effect of Presence of Label Noise on ML Models

Noise in the data mostly follows a random distribution. So to analyze the need for a label noise algorithm, we pick various diverse datasets from UCI [21] and Kaggle [1] repositories to capture variations in terms of the number of rows, columns, and classes. We collected 21 such datasets (shown in Table 5) that meet the above requirements. We use threefold cross-validation where in each fold we introduce \(10\%\) random label noise per class in the training set only and do not make any changes to the test set. We then follow the similar steps for our experimets as outlined in Section 6, with the only difference that instead of inducing an overlap, we induce label noise in the data.
Table 5.
Dataset NameNumber of columnsNumber of rowsNumber of classes
credit-approval156902
mfeat-pixel2402,00010
mfeat-zernike472,00010
cardiotocography352,12610
mfeat-morphological62,00010
soybean3568319
phoneme55,4042
banknote41,3722
texture405,50011
balance-scale46253
mfeat-fourier762,00010
pendigits1610,99210
wall-robot-navigation245,4564
spambase574,6012
mfeat-karhunen642,00010
segment192,3107
mushroom228,1242
spambase-reduced154,6012
waveform-5000405,0003
eeg-eye-state1414,9802
kr-vs-kp363,1962
Table 5. Summary of Datasets Used for Experiments for Label Noise
Table 6 shows the effect of inducing \(10\%\) random noise on the performance of the AutoML classifier [39]. We can clearly observe a drop in the performance of classifiers after noise is induced. In some cases, there is no drop in accuracy. Since the data points for inducing label noise have been selected randomly, we hypothesize that in these cases the points may be far off from the classifier boundary and are not impacting the model. For 13 datasets of 21, there is a decrease in performance by more than \(1\%\) . There are 6 datasets where a decrease in performance is high (greater than \(2\%\) ) and the maximum drop observed for these datasets is \(4\%\) . This provides strong evidence that label noise degrades the accuracy of state-of-the-art ML models and needs to be detected and corrected. With this motivation, we take a closer look at method to detect label noise. We next describe an algorithm to detect label noise in the data.
Table 6.
Dataset NameAccuracy on raw data (D)Accuracy on noise added data (D \(^{\prime }\) )Accuracy difference
credit-approval0.860.850.01
mfeat-pixel0.950.950.00
mfeat-zernike0.780.780.00
cardiotocography110.00
mfeat-morphological0.720.710.01
soybean0.900.860.04
phoneme0.880.850.03
banknote10.980.02
texture0.990.990.00
balance-scale0.880.880.00
mfeat-fourier0.810.800.01
pendigits0.990.990.00
wall-robot-navigation10.990.01
spambase0.940.920.02
mfeat-karhunen0.940.930.01
segment0.970.960.01
mushroom110.00
spambase-reduced0.830.830.00
waveform-50000.870.860.01
eeg-eye-state0.940.920.02
kr-vs-kp0.990.970.02
Table 6. Impact of Label Noise Issue on AutoAI Classifier Accuracy
D denotes raw data, and D \(^{\prime }\) denotes data after induction of \(10\%\) random noise.

7.2 Algorithm Details

Proposed label noise detection algorithm (see Algorithm 2) is built on top of the confident learning– [48] based approach (CleanLab). One limitation of this approach is that it can tag some correct samples as noisy if they lie in the overlap region thereby generating false positives. The overlap region is a region where several data points that lie close to each other in the vector space but have different class labels. We propose improvements to the algorithm to address this problem in two ways: (1) We use our proposed overlap regions detection algorithm (discussed in Section 6.2), which helps in pruning (orange color blocks in Figure 6) the samples from the list that are classified as possible candidates of noisy samples due to the confusing probability distribution of the samples in that overlapping region (refer to Algorithm 2), and (2) we also further propose effective neighborhood-based strategies (blue color blocks in Figure 6), which analyzes neighbor samples and also the suggested label recommended by the probability distribution-based approach (green color blocks in Figure 6) to further prune the identified noisy candidate samples, which helps in reducing the false positives.
Fig. 6.
Fig. 6. Proposed label noise detection framework.
Figure 6 illustrates the proposed label noise framework that takes a noisy dataset as input and generates a list of noisy points and suggested labels. Noisy data are then passed to the sequence of steps that compare classwise mean probabilities with sample probabilities to detect possible candidates of noisy samples. The system also utilizes probability information to find an initial set of suggestions for the detected possible candidates of noisy samples. As a next step, the system refines the list of detected noisy samples by removing the candidates that belong to overlap/confusing regions. Last, the retained samples are fed into the nearest-neighbor analysis module, which will help to further prune the samples from the rest of the possible noisy candidates. The final list of noisy samples with possible suggestions of the correct label is then shown to a user for further analysis. Detailed steps of the proposed label noise algorithm are highlighted in Algorithm 2.

7.3 Results

We validate the improvement of our proposed algorithm by comparing the precision and recall (see Table 7) with CleanLab as a baseline. We demonstrate these results on the same 21 datasets that we used to demonstrate the impact of noise on Auto AI classifier performance. Table 7 shows that our algorithm outperforms CleanLab in terms of precision. Overall average precision over 21 datasets for the proposed algorithm is 0.93 and for CleanLab is 0.74. Overall average recall of the proposed algorithm is around 0.76 and for CleanLab it is 0.81. At the cost of a 0.05% drop in a recall, our algorithm outperforms SOA by 0.19% in terms of precision.
Table 7.
Dataset NamePrecisionRecall
 OurCleanLabOurCleanLab
credit-approval0.630.500.600.65
mfeat-pixel10.990.730.74
mfeat-zernike0.960.250.730.78
cardiotocography10.970.740.74
mfeat-morphological0.700.160.740.83
soybean0.960.800.630.84
phoneme0.850.450.820.82
banknote110.830.84
texture0.990.990.830.84
balance-scale0.970.390.700.89
mfeat-fourier0.860.300.670.73
pendigits0.990.990.820.83
wall-robot-navigation0.990.970.760.84
spambase0.910.740.800.80
mfeat-karhunen0.990.980.680.71
segment10.980.810.82
mushroom110.810.82
spambase-reduced0.860.740.860.88
waveform-50000.940.640.660.87
eeg-eye-state0.960.700.780.81
kr-vs-kp0.980.960.870.89
Table 7. Comparison of (a) Precision and (b) Recall of Proposed Label Noise Algorithm with CleanLab Algorithm [48]
Table 8 shows the impact of correcting noisy data by applying label noise recommendation on AutoAI classifier performance. In Table 8, D \(^{\prime }\) represents the noisy data, and D \(^{\prime \prime }\) represents the cleaned data after applying a recommendation from the label noise algorithm. We can clearly observe an improvement in the performance of AutoAI classifiers for some of the cases, and for other cases performance remains the same. For 6 datasets of 21, there is an increase in performance by more than \(1\%\) . The maximum improvement observed for these datasets is \(5\%\) . These results demonstrate the effectiveness of the proposed label noise algorithm. The runtime of the proposed algorithm was measured to be 14.87 seconds, on average, across the datasets listed in Table 5.
Table 8.
Dataset NameAccuracy on noise added data (D \(^{\prime }\) )Accuracy on cleaned data (D \(^{\prime \prime }\) )Accuracy difference
credit-approval0.850.860.01
mfeat-pixel0.950.950.00
mfeat-zernike0.780.780.00
cardiotography110.00
mfeat-morphological0.710.710.00
soybean0.860.910.05
phoneme0.850.860.01
banknote0.980.990.01
texture0.990.990.00
balance-scale0.850.850.00
mfeat-fourier0.800.800.00
pendigits0.990.990.00
wall-robot-navigation0.990.990.00
spambase0.920.930.01
mfeat-karhunen0.930.940.01
segment0.960.960.00
mushroom110.00
spambase-reduced0.830.830.00
waveform-50000.860.860.00
eeg-eye-state0.920.920.01
kr-vs-kp0.970.970.00
Table 8. Impact of Cleaning the Label Noise Issue on AutoAI Classifier Accuracy
D’ denotes noisy data after induction of \(10\%\) random noise and D” denotes the cleaned data after applying the proposed label noise recommendations.

8 User Study to Understand the Usefulness of the Algorithms

In this article, we have covered several aspects of EDA and data quality for ML analysis, including the results of our proposed algorithms. We have also provided a tool that streamlines these processes by automatically generating code snippets, saving time, and reducing the risk of errors. To further assess the usefulness of the above algorithms and automated tools in the ML lifecycle, we conduct a user study with 16 data scientists from IBM. The study aims to answer the following questions: (a) Does it help a user reduce the time they spend in the data preprocessing phase?, (b) Does it have any impact on the user’s productivity?, and (c) Does it have any impact on the performance of the model that is built?

8.1 Design of the Study

To understand the usefulness of the algorithms and the tool that encapsulates the algorithms, we design our study by dividing the data scientists into two groups: the control and the user group. All of these data scientists work in horizontal platform teams and thus across different verticals like finance, healthcare, and so on. We maintain all the conditions the same for both the groups, except that the user group has access to the tool and the functionalities described in Sections 3 and 47. We recruit 16 data scientists for the study with varying levels of experience. Each group had three data scientists with 0–3 years of experience and five within the bracket of 3–7 years. We specifically check for years of experience in the data science area and not their overall experience. We divide them into two groups randomly, such that both groups have the same level of average experience.
Task Details: All the participants are tasked with building a model for a given dataset within the same time limit. We choose the kc2 dataset from the OpenML repository for our study [56] and do not impose any other constraints on steps that can be used in the process, like choosing a model. To eliminate any bias that may arise due to familiarity with popular datasets, we intentionally chose datasets that were not widely known. We divided the data into training, validation, and test sets and provided the training and validation sets to the data scientists. The test set was not shared and was used to evaluate the models built by the data scientists. We provide all the data scientists a duration of 3 hours to complete the task. We decide on this duration by running a pilot on two data scientists and observing the time taken by them to build the model on a small dataset like kc2. To avoid logistic issues, we share the dataset on a shared drive and give all the participants code snippets for basic tasks like data loading, model saving, and so on. We also install the popular packages needed by data scientists on virtual machines used for the study. We share an instructions document with all the participants where we describe the task and dataset details and also make it clear before the study that the goal is to understand steps taken by a data scientist and not evaluate the skill of the data scientist so that they can work on the dataset in their natural mode.
Metrics used for the study: We decide on metrics to capture as part of the study design. The metrics are shown in Table 9. We define an operator as an atomic unit of operation. For example, the detection of missing values is considered as one operator. To capture these metrics, we request the data scientists to share a screen recording of the work they did for the user study and the notebook that they used for their analysis. The metrics are so designed to observe the impact on time savings, productivity, and quality of the dataset as measured by the quality of the model.
Table 9.
 SourceControl GroupUser Group
Number of exploratory operators used for EDA and DQVideoYY
Number of operators that were found useful and retained in the notebookNotebookYY
Number of code errors during EDA and DQ stepsVideoYY
Time spent (in mins) in fixing code errorsVideoYY
Total lines of code that a user had to write for data preprocessingNotebookNY
Model Score (F1)NotebookYY
Table 9. Metrics to Be Captured for User Study
Training for the tool: We provide an 1-hour training session to all data scientists from the user group to make them familiar with the functionalities of the tool described in Section 3. We provide a mechanism for the users from both groups to reach out to us, in case of any logistic challenges.

8.2 Study Results and Analysis

We collect data worth 48 hours of videos and 16 notebooks as a result of the study. We analyse this data to derive metrics for each user, as described in Table 9. We show the average results for each of these metrics in the Table 10.
Table 10.
 Control Group (Average values)User Group (Average values)
Number of exploratory operators used for EDA and DQ1120
Number of operators that were found useful and retained in the notebook5.7817.11
Number of code errors during EDA and DQ steps2.280.2
Time spent (in mins) in fixing code errors3.670.33
Total lines of code that a user had to write for data preprocessingNA10.11
Model Score (F1)71.63%77.22%
Table 10. Analysis from the User Study
In Table 10, it is observed that the group using the tool was able to execute almost \(2\times\) more operators than the control group. This can be attributed to the availability of code snippets in the tool for the user group, which can directly be executed as opposed to writing the code for each operator. It can also be seen that the participants decided to retain \(85\%\) operators for the user group as opposed to \(52.54\%\) of the operators for the control group. It can be inferred that the operators provided in the tool. It was also noted that all the four operators corresponding to the four algorithms described in this article were retained by \(100\%\) of the users in the user group. The participants from the user group were also able to save time in the data preprocessing stage, as both the number of coding errors and the time spent in fixing the errors are less for the user group, as compared to the control group. We also analysed how many lines of code the participants had to write manually for the user group. On analysis from the notebooks, we found that the new lines of code were added for two reasons: (1) code written to drop non-useful columns following analysis performed in interesting columns (Section 5) and (2) code written for dataset visualisation using TSNE or PCA. We plan to make code snippets available for these functionalities in the next version of the tool. Finally, we analyse how the proposed tool helps in improving the accuracy of the model. To test this effectively, we keep the test split of the dataset hidden. We download the model for each data scientist from their notebook and then write code to infer on the test set. We record the F1 score for the models for every data scientist. We find that the model performance for the user group on average was about \(6\%\) higher than the control group. We also analysed the videos for all the control group participants and noted that they did not write any code to find deep insights as given by algorithms from Section 7 and 6. We believe that finding and correcting these issues in the data contributed towards a better model performance for participants from the user group. We thus make the following conclusions from the study. Our algorithms and tool helped the participants in the following two dimensions:
Productivity and Time Savings: Participants in the user group were able to run 2 \(\times\) more operators and retain 3 \(\times\) more operators for their final analysis in the same amount of time. This can be attributed to the availability of code as well as the availability of advanced operators that cannot be used from the open source or existing DQ libraries.
Model Performance: Improvement in model accuracy (6%) for the user due to the ability to find deep issues in the datasets.

9 User Study in the Wild

In the previous section, we describe a user study that is performed in constrained environments as we want to compare and contrast how the tool aids a data scientist. In this section, we describe insights from a user study where we relax any of the constraints as in the previous section. We work with an external customer where we install the tool in their environment and allow them to use it on their datasets. This study is done with three data scientists. The first data scientist had an experience of 0–2 years, the other had an experience of 4–7 years, and the third data scientist had an experience of more than 15+ years in the data science field. We provide detailed training on the tool and support with any ongoing questions/answers as they use the tool. We capture their feedback via a feedback survey as well as an interview with the users. We want to again capture productivity and time savings metrics and do so by asking the user to note the time taken by them using the tool and comparing it with the time taken by them in the past without the tool. We notice that all three users share in the survey form that their estimation of time savings is about 25–30%. On dwelling further on this topic in the interview, the most experienced user explains that this estimate is conservative and she expects the savings to be greater than \(50\%\) . She attributes this to both the current automation provided by the tool and the confidence in the results provided by the tool (thereby avoiding a need to recheck by a different team member). We also ask them to share the typical operations that they would perform and count the number of operations performed by them using our tool. We show from the survey results and the discussion that a novice user (0–3 years) sees a productivity improvement of 4 \(\times\) , whereas an experienced user sees a productivity improvement of 2 \(\times\) . The users also appreciate the design of the tool, especially as it gives them a head start, has novel operators to find insights that may otherwise be missed and has the ability to extend as required. Both these studies show that the tool gives time/productivity gains to a data scientist.

10 Conclusion and Next Steps

We have presented our innovations in the space of exploratory data analysis and data quality for AI in this article. We discover the challenges faced by data scientists from several user interviews. We present a framework and four novel algorithms to address the pain points and also present a way to automate them for a data scientist. First, we propose an algorithm to get a snapshot of the data, so that a data scientist can easily understand the dataset. In our algorithm, we output a subset of rows from the dataset such that we cover all the variations present in each column. For categorical columns, this includes all the possible categories and for numerical columns, it includes values from every bin after binning the data. And for string data, it includes values that cover every type of pattern. Our algorithm finds this subset by picking as few rows as possible and surfacing rows to the user that contribute to variations in more than one column. We have also tested manually for several datasets that our subset indeed covers all the variations in every column. Next, we present an innovative way to rank columns of the dataset. We define our interestingness score and also present a mechanism to show the results to a data scientist with tags on why that column is interesting for that dataset. Next, we discuss an algorithm to find confusing regions or overlapping regions in the dataset. We discuss why this problem is important to detect and show that it affects the accuracy of state-of-the-art AutoML models as well. We have proposed a new algorithm to detect overlapping regions in the data and tested it rigorously using open source datasets. We then discuss our proposed algorithm for label noise and show results on open source datasets. In our user interviews, we also learn that one of the other pain points for a data scientist is the time spent to code to explore the data and find data quality issues. We thus propose to build a tool that generates code for data preparation phases. This tool encapsulates all the algorithms proposed in the article. We finally two user studies with data scientists to observe and show that our tool and the algorithms help data scientists to improve productivity and save time and help improve the accuracy of the model. As part of our next steps, we plan to work on building scalable technologies that address very large scaled datasets and extend our innovations on EDA and DQ to inter-table analysis.

References

[1]
Kaggle Repository. Retrieved from https://www.kaggle.com.
[3]
2020. Test Driven Data Analysis. Retrieved from https://github.com/tdda.
[5]
[6]
2022. Automated Machine Learning with Scikit-learn. Retrieved from https://github.com/automl/auto-sklearn.
[7]
2022. AutoML Vision—Google Cloud AutoML. Retrieved from https://cloud.google.com//automl.
[8]
2022. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Retrieved from https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=1304712a6f63.
[9]
2022. Data Preparation: Time Consuming and Tedious? Retrieved from https://rapidminer.com/blog/data-prep-time-consuming-tedious/.
[11]
2022. A Python Automated Machine Learning Tool that Optimizes Machine Learning Pipelines Using Genetic Programming. Retrieved from https://github.com/EpistasisLab/tpot.
[12]
2022. Scalable AutoML in H2O-3 Open Source. Retrieved from https://h2o.ai/platform/h2o-automl/.
[13]
Shazia Afzal, C. Rajmohan, Manish Kesarwani, Sameep Mehta, and Hima Patel. 2021. Data readiness report. In SMDS’21. IEEE.
[14]
Gustavo E. A. P. A. Batista and Maria Carolina Monard. 2003. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 17, 5-6 (2003), 519–533.
[15]
Felix Biessmann, Tammo Rukat, Phillipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing value imputation for tables. J. Mach. Learn. Res. 20, 175 (2019), 1–6. http://jmlr.org/papers/v20/18-753.html
[16]
Felix Bießmann, David Salinas, Sebastian Schelter, Philipp Schmidt, and Dustin Lange. 2018. “Deep” learning for missing value imputationin tables with non-numerical data. In CIKM’18.
[17]
Carsten Binnig, Lorenzo De Stefani, Tim Kraska, Eli Upfal, Emanuel Zgraggen, and Zheguang Zhao. 2017. Toward sustainable insights, or why polygamy is bad for you. In CIDR’17.
[18]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data validation for machine learning. In MLSys’19.
[19]
Carla E. Brodley and Mark A. Friedl. 1999. Identifying mislabeled training data. J. Artif. Intell. Res. 11 (1999), 131–167.
[20]
Simon Brugman. 2019. pandas-profiling: Exploratory Data Analysis for Python. Retrieved from https://github.com/pandas-profiling/pandas-profiling.
[21]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.
[22]
Wenfei Fan. 2015. Data quality: From theory to practice. ACM SIGMOD Rec. 44, 3 (2015), 7–18.
[23]
Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras. 2022. SSR: An Efficient and Robust Framework for Learning with Unknown Label Noise. arXiv preprint arXiv:2111.11288 (2021).
[24]
Vicente García, Roberto Alejo, José Salvador Sánchez, José Martínez Sotoca, and Ramón Alberto Mollineda. 2006. Combined effects of class imbalance and class overlap on instance-based classification. In IDEAL’06.
[25]
Aritra Ghosh, Himanshu Kumar, and P. Shanti Sastry. 2017. Robust loss functions under label noise for deep neural networks. In AAAI’17.
[26]
Lovedeep Gondara and Ke Wang. 2018. MIDA: Multiple imputation using denoising autoencoders. In PAKDD’18.
[27]
Leo A. Goodman. 1961. Snowball sampling. Ann. Math. Stat. (1961), 148–170.
[28]
Sandeep Hans, Diptikalyan Saha, and Aniya Aggarwal. 2022. Explainable data imputation using constraints.
[29]
Ihab F. Ilyas, Xu Chu, et al. 2015. Trends in cleaning relational data: Consistency and deduplication. Found. Trends Databases 5, 4 (2015), 281–393.
[30]
Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran. 2017. Interactive data exploration with smart drill-down. IEEE Trans. Knowl. Data Eng. 31, 1 (2017), 46–60.
[31]
Ronny Kohavi and Barry Becker. 1996. Census Income Data. Retrieved from https://archive.ics.uci.edu/ml/datasets/adult.
[32]
Pradap Venkatramanan Konda. 2018. Magellan: Toward Building Entity Matching Management Systems. The University of Wisconsin—Madison.
[33]
Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. IEEE Comput. 42, 8 (2009).
[34]
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv:1711.01299. Retrieved from https://arxiv.org/abs/1711.01299.
[35]
Han Kyu Lee and Seoung Bum Kim. 2018. An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst. Appl. 98 (2018), 72–83.
[36]
Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. 2022. Selective-supervised contrastive learning with noisy labels. In CVPR’22. 316–325.
[37]
Kevin J. Liang, Samrudhdhi B. Rangrej, Vladan Petrovic, and Tal Hassner. 2022. Few-shot learning with noisy labels. In CVPR’22. 9089–9098.
[38]
R. J. A. Little and D. B. Rubin. 2002. Statistical Analysis with Missing Data. Wiley. 2002027006
[39]
Sijia Liu, Parikshit Ram, Deepak Vijaykeerthy, Djallel Bouneffouf, Gregory Bramble, Horst Samulowitz, Dakuo Wang, Andrew Conn, and Alexander Gray. 2020. An ADMM based framework for automl pipeline configuration. In AAAI’20.
[40]
Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep generative modelling and imputation of incomplete data sets. In ICML’19.
[41]
Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. (2010).
[42]
Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. 2019. Can gradient clipping mitigate label noise? In ICLR’19.
[43]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.
[44]
Claus Adolf Moser. 1952. Quota sampling. J. Roy. Stat. Soc. Ser. A (Gen.) 115, 3 (1952), 411–423.
[45]
Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. 2013. Learning with noisy labels. In NIPS’13.
[46]
Gonzalo Navarro. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1 (2001), 31–88.
[47]
Alfredo Nazábal, Pablo M. Olmos, Zoubin Ghahramani, and Isabel Valera. 2018. Handling incomplete heterogeneous data using VAEs. Pattern Recognition 107 (2020), 107501.
[48]
Curtis G. Northcutt, Lu Jiang, and Isaac L. Chuang. 2021. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research 70 (2021), 1373–1411.
[49]
Boutkhoum Omar, Furqan Rustam, Arif Mehmood, Gyu Sang Choi, et al. 2021. Minimizing the overlapping degree to improve class-imbalanced learning under sparse feature selection: application to fraud detection. IEEE Access 9 (2021), 28101–28110.
[50]
Van L. Parsons. 2014. Stratified sampling. In Wiley StatsRef: Statistics Reference Online, 1–11.
[51]
Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. J. Sci. 50, 302 (1900), 157–175.
[52]
Kathy Razmadze, Yael Amsterdamer, Amit Somech, Susan B. Davidson, and Tova Milo. 2022. Selecting Sub-tables for Data Exploration. arXiv preprint (2022).
[53]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. Holoclean: Holistic data repairs with probabilistic inference. arXiv:1702.00820. Retrieved from https://arxiv.org/abs/1702.00820.
[54]
José A. Sáez, Mikel Galar, and Bartosz Krawczyk. 2019. Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy. IEEE Access 7 (2019), 83396–83411.
[55]
Diptikalyan Saha, Aniya Aggarwal, and Sandeep Hans. 2022. Data synthesis for testing black-box machine learning models. In CODS COMAD.
[56]
J. Sayyad Shirabad and T. J. Menzies. 2005. The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa, Canada. Retrieved fromhttps://www.openml.org/search?type=data&sort=runs&id=1063&status=active.
[57]
Joseph L. Schafer. 1997. Analysis of Incomplete Multivariate Data. CRC Press.
[58]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (2018), 1781–1794.
[59]
Philip Sedgwick. 2013. Convenience sampling. Br. Med. J. 347 (2013).
[60]
Shrey Shrivastava, Dhaval Patel, Nianjun Zhou, Arun Iyengar, and Anuradha Bhamidipaty. 2020. DQLearn: A toolkit for structured data quality learning. In Big Data’20. IEEE, 1644–1653.
[61]
Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, and Aditya Parameswaran. 2016. Effortless data exploration with zenvisage: An expressive and interactive visual analytics system. arXiv:1604.03583 (2016). Retrieved from https://arxiv.org/abs/1604.03583.
[62]
D. Singh and Padam Singh. 1977. New systematic sampling. J. Stat. Plan. Infer. 1, 2 (1977), 163–177.
[63]
Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. 2018. Enriching data imputation under similarity rule constraints. In TKDE’18.
[64]
Daniel J. Stekhoven and Peter Bühlmann. 2012. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics (2012).
[65]
Henri Theil. 1970. On the estimation of relationships involving qualitative variables. Am. J. Sociol. 76, 1 (1970), 103–154.
[66]
Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17 (2001).
[67]
Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. Seedb: Efficient data-driven visualization recommendations to support visual analytics. In VLDB’15.
[68]
Jeffrey Scott Vitter. 1984. Faster methods for random sampling. Commun. ACM 27, 7 (1984), 703–718.
[69]
Pattaramon Vuttipittayamongkol, Eyad Elyan, and Andrei Petrovski. 2021. On the class overlap problem in imbalanced data classification. Knowl.-bas. Syst. 212 (2021), 106631.
[70]
Haitao Xiong, Junjie Wu, and Lu Liu. 2010. Classification with class overlapping: A systematic study. In ICEBI’10.
[71]
Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don’t be scared: Use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD’13. 553–564.
[72]
Kun Yi and Jianxin Wu. 2019. Probabilistic end-to-end noise correction for learning with noisy labels. In CVPR’19. 7017–7025.
[73]
Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing data imputation using generative adversarial nets. In ICML’18.
[74]
HaiYue Yu and KunHong Liu. 2017. Classification of multi-class microarray datasets using a minimizing class-overlapping based ECOC algorithm. In ICBCB’17.
[75]
Hongbao Zhang, Pengtao Xie, and Eric P. Xing. 2018. Missing value imputation based on deep generative models.
[76]
Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, and Tim Kraska. 2017. Controlling false discoveries during interactive data exploration. In SIGMOD’17. 527–540.

Cited By

View all
  • (2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
  • (2024)A Review of Data-Centric Artificial Intelligence (DCAI) and its Impact on manufacturing Industry: Challenges, Limitations, and Future Directions2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00018(44-51)Online publication date: 25-Jun-2024
  • (2024)Impact of AI and Dynamic Ensemble Techniques in Enhancing Healthcare Services: Opportunities and Ethical ChallengesIEEE Access10.1109/ACCESS.2024.344381212(141064-141087)Online publication date: 2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 15, Issue 4
December 2023
128 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3631477
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2023
Online AM: 26 June 2023
Accepted: 11 April 2023
Revised: 23 December 2022
Received: 01 June 2022
Published in JDIQ Volume 15, Issue 4

Check for updates

Author Tags

  1. Exploratory data analysis
  2. data quality for AI
  3. data preparation
  4. ML
  5. data-centric AI

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,301
  • Downloads (Last 6 weeks)195
Reflects downloads up to 01 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
  • (2024)A Review of Data-Centric Artificial Intelligence (DCAI) and its Impact on manufacturing Industry: Challenges, Limitations, and Future Directions2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00018(44-51)Online publication date: 25-Jun-2024
  • (2024)Impact of AI and Dynamic Ensemble Techniques in Enhancing Healthcare Services: Opportunities and Ethical ChallengesIEEE Access10.1109/ACCESS.2024.344381212(141064-141087)Online publication date: 2024
  • (2024)Enhancing data preparation: insights from a time series case studyJournal of Intelligent Information Systems10.1007/s10844-024-00867-8Online publication date: 25-Jul-2024
  • (2024)Artificial Intelligence in Sport Scientific Creation and Writing ProcessArtificial Intelligence in Sports, Movement, and Health10.1007/978-3-031-67256-9_2(15-29)Online publication date: 3-Sep-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media