research-article

Open access

A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality Tasks

Authors:

Lokesh NAuthors Info & Claims

ACM Journal of Data and Information Quality, Volume 15, Issue 4

Article No.: 44, Pages 1 - 26

https://doi.org/10.1145/3603709

Published: 01 November 2023 Publication History

PDF eReader

Abstract

Democratisation of machine learning (ML) has been an important theme in the research community for the last several years with notable progress made by the model-building community with automated machine learning models. However, data play a central role in building ML models and there is a need to focus on data-centric AI innovations. In this article, we first map the steps taken by data scientists for the data preparation phase and identify open areas and pain points via user interviews. We then propose a framework and four novel algorithms for exploratory data analysis and data quality for AI steps addressing the pain points from user interviews. We also validate our algorithms with open source datasets and show the effectiveness of our proposed methods. Next, we build a tool that automatically generates python code encompassing the above algorithms and study the usefulness of these algorithms via two user studies with data scientists. We observe from the first study results that the participants who used the tool were able to gain 2× productivity and 6% model improvement over the control group. The second study is performed in a more realistic environment to understand how the tool would be used in real-world scenarios. The results from this study are coherent with the first study and show an average of 30–50% of time savings that can be attributed to the tool.

1 Introduction

Data science is a multidisciplinary field focused on deriving insights from the data. Data scientists today have a wide variety of skills, including artificial intelligence, machine learning, data visualisation, and computer science techniques, including cloud computing. As more and more companies are embracing data science techniques, there is a dearth of data scientists, and companies are facing difficulty recruiting data scientists. This has led to a new trend of upskilling of professionals from business intelligence, and software engineering to the data science stream. With this trend, there is a large cohort of data scientists within the “early career bracket” for data science. To take care of this market need, there has been a lot of research on democratising data with notable progress on automated model-building technologies or AutoML techniques. There are several commercial and open source tools to automate the model-building work. Examples include Google’s AutoML [7], H2O [12], DataRobot [5], IBM AutoAI [10] and open source libraries such as Auto-sklearn [6] and TPOT [11]. However, automation in model-building steps alone is not sufficient, as the quality of a model is directly dependent on the quality of the data. Preparation of high-quality datasets has been called out as one of the most time-consuming steps of the machine learning (ML) lifecycle [8, 9]. Hence there is a need to focus on data-centric AI to bring in further automation.

Since the mid-2010s, there have been a few libraries built such as Deequ [58], DQLearn [60], Pandas Profiling [20], Google’s TensorFlow Validation [18], TDDA [3], and Great Expectations [2] to measure the quality of data. These address a naive subset of challenges like missing values, range, outliers, and basic statistical checks and do not focus on the metrics that can affect an ML model’s performance directly. Amazon Deequ [58] allows users to express arbitrary common quality constraints with custom validation code and thereby enables unit tests for data. Pandas Profiling [20] gives basic data analysis and visual functionalities. Similarly, DQLearn [60] focused on basic data quality metrics for structured and time-series data like missing value, duplicates, data profiling, and so on. Great Expectations [2] provides support for SQL and Parquet formats along with pandas, whereas Google’s TensorFlow Validation [18] automatically detects data errors using data visualisation and generation schema techniques. As we can see, the current data quality toolkits mostly focus on basic analysis and are very limited in their functionalities for both exploratory data analysis (EDA) and data quality (DQ) for AI analysis.

To further understand the gaps in the current tools and pain points faced by data scientists, we conduct user interviews to understand the as-is practices of data scientists and identify opportunities for innovation. We recruited 21 data scientists from different multi-national companies with six data scientists having experience of 0–3 years, eight data scientists having experience of 4–7 years, and seven data scientists with experience greater than 7 years. We interviewed each of the data scientists for an hour. We made a template of questions that were asked to every data scientist that fell into three buckets:understand current methods and mechanisms that are used for data preparation, understand open challenges faced by data scientists, and identification of time-consuming or manual activities. We recorded the responses for all the data scientists and grouped the observations as insights below. We share the main insights as follows:

(1)

Insight 1: Most data scientists performed the following steps as part of their EDA and DQ analysis: statistical tests, bivariate and multivariate plots, detection of missing values, detection of NA values, outlier detection, class imbalance analysis, data labeling, and so on, using open source libraries.

(2)

Insight 2: Most data scientists have their own ways of performing EDA and DQ analysis, using the rule of thumb and experience; variation was also observed in how early career data scientists approached this and how experienced data scientists approached this process. Most data scientists acknowledged that beyond light processes at a team level, this was an ad hoc process that lacked standardisation.

(3)

Insight 3: Often data scientists want to understand details about the dataset when they load the data. Common ways used by data scientists are creating a subset by printing top $N$ or bottom $N$ samples or randomly sampling the data. One challenge with these methods is that they do not guarantee to show all the variations in the data.

(4)

Insight 4: Data scientists spend time plotting and visualising the different columns of the dataset as part of understanding the data or data exploration. This can become cumbersome and time-consuming for very wide datasets (datasets with a large number of columns).

(5)

Insight 5: Data scientists shared that a lot of time was spent in coding and bug fixing for code written mainly to explore and understand the data. This time spent on coding varied with the experience of the data scientist. On the question of reuse of code, data scientists shared that they were able to reuse only $30\%$ of the code across projects. All data scientists agreed that a lot of the code written for the data preprocessing phase was to get insights from the data but was not used in the final production code when the model is deployed.

(6)

Insight 6: For enterprise settings, the dataset sizes run into gigabytes. A lot of open source packages and libraries are unable to handle such large datasets, and hence most of the time, a data scientist is forced to work on a sample of data or write extra code to run analysis on their full datasets.

(7)

Insight 7: All data scientists agreed that data preparation is one of the most time-consuming activity. In terms of percentage time spent by a data scientist, the most common was greater than $50\%$ and could go up to $60\%$ or $70\%$ , depending on the complexity of the project. The complexity increased when the user was working with multiple tables of a database instead of a single table.

The first two insights give us an understanding of the current methods employed by data scientists for data preparation. Insight numbers 3 and 4 are related to exploratory data analysis and the challenges that a user faces today to understand the data. Specifically for insight 3, a user wants to get a quick view of the data, which brings out all the unique variations in the data, without having to go through the full dataset. We address this pain point by developing an intelligent sampling algorithm that can bring out all the unique variations in every column of the dataset by picking the minimum number of rows from the data. We describe the problem statement and our algorithm formally in Section 4. Insight 4 is also related to understanding the data. Each column or feature in the dataset may have a different distribution and may relate to other columns or features in the dataset. To understand these characteristics, a user has to go through these different columns and use methods of visualisation, and broad statistics to find which columns are interesting for the ML task at hand. In our user interviews, we also ask about typical sizes of the dataset in enterprise settings. Most users have agreed that the number of columns in a dataset are greater than 100 and can go up to 500 or 1,000 columns. There are instances where the number of columns may exceed 1,000 columns, but those instances are less frequent. It is easy to understand that finding information about relevant columns via manual plotting and observations is not scalable. We address this pain point by devising a novel mechanism to rank columns of a dataset and show the most interesting columns to the user along with its plot, important details about the columns and its relationship to other columns in the dataset. For every column, we also explain why the column is interesting by tagging the properties of that column. We formally explain the problem and our solution for this problem in Section 5. We also devise two novel algorithms to address the problems of class overlap and label noise to find these issues in the dataset. These would be an addition to the issues that a user finds today via open source libraries, as noted in Insight 1. We describe the problem statements and our algorithms in Sections 6 and 7. We also address the challenge of time spent in coding and the lack of standardisation for data preparation steps by building an automated tool to auto-generate python code that can be used by the data scientists for the analysis. We discuss this in Section 3. We evaluate our individual algorithms using meaningful baselines and discuss the validation results in the same section as the algorithm details. We also present a more comprehensive evaluation of all our algorithms by evaluating the tool with a user study with data scientists employed with multi-national companies. From the insights from the user study, we note that the two other pressing challenges that a data scientist faces are the need for algorithms scalable to large datasets and the ability to perform data preparation when the input is not a single table but multiple tables from a database. These are topics that need deeper investigation and will be addressed in our future works. Our main contributions for this article are as follows:

•

Conduct user interviews to get insights on as-is data preparation steps and pain points (Section 1).

•

Design a tool that provides a framework to add various functionalities based on the insights from the user study (Section 3).

•

Propose a novel algorithm to find a data subset that captures all column variations of the data (Section 4).

•

Propose a novel algorithm to rank columns of the dataset to aid for easy exploration and understanding of the data (Section 5).

•

Propose a novel algorithm to find overlapping regions in the dataset and validate the algorithm using several open source datasets (Section 6).

•

Propose novel algorithm to find label noise in the dataset and validate the algorithm using several open source datasets (Section 7).

•

User study to understand the usefulness of the above algorithms and how they help a data scientist in terms of time, productivity, or model improvement (Section 8).

2 Related Work

In this section, we review the related work in this area. Data preparation is a long studied area with a focus on the quality of data to be stored in databases. Our work is different with a focus on data preparation for AI pipelines, and hence we review only the work that is related to AI pipelines.

2.1 Exploratory Data Analysis

The first step is to understand the data, and visualisation tools help in understanding the data by finding interesting properties that can help in developing curative measures. Reference [67] recommends data-driven visualisations using statistical and deviation-based metrics such as Earth Mover’s distance, Jenson-Shannon distance, and so on. Reference [61] uses ZQL query language for interactive visual analytics. The QUDE system [17, 76] also takes a statistical approach and provides data exploration using multiple hypothesis testing. Reference [30] discover and summarize interesting slices of the entire data. One of the gaps that we heard in user interviews was around the ability to quickly understand the data (Insight 3). Data scientists often seek to understand the details of a dataset upon its initial loading. This is commonly achieved through the creation of subsets of the data by printing the top or bottom few samples or by randomly sampling the data. However, these methods do not guarantee the inclusion of all variations within the data. Various techniques have been proposed in the literature for sampling, including random sampling [68], systematic sampling [62], stratified sampling [50], convenience sampling [59], snowball sampling [27], association-based sampling [52], and quota sampling [44]. While these methods may effectively replicate the distribution of the full dataset, they may not capture every single variation within each column of the data. It is important to capture all variations within a dataset, particularly in real-world applications where rare data patterns may be of significant value. Existing methods may capture the data points having some association with respect to target column [59] or most frequently occurring patterns but may miss rare patterns that could be of importance to the user. Moreover, these techniques focus on variations in column data values and not variations in data value patterns such as regex pattern and therefore are not able to rank rows based on the uniqueness and importance of row values in the data.

2.2 Data Quality for AI

A wide range of methods for statistical analysis [57], constraint mining and checking [22], entity matching [32, 46], and machine learning [34, 53, 71] are used nowadays for data quality checks, data cleaning, and data repairing [29] but still with a large focus on data quality for databases. An important step in data quality for AI is filling missing values in the data, called data imputation. Most research in the field of imputation focuses on numerical imputation. Some notable approaches include k-nearest neighbors [14], multivariate imputation by chained equations [38], matrix factorisation [33, 41, 66], and deep learning methods [15, 16, 26, 40, 75]. While some recent work addresses imputation for heterogeneous data types [28], heterogeneous in some works [47, 63, 64, 73] refer to binary, ordinal, or categorical variables, which can be easily transformed into numerical representations. Since this is a mature area, we do not make any algorithmic contributions to this area but design our framework in such a way that these can be easily plugged into the same (Section 3). We focus on two important problems, label noise and class overlap, that we believe are more important as their presence has shown to adversely affect ML models (please see experiments in Sections 6 and 7).

2.2.1 Label Noise.

Most of the real-world datasets generated or annotated have some inconsistent/noisy labels. Training data with noisy or inconsistent annotations or labels [48] can have a significant impact on the data science pipeline, leading to decreased model accuracy, increased complexity, and a greater need for training samples. Many approaches [23, 25, 36, 37, 42, 45, 72] have been proposed to address this problem, but most of them focus on designing robust loss functions for classifiers that can handle noisy labels. However, they do not focus on the problem of detection of noise present in the existing dataset. Reference [48] looks at the problem of label noise with respect to the detection task. But we find certain challenges in our experiments with the detector from Reference [48] and propose a new label noise algorithm that outperforms [48].

2.2.2 Class Overlap.

Overlap among classes can cause ML classifiers to have performance degradation in those regions by misclassifying points or being less confident in their predictions in those regions. There have been several techniques proposed to detect class overlap in datasets. Reference [70] proposed an algorithm that uses Support Vector Data Description (SVDD) to approximate class boundaries and determine overlap by the number of instances in the common boundary. Since this approach relies on SVDD, it can find only spherical boundaries, which is not ideal for real datasets. Other methods, such as those discussed in Reference [54], separate overlapping classes into separate binary classes and use a one-versus-one decomposition strategy to learn a classifier. However, these approaches do not focus on detecting overlap in existing datasets. Reference [35, 69, 74] examine the problem of overlap in conjunction with class imbalance, but their focus is on solving the problem rather than detecting it. We propose a novel algorithm to find class overlap in below Section 6.

Data preparation also includes transforming data before feeding into the ML pipeline. Some of the typical transformations include normalisation, bucketisation, winsorising, one-hot encoding, feature crosses, and using a pre-trained model or mappings from values such as words to real numbers (embeddings) to extract features [43]. Again, these are mature, and we have plugged them in our framework from open source libraries or our own implementation. We also discuss on various open source tools to check the quality of data in Section 1. These do not satisfy the challenges that we summarise from our user interviews. Next, we discuss our framework that overcomes the challenges and provides a mechanism to add algorithms for specific problems.

3 Framework Design

In this section, we discuss the design and implementation of our proposed framework and give a high-level overview for each of the blocks. Each individual block of the framework will be described in the next sections. Our main motivation was to design a framework/tool such that it can help a data scientist accelerate their work, while minimising the time consuming and mundane tasks. Specifically, our framework addresses insights 1–5 from Section 1. We design a tool that can automatically generate python code in Jupyter notebooks to perform advanced exploratory data analysis, data quality for AI assessment and remediation, and also maintain a track of all the data transformations done in the notebook. The history of all operations is presented via a data readiness report, which serves as a comprehensive record of all data properties, insights, and quality issues including lineage of all data operations to give a detailed record of how data have evolved. This can serve as accompanying documentation that can be used for governance and audit purposes. We build this based on our earlier work presented in Reference [13]. Our design is centered around a data scientist to help reduce time to value and, more importantly, give them flexibility and control over the data preparation process. An automated notebook with code allows them to have the flexibility to add/edit/delete code and make it customised to their business use case. We use the following guiding principles for the design of our tool:

•

Consistent input and output interfaces so that it is easy to learn and use the tool. All the algorithms output results in a JSON format that is kept consistent across all operators.

•

Ability for the data scientist to customise the data preparation process for their business use case by giving the capability to add, delete, and edit code as necessary. All the operations return results in a temporary pandas dataframe, which can be inspected by the data scientist, and then he or she can update the main dataframe.

•

Encapsulate all the algorithms via function calls but keep enough options so that the user can vary the parameters as required.

•

Documentation supporting the code that tells the user, what, and when should a functionality be used. It also calls out the code dependencies as well as what parameters can the user change and the effect of these on the algorithm details.

•

Record history of data operations applied on the dataset in the notebook

Consistent input–output interfaces help the user in understanding the code quickly. Another factor that we debated was how much code should be shown to the user so that is both not overwhelming as well as underwhelming. After consulting a few data scientists, we come up with the design to encapsulate the code via function calls, thereby keeping a balance between simplicity and transparency. We apply all the design points discussed above to our framework, which we describe next. Our framework consists of the following main sections. Each section contains multiple functionalities/algorithms. Beyond our contributions, we also add the standard operators from open source libraries that data scientists use regularly for the completeness of the tool (see Insight 1 from Section 1).

•

Exploratory Data Analysis: This section has algorithms that help a data scientist quickly explore the data, understand the issues/patterns, and decide on what data transformations should be done. We design two new algorithms that are presented in the corresponding sections: Dataset Snapshot (Section 4) and Interesting Columns Detection (Section 5) based on our user interviews from Section 1.

•

Data Quality Assessment and Remediation for AI(DQAI): This section has algorithms that help identify quality issues in the data that will affect the building of an ML model. We describe two novel algorithms that are presented in the corresponding sections, Label Purity and Class Overlap, that each identify a specific problem in the data that can hinder the performance of an ML model. For each quality assessment algorithm, we make recommendations for cleaning the data and add cleaning operators for the respective assessment operators.

•

Data Readiness Report: This section produces a summary of data insights and a history of operations performed on the data in the notebook.

All these sections are combined in one single Jupyter notebook. Our tool generates python code via a Jupyter notebook using nbformat library [4]. We choose Python and Jupyter notebook as a library and tool of choice, respectively, as it is popular among data scientists. It takes as inputs the path to the dataset and the label column and generates a notebook with python code that a data scientist can start working with.

Figure 1 shows a screenshot of the tool. A user needs to add the path to the dataset and the label column as shown in Figure 1(a). Once a user executes these five lines of code, a new Jupyter notebook is generated that contains both the documentation and Python code for the user to get started with as shown in Figure 1(b). Figure 1(c) of the same figure shows a code snippet for the class overlap algorithm that we discussed in Section 6. Each functionality returns a JSON that has the standard structure. The tool is designed in a modular and extensible fashion so that adding new algorithms is very easy, and it only incurs a small effort to write a wrapper function so that a new algorithm also adheres to the input–output interface of the tool. We next describe the proposed algorithms in the following sections.

Fig. 1.

4 EDA Method Details: Dataset Snapshot

We start with a discussion of our algorithm for building a subset of the data that serves as a snapshot of all the column variations in the dataset. We call this the dataset snapshot. Based on our user interviews (Insight 3), data scientists want to explore and understand the data. They use heuristic methods today by printing top $N$ and bottom $N$ rows or known sampling methods like random sampling. As discussed in the related work section, these techniques may not be sufficient, as they do not guarantee the capture of all the variations in the data and may miss out on rare patterns.

For example, when a categorical column has minority classes present, in the randomly generated sample, those minority classes could be absent when the sample size is not big. This could happen, because all classes are sampled according to their frequencies, which poses a disadvantage to small classes. Another issue with these methods is that the column-level information is not taken into account to get an effective sample. For example, if a gender column has four values, Male, M, Female, and F, then a user would want to see all the variations to standardise the data before applying other ML operations, else any encoding technique may mistake that there are four categories in the column, whereas there should only be two. However, with standard methods, there is no guarantee that all the values will be picked up. Another example is a date column that may have multiple date formats like dd-mm-yy and dd/mm/yy. Again, these need to be known so that they can be standardised before applying any other operations on the dataset. Another pain point is automatically determining the number of rows to sample that needs a lot of trial and error by the user to derive a good sample. We therefore felt the need to design a different approach that considers column-level information, covers all different patterns across different columns, and is aware of string patterns present in text columns to generate better data subsets or samples that users can use to explore and understand the data. We also give ourselves a goal that the algorithm should automatically determine the number of samples to pick without user input. We propose a principled way to solve this problem and guarantee that all the variations in a column are picked up and presented in the subset. We discuss our formulation and algorithm in Section 4.1.

4.1 Algorithm Details

Let us consider $D_N$ as a full dataset with $N$ number of rows, and it consists of a finite set of data patterns denoted by $P(D_N)$ . We define a data pattern as the following: We process the data in every column to a finite set of patterns. For categorical columns, we denote each unique value as a pattern. For numeric columns, we bin the data and take the bin index as a pattern to be covered. For string pattern columns like dates and phone numbers, we convert string to regex patterns and take each regex pattern as a pattern to be covered. Unique columns like ID columns are excluded from the data, since no pattern can be derived from them. Thus a data pattern for the entire dataset can be defined as follows:

\begin{equation} P(D_N) = \cup _{i=1}^{k} P(C_i), \end{equation}

(1)

where $P(C_i)$ represents set of patterns present in column $i$ in $D_N$ . Our goal is to pick a minimal sample dataset $D_S$ such that $P(D_S)=P(D_N)$ . Note that when $S=N$ , the condition is automatically satisfied. We formalise our goal as follows. Let us define

\begin{equation} H = \lbrace D_s\subset D_N|P(D_s)=P(D_N)\rbrace , \end{equation}

(2)

and we pick an optimal sample by optimising the condition

\begin{equation} D_S = \mathop{\text{argmin}}\limits _{D_s \in H} s, \end{equation}

(3)

thereby picking sample data with minimum samples. This problem can be viewed as a Bi-Partite Graph $G = (P, R, E)$ where $P$ represents all patterns in $D_N$ , $R$ is the set of all rows in $D_N$ , and $E$ represents the edges connecting partitions $P$ and $R$ . In this bi-partite graph, we optimise the condition

\begin{equation} \min \limits _{e \in E} e \end{equation}

(4)

to select as minimum edges as possible, thereby selecting minimum rows as possible, such that $C_e(P) = P(D_N)$ , where $C_e$ represents coverage of patterns by edges $e$ . Finding the optimal solution for this problem takes polynomial time, and it is NP-Complete. Therefore, we use a greedy algorithm to solve this problem. We use row importance metric to determine which rows are more important to pick. We define row importance as

\begin{equation} I(r_i) = \sum _{j=1}^{K}I(p_{c_{ij}}), \end{equation}

(5)

where $I(r_i)$ denotes importance of row $i$ , $K$ denotes number of columns, $p_{c_{ij}}$ denotes the data pattern present in column $j$ of row $i$ , and $I(p)$ denotes importance of a pattern $p$ . The importance of pattern $p$ is defined as the relative frequency of pattern $p$ occurring in the dataset.

We perform these processing steps on data $D_N$ before applying the snapshot algorithm:

(1)

All unique value columns are dropped.

(2)

Numeric columns are binned and numeric values of the column are replaced by bin values.

(3)

String columns with patterns are converted to regex patterns and string values are replaced by regex values.

(4)

Categorical columns are not processed and are retained as is.

The algorithm is then run on the processed data to find $D_S$ and is shown in Figure 1. In each iteration, we rank all rows by their importance and pick the top-ranked row so that rows with the most occurring patterns are picked first. We then set the importance of already-picked patterns to 0 to avoid giving importance to already-picked patterns in the next iteration for row selection. Once we observe all patterns are picked, we stop the iteration and return the sample dataset $D_S$ . The sampled dataset $D_S$ is built such that

(1)

Rows are ordered to place the most important rows at the top and least important at the bottom.

(2)

We rank and pick sample rows using the information of data patterns from all columns.

(3)

We avoid picking patterns that are already covered and pick most important rows, thereby enabling a small subset of data to be picked with all the patterns.

(4)

Number of samples to pick is automatically determined and not needed as input from the user.

To explain the working of our algorithm, we took a dataset of 200 rows and four columns as shown in Figure 2(a). The dataset has four columns, with each column having the following type of values:

Fig. 2.

(1)

ID column is a unique column of numbers

(2)

LEVEL column is a categorical column with LOW, MEDIUM, and HIGH as its categories

(3)

DATE column is a column with three different types of dates (a) 17.05.2011 (with full stop as a separator), (b) 26/02/2011 (with forward slash as a separator), and (c) 11-01-2011 (with hyphen as a separator).

(4)

SCALE column is a numeric column

Our algorithm processes each column according to its datatype, and the processed data are shown in Figure 2(b):

(1)

Categorical columns like the LEVEL column are retained as it is without any processing. We consider all three unique values, i.e., LOW, MEDIUM, and HIGH as three patterns in column LEVEL.

(2)

Columns with all unique values like ID column are excluded from the processed dataset.

(3)

String or Date columns with patterns like DATE column are converted to their regex pattern. For example, 17.05.2011 is converted to num{2}.num{2}.num{4} regex pattern. We therefore end up with three unique patterns for the DATE column: (a) num{2}.num{2}.num{4}, (b) num{2}-num{2}-num{4}, and (c) num{2}/num{2}/num{4}.

(4)

For Numeric columns like the SCALE column, we bin the dataset and transform the column to bin numbers. In the SCALE column, we bin the numeric data into five bins and thus end up with five unique values, i.e., 0, 1, 2, 3, and 4. We consider these bin values as five patterns in column SCALE.

We have a total of 11 data patterns (3+3+5) in the processed dataset. We then use Algorithm 1 on the processed dataset to pick all patterns found in it. Finally, we end up with just five samples representing the full dataset as shown in Figure 2(c), and all the patterns from the full dataset are represented in the data snapshot.

4.2 Experiment Results

We have taken 11 open source datasets from Reference [1] and Reference [21] and run our snapshot algorithm on them. We record the number of samples picked up by the algorithm. Table 1 shows the results of our analysis. Compared to the full data, the number of samples picked by our algorithm is significantly less to capture all the data patterns. On average, we picked 3.6% of rows from the full data for the sample data. We also validate our algorithm’s claim of covering all patterns present in the full data for the sampled data. For that, we manually check the number of patterns present in each of the full original data and also do the same for the sampled data and the results are as shown in Table 1. We can see that sampled data cover 100% of the patterns present in the original full data for all the datasets. We also produce visualisations for two datasets using the standard TSNE plots, shown in Figure 3. In Figure 3, the plots of Breast Cancer and German Categorical datasets are shown, with blue circles representing the full dataset and yellow squares representing sampled data points the algorithm identified. The average runtime of the algorithm for detecting interesting columns was found to be 23.10 seconds, as measured across the datasets listed in Table 1. The system with a 2.3-GHz eight-core processor and 64 GB memory is used for all runtime calculations.

Table 1.

Dataset Name	Full Data Rows	Sample Data Rows	Full Data Columns	Full Data Patterns	Sample Data Patterns
mfeat-zernike	2,000	107	48	938	938
mfeat-factors	2,000	204	217	4,277	4,277
mfeat-karhunen	2,000	227	65	1,281	1,281
segment	2,310	67	20	310	310
sick	3,772	87	30	140	140
phoneme	5,404	44	6	100	100
wall-robot-navigation	5,456	126	25	452	452
texture	5,500	58	41	806	806
optdigits	5,620	119	65	924	924
satimage	6,430	65	37	726	726
pendigits	10,992	48	17	330	330

Table 1. Experiment Results for the Dataset Snapshot Algorithm

Column “Sample Data Rows” shows the number of rows picked by our algorithm. The last two columns verify that the number of patterns found in the full dataset is also found in the sample dataset. The patterns are counted as a sum of patterns across all columns for the dataset.

Fig. 3.

5 EDA Method Details: Interesting Columns Detection

In our user interviews, we learn that a data scientist spends a lot of time plotting and visualising different columns of the dataset to understand characteristics of different columns as part of exploratory data analysis (see Insight 4 in Section 1). It will be difficult for a data scientist to go through all the columns and plan for operations such as dropping columns, encoding columns, and so on, for getting the data ready for later stages of a ML pipeline. This becomes an acute pain point when the dataset is large with a very high number of columns. Thus, there is a need to analyse data from multiple perspectives and summarize the information at a column level, showing interesting insights into different columns. To the best of our knowledge, no other techniques address this in the data science field.

In this section, we discuss our methodology for finding interesting columns in a given dataset. The characteristics of an interesting column are the presence of dominant values, the presence of quality issues, correlations with other columns, and the presence of spurious syntactic patterns. A job-position column may be correlated to multiple columns, e.g., education, salary, and age of a person. A column with a large number of missing values must be interesting for a data scientist who needs to fill the missing values to get the data ready for model training. A business analyst might be interested in a column where one value dominates.

Figure 4 shows an example of an interesting column named relationship from the Adult dataset [31]. Note that the value Husband dominates over other values, as shown in the data distribution in Figure 4(a), and therefore the entropy is low. Moreover, this column is also related to columns age and gender, since every Husband is male and every Wife is female, as shown in the data distribution in Figure 4(b).

Fig. 4.

We use four metrics to determine if a column is interesting. These metrics are pattern, associations, entropy, and missing fractions. We rank columns based on interestingness score to show columns exhibiting interesting properties,

\begin{equation*} interestingness\_score = (1-entropy)+missing\_fraction+association\_score+pattern\_score. \end{equation*}

Entropy: Entropy measures the degree of uncertainty in data. Entropy is maximum for a uniformly distributed random variable and minimum for a constant random variable. For a column with $n$ unique values, with $P(x_i)$ denoting the probability of a value $x_i$ , entropy is computed as follows:

\begin{equation*} entropy = -\sum _{i=1}^n P(x_i)\log P(x_i), \text{ where } P(x_i) = \dfrac{number\ of\ $x_i$\ in\ column}{total\ number\ of\ values\ in\ column}. \end{equation*}

Missing Fraction: Data values in a dataset can be missing due to mishandling or human error. This can be an interesting metric for a data scientist preparing or cleaning the data, especially for training a machine learning model. Missing fraction is computed as the fraction of values missing in a column,

\begin{equation*} missing\_fraction = \dfrac{number\ of\ missing\ values\ in\ a\ column}{number\ of\ rows\ in\ data}. \end{equation*}

Association Score: Association score captures the relation between different columns and determines if a column is related to other columns. We compute Pearson [51] correlation coefficient for a pair of numeric columns and uncertainty coefficient Thiel’s $U$ [65] between categorical columns, considering integer columns with few unique values as categorical to get more accurate correlations. These coefficients and correlations have been used in other works, such as data imputation [28] and data synthesis [55]. The association score is computed as follows:

\begin{equation*} association\_score = \dfrac{number\ of\ associations\ of\ a\ column\ with\ correlation\ \gt 0.5}{number\ of\ associations\ in\ data\ with\ correlation\ \gt 0.5}. \end{equation*}

Pattern Score: Pattern score aims to find the severity of minority patterns in a given a column. These patterns capture syntactic abstractions and different variations in column values and are different from data patterns with binned numeric columns and unique categorical values as pattern defined in Section 4.1. For example, the value Data is mapped to Aaaa, and EMPID1234 is mapped to AAAAA9999. Once the values in a column are mapped to syntactic patterns, they are grouped to identify the frequency of each identified pattern. A pattern is called minority pattern if is followed by less than 5% of cells in a given column. To reduce the false positives and not flag minority patterns in those columns that do not follow particular syntax, we identify if the given columns has a dominant or majority pattern, i.e., a pattern that is followed by at least 50% of the cells in a given column,

\begin{equation*} pattern\_score = \dfrac{number\ of\ minority\ patterns\ in\ a\ column}{total\ number\ of\ patterns\ in\ a\ column}. \end{equation*}

Figure 5 shows an example of an interesting column relationship and an example of a not-so-interesting column occupation from the Adult dataset [31]. Note that the interesting column relationship is shown with tags Low Entropy and High Correlation (on the right top corner). This column is interesting owing to the skewed data distribution, as shown in the distribution plot, and highly correlated column values with other columns, as discussed earlier. However, column occupation is not very interesting according to our metric. The average runtime of the algorithm for detecting interesting columns was found to be 8.8 seconds, as measured across the datasets listed in Table 1.

Fig. 5.

6 DQAI Method Details: Detection of Class Overlap in the Data

One common problem seen in datasets for machine learning tasks is the presence of overlapping regions in the dataset or class overlap. Class overlap occurs when there are several data points that lie close to each other in the vector space but have different class labels. The presence of such regions causes difficulty in building classification models, as the model is likely to make mistakes in overlapping regions. Depending on the number and sizes of the overlapping regions present in the data, the complexity of building a machine learning model can be different. Formally, one can define an overlapping region $R$ with data points $x_{1}...x_{n}$ having class labels indicated by $y_{1}..y_{k}$ , where $n$ is the number of data points and $k$ is the number of classes, such that the distance between the points is less than a threshold $\theta$ and the number of classes in the region is greater than 1. This can be represented as follows:

\begin{equation} R = {\left\lbrace \begin{array}{ll} dist(x_{i}, \ldots ,x_{n}) \lt \theta & \forall i \in (0..n)\\ sum(intersection(y_{i}..y_{n})) \gt 1 & \text{where $y_{i}$ corresponds to class label for $x_{i}$}.\\ \end{array}\right.} \end{equation}

(6)

Typically, the problem of the overlapping region has been studied in the context of class imbalance problem [24, 49] by considering the effect of overlap regions between the majority and minority classes. In recent work, the authors in Reference [69] showed with experiments on synthetic datasets that while the presence of class overlap amplifies the class imbalance problem, the reverse may not be true. Their experiments also showed that class overlap can significantly hurt the classifier performance, even in balanced datasets. They show their experiments on synthetic datasets using Random Forest as a classifier of choice. While these experiments provide some initial signals, there are still some open questions that need to be answered. Does the presence of overlapping regions in datasets also affect state-of-the-art models like automated machine learning models [39]? Does class overlap affect only a certain configuration of datasets or is independent of dataset size and shape? Is there any difference based on the number of classes in the dataset? We answer all these questions systematically to understand the problem of class overlap better and its impact on downstream models.

6.1 Understanding the Effect of Overlapping Regions on ML Models

We pick 12 datasets from UCI [21] and Kaggle [1] with variation in the number of rows and columns and number of classes (refer Table 2). We use the approach followed in Reference [54] to induce $20\%$ overlap in the data. We then follow the steps outlined below to perform systematic experiments using AutoML [39] classifier.

Table 2.

Dataset Name	Number of columns	Number of rows	Number of classes
mfeat-zernike	47	2,000	10
mfeat-factors	216	2,000	10
mfeat-karhunen	64	2,000	10
segment	19	2,310	7
sick	29	3,772	2
phoneme	5	5,404	2
wall-robot-navigation	24	5,456	4
texture	40	5,500	11
optdigits	64	5,620	10
satimage	36	6,430	6
pendigits	16	10,992	10
gas-drift	128	13,910	6

Table 2. Summary of Datasets Used for Experiments for Class Overlap

Steps to understand the impact of class overlap on AutoML classifiers:

(1)

Divide the dataset into train and test splits. Let us call the train splits as ${D_{tr}}$ and test splits as ${D_{te}}$ .

(2)

Create a copy of dataset $D^{\prime }$ , with the same splits as $D$ . Let us call the train splits as ${D_{tr^{\prime }}}$ and test splits as ${D_{te^{\prime }}}$ .

(3)

Add synthetically generated overlapping data points to the ${D_{tr^{\prime }}}$ . No changes are made to the ${D_{te^{\prime }}}$ .

(4)

Train classifier $C$ on original data $D$ , by applying threefold CV on the training data ${D_{tr}}$ and record the accuracy on the test split ${D_{te}}$ .

(5)

Train classifier $C^{\prime }$ on overlap induced dataset $D^{\prime }$ by applying threefold CV on the training data ${D_{tr^{\prime }}}$ and record the accuracy on the test split ${D_{te^{\prime }}}$ . Note ${D_{te}}$ and ${D_{te^{\prime }}}$ are exactly the same.

(6)

Compare accuracy of $C$ and $C^{\prime }$ on the test splits.

We train 24 models for 12 datasets by following the steps outlined above using threefold CV. For each of these models, we keep the settings for AutoML classifier the same, except for the change in the training dataset. For each model, we report the accuracy numbers on the test split in Table 3. Note that the test split is the same for a given dataset for both $C$ and $C^{\prime }$ . We observe that the accuracy of the classifier trained on overlap datasets drops for all the datasets, with an average drop of $5.9\%$ , a minimum drop of $2.5 \%$ , and a maximum drop of $15\%$ . This forms strong conclusive evidence that the presence of class overlap degrades the accuracy of state-of-the-art ML models. We do not see any patterns in the effect of the number of classes or other dataset characteristics on the impact of class overlap on ML models. We next describe an algorithm to detect class overlap in the data. Our algorithm can both detect and explain why a certain region is considered an overlap region.

Table 3.

Dataset Name	Accuracy on raw data (D)	Accuracy on overlap added data (D $^{\prime }$ )	Accuracy difference
mfeat-zernike	0.967	0.891	0.076
mfeat-factors	0.982	0.957	0.025
mfeat-karhunen	0.983	0.906	0.077
segment	0.993	0.947	0.046
sick	0.998	0.943	0.055
phoneme	0.988	0.943	0.045
wall-robot-navigation	0.982	0.953	0.029
texture	0.993	0.937	0.056
optdigits	0.989	0.955	0.034
satimage	0.989	0.942	0.047
pendigits	0.995	0.952	0.043
gas-drift	0.898	0.748	0.15

Table 3. Impact of Class Overlap Issue on Classifier Accuracy

6.2 Algorithm to Detect Class Overlap

Our goal is to detect class overlap problem to give insight to the user on the challenges in the data before they start any data preparation work. We propose a graph-based method for class overlap detection where the nodes of a graph are represented by the data points in the dataset and the edges are represented by the distance between the points. Every vertex stores information on the data point id and the corresponding class label. The graph construction starts with every vertex forming an edge with $k$ neighboring vertices. Next, for every vertex in the graph, the class labels with the neighboring vertices are compared. If all the neighbors have the same class labels, then the edges between these vertices are pruned. We also prune all vertices of degree 0 from the graph. As a next step, for every vertex, with a degree greater than $d$ , we find the connected components. For every connected component, we count the number of vertices belonging to different classes. If the ratio is less than $r$ , then the edges are pruned and that connected component is dissolved. The remaining connected components are the overlap regions in the dataset. For every overlap region, we find the feature ranges for every data point and finds features and corresponding ranges that are causing an overlap in the region. The connected component can also be visualised using standard graph drawing tools to further explain the reason for the overlap. We also quantify the amount of overlap by counting the number of points in the overlap region divided by the total number of data points. Our proposed method for class overlap detection has the following properties:

•

Quantifies the amount of overlap by giving a score that ranges from 0 to 1, so as to objectively compare the amount of overlap across the datasets

•

Insights on which classes contribute to overlap and the percentage of contribution for each class

•

Index of data points belonging to each overlapping region along with the feature ranges that cause an overlap to provide explanations to users

6.3 Results

To demonstrate the effectiveness of our algorithm, we use the same 12 datasets and synthetically add overlapping data points, which serve as ground truth. We then calculate the precision and recall of our algorithm on these datasets. We add $30\%$ overlap by using the method described in Reference [54]. Table 4 shows the results on precision and recall for the different datasets. The average precision for class overlap detection is 0.865, and the average recall is 0.7825. These results demonstrate the effectiveness of the proposed approach. The runtime of the proposed algorithm was measured to be 0.96 seconds, on average, across the datasets listed in Table 2.

Table 4.

Dataset Name	Precision	Recall
mfeat-zernike	0.8	0.93
mfeat-factors	0.87	0.88
mfeat-karhunen	0.67	1
segment	0.93	0.83
sick	1	.68
phoneme	0.92	0.81
wall-robot-navigation	0.8	0.72
texture	0.84	.98
optdigits	0.8	0.1
satimage	0.89	0.855
pendigits	0.91	0.90
gas-drift	0.96	0.71

Table 4. Results on Class Overlap Detection

7 DQAI Method Details: Label Noise

Correctness of the training data labels plays a very important role in determining the quality of the training data to build ML models. Most of the real-world datasets generated or annotated have some inconsistent/noisy labels [48]. Training data with noisy or inconsistent labels is one of the important issues in classification task settings that has a potential impact on the data science pipeline. For instance, label noise can lead to a decrease in model accuracy, an increase in model complexity, and an increase in the number of training samples required [19]. It is therefore important to detect and correct noisy samples in the dataset. Formally, the label noise operator can be defined as an operator that analyses and identifies inconsistencies or noise in the training data labels. For all the noisy labels detected, we also recommend clean labels.

7.1 Understanding the Effect of Presence of Label Noise on ML Models

Noise in the data mostly follows a random distribution. So to analyze the need for a label noise algorithm, we pick various diverse datasets from UCI [21] and Kaggle [1] repositories to capture variations in terms of the number of rows, columns, and classes. We collected 21 such datasets (shown in Table 5) that meet the above requirements. We use threefold cross-validation where in each fold we introduce $10\%$ random label noise per class in the training set only and do not make any changes to the test set. We then follow the similar steps for our experimets as outlined in Section 6, with the only difference that instead of inducing an overlap, we induce label noise in the data.

Table 5.

Dataset Name	Number of columns	Number of rows	Number of classes
credit-approval	15	690	2
mfeat-pixel	240	2,000	10
mfeat-zernike	47	2,000	10
cardiotocography	35	2,126	10
mfeat-morphological	6	2,000	10
soybean	35	683	19
phoneme	5	5,404	2
banknote	4	1,372	2
texture	40	5,500	11
balance-scale	4	625	3
mfeat-fourier	76	2,000	10
pendigits	16	10,992	10
wall-robot-navigation	24	5,456	4
spambase	57	4,601	2
mfeat-karhunen	64	2,000	10
segment	19	2,310	7
mushroom	22	8,124	2
spambase-reduced	15	4,601	2
waveform-5000	40	5,000	3
eeg-eye-state	14	14,980	2
kr-vs-kp	36	3,196	2

Table 5. Summary of Datasets Used for Experiments for Label Noise

Table 6 shows the effect of inducing $10\%$ random noise on the performance of the AutoML classifier [39]. We can clearly observe a drop in the performance of classifiers after noise is induced. In some cases, there is no drop in accuracy. Since the data points for inducing label noise have been selected randomly, we hypothesize that in these cases the points may be far off from the classifier boundary and are not impacting the model. For 13 datasets of 21, there is a decrease in performance by more than $1\%$ . There are 6 datasets where a decrease in performance is high (greater than $2\%$ ) and the maximum drop observed for these datasets is $4\%$ . This provides strong evidence that label noise degrades the accuracy of state-of-the-art ML models and needs to be detected and corrected. With this motivation, we take a closer look at method to detect label noise. We next describe an algorithm to detect label noise in the data.

Table 6.

Dataset Name	Accuracy on raw data (D)	Accuracy on noise added data (D $^{\prime }$ )	Accuracy difference
credit-approval	0.86	0.85	0.01
mfeat-pixel	0.95	0.95	0.00
mfeat-zernike	0.78	0.78	0.00
cardiotocography	1	1	0.00
mfeat-morphological	0.72	0.71	0.01
soybean	0.90	0.86	0.04
phoneme	0.88	0.85	0.03
banknote	1	0.98	0.02
texture	0.99	0.99	0.00
balance-scale	0.88	0.88	0.00
mfeat-fourier	0.81	0.80	0.01
pendigits	0.99	0.99	0.00
wall-robot-navigation	1	0.99	0.01
spambase	0.94	0.92	0.02
mfeat-karhunen	0.94	0.93	0.01
segment	0.97	0.96	0.01
mushroom	1	1	0.00
spambase-reduced	0.83	0.83	0.00
waveform-5000	0.87	0.86	0.01
eeg-eye-state	0.94	0.92	0.02
kr-vs-kp	0.99	0.97	0.02

Table 6. Impact of Label Noise Issue on AutoAI Classifier Accuracy

D denotes raw data, and D $^{\prime }$ denotes data after induction of $10\%$ random noise.

7.2 Algorithm Details

Proposed label noise detection algorithm (see Algorithm 2) is built on top of the confident learning– [48] based approach (CleanLab). One limitation of this approach is that it can tag some correct samples as noisy if they lie in the overlap region thereby generating false positives. The overlap region is a region where several data points that lie close to each other in the vector space but have different class labels. We propose improvements to the algorithm to address this problem in two ways: (1) We use our proposed overlap regions detection algorithm (discussed in Section 6.2), which helps in pruning (orange color blocks in Figure 6) the samples from the list that are classified as possible candidates of noisy samples due to the confusing probability distribution of the samples in that overlapping region (refer to Algorithm 2), and (2) we also further propose effective neighborhood-based strategies (blue color blocks in Figure 6), which analyzes neighbor samples and also the suggested label recommended by the probability distribution-based approach (green color blocks in Figure 6) to further prune the identified noisy candidate samples, which helps in reducing the false positives.

Fig. 6.

Figure 6 illustrates the proposed label noise framework that takes a noisy dataset as input and generates a list of noisy points and suggested labels. Noisy data are then passed to the sequence of steps that compare classwise mean probabilities with sample probabilities to detect possible candidates of noisy samples. The system also utilizes probability information to find an initial set of suggestions for the detected possible candidates of noisy samples. As a next step, the system refines the list of detected noisy samples by removing the candidates that belong to overlap/confusing regions. Last, the retained samples are fed into the nearest-neighbor analysis module, which will help to further prune the samples from the rest of the possible noisy candidates. The final list of noisy samples with possible suggestions of the correct label is then shown to a user for further analysis. Detailed steps of the proposed label noise algorithm are highlighted in Algorithm 2.

7.3 Results

We validate the improvement of our proposed algorithm by comparing the precision and recall (see Table 7) with CleanLab as a baseline. We demonstrate these results on the same 21 datasets that we used to demonstrate the impact of noise on Auto AI classifier performance. Table 7 shows that our algorithm outperforms CleanLab in terms of precision. Overall average precision over 21 datasets for the proposed algorithm is 0.93 and for CleanLab is 0.74. Overall average recall of the proposed algorithm is around 0.76 and for CleanLab it is 0.81. At the cost of a 0.05% drop in a recall, our algorithm outperforms SOA by 0.19% in terms of precision.

Table 7.

Dataset Name	Precision		Recall
	Our	CleanLab	Our	CleanLab
credit-approval	0.63	0.50	0.60	0.65
mfeat-pixel	1	0.99	0.73	0.74
mfeat-zernike	0.96	0.25	0.73	0.78
cardiotocography	1	0.97	0.74	0.74
mfeat-morphological	0.70	0.16	0.74	0.83
soybean	0.96	0.80	0.63	0.84
phoneme	0.85	0.45	0.82	0.82
banknote	1	1	0.83	0.84
texture	0.99	0.99	0.83	0.84
balance-scale	0.97	0.39	0.70	0.89
mfeat-fourier	0.86	0.30	0.67	0.73
pendigits	0.99	0.99	0.82	0.83
wall-robot-navigation	0.99	0.97	0.76	0.84
spambase	0.91	0.74	0.80	0.80
mfeat-karhunen	0.99	0.98	0.68	0.71
segment	1	0.98	0.81	0.82
mushroom	1	1	0.81	0.82
spambase-reduced	0.86	0.74	0.86	0.88
waveform-5000	0.94	0.64	0.66	0.87
eeg-eye-state	0.96	0.70	0.78	0.81
kr-vs-kp	0.98	0.96	0.87	0.89

Table 7. Comparison of (a) Precision and (b) Recall of Proposed Label Noise Algorithm with CleanLab Algorithm [48]

Table 8 shows the impact of correcting noisy data by applying label noise recommendation on AutoAI classifier performance. In Table 8, D $^{\prime }$ represents the noisy data, and D $^{\prime \prime }$ represents the cleaned data after applying a recommendation from the label noise algorithm. We can clearly observe an improvement in the performance of AutoAI classifiers for some of the cases, and for other cases performance remains the same. For 6 datasets of 21, there is an increase in performance by more than $1\%$ . The maximum improvement observed for these datasets is $5\%$ . These results demonstrate the effectiveness of the proposed label noise algorithm. The runtime of the proposed algorithm was measured to be 14.87 seconds, on average, across the datasets listed in Table 5.

Table 8.

Dataset Name	Accuracy on noise added data (D $^{\prime }$ )	Accuracy on cleaned data (D $^{\prime \prime }$ )	Accuracy difference
credit-approval	0.85	0.86	0.01
mfeat-pixel	0.95	0.95	0.00
mfeat-zernike	0.78	0.78	0.00
cardiotography	1	1	0.00
mfeat-morphological	0.71	0.71	0.00
soybean	0.86	0.91	0.05
phoneme	0.85	0.86	0.01
banknote	0.98	0.99	0.01
texture	0.99	0.99	0.00
balance-scale	0.85	0.85	0.00
mfeat-fourier	0.80	0.80	0.00
pendigits	0.99	0.99	0.00
wall-robot-navigation	0.99	0.99	0.00
spambase	0.92	0.93	0.01
mfeat-karhunen	0.93	0.94	0.01
segment	0.96	0.96	0.00
mushroom	1	1	0.00
spambase-reduced	0.83	0.83	0.00
waveform-5000	0.86	0.86	0.00
eeg-eye-state	0.92	0.92	0.01
kr-vs-kp	0.97	0.97	0.00

Table 8. Impact of Cleaning the Label Noise Issue on AutoAI Classifier Accuracy

D’ denotes noisy data after induction of $10\%$ random noise and D” denotes the cleaned data after applying the proposed label noise recommendations.

8 User Study to Understand the Usefulness of the Algorithms

In this article, we have covered several aspects of EDA and data quality for ML analysis, including the results of our proposed algorithms. We have also provided a tool that streamlines these processes by automatically generating code snippets, saving time, and reducing the risk of errors. To further assess the usefulness of the above algorithms and automated tools in the ML lifecycle, we conduct a user study with 16 data scientists from IBM. The study aims to answer the following questions: (a) Does it help a user reduce the time they spend in the data preprocessing phase?, (b) Does it have any impact on the user’s productivity?, and (c) Does it have any impact on the performance of the model that is built?

8.1 Design of the Study

To understand the usefulness of the algorithms and the tool that encapsulates the algorithms, we design our study by dividing the data scientists into two groups: the control and the user group. All of these data scientists work in horizontal platform teams and thus across different verticals like finance, healthcare, and so on. We maintain all the conditions the same for both the groups, except that the user group has access to the tool and the functionalities described in Sections 3 and 4–7. We recruit 16 data scientists for the study with varying levels of experience. Each group had three data scientists with 0–3 years of experience and five within the bracket of 3–7 years. We specifically check for years of experience in the data science area and not their overall experience. We divide them into two groups randomly, such that both groups have the same level of average experience.

Task Details: All the participants are tasked with building a model for a given dataset within the same time limit. We choose the kc2 dataset from the OpenML repository for our study [56] and do not impose any other constraints on steps that can be used in the process, like choosing a model. To eliminate any bias that may arise due to familiarity with popular datasets, we intentionally chose datasets that were not widely known. We divided the data into training, validation, and test sets and provided the training and validation sets to the data scientists. The test set was not shared and was used to evaluate the models built by the data scientists. We provide all the data scientists a duration of 3 hours to complete the task. We decide on this duration by running a pilot on two data scientists and observing the time taken by them to build the model on a small dataset like kc2. To avoid logistic issues, we share the dataset on a shared drive and give all the participants code snippets for basic tasks like data loading, model saving, and so on. We also install the popular packages needed by data scientists on virtual machines used for the study. We share an instructions document with all the participants where we describe the task and dataset details and also make it clear before the study that the goal is to understand steps taken by a data scientist and not evaluate the skill of the data scientist so that they can work on the dataset in their natural mode.

Metrics used for the study: We decide on metrics to capture as part of the study design. The metrics are shown in Table 9. We define an operator as an atomic unit of operation. For example, the detection of missing values is considered as one operator. To capture these metrics, we request the data scientists to share a screen recording of the work they did for the user study and the notebook that they used for their analysis. The metrics are so designed to observe the impact on time savings, productivity, and quality of the dataset as measured by the quality of the model.

Table 9.

	Source	Control Group	User Group
Number of exploratory operators used for EDA and DQ	Video	Y	Y
Number of operators that were found useful and retained in the notebook	Notebook	Y	Y
Number of code errors during EDA and DQ steps	Video	Y	Y
Time spent (in mins) in fixing code errors	Video	Y	Y
Total lines of code that a user had to write for data preprocessing	Notebook	N	Y
Model Score (F1)	Notebook	Y	Y

Table 9. Metrics to Be Captured for User Study

Training for the tool: We provide an 1-hour training session to all data scientists from the user group to make them familiar with the functionalities of the tool described in Section 3. We provide a mechanism for the users from both groups to reach out to us, in case of any logistic challenges.

8.2 Study Results and Analysis

We collect data worth 48 hours of videos and 16 notebooks as a result of the study. We analyse this data to derive metrics for each user, as described in Table 9. We show the average results for each of these metrics in the Table 10.

Table 10.

	Control Group (Average values)	User Group (Average values)
Number of exploratory operators used for EDA and DQ	11	20
Number of operators that were found useful and retained in the notebook	5.78	17.11
Number of code errors during EDA and DQ steps	2.28	0.2
Time spent (in mins) in fixing code errors	3.67	0.33
Total lines of code that a user had to write for data preprocessing	NA	10.11
Model Score (F1)	71.63%	77.22%

Table 10. Analysis from the User Study

In Table 10, it is observed that the group using the tool was able to execute almost $2\times$ more operators than the control group. This can be attributed to the availability of code snippets in the tool for the user group, which can directly be executed as opposed to writing the code for each operator. It can also be seen that the participants decided to retain $85\%$ operators for the user group as opposed to $52.54\%$ of the operators for the control group. It can be inferred that the operators provided in the tool. It was also noted that all the four operators corresponding to the four algorithms described in this article were retained by $100\%$ of the users in the user group. The participants from the user group were also able to save time in the data preprocessing stage, as both the number of coding errors and the time spent in fixing the errors are less for the user group, as compared to the control group. We also analysed how many lines of code the participants had to write manually for the user group. On analysis from the notebooks, we found that the new lines of code were added for two reasons: (1) code written to drop non-useful columns following analysis performed in interesting columns (Section 5) and (2) code written for dataset visualisation using TSNE or PCA. We plan to make code snippets available for these functionalities in the next version of the tool. Finally, we analyse how the proposed tool helps in improving the accuracy of the model. To test this effectively, we keep the test split of the dataset hidden. We download the model for each data scientist from their notebook and then write code to infer on the test set. We record the F1 score for the models for every data scientist. We find that the model performance for the user group on average was about $6\%$ higher than the control group. We also analysed the videos for all the control group participants and noted that they did not write any code to find deep insights as given by algorithms from Section 7 and 6. We believe that finding and correcting these issues in the data contributed towards a better model performance for participants from the user group. We thus make the following conclusions from the study. Our algorithms and tool helped the participants in the following two dimensions:

•

Productivity and Time Savings: Participants in the user group were able to run 2 $\times$ more operators and retain 3 $\times$ more operators for their final analysis in the same amount of time. This can be attributed to the availability of code as well as the availability of advanced operators that cannot be used from the open source or existing DQ libraries.

•

Model Performance: Improvement in model accuracy (6%) for the user due to the ability to find deep issues in the datasets.

9 User Study in the Wild

In the previous section, we describe a user study that is performed in constrained environments as we want to compare and contrast how the tool aids a data scientist. In this section, we describe insights from a user study where we relax any of the constraints as in the previous section. We work with an external customer where we install the tool in their environment and allow them to use it on their datasets. This study is done with three data scientists. The first data scientist had an experience of 0–2 years, the other had an experience of 4–7 years, and the third data scientist had an experience of more than 15+ years in the data science field. We provide detailed training on the tool and support with any ongoing questions/answers as they use the tool. We capture their feedback via a feedback survey as well as an interview with the users. We want to again capture productivity and time savings metrics and do so by asking the user to note the time taken by them using the tool and comparing it with the time taken by them in the past without the tool. We notice that all three users share in the survey form that their estimation of time savings is about 25–30%. On dwelling further on this topic in the interview, the most experienced user explains that this estimate is conservative and she expects the savings to be greater than $50\%$ . She attributes this to both the current automation provided by the tool and the confidence in the results provided by the tool (thereby avoiding a need to recheck by a different team member). We also ask them to share the typical operations that they would perform and count the number of operations performed by them using our tool. We show from the survey results and the discussion that a novice user (0–3 years) sees a productivity improvement of 4 $\times$ , whereas an experienced user sees a productivity improvement of 2 $\times$ . The users also appreciate the design of the tool, especially as it gives them a head start, has novel operators to find insights that may otherwise be missed and has the ability to extend as required. Both these studies show that the tool gives time/productivity gains to a data scientist.

10 Conclusion and Next Steps

We have presented our innovations in the space of exploratory data analysis and data quality for AI in this article. We discover the challenges faced by data scientists from several user interviews. We present a framework and four novel algorithms to address the pain points and also present a way to automate them for a data scientist. First, we propose an algorithm to get a snapshot of the data, so that a data scientist can easily understand the dataset. In our algorithm, we output a subset of rows from the dataset such that we cover all the variations present in each column. For categorical columns, this includes all the possible categories and for numerical columns, it includes values from every bin after binning the data. And for string data, it includes values that cover every type of pattern. Our algorithm finds this subset by picking as few rows as possible and surfacing rows to the user that contribute to variations in more than one column. We have also tested manually for several datasets that our subset indeed covers all the variations in every column. Next, we present an innovative way to rank columns of the dataset. We define our interestingness score and also present a mechanism to show the results to a data scientist with tags on why that column is interesting for that dataset. Next, we discuss an algorithm to find confusing regions or overlapping regions in the dataset. We discuss why this problem is important to detect and show that it affects the accuracy of state-of-the-art AutoML models as well. We have proposed a new algorithm to detect overlapping regions in the data and tested it rigorously using open source datasets. We then discuss our proposed algorithm for label noise and show results on open source datasets. In our user interviews, we also learn that one of the other pain points for a data scientist is the time spent to code to explore the data and find data quality issues. We thus propose to build a tool that generates code for data preparation phases. This tool encapsulates all the algorithms proposed in the article. We finally two user studies with data scientists to observe and show that our tool and the algorithms help data scientists to improve productivity and save time and help improve the accuracy of the model. As part of our next steps, we plan to work on building scalable technologies that address very large scaled datasets and extend our innovations on EDA and DQ to inter-table analysis.

References

[1]

Kaggle Repository. Retrieved from https://www.kaggle.com.

Abstract

1 Introduction

2 Related Work

2.1 Exploratory Data Analysis

2.2 Data Quality for AI

2.2.1 Label Noise.

2.2.2 Class Overlap.

3 Framework Design

4 EDA Method Details: Dataset Snapshot

4.1 Algorithm Details

4.2 Experiment Results

5 EDA Method Details: Interesting Columns Detection

6 DQAI Method Details: Detection of Class Overlap in the Data

6.1 Understanding the Effect of Overlapping Regions on ML Models

6.2 Algorithm to Detect Class Overlap

6.3 Results

7 DQAI Method Details: Label Noise

7.1 Understanding the Effect of Presence of Label Noise on ML Models

7.2 Algorithm Details

7.3 Results

8 User Study to Understand the Usefulness of the Algorithms

8.1 Design of the Study

8.2 Study Results and Analysis

9 User Study in the Wild

10 Conclusion and Next Steps

References

Cited By

Index Terms

Recommendations

Advances in Exploratory Data Analysis, Visualisation and Quality for Data Centric AI Systems

Data collection and quality challenges in deep learning: a data-centric AI perspective

DCAI: Data-centric Artificial Intelligence

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations