Let us consider
\(D_N\) as a full dataset with
\(N\) number of rows, and it consists of a finite set of data patterns denoted by
\(P(D_N)\) . We define a data pattern as the following: We process the data in every column to a finite set of patterns. For categorical columns, we denote each unique value as a pattern. For numeric columns, we bin the data and take the bin index as a pattern to be covered. For string pattern columns like dates and phone numbers, we convert string to regex patterns and take each regex pattern as a pattern to be covered. Unique columns like ID columns are excluded from the data, since no pattern can be derived from them. Thus a data pattern for the entire dataset can be defined as follows:
where
\(P(C_i)\) represents set of patterns present in column
\(i\) in
\(D_N\) . Our goal is to pick a minimal sample dataset
\(D_S\) such that
\(P(D_S)=P(D_N)\) . Note that when
\(S=N\) , the condition is automatically satisfied. We formalise our goal as follows. Let us define
and we pick an optimal sample by optimising the condition
thereby picking sample data with minimum samples. This problem can be viewed as a Bi-Partite Graph
\(G = (P, R, E)\) where
\(P\) represents all patterns in
\(D_N\) ,
\(R\) is the set of all rows in
\(D_N\) , and
\(E\) represents the edges connecting partitions
\(P\) and
\(R\) . In this bi-partite graph, we optimise the condition
to select as minimum edges as possible, thereby selecting minimum rows as possible, such that
\(C_e(P) = P(D_N)\) , where
\(C_e\) represents coverage of patterns by edges
\(e\) . Finding the optimal solution for this problem takes polynomial time, and it is NP-Complete. Therefore, we use a greedy algorithm to solve this problem. We use
row importance metric to determine which rows are more important to pick. We define row importance as
where
\(I(r_i)\) denotes importance of row
\(i\) ,
\(K\) denotes number of columns,
\(p_{c_{ij}}\) denotes the data pattern present in column
\(j\) of row
\(i\) , and
\(I(p)\) denotes importance of a pattern
\(p\) . The importance of pattern
\(p\) is defined as the relative frequency of pattern
\(p\) occurring in the dataset.