1 Introduction
Classic
Machine Learning (
ML) mainly includes data prepossessing, feature extraction, feature engineering, predictive modeling, and evaluation. The evolution of deep AI; however, has resulted in a new principled and widely used paradigm: i) collecting data, ii) computing data representations, and iii) applying ML models. Indeed, the success of ML algorithms highly depends on data representation [
1]. Building a good representation space is critical and fundamental because it can help to 1) identify and disentangle the underlying explanatory factors hidden in observed data, 2) easy the extraction of useful information in predictive modeling, 3) reconstruct distance measures to form discriminative and machine-learnable patterns, 4) embed structure knowledge and priors into representation space and thus make classic ML algorithms available to complex graph, spatiotemporal, sequence, multi-media, or even hybrid data.
In this article, we study the problem of learning to reconstruct an optimal and traceable feature representation space to advance a downstream ML task (Figure
1). Formally, given a set of original features, a prediction target, and a downstream ML task (e.g., classification and regression), the objective is to automatically reconstruct an optimal and traceable set of features for the ML task by mathematically transforming original features.
Prior literature has partially addressed the problem. The first relevant work is feature engineering, which designs preprocessing, feature extraction, selection [
11,
19], and generation [
16] to extract a transformed representation of the data. These techniques are essential but labor-intensive, showing the low applicability of current ML practice in the automation of extracting a discriminative feature representation space.
Issue 1 (full automation): how can we make ML less dependent on feature engineering, construct ML systems faster, and expand the scope and applicability of ML? The second relevant work is representation learning, such as factorization [
8], embedding [
9], and deep representation learning [
32,
34]. These studies are devoted to learning effective latent features. However, the learned features are implicit and non-explainable. Such traits limit the deployment of these approaches in many application scenarios (e.g., patient and biomedical domains) that require not just high predictive accuracy but also trusted understanding and interpretation of underlying drivers.
Issue 2 (explainable explicitness):how can we assure that the reconstructing representation space is traceable and explainable? The third relevant work is learning based feature transformation, such as principle component analysis [
3], traversal transformation graph based feature generation [
16], sparsity regularization based feature selection [
13]. These methods are either deeply embedded into or totally irrelevant to a specific ML model. For example, LASSO regression extracts an optimal feature subset for regression, but not for any given ML model.
Issue 3 (flexible optimal): how can we create a framework to reconstruct a new representation space for any given predictor? The three issues are well-known challenges. Our goal is to develop a new perspective to address these issues.
In the preliminary work [
35], we propose a novel and principled
Traceable Group-wise Reinforcement Generation Framework for addressing the automation, explicitness, and optimal issues in representation space reconstruction. Specifically, we view feature space reconstruction as a traceable iterative process of the combination of feature generation and feature selection. The generation step is to make new features by mathematically transforming the original features, and the selection step is to avoid the feature set explosion. To make the process self-optimizing, we formulate it into three
Markov Decision Processes (
MDPs): one is to select the first meta feature, one is to select an operation, and one is to select the second meta feature. The two candidate features are transformed by conducting the selected operation on them to generate new ones. We develop a cascading agent structure to coordinate the three MDPs and learn better feature transformation policies. Meanwhile, we propose a group-wise feature generation mechanism to accelerate feature space reconstruction and augment reward signals in order for cascading agents to learn clear policies.
While our initial research has achieved significant results, there remains scope for further improvement. To improve the performance and generalization capabilities of our initial work, we explore two distinct yet interconnected avenues: 1) Refinement of the existing framework: We propose a graph-based state representation method, enabling reinforced agents to perceive the current feature space more effectively. Furthermore, we employ enhanced Q-learning strategies to mitigate the overestimation of Q-values during policy learning. 2) Introduction of a novel optimization approach: We utilize an Actor-Critic policy learning method to train the overall framework. It aims to expedite model convergence and enhance feature transformation performance.
In the first avenue, we assume that a robust state representation can empower reinforced agents to accurately comprehend the current feature space, thereby facilitating the learning of more intelligent policies for feature transformation. Consequently, for the graph-based state representation method, we initially construct a comprehensive attributed graph predicated on feature-feature correlations. Each node signifies a feature, each edge weight denotes the correlation between two nodes, and each node attribute corresponds to the feature vector. Then, we capture the knowledge of the entire feature set and the interactions between features via message passing and graph reconstruction, preserving this information in an embedding vector as a state representation. Besides, we recognize Q-value overestimation as a critical challenge that hinders the unbiased learning of transformation policies, subsequently leading to sub-optimal model performance. Thus, to mitigate this issue, we employ Q-learning using a dual network structure: one network for action selection, and a second for Q-value estimation. Importantly, these two networks are not updated concurrently. This asynchronous update strategy successfully decouples the max operation in traditional Q-learning, thereby reducing Q-value overestimation and enabling the learning of more robust policies.
In the second avenue, we examined the optimization technique of the original framework and observed that Q-learning treats all candidate features as the complete action space. It results in substantial computational costs when our framework deals with high-dimensional spaces. Simultaneously, the \(\epsilon\)-greedy-based learning strategy of Q-learning presents challenges in enabling reinforced agents to balance the trade-off between efficiency and effectiveness. Consequently, we utilize the Actor-Critic technique to optimize the entire feature transformation framework in order to directly learn intelligent feature-feature crossing policies. In particular, we develop three Actor-Critic agents to select candidate features and operations for feature transformation. The actor component directly learns the selection policy through policy gradient update and makes the action selection by sampling. The critic component leverages gradient signals from temporal-difference errors at each iteration to evaluate and improve selection policies. The collaborated relationships between the actor and critic components can expedite model convergence and generate effective combinations for feature transformation.
In summary, in this article, we develop a generic and principled framework: group-wise reinforcement feature transformation, for optimal and explainable representation space reconstruction. The contributions of this article are summarized as follows:
(1)
We formulate the feature space reconstruction as an interactive process of nested feature generation and selection, where feature generation is to generate new meaningful and explicit features, and feature selection is to remove redundant features to control feature sizes.
(2)
We develop a group-wise reinforcement feature transformation framework to automatically and efficiently generate an optimal and traceable feature set for a downstream ML task without much professional experience and human intervention.
(3)
We present refinements for the original framework. In particular, a graph-based state representation method is proposed to capture the feature-feature correlations and the knowledge of the entire feature set for improving the comprehensive of reinforced agents. A decoupling Q-learning strategy is employed to alleviate the Q-value overestimation for learning unbiased and effective policies.
(4)
We utilize the Actor-Critic optimization technique to reimplement the self-optimizing feature transformation framework. The actor component directly learns the selection policies for candidate features and operations instead of finding a surrogate value function. The critic component evaluates the selection policy and provides feedback to the actor in order to make it learn better policies at the next iteration.
(5)
Extensive experiments and case studies are conducted to demonstrate the efficacy of our method. Compared with the preliminary work, we add the outlier detection task with five additional datasets, evaluate different state representation approaches, compare different Q-learning strategies and optimization techniques, and provide comprehensive analyses.
3 Optimal and Explainable Feature Space Reconstruction
In this section, we present an overview, and then detail each technical component of our framework.
3.1 Framework Overview
Figure
2 shows our proposed framework,
Group-wise
Reinforcement
Feature
Generation (
GRFG). In the first step, we cluster the original features into different feature groups by maximizing intra-group feature cohesion and inter-group feature distinctness. In the second step, we leverage a group-operation-group strategy to cross two feature groups to generate multiple features each time. For this purpose, we develop a novel cascading reinforcement learning method to learn three agents to select the two most informative feature groups and the most appropriate operation from the operation set. As a key enabling technique, the cascading reinforcement method will coordinate the three agents to share their perceived states in a cascading fashion, that is, (agent1, state of the first feature group), (agent2, fused state of the operation and agent1’s state), and (agent3, fused state of the second feature group and agent2’s state), in order to learn better choice policies. After the two feature groups and operation are selected, we generate new features via a group-group crossing strategy. In particular, if the operation is unary, for example, sqrt, we choose the feature group of higher relevance to target labels from the two feature groups, and apply the operation to the more relevant feature group to generate new features. If the operation is binary, we will choose the K most distinct feature-pairs from the two feature groups, and apply the binary operation to the chosen feature pairs to generate new features. In the third step, we add the newly generated features to the original features to create a generated feature set. We feed the generated feature set into a downstream task to collect predictive accuracy as reward feedback to update policy parameters. Finally, we employ feature selection to eliminate redundant features and control the dimensonality of the newly generated feature set, which will be used as the original feature set to restart the iterations to regenerate new features until the maximum number of iterations is reached.
3.2 Generation-oriented Feature Grouping
One of our key findings is that group-wise feature generation can accelerate the generation and exploration, and, moreover, augment reward feedback of agents to learn clear policies. Inspired by this finding, our system starts with generation oriented feature clustering, which aims to create feature groups that are meaningful for group-group crossing. Our another insight is that crossing features of high (low) information distinctness is more (less) likely to generate meaningful variables in a new feature space. As a result, unlike classic clustering, the objective of feature grouping is to maximize inter-group feature information distinctness and minimize intra-group feature information distinctness. To achieve this goal, we propose the M-Clustering for feature grouping, which starts with each feature as a feature group and then merges its most similar feature group pair at each iteration.
Distance Between Feature Groups: A Group-level Relevance-Redundancy Ratio Perspective. To achieve the goal of minimizing intra-group feature distinctness and maximizing inter-group feature distinctness, we develop a new distance measure to quantify the distance between two feature groups. We highlight two interesting findings: 1) relevance to predictive target: if the relevance between one feature group and predictive target is similar to the relevance of another feature group and predictive target, the two feature groups are similar; 2) mutual information: if the mutual information between the features of the two groups are large, the two feature groups are similar. Based on the two insights, we devise a feature group-group distance. The distance can be used to evaluate the distinctness of two feature groups, and, further, understand how likely crossing the two feature groups will generate more meaningful features. Formally, the distance is given by:
where
\(\mathcal {C}_i\) and
\(\mathcal {C}_j\) denote two feature groups,
\(|\mathcal {C}_i|\) and
\(|\mathcal {C}_j|\) respectively, are the numbers of features in
\(\mathcal {C}_i\) and
\(\mathcal {C}_j\),
\(f_i\) and
\(f_j\) are two features in
\(\mathcal {C}_i\) and
\(\mathcal {C}_j,\) respectively,
y is the target label vector. In particular,
\(|MI(f_i,y)-MI(f_j,y)|\) quantifies the difference in relevance between
y and
\(f_i\),
\(f_j\). If
\(|MI(f_i,y)-MI(f_j,y)|\) is small,
\(f_i\) and
\(f_j\) have a more similar influence on classifying
y;
\(MI(f_i,f_j)+\epsilon\) quantifies the redundancy between
\(f_i\) and
\(f_j\).
\(\epsilon\) is a small value that is used to prevent the denominator from being zero. If
\(MI(f_i,f_j)+\epsilon\) is big,
\(f_i\) and
\(f_j\) share more information.
Feature Group Distance based M-Clustering Algorithm: We develop a group-group distance instead of point-point distance, and under such a group-level distance, the shape of the feature cluster could be non-spherical. Therefore, it is not appropriate to use K-means or density based methods. Inspired by agglomerative clustering, given a feature set \(\mathcal {F}\), we propose a three step method: 1) INITIALIZATION: we regard each feature in \(\mathcal {F}\) as a small feature cluster. 2) REPEAT: we calculate the information overlap between any two feature clusters and determine which cluster pair is the most closely overlapped. We then merge two clusters into one and remove the two original clusters. 3) STOP CRITERIA: we iterate the REPEAT step until the distance between the closest feature group pair reaches a certain threshold. Although classic stop criteria is to stop when there is only one cluster, using the distance between the closest feature groups as stop criteria can better help us to semantically understand, gauge, and identify the information distinctness among feature groups. It eases the implementation in practical deployment.
where \(X_1\) and \(X_2\) are two vectors, \(x_1\) and \(x_2\) are elements of \(X_1\) and \(X_2\) respectively, \(p(x_1,x_2)\) is the joint distribution of \(X_1\) and \(X_2\), \(p(x_1)\) and \(p(x_2)\) are the marginal distribution of \(X_1\) and \(X_2\) respectively. The greater the value of MI, the more information the two vectors share. By using this distance metric, we can group the features that share similar information and have similar impact on identifying target label into a single feature cluster. Our framework begins by clustering the feature set into feature clusters. This step groups similar features into a single cluster and dissimilar features into different clusters. This is because we presume that the generated features are meaningful when the interacted features come from different information distributions; otherwise, the generated features should be redundant. Considering that the boundary and information distribution of feature clusters vary over the feature generation process, we develop a new M-Clustering algorithm rather than using a conventional clustering technique.
3.3 Cascading Reinforcement Feature Groups and Operation Selection
To achieve group-wise feature generation, we need to select a feature group, an operation, and another feature group to perform group-operation-group based crossing. Two key findings inspire us to leverage cascading reinforcement.
Firstly, we highlight that although it is hard to define and program the optimal selection criteria of feature groups and operation, we can view selection criteria as a form of machine-learnable policies. We propose three agents to learn the policies by trials and errors.
Secondly, we find that the three selection agents are cascading in a sequential auto-correlated decision structure, not independent and separated. Here, “cascading” refers to the fact that within each iteration agents make decision sequentially, and downstream agents await for the completion of an upstream agent. The decision of an upstream agent will change the environment states of downstream agents. As shown in Figure
3, the first agent picks the first feature group based on the state of the entire feature space, the second agent picks the operation based on the entire feature space and the selection of the first agent, and the third agent chooses the second feature group based on the entire feature space and the selections of the first and second agents.
We next propose two generic metrics to quantify the usefulness (reward) of a feature set, and then form three MDPs to learn three selection policies.
Two Utility Metrics for Reward Quantification. The two utility metrics are from the supervised and unsupervised perspectives.
Metric 1: Integrated Feature Set Redundancy and Relevance. We propose a metric to quantify feature set utility from an information theory perspective: a higher-utility feature set has less redundant information and is more relevant to prediction targets. Formally, given a feature set
\(\mathcal {F}\) and a predictive target label
y, such utility metric can be calculated by
where
MI is the mutual information,
\(f_i, f_j, f\) are features in
\(\mathcal {F}\) and
\(|\mathcal {F}|\) is the size of the feature set
\(\mathcal {F}\).
Metric 2: Downstream Task Performance. Another utility metric is whether this feature set can improve a downstream task (e.g., regression and classification). We use a downstream predictive task performance indicator (e.g., 1-RAE, Precision, and Recall, F1) as a utility metric of a feature set.
Learning Selection Agents of Feature Group 1, Operation, and Feature Group 2. Leveraging the two metrics, we next develop three MDPs to learn three agent policies to select the best feature group 1, operation, feature group 2.
Learning the Selection Agent of Feature Group 1. The feature group agent 1 iteratively select the best meta feature group 1. Its learning system includes:
i) Action: its action at the
tth iteration is the meta feature group 1 selected from the feature groups of the previous iteration, denoted group
\(a_t^1 = \mathcal {C}^1_{t-1}\).
ii) State: its state at the
tth iteration is a vectorized representation of the generated feature set of the previous iteration. Let
Rep be a state representation method, the state can be denoted by
\(s_t^1 = Rep(\mathcal {F}_{t-1})\). We will discuss the state representation method in the Section
3.5.
iii) Reward: its reward at the
tth iteration is the utility score the selected feature group 1, denoted by
\(\mathcal {R}(s_t^1,a_t^1)=U(\mathcal {F}_{t-1}|y)\).
Learning the Selection Agent of Operation. The operation agent will iteratively select the best operation (e.g., +, -) from an operation set as a feature crossing tool for feature generation. Its learning system includes: i) Action: its action at the tth iteration is the selected operation, denoted by \(a_t^o = o_t\). ii) State: its state at the tth iteration is the combination of \(Rep(\mathcal {F}_{t-1})\) and the representation of the feature group selected by the previous agent, denoted by \(s^o_t = Rep(\mathcal {F}_{t-1}) \oplus Rep(\mathcal {C}_{t-1}^1)\), where \(\oplus\) indicates the concatenation operation. iii) Reward: The selected operation will be used to generate new features by feature crossing. We combine such new features with the original feature set to form a new feature set. Thus, the feature set at the tth iteration is \(\mathcal {F}_t = \mathcal {F}_{t-1} \cup g_t\), where \(g_t\) is the generated new features. The reward of this iteration is the improvement in the utility score of the feature set compared with the previous iteration, denoted by \(\mathcal {R}(s_t^o,a_t^o) = U(\mathcal {F}_t|y) - U(\mathcal {F}_{t-1}|y)\).
Learning the Selection Agent of Feature Group 2. The feature group agent 2 will iteratively select the best meta feature group 2 for feature generation. Its learning system includes: i) Action: its action at the tth iteration is the meta feature group 2 selected from the clustered feature groups of the previous iteration, denoted by \(a_t^2 = \mathcal {C}^2_{t-1}\). ii) State: its state at the tth iteration is combination of \(Rep(\mathcal {F}_{t-1})\), \(Rep(\mathcal {C}_{t-1}^1)\) and the vectorized representation of the operation selected by the operation agent, denoted by \(s_t^2 = Rep(\mathcal {F}_{t-1})\oplus Rep(\mathcal {C}_{t-1}^1) \oplus Rep(o_t)\). iii) Reward: its reward at the tth iteration is improvement of the feature set utility and the feedback of the downstream task, denoted by \(\mathcal {R}(s_t^2, a_t^2) = U(\mathcal {F}_t|y)-U(\mathcal {F}_{t-1}|y)+V_{A_t}\), where \(V_A\) is the performance (e.g., F1) of a downstream predictive task.
3.4 Optimization Technique
In the preliminary work [
35], we use the value-based optimization technique to train the entire framework. To further enhance the generalization ability and performance of our framework, we use a brand new optimization perspective, actor-critic, to learn the feature transformation policies.
Value-based Optimization Perspective. We train the three agents by maximizing the discounted and cumulative reward during the iterative feature generation process. In other words, we encourage the cascading agents to collaborate to generate a feature set that is independent, informative, and performs well in the downstream task. For each agent, we firstly give the memory unit of the
tth exploration as
\(\mathcal {M}_t = (s_t, a_t, r_t, s_{t+1}, a_{t+1})\). In the preliminary work, we minimize the temporal difference error
\(\mathcal {L}\) based on Q-learning. The optimization objective can be formulated as:
where
\(Q(\cdot)\) is the Q function that estimates the value of choosing action
a based on the state
s.
\(\gamma\) is a discount value, which is a constant between 0 and 1. It raises the significance of rewards in the near future and diminishes the significance of rewards in the uncertain distant future, ensuring the convergence of rewards.
However, Q-learning frequently overestimates the Q-value of each action, resulting in suboptimal policy learning. The reason for this is that traditional Q-learning employs a single Q-network for both Q-value estimation and action selection. The max operation in Equation (
4) will accumulate more optimistic Q-values during the learning process, leading to Q-value overestimation.
Improving the Q-learning. In this journal version, we propose a new training strategy to alleviate this issue, namely, GRFG
\(^\upsilon\). Specifically, we use two independent Q-networks with the same structure to conduct Q-learning. The first Q-network is used to select the action with the highest Q-value. The index of the selected action is used to query the Q-value in the second Q-network for Q-value estimation. The optimization goal can be formulated as follows:
where
\(Q(\cdot)\) and
\(Q^{-}(\cdot)\) are two independent networks with the same structure. The parameters of them are not updated simultaneously. The asynchronous learning strategy reduces the possibility that the maximum Q-value of the two networks is associated with the same action, thereby alleviating Q-value overestimation.
Furthermore, to make the learning of each Q-network robust, we decouple the Q-learning into: state value estimation and action advantage value estimation [
36]. Thus, the calculation of Q-value can be formulated as follows:
where
\(V_\upsilon (\cdot)\) and
\(A_\upsilon (\cdot)\) are state value function and action advantage value function, respectively. This decoupling technique generalizes action-level learning without excessive change, resulting in a more robust and consistent Q-learning.
After agents converge, we expect to find the optimal policy
\(\pi ^*_\upsilon\) that can choose the most appropriate action (i.e., feature group or operation) based on the state via the Q-value, which can be formulated as follows:
Actor-Critic Optimization Perspective. We adopt the Actor-Critic optimization approach [
12,
20] to enhance the convergence and performance of the model, namely, GRFG
\(^\alpha\). Actor-Critic architecture consists of two independent network, which is the
Actor network for action selection and
Critic network for reward estimation.
Actor: The goal of the actor is to learn the selection policy, denoted by
\(\pi _\alpha (\cdot)\), based on the current state. This policy is used to select appropriate candidate feature groups or operations. In the
tth iteration, given the state
\(s_t\), the agent chooses an action
\(a_t\) as follows:
where the output of
\(\pi _\alpha (\cdot)\) is the probability distribution over candidate actions. The
\(\sim\) operation denotes the sampling operation.
Critic: The critic’s objective is to estimate the expected reward of a given input state, which can be expressed as:
In this equation,
\(V_\alpha (\cdot)\) is the state-value function, and
\(v_t\) represents the resulting value.
Optmization: We update the
Actor and
Critic in cascading agents after each step of feature transformation. Suppose in step-
t, for each agent, we can obtain the memory as
\(\mathcal {M}_t = (a_t, s_t, s_{t+1}, r_t)\). The estimation of temporal difference error (
\(\delta\)) for Critic is given by:
where
\(\gamma \in [0, 1]\) is the discounted factor. By that, we can update the Critic by
mean square error (
MSE):
Then, we can update the policy
\(\pi _\alpha\) in Actor by policy gradient, defined as:
where
\(H(\cdot)\) is an entropy regularization term that aims to increase the randomness in exploration. We use
\(\beta\) to control the strength of the
H.
\(\pi _\alpha (a_t|s_t)\) denote the probability of action
\(a_t\). The overall loss function for each agent is given by:
After agents converge, we expect to discover the optimal policy
\(\pi _\alpha ^*\) that can choose the most appropriate action (i.e., feature group or operation).
3.5 State Representation of a Feature Set (Group) and an Operation
State representation reflects the status of current features or operations, which serves as the input for reinforced agents to learn more effective policies in the future. We propose to map a feature group or an operation to a vectorized representation for agent training.
For a feature set (group), we propose three state representation methods. To simplify the description, in the following parts, suppose given the feature set
\(\mathcal {F}\in \mathbb {R}^{m\times n}\), where the
m is the number of the total samples, and the
n is the number of feature columns.
Descriptive Statistics Based State Representation. In the preliminary work [
31], we use the descriptive statistic matrix of a feature set as its state representation. Figure
4(a) shows its calculation process:
Step-1: We calculate the descriptive statistics matrix (i.e., count, standard deviation, minimum, maximum, first, second, and third quartile) of \(\mathcal {F}\) column by column.
Step-2: We calculate the descriptive statistics of the outcome matrix of the previous step row by row to obtain the meta descriptive statistics matrix that shape is \(\mathbb {R}^{7\times 7}\).
Step-3: We obtain the feature-feature’s state representation \(Rep_{ds}(\mathcal {F})\in \mathbb {R}^{1\times 49}\) by flatting the descriptive matrix from the former step.
While descriptive statistics based state representation has shown exceptional performance, it may cause information loss due to fixed statistics views. To better comprehend the feature set, we propose two new state representation methods in the journal version work.
Autoencoder Based State Representation. Intuitively, an effective state representation can recover the original feature set. Thus, we user an autoencoder to convert the feature set
\(\mathcal {F}\) into a latent embedding vector and regard it as the state representation. Figure
4(b) shows the calculation process, which can be devided to three steps: Specifically,
Step-1: We apply a fully connected network
\(FC_{col}(\cdot)\) to convert each column of
\(\mathcal {F}\) into matrix
\(Z \in \mathbb {R}^{k\times n}\), where
k is the dimension of the latent representation of each column. The process can be represented as:
Step-2: We apply another fully connected network
\(FC_{row}(\cdot)\) to convert each row of
Z into matrix
\(Z^{\prime } \in \mathbb {R}^{k\times d}\), where
d is the dimension of the latent embedding of each row. The process can be defined as:
Step-3: We flatten
\(Z^{\prime }\) and input it into a decoder constructed by fully connected layers to reconstruct the original feature space, which can be defined as:
During the learning process, we minimize the reconstructed error between
\(\mathcal {F}\) and
\(\mathcal {F}^{\prime }\). The optimization objective is as follows:
When the model converges,
\(\text{Flatten}(Z^{\prime })\) used in Equation (
16) is the state representation
\(Rep_{ae}(\mathcal {F})\).
Graph Autoencoder Based State Representation. The prior two methods only preserve the information of individual features in the state representation while disregarding their correlations. To further enhance state representation, we propose a graph autoencoder [
17] to incorporate the knowledge of the entire feature set and feature-feature correlations into the state representation. Figure
4(d) shows the calculation process. Specifically,
Step-1: We build a complete correlation graph
\(\mathcal {G}\) by calculating the similarity matrix between each pair of feature columns. The adjacency matrix of
\(\mathcal {G}\) is
\(\mathbf {A} \in \mathbb {R}^{n\times n}\), where a node is a feature column in
\(\mathcal {F}\) and an edge reflects the similarity between two nodes. Then, we use the encoder constructed by one-layer GCN to aggregate feature knowledge to generate enhanced embeddings
\(Z\in \mathbb {R}^{n\times k}\) based on
\(\mathbf {A}\), where
k is the dimension of latent embedding. The calculation process is defined as follows:
where
\(\mathcal {F}^{ \top }\) is the transpose of input feature
\(\mathcal {F}\),
\(\mathbf {D}\) is the degree matrix, and
\(\mathbf {W} \in \mathbb {R}^{m\times k}\) is the parameter of GCN.
Step-2: Then, we use the decoder to reconstruct the adjacency matrix
\(\mathbf {A}^{\prime }\) based on
Z, which can be represented as:
During the learning process, we minimize the reconstructed error between
\(\mathbf {A}\) and
\(\mathbf {A}^{\prime }\). The optimization objective is:
where BCELoss refers to the binary cross entropy loss function. When the model converges, we average
Z column-wisely to get the state representation
\(Rep_{gae}(\mathcal {F}) \in \mathbb {R}^{1\times k}\). For an operation
o, we use its one hot vector as the sate representation, which can be denoted as
\(Rep(o)\).
3.6 Group-wise Feature Generation
We found that using group-level crossing can generate more features each time, and, thus, accelerate exploration speed, augment reward feedback by adding significant amount of features, and effectively learn policies. The selection results of our reinforcement learning system include two generation scenarios: (1) selected are a binary operation and two feature groups; (2) selected are a unary operation and two feature groups. However, a challenge arises: what are the most effective generation strategy for the two scenarios? We next propose two strategies for the two scenarios.
Scenario 1: Cosine Similarity Prioritized Feature-Feature Crossing. We highlight that it is more likely to generate informative features by crossing two features that are less overlapped in terms of information. We propose a simple yet efficient strategy, that is, to select the top K dissimilar feature pairs between two feature groups. Specifically, we first cross two selected feature groups to prepare feature pairs. We then compute the cosine similarities of all feature pairs. Later, we rank and select the top K dissimilar feature pairs. Finally, we apply the operation to the top K selected feature pairs to generate K new features.
Scenario 2: Relevance Prioritized Unary Feature Generation. When selected are an unary operation and two feature groups, we directly apply the operation to the feature group that is more relevant to target labels. Given a feature group C, we use the average mutual information between all the features in C and the prediction target y to quantify the relevance between the feature group and the prediction targets, which is given by: \(rel = \frac{1}{|\mathcal {C}|}\sum _{f_i\in \mathcal {C}} MI(f_i,y)\), where MI is a function of mutual information. After the more relevant feature group is identified, we apply the unary operation to the feature group to generate new ones.
Post-generation Processing. After feature generation, we combine the newly generated features with the original feature set to form an updated feature set, which will be fed into a downstream task to evaluate predictive performance. Such performance is exploited as reward feedback to update the policies of the three cascading agents in order to optimize the next round of feature generation. To prevent feature number explosion during the iterative generation process, we use a feature selection step to control feature size. When the size of the new feature set exceeds a feature size tolerance threshold, we leverage the K-best feature selection method to reduce the feature size. Otherwise, we do not perform feature selection. We use the tailored new feature set as the original feature set of the next iteration.
Finally, when the maximum number of iterations is reached, the algorithm returns the optimal feature set \(\mathcal {F^{*}}\) that has the best performance on the downstream task over the entire exploration.
3.7 Applications
Our proposed method (GRFG) aims to automatically optimize and generate a traceable and optimal feature space for an ML task. This framework is applicable to a variety of domain tasks without the need for prior knowledge. We apply GRFG to classification, regression, and outlier detection tasks on 29 different tabular datasets to validate its performance.
5 Related Works
Reinforcement Learning (RL) is the study of how intelligent agents should act in a given environment in order to maximize the expectation of cumulative rewards [
27]. According to the learned policy, we may classify reinforcement learning algorithms into two categories: value-based and policy-based. Value-based algorithms (e.g., DQN [
22], Double DQN [
30]) estimate the value of the state or state-action pair for action selection. Policy-based algorithms (e.g., PG [
28]) learn a probability distribution to map state to action for action selection. Additionally, an actor-critic reinforcement learning framework is proposed to incorporate the advantages of value-based and policy-based algorithms [
25]. In recent years, RL has been applied to many domains (e.g., spatial-temporal data mining, and recommended systems) and achieves great achievements [
33,
35]. In this article, we formulate the selection of feature groups and operation as MDPs and propose a new cascading agent structure to resolve these MDPs.
Automated Feature Engineering aims to enhance the feature space through feature generation and feature selection in order to improve the performance of ML models [
5]. Feature selection is to remove redundant features and retain important ones, whereas feature generation is to create and add meaningful variables.
Feature Selection approaches include: (i) filter methods (
e.g., univariate selection [
7], correlation based selection [
38]), in which features are ranked by a specific score like redundancy, relevance; (ii) wrapper methods (e.g., Reinforcement Learning [
21], Branch and Bound [
18]), in which the optimized feature subset is identified by a search strategy under a predictive task; (iii) embedded methods (e.g., LASSO [
29], decision tree [
26]), in which selection is part of the optimization objective of a predictive task.
Feature Generation methods include: (i) latent representation learning based methods, for example, deep factorization machine [
10], deep representation learning [
1]. Due to the latent feature space generated by these methods, it is hard to trace and explain the extraction process. (ii) feature transformation based methods, which use arithmetic or aggregate operations to generate new features [
4,
16,
37]. These approaches have two weaknesses: (a) ignore feature-feature heterogeneity among different feature pairs; (b) grow exponentially when the number of exploration steps increases. Compared with prior literature, our personalized feature crossing strategy captures the feature distinctness, cascading agents learn effective feature interaction policies, and group-wise generation manner accelerates feature generation.