research-article

Open access

Traceable Group-Wise Self-Optimizing Feature Transformation Learning: A Dual Optimization Perspective

Authors:

Min Wu,

Yanjie FuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 4

Article No.: 76, Pages 1 - 22

https://doi.org/10.1145/3638059

Published: 13 February 2024 Publication History

PDF eReader

Abstract

Feature transformation aims to reconstruct an effective representation space by mathematically refining the existing features. It serves as a pivotal approach to combat the curse of dimensionality, enhance model generalization, mitigate data sparsity, and extend the applicability of classical models. Existing research predominantly focuses on domain knowledge-based feature engineering or learning latent representations. However, these methods, while insightful, lack full automation and fail to yield a traceable and optimal representation space. An indispensable question arises: Can we concurrently address these limitations when reconstructing a feature space for a machine learning task? Our initial work took a pioneering step towards this challenge by introducing a novel self-optimizing framework. This framework leverages the power of three cascading reinforced agents to automatically select candidate features and operations for generating improved feature transformation combinations. Despite the impressive strides made, there was room for enhancing its effectiveness and generalization capability. In this extended journal version, we advance our initial work from two distinct yet interconnected perspectives: 1) We propose a refinement of the original framework, which integrates a graph-based state representation method to capture the feature interactions more effectively and develop different Q-learning strategies to alleviate Q-value overestimation further. 2) We utilize a new optimization technique (actor-critic) to train the entire self-optimizing framework in order to accelerate the model convergence and improve the feature transformation performance. Finally, to validate the improved effectiveness and generalization capability of our framework, we perform extensive experiments and conduct comprehensive analyses. These provide empirical evidence of the strides made in this journal version over the initial work, solidifying our framework’s standing as a substantial contribution to the field of automated feature transformation. To improve the reproducibility, we have released the associated code and data by the Github link https://github.com/coco11563/TKDD2023_code.

1 Introduction

Classic Machine Learning (ML) mainly includes data prepossessing, feature extraction, feature engineering, predictive modeling, and evaluation. The evolution of deep AI; however, has resulted in a new principled and widely used paradigm: i) collecting data, ii) computing data representations, and iii) applying ML models. Indeed, the success of ML algorithms highly depends on data representation [1]. Building a good representation space is critical and fundamental because it can help to 1) identify and disentangle the underlying explanatory factors hidden in observed data, 2) easy the extraction of useful information in predictive modeling, 3) reconstruct distance measures to form discriminative and machine-learnable patterns, 4) embed structure knowledge and priors into representation space and thus make classic ML algorithms available to complex graph, spatiotemporal, sequence, multi-media, or even hybrid data.

In this article, we study the problem of learning to reconstruct an optimal and traceable feature representation space to advance a downstream ML task (Figure 1). Formally, given a set of original features, a prediction target, and a downstream ML task (e.g., classification and regression), the objective is to automatically reconstruct an optimal and traceable set of features for the ML task by mathematically transforming original features.

Fig. 1.

Prior literature has partially addressed the problem. The first relevant work is feature engineering, which designs preprocessing, feature extraction, selection [11, 19], and generation [16] to extract a transformed representation of the data. These techniques are essential but labor-intensive, showing the low applicability of current ML practice in the automation of extracting a discriminative feature representation space. Issue 1 (full automation): how can we make ML less dependent on feature engineering, construct ML systems faster, and expand the scope and applicability of ML? The second relevant work is representation learning, such as factorization [8], embedding [9], and deep representation learning [32, 34]. These studies are devoted to learning effective latent features. However, the learned features are implicit and non-explainable. Such traits limit the deployment of these approaches in many application scenarios (e.g., patient and biomedical domains) that require not just high predictive accuracy but also trusted understanding and interpretation of underlying drivers. Issue 2 (explainable explicitness):how can we assure that the reconstructing representation space is traceable and explainable? The third relevant work is learning based feature transformation, such as principle component analysis [3], traversal transformation graph based feature generation [16], sparsity regularization based feature selection [13]. These methods are either deeply embedded into or totally irrelevant to a specific ML model. For example, LASSO regression extracts an optimal feature subset for regression, but not for any given ML model. Issue 3 (flexible optimal): how can we create a framework to reconstruct a new representation space for any given predictor? The three issues are well-known challenges. Our goal is to develop a new perspective to address these issues.

In the preliminary work [35], we propose a novel and principled Traceable Group-wise Reinforcement Generation Framework for addressing the automation, explicitness, and optimal issues in representation space reconstruction. Specifically, we view feature space reconstruction as a traceable iterative process of the combination of feature generation and feature selection. The generation step is to make new features by mathematically transforming the original features, and the selection step is to avoid the feature set explosion. To make the process self-optimizing, we formulate it into three Markov Decision Processes (MDPs): one is to select the first meta feature, one is to select an operation, and one is to select the second meta feature. The two candidate features are transformed by conducting the selected operation on them to generate new ones. We develop a cascading agent structure to coordinate the three MDPs and learn better feature transformation policies. Meanwhile, we propose a group-wise feature generation mechanism to accelerate feature space reconstruction and augment reward signals in order for cascading agents to learn clear policies.

While our initial research has achieved significant results, there remains scope for further improvement. To improve the performance and generalization capabilities of our initial work, we explore two distinct yet interconnected avenues: 1) Refinement of the existing framework: We propose a graph-based state representation method, enabling reinforced agents to perceive the current feature space more effectively. Furthermore, we employ enhanced Q-learning strategies to mitigate the overestimation of Q-values during policy learning. 2) Introduction of a novel optimization approach: We utilize an Actor-Critic policy learning method to train the overall framework. It aims to expedite model convergence and enhance feature transformation performance.

In the first avenue, we assume that a robust state representation can empower reinforced agents to accurately comprehend the current feature space, thereby facilitating the learning of more intelligent policies for feature transformation. Consequently, for the graph-based state representation method, we initially construct a comprehensive attributed graph predicated on feature-feature correlations. Each node signifies a feature, each edge weight denotes the correlation between two nodes, and each node attribute corresponds to the feature vector. Then, we capture the knowledge of the entire feature set and the interactions between features via message passing and graph reconstruction, preserving this information in an embedding vector as a state representation. Besides, we recognize Q-value overestimation as a critical challenge that hinders the unbiased learning of transformation policies, subsequently leading to sub-optimal model performance. Thus, to mitigate this issue, we employ Q-learning using a dual network structure: one network for action selection, and a second for Q-value estimation. Importantly, these two networks are not updated concurrently. This asynchronous update strategy successfully decouples the max operation in traditional Q-learning, thereby reducing Q-value overestimation and enabling the learning of more robust policies.

In the second avenue, we examined the optimization technique of the original framework and observed that Q-learning treats all candidate features as the complete action space. It results in substantial computational costs when our framework deals with high-dimensional spaces. Simultaneously, the \(\epsilon\)-greedy-based learning strategy of Q-learning presents challenges in enabling reinforced agents to balance the trade-off between efficiency and effectiveness. Consequently, we utilize the Actor-Critic technique to optimize the entire feature transformation framework in order to directly learn intelligent feature-feature crossing policies. In particular, we develop three Actor-Critic agents to select candidate features and operations for feature transformation. The actor component directly learns the selection policy through policy gradient update and makes the action selection by sampling. The critic component leverages gradient signals from temporal-difference errors at each iteration to evaluate and improve selection policies. The collaborated relationships between the actor and critic components can expedite model convergence and generate effective combinations for feature transformation.

In summary, in this article, we develop a generic and principled framework: group-wise reinforcement feature transformation, for optimal and explainable representation space reconstruction. The contributions of this article are summarized as follows:

(1)

We formulate the feature space reconstruction as an interactive process of nested feature generation and selection, where feature generation is to generate new meaningful and explicit features, and feature selection is to remove redundant features to control feature sizes.

(2)

We develop a group-wise reinforcement feature transformation framework to automatically and efficiently generate an optimal and traceable feature set for a downstream ML task without much professional experience and human intervention.

(3)

We present refinements for the original framework. In particular, a graph-based state representation method is proposed to capture the feature-feature correlations and the knowledge of the entire feature set for improving the comprehensive of reinforced agents. A decoupling Q-learning strategy is employed to alleviate the Q-value overestimation for learning unbiased and effective policies.

(4)

We utilize the Actor-Critic optimization technique to reimplement the self-optimizing feature transformation framework. The actor component directly learns the selection policies for candidate features and operations instead of finding a surrogate value function. The critic component evaluates the selection policy and provides feedback to the actor in order to make it learn better policies at the next iteration.

(5)

Extensive experiments and case studies are conducted to demonstrate the efficacy of our method. Compared with the preliminary work, we add the outlier detection task with five additional datasets, evaluate different state representation approaches, compare different Q-learning strategies and optimization techniques, and provide comprehensive analyses.

2 Definitions and Problem Statement

2.1 Important Definitions

Definition 1.

Feature Group. We aim to reconstruct the feature space of such datasets \(\mathcal {D}\lt \mathcal {F},y\gt\). Here, \(\mathcal {F}\) is a feature set, in which each column denotes a feature and each row denotes a data sample; y is the target label set corresponding to samples. To effectively and efficiently produce new features, we divide the feature set \(\mathcal {F}\) into different feature groups via clustering, denoted by \(\mathcal {C}\). Each feature group is a feature subset of \(\mathcal {F}\).

Definition 2.

Operation Set. We perform a mathematical operation on existing features in order to generate new ones. The collection of all operations is an operation set, denoted by \(\mathcal {O}\). There are two types of operations: unary and binary. The unary operations include “square”, “exp”, “log”, and and so on. The binary operations are “plus”, “multiply”, “divide”, and and so on.

Definition 3.

Cascading Agent. To address the feature generation challenge, we develop a new cascading agent structure. This structure is made up of three agents: two feature group agents and one operation agent. Such cascading agents share state information and sequentially select feature groups and operations.

2.2 Problem Statement

The research problem is learning to reconstruct an optimal and explainable feature representation space to advance a downstream ML task. Formally, given a dataset \(D\lt \mathcal {F},y\gt\) that includes an original feature set \(\mathcal {F}\) and a target label set y, an operator set \(\mathcal {O}\), and a downstream ML task A (e.g., classification, regression, ranking, and detection), our objective is to automatically reconstruct an optimal and explainable feature set \(\mathcal {F}^{*}\) that optimizes the performance indicator V of the task A. The optimization objective is to find a reconstructed feature set \(\mathcal {\hat{F}}\) that maximizes:

\begin{equation} \mathcal {F}^{*} = argmax_{\mathcal {\hat{F}}}(V_A(\mathcal {\hat{F}},y)), \end{equation}

(1)

where \(\mathcal {\hat{F}}\) can be viewed as a subset of a combination of the original feature set \(\mathcal {F}\) and the generated new features \(\mathcal {F}^g\), and \(\mathcal {F}^g\) is produced by applying the operations \(\mathcal {O}\) to the original feature set \(\mathcal {F}\) via a certain algorithmic structure.

3 Optimal and Explainable Feature Space Reconstruction

In this section, we present an overview, and then detail each technical component of our framework.

3.1 Framework Overview

Figure 2 shows our proposed framework, Group-wise Reinforcement Feature Generation (GRFG). In the first step, we cluster the original features into different feature groups by maximizing intra-group feature cohesion and inter-group feature distinctness. In the second step, we leverage a group-operation-group strategy to cross two feature groups to generate multiple features each time. For this purpose, we develop a novel cascading reinforcement learning method to learn three agents to select the two most informative feature groups and the most appropriate operation from the operation set. As a key enabling technique, the cascading reinforcement method will coordinate the three agents to share their perceived states in a cascading fashion, that is, (agent1, state of the first feature group), (agent2, fused state of the operation and agent1’s state), and (agent3, fused state of the second feature group and agent2’s state), in order to learn better choice policies. After the two feature groups and operation are selected, we generate new features via a group-group crossing strategy. In particular, if the operation is unary, for example, sqrt, we choose the feature group of higher relevance to target labels from the two feature groups, and apply the operation to the more relevant feature group to generate new features. If the operation is binary, we will choose the K most distinct feature-pairs from the two feature groups, and apply the binary operation to the chosen feature pairs to generate new features. In the third step, we add the newly generated features to the original features to create a generated feature set. We feed the generated feature set into a downstream task to collect predictive accuracy as reward feedback to update policy parameters. Finally, we employ feature selection to eliminate redundant features and control the dimensonality of the newly generated feature set, which will be used as the original feature set to restart the iterations to regenerate new features until the maximum number of iterations is reached.

Fig. 2.

3.2 Generation-oriented Feature Grouping

One of our key findings is that group-wise feature generation can accelerate the generation and exploration, and, moreover, augment reward feedback of agents to learn clear policies. Inspired by this finding, our system starts with generation oriented feature clustering, which aims to create feature groups that are meaningful for group-group crossing. Our another insight is that crossing features of high (low) information distinctness is more (less) likely to generate meaningful variables in a new feature space. As a result, unlike classic clustering, the objective of feature grouping is to maximize inter-group feature information distinctness and minimize intra-group feature information distinctness. To achieve this goal, we propose the M-Clustering for feature grouping, which starts with each feature as a feature group and then merges its most similar feature group pair at each iteration.

Distance Between Feature Groups: A Group-level Relevance-Redundancy Ratio Perspective. To achieve the goal of minimizing intra-group feature distinctness and maximizing inter-group feature distinctness, we develop a new distance measure to quantify the distance between two feature groups. We highlight two interesting findings: 1) relevance to predictive target: if the relevance between one feature group and predictive target is similar to the relevance of another feature group and predictive target, the two feature groups are similar; 2) mutual information: if the mutual information between the features of the two groups are large, the two feature groups are similar. Based on the two insights, we devise a feature group-group distance. The distance can be used to evaluate the distinctness of two feature groups, and, further, understand how likely crossing the two feature groups will generate more meaningful features. Formally, the distance is given by:

\begin{equation} dis(\mathcal {C}_i, \mathcal {C}_j) = \frac{1}{|\mathcal {C}_i|\cdot |\mathcal {C}_j|} \sum _{f_i\in \mathcal {C}_i}\sum _{f_j\in \mathcal {C}_j} \frac{|MI(f_i,y)-MI(f_j,y)|}{MI(f_i,f_j)+\epsilon }, \end{equation}

(2)

where \(\mathcal {C}_i\) and \(\mathcal {C}_j\) denote two feature groups, \(|\mathcal {C}_i|\) and \(|\mathcal {C}_j|\) respectively, are the numbers of features in \(\mathcal {C}_i\) and \(\mathcal {C}_j\), \(f_i\) and \(f_j\) are two features in \(\mathcal {C}_i\) and \(\mathcal {C}_j,\) respectively, y is the target label vector. In particular, \(|MI(f_i,y)-MI(f_j,y)|\) quantifies the difference in relevance between y and \(f_i\), \(f_j\). If \(|MI(f_i,y)-MI(f_j,y)|\) is small, \(f_i\) and \(f_j\) have a more similar influence on classifying y; \(MI(f_i,f_j)+\epsilon\) quantifies the redundancy between \(f_i\) and \(f_j\). \(\epsilon\) is a small value that is used to prevent the denominator from being zero. If \(MI(f_i,f_j)+\epsilon\) is big, \(f_i\) and \(f_j\) share more information.

Feature Group Distance based M-Clustering Algorithm: We develop a group-group distance instead of point-point distance, and under such a group-level distance, the shape of the feature cluster could be non-spherical. Therefore, it is not appropriate to use K-means or density based methods. Inspired by agglomerative clustering, given a feature set \(\mathcal {F}\), we propose a three step method: 1) INITIALIZATION: we regard each feature in \(\mathcal {F}\) as a small feature cluster. 2) REPEAT: we calculate the information overlap between any two feature clusters and determine which cluster pair is the most closely overlapped. We then merge two clusters into one and remove the two original clusters. 3) STOP CRITERIA: we iterate the REPEAT step until the distance between the closest feature group pair reaches a certain threshold. Although classic stop criteria is to stop when there is only one cluster, using the distance between the closest feature groups as stop criteria can better help us to semantically understand, gauge, and identify the information distinctness among feature groups. It eases the implementation in practical deployment.

where \(X_1\) and \(X_2\) are two vectors, \(x_1\) and \(x_2\) are elements of \(X_1\) and \(X_2\) respectively, \(p(x_1,x_2)\) is the joint distribution of \(X_1\) and \(X_2\), \(p(x_1)\) and \(p(x_2)\) are the marginal distribution of \(X_1\) and \(X_2\) respectively. The greater the value of MI, the more information the two vectors share. By using this distance metric, we can group the features that share similar information and have similar impact on identifying target label into a single feature cluster. Our framework begins by clustering the feature set into feature clusters. This step groups similar features into a single cluster and dissimilar features into different clusters. This is because we presume that the generated features are meaningful when the interacted features come from different information distributions; otherwise, the generated features should be redundant. Considering that the boundary and information distribution of feature clusters vary over the feature generation process, we develop a new M-Clustering algorithm rather than using a conventional clustering technique.

3.3 Cascading Reinforcement Feature Groups and Operation Selection

To achieve group-wise feature generation, we need to select a feature group, an operation, and another feature group to perform group-operation-group based crossing. Two key findings inspire us to leverage cascading reinforcement. Firstly, we highlight that although it is hard to define and program the optimal selection criteria of feature groups and operation, we can view selection criteria as a form of machine-learnable policies. We propose three agents to learn the policies by trials and errors. Secondly, we find that the three selection agents are cascading in a sequential auto-correlated decision structure, not independent and separated. Here, “cascading” refers to the fact that within each iteration agents make decision sequentially, and downstream agents await for the completion of an upstream agent. The decision of an upstream agent will change the environment states of downstream agents. As shown in Figure 3, the first agent picks the first feature group based on the state of the entire feature space, the second agent picks the operation based on the entire feature space and the selection of the first agent, and the third agent chooses the second feature group based on the entire feature space and the selections of the first and second agents.

Fig. 3.

We next propose two generic metrics to quantify the usefulness (reward) of a feature set, and then form three MDPs to learn three selection policies.

Two Utility Metrics for Reward Quantification. The two utility metrics are from the supervised and unsupervised perspectives.

Metric 1: Integrated Feature Set Redundancy and Relevance. We propose a metric to quantify feature set utility from an information theory perspective: a higher-utility feature set has less redundant information and is more relevant to prediction targets. Formally, given a feature set \(\mathcal {F}\) and a predictive target label y, such utility metric can be calculated by

\begin{equation} U(\mathcal {F}|y) = -\frac{1}{|\mathcal {F}|^2}\sum _{f_i, f_j \in \mathcal {F}} MI(f_i, f_j) + \frac{1}{|\mathcal {F}|}\sum _{f\in \mathcal {F}}MI(f,y), \end{equation}

(3)

where MI is the mutual information, \(f_i, f_j, f\) are features in \(\mathcal {F}\) and \(|\mathcal {F}|\) is the size of the feature set \(\mathcal {F}\).

Metric 2: Downstream Task Performance. Another utility metric is whether this feature set can improve a downstream task (e.g., regression and classification). We use a downstream predictive task performance indicator (e.g., 1-RAE, Precision, and Recall, F1) as a utility metric of a feature set.

Learning Selection Agents of Feature Group 1, Operation, and Feature Group 2. Leveraging the two metrics, we next develop three MDPs to learn three agent policies to select the best feature group 1, operation, feature group 2.

Learning the Selection Agent of Feature Group 1. The feature group agent 1 iteratively select the best meta feature group 1. Its learning system includes: i) Action: its action at the tth iteration is the meta feature group 1 selected from the feature groups of the previous iteration, denoted group \(a_t^1 = \mathcal {C}^1_{t-1}\). ii) State: its state at the tth iteration is a vectorized representation of the generated feature set of the previous iteration. Let Rep be a state representation method, the state can be denoted by \(s_t^1 = Rep(\mathcal {F}_{t-1})\). We will discuss the state representation method in the Section 3.5. iii) Reward: its reward at the tth iteration is the utility score the selected feature group 1, denoted by \(\mathcal {R}(s_t^1,a_t^1)=U(\mathcal {F}_{t-1}|y)\).

Learning the Selection Agent of Operation. The operation agent will iteratively select the best operation (e.g., +, -) from an operation set as a feature crossing tool for feature generation. Its learning system includes: i) Action: its action at the tth iteration is the selected operation, denoted by \(a_t^o = o_t\). ii) State: its state at the tth iteration is the combination of \(Rep(\mathcal {F}_{t-1})\) and the representation of the feature group selected by the previous agent, denoted by \(s^o_t = Rep(\mathcal {F}_{t-1}) \oplus Rep(\mathcal {C}_{t-1}^1)\), where \(\oplus\) indicates the concatenation operation. iii) Reward: The selected operation will be used to generate new features by feature crossing. We combine such new features with the original feature set to form a new feature set. Thus, the feature set at the tth iteration is \(\mathcal {F}_t = \mathcal {F}_{t-1} \cup g_t\), where \(g_t\) is the generated new features. The reward of this iteration is the improvement in the utility score of the feature set compared with the previous iteration, denoted by \(\mathcal {R}(s_t^o,a_t^o) = U(\mathcal {F}_t|y) - U(\mathcal {F}_{t-1}|y)\).

Learning the Selection Agent of Feature Group 2. The feature group agent 2 will iteratively select the best meta feature group 2 for feature generation. Its learning system includes: i) Action: its action at the tth iteration is the meta feature group 2 selected from the clustered feature groups of the previous iteration, denoted by \(a_t^2 = \mathcal {C}^2_{t-1}\). ii) State: its state at the tth iteration is combination of \(Rep(\mathcal {F}_{t-1})\), \(Rep(\mathcal {C}_{t-1}^1)\) and the vectorized representation of the operation selected by the operation agent, denoted by \(s_t^2 = Rep(\mathcal {F}_{t-1})\oplus Rep(\mathcal {C}_{t-1}^1) \oplus Rep(o_t)\). iii) Reward: its reward at the tth iteration is improvement of the feature set utility and the feedback of the downstream task, denoted by \(\mathcal {R}(s_t^2, a_t^2) = U(\mathcal {F}_t|y)-U(\mathcal {F}_{t-1}|y)+V_{A_t}\), where \(V_A\) is the performance (e.g., F1) of a downstream predictive task.

3.4 Optimization Technique

In the preliminary work [35], we use the value-based optimization technique to train the entire framework. To further enhance the generalization ability and performance of our framework, we use a brand new optimization perspective, actor-critic, to learn the feature transformation policies.

Value-based Optimization Perspective. We train the three agents by maximizing the discounted and cumulative reward during the iterative feature generation process. In other words, we encourage the cascading agents to collaborate to generate a feature set that is independent, informative, and performs well in the downstream task. For each agent, we firstly give the memory unit of the tth exploration as \(\mathcal {M}_t = (s_t, a_t, r_t, s_{t+1}, a_{t+1})\). In the preliminary work, we minimize the temporal difference error \(\mathcal {L}\) based on Q-learning. The optimization objective can be formulated as:

\begin{equation} \mathcal {L} = Q(s_{t},a_t) - (\mathcal {R}_t + \gamma * \text{max}_{a_{t+1}}Q(s_{t+1},a_{t+1})), \end{equation}

(4)

where \(Q(\cdot)\) is the Q function that estimates the value of choosing action a based on the state s. \(\gamma\) is a discount value, which is a constant between 0 and 1. It raises the significance of rewards in the near future and diminishes the significance of rewards in the uncertain distant future, ensuring the convergence of rewards.

However, Q-learning frequently overestimates the Q-value of each action, resulting in suboptimal policy learning. The reason for this is that traditional Q-learning employs a single Q-network for both Q-value estimation and action selection. The max operation in Equation (4) will accumulate more optimistic Q-values during the learning process, leading to Q-value overestimation.

Improving the Q-learning. In this journal version, we propose a new training strategy to alleviate this issue, namely, GRFG\(^\upsilon\). Specifically, we use two independent Q-networks with the same structure to conduct Q-learning. The first Q-network is used to select the action with the highest Q-value. The index of the selected action is used to query the Q-value in the second Q-network for Q-value estimation. The optimization goal can be formulated as follows:

\begin{equation} \mathcal {L}^\upsilon = Q(s_{t},a_t) - (r_t + \gamma * \text{max}_{a_{t+1}}Q^{-}(s_{t+1},a_{t+1})), \end{equation}

(5)

where \(Q(\cdot)\) and \(Q^{-}(\cdot)\) are two independent networks with the same structure. The parameters of them are not updated simultaneously. The asynchronous learning strategy reduces the possibility that the maximum Q-value of the two networks is associated with the same action, thereby alleviating Q-value overestimation.

Furthermore, to make the learning of each Q-network robust, we decouple the Q-learning into: state value estimation and action advantage value estimation [36]. Thus, the calculation of Q-value can be formulated as follows:

\begin{equation} Q_\upsilon (s, a) = V_\upsilon (s) + A_\upsilon (s, a), \end{equation}

(6)

where \(V_\upsilon (\cdot)\) and \(A_\upsilon (\cdot)\) are state value function and action advantage value function, respectively. This decoupling technique generalizes action-level learning without excessive change, resulting in a more robust and consistent Q-learning.

After agents converge, we expect to find the optimal policy \(\pi ^*_\upsilon\) that can choose the most appropriate action (i.e., feature group or operation) based on the state via the Q-value, which can be formulated as follows:

\begin{equation} \pi ^*_\upsilon (a_t|s_t) = {\arg\max}_{a} Q_\upsilon (s_t,a). \end{equation}

(7)

Actor-Critic Optimization Perspective. We adopt the Actor-Critic optimization approach [12, 20] to enhance the convergence and performance of the model, namely, GRFG\(^\alpha\). Actor-Critic architecture consists of two independent network, which is the Actor network for action selection and Critic network for reward estimation.

Actor: The goal of the actor is to learn the selection policy, denoted by \(\pi _\alpha (\cdot)\), based on the current state. This policy is used to select appropriate candidate feature groups or operations. In the tth iteration, given the state \(s_t\), the agent chooses an action \(a_t\) as follows:

\begin{equation} a_t \sim \pi _\alpha (s_{t}), \end{equation}

(8)

where the output of \(\pi _\alpha (\cdot)\) is the probability distribution over candidate actions. The \(\sim\) operation denotes the sampling operation.

Critic: The critic’s objective is to estimate the expected reward of a given input state, which can be expressed as:

\begin{equation} v_t = V_\alpha (s_t), \end{equation}

(9)

In this equation, \(V_\alpha (\cdot)\) is the state-value function, and \(v_t\) represents the resulting value.

Optmization: We update the Actor and Critic in cascading agents after each step of feature transformation. Suppose in step-t, for each agent, we can obtain the memory as \(\mathcal {M}_t = (a_t, s_t, s_{t+1}, r_t)\). The estimation of temporal difference error (\(\delta\)) for Critic is given by:

\begin{equation} \delta = r_t + \gamma V_\alpha (s_{t+1}) - V_\alpha (s_t), \end{equation}

(10)

where \(\gamma \in [0, 1]\) is the discounted factor. By that, we can update the Critic by mean square error (MSE):

\begin{equation} \mathcal {L}^\alpha _c = \delta ^2 . \end{equation}

(11)

Then, we can update the policy \(\pi _\alpha\) in Actor by policy gradient, defined as:

\begin{equation} \mathcal {L}_a^\alpha = log\pi _\alpha (a_t|s_t) * \delta + \beta * H(\pi _\alpha (s_t)), \end{equation}

(12)

where \(H(\cdot)\) is an entropy regularization term that aims to increase the randomness in exploration. We use \(\beta\) to control the strength of the H. \(\pi _\alpha (a_t|s_t)\) denote the probability of action \(a_t\). The overall loss function for each agent is given by:

\begin{equation} \mathcal {L}^\alpha = \mathcal {L}^\alpha _c + \mathcal {L}^\alpha _a. \end{equation}

(13)

After agents converge, we expect to discover the optimal policy \(\pi _\alpha ^*\) that can choose the most appropriate action (i.e., feature group or operation).

3.5 State Representation of a Feature Set (Group) and an Operation

State representation reflects the status of current features or operations, which serves as the input for reinforced agents to learn more effective policies in the future. We propose to map a feature group or an operation to a vectorized representation for agent training.

For a feature set (group), we propose three state representation methods. To simplify the description, in the following parts, suppose given the feature set \(\mathcal {F}\in \mathbb {R}^{m\times n}\), where the m is the number of the total samples, and the n is the number of feature columns. Descriptive Statistics Based State Representation. In the preliminary work [31], we use the descriptive statistic matrix of a feature set as its state representation. Figure 4(a) shows its calculation process:

Fig. 4.

Step-1: We calculate the descriptive statistics matrix (i.e., count, standard deviation, minimum, maximum, first, second, and third quartile) of \(\mathcal {F}\) column by column.

Step-2: We calculate the descriptive statistics of the outcome matrix of the previous step row by row to obtain the meta descriptive statistics matrix that shape is \(\mathbb {R}^{7\times 7}\).

Step-3: We obtain the feature-feature’s state representation \(Rep_{ds}(\mathcal {F})\in \mathbb {R}^{1\times 49}\) by flatting the descriptive matrix from the former step.

While descriptive statistics based state representation has shown exceptional performance, it may cause information loss due to fixed statistics views. To better comprehend the feature set, we propose two new state representation methods in the journal version work.

Autoencoder Based State Representation. Intuitively, an effective state representation can recover the original feature set. Thus, we user an autoencoder to convert the feature set \(\mathcal {F}\) into a latent embedding vector and regard it as the state representation. Figure 4(b) shows the calculation process, which can be devided to three steps: Specifically,

Step-1: We apply a fully connected network \(FC_{col}(\cdot)\) to convert each column of \(\mathcal {F}\) into matrix \(Z \in \mathbb {R}^{k\times n}\), where k is the dimension of the latent representation of each column. The process can be represented as:

\begin{equation} Z = FC_{col}(\mathcal {F}). \end{equation}

(14)

Step-2: We apply another fully connected network \(FC_{row}(\cdot)\) to convert each row of Z into matrix \(Z^{\prime } \in \mathbb {R}^{k\times d}\), where d is the dimension of the latent embedding of each row. The process can be defined as:

\begin{equation} Z^{\prime } = FC_{row}(Z). \end{equation}

(15)

Step-3: We flatten \(Z^{\prime }\) and input it into a decoder constructed by fully connected layers to reconstruct the original feature space, which can be defined as:

\begin{equation} \mathcal {F}^{\prime } = FC_{decoder}(\text{Flatten}(Z^{\prime })). \end{equation}

(16)

During the learning process, we minimize the reconstructed error between \(\mathcal {F}\) and \(\mathcal {F}^{\prime }\). The optimization objective is as follows:

\begin{equation} \mathcal {L}_{ae} = \text{MSELoss}(\mathcal {F}, \mathcal {F}^{\prime }). \end{equation}

(17)

When the model converges, \(\text{Flatten}(Z^{\prime })\) used in Equation (16) is the state representation \(Rep_{ae}(\mathcal {F})\).

Graph Autoencoder Based State Representation. The prior two methods only preserve the information of individual features in the state representation while disregarding their correlations. To further enhance state representation, we propose a graph autoencoder [17] to incorporate the knowledge of the entire feature set and feature-feature correlations into the state representation. Figure 4(d) shows the calculation process. Specifically,

Step-1: We build a complete correlation graph \(\mathcal {G}\) by calculating the similarity matrix between each pair of feature columns. The adjacency matrix of \(\mathcal {G}\) is \(\mathbf {A} \in \mathbb {R}^{n\times n}\), where a node is a feature column in \(\mathcal {F}\) and an edge reflects the similarity between two nodes. Then, we use the encoder constructed by one-layer GCN to aggregate feature knowledge to generate enhanced embeddings \(Z\in \mathbb {R}^{n\times k}\) based on \(\mathbf {A}\), where k is the dimension of latent embedding. The calculation process is defined as follows:

\begin{equation} Z = ReLU(\mathbf {D}^{-\frac{1}{2}}\mathbf {A}\mathbf {D}^{-\frac{1}{2}}\mathcal {F}^{\top }\mathbf {W}), \end{equation}

(18)

where \(\mathcal {F}^{ \top }\) is the transpose of input feature \(\mathcal {F}\), \(\mathbf {D}\) is the degree matrix, and \(\mathbf {W} \in \mathbb {R}^{m\times k}\) is the parameter of GCN.

Step-2: Then, we use the decoder to reconstruct the adjacency matrix \(\mathbf {A}^{\prime }\) based on Z, which can be represented as:

\begin{equation} \mathbf {A}^{\prime } = \text{Sigmoid}(Z\cdot Z^{\top }). \end{equation}

(19)

During the learning process, we minimize the reconstructed error between \(\mathbf {A}\) and \(\mathbf {A}^{\prime }\). The optimization objective is:

\begin{equation} \mathcal {L}_{gae} = \text{BCELoss}(\mathbf {A}, \mathbf {A}^{\prime }), \end{equation}

(20)

where BCELoss refers to the binary cross entropy loss function. When the model converges, we average Z column-wisely to get the state representation \(Rep_{gae}(\mathcal {F}) \in \mathbb {R}^{1\times k}\). For an operation o, we use its one hot vector as the sate representation, which can be denoted as \(Rep(o)\).

3.6 Group-wise Feature Generation

We found that using group-level crossing can generate more features each time, and, thus, accelerate exploration speed, augment reward feedback by adding significant amount of features, and effectively learn policies. The selection results of our reinforcement learning system include two generation scenarios: (1) selected are a binary operation and two feature groups; (2) selected are a unary operation and two feature groups. However, a challenge arises: what are the most effective generation strategy for the two scenarios? We next propose two strategies for the two scenarios.

Scenario 1: Cosine Similarity Prioritized Feature-Feature Crossing. We highlight that it is more likely to generate informative features by crossing two features that are less overlapped in terms of information. We propose a simple yet efficient strategy, that is, to select the top K dissimilar feature pairs between two feature groups. Specifically, we first cross two selected feature groups to prepare feature pairs. We then compute the cosine similarities of all feature pairs. Later, we rank and select the top K dissimilar feature pairs. Finally, we apply the operation to the top K selected feature pairs to generate K new features.

Scenario 2: Relevance Prioritized Unary Feature Generation. When selected are an unary operation and two feature groups, we directly apply the operation to the feature group that is more relevant to target labels. Given a feature group C, we use the average mutual information between all the features in C and the prediction target y to quantify the relevance between the feature group and the prediction targets, which is given by: \(rel = \frac{1}{|\mathcal {C}|}\sum _{f_i\in \mathcal {C}} MI(f_i,y)\), where MI is a function of mutual information. After the more relevant feature group is identified, we apply the unary operation to the feature group to generate new ones.

Post-generation Processing. After feature generation, we combine the newly generated features with the original feature set to form an updated feature set, which will be fed into a downstream task to evaluate predictive performance. Such performance is exploited as reward feedback to update the policies of the three cascading agents in order to optimize the next round of feature generation. To prevent feature number explosion during the iterative generation process, we use a feature selection step to control feature size. When the size of the new feature set exceeds a feature size tolerance threshold, we leverage the K-best feature selection method to reduce the feature size. Otherwise, we do not perform feature selection. We use the tailored new feature set as the original feature set of the next iteration.

Finally, when the maximum number of iterations is reached, the algorithm returns the optimal feature set \(\mathcal {F^{*}}\) that has the best performance on the downstream task over the entire exploration.

3.7 Applications

Our proposed method (GRFG) aims to automatically optimize and generate a traceable and optimal feature space for an ML task. This framework is applicable to a variety of domain tasks without the need for prior knowledge. We apply GRFG to classification, regression, and outlier detection tasks on 29 different tabular datasets to validate its performance.

4 Experiments

4.1 Experimental Setup

4.1.1 Data Description.

We used 29 publicly available datasets from UCI [24], LibSVM [6], Kaggle [15], and OpenML [23] to conduct experiments. The 24 datasets involves 14 classification tasks, 10 regression tasks, and 5 outlier detection tasks. Table 1 shows the statistics of the data.

Table 1.

Dataset	Source	Task	Samples	Features	RDG	ERG	LDA	AFT	NFS	TTG	GRFG	GRFG\(^\upsilon\)	GRFG\(^\alpha\)
Higgs Boson	UCIrvine	C	50000	28	0.683	0.674	0.509	0.711	0.715	0.705	0.719	0.719	0.723
Amazon Employee	Kaggle	C	32769	9	0.744	0.740	0.920	0.943	0.935	0.806	0.946	0.945	0.946
PimaIndian	UCIrvine	C	768	8	0.693	0.703	0.676	0.736	0.762	0.747	0.776	0.774	0.789
SpectF	UCIrvine	C	267	44	0.790	0.748	0.774	0.775	0.876	0.788	0.878	0.832	0.876
SVMGuide3	LibSVM	C	1243	21	0.703	0.747	0.683	0.829	0.831	0.766	0.850	0.852	0.858
German Credit	UCIrvine	C	1001	24	0.695	0.661	0.627	0.751	0.765	0.731	0.772	0.773	0.774
Credit Default	UCIrvine	C	30000	25	0.798	0.752	0.744	0.799	0.799	0.809	0.800	0.800	0.802
Messidor_features	UCIrvine	C	1150	19	0.673	0.635	0.580	0.678	0.746	0.726	0.757	0.753	0.769
Wine Quality Red	UCIrvine	C	999	12	0.599	0.611	0.600	0.658	0.666	0.647	0.686	0.688	0.697
Wine Quality White	UCIrvine	C	4900	12	0.552	0.587	0.571	0.673	0.679	0.638	0.685	0.685	0.693
SpamBase	UCIrvine	C	4601	57	0.951	0.931	0.908	0.951	0.955	0.961	0.958	0.952	0.958
AP-omentum-ovary	OpenML	C	275	10936	0.711	0.705	0.117	0.783	0.804	0.795	0.818	0.830	0.841
Lymphography	UCIrvine	C	148	18	0.654	0.638	0.737	0.833	0.859	0.846	0.866	0.866	0.863
Ionosphere	UCIrvine	C	351	34	0.919	0.926	0.730	0.827	0.949	0.938	0.960	0.954	0.962
Bikeshare DC	Kaggle	R	10886	11	0.483	0.571	0.494	0.670	0.675	0.659	0.681	0.682	0.682
Housing Boston	UCIrvine	R	506	13	0.605	0.617	0.174	0.641	0.665	0.658	0.684	0.668	0.692
Airfoil	UCIrvine	R	1503	5	0.737	0.732	0.463	0.774	0.771	0.783	0.797	0.786	0.796
Openml_618	OpenML	R	1000	50	0.415	0.427	0.372	0.665	0.640	0.587	0.672	0.704	0.803
Openml_589	OpenML	R	1000	25	0.638	0.560	0.331	0.672	0.711	0.682	0.753	0.771	0.782
Openml_616	OpenML	R	500	50	0.448	0.372	0.385	0.585	0.593	0.559	0.603	0.682	0.717
Openml_607	OpenML	R	1000	50	0.579	0.406	0.376	0.658	0.675	0.639	0.680	0.702	0.756
Openml_620	OpenML	R	1000	25	0.575	0.584	0.425	0.663	0.698	0.656	0.714	0.710	0.720
Openml_637	OpenML	R	500	50	0.561	0.497	0.494	0.564	0.581	0.575	0.589	0.632	0.644
Openml_586	OpenML	R	1000	25	0.595	0.546	0.472	0.687	0.748	0.704	0.783	0.791	0.802
Mammography	UCIrvine	D	278	30	0.753	0.766	0.736	0.743	0.755	0.752	0.785	0.785	0.832
Yeast	OpenML	D	11183	6	0.731	0.728	0.668	0.714	0.728	0.734	0.751	0.757	0.753
Wisconsin-Breast Cancer	UCIrvine	D	3772	6	0.813	0.790	0.778	0.797	0.722	0.720	0.954	0.954	0.979
Thyroid	UCIrvine	D	1364	8	0.852	0.862	0.589	0.873	0.847	0.832	0.979	0.984	0.998
SMTP	UCIrvine	D	95156	3	0.885	0.836	0.765	0.881	0.816	0.895	0.943	0.942	0.949
Average Ranking	-	-	-	-	7.05	7.39	8.58	5.65	4.63	5.37	2.48	2.50	1.37

Table 1. Overall Performance Comparison

‘C’ for classification, ‘R’ for regression, and ‘D’ for Outlier Detection. For robustness check, we report the average performance ranking of each method on all datasets in the last row (less is better).

4.1.2 Evaluation Metrics.

We used the F1-score to evaluate the recall and precision of classification tasks. We used 1-relative absolute error (RAE) to evaluate the accuracy of regression tasks. Specifically, \(\text{1-RAE} = 1 - \frac{\sum _{i=1}^{n} |y_i-\check{y}_i|}{\sum _{i=1}^{n}|y_i-\bar{y}_i|}\), where n is the number of data points, \(y_i, \check{y}_i, \bar{y}_i\) respectively, denote golden standard target values, predicted target values, and the mean of golden standard targets. For the outlier detection task, we adopted the Area Under the Receiver Operating Characteristic Curve (ROC/AUC) as the metric. To better evaluate the model performance among each dataset, we statistic the ranking of each model on every dataset and average this rank to calculate the Average Ranking.

4.1.3 Baseline Algorithms.

We compared our method with five widely-used feature generation methods: (1) RDG randomly selects feature-operation-feature pairs for feature generation; (2) ERG is a expansion-reduction method, that applies operations to each feature to expand the feature space and selects significant features from the larger space as new features. (3) LDA [2] extracts latent features via matrix factorization. (4) AFT [14] is an enhanced ERG implementation that iteratively explores feature space and adopts a multi-step feature selection relying on L1-regularized linear models. (5) NFS [4] mimics feature transformation trajectory for each feature and optimizes the entire feature generation process through reinforcement learning. (6) TTG [16] records the feature generation process using a transformation graph, then uses reinforcement learning to explore the graph to determine the best feature set.

Besides, we developed four variants of GRFG to validate the impact of each technical component: (i) \({\bf GRFG}^{-c}\) removes the clustering step of GRFG and generate features by feature-operation-feature based crossing, instead of group-operation-group based crossing. (ii) \({\bf GRFG}^{-d}\) utilizes the euclidean distance as the measurement of M-Clustering. (iii) \({\bf GRFG}^{-u}\) randomly selects a feature group from the feature group set, when the operation is unary. (iv) \({\bf GRFG}^{-b}\) randomly selects features from the larger feature group to align two feature groups when the operation is binary. We adopted Random Forest (RF)for classification and regression tasks and applied K-Nearest Neighbors for outlier detection, in order to ensure the changes in results are mainly caused by the feature space reconstruction, not randomness or variance of the predictor. We performed five-fold stratified cross-validation in all experiments. Additionally, to investigate the impacts of the optimization strategies, we developed two model variants of GRFG: GRFG\(^\upsilon\) (adopted the Improving Q-learning strategy) and GRFG\(^\alpha\) (adopted the Actor-Critic optimization perspective).

4.1.4 Hyperparameters, Source Code, and Reproducibility.

The operation set consists of square root, square, cosine, sine, tangent, exp, cube, log, reciprocal, sigmoid, plus, subtract, multiply, and divide. We limited iterations (epochs) to 30, with each iteration consisting of 15 exploration steps. When the number of generated features is twice of the original feature set size, we performed feature selection to control feature size. In GRFG, all agents were constructed using a DQN-like network with two linear layers activated by the RELU function. We optimized DQN and its model variants using the Adam optimizer with a 0.01 learning rate, and set the limit of the experience replay memory as 32 and the batch size as 8. The training epochs for autoencoder and graph autoencoder-based state representation have been set to 20, respectively. The parameters of all baseline models are set based on the default settings of corresponding papers.

4.1.5 Environmental Settings.

All experiments were conducted on the Ubuntu 18.04.6 LTS operating system, AMD EPYC 7742 CPU and 8 NVIDIA A100 GPUs, with the framework of Python 3.9.10 and PyTorch 1.8.1.

4.2 Experimental Results

4.2.1 Overall Comparison.

This experiment aims to answer: Can our method effectively construct quality feature space and improve a downstream task? The empirical results from our exhaustive experimental comparison provide insightful revelations concerning the efficacy of various feature transformation methods benchmarked over 29 diverse datasets. The nine methods evaluated in this study encompass a broad spectrum of techniques, ranging from random-based methods, expansion-reduction techniques, and matrix factorization approaches to reinforcement learning-based methods. Table 1 shows the comparison results in terms of F1 score, 1-RAE, or ROC/AUC. We observed that GRFG and its variants (i.e., GRFG\(^\upsilon\) and GRFG\(^\alpha\)) rank first on most datasets. The underlying driver is that the personalized feature-crossing strategy in GRFG considers feature-feature distinctions when generating new features. In “Credit Default” and “SpamBase”, TTG performed better than GRFG and its variants. One possible reason is that TTG directly conducts mathematical transformation on the whole dataset with reinforcement learning, which will bring more sharp rewards, especially on large datasets. This phenomenon shows that GRFG’s framework has performance limitations when confronting the dataset with relatively stable performance by group-wise operation. We can also observe that GRFG\(^\alpha\) beat the other two Q-learning approaches with an average ranking of 1.37. The underlying driver is that the Actor-Critic optimization perspective can learn the policy directly and separately learn the value function and policy evaluation, thus being more effective in the downstream tasks. The average rankings of the remaining six baseline methods are substantially higher, indicating that these methods are generally less effective on the tested datasets. LDA, a matrix factorization-based method, ranks the lowest with an average rank of 8.58, suggesting that it may struggle with the challenges posed by the diverse datasets in our study. RDG and ERG, representing random-based and expansion-reduction methods, achieve rankings of 7.05 and 7.39, respectively. The relatively lower performance of RDG underscores the limitations of random-based feature generation, while the simple nature of the expansion-reduction technique might hinder ERG’s performance. AFT, an enhanced version of ERG, showcases an improvement over its simpler counterpart with an average ranking of 5.65. This improvement reinforces the value of additional enhancements in the basic expansion-reduction technique. The other two reinforcement learning-based methods, NFS and TTG, deliver average rankings of 4.63 and 5.37, respectively. These results signify that reinforcement learning methodologies have a compelling potential for feature transformation tasks, even though their performance lags behind our GRFG variants. Besides, the observation that GRFG outperforms random-based (RDG) and expansion-reduction-based (ERG, AFT) methods shows that the agents can share states and rewards in a cascading fashion and, thus, learn an effective policy to select optimal crossing features and operations. Also, we observed that GRFG could considerably increase downstream task performance on datasets with fewer features (e.g., Higgs Boson and SMTP) or datasets with more extensive features (e.g., AP-omentum-ovary). These findings demonstrated the generalizability of GRFG. Moreover, because our method is a self-learning end-to-end framework, users can treat it as a tool and easily apply it to different datasets regardless of implementation details. Thus, this experiment validates that our method is more practical and automated in real application scenarios.

4.2.2 Study of the Impact of Improving Q-learning.

This experiment aims to answer: How does the improving Q-learning variant of GRFG affect the feature space reconstruction? In Section 3.4, we decoupled the action selection network with the Q-value estimation network and the state value function with the action advantage function, thus improving the Q-value over-estimating. From those perspectives above, we modified the original framework to GRFG\(^\upsilon\), an improving Q-learning method. From Table 1, we can observe that GRFG\(^\upsilon\) ranks similarly to the original model, delivering average rankings of 2.50 and 2.48, respectively. However, from Figure 5, we can observe that GRFG\(^\upsilon\) will converge much more effectively than the original model. For example, in OpenML_589, GRFG\(^\upsilon\) can converge in epoch 24, but GRFG requires around 36 epochs. The original GRFG was observed to lag behind its counterparts. Despite DQN’s renown in addressing high-dimensional and continuous state spaces, it seemed less capable in this specific application, potentially due to its relative propensity for overestimation bias. This limitation is partially mitigated in the architecture of GRFG\(^\upsilon\), which aims to separate the estimator into two streams for estimating the state value and the advantages for each action, hence reducing over-optimistic value estimates. Indeed, the GRFG\(^\upsilon\) framework achieves more efficient convergence than the baseline GRFG model, even though the final performances of the two models were found to be equivalent. This outcome reaffirms the benefits of the architecture, which separates the estimation of state value and advantage for each action. Consequently, this results in more stable learning and faster convergence.

Fig. 5.

4.2.3 Study of the Impact of the Actor-Critic Optimization Perspective.

This experiment aims to answer: How does the Actor-Critic optimization perspective affect the feature space reconstruction? To answer this question, we develop a novel variant of GRFG, denoted by GRFG\(^\alpha\), which utilizes an Actor-Critic optimization perspective proposed in this journal version. Among all baseline methods in Table 1, our novel GRFG variants incorporating the Actor-Critic optimization strategy exhibit a strong competitive edge over the other baselines. As per the average rankings over the datasets, GRFG\(^\alpha\) attains the highest average rank of 1.37, outperforming the original GRFG (Q-learning) and GRFG\(^\upsilon\) (improving Q-learning) with average rankings of 2.48 and 2.50, respectively. Besides, from Figure 5, the GRFG\(^\alpha\) framework was found to surpass the two others in terms of both convergence efficiency and achieved performance. The superior performance of GRFG\(^\alpha\) could be attributed to the Actor-Critic method’s ability to balance the trade-off between exploration and exploitation more effectively than the other two variants, accelerating the convergence process and improving the overall result. This evidences the superior effectiveness of the Actor-Critic framework within the GRFG context, as it enables a more effective balance between exploration and exploitation, resulting in superior performance. These observations underline the importance of carefully selecting and tailoring reinforcement learning methods to specific tasks. While DQN has been famous for its ability to handle complex tasks, it may not always be the optimal solution. In this case, an Actor-Critic approach (GRFG\(^\alpha\)) resulted in more efficient convergence and superior performance. Yet, it is also important to note that the improving Q-learning method (GRFG\(^\upsilon\)) offered some advantages in terms of convergence over the standard DQN (GRFG), emphasizing the potential benefits of such refined Q-learning variants.

4.2.4 Study of Different State Representation Methods.

This experiment aims to answer: What effects does state representation have on improving feature space? In the Section 3.5, we implemented three state representation methods: descriptive statistics based (ds), antoencoder based (ae), and graph autoencoder based (gae). Additionally, we developed two mixed state representation strategies: ds+ae and ds+ae+gae. Figure 6 shows the comparison results of classification tasks in terms of Precision (PRE), Recall (REC), and F1-score (F1); regression tasks in terms of 1 - Mean Average Error (1-MAE), 1 - Root Mean Square Error (1-RMSE), and (1-RAE); outlier detection tasks in terms of Average Precision (AP), F1-score (F1), ROC/AUC. We found that advanced state representation methods (ae and gae) outperform the method (ds) provided in our preliminary work in most cases. For instance, in classification tasks (e.g., German Credit and SVMGuide3), the GRFG with ae outperform the preliminary version (i.e., the version with ds). In outlier detection tasks, the gae state representation method perform significantly better than the ds method. In regression tasks, two advanced methods achieve equivalent performance but considerably better than ds. A possible reason is that ae and gae capture more intrinsic feature properties through considering individual feature information and feature-feature correlations. Another interesting observation is that mixed state representation strategies (ds+ae and ds+ae+gae) do not achieve the best performance on all tasks. The underlying driver is that although concatenated state representations can include more feature information, redundant noises may lead to sub-optimal policy learning.

Fig. 6.

4.2.5 Study of the Impact of Each Technical Component.

This experiment aims to answer: How does each component in GRFG impact its performance? We developed four variants of GRFG (Section 4.1.3). Figure 7 shows the comparison results in terms of F1 score or 1-RAE on two classification datasests (i.e., PimaIndian, German Credit) and two regression datasets (i.e., Housing Boston, Openml_589). First, we developed GRFG\(^{-c}\) by removing the feature clustering step of GRFG. But, GRFG\(^{-c}\) performs poorer than GRFG on all datasets. This observation shows that the idea of group-level generation can augment reward feedback to help cascading agents learn better policies. Second, we developed GRFG\(^{-d}\) by using euclidean distance as feature distance metric in the M-clustering of GRFG. The superiority of GRFG over GRFG\(^{-d}\) suggests that our distance describes group-level information relevance and redundancy ratio in order to maximize information distinctness across feature groups and minimize it within a feature group. Such a distance can help GRFG generate more useful dimensions. Third, we developed GRFG\(^{-u}\) and GRFG\(^{-b}\) by using random in the two feature generation scenarios (Section 3.6) of GRFG. We observed that GRFG\(^{-u}\) and GRFG\(^{-b}\) perform poorer than GRFG. This validates that crossing two distinct features and relevance prioritized generation can generate more meaningful and informative features.

Fig. 7.

4.2.6 Study of the Impact of M-Clustering.

This experiment aims to answer: Is M-Clustering more effective in improving feature generation than classical clustering algorithms? We replaced the feature clustering algorithm in GRFG with KMeans, Hierarchical Clustering, DBSCAN, and Spectral Clustering, respectively. We reported the comparison results in terms of F1 score or 1-RAE on the datasets used in Section 4.2.5. Figure 8 shows M-Clustering beats classical clustering algorithms on all datasets. The underlying driver is that when feature sets change during generation, M-Clustering is more effective in minimizing information overlap of intra-group features and maximizing information distinctness of inter-group features. So, crossing the feature groups with distinct information is easier to generate meaningful dimensions.

Fig. 8.

4.2.7 Study of the Robustness of GRFG under Different ML Models.

This experiment is to answer: Is GRFG robust when different ML models are used as a downstream task? We examined the robustness of GRFG by changing the ML model of a downstream task to RF, Xgboost (XGB), SVM, KNN, and Ridge Regression, respectively. Figure 9 shows the comparison results in terms of F1 score or 1-RAE on the datasets used in the Section 4.2.5. We observed that GRFG robustly improves model performances regardless of the ML model used. This observation indicates that GRFG can generalize well to various benchmark applications and ML models. We found that RF and XGB are the two most powerful and robust predictors, which is consistent with the finding in Kaggle.COM competition community. Intuitively, the accuracy of RF and XGB usually represent the performance ceiling on modeling a dataset. But, after using our method to reconstruct the data, we continue to significantly improve the accuracy of RF and XGB and break through the performance ceiling. This finding clearly validates the strong robustness of our method.

Fig. 9.

4.2.8 Study of the Traceability and Explainability of GRFG.

This experiment aims to answer: Can GRFG generate an explainable feature space? Is this generation process traceable? We identified the top 10 essential features for prediction in both the original and reconstructed feature space using the Housing Boston dataset to predict housing prices with RF regression. Figure 10 shows the model performances in the central parts of each sub-figure. The texts associated with each pie chart describe the feature name. If the feature name does not include an operation, the corresponding feature is original; otherwise, it is a generated feature. The larger the pie area is, the more essential the corresponding feature is. We observed that the GRFG-reconstructed feature space greatly enhances the model performance by 20.9\(\%\) and the generated features cover 60\(\%\) of the top 10 features. This indicates that GRFG generates informative features to refine the feature space. Moreover, we can explicitly trace and explain the source and effect of a feature by checking its name. For instance, “lstat” measures the percentage of the lower status populations in a house, which is negatively related to housing prices. The most essential feature in the reconstructed feature space is “lstat*lstat” that is generated by applying a “multiply” operation to “lstat”. This shows the generation process is traceable and the relationship between “lstat” and housing prices is non-linear.

Fig. 10.

5 Related Works

Reinforcement Learning (RL) is the study of how intelligent agents should act in a given environment in order to maximize the expectation of cumulative rewards [27]. According to the learned policy, we may classify reinforcement learning algorithms into two categories: value-based and policy-based. Value-based algorithms (e.g., DQN [22], Double DQN [30]) estimate the value of the state or state-action pair for action selection. Policy-based algorithms (e.g., PG [28]) learn a probability distribution to map state to action for action selection. Additionally, an actor-critic reinforcement learning framework is proposed to incorporate the advantages of value-based and policy-based algorithms [25]. In recent years, RL has been applied to many domains (e.g., spatial-temporal data mining, and recommended systems) and achieves great achievements [33, 35]. In this article, we formulate the selection of feature groups and operation as MDPs and propose a new cascading agent structure to resolve these MDPs.

Automated Feature Engineering aims to enhance the feature space through feature generation and feature selection in order to improve the performance of ML models [5]. Feature selection is to remove redundant features and retain important ones, whereas feature generation is to create and add meaningful variables. Feature Selection approaches include: (i) filter methods (e.g., univariate selection [7], correlation based selection [38]), in which features are ranked by a specific score like redundancy, relevance; (ii) wrapper methods (e.g., Reinforcement Learning [21], Branch and Bound [18]), in which the optimized feature subset is identified by a search strategy under a predictive task; (iii) embedded methods (e.g., LASSO [29], decision tree [26]), in which selection is part of the optimization objective of a predictive task. Feature Generation methods include: (i) latent representation learning based methods, for example, deep factorization machine [10], deep representation learning [1]. Due to the latent feature space generated by these methods, it is hard to trace and explain the extraction process. (ii) feature transformation based methods, which use arithmetic or aggregate operations to generate new features [4, 16, 37]. These approaches have two weaknesses: (a) ignore feature-feature heterogeneity among different feature pairs; (b) grow exponentially when the number of exploration steps increases. Compared with prior literature, our personalized feature crossing strategy captures the feature distinctness, cascading agents learn effective feature interaction policies, and group-wise generation manner accelerates feature generation.

6 Conclusion Remarks

We present a group-wise reinforcement feature transformation framework for optimal and traceable representation space reconstruction to improve the performances of predictive models. This framework nests feature generation and selection in order to iteratively reconstruct a recognizable and size-controllable feature space via feature-crossing. To efficiently refine the feature space, we suggest a group-wise feature transformation mechanism and propose a new feature clustering algorithm (M-Clustering). In this journal version, besides exploring more forms of downstream tasks, we enhance the framework from two perspectives: (1) reformulating the original framework from the perspectives of the advanced state representation method and the improved Q-learning strategy; (2) proposing an Actor-Critic optimization perspective to re-design the original feature transformation framework. We conducted an enlightening evaluation in a comprehensive set of experiments, scrutinizing our proposed methodologies from two distinct perspectives. We found that introducing an advanced state representation method gave the reinforcement learning agents a more precise understanding of the manipulated feature set, thus enhancing the model performance. Besides, the improving Q-learning training strategy notably bolstered the robustness of the model’s convergence process. This enhancement facilitated a faster convergence rate, despite the observed performance remaining commensurate with the original model’s. Most crucially, our experiments revealed that utilizing the Actor-Critic framework yielded significant benefits. This approach consistently demonstrated superior convergence efficiency, thereby expediting the learning process. Moreover, the Actor-Critic paradigm demonstrated substantial superiority in the context of downstream task performance, thereby highlighting its significant potential within reinforcement learning-based feature transformation tasks. In the future, we aim to include the pre-training technique in GRFG to further enhance feature transformation.

References

[1]

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798–1828.

Abstract

1 Introduction

2 Definitions and Problem Statement

2.1 Important Definitions

2.2 Problem Statement

3 Optimal and Explainable Feature Space Reconstruction

3.1 Framework Overview

3.2 Generation-oriented Feature Grouping

3.3 Cascading Reinforcement Feature Groups and Operation Selection

3.4 Optimization Technique

3.5 State Representation of a Feature Set (Group) and an Operation

3.6 Group-wise Feature Generation

3.7 Applications

4 Experiments

4.1 Experimental Setup

4.1.1 Data Description.

4.1.2 Evaluation Metrics.

4.1.3 Baseline Algorithms.

4.1.4 Hyperparameters, Source Code, and Reproducibility.

4.1.5 Environmental Settings.

4.2 Experimental Results

4.2.1 Overall Comparison.

4.2.2 Study of the Impact of Improving Q-learning.

4.2.3 Study of the Impact of the Actor-Critic Optimization Perspective.

4.2.4 Study of Different State Representation Methods.

4.2.5 Study of the Impact of Each Technical Component.

4.2.6 Study of the Impact of M-Clustering.

4.2.7 Study of the Robustness of GRFG under Different ML Models.

4.2.8 Study of the Traceability and Explainability of GRFG.

5 Related Works

6 Conclusion Remarks

References

Cited By

Index Terms

Recommendations

Group-wise Reinforcement Feature Generation for Optimal and Explainable Representation Space Reconstruction

Heterogeneous feature subset selection using mutual information-based feature transformation

Automating Feature Subspace Exploration via Multi-Agent Reinforcement Learning

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations