3.1. Overview
The overview of our proposed SCM framework, which contains a
Feature Clustering step and a
Feature Ranking step, is depicted in
Figure 1. In the
Feature Clustering step, SC groups highly correlated features into clusters by capturing both linear and non-linear relationships among features. This clustering not only reduces feature redundancy but also provides a structured foundation for further analysis. In the
Feature Ranking step, the grouped features are first normalized to ensure consistency in scale. This preprocessing step prevents features with larger ranges from disproportionately influencing the MIC computation, enabling fair comparisons across features within each cluster. MIC is then used to evaluate the relevance of each feature to the target labels, and the most representative feature from each cluster is selected to form the final optimized feature subset.
The interaction between these two components is crucial: Feature Clustering reduces the complexity of the data by grouping related features, while Feature Ranking refines these groups by isolating the most impactful features. This collaborative process produces an optimized feature subset, which is then input into the decision tree classifier to predict whether the residence of a crashing fault is inside or outside the stack trace.
Detailed descriptions of the SC and MIC methods are provided in
Section 3.2 and
Section 3.3, respectively, as both are integral to the proposed SCM framework.
3.2. Feature Clustering with SC
The aim of the feature clustering step is to group highly correlated features while minimizing redundancy across groups. Traditional clustering methods like k-means, which assume spherical data distributions, often struggle to capture complex, non-linear relationships among crash data features. To achieve this, we adopted the spectral clustering (SC) algorithm [
4,
5], a graph-based algorithm capable of capturing non-linear correlations through pairwise similarity analysis. Specifically, SC divides the original feature set into multiple clusters, where features within the same cluster exhibit high correlation, whereas features in different clusters are minimally overlapping. The technical details of SC are elaborated below.
Assuming the feature set of the crash data as ( is the number of features), we construct an undirected graph , in which is a set of points () corresponding to each feature and is a set of edges (). Considering one point pair (), we define as the weight of the edge connecting and . If there is a connection between and , we set , otherwise, .
To quantify the similarity between features, we first establish the adjacency matrix
by applying the fully connected technique in which the weights between all point pairs are non-negative. The Gaussian radial basis function was chosen to define the weight of the pair, offering a continuous and differentiable metric for quantifying similarity, as shown in Equation (1).
where
is a free parameter denoting the width of the Gaussian kernel and
denotes the
-norm.
controls the scale of similarity measurement, where smaller
values emphasize local relationships by assigning higher weights to closer features, while larger
values capture broader relationships with reduced local sensitivity. In this study,
was selected through cross-validation to balance local and global feature relationships. This function ensures that the adjacency matrix
effectively represents the relationships between features in the feature space.
Based on the adjacency matrix , we compute a degree matrix by defining degree corresponding to as . This degree matrix reflects the overall connectivity of each feature within the graph, summarizing the influence of each node.
Next, using and , we derive the unnormalized Laplacian matrix, , and the normalized Laplacian matrix, . The normalized Laplacian matrix ensures that features with varying connectivity are appropriately scaled, making the clustering results less sensitive to the magnitude of node degrees.
Subsequently, to obtain the clustering results, SC adopts the idea of graph cutting and makes the weight of the edges between different subgraphs as low as possible; meanwhile, the weight of inner edges of the same subgraph is as high as possible. Different subgraphs represent different groups, and points in the same subgraph represent features in the same group with a higher similarity. For each feature,
, we have the formula shown in Equation (2).
The eigenvectors corresponding to the smallest eigenvalues capture the most significant structural information of the graph, allowing the features to be effectively projected into a reduced -dimension. These eigenvectors are then used to construct the final eigenmatrix, . Finally, the -means algorithm is applied to the rows of , with each row representing a feature in the reduced -dimension space. This approach ensures that the clustering is performed on features with preserved pairwise similarities, as encoded by the graph.
3.3. Feature Ranking with MIC
The goal of the feature ranking stage is to select the feature in each group that is most relevant to the class label and discard the redundant ones. The key to this step is to precisely measure the correlation between the crash data features and the corresponding instance labels. To achieve this, we employ MIC [
6], a robust statistical measure capable of capturing both linear and non-linear dependencies. By ranking features within each cluster based on MIC values and retaining the top-ranked feature, we ensure the final subset preserves high relevance to fault residence while eliminating redundant features.
MIC was chosen for its ability to detect both linear and non-linear dependencies, making it particularly suitable for crash data with potentially complex relationships. While Pearson’s correlation coefficient effectively captures linear relationships, it may fail to identify non-linear patterns. In comparison, MIC normalizes its values within the range [0, 1], enhancing interpretability and facilitating feature ranking. These qualities make MIC a strong choice for identifying relevant features in diverse and high-dimensional datasets.
To better understand its suitability, MIC was originally designed to measure the dependence between two variables. For a given dataset
with
samples, let the feature set and the corresponding label set be represented as two variables,
and
, individually. If a relevance between
and
exists, then we can draw a grid on their scatter plot. Given that the value of two variables is divided into p bins and q bins individually, we construct a p-by-q grid,
. We denote
as the probability distribution of the number of points falling into the grid
in
. The probability of each cell of
is determined by the proportion of points falling inside it in all points. The maximum mutual information of
is denoted as shown in Equation (3).
where
denotes the mutual information of
, and
is the maximum value of
among all cells in a specific grid. Equation (4) provides the definition of mutual information.
where
and
represent the entropy of random variables
and
individually, and
denotes the joint entropy between
and
.
For different options of
and
, we obtain multiple values for
. In this case, a characteristic matrix
is constructed, as defined in Equation (5).
The goal of Equation (5) is to restrict the matrix values between 0 and 1 to reduce the differences of the dimensions. Then, the MIC value can be calculated as shown in Equation (6).
where
represents the upper bound of the grid size. We set
as
, following the same setting in previous work [
6]. The choice of
balances computational efficiency and granularity, ensuring that the dependencies are adequately represented without excessive computational overhead. In each feature group clustered in the first step, we computed the MIC value for the corresponding features and reserved the one with the highest MIC value by adding it into the final feature subset. Thus, we obtained an optimal feature set with
features, in which these remaining features have low redundancy with each other but are highly correlated with the class labels. While MIC is robust for capturing feature–label dependencies, its theoretical complexity is
per feature. For large datasets, this complexity can become prohibitive. To address this, we reduce the feature space through clustering, significantly limiting the number of feature pairs requiring MIC computation. Additionally, we leverage parallel computing to accelerate the MIC calculation, improving the efficiency and scalability of the process within the dataset sizes used in this study.
At the end of the two-stage feature selection process, the most representative feature from each cluster is selected based on its MIC value, forming the final optimized feature subset. This subset, with reduced redundancy and high relevance to the class labels, is then used as input for the decision tree classifier. The classifier leverages these selected features to construct a model for predicting the residence of crashing faults efficiently.