1. Introduction
In classical learning problems, data are usually assumed to be independently and identically distributed without considering the relationships between them; however, when edges are used to explicitly express the relationships between data, and a graph representation of the data is constructed, the performance of the learning algorithm is often significantly improved [
1]. In the past, graph-based learning methods were mainly devoted to studying how to propagate information from sample nodes over graphs using a linear approach, using typical algorithms, including spectral clustering [
2], random walk [
3], and label propagation [
4]. In recent years, with the rapid development of deep learning, studies on applying deep learning techniques to graph data with a high structural specificity, e.g., studies on graph neural networks (GNNs), have been conducted. Such studies have been aimed at the development of more flexible and effective nonlinear information propagation algorithms for learning high-quality graph representations to better conduct various tasks on graphs, such as node classification, link prediction, and community detection. As a popular research direction in the field of machine learning and data mining, GNNs have shown excellent performance and a wide range of applications in many fields such as logic inference [
5], knowledge graphs [
6], recommendation systems [
7], natural language processing [
8], air quality monitoring [
9], and remote sensing [
10].
In a considerable number of real-world graph node classification tasks, the sample data follow a long-tailed distribution. That is, a few classes have rich samples, whereas most of the classes only have a handful of samples. For example, in the NCI chemical compound graph, only approximately 5% of the molecules were labeled as active in an anti-cancer bioassay test [
11]. Another example is that for the Cora dataset [
12], which is widely used in the graph representation learning (GRL) field, the category “Probabilistic Methods” with the largest samples is 4.54 times larger than the smallest category “Theory”. Furthermore, in the real world, it is sometimes extremely important to correctly classify samples of minority classes that do not occur frequently. For example, a rare positive result in a COVID-19 test should not be incorrectly classified as negative because otherwise, undetected patients may accelerate the spread of the virus. As another example, despite continuous surveillance conducted on a particular street, occasional criminal activity will likely still be detected by the surveillance system. However, most standard learning algorithms attempt to minimize the overall classification error during training and implicitly assign the same misclassification cost to all types of errors. As a result, the imbalanced class distribution of data forces the classifier to overfit the majority class. That is, the features of the minority class samples cannot be adequately learned [
13].
In graph data, the problem of overfitting the classifier to the majority class is exacerbated by the presence of the
topological interplay effect [
11]. This is manifested by the widespread presence of different forms of topological connections (i.e., explicit representations of inter-sample relationships) between nodes in the graph data, making the judgment of node classes by the classifier not only dependent on the node features but also influenced by other nodes connected to the target node. Therefore, the learning of a specific class using a GNN is strongly influenced by the neighboring classes. Furthermore, through messages passing between nodes, feature propagation is dominated by majority classes, making it more difficult for the GNN to define the demarcation line between classes and thus learn the more discriminative features required for classification.
To overcome the above problem, in this paper, a new graph neural network model adapted to node classification on imbalanced graph datasets is proposed, i.e., the dual cost-sensitive graph convolutional network (DCSGCN). To the best of our knowledge, our study is among the first to be devoted to an imbalanced graph node classification task. Our idea is inspired by research in the field of cost-sensitive learning. As our starting point, different types of errors lead to different degrees of misclassification costs, and the learning algorithm needs to focus on samples that may lead to high misclassification costs, thereby decreasing the overall misclassification cost of the classifier. In the case of imbalanced data classification, the classifier tends to become overfit to the majority classes. Therefore, to accurately identify the minority class samples, minority classes should be considered more important during learning. That is the action in which the classifier misclassifies the minority class samples will result in a higher cost than misclassifying the majority class samples.
We conducted experiments on a large number of standard datasets within the graph representation learning domain. In addition to the classical metrics Micro-F1 [
14] and Macro-F1 [
14] used in classification scenarios, we also tested the standard metrics in the field of cost-sensitive learning, i.e., based on the cost [
15] and high-cost errors [
15]. To the best of our knowledge, this study is the first to report the performance of graph node classification models under cost and high-cost error metrics. Further, we report the performance variation of DCSGCN under various changes in the experimental settings (e.g., different training set sizes and degree of imbalance) and the possible causes of such changes in performance. We also conducted a detailed ablation study and parameter analysis of the DCSGCN to verify the role of its key components and parameters. Finally, we visualized the learned vector representation of the DCSGCN to illustrate its superior classification.
In summary, our contribution is two-fold:
Unlike the standard GNNs based on maximizing the posterior probability, we present the first cost-sensitive graph node classification model DCSGCN based on minimizing the classification risk. In addition, we are the first to introduce methods for computing the node cost label, which is based on graph topological information and a node class distribution. Our study is among the first devoted to the task of semi-supervised multi-class imbalanced long-tailed graph node classification.
In extensive experiments conducted on a wide range of standard datasets, for the first time, we report the performance of graph neural network models from a cost-sensitive learning perspective. The experimental results demonstrate the simplicity and effectiveness of the proposed approach. Compared with the current state-of-the-art model RECT [
16], DCSGCN shows an improvement of 10.6% in terms of Micro-F1 and a reduction of 34.3% in the average cost.
The remainder of this paper is structured as follows. We survey the related work in
Section 2. In
Section 3, we outline our methods for learning the representations of long-tailed imbalanced graphs and then for generating cost labels based on label distribution and graph topology.
Section 4 explains the experimental settings, while
Section 5 describes the results of our experiments and answers the research questions of interest. Finally, we conclude the paper in
Section 6.
3. Methodology
3.1. Method Overview
Conventionally, for a sample
, the standard GNN attempts to minimize the cross-entropy loss for the predicted and labeled class of
v; that is, maximizing the posterior probability that
v belongs to class
i:
Herein, we omit the regularization parameter term for simplicity. In our cost-sensitive setting, let the (
i,
j) entry in cost matrix
C be the misclassification cost of predicting
v in class
i when the true class is
j. Instead of maximizing the posterior probability that
v belongs to class
i, our proposed DCSGCN model seeks to minimize the risk (or misclassification cost) of placing
v into class
i (denoted as
, which is the summation of the product of the posterior probability
and the misclassification cost
over potential classes):
In the above equation, the lower the posterior probability that v belongs to a non-i class, and the lower the cost of classifying v into i when v actually belongs to a non-i class, the smaller the total risk is. In the case of imbalanced data, assuming i is a minority class and j is a majority class, according to the above assumption, we should have . Therefore, for a minority class sample v, even if the classifier overfits the majority class features such that its learned posterior probability satisfies , we can still have , and thus make a correct prediction.
According to Equation (
2), DCSGCN is naturally designed as a two-tower model (see
Figure 1). The model contains two bilayer graph convolutional network (GCN) components:
and
. Among them,
is responsible for learning the posterior probability
, and
learns the misclassification cost information
. The integration of the two components is the risk of the category to which the node belongs.
As another major problem we face during training, the training set only has category labels and lacks cost labels; thus,
cannot be trained. Therefore, in this study, we propose three heuristics for computing the cost labels. Our idea is quite simple: Suppose the numbers of samples of class
i and class
j in region
S near node
v are
and
, respectively; we then have
According to the above equation, the cost of misclassifying class
i (denoted as
) is negatively related to the number of samples contained in that class (denoted as
). When
i is a minority class in
S and
j is a majority class, then
, and
, which is consistent with the above cost requirement. To make the calculation more accurate,
S is designed as multiple regions of different areas centered at
v, and
is the weight of
S based on the regional node features. Based on the cost label
c obtained, the final loss function is as shown below, where
is the weight parameter, and
is the mean squared error. Here,
and
are the losses of the cost-sensitive prediction and the cost matrix learning, respectively. By minimizing
, DCSGCN can simultaneously learn effective cost labels and make correct classifications.
3.2. Notations and Definitions
Input. The graph
. Here
denotes the set of graph nodes, and
denotes the adjacency matrix. In this case,
when there is an undirected edge between nodes
and
; otherwise,
. The self-loops in
G have been removed, and thus,
,
. Moreover,
is the feature matrix, where
is the feature vector of node
and has dimension
k. In addition,
is the class information for the labeled node set
, whereas
denotes the node class and
, and
represents the number of samples belonging to the
ith category in
L. Given the imbalanced class distribution of the nodes, the number of samples contained in different classes may vary significantly. We define the sets of majority and minority classes as
and
, respectively. For
and
, we have
. The commonly used notations in this paper are listed in
Table 1.
Output. Our goal is to learn a graph neural network f that maps the input information G into a dense vector representation for all nodes with low dimensionality, where is the vector of node with dimension d, and the category labels for the unlabeled node set . Clearly, an effective f can correctly predict unlabeled samples from both the majority and minority classes.
Cost matrix. The cost matrix C is an real square matrix (in which m is the number of node classes), where the entry in C is the misclassification cost of predicting class i when the true class is j, and . Naturally, because the prediction of the classifier is correct if , then . In the case of an imbalanced classification, the minority class samples are more likely to be misclassified as the majority class, and thus when and , we should have . The element values of C can be set manually using domain knowledge or are learned from the training data. In this study, we designed three heuristic algorithms for automatically computing the cost label of the training set based on graph topological information, which will be presented later.
Cost-sensitive prediction. When the cost matrix
C is known, we can make a cost-sensitive prediction consistent with a Bayes optimal decision, which computes the corresponding cost of discriminating node
v into each class and selects the class
y with the lowest classification cost (denoted as
):
For any class y, measures the total cost of assigning node v to class y, represents the estimated likelihood of the classifier assigning v to class k, and denotes the cost of misclassifying v to class y. The total cost is small when v is less likely to belong to a non-y class, and the cost of misclassifying a non-y class sample into class y is small. As the essence of the cost, one class may be judged to be optimal when another class is more probable. For the non-cost-sensitive case, that is, when all misclassifications have the same cost, and the correct classification cost is zero, the above equation degenerates to the commonly used prediction ; that is, the pursuit of the class with the largest posterior probability. It follows that minimizing the misclassification cost is a more general setting compared to maximizing the posterior probability.
3.3. Proposed Model: Dual Cost-Sensitive Graph Convolutional Network (DCSGCN)
3.3.1. Estimating Node Probability:
We use a classical two-layer GCN structure [
39] to compute the posterior probability of the class to which a node belongs and denote it as
. The first and second layers of
are denoted as
and
, respectively, and their corresponding outputs {
,
} are as follows:
where
and
is an identity matrix of size
n. In addition,
is a diagonal matrix, and
. Here,
is the normalized adjacency matrix, and
A and
X are the adjacency and feature matrices introduced in
Section 3.2, respectively.
Furthermore, and are the learnable parameters in the first and second layers of , where r and m are the dimensions of the output vectors of and , respectively. Here, m is the same as the number of node classes. In addition, and are the respective activation functions of the first and second layers, where and . Moreover, , , represents the posterior probability of the class to which the node belongs. For {, , , , , }, the superscript indicates the number of layers, and the subscript P indicates that the parameters belong to , which are distinguished from the following .
In (
6) and (
7), the role of
, or the normalized adjacency matrix, is to enrich the feature vector of a node by linearly adding all feature vectors of its neighbors. This is because the basic assumption of a GCN is that neighboring nodes (and thus those having similar neighbors) are more likely to belong to the same class. The role of
,
is to transform the feature dimension of the nodes, making sparse high-dimensional node features dense at low dimensions. In addition, (
6) and (
7) can also be equivalently described as the process by which the input signal (i.e., node feature
X) is filtered through a graph Fourier transform in the graph spectral domain [
44]; however, in this paper, we consider the spatial domain.
Eventually, given the one-hot training labels
, we compute the posterior probability
and the cross-entropy error
as follows:
By minimizing , we can learn the parameters of such that it predicts the posterior probability of the class to which the unlabeled node belongs.
3.3.2. Estimating Node Cost:
Instead of , we use , another neural network with a two-layer GCN structure, to predict the node misclassification cost. For the sample in the training set, where x and y denote the feature vector and label of node v, respectively, predicts the v-dependent cost matrix , whose element denotes the cost of predicting v in class i when the true class is j. Here, we use a GCN instead of a feedforward neural network to predict the cost because we want to exploit the topological information of the graph.
The setting we adopt here is an example-dependent cost, which means that the cost matrix may be different for different samples (even if they belong to the same class), as compared to a uniform cost matrix used for each category. This is because different nodes have different location information in the network. For example, consider two nodes v and u, which belong to the same global minority class. In this case, v belongs to the local majority class (i.e., most of the neighbors of v belong to the same global minority class as v), whereas u belongs to the minority class even locally. Because tends to assign similar classes to neighboring nodes, the probability of correctly predicting v as a global minority class will be much higher than that of correctly predicting u. Therefore, to prevent from making incorrect predictions, the cost of generated to incorrectly predict u as a global majority class should be much larger than the cost of misclassification for v. From this perspective, the example-dependent cost information computed by can be interpreted as a “correction” to the posterior probabilities generated by . Clearly, the class-dependent cost matrix is a special case of a sample-dependent setting.
Similar to (
6) and (
7), the output of the first layer
and the second layer
in
{
,
} is as follows:
In the above equations,
is the same as that in
. In addition,
and
are the learned parameters in the first and second layers of
. Moreover,
r and
are the respective dimensions of the output vectors of
and
,
m is the number of node categories, and
,
. Note that the second layer also uses ReLU as the activation function because the cost matrix generated by the second layer is not the posterior probability. Finally, the mean squared error (MSE) is used as the loss function of
for training its parameters:
where
is the cost matrix label of the training samples
. In addition,
denotes the 2-norm of the matrix. We elaborate on three heuristics for computing
based on the graph topological information of the graphs in
Section 3.4.
3.3.3. Ensemble of Node Probability and Cost
For a node sample
,
and
compute the posterior probability
and cost matrix
, respectively. Based on
and
, we can calculate the misclassification cost
required to predict each class based on (
5) and select the class with the minimal expected cost as follows:
where
and
are the probabilities that
will be predicted as the
kth class by
and the cost of misclassifying
as belonging to the
kth class in the
yth class when calculated using
, respectively. Equation (
12) is a cost-sensitive prediction consistent with the Bayes optimal decision.
Minimizing
is equivalent to maximizing
. Using
to normalize
, we obtain the following cross-entropy error
. In addition,
and
represent the predicted probabilities and losses obtained by ensembling
and
, respectively.
Algorithm 1 shows the “two-step” training process of DCSGCN. First, we pre-train
by minimizing
to compute the posterior probabilities of the classes to which the training samples belong. After
converges, we train the parameters of
and fine-tune
. In the second step, the loss function
is weighted by
and
:
Algorithm 1 Dual Cost-Sensitive Graph Convolutional Network |
Inputs: Graph data: |
Outputs: Network parameters: {, , , }; Node embeddings: |
1: Initialize {, , , }, training epoch and , and weight |
2: for
do |
3: ▹ Equation (7) |
4: ▹ Equation (8) |
5: Updating {, } using , |
6: end for |
7: Compute cost label ▹ Equation (16) or Equation (17) or Equation (19) |
8: for do |
9: ▹ Equation (7) |
10: ▹ Equation (10) |
11: ▹ Equation (11) |
12: ▹ Equation (14) |
13: ▹ Equation (15) |
14: Updating {, , , } using , , , |
15: end for |
By minimizing , DCSGCN can simultaneously learn the cost matrix of the samples and conduct a cost-aware node classification. Here, is used to adjust the weights of the two subnetworks. We pre-train because we want it to compute similar posterior probabilities of the nodes regardless of , whereas only provides a “correction” of the posterior probabilities. If pre-training is omitted and only and are jointly trained, the output of may contain too much cost information, thus deviating from the meaning of the posterior probabilities.
Another key point regarding DCSGCN is whether the parameters of the first layer in and should be shared (i.e., ). As an argument in favor of sharing, if we want to provide “complementary” information about the equilibrium of the sample class distribution, the potential premise is that we need to make the same guess about the sample class as . For nodes , the first layer of and learns the representations and of , respectively. If there is a significant difference between them, the predictions of by and are likely to be different. Therefore, forcing the parameter sharing can be more effective in ensuring the consistency of the two subnetworks in terms of node class prediction. However, we occasionally need and to be inconsistent in terms of label prediction. For example, samples belonging to the global majority class and local minority class are more likely to be misclassified as a local majority class. Therefore, we want DCSGCN to generate the posterior probability of the global majority class and the cost complement of the global minority class. Not requiring and to share the parameters of the first layer makes it more flexible in utilizing the topological information of the nodes.
In
Section 5.6, we verify whether pre-training
and sharing the first layer parameters of
and
help improve the classification performance. The results of the validation show that pre-training has a positive effect, whereas sharing the parameters slightly degrades the model performance. Finally,
Figure 1 illustrates the structure of our designed network.
3.4. Methods for Computing Cost Labels
In this section, we present three heuristics for computing the cost matrix of the training samples, which are based on the class distribution of the training set and the topological information of the input graph.
3.4.1. Global Cost
Our first approach assigns the same cost matrix
to all samples in the training set without considering the position information in the graph. In
Section 3.2, we defined
as the number of samples belonging to the
ith class in the labeled nodes set
L,
, and
and
as the set of global majority classes and global minority classes, respectively. Then, for each sample node, the misclassification cost of predicting it as
i when its true class is
j is
From (
16), the misclassification cost is zero when the classifier correctly classifies the samples, and the cost label when the samples are misclassified is related to the number of samples in the predicted class
i and the true class
j. Supposing {
,
}, because
,
is large; when {
,
}, or {
,
},
is approximately equal to 1 because
. Finally, when {
,
}, and
,
is small and approximately equal to 0. Therefore, (
16) is consistent with the principle of our design of the cost computing method: In the case of imbalanced training samples, owing to the topological interplay effect, the minority category samples are more likely to be surrounded by the majority category samples and misclassified into the majority category. To avoid this situation, the cost of misclassifying the minority category samples should be much larger than misclassifying the majority category samples. We use the ratio of the size of different classes to fit the cost of misclassification. To improve the fit,
k and
are added to the model as parameters to adjust the proportion of the sample. Here,
k is designed to be a small fraction of the total number of samples in the training set (e.g., 1%), and the presence of
k prevents a minority class from having zero training samples. In addition,
is the exponential scaling of the sample proportion after add-k smoothing. Together with the other DCSGCN parameters, the optimal values of
k and
are obtained from the validation set.
Figure 2 shows a toy example of how to compute
for target node
v. There are only two classes for simplicity. The minority class 1 (represented by the red nodes) contains two samples, and the majority class 2 (represented by the blue nodes) contains seven samples. Here,
k is 10% of the total number of nodes, including
v, i.e.,
. In addition,
. Therefore, the cost of misclassifying class 1 samples into class 2 is
, and the cost of misclassifying class 2 samples into class 1 samples is
.
3.4.2. Simple Average Cost
The global cost relates the size of the class to the misclassification cost. However, as its major drawback, the cost matrix is the same for all nodes in the training set. In fact, the cost matrix of a sample node is influenced by its position in the network. Consider the case in which all neighbors of a global majority class node belong to the global minority class. This situation is rare, however. For example, COVID-19 patients represent a small fraction of the global population, yet a healthy person may have family members who are infected with the virus. Owing to the topological interplay effect, the GNN-based classifier tends to misclassify its class as the majority class of its neighbors; that is, the global minority class. Moreover, its global cost matrix encourages the classifier to classify it as a global minority class. Instead of correcting the posterior probability, global cost exacerbates the tendency to misclassify.
Naturally, the cost matrix should be built by considering not only the proportions of the global inter-class samples (which are the same for all nodes) but also the proportions of the
local inter-class samples (which may vary between nodes). For node
v, supposing we denote the set of nodes directly connected to
v as
, and denote
and the set of nodes directly connected to nodes in
while not connected to
v (i.e., at a distance of 2 from
v) as
,
is then the set of nodes at a maximum distance of 2 from
v. Furthermore, the set of nodes with a maximum distance of 3 from
v is denoted as
, …, until the set of nodes with the maximum distance
l from
v is denoted as
, whereas
l is the distance from the farthest node to
v in the training set. In the case of no isolated nodes,
includes all nodes in the training set. The search process of {
,
, …,
} can be conducted simply through a breath-first search on the graph. For each set of nodes
, we can compute the cost matrix
based on the node class distribution within
using (
16). In other words, letting
denote the number of
xth class samples in
,
, then
when
, and
when
. Thus, {
} reacts to the different costs assigned to
v resulting from the different node distributions in different regions, where
v is the center dispersed toward the network edge. In the above example, because
v is a global majority class and a local minority class,
tends to determine
v as the correct class, thus attenuating the negative effect of the global cost. Eventually, the cost matrix
of a node is the simple average of {
}:
Algorithm 2 shows the computation process for
.
Figure 2 also illustrates a toy example of how to compute
. The set of nodes with the maximum distance {1, 2, 3} from node
v are {
,
, and
}, respectively, where
includes all nodes. In
, the numbers of red and blue class samples are 2 and 0, respectively, and the cost of misclassifying
v into the red and blue classes is 3 and 0.33, respectively (assuming
). Similarly, in
, the number of red and blue class samples are 2 and 4, resulting in a cost of misclassifying
v into red and blue of 0.60 and 1.67. Finally, in the global view, the numbers of red and blue samples are 2 and 7, and the corresponding misclassification cost is 0.38 and 2.67, respectively. In addition,
is the average of the above three sets, i.e.,
,
. Compared to the global cost
,
encourages the classifier to classify
v into the correct class, i.e., blue.
Algorithm 2 Calculate Simple Average Cost |
Inputs: Target node v; Adjacency matrix A |
Outputs: Simple average cost for v |
1: Initialize Q as a queue, D as a dictionary
|
2: label v as explored
|
3: |
4: |
5: while Q is not empty do |
6: |
7: for all edges from v to w in do |
8: if w is not labeled as explored then |
9: label w as explored
|
10: |
11: |
12: end if |
13: end for |
14: end while |
15: |
16: for do |
17: Construct node set |
18: Compute based on ▹ Equation (16)
|
19: end for |
20: Compute using {} ▹ Equation (17)
|
3.4.3. Weighted Average Cost
The simple average cost distinguishes the global dominance of a category from its local dominance and is thus node-dependent. However, it still ignores the fact that the dominance of a sample category in different regions may have a different importance. Returning to the example shown in
Figure 2, to compute the cost matrix of node
v, we first construct three sets of nodes {
,
,
}, and compute the cost for each node set separately to obtain {
}, and average them to obtain the final cost
. However,
still tends to classify
v into the incorrect class; that is, the global minority class red because
v is the minority class only in
, and
is the minority class in {
,
, and
} voting.
In the process of set voting, we want to highlight the sets in which the target node is in the minority of those sets. We provide an intuitive interpretation as follows: For a target node v and its corresponding node set {, , …, }, suppose ; then, as i reaches closer to l, the category distribution in tends to approximate the global distribution. Therefore, if v is in a minority category in , i is more likely to be small, and v will belong to a local minority category. Because the topological interplay effect is more pronounced locally in v, the classifier should prefer to avoid misclassifying v as a local majority class, and thus the cost information calculated based on should be given more weight. When , the situation is reversed. If v is a majority class in , then i is more likely to be small. At this point, the cost information calculated based on will incorrectly classify v as a minority class in . When i is large, the class distribution in will be close to the global distribution, and the cost calculated based on will encourage the classifier to make the correct choice, i.e., the global minority class. Therefore, should have a larger weight at this point.
Based on the above analysis, we designed a weighting method that relies on the node features. Naturally, the average features of a node set are closer to the features of the majority class nodes. Therefore, when the target node
v is a minority in
, the cosine distance between its features
and the average features of the nodes in
is large, where
is the number of nodes contained in
. This cosine distance can be regarded as the weight of
. In summary, given the target node
v and the corresponding {
,
, …,
} (constructed in the same way as described in
Section 3.4.2), the weighted average based cost matrix
of
v is
In the example shown in
Figure 2, assuming that the blue class samples are all characterized by
and the red class samples are all characterized by
, then the weights of
,
, and
are 1.0, 0.10, and 0.04, respectively. Combining the simple average cost calculated before, the cost of misclassifying
v as red is
, and the cost of misclassifying it as blue is
. It follows that
considers the cost incurred by
v being misclassified as red to be larger, thus encouraging the classifier to make the correct choice, i.e., the blue class.
4. Experimental Settings
4.1. Datasets
For the fairness and validity of the experiments, we compared the performance of all analyzed methods on nine widely used standard datasets covering different domains in the field of graph representation learning. In
Table 2, we list the statistical information of all datasets used, including the number of nodes, the number of edges, the dimensions of the node features, the number of node classes, the minority classes, and the tuned optimal value of key parameters of DCSGCN: {
,
,
k}. Below is a brief description of each dataset used.
Citation networks (3). Cora, Citeseer, and Pubmed [
49,
50]. In these datasets, nodes represent papers, edges represent citations between papers, and the features and classes of the nodes are the bag-of-words representation of the paper and the research topic (e.g., neural networks and reinforcement learning), respectively.
WebKB (3). Three sub-datasets, i.e., Cornell, Texas, and Wisconsin, are included (
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/, accessed on 31 May 2022). In these datasets, nodes and edges represent the relevant web pages of a university and links between web pages, respectively. The features of a node are the bag-of-words representation of the corresponding web page, and the node category is the web page topic (e.g., Student and Faculty).
Wikipedia networks (2). Two sub-datasets, i.e., Chameleon and Squirrel (
https://snap.stanford.edu/data/wikipedia-article-networks.html, accessed on 31 May 2022), are included, each describing a network of Wikipedia pages on a particular topic. Nodes represent article pages, and edges represent the links between pages. Node features are bag-of-word vectors of articles, and node categories are built based on the monthly traffic of the articles.
Actor co-occurrence network (1). In this dataset, a node represents an actor, and an edge represents whether two actors co-occur on the same Wikipedia page. Node features correspond to keywords on Wikipedia pages. Node categories were built based on the actor’s Wikipedia. This dataset was proposed by [
51].
All of the above datasets are available through the PyTorch geometric package (
https://github.com/rusty1s/pytorch_geometric, accessed on 31 May 2022). In addition, we notice that although GNNs have been widely adopted in the remote sensing field, the current publicly-released benchmark datasets for GRL (such as the PyTorch geometric datasets (
https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html, accessed on 31 May 2022) and the deep graph library datasets (
https://docs.dgl.ai/api/python/dgl.data.html, accessed on 31 May 2022)) still have not covered the remote sensing field. Thus, unfortunately, we are not able to verify the effectiveness of our approach directly on remote sensing-related graphs. However, according to previous studies, such as [
9] and [
10], GNNs that are demonstrated as effective on non-remote sensing datasets can still play an important role in remote sensing applications.
4.2. Analyzed Methods
In this section, we describe all the analyzed methods, including three types of baselines and our proposed three models. The types of baselines are as follows.
Type I: Graph neural networks (2). These include the vanilla GCN [
39] and its modified version, GCNII [
52].
Type II: Graph neural networks + cost-sensitive learning methods (4). For the GCN and GCNII, we tested their combination with two classical cost-sensitive learning techniques, i.e., under-sampling and threshold-moving (Under-sampling and threshold-moving have multiple implementation schemes, and this study applies the algorithm proposed in [
15]). The former reduces the degree of imbalance in the training set by under-sampling the majority class samples, and the latter moves the classification decision boundary to increase the classifier’s preference for the minority class. The combined four methods are GCN
, GCN
, GCNII
, and GCNII
.
Type III: Graph neural networks specific for imbalanced classification (3). Three state-of-the-art GNN models [
16] designed for imbalanced datasets were tested: RECT-L, RECT-N, and RECT. Unlike traditional GNNs, the RECT family adopts a novel objective function that explores class-semantic knowledge.
Type IV: Proposed methods: (3). The methods proposed in this paper are denoted as DCSGCN, DCSGCN, and DCSGCN, which represent the dual cost-sensitive GCN based on global cost, simple average cost, and weighted average cost, respectively.
4.3. Evaluation Metrics
Four evaluation metrics were used to evaluate the performance of all the methods. First, we tested two metrics that are standard in the graph node classification task: Micro-F1 and Macro-F1. Micro-F1 measures the F1-score of the aggregated contributions of all classes. In our multi-class setting, Micro-F1 and accuracy are equivalent. Macro-F1 is defined as the arithmetic mean of the label-wise F1-scores. Compared to Micro-F1, Macro-F1 does not consider the class size. Both Micro-F1 and Macro-F1 combine the precision and recall of the model, the range of which is [0, 1]. The larger the value is, the stronger the model performance.
Second, we test two standard metrics in the cost-sensitive learning task:
misclassification cost and
a number of high-cost errors. The former is the average cost resulting from all misclassifications (the ground-truth misclassification cost will be discussed later in
Section 4.4), and the latter is the number of misclassifications where a minority class sample is predicted as being in a majority class. Both metrics range within
, and the smaller the value, the stronger the model. To the best of our knowledge, this study is the first to report the performance of graph neural network models under the cost and high-cost errors in a graph node classification task. The formulas for all metrics are shown below.
In the above equations, , , and denote the number of true positives, false positives, and false negatives in the u-th class, respectively. In addition, m denotes the number of classes, and N is the number of classifications (i.e., the number of test set samples). Moreover, and denote the predicted class and label classes for test node , respectively; C is the ground-truth cost matrix; and are the majority and minority class sets, respectively; and is the Iverson bracket, which represents an indicator function.
4.4. Training Settings
In this study, we assume different types of errors lead to different degrees of misclassification costs. Furthermore, the action in which the classifier misclassifies the minority class samples will result in a higher cost than misclassifying the majority class samples. We use a benchmark that is widely applied in cost-sensitive learning to generate the ground-truth misclassification cost matrix
C, i.e., the randomized proportional setup [
28,
53]. The diagonal entries
are set to zero, whereas the non-diagonal entries
are uniformly sampled from
, where
represents the number of nodes in class
i. This randomized proportional setup charges a higher expected cost for misclassifying a minority class than a majority class because it considers the class distribution of the dataset. All experiments were conducted in PyTorch (1 February 2020, community edition), and the Adam optimizer was used to train all methods analyzed. For each dataset, following [
16] we first remove 70% of the samples from each minority class to increase the data imbalance. Then, we randomly split labeled nodes of each class into 20%/20%/60% for training, validation, and testing. A 5-fold cross-validation is adopted to evaluate the model performance. For the pre-training of
, the number of training epochs was set to 50. For the training of DCSGCN, the number of training epochs was 200.
4.5. Parameter Tuning
For the validation set for each dataset, we used a grid search to tune the parameters in the following order: learning rate (range of {0.001, 0.005, 0.01, 0.05, 0.1}) → weight decay (range of {10
, 5 × 10
, 10
, 5 × 10
, 10
}) → number of hidden units (range of {8, 16, 32, 64}) → dropout rate (range of {0.1–0.9}, step size 0.1) →
(range of {0.0–1.0}, step size 0.1) →
(range of {0.25, 0.5, 1.0. 1.5}) →
k (range of {0.001, 0.01, 0.1}). Among them, the last three parameters are specific to the DCSGCN family, and the effects of changing their values on the performance of DCSGCN are shown in
Section 5.8.
5. Experimental Results
5.1. Research Questions
In this section, the experimental results are discussed. Specifically, we answer the following research questions: (1) Does DCSGCN demonstrate a higher classification accuracy on imbalanced datasets compared to existing models (see
Section 5.2)? (2) Can DCSGCN achieve a smaller misclassification cost and fewer high-cost errors while maintaining the classification accuracy (see
Section 5.3)? (3) How does the classification performance of DCSGCN vary on datasets with different degrees of balance (see
Section 5.4)? (4) How does the classification performance of DCSGCN vary on training sets of different sizes (see
Section 5.5)? (5) Do the two key settings of training DCSGCN and pre-training
, and not sharing the first layer parameters of
and
, indeed improve the model performance (see
Section 5.6)? (6) Does DCSGCN perform better than the classic data-oversampling methods (see
Section 5.7)? (7) How do the three key parameters
,
, and
k of DCSGCN affect the network performance (see
Section 5.8)? (8) Finally, what is the visualization of the learned node representation of DCSGCN (see
Section 5.9)? In the following sections, we will answer the above questions with a detailed analysis of the experiment data.
5.2. Classification Accuracy
Table 3 and
Table 4 show the performance of all methods under two measures of classification accuracy: Micro-F1 (which is equivalent to the accuracy under our multi-class classification scenario) and Macro-F1. The results are presented as the mean based on 10 replicated experiments. From the experimental results, we conducted the following analysis:
In our setting of removing 70% of the minority class samples, we noticed that the Micro-F1 values of both GCN and GCNII decreased to different degrees compared to the performance reported in related studies [
39,
52], and the performance of GCNII decreased more significantly. For example, in the Cora dataset, the decrease in the Micro-F1 values for the GCN and GCNII was approximately 7.5% and 30.6%, respectively. These data demonstrate the necessity of developing specific algorithms for imbalanced graph data.
The classic cost-sensitive learning methods under-sampling and threshold-moving improve the accuracy of type I methods. For under-sampling, it brings a 5.0% average performance improvement in terms of Micro-F1. For threshold-moving, this figure is 2.3%. It follows that under-sampling is the better of the two methods based on our experiments.
Better baselines than the under-sampling and threshold-moving are three type III models specifically designed for imbalanced datasets, i.e., RECT-L, RECT-N, and RECT. Among them, RECT exhibited the best performance. It achieves the best Micro-F1 scores on the Cora and Texas datasets and the best Macro-F1 score on the Cora dataset. Compared to under-sampling, RECT brings an average improvement of 5.6% in terms of Micro-F1. Compared to threshold-moving, this figure is 8.5%.
It is encouraging to note that our proposed DCSGCN model achieves the best classification performance overall and the strongest robustness. We observed that in six of the nine datasets (excluding Cora, Citeseer, and Texas), DCSGCN achieved the highest Micro-F1; in eight of the nine datasets (excluding Cora), DCSGCN achieved the best Macro-F1. On Citeseer, Micro-F1 of DCSGCN differs from the best performer by only 0.44%. As another observation, for Micro-F1, among the six datasets where DCSGCN is the best performer, relatively small performance gains on Pubmed/Cornell/Actor are obtained over the best baseline of 0.4%, 2.8%, and 0.3%, respectively. For Wisconsin, Chameleon, and Squirrel, DCSGCN has an extremely significant advantage, i.e., 5.8%, 14.2%, and 11.3%, respectively, over the highest baseline. Similar observations were made for Macro-F1. Naturally, DCSGCN has a different fitness on datasets with different sample distributions and topological features. We regard the study of such fitness variation as a follow-up research problem.
Among the three variants of DCSGCN, we noted that the simplest global cost-based DCSGCN achieves the best performance, i.e., winning on three (Micro-F1) and four (Macro-F1) of the datasets, respectively. For the two methods based on the node-dependent cost, DCSGCN using a local context wins on two (Micro-F1), and three (Macro-F1) of the datasets, and DCSGCN using the region features wins on only one dataset under both metrics. This suggests that the information provided by the local context and region semantics is not always helpful for classification. As one possible explanation for this, the sample-dependent cost, while bringing about more flexibility to the neural network, is also more likely to result in overfitting, particularly when the number of training data is insufficient. In summary, for the performance ranking of the three DCSGCN models, DCSGCN > DCSGCN > DCSGCN.
In summary, our answer to research question (1) (see
Section 5.1) is “yes.” DCSGCN demonstrates a SOTA performance in the task of imbalanced graph node classification. Our idea of using cost to “complement” the posterior probability is simple but effective. In addition, the simplest method of estimating the sample cost achieves the highest performance.
5.3. Cost and High-Cost Errors
Table 5 and
Table 6 show the performance of all methods under two metrics from a cost-sensitive classification perspective, i.e., cost and high-cost errors, respectively. Both metrics measure the ability of the classifier to correctly identify the minority class samples under the topological interplay effect. Similarly, the results are expressed in the form of the mean based on 10 replicate experiments. From the collected results, we conducted the following analysis.
As a counterintuitive finding, both under-sampling and threshold-moving increase the cost of type I models (1.4%↑ and 2.1%↑, respectively). Past research has demonstrated that both under-sampling and threshold-moving are effective strategies for reducing the cost when the training set samples are independently and identically distributed [
15]. Our observations show that undersampling and threshold-moving can fail on graph data (the presence of edges indicates explicit dependency relationships among nodes such that they do not fit; they are independent and identically distributed). Combined with the previous analysis in
Section 5.2, we infer that the improvements of the under-sampling and threshold-moving on the graph data are greater in the case of reducing the misclassification of the majority class samples and not the minority class samples.
The RECT family can effectively reduce the misclassification cost. Among the three type II methods, in terms of cost, the best method in terms of high-cost errors is RECT. Combined with the previous analysis, we conclude that the RECT family outperforms the GCN and GCNII-based models regardless of classifying the samples of all classes or classifying the minority class samples.
Our proposed DCSGCN models are the best performers among all methods compared. In terms of cost, DCSGCN achieves minimal cost in all datasets (9/9). In terms of high-cost errors, it achieves the fewest errors in most of the datasets (8/9). Compared to its performance in terms of Micro-F1 and Macro-F1, DCSGCN shows a more significant advantage under the cost and high-cost error metrics. Specifically, compared to the average performance of the RECT family, DCSGCN reduces the cost by 34.3% on average, whereas DCSGCN reduces the cost by 37.7% on average.
Among the three variants of DCSGCN, DCSGCN
performs the best. This observation is consistent with that described in
Section 5.2. DCSGCN
achieves the lowest cost in five of the datasets and the lowest cost errors in four of the datasets. For the other two methods, DCSGCN
and DCSGCN
, in terms of cost, both achieve the highest performance in three of the datasets. In terms of high-cost errors, DCSGCN
is better (3 > 1). In general, based on the classification accuracy of the minority samples, we have DCSGCN
> DCSGCN
> DCSGCN
.
In summary, our answer to question (2) (see
Section 5.1) is also “yes.” Under the standard metrics for cost-sensitive classification, DCSGCN showed the best performance. Although leading in terms of classification accuracy, DCSGCN still manages to achieve the smallest misclassification cost. DCSGCN
is still the best approach, whereas both DCSGCN
and DCSGCN
can still achieve the best performance on certain datasets. We can also state that neither is superior to the other models.
5.4. Influence of Imbalance Ratio
In this subsection, we explore how the performance of DCSGCN changes as the imbalance of the training set changes. We first reduced the percentage of randomly removed minority class samples from 70% to 30% for each dataset (which will reduce the imbalance of the training set) and compared the performance of DCSGCN with all baselines under this moderated setting (see
Table 7 and
Table 8). The results in the tables are the mean values based on 10 replicated experiments, and we omit the standard deviation owing to space limitations. Subsequently, we varied the percentage of removed minority samples in [10%, 20%, …, 90%] and recorded the performance change of DCSGCN on the Cora dataset (see
Figure 3).
As shown in
Table 7 and
Table 8, there is a small increase in Micro-F1 and a small decrease in cost for each compared method when more minority class samples are available in the training set. This finding is consistent with our hypothesis. Second, DCSGCN maintained its performance advantage. Compared with the two strong baselines, GCN and RECT, DCSGCN increases Micro-F1 by 8.6% and 10.3% on average and reduces the cost by 18.7% and 27.5% on average, respectively. In terms of cost, the superiority of DCSGCN is more significant. Finally, we observe that DCSGCN
, DCSGCN
, and DCSGCN
have no obvious differences in their performance advantages when compared to our previous findings when the data are more imbalanced.
As seen in
Figure 3, there is a significant decreasing trend of Micro-F1 as the imbalance of the training set increases. Among the three DCSGCN models, the change in DCSGCN
is the most moderate, and the change in DCSGCN
is the most drastic. For DCSGCN
, on Cora, when the proportion of removed minority class samples reaches 70%, the change in its Micro-F1 with respect to the imbalance transfers from a small fluctuation to a drastic decrease. The cost is positively correlated with the imbalance in the training set. Similarly, we found that DCSGCN
has the strongest sensitivity to an imbalance in terms of cost. We infer that this is because, unlike the other two methods, DCSGCN
has one more parameter that is affected by the training set in addition to the inter-class sample ratio, i.e., the semantic features of the training data. The greater the number of parameters depending on the training set, the stronger the sensitivity.
5.5. Performance under Dense Training
Similar to [
54], in this section, we discuss whether DCSGCN can maintain its leading performance on a larger training set. For each dataset, we randomly maintained 60% of its nodes to train the network and 20% of the nodes for testing. Similar to sparse training, we then tested the performance of all compared methods for removing 30% of the minority class samples and removing 70% of the minority class samples. All other settings of the experiments were the same as those shown in
Section 4.4 and
Section 4.5.
Table 9 and
Table 10 first show the Micro-F1 and cost of each analyzed model under dense training on each dataset in which 70% of the minority class samples are removed, respectively. Based on these two tables,
Table 11 and
Table 12 show the model performance under dense training with respect to a 30% removal of minority class samples. The results in the tables are the mean values based on 10 replicated experiments, and we omit the standard deviation owing to space limitations.
When comparing
Table 3 and
Table 9 and
Table 5 and
Table 10, it can easily be seen that as the number of training data increases, most of the Micro-F1 values generally increase and the cost values generally decrease for all methods. Similar observations can be obtained when comparing
Table 7 and
Table 11 and
Table 8 and
Table 12, where more training data are given.
By contrast, we found that DCSGCN still has a significant advantage over large training sets. For 70% sample removal, compared with the GCN, the average Micro-F1 improvement of DCSGCN is 17.1%, the average cost reduction is 44.8%, and the corresponding values compared with RECT are 14.7% and 37.1%. For 30% sample removal, the corresponding performance gain of DCSGCN over the GCN is (Micro-F1, 11.4%↑; cost, 30.6%↓), and over RECT is (Micro-F1, 5.8%↑; cost, 13.4%↓). This again demonstrates the effectiveness of DCSGCN. In addition, we observed a significant performance improvement for DCSGCN and DCSGCN compared to the case of sparse training, whereas the advantage of DCSGCN decreases. With more training data, we infer that more flexible sample-dependent cost-sensitive learning exhibits a stronger classification performance than category-dependent cost-sensitive learning.
5.6. Validation of Key Procedures in Training DCSGCN
In this section, we answer research question (5) in
Section 5.1. Namely, do the two key steps in training DCSGCN and pre-training
, and not sharing the first layer parameters of
and
, indeed improve the model performance? For all DCSGCN models, we tested the following four combinations: {A, pre-trained, shared; B, pre-trained, not shared; C, not pre-trained, shared; D, not pre-trained, not shared}, and present the results in
Table 13. The test dataset used was Cora, and all experimental settings and parameters were the same as those in
Section 4.4 and
Section 4.5. Naturally, (A + B) − (C + D) describes the performance difference between pre-training and not pre-training, whereas (A + C) − (B + D) describes the performance difference between sharing parameters and not sharing parameters.
By looking at
Table 13, we find that, for all models, pre-training
improves Micro-F1 and reduces the cost. The average accuracy improvement and cost reduction are 2.4% and 0.38, respectively. By contrast, not sharing the first layer parameters of
and
significantly improves the performance of DCSGCN
and DCSGCN
and slightly impairs the performance of DCSGCN
. Therefore, we conclude that B (pre-trained, not shared) achieves the best performance.
Figure 4 shows the dynamic state of Micro-F1 and cost during the training process of DCSGCN
with each practice above (A–D) on the Cora dataset, respectively. It can be clearly observed that practice B (pre-trained, not shared) has the fastest convergence speed and the best convergence performance.
5.7. Comparison of DCSGCN and Advanced Data Over-Sampling Strategies
In this section, we explore the performance differences between DCSGCN and {GNN + data over-sampling} further. Our choice of GNNs includes GCN [
39] and GCNII [
52], while our selection of data-oversampling approach comprises SMOTE [
48] and ADASYN [
55]. We then compare the performance of DCSGCN
with the following four combinations: {A, GCN + SMOTE; B, GCNII + SMOTE; C, GCN + ADASYN; D, GCNII + ADASYN}, and present the results in
Table 14. The test dataset used was Cora, and all experimental settings and parameters were the same as those in
Section 4.4 and
Section 4.5. The implementation of SMOTE and ADASYN relies on the imbalanced-learn library [
56]. By looking at
Table 14, we find that the performance of the {GNN + data-oversampling} combinations is roughly in between type II and type III methods (see
Section 4.2). Therefore, we conclude that our proposed methods are better at learning how to embed imbalanced graph data than the classic data-oversampling strategies.
5.8. Influence of Key Parameters
In this section, we explore how the three key parameters
,
, and
k of DCSGCN affect the performance of DCSGCN
. Among them,
controls the weights of the category label learning and cost label learning (see (
11)), whereas
and
k control the scale of the proportion of inter-class samples when computing the cost matrix (see Equation (
16)). The values of
,
, and
k are within the ranges [0.0, 0.1, 0.2, …, 1.0], [0.25, 0.5, 1.0, 1.5], and [0.001, 0.01, 0.1], respectively.
Figure 5 shows the variation in the performance of DCSGCN
on the validation of each dataset as the values of
,
, and
k change. As shown in
Figure 5a, Micro-F1 increases on almost all datasets when
and then enters a small fluctuation phase when
. In terms of cost, excluding the three smallest datasets (Cornell, Texas, and Wisconsin) where the performance fluctuates significantly, and similarly, when
, the cost leveled off. Therefore, we conclude that [0.2, 1.0] is a suitable range of values for
. When
is too small, DCSGCN tends too much toward the learning of the cost label, decreasing the learning of the posterior probability.
For , we found that when excluding Cornell, Texas, and Wisconsin, its value does not have a significant effect on Micro-F1 or the cost; we made similar observations for k. For Cornell, Texas, and Wisconsin, Micro-F1 increases and then decreases as increases; in addition, Micro-F1 decreases and the cost increases when k changes from 0.01 to 0.1. As one possible reason for this observation, the category distribution of these three datasets is the most imbalanced, and thus the inter-class sample ratio is not approximately 1 and is, therefore, sensitive to ; by contrast, the number of nodes in these datasets is the smallest, and, therefore, when k is large (i.e., ), the number of nodes added by add-k smoothing is not negligible in comparison with the real number of nodes in the minority classes, thus making the model performance sensitive to k. In summary, using = 0.5 and k = 0.01 is a good approach, as revealed experimentally.
5.9. Graph Visualization
To more intuitively analyze the features of the learned node representations obtained from the DCSGCN models and how they differ from that of the GCN, we projected the output vectors of the last layer from GCN, DCSGCN
, DCSGCN
, and DCSGCN
on Cora into two dimensions, as illustrated in
Figure 6. For the dimensional reduction method, we used t-SNE [
57].
It is clear from
Figure 6 that some of the class representations learned by the GCN lack clear separation boundaries. For example, for classes 0, 1, and 6, the distances between the nodes from these three classes are relatively small, and some of the nodes from these classes are mixed together with class 5 nodes. In particular, for minority classes 1 and 6, many of the samples lie on the boundary between the majority classes. Our observations show that the GCN lacks the ability to learn discriminative minority class features under the influence of the majority classes.
By contrast, the representations of the minority classes learned through DCSGCN are significantly more discriminative. We can easily circle the feature space of the minority class and its boundary with the majority class. In particular, the minority classes 1 and 6 in
Figure 6c have an extremely clear boundary space from the other classes with nearly no nodes present. This observation demonstrates that DCSGCN can weaken the negative impact of an imbalanced class distribution to a considerable extent and learn the unique features of minority classes in a more effective manner. We also see that even DCSGCN cannot avoid a large number of nodes misclassified on the class boundary, which reflects the difficulty of the imbalanced node classification task.
6. Conclusions
In this paper, we propose a new model, the dual cost-sensitive graph convolutional network (DCSGCN), for the task of long-tailed graph node classification. Compared with many GNNs based on maximizing the posterior probability, the proposed DCSGCN is based on minimizing the misclassification cost for a class prediction. To generate sample cost labels for network training, we propose three algorithms based on the node class distribution and graph topology. Extensive experiments on different datasets demonstrate the high effectiveness and robustness of DCSGCN in handling imbalanced data. In addition, a large number of tests under different experimental settings describe, in detail, the performance variations and characteristics of DCSGCN under different application scenarios.
To the best of our knowledge, this study is the first to introduce the idea of cost-sensitive learning in the field of graph representation learning and imbalanced node classification. The proposed model is simple and effective. In a future study, we will explore, in more depth, new theoretical directions and application scenarios using the organic combination of cost-sensitive learning and network science.