Introduction

Graphs, as non-Euclidean data structures, are widely used in fields such as social networks 1,2,3,4, chemistry 5, and transportation planning 6,7. However, traditional deep learning techniques face challenges when processing graph data due to its distinctive properties, including nonlinearity, high heterogeneity, irregularity, and dynamic fluctuations 8. These characteristics diverge from the assumptions underlying traditional deep learning models, which are typically designed for Euclidean data. To address these challenges, researchers have proposed a variety of graph neural networks (GNNs) 9,10,11,12, designed for various different machine learning applications, including community mining 13,14 and graph embedding 15,16. GNNs stand out for their ability to capture complex structures and interactions within graph data while effectively integrating node attributes in an end-to-end learning process.In contrast to traditional deep learning techniques, GNNs are adept at representing complex structures and relationships in graph data while effectively incorporating node attributes into an end-to-end learning framework.

Most GNNs update node information by considering only first-order neighbor details. For example, various Graph Convolutional Networks (GCNs) developed in recent years use message passing algorithms within convolutional layers to aggregate attribute information from nearby nodes 17,18,19,20. To capture higher-order node information, GCNs typically increase the depth of the model. However, deeper models often suffer from the issue of over-smoothing, where node representations become indistinguishable, making it difficult to preserve the unique features of each node 21,22,23,24. GraphSAGE addresses this by sampling higher-order neighbors for each node in each layer and aggregating their information to produce new representations 25. However, this approach requires manual selection of sampling parameters and the number of samples and neighbor depth can significantly affect model performance. Existing studies have shown that the higher-order structure contains vital information in graph data. As a result, it is crucial to effectively capture the higher-order topological information for graph learning tasks.

Complex networks often employ higher-order topological structures to characterize intricate relationships within the network 26,27,28,29. By analyzing the local connections of each node, these higher-order topological structures reveal hidden organizational patterns within the network, thereby revealing more intricate higher-order functional regions in networked data. Motifs and graphlets efficiently obtain complex structural information from graphs, representing complex relationships in the network 30,31. In 2019, John et al. introduced a novel methodology that integrates the analysis of motifs within graph convolutional neural networks 32. This integration significantly improves the model’s ability to capture complex topological structural information at higher orders. Although the method can handle directed graphs with self-loops and accurately capture higher-order structural information, it comes with higher computational complexity. Moreover, identifying essential motifs and developing autonomous methods for learning these motifs from available data can present a complex problem. Addressing the challenge of defining and using higher-order topological structures to extract complex information from networks represents a significant scientific problem.

In this article, we have designed and constructed an innovative graph neural network model, named TWC-GNN, with the primary objective of obtaining complex relationships and hidden information in the network to achieve more accurate node classification. Compared to other related models, the TWC-GNN explores higher-order information throughout the network using a self-attention mechanism and a higher-order topological structure. Since first-order nodes are significant in a graph, we employ the attention method once more to modify the node feature information using the adjacency matrix. In summary, the main contributions of this paper are as follows:

  • We suggest a higher-order topological structure to capture the structural characteristics of nodes in a directed graph. In this article, we use the concept of node centrality to measure the importance of nodes in the network. By considering the three fundamental relationships among the three nodes, it assesses the mutual influence between the central node and its neighboring nodes, revealing intricate relationships within the data.

  • We encode the structural information of directed graphs using Laplacian eigenvectors, which enables the application of the self-attention mechanism from the Transformer model to sparsely structured graph data. This extension of the attention mechanism includes all nodes, even those at greater distances, allowing for the acquisition of comprehensive global information about the graph.

  • We employ the Graph Attention Network (GAT) approach, which integrates the adjacency matrix with the self-attention mechanism to emphasize important first-order neighboring node information. This method enables a thorough exploration of the node features, leading to an effective fusion of global and local interactions within the network.

We implement the proposed TWC-GNN method for node classification tasks in directed networks, using Cora, CiteSeer, Pubmed, and Reddit datasets. To evaluate its performance, we compare it against a range of baseline methods, including DeepWalk, MLP, GCN, FAGCN, FedGCN,S\(\hat{3}\)GC, GCKSVM, and GAT. The results demonstrate that our algorithm achieves relatively outstanding performance across all datasets.

Fig. 1
figure 1

The architecture of the proposed TWC-GNN model. There are three key components: the Centrality Information Module, the Transformer Attention Mechanism, and the GAT Attention Mechanism. The Transformer is utilized to capture hidden information from distant nodes, while the Centrality Information Module extracts relevant data about the local structure of the graph. This allows for a deeper understanding of the core nodes and their roles within the network, enhancing the interpretation of complex relationships. Additionally, the GAT mechanism prioritizes the importance of neighboring node information, further refining the model’s ability to capture both local and global interactions.

Methods

Node classification aims to predict the labels of unlabeled nodes in a graph \(G=\{V, E\}\), where some vertices are already labeled and others are not. \(V_L \in V\) represents the set of nodes with known labels, denoted \(Y_L\). \(V_U=V \backslash V_L\) represents the set of nodes with unknown labels, denoted \(Y_U\). The objective of supervised learning is to train a model f, such that it can accurately predict the label of an unlabeled node \(v \in V_U\). The node features are represented by a matrix \(X \in R^{n \times d}\), where \(n=|V|\) is the number of nodes in G, and d is the dimension of feature vectors.

In this section, we present a detailed overview of the architecture and key components of the proposed TWC-GNN. As illustrated in Figure 1, TWC-GNN integrates higher-order topological structures (Centrality Information) with attention mechanism modules (the Transformer and GAT methods). To capture the mutual influence between central and neighboring nodes, we construct a higher-order topological structure using Centrality Information, derived from node degrees, to represent the complex interactions within the network. The Transformer method, employing the self-attention mechanism, is used to extract implicit information from higher-order nodes. Finally, the GAT method, also based on self-attention, updates node attributes by focusing specifically on first-order neighboring nodes.

Centrality information

The inherent structure of the graph data offers a plethora of valuable information. If a node exhibits a high degree, it can be inferred that the node possesses a significant level of influence. Quantifying node importance is crucial to understand graph structures, yet it remains insufficiently explored in the existing literature 33,34,35,36. In this analysis, we examine node centrality by focusing on three fundamental types of higher-order relationship. To assess the impact of a central node on its neighboring nodes, we will transform the node information into edge information and explore the correlation between pairs of nodes using Eq. 1.

$$\begin{aligned} Z=M^{T} X W_{a} \end{aligned}$$
(1)

where \(M \in R^{|V| \times |E|}\) is the incidence matrix that encodes the connections between nodes and edges, denotes, defined as \(M_{i,{i\rightarrow j}}=M_{j,{i\rightarrow j}}=1\) and 0 otherwise. Moreover, \(W_{a}\) is a learnable projection matrix that transforms the node feature matrix X to the edge feature matrix Z.

The transformation from node features to edge features is critical in tasks such as node classification because it allows the model to capture both direct and indirect interactions between nodes 37. By converting node information into edge representations, we ensure that the model does not just capture the immediate relationships between nodes, but also the flow of information across the graph. Incorporating edge features derived from node information enables the model to go beyond first-order neighbor interactions, ensuring that higher-order dependencies are captured. This is particularly important for directed graphs, where the flow and influence of information are directional, making the modeling of these edges vital for tasks like node classification. Such indirect interactions often carry important latent information about the network’s structure, especially in directed graphs, where the directionality of relationships is a key aspect.

Node liquidity

In a citation network, if a literature j on genetic algorithms is cited by only one literature i and cites only one literature k, then the probability that the three pieces of literature belong to the same category will be high. If the literature j is cited by or cites a greater number of other literature, the likelihood that the literature i and the literature k are related to genetic algorithms will decrease. As shown in Figure 2(a), the correlation between the citations \((i \rightarrow j)\) and \((j \rightarrow k)\) increases. As shown in Figure 2(b), if the central node j has many neighbors (i.e., the degree of the node j is large), then the correlation between \((i \rightarrow j)\) and \((j \rightarrow k)\) weakens due to the influence of other neighboring nodes. Likewise, on social networks, if the central node, user j, is connected to both user i and user k through a friendship relationship, and both user i and user k are engaged in fraudulent activities, then it is extremely probable that user j is also involved. Therefore, when we discover that the user j has a close social relationship with known fraudsters, we need to be more vigilant about their behavior to prevent potential fraudulent activities. In Eq. 2, the correlation coefficient between \((i \rightarrow j)\) and \((j \rightarrow k)\) is calculated as follows:

$$\begin{aligned} A_{e,(i \rightarrow j),(j \rightarrow k)}=\exp \left( -\frac{\left( \operatorname {deg}^{-}(j)+\operatorname {deg}^{+}(j)-2\right) ^{2}}{\sigma ^{2}}\right) \end{aligned}$$
(2)

where \(\sigma\) is the standard deviation of node degrees and \(\operatorname {deg}^{-}(j)\) and \(\operatorname {deg}^{+}(j)\) indicate the in-degree and out-degree of node j in the graph. \(A_e \in R^{|E| \times |E|}\) denotes the correlation coefficient between edges. When a central node j has only one citing node i and one cited node k, the correlation between these two edges is at its maximum, with a value of 1.

Fig. 2
figure 2

Node liquidity structure diagram.

Multi-reference ability

As shown in Figure 3(a), literature typically cites relevant works within the same research field. Higher out-degrees indicate more significant influence and a higher probability of being cited by different categories of literature. As shown in Figure 3(b), the literature on genetic algorithms cites a genetic algorithms literature i and a probabilistic methods literature k. However, the literature k can also be referenced in the literature within the field of neural networks. The correlation between \((i \rightarrow j)\) and \((k \rightarrow j)\) is influenced by the out-degree of the citing nodes i and k.

$$\begin{aligned} A_{e,(i \rightarrow j),(k \rightarrow j)}=\exp \left( -\frac{\left( \operatorname {deg}^{+}(i)+\operatorname {deg}^{+}(k)-2\right) ^{2}}{\sigma ^{2}}\right) \end{aligned}$$
(3)

where \(\operatorname {deg}^{+}(i)\) and \(\operatorname {deg}^{+}(k)\) denote the out-degree of node i and node k. Literature with lesser out-degrees is more likely to be cited by literature from the same category. In Eq. 3, the correlation between edge \((i \rightarrow j)\) and edge \((k \rightarrow j)\) is maximized when both the out-degree of node i and node k is 1.

Fig. 3
figure 3

Node multi-reference ability structure diagram.

Multi-propagation property

The greater the out-degree, the stronger the influence and citation impact of the literature. It will be cited not only by literature within the same category but also by those from other domains. For instance, probabilistic methods may be cited in the literature on genetic algorithms and neural networks. As shown in Figure 4 and Eq. 4, the correlation between \((j \rightarrow i)\) and \((j \rightarrow k)\) is not solely determined by the degree of the central node j, but rather by the in-degree of nodesi and k.

$$\begin{aligned} A_{e,(j \rightarrow i),(j \rightarrow k)}=\exp \left( -\frac{\left( \operatorname {deg}^{-}(i)+\operatorname {deg}^{-}(k)-2\right) ^{2}}{\sigma ^{2}}\right) \end{aligned}$$
(4)

where the in-degree of nodes i and k are denoted by \(\operatorname {deg}^{-}(i)\) and \(\operatorname {deg}^{-}(k)\), respectively. The higher the importance of a node, the lower its similarity to neighboring nodes. The correlation between edge \((j \rightarrow i)\) and edge \((j \rightarrow k)\) is determined by node i and node k. The degree of link between the two edges is maximized when both node i and node k have an in-degree of 1.

Fig. 4
figure 4

Node multi-propagation structure diagram.

This represents a higher-level topology involving three nodes and two edges, defined on the basis of the structural characteristics of the data. Identifying the center node of the graph and evaluating the relationship between it and its neighboring nodes are the primary objectives. This higher-level topology is utilized to extract complex relationships within the graph data, as further elaborated in the comprehensive model introduction below.

Self-attention based mechanisms

The self-attention mechanism enables the thorough extraction of node feature information. In this section, we examine in detail the integration of self-attention mechanisms, including the Transformer and Graph Attention Network (GAT). We provide a comprehensive explanation of how these methods work together to capture higher-order node information, ultimately leading to an enriched representation of node features.

Transformer for higher-order node information

The transformer integrates the attention mechanism with a positional encoding matrix 38. This approach considers all nodes within the network, capturing the interdependence of nodes, even those far apart, instead of focusing solely on neighboring nodes 39,40,41. Networks, in general, are commonly accepted to exhibit sparsity, indicating that not all nodes are connected. To obtain comprehensive global information and preserve the sparsity of the graph, encoding the graph’s structural information is crucial. Several commonly used techniques for encoding graph positions include Node Index Positional Encoding, Laplacian Eigenvector Encoding 42,43, and Learnable Positional Encoding 44. We utilize Laplacian Eigenvector Encoding as follows:

$$\begin{aligned} A_{\text{ lp }}=\textrm{I}-D^{-1 / 2} A D^{-1 / 2}=U^{T} \Lambda U, \end{aligned}$$
(5)

where A is the directed adjacency matrix of G, with \(a_{ij}=1\) if there is a directed edge between nodes i and j, and \(a_{ij} = 0\) otherwise.

The Transformer’s attention module makes use of the self-attention mechanism, which has the following definition:

$$\begin{aligned} Q=X W_{Q}, \quad K=X W_{K}, \quad V=X W_{V} \end{aligned}$$
(6)
$$\begin{aligned} A_{\text{ soft } }=\operatorname {softmax}\left( \frac{Q K^{\top }}{\sqrt{d_{K}}} \cdot A_{\text{ lp }}\right) \end{aligned}$$
(7)
$$\begin{aligned} \operatorname {Attn}(H)=A_{\text{ soft } } V \end{aligned}$$
(8)

The self-attention mechanism utilizes new vector representations, denoted Q, K, and V, to represent the nodes Query, Key, and Value, respectively. The weight matrices \(W_{Q} \in \mathbb {R}^{d \times d_{K}}\), \(W_{K} \in \mathbb {R}^{d \times d_{K}}\) and \(W_{V} \in \mathbb {R}^{d \times d_{V}}\) are used to transform the input into query, key, and value representations. \(\operatorname {Attn}(H)\) represents the output of the single-headed self-attention module, computed based on the similarity between the query and the key.

The fundamental components of the Transformer architecture consist of multi-head self-attention (MSA), which enables the model to selectively attend to several segments of the input sequence concurrently. Layer normalization (LN) is implemented prior to each block, and the feed-forward network (FFN) is composed of two linear layers followed by a GELU (Gaussian Error Linear Unit) nonlinearity. This configuration allows the model to capture intricate patterns in the data effectively. The mathematical equations regulating these operations are shown below:

$$\begin{aligned} h^{\prime (l)}=\operatorname {MHA}\left( \operatorname {LN}\left( h^{(l-1)}\right) \right) +h^{(l-1)} \end{aligned}$$
(9)
$$\begin{aligned} h^{(l)}=\textrm{FFN}\left( \operatorname {LN}\left( h^{\prime (l)}\right) \right) +h^{\prime (l)} \end{aligned}$$
(10)

where \(h^{(l-1)}\) denotes the input of layer l, and \(h^{\prime (l)}\) represents the output of layer l after passing through the multi-head attention mechanism module, which is also the input of the position pre-encoding module for layer l.

Distances between nodes in complex networks often vary. The more distant nodes frequently represent the network’s edges and the connections between different components, and they can provide crucial information about the network as a whole.

GAT for feature update among first-order neighbors

Both GAT and Transformer incorporate self-attention mechanisms. One distinction lies in the ability of the Transformer technique to acquire global node information, whereas the GAT method concentrates solely on the first-order neighbor nodes of the given nodes. However, in a network, the first-order neighbor nodes are more important. Therefore, after extracting the latent information from the network, directing attention to the information from the first-order neighboring nodes can effectively consolidate the information within the network.

Using the attention mechanism, the GAT module first determines the attention coefficient \(e_{i j}\) between the nodes i and j. Then, the LeakyReLU function and softmax function are employed to normalize the inter-node attention coefficients, respectively, to get the attention weight \(\alpha _{i j}\) between any two nodes. Finally, node features are updated according to the adjacency matrix W, ensuring a refined representation of the graph’s structural information:

$$\begin{aligned} h_i=\sigma \left( \sum _{j \in N(i)} \alpha _{i j} W h_j\right) \end{aligned}$$
(11)

TWC-GNN: an implementation of the method

In this section, we present TWC-GNN, a novel model that integrates higher-order topology and attention mechanisms for a thorough exploration of complex relationships and latent information within the network. TWC-GNN consists of higher-order topological structures (Centrality Information) and attention mechanism modules (the Transformer method and the GAT method). The Centrality Information module represents a higher-order topology defined based on the intrinsic features of the data. It is used to reveal intricate relationships within the network. The discovery of concealed information in higher-order nodes is facilitated by leveraging the self-attention mechanism and location-coding information of the Transformer method. The GAT technique focuses on extracting more significant information from first-order neighbors.

Code Availability: The step-by-step procedure is illustrated in Algorithm 1. The implementation of the TWC-GNN model is publicly available on GitHub: https://github.com/xiajiwen/TWC-GNN and https://doi.org/10.5281/zenodo.14264082.

Algorithm 1
figure a

The TWC-GNN Algorithm.

Results and discussion

In this section, we conduct experiments on multiple benchmark networks to assess the effectiveness of the proposed solution. We employ visualization and analysis techniques to showcase the efficacy of our suggested framework based on experimental findings. In addition, we perform ablation experiments to emphasize the necessity of collecting complex interactions and hidden information within the network.

Table 1 Statistic characteristics of all the datasets used in our experiments.

Baselines and datasets

In order to assess the significance of higher-order information and complex interactions, we performed a series of experiments on two extensively utilized network datasets in the domain of graph representation learning. Two common types of networks studied in academic research are academic paper citation networks and social networks. To handle the extensive scale of the social network dataset, this study uses a subset that constitutes 3% of the data for experimental purposes. The statistical characteristics of all datasets are summarized in Table 1.

  • Cora: The dataset is a widely used academic citation network, which is commonly used in graph neural networks. The 2,708 computer science research articles in it are classified into seven classes, including machine learning, databases, artificial intelligence, and others. The frequency of words in the papers is represented by a 1433-dimensional word vector created using a bag-of-words model.

  • CiteSeer: It comprises over 6000 papers in the field of computer science from 1998 to 2002 together with the connections between them. In this collection, publications are shown as nodes, and citation links between publications are shown as edges. In addition, a label is issued to each node that falls into one of six categories based on themes, including artificial intelligence, distributed systems, databases, machine learning, parallel computing, databases, and information retrieval.

  • Pubmed: The dataset is a widely used citation network dataset in academic research. The National Library of Medicine of the United States provides it and has approximately 2.4 million scientific papers on the subjects of medicine and biological sciences. Each article is represented as a node on the network, with metadata such as title, abstract, and keywords, and the edges are formed by the citation links between the articles.

  • Reddit: The dataset is widely used and originates from the social news platform Reddit, containing posts and comments posted by Reddit users. The dataset comprises posts from diverse categories, including politics, sports, and entertainment, and spans a broad spectrum of social interactions such as upvoting, commenting, and sharing.

The fundamental methods for comparison in our experiments are thoroughly explained as follows.

  • DeepWalk 15: DeepWalk is a method that utilizes random walks and neural networks for unsupervised graph embedding, enabling efficient learning of node representations on large-scale graphs.

  • MLP 45: MLP is a feedforward neural network based on multiple fully connected layers, capable of learning non-linear features. It has been widely applied in classification, regression, and other machine learning tasks.

  • GCN 18: GCN is a neural network model designed to handle graph data. It is characterized by its ability to learn feature representations of nodes while considering their interactions within the graph structure.

  • FAGCN 46: This model is designed to process graph data by adaptively integrating both low-frequency and high-frequency signals from node features and their interactions within the graph structure. It allows for more flexible and efficient learning of node representations across a variety of network scenarios.

  • FedGCN 47: This neural network model is designed to handle graph data distributed across multiple clients. It learns node feature representations while accounting for their interactions within the graph structure, achieving fast convergence with minimal communication and enhanced privacy through federated learning.

  • S\(\hat{3}\)GC 48: This neural network model is designed for processing graph data, with a key feature being its ability to learn node feature representations while incorporating their interactions within the graph structure. It accomplishes this in a computationally efficient manner, reducing complexity compared to traditional GCNs.

  • GCKSVM 49: GCKSVM integrates graph convolution with kernel learning to efficiently process graph data. It stands out for its ability to generate expressive node features by capturing both inherent attributes and the relational context of the graph structure.

  • GAT 19: The proposed method incorporates the self-attention mechanism into graph convolutional networks (GCN), establishing dynamic and weighted connections among nodes. This enhancement aims to capture the relationships and node importance better.

Node classification and visualization

In this work, we applied L2 regularization to prevent overfitting during training and used Adam SGD Optimizer as a parameter optimizer to minimize cross-entropy loss 50. Furthermore, we implemented early halting when the negative log-likelihood (NLL) loss did not decrease for 20 consecutive iterations 51, and we set the maximum number of training epochs at 2000. The learning rates were set to 0.03, 0.005, 0.05, and 0.005 for the Cora, Citeseer, Pubmed, and Reddit datasets, respectively. For the respective datasets, the number of attention heads was set to 8, 4, 1, and 4.

Fig. 5
figure 5

The loss comparison of node classification under different embedding methods in three real citation networks: (a) Cora citation network; (b) Citeseer citation network; (c) Pubmed citation network. The x axis represents training epochs, and the y axis represents the loss value.

Table 2 An analysis of the classification outcomes for the Citation Networks and Social Network datasets.

Table 2 summarized the accuracy of categorization along with the standard deviation of each approach used in our experiments. Clearly, our TWC-GNN model exhibits the best performance in the Cora network, with a 4.7% improvement over the GCN technique. Similarly, the TWC-GNN approach achieves the best performance in the Social Network. On the citation network, Figure 5 illustrates the variations in the loss curves for the GCN, GAT, and TWC-GNN techniques. Compared to the other two models, our proposed TWC-GNN technique achieves a lower final loss value. The substantially higher loss value in the GCN model can be attributed to its less effective use of higher-order node information and simpler handling of local neighborhood information. The left half of Figure 6 displays the ground truth labels of the dataset, while the middle part shows the predicted labels. To facilitate the comparison of anticipated results with ground-truth labels, the right side of the picture uses darker hues to signify correct predictions and lighter shades to highlight prediction mistakes.

In general, this paper introduces a novel graph neural network model, TWC-GNN, which utilizes higher-order topological structures and self-attention mechanisms. Compared to other baseline algorithms, TWC-GNN demonstrates superior capability to capture complex relationships and latent information within the network.

Table 3 Ablation Experiments on higher-Order Topological Structures and higher-Order Node Information.

Ablation study

We conducted a series of ablation experiments to determine the significance of higher-order node information and higher-order topological structures, as outlined in Table 3. The Transformer-GAT model incorporates higher-order node information in addition to the GAT model. The comparison of findings with the GAT model indicates that higher-order nodes possess significant latent information. The Centrality-GAT model incorporates higher-order topological features in addition to the GAT model. On the other hand, the EdgeAdj-Transformer-GAT model and the TWC-GNN model differ in terms of the edge matrix weights, which are restricted to binary values of 0 and 1. The empirical findings suggest that higher-order topological structures have superior capability in capturing intricate relationships inside the network, hence leading to enhanced accuracy of the model.

Fig. 6
figure 6

Data Visualization. Three citation datasets: Cora, Citeseer, and Pubmed. Visualize the dataset according to its own node connections and true labels (try to position the same type of nodes as close together as possible). Visualize the classification results using our model in the same position. Finally, dark color indicates accurate prediction and light color indicates wrong prediction.

Conclusion

In this paper, we have presented the TWC-GNN model, designed for the node classification task within directed graphs by leveraging higher-order topological structures and self-attention mechanisms. Specifically, we have introduced the concept of Centrality Information to quantify the reciprocal influence between central nodes and their neighboring nodes, allowing for a deeper understanding of intricate network relationships. Additionally, we have incorporated a Transformer-based self-attention mechanism using Laplacian Eigenvector Encoding to extract higher-order node information, which plays a crucial role in understanding the structure of directed graphs. The TWC-GNN model further refines node representations by using the GAT approach to focus on first-order information, balancing global and local node interactions. Through extensive experiments on citation network datasets, we have demonstrated the effectiveness of our model in node classification tasks. A series of ablation studies have shown that the performance of TWC-GNN is significantly influenced by the integration of higher-order topological structures and higher-order node information. In conclusion, our findings highlight the importance of higher-order structural information in directed networks, providing a strong foundation for future research in complex graph-based learning tasks. The proposed TWC-GNN model offers a promising approach for extracting and leveraging global and local graph features, enabling more accurate and efficient node classification in directed networks.