research-article

Open access

Neural Architecture Search for GNN-Based Graph Classification

Authors:

Lanning Wei,

Huan Zhao,

Zhiqiang He,

Quanming YaoAuthors Info & Claims

ACM Transactions on Information Systems, Volume 42, Issue 1

Article No.: 1, Pages 1 - 29

https://doi.org/10.1145/3584945

Published: 18 August 2023 Publication History

PDF eReader

Abstract

Graph classification is an important problem with applications across many domains, for which graph neural networks (GNNs) have been state-of-the-art (SOTA) methods. In the literature, to adopt GNNs for the graph classification task, there are two groups of methods: global pooling and hierarchical pooling. The global pooling methods obtain the graph representation vectors by globally pooling all of the node embeddings together at the end of several GNN layers, whereas the hierarchical pooling methods provide one extra pooling operation between the GNN layers to extract hierarchical information and improve the graph representations. Both global and hierarchical pooling methods are effective in different scenarios. Due to highly diverse applications, it is challenging to design data-specific pooling methods with human expertise. To address this problem, we propose PAS (Pooling Architecture Search) to design adaptive pooling architectures by using the neural architecture search (NAS). To enable the search space design, we propose a unified pooling framework consisting of four modules: Aggregation, Pooling, Readout, and Merge. Two variants, PAS-G and PAS-NE, are provided to design the pooling operations in different scales. A set of candidate operations is designed in the search space using this framework. Then, existing human-designed pooling methods, including global and hierarchical ones, can be incorporated. To enable efficient search, a coarsening strategy is developed to continuously relax the search space, and then a differentiable search method can be adopted.

We conduct extensive experiments on six real-world datasets, including the large-scale datasets MR and ogbg-molhiv. Experimental results in this article demonstrate the effectiveness and efficiency of the proposed PAS in designing the pooling architectures for graph classification. The Top-1 performance on two Open Graph Benchmark (OGB) datasets¹ further indicates the utility of PAS when facing diverse realistic data. The implementation of PAS is available at: https://github.com/AutoML-Research/PAS.

1 Introduction

The graph classification task aims to learn the mapping function between a graph and label. Graph neural networks (GNNs) [69, 77] have been the state-of-the-art (SOTA) method for this task in recent years, which necessitates generating the graph representation for a given graph. This task has applications in various domains, for example, chemistry [19], bioinformatics [58, 69], and text categorization [39, 51, 78]. It can be extended to the link-prediction task by learning the graph representations of the subgraphs around links [41, 59, 60, 75, 76]. To obtain a graph representation, a straightforward way is to add a global pooling function, also called the readout function, at the end of GNNs to globally pool all of these node embeddings. For example, take the mean or summation of all node embeddings as the graph-level representation vector. These methods [66, 77] can capture global information and are easy to obtain based on the existing GNNs, which are denoted as global pooling in this article. However, it directly reduces the given graph into one vector, which ignores the hierarchical structure of the graph compositional nature [26] and leads to flat graph-level representation [69]. More advanced pooling methods are proposed to preserve the hierarchical information by aggregating messages on increasingly coarser graphs, which is denoted as hierarchical pooling. These methods are constructed by applying a pooling operation between the aggregation operations to reduce the graph size as shown in Figure 1(a). They can be categorized into two classes according to the node preparation mechanism when generating the coarse graph. The selection-based pooling methods design the coarse graph by preserving the most informative nodes and edges among them. The representative ones are SAGPool [31] and Graph U-Net [14]. The grouping-based methods, for example, DiffPool [69] and MinCut [2], are devoted to grouping nodes into clusters from which the new nodes in the coarse graphs are obtained. The graphical illustrations are provided in Figure 1(b).

Fig. 1.

Despite the success of these pooling methods, in reality, graphs are from highly diverse domains and sizes. Existing methods use fixed pooling operations, which are difficult to generalize to various domains, and it is time-consuming to hand-design one data-specific pooling architecture for a given dataset. Moreover, hierarchical pooling methods are not always beneficial in all cases [11, 20, 42], leading to a new demand to design the pooling architectures adaptively. Very recently, to obtain data-specific GNN architectures, researchers have turned to neural architecture search (NAS) [38, 82], for example, GraphNAS [17], Policy-GNN [29], and SANE [80]. However, most of the existing methods are designed for node-level tasks, either on aggregation operations [17, 80] or the number of layers [29]. When dealing with a graph-level task, fixed global pooling operations are used in GNNs [6, 71] or jointly learned with aggregation operations [25]. They pay more attention to global pooling methods and fail to incorporate hierarchical ones. It is non-trivial to design the pooling architectures adaptively since the following two aspects should be considered. First, both global and hierarchical pooling methods should be incorporated into the search space due to their effectiveness in different scenarios [11, 42], while there is a lack of a systematic pooling framework for the graph classification task to design the search space. Second, because of its search efficiency and effectiveness, the differentiable search algorithm is expected to learn architectures on top of the search space. However, it is unachievable for pooling methods since the pooling results have different sizes and node sets that cannot be integrated and relaxed. Therefore, the usage of gradient descent is infeasible.

To learn adaptive pooling architectures with the help of NAS, we propose a novel method called PAS (Pooling Architecture Search) to address these challenges. It contains a unified pooling framework to design the search space and one coarsening strategy to enable the usage of the differentiable architecture search. A unified framework is defined based on four key modules derived from the existing pooling architectures: Aggregation, Pooling, Readout, and Merge. Considering that two classes of pooling operations in existing hierarchical methods — selection-based and grouping-based pooling operations — have diverse manners in designing a coarse graph, we provide two variants of the Pooling Module that can design the pooling operation in different scales on these two classes. In the first variant, we employ selection-based operations in the Pooling Module. They focus on designing the node score functions, and the coarse graph can be formulated based on the most informative nodes. In other words, there is only one step to designing the pooling results. This coarse-grained pooling variant is dubbed PAS-G (Pooling Architecture Search in Graph scale).

For PAS-G designed on the selection-based methods, the unselected nodes and edges are dropped directly when formulating the coarse graph, which may lead to information loss to some extent. To improve graph representation learning, we further introduce grouping-based pooling operations in the Pooling Module, which have advantages in grouping the information from all nodes and edges in the graph. To unify these two classes of pooling operations, we first refine the Pooling Module and decompose it into Selection, Reduction, and Connection blocks based on the three steps in designing the coarse graph. They are responsible for preparing the nodes in the coarse graph, updating features, and reconstructing the adjacency matrix, respectively. Then, we observe that the different node preparation mechanisms in the Selection Block result in mismatched nodes in the coarse graph as shown in Figure 1(b). To address this problem, we modify the grouping-based methods by assigning nodes to the cluster heads prepared in the Selection Block, from which we obtained the matched nodes. Based on this block, the coarse graph can be formulated by designing the reduction and connection operations to update the node features and reconstructing the edges in the graph. This fine-grained pooling variant improves the coarse graph design in the node and edge scale, which is dubbed PAS-NE. Based on this unified pooling framework, one search space is designed naturally by providing a set of candidate operations for each module and block, on top of which the different global and hierarchical methods can be incorporated.

As to the search algorithm, to address the mismatched problem in the coarse graph, a coarsening strategy is designed to relax the search space by padding the unselected nodes and keeping the same sets. We further decompose this strategy into node and edge scales to adapt the fine-grained variant PAS-NE. Then, the coarse graphs can be matched with each other, and the usage of gradient descent is available. We conduct extensive experiments on six real-world datasets from five domains, for example, the large-scale text classification dataset MR [78] and the molecular property prediction dataset ogbg-molhiv [23]. The experimental results demonstrate that two PAS variants can achieve better performance than those various baselines. By designing the nodes and edges in the coarse graph, PAS-NE has better performance than PAS-G. With the help of coarsening strategy, PAS is optimized with gradient descent and achieves a two-orders-of-magnitude reduction of the search cost. Furthermore, on the popular OGB leaderboard of graph classification,² we also have achieved 1st, 1st, and 3rd performance on datasets MolHIV, PPA, and MolPCBA, respectively, with PAS [58] since April 2022, which further shows its effectiveness for the graph classification task.

To summarize, the contributions of this work are as follows:

•

To the best of our knowledge, PAS is the first work to design data-specific pooling architectures for GNN-based graph classification, which is based on the differentiable neural architecture search.

•

To design an expressive search space for PAS, we propose a unified pooling framework that contains four key modules derived from the literature: Aggregation, Pooling, Readout, and Merge. Two variants, PAS-G and PAS-NE, are proposed on the selection-based and grouping-based pooling methods, respectively. With these variants, the global and two hierarchical pooling methods in the literature can be incorporated.

•

Considering its efficiency and effectiveness, we provide one coarsening strategy to enable the usage of the differentiable architecture search.

•

We validate the proposed two variants, PAS-G and PAS-NE, on six real-world datasets. The results demonstrate that we achieve higher performance than various baselines for graph classification. Moreover, PAS-NE outperforms PAS-G by designing coarse graphs in the node and edge scales. In addition, the efficiency of these two methods is improved by two orders of magnitude due to the coarsening strategies.

Differences with the Conference Version. Preliminary results of this article have been published in CIKM 2021 [62] in which the unified framework is proposed and the PAS-G variant is employed in the Pooling Module. Here, we have made the following important extensions:

•

We add one comprehensive variant PAS-NE that can maintain the PAS-G variant and the grouping-based hierarchical pooling methods. A novel search space is constructed based on the designed unified pooling operation (Section 3.2.2 and 3.2.3). In addition, we further refine the coarsening strategy to enable the usage of the differentiable architecture search for the PAS-NE variant especially (Section 3.3.2).

•

New datasets and experiments. We use two large-scale datasets, MR and ogbg-molhiv, in the journal version and update the experiments in Section 4.2. We conduct additional ablation studies to evaluate the PAS-NE variant in Section 4.3 to Section 4.5.

•

We conduct a literature investigation to check the latest pooling methods for the graph classification task (Section 2.1).

Notations. We represent a graph as \(G =({\bf A}, {\bf H})\), where \({\bf A} \in \mathbb {R}^{N \times N}\) is the adjacency matrix of this graph and \({\bf H}\in \mathbb {R}^{N \times d}\) is the node features, \(N\) is the node number and \(d\) is the feature dimension. For simplicity, all of the intermediate layers have the same feature dimension \(d\). \(\widetilde{\mathcal {N}}(v) = \lbrace v\rbrace \cup \lbrace u | {\bf A}_{uv} \ne 0 \rbrace\) represents set of the self-contained first-order neighbors of node \(v\). In an \(L\)-layer GNN, for clear presentation, the input graph is denoted by \(G^0 = ({\bf A}^0, {\bf H}^0)\), and the input of the \(l\)-th layer is \(G^{l-1} = ({\bf A}^{l-1}, {\bf H}^{l-1})\). One aggregation and one pooling operation are placed in this layer, and the operation results are represented as \(G^{la}=({\bf A}^{l-1}, {\bf H}^{la})\) and \(G^{l} = ({\bf A}^{l}, {\bf H}^{l})\), respectively. The features of node \(v\) in \(l\)-th layer are denoted by \({\bf h}_v^l\). \(\mathcal {V}^l\) represents the nodes in the \(G^{la}\) and \(n_l = |\mathcal {V}^l|\) is the input graph size of \(l\)-th layer.

2 Related Work

2.1 GNN for Graph Classification

The graph classification task aims to learn the mapping function between the graph and the label. One popular approach is to use a graph kernel to quantify the similarity of graph pairs, for example, the shortest path kernel [4] and the random walk kernel [18, 40]. Nevertheless, it suffers significantly from computational bottlenecks. GNNs try to extract graph representations for the given graph and perform the classification directly. They are more efficient and have been the SOTA methods for this task [63, 77]. In the literature, existing GNN methods for graph classification can be roughly classified into two groups: global pooling and hierarchical pooling. The hierarchical pooling methods can be categorized into two classes according to the node preparation mechanism: selection-based methods and grouping-based methods.

2.1.1 Global Pooling.

These methods focus on calculating the graph representation vector by adding one global pooling function, which is also called the readout function, at the end of GNNs to globally pool the node features. The widely used global pooling operation sums (or averages) the node features. DGCNN [77] generated the graph representations based on the features of top-ranked nodes. GNP [28] proposed a norm-based framework that can express the existing global pooling operations, such as sum, mean, min or max, and beyond. SSRead [30] proposed a graph readout technique aiming to identify structurally meaningful positions. It is achieved by clustering nodes according to position information and then aggregating node embeddings based on position-specific weights. On top of the aggregation operations, global pooling methods directly reduce a whole graph into a single graph representation vector. It learns a flat presentation and cannot capture the potential hierarchical information in real-world graphs.

2.1.2 Hierarchical Pooling.

Compared with the global pooling methods, the hierarchical ones aggregate messages on the increasingly coarser graph to capture the hierarchical information in graph compositional nature [26]. It is obtained by adding one pooling operation between the aggregation operations to reduce the graph size. One pooling operation prepares the nodes in the coarse graph first. It then updates the features and reconstructs the adjacency matrix, from which the coarse graph is obtained. The pooling operations can be categorized into two classes according to the node preparation mechanisms.

\(\bullet\) Selection-based methods. The selection-based methods focus on preserving the most informative nodes and edges in the graph. They first evaluate the node importance with one score function and then formulate the coarse graph based on the top-ranked nodes by dropping the unselected nodes and edges connected with them. The representative method, Graph U-Net [14], estimated scalar projection values of the node feature, whereas more methods design the score functions based on local information. For example, SAGPool [31] selected nodes based on the self-attention scores obtained by graph convolutions. VIPool [34] and CGIPool [44] selected nodes based on the neural estimation of mutual information between nodes and their neighbors, while iPool [16] followed the signal processing theory and preserved the nodes that cannot be represented by their neighbors. ASAP [48] calculated the importance score of the 1-hop subgraph with the self-attention mechanism. Then, the top-ranked nodes were preserved and the edges were generated based on the cluster assignment matrix. Compared with these methods, which used a single evaluation metric, GSAPool [74] incorporated the local structure and feature scores learned by graph convolution and Multilayer Perceptrons (MLP) operations, respectively. TAP [15] selected nodes according to the local and global structures. RePool [33] used the node features, degree, and the distance between node pairs in the score function, then assigned the unselected nodes to their selected neighbors.

\(\bullet\) Grouping-based methods. The grouping-based methods focus on grouping similar nodes into one cluster. They first design one assignment matrix, on top of which the cluster features and new edges are constructed. The representative methods are DiffPool [69] and MinCut [2]. DiffPool designed the assignment matrix with the aggregation layer, while MinCut [2] computed cluster assignments with the MLP operation and optimized with the relaxed minCUT objective. STRUCTPOOL [72] used conditional random fields (CRFs) to capture the assignment relationships. CommonPool [54] and MHIBPool [52] assigned nodes based on community detection techniques.

Apart from these methods that focus on designing the pooling operations, SRCPool [20] proposed a formal characterization of pooling operations on three main aspects: selection, reduction, and connection. The authors of [11] provided a fair evaluation of the literature pooling methods. The authors of [42] studied the role of local pooling and its impact on performance. In this article, we learn to design the data-specific pooling architectures for graph classification. PAS provides two pooling variants and a set of candidate operations to design a coarse graph in different scales, on top of which existing methods can be unified and new architectures can be designed.

2.2 Differentiable Neural Architecture Search (NAS)

NAS methods were proposed to automatically find SOTA CNN architectures. It requires a search space to represent the architectures and a search algorithm to explore the search space [10]. The representative methods are presented in [38, 49, 50, 65, 82]. Early NAS approaches employ the Reinforcement Learning (RL) methods [82] and the Evolutionary Algorithm (EA) [49, 50] to select architectures from the predefined search space. They follow a trial-and-error pipeline, which first samples a candidate and then trains it from scratch. It is inherently time-consuming to train thousands of candidate architectures during the search process. Due to their superior efficiency, differentiable methods are preferable in the latest NAS methods. An over-parameterized network (supernet) is provided based on a unified framework and the search space, on top of which the method in the literature can be subsumed. The discrete search space is relaxed into a continuous one with a weighted summation shown in \(o(x) = \sum \nolimits _{i=1}^{\left|\mathcal {O}\right|} c_io_i(x)\), where \(c_i \in (0,1)\) is the weight of the \(i\)-th operation \(o_i(\cdot)\) in the operation set \(\mathcal {O}\) and \(o(\cdot)\) is the mixed operation. The weight \(c_i\) is generated by one reparameterization trick \(c_i=g(\mathbf {\alpha }), \mathbf {\alpha } \in \mathbb {R}^{\left|\mathcal {O}\right|}\) and \(\mathbf {\alpha }_i\) is the corresponding supernet parameter for \(c_i\). The Softmax [38] and Gumbel-Softmax [65] tricks are widely used in existing methods. After relaxing the search space, the mixed output of the supernet and loss can be obtained, and then the gradient descent can be applied to optimize the supernet since the loss is differentiable with regard to \(\mathbf {\alpha }\). In other words, the architecture design in the supernet is equivalent to learning the parameter \(\mathbf {\alpha }\) by solving the following bilevel optimization problem:

\[\begin{eqnarray*} \min _{\mathbf {\alpha }\in \mathcal {A}} & \; \mathcal {L}_{\text{val}} ({\bf W}^*(\mathbf {\alpha }), \mathbf {\alpha }), \\ \ \text{s.t.} {\bf W}^*(\mathbf {\alpha }) & = \text{arg min}_{\bf W}\mathcal {L}_{\text{train}}({\bf W}, \mathbf {\alpha }), \end{eqnarray*}\]

where \(\mathcal {A}\) represents the search space, \(\mathbf {\alpha }\) represents one candidate architecture in \(\mathcal {A}\), \({\bf W}\) represents the parameters of a model from the search space, and \({\bf W}^*(\mathbf {\alpha })\) represents the corresponding operation parameter after training. As shown in Algorithm 1, the operation weights \({\bf W}\) and the supernet parameter \(\mathbf {\alpha }\) are updated iteratively based on the training loss \(\mathcal {L}_{train}\) and the validation loss \(\mathcal {L}_{val}\), respectively. After supernet training is finished, it preserves the operations with the largest supernet parameters in each operation set, from which the searched architecture is obtained. Differentiable methods [5, 22, 38, 65, 73] have been widely used in the image classification task. The increased performance and significantly reduced search cost demonstrate its advantages over other search algorithms.

2.3 Graph Neural Architecture Search

Very recently, researchers tried to design GNN architectures by NAS automatically. The majority of these methods focus on designing the aggregation layers in GNNs. For example, GraphNAS [17], Auto-GNN [81], AutoGM [70], DSS [37], and NAS-HAR [46] learned to design aggregation layers with several micro dimensions, such as aggregation function, attention function, attention head number, embedding size; SGAS [32] designed adaptive GNNs for a 3D object detection task; SANE [80], SNAG [79], AutoGraph [35], and F2GNN [61] provided the extra layer-wise skip connections design dimension; GNAS [6], DeepGNAS [12], and Policy-GNN [29] learned to design the depth of GNNs. For the graph classification task, a straightforward way to obtain the data-specific graph-level representations is to combine the fixed global pooling operations and the aforementioned adaptive GNNs [6, 71]. Apart from these methods, NAS-GCN [25] learns adaptive global pooling functions as well. As to the search algorithm, most of these methods use the RL-based algorithms [12, 17, 29, 79, 81] and EA [25, 35, 46] to select architectures from the search space. These methods need thousands of evaluations, which are computationally expensive. Therefore, the differentiable search algorithms are adopted in [6, 32, 37, 80] to improve search efficiency.

More graph neural architecture search methods can be found in [57]. In this article, PAS makes the first attempt to design the pooling architectures for the graph classification task with the differentiable search algorithm. Compared with existing methods in Table 1, PAS provides one search space that can cover existing global and hierarchical pooling methods, and designs one coarsening strategy to enable the usage of the gradient descent to accelerate the search phase.

Table 1.

				Aggregation	Pooling			Readout	Merge
		Method	Task	Search Space						Search Algorithm
				Aggregation	Se	Red	Conn	Readout	Merge
GNNs	-	GCN [27]	Node	\(\surd\)	-			-	\(\times\)	-
	-	JK-Net [67]	Node	\(\surd\)	-			-	\(\surd\)	-
	Global	GIN [66]	Graph	\(\surd\)	\(\times\)			\(\surd\)	\(\surd\)	-
	Global	DGCNN [77]	Graph	\(\surd\)	\(\times\)			\(\surd\)	\(\surd\)	-
	Hierarchical	Graph U-Nets [14]	Graph	\(\surd\)	\(\surd\)	\(\times\)	\(\times\)	\(\surd\)	\(\times\)	-
		SAGPool [31]	Graph	\(\surd\)	\(\surd\)	\(\times\)	\(\times\)	\(\surd\)	\(\surd\)	-
		ASAP [48]	Graph	\(\surd\)	\(\surd\)	\(\times\)	\(\surd\)	\(\surd\)	\(\times\)	-
		DiffPool [69]	Graph	\(\surd\)	\(\surd\)	\(\surd\)	\(\surd\)	\(\surd\)	\(\times\)	-
		MinCut [2]	Graph	\(\surd\)	\(\surd\)	\(\surd\)	\(\surd\)	\(\surd\)	\(\times\)	-
NAS	-	GraphNAS [17]	Node	\(\surd\)	-			-	\(\times\)	RL
		AutoGM [70]	Node	\(\surd\)	-			-	\(\times\)	Bayesian
		SANE [80]	Node	\(\surd\)	-			-	\(\surd\)	Gradient Descent
	Global	NAS-GCN [25]	Graph	\(\surd\)	\(\times\)			\(\surd\)	\(\surd\)	EA
	Adaptive (ours)	PAS-G	Graph	\(\surd\)	\(\surd\)	\(\times\)	\(\times\)	\(\surd\)	\(\surd\)	Gradient Descent
	Adaptive (ours)	PAS-NE	Graph	\(\surd\)	\(\surd\)	\(\surd\)	\(\surd\)	\(\surd\)	\(\surd\)	Gradient Descent

Table 1. Comparing Existing GNN and NAS Methods with PAS

“Se,” “Red,” and “Conn” are short for Selection, Reduction, and Connection blocks, respectively. For node-level methods, we set the Pooling Module and Readout Module as “-”; the search algorithm of hand-designed methods is represented as “-”; \(\times\) indicates that the current module or block is unexplored in the architecture design.

3 The Proposed Method

In this section, we elaborate on the proposed PAS. First, we design a unified pooling framework that can cover global and hierarchical pooling methods with the help of the Pooling Module. We provide two variants of the Pooling Module to design the coarse graph in different scales. Then, a novel search space is provided based on the framework. Considering search efficiency, we design one coarsening strategy and employ the differentiable search algorithm to design architectures in this article.

3.1 Designing a Search Space by a Unified Pooling Framework

A unified framework is required to incorporate the existing methods when designing neural networks based on NAS. To design pooling architectures automatically for the graph classification task, both global and hierarchical pooling paradigms should be incorporated due to their effectiveness in different scenarios [11, 42]. Nevertheless, existing NAS methods for graph classification are defined based on the global ones that cannot capture hierarchical information.

Here, we define a unified framework that consists of four key modules for learning graph-level representation derived from existing pooling architectures: Aggregation, Pooling, Readout, and Merge Module. In general, one Pooling Module is placed after each Aggregation Module in each layer, and the graph representation vector is generated with a Readout Module. Based on [67], a Merge Module is utilized to incorporate the intermediate graph representations produced by the Readout Module. In Figure 2(a), we use a 2-layer architecture backbone as an illustrative example of the unified framework. With the input Graph \(G^0\), the Aggregation Module in the first layer updates node embeddings and produces the graph \(G^{1a}=({\bf A}^0, {\bf H}^{1a})\) Then, the Pooling Module generates the coarse graph \(G^1=({\bf A}^1,{\bf H}^1)\) behind. Three Readout Modules are used to capture the graph representations \({\bf z}\) in all layers, and the Merge Module generates the final graph representation \({\bf z}^F\), from which the prediction can be obtained.

Fig. 2.

•

Aggregation Module. The Aggregation Module aims to capture local information and update the node features. Existing methods provided diverse design dimensions for the aggregation operation as introduced in Section 2.3. Therefore, we provide a simple operation set in this article; more candidates can be trivially added based on the literature [17, 70, 71]. Five widely used aggregation operations are provided: GCN [27], GAT [55], GraphSAGE [21] with mean aggregator, GIN [66], and GraphConv [43], which are denoted as GCN, GAT, SAGE, GIN, and GRAPHCONV. In addition, we incorporate the operation MLP, which applies a two-layer MLP to update node embeddings without using the graph structure.

•

Pooling Module. We provide the Pooling Module to design the adaptive pooling operations. As introduced in Section 2.1.2, existing pooling methods can be categorized as selection based and grouping based. We then provide two pooling variants based on these methods: one variant can incorporate the selection-based pooling methods and the other can maintain both of them. They are introduced in detail next.

•

Readout Module. The Readout Module is responsible for generating the global view of the given graph. For example, extract the statistics of nodes by averaging all of the node embeddings or capture the “skeleton” of features by maximizing the features in each dimension. We provide seven widely used readout functions here to improve global view representation learning: 3 existing methods, GLOBAL_SORT [77], GLOBAL_ATT [36], and SET2SET [56]; simple global mean, max, and sum functions denoted as GLOBAL_MEAN, GLOBAL_MAX, and GLOBAL_SUM, respectively; and ZERO operation, which generates a zero vector, indicating that the graph embeddings in this layer are not used for the final representation.

•

Merge Module. The intermediate layers help to formulate expressive embeddings both in human-designed methods [7, 66] and searched GNNs [35, 79, 80]. Based on these methods, in this article, we add a Readout Module in each layer and provide a Merge Module to incorporate the graph representations in these layers. Five merge functions are provided: long short-term memory (LSTM), concatenation, max, mean, and sum, which are denoted as M_LSTM, M_CONCAT, M_MAX, M_MEAN, and M_SUM in our search space.

3.2 Pooling Module Design

3.2.1 Coarse-Grained Pooling Module.

In the first variant, we design the Pooling Module based on the selection-based hierarchical pooling methods. The key difference among these pooling operations is the design of the score functions. It indicates that only one step needs to be designed when generating the coarse graph; this variant is denoted as \({\bf PAS-G}\) (Pooling Architecture Search in Graph scale). Then, the pooling operations in the search space can be unified by a computation process as

\begin{equation} \begin{aligned} {\bf s}&=f_s({\bf A}, {\bf H}), idx = \texttt {TOP}_k({\bf s}), \\ G^{\prime }&=({\bf A}^{\prime }, {\bf H}^{\prime })=({\bf A}[idx,idx], ({\bf H}\odot {\bf s})[idx,:]), \end{aligned} \end{equation}

(1)

where \(f_s\) is the score function to evaluate the node importance in the graph and \({\bf s}\in \mathbb {R}^{N}\) is the corresponding node score vector. \(idx\) is the set of \(k\) top-ranked nodes selected with the function \(\texttt {TOP}_k\); these nodes are preserved in the coarse graph \(G^{\prime }\). In this coarse graph, the edges \({\bf A}^{\prime } \in \mathbb {R}^{k \times k}\) are generated by preserving the edges among the selected nodes, which can be denoted as \({\bf A}[idx,idx]\). For the node feature matrix \({\bf H}^{\prime } \in \mathbb {R}^{k \times d}\) in the coarse graph, each row represents one \(d\)-dimensional feature vector of the selected node. The features of these selected nodes can be obtained by \(({\bf H}\odot {\bf s})[idx,:],\) where \(\odot\) represents the broadcasted element-wise product.

We provide a set of score functions to estimate node importance based on node feature, structure, and local information.

•

Feature. Graph U-Net [14] selects nodes based on the learnable feature projection values. It uses only the features and neglects the structures in the pooling operation. Based on this method, we further provide one operation in the search space that uses 2-layer MLPs to generate the node scores based on the feature matrix. These two operations are denoted as TOPKPOOL and MLP, respectively.

•

Structure. In this article, we use the number of neighbors to estimate the structural importance of nodes. The importance score of node \(u\) can be calculated by \({\bf s}_u=\sum _{i=1}^t\sum _{v \in \widetilde{\mathcal {N}}(u)}{\bf A}^i_{uv}\), where \(t\) is the expected range. We provide 3 operations — HOPPOOL_1, HOPPOOL_2, and HOPPOOL_3 — which correspond to \(t=1, 2, 3\), respectively.

•

Local information. The selection-based methods in the literature, for example, SAGPool [31] and ASAP [48], calculate the node scores based on the 1-hop neighbors. Here, two score functions are derived from these two methods, which are denoted as SAGPOOL and ASAP, respectively. Considering that the PAS-G variant focuses on preserving the informative nodes in the pooling operation, we further provide two new functions. GCPOOL uses GRAPHCONV to generate the node scores, which is similar to SAGPool [31], and the node score in GAPPOOL can be calculated by \({\bf s}_u=\frac{1}{2}{\bf W}_{\text{GAP}} \sum _{v \in \widetilde{\mathcal {N}}(u)}({\bf h}_u-{\bf h}_v)^2\), where \({\bf W}_{\text{GAP}}\) is the learnable parameter in this operation.

By designing the score functions, diverse pooling operations can be obtained from Equation (1) and more functions can be added trivially. Apart from these pooling operations, we also add IDENTITY operation to incorporate the global pooling methods. In the \(l\)-th layer, this operation is denoted as \({\bf s}={\bf 1}^{n_l}, idx = \mathcal {V}^l\), where \({\bf 1}^{n_l}\) is a vector of size \(n_l\) with all components being 1. In this way, we can search for hierarchical and global methods adaptively with PAS-G, which is more flexible than existing methods.

3.2.2 Fine-Grained Pooling Module.

Although PAS-G can unify the global and hierarchical pooling methods, it is built on the selection-based approaches that focus on preparing the most informative nodes and formulating a coarse graph by dropping the unselected nodes and edges connected with them, which may lead to information loss to some extent.

Taking this drawback into consideration, we further incorporate the grouping-based pooling operations in this module, which have advantages in utilizing the features of unselected nodes and reconstructing the adjacency matrix based on all of the input edges. They can be unified by the following process:

\begin{equation} \begin{aligned} {\bf B}&=f_b({\bf A}, {\bf H}),\\ G^{\prime }&=({\bf A}^{\prime }, {\bf H}^{\prime })=({\bf B}^T{\bf A}{\bf B},{\bf B}^T{\bf H}), \end{aligned} \end{equation}

(2)

where the assignment function \(f_b\) calculates the probabilistic assignment of the nodes to the \(n_{l+1}\) clusters in the \(l\)-th layer and \({\bf B}\in \mathbb {R}^{n_l \times n_{l+1}}\) is the corresponding assignment matrix.

To enable the search space design, we need to unify the selection-based and grouping-based pooling operations in this module. First, we review the computation processes in Equations (1) and (2), and summarize three general steps in designing the coarse graph: preparing the nodes in the coarse graph, updating the node features, and reconstructing the adjacency matrix. The graphical illustrations are provided in Figure 2(b). These steps are denoted as selection, reduction, and connection following SRCPool [20]. Then, we decompose the Pooling Module into three blocks following these steps to unify two classes of pooling operations. As shown in Table 2, a straightforward way to construct the search space is to include the processes in each block. However, arbitrarily combining the operations in three blocks cannot be combined into one new pooling operation based on this search space. For example, combination \(\lbrace idx\); \({\bf H}^{\prime }={\bf B}^T{\bf H}\); \({\bf A}^{\prime }={\bf B}^T{\bf A}{\bf B}\rbrace\) is meaningless since \({\bf B}\) has not been generated in the selection block and cannot be used in the following two steps. Therefore, this search space cannot be used to design pooling operations. We observe that it is caused by the mismatched node preparation mechanisms. Specifically, in the grouping-based methods, nodes are grouped into clusters that are treated as the new nodes \(\mathcal {V}_{group}\) in the coarse graph whereas in the selection-based methods, the nodes \(\mathcal {V}_{se}\) are derived from the input graphs. As shown in Figure 1(b), we can observe \(\mathcal {V}_{se} \subseteq \mathcal {V}\) and \(\mathcal {V}_{group} \cap \mathcal {V}_{se} = \phi\), where \(\mathcal {V}\) is the node set of the input graph, and we cannot map \(\mathcal {V}_{se}\) with \(\mathcal {V}_{group}\) in the node preparation mechanisms \({\bf s}\) and \({\bf B}\). In other words, two kinds of pooling operations have mismatched node preparation mechanisms, and they cannot be integrated in the selection block. Moreover, the following two blocks will be affected as well.

Table 2.

	Selection	Reduction	Connection
Selection-based pooling	\({\bf s}, idx = f({\bf s})\)	\({\bf H}^{\prime }=({\bf H}\odot {\bf s})[idx,:]\)	\({\bf A}^{\prime }={\bf A}[idx,idx]\)
Grouping-based pooling	\({\bf B}\)	\({\bf H}^{\prime }={\bf B}^T{\bf H}\)	\({\bf A}^{\prime }={\bf B}^T{\bf A}{\bf B}\)
PAS-NE	\({\bf s}, idx = f({\bf s})\)	\({\bf H}^{\prime }=({\bf H}\odot {\bf s})[idx,:];\) \({\bf H}^{\prime }={\bf B}^T{\bf H}, {\bf B}=f({\bf A}, {\bf H}, {\bf s}, idx)\)	\({\bf A}^{\prime } = {\bf A}[idx, idx]\) \({\bf A}^{\prime }={\bf B}^T{\bf A}{\bf B},{\bf B}=f({\bf A}, {\bf H}, {\bf s}, idx)\)

Table 2. The Comparisons of the Computation Process in Different Pooling Methods

PAS-NE can incorporate two classes of pooling operations by decomposing the Pooling Module and modifying the the computation process of the grouping-based methods.

We address this problem by modifying the computation process of the grouping-based methods. As shown in the following, the top-ranked nodes \(idx\) are treated as cluster heads and are preserved in the coarse graph. The assignment matrix \({\bf B}\) is moved to the following two steps and is treated as one preprocessing procedure to update the features and reconstruct the adjacency matrix. \(f_H\) and \(f_A\) represent the assignment function for the feature and adjacency matrix, respectively.

\[\begin{eqnarray*} {\bf s}&=f_s({\bf A}, {\bf H}), idx = \texttt {TOP}_k({\bf s}), \nonumber \nonumber\\ {\bf H}^{\prime } &= {\bf B}_H^T{\bf H}, {\bf B}_H = f_H({\bf A}, {\bf H}, {\bf s}, idx), \nonumber \nonumber\\ {\bf A}^{\prime } &= {\bf B}_A^T{\bf A}{\bf B}_A, {\bf B}_A = f_A({\bf A}, {\bf H}, {\bf s}, idx). \nonumber \nonumber \end{eqnarray*}\]

By doing this, the search space can be constructed following the computation process in Table 2. Two classes of pooling operations share the same selection block; then, the coarse graph can be obtained by designing the reduction and connection operations to update the features and adjacency matrix. For example, the combination \(\lbrace {\bf s}, idx\); \({\bf H}^{\prime }={\bf B}^T{\bf H}\); \({\bf A}^{\prime }={\bf B}^T{\bf A}{\bf B}\rbrace\) is transformed into \(\lbrace {\bf s}, idx;{\bf H}^{\prime } = {\bf B}_H^T{\bf H}, {\bf B}_H;{\bf A}^{\prime }= {\bf B}_A^T{\bf A}{\bf B}_A, {\bf B}_A\rbrace\) with Table 2, from which the coarse graph can be obtained. Compared with PAS-G, this fine-grained pooling variant improves the coarse graph design in the node and edge scales with the help of reduction and connection operations, called PAS-NE here. We provide a set of operations for three blocks in the Pooling Module.

•

Selection Block. This block prepares the nodes in the coarse graph by evaluating the node importance. Therefore, we reuse the score functions in PAS-G.

•

Reduction Block.Based on the nodes generated in the Selection Block, the Reduction Block is responsible for updating the feature matrix \({\bf H}^{\prime }\) in the coarse graph \(G^{\prime }=({\bf H}^{\prime }, {\bf A}^{\prime })\). As shown in

\begin{align} {\bf H}^{\prime }&=({\bf H}\odot {\bf s})[idx, :], \end{align}

(3)

\begin{align} {\bf H}^{\prime } &= (\texttt {GRAPHCONV}({\bf A}, {\bf H})\odot {\bf s})[idx, :], \end{align}

(4)

\begin{align} {\bf H}^{\prime } &={\bf B}_H^T{\bf H},\;{\bf B}_H=\text{SOFTMAX}(\sigma ({\bf H}{\bf W}_H, ({\bf H}\odot {\bf s})[idx,:])), \end{align}

(5)

we provide three reduction operations in this block, which are denoted as RED_DROP, RED_AGG, and RED_CLUSTER, respectively. The selection-based methods adopt the operation RED_DROP and preserve the most informative nodes in the coarse graph. Following GSAPool [74], which uses an extra aggregation operation to utilize the unselected features, we provide one strategy RED_AGG and use the GRAPHCONV aggregation operation here as shown in Equation (4). Considering the grouping-based pooling methods, one operation RED_CLUSTER is proposed to group nodes based on the feature similarity. As shown in Equation (5), \(\text{SOFTMAX}(\cdot)\) is the row-wise Softmax function, \(\sigma (\cdot)\) is the cosine similarity function, and \({\bf W}_H\) is the learnable parameter in this operation.

•

Connection Block. For the selection-based pooling operation in the literature, the edges among the selected nodes are preserved, that is, \({\bf A}^{\prime } = {\bf A}[idx,idx]\) whereas for the grouping-based ones, based on the designed assignment matrix, the new adjacency matrix can be obtained by

\begin{align} {\bf A}^{\prime } = {\bf B}_A^T{\bf A}{\bf B}_A, \;{\bf B}_A=\text{SOFTMAX}(\sigma ({\bf H}{\bf W}_A, ({\bf H}\odot {\bf s})[idx,:])), \end{align}

(6)

where \({\bf W}_A\) is the corresponding learnable parameters in this operation. These two operations are denoted as CONN_DROP and CONN_CLUSTER, respectively.

3.2.3 Summary and Discussion.

To automatically design the pooling architectures for graph classification, we propose a unified pooling framework and provide a set of candidate operations for each module and block, as shown in Table 3. For the Pooling Module, the coarse-grained variant PAS-G is built on the selection-based pooling operations, which can design the coarse graph in the graph scale. The fine-grained variant PAS-NE designs the coarse graph in the node and edge scales. This variant can incorporate both the selection-based pooling operations and the grouping-based pooling operations. That is, the newly proposed variant PAS-NE can maintain the PAS-G variant. By designing a proper search space for these two variants, global and hierarchical methods can be unified. For example, in PAS-NE, the pooling operation in Graph U-Net [14] can be represented as \(\lbrace \texttt {TOPKPOOL}; \texttt {RED_DROP}; \texttt {CONN_DROP}\rbrace\) and the DiffPool [69] can be simulated as \(\lbrace \texttt {SAGPOOL}; \texttt {RED_CLUSTER}; \texttt {CONN_CLUSTER}\rbrace\). For the global pooling methods, \(\lbrace \texttt {IDENTITY}; \texttt {RED_DROP}; \texttt {CONN_DROP}\rbrace\) represents that the Pooling Module is skipped and \(G^l=G^{la}\) is obtained. Moreover, the proposed two variants can generate data-specific pooling architectures beyond existing human-designed ones (see Figure 3).

Table 3.

		PAS-G	PAS-NE
Aggregation \(\mathcal {O}_a\)		GCN, GAT, SAGE, GIN, GRAPHCONV, MLP
Readout \(\mathcal {O}_r\)		GLOBAL_SORT, GLOBAL_ATT, SET2SET, GLOBAL_MEAN, GLOBAL_MAX, GLOBAL_SUM, ZERO
Merge \(\mathcal {O}_m\)		M_LSTM, M_CONCAT, M_MAX, M_MEAN, M_SUM
Pooling	Selection \(\mathcal {O}_{se}\)	TOPKPOOL, MLPPOOL, HOPPOOL_1, HOPPOOL_2, HOPPOOL_3, SAGPOOL, ASAP, GCPOOL, GAPPOOL, IDENTITY
	Reduction \(\mathcal {O}_{red}\)	RED_DROP	RED_DROP, RED_AGG, RED_CLUSTER
	Connection \(\mathcal {O}_{conn}\)	CONN_DROP	CONN_DROP, CONN_CLUSTER

Table 3. The Operations used in the Search Space

In PAS-NE, the Pooling Module is decomposed into the Selection, Reduction, and Connection blocks. PAS-G focuses on designing the score functions, which is equivalent to designing the selection operations based on fixed reduction and connection operations.

Fig. 3.

Compared with DiffPool, the representative grouping-based method, two assignment matrices \(\lbrace {\bf B}_A, {\bf B}_H\rbrace \in \mathbb {R}^{n_l \times n_{l+1}}\) in PAS-NE are designed based on the node similarity that decouples the cluster number and feature dimension. For the \(l\)-th pooling operation in Diffpool, the assignment matrix can be calculated as

\[\begin{eqnarray*} {\bf B}^l &= \texttt {softmax}(\texttt {GNN}_{l, pool}({\bf A}^{l-1}, {\bf H}^{l-1}))=\texttt {softmax}(M({\bf A}^{l-1}, {\bf H}^{l-1};{\bf W}^l)), \nonumber \nonumber \end{eqnarray*}\]

where \({\bf B}^l \in \mathbb {R}^{n_l \times n_{l+1}}\) is the assignment matrix in this layer, \(M\) is the propagation function, \({\bf W}^l \in \mathbb {R}^{d \times n_{l+1}}\) are the learnable parameters, and \(d\) is the feature dimension. The learnable parameters are shared among all graphs. In other words, the graphs with different node numbers \(n_0\) will have the same graph size after the pooling operation. It is empirically incorrect for the small graph because the graph size may become larger after the pooling operation. This is the MinCut method. In PAS-NE, the learnable parameters \({\bf W}_H\) and \({\bf W}_A\) in Equations (5) and (6) have the dimension \(\mathbb {R}^{d \times d}\), which is irrelevant with the graph size \(n_{l+1}\). It indicates that the PAS-NE variant can decouple the cluster number and the dimension of features with the node similarity assignment function.

As shown in Figure 2(a), the example of a 2-layer architecture backbone has 2 Aggregation Modules, 2 Pooling Modules, 3 Readout Modules, and 1 Merge module that need to be designed. Combined with the candidate operations shown in Table 3, the search space size of PAS-G variant can be calculated as \(6^2 \times 10^2 \times 7^3 \times 5 \approx 6.2\times 10^6\) and the PAS-NE variant is \(6^2 \times (10\times 3\times 2)^2 \times 7^3 \times 5 \approx 2.2\times 10^8\). More operations can be trivially added to the search space if the computational budget is enough, for example, different score functions \(f_s(\cdot)\) and similarity functions \(\sigma (\cdot)\) in Equations (5) and (6). It also means that an efficient search method is needed over such a large search space.

3.3 Enabling Differentiable Architecture Search

Because of its efficiency and effectiveness, the differentiable search algorithm is employed in this article. Technically speaking, as done in existing NAS works [38, 65, 80], the discrete selection of operations is relaxed into continuous with a weighted summation of all possible operations as \(o(x)= \sum \nolimits _{i=1}^{\left|\mathcal {O}\right|} c_io_i(x)=\sum \nolimits _{i=1}^{\left|\mathcal {O}\right|} g(\mathbf {\alpha })o_i(x)\). For the Aggregation, Readout, and Merge Modules in the supernet, we can generate mixed results by constraining the node embedding dimension as shown in

\begin{align} {\bf H}^{la}&=\sum \limits _{i=1}^{\left|\mathcal {O}_a\right|} c_i^{l,a}o_i({\bf A}^{l-1},{\bf H}^{l-1}), \end{align}

(7)

\begin{align} {\bf z}^l&=\sum \limits _{i=1}^{\left|\mathcal {O}_r\right|} c_i^{l,r}o_i({\bf A}^l, {\bf H}^l), \end{align}

(8)

\begin{align} {\bf z}^F &=\sum \limits _{i=1}^{\left|\mathcal {O}_m\right|}c_i^{m} o_i({\bf z}^0,{\bf z}^1, \ldots ,{\bf z}^L), \end{align}

(9)

where \(c_i^{l,a}\) and \(c_i^{l,r}\) denote the weights of the \(i\)-th aggregation operation and readout operation in \(l\)-th layer, respectively, and \(c_i^m\) denotes the \(i\)-th operation in the Merge Module. However, it is infeasible to directly compute the weighted summation of the output of all operations in the Pooling Module due to the mismatch problem. As shown in Figure 2(c), three coarse graphs generated by the different operations are given on the left. They cannot match each other considering (a) different graph sizes — the coarse graphs generated by pooling operations have four nodes, and the global pooling methods update the node features and keep the nine nodes in the graph; and (b) different node sets — compared with the input graph, \(\lbrace 0,1,4,5\rbrace\) and \(\lbrace 0,4,5,8\rbrace\) are preserved by two pooling operations. They have the same graph size in the coarse graph, whereas the features matrix \({\bf H}^{\prime } \in \mathbb {R}^{4\times d}\) cannot directly be relaxed with the relaxation function due to the node mismatch problem. These two aspects should be considered when relaxing the pooling operations. To address these challenges in employing the differentiable search algorithm, we design one coarsening strategy to relax the pooling operations.

3.3.1 Coarsening Strategy in PAS-G.

The coarse-grained variant PAS-G focuses on designing the coarse graph in the graph scale. They have different graph sizes and mismatched nodes due to the usage of various score functions. In this coarsening strategy, we first calculate the node score vector \({\bf s}\) and select the top-\(k\) nodes idx for each pooling operation. Rather than formulate the coarse graphs by removing the unselected nodes and edges connected with them, as shown in Figure 2(c), we generate the result \(G_i^l=({\bf A}^l_i,{\bf H}^l_i)\) of the \(i\)-th pooling operation by making the coarse graph keep the “shape,” that is, the same node numbers as the input graph \(G^{la}\). In this way, the results of the global and hierarchical methods will have the same shape and different coarse graphs can be matched with each other. It is achieved by setting the unselected “part” to 0, that is, the features of unselected nodes and weights of unselected edges. Then, different pooling results \(G^l_i\) can be added to generate the mixed results of the Pooling Module, which is shown in the following:

\begin{align} G^l = (\mathbf {A}^l,\mathbf {H}^l) = \left(\sum \limits _{i=1}^{\left|\mathcal {O}_{se}\right|}c_i^{l,se}\mathbf {A}_i^l, \sum \nolimits _{i=1}^{\left|\mathcal {O}_{se}\right|}c_i^{l,se}\mathbf {H}_i^{l}\right), \end{align}

(10)

where \(c_i^{l,se}\) denotes the weight of the \(i\)-th selection operation in the \(l\)-th layer. With the designed coarsening strategy, we can relax the Pooling Module into continuous and allow the usage of gradient descent. Therefore, it is easy to generate the graph representation \({\bf z}^F\) following the procedure in Figure 2(a).

3.3.2 Coarsening Strategy in PAS-NE.

Compared with PAS-G, the fine-grained variant PAS-NE decomposes the Pooling Module into three blocks. In this variant, the Selection Block is responsible for selecting nodes for the coarse graph, and different nodes \(idx\) are preserved in various score functions. It is the source of the node mismatch problem and will affect the following two blocks along with the computational procedure. Following PAS-G, this problem can be addressed by applying the node scale coarsening strategy, that is, setting the features of unselected nodes to “0.” We use a mask \({\bf m} \in \lbrace 0,1\rbrace ^{N}\) to represent the node selection results, where 1 represents that the node is selected and 0 otherwise. The mixed results of the Selection Block can be obtained by

\begin{align} {\bf s}^{l}&=\sum \limits _{i=1}^{\left|\mathcal {O}_{se}\right|} c_i^{l,se}o_i({\bf A}^{l-1},{\bf H}^{la}), \end{align}

(11)

\begin{align} {\bf m}^l &=\sum \limits _{i=1}^{\left|\mathcal {O}_{se}\right|} c_i^{l,se}{\bf m}^l_i, \end{align}

(12)

where \({\bf m}^l_i\) is the mask of the \(i\)-th selection operation in the \(l\)-th layer and \({\bf m}^l\) is the mixed mask result.

Based on the mixed mask results \({\bf m}^l\), the reduction operations RED_DROP and RED_AGG can be relaxed by setting the features of unselected nodes to “0,” that is, replacing \(({\bf H}\odot {\bf s})[idx,:]\) in Equations (3) and (4) with \({\bf H}\odot {\bf s}\odot {\bf m}^l\) directly in the \(l\)-th layer. Nevertheless, it is infeasible for RED_CLUSTER in Equation (5). We calculate the cosine similarity between the nodes and selected ones in the assignment matrix \({\bf B}_H \in \mathbb {R}^{n_l \times n_{l+1}}\) and then assign nodes to the selected ones. The cosine similarity cannot be calculated based on the “0” feature, thus relaxation cannot be achieved. Because of this, we use a masked row-wise Softmax function instead. As shown in

\begin{equation} {\bf B}_{H,uv}^{l}=\left\lbrace \begin{aligned}&0 & {\bf m}_v^l =0 \\ &{{\bf m}_v^l \times e^{\sigma _{uv}}}/{\sum _{k=0}^{n_l} {\bf m}_k^l \times e^{\sigma _{uk}}} & \text{otherwise} \end{aligned} \right., \end{equation}

(13)

\({\bf B}_H^{l} \in \mathbb {R}^{n_l \times n_l}\) is the feature assignment matrix in the \(l\)-th layer, \({\bf m}_v^l\) is the mask of node \(v\), and \(\sigma _{uv}\) is the cosine similarity of nodes \(u\) and \(v\) calculated in Equation (5). With this function, nodes will be assigned to the selected ones rather than the unselected ones, and the feature matrix in Equation (5) will be matched with the other two reduction operations.

The mixed results of the Reduction Block can be represented as

\begin{equation} \begin{aligned} {\bf H}^l &= \sum \limits _{i=1}^{\left|\mathcal {O}_{red}\right|} c_i^{l,red} o_i({\bf A}^{l-1},{\bf H}^{la}) \\ &=c_1^{l, red}({\bf H}^{la}\odot {\bf s}^l\odot {\bf m}^l) + c_2^{l, red}(\texttt {GRAPHCONV}({\bf A}^{l-1}, {\bf H}^{la})\odot {\bf s}^l\odot {\bf m}^l) + c_3^{l,red}(({\bf B}_H^{l})^T{\bf H}^{la}). \end{aligned} \end{equation}

(14)

Three terms represent the weighted results of RED_DROP, RED_AGG, and RED_CLUSTER, respectively. \(c_i^{l,red}\) is the weight of \(i\)-th reduction operation in layer \(l\), and the equations of three operations are provided in Equation (3) to Equation (5). Similar to the Reduction Block, the edge scale coarsening strategy is employed, which sets the weights of unselected edges into “0.” Then, the mixed results of the Connection Block in layer \(l\) can be represented as

\begin{equation} \begin{aligned} {\bf A}^l &= \sum \limits _{i=1}^{\left|\mathcal {O}_{conn}\right|} c_i^{l,conn} o_i({\bf A}^{l-1},{\bf H}^{la}) \\ &=c_1^{l,conn}(({\bf m}^l)^T\odot {\bf A}^{l-1}\odot {\bf m}^l) +c_2^{l,conn}(({\bf B}_A^{l})^T{\bf A}^{l-1} {\bf B}_A^{l}). \end{aligned} \end{equation}

(15)

Two terms represent the weighted results of CONN_DROP and CONN_CLUSTER, respectively. \(c_i^{l,conn}\) is the weight of the \(i\)-th connection operation in this block, and \({\bf B}_A^{l}\) is the corresponding adjacency assignment matrix, which is calculated with Equation (13).

In this variant, we observe the mismatched node preparation mechanisms between two kinds of pooling methods and then address this problem by modifying the search space. Therefore, these pooling operations will be incorporated by the designed search space. As to the mismatched graph size and node set, we address these challenges by applying the node and edge scale coarsening strategies. We then provide the masked assignment matrix to relax the grouping-based operations, that is, RED_CLUSTER and CONN_CLUSTER. In this way, the global and two hierarchical pooling methods can be relaxed and matched in the Pooling Module. With the designed coarsening strategy, we can employ the differentiable search algorithm, and it is easy to generate the graph representation \({\bf z}^F\) as in Figure 2(a).

3.3.3 Optimization and Deriving Process.

In this article, we choose Gumbel-Softmax [24] as the continuous relaxation function, which is designed to approximate discrete distribution in a differentiable manner and show the effectiveness of supernet training [9, 65]. As shown in

\begin{equation} c_i = g(\mathbf {\alpha })=\frac{\text{exp}((\text{log}\mathbf {\alpha _i} + {\bf G}_i)/\lambda)}{\sum _{j=1}^{|\mathcal {O}|} \text{exp}((\text{log}\mathbf {\alpha _j} + {\bf G}_j)/\lambda)}, \end{equation}

(16)

\(\mathbf {\alpha _i}\) is the corresponding supernet parameter for \(c_i\), \({\bf G}_i = -\text{log}(-\text{log}({\bf U}_i))\) is the Gumbel random variant, \({\bf U}_i\) is a uniform random variant, and \(\lambda\) is the temperature of Softmax. Based on this relaxation function, the mixed results of each module and each block can be calculated along with the computational procedure as shown in Lines 5 to 13. Finally, we optimize the expected performance of the architectures sampled with \(p({\bf C})\) as shown in

\begin{align} \min \nolimits _{\mathbf {\alpha }} & \mathbb {E}_{{\bf C} \sim p(\mathbf {\alpha })} [\mathcal {L}_{val}({\bf C}, {\bf W}^*)], \end{align}

(17)

\begin{align} \text{s.t.} &{\bf W}^* = \text{arg min}_{{\bf W}} \mathbb {E}_{{\bf C} \sim p(\mathbf {\alpha })} [\mathcal {L}_{train}({\bf C}, {\bf W})]. \end{align}

(18)

The detailed optimization stage is shown in Algorithm 2. Since the training loss \(\mathcal {L}_{train}\), validation loss \(\mathcal {L}_{val}\), and the relaxation function are differentiable, we can optimize the parameters \(\mathbf {\alpha }\) and \({\bf W}\) with gradient descent. After finishing the search process, we obtain the searched architecture by preserving the operations with the largest weights in each module and block.

Compared with Algorithm 1, the optimization stage in Algorithm 2 can be treated as an implementation in the pooling architecture design. We observe the node mismatch problem that is blocking gradient descent optimization. This problem is addressed by the designed coarsening strategy, which pads the unselected nodes and keeps the same node set.

4 Experiments

4.1 Experimental Settings

4.1.1 Datasets.

In this article, we use six datasets as shown in Table 4.

Table 4.

Dataset	# of Graphs	# of Feature	# of Classes	Avg.# of Nodes	Avg.# of Edges	Domain
D& D	1,178	89	2	384.3	715.7	Bioinfo
PROTEINS	1,113	3	2	39.1	72.8	Bioinfo
IMDBM	1,500	0	3	13	65.9	Social
COX2	467	3	2	41.2	43.5	Chemistry
MR	10,654	300	2	18.5	73.5	Text
ogbg	41,127	2	2	25.5	27.5	Molecule

Table 4. Statistics of the Datasets from Five Domains

IMDBM and ogbg are short for IMDB-MULTI and ogbg-molhiv, respectively.

The D&D and PROTEINS datasets, provided by [8], are both protein graphs. In the D&D dataset, nodes represent the amino acids and two nodes are connected if the distance is less than 6 Angstroms apart. In the PROTEINS dataset, nodes are secondary structure elements and edges represent that the nodes are in an amino acid or in a close 3D space. The IMDB-MULTI dataset (denoted as IMDBM in this article), provided by [68], is a movie-collaboration dataset that contains the actor/actress and genre information from IMDB. Nodes represent the actor/actress and edges mean that they appear in the same movies. The COX2 dataset, provided by [53], is a set of 467 cyclooxygenase-2 (COX-2) inhibitors that classifies these compounds as active or inactive with in vitro activities against human recombinant enzyme values. The MR dataset, provided by [78], is a movie review dataset for binary sentiment classification. The corpus has 5,331 positive and 5,331 negative reviews and each review contains only one sentence. Each graph is a review, in which nodes represent unique words and edges represent the co-occurrences of word pairs. The co-occurrences describe the relationship of words that occur within a fixed-size sliding window (length three at default). The initial node features are the word embeddings that are preprocessed in a standard way, including tokenization and stopword removal [3, 51]. The ogbg-molhiv dataset (denoted as ogbg in this article) is a molecular property prediction dataset derived from MoleculeNet [64]. Each graph represents a molecule in which nodes are atoms and edges are chemical bonds. The molecular properties are cast into binary labels, for example, whether a molecule inhibits HIV virus replication or not. The ROC-AUC evaluation metric is adopted on this dataset.

4.1.2 Baselines.

We use 3 types of baselines: global pooling methods, hierarchical pooling methods, and NAS methods for GNNs.

For global pooling methods, we provide a readout function at the end of existing node classification methods, for example, GCN [27], GAT [55], SAGE [21], and GIN [66]. Combining them with the JK-Network [67], which adds skip connections to these models, we can formulate another 4 methods, denoted as GCN-JK, GAT-JK, SAGE-JK, and GIN-JK, respectively. We add the existing global pooling method DGCNN [77] in our experiment and MLP-Baseline, which contains only MLPs and one readout function.

For hierarchical pooling methods, we use three selection-based methods, Graph U-Net [14], SAGPool [31], and ASAP [48], and two grouping-based methods, DiffPool [69] and MinCut [2].

Existing NAS methods focus on design aggregation operations that lack the exploration of the pooling operations. We choose five representative methods to learn pooling architectures based on their search spaces and the fixed readout operation GLOBAL_MEAN: (a) GraphNAS [17],³ an RL-based method that designs different aggregation operations in each layer; (b) GraphNAS-WS, GraphNAS with the weight-sharing schema [47]; (c) SNAG [79],⁴ an RL-based method that designs node aggregations, skip connections, and layer aggregations in GNNs; (d) SNAG-WS, SNAG with the weight-sharing schema; and (e) SANE [80],⁵ a differentiable method that designs the GNNs with aggregation, skip-connection, and layer aggregations.

The evaluations of other global pooling NAS methods with diverse space based on differentiable search algorithms is equivalent to the experiments in Section 4.3. Since existing methods cannot learn the data-specific pooling operations, we further provide two baselines based on the proposed search space: (f) Random search (Random), samples an architecture from search space randomly; and (g) Bayesian optimization (Bayesian) [1], incorporates with prior information, using a tree-structured Parzen estimator as the measurement to find a better architecture. In this respect, these NAS baselines can cover the widely used search space and search algorithms; thus, existing NAS methods can be fully evaluated in our experiment.

4.1.3 Implementation Details.

All models are implemented with PyTorch [45] on a GPU RTX 3090 (memory: 24 GB, CUDA v. 11.2). Thus, for consistent comparisons of baseline models, we use the implementation of all GNN baselines by the popular GNN library: PyTorch Geometric (PyG, v. 1.7.2) [13], which provides a unifying code framework⁶ for various GNN models. For the ogbg dataset, we ignore the edge features in this article and adopt the split provided in [23].⁷ For others, we perform 10-fold cross-validation to evaluate model performance and an inner holdout technique with an 80%/10%/10% training/validation/test for model training and selection. Further, we adopt the same data preprocessing manner provided by PyG⁸ and split data by means of a stratification technique with the same seed.

We set training and fine-tuning stages to get the performance for all NAS baselines and PAS. In the training stage, we derived the candidate GNNs from the corresponding search space. Different search algorithms have diverse settings in this stage.

•

For PAS-G and PAS-NE, we set the training epoch to 200. As shown in Algorithm 2, in each training epoch, we sample a set of mini-batches and use the training loss to update parameters \({\bf W}\) and use the validation loss to update \(\mathbf {\alpha }\). After the training stage is finished, we derive the candidate architecture from the supernet.

•

For four RL-based methods, GraphNAS, GraphNAS-WS, SNAG, and SNAG-WS, we provide a readout function GLOBAL_MEAN at the end of GNNs. Thus, these methods can be applied for graph classification. We set the training epoch to 200. In each training epoch, we sample 10 architectures and use the validation performance to update the controller parameters. The searched architecture can be obtained at the end of the training stage.

•

For SANE, we provide a readout function GLOBAL_MEAN at the end of GNNs and set the training epoch to 20. It uses the training loss to update parameters \({\bf W}\) and uses the validation loss to update \(\mathbf {\alpha }\). After the training stage is finished, we derive the candidate architecture from the supernet.

•

For Random and Bayesian methods, we set the training epoch to 200. In each training epoch, we sample one architecture and train from scratch. After training is finished, we select one candidate with validation performance.

We fine-tune the human-designed architectures and the searched architectures with the help of Hyperopt.⁹ In the fine-tuning stage, each GNN has 30 hyper steps. In each step, a set of hyperparameters will be sampled from Table 5. Then, we select the final hyperparameters on the validation set, from which we can obtain the final performance of this GNN. Especially for the human-designed global pooling methods, which need an additional readout function, we provide the extra design dimension. Thus, these methods would obtain a better global view on the given graph. For the ogbg dataset, the ROC-AUC evaluation metric is used; we report the mean and standard deviations of test performance with five repeated runs. For others, we report the mean test accuracy and standard deviations based on the 10-fold test results.

Table 5.

For five human-designed hierarchical pooling baselines, we set the pooling ratio to 10%, 25%, 30%, 40%, and 50% when layers range from 1 to 5. For all NAS methods on D&D, PROTEINS, COX2, and MR datasets, we choose a 2-layer backbone directly. In each layer, one aggregation operation and one pooling operation are employed as shown in Figure 2(a), and we set the pooling ratio to 25%. For the IMDBM dataset, we choose a 1-layer backbone considering the small graph size and set the pooling ratio to 10%. For the ogbg dataset, we observe that more aggregation operations are preferable on this dataset, for example, five aggregation operations in GNNs by default. Five global pooling NAS baselines use the 4-layer backbone in which one aggregation operation is designed in each layer. PAS, Random, and Bayesian baselines use a 2-layer backbone in which one Pooling Module is placed after two Aggregation Modules in each layer.

4.2 Performance Comparisons

The results are given in Table 6, from which we can see that there is no absolute winner from human-designed models on all datasets. For example, GCN performs best on D&D while MinCut performs best on PROTEINS. Besides, the hierarchical pooling methods are not always superior to the global ones, which is consistent with two recent works [11, 42], and the grouping-based methods cannot achieve consistently higher accuracy compared with the selection-based ones. Considering that these datasets are from different domains, it demonstrates the need for adaptive pooling architectures for graph classification.

Table 6.

	Methods	D& D	PROTEINS	IMDBM	COX2	MR	ogbg	Avg. Rank (Group)	Avg. Rank (All)
Global	GCN	78.12\(^{4.33}\)	74.84\(^{2.82}\)	50.40\(^{3.02}\)	79.23\(^{2.19}\)	76.88\(^{1.11}\)	73.89\(^{0.46}\)	4.7	11.8
	GAT	75.56\(^{3.72}\)	75.30\(^{3.72}\)	49.67\(^{4.30}\)	81.56\(^{4.17}\)	77.78\(^{1.07}\)	73.72\(^{0.50}\)	4.8	11.3
	SAGE	77.27\(^{4.06}\)	73.75\(^{2.97}\)	48.53\(^{5.43}\)	80.31\(^{5.94}\)	77.98\(^{1.02}\)	73.68\(^{1.43}\)	6.3	14.5
	GIN	75.40\(^{3.68}\)	74.48\(^{2.78}\)	49.80\(^{2.50}\)	83.09\(^{4.17}\)	76.82\(^{0.09}\)	74.22\(^{1.53}\)	5.2	12.5
	GCN-JK	77.69\(^{2.35}\)	75.39\(^{5.17}\)	48.86\(^{4.96}\)	79.66\(^{2.26}\)	77.84\(^{1.06}\)	74.26\(^{0.34}\)	4.7	12.0
	GAT-JK	78.09\(^{6.18}\)	74.57\(^{4.05}\)	49.53\(^{3.84}\)	80.72\(^{2.87}\)	77.92\(^{0.90}\)	73.23\(^{0.64}\)	5.0	11.2
	SAGE-JK	77.35\(^{4.20}\)	74.94\(^{3.55}\)	49.73\(^{3.40}\)	79.90\(^{4.15}\)	76.59\(^{1.11}\)	73.57\(^{0.47}\)	6.0	14.5
	GIN-JK	75.13\(^{3.95}\)	73.12\(^{4.51}\)	50.80\(^{3.02}\)	81.17\(^{4.67}\)	78.42\(^{0.87}\)	72.03\(^{1.50}\)	5.5	12.0
	DGCNN	76.66\(^{4.03}\)	73.57\(^{4.69}\)	49.00\(^{3.56}\)	79.85\(^{2.64}\)	78.17\(^{1.03}\)	68.91\(^{2.24}\)	7.2	17.2
	MLP	77.52\(^{3.90}\)	72.39\(^{3.53}\)	49.80\(^{4.28}\)	78.38\(^{1.04}\)	78.30\(^{1.21}\)	73.71\(^{0.14}\)	5.7	14.5
Hierarchical	ASAP	77.35\(^{4.15}\)	74.93\(^{3.57}\)	50.13\(^{3.44}\)	80.95\(^{3.20}\)	78.19\(^{1.38}\)	69.59\(^{0.90}\)	2.2	10.8
	SAGPool	75.06\(^{5.06}\)	73.12\(^{4.47}\)	49.33\(^{4.90}\)	79.45\(^{2.98}\)	74.12\(^{1.53}\)	65.47\(^{1.34}\)	4.7	21.8
	Graph U-Net	77.10\(^{5.17}\)	74.40\(^{3.49}\)	48.80\(^{3.19}\)	80.30\(^{4.21}\)	78.62\(^{1.22}\)	70.46\(^{1.18}\)	3.0	14.0
	DiffPool	77.75\(^{4.00}\)	73.55\(^{3.22}\)	49.53\(^{3.98}\)	79.66\(^{2.64}\)	77.84\(^{1.39}\)	74.59\(^{0.75}\)	2.7	12.5
	MinCut	78.03\(^{3.63}\)	75.75\(^{3.80}\)	49.00\(^{2.83}\)	80.07\(^{3.85}\)	77.43\(^{1.22}\)	74.33\(^{1.67}\)	2.5	11.0
NAS	GraphNAS	71.98\(^{4.54}\)	72.51\(^{3.36}\)	46.93\(^{3.64}\)	77.73\(^{1.40}\)	77.00\(^{1.93}\)	70.31\(^{0.68}\)	7.3	23.0
	GraphNAS-WS	76.74\(^{4.55}\)	75.20\(^{2.51}\)	48.27\(^{3.50}\)	78.16\(^{0.58}\)	75.89\(^{1.13}\)	69.35\(^{0.76}\)	6.3	20.0
	SNAG	72.23\(^{3.86}\)	70.53\(^{3.11}\)	49.33\(^{2.89}\)	79.03\(^{2.12}\)	76.83\(^{1.76}\)	67.31\(^{0.27}\)	7.3	22.7
	SNAG-WS	73.51\(^{3.03}\)	72.33\(^{2.44}\)	50.00\(^{2.48}\)	80.54\(^{3.81}\)	76.43\(^{0.96}\)	65.07\(^{0.64}\)	6.3	19.5
	SANE	72.57\(^{3.49}\)	72.87\(^{3.35}\)	51.33\(^{4.12}\)	81.16\(^{1.32}\)	78.46\(^{1.79}\)	66.12\(^{0.83}\)	4.0	13.7
	Random-G	77.92\(^{4.82}\)	73.94\(^{4.23}\)	49.80\(^{3.98}\)	80.73\(^{2.31}\)	77.81\(^{1.52}\)	72.84\(^{1.04}\)	3.7	11.7
	Bayesian-G	75.55\(^{3.21}\)	73.14\(^{2.39}\)	49.80\(^{3.97}\)	80.29\(^{1.72}\)	77.83\(^{1.93}\)	70.29\(^{3.23}\)	4.8	16.2
	Random-NE	79.71\(^{3.03}\)	75.39\(^{3.58}\)	50.00\(^{4.04}\)	80.51\(^{2.09}\)	78.30\(^{0.65}\)	73.54\(^{1.55}\)	2.0	5.8
	Bayesian-NE	77.50\(^{0.38}\)	75.30\(^{3.99}\)	50.87\(^{2.83}\)	78.59\(^{4.26}\)	78.29\(^{1.91}\)	72.64\(^{1.24}\)	3.2	10.7
	PAS-G	78.96\(^{3.68}\)	76.64\(^{3.29}\)	52.20\(^{3.73}\)	83.44\(^{6.33}\)	77.59\(^{1.46}\)	74.41\(^{0.99}\)	1.8	4.7
	PAS-NE	79.62\(^{1.75}\)	77.36\(^{3.69}\)	53.13\(^{4.49}\)	82.66\(^{3.51}\)	78.72\(^{1.08}\)	75.47\(^{1.50}\)	1.2	1.5

Table 6. Performance Comparisons of PAS and All Baselines

For the ogbg dataset, we use the ROC-AUC evaluation metric and report the averaged test performance and standard deviations with five repeated runs. For others, the results are obtained on top of the 10-fold test results. The best results in different groups of baselines are underlined. The group accuracy rank and the overall accuracy rank of each method are calculated on each dataset. The Top-2 methods in each group and the Top-3 methods in this table are highlighted in gray.

Compared with these human-designed architectures, we can observe that the proposed two PAS variants achieve top-ranking performance on these datasets, demonstrating the effectiveness of these methods in searching for data-specific pooling architectures for graph classification.

When it comes to NAS baselines, the performance gains of the proposed PAS variants are also significant. On one hand, compared with the global pooling methods, GraphNAS, SNAG, and SANE, the performance gains are mainly from the Pooling Module, which can capture the hierarchical information. On the other hand, compared with Random and Bayesian baselines, which have the same search space as PAS, the performance gains are from the differentiable search algorithm in obtaining better architectures. In addition, when the Random and Bayesian search algorithms are employed, the fine-grained variant also outperforms the coarse-grained variant. The higher performance, regardless of the search algorithm, is attributed to the fine-grained pooling, that is, node and edge scale design in the coarse graph.

We also visualize the searched architectures on all datasets in Figure 3, from which it is clear that different combinations of the designed operations, that is, data-specific architectures, are obtained. Especially in the PAS-G, a global pooling architecture in COX2 is obtained since IDENTITY is selected in the Pooling Module of the first layer, whereas a hierarchical pooling architecture is obtained on D&D dataset. This observation further indicates the flexibility of PAS in designing the global and hierarchical pooling methods. Moreover, compared with PAS-G, which focuses on preserving the most informative nodes, the PAS-NE variant further improves the graph-level representation learning by designing the coarse graph in the node and edge scales. As shown in Figure 3, all of the searched architectures in PAS-NE apply the operations RED_AGG or RED_CLUSTER to utilize the features of those unselected nodes or employ the operation CONN_CLUSTER to utilize the unselected edges.

We give a detailed analysis of the Pooling Module based on the fine-grained variant PAS-NE, in which the searched operations may reflect the preference of different datasets. For the Selection block, we provide three kinds of functions to estimate the node importance based on features, structures, and local information. Three datasets design the node scores based on the node features and local information: D&D, PROTEINS, and ogbg datasets. The other three datasets use only the node features that indicate their significance. For example, the MR dataset is constructed based on the text data and the node features are generated based on the word embedding dictionary, which may contain more information compared with the graph structure. As shown in Table 6, the MLP baseline, which utilizes only the node features, achieves comparable performance with other baselines, from which the same conclusion can be drawn. For the Reduction block, most of the searched architectures are devoted to exploring the unselected features, that is, using RED_AGG and RED_CLUSTER, and no one employs two RED_DROP reduction operations in the network. It demonstrates the significance of preserving the node features in generating the coarse graph for all datasets. For the Connection block, we observe that two CONN_CLUSTER operations are presented in the ogbg dataset as shown in Figure 3(f) (right-hand side), which indicates the importance of enhancing the information extraction on the graph structure. As shown in Table 6, the two grouping-based pooling methods DiffPool and MinCut achieve the top two performances among the human-designed baselines, from which the SOTA performance of PAS-NE can be explained.

4.3 Ablation Studies on Search Space

We conduct ablation studies to show the importance of the four modules and three blocks in the search space. For simplicity, we use two datasets: PROTEINS and IMDBM, and conduct experiments over different variants of search space. The results are shown in Table 7. We can observe that performance dropped on all reduced search space. It indicates the importance of the four modules and three blocks used in the framework. In the Aggregation Module, we use the fixed aggregation operations GCN and GAT in the search space. These two reduced search space variants are denoted as FGCN and FGAT, respectively. The performance drop is observed. FGAT has a higher performance than FGCN in general, which is consistent with [55]. For Readout Module, we use the fixed operation GLOBAL_MEAN, denoted as FR. The performance drop means that it is far from satisfactory to use a simple mean function to generate a fixed-size representation out of all nodes in a graph. As shown in Figure 3, complex global pooling functions such as GLOBAL_ATT and GLOBAL_SORT are selected for real-world datasets. The Merge Module is designed to integrate the graph representations in different layers. By removing the Merge Module, we preserve only the last Readout Module, whose output is used as the graph representation \({\bf z}^F\). The performance drop demonstrates the importance of utilizing the intermediate layers in the final representation, which has been shown in previous works [7, 67, 80].

Table 7.

Module	Space Variant	Fixed Operation	PAS-G		PAS-NE
Module	Space Variant	Fixed Operation	PROTEINS	IMDBM	PROTEINS	IMDBM
Aggregation	FGCN	GCN	72.25\(^{4.47}\)	50.27\(^{4.09}\)	75.93\(^{4.07}\)	51.40\(^{3.13}\)
Aggregation	FGAT	GAT	73.86\(^{4.72}\)	50.87\(^{4.17}\)	75.92\(^{4.26}\)	51.45\(^{3.07}\)
Readout	FR	GLOBAL_MEAN	73.69\(^{6.45}\)	50.33\(^{4.36}\)	74.75\(^{2.78}\)	51.53\(^{3.88}\)
Merge	RM	-	72.60\(^{2.25}\)	50.47\(^{3.80}\)	75.21\(^{2,48}\)	51.93\(^{3.17}\)
Pooling	PGLOBAL	IDENTITY	75.84\(^{3.44}\)	51.73\(^{4.47}\)	76.20\(^{3.09}\)	51.87\(^{3.65}\)
	PTOPK	TOPKPOOL	75.30\(^{2.78}\)	50.33\(^{2.65}\)	76.01\(^{3.77}\)	52.33\(^{4.32}\)
	REDS	RED_DROP	-	-	76.46\(^{3.95}\)	51.93\(^{3.93}\)
	CONNS	CONN_DROP	-	-	76.73\(^{3.54}\)	51.40\(^{3.94}\)
Full Space			76.64\(^{3.29}\)	52.20\(^{3.73}\)	77.36\(^{3.69}\)	53.13\(^{4.49}\)

Table 7. Performance of PAS Using Different Variants of Search Space

The first column represents the corresponding module we try to evaluate; the second column is the reduced search space. The third column is the fixed operation used in the reduced search space. The RM variant uses a different framework and is denoted as ‘-’ in the third column.

For the Pooling Module, the coarse-grained variant PAS-G designs the score functions in the Selection Block, and the fine-grained variant PAS-NE designs the coarse graph based on three blocks. For the Selection Block, fixed selection operations IDENTITY and TOPKPOOL are used. The former is denoted as PGLOBAL, which is equivalent to the global pooling method. The latter, PTOPK, is a hierarchical method with a fixed-score function. The performance drop demonstrates the effectiveness of this block. PGLOBAL achieves a better performance than PTOPK in general. We observe that it is challenging to achieve a SOTA performance with the fixed-score functions and that it is necessary to design the pooling operation adaptively. For Reduction and Connection Blocks, we use RED_DROP and CONN_DROP operations, which are denoted as REDS and CONNS, respectively. It indicates that the unselected nodes or edges are ignored. We observed the performance drop if we lack the exploration in each scale, demonstrating the importance of these blocks and the design of PAS-NE. PAS-NE achieves higher performance than PAS-G based on the same reduced search space. Considering that the only difference is the Pooling Module, the superiority of PAS-NE over PAS-G is obvious.

Three modules — Aggregation, Readout, and Merge —make up the backbone of graph classification GNNs. They are responsible for extracting local information, generating the graph representation vector for the given graph, and merging the readout results from intermediate layers. In Table 7, we observe a larger performance drop on these three modules (FCGN, FGAT, FR, and RM) compared with the Pooling Module (PGLOGAL, PTOPK, REDS, CONNS) when using the fixed operations. In other words, it is insufficient to merely design the pooling operations. The SOTA performance can be obtained by designing the four modules simultaneously, which further demonstrates the effectiveness of the designed unified framework.

Taking all results in Table 7 into consideration, we can see that it is essential for graph classification to search for combinations of operations from the four essential modules, which demonstrates the contribution of the designed unified framework as well as the search space.

4.4 The Evaluation of the Search Algorithm

4.4.1 The Efficiency of PAS.

In this subsection, we show the superior efficiencyof the differentiable search process of PAS over NAS baselines. In Figure 4, we show the efficiency improvements with regard to the search cost compared with GraphNAS. Considering that SANE has fewer search epochs (20 epochs in SANE and 200 epochs in others) in the training stage, we do not consider this method in this subsection. The detailed results of the search cost and search space size are provided in Tables 10 and 11 in Appendix A.

Table 8.

Variant	Search Algorithm	PROTEINS	IMDBM
PAS-G	Gumbel-Softmax	76.64\(^{3.29}\)	52.20\(^{3.73}\)
	Softmax	75.75\(^{3.68}\)	51.60\(^{3.59}\)
	Random	73.94\(^{4.23}\)	49.80\(^{3.98}\)
	Bayesian	73.14\(^{2.39}\)	49.80\(^{3.97}\)
PAS-NE	Gumbel-Softmax	77.36\(^{3.69}\)	53.13\(^{4.49}\)
	Softmax	76.46\(^{3.54}\)	52.33\(^{4.69}\)
	Random	75.39\(^{3.58}\)	50.00\(^{4.04}\)
	Bayesian	75.30\(^{3.99}\)	50.87\(^{2.83}\)

Table 8. Performance Comparisons of Different Search Algorithms

Table 9.

	IMDBM	PROTEINS
PAS-NE	53.13\(^{4.49}\)	77.36\(^{3.69}\)
PAS-NE (reuse)	51.67\(^{4.46}\)	76.64\(^{4.33}\)
DiffPool	49.53\(^{3.98}\)	73.55\(^{3.22}\)
MinCut	49.00\(^{2.83}\)	75.75\(^{3.80}\)

Table 9. Performance Comparisons of Whether to Reuse the \({\bf B}\)

Fig. 4.

As shown in Figure 4, compared with these baselines, whose architectures are generated based on hundreds of evaluations, PAS has the smallest search cost and the biggest efficiency improvements, which is mainly attributed to use of the differentiable architecture search, including the coarsening strategy. Especially for large datasets such as MR and ogbg, which are extremely costly in the training stage, our method significantly relieved the pressure on the time budget. Combined with the top-ranking performance in Table 6, the efficiency and effectiveness of PAS are obvious.

Compared with the PAS-G variant, the PAS-NE variant has more operations that need extra matrix multiplications, in which case a longer search time is required. Due to the usage of the differentiable search algorithm, the PAS-NE variant takes only twice as long as the PAS-G variant in general. Nevertheless, on the D&D dataset, the graph size is more than one order of magnitude larger than other datasets. Thus, it is more expensive to design the assignment matrix on this dataset. PAS-NE costs 5.08 GPU hours in the training stage of this dataset whereas PAS-G costs 0.27 hours, which achieves 18\(\times\) less computation cost. Combined with performance gains in Table 6, it is worth employing the fine-grained variant PAS-NE for the large-sized graphs.

To summarize, based on the unified framework and coarsening strategy, we employ the differentiable search algorithm in this article to learn data-specific pooling architectures for graph classification. The fine-grained variant PAS-NE can achieve higher performance with comparable search cost in general datasets, whereas the coarse-grained variant PAS-G is more efficient on large-sized graphs, for example, with hundreds of nodes in each graph.

4.4.2 Comparisons of Different Search Algorithms.

Considering the performance and efficiency improvements achieved by the existing differentiable search algorithm, we proposed one coarsening strategy to enable the usage of the differentiable architecture search. The Gumbel-Softmax function is directly adopted in this article to relax the search space into a continuous one. In Table 8, we provide comparisons of the different relaxation functions. The Gumbel-Softmax function achieves higher performance than the Softmax function in two variants. The results are consistent with SNAS [65], a previously proposed NAS method for computer vision tasks. Compared with the Random and Bayesian baselines, which sample architectures and train from scratch, the differentiable methods optimize the supernet with gradient descent, from which the performance gain is obvious.

In this article, we provide a platform to employ the existing effective differentiable search algorithm rather than designing a specific search algorithm for this task. The efficiency and effectiveness of using the differentiable search algorithm are apparent in Figure 4 and Table 8. Based on the designed supernet and coarsening strategy, more advanced search algorithms can be used to improve the proposed PAS further, which will be left to future work.

4.5 Evaluation of the Hyperparameters

4.5.1 Number of Layers.

Here, we conduct experiments to show the influences of layer \(L\) of the proposed PAS by varying \(L \in \lbrace 1,2,3,4,5\rbrace\). Each layer contains one Aggregation Module and one Pooling Module as introduced in Section 3.1 and Figure 2(a). The results are shown in Figure 5, from which we can see the trend that the performance first increases with the increase of \(L\) and then decreases on the PROTEINS dataset. On IMDBM, which has a smaller graph size, more layers are not suitable and the performance first decreases and then stabilizes. PAS-NE generally performs better than PAS-G due to the fine-grained coarse graph design.

Fig. 5.

4.5.2 Reuse B in the Connection Block.

The proposed fine-grained variant PAS-NE can emulate the grouping-based methods, and it moves \({\bf B}\) into the following Reduction and Connection Blocks. Therefore, we calculate the assignment matrix \({\bf B}_H\) and \({\bf B}_A\) for the Reduction and Connection Blocks. In the literature, the grouping-based methods require only one assignment matrix \({\bf B}\) in each pooling operation, on top of which the feature and adjacency matrices are updated, respectively. Here, we conduct an experiment to reuse the assignment matrix \({\bf B}\) between the RED_CLUSTER and CONN_CLUSTER operations in the same module, which is denoted as PAS-NE (reuse) in this article. Compared with PAS-NE, it has a narrow spectrum of application scenarios due to only \(\frac{1}{6}\) candidate architectures in the search space containing these two operations (\(\frac{1}{3}\) in the Reduction Block and \(\frac{1}{2}\) in the Connection Block from Table 3), and the others cannot share \({\bf B}\) in the Pooling Module. As shown in Figure 6, even though we reuse the assignment matrix in the training stage, the searched architectures cannot satisfy the demand to share the matrix \({\bf B}\). Furthermore, as shown in Table 9, PAS-NE can achieve higher performance than PAS-NE (reuse) and two grouping-based methods through the different feature and edge assignment matrices. Considering the different degrees of importance of the nodes and edges, different assignment matrices provide more flexibility in designing coarse graphs, leading to potentially more expressive architectures.

Fig. 6.

5 Conclusion

In this article, we propose a novel method PAS to automatically learn data-specific pooling architectures for GNN-based graph classification. To design an effective search space, we propose a unified pooling framework consisting of four essential modules — Aggregation, Pooling, Readout, and Merge, by revisiting various human-designed pooling architectures. The coarse-grained variant PAS-G aims to design the score function in the Pooling Module to preserve the most informative nodes. The fine-grained variant PAS-NE decomposes the Pooling Module into three blocks and then designs the coarse graph in the node and edge scales. A set of candidate operations is provided for each module and block, on top of which the global and hierarchical methods in the literature can be incorporated.

Considering its efficiency and effectiveness, we employ the differentiable search algorithm and develop a coarsening strategy to relax the search space continuously. We conduct extensive experiments on six datasets from five domains. The experimental results show that PAS can achieve the top-ranking performance among all baselines. The fine-grained variant PAS-NE performs better than PAS-G by designing the coarse graphs in the node and edge scales. With the designed coarsening strategy, PAS is optimized with gradient descent, which accelerates the training stage and is efficient in orders of magnitude. For future work, we plan to investigate the influence of the nodes and edges in the coarse graph, which can help us understand the graph classification task.

Footnotes

https://ogb.stanford.edu/docs/leader_graphprop/.

https://github.com/GraphNAS/GraphNAS.

⁴

https://github.com/AutoML-Research/SNAG.

⁵

https://github.com/AutoML-Research/SANE.

⁶

https://github.com/pyg-team/pytorch_geometric/tree/master/benchmark/kernel.

⁷

https://github.com/snap-stanford/ogb.

⁸

https://github.com/pyg-team/pytorch_geometric/blob/master/benchmark/kernel/datasets.py.

⁹

https://github.com/hyperopt/hyperopt.

A Experiment Results

The detailed comparisons of the search cost and the search space size are provided in Table 10 and Table 11, respectively.

Table 10.

Methods	D& D	PROTEINS	IMDBM	COX2	MR	ogbg
GraphNAS	103.7	88.342	57.35	18.987	368	502.22
SNAG	8.747	14.974	17.143	5.52	83.2	140.42
SANE	0.015	0.017	0.019	0.013	0.09	0.48
Random-G	49.3	6.793	4.2	2.982	22.23	47.54
Random-NE	57.51	5.47	6.67	1.52	24.89	44.48
Bayesian-G	44.45	6.712	6.33	3.279	22.1	53.64
Bayesian-NE	55.31	5.73	5.67	1.97	22.72	41.25
PAS-G	0.27	0.327	0.169	0.124	0.79	3.86
PAS-NE	5.08	0.61	0.29	0.215	1.42	7.19

Table 10. Search Cost (GPU Hours) Comparisons in the Search Stage

Table 11.

Methods	D& D	PROTEINS	IMDBM	COX2	MR	ogbg
GraphNAS	8.85 \(\times 10^7\)	8.85 \(\times 10^7\)	9.41 \(\times 10^3\)	8.85 \(\times 10^7\)	8.85 \(\times 10^7\)	7.83 \(\times 10^{15}\)
SNAG	1.20 \(\times 10^3\)	1.20 \(\times 10^3\)	3.00 \(\times 10^1\)	1.20 \(\times 10^3\)	1.20 \(\times 10^3\)	1.80 \(\times 10^5\)
SANE	7.26 \(\times 10^2\)	7.26 \(\times 10^2\)	3.30 \(\times 10^1\)	7.26 \(\times 10^2\)	7.26 \(\times 10^2\)	3.51 \(\times 10^5\)
Random-G	6.17 \(\times 10^6\)	6.17 \(\times 10^6\)	1.47 \(\times 10^4\)	6.17 \(\times 10^6\)	6.17 \(\times 10^6\)	2.22 \(\times 10^8\)
Random-NE	2.22 \(\times 10^8\)	2.22 \(\times 10^8\)	8.82 \(\times 10^4\)	2.22 \(\times 10^8\)	2.22 \(\times 10^8\)	8.00 \(\times 10^9\)
Bayesian-G	6.17 \(\times 10^6\)	6.17 \(\times 10^6\)	1.47 \(\times 10^4\)	6.17 \(\times 10^6\)	6.17 \(\times 10^6\)	2.22 \(\times 10^8\)
Bayesian-NE	2.22 \(\times 10^8\)	2.22 \(\times 10^8\)	8.82 \(\times 10^4\)	2.22 \(\times 10^8\)	2.22 \(\times 10^8\)	8.00 \(\times 10^9\)
PAS-G	6.17 \(\times 10^6\)	6.17 \(\times 10^6\)	1.47 \(\times 10^4\)	6.17 \(\times 10^6\)	6.17 \(\times 10^6\)	2.22 \(\times 10^8\)
PAS-NE	2.22 \(\times 10^8\)	2.22 \(\times 10^8\)	8.82 \(\times 10^4\)	2.22 \(\times 10^8\)	2.22 \(\times 10^8\)	8.00 \(\times 10^9\)

Table 11. Search Space Size Comparisons of All NAS Methods

References

[1]

James S. Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In NeurIPS. 2546–2554.

Abstract

1 Introduction

2 Related Work

2.1 GNN for Graph Classification

2.1.1 Global Pooling.

2.1.2 Hierarchical Pooling.

2.2 Differentiable Neural Architecture Search (NAS)

2.3 Graph Neural Architecture Search

3 The Proposed Method

3.1 Designing a Search Space by a Unified Pooling Framework

3.2 Pooling Module Design

3.2.1 Coarse-Grained Pooling Module.

3.2.2 Fine-Grained Pooling Module.

3.2.3 Summary and Discussion.

3.3 Enabling Differentiable Architecture Search

3.3.1 Coarsening Strategy in PAS-G.

3.3.2 Coarsening Strategy in PAS-NE.

3.3.3 Optimization and Deriving Process.

4 Experiments

4.1 Experimental Settings

4.1.1 Datasets.

4.1.2 Baselines.

4.1.3 Implementation Details.

4.2 Performance Comparisons

4.3 Ablation Studies on Search Space

4.4 The Evaluation of the Search Algorithm

4.4.1 The Efficiency of PAS.

4.4.2 Comparisons of Different Search Algorithms.

4.5 Evaluation of the Hyperparameters

4.5.1 Number of Layers.

4.5.2 Reuse B in the Connection Block.

5 Conclusion

Footnotes

A Experiment Results

References

Cited By

Index Terms

Recommendations

Pooling Architecture Search for Graph Classification

GraphPAS: Parallel Architecture Search for Graph Neural Networks

Decoupled differentiable graph neural architecture search

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations