1 Introduction
The graph classification task aims to learn the mapping function between a graph and label. Graph neural networks (GNNs) [
69,
77] have been the state-of-the-art (SOTA) method for this task in recent years, which necessitates generating the graph representation for a given graph. This task has applications in various domains, for example, chemistry [
19], bioinformatics [
58,
69], and text categorization [
39,
51,
78]. It can be extended to the link-prediction task by learning the graph representations of the subgraphs around links [
41,
59,
60,
75,
76]. To obtain a graph representation, a straightforward way is to add a global pooling function, also called the
readout function, at the end of GNNs to globally pool all of these node embeddings. For example, take the mean or summation of all node embeddings as the graph-level representation vector. These methods [
66,
77] can capture global information and are easy to obtain based on the existing GNNs, which are denoted as
global pooling in this article. However, it directly reduces the given graph into one vector, which ignores the hierarchical structure of the graph compositional nature [
26] and leads to
flat graph-level representation [
69]. More advanced pooling methods are proposed to preserve the hierarchical information by aggregating messages on increasingly coarser graphs, which is denoted as
hierarchical pooling. These methods are constructed by applying a pooling operation between the aggregation operations to reduce the graph size as shown in Figure
1(a). They can be categorized into two classes according to the node preparation mechanism when generating the coarse graph. The selection-based pooling methods design the coarse graph by preserving the most informative nodes and edges among them. The representative ones are SAGPool [
31] and Graph U-Net [
14]. The grouping-based methods, for example, DiffPool [
69] and MinCut [
2], are devoted to grouping nodes into clusters from which the new nodes in the coarse graphs are obtained. The graphical illustrations are provided in Figure
1(b).
Despite the success of these pooling methods, in reality, graphs are from highly diverse domains and sizes. Existing methods use fixed pooling operations, which are difficult to generalize to various domains, and it is time-consuming to hand-design one data-specific pooling architecture for a given dataset. Moreover, hierarchical pooling methods are not always beneficial in all cases [
11,
20,
42], leading to a new demand to design the pooling architectures adaptively. Very recently, to obtain data-specific GNN architectures, researchers have turned to neural architecture search (NAS) [
38,
82], for example, GraphNAS [
17], Policy-GNN [
29], and SANE [
80]. However, most of the existing methods are designed for node-level tasks, either on aggregation operations [
17,
80] or the number of layers [
29]. When dealing with a graph-level task, fixed global pooling operations are used in GNNs [
6,
71] or jointly learned with aggregation operations [
25]. They pay more attention to global pooling methods and fail to incorporate hierarchical ones. It is non-trivial to design the pooling architectures adaptively since the following two aspects should be considered. First, both global and hierarchical pooling methods should be incorporated into the search space due to their effectiveness in different scenarios [
11,
42], while there is a lack of a systematic pooling framework for the graph classification task to design the search space. Second, because of its search efficiency and effectiveness, the differentiable search algorithm is expected to learn architectures on top of the search space. However, it is unachievable for pooling methods since the pooling results have different sizes and node sets that cannot be integrated and relaxed. Therefore, the usage of gradient descent is infeasible.
To learn adaptive pooling architectures with the help of NAS, we propose a novel method called PAS (Pooling Architecture Search) to address these challenges. It contains a unified pooling framework to design the search space and one coarsening strategy to enable the usage of the differentiable architecture search. A unified framework is defined based on four key modules derived from the existing pooling architectures: Aggregation, Pooling, Readout, and Merge. Considering that two classes of pooling operations in existing hierarchical methods — selection-based and grouping-based pooling operations — have diverse manners in designing a coarse graph, we provide two variants of the Pooling Module that can design the pooling operation in different scales on these two classes. In the first variant, we employ selection-based operations in the Pooling Module. They focus on designing the node score functions, and the coarse graph can be formulated based on the most informative nodes. In other words, there is only one step to designing the pooling results. This coarse-grained pooling variant is dubbed PAS-G (Pooling Architecture Search in Graph scale).
For PAS-G designed on the selection-based methods, the unselected nodes and edges are dropped directly when formulating the coarse graph, which may lead to information loss to some extent. To improve graph representation learning, we further introduce grouping-based pooling operations in the Pooling Module, which have advantages in grouping the information from all nodes and edges in the graph. To unify these two classes of pooling operations, we first refine the Pooling Module and decompose it into Selection, Reduction, and Connection blocks based on the three steps in designing the coarse graph. They are responsible for preparing the nodes in the coarse graph, updating features, and reconstructing the adjacency matrix, respectively. Then, we observe that the different node preparation mechanisms in the Selection Block result in mismatched nodes in the coarse graph as shown in Figure
1(b). To address this problem, we modify the grouping-based methods by assigning nodes to the cluster heads prepared in the Selection Block, from which we obtained the matched nodes. Based on this block, the coarse graph can be formulated by designing the reduction and connection operations to update the node features and reconstructing the edges in the graph. This fine-grained pooling variant improves the coarse graph design in the
node and
edge scale, which is dubbed
PAS-NE. Based on this unified pooling framework, one search space is designed naturally by providing a set of candidate operations for each module and block, on top of which the different global and hierarchical methods can be incorporated.
As to the search algorithm, to address the mismatched problem in the coarse graph, a coarsening strategy is designed to relax the search space by padding the unselected nodes and keeping the same sets. We further decompose this strategy into node and edge scales to adapt the fine-grained variant PAS-NE. Then, the coarse graphs can be matched with each other, and the usage of gradient descent is available. We conduct extensive experiments on six real-world datasets from five domains, for example, the large-scale text classification dataset MR [
78] and the molecular property prediction dataset ogbg-molhiv [
23]. The experimental results demonstrate that two PAS variants can achieve better performance than those various baselines. By designing the nodes and edges in the coarse graph, PAS-NE has better performance than PAS-G. With the help of coarsening strategy, PAS is optimized with gradient descent and achieves a two-orders-of-magnitude reduction of the search cost. Furthermore, on the popular OGB leaderboard of graph classification,
2 we also have achieved 1st, 1st, and 3rd performance on datasets MolHIV, PPA, and MolPCBA, respectively, with PAS [
58] since April 2022, which further shows its effectiveness for the graph classification task.
To summarize, the contributions of this work are as follows:
•
To the best of our knowledge, PAS is the first work to design data-specific pooling architectures for GNN-based graph classification, which is based on the differentiable neural architecture search.
•
To design an expressive search space for PAS, we propose a unified pooling framework that contains four key modules derived from the literature: Aggregation, Pooling, Readout, and Merge. Two variants, PAS-G and PAS-NE, are proposed on the selection-based and grouping-based pooling methods, respectively. With these variants, the global and two hierarchical pooling methods in the literature can be incorporated.
•
Considering its efficiency and effectiveness, we provide one coarsening strategy to enable the usage of the differentiable architecture search.
•
We validate the proposed two variants, PAS-G and PAS-NE, on six real-world datasets. The results demonstrate that we achieve higher performance than various baselines for graph classification. Moreover, PAS-NE outperforms PAS-G by designing coarse graphs in the node and edge scales. In addition, the efficiency of these two methods is improved by two orders of magnitude due to the coarsening strategies.
Differences with the Conference Version. Preliminary results of this article have been published in CIKM 2021 [
62] in which the unified framework is proposed and the PAS-G variant is employed in the Pooling Module. Here, we have made the following important extensions:
•
We add one comprehensive variant PAS-NE that can maintain the PAS-G variant and the grouping-based hierarchical pooling methods. A novel search space is constructed based on the designed unified pooling operation (Section
3.2.2 and
3.2.3). In addition, we further refine the coarsening strategy to enable the usage of the differentiable architecture search for the PAS-NE variant especially (Section
3.3.2).
•
New datasets and experiments. We use two large-scale datasets, MR and ogbg-molhiv, in the journal version and update the experiments in Section
4.2. We conduct additional ablation studies to evaluate the PAS-NE variant in Section
4.3 to Section
4.5.
•
We conduct a literature investigation to check the latest pooling methods for the graph classification task (Section
2.1).
Notations. We represent a graph as \(G =({\bf A}, {\bf H})\), where \({\bf A} \in \mathbb {R}^{N \times N}\) is the adjacency matrix of this graph and \({\bf H}\in \mathbb {R}^{N \times d}\) is the node features, \(N\) is the node number and \(d\) is the feature dimension. For simplicity, all of the intermediate layers have the same feature dimension \(d\). \(\widetilde{\mathcal {N}}(v) = \lbrace v\rbrace \cup \lbrace u | {\bf A}_{uv} \ne 0 \rbrace\) represents set of the self-contained first-order neighbors of node \(v\). In an \(L\)-layer GNN, for clear presentation, the input graph is denoted by \(G^0 = ({\bf A}^0, {\bf H}^0)\), and the input of the \(l\)-th layer is \(G^{l-1} = ({\bf A}^{l-1}, {\bf H}^{l-1})\). One aggregation and one pooling operation are placed in this layer, and the operation results are represented as \(G^{la}=({\bf A}^{l-1}, {\bf H}^{la})\) and \(G^{l} = ({\bf A}^{l}, {\bf H}^{l})\), respectively. The features of node \(v\) in \(l\)-th layer are denoted by \({\bf h}_v^l\). \(\mathcal {V}^l\) represents the nodes in the \(G^{la}\) and \(n_l = |\mathcal {V}^l|\) is the input graph size of \(l\)-th layer.