Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Ada-MIP: Adaptive Self-supervised Graph Representation Learning via Mutual Information and Proximity Optimization

Published: 07 April 2023 Publication History

Abstract

Self-supervised graph-level representation learning has recently received considerable attention. Given varied input distributions, jointly learning graphs’ unique and common features is vital to downstream tasks. Inspired by graph contrastive learning (GCL), which targets maximizing the agreement between graph representations from different views, we propose an Adaptive self-supervised framework, Ada-MIP, considering both Mutual Information between views (unique features) and inter-graph Proximity (common features). Specifically, Ada-MIP learns graphs’ unique information through a learnable and probably injective augmenter, which can acquire more adaptive views compared to the augmentation strategies applied by existing GCL methods; to learn graphs’ common information, we employ graph kernels to calculate graphs’ proximity and learn graph representations among which the precomputed proximity is preserved. By sharing a global encoder, graphs’ unique and common information can be well integrated into the graph representations learned by Ada-MIP. Ada-MIP is also extendable to semi-supervised scenarios, with our experiments confirming its superior performance in both unsupervised and semi-supervised tasks.

1 Introduction

Graph-level representation learning (GRL) aims at encoding an entire graph into a low-dimensional embedding, which has a variety of downstream applications in social science, biochemistry, and physics [Senior et al. 2020; Shlomi et al. 2020; Hamilton 2020]. In recent years, graph neural network (GNN) has demonstrated its superior power in many graph-relevant tasks [Kipf and Welling 2016a; Xu et al. 2018]. However, in the majority of graph-level tasks, such as graph classification, GNNs are trained by a large number of labels, which are extremely expensive for graph-structured data [Sun et al. 2019]. The label scarcity issue emphasizes the significance of unsupervised and semi-supervised GRL.
One traditional solution to unsupervised GRL is graph kernel methods [Borgwardt and Kriegel 2005; Shervashidze et al. 2011]. Roughly speaking, these methods measure the similarity of two graphs using a kernel function that corresponds to an inner product in reproducing kernel Hilbert space [Schölkopf et al. 2002]. Unfortunately, graph kernel methods cannot provide explicit graph embeddings which many machine learning algorithms operate on. Besides, the handcrafted features of graph kernels lead to high dimensional, sparse, or non-smooth representations and thus result in poor generalization performance [Narayanan et al. 2017]. Another kind of solutions is based on language models, such as node2vec [Grover and Leskovec 2016] and graph2vec [Narayanan et al. 2017]. These methods aim at learning certain substructure embeddings as graph representations, whereas paying less attention to the global structure information, resulting in poor performance in graph-level tasks.
To handle these issues, self-supervised learning (SSL) has recently attracted more attention, adopting self-defined pseudolabels as supervision and using the learned representations for downstream tasks. Designing proper SSL principles for GNNs is crucial, as they drive what information of graphs will be captured by GNNs and determine GNN’s performance in downstream tasks. Among the SSL methods, graph contrastive learning (GCL) stands out by leveraging the InfoMax Principle [Linsker 1988]. Specifically, it maximizes the agreement across the representations of one graph from different views to mine graphs’ unique information. Ideally, a discriminator can distinguish a graph from others based on its learned representation.
Inspired by GCL, given varied graph distributions, we propose to learn the graph representations capturing both graphs’ unique and common features, addressing the following two challenges. (i) Adaptive and injective graph augmentation. Previous GCL methods usually apply a preset strategy such as random node/edge dropping [You et al. 2020] or graph diffusion [Hassani and Khasahmadi 2020] to obtain augmented graphs, suffering from the adaptability to different input distributions. Moreover, existing methods may yield the same augmentation for two different graphs distinguishable by the Weisfeiler-Lehman (WL) test [Weisfeiler and Leman 1968]. This makes subsequent modules impossible to distinguish the augmentations generated from two graphs. Both the adaptability and injectivity of the augmenter are significant for mining graphs’ unique information. (ii) Integration with proximity information. GCL focuses on learning graph’s individual-specific information whereas neglecting the proximity inherent in graphs’ natural topologies. As shown on the left of Figure 1, graphs \(G_1\) and \(G_3\) have similar structures and the same semantic label but might be mapped to two distant points in the latent space by GCL, impairing its performance on downstream tasks. Thus how to properly integrate graphs’ unique and common information is worth exploring.
Fig. 1.
Fig. 1. Motivation of Ada-MIP. \(g_i\) denotes the representation of graph \(G_i\). The similarity between graph representations learned by GCL is uncertain. In contrast, Ada-MIP learns the graph representations integrating both individual-specific and inter-graph proximity information with the aid of precomputed proximity matrix.
To fully mine graphs’ unique and common information, we develop an Adaptive self-supervised framework considering both Mutual Information (MI) between views and inter-graph Proximity, called Ada-MIP. Specifically, Ada-MIP learns graphs’ unique information through a top-k augmenter where we summarize the derivation of all the types of augmented graphs as a top-k node/edge selection problem. We also theoretically demonstrate the augmenter’s injectivity for graphs distinguishable by the WL test. Then we follow the InfoMax principle to maximize the MI between learned views; to mine graphs’ common features, we naturally make an association with the graph kernels. In spite that these methods cannot provide explicit graph embeddings, it reminds us to learn graph representations among which graphs’ proximity should be preserved. Finally, we employ a shared encoder to encode the augmented and original graphs to latent embeddings and project them into two distinct spaces for MI maximization and proximity preservation, respectively. As shown on the right of Figure 1, Ada-MIP learns graph representations integrated with both graphs’ unique and common information.
We further extend Ada-MIP to Ada-MIP-AD for semi-supervised learning. With a handful of labels, Ada-MIP-AD aims at learning the augmented views by applying adversarial training based on the top-k augmenter, which retains more information about the labels and less redundant information between views. We then integrate our frameworks into an existing framework Infograph* [Sun et al. 2019] for semi-supervised tasks.
Our main contributions are summarized as follows:
We explore the significance of combining graphs’ unique and common information for self-supervised graph-level representation learning.
We make graph augmentations learnable by designing a unified top-k graph augmenter and theoretically demonstrate its injectivity for different graphs distinguishable by the WL test.
We extend Ada-MIP to Ada-MIP-AD to learn task-dependent views for semi-supervised tasks on graphs.
Empirical findings validate the superb quality of graph-level representations learned by Ada-MIP in both unsupervised and semi-supervised tasks.
Organization. The rest of this article is organized as the following. Sections 2 and 3 introduce the related research works and background knowledge. We then elaborate on our proposed Ada-MIP in unsupervised and semi-supervised scenarios in Sections 4 and 5, respectively. The experimental studies are presented in Section 6, followed by concluding remarks in Section 7.

2 Related Work

2.1 Graph Kernels

Graph kernels refer to a class of methods that compare pairs of graph data points using particular similarity measures. The researches on graph kernels could be traced back to 2003 [Gärtner et al. 2003; Kashima et al. 2003]. Then several kernels specifically designed for graphs have been proposed. The Shortest-Path (SP) kernel [Borgwardt and Kriegel 2005] compared the attributes and lengths of the shortest paths between all pairs of vertices in two graphs. The Graphlet kernel [Shervashidze et al. 2009] was based on counting occurrences of subgraph patterns of a fixed size. Shervashidze et al. [2011] introduced the Weisfeiler-Lehman subtree kernel for graphs with discrete labels based on the 1-dimensional Weisfeiler-Lehman (1-WL). The Sub-graph Matching kernel [Kriege and Mutzel 2012] was computed by considering all bijections between all sub-graphs on at most \(k\) vertices. The Pyramid Match (PM) kernel [Nikolentzos et al. 2017] represented each graph as a set of vectors corresponding to the embeddings of its vertices and employs the Earth Mover’s Distance metric for similarity computation. The Multiscale Laplacian Graph (MLG) kernel [Kondor and Pan 2016] was able to capture the graph structure at multiple scales, i.e., neighborhoods around vertices of increasing depth by using ideas from spectral graph theory.

2.2 Graph Contrastive Learning

In recent years, GCL has emerged as a promising method for self-supervised graph representation learning. Deep Graph Infomax (DGI) [Velickovic et al. 2019] and InfoGraph [Sun et al. 2019] followed the Infomax Principle [Linsker 1988] to learn node-level/graph-level representations by maximizing the mutual information (MI) between global and local representations. Graph Contrastive Learning (GraphCL) [You et al. 2020] applied hand-pick graph augmentations to incorporate various priors and learned to maximize the agreement of graph augmentations originated from the same graph. Multi-View Graph Representation Learning (MVGRL) [Hassani and Khasahmadi 2020] augmented graphs via graph diffusion and randomly sample corresponding sub-graphs from original and augmented graphs as two views. Then it crossly contrasted the local and global features of the two views. GCA [Zhu et al. 2021b] used heuristics based on the network centrality to adaptively sample nodes (features) to be preserved. GMI [Peng et al. 2020] aims at maximizing the MI between the input, i.e., node features and topological structure, and outputs embeddings of the GNN encoder. Joint Augmentation Optimization (JOAO) [You et al. 2021] extended the GraphCL by employing a bi-level min-max optimization to select the combination of augmentations. Adversarial Graph Contrastive Learning (AD-GCL) [Suresh et al. 2021] employed an adversarial training strategy to learn a GNN-based augmenter and a GNN encoder, which aim at capturing the minimal sufficient information to distinguish graphs. AutoGCL [Yin et al. 2021] used a set of learnable graph view generators each of which learned a probability distribution of augmented graphs in the contrastive learning procedure. SimGRACE [Xia et al. 2022] augmented input graphs by permuting the parameters of the graph encoder, thus not requiring manual picked graph augmentations. Despite that existing works have made some attempts to automatic augmentation in graph contrastive learning, they do not consider the injectivity of employed augmenter or the inter-graph proximity information. Instead, our Ada-MIP breaks through state-of-the-arts GCL framework by designing a theoretically-guaranteed injective graph augmenter that can better preserve graph’s unique information and a multi-head encoder, which integrates both individual-specific and inter-graph proximity information.

3 Preliminaries

3.1 Graph Neural Networks

We first review some background knowledge of GNN. Given a graph \(G=(\mathcal {V}, \mathcal {E})\) with raw node features \(X\in \mathcal {R}^{|\mathcal {V}|\times d_x}\), the feature updating process in GNN’s \(l\)th layer is as follows:
\begin{equation} h_{v}^{(l)}=\rho \left(h_{v}^{(l-1)}, \omega \left(\left\lbrace h_{u}^{(l-1)}: u \in \mathcal {N}(v)\right\rbrace \right)\right) , \end{equation}
(1)
where \(h_{v}^{(l)}\) is the hidden feature of node \(v\) in GNN’s \(l\)th layer and \(h_{v}^{(0)}\) is initialized as \(X_v\). \(\lbrace h_{u}^{(l-1)}: u \in \mathcal {N}(v)\rbrace\) is the multiset (a set allows possibly repeating elements) of the neighboring node features. \(\omega\) and \(\rho\) are the aggregation and updating functions, respectively. For example, in Graph Isomorphism Network (GIN) [Xu et al. 2018], the multi-layer perceptron (MLP) is used to fit the composition of aggregation and combination functions:
\begin{equation} h_{v}^{(l)}=\mathrm{MLP}^{(l)}\left(\left(1+\epsilon ^{(l)}\right) \cdot h_{v}^{(l-1)}+\sum _{u \in \mathcal {N}(v)} h_{u}^{(l-1)}\right), \end{equation}
(2)
where \(\epsilon\) is a learnable parameter or a fixed scalar. After \(L\) layers of iteration, the graph representation \(h_G\) is calculated as
\begin{equation} h_{G}=\operatorname{Readout}\left(\left\lbrace h_{v}^{(L)} \mid v \in G\right\rbrace \right). \end{equation}
(3)
The Readout function can be any permutation-invariant function such as summation or maximization.

3.2 Weisfeiler-Lehman Test

We briefly introduce the WL graph isomorphism test [Weisfeiler and Leman 1968] to establish the groundwork for later theoretical research. Graph isomorphism problem aims at judging if two graphs have the same topology. However, no existing polynomial-time algorithm can solve this problem [Babai 2016]. The WL test is an effective algorithm to discern non-isomorphic graphs except for some corner examples [Cai et al. 1992]. Concretely, it comprises a two-step iterative process: for each node, (1) aggregate the labels from its neighborhood, and (2) use a hash function to map its label and the aggregated labels to a unique new label. The WL test decides two graphs as non-isomorphic once they have different label distributions in any iteration. It is noteworthy that this two-step process is strikingly similar to the aggregation and combination process in GNNs. Thus Xu et al. [2018] have investigated the link between GNNs and the WL test and proved that traditional GNNs are as most as powerful as the WL test, upon which we build our theoretical analysis later.

4 Proposed Ada-MIP

This section gives a detailed introduction to the proposed framework, Ada-MIP. The overview of Ada-MIP is shown in Figure 2. We first introduce the top-k augmenter, which learns adaptive and injective graph augmentations, followed by the multi-head encoder, which learns graph representations integrated with both individual-specific and inter-graph proximity information.
Fig. 2.
Fig. 2. Overview of Ada-MIP. Given a graph \(G\), Ada-MIP first encodes it by the augmentation encoder \(g_{\eta }\) and obtain the combination \(C_n/C_e\) of node/edge and graph embeddings. Then it uses \(M\) top-k augmentation heads \(\lbrace q_{\phi _i}\rbrace _{i=1}^{M}\) to learn the augmented graphs \(\lbrace t_{\phi _i}^{\psi _i}(G)\rbrace _{i=1}^{M}\) based on the node/edge score vector \(\mathbf {y_n}/\mathbf {y_e}\). We put graph \(G\) and its augmented graphs \(\lbrace G_i=t_{\phi _i}^{\psi _i}(G)\rbrace _{i=1}^{M}\) through the shared encoder \(g_{\theta }\) and use two projection heads, i.e., \(f_c\) and \(f_p\) to project the embeddings of augmented and original graphs to two latent spaces. Finally, the contrastive and proximity losses are optimized simultaneously.

4.1 Top-k Graph Augmenter

Recall that graph contrastive learning aims at preserving graph’s unique information by maximizing the mutual information between representations from different views. As a key component, a good graph augmenter should satisfy the following two properties: (1) adaptivity, i.e., learning informative augmented graphs according to the input graph distribution; (2) injectivity, i.e., for different input graphs, generating augmented graphs distinguishable by subsequent encoder and discriminator. We then provide our top-k graph augmenter, which satisfies these two properties.
Formally, assume that we are given a set of training graphs \(\lbrace G_n\rbrace _{n=1}^N\) in an input space \(\mathcal {G}\), with empirical probability distribution \(\mathcal {P}\). Let \(\Phi\) be a family of graph transformations (e.g., node dropping), \(\Psi\) be the parameter space and \(t_{\phi }^{\psi }:\mathcal {G}\rightarrow \mathcal {G}\) be a parametric function with transformation \(\phi \in \Phi\) and parameters \(\psi \in \Psi\) (e.g., a neural network). Then we define \(\mathcal {D}_{\phi ,\mathcal {P}}^{\psi }\) as the marginal distribution induced by pushing samples from \(\mathcal {P}\) through \(t_{\phi }^{\psi }\). In other words, \(\mathcal {D}_{\phi ,\mathcal {P}}^{\psi }\) is the distribution of augmented graphs \(t_{\phi }^{\psi }(G)\), where \(G\sim \mathcal {P}\). Then given two types of graph transformations \(\phi _i,\phi _j\in \Phi\), we aim at learning the parameters \(\psi _i\) and \(\psi _j\) to maximize the mutual information, i.e., \(I(\mathcal {D}_{\phi _i,\mathcal {P}}^{\psi _i}; \mathcal {D}_{\phi _j,\mathcal {P}}^{\psi _j})\). For simplicity, we signify \(\mathcal {D}_{\phi _i,\mathcal {P}}^{\psi _i}\) as \(v_i\) in the rest of article.
After determining the criteria of augmenter, we then provide an instantiation of obtaining \(t^{\psi _i}_{\phi _i}(G)\) and \(t^{\psi _j}_{\phi _j}(G)\) while leaving the optimization of \(I(v_i;v_j)\) to Section 4.2. Following [You et al. 2020], we consider four standard forms of graph transformations as \(\Phi\), including node dropping (ND), edge perturbation (EP), subgraph inducing (SI), and feature masking (FM). Specifically, the top-k augmenter consists of an augmentation encoder and four augmentation heads.
Augmentation encoder. For an input graph \(G=(\mathcal {V}, \mathcal {E})\) with node feature matrix \(X\in \mathcal {R}^{|\mathcal {V}|\times d_x}\), a GNN augmentation encoder \(g_\eta (\cdot): \mathcal {R}^{|\mathcal {V}| \times d_{x}} \times \mathcal {R}^{|\mathcal {V}|^2} \rightarrow \mathcal {R}^{|\mathcal {V}| \times d_{h}} \times \mathcal {R}^{d_{h}}\) shared by all the transformations learns node representations \(H_{v}\) and graph representation \(h_G\) to provide subsequent modules with expressive encodings. Specifically, it consists of a GNN producing node representations, a read-out function (summation) aggregating the representations into a graph representation, To encode graphs, we opt for expressive power and adopt Graph Isomorphism Network.
Top-k augmentation head. Given \(H_{v}\) and \(h_G\), we aim at learning the importance score of each node/edge and select the top-k ones that can preserve graph’s most unique information for each transformation. Specifically, we select the top-k important nodes which should be preserved/feature-preserved for ND/FM, the top-k important edges from all the node pairs for EP, and the top-one important node with its \(l\)-hop neighborhood as the subgraph for SI. Formally, for each \(\phi \in \Phi\), the propagation rule of the top-k augmentation head \(q_{\phi }(\cdot):\mathcal {G}\times \mathcal {R}^{|\mathcal {V}| \times d_{h}} \times \mathcal {R}^{d_{h}}\rightarrow \mathcal {G}\) is defined as follows:
\begin{equation} \begin{aligned}\mathrm{idx_{n}^{\phi }} &=\operatorname{Rank}(\mathbf {y_{n}^{\phi }}, k_n^{\phi }) \\ \mathrm{idx_{e}^{\phi }} &=\operatorname{Rank}(\mathbf {y_{e}^{\phi }}, k_e^{\phi }) \\ X^{\phi } &=\mu _X^{\phi }(X, A, \mathrm{idx_{n}^{\phi }}) \\ A^{\phi } &=\mu _A^{\phi }(A, \mathrm{idx_{n}^{\phi }}, \mathrm{idx_{e}^{\phi }}, \mathbf {y_{n}^{\phi }}, \mathbf {y_{e}^{\phi }}) \\ \end{aligned} \end{equation}
(4)
where \(\operatorname{C_n}\)/\(\operatorname{C_e}\) indicates the combination of node/edge and graph embeddings for node/edge selection and \(\mathbb {1}\) is the indicator function. For large graphs, \(C_e\) can be obtained from all the existing edges and randomly sampled non-existing edges. \(\operatorname{MLP}\) denotes the multi-layer perceptron. \(\mathbf {y_{n}^{\phi }} \in \mathcal {R}^{|\mathcal {V}|\times 1}\)/\(\mathbf {y_{e}^{\phi }}\in \mathcal {R}^{|\mathcal {V}|^2\times 1}\) denotes the node/edge scores and \(\operatorname{Rank}(\mathbf {y}, k)\) returns indices of the \(k\)-largest values in \(\mathbf {y}\). In practice, \(k_n\)/\(k_e\) is set as a ratio of node/edge numbers to fit different graph sizes. \(\mu _X^{\phi }\) and \(\mu _A^{\phi }\) are the transformation-specific functions for the augmented feature matrix \(X^{\phi }\) and adjacency matrix \(A^{\phi }\), respectively. For ND and SI, \(X^{ND}\)/\({X^{SI}}\) is the submatrix of \(X\) representing the features of top \(k^{ND}_n\) nodes or nodes in the \(l\)-hop subgraph centered by the top-one node. \(A^{ND}\)/\(A^{SI}\) contains the edges induced by preserved nodes with the sum of their corresponding node scores as edge weight. For EP, \(X^{EP}\) is identical to \(X\) while \(A^{EP}\) contains the selected \(k^{EP}_e\) edges weighted by edge scores. For FM, the features of the top \(k^{FM}_n\) nodes are preserved from \(X\) while others’ are masked with Gaussian noise. The edges in \(A^{FM}\) are identical to those in \(A\) while weighted in the same way as ND and SI. The detailed algorithms for the four augmentation heads can be referred to Appendix.
Crucially, by introducing edge weights to the augmented graphs, the top-k graph augmenter is not only trainable by back-propagation but also enjoys the injectivity. Note that according to Lemma 1, GNNs are never injective between the graph space and the representation space since GNN’s expressive power is limited by the WL test. Thus we slightly abuse the definition of “injectivity” and redefine it in the graph space.
Lemma 1.
Given any two non-isomorphic graphs \(G_1\) and \(G_2\), if a GNN \(g:\mathcal {G}\rightarrow \mathcal {R}^d\) maps \(G_1\) and \(G_2\) to different embeddings, the Weisfeiler-Lehman graph isomorphism test judges \(G_1\) and \(G_2\) as non-isomorphic.
Due to the limitation of GNN’s expressive power, we consider GNNs as powerful as the WL test, which satisfy the conditions in Lemma 2.
Lemma 2.
Let \(g:\mathcal {G}\rightarrow \mathcal {R}^d\) be a GNN. \(g\) maps any graphs \(G_1\) and \(G_2\) judged as non-isomorphic by the WL test to different embeddings if the following conditions hold:
(a) The aggregation function \(\omega\) and updating function \(\rho\) are injective.
(b) The graph-level readout function is injective.
We define these GNNs as Injective GNN (I-GNN) and redefine the injectivity in the graph space.
Definition 1 (Injectivity in the Graph Space).
Given a function \(t:\mathcal {G}\rightarrow \mathcal {G}\) and any two graphs \(G_1,G_2\in \mathcal {G}\) distinguishable by I-GNN, \(t\) is injective if \(t(G_1)\) and \(t(G_2)\) can also be distinguished by I-GNN.
Note that I-GNNs map graphs distinguishable by the WL test to different representations, while our top-k augmentation head can map the graph representations to different augmented graphs, bridging the gap from representation space to graph space. Specifically, with the I-GNN as the augmentation encoder, the top-k augmenter satisfies the injectivity.
Theorem 1 (Injectivity of Augmenter).
Assume the input feature space \(\mathcal {X}\) is countable and contains no zero-vectors. Then for any transformation \(\phi \in \lbrace ND,SI,FM,EP\rbrace\) with parameters \(\psi\), the top-k augmenter \(t_{\phi }^{\psi }:\mathcal {G}\rightarrow \mathcal {G}\) is injective.
The proof of Lemma 1, 2 can be referred to Xu et al. [2018] and that of Theorem 1 is provided in Appendix.

4.2 Multi-head Encoder

Given graph \(G\) and its \(M\) augmentations \(\lbrace G_{\phi _i}=t_{\phi _i}^{\psi _i}(G)\rbrace _{i=1}^{M}\) as input, the multi-head encoder aims at learning high-quality representations via MI maximization and inter-graph proximity preservation. To this end, we borrow ideas from multi-task learning whose architectures usually follow the same outline [Crawshaw 2020]: a shared feature extractor and an individual output branch for each task. Drawing on its essence, we employ a shared encoder with two projection heads supervised by the contrastive loss and proximity loss, respectively.
Shared encoder. We first pass \(\lbrace G_{\phi _i}\rbrace _{i=1}^{M}\) and \(G\) through a shared GNN encoder \(g_\theta (\cdot)\) to obtain their graph-level embeddings \(\lbrace h^{\prime }_{G_{\phi _i}}\!\rbrace _{i=1}^{M}\) and \(h^{\prime }_G\) as input of the projection heads. The architecture of shared encoder is the same with augmentation encoder.
Contrastive projection head. We use a non-linear transformation \(f_c(\cdot)\) to map the augmented graph embeddings \(\lbrace h^{\prime }_{G_{\phi _i}}\!\rbrace _{i=1}^{M}\) to a new latent space, i.e., \(z_i=f_c(h^{\prime }_{G_{\phi _i}}\!)\). Then for each pair of views \((i,j)\), we maximize the consistency between positive pairs \(z_i,z_j\) compared with negative pairs by optimizing the normalized temperature-scaled cross entropy loss (NT-Xent) [Oord et al. 2018] which is defined as follows:
\begin{equation} \ell _{cont}=\sum _{1 \le i\lt j \le M} \ell (z_i,z_j)=\sum _{1 \le i\lt j \le M}-\frac{1}{N}\sum _{n=1}^{N}\log \frac{e^{(\operatorname{sim}(\boldsymbol {z}_{n, i}, \boldsymbol {z}_{n, j})/\tau)}}{\sum _{n^{\prime }=1, n^{\prime } \ne n}^{N} e^{(\operatorname{sim}(\boldsymbol {z}_{n, i}, \boldsymbol {z}_{n^{\prime }, j})/\tau)}}, \end{equation}
(5)
where \(N\) denotes the batch size, \(\operatorname{sim}(\cdot)\) denotes the cosine function and \(\tau\) is the temperature hyper-parameter. As proved in You et al. [2020], NT-Xent could fit the formulation of the InfoNCE loss [Oord et al. 2018]. Thus minimizing NT-Xent is equivalent to maximizing a lower bound of \(I(v_i;v_j)\). Supervised by NT-Xent, the top-k augmenter can learn informative views, enabling subsequent modules easier to optimize \(I(v_i;v_j)\). Thus the shared encoder can learn graph representations preserving the most graphs’ unique information.
Proximity projection head. Similar to the contrastive projection head, we apply another non-linear transformation \(f_p(\cdot)\) to map the original graph embedding \(h^{\prime }_G\) to another latent space, i.e., \(z=f_p(h^{\prime }_{G})\). Given the embeddings \(\lbrace z_n\rbrace _{n=1}^N\) of graph samples in a batch and a pre-defined proximity metric, we first use the metric to determine the proximity between all pairs of graphs in a batch, obtaining a proximity matrix \(S\). Then we force the similarity between any pair of graph embeddings \(z_n\) and \(z_{n^{\prime }}\) to be consistent with the values in \(S\) by optimizing the proximity loss:
(6)
where \((n,n^{\prime })\) is sampled from a batch with equal probability. \(S\) can be derived from any graph kernel or domain knowledge while we find that the Weisfeiler-Lehman Subtree Kernel [Shervashidze et al. 2011] works well and make it a default choice of Ada-MIP.
Practically, two-layer perceptrons (MLP) are applied to model \(f_c\) and \(f_p\). Overall, the unsupervised loss is
\begin{equation} \begin{aligned}\ell _{u}=\ell _{prox}+\alpha \ell _{cont} \end{aligned}, \end{equation}
(7)
where \(\alpha \in [0, 1]\) is a hyper-parameter to control the contributions towards the overall objective. After training, we can simply maintain the shared encoder \(g_\theta\) for downstream tasks.

4.3 Time Complexity Analysis

In this section, we analyze the complexity of Ada-MIP with GIN as the GNN encoder and the WL Subtree kernel as the preprocessing method. Suppose the number of nodes and edges in the input graph are \(N\) and \(M\) respectively. Let \(s\) denote the number of training epochs, \(B\) denote the size of each training batch, \(V\) denote the number of input graphs, \(d\) denote the embedding size, \(L\) denote the sum of GNN layers in augmentation and shared encoders.
Specifically, in the preprocessing phase, Ada-MIP adopts the WL Subtree kernel whose complexity is \(O(M)\) and pre-computes the proximity matrix of graphs in the same batch. Thus the complexity of preprocessing is \(O(\frac{V}{B}B^2M)=O(VBM)\). In the learning phase, Ada-MIP uses GIN as the graph encoders whose complexity is \(O(BLMd)\). As for contrastive learning, both the positive and negative samples are drawn from the augmented graphs from the same batch. Within a batch, the complexity of calculating Equation (5) is \(O(B^2d)\). As for proximity learning, the graph pairs are also sampled from the same batch and the complexity of calculating Equation (6) is also \(O(B^2d)\). Therefore, the time complexity of the whole training phase is \(O(s\frac{V}{B}(BLMd+B^2d))=O(sV(LMd+Bd))\) and the overall complexity of Ada-MIP is \(O(VBM+sV(LMd+Bd))\). It is notable that the time complexity of graph contrastive learning baselines (e.g., GraphCL [You et al. 2020], ADGCL [Yin et al. 2021]) in the experiments is also \(O(sV(LMd+Bd))\). Thus the time complexity of Ada-MIP is on par with the baselines. Also Notice that GVAE [Kipf and Welling 2016b], one of the unsupervised baselines, has \(O(sV(LMd+N^2d))\) time complexity due to the adjacency matrix reconstruction in the VAE framework, where \(N\) denotes the number of nodes in the graphs. Since \(M\ll N^2\) for sparse graphs, our proposed method is much more scalable than GVAE.

5 Semi-supervised Ada-MIP

Based on the semi-supervised framework, InfoGraph* [Sun et al. 2019], Ada-MIP can be easily extended to the semi-supervised tasks. Specifically, InfoGraph* uses a supervised module \(c_{\zeta }=g_{\zeta }\circ f_{\zeta }\) comprised of a GNN encoder \(g_{\zeta }\) and a prediction head \(f_{\zeta }\) on labeled data and an unsupervised module which could be any self-supervised framework (e.g., Ada-MIP) on all data. Then InfoGraph* maximizes the MI between embeddings learned by the supervised encoder and unsupervised encoder to transfer knowledge between the two modules. In each iteration, it samples an unlabeled batch and a labeled batch. The total loss of Ada-MIP embedding into InfoGraph* is defined as follows:
\begin{equation} \begin{aligned}\ell _{t}=\sum _{i=1}^{|G^u|+|G^l|}\ell _{u}+ \lambda \sum _{i=1}^{|G^u|+|G^l|} \ell _{m}(g_{\theta }(G_i);g_{\zeta }(G_i)) +\sum _{i=1}^{|G^l|}\ell _{s}(c_{\zeta }(G_i),y_i) \end{aligned} , \end{equation}
(8)
where \(\ell _{m}\) can be any MI estimator such as the NT-Xent or the Jensen-Shannon (JS) MI estimator [Nowozin et al. 2016] and \(\lambda\) is a tunable hyper-parameter. \(\ell _{s}\) denotes the supervised loss measuring the discrepancy between the output \(c_{\zeta }(G_i)\) and the label \(y_i\). The terms \(\ell _u\) and \(\ell _{m}\) are for all data while \(\ell _{s}\) is only for labeled data.
However, the direct embedding of Ada-MIP into InfoGraph* does not make full use of the valuable labels. In the unsupervised-learning scenario, we propose to maximize the shared information between views since we are not sure which one would benefit the downstream tasks. But if a handful of labels \(y\) are available, we can guide the top-k graph augmenter \(T_{\phi }^{\psi }(G)\) to retain as much task-relevant information as possible while dropping out irrelevant noise between views based on the InfoMin principle [Tian et al. 2020]. In other words, let \(y\) be the ground-truth labels and we aim at retaining \(\sum _{i=1}^MI(v_i;y)\) while minimize \(\sum _{1\le i\lt j\le M}I(v_i;v_j)\).
To this end, we introduce \(M\) supervised modules \(\lbrace c_{\phi _i}\rbrace _{i=1}^M\) whose architectures are identical to \(c_{\zeta }\) and apply the adversarial training strategy [Goodfellow et al. 2014] to optimize:
\begin{equation} \min _{g_{\theta },f_{c}, f_{p}} \max _{\lbrace t_{\phi _i}^{\psi _i}\rbrace _{i=1}^M,\lbrace c_{\phi _i}\rbrace _{i=1}^M} -\sum _{j=1}^{|G^l|}\sum _{i=1}^M\ell _{s}(c_{\phi _i}(t_{\phi _i}^{\psi _i}(G_j)),y_j)+\sum _{i=1}^{|G^u|+|G^l|}\ell _{u}, \end{equation}
(9)
After combining Equation (8) with Equation (9), the final objective of semi-supervised Ada-MIP becomes:
\begin{equation} \begin{aligned}\min _{g_{\theta },f_{c}, f_{p},c_{\zeta }}& \max _{\lbrace t_{\phi _i}^{\psi _i}\rbrace _{i=1}^M,\atop \lbrace c_{\phi _i}\rbrace _{i=1}^M} \sum _{j=1}^{|G^u|+|G^l|}[\ell _{u} +\lambda \ell _{m}(g_{\theta }(G_j);g_{\zeta }(G_j))] \\ &+\sum _{j=1}^{|G^l|}\left[\ell _{s}(c_{\zeta }(G_j),y_j) -\sum _{i=1}^M\ell _{s}(c_{\phi _i}(t_{\phi _i}^{\psi _i}(G_j)),y_j)\right], \end{aligned} \end{equation}
(10)
where the first two terms are applied to all data and the last two are only for labeled data. At inference time, we can only maintain the supervised module \(c_{\zeta }\) for predictions.
In the semi-supervised experiments, we refer to the method optimizing Equation (8) as Ada-MIP and that optimizing Equation (10) as Ada-MIP-AD, which is fully summarized in Figure 3. To alleviate the extra cost brought by Ada-MIP-AD, we provide another framework, named Ada-MIP-SAD, which employs a shared supervised encoder \(g_{\phi }\) to replace \(\lbrace g_{\phi _i}\rbrace _{i=1}^M\).
Fig. 3.
Fig. 3. Illustration of Ada-MIP-AD. Different from Ada-MIP, we add a supervised module \(c_{\phi _i}\) after each augmentation head and guide the top-k augmenter to optimize the supervised loss while minimizing the mutual information between views. Then we introduce another supervised module \(c_{\zeta }=g_{\zeta }\circ f_{\zeta }\) for labeled data and maximize the mutual information between representations learned by the shared encoder \(g_{\theta }\) and supervised encoder \(g_{\zeta }\).

6 Experiments

In this section, we perform extensive experiments to evaluate the quality of graph-level representations learned by Ada-MIP in both unsupervised and semi-supervised scenarios. We first provide the descriptions of the datasets used in our experiments and the model configurations and then list several state-of-the-art baseline methods. After that, we present the experimental results and discussions. More detailed experimental analyses are also included.

6.1 Datasets

For the unsupervised learning scenario, we evaluate Ada-MIP for graph classification on five biochemical datasets and four social network datasets from the standard TU Benchmark [Morris et al. 2020]. Table 1 shows their statistics and properties. For the semi-supervised learning scenario, we evaluate Ada-MIP for molecular property prediction on the benchmark QM9 dataset [Ramakrishnan et al. 2014]. QM9 dataset consists of about 134,000 drug-like organic molecules with a wide range of chemical compositions and characteristics. All of the molecules in the dataset are made up of Hydrogen (H), Carbon (C), Oxygen (O), Nitrogen (N), and Flourine (F) atoms. For each molecule, a total of 12 fundamental chemical characteristics are pre-computed. A more detailed description of the QM9 dataset can be referred to Gilmer et al. [2017]. Following InfoGraph* [Sun et al. 2019], we randomly select 5,000 samples as labeled training data, another 10,000 for validation, 10,000 for testing, and the rest as unlabeled training data. We use the same data split for all the methods. All the targets are normalized to have mean 0 and variance 1.
Table 1.
 Biochemical NetworkSocial Network
DatasetNCI1PROTEINSDDPTC-MRMUTAGCOLLABIMDB-BIMDB-MRDT-B
#Graphs4,1101,1131,1783441885,0001,0001,5002,000
Avg. #Nodes29.8739.06284.3214.2917.9374.519.7713.0508.52
Avg. #Edges32.3072.82715.6614.6919.792,457.7896.5365.94497.75
#Classes222223232
Table 1. Summary of Biochemical and Social Networks Used for Unsupervised Learning Experiments

6.2 Model Configuration

In the unsupervised experiments, we use the Graph Isomorphism Network (GIN) [Xu et al. 2018] as the encoder. We follow InfoGraph [Sun et al. 2019] and MVGRL [Hassani and Khasahmadi 2020] for graph classification and choose the number of GNN layers, number of epochs, and learning rate from \(\lbrace 3,5,7\rbrace\), \(\lbrace 10,20,50,100,300\rbrace\) and \(\lbrace 1e-3,5e-4,1e-4\rbrace\), respectively. The hidden dimension, batch size, and the C parameter of SVM are fixed as \(64, 128\) and 10. Both the projection heads are parameterized by 2-layered feed-forward neural networks. Following each linear layer is a ReLU activation function.
In the semi-supervised experiments, we follow InfoGraph* [Sun et al. 2019] for molecular property prediction and use enn-s2s [Gilmer et al. 2017] with the number of set2set computations as 3. All the involved models are trained with an initial learning rate 0.001 for 500 epochs with a batch size 20 and \(\lambda =10^{-3}\).
In all the experiments, we set the hyper-parameters involved by Ada-MIP as a fixed value. We set the sampling ratio \(k\) of nodes/edges in the ND, EP, and FM augmentation heads as 0.9 and the hop number \(l\) in SI head as 2. We fix \(\tau\) in NT-Xent as 0.2, \(\alpha\) in Equation (7) as 0.005 and \(\lambda\) in Equation (10) as 0.001.

6.3 Baseline Methods

For unsupervised learning, we compare Ada-MIP with four graph kernel baselines including Shortest Path Kernel (SP) [Borgwardt and Kriegel 2005], Weisfeiler-Lehman Sub-tree Kernel (WL) [Shervashidze et al. 2011], Pyramid Match Kernel (PM) [Nikolentzos et al. 2017], Multi-scale Laplacian kernel (MLG) [Kondor and Pan 2016], three unsupervised graph-level representation learning baselines including Node2vec [Grover and Leskovec 2016], Sub2vec [Adhikari et al. 2018], Graph2vec [Narayanan et al. 2017], and nine self-supervised learning baselines including GVAE [Kipf and Welling 2016b], InfoGraph [Sun et al. 2019], GraphCL [You et al. 2020], JOAO [You et al. 2021], MVGRL [Hassani and Khasahmadi 2020], AD-GCL [Suresh et al. 2021], GASSL [Yang et al. 2021], AutoGCL [Yin et al. 2021], SimGRACE [Xia et al. 2022]. In order to make fair comparisons with these baseline methods, we directly quote the results from the previous literature or their original articles if available. If results are not previously reported, we implement them based on the public source codes and conduct a hyperparameter search according to the original article.
For semi-supervised learning, We consider the three proposed models mentioned in Section 5 and compare them with several baselines: (1) the supervised GNN (Sup-GNN) providing an anchor of the performance. (2) InfoGraph*, (3) GraphCL, (4) MVGRL, and (5) AD-GCL. For a fair comparison, we embed all the baselines into InfoGraph* using the same hyper-parameters and implement \(\ell _{m}\) in Equation (8) as the JS estimator. For GraphCL, we choose ND and FM as the two graph transformations, which have been empirically proved to perform the best [Zhu et al. 2021a].

6.4 Experimental Results on Unsupervised Learning

In this part, we investigate the effectiveness of the proposed Ada-MIP in unsupervised learning scenario, which focuses on the downstream graph classification task. Following previous works, we compute the accuracy using LIBSVM [Chang and Lin 2011] and perform 10-fold cross-validations. All the experiments are repeated 5 times. We report the average accuracy and standard deviation for each dataset in Table 2, where we see that the proposed method achieves better or comparable results than the baseline methods. In particular, Ada-MIP outperforms the SOTA methods by a maximum of 3.1% and an average of 1.5% on 7 of the 9 benchmark datasets. This indicates the high quality of graph representations learned by Ada-MIP in the unsupervised scenario.
Table 2.
DatasetNCI1PROTEINSDDPTC-MRMUTAGCOLLABIMDB-BIMDB-MRDT-B
SP\(73.2\pm 0.3\)\(69.8 \pm 1.2\)\(70.4 \pm 3.6\)\(58.2\pm 2.4\)\(85.2\pm 2.4\)\(64.5 \pm 2.1\)\(55.6\pm 0.2\)\(38.0\pm 0.3\)\(64.1\pm 0.1\)
WL\(80.0\pm 0.5\)\(72.9\pm 0.6\)\(74.0 \pm 2.2\)\(58.0\pm 0.5\)\(80.7\pm 3.0\)\(69.3 \pm 3.4\)\(72.3\pm 3.4\)\(47.0\pm 0.5\)\(68.8\pm 0.4\)
PM\(73.3\pm 0.3\)\(71.8 \pm 0.9\)\(75.1 \pm 1.6\)\(61.5\pm 1.3\)\(84.9\pm 1.2\)\(71.2 \pm 0.9\)\(70.7\pm 0.6\)\(47.8\pm 0.6\)\(82.3\pm 0.2\)
MLG\(74.1 \pm 0.6\)\(72.4 \pm 1.3\)\(75.9 \pm 2.4\)\(63.3\pm 1.5\)\(87.9\pm 1.6\)\(72.7 \pm 0.8\)\(66.6\pm 0.3\)\(41.2\pm 0.03\)>1 Day
Node2vec\(54.9 \pm 1.6\)\(57.5 \pm 3.6\)\(67.1 \pm 3.5\)\(58.6\pm 8.0\)\(72.6\pm 10.2\)\(66.2 \pm 5.3\)\(56.4 \pm 2.8\)\(38.1 \pm 3.3\)\(69.7 \pm 4.1\)
Sub2vec\(52.8 \pm 1.5\)\(53.0 \pm 5.6\)\(67.5 \pm 1.3\)\(60.0\pm 6.4\)\(61.1\pm 15.8\)\(67.4 \pm 1.7\)\(55.3\pm 1.5\)\(36.7\pm 0.8\)\(71.5\pm 0.4\)
Graph2vec\(73.2 \pm 1.8\)\(71.9 \pm 1.8\)\(68.7 \pm 2.3\)\(60.2\pm 6.9\)\(83.2\pm 9.3\)\(72.0 \pm 1.3\)\(67.0\pm 0.6\)\(44.6\pm 0.5\)\(75.8\pm 1.0\)
GVAE\(74.4 \pm 0.2\)\(72.9 \pm 0.4\)\(71.1 \pm 1.3\)\(61.2\pm 1.8\)\(87.7\pm 0.7\)\(75.1 \pm 0.2\)\(70.7\pm 0.7\)\(49.3\pm 0.4\)\(87.1\pm 0.1\)
InfoGraph\(76.2\pm 1.1\)\(74.4\pm 0.3\)\(72.9\pm 1.8\)\(61.7\pm 1.4\)\(89.0\pm 1.1\)\(70.7 \pm 1.1\)\(73.0\pm 0.9\)\(49.7\pm 0.5\)\(82.5\pm 1.4\)
GraphCL\(77.9\pm 0.4\)\(74.4\pm 0.5\)\(\underline{78.6\pm 0.4}\)\(60.2 \pm 1.7\)\(86.8\pm 1.3\)\(71.4 \pm 1.1\)\(71.1\pm 0.4\)\(48.5\pm 0.6\)\(89.5\pm 0.8\)
JOAO\(77.5 \pm 0.9\)\(74.6\pm 0.4\)\(77.3\pm 0.5\)\(60.4 \pm 1.7\)\(87.4\pm 1.0\)\(69.5 \pm 0.4\)\(70.2\pm 3.1\)\(48.8 \pm 0.8\)\(85.3\pm 1.4\)
JOAOv2\(78.2 \pm 1.4\)\(74.1\pm 1.1\)\(77.4\pm 1.2\)\(61.0 \pm 2.8\)\(87.7\pm 0.8\)\(69.3 \pm 0.4\)\(70.8\pm 0.3\)\(49.2 \pm 0.5\)\(86.4\pm 1.5\)
MVGRL\(75.1\pm 0.5\)\(71.5\pm 0.3\)OOM\(62.5\pm 1.7\)\(89.7\pm 1.1\)OOM\(\underline{74.2\pm 0.7}\)\(51.2\pm 0.5\)\(84.5\pm 0.6\)
AD-GCL\(75.8\pm 0.5\)\(75.0\pm 0.5\)\(75.4\pm 0.4\)\(63.2 \pm 2.4\)\(88.6\pm 1.1\)\(74.8 \pm 0.4\)\(71.5\pm 1.0\)\(50.6 \pm 0.7\)\(\mathbf {92.0\pm 0.4}\)
GASSL\(80.2\pm 1.9\)\(-\)\(-\)\(\underline{64.6\pm 6.1}\)\(\underline{90.9\pm 7.9}\)\(\underline{78.0 \pm 2.0}\)\(\underline{74.2\pm 0.5}\)\(\underline{51.7\pm 2.5}\)\(-\)
AutoGCL\(\mathbf {82.2\pm 0.3}\)\(75.8 \pm 0.4\)\(77.6 \pm 0.6\)\(63.1 \pm 2.3\)\(88.6\pm 1.2\)\(70.1 \pm 0.7\)\(73.3\pm 0.4\)\(50.6 \pm 0.8\)\(88.6\pm 0.5\)
SimGRACE\(79.1\pm 0.4\)\(75.4 \pm 0.1\)\(77.4 \pm 1.1\)\(63.2 \pm 3.1\)\(89.0\pm 1.3\)\(71.7 \pm 0.8\)\(71.3\pm 0.8\)\(50.9 \pm 0.9\)\(89.5\pm 0.9\)
Ada-MIP\(\underline{80.7\pm 0.7}\)\(\mathbf {77.0\pm 0.5}\)\(\mathbf {81.7\pm 1.2}\)\(\mathbf {66.0\pm 1.4}\)\(\mathbf {91.5\pm 2.5}\)\(\mathbf {79.5 \pm 1.1}\)\(\mathbf {75.4\pm 1.9}\)\(\mathbf {52.1\pm 0.5}\)\(\underline{91.1\pm 1.1}\)
Table 2. Unsupervised Biochemical and Social Network Classification Performance in TU Datasets (Averaged Accuracy \(\pm\) std. over 10-fold Cross-validation)
The best and second-best results are highlighted with boldface and underlined. - denotes that there’s no public record or code for the method on the dataset. “> 1 day” represents that the computation exceeds 24 hours. “OOM” is the out-of-memory error.

6.5 Experimental Results on Semi-supervised Learning

In this part, we investigate the effectiveness of the proposed Ada-MIP in a semi-supervised learning scenario, which focuses on the molecular property prediction task. Following InfoGraph*, we optimize the mean square error (MSE) loss while evaluating the mean absolute error (MAE). We report the MAE for each task in Table 3, where we see that simply using Ada-MIP yields the best performance on 10 out of 12 tasks compared to all the baselines, which again validates the superiority of Ada-MIP. Furthermore, by applying adversarial learning on the augmenter, the performance on 11 out of 12 tasks is further boosted by Ada-MIP-AD and Ada-MIP-SAD, which demonstrates that filtering out the task-irrelevant information between views is able to fully leverage the valuable labels. Besides, the superior performance of Ada-MIP-SAD also indicates its effectiveness with less cost than Ada-MIP-AD. In practice, Ada-MIP-SAD is a good tradeoff between efficiency and effectiveness.
Table 3.
TargetMu(0)Alpha(1)HOMO(2)LUMO(3)Gap(4)R2(5)ZPVE(6)U0(7)U(8)H(9)G(10)Cv(11)
Sup-GNN0.2540.5540.1710.1680.2593.910.00967.699.307.327.630.182
InfoGraph0.236\(\underline{0.517}\)0.1700.1590.2493.870.00937.588.296.176.390.180
GraphCL0.2420.5510.1670.1630.2543.540.00917.498.857.146.700.177
MVGRL0.2480.5370.1650.1650.2513.570.00885.555.966.456.380.181
AD-GCL0.2500.5360.1680.1660.2343.170.00857.205.916.986.080.175
Ada-MIP0.2300.5490.1610.149\(\underline{0.228}\)2.420.00775.464.985.595.140.176
Ada-MIP-AD\(\underline{0.227}\)\(\mathbf {0.511}\)\(\mathbf {0.159}\)\(\mathbf {0.147}\)0.231\(\mathbf {2.14}\)\(\mathbf {0.0074}\)\(\underline{4.38}\)\(\mathbf {4.08}\)\(\underline{3.98}\)\(\underline{3.91}\)\(\underline{0.173}\)
Ada-MIP-SAD\(\mathbf {0.223}\)0.535\(\underline{0.160}\)\(\underline{0.148}\)\(\mathbf {0.225}\)\(\underline{2.28}\)\(\underline{0.0076}\)\(\mathbf {4.06}\)\(\underline{4.23}\)\(\mathbf {3.39}\)\(\mathbf {3.67}\)\(\mathbf {0.172}\)
Table 3. Semi-supervised Molecular Property Prediction Performance (Mean Absolute Error) on QM9 Dataset
The top row shows the MAE of the supervised model. The best and second-best results are highlighted with boldface and underlined.
Besides, to validate the convergence of Ada-MIP-AD, we draw the training curves for all the loss items in Equation (10) on the QM9 dataset. The results in Figure 4 show that the training process is stable and all the loss items converge after 200 epochs.
Fig. 4.
Fig. 4. Convergence plot of Ada-MIP-AD on QM9 target 0.

6.6 Ablation Studies

Effect of Components in Ada-MIP. We first investigate the contributions of each component in Ada-MIP by removing the contrastive head, the proximity head, and the top-k graph augmenter, respectively. Specifically, Ada-MIP w/o CH denotes the model that only optimizes the proximity loss. Ada-MIP w/o PH denotes the model that optimizes the contrastive loss with the top-k augmenter. Ada-MIP w/o TGA denotes the model that simultaneously optimizes the above two losses while replacing the top-k augmenter with a stochastic augmentation strategy as the same as GraphCL. The results are shown at the top of Table 4.
Table 4.
DatasetPROTEINSDDPTC-MRIMDB-BIMDB-M
Ada-MIP w/o CH73.573.460.372.548.5
Ada-MIP w/o PH75.780.463.674.551.7
Ada-MIP w/o TGA76.079.464.674.851.5
Ada-MIP w SP76.580.964.275.3\(\mathbf {52.1}\)
Ada-MIP w PM76.580.764.575.051.8
Ada-MIP w MLG76.780.263.974.451.9
Ada-MIP w WL\(\mathbf {77.0}\)\(\mathbf {81.7}\)\(\mathbf {66.0}\)\(\mathbf {75.4}\)\(\mathbf {52.1}\)
2 Views (Avg)76.380.564.374.351.6
2 Views (Max)76.780.965.374.751.7
3 Views (Avg)76.481.665.374.451.9
3 Views (Max)76.5\(\mathbf {82.5}\)\(\mathbf {66.2}\)74.8\(\mathbf {52.1}\)
4 Views\(\mathbf {77.0}\)81.766.0\(\mathbf {75.4}\)\(\mathbf {52.1}\)
Table 4. Performance of Ada-MIP with Different Graph Kernels and Number of Views
It is observed that simply optimizing the proximity loss can boost the performance of the WL Subtree kernel. This is because the graph encoder maps the inter-graph information into low-dimensional embeddings, leading to better generalization performance on downstream tasks. Besides, the consistent superior performance of Ada-MIP w/o PH over Ada-MIP w/o CH implies that learning graphs’ unique features benefits downstream tasks more compared to learning common features. Moreover, Although the optimization object of contrastive loss and proximity loss is quite different, they are complementary and combining them can further boost the model performance. Overall, Ada-MIP consistently has a significant improvement over its three incomplete versions, indicating that all the three components are essential in Ada-MIP.
Effect of Graph Kernel Methods. It might be argued that the performance improvement by Ada-MIP mainly depends on the choice of graph kernel methods. Thus in this part, we investigate the effect of Ada-MIP with three other graph kernels SP, PM, and MLG. The results shown in the middle of Table 4 suggest that Ada-MIP can also achieve performance gain with graph kernels other than WL. And the consistent best performance of Ada-MIP with WL also suggests that WL is a sweet default choice for Ada-MIP.
Effect of View Numbers. To investigate the relationship between view numbers and the quality of learned representations in Ada-MIP, we evaluate all the pairs and triplets of augmentation combinations on five datasets and report the average and best results in Table 4. Overall, we observe that the average accuracy of Ada-MIP increases as the number of views grows on all datasets. Even considering all the combinations of views, Ada-MIP with full views still performs the best on 3 out of 5 datasets. From another perspective, Ada-MIP with two views is still competitive to the baselines, implying a good tradeoff between effectiveness and efficiency.

6.7 Hyper-parameter Sensitivity Analysis

We first investigate Ada-MIP’s sensitivity to \(\alpha\) in Equation (7). Different \(\alpha\) controls the relative learning pace of the two objectives. We tune \(\alpha\) from \(\lbrace 0.001,0.005,0.01,0.05\rbrace\) and report the results on the top of Table 5. We find that Ada-MIP is robust to different \(\alpha\) within a reasonable range.
Table 5.
DatasetPROTEINSDDPTC-MRIMDB-BIMDB-M
\(\alpha =0.001\)76.981.166.474.351.7
\(\alpha =0.005\)\(\mathbf {77.0}\)\(\mathbf {81.7}\)66.0\(\mathbf {75.4}\)52.1
\(\alpha =0.01\)76.881.5\(\mathbf {66.1}\)74.7\(\mathbf {52.2}\)
\(\alpha =0.05\)76.580.764.874.751.9
GraphCL74.478.660.271.148.5
AutoGCL75.077.463.173.350.6
\(k_e^{EP}=0.95\)75.179.763.173.2\(\mathbf {51.3}\)
\(k_e^{EP}=0.9\)75.279.963.473.651.0
\(k_e^{EP}=0.85\)\(\mathbf {75.5}\)\(\mathbf {80.1}\)\(\mathbf {63.9}\)\(\mathbf {73.8}\)50.7
\(k_e^{EP}=0.8\)74.979.563.373.150.4
\(k^{ND}_n=0.95\)75.179.963.372.950.1
\(k^{ND}_n=0.9\)\(\mathbf {75.3}\)\(\mathbf {80.2}\)\(\mathbf {64.1}\)73.450.4
\(k^{ND}_n=0.85\)74.879.863.5\(\mathbf {73.5}\)\(\mathbf {50.7}\)
\(k^{ND}_n=0.8\)74.579.863.173.150.2
\(k^{FM}_n=0.95\)75.179.462.973.950.9
\(k^{FM}_n=0.9\)75.4\(\mathbf {79.6}\)\(\mathbf {63.2}\)74.1\(\mathbf {51.3}\)
\(k^{FM}_n=0.85\)\(\mathbf {75.5}\)79.262.7\(\mathbf {74.2}\)51.0
\(k^{FM}_n=0.8\)75.279.062.373.750.7
\(l=1\)75.179.362.173.850.5
\(l=2\)\(\mathbf {75.7}\)79.7\(\mathbf {63.0}\)74.050.8
\(l=3\)75.479.462.3\(\mathbf {74.4}\)\(\mathbf {51.1}\)
Table 5. Sensitivity of Ada-MIP to Hyper-parameters \(\alpha\), Hyper-parameter \(k\) in Edge Perturbation, Node Dropping, Feature Masking Heads and \(l\) in Sub-graph Inducing Head
We then investigate the hyper-parameter in each augmentation head of the graph augmenter. While experimenting on one head, we drop out all the other augmentation heads and the proximity projection head as a vanilla version of Ada-MIP. We contrast the augmented graphs with original graphs and tune the ratio of node/edge selection \(k\) from \(\lbrace 0.95, 0.9, 0.85, 0.8\rbrace\) and the hops of the induced sub-graph \(l\) from \(\lbrace 1,2,3\rbrace\). Table 5 shows that the vanilla Ada-MIP also consistently outperforms GraphCL and AutoGCL on all the datasets, which validates the effectiveness of our top-k augmenter. Besides, we also observe that our augmenter is robust to the slight tuning of the hyper-parameters \(k\) and \(l\).

6.8 Contrastive Loss Plot

To validate our theoretical analysis of the top-k augmenter’s injectivity, we draw the four contrastive loss curves of Ada-MIP and GraphCL for different augmentations on the PROTEINS dataset. We argue that if an augmenter maps more non-isomorphic graphs to augmented graphs distinguishable by subsequent encoder and discriminator, it can better help to optimize the contrastive loss. To ensure a fair comparison, we apply the vanilla Auot-MIP mentioned in Section 6.7 and the same hyper-parameter settings for Ada-MIP and GraphCL. Figure 5 shows that the vanilla Ada-MIP can consistently optimize the lower bound of mutual information better than GraphCL. This result consolidates that the top-k augmenter with each augmentation head can learn more informative views than those generated by the fixed augmentation strategy in GraphCL.
Fig. 5.
Fig. 5. Contrastive loss on PROTEINS.

6.9 Model Running Time

We study the efficiency of the proposed method by taking two datasets, i.e., RDT-B and DD as examples. This experiment is run on a server with NVIDIA GeForce RTX 1080 Ti (11 GB). We set the same hyper-parameters and run 300 epochs for all the test methods. The results measured in wall-clock time in seconds are presented in Table 6, where we compare the proposed Ada-MIP with GraphCL and GVAE. Particularly, we report the results w.r.t. the total running time of all the methods and the mean training time per epoch, including forward pass, objective loss calculation, and backward pass. GraphCL needs the least training time as it uses random augmentations and simply optimizes the contrastive loss. GVAE costs the most time since it has to reconstruct the adjacency matrix of each graph. Ada-MIP ranks second and achieves slightly inferior to GraphCL as it requires precomputing the proximity matrix, learning the augmented views, and optimizing the proximity loss. However, it is observed that Ada-MIP only spends less than 2% additional training time on preprocessing. And the training efficiency performance of Ada-MIP is quite close to the results of GraphCL. Considering the prediction accuracy on downstream tasks, our proposed method consumes reasonable training time and achieves superior results.
Table 6.
 GraphCLGVAEAda-MIP
TotalTime per epochTotalTime per epochTotalPrecomputingTime per epoch
RDT-B939.3\(3.13 \pm 0.32\)2,364.6\(7.88 \pm 0.47\)1194.2518.25\(3.92\pm 0.28\)
DD864.4\(2.88 \pm 0.35\)2,016.8\(6.72 \pm 0.38\)1,116.615.60\(3.67 \pm 0.39\)
Table 6. Wall-clock Time (s) of Different Methods on the RDT-B and DD Datasets

6.10 Visualization

To better interpret the learned augmentations by the top-k augmenter, we randomly select three graphs in the MUTAG dataset and visualize their augmentations in Figure 6. The model is trained for 100 epochs and we set \(k=0.9\) for ND, ED, and FM and \(l=2\) for SI. From the results, we have several observations. First, the top-k augmenter tends to preserve the connectivity of augmented graphs for ND and the feature of special atoms like oxygen and nitrogen for FM. Second, the center nodes have higher importance scores in the augmented graphs of SI, indicating that the augmenter tends to choose the subgraphs of larger size and preserve as much information of the original graph. Third, for all the transformations, the top-k augmenter can identify and preserve some key chemical groups such as nitro and hydroxide radical, which might be important to the mutagenic property of chemicals.
Fig. 6.
Fig. 6. Visualization of the learned augmentations in the MUTAG dataset. The numbers denote the learned node/edge importance scores.

7 Conclusion

In this work, we have proposed Ada-MIP, a novel self-supervised framework for graph-level representation learning. Ada-MIP first employs an adaptive and injective top-k graph augmenter to learn informative views, which preserves the unique information of original graphs. Then a multi-head encoder learns graph representations fused with graphs’ unique and common information by simultaneously optimizing the contrastive loss and proximity loss. We also extend Ada-MIP to semi-supervised scenarios by applying adversarial training to filter out task-irrelevant information between views. We conduct comprehensive experiments on several graph classification benchmarks and molecular property prediction tasks to validate the high quality of graph representations learned by Ada-MIP in both unsupervised and semi-supervised scenarios.

A Algorithms of Augmentation Heads

In this section, we provide the detailed algorithms for the node dropping head, edge perturbation head, subgraph inducing head, and the feature masking head.

B Proof of Theorem 1

Before proving Theorem 1, we review two lemmas which have been certified in Xu et al. [2018]. First, GNN can be as powerful as the Weisfeiler-Lehman test [Weisfeiler and Leman 1968] by using several injective functions over multiset to map two different graphs into distinct representations as stated in Lemma 2. Following [Xu et al. 2018], we consider the situation where input node features are from a countable set and the following lemma holds.
Lemma 3.1.
Assume the input feature space \(\mathcal {X}\) is countable. Let \(g:\mathcal {G}\rightarrow \mathcal {R}^d\) be a GNN with a sufficient number of layers \(L\). The space of node hidden features \(h_v^{(l)}\) and the space of graph feature \(h_G\) output by \(g\) are also countable for all \(l=1,\ldots ,L\).
Next, we begin our proof. Given a graph \(G=(\mathcal {V}, \mathcal {E}, X)\), we use an I-GNN as the augmentation encoder to obtain its node features \(H_v\) and graph feature \(h_G\). Then we first consider the ND, SI, and FM transformations, where edges in the augmented graphs are weighted by the summation of the two corresponding node scores. For transformation \(\phi \in \lbrace ND,SI,FM\rbrace\), we denote the function \(c_{\phi }\) as the mapping from a node representation \(h_{v}\) and its corresponding graph representation \(h_G\) to a node score \(Y^{\phi }_{v}\), i.e.:
\begin{equation} Y^{\phi }_{v}=c_{\phi }(h_{v}, h_G). \end{equation}
(11)
Then the edge score \(Y^{\phi }_{(u,v)}\) is calculated as
\begin{equation} Y^{\phi }_{(u,v)}=Y^{\phi }_{u}+Y^{\phi }_{v}. \end{equation}
(12)
For transformation \(\phi =EP\), we denote the function \(d_{\phi }\) as the mapping from the summation of two node representations \(h_{u},h_{v}\), their corresponding graph representation \(h_G\) and the indicator variable \(h_{e}=\mathbb {1}_{(u,v)\in \mathcal {E}}\) to the edge score \(Y^{\phi }_{(u,v)}\), i.e.,:
\begin{equation} Y^{\phi }_{(u,v)}=d_{\phi }(h_{u}+h_{v}, h_G,h_e). \end{equation}
(13)
Then we denote the augmented graph of \(G\) by transformation \(\phi\) as \(G_{\phi }=(\mathcal {V}^{\phi }, \mathcal {E}^{\phi }, X^{\phi }, Y^{\phi })\), where \(Y^{\phi }\in \mathcal {R}^{|\mathcal {E}^{\phi }|}\) denotes the edge weight vector. The feature updating for nodes of \(G_{\phi }\) in the \(l\)th layer of GNN is as follows:
\begin{equation} h_{v}^{(l)}=\rho \left(h_{v}^{(l-1)}, \omega \left(\left\lbrace Y^{\phi }_{(u,v)}\cdot h_{u}^{(l-1)}: u \in \mathcal {N}(v)\right\rbrace \right)\right). \end{equation}
(14)
Next, we provide a sufficient condition for two weighted graphs distinguishable by I-GNNs.
Lemma 4.2.
Given an I-GNN \(g\), and two weighted graphs \(G_1=(\mathcal {V}_1, \mathcal {E}_1, X_1, Y_1)\), \(G_2=(\mathcal {V}_2, \mathcal {E}_2, X_2, Y_2)\), \(g\) can map \(G_1\) and \(G_2\) to two different representations if the following condition holds:
\(\forall v_1\in \mathcal {V}_1, u_1\in \mathcal {N}(v_1)\) and \(\forall v_2\in \mathcal {V}_2, u_2\in \mathcal {N}(v_2)\),
\begin{equation} Y_{1,(u_1,v_1)}\cdot X_{1,v_1} \ne Y_{2,(u_2,v_2)}\cdot X_{2,v_2}. \end{equation}
(15)
Proof.
For \(\forall v_1\in \mathcal {V}_1, v_2\in \mathcal {V}_2\), we consider their features updated by the first layer of I-GNN:
\begin{equation} \begin{aligned}h_{v_1}^{(1)}&=\rho \left(X_{1,v_1}, \omega \left(\left\lbrace Y_{1,(u_1,v_1)}\cdot X_{1,u_1}: u_1 \in \mathcal {N}(v_1)\right\rbrace \right)\right)\\ h_{v_2}^{(1)}&=\rho \left(X_{2,v_2}, \omega \left(\left\lbrace Y_{2,(u_2,v_2)}\cdot X_{2,u_2}: u_2 \in \mathcal {N}(v_2)\right\rbrace \right)\right) \end{aligned} . \end{equation}
(16)
Due to the condition, we have \(\lbrace Y_{1,(u_1,v_1)}\cdot X_{1,u_1}: u_1 \in \mathcal {N}(v_1)\rbrace \ne \lbrace Y_{2,(u_2,v_2)}\cdot X_{2,u_2}: u_2 \in \mathcal {N}(v_2)\rbrace\). Because \(\omega\) and \(\rho\) are injective, for all \(v_1\in \mathcal {V}_1, v_2\in \mathcal {V}_2\), we have \(h_{v_1}^{(1)}\ne h_{v_2}^{(1)}\), and further \(\lbrace h_{v_1}^{(1)},v_1 \in \mathcal {V}_1\rbrace \ne \lbrace h_{v_2}^{(1)},v_2 \in \mathcal {V}_2\rbrace\). Thus we conclude that \(h_{G_1} \ne h_{G_2}\).□
In practice, \(c_{\phi }\) and \(d_{\phi }\) are modeled by the MLP in each augmentation head. Thus based on the universal approximation theorem [Hornik 1991] and Lemma 4.2, it is sufficient to prove that there exist \(c_\phi\) and \(d_\phi\) such that \(Y^{\phi }_{1}\) and \(Y^{\phi }_{2}\) generated by \(c_\phi\) for \(\phi \in \lbrace ND,SI,FM\rbrace\) and \(d_\phi\) for \(\phi =EP\) satisfy the condition in Lemma 4.2. We then provide an example of such \(c_\phi\) and \(d_\phi\), respectively.
Since the input feature space \(\mathcal {X}\) is countable, according to Lemma 3.1, the space \(\mathcal {H}_v\) of node features \(h_v^{(L)}\) and the space \(\mathcal {H}_G\) of graph features \(h_G\) output by the augmentation encoder are also countable. Thus there exist two mappings, \(Z_V:\mathcal {H}_v\rightarrow \mathbb {N}\) and \(Z_G:\mathcal {H}_G\rightarrow \mathbb {N}\) from \(h_v\) and \(h_G\) to natural numbers. Because the cardinality of node feature multiset \(H_v\subset \mathcal {H}_v\) is bounded, there exists a natural number \(N\in \mathbb {N}\) so that \(|H_v|\lt N\) for all \(H_v\) and thus we have \(Z_V(h_v)\lt N,\forall h_v\in H_v\). Since the cardinality of input node feature multiset \(X\subset \mathcal {X}\) is also bounded, there exists a natural number \(K\in \mathbb {N}\) indicating the upper bound of multiple relation between two raw node features \(x_1,x_2\in X\), i.e., \(\forall x_1,x_2\in X, x_1=k\cdot x_2 \Rightarrow k\lt K\). We then define \(A\) as \(max(K,N)\) and provide an example of such \(c_{\phi }\):
\begin{equation} c_{\phi }(h_v,h_G)=A^{2\cdot Z_G(h_G)}\cdot Z_V(h_v). \end{equation}
(17)
And the edge weight \(Y_{(u,v)}^{\phi }\) is calculated as
\begin{equation} \begin{aligned}Y_{(u,v)}^{\phi }&=c_{\phi }(h_{u},h_G)+c_{\phi }(h_{v},h_G)\\ &=A^{2\cdot Z_G(h_G)}\cdot (Z_V(h_{u})+Z_V(h_{v})) \end{aligned}. \end{equation}
(18)
This \(c_{\phi }\) projects the edge weight of different graphs into different positive natural number intervals. The quotient of edge weight in one graph divided by that in another graph is at least \(A\) or at most \(\frac{1}{A}\), which we will prove in detail later.
Similarly, we provide an example of \(d_\phi\):
\begin{equation} d_{\phi }(h_{u}+h_{v}, h_G,h_e)=h_e\cdot A^{2\cdot Z_G(h_G)}\cdot Z_V(h_{u}+h_{v}). \end{equation}
(19)
Then for all \(\phi \in \lbrace ND,SI,FM,EP\rbrace\), all \(v_1\in \mathcal {V}^{\phi }_1, u_1\in \mathcal {N}(v_1)\) and all \(v_2\in \mathcal {V}^{\phi }_2, u_2\in \mathcal {N}(v_2)\), there are two situations for \(Y_{1,(u_1,v_1)}^{\phi }\cdot X_{1,v_1}^{\phi }\) and \(Y_{2,(u_2,v_2)}^{\phi }\cdot X_{2,v_2}^{\phi }\):
(a) There exists no \(k\in \mathcal {R}\), s.t., \(X_{1,v_1}^{\phi }=k\cdot X_{2,v_2}^{\phi }\). Then we have \(Y^{\phi }_{1,(u_1,v_1)}\cdot X_{1,v_1}^{\phi } \ne Y^{\phi }_{2,(u_2,v_2)}\cdot X_{2,v_2}^{\phi }\) for any non-zero value of \(Y^{\phi }_{1,(u_1,v_1)}\) and \(Y^{\phi }_{2,(u_2,v_2)}\).
(b) There exists a \(k\in \mathcal {R}\), s.t., \(X_{1,v_1}^{\phi }=k\cdot X_{2,v_2}^{\phi }\). Since \(G_1\) and \(G_2\) are distinguishable by the augmentation encoder, we have \(h_{G_1}\ne h_{G_2}\) and \(Z_G(h_{G_1})\ne Z_G(h_{G_2})\). We first assume that \(Z_G(h_{G_1})\gt Z_G(h_{G_2})\). Then for \(\phi \in \lbrace ND,SI,FM\rbrace\), we have
\begin{equation} \begin{aligned}\frac{Y^{\phi }_{1,(u_1,v_1)}}{Y^{\phi }_{2,(u_2,v_2)}}&=\frac{c_\phi (h_{u_1}, h_{G_1})+c_\phi (h_{v_1},h_{G_1})}{c_\phi (h_{u_2},h_{G_2})+c_\phi (h_{v_2}, h_{G_2})}\\ &=\frac{A^{2\cdot Z_G(h_{G_1})}\cdot (Z_V(h_{u_1})+Z_V(h_{v_1}))}{A^{2\cdot Z_G(h_{G_2})}\cdot (Z_V(h_{u_2})+Z_V(h_{v_2}))}\\ &= A^{2(Z_G(h_{G_1})-Z_G(h_{G_2}))}\cdot \frac{(Z_V(h_{u_1})+Z_V(h_{v_1}))}{(Z_V(h_{u_2})+Z_V(h_{v_2}))}\\ &\gt \frac{A^{2(Z_G(h_{G_1}) -Z_G(h_{G_2}))}}{N}\\ &\ge A^{[2(Z_G(h_{G_1}) -Z_G(h_{G_2}))-1]}\ge A\gt k. \end{aligned} \end{equation}
(20)
For \(\phi =EP\), we have
\begin{equation} \begin{aligned}\frac{Y^{\phi }_{1,(u_1,v_1)}}{Y^{\phi }_{2,(u_2,v_2)}}&=\frac{d_\phi (h_{u_1}+h_{v_1},h_{G_1},1)}{d_\phi (h_{u_2}+h_{v_2},h_{G_2},1)}\\ &=\frac{A^{2\cdot Z_G(h_{G_1})}\cdot (Z_V(h_{u_1}+h_{v_1}))}{A^{2\cdot Z_G(h_{G_2})}\cdot (Z_V(h_{u_2}+h_{v_2}))}\\ &= A^{2(Z_G(h_{G_1})-Z_G(h_{G_2}))}\cdot \frac{(Z_V(h_{u_1}+h_{v_1}))}{(Z_V(h_{u_2}+h_{v_2}))}\\ &\gt \frac{A^{2(Z_G(h_{G_1}) -Z_G(h_{G_2}))}}{N}\\ &\ge A^{[2(Z_G(h_{G_1}) -Z_G(h_{G_2}))-1]}\ge A\gt k. \end{aligned} \end{equation}
(21)
In the same way, for all \(\phi \in \lbrace ND,SI,FM,EP\rbrace\), if \(Z_G(h_{G_1})\lt Z_G(h_{G_2})\), we have
\begin{equation} 0\lt \frac{Y^{\phi }_{1,(u_1,v_1)}}{Y^{\phi }_{2,(u_2,v_2)}}\lt \frac{1}{A}. \end{equation}
(22)
Thus we have \(\frac{Y^{\phi }_{1,(u_1,v_1)}}{Y^{\phi }_{2,(u_2,v_2)}}\ne k\) regardless of the sign of \(k\). Because \(X_{1,v_1}^{\phi }\) and \(X_{2,v_2}^{\phi }\) are non-zero vectors, we have the conclusion that \(Y^{\phi }_{1,(u_1,v_1)}\cdot X_{1,v_1}^{\phi } \ne Y^{\phi }_{2,(u_2,v_2)}\cdot X_{2,v_2}^{\phi }\).
\begin{equation} \max \sum _{i \in I} \frac{1}{\left|S_{i}||N_{i}\right|} \sum _{s_{i} \in S_{i}, n_{i} \in N_{i}} \log \frac{\exp (\mathbf {h}_{i}^{\top } \mathbf {h}_{s_{i}}/\tau)-\exp (\mathbf {h}_{i}^{\top } \mathbf {h}_{n_{i}}/\tau)}{\sum _{j \in I \backslash \lbrace i\rbrace } \exp (\mathbf {h}_{i}^{\top } \mathbf {h}_{j}/\tau)}. \end{equation}
(23)

References

[1]
Bijaya Adhikari, Yao Zhang, Naren Ramakrishnan, and B. Aditya Prakash. 2018. Sub2vec: Feature learning for subgraphs. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 170–182.
[2]
László Babai. 2016. Graph isomorphism in quasipolynomial time. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing. 684–697.
[3]
Karsten M. Borgwardt and Hans-Peter Kriegel. 2005. Shortest-path kernels on graphs. In Proceedings of the ICDM. IEEE, 8–pp.
[4]
Jin-Yi Cai, Martin Fürer, and Neil Immerman. 1992. An optimal lower bound on the number of variables for graph identification. Combinatorica 12, 4 (1992), 389–410.
[5]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 1–27.
[6]
Michael Crawshaw. 2020. Multi-task learning with deep neural networks: A survey. arXiv:2009.09796. Retrieved from https://arxiv.org/abs/2009.09796.
[7]
Thomas Gärtner, Peter Flach, and Stefan Wrobel. 2003. On graph kernels: Hardness results and efficient alternatives. In Proceedings of the Learning Theory and Kernel Machines. Springer, 129–143.
[8]
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the International Conference on Machine Learning. PMLR, 1263–1272.
[9]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. NeurIPS 27 (2014).
[10]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864.
[11]
William L. Hamilton. 2020. Graph representation learning. Synthesis Lectures on Artifical Intelligence and Machine Learning 14, 3 (2020), 1–159.
[12]
Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In Proceedings of the International Conference on Machine Learning. PMLR, 4116–4126.
[13]
Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2 (1991), 251–257.
[14]
Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. 2003. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning. 321–328.
[15]
Thomas N. Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Retrieved from https://arxiv.org/abs/1609.02907.
[16]
Thomas N. Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv:1611.07308. Retrieved from https://arxiv.org/abs/1611.07308.
[17]
Risi Kondor and Horace Pan. 2016. The multiscale laplacian graph kernel. NeurIPS 29 (2016), 2990–2998.
[18]
Nils Kriege and Petra Mutzel. 2012. Subgraph matching kernels for attributed graphs. arXiv:1206.6483. Retrieved from https://arxiv.org/abs/1206.6483.
[19]
Ralph Linsker. 1988. Self-organization in a perceptual network. Computer 21, 3 (1988), 105–117.
[20]
Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. 2020. Tudataset: A collection of benchmark datasets for learning with graphs. arXiv:2007.08663. Retrieved from https://arxiv.org/abs/2007.08663.
[21]
Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv:1707.05005. Retrieved from https://arxiv.org/abs/1707.05005.
[22]
Giannis Nikolentzos, Polykarpos Meladianos, and Michalis Vazirgiannis. 2017. Matching node embeddings for graph similarity. In Proceedings of the AAAI.
[23]
Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-gan: Training generative neural samplers using variational divergence minimization. In Proceedings of the NeurIPS. 271–279.
[24]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv:1807.03748. Retrieved from https://arxiv.org/abs/1807.03748.
[25]
Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, and Junzhou Huang. 2020. Graph representation learning via graphical mutual information maximization. In Proceedings of the Web Conference 2020. 259–270.
[26]
Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole Von Lilienfeld. 2014. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1, 1 (2014), 1–7.
[27]
Bernhard Schölkopf, Alexander J. Smola, Francis Bach, et al. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press.
[28]
Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander WR Nelson, Alex Bridgland, et al. 2020. Improved protein structure prediction using potentials from deep learning. Nature 577, 7792 (2020), 706–710.
[29]
Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt. 2011. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research 12, 9 (2011).
[30]
Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. 2009. Efficient graphlet kernels for large graph comparison. In Proceedings of the Artificial Intelligence and Statistics. PMLR, 488–495.
[31]
Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vlimant. 2020. Graph neural networks in particle physics. Machine Learning: Science and Technology 2, 2 (2020), 021001.
[32]
Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. 2019. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv:1908.01000. Retrieved from https://arxiv.org/abs/1908.01000.
[33]
Susheel Suresh, Pan Li, Cong Hao, and Jennifer Neville. 2021. Adversarial graph augmentation to improve graph contrastive learning. arXiv:2106.05819. Retrieved from https://arxiv.org/abs/2106.05819.
[34]
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? arXiv:2005.10243. Retrieved from https://arxiv.org/abs/2005.10243.
[35]
Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. 2019. Deep graph infomax. ICLR 2, 3 (2019), 4.
[36]
Boris Weisfeiler and Andrei Leman. 1968. The reduction of a graph to canonical form and the algebra which appears therein. NTI, Series 2, 9 (1968), 12–16.
[37]
Jun Xia, Lirong Wu, Jintao Chen, Bozhen Hu, and Stan Z. Li. 2022. SimGRACE: A simple framework for graph contrastive learning without data augmentation. arXiv:2202.03104. Retrieved from https://arxiv.org/abs/2202.03104.
[38]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv:1810.00826. Retrieved from https://arxiv.org/abs/1810.00826.
[39]
Longqi Yang, Liangliang Zhang, and Wenjing Yang. 2021. Graph adversarial self-supervised learning. NeurIPS 34 (2021).
[40]
Yihang Yin, Qingzhong Wang, Siyu Huang, Haoyi Xiong, and Xiang Zhang. 2021. AutoGCL: Automated graph contrastive learning via learnable view generators. arXiv:2109.10259. Retrieved from https://arxiv.org/abs/2109.10259.
[41]
Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. 2021. Graph contrastive learning automated. arXiv:2106.07594. Retrieved from https://arxiv.org/abs/2106.07594.
[42]
Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. NeurIPS 33 (2020), 5812–5823.
[43]
Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. 2021a. An empirical study of graph contrastive learning. (2021).
[44]
Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021b. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.

Cited By

View all

Index Terms

  1. Ada-MIP: Adaptive Self-supervised Graph Representation Learning via Mutual Information and Proximity Optimization

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 17, Issue 5
      June 2023
      386 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3583066
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 April 2023
      Online AM: 09 November 2022
      Accepted: 10 October 2022
      Revised: 29 September 2022
      Received: 13 May 2022
      Published in TKDD Volume 17, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Self-supervised learning
      2. semi-supervised learning
      3. graph neural network

      Qualifiers

      • Research-article

      Funding Sources

      • NSF
      • 100-Talents Program of Xinhua News Agency, and the Program of Shanghai Academic/Technology Research Leader

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 901
        Total Downloads
      • Downloads (Last 12 months)499
      • Downloads (Last 6 weeks)99
      Reflects downloads up to 18 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media