Collaborative filtering (CF) is widely used to learn informative latent representations of users and items from observed interactions. Existing CF-based methods commonly adopt negative sampling to discriminate different items. That is, observed user-item pairs are treated as positive instances; unobserved pairs are considered as negative instances and are sampled under a defined distribution for training. Training with negative sampling on large datasets is computationally expensive. Further, negative items should be carefully sampled under the defined distribution, in order to avoid selecting an observed positive item in the training dataset. Unavoidably, some negative items sampled from the training dataset could be positive in the test set. Recently, self-supervised learning (SSL), has emerged as a powerful tool to learn a model without negative samples. In this paper, we propose a self-supervised collaborative filtering framework (SelfCF), that is specially designed for recommender scenario with implicit feedback. The proposed SelfCF framework simplifies Siamese networks and can be easily applied to existing deep-learning based CF models, which we refer to as backbone networks. The main idea of SelfCF is to augment the latent embeddings generated by backbone networks instead of the raw input of user/item ids. We propose and study three embedding perturbation techniques that can be applied to different types of backbone networks including both traditional CF models and graph-based models. The framework enables learning informative representations of users and items without negative samples, and is agnostic to the encapsulated backbones. We conduct experimental comparisons on four datasets, one self-supervised framework, and eight baselines to show that our framework may achieve even better recommendation accuracy than the encapsulated supervised counterpart with a 2×–4× faster training speed. The results also demonstrate that SelfCF can boost up the accuracy of a self-supervised framework BUIR by 17.79% on average and shows competitive performance with baselines.
1 Introduction
Recommender systems aim to provide users with personalized products or services. They help to handle the increasing information overload problem and improve customer relationship management. In Figure 1, we present an illustration of recommendation under implicit feedback. Recommender systems are designed to infer the missing values of the matrix (right) transformed from the user-item interactions (left). In top-K scenario, the inferred values are further ranked with each user for personalized recommendation. Collaborative Filtering (CF) is a canonical recommendation technique, which predicts interests of a user by aggregating information from similar users or items. In detail, existing CF-based methods [20, 21, 26, 38] learn latent representations of users and items, by first factorizing the observed interaction matrix, then predicting the potential interests of user-item pairs based on the dot-product of learned embeddings. However, existing CF models rely heavily on negative sampling techniques to discriminate against different items, because negative samples are not naturally available.
Fig. 1.
Nevertheless, the negative sampling techniques suffer from a few limitations. Firstly, they introduce additional computation and memory costs. In existing CF-based methods, the negative sampling algorithm need be carefully designed in order to not select the observed positive user-item pairs. Specifically, to sample one negative user-item pair for a specific user, the algorithm checks its conflicts with all the observed positive items interacted with this user. As a result, much computation is needed for users who have a large number of interactions. Secondly, even if non-conflicted negative samples are selected for a user, the samples may fall into future positive items of the user. The reason is that the unobserved user-item pairs can be either true negative instances (i.e., the user is not interested in these items) or missing values (e.g., interaction pairs not observed in the training set but the test set) [28, 36]. We denote the sampled pairs that fall in the test set as false negative samples [6]. Although another line of work [7, 8, 9] has gotten rid of negative sampling and takes the full unobserved interactions as negative samples, they may still treat a future positive sample as negative.
To uncover the negative sampling problem in current models, we employ uniform sampling (UniS) and Dynamic Negative Sampling (DNS) [57] in LightGCN [20] to study the aforementioned issues. Uniform sampling is a widely used and classical solution in the item recommendation domain with implicit feedback [6]. DNS improves uniform sampling by selecting a set of negative candidates and ranking the candidates based on learned user/item embeddings. The top-ranked item is used as a hard instance. As a result, DNS is model-sensitive. We plot the percentage of sampled false negative pairs in the test set along with the training progress of LightGCN under uniform sampling and DNS in Figure 2. We test the negative sampling methods on two diverse datasets, MOOC and Amazon Video Games (Games). MOOC contains 458,453 interactions collected from 82,535 learners on 1,302 courses. Games has 50,677 users, 16,897 items, and 454,529 interactions. The sparsity of MOOC and Games are 99.4039% and 99.9469%, respectively. From Figure 2, we observe the percentage of false negative pairs sampled by DNS is over 50% when LightGCN early stops on the MOOC dataset. Here, we use the original early stopping setting in LightGCN [20]. The sparse dataset, Games, has a relatively small number of sampled false negative instances under 10%. However, the overhead to sample a negative instance increases with the number of candidates, as shown in Table 1. Although DNS can sample hard negative instances, its overhead on sampling is 2-3 times that of uniform sampling in Table 1. From the above observations, it is promising to train the model without negative sampling.
Fig. 2.
Table 1.
Dataset
Method
Time (s)
Overhead
Sampling
Training
MOOC
UniS
2.32
8.89
26.1%
MOOC
DNS
32.38
40.33
80.3%
Games
UniS
3.47
8.22
42.2%
Games
DNS
39.15
43.68
89.6%
Table 1. Negative Sampling Time under Various Sampling Methods and Datasets
Overhead is calculated as the percentage of sampling time over training time per epoch. UniS denotes uniform sampling. DNS denotes dynamic negative sampling.
Self-supervised learning (SSL) models [13, 17, 51] provide us a possible solution to tackle the aforementioned limitations. SSL enables training a model by iteratively updating network parameters without using negative samples. Research in various domains ranging from Computer Vision (CV) to Natural Language Processing (NLP), has shown that SSL is possible to achieve competitive or even better results than supervised learning [12, 17, 51]. The underlying idea is to maximize the similarity of representations obtained from different distorted versions of a sample using a variant of Siamese networks [18]. Siamese networks usually include two symmetric networks (i.e., online network and target network) for inputs comparing. The problem with only positive sampling in model training is that, the Siamese networks collapse to a trivial constant solution [13]. Thus, in recent work, BYOL [17] and SimSiam [13] introduce asymmetry to the network architecture by adding a parameter update technique. Specifically, in the network architecture, an additional “predictor” network is stacked onto the online encoder. For parameter update, a special “stop gradient” operation is highlighted to prevent solution collapsing. SimSiam simplifies BYOL by removing its “momentum update”, which updates the parameters of target networks based on online networks. We will illustrate the architectures in detail in the related work section.
To the best of our knowledge, BUIR [28] is the only recommendation framework to learn user and item latent representations without negative samples. BUIR is derived from BYOL [17]. Similar to BYOL, BUIR employs two distinct encoder networks (i.e., online and target networks) to address the recurring of trivial constant solutions in SSL. In BUIR, the parameters of the online network are optimized towards that of the target network. At the same time, parameters of the target network are updated based on momentum-based moving average [17, 19, 42] to slowly approximate the online network [28]. As BUIR is built upon BYOL, which stems from vision domain, its architecture is redundant and suffers from slow convergence, because of the design of the momentum-based parameter updating. The SimSiam network is originally proposed in vision domain as well. The input is an image, and techniques for data augmentation on images are relatively mature [39], such as random cropping, resizing, horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization. As for a pair of user id and item id that is observed in implicit feedback, there is no standard solution on how to distort it while keep its representation invariant.
The learning paradigm of SSL without negative samples differs slightly from existing paradigms that use negative samples to learn representations. SSL without negative samples intends to learn an encoder with augmentation-invariant representation [13, 17]. That is, they minimize the representation distance between two positive samples based on a Siamese network architecture [3]. By using negative samples in SSL, solutions are prevented from collapsing because of their repulsivity. Our proposed framework can achieve competitive performance without harnessing this repulsive force.
In this paper, we propose a Self-supervised Collaborative Filtering (SelfCF) framework, which performs posterior perturbation on user and item latent embeddings, to obtain a contrastive pair. On architecture design, our framework uses only one encoder instead of two encoders, which simplifies BYOL and SimSiam. Besides, instead of perturbing inputs ahead of encoding, we generate different but invariant contrastive views with posterior embedding perturbations. An additional benefit of posterior embedding perturbation is that the framework can take the internal implementation of the encapsulated backbones as black-box. Conversely, BUIR adds momentum-based parameter updating to encoders in order to generate different views. Our experiments on four real-world datasets validate that the proposed SelfCF framework is able to learn informative representation solely based on positive user-item pairs. In our experiments, we encapsulate two popular CF-based models into the framework, and the results on Top-K item recommendation are competitive or even better than their supervised counterparts.
We summarize our contributions as follows:
•
We propose a novel framework, SelfCF, that learns latent representations of users/items solely based on positively observed interactions. The framework uses posterior output perturbation to generate different augmented views of the same user/item embeddings for contrastive learning.
•
We design three output perturbation techniques: historical embedding, embedding dropout, and edge pruning to distort the output of the backbone. The techniques are applicable to all existing CF-based models as long as their outputs are embedding-like.
•
We investigate the underlying mechanisms of the framework by performing an ablation study on each component. We find the presentations of user/item can be learnt even without the “stop gradient” operator, which shows different behaviors from previous SSL frameworks (e.g., BYOL [17] and SimSiam [13]).
•
Finally, we conduct experiments on four public datasets by encapsulating two popular backbones. Results show SelfCF is competitive or better than their supervised counterpart and outperforms existing SSL framework by up to 17.79% on average.
2 Related Work
In this section, we first review the CF technique, then summarize the current progress of SSL.
2.1 Collaborative Filtering
CF is a typical and prevalent technique adopted in modern recommender systems [48]. The core concept is that similar users tend to have similar tastes on items. To tackle the data sparsity and scalability of CF, more advanced method, Matrix Factorization (MF), decomposes the original sparse matrix to low-dimensional matrices with latent factors/features and less sparsity. To learn informative and compressed latent features, deep learning based models are further proposed for recommendation [21, 43, 56].
With the emergence of graph convolutional networks (GCNs), which generalize convolutional neural networks (CNNs) on graph-structured data [31, 54, 63], GCN-based CF is widely researched recently [2, 45, 48, 60, 61, 62]. The user-item interaction matrix naturally can be treated as a bipartite graph. GCN-based CF takes advantage of fusing both high-order information and the inherent graph structure. GCNs are used to propagate information using the normalized adjacency matrix and aggregate information from neighbors via the nonlinear activation and linear transformation layers. He et al. [20] simplify the GCNs architecture by removing the feature transformation as well as nonlinear activation layers as they impose a negative effect on recommendation performance. In [11], the authors add a residual preference learning on GCN and obtain better recommendation performance.
2.2 Self-supervised Learning
SSL has achieved competitive results on various tasks in vision and natural language processing domains. We review two lines of work on SSL.
Contrastive learning. Contrastive approaches learn representations by attracting the positive sample pairs and repulsing the negative sample pairs [18]. A line of work [12, 19, 22, 23, 47, 53, 55] is developed based on this concept. These works benefit from a large number of negative samples, which require a memory bank [47] or a queue [19] to store negative samples. In [46], the authors integrate supplemental signal into supervised baselines for contrastive learning, and show that it performs better than their baselines. Following the line, Yu et al. propose a graph-augmentation free recommendation model [49] to enforce the learning of uniform representations for users and items. The uniform representations can mitigate the popularity bias and achieve better recommendation accuracy. Liu et al. summarize the contrastive learning applied on broad fields, e.g., NLP, Computer Vision, in [32].
Siamese networks. Siamese networks [3] are general models for comparing entities. BYOL [17] and SimSiam [13] are two specializations of the Siamese network that achieve remarkable results by only using positive samples. BYOL proposes two coupled networks (i.e., online and target networks) that are optimized and updated iteratively. In detail, the online network is optimized towards the target network, while the target network is updated with a moving average of the online network to avoid collapse. On the contrary, SimSiam verifies that a “stop gradient” operator is crucial in preventing collapse. As a result, it removes the dashed “momentum update” line in Figure 3(a).
Fig. 3.
Derived from BYOL, the recently proposed self-supervised framework, BUIR, learns the representation of users and items solely on positive interactions. It introduces different views by differentiating parameters of online and target networks. However, the framework modifies the underlying logic of the encapsulated graph-based CF models for the sake of introducing contrastive user-item pairs. In our solution, we choose to augment the output of encoder f to generate two different but related embeddings for representation learning. For comparison, we present our proposed framework specialized for CF, SelfCF, in Figure 3(b). The framework shares the same encoder between online and target networks, thus it reduces the unnecessary memory and computational resources for storing and executing an additional encoder of the target network. We elaborate on our framework in the following section.
3 The SelfCF Framework
Our framework (shown in Figure 3(b)) partially inherits the Siamese network architecture of SimSiam, as shown in Figure 3. In our framework, SelfCF, the goal is to learn informative representations of users and items based on positive user-item interactions only. The latent embeddings of users and items are learnt from the online network. Analogous to convolutions [27], which is a successful inductive bias via weight-sharing for modeling translation-invariance, the weight-sharing Siamese networks can model invariance with regard to more complicated transformations (e.g., data augmentations) [13]. The online and target networks in SelfCF use a same copy of the parameters as well as the backbone for modeling representation invariance. In addition, we drop the momentum encoder as used in BYOL and BUIR. As a result, with the same input, the online and target networks will generate the same output which makes the loss totally vanished. We will discuss how to tackle this issue in the following section.
When considering data augmentations of input in CF, it is not a trivial task to distort the positive samples. In vision domain, where SSL is popularly applied, images can be easily distorted under a wide range of transformations. However, positive user-item pairs are difficult to be distorted while preserving their representation invariance. We use the following embedding perturbation techniques to achieve the same effect. For reasons of clarity, we denote bold value \({\bf E}\) as the embedding matrix of users and items within a batch, and differentiate the embedding matrix of users with \({\bf E}_u\), vice visa. The value e in lowercase denotes the embedding of a user or item, specified as \(e_u\) or \(e_i\).
3.1 Data Augmentation via Output Perturbation
In vision, researchers use image transformations to augment input data and generate two different but relative reviews. Instead, our framework augments the output embeddings of users and items to generate two contrastive views. We propose three methods to introduce embedding perturbation in our framework, shown in Figure 4. The historical embedding and embedding dropout are general techniques for output augmentation in our framework, while the edge pruning is specially designed for graph-based CF models.
Fig. 4.
Historical embedding. We introduce embedding perturbation by utilizing historical embeddings [10, 15] from prior training iterations. Specifically, we use a momentum update to generate the contrastive embeddings in the target network. Suppose \({\bf E}^t\) is the embeddings generated by a backbone encoder f in a batch \(\mathbb {B}^t\). The perturbed embeddings \(\tilde{{\bf E}}^t\) is calculated by the combining of the output embeddings \({\bf E}^t\) with the historical embedding \({\bf E}^{t-1}\):
where \(\tau\) is a parameter controls the proportion of information preserved from a prior iteration.
Embedding dropout. We apply the embedding dropout scheme to perturb the embeddings of users and items from the target network. In classical CF models, the parameters are not modified until the loss is backpropagated. With the same input, to avoid null loss resulted from these models, our framework adopts embedding dropout on the resulted users’ and items’ vectors, which is analogous to node dropout [40]. In this way, the framework is able to generate two different but related views on the output, which are then fed into the loss function for optimization. The resulted embeddings under a dropout ratio p is calculated as:
Edge pruning. As for graph-based CF models, the edge pruning method used in [34, 37] provides an alternative way to augment the output embeddings. With the user-item bipartite graph, we randomly prune a certain proportion of edges from the graph in each batch. The output embeddings are updated by aggregating the embeddings of neighbors. With the same positive user-item pair, the output is distorted with different adjacency matrix (neighbors). Let \(A_{pruned}\) be the pruned adjacency matrix, then the resulted embeddings with edge pruning denote as:
Note that, in implementation, edge pruning would require to calculate the adjacency matrix of users and items, which is more expensive in computation than the embedding dropout technique.
To summarize, our framework augments the output via embedding perturbation in the target network instead of distorting the input directly as commonly used in vision domain. It is worth noting that the historical embedding perturbation performs on embeddings from prior and current iteration, the embedding dropout perturbs the current embedding with noise, and the edge pruning method operates on future embeddings generated by stacking one more convolutional layer on current embeddings. Both historical embedding perturbation and embedding dropout perturbation remove the requirements of auxiliary graphs to generate a contrastive view as in [28, 46, 49]. We will discuss their performance with regard to this perspective in the experiments section.
3.2 The Loss Function
Our framework, as shown in Figure 3(b), takes a positive user-item pair \((u,i)\) as input. The \((u,i)\) pair is initially processed by an encoder network f in a backbone (e.g., LightGCN [20]). The output of the encoder f is then copied to the target network for embedding perturbation. Formally, we denote the output of the encoder from the online network as \((e_u, e_i) = f(u, i)\). Finally, the linear predictor in our framework transforms the output \((e_u, e_i)\) with \((\dot{e}_u, \dot{e}_i) = h(e_u, e_i)\) and matches it to the perturbed embeddings \((\tilde{e}_u, \tilde{e}_i) = g(e_u, e_i)\) in other views like in BYOL [17] and SimSiam [13].
We define a symmetrized loss function as the negative cosine similarity between \((\dot{e}_u, \tilde{e}_i)\) and \((\tilde{e}_u, \dot{e}_i)\):
where \(||\cdot ||_2\) is \(\ell _2\)-norm. The total loss is averaged over all user-item pairs in a batch. The intuition behind this is that we intend to maximize the prediction of the perturbed item i given a user u, and vice versa. The minimized possible value for this loss is \(-1\).
Finally, we stop gradient on the target network and force the backpropagation of loss over the online network only. We follow the stop gradient (sg) operator as in [13, 17], and implement the operator by updating Equation (4) as:
With the stop gradient operator, the target network receives no gradient from \((\tilde{e}_u, \tilde{e}_i)\). However, the encoder f in the online network receives gradients from user-item pair \((\dot{e}_u, \dot{e}_i)\), and optimizes its parameters towards the global optimum. Conversely, the removal of this operator can cause instability in online network learning, which we will verify this through ablation study. The reason is that the online and target networks simulate the student-teacher-like network [42] in which only the online network is optimized to predict the positively interacted item (user) presented by the target network. Additionally, we add regularization penalty on the online embeddings (i.e., \(e_u\) and \(e_i)\) and the predictor h. The final loss function is:
where \(||\cdot ||_1\) is \(\ell _1\)-norm. The pseudo-code of SelfCF is in Algorithm 1.
3.3 Top-K Recommendation
Classical CF methods recommend top-K items by ranking scores of the inner product of a user embedding with all candidate item embeddings. However, in SSL, we minimize the predicted loss between u and i for each positive interaction \((u, i)\). Intuitively, we predict the future interaction score based on a cross-prediction task [28]. That is, we both predict the interaction probability of item i with u and the probability of user u with i. Given \((e_u, e_i)\) being the output of the encoder f, the recommendation score is calculated as:
It is worth noting that since the encoder f is shared between both online and target networks, we use the representations obtained from the online network to predict top-K items for each user.
4 Experiments
We evaluate the framework on four publicly available datasets and compare its performance with BUIR [28] and eight baselines by encapsulating BPR and LightGCN as our backbone. Our framework is mainly compared with BUIR, as it is the only recommendation framework that works without negative samples. All baselines as well as our frameworks are trained on a single GeForce RTX 2080 Ti (11 GB).
We list the research questions addressed in our evaluation as follows:
RQ1: Whether the self-supervised models that only leverage positive user-item interactions can outperform their supervised counterparts?
RQ2: How SelfCF shapes the recommendation results for cold-start and loyal users?
RQ3: Why SelfCF works, and which component is essential in preventing collapsing?
We address the first research question by evaluating our framework against supervised baselines with four datasets under six evaluation metrics. Next, we dive into the recommendation results of the baselines under both supervised and self-supervised settings and analyze their performance on users with a different number of interactions. Finally, to investigate the underlying mechanisms of SelfCF, we perform an ablation study on the components of SelfCF, such as the linear predictor, the loss function, and so on.
4.1 Dataset Description
We chose the evaluated datasets carefully by considering the following principles in order to introduce as much diversity as possible.
Domain: Interactions within the same domain may exhibit similar patterns across datasets. Hence, we choose evaluation datasets from two different domains ranging from education to e-commerce under different categories.
Released date: Existing recommender systems usually evaluated on out-dated datasets nearly collected 10 years ago. With the rapid growth of e-commerce platforms, user behaviors are gradually shaped with online purchasing.
Graph size: The user-item interactions can be viewed as a bipartite graph (Figure 1), we consider the graph size with the number of nodes ranging from 10K to 100K.
We describe each dataset with regard to the above selection principles.
•
Amazon Video Games (Games): This is a newly released version of the Amazon-Video-Games review dataset in 2018. We select the rating only version for evaluation. The dataset is available from [35].1
•
Amazon Arts, Crafts and Sewing (Arts): This dataset is similar to the Amazon Video Games dataset under a different genre.
•
Amazon Grocery and Gourmet Food (Food): This dataset has a large-scale interaction graph with more than 100K users.
•
COCO: A large-scale dataset from education domain. The raw dataset includes over 43K online courses and 2.5M learners [14].
All raw datasets are preprocessed with a 5-core setting on both items and users and the filtered results are presented in Table 2.
Table 2.
Dataset
# of Users
# of Items
# of Interactions
Sparsity
Arts
45,624
21,104
396,556
99.9588%
Games
50,677
16,897
454,529
99.9469%
Food
115,144
39,688
1,025,169
99.9776%
COCO
144,773
20,969
1,204,697
99.9603%
Table 2. Statistics of the Experimented Data
4.2 Encapsulated Baselines and Framework BUIR
To compare the performance of our proposed framework, we first consider the following baselines that adopt negative sampling for supervised learning except the popularity algorithm.
•
Pop: Popularity algorithm recommends the most popular items to each user.
•
BPR [36]: A matrix factorization model optimized by a pairwise ranking loss in a Bayesian way.
•
MultiVAE [30]: It is a generative model that adopts variational auto-encoder (VAE) for item-based CF. It uses a multinomial likelihood to fit the distribution of data and adopts Bayesian inference for parameter estimation.
•
EHCF [9]: This is an efficient recommendation model that learns the representations of users and items by reconstructing the interaction matrix without negative sampling. It takes all unobserved user-item pairs as negative samples.
•
NGCF [45]: This model explicitly injects collaborative signal from high-order connectivity of user-item graph into the embedding learning process.
•
LR-GCCF [11]: The model first simplifies the vanilla GCN by removing nonlinear function, then it uses a residual preference learning process for prediction.
•
LightGCN [20]: This is a simplified graph convolution network that only performs linear propagation and aggregation between neighbors. The hidden layer embeddings are averaged to calculate the final user/item embeddings for prediction.
•
SimGCL [49]: This self-supervised model injects uniform noises into the latent embeddings to generate contrastive views.
We also consider the following self-supervised frameworks that learn the representations of users and items without negative samples. Our framework is mainly compared with BUIR [28], a self-supervised framework that is derived from BYOL [17]. Its architecture follows the Siamese network in Figure 3(a). To compare the performance of our proposed framework, we encapsulate two state-of-the-art models, BPR and LightGCN, into the frameworks. That is, we substitute the encoder f in Figure 3(b) with BPR and LightGCN, respectively.
•
BUIR [28]: This framework uses asymmetric network architecture to update its backbone network parameters.
•
SelfCF\(_{\textrm {he}}\): Our proposed framework that uses historical embedding for data augmentation.
•
SelfCF\(_{\textrm {ed}}\): Our proposed framework that uses embedding dropout for data augmentation.
•
SelfCF\(_{\textrm {ep}}\): Our proposed framework that uses edge pruning for data augmentation.
To demonstrate the generalization of our framework, we consider two backbone networks for BUIR and SelfCF\(_{\textrm {ed}}\), the classic BPR and recently graph-based LightGCN. Other frameworks will only use LightGCN as their backbone network because LightGCN always shows better performance against BPR.
4.3 Evaluation Metrics
We use \(Recall@K\) and \(NDCG@K\) computed by the all-ranking protocol as the evaluation metrics for recommendation accuracy comparison. In the recommendation phase, all items that have not been interacted with a specific user are regarded as candidates. That is, we do not use sampled evaluation.
Formally, we define \(I^r_u(i)\) as the i-th ranked item recommended for u, \(\mathcal {I}[\cdot ]\) is the indicator function, and \(I^t_u\) is the set of items that user u interacted in the testing data.
\(NDCG@K\) is normalized to [0, 1] with \(NDCG@K=DCG@K/IDCG@K\), where \(IDCG@K\) is calculated by sorting the interacted items in the testing data at top and then use the formula for \(DCG@K\). We set \(K=10\), \(K=20\) and \(K=50\) in our experimental comparison, respectively. For simplicity, we denote \(Recall@K\) and \(NDCG@K\) with \(R@K\) and \(N@K\) in the following sections.
4.4 Hyper-parameters Settings
Same as other work [11, 20], we fix the embedding size of both users and items to 64 for all models, initialize the embedding parameters with the Xavier method [16], and use Adam [25] as the optimizer. For a fair comparison, we carefully tune the parameters of each model following their published papers. For our proposed frameworks, we perform a grid search across all datasets to conform the optimal settings. We summarize the settings in Table 3. We penalize the predictor with \(L_1\) regularization when BPR is encapsulated, otherwise, we use \(L_2\) regularization. The reason is that BPR learns the embeddings of users and items without leveraging graph structure and opts to over-fitting. We add \(L_1\) regularization to learn a sparsified predictor. For convergence consideration, the early stopping and total epochs are fixed at 50 and 1,000, respectively. Following [45], we use Recall@20 on validation data as the training stopping indicator. We implement our model on top of Recbole [58] at: https://github.com/enoche/SelfCF.
While we acknowledge the significance of online evaluation for recommender systems, it is not feasible to evaluate our model in such a manner in an academic environment. Therefore, to avoid data leakage under offline evaluation [41], we adopt the evaluation setting used in [5, 29], which involves splitting the data chronologically in a 7:1:2 ratio for training, validation, and testing. We define the global comparison perspective as the comparison across supervised and self-supervised baselines, while the local comparison perspective as the comparison between self-supervised frameworks BUIR and SelfCF. We analyze the comparison results with regard to recommendation accuracy (Table 4) under the following perspectives:
Table 4.
Arts
Framework
Model
R@10
R@20
R@50
N@10
N@20
N@50
Non-parametric
-
Pop
0.0091
0.0164
0.0283
0.0072
0.0095
0.0128
Supervised (with NS)
-
BPR
0.0201
0.0327
0.0589
0.0137
0.0177
0.0245
MultiVAE
0.0171
0.0268
0.0503
0.0113
0.0145
0.0205
EHCF
0.0202
0.0319
0.0567
0.0136
0.0175
0.0240
NGCF
0.0205
0.0342
0.0623
0.0142
0.0186
0.0260
LR-GCCF
0.0221
0.0365
0.0636
0.0151
0.0197
0.0268
LightGCN
0.0231
0.0371
0.0663
0.0156
0.0201
0.0277
Self-Supervised (with NS)
SimGCL
0.0198
0.0322
0.0558
0.0133
0.0172
0.0234
Self-Supervised (without NS)
BUIR
BPR
0.0197
0.0309
0.0560
0.0139
0.0174
0.0239
LightGCN
0.0208
0.0334
0.0636
0.0149
0.0190
0.0270
SelfCF\(_{\textrm {he}}\)
LightGCN
0.0236
0.0397
0.0709
0.0157
0.0208
0.0289
\(\Delta\)
13.46%
18.86%
11.48%
5.37%
9.47%
7.04%
SelfCF\(_{\textrm {ed}}\)
BPR
0.0231
0.0354
0.0632
0.0157
0.0197
0.0269
\(\Delta\)
17.26%
14.56%
12.86%
12.95%
13.22%
12.55%
LightGCN
0.0246
0.0391
0.0708
0.0170
0.0218
0.0300
\(\Delta\)
18.27%
17.07%
11.32%
14.09%
14.74%
11.11%
SelfCF\(_{\textrm {ep}}\)
LightGCN
0.0239
0.0395
0.0714
0.0158
0.0208
0.0290
\(\Delta\)
14.90%
18.26%
12.26%
6.04%
9.47%
7.41%
Games
Framework
Model
R@10
R@20
R@50
N@10
N@20
N@50
Non-parametric
-
Pop
0.0117
0.0175
0.0379
0.0049
0.0067
0.0117
Supervised (with NS)
-
BPR
0.0210
0.0369
0.0699
0.0135
0.0183
0.0265
MultiVAE
0.0238
0.0376
0.0718
0.0154
0.0196
0.0280
EHCF
0.0278
0.0445
0.0772
0.0175
0.0227
0.0308
NGCF
0.0254
0.0425
0.0825
0.0166
0.0217
0.0314
LR-GCCF
0.0259
0.0446
0.0824
0.0171
0.0228
0.0320
LightGCN
0.0275
0.0461
0.0841
0.0175
0.0231
0.0326
Self-Supervised (with NS)
SimGCL
0.0310
0.0502
0.0879
0.0194
0.0251
0.0344
Self-Supervised (without NS)
BUIR
BPR
0.0217
0.0361
0.0674
0.0135
0.0180
0.0257
LightGCN
0.0227
0.0384
0.0749
0.0143
0.0192
0.0282
SelfCF\(_{\textrm {he}}\)
LightGCN
0.0295
0.0473
0.0859
0.0187
0.0241
0.0336
\(\Delta\)
29.96%
23.18%
14.69%
30.77%
25.52%
19.15%
SelfCF\(_{\textrm {ed}}\)
BPR
0.0241
0.0402
0.0744
0.0152
0.0200
0.0285
\(\Delta\)
11.06%
11.36%
10.39%
12.59%
11.11%
10.89%
LightGCN
0.0289
0.0485
0.0857
0.0181
0.0240
0.0332
\(\Delta\)
27.31%
26.30%
14.42%
26.57%
25.00%
17.73%
SelfCF\(_{\textrm {ep}}\)
LightGCN
0.0301
0.0517
0.0930
0.0189
0.0255
0.0358
\(\Delta\)
32.60%
34.64%
24.17%
32.17%
32.81%
26.95%
Food
Framework
Model
R@10
R@20
R@50
N@10
N@20
N@50
Non-parametric
-
Pop
0.0125
0.0189
0.0346
0.0112
0.0133
0.0173
Supervised (with NS)
-
BPR
0.0138
0.0222
0.0390
0.0097
0.0124
0.0167
MultiVAE
0.0133
0.0208
0.0374
0.0092
0.0116
0.0159
EHCF
0.0158
0.0243
0.0416
0.0111
0.0137
0.0182
NGCF
0.0158
0.0254
0.0456
0.0102
0.0132
0.0185
LR-GCCF
0.0172
0.0277
0.0478
0.0120
0.0154
0.0206
LightGCN
0.0184
0.0286
0.0497
0.0125
0.0157
0.0211
Self-Supervised (with NS)
-
SimGCL
0.0173
0.0265
0.0453
0.0116
0.0147
0.0195
Self-Supervised (without NS)
BUIR
BPR
0.0113
0.0178
0.0313
0.0075
0.0096
0.0130
LightGCN
0.0145
0.0236
0.0469
0.0111
0.0141
0.0201
SelfCF\(_{\textrm {he}}\)
LightGCN
0.0195
0.0299
0.0516
0.0132
0.0166
0.0221
\(\Delta\)
34.48%
26.69%
10.02%
18.92%
17.73%
9.95%
SelfCF\(_{\textrm {ed}}\)
BPR
0.0165
0.0259
0.0443
0.0111
0.0141
0.0188
\(\Delta\)
46.02%
45.51%
41.53%
48.00%
46.88%
44.62%
LightGCN
0.0198
0.0316
0.0555
0.0135
0.0173
0.0235
\(\Delta\)
36.55%
33.90%
18.34%
21.62%
22.70%
16.92%
SelfCF\(_{\textrm {ep}}\)
LightGCN
0.0186
0.0296
0.0514
0.0126
0.0161
0.0216
\(\Delta\)
28.28%
25.42%
9.59%
13.51%
14.18%
7.46%
COCO
Framework
Model
R@10
R@20
R@50
N@10
N@20
N@50
Non-parametric
-
Pop
0.0574
0.0798
0.1393
0.0318
0.0385
0.0525
Supervised (with NS)
-
BPR
0.1181
0.1745
0.2681
0.0741
0.0908
0.1129
MultiVAE
0.1243
0.1816
0.2786
0.0775
0.0946
0.1175
EHCF
0.1146
0.1674
0.2507
0.0724
0.0880
0.1078
NGCF
0.1210
0.1817
0.2843
0.0740
0.0921
0.1163
LR-GCCF
0.1215
0.1784
0.2734
0.0754
0.0923
0.1147
LightGCN
0.1213
0.1781
0.2723
0.0762
0.0932
0.1154
Self-Supervised (with NS)
-
SimGCL
0.1238
0.1758
0.2564
0.0784
0.0939
0.1130
Self-Supervised (without NS)
BUIR
BPR
0.0977
0.1445
0.2222
0.0601
0.0740
0.0924
LightGCN
0.1162
0.1745
0.2672
0.0719
0.0893
0.1113
SelfCF\(_{\textrm {he}}\)
LightGCN
0.1147
0.1758
0.2722
0.0716
0.0898
0.1127
\(\Delta\)
\(-\)1.29%
0.74%
1.87%
\(-\)0.42%
0.56%
1.26%
SelfCF\(_{\textrm {ed}}\)
BPR
0.1126
0.1672
0.2508
0.0684
0.0847
0.1046
\(\Delta\)
15.25%
15.71%
12.87%
13.81%
14.46%
13.20%
LightGCN
0.1287
0.1892
0.2877
0.0796
0.0977
0.1210
\(\Delta\)
10.76%
8.42%
7.67%
10.71%
9.41%
8.72%
SelfCF\(_{\textrm {ep}}\)
LightGCN
0.1174
0.1734
0.2712
0.0740
0.0906
0.1137
\(\Delta\)
1.03%
\(-\)0.63%
1.50%
2.92%
1.46%
2.16%
Table 4. Overall Performance Comparison
We mark the global best results on each dataset under each metric in boldface, and the second best underlined. We also calculate the performance improvement by SelfCF on BUIR over each evaluation metric as \(\Delta\). “NS” denotes Negative Samples.
•
Classic CF vs. Graph-based CF. In general, graph-based CF (i.e., NGCF, LR-GCCF, LightGCN) performs better than other supervised baselines. We speculate the graph-based CF model naturally encodes structural embedding that is preferred for contrastive learning. Analogously, self-supervised frameworks encapsulated with LightGCN also have better performance. The performance of LightGCN under SelfCF is on par with or better than that of under supervised learning. Classic CF models, e.g., BPR, use pairwise learning to differentiate positive and negative user-item samples which encode less information between positive instances, resulting in a worse performance under the self-supervised framework, BUIR. On the contrary, in our framework, we penalize the predictor h with L1 regularization term. As a result, a sparse and weak predictor can encourage the framework to learn informative representations for users and items.
•
Comparison between self-supervised frameworks. When compared between frameworks without negative samples, our proposed framework SelfCF\(_{\textrm {ed}}\) improves BUIR on every evaluation metric across all datasets. The proposed framework with three output perturbations takes significant improvement, as high as 17.79% over four datasets on average.
In particular, our framework SelfCF\(_{\textrm {ed}}\) gains 21.19% over BUIR when both use BPR as the backbone network. It is worth mentioning that SimGCL leveraging negative samples for representation learning obtains competitive performance on ranking metric (e.g., NDCG@10).
•
Output perturbation techniques in SelfCF. Among the three output perturbation techniques, history embedding technique integrates the embedding from previous training iteration; embedding dropout technique introduces noise on the current output embedding; and edge pruning technique achieves embedding augmentation by merging embedding from neighbors. Within the three proposed output perturbation techniques, Table 4 shows the embedding dropout technique is preferable across all datasets. The reason is that the embedding dropout and edge pruning techniques can remove the noise information and preserve the salient features in the embeddings. However, the embedding dropout is better than the edge pruning technique in retaining the similarity between the original embedding and the augmented embedding.
We conclude our analysis to address research question RQ1: Both classical CF and graph-based model CF can benefit from SelfCF. Specially, the supervised counterparts, BPR and LightGCN, can be improved with up to 7.36% and 6.55% across the four datasets under SelfCF\(_{\textrm {ed}}\), respectively.
4.6 Efficiency of SelfCF
We evaluate the efficiency of SelfCF compared with LightGCN with regard to the number of layers in Table 5. From the results, we observe SelfCF\(_{\textrm {ed}}\) is on par with or better than LightGCN stacked with 4-layer, but requires only one half to one quarter training time of LightGCN.
Table 5.
Dataset
Model
R@10
R@20
R@50
N@10
N@20
N@50
Time (s)
Games
SelfCF\(_{\textrm {ed}}\) 1-Layer
0.0274
0.0456
0.0857
0.0175
0.0231
0.0332
3.19
SelfCF\(_{\textrm {ed}}\) 2-Layer
0.0289
0.0485
0.0857
0.0181
0.0240
0.0332
3.75
LightGCN 4-Layer
0.0275
0.0461
0.0841
0.0175
0.0231
0.0326
8.22
LightGCN 3-Layer
0.0270
0.0458
0.0836
0.0176
0.0233
0.0326
7.60
LightGCN 2-Layer
0.0271
0.0454
0.0818
0.0174
0.0230
0.0320
6.78
LightGCN 1-Layer
0.0263
0.0448
0.0798
0.0172
0.0228
0.0315
5.05
Food
SelfCF\(_{\textrm {ed}}\) 1-Layer
0.0197
0.0316
0.0547
0.0135
0.0173
0.0233
14.09
SelfCF\(_{\textrm {ed}}\) 2-Layer
0.0198
0.0316
0.0555
0.0135
0.0173
0.0235
17.20
LightGCN 4-Layer
0.0184
0.0286
0.0497
0.0125
0.0157
0.0211
59.81
LightGCN 3-Layer
0.0176
0.0280
0.0484
0.0122
0.0155
0.0207
48.54
LightGCN 2-Layer
0.0177
0.0280
0.0482
0.0121
0.0154
0.0206
41.46
LightGCN 1-Layer
0.0167
0.0267
0.0460
0.0118
0.0149
0.0198
26.02
Table 5. Efficiency of SelfCF
4.7 Understanding the Learning of SelfCF
In this section, we attempt to answer “why do SelfCF frameworks work well for recommendation?” Based on the line of work [1, 52], we hypothesize that the “stop-gradient” design in SelfCF has the de-correlation effect on learning informative representations. Following [1], we define the covariance matrix of \({\bf E} = [{\bf e}_1, \dots , {\bf e}_n]\) as:
where \(1/d\) is a scale factor. A lower covariance value indicates a better de-correlation effect on representations.
Then, we compare the performance of SelfCF with and without “stop-gradient” component in training on the Food dataset, as shown in Figure 5. From the figure, we observe SelfCF with “stop-gradient” can decrease the covariance value of learnt representations to a significant extent compared with that of no “stop-gradient”. Thanks to the de-correlation effect of SelfCF, the performance on recommendation with regard to Recall@20 is consistently improved with training. In contrast, its performance collapses at a fixed level with the removal of the “stop-gradient” component.
Fig. 5.
4.8 Hyper-parameter Sensitivity
To guide the selection of parameters of our framework, we perform a hyper-parameter study on the performance of SelfCF. In the implementation, we use the Food dataset as the evaluated dataset and LightGCN as the backbone of SelfCF. The results on Games and other datasets show similar patterns with Food; we put the results on Games in the Appendix for reference. We investigate the performance changes of our framework with regard to hyper-parameters on the momentum in historical embedding, the number of layers, the ratio of embedding dropout, and the proportion of edges pruned in SelfCF.
The number of layers. We study how layers in LightGCN affect the performance of SelfCF by stepping its range from [1, 2, 3, 4, 5, 6, 7, 8]. We plot the results in Figure 6.
Fig. 6.
SelfCF\(_{\textrm {he}}\) and SelfCF\(_{\textrm {ed}}\) show relatively slow performance degradation as the number of layers increases. The performance of SelfCF\(_{\textrm {ep}}\) is not stable with regard to the number of layers. On the contrary, the performance of SelfCF\(_{\textrm {ed}}\) is not very affected with the number of layers in LightGCN. SelfCF\(_{\textrm {ed}}\) is capable of boosting up the performance of recommendation for the graph-based models within a few layers.
The momentum/dropout and regularization coefficient. We set both the momentum of SelfCF\(_{\textrm {he}}\) and the dropout of SelfCF\(_{\textrm {ed}}\) value in the range of [0.1, 0.6] with a step of 0.1. The L2 regularization coefficient \(\lambda _1\) is searched in the range of {0.0, 1e-05, 1e-04, ..., 1e-01}. We plot the heatmap for SelfCF\(_{\textrm {he}}\), SelfCF\(_{\textrm {ed}}\), SelfCF\(_{\textrm {ep}}\) over Recall@20 and NDCG@20 in Figures 7 and 8, respectively.
Fig. 7.
Fig. 8.
From Figures 7 and 8, we observe the performance of SelfCF on Recall@20 is consistence with NDCG@20. Higher on Recall usually results in higher NDCG. The performance of our framework is less sensitive to the momentum and dropout than the regularization factor. In practice, it is preferable to put weak regularization to normalize the learned embeddings.
The hyper-parameter studies also show that three variants of SelfCF exhibit similar behaviors. Hence, we analyze recommendation result and perform ablation study on SelfCF\(_{\textrm {ed}}\) in the following sections.
4.9 Diving into the Recommendation Results
In our framework, we recommend top-K items to users relying solely on positive user-item interaction pairs. We further study how our framework differentiates with the supervised models in recommendation results. Specifically, we encapsulate LightGCN into our framework, and compare the recommendation results between SelfCF and LightGCN with regard to users’ degree on Food. We plot the results in Figure 9.
Fig. 9.
On metrics Recall@50 and NDCG@50, we see SelfCF outperforms LightGCN in every category. Our proposed framework is able to alleviate the cold-start issue to a certain extent. The most significant improvement (14.4%) is observed on cold-start users, occupied by about 63.92% users in the testing dataset. The second highest improvement is observed with loyal users, which gains 12.80%. From our data analysis, we find users with a high degree of interactions in the training prefer to select items with a low degree in the testing. Thus, it is difficult to recommend the right items to these users. Our self-supervised framework can partially tackle the problem of recommendation degradation on loyal users [24]. We speculate the underlying reason is that for these users the supervised models sample a large number of unobserved but potentially positive items for training, which makes the models unable to consider the negatively sampled items in the recommendation list.
Regarding research question RQ2, SelfCF boosts up the recommendation performance of all users. Especially for cold-start users, it improves the recommendation accuracy of LightGCN by 14.4% on Recall@50.
4.10 Representation of Nodes
SelfCF leverages positive samples merely to learn the latent user and item representations. We examine how the representations differ between supervised and self-supervised learnings. We draw 2D t-SNE plots of node representations learned from LightGCN and SelfCF\(_{\textrm {ed}}\) with regard to the Food dataset in Figure 10. For computational complexity consideration, we only plot the representations of users and items in the test set. In this figure, we can observe the representations of users and items are highly melded with each other in the supervised model, LightGCN. In contrast, the representations of users are items are repulsed apart in SelfCF\(_{\textrm {ed}}\). The result is in line with Equation (6). Without the negative samples, the loss function is unable to enforce the embeddings of positive users and items similar to each other. Instead, it maximizes the similarity subject to different conditions. That is, our proposed self-supervised model encourages similar users to congregate in a group, and similar items cluster together with each other.
Fig. 10.
5 Ablation Study
We investigate each component in SelfCF to study its contribution to the recommendation performance. All ablation studies are performed on SelfCF\(_{\textrm {ed}}\) trained on the Food dataset. The encapsulated baseline is LightGCN with two convolutional layers. The dropout of embedding for SelfCF\(_{\textrm {ed}}\) is set as 0.5.
5.1 Predictor
We study the recommendation performance considering predictor h under several variants. Table 6 summarizes the variants and their recommendation performance.
Table 6.
MLP h
R@10
R@20
R@50
N@10
N@20
N@50
1-layer MLP
0.0198
0.0316
0.0555
0.0135
0.0173
0.0235
No predictor
0.0124
0.0191
0.0359
0.0109
0.0132
0.0176
Fixed random init.
0.0127
0.0191
0.0309
0.0109
0.0129
0.0160
2-layer MLP
0.0199
0.0313
0.0545
0.0137
0.0173
0.0233
Table 6. Impact of Predictor h
Different from the predictor in SimSiam [13], our framework still works by removing the predictor h, but resulting in performance degeneration to the level of Popularity algorithm. A fixed random initialization with the predictor makes the self-supervised framework difficult to learn good representations of users/items. In contrast, a 2-layer MLP also achieves a competitive performance as the 1-layer version.
5.2 Loss Function
In contrastive learning, it is a common practice for losses measuring a cosine similarity [12, 17, 44]. We substitute the loss function with cross-entropy similarity by modifying \(\mathcal {C}\) with:
Table 7 shows the results compared with cosine similarity. The cross-entropy similarity can prevent the solution collapsing to some extent. Cosine similarity captures the interaction preference between user and item directly, hence it shows better performance.
Table 7.
Similarity loss
R@10
R@20
R@50
N@10
N@20
N@50
Cosine
0.0198
0.0316
0.0555
0.0135
0.0173
0.0235
Cross-entropy
0.0124
0.0191
0.0350
0.0110
0.0133
0.0174
Table 7. Effectiveness of Loss Function
5.3 Stop-gradient
Existing researches [4, 13, 17, 64] on SSL highlight the crucial role of stop-gradient in preventing solution collapsing. We evaluate with adding or removing of the stop-gradient operator with/without a linear predictor. The results in Table 8 show that our self-supervised framework works even under a completely symmetric setting. The loss function of Equation (6) is able to capture the invariant and salient features in the embeddings of users/items by dropping the noise signal. However, without the “stop gradient” operator, the performance of SelfCF decreases greatly. We speculate the loss backpropagated to both directions (online and target networks) leads to the framework unable to learn the optimal parameters of the baseline.
Table 8.
Case
sg
Predictor
R@10
R@20
R@50
N@10
N@20
N@50
Baseline
\(\checkmark\)
\(\checkmark\)
0.0198
0.0316
0.0555
0.0135
0.0173
0.0235
(a)
-
\(\checkmark\)
0.0124
0.0191
0.0359
0.0109
0.0133
0.0176
(b)
-
-
0.0124
0.0191
0.0359
0.0109
0.0132
0.0176
Table 8. Effectiveness of Stop-gradient (sg) Operator
Based on our ablation studies with regard to research question RQ3, we observe that SelfCF does not rely on a single component for preventing solution collapsing. It shows a different behavior from other self-supervised models, in which the “stop gradient” operator is identified as a crucial component to prevent solution collapsing [13]. The underlying reason is that our loss function is designed as the similarity between latent embeddings of user and item, hence it can capture the preference of the user to some extent.
6 Conclusion and Future Directions
In this paper, we propose a framework on top of Siamese networks to learn representation of users and items without negative samples or labels. We argue the self-supervised learning techniques that are widely used in vision cannot be directly applied in the recommendation domain. Hence, we design a Siamese network architecture that perturbs the output of backbone instead of augmenting the input data. By encapsulating two popular recommendation models into the framework, our experiments on four datasets show the proposed framework is on par with or better than another self-supervised framework, BUIR. The performance is also competitive to the supervised counterpart, obtaining a gain of 6.55% over LightGCN. We hope our study will shed light on further researches on self-supervised collaborative filtering.
We also discuss the potential directions based on our framework, SelfCF. (a) More powerful predictor. In the above ablation study, we observed both 1-layer MLP and 2-layer MLP show promising performance. Other than the supervised baselines, e.g., LightGCN, the predictor is a crucial component in our framework. Our framework learns not only the representation of users and items but also the parameters of the predictor. As a result, future researches can be paid to the design of predictor in our framework. (b) Combining with supervised signals. In recent years, a line of work integrates self-supervised learning into the classic pairwise BPR supervised loss and shows promising improvement on recommendation performance [33, 46, 49, 50]. However, in our framework, we only use the positive user-item interactions pairs for recommendation. It is worth researching the integration of supervised signals into our framework. (c) Embedding augmentation methods. This paper proposes three embedding perturbation techniques that can be divided into two categories. (i) Graph-based augmentation. Like BUIR, the edge pruning methods use another graph network to generate the contrastive embeddings. (ii) Non-graph-based augmentation. The other two techniques directly distort the original view and can significantly ease the computation burden. However, other methods [33] for embedding augmentation can be explored in SelfCF. (d) Fusing multimodal features for effective recommendation. To alleviate the data sparsity problem and the cold start issue in CF, various methods have been developed to fuse the multimodal information (e.g., text descriptions and images) of items into the current CF paradigm [59]. In this direction of future work, we are interested in exploring effective ways of fusing multimodal features for recommendation.
Acknowledgments
We thank Professor Toru Ishida from the Department of Computer Science of Hong Kong Baptist University for his valuable advices on this work.
We plot the performance of SelfCF which varies with the number of layers in Figure 11. The performance of Recall@20 and NDCG@20 changes with momentum, embedding dropout, edge dropout, and regularization coefficient in Figures 12 and 13, respectively. The patterns in Games are in line with that of Food.
Fig. 11.
Fig. 12.
Fig. 13.
B Performance of LightGCN Under Different Sampling Methods
In response to the research issue mentioned in the introduction section, we evaluate the performance of LightGCN under both uniform sampling (UniS) and dynamic negative sampling (DNS) methods on MOOC. The results are summarized in Table 9. DNS samples hard negative user-item pairs for LightGCN, hence it obtains a higher ranking score (i.e., NDCG@10) when K is low in top-K. However, when K increases, it is difficult to retrieve related but low ranked items for a target user because it always ranks items based on the current representations of users and items learned by LightGCN. In many cases, the sampled negative pairs are from the test set. As a result, the recall of dynamic negative sampling on LightGCN with regard to \(K=20\) and \(K=50\) is worse than the uniform sampling method.
Table 9.
Model
R@10
R@20
R@50
N@10
N@20
N@50
LightGCN-UniS
0.2507
0.3321
0.4844
0.1588
0.1835
0.2208
LightGCN-DNS
0.2560
0.3297
0.4644
0.1644
0.1864
0.2191
SelfCF\(_{\textrm {he}}\)
0.2545
0.3500
0.4914
0.1696
0.1986
0.2328
SelfCF\(_{\textrm {ed}}\)
0.2460
0.3337
0.5088
0.1752
0.2009
0.2443
SelfCF\(_{\textrm {ep}}\)
0.2514
0.3485
0.4964
0.1671
0.1963
0.2323
Table 9. Influence of Sampling Methods on LightGCN Evaluated with MOOC Dataset
References
[1]
Adrien Bardes, Jean Ponce, and Yann LeCun. 2022. VICReg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR 2022-10th International Conference on Learning Representations.
Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2018. Graph convolutional matrix completion. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a “Siamese” time delay neural network. Advances in Neural Information Processing Systems 6 (1993).
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems.
Chao Chen, Dongsheng Li, Junchi Yan, and Xiaokang Yang. 2021. Modeling dynamic user preference via dictionary learning for sequential recommendation. IEEE Transactions on Knowledge and Data Engineering (2021).
Chong Chen, Weizhi Ma, Min Zhang, Chenyang Wang, Yiqun Liu, and Shaoping Ma. 2022. Revisiting negative sampling vs. non-sampling in implicit recommendation. ACM Transactions on Information Systems (TOIS) (2022).
Chong Chen, Min Zhang, Weizhi Ma, Yiqun Liu, and Shaoping Ma. 2020. Efficient non-sampling factorization machines for optimal context-aware recommendation. In Proceedings of The Web Conference 2020. 2400–2410.
Chong Chen, Min Zhang, Weizhi Ma, Yiqun Liu, and Shaoping Ma. 2020. Jointly non-sampling learning for knowledge graph enhanced recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 189–198.
Chong Chen, Min Zhang, Yongfeng Zhang, Weizhi Ma, Yiqun Liu, and Shaoping Ma. 2020. Efficient heterogeneous collaborative filtering without negative sampling for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 19–26.
Jianfei Chen, Jun Zhu, and Le Song. 2017. Stochastic training of graph convolutional networks with variance reduction. In International Conference on Machine Learning.
Lei Chen, Le Wu, Richang Hong, Kun Zhang, and Meng Wang. 2020. Revisiting graph based collaborative filtering: A linear residual graph convolutional network approach. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 27–34.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning. 1597–1607.
Xinlei Chen and Kaiming He. 2021. Exploring simple Siamese representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
Danilo Dessì, Gianni Fenu, Mirko Marras, and Diego Reforgiato Recupero. 2018. COCO: Semantic-enriched collection of online courses at scale with experimental use cases. In World Conference on Information Systems and Technologies. Springer, 1386–1396.
Matthias Fey, Jan E. Lenssen, Frank Weichert, and Jure Leskovec. 2021. GNNAutoScale: Scalable and expressive graph neural networks via historical embeddings. In International Conference on Machine Learning.
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 249–256.
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems. 21271–21284.
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 1735–1742.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9729–9738.
Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.
Olivier Henaff. 2020. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning. PMLR, 4182–4192.
R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations.
Yehuda Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 426–434.
Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Computation 1, 4 (1989), 541–551.
Dongha Lee, SeongKu Kang, Hyunjun Ju, Chanyoung Park, and Hwanjo Yu. 2021. Bootstrapping user and item representations for one-class collaborative filtering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
Xiaohan Li, Mengqi Zhang, Shu Wu, Zheng Liu, Liang Wang, and Philip S. Yu. 2020. Dynamic graph collaborative filtering. In International Conference on Data Mining. 322–331.
Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference. 689–698.
Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2021. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering (2021).
Yixin Liu, Ming Jin, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and Philip Yu. 2022. Graph self-supervised learning: A survey. IEEE Transactions on Knowledge and Data Engineering (2022).
Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen, and Xiang Zhang. 2021. Learning to drop: Robust graph neural network via topological denoising. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 779–787.
Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 188–197.
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 452–461.
Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2020. DropEdge: Towards deep graph convolutional networks on node classification. In International Conference on Learning Representations.
Yue Shi, Martha Larson, and Alan Hanjalic. 2014. Collaborative filtering beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys (CSUR) 47, 1 (2014), 1–45.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems 30.
Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative deep learning for recommender systems. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1235–1244.
Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning. PMLR, 9929–9939.
Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174.
Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 726–735.
Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3733–3742.
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 974–983.
Chaoning Zhang, Kang Zhang, Chang-Dong Yoo, and In-So Kweon. 2022. How does SimSiam avoid collapse without negative samples? Towards a unified understanding of progress in SSL. In The International Conference on Learning Representations (ICLR’22). The International Conference on Learning Representations (ICLR).
Lingzi Zhang, Yong Liu, Xin Zhou, Chunyan Miao, Guoxin Wang, and Haihong Tang. 2022. Diffusion-based graph contrastive learning for recommendation with implicit feedback. In Database Systems for Advanced Applications: 27th International Conference, DASFAA 2022, Virtual Event, April 11–14, 2022, Proceedings, Part II. Springer, 232–247.
Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR) 52, 1 (2019), 1–38.
Weinan Zhang, Tianqi Chen, Jun Wang, and Yong Yu. 2013. Optimizing top-n collaborative filtering via dynamic negative item sampling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 785–788.
Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, et al. 2021. RecBole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4653–4664.
Xin Zhou, Jinglong Wang, Yong Liu, Xingyu Wu, Zhiqi Shen, and Cyril Leung. 2023. Inductive graph transformer for delivery time estimation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 679–687.
RecSys '18: Proceedings of the 12th ACM Conference on Recommender Systems
User-based Collaborative Filtering (CF) is one of the most popular approaches to create recommender systems. This approach is based on finding the most relevant k users from whose rating history we can extract items to recommend. CF, however, suffers ...
Collaborative filtering (CF) plays a central role in recommender systems, but often suffers from the data sparsity issue that dramatically degrades the recommendation performance. In this paper, we propose a Semi-Supervised Ensemble Filtering (...
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Recently, ranking-oriented collaborative filtering (CF) algorithms have achieved great success in recommender systems. They obtained state-of-the-art performances by estimating a preference ranking of items for each user rather than estimating the ...
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
Alibaba Group through Alibaba Innovative Research (AIR) Program and Alibaba-NTU Singapore Joint Research Institute (JRI), Nanyang Technological University, Singapore
Zhu YLi M(2025)Multi-modal Information Multi-angle Mining for Multimedia RecommendationMultiMedia Modeling10.1007/978-981-96-2064-7_6(73-86)Online publication date: 9-Jan-2025
Luo SXiao YZhang XLiu YDing WSong L(2024)PerFedRec++: Enhancing Personalized Federated Recommendation with Self-Supervised Pre-TrainingACM Transactions on Intelligent Systems and Technology10.1145/366492715:5(1-24)Online publication date: 14-May-2024