research-article

Open access

COMET: Convolutional Dimension Interaction for Collaborative Filtering

Authors:

Rui Yin,

Chi XuAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 14, Issue 4

Article No.: 59, Pages 1 - 18

https://doi.org/10.1145/3588576

Published: 08 May 2023 Publication History

PDF eReader

Abstract

Representation learning-based recommendation models play a dominant role among recommendation techniques. However, most of the existing methods assume both historical interactions and embedding dimensions are independent of each other, and thus regrettably ignore the high-order interaction information among historical interactions and embedding dimensions. In this article, we propose a novel representation learning-based model called COMET (COnvolutional diMEnsion inTeraction), which simultaneously models the high-order interaction patterns among historical interactions and embedding dimensions. To be specific, COMET stacks the embeddings of historical interactions horizontally at first, which results in two “embedding maps”. In this way, internal interactions and dimensional interactions can be exploited by convolutional neural networks (CNN) with kernels of different sizes simultaneously. A fully connected multi-layer perceptron (MLP) is then applied to obtain two interaction vectors. Lastly, the representations of users and items are enriched by the learnt interaction vectors, which can further be used to produce the final prediction. Extensive experiments and ablation studies on various public implicit feedback datasets clearly demonstrate the effectiveness and rationality of our proposed method.

1 Introduction

In the era of big data, analyzing customers’ demands and behaviors is necessary to exploit potential insights and build intelligent systems, which can be achieved by modern recommender systems. Suggestions for videos on YouTube or products on Amazon [3, 21, 34], are real-world examples of intelligent systems to help users navigate a growing ocean of choices. To this end, collaborative filtering (CF) methods are proposed to estimate users’ preferences from their historical behaviors, which are widely adopted due to their impressive recommendation performance. For example, representation learning-based models which aim to learn effective users’ representations and items’ representations using deep learning techniques have played a dominant role in recent years [1, 10, 20, 22, 36]. The matching score between the target user and the target item can be predicted by leveraging their representations.

In order to effectively capture the latent relationships between users and items, the recommender system research community and industry have paid great attention and made great efforts to model the interaction information between contextual features. A typical solution is to model the domain-specific cross features manually [2, 35]. For example, cross-product transformation is used over sparse features to encode feature interactions in [2]. Although these methods are able to discover the relationships between feature pairs and target labels in an explicit manner, tedious efforts are required to construct cross-features and such feature interaction cannot generalize to unseen cross-features. Alternatively, the recommendation models could learn feature interactions automatically. In particular, Factorization Machine (FM) and its extensions [7, 26] map contextual features to low-dimensional embeddings. This enables the interaction between contextual features can be estimated by the inner product of their embeddings. For example, the interaction between users’ gender and items’ categories are recognized as second-order interaction. However, most of the FM-based methods only model pairwise (second-order) feature interactions due to the computational inefficiency of explicitly enumerating high-order feature interactions (i.e., the interaction among more than three features).

Another attempt to model the interaction information is to capture dimension-level interaction information. The motivation is to treat the latent factors encoded in the embeddings as user/item features [9, 19, 25, 37]. For example, a user’s representation may encode her gender, her spending power, and her preferred color. Similarly, an item can be characterized as male-oriented versus female-oriented, its price and its color. Recent works use outer-product to model the interaction between input pairs, and then multi-layer perceptron (MLP) [25] or convolutional neural network (CNN) [9, 37] is used to estimate matching scores. By using an outer-product operation on the target user’s representation and the target item’s representation, the generated interaction map explicitly encodes all the pairwise dimensional interactions. In this way, CNN is able to capture the dimensional interactions in a more explicit way compared to methods which directly employ MLP on embedding concatenation such as Deep Crossing [9, 29, 37]. However, the input of such an interaction modeling process is only pairwise, in other words, the outer-product can only encode the explicit interactions between only two dimensions, which may potentially ignore the rich information among latent embeddings.

Based on the above observations, we can see that both types of methods for modeling interaction information are deeply mired in the difficulty of explicitly modeling high-order interactions: most of the existing works either consider pairwise (second-order) feature interactions or pairwise dimension-level interactions. Furthermore, most of the FM-based methods that aim to model the feature interaction information, are based on contextual features (e.g., item descriptions, rating, and user check-in data) which are not always available. In contrast, implicit feedback (e.g., click, browse, or purchase behaviors), is much easier to be collected [9, 10, 27]. In this article, we propose a novel approach called COMET (COnvlutional diMEnsion inTeraction) to simultaneously capture the high-order interaction information among historical interactions and embedding dimensions from implicit feedback. To be specific, we treat the interacted items and interacted users (i.e., items purchased by the target user and users who have consumed the target item) as “contextual features” in this work. By stacking such historical interactions horizontally, two “embedding maps” can be obtained. For each embedding map, we employ a single-layer CNN with kernels of variant sizes over it, which aims to capture high-order interaction signals among historical interactions and all embedding dimensions simultaneously. A fully connected MLP is then employed to achieve two interaction vectors. By enriching the original representations of the target user and target item with such interaction vectors, our proposed method is able to obtain an impressive performance. In summary, the main contributions of this work are:

•

We propose a novel approach COMET to capture the interaction signals for recommendation from implicit feedback. COMET aims to exploit the high-order interaction information among historical interactions and embedding dimensions simultaneously.

•

We propose to enrich the representations of the target user and the target item by the learnt interaction information. In this way, the target user’s representation and the target item’s representation are dependently learnt.

•

We conduct extensive experiments on public implicit feedback data to evaluate the performance of our proposed method. Experimental results show that our proposed method is able to achieve impressive results. Moreover, ablation studies are conducted to analyze the advantages of COMET.

The rest of the article is organized as follows: In Section 2, related works are briefly reviewed. We then elaborate on our method in Section 3 and Section 4. In Section 5, we empirically evaluate our proposed method on recommendation tasks. We conclude our work and discuss future directions in Section 6.

2 Related Work

Our work is built on the foundation of latent factor models and representation-based models, and takes advantage of the feature interaction modeling.

Latent factor models learn users’ and items’ latent embeddings in a shared latent space. These methods use low-rank approximation to fit the rating matrix. For example, Karatzoglou et al. introduced a technique called tensor factorization (TensorF) [15] that allows for the incorporation of multiple features into a recommendation model. This is done by representing the data as a multi-dimensional tensor rather than a traditional 2D matrix. Based on TensorF, Symeonidis et al. [32] developed a recommendation model called HOSVD that utilizes tensor factorization for user-tag-item triplet data. More recently, Yu et al. proposed a tensor factorization model called DCFA [41] which uses aesthetic features, rather than traditional features, to make recommendations based on a user’s preferences. They believed that a user’s decision is often influenced by whether the product aligns with their personal aesthetics. As special cases of tensor factorization, matrix factorization techniques (MF) [19, 27] factorize the rating matrix into user-specific and item-specific matrices for rating prediction. Another notable latent factor model is SVD++ [18], which integrates the embedding of the target user with additional latent embeddings of interacted items.

Intuitively, the relationships between users and items are complex, thus representation-based recommendation models are proposed to learn the complex matching function that maps user-item pairs to matching scores. For example, Generalized Matrix Factorization (GMF) [10] is proposed to generalize MF in a non-linear manner. NeuMF [10] and DeepCF [4] use MLP to learn effective matching functions from user/item representations or user/item rating data, respectively. In recent years, many representation learning-based sequential recommendation methods are proposed due to the impressive ability of deep learning to learn the complex behaviors of users [11, 14, 30, 33, 42]. Specifically, sequential recommendation methods are used to model users’ dynamic preferences and make personalized recommendations to users based on their sequential interactions with items. For example, GRU4rec [11] is proposed to process a sequence of items and make recommendations based on the hidden state learned by the gated recurrent unit (GRU). In addition, DIN [42] and SASRec [14] use attention mechanisms to weigh the importance of different items in a user’s interaction history, which process a sequence of items and makes recommendations based on the learned attention weights and item representations. Meanwhile, Caser [33] is a CNN-based method, which aims to model the skip behaviors of sequential patterns. To utilize the pre-trained language model BERT for recommendation tasks, BERT4Rec [30] first fine-tunes BERT on a large dataset of user-item interactions and then uses the resulting fine-tuned model to make recommendations based on a given sequence of items. However, the motivation and experimental settings of sequential methods are different from ours. We thus omit the comparison with sequential recommendation models in this work. To be specific, COMET and other general recommendation methods [8, 9, 10, 13, 18, 27] aim to learn static and inherent user preferences from their historical data, which recommend items to users based on their overall preferences and interests. Note that these methods do not take into account the order in which the items were consumed or the time at which they were consumed [31]. Sequential recommendation methods, on the other hand, take into account the sequence of items that have been consumed by a user. These methods learn the sequential information from sorted historical interactions, which model dynamic user preferences that change from time to time [14, 30, 33]. Therefore, we only compare baselines that aim to model static user preferences based on their unsorted historical interactions in this article as we focus on modeling general user preferences. Above all, we can observe that most of the existing latent factor approaches and representation-based models regrettably ignore the static interaction information among historical interactions and embedding dimensions.

In the meantime, there are several works showing the importance and effectiveness of modeling the interaction information for recommendation tasks. The representative works in this field are FM and its extensions [7, 26, 37]. The typical paradigm of FM-based methods is to model the second-order interaction between feature vectors. For example, NFM [7] is proposed to model non-linear pairwise feature interactions. Although FM-based methods generally achieve satisfactory performance in recommendation tasks, existing works on modeling feature interaction mainly focus on context-aware recommendation tasks [2, 35, 37]. However, such contextual features are not always available, in particular, the user has very little historical data. Recently, some works focus on exploiting the dimension-level interaction information to enhance the performance of recommendation. For example, ConvNCF [9] applies an outer-product operation to encode pairwise dimension-level interactions. CFM [37] is proposed to model second-order interactions for the context-aware recommendation. In this work, we propose to model the high-order interaction among historical interactions and embedding dimensions simultaneously. Instead of modeling interaction effects from contextual features like FM-based methods, we exploit interaction signals from implicit feedback data: the interacted users and items are treated as “contextual features” in our work. In this way, COMET captures the internal interaction patterns among the target user’s and target item’s historical interactions and the dimensional interaction signals among all latent dimensions, which has not been studied before. Moreover, we present how to enrich the representations of users and items by the learnt interaction information, in this way, users’ representations and items’ representations are learnt dependently and lead to better recommendation performance.

3 Preliminaries

Before we detail our proposed method, we first formulate the problem and define the notations used in this article.

3.1 Problem Formulation and Notations

Let $\mathcal {U}=\lbrace u_{1},u_{2},\ldots , u_{m}\rbrace$ be the set of users, and $\mathcal {I}=\lbrace i_{1},i_{2},\ldots , i_{n}\rbrace$ be the set of items. The user-item interaction matrix is denoted by $\mathbf {Y}=[y_{ui}]$ of size $m \times n$ from implicit feedback data as:

\[\begin{gather} y_{ui}={\left\lbrace \begin{array}{ll}1,&\text{if the interaction between $u$ and $i$ is observed}; \\ 0,&\text{otherwise}.\end{array}\right.} \end{gather}\]

(1)

Specifically, $y_{ui}=1$ represents the existence of observed interaction between user u and item i, while $y_{ui}=0$ means the user-item interaction was not observed. Intuitively, the goal of recommendation is to compute the interaction scores of the missing entries in $\mathbf {Y}$, and a meaningful recommendation list can be further generated based on the estimated interaction scores.

Throughout this article, we use u and i to represent a user and an item, respectively. We use bold symbols in lower case (e.g., $\mathbf {u}$) to denote vectors and bold symbols in upper case (e.g., $\mathbf {Y}$) to represent matrices. Moreover, $y_{ui}$ denotes the $(u,i)$-th element of matrix $\mathbf {Y}$. In addition, we denote predicted values by a ^ over it, for example, the final predicted interaction score between user u and item i is represented as $\hat{y}_{ui}$.

3.2 Relationship with Matrix Factorization

MF plays an important role in latent factor models, which factorizes the rating matrix into a user matrix and an item matrix for rating prediction [10, 19]. We denote the latent representations for user u and item i as $\mathbf {p}_u$ and $\mathbf {q}_i$, respectively. MF estimates the interaction score $\hat{y}_{ui}$ of $y_{ui}$ by the inner product of $\mathbf {p}_u$ and $\mathbf {q}_i$:

\[\begin{gather} \hat{y}_{ui} = \mathbf {p}_u^\top \mathbf {q}_i=\sum _{k=1}^Kp_{uk}q_{ik,} \end{gather}\]

(2)

where K denotes the dimension of the latent representations. Based on the above equation, MF linearly combines the latent features. While our proposed COMET model aims to enrich the users’ and items’ representations generated by MF. In COMET, a prediction $\hat{y}_{ui}$ of $y_{ui}$ as follows:

\[\begin{gather} \hat{y}_{ui}=\sigma (\mathbf {h}^\top ((\mathbf {p}_u+\mathbf {p}_u\prime)\odot (\mathbf {q}_i+\mathbf {q}_i\prime))) \end{gather}\]

(3)

where $\odot$ represents the element-wise product between two vectors, $\sigma (\cdot)$ is the Sigmoid function, and $\mathbf {h}$ denotes a weight vector. If we set the weight vector $\mathbf {h}=\mathbf {1}$ where $\mathbf {1}$ is a vector whose elements are all equal to 1, and set the interaction vectors $\mathbf {p}_u\prime$ and $\mathbf {q}_i\prime$ = $\mathbf {0}$ where $\mathbf {0}$ is a vector whose elements are all equal to 0. As can be seen, the MF model is exactly recovered by COMET except for the activation function, since $\mathbf {p}_u^\top \mathbf {q}_i=\mathbf {1}^\top ((\mathbf {p}_u+ \mathbf {0}) \odot (\mathbf {q}_i+ \mathbf {0}))$. In the next section, we will introduce how to model the interaction vectors.

4 The Proposed Approach

Figure 1 demonstrates the proposed framework which encodes the high-order interactions among historical interactions and dimensional interactions to enrich the representations of the target user and the target item. In this section, we detail our proposed COMET layer by layer.

Fig. 1.

4.1 Input Layer

Most of the latent factor models [7, 9] only take one-hot encoding on the target user’s ID and the target item’s ID into account for the input layer. In this work, we also consider multi-hot encoding on the target user u’s interacted items as well as the target item i’s interacted users. Such a design will not only take into account more historical information but also benefit the construction of embedding maps in the next layer.

Let us take the target user u and her interacted items $\text{N}(u)$ as an example. The one-hot encoding of u can be presented as a vector $\mathbf {u}\in \lbrace 0,1\rbrace ^m$ whose entry indicates the user ID of the target user, and m is the number of users. Similarly, the multi-hot encoding on interacted items $\text{N}(u)$ can be represented as a vector $\mathbf {u^{\prime }}\in \lbrace 0,1\rbrace ^n$, the entries record the ID of items that the target user has interacted with, where n denotes the number of items. In other words, the multi-hot encoding vector indicates the target user interacted with jth item before if the jth element of the multi-hot encoding vector $\mathbf {u}^{\prime }_{j}$ is a non-zero value.

4.2 Embedding Layer

The embedding layer projects the target user u and the target item i to the latent space and gains the feature vectors $\mathbf {p}_u\in \mathbb {R}^K$ and $\mathbf {q}_i\in \mathbb {R}^K$, respectively. Similarly, we can obtain $\lbrace \mathbf {q}_j\in \mathbb {R}^K|j\in \text{N}(u)\rbrace$ and $\lbrace \mathbf {p}_k\in \mathbb {R}^K|k\in \text{N}(i)\rbrace$ for each interacted item $j\in \text{N}(u)$ and each interacted user $k\in \text{N}(i)$, where K represents the embedding size. Note that the one-hot encoding vector and multi-hot encoding vector are fixed, because they only encode the user ID information and the interacted items’ ID information. The embedding of the target user and interacted items shown in the embedding layer are learnt with the model in an end-to-end manner.

4.3 Embedding Maps

In this work, we treat the historical interactions as “contextual features”. Their embeddings are horizontally stacked as “embedding maps” above the embedding layer. For example, given a set of interacted items’ embeddings for a user $\lbrace \mathbf {q}_1, \mathbf {q}_3, \ldots , \mathbf {q}_j\rbrace$, the stacked item embedding map is constructed as follows:

\[\begin{gather} \mathbf {E}_{i}^{|\text{N}(u)|\times K} = \begin{bmatrix}\mathbf {q}_1^T\\ \mathbf {q}_3^T\\ ...\\ \mathbf {q}_j^T\end{bmatrix} \end{gather}\]

(4)

In this way, the historical interactions are denoted as a matrix form. Likewise, for a set of interacted users’ embeddings $\lbrace \mathbf {p}_1, \mathbf {p}_3, \ldots , \mathbf {p}_k\rbrace$, the stacked user embedding map can be represented as:

\[\begin{gather} \mathbf {E}_{u}^{|\text{N}(i)|\times K} = \begin{bmatrix}\mathbf {p}_1^T\\ \mathbf {p}_3^T\\ ...\\ \mathbf {p}_k^T\end{bmatrix} \end{gather}\]

(5)

where $\mathbf {E}_i$ and $\mathbf {E}_u$ represent the stacked item embedding map and stacked user embedding map, respectively. Note that $|\text{N}(i)|$ and $|\text{N}(u)|$ represent the cardinality of $\text{N}(u)$ and $\text{N}(i)$, which are the number of historical interactions and the values can be controlled by historical data sampling.

Constructing such embedding maps is advantageous threefold. First, by representing the historical interactions as embedding maps, our model is able to exploit the interaction signals internally (i.e., relationships among items and relationships among users), which empowers our model to learn users’ representations and items’ representations. Second, unlike the outer-product operation, which only considers pairwise dimensional interactions, our proposed embedding map reserve the latent information in the original embedding space, which enables our model to explicitly capture high-order dimensional interactions. Third, such a design of embedding maps makes it possible for our proposed COMET to capture the internal interactions and dimensional interactions simultaneously.

4.4 Interaction Modeling

The latent representations characterize both users and items by vectors of factors, and a high matching score between user and item factors leads to a recommendation [19]. Therefore, modeling the dimensional interactions among such factors are important to achieve personalized recommendations. In this subsection, since the user interaction vector and item interaction vector can be obtained from the same process, we focus on illustrating how to obtain the user interaction vector. As shown in Figure 2, the interaction modeling process aims to generate an interaction vector which encodes the high-order interaction information among the item embedding map constructed in the previous layer. Technically speaking, any method that can transform a matrix into a vector can be used here. Intuitively, MLP is a common choice to capture high-order interactions, which has been widely used in recommendation research [7, 10, 12]. However, recent studies [9, 37] demonstrate that the interactions are inefficiently and implicitly modeled by MLP with current optimization techniques, resulting in sub-optimal performance on recommendation tasks. Inspired by the recent works that explicitly encode the pairwise interactions and treat the pairwise interaction map as a 2D image or 3D cube [6, 9, 37, 38]. We propose to use a single-layer CNN with filters of variant sizes to capture the high-order interactions encoded in the item embedding map. A fully connected MLP is then used to generate the user interaction vector. The efficacy of such a design is studied in Section 4.2. Specifically, the cross-features generated by the convolution of the item embedding map $\mathbf {E}_i$ and lth filter are denoted as:

\[\begin{gather} \mathbf {c}_i^l=\psi (\mathbf {W}_i^l * \mathbf {E}_i + \mathbf {b}_i^l), \end{gather}\]

(6)

where $*$ represents a convolution operator, $\mathbf {W}_i^l\in \mathbb {R}^{|\text{N}(i)| \times H}$ is a weight matrix and $\mathbf {b}_i^l$ is the corresponding bias. Besides, $\psi (\cdot)$ is a non-linear activation function, here we employ the rectified linear unit (ReLU) [23, 39, 40]. Note that $|\text{N}(i)| \times H$ denotes the size of the filter, aiming to cover all the rows (interacted items) of the embedding map. We use multiple filters with varying widths such as [16] to extract features from both the local and global scale.

\[\begin{gather} \mathbf {c}_i= [ \mathbf {c}_i^1 ; \mathbf {c}_i^2 ; \ldots ; \mathbf {c}_i^l ], \end{gather}\]

(7)

where $\mathbf {c}_i$ represents the item internal interaction features.

Fig. 2.

Above the CNN is a fully-connected MLP layer, it takes $\mathbf {c}_i$ as input and generates the user interaction vector.

\begin{equation} \begin{split}&\mathbf {p}_u^1= \psi _2 (\mathbf {W}_i^1 \mathbf {c}_i + \mathbf {b}_i^1)\\ &\mathbf {p}_u^2= \psi _2 (\mathbf {W}_i^2 \mathbf {p}_u^1 + \mathbf {b}_i^2)\\ &...\\ &\mathbf {p}_u^L= \psi _2 (\mathbf {W}_i^L \mathbf {p}_u^{L-1} + \mathbf {b}_i^L)\\ \end{split} \end{equation}

(8)

where the number of hidden layers is denoted by L, $\mathbf {W}_i^L$ represents the weight matrix, $\mathbf {b}_i^L$ is a bias vector, and $\psi _2$ means the activation function for MLP layers. The output of the last hidden layer $\mathbf {p}_u^L$ is then transformed to the user interaction vector $\mathbf {p}_u\prime$:

\[\begin{gather} \mathbf {p}_u\prime = \mathbf {W}_i \mathbf {p}_u^L + \mathbf {b}_i, \end{gather}\]

(9)

where $\mathbf {W}_i$, $\mathbf {b}_i$ represent the weight matrix and bias vector for the output layer. We can obtain the item interaction vector $\mathbf {q}_i\prime$ in the same way.

4.5 Prediction Layer

Given two interaction vectors $\mathbf {p}\prime _u$ and $\mathbf {q}\prime _i$ for $\mathbf {p}_u$ and $\mathbf {q}_i$, the original representations and the learnt interaction information are combined in the prediction layer. To be specific, the predicted interaction score between the target user and the target item is predicted as follows:

\[\begin{gather} \hat{y}_{ui}=\sigma (\mathbf {h}^\top ((\mathbf {p}_u+\mathbf {p}_u\prime)\odot (\mathbf {q}_i+\mathbf {q}_i\prime))) \end{gather}\]

(10)

where $\mathbf {h}$ denotes a learnt weight vector, the sigmoid function is used as the activation function, and “$\odot$” represents the element-wise product between two vectors. By enriching the original representations of the target user and target item with internal interaction vectors, users’ representations and items’ representations are dependently learnt. The efficacy of such a design is discussed in Section 4.2.

4.6 Loss Function

In this article, we focus on the task of recommendation from implicit feedback data. To this end, our model should learn parameters with a ranking-aware objective. Therefore, Binary Cross Entropy (BCE) loss which constrains the output in the range of $[0,1]$ is employed:

\begin{align} \mathcal {L}=-\sum _{(u,i)\in \mathcal {O}^{+}\cup \mathcal {O}^{-}}y_{ui}\log \hat{y}_{ui}+(1-y_{ui})\log (1-\hat{y}_{ui}), \end{align}

(11)

where $\mathcal {O}^+$ is the set of positive samples and $\mathcal {O}^-$ represents the set of negative samples, respectively. During the training process, four negative samples are randomly sampled for each positive sample in every single training epoch.

5 Experiments

We aim to evaluate the effectiveness and rationality of our proposed method in this section. We hence design extensive experiments and ablation studies in order to answer the following research questions:

•

RQ1 Is COMET able to outperform the state-of-the-art latent factor models?

•

RQ2 Can our method effectively capture the interaction information from historical interactions and embedding dimensions?

•

RQ3 Does COMET benefit from the learnt internal interaction signals?

•

RQ4 How do the key hyperparameters influence the performance of our method?

5.1 Experimental Settings

5.2 Datasets

We conduct experiments and sensitivity analysis on three public datasets: Amazon Movies & TV,¹ and Amazon CDs & Vinyl, and MovieLens 1M (ML-1M) dataset,² Since it is difficult to evaluate recommendation models on a highly sparse dataset, we follow the common practice [10, 27], ignoring the users with less than 10 interactions for both Amazon datasets. Noted that COMET aims to generate personalized recommendation from implicit feedback, we hence convert all ratings to implicit feedback to indicate whether the user has interacted with the item, by representing the rating entries as either 1 or 0. The characteristics of the three datasets are shown in Table 1.

Table 1.

Dataset	ML-1M	Movies&TV	CDs&Vinyl
Number of users	6,040	40,928	26,876
Number of Items	3,706	51,509	66,820
Number of interactions	1,000,209	1,163,413	770,188
Rating density	0.04468	0.00055	0.00043

Table 1. Characteristics of Datasets

5.3 Compared Methods

To demonstrate the effectiveness of COMET, we also study the performance of the following state-of-the-art counterparts:

•

MF [27]. It is the classic MF trained by optimizing the binary cross entropy loss.

•

SVD++ [18]. SVD++ enriches the user latent factor with her interacted items’ embedding.

•

FISM [13]. As an item-based latent factor model, FISM factorizes the item-item similarity matrix into two low-dimensional latent factor matrices.

•

GMF [10]. GMF generalized the MF model in a non-linear manner.

•

MLP [10]. The interaction function between users’ and items’ representations is learnt by a MLP.

•

NeuMF [10]. NeuMF combines of GMF and MLP. We compare with NeuMF-p which pre-trains GMF and MLP [10].

•

ConvNCF [9]. It uses outer-product to model the pairwise interactions between the latent dimensions. A CNN is then used to discover the high-level interactions among embedding dimensions.

•

LightGCN [8]. A state-of-the-art Graph Convolution Network-based recommendation approach.

5.4 Training Details

In order to find out the optimal parameter settings for the comparing approaches, we carefully tune hyperparameters suggested by the respective literature. To be specific, for all the recommendation models, we choose the learning rate from $[5\text{e}^{-7},1\text{e}^{-6},5\text{e}^{-6},1\text{e}^{-5},5\text{e}^{-5}, 1\text{e}^{-4}, 5\text{e}^{-4}, 1\text{e}^{-3}, 5\text{e}^{-3}]$, we select the embedding size K from the following set: $[16, 32, 64, 128]$, and the regularization parameter tried lies in the interval $[5\text{e}^{-8},1\text{e}^{-7},5\text{e}^{-7},1\text{e}^{-6}, 5\text{e}^{-6}, 1\text{e}^{-5}, 5\text{e}^{-5}]$. Since there are multiple fully-connected layers in the MLP and NeuMF,³ the number of hidden layers has been fairly tuned from 1 to 3 [10]. As for ConvNCF⁴ and LightGCN⁵ [8], we follow the settings proposed in [9]. Note that we trained MF and ConvNCF with binary cross entropy loss like [28], so as to conduct a fair comparison among all baselines and our proposed method. In addition, all the recommendation approaches are trained until convergence.

For our proposed method, the weight vectors are initialized by the Xavier initialization [5]. Moreover, we initialize the embedding layer and weight matrices for CNN by the uniform distribution. In addition, we employ the Adaptive Moment Estimation optimizer (Adam) [17] to train our proposed model, and implement our proposed method using PyTorch [24]. The learning rate we tried are $[5\text{e}^{-5}, 1\text{e}^{-4}, 5\text{e}^{-4}, 1\text{e}^{-3}, 5\text{e}^{-3}]$, and the regularization parameter we tried are: $[1\text{e}^{-6}, 5\text{e}^{-6}, 1\text{e}^{-5}, 5\text{e}^{-5}]$. Moreover, the embedding size K is fixed at 128 and the dropout rate at 0.3 which always achieves better results under our setting. As for CNN, we empirically set the number of channels, stride, and padding to 8, 1, and 0, respectively. The filters of CNN are designed to cross all the latent dimensions, for example, the sizes of filters for the item embedding map are $|\text{N}(u)| \times 1$, $|\text{N}(u)| \times 8$, $|\text{N}(u)| \times 32$, $|\text{N}(u)| \times 128$, respectively. However, a user may have interacted with many items and leading to a large $|\text{N}(u)|$ in the real-world scenario and thus may need intensive computational power. To alleviate the problem, we empirically set the maximum number of interactions to 50 in the experiment.

5.5 Evaluation Protocols

In order to make a fair comparison among COMET with other approaches, the leave-one-out evaluation method is adopted, which is the common choice for recommendation from implicit feedback [9, 27]. To be specific, we randomly sampled one interaction for each user as the validation set, on which we tune hyper-parameters of all approaches. Then we hold the latest historical item of each user as the test positive samples and the other 99 random items which have no interaction with this user as the test negative samples. Therefore, all the comparing models generate recommendations for each user by ranking the above 100 mentioned items.

To evaluate the quality of the generated recommendation list, we employ two evaluation metrics in this article, namely Hit Ratio(HR) and Normalized Discounted Cumulative Gain(NDCG). HR@k measures if the testing item is included in the top-k recommendation list, and NDCG@k takes the position of correct recommendations into account [9].

5.6 Performance Comparison (The Answer to RQ1)

Tables 2 and 3 show the top-k evaluation on all three datasets. We run 5 times for each method and perform the significant test. Obviously, COMET achieves the best performance on all datasets regarding both HR and NDCG. We believe that the underlying factor is the interaction modeling process. By efficiently capturing the high-order interaction information among embedding maps, better representations of users and items are modeled. Besides, we can see that SVD++ achieves comparable performance to some deep models (i.e., MLP and ConvNCF) on the ML-1M dataset, which may benefit from the abundant latent factors of interacted items in the ML-1M dataset.

Table 2.

Method	ML-1M		Movie&TV		CDs&Vinyl
Method	HR@5	NDCG@5	HR@5	NDCG@5	HR@5	NDCG@5
BCE-MF	0.540 $\pm$ $0.001$ •	0.376 $\pm$ $0.002$ •	0.634 $\pm$ $0.002$	0.498 $\pm$ ${\bf 0.001}$	0.606 $\pm$ $0.001$ •	0.466 $\pm$ $0.001$ •
SVD++	0.557 $\pm$ $0.002$	0.388 $\pm$ $0.001$ •	0.606 $\pm$ $0.001$ •	0.462 $\pm$ $0.001$ •	0.607 $\pm$ $0.002$ •	0.466 $\pm$ $0.001$ •
FISM	0.528 $\pm$ $0.002$ •	0.372 $\pm$ $0.002$ •	0.583 $\pm$ $0.003$ •	0.452 $\pm$ $0.002$ •	0.592 $\pm$ $0.001$ •	0.457 $\pm$ $0.002$ •
MLP	0.526 $\pm$ $0.003$ •	0.362 $\pm$ $0.003$ •	0.570 $\pm$ $0.002$ •	0.425 $\pm$ $0.001$ •	0.588 $\pm$ $0.003$ •	0.445 $\pm$ $0.001$ •
GMF	0.540 $\pm$ $0.001$ •	0.372 $\pm$ $0.001$ •	0.569 $\pm$ $0.004$ •	0.427 $\pm$ $0.003$ •	0.620 $\pm$ $0.004$ •	0.481 $\pm$ $0.002$ •
NeuMF	0.548 $\pm$ $0.003$ •	0.381 $\pm$ $0.002$ •	0.596 $\pm$ $0.001$ •	0.453 $\pm$ $0.001$ •	0.629 $\pm$ $0.001$ •	0.491 $\pm$ $0.001$ •
ConvNCF	0.539 $\pm$ $0.001$ •	0.376 $\pm$ $0.001$ •	0.623 $\pm$ $0.002$ •	0.488 $\pm$ $0.001$ •	0.603 $\pm$ $0.002$ •	0.457 $\pm$ $0.001$ •
LightGCN	0.542 $\pm$ $0.002$ •	0.381 $\pm$ $0.001$ •	0.618 $\pm$ $0.001$ •	0.476 $\pm$ $0.001$ •	0.658 $\pm$ $0.004$ •	0.525 $\pm$ $0.003$
COMET	0.558 $\pm$ $\textbf {0.001}$	0.392 $\pm$ $\textbf {0.001}$	0.637 $\pm$ $\textbf {0.002}$	0.491 $\pm$ $0.002$	0.667 $\pm$ $\textbf {0.002}$	0.528 $\pm$ $\textbf {0.002}$

Table 2. HR@5 and NDCG@5 of All Recommendation Models Are Evaluated

The best results are highlighted. In addition, $\mathbin {\vcenter{\hbox{$\bullet$}}}/○$ indicates whether the performance of COMET is significantly superior to the compared methods on each dataset. (Paired t-test at 0.05 significance level).

Table 3.

Method	ML-1M		Movie&Tv		CDs&Vinyl
Method	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
BCE-MF	0.706 $\pm$ $0.001$ •	0.429 $\pm$ $0.002$ •	0.738 $\pm$ $0.002$ •	0.533 $\pm$ $\textbf {0.001}$	0.727 $\pm$ $0.001$ •	0.509 $\pm$ $0.001$ •
SVD++	0.713 $\pm$ $0.003$ •	0.438 $\pm$ $0.002$ •	0.728 $\pm$ $0.001$ •	0.502 $\pm$ $0.001$ •	0.719 $\pm$ $0.001$ •	0.502 $\pm$ $0.001$ •
FISM	0.699 $\pm$ $0.003$ •	0.433 $\pm$ $0.001$ •	0.708 $\pm$ $0.002$ •	0.471 $\pm$ $0.001$ •	0.729 $\pm$ $0.002$ •	0.512 $\pm$ $0.001$ •
MLP	0.703 $\pm$ $0.003$ •	0.421 $\pm$ $0.004$ •	0.703 $\pm$ $0.002$ •	0.471 $\pm$ $0.001$ •	0.712 $\pm$ $0.003$ •	0.485 $\pm$ $0.004$ •
GMF	0.711 $\pm$ $0.001$ •	0.429 $\pm$ $0.002$ •	0.712 $\pm$ $0.006$ •	0.479 $\pm$ $0.005$ •	0.729 $\pm$ $0.006$ •	0.515 $\pm$ $0.002$ •
NeuMF	0.727 $\pm$ $0.004$	0.443 $\pm$ $0.002$ •	0.721 $\pm$ $0.001$ •	0.493 $\pm$ $0.002$ •	0.750 $\pm$ $0.001$ •	0.529 $\pm$ $0.001$ •
ConvNCF	0.710 $\pm$ $0.002$ •	0.431 $\pm$ $0.001$ •	0.726 $\pm$ $0.001$ •	0.521 $\pm$ $0.001$ •	0.722 $\pm$ $0.001$ •	0.496 $\pm$ $0.001$ •
LightGCN	0.709 $\pm$ $0.003$ •	0.434 $\pm$ $0.002$ •	0.744 $\pm$ $0.002$ •	0.517 $\pm$ $0.001$ •	0.757 $\pm$ $0.003$ •	0.556 $\pm$ $0.003$
COMET	0.729 $\pm$ $\textbf {0.002}$	0.448 $\pm$ $\textbf {0.001}$	0.759 $\pm$ $\textbf {0.002}$	0.529 $\pm$ $0.003$	0.780 $\pm$ $\textbf {0.003}$	0.560$\pm \textbf {0.004}$

Table 3. HR@10 and NDCG@10 of All Recommendation Models Are Evaluated

The best results are highlighted. In addition, $\mathbin {\vcenter{\hbox{$\bullet $}}}/○$ indicates whether the performance of COMET is significantly superior to the compared methods on each dataset. (Paired t-test at 0.05 significance level).

5.7 Study of Interaction Modeling (The Answer to RQ2)

COMET aims to model the high-order interactions from historical interactions and embedding dimensions. In this work, we apply a single-layer CNN to extract cross-features over the embedding map. With filters of variant sizes, interaction features can be effectively obtained on local and global scales. Those features are then served as the input of the fully connected MLP layer, in this way, dimensional interaction signals are captured in a rather explicit manner. To clearly demonstrate the rationality of our proposed interaction modeling process, we present two models here:

•

COMET-CNN Only. This method uses a 3-layer CNN to transform embedding maps to interaction vectors. An output layer is used to guarantee the dimension of the interaction vector to K. The filter size, number of channels, stride, and padding are set to 3 $\times$ 3, 8, 2, and 0, respectively.

•

COMET-MLP Only. Embeddings of the interacted users or the interacted items are concatenated and fed into a fully-connected MLP directly. Besides the dropout rate, we carefully tune the number of hidden layers from 1 to 3 according to the tower structure of neural networks [10].

The comparison among COMET, COMET-CNN only, and COMET-MLP only is displayed in Figure 3. We can see that MLP is able to capture complex relationships better by encoding the high-order feature interactions with CNN in a rather explicit way. This observation agrees with the conclusion of recent works [9, 37]. Furthermore, to discover the benefit of modeling high-order interactions among all the latent dimensions, a fair comparison among different sets of filters is studied here. For example, COMET(1) means that only a filter $\mathbf {W}\in \mathbb {R}^{|\text{N}(i)| \times 1}$ is used and no dimensional interaction is captured. While COMET (i.e., COMET(1,8,32,128)) captures not only the independent dimensional information but also the interaction signals among all the embedding dimensions. The performance of COMET with different filter sets is shown in Table 4. We can see that COMET generally performs better than other counterparts, in particular under the NDCG evaluation metrics. This demonstrates the effectiveness of modeling interaction effects across latent dimensions. In addition, we observe that the performance gaps between the methods are not as significant as expected. The underlying reason may be the relatively small amount of input which encodes the interaction among all (i.e., K) the embedding dimensions. We may alleviate this problem by weighting or selecting the cross-features before feeding them onto MLP. We leave this challenge as future work. Figure 4 shows the heat map of two randomly selected kernels. A quick observation is that the selected two kernels are able to capture different interaction signals among latent dimensions.

Fig. 3.

Fig. 4.

Table 4.

Datasets	Methods	HR@5	NDCG@5	HR@10	NDCG@10
ML-1M	COMET(1)	0.557	0.389	0.730	0.445
	COMET(1,8)	0.549	0.386	0.721	0.442
	COMET(1,8,32)	0.553	0.385	0.722	0.440
	COMET	0.558	0.392	0.729	0.448
Movie&Tv	COMET(1)	0.631	0.483	0.758	0.524
	COMET(1,8)	0.619	0.476	0.748	0.515
	COMET(1,8,32)	0.619	0.477	0.749	0.517
	COMET	0.637	0.491	0.759	0.529
CD&Vinyl	COMET(1)	0.642	0.513	0.749	0.547
	COMET(1,8)	0.660	0.525	0.772	0.558
	COMET(1,8,32)	0.665	0.523	0.779	0.559
	COMET	0.667	0.528	0.780	0.560

Table 4. Performance of COMET with Different Filter Sets on the Three Datasets

5.8 Study of the Interaction-aware Representation (The Answer to RQ3)

As mentioned before, the representations of users and items are enriched by learnt interaction vectors in the prediction layer. In order to study the efficacy of such a design, we design the following models:

•

COMET-Original Only. The interaction score is only predicted by the inner product of the original representations of the target user and target item. In this way, the step of constructing embedding maps and the interaction modeling process is omitted.

•

COMET-Interaction Only. Similarly, we ignore the original representations of the target user (i.e., $\mathbf {p}_u$) and the target item (i.e., $\mathbf {q}_i$) in the prediction layer in this approach. In other words, we only use $\mathbf {p}_u\prime$ and $\mathbf {q}_i\prime$ to estimate the prediction score between the target user u and the target item i.

The performance of COMET, COMET-original only, and COMET-interaction only is reported in Figure 5. Obviously, the combination of the original representation and the interaction vectors achieves the best performance among the three models. By enriching the original representations with internal interaction vectors, the latent representations of users and items are dependently learnt in COMET. This observation demonstrates a promising way to improve the implicit recommendation task without any additional data such as text reviews, social networks, and knowledge graphs [31].

Fig. 5.

5.9 Sensitivity Analysis (The Answer to RQ4)

Here we investigate the effect of the regularization parameter, the embedding size K, and the number of channels applied on COMET. From Figure 6, we can observe that COMET obtains better performance with the increase in embedding size. It is reasonable since a larger embedding size is able to encode richer representations of users and items. Besides, we can observe that COMET generally achieves better performance in the range [5e-6, 1e-5]. This indicates that it is important to choose a suitable regularization parameter to balance between overfitting and underfitting. At last, we find that the performance of COMET based on different numbers of channels is very stable, which shows the strong expressiveness of CNN. We only conduct a sensitivity test on three different numbers of channels here, since the training time will increase dramatically with more channels under our setting. In addition, to show the convergence of our proposed method, we plot the training loss in each training epoch in Figure 7.

Fig. 6.

Fig. 7.

6 Conclusion

The representation learning-based recommendation models aim to learn effective representations of users and items. In this article, we studied how the interactions among historical interactions and embedding dimensions enrich the representations learnt by the MF. By representing the interacted items and interacted users as two “embedding maps”, COMET is able to exploit high-order interaction signals among historical interactions and embedding dimensions simultaneously. The advantage of enriching the representations of users and items by the learnt interaction information is also demonstrated. Extensive experiments and ablation studies demonstrate the efficacy of our proposed method over the existing state-of-the-art methods. In future work, we will explore more efficient and scalable approaches to capture the interactions among historical interactions and embedding dimensions. In addition, we would like to apply this idea to sequential recommendation tasks and investigate the dimensional interaction effects in sequential settings.

Footnotes

Amazon Movies and TV: http://jmcauley.ucsd.edu/data/amazon/.

MovieLens 1M: https://grouplens.org/datasets/movielens/.

https://github.com/hexiangnan/neural_collaborative_filtering.

⁴

https://github.com/duxy-me/ConvNCF.

⁵

https://github.com/wubinzzu/NeuRec.

References

[1]

Emmanuel J. Candès and Benjamin Recht. 2009. Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9, 6 (2009), 717.

Abstract

1 Introduction

2 Related Work

3 Preliminaries

3.1 Problem Formulation and Notations

3.2 Relationship with Matrix Factorization

4 The Proposed Approach

4.1 Input Layer

4.2 Embedding Layer

4.3 Embedding Maps

4.4 Interaction Modeling

4.5 Prediction Layer

4.6 Loss Function

5 Experiments

5.2 Datasets

5.3 Compared Methods

5.4 Training Details

5.5 Evaluation Protocols

5.6 Performance Comparison (The Answer to RQ1)

5.7 Study of Interaction Modeling (The Answer to RQ2)

5.8 Study of the Interaction-aware Representation (The Answer to RQ3)

5.9 Sensitivity Analysis (The Answer to RQ4)

6 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

A Neural Collaborative Filtering Model with Interaction-based Neighborhood

A Personalized Interaction Approach: Motivation and Use Case

Trust-based collaborative filtering: tackling the cold start problem using regular equivalence

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations