CN110781409A

CN110781409A - Article recommendation method based on collaborative filtering

Info

Publication number: CN110781409A
Application number: CN201911022328.2A
Authority: CN
Inventors: 郑莹; 吕艳霞
Original assignee: Northeastern University Qinhuangdao
Current assignee: Northeastern University Qinhuangdao
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2020-02-11
Anticipated expiration: 2039-10-25
Also published as: CN110781409B

Abstract

The present invention provides an item recommendation method based on collaborative filtering, which relates to the technical field of recommendation systems. The present invention introduces a special dynamic weight to better predict the user u's preference for item i. This dynamic weight uses an attention mechanism to Estimation, the recommendation performance is evaluated by the recall rate and precision rate, which improves the effectiveness and recommendation quality of the recommendation system, and confirms that the attention mechanism is helpful for estimating the contribution of the historical items interacted by the user to the user's preference representation , making the personalized recommendation more accurate. Using dot product attention and self-attention to calculate the attention score separately, and achieved remarkable results. At the same time, the transformer model is combined with the recommendation algorithm and compared with the conventional embedding model, showing the improvement of the recommendation effect.

Description

An item recommendation method based on collaborative filtering

技术领域technical field

本发明涉及推荐系统技术领域，具体涉及一种基于协同过滤的物品推荐方法。The invention relates to the technical field of recommendation systems, in particular to an item recommendation method based on collaborative filtering.

背景技术Background technique

协同过滤(Collaborative Filtering，简称CF)是诞生最早，并且较为著名的推荐算法。主要的功能是预测和推荐，不仅在学术界得到了深入研究，而且在业界得到了广泛应用。该算法通过对用户历史行为数据的挖掘发现用户的偏好，基于不同的偏好对用户推荐品味相似的物品。协同过滤推荐算法主要分为两类，分别是基于用户的协同过滤算法(User-based Collaborative Filtering，简称UserCF)和基于物品的协同过滤算法(Item-based Collaborative Filtering简称ItemCF)。简单的说就是：人以类聚，物以群分。基于用户的协同过滤算法是通过用户的历史行为数据发现用户对商品或内容的喜欢(如商品购买，收藏，内容评论或分享)，并对这些喜好进行度量和打分。根据不同用户对相同商品或内容的态度和偏好程度计算用户之间的关系，在有相同喜好的用户间进行商品推荐。简单的说就是如果A,B两个用户都购买了x,y,z三本图书，并且给出了5星的好评，那么A和B就属于同一类用户即可以将A看过的图书w也推荐给用户B。UserCF在一些网站(如Digg)中得到了应用，但该算法有一些缺点。首先，随着网站的用户数目越来越大，计算用户兴趣相似度矩阵将越来越困难，其运算时间复杂度和空间复杂度的增长和用户数的增长近似于平方关系。其次，基于用户的协同过滤很难对推荐结果作出解释。因此，著名的电子商务公司亚马逊提出了另一个种基于物品的协同过滤算法。Collaborative Filtering (CF) is the earliest and most famous recommendation algorithm. The main function is prediction and recommendation, which is not only deeply studied in academia, but also widely used in industry. The algorithm discovers the user's preference by mining the user's historical behavior data, and recommends items with similar tastes to the user based on different preferences. Collaborative filtering recommendation algorithms are mainly divided into two categories, namely User-based Collaborative Filtering (UserCF) and Item-based Collaborative Filtering (ItemCF). To put it simply: people gather in groups, and things divide in groups. The user-based collaborative filtering algorithm is to discover the user's likes of goods or content (such as commodity purchases, collections, content comments or sharing) through the user's historical behavior data, and to measure and score these preferences. Calculate the relationship between users according to the attitudes and preferences of different users to the same product or content, and recommend products among users who have the same preference. Simply put, if both users A and B have purchased three books x, y, and z, and gave 5-star praise, then A and B belong to the same category of users, and they can read the books A read w. Also recommended to User B. UserCF is used in some websites (such as Digg), but the algorithm has some drawbacks. First of all, as the number of users of the website increases, it will become more and more difficult to calculate the similarity matrix of user interests, and the increase of the computational time complexity and space complexity and the increase of the number of users are approximated by a square relationship. Second, user-based collaborative filtering is difficult to interpret the recommendation results. Therefore, the famous e-commerce company Amazon has proposed another item-based collaborative filtering algorithm.

基于物品的协同过滤算法(ICF)给用户推荐那些和他们之前喜欢的物品相似的物品。比如，该算法会因为你购买过《数据挖掘导论》而给你推荐《机器学习》。不过，ICF算法并不利用物品的内容属性计算物品之间的相似度，它主要通过分析用户的行为记录计算物品之间的相似度。ICF不仅在许多推荐场景中为预测结果提供了有说服力的解释，而且更容易实现实时个性化。具体来说，估计物品之间相似性的主要计算可以离线完成，而在线推荐模块仅需要对类似物品执行一系列查找，这很容易实时完成。Item-based collaborative filtering (ICF) recommends items to users that are similar to items they liked before. For example, the algorithm will recommend "Machine Learning" to you because you have purchased "Introduction to Data Mining". However, the ICF algorithm does not use the content attributes of the items to calculate the similarity between items. It mainly calculates the similarity between items by analyzing the user's behavior records. ICF not only provides persuasive explanations for prediction results in many recommendation scenarios, but also makes real-time personalization easier. Specifically, the main computation of estimating the similarity between items can be done offline, while the online recommendation module only needs to perform a series of lookups for similar items, which is easily done in real-time.

最早的基于物品的协同过滤ItemCF方法是通过计算用户过去接触过的物品与当前目标物品之间的相似度来确定是否将目标物品添加到用户的推荐列表即预测的用户u对一个特定的物品i的评分等于用户u所有交互过的物品j各自与物品i的相似度s_ij乘以用户对j的评分r_uj最后累加起来的值。计算公式如下：The earliest item-based collaborative filtering ItemCF method is to determine whether to add the target item to the user's recommendation list by calculating the similarity between the items that the user has contacted in the past and the current target item, that is, the predicted user u for a specific item i. The score is equal to the final accumulated value of the similarity s _ij of all items j interacted by user u with item i multiplied by the user's score r _uj for j. Calculated as follows:

早期的ItemCF方法使用统计度量来计算用户历史物品与目标物品之间的相似性，例如Pearson系数和Cosine相似度等方法。该方法简单但是这种用于估计物品相似性的基于启发式的方法缺乏针对推荐而定制的优化，因此会产生次优的性能。其次在数据稀疏的情况下，假设用户对未评估物品调整余弦相似度为0，并且用户在Pearson系数中一起评价的物品集合(co-related)可能很小。因此，需要对这些方法进行调整和优化，以使不同的数据集适应推荐方案。随着机器学习的发展，一种基于学习的方法被提出，叫做SLIM。该方法主要是定制了一个推荐目标函数通过优化它从数据中自适应的学习物品之间的相似性。也就是说最小化原始用户物品交互矩阵与基于物品的CF模型重建的交互矩阵之间的损失。尽管SLIM可以实现更好的推荐精度，但它有两个固有的局限性。首先，离线训练过程对于大规模数据来说可能非常耗时，这是由于要直接学习具有相似度矩阵S，时间复杂度在O(I²)量级。其次它只能估计共同购买或共同评分的两个物品之间的相似性，不能估计不相关的物品之间的相似性所以无法捕获物品之间的传递关系。在实际推荐任务中，特别是在数据稀疏时，SLIM的推荐效果会下降。Early ItemCF methods used statistical measures to calculate the similarity between user's historical items and target items, such as methods such as Pearson coefficient and Cosine similarity. The method is simple, but this heuristic-based approach for estimating item similarity lacks optimization tailored for recommendation and thus yields sub-optimal performance. Secondly, in the case of sparse data, it is assumed that the user-adjusted cosine similarity for unevaluated items is 0, and the co-related items that users evaluate together in the Pearson coefficient may be small. Therefore, these methods need to be adjusted and optimized to adapt different datasets to recommendation schemes. With the development of machine learning, a learning-based method was proposed, called SLIM. The method mainly customizes a recommendation objective function to learn the similarity between items by optimizing it adaptively from the data. That is to minimize the loss between the original user-item interaction matrix and the interaction matrix reconstructed by the item-based CF model. Although SLIM can achieve better recommendation accuracy, it has two inherent limitations. First, the offline training process may be very time-consuming for large-scale data, because the time complexity is in the order of O(I ² ) to directly learn the similarity matrix S. Second, it can only estimate the similarity between two items that are purchased or rated jointly, and cannot estimate the similarity between unrelated items, so it cannot capture the transitive relationship between items. In practical recommendation tasks, especially when the data is sparse, the recommendation effect of SLIM will drop.

FISM很好的解决了这些局限性。这个方法主要是将物品表示为低维嵌入向量，于是物品之间的相似性s_ij就参数化为物品i和j的嵌入向量的内积。随着用户和物品数量增加整个交互矩阵变得稀疏，现有的Top-K推荐方法的有效性降低，在FISM算法中提出了一种基于物品的方法用于生成Top-K推荐，该推荐算法将物品相似度矩阵的学习设为两个低维潜在因子矩阵的乘积。在几个不同稀疏度级别的多个数据集上进行的一整套实验表明，FISM算法中所提出的方法可以有效地处理稀疏数据集。由于这一点，FISM的推荐精度优于其他非常流行的Top-K推荐算法，特别是随着数据集变稀疏，其性能得到极大提升。虽然它具有优越的性能，但是它假设用户所有的历史上交互的物品对用户偏好的表示具有相同的贡献，这显然是不合理的。例如篮球和日用品对实时推荐篮球鞋的影响是不相同的。于是我们引入特殊的动态权重来更好的预测用户u对物品i的喜好程度，这个动态的权重会利用注意力机制来估算。FISM addresses these limitations nicely. This method mainly represents items as low-dimensional embedding vectors, so the similarity s _ij between items is parameterized as the inner product of the embedding vectors of items i and j. As the number of users and items increases, the entire interaction matrix becomes sparse, and the effectiveness of existing Top-K recommendation methods decreases. In the FISM algorithm, an item-based method is proposed to generate Top-K recommendations. The recommendation algorithm The learning of the item similarity matrix is set as the product of two low-dimensional latent factor matrices. A full set of experiments on multiple datasets with several different sparsity levels show that the proposed method in the FISM algorithm can effectively handle sparse datasets. Due to this, the recommendation accuracy of FISM is superior to other very popular Top-K recommendation algorithms, especially its performance is greatly improved as the dataset becomes sparse. Although it has superior performance, it is obviously unreasonable to assume that all the historically interacted items of the user have the same contribution to the representation of the user's preferences. For example, basketball and daily necessities have different effects on real-time recommendation of basketball shoes. So we introduce a special dynamic weight to better predict the user u's preference for item i, and this dynamic weight will be estimated using the attention mechanism.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题，本发明提供一种基于协同过滤的物品推荐方法。In view of the problems existing in the prior art, the present invention provides an item recommendation method based on collaborative filtering.

一种基于协同过滤的物品推荐方法，具体步骤如下：An item recommendation method based on collaborative filtering, the specific steps are as follows:

1、一种基于协同过滤的物品推荐方法，其特征在于：包括以下步骤：1. A method for recommending items based on collaborative filtering, characterized in that: comprising the following steps:

步骤1：计算用户u对目标物品i的预测分数，将预测目标物品i使用one-hot编码通过嵌入层获得其嵌入向量p和q，其中p表示该物品是预测物品，q表示它是历史交互物品，获得其评价指标，定义基于注意力的ItemCF公式如下：Step 1: Calculate user u's prediction score for target item i, and use one-hot encoding to predict target item i to obtain its embedding vectors p and q through the embedding layer, where p indicates that the item is a predicted item, and q indicates that it is a historical interaction Items, get their evaluation indicators, and define the attention-based ItemCF formula as follows:

a_ij＝f(p_i,q_j)a _ij =f(pi , _{q j} ₎

其中，i为预测目标物品，j为用户历史交互物品，a_ij为利用注意力网络计算出的历史交互物品对用户偏好表示所占的权重，p_i和q_j分别表示预测物品集的嵌入向量和用户交互过的物品的嵌入向量，R表示用户u的正例集,

表示去掉正例集中的物品i，

为系数；Among them, i is the predicted target item, j is the user's historical interactive item, a _ij is the weight of the historical interactive item calculated by the attention network to the user's preference representation, p _i and q _j represent the embedding vector of the predicted item set, respectively Embedding vector of items that have interacted with the user, R represents the positive example set of user u,

means to remove the item i in the positive example set,

is the coefficient;

步骤1.1：将查询物品集的嵌入向量p_i和用户交互过的物品集的嵌入向量q_j两个向量进行拼接，得到拼接向量c，

将拼接向量作为点乘注意力模型的输入，将注意力机制的第一次尝试命名为Dot；Step 1.1: Splicing the embedding vector p _i of the query item set and the embedding vector q _j of the item set interacted by the user to obtain the splicing vector c,

The spliced vector is used as the input of the dot product attention model, and the first attempt of the attention mechanism is named Dot;

步骤1.1.1：将拼接向量c独立的做三次线性变换，系数矩阵分别为W_Q，W_K，W_V，由此得到注意力网络的输入Query，Key，Value(Q，K，V)；Step 1.1.1: Perform three linear transformations on the splicing vector c independently, and the coefficient matrices are W _Q , W _K , W _V respectively, thereby obtaining the input Query, Key, Value (Q, K, V) of the attention network;

步骤1.1.2：使用高度优化的矩阵乘法来实现Q和K转置的点积，在softmax之后，与V相乘以得到权重矩阵，将注意力函数表示为Attention(Q，K，V)，计算公式如下所示：Step 1.1.2: Use highly optimized matrix multiplication to achieve the dot product of Q and K transpose, after softmax, multiply with V to get the weight matrix, denote the attention function as Attention(Q, K, V), The calculation formula is as follows:

其中，d_k表示K的维度，softmax函数将值转换为概率分布，如果Q，K，V维度相同的话，那么输出的注意力权重矩阵的维度也与它们相同；Among them, d _k represents the dimension of K, and the softmax function converts the value into a probability distribution. If the dimensions of Q, K, and V are the same, the dimensions of the output attention weight matrix are also the same as them;

步骤1.2：将拼接向量

作为输入放入网络中，将前面的单个点乘注意力重复h次，将h次的结果矩阵拼接起来，最后通过线性变换转为需要的维度，即将注意力函数设置为自注意模型来计算历史物品j对用户u预测目标物品i的得分所贡献的权重，并将其命名为Self；Step 1.2: Put the stitching vector

Put it into the network as input, repeat the previous single point multiplication attention h times, splicing the result matrix of h times, and finally convert it to the required dimension through linear transformation, that is, set the attention function to the self-attention model to calculate the history The weight that item j contributes to user u's prediction of the target item i's score, and it is named Self;

步骤1.3：利用Transformer模型的主要框架，其主要分为encoder和decoder两个模块，将encoder模块的第一个子模块输入设置为待预测的目标物品的嵌入向量p_i，剩余的每个子模块的输入是前一个子模块的输出，每个encoder子模块由两层组成，第一层是自注意模型层，第二层是feedforward层，encoder和decoder在Attention操作之后，都会包含一个完全连接的前向网络，包括两个线性变换和一个Relu激活输出，其公式如下:Step 1.3: Use the main framework of the Transformer model, which is mainly divided into two modules: encoder and decoder, and set the input of the first sub-module of the encoder module as the embedding vector p _i of the target item to be predicted, and the rest of each sub-module. The input is the output of the previous sub-module. Each encoder sub-module consists of two layers. The first layer is the self-attention model layer, and the second layer is the feedforward layer. After the Attention operation, the encoder and decoder will both contain a fully connected front. To the network, including two linear transformations and a Relu activation output, the formula is as follows:

FFN(x)＝max(0，xW₁+b₁)W₂+b₂ FFN(x)=max(0, xW ₁ +b ₁ )W ₂ +b ₂

将decoder模块的第一个子模块输入设置用户交互的历史物品的集合q_j，剩余的每个子模块的输入是前一个子模块的输出，每个decoder子模块由三层组成，第一层和第二层都是自注意力层，但第二层的输入Q是前一层的输出，K和V是encoder的输出，第三层是feedforward层，在每个层后面添加“Add&Normalize”层以防止渐变消失或爆炸，同时防止过拟合；模型的输出通过一个全连接层和softmax函数转换为所需的尺寸以获得注意力权重a_ij并进行下面的工作，定义该模型为Trans；Input the first sub-module of the decoder module to set the set q _j of historical items of user interaction, the input of each remaining sub-module is the output of the previous sub-module, each decoder sub-module consists of three layers, the first layer and The second layer is a self-attention layer, but the input Q of the second layer is the output of the previous layer, K and V are the output of the encoder, and the third layer is the feedforward layer. After each layer, add an "Add&Normalize" layer to Prevent the gradient from disappearing or exploding, while preventing overfitting; the output of the model is transformed into the required size through a fully connected layer and the softmax function to obtain the attention weight a _ij and the following work is performed, defining the model as Trans;

步骤1.4：定制目标函数，将观察到的用户-物品交互视为正面实例，从剩余未观察到的交互中抽取负面实例，使用R⁺和R^-代表正负实例的集合，使用log作为损失项，并用L2范式惩罚嵌入向量和各网络的系数和偏置项。那么损失函数如下：Step 1.4: Customize the objective function to treat observed user-item interactions as positive instances, extract negative instances from the remaining unobserved interactions, use R ⁺ and R- ^to represent the set of positive and negative instances, and use log as the loss term , and use the L2 norm to penalize the embedding vector and the coefficients and bias terms of each network. Then the loss function is as follows:

其中N代表训练实例的总数，σ代表sigmoid方法把预测值转化为概率值，超参数λ控制的L2范式的强度用来防止过拟合，θ＝{{p_i},{q_j},W,b,h}代表所有可训练的参数，这里的W，b，h以及全部使用到的线性变换的参数都有做正则惩罚；采用随机梯度下降算法的一种变体称为Adagrad优化目标函数，它对每一个参数应用自适应学习率，从所有训练实例中抽取随机样本，将相关参数向梯度的负方向更新。采用一种mini-batch方法，随机挑选一名用户，然后用它所有交互过的物品集作为一个小批次。where N represents the total number of training instances, σ represents the sigmoid method to convert the predicted value into a probability value, and the strength of the L2 paradigm controlled by the hyperparameter λ is used to prevent overfitting, θ={{p _i },{q _j },W ,b,h} represents all trainable parameters, where W, b, h and all the linear transformation parameters used have regular penalties; a variant of the stochastic gradient descent algorithm is called Adagrad optimization objective function , which applies an adaptive learning rate to each parameter, draws random samples from all training instances, and updates the relevant parameters in the negative direction of the gradient. Using a mini-batch approach, a user is randomly selected and the set of all items it has interacted with is used as a mini-batch.

步骤2：在评价指标上对真实物品数据集进行实验，性能由推荐结果进行评判，并将实验结果与其他推荐方法进行比较。Step 2: Experiment on the real item dataset on the evaluation index, the performance is judged by the recommendation results, and the experimental results are compared with other recommendation methods.

本发明的有益效果：Beneficial effects of the present invention:

本发明将自然语言处理中机器翻译的注意力机制transformer运用到了推荐模型中，在电影和图片两个真实数据集上对本文提出的方法进行实验，使用查全率和查准率两个常用的推荐模型评估指标进行评估。基于查全率，该方法实现了相对3.2％的提高，基于查准率实现了相对4.3％的提高，因此该方法能为用户生成更精准的个性化推荐列表。高效的推荐系统可以在用户缺乏相关领域经验或是面对海量数据无法处理的情况下,为用户提供一种高效智能的信息过滤技术，挖掘用户潜在的消费倾向，为众多的用户提供个性化服务。通过将物品精准的推荐给用户，可以提升用户的兴趣，提高网站浏览量，点击率和购买率，为网站带来收入的同时为用户的生活休闲带来极大的便利。更优的推荐方法可以为企业实体带来商业价值的实现，优化销售边界和利润，帮助产品拓展边界，通过场景构建，提供更多样、更贴心的体验，最终提升利润等。In the present invention, the attention mechanism transformer of machine translation in natural language processing is applied to the recommendation model, and the method proposed in this paper is tested on two real data sets of movies and pictures, and two commonly used recall rate and precision rate are used. Recommend model evaluation metrics for evaluation. Based on the recall rate, the method achieves a relative improvement of 3.2%, and based on the precision rate, it achieves a relative improvement of 4.3%, so this method can generate a more accurate personalized recommendation list for users. An efficient recommendation system can provide users with an efficient and intelligent information filtering technology when users lack experience in relevant fields or cannot process massive data, mine users' potential consumption tendencies, and provide personalized services for many users. . By accurately recommending items to users, it can enhance users' interest, increase website page views, click-through rate and purchase rate, bring revenue to the website and bring great convenience to users' life and leisure. A better recommendation method can bring business value to enterprise entities, optimize sales boundaries and profits, help products expand boundaries, provide more diverse and more intimate experiences through scenario construction, and ultimately increase profits, etc.

附图说明Description of drawings

图1为基于注意力的Item CF模型的基本框架；Figure 1 shows the basic framework of the attention-based Item CF model;

图2为点乘注意力模型结构；Figure 2 shows the structure of the dot product attention model;

图3为Transformer模型基本框架；Figure 3 shows the basic framework of the Transformer model;

图4为模型FISM，Dot，Self，Trans在嵌入尺寸为16时的性能比较；Figure 4 shows the performance comparison of models FISM, Dot, Self, and Trans when the embedding size is 16;

图(a)为ML-1M-HR，图(b)为ML-1M-NDCG，图(c)为Pinterest-20-HR，图(d)为Pinterest-20-NDCG。Figure (a) is ML-1M-HR, Figure (b) is ML-1M-NDCG, Figure (c) is Pinterest-20-HR, and Figure (d) is Pinterest-20-NDCG.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优势更加清晰，下面结合附图和具体实施例对本发明做进一步详细说明。此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. The specific embodiments described herein are only used to explain the present invention, and are not intended to limit the present invention.

一种基于协同过滤的物品推荐方法，包括如下步骤：An item recommendation method based on collaborative filtering, comprising the following steps:

步骤1：如图1所示，表示用户时采用Multi-hot编码即利用用户u在隐式反馈情况下交互过的所有物品来表示u，在这种情况下，用户的Multi-hot编码通过嵌入层并产生一组向量，其中每个向量表示与用户相关联的历史物品，要预测的目标物品使用one-hot编码通过嵌入层获得其嵌入向量。计算用户u对目标物品i的预测分数，将预测目标物品i使用one-hot编码通过嵌入层获得其嵌入向量p和q，其中p表示该物品是预测物品，q表示它是历史交互物品，如图1所示，定义基于注意力的ItemCF公式如下：Step 1: As shown in Figure 1, Multi-hot encoding is used to represent users, that is, all items that user u has interacted with in the case of implicit feedback are used to represent u. In this case, the user's Multi-hot encoding is embedded by embedding layer and produce a set of vectors, where each vector represents a historical item associated with the user, and the target item to be predicted uses one-hot encoding to obtain its embedding vector through the embedding layer. Calculate the prediction score of user u for target item i, and use one-hot encoding to predict target item i to obtain its embedding vectors p and q through the embedding layer, where p indicates that the item is a predicted item, and q indicates that it is a historical interactive item, such as As shown in Figure 1, the formula for defining the attention-based ItemCF is as follows:

a_ij＝f(p_i,q_j)a _ij =f(pi , _{q j} ₎

表示去掉正例集中的物品i，

means to remove the item i in the positive example set,

is the coefficient;

步骤1.1：将查询物品集的嵌入向量p_i和用户交互过的物品集的嵌入向量q_j两个向量进行拼接，得到拼接向量c学习交互权重，将拼接向量作为点乘注意力模型的输入，如图2所示，将注意力机制的第一次尝试命名为Dot；Step 1.1: Splicing the embedding vector p _i of the query item set and the embedding vector q _j of the item set interacted by the user to obtain the splicing vector c to learn the interaction weight, The spliced vector is used as the input of the dot product attention model, as shown in Figure 2, and the first attempt of the attention mechanism is named Dot;

步骤1.1.2：使用高度优化的矩阵乘法来实现Q和K转置的点积，因为点积更快，更节省空间，并且可以使用高度优化的矩阵乘法来实现，其中因子

起调节作用，使得内积不至于太大，否则会导致softmax层的值非0即1，引起梯度消失或者爆炸问题，这样做可以使值保持在梯度较大的位置，softmax函数是将值转换为概率分布，这对梯度计算非常友好，在softmax之后，与V相乘以得到权重矩阵，将注意力函数表示为Attention(Q，K，V)，计算公式如下所示：Step 1.1.2: Use highly optimized matrix multiplication to implement the dot product of the Q and K transposes, as the dot product is faster, more space efficient, and can be implemented using highly optimized matrix multiplication, where the factor

It plays a regulating role so that the inner product is not too large, otherwise the value of the softmax layer will be either 0 or 1, causing the gradient to disappear or explode. This can keep the value at a position with a large gradient. The softmax function converts the value It is a probability distribution, which is very friendly to gradient calculation. After softmax, it is multiplied by V to get the weight matrix, and the attention function is expressed as Attention(Q, K, V). The calculation formula is as follows:

步骤1.2：将拼接向量

作为输入放入网络中，将前面的单个点乘注意力重复h次，将h次的结果矩阵拼接起来，最后通过线性变换转为需要的维度，即将注意力函数设置为自注意模型来计算历史物品j对用户u预测目标物品i的得分所贡献的权重，并将其命名为Self。Step 1.2: Put the stitching vector

Put it into the network as input, repeat the previous single point multiplication attention h times, splicing the result matrix of h times, and finally convert it to the required dimension through linear transformation, that is, set the attention function to the self-attention model to calculate the history The weight that item j contributes to user u's prediction of target item i's score is named Self.

步骤1.3：利用Transformer模型的主要框架，如图3所示，其主要分为encoder和decoder两个模块，将encoder模块的第一个子模块输入设置为待预测的目标物品的嵌入向量p_i，剩余的每个子模块的输入是前一个子模块的输出，每个encoder子模块由两层组成，第一层是自注意模型层，第二层是feedforward层，encoder和decoder在Attention操作之后，都会包含一个完全连接的前向网络，包括两个线性变换和一个Relu激活输出，其公式如下:Step 1.3: Use the main framework of the Transformer model, as shown in Figure 3, which is mainly divided into two modules: encoder and decoder, and set the input of the first sub-module of the encoder module as the embedding vector p _i of the target item to be predicted, The input of each remaining sub-module is the output of the previous sub-module. Each encoder sub-module consists of two layers. The first layer is the self-attention model layer, and the second layer is the feedforward layer. After the Attention operation, the encoder and decoder will be Contains a fully connected forward network consisting of two linear transformations and a Relu activation output with the following formula:

FFN(x)＝max(0，xW₁+b₁)W₂+b₂ FFN(x)=max(0, xW ₁ +b ₁ )W ₂ +b ₂

将decoder模块的第一个子模块输入设置用户交互的历史物品的集合q_j，这极大地增强了模型的可解释性，剩余的每个子模块的输入是前一个子模块的输出，每个decoder子模块由三层组成，第一层和第二层都是自注意力层，但第二层的输入Q是前一层的输出，K和V是encoder的输出，第三层是feedforward层，在每个层后面添加“Add&Normalize”层以防止渐变消失或爆炸，同时防止过拟合；对于模型的输出，它将通过一个全连接层和softmax函数转换为所需的尺寸以获得注意力权重α_ij并进行下面的工作，定义该模型为Trans；The first submodule input of the decoder module sets the set q _j of historical items of user interaction, which greatly enhances the interpretability of the model, the input of each remaining submodule is the output of the previous submodule, and each decoder The sub-module consists of three layers. The first layer and the second layer are self-attention layers, but the input Q of the second layer is the output of the previous layer, K and V are the output of the encoder, and the third layer is the feedforward layer. Add "Add&Normalize" layer after each layer to prevent the gradient from disappearing or exploding, while preventing overfitting; for the output of the model, it will be transformed to the desired size through a fully connected layer and softmax function to obtain the attention weight α _ij and do the following work, define the model as Trans;

步骤1.4：定制目标函数，将观察到的用户-物品交互视为正面实例，从剩余未观察到的交互中抽取负面实例，使用R⁺和R^-代表正负实例的集合，使用交叉上损失函数作为目标函数，并用L2范式惩罚嵌入向量和各网络的系数和偏置项。那么目标函数如下：Step 1.4: Customize the objective function to treat observed user-item interactions as positive instances, extract negative instances from the remaining unobserved interactions, use R ⁺ and R- ^to represent the set of positive and negative instances, use a crossover loss function As the objective function, the embedding vector and the coefficients and bias terms of each network are penalized with L2 norm. Then the objective function is as follows:

其中N代表训练实例的总数，σ代表sigmoid方法把预测值转化为概率值代表用户u会与物品i交互的可能性，超参数λ控制用来L2范式的强度防止过拟合，θ＝{{p_i},{q_j},W,b,h}代表所有可训练的参数，这里的W，b，h以及全部使用到的线性变换的参数都有做正则惩罚；采用随机梯度下降算法的一种变体称为Adagrad优化目标函数，它对每一个参数应用自适应学习率，从所有训练实例中抽取随机样本，将相关参数向梯度的负方向更新。where N represents the total number of training instances, σ represents the sigmoid method to convert the predicted value into a probability value representing the possibility that user u will interact with item i, and the hyperparameter λ controls the strength of the L2 paradigm to prevent overfitting, θ={{ p _i },{q _j },W,b,h} represent all trainable parameters, where W, b, h and all the linear transformation parameters used have regular penalties; the stochastic gradient descent algorithm is used. A variant called Adagrad optimizes the objective function, which applies an adaptive learning rate to each parameter, draws random samples from all training instances, and updates the relevant parameters in the negative direction of the gradient.

本实施例使用TensorFlow实现所有模型，而TensorFlow要求批次的所有训练实例必须具有相同的长度，因为一些活跃用户可能与数千个物品进行了交互，使得采样的小批量训练集仍然非常大。为了解决这个问题，本实施例采用一种mini-batch方法，随机挑选一名用户，然后用它所有交互过的物品集作为一个小批次，而不是随机抽取固定数量的训练实例作为小批量训练集。这种方法有两个优点：1)不必使用掩蔽技巧，因此速度更快；2)不需要指定批量大小，这样可以避免调整批量大小。如果注意力网络与物品嵌入向量同时进行训练，由于注意力网络的输出会对物品嵌入做出变化，因此联合训练的话，容易导致自适应效应，从而收敛速度降低。为了解决模型训练中的实际问题，本实施例使用Kabbur等人提出的FISM算法对模型进行了预训练，利用FISM算法学习到的物品嵌入向量来初始化模型。由于FISM算法本身没有自适应问题，它能更好地学习嵌入编码物品的相似性。因此，利用FISM算法初始化模型的话可以极大促进注意力网络的学习，可以更好地性能和快速收敛。This example uses TensorFlow to implement all models, and TensorFlow requires that all training instances of a batch must have the same length, because some active users may interact with thousands of items, making the sampled mini-batch training set still very large. In order to solve this problem, this embodiment adopts a mini-batch method, randomly selects a user, and then uses all the interacted item sets as a mini-batch, instead of randomly selecting a fixed number of training instances as mini-batch training set. This approach has two advantages: 1) it doesn't have to use masking tricks, so it's faster; 2) it doesn't need to specify the batch size, which avoids batch resizing. If the attention network and the item embedding vector are trained at the same time, since the output of the attention network will make changes to the item embedding, joint training will easily lead to an adaptive effect, thereby reducing the convergence speed. In order to solve the practical problem in model training, this embodiment uses the FISM algorithm proposed by Kabbur et al. to pre-train the model, and uses the item embedding vector learned by the FISM algorithm to initialize the model. Since the FISM algorithm itself has no self-adaptation problem, it can better learn the similarity of the embedding-encoded items. Therefore, using the FISM algorithm to initialize the model can greatly promote the learning of the attention network, resulting in better performance and faster convergence.

本实施例为用户历史上交互过的每个物品赋予一个权重使得用户历史物品集在用户对目标物品进行预测打分时能更精确地表示用户偏好，推荐效果得到了提高，个性化推荐也更加准确，我们将这些改进归因于有效引入注意机制，以区分历史物品在用户表示中的重要性。我们在评价指标HR和NDCG上对两个真实物品数据集ML-1M和Pinterest-20进行了全面的实验，以评估Top-K。性能由推荐结果的前10位的Hit Ratio(HR)和NormalizedDiscounted Cumulative Gain(NDCG)进行评判。这两个指标已广泛用于评估Top-K推荐和信息检索文献的搜索系统中。HR@10可以被解释为基于召回的度量，表示成功推荐用户的百分比(即，正面实例出现在前10名)，而NDCG@10是考虑到正实例的预测位置，这两个指标的数值越大越好。This embodiment assigns a weight to each item that the user has interacted with in the history, so that the user's historical item set can more accurately represent the user's preference when the user predicts and scores the target item, the recommendation effect is improved, and the personalized recommendation is also more accurate. , we attribute these improvements to the effective introduction of an attention mechanism to distinguish the importance of historical items in user representations. We conduct comprehensive experiments on two real-item datasets, ML-1M and Pinterest-20, on the evaluation metrics HR and NDCG to evaluate Top-K. The performance is judged by the Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) of the top 10 recommended results. These two metrics have been widely used in search systems evaluating Top-K recommendation and information retrieval literature. HR@10 can be interpreted as a recall-based metric representing the percentage of successfully recommended users (i.e., positive instances appearing in the top 10), while NDCG@10 is the predicted position considering the positive instances. Bigger is better.

我们将实验结果与一些流行的推荐方法进行比较。对于这些基于嵌入的方法(MF、MLP、FISM和本文的模型)，嵌入大小控制其建模能力；因此，我们将所有方法设置为16。结果如表1所示，可以看到基于注意力的模型最终都达到了比较好的效果，并且最终效果相似。它们在所有的数据集中均拿到了NDCG和HR评分的最高分。在ML-1M数据集上，三个模型的性能相较于FISM都得到了一定的提升，其中Self模型相较FISM在HR和NDCG方面相对改进了3.1％和4.3％。这可能是结构相对简单的模型在不太稀疏的数据集上能更全面地捕捉用户特征，刻画用户的偏好。而在Pinterest-20上，Trans优于另外二者拿到了最高分并在NDCG上相对FISM提升了3.2％，这可能是因为较深的网络能更好的捕捉到稀疏数据的特性。基于学习的协同过滤方法普遍的表现要比Pop和ItemKNN这些基于启发式的方法要好很多，特别是FISM要比ItemKNN高了不少。考虑到两种方法使用相同的预测模型，只是对物品的相似度估计方法不同，可以清楚地看到定制优化对推荐的积极影响。We compare the experimental results with some popular recommendation methods. For these embedding-based methods (MF, MLP, FISM, and our model), the embedding size controls their modeling power; therefore, we set all methods to 16. The results are shown in Table 1. It can be seen that the attention-based models have achieved better results in the end, and the final results are similar. They achieved the highest NDCG and HR scores in all datasets. On the ML-1M dataset, the performance of the three models has been improved to a certain extent compared with FISM, among which the Self model has a relative improvement of 3.1% and 4.3% in HR and NDCG compared with FISM. This may be that a model with a relatively simple structure can more comprehensively capture user characteristics and describe user preferences on a less sparse dataset. On Pinterest-20, Trans outperformed the other two and got the highest score and improved by 3.2% on NDCG relative to FISM, which may be because deeper networks can better capture the characteristics of sparse data. Learning-based collaborative filtering methods generally perform much better than heuristic-based methods such as Pop and ItemKNN, especially FISM is much higher than ItemKNN. Considering that the two methods use the same prediction model, but the similarity estimation method for items is different, the positive impact of customized optimization on recommendation can be clearly seen.

表1 效果对比图Table 1 Effect comparison chart

如图4所示，在物品嵌入尺寸为16时，FISM以及本申请提出的Dot，Self和Trans在每个epoch的性能，这三种模型在两个数据集上都拿到了HR和NDCG的最高分，它们达到了相同的性能水平，相较FISM实现了显著的改进。我们认为这些优点归功于在学习物品对物品交互时注意力网络的有效设计。尤其是在第一个epoch时，我们的模型性能就显著超过了FISM，随着训练时长的增加，实验效果会变得更好直至收敛。As shown in Figure 4, when the item embedding size is 16, the performance of FISM and Dot, Self and Trans proposed in this application in each epoch, these three models have achieved the highest HR and NDCG on both datasets. points, they achieve the same level of performance and achieve a significant improvement over FISM. We attribute these advantages to the efficient design of attention networks when learning item-to-item interactions. Especially in the first epoch, the performance of our model significantly exceeds that of FISM, and as the training time increases, the experimental effect will become better until convergence.

基于上面的讨论，本文在基于物品的协同过滤算法方面进行研究，尝试接入多种注意力模型来改进动态权重系数的学习并加以实现和实验，并与其它模型进行对比，取得了较好的效果。主要贡献有:(1)证实了注意力机制有助于捕捉新物品对用户接触过的历史物品集之间在相似度计算上的贡献的动态权重。使得个性化推荐更为准确。(2)使用了点乘注意力，自注意力来分别计算注意力分数，并取得了不错的效果。(3)将transformer模型与推荐算法结合起来并与常规的嵌入模型做出了比较，展示了推荐效果的提升。Based on the above discussion, this paper conducts research on item-based collaborative filtering algorithms, tries to access a variety of attention models to improve the learning of dynamic weight coefficients, implements and experiments, and compares with other models, and achieves better results. Effect. The main contributions are: (1) It is confirmed that the attention mechanism helps to capture the dynamic weight of the contribution of new items to the similarity calculation between the set of historical items that the user has been exposed to. Make personalized recommendation more accurate. (2) Using the dot product attention and self-attention to calculate the attention score separately, and achieved good results. (3) The transformer model is combined with the recommendation algorithm and compared with the conventional embedding model, showing the improvement of the recommendation effect.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解；其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；因而这些修改或者替换，并不使相应技术方案的本质脱离本发明权利要求所限定的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand; The technical solutions described in the foregoing embodiments are modified, or some or all of the technical features thereof are equivalently replaced; therefore, these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope defined by the claims of the present invention.

Claims

1. an article recommendation method based on collaborative filtering, is characterized in that: comprise the following steps:

Step 1: Calculate user u's prediction score for target item i, and use one-hot encoding to predict target item i to obtain its embedding vectors p and q through the embedding layer, where p indicates that the item is a predicted item, and q indicates that it is a historical interaction Items, get their evaluation indicators, and define the attention-based ItemCF formula as follows:

a _ij =f(pi , _{q j} ₎

Among them, i is the predicted target item, j is the user's historical interactive item, a _ij is the weight of the historical interactive item calculated by the attention network to the user's preference representation, p _i and q _j represent the embedding vector of the predicted item set, respectively Embedding vector of items that have interacted with the user, R represents the positive example set of user u,

means to remove the item i in the positive example set,

is the coefficient;

Step 2: Experiment on the real item dataset on the evaluation index, the performance is judged by the recommendation results, and the experimental results are compared with other recommendation methods.

2. A collaborative filtering-based item recommendation method according to claim 1, wherein the specific steps of the step 1 are:

Step 1.1: Splicing the embedding vector p _i of the predicted item set and the embedding vector q _j of the item set interacted by the user to obtain the splicing vector c,

Step 1.1.1: Perform three linear transformations on the splicing vector c independently, and the coefficient matrices are W _Q , W _K , W _V respectively, thereby obtaining the input Query, Key, Value (Q, K, V) of the attention network;

Step 1.1.2: Use highly optimized matrix multiplication to achieve the dot product of Q and K transpose, after softmax, multiply with V to get the weight matrix, denote the attention function as Attention(Q, K, V), The calculation formula is as follows:

Among them, d _k represents the dimension of K, and the softmax function converts the value into a probability distribution. If the dimensions of Q, K, and V are the same, the dimensions of the output attention weight matrix are also the same as them;

Step 1.2: Put the stitching vector Put it into the network as input, repeat the previous single point multiplication attention h times, splicing the result matrix of h times, and finally convert it to the required dimension through linear transformation, that is, set the attention function to the self-attention model to calculate the history The weight that item j contributes to user u's prediction of the target item i's score, and it is named Self;

Step 1.3: Use the main framework of the Transformer model, which is mainly divided into two modules: encoder and decoder, and set the input of the first sub-module of the encoder module as the embedding vector p _i of the target item to be predicted, and the rest of each sub-module. The input is the output of the previous sub-module. Each encoder sub-module consists of two layers. The first layer is the self-attention model layer, and the second layer is the feedforward layer. After the Attention operation, the encoder and decoder will both contain a fully connected front. To the network, including two linear transformations and a Relu activation output, the formula is as follows:

FFN(x)=max(0, xW ₁ +b ₁ )W ₂ +b ₂

Input the first sub-module of the decoder module to set the set q _j of historical items of user interaction, the input of each remaining sub-module is the output of the previous sub-module, each decoder sub-module consists of three layers, the first layer and The second layer is a self-attention layer, but the input Q of the second layer is the output of the previous layer, K and V are the output of the encoder, and the third layer is the feedforward layer. After each layer, add an "Add&Normalize" layer to Prevent the gradient from disappearing or exploding, and at the same time prevent overfitting; the output of the model is converted to the required size through a fully connected layer and the softmax function to obtain the attention weight a _ij and the following work is performed, and the model is defined as Trans;

Step 1.4: Customize the objective function to treat observed user-item interactions as positive instances, extract negative instances from the remaining unobserved interactions, use R ⁺ and R- ^to represent the set of positive and negative instances, and use log as the loss term , and use the L2 norm to penalize the embedding vector and the coefficients and bias terms of each network, then the loss function is as follows:

where N represents the total number of training instances, σ represents the sigmoid method to convert the predicted value into a probability value, and the strength of the L2 paradigm controlled by the hyperparameter λ is used to prevent overfitting, θ={{p _i },{q _j },W ,b,h} represents all trainable parameters, where W, b, h and all the linear transformation parameters used have regular penalties; a variant of the stochastic gradient descent algorithm is called Adagrad optimization objective function , which applies an adaptive learning rate to each parameter, draws random samples from all training instances, updates the relevant parameters in the negative direction of the gradient, uses a mini-batch method that randomly picks a user, and then uses it for all interactions The set of items passed as a mini-batch.