WO2021179640A1

WO2021179640A1 - Graph model-based short video recommendation method, intelligent terminal and storage medium

Info

Publication number: WO2021179640A1
Application number: PCT/CN2020/125527
Authority: WO
Inventors: 王娜; 刘兑
Original assignee: 深圳大学
Priority date: 2020-03-10
Filing date: 2020-10-30
Publication date: 2021-09-16
Also published as: CN111382309A; CN111382309B

Abstract

A graph model-based short video recommendation method, an intelligent terminal and a storage medium, the method comprising: a bipartite graph of a corresponding relation between a user and a short video is constructed according to the interaction behavior of the user towards the short video (S10); an aggregation layer outputs a high-order representation vector of a target vertex by aggregating neighborhood information of the target vertex (S20); an integration layer integrates target node information and the neighborhood information (S30); a fusion layer fuses multiple pieces of modal information of the target vertex (S40); and an output layer calculates the similarity between the user vector and a short video vector, predicts the probability that the user interacts with the short video, and recommends the short video to the user (S50). The bipartite graph and a corresponding graph convolution network are constructed for different modes of the short video respectively, vector representation of the user and the vertex of the short video in different modes is learned, thereby achieving the purpose of fine-grained personalized recommendation for the user.

Description

Short video recommendation method based on graph model, and intelligent terminal and storage medium

Technical field

The present invention relates to the technical field of information processing, in particular to a short video recommendation method based on a graph model, an intelligent terminal and a storage medium.

Background technique

In the context of the information age, in the face of increasing Internet information, personalized recommendations serve as a bridge between service providers and users, allowing companies to effectively dig out and use useful information from massive amounts of information, and discover the interests of users Preferences, improve user experience, increase user stickiness, and thus increase revenue; for users, they can quickly find their interested targets in the platform’s massive information database. Personalized recommendations have become a core component of many online content sharing services, such as pictures, blogs, and music recommendations. For example, the recently emerging short video sharing platforms Kuaishou and Douyin have made short video recommendation methods more attractive. Different from single-modal media content such as images and music, short videos contain rich multimedia information-video cover pictures, video background music, and text descriptions of the video, which constitute visual, auditory, and textual content in multiple modalities , Integrate these multi-modal information into the historical interaction behavior of users and short videos, and provide help for deeper capture of user preferences.

Traditional recommendation algorithms for short videos generally include Collaborative Filtering (CF) and Graph Convolutional Network (GCN) methods.

Among them, the ideas based on collaborative filtering methods can be roughly divided into two types, both of which use the historical interaction behavior of “user-video” to construct a “user-video” interaction matrix, and recommend items that similar users like to target users (based on user-video) interaction matrix. Collaborative filtering) or recommend similar items to the target user's favorite items (collaborative filtering based on items). The model based on collaborative filtering can make full use of the user's explicit feedback information (likes, following, comments, etc.) and implicit feedback information (user browsing records, stay time, etc.) to predict the interaction between users and items, but it is easily restricted Due to the sparseness of the data, the recommendation results have certain limitations. For example, in the case of insufficient data explicit feedback and few user feedbacks, it is difficult for the recommendation algorithm to learn meaningful user preference information; the use of implicit feedback can also easily cause "short-sighted" problems for the recommendation system, that is, the recommendation list for users is mostly For the popular items on the head, the personalization and diversity of recommendations are sacrificed. Although the method based on collaborative filtering is simple and fast, it can only use the user's interactive behavior with the short video, and cannot use the rich multi-modal information of the short video.

The graph-based convolutional network method is used for recommendation. Generally, a "user-video" bipartite graph is constructed based on the user's interactive behavior on items. The attribute information of the target node neighborhood set is aggregated in the bipartite graph as the node's own high The first-order representation is to transfer information between nodes, and finally complete the learning of the representation vector of the user node and the video node. By calculating the similarity between the user vector and the video vector, the probability of the user's interaction behavior on the short video is predicted. Compared with the collaborative filtering method, the method based on graph convolutional network converts the behavior data of the non-European structure of the user interaction sequence into a bipartite graph structure for use, and uses the method of node neighborhood aggregation to realize the attribute information of the short video. Transfer between nodes in the graph. However, the currently proposed methods based on graph convolutional networks generally combine the multi-modal attribute information of short video nodes as a whole for calculation and transmission, and lack of consideration of the semantic gap between different modal information (semantic gap), that is, between modalities. Containing the difference of information, there is a problem that the representation learning of users and short videos is not granular enough.

Both the collaborative filtering method and the graph-based convolutional network method use the historical interaction behavior between users and videos (items), but in different forms: the former uses it to construct a "user-video" interaction matrix; the latter transforms it into " User-Video" two-part picture. The interactive matrix constructed by collaborative filtering can only use interactive behavior information (for example, it can only understand "User A clicked on video 1"), and cannot use video attribute information (such as video visual, text, auditory and other multi-modal information); and The graph convolutional network is equivalent to an improvement of collaborative filtering. It can use the attribute information of the video to learn the representation vector of the user and the video, but generally the multi-modal information of the video is input to the model for learning as a whole, and it is not modally different. Model separately.

The common problem of the existing collaborative filtering methods and graph-based convolutional network methods is that they do not learn the representation of users and short videos from the modal level, and cannot measure the influence of modal differences on user preferences.

Therefore, the existing technology needs to be improved and developed.

Summary of the invention

In view of the fact that the prior art does not perform user and short video representation learning from the modal level, and cannot measure the influence of modal differences on user preferences, the present invention provides a short video recommendation method based on a graph model, an intelligent terminal and storage medium.

The technical solutions adopted by the present invention to solve the technical problems are as follows:

A short video recommendation method based on a graph model, wherein the short video recommendation method based on a graph model includes:

According to the user's interactive behavior on the short video, construct a bipartite graph of the corresponding relationship between the user and the short video;

The aggregation layer outputs the high-order representation vector of the target vertex itself by aggregating the neighborhood information of the target vertex;

The integration layer integrates target node information with neighborhood information;

The fusion layer fuses multiple modal information of the target vertex;

The output layer calculates the degree of similarity between the user vector and the short video vector, predicts the probability that the user will interact with the short video, and recommends the short video for the user.

In the method for recommending a short video based on a graph model, the interactive behavior is defined as a user watching a short video in full or performing a thumbs-up operation on the watched short video.

In the method for recommending short videos based on the graph model, the constructing a bipartite graph of the corresponding relationship between the user and the short video according to the user's interactive behavior on the short video further includes:

Construct a bipartite graph of the corresponding relationship between users and short videos at the modal level.

The short video recommendation method based on the graph model, wherein the short video includes visual modal information, text modal information, and auditory modal information;

The visual modal information is represented by a 128-dimensional vector output from a video cover picture through a convolutional neural network;

The text modal information is represented by a 128-dimensional vector outputted by word segmentation and natural language processing model vectorization of the video title text;

The auditory modal information is represented by a 128-dimensional vector after the background music and speech sounds of characters are truncated and passed through a convolutional neural network.

In the short video recommendation method based on the graph model, the aggregation layer is used to aggregate the neighborhood information of the target vertex to obtain a vector representing the target neighborhood, and each aggregation operation is performed by neighborhood aggregation and nonlinear processing composition.

In the short video recommendation method based on the graph model, the neighborhood aggregation is: performing an aggregation operation on the neighborhood of the target vertex through an aggregation function;

The non-linear processing is: obtaining the first-order and second-order neighborhood information of the target vertex by the neighborhood aggregation operation, and by splicing the original information of the target vertex with its neighborhood information, and inputting it into a single-layer neural network to obtain the height of the target vertex. Order features.

In the short video recommendation method based on the graph model, the construction mode of the aggregation function includes: average aggregation, maximum pooling aggregation, and attention mechanism aggregation.

In the short video recommendation method based on the graph model, the integration layer is used to integrate input information from different sources in the same mode, and to perform low-level information and high-level information of the target vertex in a specific mode. Integration to obtain the representation vectors of user vertices and short video vertices in different modalities;

The fusion layer is used to merge multiple modal representation vectors of the user vertex and the short video vertex.

An intelligent terminal, wherein the intelligent terminal includes the above-mentioned graph model-based short video recommendation system, and further includes: a memory, a processor, and a memory based on the memory and capable of running on the processor. A graph model-based short video recommendation program, which implements the steps of the above-mentioned graph model-based short video recommendation method when the graph model-based short video recommendation program is executed by the processor.

A storage medium, wherein the storage medium stores a short video recommendation program based on a graph model, and when the short video recommendation program based on a graph model is executed by a processor, the method for short video recommendation based on the graph model as described above is implemented step.

The present invention constructs a bipartite graph of the corresponding relationship between the user and the short video according to the user's interactive behavior on the short video; the aggregation layer outputs the high-order representation vector of the target vertex itself by aggregating the neighborhood information of the target vertex; the integration layer combines the target node information with The neighborhood information is integrated; the fusion layer integrates multiple modal information of the target vertex; the output layer calculates the similarity between the user vector and the short video vector, predicts the probability of the user interacting with the short video, and performs short video for the user. Video recommendation. The present invention constructs a bipartite graph and a corresponding graph convolution network for different modalities of short videos, learns vector representations of users and short video vertices in different modalities, and achieves the purpose of fine-grained personalized recommendation for users.

Description of the drawings

Fig. 1 is a flowchart of a preferred embodiment of a short video recommendation method based on a graph model of the present invention;

2 is a schematic diagram of the overall framework principle of the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

3 is a schematic diagram of a two-part graph model in the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

4 is a schematic diagram of the construction of a "user-short video" interactive bipartite graph based on user interaction behavior in the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

FIG. 5 is a schematic diagram of a two-part diagram of the modal level "user-short video" in the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

6 is a schematic diagram of the aggregation layer in the preferred embodiment of the short video recommendation method based on the graph model of the present invention;

FIG. 7 is a schematic diagram of the operating environment of the preferred embodiment of the smart terminal of the present invention.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not used to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

The short video recommendation method based on a graph model according to a preferred embodiment of the present invention, as shown in FIG. 1, is a short video recommendation method based on a graph model, wherein the short video recommendation method based on a graph model includes the following steps :

Step S10: Construct a bipartite graph of the corresponding relationship between the user and the short video according to the user's interactive behavior on the short video;

Step S20: The aggregation layer outputs the high-level representation vector of the target vertex itself by aggregating the neighborhood information of the target vertex;

Step S30, the integration layer integrates the target node information with the neighborhood information;

Step S40: The fusion layer fuses multiple modal information of the target vertex;

Step S50: The output layer calculates the degree of similarity between the user vector and the short video vector, predicts the probability that the user will interact with the short video, and recommends the short video for the user.

As shown in Figure 2, the framework of the short video recommendation method based on the graph model of the present invention is composed of a bipartite graph (user-short video), an aggregation layer, an integration layer, a fusion layer and an output layer.

Among them, the bipartite graph is a special model in graph theory. As shown in Figure 3, assuming that the graph G=(V, E) is composed of a vertex set V and an edge set E, the vertex set V can be divided into two mutual Disjoint subset {A, B}, and _{the two vertices i and j connected by any edge e ij} in the graph belong to these two different vertex sets (i ∈ A, j ∈ B), then the graph G is a bipartite graph, and vertices i and j are first-order neighbors to each other.

According to the user’s historical interaction behaviors that can reflect the user’s interest and preferences, a "user-short video" bipartite graph is constructed. In the "user-short video" bipartite graph, the vertices are divided into two subsets: the user vertex set and the short video vertex set If the user has interacted with a short video (such as watching the video completely, liking), then there is an edge directly connected between the user vertex and the short video vertex in the "user-short video" bipartite graph. The user's interaction history short video vertex set is the first-order neighborhood set of the user's vertex, and each short video vertex contains the attribute information of the short video. In order to measure the degree to which attribute information of different modalities of short videos (such as video cover pictures, titles, and background music) affect user preferences, the present invention constructs corresponding “users” for different modalities of short videos (such as vision, text, and hearing). -"Short video" bipartite graph, the topological structure of the bipartite graph of different modals is the same, and the vertices contain the attribute information under the corresponding modal.

Among them, the neighborhood is the set of neighbor vertices. The neighbors of a vertex are simply the vertices directly connected to it. The neighborhood is the set of all vertices directly connected to it. The first-order neighborhood refers to the set of first-order neighbor vertices; because Pooling aggregation is to calculate each neighbor vertex in a certain neighborhood, so it measures the degree of influence of different neighbors on the target vertex.

Following the "aggregation/integration/readout" structure of the graph convolutional network, the aggregation layer designed in the present invention functions to aggregate the neighborhood information of the target vertex and output the high-order representation vector of the target vertex itself; the integration layer performs the target node information and neighboring information The integration of domain information, the fusion layer realizes the fusion of multiple modal information of the target vertex, learns users and short video vector representations that contain different aggregation levels of information, and reflects the difference of information contained in different modalities of short videos; the output layer calculates the user vector The degree of similarity with the short video vector predicts the probability that the user will interact with the short video and generates recommendations for the user.

Specifically, a "user-short video" two-part graph is constructed based on the user's interactive behavior on the short video. The interactive behavior is defined as a short video sequence where the user has fully watched a short video or praised the short video, and the user has interacted with it. The shape is like user 1 [video 1, video 2, …, video n], as shown in Figure 4, the user and the short video are corresponding to the vertices of the graph, and there is a straight edge between the user and the vertices of the short video that has been interacted, constructing " User-short video" interactive two-part picture.

Continue to construct a two-part picture of the "user-short video" at the modal level. Each source or form of information can be called a modal. People can receive information through sight, hearing, smell and touch. Image, text, voice and other forms of transmission. Short video includes three types of modal information: visual modal information, text modal information, and auditory modal information. The information contained in each modal is represented by a vector with a fixed dimension: as the visual modal information passes through the volume with a video cover picture The output of the product neural network is a 128-dimensional vector as a representation; the text modal information is represented by the video title text after word segmentation and a natural language processing model vectorized output as a 128-dimensional vector; the auditory modal information is represented by background music and The character's speech is truncated and passed through a convolutional neural network and then output as a 128-dimensional vector as a representation. As shown in Figure 5, the vertices are

Different modal types are distinguished, among which

It is a collection of modal types, V is visual modal, T is text modal, and A is auditory modal. Construct a two-part picture of "user-short video" at the modal level

m ∈ {V, T, A}, the short video vertex attribute information in the bipartite graph is the corresponding modal information of the short video, and the distance between the vertices in different modal graphs represents the difference of information between the vertices modalities.

Further, as shown in Fig. 6, according to the idea of "the user’s historical interactive behavior can reflect the user’s interest preferences" in the recommendation system, the present invention adopts a two-layer structure of GCN (Graph Convolutional Network, a two-level (first-order and second-order neighborhood aggregation) aggregation operation (Bi-level Aggregation) on vertices; Figure 6 is a schematic diagram of different display angles of the aggregation operation. The role of the aggregation layer is to aggregate the neighborhood information of the target vertex to obtain a vector that characterizes the target neighborhood. Each aggregation operation is composed of neighborhood aggregation and nonlinear processing.

Among them, neighborhood aggregation: for the k-th order neighborhood of the target vertex v under the mode m

The aggregation operation is performed by the aggregation function f _{agg (·):}

in,

Is the number of layers of GCN, vertex u is the k-th order neighborhood of target vertex v

Vertices in,

Is that the vertex u is in the

The representation vector of the layer, when

, It is expressed as the original attribute feature x _m,v of the vertex in a specific mode,

Aggregate information for the k-th order neighborhood of the target vertex v.

Among them, non-linear processing: obtain the first-order and second-order neighborhood information of the target vertex by the neighborhood aggregation operation, and input the original information of the target vertex and its neighborhood information into a single-layer neural network to obtain the high-order of the target vertex feature:

in,

Is the neural network parameter matrix,

Is that the vertex v is in the first

The representation vector of the layer,

with

Are the first-order and second-order neighborhood representation vectors of the target vertex v respectively, [·,·] is the vector splicing operation, and σ(·)=max(0,·) is the ReLU function, which can perform nonlinear transformation on the vector effect,

Is the vertex v mode m in the GCN

The output vector of the aggregation layer of the layer represents the high-order representation information of the vertex v in the mode m.

Since in the "user-short video" bipartite graph, the neighbors of the vertices are disordered, and there is no actual order. Therefore, it is hoped that the constructed aggregate function f _agg (·) is permutation invariant, that is, the output result of the aggregate function is not changed by the neighbor order of the input vertices, and can effectively capture the neighbor vertex information. The present invention constructs the aggregate function in the following three ways:

(1) Average aggregation: The simplest and most intuitive way to aggregate neighbor information is to select the k-th order neighborhood of the target vertex v under mode m

Vertex u in GCN, and place it in the GCN

Representation vector of the layer

Perform an averaging operation element-wise:

Is the k-th order neighborhood representation vector of vertex v in mode m, where

Indicates the number of neighbors of the k-th order neighborhood of the vertex v.

After introducing self-connection in the target vertex adjacency matrix and retaining the target vertex information, the aggregation function is transformed:

After the transformation, the aggregate function is equivalent to integrating the features of the target vertex into the neighborhood features. In the subsequent nonlinear processing, the neighborhood features are directly used as the input of the single-layer network, which can avoid the noise introduced by the splicing operation and reduce the calculation at the same time. the complexity. The corresponding aggregation layer output is:

(2) Maximum pooling aggregation: The pooling operation is usually used in deep neural networks to extract and compress the incoming information from the network layer. The present invention introduces the maximum pooling aggregation operation in the single-layer network structure of GCN:

Among them, W _pool is the pooling parameter matrix, and b is the bias.

Since deep neural networks can extract high-level features of input information, the transmission of information in the network is equivalent to being encoded into features of multiple channels. In order to intuitively measure the degree of influence of different neighbors on the target vertex, the present invention performs the maximum pooling operation on the feature of the neighbor set of the target vertex by element, and the most significant neighbor vertex in a specific feature dimension affects the target vertex in that dimension The greatest degree. Compared with the average aggregation, the maximum pooling aggregation can more effectively distinguish the contribution degree of different neighbors to the output in the feature dimension.

(3) Attention mechanism aggregation: In order to aggregate vertex neighborhood information more concisely and effectively, the present invention introduces attention scores between graph vertices in a node-wise manner to measure the difference between the target vertex and neighbor vertices. the similarity. Assuming that vertex i is the neighbor of vertex v, the similarity between the two is sim _{v, i is} defined as:

Among them, W is the parameter matrix in the forward neural network, W _v and W _i are the corresponding parameter matrices of the vertices v and i in the forward propagation neural network, and are multiplied by the representation vector of the vertex to expand the feature dimension of the vertex , The function a(·,·) maps the spliced high-dimensional vector features to the real number domain,

with

They are the first-order neighborhood and the second-order neighborhood of vertex v.

The similarity between vertices v and i sim _{v, i is taken} as the LeakyReLU function (activation function):

The input of is subjected to nonlinear transformation, x represents the input term, and the obtained vector (x) is input into the softmax formula:

Perform normalization in, constrain the value of the result to the interval [0, 1], and obtain the attention score α _{v, i} between the vertices v and i:

Neighbor-by-neighbor aggregation is performed on the target vertex v:

Among them, W is the same as W in the formula for calculating similarity.

In order to make the aggregation result more reasonable (robust), the present invention introduces the multi-head attention mechanism into the aggregation operation, and sets the number of multi-head attention to P:

in,

Is the attention score between the target vertex v and its neighbor vertex u in its k-th order neighborhood in the p-th attention space,

Average operation for multi-head attention.

Optimization of the aggregation layer: In the aggregation layer, if the number of neighbors of the target vertex is not limited, the corresponding complexity in the worst case is:

in,

Is the set of all vertices in the "user-short video" bipartite graph,

Is the number of all vertices,

with

Are the number of first-order and second-order neighbors of vertex v, respectively. When using attention aggregation, P neighborhood aggregation is required, so the computational complexity needs to be multiplied by P. Since the number of neighbors corresponding to different target vertices is inconsistent, it cannot be input into the model. In order to balance the computational complexity and accuracy, in the present invention, the value of the first-order neighbor of the target vertex is set according to the practical results.

Second-order neighbor value

The number of multi-head attention P=3. If the number of neighbors of the target vertex is less than the set value, the number is filled by repeated sampling; if the number of neighbors is more than the set value, if the aggregation method is average or maximum pooling, the set number of neighbors will be randomly selected, such as aggregation The method is the attention mechanism, and the neighbor vertices with larger attention scores are preferentially selected.

Further, in the aggregation layer, the information contained in the vertex itself is propagated between neighbor vertices of two levels for high-level interaction through GCN. However, the previous GCN model used for recommendation regards the attribute information of the recommended item and the structural information of the corresponding graph vertices as homogenized information, and is input into the model as a whole, ignoring the influence of the different source information of the item on the representation learning process. In this regard, the design integration layer of the present invention integrates input information from different sources in the same mode:

H _{m, v} = f _merge (h _{m, v} , x _{m, v} , h _{v, id} ),

Among them, f _merge (·) is the integration function, the output of the integration layer is H _{m, and the v} vertex v represents the vector in the mode m, where

Represents in the real number domain R, the dimension is dm) is the output of the vertex v through the aggregation layer in the mode m, representing the high-order aggregation information _{of the vertex, x m, v} are the original information contained in the vertex in the mode m, which can be regarded as The zeroth order information, h _{v, id} is the embedding vector of the vertex v obtained by the graph embedding method in the "user-short video" bipartite graph, which can be equivalent to the representation vector of the vertex structure information. The function of the integration layer in the model is to integrate the low-level information (own attribute information) and high-level information (neighborhood information) of the target vertex in a specific mode. The present invention uses two integration functions to integrate the vertex information. :

(1) Hierarchical integration: The original information and ID embedding information of the vertex are defined as the low-level information of the vertex, and the vector generated by the two element-wise splicing and a layer of feedforward neural network is defined as the low-level information containing the vertex structure and content information. Order representation:

h _{m, v, low} = LeakyReLU(W _merge [x _{m, v} , h _id ]+b);

Among them, W _merge is the parameter matrix of the single-layer neural network of the integration layer, b is the offset, and the low-level representation of the vertex h _{m, v, low} and the high-level information h _{m, v of the} vertex are spliced as the output of the integration layer:

H _{m, v} = [h _{m, v, low} , h _{m, v} ].

(2) Outer product integration: The present invention divides the information of the vertices in a specific mode into two types of content information (content information) and structural information (structural information), and crosses the vectors of the two types of information through the method of outer product. Finally, after a layer of feedforward neural network output:

in,

Is the content information,

Is structural information,

The parameter matrix learned for the integration layer,

For bias.

Further, the integration layer integrates the data from different sources of the vertices in a specific modal to obtain the representation vectors of the user vertices and the short video vertices in different modalities. Fusion of multiple modal representation vectors of vertices (user vertices and short video vertices):

in,

with

They respectively represent the set of user vertices and the set of short video vertices in the bipartite graph of "user-short video". For the user vertex u, its output z _u _{in the fusion layer is composed of the fusion layer output vectors H V, u} , H _{T, u} and H in the visual, text, and auditory senses, that is, V, T, and A _{A, u} are stitched together; similarly, for the short video vertex i, its output z _i in the fusion layer is stitched by the integration layer output vectors H _{V, i} , HT _{, i} and H _{A, i in the three modes} .

In order to perform more detailed user vector modeling, the representations of similar vertices in the "user-short video" bipartite graph are more similar, and the representations of vertices separated from each other are more distinguishable. In the fusion layer of the present invention, a negative sampling method is used for unsupervised optimization. Define the “user-short video” bipartite graph with a short video vertex i _p directly connected to the user vertex u as a positive sample; a negative sample is defined as the “user-short video” bipartite graph with a higher degree and the target user _{A short video vertex i n} with no straight edges connected to the vertex. The reason is that the high degree of the apex of the short video means that it has been interacted more often, and it can be regarded as a popular item. Generally, it is considered that a popular item and the user's no behavior means that the user is not interested in the item. After experiments, in order to maintain the balance of the number of positive and negative samples, set the number of positive and negative samples to Q=20, the number ratio is 1:1, and the negative samples are randomly selected from the top 15% of the number of vertices, and the loss is designed Function optimization:

in,

Is the sigmoid function,

Represents the "user-short video" pair formed by _{short video vertices i p} that have interactive behavior with user u,

It means that the short video vertex i _{n does} not interact with the user vertex u and is selected as a negative sample.

Further, the optimized user vector z _u and the short video vector z _i to be inferred are inner producted, and the probability p(Interact) of the user's interaction behavior on the short video is output:

in,

It means that the short video i has not been interacted with by the user u.

Technical effect:

(1) The representation learning of the vertices is carried out by constructing a two-part graph of "user-short video" of the modal hierarchy. Due to the "semantic gap" between the modalities in multimodal data, it is difficult to distinguish the difference of information contained in different modalities with the existing graph convolutional network applied to the recommended methods, and to model them separately. The present invention constructs a bipartite graph and a corresponding graph convolution network for different modalities of short videos, learns vector representations of users and short video vertices in different modalities, and achieves the purpose of fine-grained personalized recommendation for users.

(2) Perform a bi-level aggregation operation on the vertices (user vertices and short video vertices) in the aggregation layer to quantify the influence of vertex neighbors, and model the high-level representation of the vertices. As the number of GCN layers increases, the information transmission efficiency of high-order neighbors will gradually decrease. The problem of gradient disappearance in the transmission of high-order neighbor vertex information is prone to be difficult to apply to the representation learning of the target vertex. Inspired by the use of skip-connect in convolutional neural networks to increase the information transmission path and inhibit the disappearance of gradients. The present invention performs a second-level aggregation operation between the target vertex and its second-order neighbor in the graph, enhances the role of the second-order neighbor information of the target vertex in the target vertex representation learning, and maintains the integrity of high-order neighbor information transmission.

(3) Introduce the idea of multi-head attention mechanism in the aggregation layer to construct the aggregation function. Compared with the Mean aggregation and Maxpool aggregation methods commonly used in existing graph convolutional networks, the method based on the attention mechanism of the present invention uses the attention scores between vertices as the aggregation process. Metrics, considering the correlation constraints between vertex features, play a role in filtering and removing irrelevant neighbor information, and enhancing the impact of related neighbors on target vertices; the introduction of a multi-head attention mechanism is equivalent to ensemble learning of multiple attention aggregation operations (ensemble). ) To make the learned vertex expression vector more robust.

(4) The outer product operation is performed on the content vector and structure vector of the vertex in the integration layer. In the present invention, the graph embedding method is applied to the topological structure representation of the bipartite graph learning target vertex in the graph as a structure vector; the original attribute vector of the target vertex and the higher-order representation vector through the aggregation layer are spliced into the content of the vertex Vector, the outer product operation of the two is equivalent to the feature dimension expansion from the data point of view, the two one-dimensional representation vectors are mapped to the two-dimensional plane space, and the two are transformed into the information containing both information through a layer of feedforward neural network One-dimensional vector output H _m,v ∈R ^d , to achieve the purpose of integrating different source information of the target vertex.

The present invention learns the representation of vertices by constructing a bipartite graph of "user-short video" at the modal level. Other alternative variants can be achieved by constructing a single type of vertex graph at the modal level, such as "user-user" and "short video". -Short video" and other forms, using graph convolutional networks to learn representations of users or short video vertices. The present invention performs two-level (first-order and second-order) aggregation operations on vertices (user vertices and short video vertices) in the aggregation layer to quantify the influence of vertex neighbors, and to model high-order representations of vertices; the deformation scheme can be achieved through the vertex ( High-order (third-order or above) aggregation of user vertices and short video vertices for representation learning.

Further, as shown in FIG. 7, based on the foregoing short video recommendation method based on the graph model, the present invention also provides an intelligent terminal correspondingly. The intelligent terminal includes a processor 10, a memory 20 and a display 30. FIG. 7 only shows part of the components of the smart terminal, but it should be understood that it is not required to implement all the shown components, and more or fewer components may be implemented instead.

In some embodiments, the memory 20 may be an internal storage unit of the smart terminal, such as a hard disk or a memory of the smart terminal. In other embodiments, the memory 20 may also be an external storage device of the smart terminal, for example, a plug-in hard disk equipped on the smart terminal, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card, etc. Further, the memory 20 may also include both an internal storage unit of the smart terminal and an external storage device. The memory 20 is used to store application software and various types of data installed on the smart terminal, such as the program code of the installed smart terminal. The memory 20 can also be used to temporarily store data that has been output or will be output. In an embodiment, a short video recommendation program 40 based on a graph model is stored in the memory 20, and the short video recommendation program 40 based on a graph model can be executed by the processor 10, so as to realize the short video based on the graph model in this application. Recommended method.

The processor 10 may be a central processing unit (CPU), microprocessor or other data processing chip in some embodiments, and is used to run the program code or process data stored in the memory 20, for example Perform the short video recommendation method based on the graph model and so on.

In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display 30 is used for displaying information on the smart terminal and for displaying a visualized user interface. The components 10-30 of the smart terminal communicate with each other via a system bus.

In an embodiment, when the processor 10 executes the graph model-based short video recommendation program 40 in the memory 20, the following steps are implemented:

The fusion layer fuses multiple modal information of the target vertex;

The interactive behavior is defined as the user watching a short video in its entirety or performing a thumbs-up operation on the watched short video.

The constructing a bipartite graph of the corresponding relationship between the user and the short video according to the user's interactive behavior on the short video also includes:

The short video includes visual modal information, text modal information, and auditory modal information;

The auditory modal information is represented by a 128-dimensional vector output after truncated background music and character speech sound through a convolutional neural network.

The aggregation layer is used to aggregate the neighborhood information of the target vertex to obtain a vector representing the target neighborhood, and each aggregation operation consists of neighborhood aggregation and nonlinear processing.

The neighborhood aggregation is: performing an aggregation operation on the neighborhood of the target vertex through an aggregation function;

The construction methods of the aggregation function include: average aggregation, maximum pooling aggregation, and attention mechanism aggregation.

The integration layer is used to integrate input information from different sources in the same mode, and to integrate low-level information and high-level information of the target vertex in a specific mode to obtain user vertices and short video vertices in different modalities The representation vector;

The present invention also provides a storage medium, wherein the storage medium stores a short video recommendation program based on a graph model, and the short video recommendation program based on the graph model is executed by a processor to realize the short video based on the graph model The steps of the recommended method; the details are as described above.

In summary, the present invention provides a short video recommendation method based on a graph model, an intelligent terminal, and a storage medium. The method includes: training a deep neural network using a data set; and inputting a three-dimensional point cloud to the deep neural network. The deep neural network outputs the first part and the second part of the three-dimensional point cloud, using the first part as the motion subunit, and the second part as the reference part of the motion unit; according to the three-dimensional point cloud The output completes the network prediction, and outputs the motion information, the motion information includes the motion segmentation, the motion axis, and the motion type. The present invention realizes the prediction result of simultaneous movement and components of various hinged objects that are unstructured and may be partially scanned in a static state, and can predict the movement of the object components very accurately.

Of course, those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware (such as a processor, a controller, etc.) through a computer program, and the program can be stored in a computer program. In a computer-readable storage medium, the program may include the processes of the foregoing method embodiments when executed. The storage medium mentioned may be a memory, a magnetic disk, an optical disk, and the like.

It should be understood that the application of the present invention is not limited to the above examples, and those of ordinary skill in the art can make improvements or changes based on the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.

Claims

A short video recommendation method based on a graph model, characterized in that the short video recommendation method based on a graph model includes:

According to the user's interactive behavior on the short video, construct a bipartite graph of the corresponding relationship between the user and the short video;

The aggregation layer outputs the high-order representation vector of the target vertex itself by aggregating the neighborhood information of the target vertex;

The integration layer integrates target node information with neighborhood information;

The fusion layer fuses multiple modal information of the target vertex;

The output layer calculates the degree of similarity between the user vector and the short video vector, predicts the probability that the user will interact with the short video, and recommends the short video for the user.
The method for recommending a short video based on a graph model according to claim 1, wherein the interactive behavior is defined as the user watching a short video in full or performing a thumbs-up operation on the watched short video.
The method for recommending a short video based on a graph model according to claim 1, wherein the constructing a bipartite graph of the corresponding relationship between the user and the short video according to the user's interaction behavior on the short video further comprises:

Construct a bipartite graph of the corresponding relationship between users and short videos at the modal level.
The method for recommending a short video based on a graph model according to claim 3, wherein the short video includes visual modal information, text modal information, and auditory modal information;

The visual modal information is represented by a 128-dimensional vector output from a video cover picture through a convolutional neural network;

The text modal information is represented by a 128-dimensional vector outputted by word segmentation and natural language processing model vectorization of the video title text;

The auditory modal information is represented by a 128-dimensional vector after the background music and speech sounds of characters are truncated and passed through a convolutional neural network.
The method for recommending short videos based on a graph model according to claim 1, wherein the aggregation layer is used to aggregate the neighborhood information of the target vertex to obtain a vector that characterizes the target neighborhood, and each aggregation operation is performed by the neighbors. Domain aggregation and non-linear processing composition.
The method for recommending short videos based on a graph model according to claim 5, wherein the neighborhood aggregation is: performing an aggregation operation on the neighborhood of the target vertex through an aggregation function;

The non-linear processing is: obtaining the first-order and second-order neighborhood information of the target vertex by the neighborhood aggregation operation, and by splicing the original information of the target vertex with its neighborhood information, and inputting it into a single-layer neural network to obtain the height of the target vertex. Order features.
The method for recommending a short video based on a graph model according to claim 6, characterized in that the construction mode of the aggregation function includes: average aggregation, maximum pooling aggregation, and attention mechanism aggregation.
The method for recommending short videos based on a graph model according to claim 1, wherein the integration layer is used to integrate input information from different sources in the same mode, and to combine the low-level target vertices in a specific mode. Information and high-level information are integrated to obtain the representation vectors of user vertices and short video vertices in different modalities;

The fusion layer is used to merge multiple modal representation vectors of the user vertex and the short video vertex.
An intelligent terminal, characterized in that, the intelligent terminal includes: a memory, a processor, and a short video recommendation program based on a graph model that is stored in the memory and can run on the processor, and the graph model-based short video recommendation program When the short video recommendation program is executed by the processor, the steps of the short video recommendation method based on the graph model according to any one of claims 1-8 are realized.
A storage medium, wherein the storage medium stores a short video recommendation program based on a graph model, and when the short video recommendation program based on the graph model is executed by a processor, the short video recommendation program is implemented as described in any one of claims 1-8. Describes the steps of the short video recommendation method based on the graph model.