CN113705402B

CN113705402B - Video behavior prediction method, system, electronic device and storage medium

Info

Publication number: CN113705402B
Application number: CN202110950812.2A
Authority: CN
Inventors: 徐常胜; 杨小汕; 黄毅
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2024-08-13
Anticipated expiration: 2041-08-18
Also published as: CN113705402A

Abstract

The invention provides a video behavior prediction method, a video behavior prediction system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics and state characteristics of future moments of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the multi-mode characteristics after dynamic relation modeling based on the optimized graph convolution neural network. The video behavior prediction method, the system, the electronic equipment and the storage medium provided by the invention can effectively capture the multi-mode dynamic relation change of the historical fragments and the future fragments in the video, and the future occurrence behavior of the video can be predicted more accurately through the graph convolution neural network after knowledge distillation optimization.

Description

Video behavior prediction method, system, electronic device and storage medium

Technical Field

The present invention relates to the field of video analysis technologies, and in particular, to a video behavior prediction method, a system, an electronic device, and a storage medium.

Background

With the rapid development of the technologies of computers and the Internet of things, the future behavior prediction technology based on videos has wider and wider practical application scenes in the fields of automatic driving, man-machine interaction, wearable equipment assistants and the like.

At present, the traditional video behavior prediction method directly uses hidden state representation of the observed video to generate future behavior characteristics after context modeling is carried out on the observed video, so that behavior prediction is realized. But this way of directly predicting future behavior based on past video clips ignores the potentially strong correlation between past behavior and future behavior.

In addition, in the training stage of the model, the traditional video behavior prediction method does not consider that training samples containing future video clips are used, so that the learning of the correlation knowledge between the past behavior and the future behavior is insufficient, and the obtained behavior prediction result is inaccurate and reliable.

Disclosure of Invention

The invention provides a video behavior prediction method, a video behavior prediction system, electronic equipment and a storage medium, which are used for solving the technical problems that video behavior prediction is inaccurate and reliable in the prior art.

In a first aspect, the present invention provides a video behavior prediction method, including:

Acquiring a target video to be predicted;

Inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;

the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the multi-mode characteristics after dynamic relation modeling based on the optimized graph convolution neural network.

According to the video behavior prediction method provided by the invention, the training process of the video behavior prediction model comprises the following steps:

Feature extraction: extracting historical moment characteristics of an observation video in a training set, carrying out multi-mode characteristic learning on the historical moment characteristics, and predicting to obtain state characteristics of future moments;

Modeling of dynamic relationship: carrying out dynamic relation modeling on the historical moment characteristics of the observed video and the state characteristics of the future moment through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;

network optimization: obtaining a complete video, carrying out sequence dynamic relation modeling on the complete video, respectively distilling the characteristic knowledge and the relation knowledge of the complete video into the graph rolling neural network, and carrying out multi-mode characteristic mutual learning and relation mutual learning to obtain an optimized graph rolling neural network; wherein the complete video comprises a video history segment and a real future segment;

Feature fusion: and based on the optimized graph convolution neural network, fusing graph node characteristics of each updated mode to obtain a video behavior prediction result.

According to the video behavior prediction method provided by the invention, the characteristic extraction process comprises the following steps:

extracting historical moment characteristics of observed videos in a training set, wherein the historical moment characteristics comprise video characteristics of multiple modes;

Respectively carrying out sequence context modeling on video features of each mode in the historical moment features, and mapping the video features of each mode to the same dimension;

And predicting and obtaining state characteristics at future time according to the sequence context modeling and the video characteristics of each mode after dimension unification.

According to the video behavior prediction method provided by the invention, the video characteristics of the multiple modes comprise: RGB visual features, optical flow features, and target object features.

According to the video behavior prediction method provided by the invention, the updated graph node characteristics of all modes are fused, and the fusion expression is as follows:

Where y is the probability of occurrence of the final predicted future behavior, m is the video feature modality, For the node characteristics of the graph at the time of the (l+1), W ^m is the weight parameter of the behavior classifier, and b ^m is the bias parameter of the behavior classifier.

According to the video behavior prediction method provided by the invention, the dynamic relationship modeling process comprises the following steps:

Establishing a graph convolution neural network for behavior prediction;

taking hidden state features of video clips in the observed video as graph network nodes, and constructing a node feature matrix corresponding to each mode video feature;

respectively calculating dynamic node relations corresponding to the video features of each mode according to the node feature matrix;

And respectively updating graph node characteristics in the characteristic graphs corresponding to the video characteristics of each mode according to the node characteristic matrix and the dynamic node relation.

According to the video behavior prediction method provided by the invention, the network optimization process comprises the following steps:

acquiring a complete video, wherein the complete video comprises a video history fragment and a real future fragment;

modeling the sequence dynamic relationship of the complete video, and learning to obtain a teacher model;

distilling the characteristic knowledge and the relation knowledge of the teacher model into the graph convolution neural network respectively;

And learning complementary information among the video features of each mode through feature mutual learning and relationship mutual learning to obtain the optimized graph convolution neural network.

In a second aspect, the present invention also provides a video behavior prediction system, including:

the acquisition module is used for acquiring a target video to be predicted;

The behavior prediction module is used for inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;

In a third aspect, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of any one of the video behavior prediction methods described above when the program is executed.

In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the video behavior prediction methods described above.

According to the video behavior prediction method, the system, the electronic equipment and the storage medium, dynamic relation modeling is carried out on the historical moment characteristics of the target video and the predicted state characteristics of the future moment, so that the dynamic relation between the historical fragments and the future fragments in the video can be effectively inferred, the multi-mode dynamic relation changes of the historical fragments and the future fragments in the video can be effectively captured, and finally, the future occurrence behavior of the video can be predicted more accurately through the graph convolution neural network after knowledge distillation optimization.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a video behavior prediction method provided by the invention;

FIG. 2 is a schematic diagram of a training flow of a video behavior prediction model;

FIG. 3 is a schematic diagram of training principles of a video behavior prediction model;

FIG. 4 is a schematic diagram of a video behavior prediction system according to the present invention;

Fig. 5 is a schematic structural diagram of an electronic device according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 shows a video behavior prediction method provided by an embodiment of the present invention, including:

S110: acquiring a target video to be predicted;

S120: inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;

the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.

Referring to fig. 2 and fig. 3, the training process of the video behavior prediction model specifically includes:

S210: and a feature extraction step, namely extracting the historical moment features of the observed video in the training set, carrying out multi-mode feature learning on the historical moment features, and predicting to obtain the state features at the future moment.

The multi-mode feature learning method mainly considers multi-mode information of videos and carries out multi-mode feature learning on observation videos in a training set. For each modality feature of the observed video, a context sequence dependency is modeled and state features at future times are predicted.

Specifically, first, for each video in the training data set, RGB visual features of the video are extracted using a convolutional neural network, denoted asAnd optical flow features, noted as Extracting features of the target object using FASTER RCNN, noted asWhere D _r,D_f and D _o represent dimensions of RGB visual features, optical flow features, and target object features, respectively. i is an index of video clips, and the input video contains l clips in total.

Then, using 3 gating loop units (Gated Recurrent Unit, GRU) network to respectively perform sequential context modeling on RGB visual features { r ₁,r₂,…,r_l }, optical flow features { f ₁,f₂,…,f_l } and target object features { o ₁,o₂,…,o_l }, mapping the three features to a unified dimension D _h, and performing the above processing operation on the three modal features, where the obtained expression corresponding to the historical time feature is as follows:

wherein,

Finally, 3 step gating loop units (Progressive Gated Recurrent Unit, PGRU) are designed to predict the multi-modal video characteristics of future time nodes, and the expression of the prediction result is as follows:

wherein, AndThree modal characteristics for the predicted future time.

S220: and a dynamic relation modeling step, namely carrying out dynamic relation modeling on the historical moment characteristics of the observed video and the state characteristics of the future moment through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode.

The method uses a graph convolution neural network (Graph Convolutional Networks, GCN) to model the dynamic relationship between the historical moment characteristics in the observed video and the predicted state characteristics of the future moment, and further infers the impending behavior of the future moment.

Specifically, the operation of the GCN in this embodiment is defined as:

Wherein X is an input matrix formed by arranging all nodes in the graph network. A is an adjacency matrix describing the node-to-node relationship in the graph roll-up neural network. W is the network parameter of the GCN. ReLu is a nonlinear activation function.

Taking the hidden state characteristics of the modeled video segments as graph network nodes to form node characteristic matrixes of 3 modes, wherein the three matrixes are respectively AndAnd then calculating a dynamic node relation according to the node characteristics, wherein the calculation formulas are respectively as follows:

Wherein, A ^r(i,j)、A^f (i, j) and A ^o (i, j) respectively represent the relationship between the ith and jth nodes in the relationship diagram of the 3 modes.

And finally, respectively updating node characteristics of the characteristic graphs of the 3 modes by using 3 layers of GCNs, wherein the updated expression of the node characteristics of the graph is as follows:

wherein, AndThe node characteristics of the graph after updating are 3 modes.

S230: a network optimization step, namely acquiring a complete video, carrying out sequence dynamic relationship modeling on the complete video, respectively distilling the feature knowledge and the relationship knowledge of the complete video into a graph convolution neural network, and carrying out multi-mode feature mutual learning and relationship mutual learning to obtain an optimized graph convolution neural network; wherein the complete video contains a video history segment and a real future segment.

In this step, the design uses the complete video comprising the video history and real future segments to perform sequential dynamic relationship modeling, learn the teacher (Teacher) model, and distill the teacher's network relationship knowledge into the graph-convolution neural network, i.e., into the Student model. Future node characterization in teacher modelAndCalculated based on the actual future video clip, not predicted by PGRU in S210 above.

Specifically, the present embodiment distills the knowledge of the teacher model into the graph convolutional neural network for behavior prediction in S220 using both the feature distillation and the relational distillation knowledge distillation strategies.

The loss function of the feature distillation strategy is the 2-norm difference between graph node features obtained in a teacher model and a student model, namely:

wherein, AndFor graph node features derived from a graph roll-up neural network (i.e. student model) for behavior prediction,AndAnd obtaining graph node characteristics for the teacher model.

The loss function of the relational distillation strategy is the Kullback-Leibler divergence between graph relational matrixes obtained in a teacher model and a student model, namely:

wherein A ^r、A^f and A ^o are graph relation matrixes obtained by the student model, AndAnd obtaining a graph relation matrix for the teacher model.

Specifically, the calculation mode of the Kullback-Leibler divergence is as follows:

D_KL(p,q)＝E[log(p)-log(q)] (16)

Meanwhile, the complementary information among three video modes is learned through two multi-mode mutual learning strategies of characteristic mutual learning and relation mutual learning. The feature mutual learning loss function is the 2-norm difference between the graph node features obtained in the step graph convolution neural network, namely:

The loss function of the relation mutual learning is Kullback-Leibler divergence among graph relation matrixes obtained in the graph convolution neural network, namely:

L_{mu_rel}＝D_KL(A^f,A^r)+D_KL(A^r,A^o)+D_KL(A^r,A^f)+D_KL(A^r,A^f)D_KL(A^o,A^r)+D_KL(A^f,A^r) (18)

Through the knowledge distillation and mutual learning processes, the graph convolution neural network for behavior prediction can be optimized, and the accuracy and reliability of network processing data are improved.

S240: and a feature fusion step, namely fusing updated graph node features of all modes based on the optimized graph convolution neural network to obtain a video behavior prediction result.

The multi-mode prediction result obtained in the step S220 is fused, learning and optimizing are carried out under a unified frame, and finally the video behavior prediction result is output.

The multimode fusion strategy involved in the feature fusion is as follows:

Where y is the final predicted future behavior probability distribution, m is the video feature modality, For the node characteristics of the graph at the first+1st time obtained in S220, W ^m is a weight parameter of the behavior classifier, and b ^m is a bias parameter of the behavior classifier.

The learning optimization loss function of the unified framework is as follows:

Wherein L _ce is a cross entropy loss function, Is a true future behavior label. L _{kd_fea}、L_{kd_rel}、L_{mu_fea} and L _{mu_rel} are loss functions in knowledge distillation and multi-modal mutual learning.

Considering that the graph roll-up neural network has achieved great success in dynamic relationship modeling, few approaches have applied GCN to video-based behavior prediction. In order to enable the GCN to effectively model the relation between the past behavior and the future behavior of the video counterpart, and fully utilize the complete video segment to learn the dynamic association relation between the past behavior and the future behavior, the embodiment of the invention fully considers three aspects of multi-mode feature learning, global relation modeling and complete video segment relation knowledge distillation, and provides the video behavior prediction method.

The video behavior prediction system provided by the invention is described below, and the video behavior prediction system described below and the video behavior prediction method described above can be referred to correspondingly.

Referring to fig. 4, a video behavior prediction system provided by an embodiment of the present invention includes:

an obtaining module 410, configured to obtain a target video to be predicted;

the behavior prediction module 420 is configured to input the target video to the video behavior prediction model, and obtain a behavior prediction result output by the video behavior prediction model;

The behavior prediction module 420 implements prediction of future behavior in video through a video behavior prediction model, specifically, regarding a training portion of the video behavior prediction model, including:

the feature extraction unit is used for extracting historical moment features of the observation video in the training set, carrying out multi-mode feature learning on the historical moment features, and predicting to obtain state features at future moments;

The dynamic relation modeling unit is used for carrying out dynamic relation modeling on the historical moment characteristics of the observed video and the state characteristics of the future moment through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;

The network optimization unit is used for acquiring a complete video, carrying out sequence dynamic relationship modeling on the complete video, respectively distilling the characteristic knowledge and the relationship knowledge of the complete video into the graph rolling neural network, and carrying out multi-mode characteristic mutual learning and relationship mutual learning to obtain an optimized graph rolling neural network; wherein the complete video comprises a video history segment and a real future segment;

And the feature fusion unit is used for fusing the updated graph node features of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.

It can be understood that the feature extraction unit needs to extract the historical moment features of the observation video in the training set, where the historical moment features include video features of multiple modes, and in this embodiment, features of three modes including RGB visual features, optical flow features and target object features are used; then respectively carrying out sequence context modeling on video features of each mode in the historical moment features, and mapping the video features of each mode to the same dimension; and finally, predicting and obtaining the state characteristics of the future moment according to the sequence context modeling and the video characteristics of each mode after the dimension is unified.

It can be appreciated that the dynamic relationship modeling unit first needs to build a graph convolution neural network for behavior prediction; then taking the hidden state characteristics of the video clips in the observed video as graph network nodes, and constructing a node characteristic matrix corresponding to each mode video characteristic; then, respectively calculating dynamic node relations corresponding to the video features of each mode according to the node feature matrix; and finally, respectively updating the graph node characteristics in the characteristic graphs corresponding to the video characteristics of each mode according to the node characteristic matrix and the dynamic node relation.

It can be appreciated that the network optimization unit first obtains a complete video including a video history segment and a real future segment; then, modeling the sequence dynamic relationship of the complete video, and learning to obtain a teacher model; then, respectively distilling the characteristic knowledge and the relation knowledge of the teacher model into a graph convolution neural network; and finally, learning complementary information among the video features of each mode through feature mutual learning and relationship mutual learning to obtain the optimized graph convolution neural network.

According to the video behavior prediction system provided by the embodiment of the invention, the behavior prediction module is used for carrying out dynamic relation modeling on the historical moment characteristics of the target video and the predicted state characteristics of the future moment, so that the dynamic relation between the historical fragments and the future fragments in the video can be effectively inferred, the multi-mode dynamic relation changes of the historical fragments and the future fragments in the video can be effectively captured, and finally, the future occurrence behavior of the video can be predicted more accurately through the graph convolution neural network after knowledge distillation optimization.

Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a video behavior prediction method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the video behavior prediction method provided by the methods described above, the method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the video behavior prediction methods provided above, the method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for predicting video behavior, comprising:

Acquiring a target video to be predicted;

The video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the multi-mode characteristics after dynamic relation modeling based on the optimized graph convolution neural network;

The training process of the video behavior prediction model comprises the following steps:

2. A method of predicting video behavior as recited in claim 1, wherein the feature extraction process comprises:

3. A method of predicting video behavior according to claim 2, wherein the video features of the plurality of modalities include: RGB visual features, optical flow features, and target object features.

4. A video behavior prediction method according to claim 3, wherein the updated graph node features of each mode are fused, and a fusion expression is:

5. A method of video behavior prediction according to claim 1, wherein the process of dynamic relationship modeling comprises:

Establishing a graph convolution neural network for behavior prediction;

6. A video behavior prediction method according to claim 1, the network optimization process is characterized by comprising the following steps:

7. A video behavior prediction system, comprising:

the acquisition module is used for acquiring a target video to be predicted;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the video behavior prediction method of any one of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the video behavior prediction method of any one of claims 1 to 6.