Nothing Special   »   [go: up one dir, main page]

CN113705402B - Video behavior prediction method, system, electronic device and storage medium - Google Patents

Video behavior prediction method, system, electronic device and storage medium Download PDF

Info

Publication number
CN113705402B
CN113705402B CN202110950812.2A CN202110950812A CN113705402B CN 113705402 B CN113705402 B CN 113705402B CN 202110950812 A CN202110950812 A CN 202110950812A CN 113705402 B CN113705402 B CN 113705402B
Authority
CN
China
Prior art keywords
video
behavior prediction
neural network
mode
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110950812.2A
Other languages
Chinese (zh)
Other versions
CN113705402A (en
Inventor
徐常胜
杨小汕
黄毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110950812.2A priority Critical patent/CN113705402B/en
Publication of CN113705402A publication Critical patent/CN113705402A/en
Application granted granted Critical
Publication of CN113705402B publication Critical patent/CN113705402B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video behavior prediction method, a video behavior prediction system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics and state characteristics of future moments of a target video through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the multi-mode characteristics after dynamic relation modeling based on the optimized graph convolution neural network. The video behavior prediction method, the system, the electronic equipment and the storage medium provided by the invention can effectively capture the multi-mode dynamic relation change of the historical fragments and the future fragments in the video, and the future occurrence behavior of the video can be predicted more accurately through the graph convolution neural network after knowledge distillation optimization.

Description

Video behavior prediction method, system, electronic device and storage medium
Technical Field
The present invention relates to the field of video analysis technologies, and in particular, to a video behavior prediction method, a system, an electronic device, and a storage medium.
Background
With the rapid development of the technologies of computers and the Internet of things, the future behavior prediction technology based on videos has wider and wider practical application scenes in the fields of automatic driving, man-machine interaction, wearable equipment assistants and the like.
At present, the traditional video behavior prediction method directly uses hidden state representation of the observed video to generate future behavior characteristics after context modeling is carried out on the observed video, so that behavior prediction is realized. But this way of directly predicting future behavior based on past video clips ignores the potentially strong correlation between past behavior and future behavior.
In addition, in the training stage of the model, the traditional video behavior prediction method does not consider that training samples containing future video clips are used, so that the learning of the correlation knowledge between the past behavior and the future behavior is insufficient, and the obtained behavior prediction result is inaccurate and reliable.
Disclosure of Invention
The invention provides a video behavior prediction method, a video behavior prediction system, electronic equipment and a storage medium, which are used for solving the technical problems that video behavior prediction is inaccurate and reliable in the prior art.
In a first aspect, the present invention provides a video behavior prediction method, including:
Acquiring a target video to be predicted;
Inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the multi-mode characteristics after dynamic relation modeling based on the optimized graph convolution neural network.
According to the video behavior prediction method provided by the invention, the training process of the video behavior prediction model comprises the following steps:
Feature extraction: extracting historical moment characteristics of an observation video in a training set, carrying out multi-mode characteristic learning on the historical moment characteristics, and predicting to obtain state characteristics of future moments;
Modeling of dynamic relationship: carrying out dynamic relation modeling on the historical moment characteristics of the observed video and the state characteristics of the future moment through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;
network optimization: obtaining a complete video, carrying out sequence dynamic relation modeling on the complete video, respectively distilling the characteristic knowledge and the relation knowledge of the complete video into the graph rolling neural network, and carrying out multi-mode characteristic mutual learning and relation mutual learning to obtain an optimized graph rolling neural network; wherein the complete video comprises a video history segment and a real future segment;
Feature fusion: and based on the optimized graph convolution neural network, fusing graph node characteristics of each updated mode to obtain a video behavior prediction result.
According to the video behavior prediction method provided by the invention, the characteristic extraction process comprises the following steps:
extracting historical moment characteristics of observed videos in a training set, wherein the historical moment characteristics comprise video characteristics of multiple modes;
Respectively carrying out sequence context modeling on video features of each mode in the historical moment features, and mapping the video features of each mode to the same dimension;
And predicting and obtaining state characteristics at future time according to the sequence context modeling and the video characteristics of each mode after dimension unification.
According to the video behavior prediction method provided by the invention, the video characteristics of the multiple modes comprise: RGB visual features, optical flow features, and target object features.
According to the video behavior prediction method provided by the invention, the updated graph node characteristics of all modes are fused, and the fusion expression is as follows:
Where y is the probability of occurrence of the final predicted future behavior, m is the video feature modality, For the node characteristics of the graph at the time of the (l+1), W m is the weight parameter of the behavior classifier, and b m is the bias parameter of the behavior classifier.
According to the video behavior prediction method provided by the invention, the dynamic relationship modeling process comprises the following steps:
Establishing a graph convolution neural network for behavior prediction;
taking hidden state features of video clips in the observed video as graph network nodes, and constructing a node feature matrix corresponding to each mode video feature;
respectively calculating dynamic node relations corresponding to the video features of each mode according to the node feature matrix;
And respectively updating graph node characteristics in the characteristic graphs corresponding to the video characteristics of each mode according to the node characteristic matrix and the dynamic node relation.
According to the video behavior prediction method provided by the invention, the network optimization process comprises the following steps:
acquiring a complete video, wherein the complete video comprises a video history fragment and a real future fragment;
modeling the sequence dynamic relationship of the complete video, and learning to obtain a teacher model;
distilling the characteristic knowledge and the relation knowledge of the teacher model into the graph convolution neural network respectively;
And learning complementary information among the video features of each mode through feature mutual learning and relationship mutual learning to obtain the optimized graph convolution neural network.
In a second aspect, the present invention also provides a video behavior prediction system, including:
the acquisition module is used for acquiring a target video to be predicted;
The behavior prediction module is used for inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the multi-mode characteristics after dynamic relation modeling based on the optimized graph convolution neural network.
In a third aspect, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of any one of the video behavior prediction methods described above when the program is executed.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the video behavior prediction methods described above.
According to the video behavior prediction method, the system, the electronic equipment and the storage medium, dynamic relation modeling is carried out on the historical moment characteristics of the target video and the predicted state characteristics of the future moment, so that the dynamic relation between the historical fragments and the future fragments in the video can be effectively inferred, the multi-mode dynamic relation changes of the historical fragments and the future fragments in the video can be effectively captured, and finally, the future occurrence behavior of the video can be predicted more accurately through the graph convolution neural network after knowledge distillation optimization.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a video behavior prediction method provided by the invention;
FIG. 2 is a schematic diagram of a training flow of a video behavior prediction model;
FIG. 3 is a schematic diagram of training principles of a video behavior prediction model;
FIG. 4 is a schematic diagram of a video behavior prediction system according to the present invention;
Fig. 5 is a schematic structural diagram of an electronic device according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 shows a video behavior prediction method provided by an embodiment of the present invention, including:
S110: acquiring a target video to be predicted;
S120: inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.
Referring to fig. 2 and fig. 3, the training process of the video behavior prediction model specifically includes:
S210: and a feature extraction step, namely extracting the historical moment features of the observed video in the training set, carrying out multi-mode feature learning on the historical moment features, and predicting to obtain the state features at the future moment.
The multi-mode feature learning method mainly considers multi-mode information of videos and carries out multi-mode feature learning on observation videos in a training set. For each modality feature of the observed video, a context sequence dependency is modeled and state features at future times are predicted.
Specifically, first, for each video in the training data set, RGB visual features of the video are extracted using a convolutional neural network, denoted asAnd optical flow features, noted as Extracting features of the target object using FASTER RCNN, noted asWhere D r,Df and D o represent dimensions of RGB visual features, optical flow features, and target object features, respectively. i is an index of video clips, and the input video contains l clips in total.
Then, using 3 gating loop units (Gated Recurrent Unit, GRU) network to respectively perform sequential context modeling on RGB visual features { r 1,r2,…,rl }, optical flow features { f 1,f2,…,fl } and target object features { o 1,o2,…,ol }, mapping the three features to a unified dimension D h, and performing the above processing operation on the three modal features, where the obtained expression corresponding to the historical time feature is as follows:
wherein,
Finally, 3 step gating loop units (Progressive Gated Recurrent Unit, PGRU) are designed to predict the multi-modal video characteristics of future time nodes, and the expression of the prediction result is as follows:
wherein, AndThree modal characteristics for the predicted future time.
S220: and a dynamic relation modeling step, namely carrying out dynamic relation modeling on the historical moment characteristics of the observed video and the state characteristics of the future moment through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode.
The method uses a graph convolution neural network (Graph Convolutional Networks, GCN) to model the dynamic relationship between the historical moment characteristics in the observed video and the predicted state characteristics of the future moment, and further infers the impending behavior of the future moment.
Specifically, the operation of the GCN in this embodiment is defined as:
Wherein X is an input matrix formed by arranging all nodes in the graph network. A is an adjacency matrix describing the node-to-node relationship in the graph roll-up neural network. W is the network parameter of the GCN. ReLu is a nonlinear activation function.
Taking the hidden state characteristics of the modeled video segments as graph network nodes to form node characteristic matrixes of 3 modes, wherein the three matrixes are respectively AndAnd then calculating a dynamic node relation according to the node characteristics, wherein the calculation formulas are respectively as follows:
Wherein, A r(i,j)、Af (i, j) and A o (i, j) respectively represent the relationship between the ith and jth nodes in the relationship diagram of the 3 modes.
And finally, respectively updating node characteristics of the characteristic graphs of the 3 modes by using 3 layers of GCNs, wherein the updated expression of the node characteristics of the graph is as follows:
wherein, AndThe node characteristics of the graph after updating are 3 modes.
S230: a network optimization step, namely acquiring a complete video, carrying out sequence dynamic relationship modeling on the complete video, respectively distilling the feature knowledge and the relationship knowledge of the complete video into a graph convolution neural network, and carrying out multi-mode feature mutual learning and relationship mutual learning to obtain an optimized graph convolution neural network; wherein the complete video contains a video history segment and a real future segment.
In this step, the design uses the complete video comprising the video history and real future segments to perform sequential dynamic relationship modeling, learn the teacher (Teacher) model, and distill the teacher's network relationship knowledge into the graph-convolution neural network, i.e., into the Student model. Future node characterization in teacher modelAndCalculated based on the actual future video clip, not predicted by PGRU in S210 above.
Specifically, the present embodiment distills the knowledge of the teacher model into the graph convolutional neural network for behavior prediction in S220 using both the feature distillation and the relational distillation knowledge distillation strategies.
The loss function of the feature distillation strategy is the 2-norm difference between graph node features obtained in a teacher model and a student model, namely:
wherein, AndFor graph node features derived from a graph roll-up neural network (i.e. student model) for behavior prediction,AndAnd obtaining graph node characteristics for the teacher model.
The loss function of the relational distillation strategy is the Kullback-Leibler divergence between graph relational matrixes obtained in a teacher model and a student model, namely:
wherein A r、Af and A o are graph relation matrixes obtained by the student model, AndAnd obtaining a graph relation matrix for the teacher model.
Specifically, the calculation mode of the Kullback-Leibler divergence is as follows:
DKL(p,q)=E[log(p)-log(q)] (16)
Meanwhile, the complementary information among three video modes is learned through two multi-mode mutual learning strategies of characteristic mutual learning and relation mutual learning. The feature mutual learning loss function is the 2-norm difference between the graph node features obtained in the step graph convolution neural network, namely:
The loss function of the relation mutual learning is Kullback-Leibler divergence among graph relation matrixes obtained in the graph convolution neural network, namely:
Lmu_rel=DKL(Af,Ar)+DKL(Ar,Ao)+DKL(Ar,Af)+DKL(Ar,Af)DKL(Ao,Ar)+DKL(Af,Ar) (18)
Through the knowledge distillation and mutual learning processes, the graph convolution neural network for behavior prediction can be optimized, and the accuracy and reliability of network processing data are improved.
S240: and a feature fusion step, namely fusing updated graph node features of all modes based on the optimized graph convolution neural network to obtain a video behavior prediction result.
The multi-mode prediction result obtained in the step S220 is fused, learning and optimizing are carried out under a unified frame, and finally the video behavior prediction result is output.
The multimode fusion strategy involved in the feature fusion is as follows:
Where y is the final predicted future behavior probability distribution, m is the video feature modality, For the node characteristics of the graph at the first+1st time obtained in S220, W m is a weight parameter of the behavior classifier, and b m is a bias parameter of the behavior classifier.
The learning optimization loss function of the unified framework is as follows:
Wherein L ce is a cross entropy loss function, Is a true future behavior label. L kd_fea、Lkd_rel、Lmu_fea and L mu_rel are loss functions in knowledge distillation and multi-modal mutual learning.
Considering that the graph roll-up neural network has achieved great success in dynamic relationship modeling, few approaches have applied GCN to video-based behavior prediction. In order to enable the GCN to effectively model the relation between the past behavior and the future behavior of the video counterpart, and fully utilize the complete video segment to learn the dynamic association relation between the past behavior and the future behavior, the embodiment of the invention fully considers three aspects of multi-mode feature learning, global relation modeling and complete video segment relation knowledge distillation, and provides the video behavior prediction method.
The video behavior prediction system provided by the invention is described below, and the video behavior prediction system described below and the video behavior prediction method described above can be referred to correspondingly.
Referring to fig. 4, a video behavior prediction system provided by an embodiment of the present invention includes:
an obtaining module 410, configured to obtain a target video to be predicted;
the behavior prediction module 420 is configured to input the target video to the video behavior prediction model, and obtain a behavior prediction result output by the video behavior prediction model;
the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.
The behavior prediction module 420 implements prediction of future behavior in video through a video behavior prediction model, specifically, regarding a training portion of the video behavior prediction model, including:
the feature extraction unit is used for extracting historical moment features of the observation video in the training set, carrying out multi-mode feature learning on the historical moment features, and predicting to obtain state features at future moments;
The dynamic relation modeling unit is used for carrying out dynamic relation modeling on the historical moment characteristics of the observed video and the state characteristics of the future moment through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;
The network optimization unit is used for acquiring a complete video, carrying out sequence dynamic relationship modeling on the complete video, respectively distilling the characteristic knowledge and the relationship knowledge of the complete video into the graph rolling neural network, and carrying out multi-mode characteristic mutual learning and relationship mutual learning to obtain an optimized graph rolling neural network; wherein the complete video comprises a video history segment and a real future segment;
And the feature fusion unit is used for fusing the updated graph node features of each mode based on the optimized graph convolution neural network to obtain a video behavior prediction result.
It can be understood that the feature extraction unit needs to extract the historical moment features of the observation video in the training set, where the historical moment features include video features of multiple modes, and in this embodiment, features of three modes including RGB visual features, optical flow features and target object features are used; then respectively carrying out sequence context modeling on video features of each mode in the historical moment features, and mapping the video features of each mode to the same dimension; and finally, predicting and obtaining the state characteristics of the future moment according to the sequence context modeling and the video characteristics of each mode after the dimension is unified.
It can be appreciated that the dynamic relationship modeling unit first needs to build a graph convolution neural network for behavior prediction; then taking the hidden state characteristics of the video clips in the observed video as graph network nodes, and constructing a node characteristic matrix corresponding to each mode video characteristic; then, respectively calculating dynamic node relations corresponding to the video features of each mode according to the node feature matrix; and finally, respectively updating the graph node characteristics in the characteristic graphs corresponding to the video characteristics of each mode according to the node characteristic matrix and the dynamic node relation.
It can be appreciated that the network optimization unit first obtains a complete video including a video history segment and a real future segment; then, modeling the sequence dynamic relationship of the complete video, and learning to obtain a teacher model; then, respectively distilling the characteristic knowledge and the relation knowledge of the teacher model into a graph convolution neural network; and finally, learning complementary information among the video features of each mode through feature mutual learning and relationship mutual learning to obtain the optimized graph convolution neural network.
According to the video behavior prediction system provided by the embodiment of the invention, the behavior prediction module is used for carrying out dynamic relation modeling on the historical moment characteristics of the target video and the predicted state characteristics of the future moment, so that the dynamic relation between the historical fragments and the future fragments in the video can be effectively inferred, the multi-mode dynamic relation changes of the historical fragments and the future fragments in the video can be effectively captured, and finally, the future occurrence behavior of the video can be predicted more accurately through the graph convolution neural network after knowledge distillation optimization.
Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a video behavior prediction method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the video behavior prediction method provided by the methods described above, the method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the video behavior prediction methods provided above, the method comprising: acquiring a target video to be predicted; inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model; the video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the dynamic relation modeled multi-mode characteristics based on the optimized graph convolution neural network.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for predicting video behavior, comprising:
Acquiring a target video to be predicted;
Inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
The video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the multi-mode characteristics after dynamic relation modeling based on the optimized graph convolution neural network;
The training process of the video behavior prediction model comprises the following steps:
Feature extraction: extracting historical moment characteristics of an observation video in a training set, carrying out multi-mode characteristic learning on the historical moment characteristics, and predicting to obtain state characteristics of future moments;
Modeling of dynamic relationship: carrying out dynamic relation modeling on the historical moment characteristics of the observed video and the state characteristics of the future moment through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;
network optimization: obtaining a complete video, carrying out sequence dynamic relation modeling on the complete video, respectively distilling the characteristic knowledge and the relation knowledge of the complete video into the graph rolling neural network, and carrying out multi-mode characteristic mutual learning and relation mutual learning to obtain an optimized graph rolling neural network; wherein the complete video comprises a video history segment and a real future segment;
Feature fusion: and based on the optimized graph convolution neural network, fusing graph node characteristics of each updated mode to obtain a video behavior prediction result.
2. A method of predicting video behavior as recited in claim 1, wherein the feature extraction process comprises:
extracting historical moment characteristics of observed videos in a training set, wherein the historical moment characteristics comprise video characteristics of multiple modes;
Respectively carrying out sequence context modeling on video features of each mode in the historical moment features, and mapping the video features of each mode to the same dimension;
And predicting and obtaining state characteristics at future time according to the sequence context modeling and the video characteristics of each mode after dimension unification.
3. A method of predicting video behavior according to claim 2, wherein the video features of the plurality of modalities include: RGB visual features, optical flow features, and target object features.
4. A video behavior prediction method according to claim 3, wherein the updated graph node features of each mode are fused, and a fusion expression is:
Where y is the probability of occurrence of the final predicted future behavior, m is the video feature modality, For the node characteristics of the graph at the time of the (l+1), W m is the weight parameter of the behavior classifier, and b m is the bias parameter of the behavior classifier.
5. A method of video behavior prediction according to claim 1, wherein the process of dynamic relationship modeling comprises:
Establishing a graph convolution neural network for behavior prediction;
taking hidden state features of video clips in the observed video as graph network nodes, and constructing a node feature matrix corresponding to each mode video feature;
respectively calculating dynamic node relations corresponding to the video features of each mode according to the node feature matrix;
And respectively updating graph node characteristics in the characteristic graphs corresponding to the video characteristics of each mode according to the node characteristic matrix and the dynamic node relation.
6. A video behavior prediction method according to claim 1, the network optimization process is characterized by comprising the following steps:
acquiring a complete video, wherein the complete video comprises a video history fragment and a real future fragment;
modeling the sequence dynamic relationship of the complete video, and learning to obtain a teacher model;
distilling the characteristic knowledge and the relation knowledge of the teacher model into the graph convolution neural network respectively;
And learning complementary information among the video features of each mode through feature mutual learning and relationship mutual learning to obtain the optimized graph convolution neural network.
7. A video behavior prediction system, comprising:
the acquisition module is used for acquiring a target video to be predicted;
The behavior prediction module is used for inputting the target video into a video behavior prediction model to obtain a behavior prediction result output by the video behavior prediction model;
The video behavior prediction model is used for carrying out dynamic relation modeling on historical moment characteristics of a target video and predicted state characteristics of future moments through a graph convolution neural network, optimizing the graph convolution neural network through knowledge distillation, and obtaining a video behavior prediction result by fusing the multi-mode characteristics after dynamic relation modeling based on the optimized graph convolution neural network;
The training process of the video behavior prediction model comprises the following steps:
Feature extraction: extracting historical moment characteristics of an observation video in a training set, carrying out multi-mode characteristic learning on the historical moment characteristics, and predicting to obtain state characteristics of future moments;
Modeling of dynamic relationship: carrying out dynamic relation modeling on the historical moment characteristics of the observed video and the state characteristics of the future moment through a pre-constructed graph convolution neural network to obtain updated graph node characteristics of each mode;
network optimization: obtaining a complete video, carrying out sequence dynamic relation modeling on the complete video, respectively distilling the characteristic knowledge and the relation knowledge of the complete video into the graph rolling neural network, and carrying out multi-mode characteristic mutual learning and relation mutual learning to obtain an optimized graph rolling neural network; wherein the complete video comprises a video history segment and a real future segment;
Feature fusion: and based on the optimized graph convolution neural network, fusing graph node characteristics of each updated mode to obtain a video behavior prediction result.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the video behavior prediction method of any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the video behavior prediction method of any one of claims 1 to 6.
CN202110950812.2A 2021-08-18 2021-08-18 Video behavior prediction method, system, electronic device and storage medium Active CN113705402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110950812.2A CN113705402B (en) 2021-08-18 2021-08-18 Video behavior prediction method, system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110950812.2A CN113705402B (en) 2021-08-18 2021-08-18 Video behavior prediction method, system, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN113705402A CN113705402A (en) 2021-11-26
CN113705402B true CN113705402B (en) 2024-08-13

Family

ID=78653369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110950812.2A Active CN113705402B (en) 2021-08-18 2021-08-18 Video behavior prediction method, system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN113705402B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114495278A (en) * 2022-01-27 2022-05-13 苏州大学 Unknown category-oriented action prediction method
CN114310917B (en) * 2022-03-11 2022-06-14 山东高原油气装备有限公司 Oil pipe transfer robot joint track error compensation method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541440A (en) * 2020-12-16 2021-03-23 中电海康集团有限公司 Subway pedestrian flow network fusion method based on video pedestrian recognition and pedestrian flow prediction method

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784997B (en) * 2019-01-11 2022-07-01 重庆邮电大学 Short video active user prediction method based on big data
US11790213B2 (en) * 2019-06-12 2023-10-17 Sri International Identifying complex events from hierarchical representation of data set features
CN110324728B (en) * 2019-06-28 2021-11-23 浙江传媒学院 Sports event full-field review short video generation method based on deep reinforcement learning
CN111079601A (en) * 2019-12-06 2020-04-28 中国科学院自动化研究所 Video content description method, system and device based on multi-mode attention mechanism
CN111259804B (en) * 2020-01-16 2023-03-14 合肥工业大学 Multi-modal fusion sign language recognition system and method based on graph convolution
CN111488815B (en) * 2020-04-07 2023-05-09 中山大学 Event prediction method based on graph convolution network and long-short-time memory network
CN111860353A (en) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 Video behavior prediction method, device and medium based on double-flow neural network
CN112183391A (en) * 2020-09-30 2021-01-05 中国科学院计算技术研究所 First-view video behavior prediction system and method
CN112668438A (en) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 Infrared video time sequence behavior positioning method, device, equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541440A (en) * 2020-12-16 2021-03-23 中电海康集团有限公司 Subway pedestrian flow network fusion method based on video pedestrian recognition and pedestrian flow prediction method

Also Published As

Publication number Publication date
CN113705402A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN111651671B (en) User object recommendation method, device, computer equipment and storage medium
CN111741330A (en) Video content evaluation method and device, storage medium and computer equipment
CN109543112A (en) A kind of sequence of recommendation method and device based on cyclic convolution neural network
CN113705402B (en) Video behavior prediction method, system, electronic device and storage medium
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN114780831A (en) Sequence recommendation method and system based on Transformer
CN114580794B (en) Data processing method, apparatus, program product, computer device and medium
CN116561446B (en) Multi-mode project recommendation method, system and device and storage medium
CN112561031A (en) Model searching method and device based on artificial intelligence and electronic equipment
CN114492601A (en) Resource classification model training method and device, electronic equipment and storage medium
CN110245310B (en) Object behavior analysis method, device and storage medium
CN111897943A (en) Session record searching method and device, electronic equipment and storage medium
CN115062779A (en) Event prediction method and device based on dynamic knowledge graph
CN116484016B (en) Time sequence knowledge graph reasoning method and system based on automatic maintenance of time sequence path
CN116702784B (en) Entity linking method, entity linking device, computer equipment and storage medium
CN118115932A (en) Image regressor training method, related method, device, equipment and medium
CN114627085A (en) Target image identification method and device, storage medium and electronic equipment
CN111935259B (en) Method and device for determining target account set, storage medium and electronic equipment
Bordes et al. Evidential grammars: A compositional approach for scene understanding. Application to multimodal street data
CN111966889A (en) Method for generating graph embedding vector and method for generating recommended network model
CN113283394B (en) Pedestrian re-identification method and system integrating context information
CN116050508B (en) Neural network training method and device
CN115359654B (en) Updating method and device of flow prediction system
CN112699909B (en) Information identification method, information identification device, electronic equipment and computer readable storage medium
CN118132788A (en) Image text semantic matching method and system applied to image-text retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant