CN114724058A

CN114724058A - Method for extracting key frames of fusion characteristic motion video based on human body posture recognition

Info

Publication number: CN114724058A
Application number: CN202210245767.5A
Authority: CN
Inventors: 郑艳伟; 江文; 李博韬; 于东晓
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-07-08
Anticipated expiration: 2042-03-14

Abstract

The invention discloses a method for extracting a fusion characteristic motion video key frame based on human body posture recognition, which comprises the following steps: s1, performing frame-by-frame segmentation on a target video segment; s2, extracting static characteristics by using a residual error network, and performing dimensionality on data to obtain static characteristics of a video frame; s3, extracting the skeleton data of the human body in the three-dimensional spaceExtracting motion characteristics of video frames to obtain motion characteristics S_d(ii) a S4, performing linear weighting processing on the extracted static characteristics and the extracted motion characteristics according to the weight; s5, extracting global features from the fused features through a self-attention mechanism, then calculating the importance of video frames, extracting key frames through corresponding actions through a Bernoulli function, and optimizing a result set by using reinforcement learning.

Description

Method for extracting key frames of fusion characteristic motion video based on human body posture recognition

Technical Field

The invention relates to the field of video processing, in particular to a method for extracting a key frame of a motion video with fused features.

Background

For videos, the videos are all an image sequence, the content of the videos is much more than that of one image, the expressive force is strong, the information amount is large, generally, the analysis of the videos is performed after the videos are decomposed into video frames, but the video frames usually have a large amount of redundancy, and the analysis is performed after the video key frames are extracted, so that the operation time can be effectively reduced.

With the development of networks, multimedia information retrieval has more and more influence on various social fields, the traditional video retrieval method can utilize an image retrieval method to carry out retrieval frame by frame, and the method needs to process a large amount of image information and causes great burden on information transmission and calculation. In addition, nowadays when home camera equipment is popularized, a monitoring area needs to be saved frequently, but the saving of video information needs to occupy a large amount of storage space, and the video information is stored in a video key frame mode, so that the authenticity of the video information can be kept, and the space can be saved to a great extent.

For a motion video, the state of a motion object changes frequently, and due to the diversity of motion targets and the similarity of motions, if the condition that detection omission easily occurs in motion features is considered, the deviation of feature extraction may be large, so the method carries out research on a key frame extraction technology on the motion video in a feature fusion mode.

Disclosure of Invention

In order to solve the above problems, the present application provides a method for extracting feature-fused motion video key frames, which utilizes a face recognition technology and a mosaic technology to protect user privacy and improves the accuracy of small-size face recognition to a certain extent.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a method for extracting a fusion characteristic motion video key frame based on human body posture recognition comprises the following steps:

s1, performing frame-by-frame segmentation on a target video segment, and segmenting a video into a series of video frames;

s2, extracting static characteristics by using a residual error network, and performing dimensionality reduction on data to obtain static characteristics S of the video frame_s＝[S_s1,S_s2,…,S_sT]；

S3, abstracting skeleton data of a human body in a three-dimensional space, and extracting motion characteristics of a video frame to obtain motion characteristics S_d＝[S_d1,S_d2,…,S_dT]；

S4, extracting static characteristics S_sAnd motion characteristics S_dLinear weighting processing is carried out according to the weight, S is mS_s+nS_dM and n are respectively the weight factor of the static characteristic and the weight factor of the motion characteristic;

and S5, extracting global features from the fused features through a self-attention mechanism, then calculating the importance of the video frame, extracting key frames of corresponding actions through a Bernoulli function, and optimizing a result set by using reinforcement learning.

Further preferably, the specific method of step S3 is as follows:

s31, extracting human skeleton for each frame in the video, and analyzing human posture by using a light-weight HRNet;

s32, using the coordinates and confidence degrees of the bone key points identified in each frame of the video as input, constructing a topological graph according to physical relations among bones, and then carrying out batch normalization processing on the topological graph;

s33, performing feature extraction on the processed data through a plurality of S-GCN units, endowing different weight coefficients for different trunks, and obtaining feature representation S of the video_d＝{S_d1,S_d2,…,S_dT}。

Further preferably, the specific method of step S31 is as follows:

s311, each sub-network of each branch at each stage comprises two residual blocks and a multi-resolution fusion module;

s312, replacing all residual blocks in the original network with a Shuffle module of the Shuffle Net, wherein the Shuffle module divides a channel into two parts, one part directly passes through the channel without any convolution operation, and the other part needs to be subjected to deep separable convolution;

s313, replacing convolution in the depth separable convolution by channel weighting, carrying out average pooling downsampling, adjusting to the size same as the minimum resolution, carrying out characteristic fusion of channel addition on the processed characteristic graphs of the i branches with different resolutions, and obtaining a weight matrix W by utilizing an SE module_tThe weight matrix W_tAnd performing upsampling operation on each branch, recovering the original size, and weighting the channels.

Further preferably, the specific method of step S5 is as follows:

s51, modeling position information between video frames through bidirectional masks;

s52, after global context information of a video sequence is obtained, calculating a feature matching degree based on global correlation features, and then predicting an importance score of a video frame by adopting a full connection layer;

s53, after the frame score of each video frame is obtained, selecting a key frame for corresponding action through Bernoulli distribution_t～B(Y)，a_tExpressed as the probability of having the current frame as the key frame;

and S54, judging the quality of the extracted key frame result set by using reinforcement learning, wherein the state-action value is used as the sum of the importance and diversity of the result set for representation, the importance of the result set is evaluated by using the covering capability of the key frame set on complete video information, and the diversity of the result set is evaluated by using the difference of feature spaces among selected frames.

Further preferably, the specific method of step S51 is as follows:

s511, the forward mask indicates that the attention weight is related to the calculation result before the current position, and the reverse mask indicates that the attention weight is related to the calculation result after the current position;

s512, inputting T frame video X ═ X_iI 1, …, T, each frame contains N key points, and through the self-attention mechanism, the correlation coefficient can be calculated

Wherein T, i belongs to [0, T), U and V are weight matrixes of two frames respectively, M is a position coding matrix, triangular information is reserved by a forward mask, triangular information is reserved by a reverse mask, lambda is a characteristic value of a fusion characteristic matrix, and s is a sum of the values of the two matrixes_tIs the fusion characteristic of the current frame, s_iIs the fusion feature of the frame before and after the frame;

s513, correlating the coefficients

Combined with the relative position (representing the position relation between the previous frame and the next frame) information of the frame to obtain

And the front direction and the back direction are fused,

mapping back to the original video frame sequence to obtain a sequence c ═ c containing context information_t|t＝1,…,T}。

Advantageous effects

(1) The invention provides a motion video key frame extraction technology with characteristics fused through human body posture recognition, spatial graph convolution and characteristic fusion, and meets the requirements on key frame extraction accuracy and integrity.

(2) According to the invention, through providing a mode of extracting video frame features, static features extracted by human body posture recognition and motion features extracted by space graph convolution are fused to be used as final video frame features, so that importance analysis is carried out, and the problem of missing detection and false detection can be effectively avoided.

(3) According to the invention, the HRNet is improved through light weight by replacing the residual error module and adding the attention mechanism, and the calculation amount is greatly reduced on the basis of not losing the accuracy.

Drawings

Fig. 1 is a schematic diagram of a stage of a method for extracting a feature-fused key frame of a motion video according to an embodiment of the present invention;

fig. 2 is a specific schematic diagram of a human body gesture recognition module of the method for extracting a feature-fused motion video key frame according to the embodiment of the present invention;

fig. 3 is a schematic diagram of a key frame extraction result of a method for extracting a feature-fused motion video key frame according to an embodiment of the present invention;

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides a method for extracting a feature-fused motion video key frame, which fuses a static feature extracted by a lightweight human body posture recognition algorithm and a motion feature extracted by a space map convolution as shown in figure 1, and improves the accuracy and the integrity of key frame detection, wherein the specific embodiment comprises the following steps:

(1) and performing frame-by-frame segmentation on the target video segment, and segmenting the video into a series of video frames.

(2) In order to better retain original information in an input image and reduce loss, a residual error network ResNet50 is used for static feature extraction, the data dimension is reduced to 256 dimensions, and the obtained static feature of a video frame is represented as S_s＝[S_s1,S_s2,…,S_sT]。

(3) Abstracting skeleton data of a human body in a three-dimensional space, carrying out human body posture analysis by using a lightweight HRNet algorithm, and then extracting motion characteristics of a video frame by using an ST-GCN network to obtain motion characteristics S_d＝[S_d1,S_d2,…,S_dT]。

The specific method of the step (3) is as follows:

(3.1) carry out human skeleton to each frame in the video and draw, in order not to aggravate the operation burden excessively when promoting the degree of accuracy, this application has used light-weight HRNet to carry out human attitude analysis.

As shown in fig. 2, the specific method of step (3.1) is as follows:

(3.1.1) for HRNet, the accuracy is greatly higher than that of other bottom-up algorithms, but the obvious defects are that the parameter quantity is large, the operation speed is slow, so the method aims at the obvious defect that light weight improvement is carried out, and the analysis speed is accelerated.

(3.1.2) to make the model as light as possible, the depth and width of the original HRNet network are first reduced, reducing the sub-net of each branch at each stage into two residual blocks and one multi-resolution fusion module.

(3.1.3) replacing all residual blocks in the original network by a Shuffle module of the Shuffle Net, wherein the module divides a channel into two parts, one part directly passes through the channel without any convolution operation, and the other part needs to carry out deep separable convolution.

(3.1.4) the 1 × 1 convolution in the deep separable convolution is replaced by channel weighting, and the effect of information exchange is also achieved, but the time complexity is far lower than that of the 1 × 1 convolution. Adjusting feature map to the same size as the minimum resolution by average pooling downsampling, performing add operation (feature fusion of channel addition) on the processed feature maps of i branches with different resolutions, and then obtaining a weight matrix W by an SE module (comprising two parts of Squeeze and Excitation)_tThe weight matrix W_tAnd performing upsampling operation on each branch, recovering the original size, and weighting the channels.

And (3.2) taking the coordinates and confidence degrees of the bone key points identified in each frame of the video as input, constructing a topological graph according to physical relations among bones, and then carrying out batch normalization processing on the topological graph to unify scattered data.

(3.3) processing the processed data by 9S-GCN unitsPerforming feature extraction on the elements, endowing different weight coefficients for different trunks, and obtaining feature representation S of the video_d＝{S_d1,S_d2,…,S_dT}。

(4) Carrying out linear weighting processing on the extracted static characteristics and the extracted motion characteristics according to the weight, wherein S is mS_s+nS_dM and n are respectively the weight factor of the static characteristic and the weight factor of the motion characteristic;

(5) extracting global features from the fused features through a self-attention mechanism, then calculating the importance of video frames, extracting key frames of corresponding actions through a Bernoulli function, and optimizing a result set by using reinforcement learning.

The specific method of the step (5) is as follows:

(5.1) modeling the position information between the video frames through the bidirectional mask can ensure that the importance of the current video frame is influenced not only by the previous video frame but also by the subsequent video frame.

The specific method of step (5.1) is as follows:

(5.1.1) the forward mask indicates that the weights of attention are related to the calculation results before the current position, and the backward mask indicates that the weights of the current position are related to the calculation results after.

(5.1.2) input T-frame video X ═ { X_iI 1, …, T, each frame contains N key points, and through the self-attention mechanism, the correlation coefficient can be calculated

Wherein T, i belongs to [0, T), U and V are weight matrixes of two frames respectively, M is a position coding matrix, the upper triangular information is reserved by a forward mask, and the lower triangular information is reserved by a reverse mask.

(5.1.3) correlation of numbers

Combined with the relative position (representing the position relation with the previous and next frames) information of the frame to obtain

And the front direction and the back direction are fused,

And (5.2) after obtaining the global context information of the video sequence, calculating a feature matching degree based on the global correlation features, and then predicting the importance score of the video frame by adopting the full connection layer.

(5.3) after obtaining the frame score of each video frame, selecting key frames for corresponding actions through Bernoulli distribution a_t～B(Y)，a_tExpressed as the probability of having the current frame as the key frame.

(5.4) judging the quality of the extracted key frame result set by using reinforcement learning, wherein the quality is characterized by using a state-action value as the sum of the importance and diversity of the result set, evaluating the importance of the result set by using the covering capability of the key frame set on complete video information, and evaluating the diversity of the result set by using the difference size of a feature space between selected frames.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles disclosed herein.

Claims

1. A method for extracting a fusion characteristic motion video key frame based on human body posture recognition is characterized by comprising the following steps:

s2, extracting static characteristics by using a residual error network, and performing dimensionality reduction on data to obtain static characteristics S of the video frame_s＝[S_s1，S_s2，...，S_sT]；

S3, abstracting skeleton data of a human body in a three-dimensional space, and extracting motion characteristics of a video frame to obtain motion characteristics S_d＝[S_d1，S_d2，...，S_dT]；

S4, extracting static characteristics S_sAnd motion characteristics S_dLinear weighting processing is carried out according to the weight, and S is mS_s+nS_dM and n are respectively the weight factor of the static characteristic and the weight factor of the motion characteristic;

2. The method for extracting the video key frames based on the fusion characteristics and the motion types of the human body posture recognition according to claim 1, wherein the specific method in step S3 is as follows:

s33, performing feature extraction on the processed data through a plurality of S-GCN units, endowing different weight coefficients for different trunks, and obtaining feature representation S of the video_d＝{S_d1，S_d2，...，S_dT}。

3. The method for extracting the key frame of the fusion feature motion video based on human body posture recognition according to claim 2, wherein the specific method of the step S31 is as follows:

4. The method for extracting the key frame of the fusion feature motion video based on the human body posture recognition as claimed in claim l, wherein the specific method of the step S5 is as follows:

s52, after global context information of a video sequence is obtained, calculating a feature matching degree based on global correlation characteristics, and then predicting an importance score of a video frame by adopting a full connection layer;

and S54, judging the quality of the extracted key frame result set by using reinforcement learning, representing the sum of the importance and diversity of the result set by using a state-action value, evaluating the importance of the result set by using the covering capability of the key frame set on complete video information, and evaluating the diversity of the result set by using the difference of feature spaces among selected frames.

5. The method for extracting the key frame of the fusion feature motion video based on human body posture recognition according to claim 4, wherein the specific method of the step S51 is as follows:

s512, inputting T frame video X ═ X_i1.. T }, each frame contains N key points, and correlation coefficients are calculated by a self-attention mechanism

Wherein T, i belongs to [0, T), U and V are weight matrixes of two frames respectively, M is a position coding matrix, upper triangular information is reserved by a forward mask, lower triangular information is reserved by a reverse mask, lambda is a characteristic value of a fusion characteristic matrix, and s is a value of a fusion characteristic matrix_tIs the fusion characteristic of the current frame, s_iIs the fusion feature of the frame before and after the frame;

s513. correlation coefficient is calculated

Combined with the relative position information of the frame to obtain

And the front direction and the reverse direction are fused,

mapping back to the original video frame sequence to obtain a sequence c ═ c containing context information_t|t＝1，...，T}。