WO2024160038A1 - 一种动作识别方法、装置、设备、存储介质及产品 - Google Patents
一种动作识别方法、装置、设备、存储介质及产品 Download PDFInfo
- Publication number
- WO2024160038A1 WO2024160038A1 PCT/CN2024/072044 CN2024072044W WO2024160038A1 WO 2024160038 A1 WO2024160038 A1 WO 2024160038A1 CN 2024072044 W CN2024072044 W CN 2024072044W WO 2024160038 A1 WO2024160038 A1 WO 2024160038A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- features
- audio
- action recognition
- feature
- Prior art date
Links
- 230000009471 action Effects 0.000 title claims abstract description 192
- 238000000034 method Methods 0.000 title claims abstract description 59
- 230000004927 fusion Effects 0.000 claims abstract description 38
- 238000000605 extraction Methods 0.000 claims description 59
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 52
- 238000011176 pooling Methods 0.000 claims description 33
- 239000000284 extract Substances 0.000 claims description 20
- 238000007500 overflow downdraw method Methods 0.000 claims description 18
- 238000004458 analytical method Methods 0.000 claims description 15
- 239000013598 vector Substances 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 8
- 238000001228 spectrum Methods 0.000 claims description 7
- 239000010410 layer Substances 0.000 description 90
- 230000000875 corresponding effect Effects 0.000 description 36
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000007499 fusion processing Methods 0.000 description 4
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
Definitions
- the embodiments of the present application relate to the field of image processing technology, and in particular, to a method, device, equipment, storage medium and product for motion recognition.
- the embodiments of the present application provide a motion recognition method, device, equipment, storage medium and product to solve the technical problem that only image information is considered in the related art and the motion recognition accuracy is low, and effectively improve the motion recognition accuracy.
- an action recognition method comprising:
- Action recognition is performed based on the fusion features to obtain an action recognition result.
- an embodiment of the present application provides an action recognition device, including an audio analysis module, a video analysis module, a feature fusion module and an action recognition module, wherein:
- the audio analysis module is configured to obtain a spectrogram of each audio frame in the video to be identified, and extract audio features of each audio frame based on the spectrogram;
- the video analysis module is configured to obtain video features of each video frame in the video to be identified
- the feature fusion module is configured to map the audio feature and the video feature to the same dimension, and fuse the audio feature and the video feature to obtain a plurality of fused features;
- the action recognition module is configured to perform action recognition based on the fusion features to obtain an action recognition result.
- an embodiment of the present application provides a motion recognition device, including: a memory and one or more processors;
- the memory is used to store one or more programs
- the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the action recognition method as described in the first aspect.
- an embodiment of the present application provides a non-volatile storage medium storing computer executable instructions, wherein the computer executable instructions are used to execute the action recognition method as described in the first aspect when executed by a computer processor.
- an embodiment of the present application provides a computer program product, which includes a computer program stored in a computer-readable storage medium, and at least one processor of the device reads and executes the computer program from the computer-readable storage medium, so that the device performs the action recognition method described in the first aspect.
- the embodiments of the present application obtain the spectrogram of each audio frame in the video to be identified, extract the audio features of the audio frame based on the spectrogram, and extract the video features of each video frame in the video to be identified. After mapping the audio features and the video features to the same dimension, the audio features and the video features are fused to obtain fused features, and the fused features are used for action recognition to obtain action recognition results.
- the dimensions for action recognition are enriched, and the action recognition accuracy of the video to be identified is effectively improved.
- FIG1 is a flow chart of an action recognition method provided by an embodiment of the present application.
- FIG2 is a flow chart of another action recognition method provided by an embodiment of the present application.
- FIG3 is a schematic diagram of an audio feature extraction network structure provided in an embodiment of the present application.
- FIG4 is a schematic diagram of a video feature extraction network structure provided in an embodiment of the present application.
- FIG5 is a schematic diagram of the structure of a stem convolution block provided in an embodiment of the present application.
- FIG6 is a schematic diagram of a residual block group structure provided in an embodiment of the present application.
- FIG7 is a schematic diagram of an action recognition model provided in an embodiment of the present application.
- FIG8 is a schematic diagram of the structure of a motion recognition device provided in an embodiment of the present application.
- FIG. 9 is a schematic diagram of the structure of a motion recognition device provided in an embodiment of the present application.
- the action recognition method provided by the present application can be applied to action recognition in scenes such as live broadcast and video posting, such as real-time recognition of action types such as singing and dancing in the live broadcast room in the live broadcast scene, and content recognition and recommendation of user-uploaded videos in the video posting scene, etc. It aims to enrich the dimensions of action recognition and improve the action recognition accuracy of the video to be recognized by multi-modal feature fusion of audio features and video features in the video to be recognized.
- action recognition is generally performed on images in the video through a deep learning network, but this action recognition method only considers the impact of image information on action recognition, ignores the action timing information in the video, and has low action recognition accuracy for the video.
- an action recognition method of an embodiment of the present application is provided to solve the technical problem that the existing action recognition scheme only considers the image information in the video and has low action recognition accuracy.
- Figure 1 shows a flow chart of a motion recognition method provided in an embodiment of the present application.
- the motion recognition method provided in an embodiment of the present application can be performed by a motion recognition device, which can be implemented by hardware and/or software and integrated in a motion recognition device (such as a live broadcast platform, a video service platform, etc.).
- the action recognition method includes:
- S101 Obtain a spectrogram of each audio frame in a video to be identified, and extract audio features of each audio frame based on the spectrogram.
- the video to be identified provided by this solution can be understood as a video that requires action recognition, such as identifying the actions of a task in a video, wherein the action recognition results obtained by action recognition can be different action types such as dancing, singing, and games.
- the information to be identified can be video stream data (such as video stream data uploaded in real time by the anchor during the live broadcast) or complete video data (such as a complete video uploaded by the user in a video post).
- video stream data collected within a set time interval can be used as the video to be identified.
- the information to be identified provided by this solution records multiple continuous audio frames and multiple continuous video frames.
- a video to be identified in which an action needs to be identified is obtained, and audio frames and video frames in the video to be identified are extracted.
- a spectrogram of each audio frame is determined, and audio features corresponding to the audio frame are extracted based on the spectrogram.
- the spectrogram corresponding to each audio frame is input into a trained audio feature extraction network, and the audio feature extraction network processes and analyzes each spectrogram and outputs the corresponding audio features.
- each video frame in the information to be identified video features of each video frame are extracted respectively.
- each video frame is sequentially input into a trained video feature extraction network, and the video network processes and analyzes each video frame and outputs corresponding video features.
- S103 Mapping the audio features and the video features to the same dimension, and fusing the audio features and the video features to obtain a plurality of fused features.
- the audio features and the video features are mapped to the same dimension.
- the audio features and the video features can be mapped to the set dimension mapping through a fully connected layer respectively. It should be explained that since the dimension of the audio features is inconsistent with the dimension of the video features, it is necessary to map the audio features and the video features to the same dimension so that the feature weights of the audio frame and the video frame are consistent, so that the audio frame and the video frame are consistent. Frames can work together in action recognition to improve action recognition accuracy.
- the audio features and the video features are fused to obtain a plurality of fused features, wherein the number of audio frames in a video to be identified is consistent with the number of video frames, and the number of fused features obtained by fusion of the audio features and the video features is consistent with the number of audio frames or video frames.
- the fusion processing of audio features and video features can be to fuse audio features and video features between frames, for example, to fuse audio features corresponding to an audio frame of one frame with video features corresponding to a video frame of one frame.
- the audio features and video features can also be fused as a whole at the video level in a multimodal manner, for example, to fuse the audio features and video features of the entire video to be identified.
- S104 Perform action recognition based on the fused features to obtain an action recognition result.
- action recognition can be performed based on these fusion features to obtain the action recognition result corresponding to the video to be identified.
- the multiple fusion features obtained above are input into a trained action recognition model, and the action recognition model analyzes and processes the received multiple fusion features and outputs the corresponding action recognition result.
- the action recognition result can be expressed as the probabilities corresponding to different action types.
- the action type corresponding to the maximum probability in the action recognition result can be determined as the action type corresponding to the video to be recognized, or the action type whose probability reaches a set threshold can be determined as the action type corresponding to the video to be recognized (when the probabilities of multiple action types all reach the set threshold, the action type corresponding to the maximum probability can be determined as the action type corresponding to the video to be recognized, or multiple action types can be determined as the action type corresponding to the video to be recognized).
- the present scheme can also perform a live broadcast room recommendation process for the live broadcast room of the anchor end based on the time information of the video to be recognized and the action recognition result.
- the video stream data provided by the anchor terminal is collected at set time intervals, and the video stream data corresponding to each time interval is used as the video to be identified for action recognition in sequence, so as to obtain the action recognition result corresponding to the video to be identified in each time interval.
- the time information of the video to be identified can be determined based on the timestamp of the video stream data. For example, the timestamp of the first video frame, the middle video frame or the last video frame in the video to be identified can be used as the time information of the video to be identified, or the average time of the timestamps of the video frames in the video to be identified can be used as the time information of the video to be identified.
- the anchor is identified based on the time information of the video to be identified and the action recognition result.
- the live broadcast room recommendation processing of the live broadcast room can be weighted or reduced in the corresponding channel, and the live broadcast rooms can be sorted in the corresponding channel based on the weight of the live broadcast room, or the live broadcast rooms can be sorted in the whole channel based on the weight of the live broadcast room.
- the audience can see the recommended live broadcast room in the corresponding channel or the whole channel first, which improves the audience's live viewing experience and the popularity of the anchor's live broadcast room.
- This solution recommends the live broadcast room based on time information and action recognition results, so that the audience can see the live broadcast room with the live broadcast content they are interested in first, which improves the audience's live viewing experience and the popularity of the anchor's live broadcast room.
- the corresponding time information is determined as the start time of the corresponding action type, and the live broadcast room is weighted to improve the recommendation of the live broadcast room.
- the action recognition result begins to indicate that the set action type is not recognized (the probability is less than the set threshold)
- the corresponding time information is determined as the end time of the corresponding action type, and the live broadcast room is downgraded, and other live broadcast rooms that perform the corresponding action type are recommended first.
- the weight of the live broadcast room in the dance channel and/or singing channel is weighted to improve the recommendation of the live broadcast room in the dance channel and/or singing channel.
- the weight of the live broadcast room in the dance channel and/or singing channel is downgraded to reduce the recommendation of the live broadcast room in the dance channel and/or singing channel, so that the audience can give priority to the live broadcast room of the corresponding live content, improving the audience's live viewing experience and the popularity of the anchor's live broadcast room.
- the present solution when the recognition video provided by the present solution is complete video data, after obtaining the action recognition result by performing action recognition based on the fusion features, the present solution may also perform video recommendation processing on the video to be recognized based on the action recognition result.
- the user end or the anchor end can upload complete video data to the live broadcast platform or the video service platform, wherein the video data can be a screen recording video during the live broadcast, or a separately produced video.
- the video data uploaded by the user end or the anchor end is obtained, and the video data is used as the video to be identified for action recognition, and the action recognition result of the video to be identified is obtained.
- the video to be identified is processed for video recommendation based on the action recognition result.
- the video recommendation processing method for the video to be identified can be to mark the action type of the video to be identified, and recommend the video to be identified to interested users based on the action type. This solution performs video recommendation processing on the video data uploaded by the user according to the action recognition result, so that the user can see the video data of interest first, thereby improving the audience's video viewing experience.
- the audio features are converted into After being mapped to the same dimension as the video features, the audio features and video features are fused to obtain fused features, and the fused features are used for action recognition to obtain action recognition results.
- the dimensions of action recognition are enriched, and the action recognition accuracy of the video to be identified is effectively improved.
- FIG2 shows a flow chart of another action recognition method provided by an embodiment of the present application, which is a specific embodiment of the above action recognition method.
- the action recognition method includes:
- S202 Perform short-time Fourier transform on each resampled audio frame to obtain a frequency spectrum of the audio frame.
- S203 Map the spectrogram to the Mel filter bank to calculate the spectrogram corresponding to each spectrogram.
- each audio frame in the video to be identified is resampled based on the set frequency, so that each audio frame in the video to be identified is resampled into a mono audio of the set frequency.
- the audio file in the video to be identified is an audio file in wav format
- the audio file includes multiple continuous audio frames, and these audio frames can be resampled into 16kHz mono audio.
- each resampled audio frame is subjected to a short-time Fourier transform to obtain a spectrogram of each audio frame, and the spectrogram of each audio frame obtained above is mapped to a Mel filter bank (mel filter bank), and a spectrogram (mel spectrogram) corresponding to each spectrogram is calculated by the Mel filter bank.
- a Mel filter bank mel filter bank
- the short-time Fourier transform of the audio frame can be performed using a time window of a set size and a set frame shift amplitude.
- a 25ms Hann time window is used to perform a short-time Fourier transform on the audio frame with a frame shift of 10ms to obtain a spectrogram of the audio frame.
- the spectrogram can be mapped to a 64-order Mel filter bank to calculate the mel spectrum, and log(mel spectrum + 0.01) is calculated to obtain a stable mel spectrum (i.e., spectrogram).
- 0.01 is used as a bias term to effectively reduce the situation of calculating the logarithm of 0, ensuring that a valid spectrogram is obtained.
- This scheme obtains a spectrogram by performing short-time Fourier transform on the resampled audio frames, and maps the spectrogram to a Mel filter bank to calculate the spectrogram, accurately calculating the spectrogram of each audio frame, improving the quality of audio feature extraction and the accuracy of action recognition in the video to be recognized.
- S204 Input each spectrogram into the trained audio feature extraction network, and extract each audio feature through the audio feature extraction network based on the first number of convolutional layers and the second number of fully connected layers connected in sequence. The audio features of the video frame.
- the present solution extracts audio features from a spectrogram through a trained audio feature extraction network, wherein the audio feature extraction network includes a first number of convolutional layers and a second number of fully connected layers connected in sequence.
- the first number and the second number may be set to 6 and 3, respectively, that is, the audio feature extraction network may include 6 convolutional layers and 3 fully connected layers connected in sequence.
- each spectrogram is input into a trained audio feature extraction network, so as to extract the audio features of each audio frame through the audio feature extraction network based on the first number of convolutional layers and the second number of fully connected layers connected in sequence.
- This solution extracts the audio features of the audio frames through the audio feature extraction network based on the first number of convolutional layers and the second number of fully connected layers connected in sequence, accurately extracts the sound features of each audio frame, and effectively improves the action recognition accuracy of the video to be recognized.
- the audio feature extraction network provided by the present solution can be built based on the VGGish model, and the spectrogram can be directly input into the audio feature extraction network to extract audio features.
- the VGGish model as a VGG model based on tensorflow, supports the extraction of semantic 128-dimensional embedding feature vectors from the spectrogram as audio features.
- the audio feature extraction network includes 6 convolutional layers connected in sequence (including conv1 composed of a single convolutional layer with 64 channels, conv2 composed of a single convolutional layer with 128 channels, conv3 composed of two convolutional layers with 256 channels, and conv4 composed of two convolutional layers with 512 channels, and conv1-4 are connected by an activation function) and 3 fully connected layers (including fc1 composed of two fully connected layers with 4096 channels and fc2 composed of a single fully connected layer with 128 channels, and conv4 and fc1 are connected by an activation function).
- the 96 ⁇ 64 spectrogram is input into the audio feature extraction network, and the audio feature extraction network can output 128-dimensional audio features.
- S205 Input each video frame in the video to be identified into a trained video feature extraction network, and extract video features of each video frame through the video feature extraction network.
- the video feature extraction network includes a stem convolution block composed of multiple convolutional layers connected in sequence, a third number of residual block groups based on convolutional layers, and a pooling layer with an attention mechanism.
- the stem convolution block can extract feature maps containing image semantic information in the video frame, and the pooling layer with an attention mechanism can apply the attention mechanism to reduce the information redundancy of the extracted features, increase the network receptive field, prevent the network from overfitting, and ensure the quality of video feature extraction.
- each video frame in the video to be identified is sequentially input into the video feature extraction network, and the video feature extraction network extracts the video features of each video frame.
- the video feature extraction network extracts video features
- the video frames are sequentially processed by the stem convolution block, the residual block group and the pooling layer to obtain video features.
- This solution extracts the video features of the video frames based on the sequentially connected stem convolution blocks, residual block groups and pooling layers through the video feature extraction network, accurately extracts the video features of each video frame, and effectively improves the action recognition accuracy of the video to be identified.
- the video feature extraction network includes stem convolution blocks connected in sequence, a third number of residual block groups (set to 4 residual block groups in the figure), and a pooling layer with an attention mechanism (attention pool, 1 ⁇ 1024).
- a 224 ⁇ 224 video frame can be input into the video feature extraction network to obtain a 1024-dimensional video feature.
- the video feature extraction network provided in this solution can be built based on a ResNet50 network (a 50-layer residual neural network).
- the stem convolution block in the video feature extraction network provided by the present scheme includes three 3 ⁇ 3 convolutional layers connected in sequence (including a convolutional layer of "3 ⁇ 3conv, 32, /2", a convolutional layer of "3 ⁇ 3conv, 32, /1” and a convolutional layer of "3 ⁇ 3conv, 64, 1" connected in sequence) and a pooling layer, and the pooling layer in the stem convolution block provided by the present scheme is an average pooling layer (a pooling layer of "avgpool, /2").
- this solution adjusts the 7 ⁇ 7 convolution layer to three 3 ⁇ 3 convolution layers. Multiple small convolution layers can achieve better stacking effects. The stacking of small convolution layers can more effectively extract image semantic information in video frames.
- the pooling layer in the original ResNet50 network is the maximum pooling layer. This solution adjusts the maximum pooling layer to the average pooling layer, which can effectively retain the background information in the video frame. The extracted video features will contain more background information, and the action judgment of the video will be more accurate, thereby improving the action recognition accuracy of the video to be recognized.
- the last pooling layer connected in the original ResNet50 network is the average pooling layer.
- the attention mechanism maps the query vector Query and a set of key-value pairs (Key-Value) to the output, where Query (query matrix), Key (key-value matrix), Value (key-value matrix) and the output of the attention pooling layer are all vectors.
- the output can be calculated by the weighted sum of the values, where the weight assigned to each value can be calculated by the fitness function of the Query and the corresponding Key.
- the query matrix Query, the key matrix Key and the value matrix Value are created from the input vector through the learnable weights WQ, WK, and WV.
- the calculation of the attention mechanism is to perform the sum of the Query,
- the final result is obtained by calculating the three matrices of Key and Value.
- the attention mechanism pooling can select the best two-dimensional features to convert into one-dimensional features of the final classification layer, thereby improving the video feature extraction effect.
- This solution uses a stem convolution block constructed with multiple layers of small convolution kernels to replace a single-layer large convolution kernel.
- the output of the stem convolution block uses an average pooling layer instead of a maximum pooling layer, and the extracted features are sent to the classification layer.
- the pooling layer of the attention mechanism is used instead of average pooling.
- the receptive field of video features is effectively improved, and higher-dimensional semantic information in video frames is extracted, so that the extracted video features and actions have better matching.
- the video feature extraction network provided by the present solution includes four residual block groups connected in sequence, wherein the residual block group includes a plurality of residual blocks, and the residual block includes a 1 ⁇ 1 convolution layer, a 3 ⁇ 3 convolution layer, and a 1 ⁇ 1 convolution layer connected in sequence.
- the video feature extraction network provided by the present solution has a 1 ⁇ 1 convolution layer bypass-connected at the residual block at the entrance of each residual block group.
- the video feature extraction network is set to 4 residual block groups (stages), with a total of 13 residual blocks using convolution.
- the first residual block group includes 3 residual blocks, and the last 2 residual blocks have the same structure;
- the second residual block group includes 4 residual blocks, and the last 3 residual blocks have the same structure;
- the third residual block group includes 4 residual blocks, and the last 3 residual blocks have the same structure;
- the fourth residual block group includes 2 residual blocks.
- the first residual block at the entrance of each residual block group is bypassed to connect the 1 ⁇ 1 convolution layer.
- each residual block uses a bottleneck-based design, that is, each residual block consists of three convolutional layers (the convolution kernel parameters are 1 ⁇ 1, 3 ⁇ 3 and 1 ⁇ 1 respectively), of which the two 1 ⁇ 1 convolutional layers at the inlet and outlet are used to compress and restore the number of channels of the feature map (the original feature map is output by the stem convolutional block), effectively reducing the number of channels of the convolutional layer operation, reducing the number of parameters in the calculation process, and improving the efficiency of video feature extraction.
- the convolution kernel parameters are 1 ⁇ 1, 3 ⁇ 3 and 1 ⁇ 1 respectively
- this solution uses a 2D projection residual block at the entrance of each residual block group, that is, the 1 ⁇ 1 convolutional layer is bypassed at the residual block at the entrance of each residual block group to ensure that the size and number of channels of the feature map remain consistent when the feature map is added pixel by pixel.
- using only 2D projection residual blocks at the entrance of each residual block group can further reduce network parameters and improve the efficiency of video feature extraction.
- S207 Based on a feature concatenation and fusion method, a feature superposition and fusion method, or a weighted superposition and fusion method, the audio features and the video features are fused to obtain a plurality of fused features.
- the audio features and the video features are mapped to the same dimension.
- the audio features and the video features can be fused by a set fusion method to obtain multiple fusion features.
- the fusion method provided by this solution can be one of the feature splicing fusion method, feature superposition fusion method and weighted superposition fusion method, wherein the feature splicing fusion method can be to splice audio features and video features to obtain fusion features, such as splicing 384-dimensional audio features and 384-dimensional video features into 768-dimensional fusion features.
- the feature superposition fusion method can be to add audio features and video features to obtain fusion features, such as adding 384-dimensional audio features and 384-dimensional video features to obtain 384-dimensional fusion features.
- the weighted superposition fusion method can be to obtain fusion features by weighted summing audio features and video features based on set weight coefficients, such as weighted summing 384-dimensional audio features and 384-dimensional video features to obtain 384-dimensional fusion features.
- This solution maps audio features and video features to the same dimension, and then fuses audio features and video features into fusion features based on the set fusion method, so that audio frames and video frames can play a role in action recognition together, thereby improving the accuracy of action recognition.
- the audio features of an audio frame and the video features of a video frame may be fused.
- the audio features and the video features may be aligned based on the timestamp, and when performing fusion processing, the audio features of an audio frame and the video features of the video frame aligned therewith may be fused to improve the fusion processing effect.
- S208 Perform action recognition based on the fused features to obtain an action recognition result.
- This solution can perform action recognition on fused features through an action recognition model built based on the vision transformer model. Based on this, when this solution performs action recognition based on fused features to obtain action recognition results, it can be to input multiple fused features into the trained action recognition model, obtain the feature vectors of each fused feature based on the transformer encoder through the action recognition model, and input each feature vector into the fully connected layer, and the fully connected layer outputs the action recognition result.
- the action recognition model provided by this solution includes a transformer encoder (Transformer Decoder based on the vision transformer model) and a fully connected layer (MLP Head), and the action type (Class) corresponding to the output action recognition result can be dancing, singing, etc.
- the transformer encoder includes a normalization layer (Norm), a multi-head attention layer (Multi-Head Attention), a normalization layer (Norm) and a fully connected layer (MLP) connected in sequence.
- the action recognition model provided by this method can be implemented using pytorch as the underlying support library.
- each of the fused features determined above is input into the action recognition model, and the action recognition model will obtain the feature vector of the corresponding fused feature through a transformer encoder respectively, and input the feature vector corresponding to each fused feature into the fully connected layer (MLP Head), and the fully connected layer outputs the action recognition result.
- This solution uses an action recognition model built on the vision transformer model to perform action recognition on the fused features, where the vision transformer model applies the transformer encoder algorithm in the NLP field to the visual field, treats the features of each frame as NLP tokens, and directly applies the standard transformer encoder of NLP to these tokens, and can perform video classification based on this, effectively improving the accuracy of action recognition in the video.
- the audio features and video features are fused to obtain fused features, and the fused features are used for action recognition to obtain action recognition results.
- the dimensions for action recognition are enriched, and the action recognition accuracy of the video to be identified is effectively improved.
- the sound features of each audio frame are accurately extracted through the audio feature extraction network, and the video features of the video frame are extracted through the video feature extraction network.
- the video features of each video frame are accurately extracted, and the action recognition model built based on the vision transformer model is used to perform action recognition on the fused features, effectively improving the action recognition accuracy of the video to be identified.
- FIG8 is a schematic diagram of the structure of a motion recognition device provided in an embodiment of the present application.
- the motion recognition device includes an audio analysis module 81 , a video analysis module 82 , a feature fusion module 83 and a motion recognition module 84 .
- the audio analysis module 81 is configured to obtain the spectrogram of each audio frame in the video to be identified, and extract the audio features of each audio frame based on the spectrogram;
- the video analysis module 82 is configured to obtain the video features of each video frame in the video to be identified;
- the feature fusion module 83 is configured to map the audio features and the video features to the same dimension, and fuse the audio features and the video features to obtain multiple fused features;
- the action recognition module 84 is configured to perform action recognition based on the fused features to obtain action recognition results.
- the audio features and the video features are fused to obtain fused features, and the fused features are used for action recognition to obtain the action recognition result.
- the dimension of action recognition is enriched, which is effective. Improve the accuracy of action recognition in videos to be recognized.
- the audio analysis module 81 obtains the spectrogram of each audio frame in the video to be identified, it is configured as follows:
- each audio frame in the video to be identified into a mono audio of a set frequency
- the audio analysis module 81 extracts the audio features of each audio frame based on the spectrogram, it is configured as follows:
- Each spectrogram is input into a trained audio feature extraction network, and the audio feature extraction network extracts audio features of each audio frame based on a first number of convolutional layers and a second number of fully connected layers connected in sequence.
- the video analysis module 82 when acquiring the video features of each video frame in the video to be identified, is configured as follows:
- Each video frame in the video to be identified is input into a trained video feature extraction network, and the video features of each video frame are extracted through the video feature extraction network, wherein the video feature extraction network includes a stem convolution block composed of multiple convolutional layers connected in sequence, a third number of residual block groups based on convolutional layers, and a pooling layer with an attention mechanism.
- the video feature extraction network includes a stem convolution block composed of multiple convolutional layers connected in sequence, a third number of residual block groups based on convolutional layers, and a pooling layer with an attention mechanism.
- the stem convolution block in the video feature extraction network includes three 3 ⁇ 3 convolution layers and a pooling layer, and the pooling layer in the stem convolution block is an average pooling layer.
- the video feature extraction network includes four residual block groups connected in sequence, the residual block group includes multiple residual blocks, the residual block includes a 1 ⁇ 1 convolution layer, a 3 ⁇ 3 convolution layer and a 1 ⁇ 1 convolution layer connected in sequence, and the video feature extraction network has a 1 ⁇ 1 convolution layer bypass-connected at the residual block at the entrance of each residual block group.
- the configuration is as follows:
- the audio features and video features are mapped to the same dimension through the fully connected layers respectively;
- a feature superposition fusion method or a weighted superposition fusion method Based on a feature splicing fusion method, a feature superposition fusion method or a weighted superposition fusion method, the audio features and the video features are fused to obtain a plurality of fused features.
- the action recognition module 84 when the action recognition module 84 performs action recognition based on the fusion feature to obtain the action recognition result, it is configured as follows:
- fused features are input into the trained action recognition model.
- the feature vectors of each fused feature are obtained through the action recognition model based on the transformer encoder, and each feature vector is input into the fully connected layer, which outputs the action recognition result.
- the video to be identified is video stream data provided by the host
- the action recognition device also includes a first processing module, which is configured to perform live broadcast room recommendation processing on the live broadcast room of the host based on the time information of the video to be identified and the action recognition result after the action recognition module 84 performs action recognition based on the fusion features to obtain the action recognition result.
- the video to be identified is complete video data
- the action recognition device further includes a second processing module configured to perform video recommendation processing on the video to be identified based on the action recognition result after the action recognition module 84 performs action recognition based on the fusion features to obtain the action recognition result.
- the embodiment of the present application also provides a motion recognition device, which can integrate the motion recognition device provided by the embodiment of the present application.
- Figure 9 is a structural diagram of a motion recognition device provided by the embodiment of the present application.
- the motion recognition device includes: an input device 93, an output device 94, a memory 92 and one or more processors 91; the memory 92 is used to store one or more programs; when one or more programs are executed by one or more processors 91, the one or more processors 91 implement the motion recognition method provided by the above embodiment.
- the above-provided motion recognition device, equipment and computer can be used to execute the motion recognition method provided by any of the above embodiments, and have corresponding functions and beneficial effects.
- the embodiment of the present application also provides a non-volatile storage medium storing computer executable instructions, which are used to execute the action recognition method provided in the above embodiment when executed by a computer processor.
- the non-volatile storage medium storing computer executable instructions provided in the embodiment of the present application whose computer executable instructions are not limited to the action recognition method provided above, can also execute the related operations in the action recognition method provided in any embodiment of the present application.
- the action recognition device, equipment and storage medium provided in the above embodiment can execute the action recognition method provided in any embodiment of the present application.
- the embodiments of the present application also provide a computer program product.
- the technical solution of the present application, or the part that contributes to the relevant technology, or all or part of the technical solution can be embodied in the form of a software product.
- the computer program product is stored in a storage medium, and includes a number of instructions for enabling a computer device, a mobile terminal or a processor therein to execute all or part of the steps of the action recognition method provided in each embodiment of the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
本申请实施例提供了一种动作识别方法、装置、设备、存储介质及产品。本申请实施例提供的技术方案通过获取待识别视频中各个音频帧的声谱图,并基于声谱图提取音频帧的音频特征,以及提取待识别视频中各个视频帧的视频特征,将音频特征和视频特征映射到相同维度后,对音频特征和视频特征进行融合处理得到融合特征,并对融合特征进行动作识别得到动作识别结果,通过对待识别视频中的音频特征和视频特征进行多模态的特征融合,丰富进行动作识别的维度,有效提高对待识别视频的动作识别精度。
Description
本申请要求在2023年02月02日提交中国专利局,申请号为202310124729.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
本申请实施例涉及图像处理技术领域,尤其涉及一种动作识别方法、装置、设备、存储介质及产品。
随着互联网和图像处理技术的发展,视频直播的用户也越来越多,直播中的精彩画面的识别和记录可吸引大量用户进行观看,从而提升用户使用感与主播的活跃度。
随着直播业务以及用户的增加,对于直播内容或者视频内容的自动识别越来越重要。传统的动作识别方法一般是利用深度学习对视频流或视频中的图片进行动作识别,但是这样的动作识别方法只考虑了视频流或视频中的图像信息,动作识别精度较低。
发明内容
本申请实施例提供一种动作识别方法、装置、设备、存储介质及产品,以解决相关技术中只考虑图像信息,动作识别精度较低的技术问题,有效提高动作识别精度。
在第一方面,本申请实施例提供了一种动作识别方法,包括:
获取待识别视频中各个音频帧的声谱图,并基于所述声谱图提取各个所述音频帧的音频特征;
获取待识别视频中各个视频帧的视频特征;
将所述音频特征和所述视频特征映射到相同维度,并对所述音频特征和所述视频特征进行融合处理得到多个融合特征;
基于所述融合特征进行动作识别得到动作识别结果。
在第二方面,本申请实施例提供了一种动作识别装置,包括音频分析模块、视频分析模块、特征融合模块和动作识别模块,其中:
所述音频分析模块,配置为获取待识别视频中各个音频帧的声谱图,并基于所述声谱图提取各个所述音频帧的音频特征;
所述视频分析模块,配置为获取待识别视频中各个视频帧的视频特征;
所述特征融合模块,配置为将所述音频特征和所述视频特征映射到相同维度,并对所述音频特征和所述视频特征进行融合处理得到多个融合特征;
所述动作识别模块,配置为基于所述融合特征进行动作识别得到动作识别结果。
在第三方面,本申请实施例提供了一种动作识别设备,包括:存储器以及一个或多个处理器;
所述存储器,用于存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如第一方面所述的动作识别方法。
在第四方面,本申请实施例提供了一种存储计算机可执行指令的非易失性存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如第一方面所述的动作识别方法。
在第五方面,本申请实施例提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存储介质中,设备的至少一个处理器从计算机可读存储介质读取并执行计算机程序,使得设备执行如第一方面所述的动作识别方法。
本申请实施例通过获取待识别视频中各个音频帧的声谱图,并基于声谱图提取音频帧的音频特征,以及提取待识别视频中各个视频帧的视频特征,将音频特征和视频特征映射到相同维度后,对音频特征和视频特征进行融合处理得到融合特征,并对融合特征进行动作识别得到动作识别结果,通过对待识别视频中的音频特征和视频特征进行多模态的特征融合,丰富进行动作识别的维度,有效提高对待识别视频的动作识别精度。
图1是本申请实施例提供的一种动作识别方法的流程图;
图2是本申请实施例提供的另一种动作识别方法的流程图;
图3是本申请实施例提供的一种音频特征提取网络结构示意图;
图4是本申请实施例提供的一种视频特征提取网络结构示意图;
图5是本申请实施例提供的一种茎部卷积块的结构示意图;
图6是本申请实施例提供的一种残差块组结构示意图;
图7是本申请实施例提供的一种动作识别模型示意图;
图8是本申请实施例提供的一种动作识别装置的结构示意图;
图9是本申请实施例提供的一种动作识别设备的结构示意图。
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图对本申请具体实施例作进一步的详细描述。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部内容。在更加详细地讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各项操作(或步骤)描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。此外,各项操作的顺序可以被重新安排。当其操作完成时上述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。上述处理可以对应于方法、函数、规程、子例程、子程序等等。
本申请提供的动作识别方法可应用于直播、视频贴等场景的动作识别,例如对直播场景中直播间唱歌、跳舞等动作类型的实时识别,对视频贴场景中用户上传视频的内容识别推荐等,旨在通过对待识别视频中的音频特征和视频特征进行多模态的特征融合,丰富进行动作识别的维度,提高对待识别视频的动作识别精度。对于传统的视频动作识别方案,一般是通过深度学习网络对视频中的图像进行动作识别,但是这种动作识别方式只考虑了图像信息对动作识别的影响,忽略了视频中的动作时序性信息,对视频的动作识别精度较低。基于此,提供本申请实施例的一种动作识别方法,以解决现有动作识别方案只考虑视频中的图像信息,动作识别精度较低的技术问题。
图1给出了本申请实施例提供的一种动作识别方法的流程图,本申请实施例提供的动作识别方法可以由动作识别装置来执行,该动作识别装置可以通过硬件和/或软件的方式实现,并集成在动作识别设备(例如直播平台、视频服务平台等)中。
下述以动作识别装置执行动作识别方法为例进行描述。参考图1,该动作识别方法包括:
S101:获取待识别视频中各个音频帧的声谱图,并基于声谱图提取各个音频帧的音频特征。
S102:获取待识别视频中各个视频帧的视频特征。
本方案提供的待识别视频可理解为需要进行动作识别的视频,例如对视频中任务的动作进行识别,其中动作识别得到的动作识别结果可以是舞蹈、唱歌、游戏等不同动作类型。其中,待识别信息可以是视频流数据(例如主播端在直播期间实时上传的视频流数据)或完整的视频数据(例如用户在视频贴中上传的完整视频)。可选的,对于视频流数据,可基于设定的时间间隔内收集的视频流数据作为待识别视频。本方案提供的待识别信息中记录有多个连续的音频帧和多个连续的视频帧。
示例性的,获取需要识别动作的待识别视频,提取出待识别视频中的音频帧和视频帧。对于待识别信息中的每个音频帧,确定每个音频帧的声谱图,并基于声谱图提取音频帧对应的音频特征。例如将每个音频帧对应的声谱图输入至训练好的音频特征提取网络中,由音频特征提取网络对各个声谱图进行处理分析并输出对应的音频特征。
对于待识别信息中的每个视频帧,分别提取每个视频帧的视频特征。例如,将各个视频帧依次输入至训练好的视频特征提取网络中,由视频网络对各个视频帧进行处理分析并输出对应的视频特征。
S103:将音频特征和视频特征映射到相同维度,并对音频特征和视频特征进行融合处理得到多个融合特征。
示例性的,在得到各个音频帧的音频特征以及各个视频帧的视频特征后,将音频特征和视频特征映射到相同维度。可选的,可分别通过全连接层实现音频特征和视频特征映射到设定的维度映射。需要进行解释的是,由于存在音频特征的维度与视频特征的维度不一致的情况,需要将音频特征和视频特征映射到相同的维度,使得音频帧和视频帧的特征权重保持一致,使得音频帧和视频
帧可共同在动作识别中发挥作用,提高动作识别精度。
在一个实施例中,对音频特征和视频特征进行融合处理得到多个融合特征。其中,一个待识别视频中的音频帧的数量与视频帧的数量一致,对各个音频特征和视频特征融合得到的融合特征的数量,与音频帧或视频帧的数量一致。
可选的,对音频特征和视频特征的融合处理,可以是在帧间融合音频特征和视频特征,例如一帧的音频帧对应的音频特征与一帧的视频帧对应的视频特征进行融合。在其他实施例中,也可以是在视频级别对音频特征和视频特征进行整体的多模态融合,例如将整个待识别视频的音频特征和视频特征进行融合处理。
S104:基于融合特征进行动作识别得到动作识别结果。
示例性的,在得到待识别视频在音频和视频模态上的融合特征后,可基于这些融合特征进行动作识别,得到待识别视频对应的动作识别结果。例如,将上述得到的多个融合特征输入至训练好的动作识别模型中,由动作识别模型对接收到的多个融合特征进行分析处理,并输出对应的动作识别结果。
在一个实施例中,动作识别结果可表示为不同的动作类型对应的概率,可将动作识别结果中最大概率对应的动作类型确定为待识别视频对应的动作类型,或者是将概率达到设定阈值的动作类型确定为待识别视频对应的动作类型(在多个动作类型的概率均达到设定阈值时,可将最大概率对应的动作类型确定为待识别视频对应的动作类型,或者是将多个动作类型均确定为待识别视频对应的动作类型)。
在一个可能的实施例中,在本方案按提供的识别视频为主播端提供的视频流数据时,本方案在基于融合特征进行动作识别得到动作识别结果之后,还可基于待识别视频的时间信息以及动作识别结果,对主播端的直播间进行直播间推荐处理。
示例性的,按照设定的时间间隔收集主播端提供的视频流数据,并将每个时间间隔对应的视频流数据作为待识别视频依次进行动作识别,得到每个时间间隔的待识别视频对应的动作识别结果。其中,待识别视频的时间信息可基于视频流数据的时间戳确定,例如,可将待识别视频中第一个视频帧、中间视频帧或最后一个视频帧的时间戳作为待识别视频的时间信息,也可以是将待识别视频中各个帧视频帧的时间戳的平均时间作为待识别视频的时间信息。
在一个实施例中,根据待识别视频的时间信息以及动作识别结果,对主播
端的直播间进行直播间推荐处理。其中,对直播间的直播间推荐处理可以是对直播间在对应频道中的权重进行加权或降权处理,并基于直播间的权重在对应频道中对各个直播间进行排序,还可以是基于直播间的权重在全频道中对各个直播间进行排序,观众端可在对应频道或全频道优先看到推荐的直播间,提升观众直播观看体验以及主播直播间热度。本方案通过基于时间信息和动作识别结果对直播间进行直播间推荐处理,使得观众可优先看到感兴趣的直播内容的直播间,提升观众直播观看体验以及主播直播间热度。
例如动作识别结果开始指示识别到设定的动作类型(概率达到设定阈值)时,将对应的时间信息确定为对应动作类型的开始时间,并对直播间进行加权处理,提高对直播间的推荐度。在动作识别结果开始指示未识别到设定的动作类型(概率小于设定阈值)时,将对应的时间信息确定为对应动作类型的结束时间,并对直播间进行降权处理,优先推荐其他进行对应动作类型的直播间。例如在检测到直播间中的主播开始跳舞和/或唱歌时,对直播间在舞蹈频道和/或唱歌频道中的权重进行加权处理,提高在舞蹈频道和/或唱歌频道对该直播间的推荐度。而在检测到直播间中的主播结束跳舞和/或唱歌时,对直播间在舞蹈频道和/或唱歌频道中的权重进行降权处理,降低在舞蹈频道和/或唱歌频道对该直播间的推荐度,使得观众端可优先看到相应直播内容的直播间,提升观众直播观看体验以及主播直播间热度。
在一个可能的实施例中,在本方案按提供的识别视频为完整的视频数据时,本方案在基于融合特征进行动作识别得到动作识别结果之后,还可基于动作识别结果对待识别视频进行视频推荐处理。
其中,用户端或主播端可向直播平台或视频服务平台上传完整的视频数据,其中视频数据可以是直播期间的录屏视频,也可以是另外制作的视频。示例性的,获取用户端或主播端上传的视频数据,并将视频数据作为待识别视频进行动作识别,得到待识别视频的动作识别结果。在一个实施例中,根据动作识别结果对该待识别视频进行视频推荐处理。其中,对待识别视频的视频推荐处理方式可以是标记待识别视频的动作类型,并基于动作类型将待识别视频推荐给感兴趣的用户。本方案通过根据动作识别结果对用户上传的视频数据进行视频推荐处理,使得用户可优先看到感兴趣的视频数据,提升观众视频播观看体验。
上述,通过获取待识别视频中各个音频帧的声谱图,并基于声谱图提取音频帧的音频特征,以及提取待识别视频中各个视频帧的视频特征,将音频特征
和视频特征映射到相同维度后,对音频特征和视频特征进行融合处理得到融合特征,并对融合特征进行动作识别得到动作识别结果,通过对待识别视频中的音频特征和视频特征进行多模态的特征融合,丰富进行动作识别的维度,有效提高对待识别视频的动作识别精度。
在上述实施例的基础上,图2给出了本申请实施例提供的另一种动作识别方法的流程图,该动作识别方法是对上述动作识别方法的具体化。参考图2,该动作识别方法包括:
S201:将待识别视频中的各个音频帧重采样为设定频率的单声道音频。
S202:对重采样后的各个音频帧进行短时傅里叶变换得到音频帧的频谱图。
S203:将频谱图映射到梅尔滤波器组中计算各个频谱图对应的声谱图。
示例性的,在得到待识别视频后,基于设定频率对待识别视频中的各个音频帧进行重采样处理,从而将待识别视频中的各个音频帧重采样为设定频率的单声道音频。例如,假设待识别视频中的音频文件为wav格式的音频文件,音频文件中包括多个连续的音频帧,可将这些音频帧重采样为16kHz的单声道音频。
在一个实施例中,对重采样后的各个音频帧进行短时傅里叶变换得到,得到每个音频帧的频谱图。并将上述得到的每个音频帧的频谱图映射到梅尔滤波器组(mel滤波器组)中,通过梅尔滤波器组计算各个频谱图对应的声谱图(mel声谱)。
可选的,对音频帧的短时傅里叶变换可利用设定尺寸的时间窗口,以设定的移帧幅度进行,例如使用25ms的Hann时窗,以10ms的帧移对音频帧进行短时傅里叶变换得到音频帧的频谱图。在得到各个音频帧的频谱图后,可将频谱图映射到64阶的梅尔滤波器组中计算mel声谱,并计算log(mel声谱+0.01)得到稳定的mel声谱(即声谱图)。其中,0.01作为偏置项,可有效减少计算对0取对数的情况,保证得到有效的声谱图。
本方案通过对重采样后的音频帧进行短时傅里叶变换得到频谱图,并将频谱图映射到梅尔滤波器组中计算声谱图,准确计算各个音频帧的声谱图,提高音频特征的提取质量以及对待识别视频的动作识别精度。
S204:将各个声谱图输入至训练好的音频特征提取网络中,通过音频特征提取网络基于依次连接的第一数量的卷积层和第二数量的全连接层提取各个音
频帧的音频特征。
本方案通过训练好的音频特征提取网络从声谱图中提取音频特征,其中音频特征提取网络包括依次连接的第一数量的卷积层和第二数量的全连接层。例如,第一数量和第二数量可分别设置为6和3,即音频特征提取网络可包括依次连接的6个卷积层和3个全连接层。
示例性的,在得到各个频谱图对应的声谱图后,将各个声谱图输入至训练好的音频特征提取网络中,以通过该音频特征提取网络基于依次连接的第一数量的卷积层和第二数量的全连接层提取各个音频帧的音频特征。本方案通过音频特征提取网络基于依次连接的第一数量的卷积层和第二数量的全连接层提取音频帧的音频特征,准确提取各个音频帧的声音特征,有效提高对待识别视频的动作识别精度。
在一个实施例中,本方案提供的音频特征提取网络可以是基于VGGish模型进行搭建,声谱图可直接输入音频特征提取网络中提取音频特征,其中,VGGish模型作为基于tensorflow的VGG模型,支持从声谱图中提取具有语义的128维embedding特征向量作为音频特征。
如图3提供的一种音频特征提取网络结构示意图所示,其中音频特征提取网络包括依次连接的6个卷积层(包括由通道数为64的单个卷积层构成的conv1、由通道数为128的单个卷积层构成的conv2、由通道数为256的两个卷积层构成的conv3和由通道数为512的两个卷积层构成的conv4,conv1-4之间通过通过激活函数连接)和3个全连接层(包括由通道数为4096的两个全连接层构成的fc1和由通道数为128的单个全连接层构成的fc2,conv4和fc1之间通过激活函数连接),将96×64的声谱图输入音频特征提取网络中,音频特征提取网络可输出128维的音频特征。
S205:将待识别视频中的各个视频帧输入至训练好的视频特征提取网络中,通过视频特征提取网络提取各个视频帧的视频特征。
本方案通过训练好的视频特征提取网络提取视频帧的视频特征。其中,本方案提供的视频特征提取网络包括依次连接的基于多个卷积层组成的茎部卷积块、第三数量的基于卷积层的残差块组以及带注意力机制的池化层。其中,茎部卷积块可提取视频帧中包含图像语义信息的特征图,带注意力机制的池化层可应用注意力机制降低提取到的特征的信息冗余,增大网络感受野,防止网络出现过拟合的情况,保证视频特征提取质量。
示例性的,将待识别视频中的各个视频帧依次输入到视频特征提取网络中,由视频特征提取网络提取各个视频帧的视频特征。其中,视频特征提取网络在提取视频特征时,视频帧依次经过茎部卷积块、残差块组以及池化层进行处理并得到视频特征。本方案通过视频特征提取网络基于依次连接的茎部卷积块、残差块组以及池化层提取视频帧的视频特征,准确提取各个视频帧的视频特征,有效提高对待识别视频的动作识别精度。
如图4提供的一种视频特征提取网络结构示意图所示,其中视频特征提取网络包括依次连接的茎部卷积块、第三数量的残差块组(图中设置为4个残差块组)以及带注意力机制的池化层(attention pool,1×1024),可将224×224的视频帧输入视频特征提取网络中,得到1024维的视频特征。可选的,本方案提供的视频特征提取网络可基于ResNet50网络(50层的残差神经网络)进行搭建得到。
在一个可能的实施例中,如图5提供的一种茎部卷积块的结构示意图所示,本方案提供的视频特征提取网络中的茎部卷积块包括依次连接的三层3×3的卷积层(包括依次连接的“3×3conv,32,/2”的卷积层、“3×3conv,32,/1”的卷积层以及“3×3conv,64,1”的卷积层)以及池化层,并且本方案提供的茎部卷积块中的池化层为平均池化层(“avgpool,/2”的池化层)。
需要进行解释的是,相对于原始的ResNet50网络中设置为一层的7×7的卷积层,本方案将7×7的卷积层调整为三层3×3的卷积层,通多个小卷积层可实现更好的堆积效果,小卷积层堆积更能更有效地提取视频帧中的图像语义信息。并且原始的ResNet50网络中的池化层为最大池化层,本方案将最大池化层调整为平均池化层,可有效保留视频帧中的背景信息,提取出来的视频特征会包含更多的背景信息,对视频的动作判断更精准,提高对待识别视频的动作识别精度。同时,原始的ResNet50网络中最后接入的池化层为平均池化层,本方案将平均池化层调整为注意力池化层,可有效降低提取到的特征的信息冗余,增大网络感受野,防止网络出现过拟合的情况。其中,注意力机制将查询向量Query与一组键值对(Key-Value)映射到输出,其中Query(查询矩阵)、Key(键值矩阵)、Value(键值矩阵)和注意力池化层的输出都是向量。输出可以通过值的加权和计算得出,其中分配到每一个值的权重可通过Query和对应Key的适应度函数计算。通过可学习的权重WQ、WK、WV从输入向量中创建查询矩阵Query、键值矩阵Key和值矩阵Value。注意力机制的计算就是对Query、
Key和Value这三个矩阵计算得到最终的结果。相较于平均池化和最大池化通过平均或者最大挤压到最终分类层的一维特征,注意力机制池化可以选择最佳的二维特征转换为最终分类层的一维特征,提高视频特征提效果。本方案采用多层小卷积核构建的茎部卷积块代替单层大卷积核,茎部卷积块输出用平均池化层代替最大池化层,将提取后的特征到分类层,并利用注意力机制的池化层代替平均池化,基于改进的ResNet50模型,有效提升视频特征的感受野,提取视频帧中更高维的语义信息,使得提取的视频特征和动作具有更好的匹配性。
在一个可能的实施例中,如图6提供的一种残差块组结构示意图所示,图中提供了4个依次连接的残差块组,本方案提供的视频特征提取网络包括4个依次连接的残差块组,其中残差块组包括多个残差块,残差块包括依次连接的1×1的卷积层、3×3的卷积层以及1×1的卷积层。同时,本方案提供的视频特征提取网络在每个残差块组入口的残差块处旁路连接有1×1的卷积层。
其中,视频特征提取网络中设置为4个残差块组(stage),共13个使用卷积的残差块。其中第一个残差块组包括3个残差块,最后2个残差块结构相同;第二个残差块组包括4个残差块,最后3个残差块结构相同;第三个残差块组包括4个残差块,最后3个残差块结构相同;第四个残差块组包括2个残差块。其中每个残差块组入口的第一个残差块旁路连接1×1的卷积层。
本方案提供的每个残差块都使用了基于bottleneck的设计方式,即每个残差块都由3个卷积层组成(卷积核参数分别为1×1、3×3和1×1),其中进出口的两个1×1卷积层分别用于压缩和还原特征图的通道数(原始的特征图由茎部卷积块输出),有效减少卷积层运算的通道数,减少计算过程中的参数量,提高视频特征提取效率。另外,由于特征图每经过一个残差块组,需要把特征图的尺寸缩小至四分之一、通道扩大为两倍,本方案在每个残差块组的入口处都使用了2D投影残差块,即在每个残差块组入口的残差块处旁路连接1×1的卷积层,以保证在对特征图做逐像素相加操作时,特征图的尺寸和通道数保持一致。同时,只在每个残差块组的入口处使用2D投影残差块可进一步减少网络参数,提高视频特征提取效率。
S206:分别通过全连接层将音频特征和视频特征映射到相同维度。
S207:基于特征拼接融合方式、特征叠加融合方式或加权叠加融合方式,对音频特征和视频特征进行融合处理得到多个融合特征。
示例性的,在得到音频特征和视频特征后,分别通过一个全连接层将音频
特征和视频特征映射到相同维度。在一个实施例中,可通过设定的融合方式对音频特征和视频特征进行融合处理,得到多个融合特征。
本方案提供的融合方式可以是特征拼接融合方式、特征叠加融合方式和加权叠加融合方式中的一种,其中特征拼接融合方式可以是将音频特征和视频特征进行拼接得到融合特征,例如将384维的音频特征和384维的视频特征拼接为768维的融合特征。特征叠加融合方式可以是将音频特征和视频特征相加得到融合特征,例如将384维的音频特征和384维的视频特征相加得到384维的融合特征。加权叠加融合方式可以是基于设定的权值系数对音频特征和视频特征进行加权求和得到融合特征,例如将384维的音频特征和384维的视频特征加权求和得到384维的融合特征。本方案通过将音频特征和视频特征映射到相同维度后,基于设定的融合方式将音频特征和视频特征融合为融合特征,使得音频帧和视频帧可共同在动作识别中发挥作用,提高动作识别精度。
在一个实施例中,在进行融合处理时,可以是将一帧音频帧的音频特征与一帧视频帧的视频特征进行融合处理。可选的,可在对音频特征和视频特征进行融合处理之前,或者是在得到待识别视频后,基于时间戳对音频特征和视频特征进行对齐,在行融合处理时,可以是将一帧音频帧的音频特征和与其对齐的视频帧的视频特征进行融合处理,提高融合处理效果。
S208:基于融合特征进行动作识别得到动作识别结果。
本方案可通过基于vision transformer模型搭建的动作识别模型对融合特征进行动作识别。基于此,本方案在基于融合特征进行动作识别得到动作识别结果时,可以是将多个融合特征输入至训练好的动作识别模型中,通过动作识别模型基于transformer编码器获取各个融合特征的特征向量,并将各个特征向量输入全连接层,由全连接层输出动作识别结果。
如图7提供的一种动作识别模型示意图所示,本方案提供的动作识别模型包括transformer编码器(基于vision transformer模型的Transformer Decoder)和全连接层(MLP Head),对应输出的动作识别结果对应的动作类型(Class)可以是跳舞、唱歌等。其中transformer编码器的数量设置有多个,可选的,可根据视频帧数量设置transformer编码器数量。其中,transformer编码器包括依次连接的归一化层(Norm)、多头注意力层(Multi-Head Attention)、归一化层(Norm)和全连接层(MLP)。可选的,本法难提供的动作识别模型可使用pytorch作为底层支持库实现。
示例性的,将上述确定的各个融合特征输入到动作识别模型中,动作识别模型将分别通过一个transformer编码器获取对应融合特征的特征向量,并将每个融合特征对应的特征向量输入到全连接层(MLP Head)中,由全连接层输出动作识别结果。本方案通过基于vision transformer模型搭建的动作识别模型对融合特征进行动作识别,其中vision transformer模型把NLP领域的transformer编码器算法应用到视觉领域,把每一帧的特征当作NLP的token,直接将NLP的标准transformer编码器应用于这些token,并可据此进行视频分类,有效提高对视频的动作识别精度。
上述,通过获取待识别视频中各个音频帧的声谱图,并基于声谱图提取音频帧的音频特征,以及提取待识别视频中各个视频帧的视频特征,将音频特征和视频特征映射到相同维度后,对音频特征和视频特征进行融合处理得到融合特征,并对融合特征进行动作识别得到动作识别结果,通过对待识别视频中的音频特征和视频特征进行多模态的特征融合,丰富进行动作识别的维度,有效提高对待识别视频的动作识别精度。同时,通过音频特征提取网络准确提取各个音频帧的声音特征,以及通过视频特征提取网络基提取视频帧的视频特征,准确提取各个视频帧的视频特征,并过基于vision transformer模型搭建的动作识别模型对融合特征进行动作识别,有效提高对待识别视频的动作识别精度。
图8是本申请实施例提供的一种动作识别装置的结构示意图。参考图3,该动作识别装置包括音频分析模块81、视频分析模块82、特征融合模块83和动作识别模块84。
其中,音频分析模块81,配置为获取待识别视频中各个音频帧的声谱图,并基于声谱图提取各个音频帧的音频特征;视频分析模块82,配置为获取待识别视频中各个视频帧的视频特征;特征融合模块83,配置为将音频特征和视频特征映射到相同维度,并对音频特征和视频特征进行融合处理得到多个融合特征;动作识别模块84,配置为基于融合特征进行动作识别得到动作识别结果。
上述,通过获取待识别视频中各个音频帧的声谱图,并基于声谱图提取音频帧的音频特征,以及提取待识别视频中各个视频帧的视频特征,将音频特征和视频特征映射到相同维度后,对音频特征和视频特征进行融合处理得到融合特征,并对融合特征进行动作识别得到动作识别结果,通过对待识别视频中的音频特征和视频特征进行多模态的特征融合,丰富进行动作识别的维度,有效
提高对待识别视频的动作识别精度。
在一个可能的实施例中,音频分析模块81在获取待识别视频中各个音频帧的声谱图时,配置为:
将待识别视频中的各个音频帧重采样为设定频率的单声道音频;
对重采样后的各个音频帧进行短时傅里叶变换得到音频帧的频谱图;
将频谱图映射到梅尔滤波器组中计算各个频谱图对应的声谱图。
在一个可能的实施例中,音频分析模块81在基于声谱图提取各个音频帧的音频特征时,配置为:
将各个声谱图输入至训练好的音频特征提取网络中,通过音频特征提取网络基于依次连接的第一数量的卷积层和第二数量的全连接层提取各个音频帧的音频特征。
在一个可能的实施例中,视频分析模块82在获取待识别视频中各个视频帧的视频特征时,配置为:
将待识别视频中的各个视频帧输入至训练好的视频特征提取网络中,通过视频特征提取网络提取各个视频帧的视频特征,其中,视频特征提取网络包括依次连接的基于多个卷积层组成的茎部卷积块、第三数量的基于卷积层的残差块组以及带注意力机制的池化层。
在一个可能的实施例中,视频特征提取网络中的茎部卷积块包括三层3×3的卷积层以及池化层,茎部卷积块中的池化层为平均池化层。
在一个可能的实施例中,视频特征提取网络包括4个依次连接的残差块组,残差块组包括多个残差块,残差块包括依次连接的1×1的卷积层、3×3的卷积层以及1×1的卷积层,视频特征提取网络在每个残差块组入口的残差块处旁路连接有1×1的卷积层。
在一个可能的实施例中,特征融合模块83在将音频特征和视频特征映射到相同维度,并对音频特征和视频特征进行融合处理得到多个融合特征时,配置为:
分别通过全连接层将音频特征和视频特征映射到相同维度;
基于特征拼接融合方式、特征叠加融合方式或加权叠加融合方式,对音频特征和视频特征进行融合处理得到多个融合特征。
在一个可能的实施例中,动作识别模块84在基于融合特征进行动作识别得到动作识别结果时,配置为:
将多个融合特征输入至训练好的动作识别模型中,通过动作识别模型基于transformer编码器获取各个融合特征的特征向量,并将各个特征向量输入全连接层,由全连接层输出动作识别结果。
在一个可能的实施例中,待识别视频为主播端提供的视频流数据,动作识别装置还包括第一处理模块,配置为在动作识别模块84基于融合特征进行动作识别得到动作识别结果之后,基于待识别视频的时间信息以及动作识别结果,对主播端的直播间进行直播间推荐处理。
在一个可能的实施例中,待识别视频为完整的视频数据,动作识别装置还包括第二处理模块,配置为在动作识别模块84基于融合特征进行动作识别得到动作识别结果之后,基于动作识别结果对待识别视频进行视频推荐处理。
值得注意的是,上述动作识别装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请实施例的保护范围。
本申请实施例还提供了一种动作识别设备,该动作识别设备可集成本申请实施例提供的动作识别装置。图9是本申请实施例提供的一种动作识别设备的结构示意图。参考图9,该动作识别设备包括:输入装置93、输出装置94、存储器92以及一个或多个处理器91;存储器92,用于存储一个或多个程序;当一个或多个程序被一个或多个处理器91执行,使得一个或多个处理器91实现如上述实施例提供的动作识别方法。上述提供的动作识别装置、设备和计算机可用于执行上述任意实施例提供的动作识别方法,具备相应的功能和有益效果。
本申请实施例还提供一种存储计算机可执行指令的非易失性存储介质,计算机可执行指令在由计算机处理器执行时用于执行如上述实施例提供的动作识别方法。当然,本申请实施例所提供的一种存储计算机可执行指令的非易失性存储介质,其计算机可执行指令不限于如上提供的动作识别方法,还可以执行本申请任意实施例所提供的动作识别方法中的相关操作。上述实施例中提供的动作识别装置、设备及存储介质可执行本申请任意实施例所提供的动作识别方法,未在上述实施例中详尽描述的技术细节,可参见本申请任意实施例所提供的动作识别方法。
在上述实施例的基础上,本申请实施例还提供一种计算机程序产品,本申请的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机程序产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备、移动终端或其中的处理器执行本申请各个实施例所提供的动作识别方法的全部或部分步骤。
Claims (14)
- 一种动作识别方法,其中,包括:获取待识别视频中各个音频帧的声谱图,并基于所述声谱图提取各个所述音频帧的音频特征;获取待识别视频中各个视频帧的视频特征;将所述音频特征和所述视频特征映射到相同维度,并对所述音频特征和所述视频特征进行融合处理得到多个融合特征;基于所述融合特征进行动作识别得到动作识别结果。
- 根据权利要求1所述的动作识别方法,其中,所述获取待识别视频中各个音频帧的声谱图,包括:将待识别视频中的各个音频帧重采样为设定频率的单声道音频;对重采样后的各个音频帧进行短时傅里叶变换得到所述音频帧的频谱图;将所述频谱图映射到梅尔滤波器组中计算各个频谱图对应的声谱图。
- 根据权利要求1所述的动作识别方法,其中,所述基于所述声谱图提取各个所述音频帧的音频特征,包括:将各个所述声谱图输入至训练好的音频特征提取网络中,通过所述音频特征提取网络基于依次连接的第一数量的卷积层和第二数量的全连接层提取各个所述音频帧的音频特征。
- 根据权利要求1所述的动作识别方法,其中,所述获取待识别视频中各个视频帧的视频特征,包括:将待识别视频中的各个视频帧输入至训练好的视频特征提取网络中,通过所述视频特征提取网络提取各个所述视频帧的视频特征,其中,所述视频特征提取网络包括依次连接的基于多个卷积层组成的茎部卷积块、第三数量的基于卷积层的残差块组以及带注意力机制的池化层。
- 根据权利要求4所述的动作识别方法,其中,所述视频特征提取网络中的茎部卷积块包括三层3×3的卷积层以及池化层,所述茎部卷积块中的池化层为平均池化层。
- 根据权利要求4所述的动作识别方法,其中,所述视频特征提取网络包括4个依次连接的残差块组,所述残差块组包括多个残差块,所述残差块包括依次连接的1×1的卷积层、3×3的卷积层以及1×1的卷积层,所述视频特征提取网络在每个所述残差块组入口的残差块处旁路连接有1×1的卷积层。
- 根据权利要求1所述的动作识别方法,其中,所述将所述音频特征和所述 视频特征映射到相同维度,并对所述音频特征和所述视频特征进行融合处理得到多个融合特征,包括:分别通过全连接层将所述音频特征和所述视频特征映射到相同维度;基于特征拼接融合方式、特征叠加融合方式或加权叠加融合方式,对所述音频特征和所述视频特征进行融合处理得到多个融合特征。
- 根据权利要求1所述的动作识别方法,其中,所述基于所述融合特征进行动作识别得到动作识别结果,包括:将多个所述融合特征输入至训练好的动作识别模型中,通过所述动作识别模型基于transformer编码器获取各个所述融合特征的特征向量,并将各个所述特征向量输入全连接层,由所述全连接层输出动作识别结果。
- 根据权利要求1-8任一项所述的动作识别方法,其中,所述待识别视频为主播端提供的视频流数据,所述基于所述融合特征进行动作识别得到动作识别结果之后,还包括:基于所述待识别视频的时间信息以及所述动作识别结果,对所述主播端的直播间进行直播间推荐处理。
- 根据权利要求1-8任一项所述的动作识别方法,其中,所述待识别视频为完整的视频数据,所述基于所述融合特征进行动作识别得到动作识别结果之后,还包括:基于所述动作识别结果对所述待识别视频进行视频推荐处理。
- 一种动作识别装置,其中,包括音频分析模块、视频分析模块、特征融合模块和动作识别模块,其中:所述音频分析模块,配置为获取待识别视频中各个音频帧的声谱图,并基于所述声谱图提取各个所述音频帧的音频特征;所述视频分析模块,配置为获取待识别视频中各个视频帧的视频特征;所述特征融合模块,配置为将所述音频特征和所述视频特征映射到相同维度,并对所述音频特征和所述视频特征进行融合处理得到多个融合特征;所述动作识别模块,配置为基于所述融合特征进行动作识别得到动作识别结果。
- 一种动作识别设备,其中,包括:存储器以及一个或多个处理器;所述存储器,用于存储一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多 个处理器实现如权利要求1-10任一项所述的动作识别方法。
- 一种存储计算机可执行指令的非易失性存储介质,其中,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-10任一项所述的动作识别方法。
- 一种计算机程序产品,包括计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1-10任一项所述的动作识别方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310124729.9A CN116129936A (zh) | 2023-02-02 | 2023-02-02 | 一种动作识别方法、装置、设备、存储介质及产品 |
CN202310124729.9 | 2023-02-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024160038A1 true WO2024160038A1 (zh) | 2024-08-08 |
Family
ID=86299003
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2024/072044 WO2024160038A1 (zh) | 2023-02-02 | 2024-01-12 | 一种动作识别方法、装置、设备、存储介质及产品 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116129936A (zh) |
WO (1) | WO2024160038A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116129936A (zh) * | 2023-02-02 | 2023-05-16 | 百果园技术(新加坡)有限公司 | 一种动作识别方法、装置、设备、存储介质及产品 |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111598026A (zh) * | 2020-05-20 | 2020-08-28 | 广州市百果园信息技术有限公司 | 动作识别方法、装置、设备及存储介质 |
CN112188306A (zh) * | 2020-09-23 | 2021-01-05 | 腾讯科技(深圳)有限公司 | 一种标签生成方法、装置、设备及存储介质 |
CN112418011A (zh) * | 2020-11-09 | 2021-02-26 | 腾讯科技(深圳)有限公司 | 视频内容的完整度识别方法、装置、设备及存储介质 |
WO2021184026A1 (en) * | 2021-04-08 | 2021-09-16 | Innopeak Technology, Inc. | Audio-visual fusion with cross-modal attention for video action recognition |
CN114529846A (zh) * | 2021-12-28 | 2022-05-24 | 北京邮电大学 | 基于声音的视频动作分类方法及相关设备 |
CN114612884A (zh) * | 2021-12-24 | 2022-06-10 | 亚信科技(中国)有限公司 | 基于视频流神经网络的异常驾驶行为识别方法及相关装置 |
CN114882889A (zh) * | 2022-04-13 | 2022-08-09 | 厦门快商通科技股份有限公司 | 一种说话人识别模型训练方法、装置、设备及可读介质 |
CN115273046A (zh) * | 2022-07-29 | 2022-11-01 | 黄山学院 | 一种用于智能视频分析的驾驶员行为识别方法 |
CN115641533A (zh) * | 2022-10-21 | 2023-01-24 | 湖南大学 | 目标对象情绪识别方法、装置和计算机设备 |
CN116129936A (zh) * | 2023-02-02 | 2023-05-16 | 百果园技术(新加坡)有限公司 | 一种动作识别方法、装置、设备、存储介质及产品 |
-
2023
- 2023-02-02 CN CN202310124729.9A patent/CN116129936A/zh active Pending
-
2024
- 2024-01-12 WO PCT/CN2024/072044 patent/WO2024160038A1/zh unknown
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111598026A (zh) * | 2020-05-20 | 2020-08-28 | 广州市百果园信息技术有限公司 | 动作识别方法、装置、设备及存储介质 |
CN112188306A (zh) * | 2020-09-23 | 2021-01-05 | 腾讯科技(深圳)有限公司 | 一种标签生成方法、装置、设备及存储介质 |
CN112418011A (zh) * | 2020-11-09 | 2021-02-26 | 腾讯科技(深圳)有限公司 | 视频内容的完整度识别方法、装置、设备及存储介质 |
WO2021184026A1 (en) * | 2021-04-08 | 2021-09-16 | Innopeak Technology, Inc. | Audio-visual fusion with cross-modal attention for video action recognition |
CN114612884A (zh) * | 2021-12-24 | 2022-06-10 | 亚信科技(中国)有限公司 | 基于视频流神经网络的异常驾驶行为识别方法及相关装置 |
CN114529846A (zh) * | 2021-12-28 | 2022-05-24 | 北京邮电大学 | 基于声音的视频动作分类方法及相关设备 |
CN114882889A (zh) * | 2022-04-13 | 2022-08-09 | 厦门快商通科技股份有限公司 | 一种说话人识别模型训练方法、装置、设备及可读介质 |
CN115273046A (zh) * | 2022-07-29 | 2022-11-01 | 黄山学院 | 一种用于智能视频分析的驾驶员行为识别方法 |
CN115641533A (zh) * | 2022-10-21 | 2023-01-24 | 湖南大学 | 目标对象情绪识别方法、装置和计算机设备 |
CN116129936A (zh) * | 2023-02-02 | 2023-05-16 | 百果园技术(新加坡)有限公司 | 一种动作识别方法、装置、设备、存储介质及产品 |
Also Published As
Publication number | Publication date |
---|---|
CN116129936A (zh) | 2023-05-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gao et al. | 2.5 d visual sound | |
CN111428088B (zh) | 视频分类方法、装置及服务器 | |
WO2022184117A1 (zh) | 基于深度学习的视频剪辑方法、相关设备及存储介质 | |
CN109874029B (zh) | 视频描述生成方法、装置、设备及存储介质 | |
WO2024160038A1 (zh) | 一种动作识别方法、装置、设备、存储介质及产品 | |
JP2019216408A (ja) | 情報を出力するための方法、及び装置 | |
CN110839173A (zh) | 一种音乐匹配方法、装置、终端及存储介质 | |
CN109819282B (zh) | 一种视频用户类别识别方法、装置和介质 | |
WO2021046957A1 (zh) | 一种视频分类方法、设备及系统 | |
WO2023197979A1 (zh) | 一种数据处理方法、装置、计算机设备及存储介质 | |
CN114282047A (zh) | 小样本动作识别模型训练方法、装置、电子设备及存储介质 | |
CN111147871B (zh) | 直播间歌唱识别方法、装置及服务器、存储介质 | |
CN112738557A (zh) | 视频处理方法及装置 | |
WO2021227308A1 (zh) | 一种视频资源的生成方法和装置 | |
CN113766268A (zh) | 视频处理方法、装置、电子设备和可读介质 | |
CN113315979A (zh) | 数据处理方法、装置、电子设备和存储介质 | |
CN114286169A (zh) | 视频生成方法、装置、终端、服务器及存储介质 | |
CN114627868A (zh) | 意图识别方法、装置、模型及电子设备 | |
CN111741333B (zh) | 直播数据获取方法、装置、计算机设备及存储介质 | |
CN117953898A (zh) | 视频数据的语音识别方法、服务器及存储介质 | |
CN113762056A (zh) | 演唱视频识别方法、装置、设备及存储介质 | |
WO2023142590A1 (zh) | 手语视频的生成方法、装置、计算机设备及存储介质 | |
CN111883105A (zh) | 用于视频场景的上下文信息预测模型的训练方法及系统 | |
CN112466306A (zh) | 会议纪要生成方法、装置、计算机设备及存储介质 | |
CN115687696A (zh) | 一种用于客户端的流媒体视频播放方法及相关装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24749519 Country of ref document: EP Kind code of ref document: A1 |