Nothing Special   »   [go: up one dir, main page]

CN112685597B - Weak supervision video clip retrieval method and system based on erasure mechanism - Google Patents

Weak supervision video clip retrieval method and system based on erasure mechanism Download PDF

Info

Publication number
CN112685597B
CN112685597B CN202110272729.4A CN202110272729A CN112685597B CN 112685597 B CN112685597 B CN 112685597B CN 202110272729 A CN202110272729 A CN 202110272729A CN 112685597 B CN112685597 B CN 112685597B
Authority
CN
China
Prior art keywords
video
frame
candidate
branch
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110272729.4A
Other languages
Chinese (zh)
Other versions
CN112685597A (en
Inventor
李昊沅
周楚程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yizhi Intelligent Technology Co ltd
Original Assignee
Hangzhou Yizhi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yizhi Intelligent Technology Co ltd filed Critical Hangzhou Yizhi Intelligent Technology Co ltd
Priority to CN202110272729.4A priority Critical patent/CN112685597B/en
Publication of CN112685597A publication Critical patent/CN112685597A/en
Application granted granted Critical
Publication of CN112685597B publication Critical patent/CN112685597B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a weak supervision video clip retrieval method and system based on an erasing mechanism, and belongs to the field of video clip retrieval. Aiming at a video-query statement, respectively acquiring language features and frame features; constructing a language-aware dual-branch visual filter to generate an enhanced video stream and a suppressed video stream; constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, and generating an active candidate segment and a passive candidate segment; introducing a dynamic erasure mechanism into an enhancement branch of the candidate network, and calculating an enhancement score and a suppression score; training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multi-task loss to obtain a trained model; and aiming at the query sentences and videos to be processed, using the trained model to take the segment corresponding to the highest candidate score output by the enhanced branch as the final retrieval result. The invention enhances the matching capability of the video sentences and improves the performance of video retrieval.

Description

Weak supervision video clip retrieval method and system based on erasure mechanism
Technical Field
The invention relates to the field of video segment retrieval, in particular to a weak supervision video segment retrieval method and system based on an erasing mechanism.
Background
Video clip retrieval is a new topic in information retrieval systems that integrate computer vision and natural language processing. Given an untrimmed video and a natural language description, the purpose of video clip retrieval is to locate temporal boundaries that match the semantics of the target clip. However, most existing methods are trained in a fully supervised environment. Such manual annotation is very expensive and time consuming, especially for ambiguous descriptions.
Existing weakly supervised methods typically employ MIL-based or reconstruction-based methods to train the weakly supervised positioning network. Both of these methods have some drawbacks. The former trains potential visual text matching through inter-sample loss by defining some initial visual language pairs as positive samples, constructing unmatched language visual pairs as negative samples. However, the method has high quality requirements on randomly selected negative samples, and the samples with low quality are easy to identify and cannot provide strong supervision signals. Reconstruction-based approaches, on the other hand, attempt to reconstruct the query statement from the visual content in training and locate candidate targets during reasoning using intermediate results such as attention weights. These methods do not directly optimize the visual text matching score used for reasoning. Because the candidate with higher attention weight does not necessarily have higher association with the query sentence, such indirect optimization limits the performance of the model, and therefore, the existing weak supervision method has at least the following problems:
1) a high-quality negative sample is required, a low-quality sample is easy to identify and cannot provide a strong supervision signal;
2) visual text matching scores used for reasoning cannot be directly optimized, candidates with high attention weights do not necessarily have high relevance to problem statements, and indirect optimization limits the performance of the model.
Erasure is an effective data enhancement method to suppress overfitting and enhance model robustness, and conventional erasure methods are generally used in images, randomly select regions in the image, replace their pixels with 0 or average values of the image, generate a large number of new images for training, but erasure of video images has limited ability to improve video-sentence matching. The method provides a novel regularized double-branch candidate network with an erasing mechanism, a fine-grained in-sample countermeasure is constructed by finding out credible negative candidate moments, and a more complete visual-text relation is captured by paying attention to guided dynamic erasing.
Disclosure of Invention
In order to overcome the defect that in the prior art, only inter-sample countermeasures are usually concerned, and intra-sample countermeasures are ignored, so that a correct result is difficult to select from plausible candidate segments; and in the prior art, the focused video-sentence pairs are concentrated on a plurality of dominant words, the global situation is ignored, samples which do not appear in the training data and are not trained can not be positioned easily, higher accuracy can be obtained only in the training data set, and the practical applicability is poor. The invention provides a weak supervision video segment retrieval method and system based on an erasing mechanism, which can efficiently and accurately retrieve video segments.
According to the invention, a double-branch candidate module is constructed, two branches adopt the same structure, and parameters are shared between the branches, so that the model is lighter and more robust; by constructing a dynamic erasing mechanism, words with higher occupation ratio in the query sentence are erased, the matching capability of the video sentence is enhanced, and the performance of video retrieval is improved.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
one of the purposes of the present invention is to provide a weak supervised video segment retrieval method based on an erasure mechanism, which includes the following steps:
1) aiming at the video-query statement, acquiring the language characteristic of the query statement and the frame characteristic of the video;
2) constructing a language-aware double-branch visual filter, and obtaining an enhanced modal characteristic and a suppressed modal characteristic of each frame in a video by using the frame characteristic and the language characteristic to form an enhanced video stream and a suppressed video stream;
3) constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, wherein the double-branch sharing candidate network comprises an enhanced branch and a suppressed branch;
in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
4) introducing a dynamic erasing mechanism into the enhancement branch in the step 3) to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating erasure loss by using the erased enhanced language perception frame characteristics, calculating a candidate score of each active segment after erasure, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;
5) combining the candidate score of the passive candidate segment and the final candidate score of the active candidate segment, training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multitask loss to obtain a trained model;
6) and aiming at the query sentences and videos to be processed, utilizing a trained model and combining the steps 1) to 3), and taking the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
Another object of the present invention is to provide a system for retrieving a weakly supervised video segment based on the above method, which includes:
the video preprocessing module is used for acquiring the frame characteristics of the video;
the query statement preprocessing module is used for acquiring the language features of the query statement;
the double-branch visual filtering module is used for obtaining the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;
a dual-branch shared candidate network module comprising an enhanced branch and a suppressed branch; in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
the dynamic erasing module is used for erasing the query sentence in a word level in an enhancement branch of the double-branch sharing candidate network module to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating the erasure loss by using the erased enhanced language perception frame characteristics;
a candidate score calculation module for calculating a final candidate score comprising a final candidate score for a positive candidate segment and a final candidate score for a negative candidate segment; the final candidate score of the extreme candidate segment is obtained by weighting and summing the candidate score of each positive segment after erasure and the candidate score before erasure, and the final candidate score of the negative candidate segment adopts a candidate score for inhibiting branch output by a double-branch sharing candidate network module;
the training module is used for training the double-branch visual filtering module and the double-branch sharing candidate network module based on multi-task loss to obtain a trained model;
and the retrieval module is used for retrieving the query statement and the video to be processed according to the trained model, the video preprocessing module and the query statement preprocessing module, and outputting the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
Compared with the prior art, the invention has the advantages that:
(1) according to the video segment retrieval method based on weak supervision, only video-level sentence annotation is required to be provided in the training process, and alignment annotation is not required to be carried out on each sentence, so that the annotation cost and time are greatly reduced;
(2) compared with the majority of multi-Instance Learning-based methods (Multiple Instance Learning) which mainly focus on inter-sample confrontation to judge the result, the method fully performs intra-sample confrontation between the moments with similar information in the video, and can select the correct result from plausible candidate segments;
(3) the invention introduces a dynamic erasure mechanism of antagonism in the double-branch candidate network training, covers the word set with the highest text attention in the query sentence, utilizes the enhanced video stream and the language features corresponding to the erased query sentence for interaction, fully considers global information, and performs weighted summation of the obtained candidate scores and the candidate scores obtained by the original enhanced branch to be the final scores of the active candidate segments, which is beneficial to improving the matching of video sentences, not only can improve the training effect in the training set, but also can be migrated to practical application, and improves the retrieval performance.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of the overall architecture of the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the method for retrieving a weakly supervised video segment based on an erasure mechanism provided by the present invention mainly includes the following steps:
the method comprises the steps of firstly, aiming at a video-query statement, obtaining language features of the query statement and frame features of a video;
secondly, a language-aware double-branch visual filter is constructed, and the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video are obtained by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;
step three, constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, wherein the double-branch sharing candidate network comprises an enhanced branch and a suppressed branch;
in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
step four, introducing a dynamic erasing mechanism into the enhancement branch in the step three to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating erasure loss by using the erased enhanced language perception frame characteristics, calculating a candidate score of each active segment after erasure, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;
combining the candidate scores of the passive candidate segments and the final candidate scores of the active candidate segments, and training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multi-task loss to obtain a trained model;
and step six, aiming at the query sentences and videos to be processed, using the trained model and combining the steps one to three, and taking the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
In one embodiment of the present invention, step one is described as follows:
aiming at a query sentence, extracting character features in the query sentence through a pre-trained Glove model, inputting the character features into a Bi-GRU to learn word semantic representation, and obtaining language features Q of the query sentence as { Q ═ Qi},1≤i≤nqWherein n isqIs the number of words in the query sentence, qiIs the linguistic feature of the ith word in the query statement;
aiming at a section of video, extracting video features through a pre-trained video feature extractor, and then shortening the length of a video sequence by utilizing time-average pooling to obtain frame features V ═ { V ═ Vi},1≤i≤nvWherein n isvIs the number of frames in the video, viThe frame characteristics of the ith frame in the video. In this embodiment, the pre-trained video feature extractionThe extractor may employ a C3D feature extractor.
In one embodiment of the present invention, step two is described as follows:
in step two, a language-aware dual-branch visual filter is constructed, and the frame feature V and the language feature Q are used for generating the enhanced video stream
Figure GDA0003041639280000061
And suppressing video streams
Figure GDA0003041639280000062
Wherein
Figure GDA0003041639280000063
Is the enhanced modal feature corresponding to the ith frame in the video,
Figure GDA0003041639280000064
is the corresponding suppressed modal feature of the ith frame in the video.
Specifically, the method comprises the following steps:
2.1) regarding each language scene as a cluster center, and setting the language feature Q as { Q ═ Q }i},1≤i≤nqProjected to each cluster center, calculating language characteristic Q ═ Q { Q } by NetVLAD modeli},1≤i≤nqAnd a trainable center vector C ═ { C ═ Cj},1≤j≤ncThe accumulated residual error between the two is calculated by the formula:
Figure GDA0003041639280000065
wherein, cjIs the jth cluster center, ncIs the number of cluster centers, WcAnd bcProjection matrix and bias of NetVLAD model, respectively;
Figure GDA0003041639280000066
linguistic features representing the ith word and ncCoefficient vector, alpha, related to the center of each clusterijIs alphaiThe j-th element in (1) representsA coefficient associated with the jth cluster center; u. ofjIs the language feature Q ═ Qi},1≤i≤nqThe accumulated residual error of the jth cluster center is used for representing the jth scene-based language feature;
2.2) calculating a cross-modal matching score from the accumulated residuals:
Figure GDA0003041639280000067
wherein,
Figure GDA0003041639280000068
and
Figure GDA0003041639280000069
is a projection matrix, baIs a bias that is a function of the bias,
Figure GDA00030416392800000610
is a row vector, T denotes a transpose, σ is a sigmoid function, tanh (·) is a tanh function; beta is aijE (0,1) is the matching score of the frame characteristics of the ith frame and the jth scene-based language characteristics in the video;
2.3) obtaining the matching scores between the frame characteristics of the ith frame in the video and all scenes according to the steps 2.1) to 2.2), and taking the maximum matching score as a global score
Figure GDA00030416392800000611
The calculation formula is as follows:
Figure GDA00030416392800000612
wherein,
Figure GDA00030416392800000613
representing the global scores of all scenes corresponding to the frame characteristics of the ith frame in the video; further obtaining the global scores of the scenes corresponding to all the frame characteristics in the video, and expressing the global scores as
Figure GDA00030416392800000614
1≤i≤nv
2.4) normalizing the global score, wherein the calculation formula is as follows:
Figure GDA00030416392800000615
wherein,
Figure GDA0003041639280000071
represents the minimum value of the global score of all frame features in the video corresponding to the scene,
Figure GDA0003041639280000072
represents the maximum value of the global score of the scene corresponding to all the frame features in the video.
Through the steps, the normalized fraction of all the frames is obtained
Figure GDA0003041639280000073
1≤i≤nvExpressing the relation between each frame and the query statement by using the corresponding normalized score of the frame;
2.5) respectively calculating the enhanced modal characteristics and the suppressed modal characteristics corresponding to each frame in the video according to the normalized global score, wherein the calculation formula is as follows:
Figure GDA0003041639280000074
wherein,
Figure GDA0003041639280000075
is the enhanced modal feature of the ith frame in the video, all the frames together form the enhanced video stream
Figure GDA0003041639280000076
Enhancing the important frames according to the normalized fraction and weakening the non-important frames;
Figure GDA0003041639280000077
is the suppression modal characteristic of the ith frame in the video, and all the frames form the suppression video stream together
Figure GDA0003041639280000078
In contrast to enhancing the video stream, the suppression video stream weakens the important frames and enhances the non-important frames.
In one embodiment of the present invention, step three is described as follows:
in the third step, the established dual-branch sharing candidate module based on the dynamic erasure mechanism comprises an enhanced branch and a suppressed branch with the same structure, and parameters are shared among the branches;
the processing flows of the enhanced branch and the suppressed branch are similar, and only the calculation process of the enhanced branch is specifically described below, including:
3.1a) language feature Q ═ { Q) from query statementi},1≤i≤nqEnhanced video stream generated with a dual-branch visual filter
Figure GDA0003041639280000079
And aggregating the language features of each frame by using a cross-modal unit to obtain an enhanced aggregate text representation of each frame in the video, wherein the calculation formula is as follows:
Figure GDA00030416392800000710
Figure GDA00030416392800000711
wherein, deltaijRepresenting cross-modal attention between the ith frame in the video and the jth word in the query statement;
Figure GDA00030416392800000712
and
Figure GDA00030416392800000713
is a projection matrix, bmIs a bias that is a function of the bias,
Figure GDA00030416392800000714
is a row vector, T represents a transpose, tanh (·) is a tanh function;
Figure GDA00030416392800000715
the cross-modal attention after normalization is represented,
Figure GDA00030416392800000716
is an enhanced aggregate text representation of the ith frame in the video;
3.2a) connecting the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the enhanced language perception frame characteristics of each frame in the video, wherein the calculation formula is as follows:
Figure GDA0003041639280000081
wherein,
Figure GDA0003041639280000082
is an enhanced language aware frame feature of the ith frame in the video;
3.3a) calculating cross-modal features at video segment level:
dividing video into segments and constructing 2D segment feature map
Figure GDA0003041639280000083
The first two dimensions represent the beginning and ending frames of a segment, and the third dimension is the fusion feature of the segment; the calculation formula of the fusion characteristics is as follows:
Figure GDA0003041639280000084
wherein a and b are respectively a start frame and an end frame of the video segment, F [ a, b ]; is a fusion characteristic of the fragment [ a, b ];
based onThe 2D segment characteristic graph is subjected to two-layer 2D convolution, the size of a convolution kernel can be selected according to actual conditions to calculate the matrix relation between adjacent segments, and cross-modal characteristics of the video segment level are obtained
Figure GDA0003041639280000085
Wherein M isenIs the number of fragments in the 2D fragment profile;
3.4a) calculating the candidate score of each segment according to the following formula:
Figure GDA0003041639280000086
in the formula,
Figure GDA0003041639280000087
is the candidate score of the ith video segment, WpAnd bpIs the projection matrix and the offset, sigma is the sigmoid function,
Figure GDA0003041639280000088
is the cross-modal feature of the ith video segment;
selecting T candidate segments with highest scores to form an active candidate segment set
Figure GDA0003041639280000089
And extracting a candidate score for each active candidate segment
Figure GDA00030416392800000810
Wherein
Figure GDA00030416392800000811
And
Figure GDA00030416392800000812
respectively representing the ith positive candidate segment and the corresponding candidate score;
similarly, the same suppression branch is used, and the language feature Q of the query sentence is { Q ═ Q { (Q) }i},1≤i≤nqAnd double branchesVisual filter generated suppressed video streams
Figure GDA00030416392800000813
Generating a set of negative candidate segments according to the procedure of step 3.1a) to step 3.4a)
Figure GDA00030416392800000814
And a candidate score for each passive candidate segment
Figure GDA00030416392800000815
Wherein, therein
Figure GDA00030416392800000816
And
Figure GDA00030416392800000817
respectively, the ith negative candidate segment and the corresponding candidate score.
In one embodiment of the present invention, step four is described as follows:
4.1) calculating the attention of the aggregated text by utilizing the cross-modal attention between each frame in the video and each word in the query sentence calculated in the step 3.1 a):
Figure GDA0003041639280000091
wherein,
Figure GDA0003041639280000092
representing aggregate textual attention between the video and the jth word in the query sentence;
4.2) screening n with highest attention of aggregated texts in query statementeIndividual words forming a mask word set W*={wi *},1≤i≤ne
Figure GDA00030416392800000914
Represents a rounding down, E% represents a mask percentage threshold;
substituting mask character 'Unknown' for mask word set W*N in (1)eObtaining the erased query sentence by using the words; obtaining the language characteristic Q of the erased query statement by using the method in the step oneen*={qi *},1≤i≤nq
4.3) enhanced video streams generated from Dual-Branch visual Filter
Figure GDA0003041639280000093
And the language feature Q of the erased query statementen*={qi *},1≤i≤nqObtaining the erased enhanced video stream by adopting the method of the step two
Figure GDA0003041639280000094
Is the erased enhanced modal characteristics of the ith frame in the video; according to the erased enhanced video stream Ven*And the language feature Q of the erased query statementenObtaining the cross-modal attention between the ith frame in the video and the ith word in the erased query sentence by using the method in the step 3.1a)
Figure GDA0003041639280000095
4.4) calculating the erasure aggregation visual representation of each word in the query sentence, wherein the calculation formula is as follows:
Figure GDA0003041639280000096
wherein,
Figure GDA0003041639280000097
representing the normalized cross-modal attention,
Figure GDA0003041639280000098
an erasure-aggregated visual representation representing a jth word in the query statement;
4.5) calculating the erasure loss:
Figure GDA0003041639280000099
Figure GDA00030416392800000910
wherein,
Figure GDA00030416392800000911
representing erase loss, s being an intermediate variable, WeIn order to be a projection matrix, the projection matrix,
Figure GDA00030416392800000912
is the erased enhanced speech perception frame characteristic of the ith frame in the video;
further, after the step 4.4), the method further comprises the following steps:
connecting the erased enhanced modal characteristics and the erased aggregate visual representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the erased enhanced language-aware frame characteristics of each frame in the video, wherein the calculation formula is as follows:
Figure GDA00030416392800000913
wherein,
Figure GDA0003041639280000101
is the erased enhanced speech perception frame characteristic of the ith frame in the video;
calculating the candidate score of each erased fragment by adopting the methods from the step 3.3a) to the step 3.4a)
Figure GDA0003041639280000102
And then carrying out weighted summation with the candidate scores before erasure to obtain the final candidate score of the active candidate segment, wherein the calculation formula is as follows:
Figure GDA0003041639280000103
wherein, wcIs a learnable parameter, the initial value is 0.5;
Figure GDA0003041639280000104
is the final candidate score for the ith positive candidate segment.
In one embodiment of the present invention, step five is described as follows:
the multi-task loss adopted by the invention is the weighted sum of an erasure loss value, an inter-sample loss value, an intra-sample loss value, a global loss value and a gap loss value;
a. the calculation formula of the loss value between samples is as follows:
Figure GDA0003041639280000105
Figure GDA0003041639280000106
wherein,
Figure GDA0003041639280000107
is the value of the loss between samples, ΔintraIs the value of the margin at which,
Figure GDA0003041639280000108
is the final candidate score for the ith positive candidate segment,
Figure GDA0003041639280000109
is the candidate score of the ith negative candidate segment, KenIs the enhancement score, KspIs the inhibition score, and T is the number of candidate segments.
b. The calculation formula of the loss value in the sample is as follows:
Figure GDA00030416392800001010
wherein,
Figure GDA00030416392800001011
is the value of the loss in the sample, ΔinterIs the value of the margin at which,
Figure GDA00030416392800001012
is a negative sample
Figure GDA00030416392800001013
The corresponding enhancement score is calculated based on the corresponding enhancement score,
Figure GDA00030416392800001014
is a negative sample
Figure GDA00030416392800001015
A corresponding enhancement score;
negative samples
Figure GDA00030416392800001016
And negative samples
Figure GDA00030416392800001017
The acquisition method comprises the following steps: for each pair of video-query statements (V, Q), an unmatched video is randomly selected from the training set
Figure GDA00030416392800001018
Composing passive samples
Figure GDA00030416392800001019
Selecting unmatched queries
Figure GDA00030416392800001020
And video V form another negative sample
Figure GDA00030416392800001021
c. The global loss is calculated by the following formula:
Figure GDA00030416392800001022
wherein,
Figure GDA0003041639280000111
is a global penalty value for keeping the score of each candidate segment relatively low again, MenIs the number of fragments in the 2D fragment profile.
d. The gap loss is calculated by the following formula:
Figure GDA0003041639280000112
Figure GDA0003041639280000113
wherein,
Figure GDA0003041639280000114
is a gap loss value used to extend the score gap between active segment candidates.
Building multitask losses
Figure GDA0003041639280000115
Figure GDA0003041639280000116
Wherein λCIs a hyper-parameter used to control various losses. And training by utilizing multitask loss to obtain a trained model. When the method is used, the video and the text to be inferred are input, and the segment with the highest candidate score is selected as the required segment.
The invention also provides a system for retrieving the weakly supervised video clips based on the erasure mechanism, which can be referred to as fig. 2 and comprises:
the video preprocessing module is used for acquiring the frame characteristics of the video;
the query statement preprocessing module is used for acquiring the language features of the query statement;
the double-branch visual filtering module is used for obtaining the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;
a dual-branch shared candidate network module comprising an enhanced branch and a suppressed branch; in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
the dynamic erasing module is used for erasing the query sentence in a word level in an enhancement branch of the double-branch sharing candidate network module to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating the erasure loss by using the erased enhanced language perception frame characteristics;
a candidate score calculation module for calculating a final candidate score comprising a final candidate score for a positive candidate segment and a final candidate score for a negative candidate segment; the final candidate score of the extreme candidate segment is obtained by weighting and summing the candidate score of each positive segment after erasure and the candidate score before erasure, and the final candidate score of the negative candidate segment adopts a candidate score for inhibiting branch output by a double-branch sharing candidate network module; in this embodiment, the candidate score calculation module is located inside the enhancement branch and the suppression branch, and therefore is not shown in fig. 2. In addition, it should be noted that each candidate segment further needs to calculate an enhancement score and a suppression score according to the obtained candidate score, and please refer to the description of step five of the method part above for the calculation formula, which is not repeated here.
The training module is used for training the double-branch visual filtering module and the double-branch sharing candidate network module based on multi-task loss to obtain a trained model;
and the retrieval module is used for retrieving the query statement and the video to be processed according to the trained model, the video preprocessing module and the query statement preprocessing module, and outputting the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
In this embodiment, the dual-branch visual filtering module includes:
the accumulated residual calculation module is used for projecting the language features of the query statement to the clustering center corresponding to each language scene and calculating the accumulated residual between the language features and the center vectors;
a cross-modal matching score calculation module for calculating the matching score of the language features of each frame and all scenes in the video according to the accumulated residual;
the global score calculation module is used for screening out the highest score from the matching scores of the language features of each frame and all scenes in the video to be used as a global score, and carrying out normalization processing to obtain the normalization scores of all frames in the video;
and the modal characteristic calculation module is used for calculating to obtain the enhanced modal characteristic and the suppressed modal characteristic corresponding to each frame of the video according to the normalized fraction of each frame of the video, and respectively forming an enhanced video stream and a suppressed video stream.
In this embodiment, the dual-branch sharing candidate network module includes an enhanced branch and a suppressed branch, and parameter sharing between the branches;
the two branches comprise:
the aggregation text representation calculation module is used for obtaining cross-modal attention between each frame and each word in the video according to the language features of the query sentences and the enhancement video stream or the inhibition video stream, and obtaining enhancement aggregation text representation or inhibition aggregation text representation after normalization and cumulative summation;
the language perception frame feature calculation module is used for connecting modal features and aggregate text representations of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain enhanced language perception frame features or suppressed language perception frame features of each frame in the video;
the cross-modal feature calculation module is used for constructing a 2D fragment feature map, performing two-layer 2D convolution based on the 2D fragment feature map, calculating a matrix relation between adjacent video fragments and obtaining cross-modal features of video fragment levels;
a candidate score calculating module for calculating a candidate score of each video segment, and selecting the T segments with the highest scores as candidate segments; the candidate segment obtained in the enhancement branch is a positive candidate segment, and the candidate segment obtained in the suppression branch is a negative candidate segment.
In this embodiment, the dynamic erase module includes:
the cross-modal attention calculation module is used for weighting and summing the cross-modal attention between each frame in the video and each word in the query sentence to obtain the aggregate text attention between the video and each word in the query sentence;
the query sentence erasing module is used for screening a plurality of words with the highest attention of the aggregated text in the query sentence and replacing the words with mask symbols; obtaining the language features of the erased query statement according to the query statement preprocessing module;
the erasure cross-modal attention calculation module is used for obtaining an erased enhanced video stream according to the language features of the enhanced video stream and the erased query statement; further obtaining cross-modal attention between each frame in the video and the word of the erased query sentence according to the erased enhanced video stream and the language characteristics of the erased query sentence;
the erasure aggregation visual representation calculation module is used for normalizing, accumulating and summing the cross-modal attention between each frame in the video and the erased words of the query statement to obtain the erasure aggregation visual representation of each word in the query statement;
and the erasure loss calculation module is used for calculating the erasure loss.
The implementation of each module in the above description may refer to the description of the method portion, and is not described herein again.
In the embodiments provided in the present invention, it should be understood that the above-described system embodiments are merely illustrative, and for example, the dual-branch shared candidate network module may be a logical functional partition, and may have another partition in actual implementation, for example, multiple modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the connections between the modules shown or discussed may be communication connections via interfaces, electrical or otherwise.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
The invention was tested in the following three data sets:
ActivityCaption A data set containing more than 19k of video is rich in content with an average duration of about 2 minutes. And dividing the sentence into 37k, 18k and 17k pairs as training verification and test sets.
Charrades-STA: the data set is built on a Charades data set, which contains video level paragraph descriptions and temporal action annotations. The description at the video level is decomposed into descriptions at the segment level using a semi-automatic method. The data set contains 10k of live video in the room, which are on average about 30 seconds in duration. There are approximately 12k pairs of segment statements for training and 4k pairs for testing.
DiDeMo: the data set consists of 10k videos, each of 25-30 seconds in duration. It contains approximately 33k segment sentence pairs for training, 4k for verification, and 4k for testing. Each video is divided into six five-second segments in the data set, with the target segment containing one or more consecutive segments.
Evaluation criteria:
r @ n, IoU ═ m was used as an evaluation index on ActivityCaption and charads-STA datasets.
Rank @1, Rank @5 and mlou were used as evaluation indices of the DiDeMo.
Data processing:
C3D features were extracted on ActivityCaption and chardes-STA datasets and VGG16 and visual flow features were applied on the didymo dataset.
Time-averaged pooling with steps of 8 and 4 was used at activityCaption and Charides-STA, respectively, to shorten the signature sequence. The average signature of the five second segments fixed in the DiDeMo was calculated.
Each sentence is decomposed into words and a tone list, the vocabulary which does not belong to Glove is removed, the first MaxSeq element is obtained, and MaxSeq elements of ActivityCaption, Chardes-STA and DiDeMo are respectively set to be 25, 20 and 20. And extracting text features of each word by using pre-trained 300-d Glove coding.
Setting a model:
cross-modal interaction unit with Wc bc in NetVLAD
Figure GDA0003041639280000151
And bmSet to 512. The dimension of the hidden state for each direction in the Bi-GRU is set to 256 and the dimension of the trainable center vector is set to 512. In the construction of the 2D map, for the data set DiDeMo, for each position of a ≦ b [ a, b]Will be filled. For a dataset, Chardes-STA, the positions that satisfy a ≦ b and (b-a) mod 2 ≦ 1 will be filled. For the dataset activityCaption, the locations that satisfy a ≦ b and (b-a) mod 8 ≦ 0 will be filled. The convolution kernel size K applied to ActivityCaption, Charads-STA and DiDeMo is set to 5, 3, 1, respectively. In the center-based candidate method, the number T of active/passive candidate segments found by ActivityCaption, chardes-STA, and didymo is set to 16, 32, 6, respectively. Will be lambda15Set to 0.1, 1, 0.1, 0.01 and 0.01, respectively. DeltaintraAnd ΔinterSet to 0.4 and 0.6, respectively. Using an initial learning rate of 10-4Weight attenuation of 10-7Adam optimizer of (1). A threshold of 0.55 non-maximum suppression (NMS) is used in the inference process to select multiple times.
The experimental results are as follows:
TABLE 1 Performance on dataset ActivityCaption
Figure GDA0003041639280000152
TABLE 2 Performance on data set Chardes-STA
Figure GDA0003041639280000153
TABLE 3 Performance on data set DiDeMo
Method Input Rank@1 Rank@5 mIoU
WSLLN RGB 19.30 53.00 25.30
RTPEN (invention) RGB 20.19 60.38 28.22
WSLLN Flow 18.41 54.51 27.41
RTPEN (invention) Flow 18.39 54.39 27.39
TGA RGB+Flow 20.90 60.17 30.99
RTPEN (invention) RGB+Flow 21.55 62.95 30.98
From the results, the invention shows the optimal retrieval results on three data sets, and the RTPEN method provided by the invention almost achieves the optimal weak supervision performance.
IoU-0.3 and IoU-0.5 are respectively set on activityCaption and Charads-STA, which shows that the dynamic erasing mechanism of the invention is helpful for capturing the correspondence of comprehensive videos and sentences and retrieving more accurate time, especially on the performance of Rank @5 in DiDeMo, even exceeding a fully supervised method, and verifies the effectiveness of two branch frameworks with the erasing mechanism and the regularization strategy.
On activityCaption and Charads-STA, the RTPEN of the invention is superior to the SCN method based on reconstruction, which shows that the intra-sample countermeasure mode adopted by the invention can effectively utilize negative samples, and the intra-sample countermeasure is fully carried out between the moments of similar information, so that the matching capability of video-sentences is improved, and therefore, excellent retrieval capability is obtained.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims (10)

1. A weak supervision video segment retrieval method based on an erasure mechanism is characterized by comprising the following steps:
1) aiming at the video-query statement, acquiring the language characteristic of the query statement and the frame characteristic of the video;
2) constructing a language-aware double-branch visual filter, and obtaining an enhanced modal characteristic and a suppressed modal characteristic of each frame in a video by using the frame characteristic and the language characteristic to form an enhanced video stream and a suppressed video stream;
3) constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, wherein the double-branch sharing candidate network comprises an enhanced branch and a suppressed branch;
in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
4) introducing a dynamic erasing mechanism into the enhancement branch in the step 3) to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating an erasure loss value by using the erased enhanced language perception frame characteristics, calculating a candidate score of each erased active segment, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;
5) combining the candidate score of the passive candidate segment and the final candidate score of the active candidate segment, training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multitask loss to obtain a trained model;
6) and aiming at the query sentences and videos to be processed, utilizing a trained model and combining the steps 1) to 3), and taking the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
2. The method according to claim 1, wherein the language-aware dual-branch visual filter in step 2) is specifically:
2.1) regarding each language scene as a cluster center, and setting the language feature Q of the query sentence as Qi},1≤i≤nqProjected to each cluster center, calculating language characteristic Q ═ Q { Q } by NetVLAD modeli},1≤i≤nqAnd a trainable center vector C ═ { C ═ Cj},1≤j≤ncThe accumulated residual error between the two is calculated by the formula:
Figure FDA0003041639270000021
wherein n isqIs the number of words in the query sentence, qiIs the linguistic feature of the ith word in the query statement; c. CjIs the jth cluster center, ncIs the number of cluster centers, WcAnd bcProjection matrix and bias of NetVLAD model, respectively;
Figure FDA0003041639270000022
linguistic features representing the ith word and ncCoefficient vector, alpha, related to the center of each clusterijIs alphaiThe jth element in (a), representing a coefficient associated with the jth cluster center; u. ofjThe accumulated residual error of the language feature to the jth clustering center is used for representing the jth scene-based language feature;
2.2) calculating a cross-modal matching score from the accumulated residuals:
Figure FDA0003041639270000023
representing a frame feature of a video as V ═ Vi},1≤i≤nvWherein n isvIs the number of frames in the video, viThe frame characteristics of the ith frame in the video are obtained;
Figure FDA0003041639270000024
and
Figure FDA0003041639270000025
is a projection matrix, baIs a bias that is a function of the bias,
Figure FDA0003041639270000026
is a row vector, T denotes a transpose, σ is a sigmoid function, tanh (·) is a tanh function; beta is aijE (0,1) is the matching score of the frame characteristics of the ith frame and the jth scene-based language characteristics in the video;
2.3) obtaining the matching scores between the frame characteristics of the ith frame in the video and all scenes according to the steps 2.1) to 2.2), and taking the maximum matching score as a global score
Figure FDA0003041639270000027
The calculation formula is as follows:
Figure FDA0003041639270000028
wherein,
Figure FDA0003041639270000029
representing the global scores of all scenes corresponding to the frame characteristics of the ith frame in the video; further obtaining the global scores of the scenes corresponding to all the frame characteristics in the video, and expressing the global scores as
Figure FDA00030416392700000210
2.4) normalizing the global score, wherein the calculation formula is as follows:
Figure FDA00030416392700000211
obtaining normalized scores for all frames
Figure FDA00030416392700000212
Expressing the relation between each frame and the query statement by using the corresponding normalized score of the frame;
2.5) respectively calculating the enhanced modal characteristics and the suppressed modal characteristics corresponding to each frame in the video according to the normalized global score, wherein the calculation formula is as follows:
Figure FDA00030416392700000213
wherein,
Figure FDA00030416392700000214
is the enhanced modal feature of the ith frame in the video, all the frames together form the enhanced video stream
Figure FDA00030416392700000215
Figure FDA00030416392700000216
Is the suppression modal characteristic of the ith frame in the video, and all the frames form the suppression video stream together
Figure FDA0003041639270000031
3. The method according to claim 1, wherein the candidate network based on the dynamic erasure mechanism for dual-branch sharing in step 3) comprises an enhanced branch and a suppressed branch with the same structure, and the parameters between the branches are shared;
the calculation process of the enhancement branch is as follows:
3.1a) language feature Q ═ { Q) from query statementi},1≤i≤nqEnhanced video stream generated with a dual-branch visual filter
Figure FDA0003041639270000032
And aggregating the language features of each frame by using a cross-modal unit to obtain an enhanced aggregate text representation of each frame in the video, wherein the calculation formula is as follows:
Figure FDA0003041639270000033
Figure FDA0003041639270000034
wherein,
Figure FDA0003041639270000035
is the enhanced modal feature, δ, corresponding to the ith frame in the videoijRepresenting videoCross-modal attention between the ith frame and the jth word in the query statement;
Figure FDA0003041639270000036
and
Figure FDA0003041639270000037
is a projection matrix, bmIs a bias that is a function of the bias,
Figure FDA0003041639270000038
is a row vector, T represents a transpose, tanh (·) is a tanh function;
Figure FDA0003041639270000039
the cross-modal attention after normalization is represented,
Figure FDA00030416392700000310
is an enhanced aggregate text representation of the ith frame in the video;
3.2a) connecting the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the enhanced language perception frame characteristics of each frame in the video, wherein the calculation formula is as follows:
Figure FDA00030416392700000311
wherein,
Figure FDA00030416392700000312
is an enhanced language aware frame feature of the ith frame in the video;
3.3a) calculating cross-modal features at video segment level:
dividing a video into segments, constructing a 2D segment feature map, wherein the first two dimensions represent a starting frame and an ending frame of a segment, and the third dimension is a fusion feature of the segment; the fusion features are obtained by accumulating the enhanced language perception frame features of each frame in the segment;
performing two-layer 2D convolution based on the 2D segment characteristic diagram to calculate the matrix relation between adjacent segments and obtain the cross-modal characteristics of the video segment level
Figure FDA00030416392700000313
Wherein M isenIs the number of fragments in the 2D fragment profile;
3.4a) calculating the candidate score of each segment according to the following formula:
Figure FDA00030416392700000314
in the formula,
Figure FDA0003041639270000041
is the candidate score of the ith video segment, WpAnd bpIs the projection matrix and the offset, sigma is the sigmoid function,
Figure FDA0003041639270000042
is the cross-modal feature of the ith video segment;
selecting T candidate segments with highest scores to form an active candidate segment set
Figure FDA0003041639270000043
And extracting a candidate score for each active candidate segment
Figure FDA0003041639270000044
Wherein
Figure FDA0003041639270000045
And
Figure FDA0003041639270000046
respectively representing the ith positive candidate segment and the corresponding candidate score;
similarly, the same structure of the suppression branch is used for inquiring the language of the statementSaid characteristic Q ═ Qi},1≤i≤nqAnd suppression video streaming with dual-branch visual filter generation
Figure FDA0003041639270000047
Generating a set of negative candidate segments according to the procedure of step 3.1a) to step 3.4a)
Figure FDA0003041639270000048
And a candidate score for each passive candidate segment
Figure FDA0003041639270000049
Wherein, therein
Figure FDA00030416392700000410
And
Figure FDA00030416392700000411
respectively, the ith negative candidate segment and the corresponding candidate score.
4. The method according to claim 3, wherein the step 4) is specifically as follows:
4.1) obtaining the aggregate text attention between each frame in the video and each word in the query sentence by utilizing the cross-modal attention between each frame in the video and each word in the query sentence calculated in the step 3.1a) through weighted summation;
4.2) screening n with highest attention of aggregated texts in query statementeIndividual words forming a mask word set W*={wi *},1≤i≤ne
Figure FDA00030416392700000417
Figure FDA00030416392700000412
Represents a rounding down, E% represents a mask percentage threshold;
using mask tokens to replace mask word sets W*N in (1)eObtaining the erased query sentence by using the words; obtaining the language characteristic Q of the erased query statement by using the method in the step 1)en*={qi *},1≤i≤nq,qi *The language features of the ith word in the query sentence after erasure are represented;
4.3) enhanced video streams generated from Dual-Branch visual Filter
Figure FDA00030416392700000413
And the language feature Q of the erased query statementen*={qi *},1≤i≤nqObtaining the erased enhanced video stream by adopting the method of the step 2)
Figure FDA00030416392700000414
Figure FDA00030416392700000415
Is the erased enhanced modal characteristics of the ith frame in the video; according to the erased enhanced video stream Ven*And the language feature Q of the erased query statementenObtaining the cross-modal attention between the ith frame in the video and the ith word in the erased query sentence by using the method in the step 3.1a)
Figure FDA00030416392700000416
4.4) calculating the erasure aggregation visual representation of each word in the query sentence, wherein the calculation formula is as follows:
Figure FDA0003041639270000051
wherein,
Figure FDA0003041639270000052
representing normalized cross-modal attention,
Figure FDA0003041639270000053
An erasure-aggregated visual representation representing a jth word in the query statement;
4.5) calculating the erasure loss:
Figure FDA0003041639270000054
Figure FDA0003041639270000055
wherein,
Figure FDA0003041639270000056
representing the value of the erase loss, s being an intermediate variable, WeIn order to be a projection matrix, the projection matrix,
Figure FDA0003041639270000057
is the erased enhanced speech perception frame characteristic of the ith frame in the video;
5. the method for retrieving the weakly supervised video segment based on the erasure mechanism as recited in claim 4, further comprising:
connecting the erased enhanced modal characteristics and the erased aggregate visual representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the erased enhanced language-aware frame characteristics of each frame in the video, wherein the calculation formula is as follows:
Figure FDA0003041639270000058
wherein,
Figure FDA0003041639270000059
is the erasure of the ith frame in the videoThe enhanced language perception frame characteristics after the division;
calculating the candidate score of each erased fragment by adopting the methods from the step 3.3a) to the step 3.4a)
Figure FDA00030416392700000510
And then carrying out weighted summation with the candidate scores before erasure to obtain the final candidate score of the active candidate segment, wherein the calculation formula is as follows:
Figure FDA00030416392700000511
wherein, wcIs a learnable parameter, the initial value is 0.5;
Figure FDA00030416392700000512
is the final candidate score for the ith positive candidate segment.
6. The method according to claim 1, wherein the multitask penalty in step 5) is the weighted sum of the erasure penalty value in step 4) and the inter-sample penalty value, the intra-sample penalty value, the global penalty value, and the gap penalty value;
a. the calculation formula of the loss value between samples is as follows:
Figure FDA0003041639270000061
Figure FDA0003041639270000062
wherein,
Figure FDA0003041639270000063
is the value of the loss between samples, ΔintraIs the value of the margin at which,
Figure FDA0003041639270000064
is the final candidate score for the ith positive candidate segment,
Figure FDA0003041639270000065
is the candidate score of the ith negative candidate segment, KenIs the enhancement score, KspIs the suppression score, T is the number of candidate segments;
b. the calculation formula of the loss value in the sample is as follows:
Figure FDA0003041639270000066
wherein,
Figure FDA0003041639270000067
is the value of the loss in the sample, ΔinterIs the value of the margin at which,
Figure FDA0003041639270000068
is a negative sample
Figure FDA0003041639270000069
The corresponding enhancement score is calculated based on the corresponding enhancement score,
Figure FDA00030416392700000610
is a negative sample
Figure FDA00030416392700000611
A corresponding enhancement score;
c. the global loss is calculated by the following formula:
Figure FDA00030416392700000612
wherein,
Figure FDA00030416392700000613
is the global penalty value, MenIs the number of fragments in the 2D fragment profile;
d. the gap loss is calculated by the following formula:
Figure FDA00030416392700000614
Figure FDA00030416392700000615
wherein,
Figure FDA00030416392700000616
is the gap loss value.
7. The method as claimed in claim 6, wherein the intra-sample loss is a negative sample of the weakly supervised video segment retrieval method based on the erasure mechanism
Figure FDA00030416392700000617
And negative samples
Figure FDA00030416392700000618
The acquisition method comprises the following steps: for each pair of video-query statements (V, Q), an unmatched video is randomly selected from the training set
Figure FDA00030416392700000619
Composing passive samples
Figure FDA00030416392700000620
Selecting unmatched queries
Figure FDA00030416392700000621
And video V form another negative sample
Figure FDA00030416392700000622
8. A retrieval system based on the weak supervision video clip retrieval method of claim 1, characterized by comprising:
the video preprocessing module is used for acquiring the frame characteristics of the video;
the query statement preprocessing module is used for acquiring the language features of the query statement;
the double-branch visual filtering module is used for obtaining the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;
a dual-branch shared candidate network module comprising an enhanced branch and a suppressed branch; in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
the dynamic erasing module is used for erasing the query sentence in a word level in an enhancement branch of the double-branch sharing candidate network module to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating the erasure loss by using the erased enhanced language perception frame characteristics;
a candidate score calculation module for calculating a final candidate score comprising a final candidate score for a positive candidate segment and a final candidate score for a negative candidate segment; the final candidate score of the extreme candidate segment is obtained by weighting and summing the candidate score of each positive segment after erasure and the candidate score before erasure, and the final candidate score of the negative candidate segment adopts a candidate score for inhibiting branch output by a double-branch sharing candidate network module;
the training module is used for training the double-branch visual filtering module and the double-branch sharing candidate network module based on multi-task loss to obtain a trained model;
and the retrieval module is used for retrieving the query statement and the video to be processed according to the trained model, the video preprocessing module and the query statement preprocessing module, and outputting the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
9. The retrieval system of claim 8, wherein the dual-branch visual filter module comprises:
the accumulated residual calculation module is used for projecting the language features of the query statement to the clustering center corresponding to each language scene and calculating the accumulated residual between the language features and the center vectors;
a cross-modal matching score calculation module for calculating the matching score of the language features of each frame and all scenes in the video according to the accumulated residual;
the global score calculation module is used for screening out the highest score from the matching scores of the language features of each frame and all scenes in the video to be used as a global score, and carrying out normalization processing to obtain the normalization scores of all frames in the video;
and the modal characteristic calculation module is used for calculating to obtain the enhanced modal characteristic and the suppressed modal characteristic corresponding to each frame of the video according to the normalized fraction of each frame of the video, and respectively forming an enhanced video stream and a suppressed video stream.
10. The retrieval system of claim 9, wherein the dual-branch sharing candidate network modules include an enhanced branch and a suppressed branch, and parameter sharing between the branches;
the two branches comprise:
the aggregation text representation calculation module is used for obtaining cross-modal attention between each frame and each word in the video according to the language features of the query sentences and the enhancement video stream or the inhibition video stream, and obtaining enhancement aggregation text representation or inhibition aggregation text representation after normalization and cumulative summation;
the language perception frame feature calculation module is used for connecting modal features and aggregate text representations of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain enhanced language perception frame features or suppressed language perception frame features of each frame in the video;
the cross-modal feature calculation module is used for constructing a 2D fragment feature map, performing two-layer 2D convolution based on the 2D fragment feature map, calculating a matrix relation between adjacent video fragments and obtaining cross-modal features of video fragment levels;
a candidate score calculating module for calculating a candidate score of each video segment, and selecting the T segments with the highest scores as candidate segments; the candidate segment obtained in the enhancement branch is a positive candidate segment, and the candidate segment obtained in the suppression branch is a negative candidate segment.
CN202110272729.4A 2021-03-12 2021-03-12 Weak supervision video clip retrieval method and system based on erasure mechanism Active CN112685597B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110272729.4A CN112685597B (en) 2021-03-12 2021-03-12 Weak supervision video clip retrieval method and system based on erasure mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110272729.4A CN112685597B (en) 2021-03-12 2021-03-12 Weak supervision video clip retrieval method and system based on erasure mechanism

Publications (2)

Publication Number Publication Date
CN112685597A CN112685597A (en) 2021-04-20
CN112685597B true CN112685597B (en) 2021-07-13

Family

ID=75455541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110272729.4A Active CN112685597B (en) 2021-03-12 2021-03-12 Weak supervision video clip retrieval method and system based on erasure mechanism

Country Status (1)

Country Link
CN (1) CN112685597B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254716B (en) * 2021-05-26 2022-05-24 北京亮亮视野科技有限公司 Video clip retrieval method and device, electronic equipment and readable storage medium
CN113590881B (en) * 2021-08-09 2024-03-19 北京达佳互联信息技术有限公司 Video clip retrieval method, training method and device for video clip retrieval model
CN113792594B (en) * 2021-08-10 2024-04-12 南京大学 Method and device for locating language fragments in video based on contrast learning
CN113901847B (en) * 2021-09-16 2024-05-24 昆明理工大学 Neural machine translation method based on source language syntax enhancement decoding
CN113590874B (en) * 2021-09-28 2022-02-11 山东力聚机器人科技股份有限公司 Video positioning method and device, and model training method and device
CN113806589B (en) * 2021-09-29 2024-03-08 云从科技集团股份有限公司 Video clip positioning method, device and computer readable storage medium
CN113963304B (en) * 2021-12-20 2022-06-28 山东建筑大学 Cross-modal video time sequence action positioning method and system based on time sequence-space diagram
CN115187917B (en) * 2022-09-13 2022-11-25 山东建筑大学 Unmanned vehicle historical scene detection method based on video clip retrieval
CN115687687B (en) * 2023-01-05 2023-03-28 山东建筑大学 Video segment searching method and system for open domain query
CN117690191B (en) * 2024-02-02 2024-04-30 南京邮电大学 Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110991290A (en) * 2019-11-26 2020-04-10 西安电子科技大学 Video description method based on semantic guidance and memory mechanism
CN112417206A (en) * 2020-11-24 2021-02-26 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7818430B2 (en) * 2008-10-15 2010-10-19 Patentvc Ltd. Methods and systems for fast segment reconstruction

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110991290A (en) * 2019-11-26 2020-04-10 西安电子科技大学 Video description method based on semantic guidance and memory mechanism
CN112417206A (en) * 2020-11-24 2021-02-26 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network

Also Published As

Publication number Publication date
CN112685597A (en) 2021-04-20

Similar Documents

Publication Publication Date Title
CN112685597B (en) Weak supervision video clip retrieval method and system based on erasure mechanism
Gu et al. Stack-captioning: Coarse-to-fine learning for image captioning
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN110516536B (en) Weak supervision video behavior detection method based on time sequence class activation graph complementation
CN108549658B (en) Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN111652357B (en) Method and system for solving video question-answer problem by using specific target network based on graph
CN111581966A (en) Context feature fusion aspect level emotion classification method and device
CN112784929B (en) Small sample image classification method and device based on double-element group expansion
CN111460883B (en) Video behavior automatic description method based on deep reinforcement learning
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN108959512B (en) Image description network and technology based on attribute enhanced attention model
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN116304984A (en) Multi-modal intention recognition method and system based on contrast learning
CN115858726A (en) Multi-stage multi-modal emotion analysis method based on mutual information method representation
Xue et al. Lcsnet: End-to-end lipreading with channel-aware feature selection
Tong et al. Automatic error correction for speaker embedding learning with noisy labels
CN117033961A (en) Multi-mode image-text classification method for context awareness
CN117033558A (en) BERT-WWM and multi-feature fused film evaluation emotion analysis method
CN116541507A (en) Visual question-answering method and system based on dynamic semantic graph neural network
CN116245115A (en) Video content description method based on concept parser and multi-modal diagram learning
CN116975347A (en) Image generation model training method and related device
CN115830401A (en) Small sample image classification method
CN114781356A (en) Text abstract generation method based on input sharing
CN113283520A (en) Member reasoning attack-oriented depth model privacy protection method and device based on feature enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant