CN112685597B - Weak supervision video clip retrieval method and system based on erasure mechanism - Google Patents
Weak supervision video clip retrieval method and system based on erasure mechanism Download PDFInfo
- Publication number
- CN112685597B CN112685597B CN202110272729.4A CN202110272729A CN112685597B CN 112685597 B CN112685597 B CN 112685597B CN 202110272729 A CN202110272729 A CN 202110272729A CN 112685597 B CN112685597 B CN 112685597B
- Authority
- CN
- China
- Prior art keywords
- video
- frame
- candidate
- branch
- language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 230000007246 mechanism Effects 0.000 title claims abstract description 29
- 230000000007 visual effect Effects 0.000 claims abstract description 38
- 230000001629 suppression Effects 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims description 46
- 230000008447 perception Effects 0.000 claims description 31
- 239000012634 fragment Substances 0.000 claims description 19
- 230000002776 aggregation Effects 0.000 claims description 18
- 238000004220 aggregation Methods 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 16
- 230000003993 interaction Effects 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 11
- 238000010586 diagram Methods 0.000 claims description 9
- 230000004931 aggregating effect Effects 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 5
- 230000005764 inhibitory process Effects 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 4
- 230000002401 inhibitory effect Effects 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 241001636070 Didymosphenia geminata Species 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000008485 antagonism Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a weak supervision video clip retrieval method and system based on an erasing mechanism, and belongs to the field of video clip retrieval. Aiming at a video-query statement, respectively acquiring language features and frame features; constructing a language-aware dual-branch visual filter to generate an enhanced video stream and a suppressed video stream; constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, and generating an active candidate segment and a passive candidate segment; introducing a dynamic erasure mechanism into an enhancement branch of the candidate network, and calculating an enhancement score and a suppression score; training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multi-task loss to obtain a trained model; and aiming at the query sentences and videos to be processed, using the trained model to take the segment corresponding to the highest candidate score output by the enhanced branch as the final retrieval result. The invention enhances the matching capability of the video sentences and improves the performance of video retrieval.
Description
Technical Field
The invention relates to the field of video segment retrieval, in particular to a weak supervision video segment retrieval method and system based on an erasing mechanism.
Background
Video clip retrieval is a new topic in information retrieval systems that integrate computer vision and natural language processing. Given an untrimmed video and a natural language description, the purpose of video clip retrieval is to locate temporal boundaries that match the semantics of the target clip. However, most existing methods are trained in a fully supervised environment. Such manual annotation is very expensive and time consuming, especially for ambiguous descriptions.
Existing weakly supervised methods typically employ MIL-based or reconstruction-based methods to train the weakly supervised positioning network. Both of these methods have some drawbacks. The former trains potential visual text matching through inter-sample loss by defining some initial visual language pairs as positive samples, constructing unmatched language visual pairs as negative samples. However, the method has high quality requirements on randomly selected negative samples, and the samples with low quality are easy to identify and cannot provide strong supervision signals. Reconstruction-based approaches, on the other hand, attempt to reconstruct the query statement from the visual content in training and locate candidate targets during reasoning using intermediate results such as attention weights. These methods do not directly optimize the visual text matching score used for reasoning. Because the candidate with higher attention weight does not necessarily have higher association with the query sentence, such indirect optimization limits the performance of the model, and therefore, the existing weak supervision method has at least the following problems:
1) a high-quality negative sample is required, a low-quality sample is easy to identify and cannot provide a strong supervision signal;
2) visual text matching scores used for reasoning cannot be directly optimized, candidates with high attention weights do not necessarily have high relevance to problem statements, and indirect optimization limits the performance of the model.
Erasure is an effective data enhancement method to suppress overfitting and enhance model robustness, and conventional erasure methods are generally used in images, randomly select regions in the image, replace their pixels with 0 or average values of the image, generate a large number of new images for training, but erasure of video images has limited ability to improve video-sentence matching. The method provides a novel regularized double-branch candidate network with an erasing mechanism, a fine-grained in-sample countermeasure is constructed by finding out credible negative candidate moments, and a more complete visual-text relation is captured by paying attention to guided dynamic erasing.
Disclosure of Invention
In order to overcome the defect that in the prior art, only inter-sample countermeasures are usually concerned, and intra-sample countermeasures are ignored, so that a correct result is difficult to select from plausible candidate segments; and in the prior art, the focused video-sentence pairs are concentrated on a plurality of dominant words, the global situation is ignored, samples which do not appear in the training data and are not trained can not be positioned easily, higher accuracy can be obtained only in the training data set, and the practical applicability is poor. The invention provides a weak supervision video segment retrieval method and system based on an erasing mechanism, which can efficiently and accurately retrieve video segments.
According to the invention, a double-branch candidate module is constructed, two branches adopt the same structure, and parameters are shared between the branches, so that the model is lighter and more robust; by constructing a dynamic erasing mechanism, words with higher occupation ratio in the query sentence are erased, the matching capability of the video sentence is enhanced, and the performance of video retrieval is improved.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
one of the purposes of the present invention is to provide a weak supervised video segment retrieval method based on an erasure mechanism, which includes the following steps:
1) aiming at the video-query statement, acquiring the language characteristic of the query statement and the frame characteristic of the video;
2) constructing a language-aware double-branch visual filter, and obtaining an enhanced modal characteristic and a suppressed modal characteristic of each frame in a video by using the frame characteristic and the language characteristic to form an enhanced video stream and a suppressed video stream;
3) constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, wherein the double-branch sharing candidate network comprises an enhanced branch and a suppressed branch;
in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
4) introducing a dynamic erasing mechanism into the enhancement branch in the step 3) to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating erasure loss by using the erased enhanced language perception frame characteristics, calculating a candidate score of each active segment after erasure, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;
5) combining the candidate score of the passive candidate segment and the final candidate score of the active candidate segment, training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multitask loss to obtain a trained model;
6) and aiming at the query sentences and videos to be processed, utilizing a trained model and combining the steps 1) to 3), and taking the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
Another object of the present invention is to provide a system for retrieving a weakly supervised video segment based on the above method, which includes:
the video preprocessing module is used for acquiring the frame characteristics of the video;
the query statement preprocessing module is used for acquiring the language features of the query statement;
the double-branch visual filtering module is used for obtaining the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;
a dual-branch shared candidate network module comprising an enhanced branch and a suppressed branch; in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
the dynamic erasing module is used for erasing the query sentence in a word level in an enhancement branch of the double-branch sharing candidate network module to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating the erasure loss by using the erased enhanced language perception frame characteristics;
a candidate score calculation module for calculating a final candidate score comprising a final candidate score for a positive candidate segment and a final candidate score for a negative candidate segment; the final candidate score of the extreme candidate segment is obtained by weighting and summing the candidate score of each positive segment after erasure and the candidate score before erasure, and the final candidate score of the negative candidate segment adopts a candidate score for inhibiting branch output by a double-branch sharing candidate network module;
the training module is used for training the double-branch visual filtering module and the double-branch sharing candidate network module based on multi-task loss to obtain a trained model;
and the retrieval module is used for retrieving the query statement and the video to be processed according to the trained model, the video preprocessing module and the query statement preprocessing module, and outputting the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
Compared with the prior art, the invention has the advantages that:
(1) according to the video segment retrieval method based on weak supervision, only video-level sentence annotation is required to be provided in the training process, and alignment annotation is not required to be carried out on each sentence, so that the annotation cost and time are greatly reduced;
(2) compared with the majority of multi-Instance Learning-based methods (Multiple Instance Learning) which mainly focus on inter-sample confrontation to judge the result, the method fully performs intra-sample confrontation between the moments with similar information in the video, and can select the correct result from plausible candidate segments;
(3) the invention introduces a dynamic erasure mechanism of antagonism in the double-branch candidate network training, covers the word set with the highest text attention in the query sentence, utilizes the enhanced video stream and the language features corresponding to the erased query sentence for interaction, fully considers global information, and performs weighted summation of the obtained candidate scores and the candidate scores obtained by the original enhanced branch to be the final scores of the active candidate segments, which is beneficial to improving the matching of video sentences, not only can improve the training effect in the training set, but also can be migrated to practical application, and improves the retrieval performance.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of the overall architecture of the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the method for retrieving a weakly supervised video segment based on an erasure mechanism provided by the present invention mainly includes the following steps:
the method comprises the steps of firstly, aiming at a video-query statement, obtaining language features of the query statement and frame features of a video;
secondly, a language-aware double-branch visual filter is constructed, and the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video are obtained by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;
step three, constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, wherein the double-branch sharing candidate network comprises an enhanced branch and a suppressed branch;
in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
step four, introducing a dynamic erasing mechanism into the enhancement branch in the step three to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating erasure loss by using the erased enhanced language perception frame characteristics, calculating a candidate score of each active segment after erasure, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;
combining the candidate scores of the passive candidate segments and the final candidate scores of the active candidate segments, and training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multi-task loss to obtain a trained model;
and step six, aiming at the query sentences and videos to be processed, using the trained model and combining the steps one to three, and taking the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
In one embodiment of the present invention, step one is described as follows:
aiming at a query sentence, extracting character features in the query sentence through a pre-trained Glove model, inputting the character features into a Bi-GRU to learn word semantic representation, and obtaining language features Q of the query sentence as { Q ═ Qi},1≤i≤nqWherein n isqIs the number of words in the query sentence, qiIs the linguistic feature of the ith word in the query statement;
aiming at a section of video, extracting video features through a pre-trained video feature extractor, and then shortening the length of a video sequence by utilizing time-average pooling to obtain frame features V ═ { V ═ Vi},1≤i≤nvWherein n isvIs the number of frames in the video, viThe frame characteristics of the ith frame in the video. In this embodiment, the pre-trained video feature extractionThe extractor may employ a C3D feature extractor.
In one embodiment of the present invention, step two is described as follows:
in step two, a language-aware dual-branch visual filter is constructed, and the frame feature V and the language feature Q are used for generating the enhanced video streamAnd suppressing video streamsWhereinIs the enhanced modal feature corresponding to the ith frame in the video,is the corresponding suppressed modal feature of the ith frame in the video.
Specifically, the method comprises the following steps:
2.1) regarding each language scene as a cluster center, and setting the language feature Q as { Q ═ Q }i},1≤i≤nqProjected to each cluster center, calculating language characteristic Q ═ Q { Q } by NetVLAD modeli},1≤i≤nqAnd a trainable center vector C ═ { C ═ Cj},1≤j≤ncThe accumulated residual error between the two is calculated by the formula:
wherein, cjIs the jth cluster center, ncIs the number of cluster centers, WcAnd bcProjection matrix and bias of NetVLAD model, respectively;linguistic features representing the ith word and ncCoefficient vector, alpha, related to the center of each clusterijIs alphaiThe j-th element in (1) representsA coefficient associated with the jth cluster center; u. ofjIs the language feature Q ═ Qi},1≤i≤nqThe accumulated residual error of the jth cluster center is used for representing the jth scene-based language feature;
2.2) calculating a cross-modal matching score from the accumulated residuals:
wherein,andis a projection matrix, baIs a bias that is a function of the bias,is a row vector, T denotes a transpose, σ is a sigmoid function, tanh (·) is a tanh function; beta is aijE (0,1) is the matching score of the frame characteristics of the ith frame and the jth scene-based language characteristics in the video;
2.3) obtaining the matching scores between the frame characteristics of the ith frame in the video and all scenes according to the steps 2.1) to 2.2), and taking the maximum matching score as a global scoreThe calculation formula is as follows:
wherein,representing the global scores of all scenes corresponding to the frame characteristics of the ith frame in the video; further obtaining the global scores of the scenes corresponding to all the frame characteristics in the video, and expressing the global scores as1≤i≤nv;
2.4) normalizing the global score, wherein the calculation formula is as follows:
wherein,represents the minimum value of the global score of all frame features in the video corresponding to the scene,represents the maximum value of the global score of the scene corresponding to all the frame features in the video.
Through the steps, the normalized fraction of all the frames is obtained1≤i≤nvExpressing the relation between each frame and the query statement by using the corresponding normalized score of the frame;
2.5) respectively calculating the enhanced modal characteristics and the suppressed modal characteristics corresponding to each frame in the video according to the normalized global score, wherein the calculation formula is as follows:
wherein,is the enhanced modal feature of the ith frame in the video, all the frames together form the enhanced video streamEnhancing the important frames according to the normalized fraction and weakening the non-important frames;is the suppression modal characteristic of the ith frame in the video, and all the frames form the suppression video stream togetherIn contrast to enhancing the video stream, the suppression video stream weakens the important frames and enhances the non-important frames.
In one embodiment of the present invention, step three is described as follows:
in the third step, the established dual-branch sharing candidate module based on the dynamic erasure mechanism comprises an enhanced branch and a suppressed branch with the same structure, and parameters are shared among the branches;
the processing flows of the enhanced branch and the suppressed branch are similar, and only the calculation process of the enhanced branch is specifically described below, including:
3.1a) language feature Q ═ { Q) from query statementi},1≤i≤nqEnhanced video stream generated with a dual-branch visual filterAnd aggregating the language features of each frame by using a cross-modal unit to obtain an enhanced aggregate text representation of each frame in the video, wherein the calculation formula is as follows:
wherein, deltaijRepresenting cross-modal attention between the ith frame in the video and the jth word in the query statement;andis a projection matrix, bmIs a bias that is a function of the bias,is a row vector, T represents a transpose, tanh (·) is a tanh function;the cross-modal attention after normalization is represented,is an enhanced aggregate text representation of the ith frame in the video;
3.2a) connecting the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the enhanced language perception frame characteristics of each frame in the video, wherein the calculation formula is as follows:
3.3a) calculating cross-modal features at video segment level:
dividing video into segments and constructing 2D segment feature mapThe first two dimensions represent the beginning and ending frames of a segment, and the third dimension is the fusion feature of the segment; the calculation formula of the fusion characteristics is as follows:
wherein a and b are respectively a start frame and an end frame of the video segment, F [ a, b ]; is a fusion characteristic of the fragment [ a, b ];
based onThe 2D segment characteristic graph is subjected to two-layer 2D convolution, the size of a convolution kernel can be selected according to actual conditions to calculate the matrix relation between adjacent segments, and cross-modal characteristics of the video segment level are obtainedWherein M isenIs the number of fragments in the 2D fragment profile;
3.4a) calculating the candidate score of each segment according to the following formula:
in the formula,is the candidate score of the ith video segment, WpAnd bpIs the projection matrix and the offset, sigma is the sigmoid function,is the cross-modal feature of the ith video segment;
selecting T candidate segments with highest scores to form an active candidate segment setAnd extracting a candidate score for each active candidate segmentWhereinAndrespectively representing the ith positive candidate segment and the corresponding candidate score;
similarly, the same suppression branch is used, and the language feature Q of the query sentence is { Q ═ Q { (Q) }i},1≤i≤nqAnd double branchesVisual filter generated suppressed video streamsGenerating a set of negative candidate segments according to the procedure of step 3.1a) to step 3.4a)And a candidate score for each passive candidate segmentWherein, thereinAndrespectively, the ith negative candidate segment and the corresponding candidate score.
In one embodiment of the present invention, step four is described as follows:
4.1) calculating the attention of the aggregated text by utilizing the cross-modal attention between each frame in the video and each word in the query sentence calculated in the step 3.1 a):
wherein,representing aggregate textual attention between the video and the jth word in the query sentence;
4.2) screening n with highest attention of aggregated texts in query statementeIndividual words forming a mask word set W*={wi *},1≤i≤ne,Represents a rounding down, E% represents a mask percentage threshold;
substituting mask character 'Unknown' for mask word set W*N in (1)eObtaining the erased query sentence by using the words; obtaining the language characteristic Q of the erased query statement by using the method in the step oneen*={qi *},1≤i≤nq;
4.3) enhanced video streams generated from Dual-Branch visual FilterAnd the language feature Q of the erased query statementen*={qi *},1≤i≤nqObtaining the erased enhanced video stream by adopting the method of the step twoIs the erased enhanced modal characteristics of the ith frame in the video; according to the erased enhanced video stream Ven*And the language feature Q of the erased query statementenObtaining the cross-modal attention between the ith frame in the video and the ith word in the erased query sentence by using the method in the step 3.1a)
4.4) calculating the erasure aggregation visual representation of each word in the query sentence, wherein the calculation formula is as follows:
wherein,representing the normalized cross-modal attention,an erasure-aggregated visual representation representing a jth word in the query statement;
4.5) calculating the erasure loss:
wherein,representing erase loss, s being an intermediate variable, WeIn order to be a projection matrix, the projection matrix,is the erased enhanced speech perception frame characteristic of the ith frame in the video;
further, after the step 4.4), the method further comprises the following steps:
connecting the erased enhanced modal characteristics and the erased aggregate visual representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the erased enhanced language-aware frame characteristics of each frame in the video, wherein the calculation formula is as follows:
wherein,is the erased enhanced speech perception frame characteristic of the ith frame in the video;
calculating the candidate score of each erased fragment by adopting the methods from the step 3.3a) to the step 3.4a)And then carrying out weighted summation with the candidate scores before erasure to obtain the final candidate score of the active candidate segment, wherein the calculation formula is as follows:
wherein, wcIs a learnable parameter, the initial value is 0.5;is the final candidate score for the ith positive candidate segment.
In one embodiment of the present invention, step five is described as follows:
the multi-task loss adopted by the invention is the weighted sum of an erasure loss value, an inter-sample loss value, an intra-sample loss value, a global loss value and a gap loss value;
a. the calculation formula of the loss value between samples is as follows:
wherein,is the value of the loss between samples, ΔintraIs the value of the margin at which,is the final candidate score for the ith positive candidate segment,is the candidate score of the ith negative candidate segment, KenIs the enhancement score, KspIs the inhibition score, and T is the number of candidate segments.
b. The calculation formula of the loss value in the sample is as follows:
wherein,is the value of the loss in the sample, ΔinterIs the value of the margin at which,is a negative sampleThe corresponding enhancement score is calculated based on the corresponding enhancement score,is a negative sampleA corresponding enhancement score;
negative samplesAnd negative samplesThe acquisition method comprises the following steps: for each pair of video-query statements (V, Q), an unmatched video is randomly selected from the training setComposing passive samplesSelecting unmatched queriesAnd video V form another negative sample
c. The global loss is calculated by the following formula:
wherein,is a global penalty value for keeping the score of each candidate segment relatively low again, MenIs the number of fragments in the 2D fragment profile.
d. The gap loss is calculated by the following formula:
Wherein λCIs a hyper-parameter used to control various losses. And training by utilizing multitask loss to obtain a trained model. When the method is used, the video and the text to be inferred are input, and the segment with the highest candidate score is selected as the required segment.
The invention also provides a system for retrieving the weakly supervised video clips based on the erasure mechanism, which can be referred to as fig. 2 and comprises:
the video preprocessing module is used for acquiring the frame characteristics of the video;
the query statement preprocessing module is used for acquiring the language features of the query statement;
the double-branch visual filtering module is used for obtaining the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;
a dual-branch shared candidate network module comprising an enhanced branch and a suppressed branch; in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
the dynamic erasing module is used for erasing the query sentence in a word level in an enhancement branch of the double-branch sharing candidate network module to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating the erasure loss by using the erased enhanced language perception frame characteristics;
a candidate score calculation module for calculating a final candidate score comprising a final candidate score for a positive candidate segment and a final candidate score for a negative candidate segment; the final candidate score of the extreme candidate segment is obtained by weighting and summing the candidate score of each positive segment after erasure and the candidate score before erasure, and the final candidate score of the negative candidate segment adopts a candidate score for inhibiting branch output by a double-branch sharing candidate network module; in this embodiment, the candidate score calculation module is located inside the enhancement branch and the suppression branch, and therefore is not shown in fig. 2. In addition, it should be noted that each candidate segment further needs to calculate an enhancement score and a suppression score according to the obtained candidate score, and please refer to the description of step five of the method part above for the calculation formula, which is not repeated here.
The training module is used for training the double-branch visual filtering module and the double-branch sharing candidate network module based on multi-task loss to obtain a trained model;
and the retrieval module is used for retrieving the query statement and the video to be processed according to the trained model, the video preprocessing module and the query statement preprocessing module, and outputting the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
In this embodiment, the dual-branch visual filtering module includes:
the accumulated residual calculation module is used for projecting the language features of the query statement to the clustering center corresponding to each language scene and calculating the accumulated residual between the language features and the center vectors;
a cross-modal matching score calculation module for calculating the matching score of the language features of each frame and all scenes in the video according to the accumulated residual;
the global score calculation module is used for screening out the highest score from the matching scores of the language features of each frame and all scenes in the video to be used as a global score, and carrying out normalization processing to obtain the normalization scores of all frames in the video;
and the modal characteristic calculation module is used for calculating to obtain the enhanced modal characteristic and the suppressed modal characteristic corresponding to each frame of the video according to the normalized fraction of each frame of the video, and respectively forming an enhanced video stream and a suppressed video stream.
In this embodiment, the dual-branch sharing candidate network module includes an enhanced branch and a suppressed branch, and parameter sharing between the branches;
the two branches comprise:
the aggregation text representation calculation module is used for obtaining cross-modal attention between each frame and each word in the video according to the language features of the query sentences and the enhancement video stream or the inhibition video stream, and obtaining enhancement aggregation text representation or inhibition aggregation text representation after normalization and cumulative summation;
the language perception frame feature calculation module is used for connecting modal features and aggregate text representations of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain enhanced language perception frame features or suppressed language perception frame features of each frame in the video;
the cross-modal feature calculation module is used for constructing a 2D fragment feature map, performing two-layer 2D convolution based on the 2D fragment feature map, calculating a matrix relation between adjacent video fragments and obtaining cross-modal features of video fragment levels;
a candidate score calculating module for calculating a candidate score of each video segment, and selecting the T segments with the highest scores as candidate segments; the candidate segment obtained in the enhancement branch is a positive candidate segment, and the candidate segment obtained in the suppression branch is a negative candidate segment.
In this embodiment, the dynamic erase module includes:
the cross-modal attention calculation module is used for weighting and summing the cross-modal attention between each frame in the video and each word in the query sentence to obtain the aggregate text attention between the video and each word in the query sentence;
the query sentence erasing module is used for screening a plurality of words with the highest attention of the aggregated text in the query sentence and replacing the words with mask symbols; obtaining the language features of the erased query statement according to the query statement preprocessing module;
the erasure cross-modal attention calculation module is used for obtaining an erased enhanced video stream according to the language features of the enhanced video stream and the erased query statement; further obtaining cross-modal attention between each frame in the video and the word of the erased query sentence according to the erased enhanced video stream and the language characteristics of the erased query sentence;
the erasure aggregation visual representation calculation module is used for normalizing, accumulating and summing the cross-modal attention between each frame in the video and the erased words of the query statement to obtain the erasure aggregation visual representation of each word in the query statement;
and the erasure loss calculation module is used for calculating the erasure loss.
The implementation of each module in the above description may refer to the description of the method portion, and is not described herein again.
In the embodiments provided in the present invention, it should be understood that the above-described system embodiments are merely illustrative, and for example, the dual-branch shared candidate network module may be a logical functional partition, and may have another partition in actual implementation, for example, multiple modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the connections between the modules shown or discussed may be communication connections via interfaces, electrical or otherwise.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
The invention was tested in the following three data sets:
ActivityCaption A data set containing more than 19k of video is rich in content with an average duration of about 2 minutes. And dividing the sentence into 37k, 18k and 17k pairs as training verification and test sets.
Charrades-STA: the data set is built on a Charades data set, which contains video level paragraph descriptions and temporal action annotations. The description at the video level is decomposed into descriptions at the segment level using a semi-automatic method. The data set contains 10k of live video in the room, which are on average about 30 seconds in duration. There are approximately 12k pairs of segment statements for training and 4k pairs for testing.
DiDeMo: the data set consists of 10k videos, each of 25-30 seconds in duration. It contains approximately 33k segment sentence pairs for training, 4k for verification, and 4k for testing. Each video is divided into six five-second segments in the data set, with the target segment containing one or more consecutive segments.
Evaluation criteria:
r @ n, IoU ═ m was used as an evaluation index on ActivityCaption and charads-STA datasets.
Rank @1, Rank @5 and mlou were used as evaluation indices of the DiDeMo.
Data processing:
C3D features were extracted on ActivityCaption and chardes-STA datasets and VGG16 and visual flow features were applied on the didymo dataset.
Time-averaged pooling with steps of 8 and 4 was used at activityCaption and Charides-STA, respectively, to shorten the signature sequence. The average signature of the five second segments fixed in the DiDeMo was calculated.
Each sentence is decomposed into words and a tone list, the vocabulary which does not belong to Glove is removed, the first MaxSeq element is obtained, and MaxSeq elements of ActivityCaption, Chardes-STA and DiDeMo are respectively set to be 25, 20 and 20. And extracting text features of each word by using pre-trained 300-d Glove coding.
Setting a model:
cross-modal interaction unit with Wc bc in NetVLADAnd bmSet to 512. The dimension of the hidden state for each direction in the Bi-GRU is set to 256 and the dimension of the trainable center vector is set to 512. In the construction of the 2D map, for the data set DiDeMo, for each position of a ≦ b [ a, b]Will be filled. For a dataset, Chardes-STA, the positions that satisfy a ≦ b and (b-a) mod 2 ≦ 1 will be filled. For the dataset activityCaption, the locations that satisfy a ≦ b and (b-a) mod 8 ≦ 0 will be filled. The convolution kernel size K applied to ActivityCaption, Charads-STA and DiDeMo is set to 5, 3, 1, respectively. In the center-based candidate method, the number T of active/passive candidate segments found by ActivityCaption, chardes-STA, and didymo is set to 16, 32, 6, respectively. Will be lambda1-λ5Set to 0.1, 1, 0.1, 0.01 and 0.01, respectively. DeltaintraAnd ΔinterSet to 0.4 and 0.6, respectively. Using an initial learning rate of 10-4Weight attenuation of 10-7Adam optimizer of (1). A threshold of 0.55 non-maximum suppression (NMS) is used in the inference process to select multiple times.
The experimental results are as follows:
TABLE 1 Performance on dataset ActivityCaption
TABLE 2 Performance on data set Chardes-STA
TABLE 3 Performance on data set DiDeMo
Method | Input | Rank@1 | Rank@5 | mIoU |
WSLLN | RGB | 19.30 | 53.00 | 25.30 |
RTPEN (invention) | RGB | 20.19 | 60.38 | 28.22 |
WSLLN | Flow | 18.41 | 54.51 | 27.41 |
RTPEN (invention) | Flow | 18.39 | 54.39 | 27.39 |
TGA | RGB+Flow | 20.90 | 60.17 | 30.99 |
RTPEN (invention) | RGB+Flow | 21.55 | 62.95 | 30.98 |
From the results, the invention shows the optimal retrieval results on three data sets, and the RTPEN method provided by the invention almost achieves the optimal weak supervision performance.
IoU-0.3 and IoU-0.5 are respectively set on activityCaption and Charads-STA, which shows that the dynamic erasing mechanism of the invention is helpful for capturing the correspondence of comprehensive videos and sentences and retrieving more accurate time, especially on the performance of Rank @5 in DiDeMo, even exceeding a fully supervised method, and verifies the effectiveness of two branch frameworks with the erasing mechanism and the regularization strategy.
On activityCaption and Charads-STA, the RTPEN of the invention is superior to the SCN method based on reconstruction, which shows that the intra-sample countermeasure mode adopted by the invention can effectively utilize negative samples, and the intra-sample countermeasure is fully carried out between the moments of similar information, so that the matching capability of video-sentences is improved, and therefore, excellent retrieval capability is obtained.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.
Claims (10)
1. A weak supervision video segment retrieval method based on an erasure mechanism is characterized by comprising the following steps:
1) aiming at the video-query statement, acquiring the language characteristic of the query statement and the frame characteristic of the video;
2) constructing a language-aware double-branch visual filter, and obtaining an enhanced modal characteristic and a suppressed modal characteristic of each frame in a video by using the frame characteristic and the language characteristic to form an enhanced video stream and a suppressed video stream;
3) constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, wherein the double-branch sharing candidate network comprises an enhanced branch and a suppressed branch;
in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
4) introducing a dynamic erasing mechanism into the enhancement branch in the step 3) to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating an erasure loss value by using the erased enhanced language perception frame characteristics, calculating a candidate score of each erased active segment, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;
5) combining the candidate score of the passive candidate segment and the final candidate score of the active candidate segment, training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multitask loss to obtain a trained model;
6) and aiming at the query sentences and videos to be processed, utilizing a trained model and combining the steps 1) to 3), and taking the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
2. The method according to claim 1, wherein the language-aware dual-branch visual filter in step 2) is specifically:
2.1) regarding each language scene as a cluster center, and setting the language feature Q of the query sentence as Qi},1≤i≤nqProjected to each cluster center, calculating language characteristic Q ═ Q { Q } by NetVLAD modeli},1≤i≤nqAnd a trainable center vector C ═ { C ═ Cj},1≤j≤ncThe accumulated residual error between the two is calculated by the formula:
wherein n isqIs the number of words in the query sentence, qiIs the linguistic feature of the ith word in the query statement; c. CjIs the jth cluster center, ncIs the number of cluster centers, WcAnd bcProjection matrix and bias of NetVLAD model, respectively;linguistic features representing the ith word and ncCoefficient vector, alpha, related to the center of each clusterijIs alphaiThe jth element in (a), representing a coefficient associated with the jth cluster center; u. ofjThe accumulated residual error of the language feature to the jth clustering center is used for representing the jth scene-based language feature;
2.2) calculating a cross-modal matching score from the accumulated residuals:
representing a frame feature of a video as V ═ Vi},1≤i≤nvWherein n isvIs the number of frames in the video, viThe frame characteristics of the ith frame in the video are obtained;andis a projection matrix, baIs a bias that is a function of the bias,is a row vector, T denotes a transpose, σ is a sigmoid function, tanh (·) is a tanh function; beta is aijE (0,1) is the matching score of the frame characteristics of the ith frame and the jth scene-based language characteristics in the video;
2.3) obtaining the matching scores between the frame characteristics of the ith frame in the video and all scenes according to the steps 2.1) to 2.2), and taking the maximum matching score as a global scoreThe calculation formula is as follows:
wherein,representing the global scores of all scenes corresponding to the frame characteristics of the ith frame in the video; further obtaining the global scores of the scenes corresponding to all the frame characteristics in the video, and expressing the global scores as
2.4) normalizing the global score, wherein the calculation formula is as follows:
obtaining normalized scores for all framesExpressing the relation between each frame and the query statement by using the corresponding normalized score of the frame;
2.5) respectively calculating the enhanced modal characteristics and the suppressed modal characteristics corresponding to each frame in the video according to the normalized global score, wherein the calculation formula is as follows:
3. The method according to claim 1, wherein the candidate network based on the dynamic erasure mechanism for dual-branch sharing in step 3) comprises an enhanced branch and a suppressed branch with the same structure, and the parameters between the branches are shared;
the calculation process of the enhancement branch is as follows:
3.1a) language feature Q ═ { Q) from query statementi},1≤i≤nqEnhanced video stream generated with a dual-branch visual filterAnd aggregating the language features of each frame by using a cross-modal unit to obtain an enhanced aggregate text representation of each frame in the video, wherein the calculation formula is as follows:
wherein,is the enhanced modal feature, δ, corresponding to the ith frame in the videoijRepresenting videoCross-modal attention between the ith frame and the jth word in the query statement;andis a projection matrix, bmIs a bias that is a function of the bias,is a row vector, T represents a transpose, tanh (·) is a tanh function;the cross-modal attention after normalization is represented,is an enhanced aggregate text representation of the ith frame in the video;
3.2a) connecting the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the enhanced language perception frame characteristics of each frame in the video, wherein the calculation formula is as follows:
3.3a) calculating cross-modal features at video segment level:
dividing a video into segments, constructing a 2D segment feature map, wherein the first two dimensions represent a starting frame and an ending frame of a segment, and the third dimension is a fusion feature of the segment; the fusion features are obtained by accumulating the enhanced language perception frame features of each frame in the segment;
performing two-layer 2D convolution based on the 2D segment characteristic diagram to calculate the matrix relation between adjacent segments and obtain the cross-modal characteristics of the video segment levelWherein M isenIs the number of fragments in the 2D fragment profile;
3.4a) calculating the candidate score of each segment according to the following formula:
in the formula,is the candidate score of the ith video segment, WpAnd bpIs the projection matrix and the offset, sigma is the sigmoid function,is the cross-modal feature of the ith video segment;
selecting T candidate segments with highest scores to form an active candidate segment setAnd extracting a candidate score for each active candidate segmentWhereinAndrespectively representing the ith positive candidate segment and the corresponding candidate score;
similarly, the same structure of the suppression branch is used for inquiring the language of the statementSaid characteristic Q ═ Qi},1≤i≤nqAnd suppression video streaming with dual-branch visual filter generationGenerating a set of negative candidate segments according to the procedure of step 3.1a) to step 3.4a)And a candidate score for each passive candidate segmentWherein, thereinAndrespectively, the ith negative candidate segment and the corresponding candidate score.
4. The method according to claim 3, wherein the step 4) is specifically as follows:
4.1) obtaining the aggregate text attention between each frame in the video and each word in the query sentence by utilizing the cross-modal attention between each frame in the video and each word in the query sentence calculated in the step 3.1a) through weighted summation;
4.2) screening n with highest attention of aggregated texts in query statementeIndividual words forming a mask word set W*={wi *},1≤i≤ne, Represents a rounding down, E% represents a mask percentage threshold;
using mask tokens to replace mask word sets W*N in (1)eObtaining the erased query sentence by using the words; obtaining the language characteristic Q of the erased query statement by using the method in the step 1)en*={qi *},1≤i≤nq,qi *The language features of the ith word in the query sentence after erasure are represented;
4.3) enhanced video streams generated from Dual-Branch visual FilterAnd the language feature Q of the erased query statementen*={qi *},1≤i≤nqObtaining the erased enhanced video stream by adopting the method of the step 2) Is the erased enhanced modal characteristics of the ith frame in the video; according to the erased enhanced video stream Ven*And the language feature Q of the erased query statementenObtaining the cross-modal attention between the ith frame in the video and the ith word in the erased query sentence by using the method in the step 3.1a)
4.4) calculating the erasure aggregation visual representation of each word in the query sentence, wherein the calculation formula is as follows:
wherein,representing normalized cross-modal attention,An erasure-aggregated visual representation representing a jth word in the query statement;
4.5) calculating the erasure loss:
5. the method for retrieving the weakly supervised video segment based on the erasure mechanism as recited in claim 4, further comprising:
connecting the erased enhanced modal characteristics and the erased aggregate visual representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the erased enhanced language-aware frame characteristics of each frame in the video, wherein the calculation formula is as follows:
wherein,is the erasure of the ith frame in the videoThe enhanced language perception frame characteristics after the division;
calculating the candidate score of each erased fragment by adopting the methods from the step 3.3a) to the step 3.4a)And then carrying out weighted summation with the candidate scores before erasure to obtain the final candidate score of the active candidate segment, wherein the calculation formula is as follows:
6. The method according to claim 1, wherein the multitask penalty in step 5) is the weighted sum of the erasure penalty value in step 4) and the inter-sample penalty value, the intra-sample penalty value, the global penalty value, and the gap penalty value;
a. the calculation formula of the loss value between samples is as follows:
wherein,is the value of the loss between samples, ΔintraIs the value of the margin at which,is the final candidate score for the ith positive candidate segment,is the candidate score of the ith negative candidate segment, KenIs the enhancement score, KspIs the suppression score, T is the number of candidate segments;
b. the calculation formula of the loss value in the sample is as follows:
wherein,is the value of the loss in the sample, ΔinterIs the value of the margin at which,is a negative sampleThe corresponding enhancement score is calculated based on the corresponding enhancement score,is a negative sampleA corresponding enhancement score;
c. the global loss is calculated by the following formula:
d. the gap loss is calculated by the following formula:
7. The method as claimed in claim 6, wherein the intra-sample loss is a negative sample of the weakly supervised video segment retrieval method based on the erasure mechanismAnd negative samplesThe acquisition method comprises the following steps: for each pair of video-query statements (V, Q), an unmatched video is randomly selected from the training setComposing passive samplesSelecting unmatched queriesAnd video V form another negative sample
8. A retrieval system based on the weak supervision video clip retrieval method of claim 1, characterized by comprising:
the video preprocessing module is used for acquiring the frame characteristics of the video;
the query statement preprocessing module is used for acquiring the language features of the query statement;
the double-branch visual filtering module is used for obtaining the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;
a dual-branch shared candidate network module comprising an enhanced branch and a suppressed branch; in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;
in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;
the dynamic erasing module is used for erasing the query sentence in a word level in an enhancement branch of the double-branch sharing candidate network module to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating the erasure loss by using the erased enhanced language perception frame characteristics;
a candidate score calculation module for calculating a final candidate score comprising a final candidate score for a positive candidate segment and a final candidate score for a negative candidate segment; the final candidate score of the extreme candidate segment is obtained by weighting and summing the candidate score of each positive segment after erasure and the candidate score before erasure, and the final candidate score of the negative candidate segment adopts a candidate score for inhibiting branch output by a double-branch sharing candidate network module;
the training module is used for training the double-branch visual filtering module and the double-branch sharing candidate network module based on multi-task loss to obtain a trained model;
and the retrieval module is used for retrieving the query statement and the video to be processed according to the trained model, the video preprocessing module and the query statement preprocessing module, and outputting the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.
9. The retrieval system of claim 8, wherein the dual-branch visual filter module comprises:
the accumulated residual calculation module is used for projecting the language features of the query statement to the clustering center corresponding to each language scene and calculating the accumulated residual between the language features and the center vectors;
a cross-modal matching score calculation module for calculating the matching score of the language features of each frame and all scenes in the video according to the accumulated residual;
the global score calculation module is used for screening out the highest score from the matching scores of the language features of each frame and all scenes in the video to be used as a global score, and carrying out normalization processing to obtain the normalization scores of all frames in the video;
and the modal characteristic calculation module is used for calculating to obtain the enhanced modal characteristic and the suppressed modal characteristic corresponding to each frame of the video according to the normalized fraction of each frame of the video, and respectively forming an enhanced video stream and a suppressed video stream.
10. The retrieval system of claim 9, wherein the dual-branch sharing candidate network modules include an enhanced branch and a suppressed branch, and parameter sharing between the branches;
the two branches comprise:
the aggregation text representation calculation module is used for obtaining cross-modal attention between each frame and each word in the video according to the language features of the query sentences and the enhancement video stream or the inhibition video stream, and obtaining enhancement aggregation text representation or inhibition aggregation text representation after normalization and cumulative summation;
the language perception frame feature calculation module is used for connecting modal features and aggregate text representations of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain enhanced language perception frame features or suppressed language perception frame features of each frame in the video;
the cross-modal feature calculation module is used for constructing a 2D fragment feature map, performing two-layer 2D convolution based on the 2D fragment feature map, calculating a matrix relation between adjacent video fragments and obtaining cross-modal features of video fragment levels;
a candidate score calculating module for calculating a candidate score of each video segment, and selecting the T segments with the highest scores as candidate segments; the candidate segment obtained in the enhancement branch is a positive candidate segment, and the candidate segment obtained in the suppression branch is a negative candidate segment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110272729.4A CN112685597B (en) | 2021-03-12 | 2021-03-12 | Weak supervision video clip retrieval method and system based on erasure mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110272729.4A CN112685597B (en) | 2021-03-12 | 2021-03-12 | Weak supervision video clip retrieval method and system based on erasure mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112685597A CN112685597A (en) | 2021-04-20 |
CN112685597B true CN112685597B (en) | 2021-07-13 |
Family
ID=75455541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110272729.4A Active CN112685597B (en) | 2021-03-12 | 2021-03-12 | Weak supervision video clip retrieval method and system based on erasure mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112685597B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254716B (en) * | 2021-05-26 | 2022-05-24 | 北京亮亮视野科技有限公司 | Video clip retrieval method and device, electronic equipment and readable storage medium |
CN113590881B (en) * | 2021-08-09 | 2024-03-19 | 北京达佳互联信息技术有限公司 | Video clip retrieval method, training method and device for video clip retrieval model |
CN113792594B (en) * | 2021-08-10 | 2024-04-12 | 南京大学 | Method and device for locating language fragments in video based on contrast learning |
CN113901847B (en) * | 2021-09-16 | 2024-05-24 | 昆明理工大学 | Neural machine translation method based on source language syntax enhancement decoding |
CN113590874B (en) * | 2021-09-28 | 2022-02-11 | 山东力聚机器人科技股份有限公司 | Video positioning method and device, and model training method and device |
CN113806589B (en) * | 2021-09-29 | 2024-03-08 | 云从科技集团股份有限公司 | Video clip positioning method, device and computer readable storage medium |
CN113963304B (en) * | 2021-12-20 | 2022-06-28 | 山东建筑大学 | Cross-modal video time sequence action positioning method and system based on time sequence-space diagram |
CN115187917B (en) * | 2022-09-13 | 2022-11-25 | 山东建筑大学 | Unmanned vehicle historical scene detection method based on video clip retrieval |
CN115687687B (en) * | 2023-01-05 | 2023-03-28 | 山东建筑大学 | Video segment searching method and system for open domain query |
CN117690191B (en) * | 2024-02-02 | 2024-04-30 | 南京邮电大学 | Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110991290A (en) * | 2019-11-26 | 2020-04-10 | 西安电子科技大学 | Video description method based on semantic guidance and memory mechanism |
CN112417206A (en) * | 2020-11-24 | 2021-02-26 | 杭州一知智能科技有限公司 | Weak supervision video time interval retrieval method and system based on two-branch proposed network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7818430B2 (en) * | 2008-10-15 | 2010-10-19 | Patentvc Ltd. | Methods and systems for fast segment reconstruction |
-
2021
- 2021-03-12 CN CN202110272729.4A patent/CN112685597B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110991290A (en) * | 2019-11-26 | 2020-04-10 | 西安电子科技大学 | Video description method based on semantic guidance and memory mechanism |
CN112417206A (en) * | 2020-11-24 | 2021-02-26 | 杭州一知智能科技有限公司 | Weak supervision video time interval retrieval method and system based on two-branch proposed network |
Also Published As
Publication number | Publication date |
---|---|
CN112685597A (en) | 2021-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112685597B (en) | Weak supervision video clip retrieval method and system based on erasure mechanism | |
Gu et al. | Stack-captioning: Coarse-to-fine learning for image captioning | |
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN110516536B (en) | Weak supervision video behavior detection method based on time sequence class activation graph complementation | |
CN108549658B (en) | Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree | |
CN111414461B (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN111652357B (en) | Method and system for solving video question-answer problem by using specific target network based on graph | |
CN111581966A (en) | Context feature fusion aspect level emotion classification method and device | |
CN112784929B (en) | Small sample image classification method and device based on double-element group expansion | |
CN111460883B (en) | Video behavior automatic description method based on deep reinforcement learning | |
CN110321805B (en) | Dynamic expression recognition method based on time sequence relation reasoning | |
CN108959512B (en) | Image description network and technology based on attribute enhanced attention model | |
CN110852066B (en) | Multi-language entity relation extraction method and system based on confrontation training mechanism | |
CN116304984A (en) | Multi-modal intention recognition method and system based on contrast learning | |
CN115858726A (en) | Multi-stage multi-modal emotion analysis method based on mutual information method representation | |
Xue et al. | Lcsnet: End-to-end lipreading with channel-aware feature selection | |
Tong et al. | Automatic error correction for speaker embedding learning with noisy labels | |
CN117033961A (en) | Multi-mode image-text classification method for context awareness | |
CN117033558A (en) | BERT-WWM and multi-feature fused film evaluation emotion analysis method | |
CN116541507A (en) | Visual question-answering method and system based on dynamic semantic graph neural network | |
CN116245115A (en) | Video content description method based on concept parser and multi-modal diagram learning | |
CN116975347A (en) | Image generation model training method and related device | |
CN115830401A (en) | Small sample image classification method | |
CN114781356A (en) | Text abstract generation method based on input sharing | |
CN113283520A (en) | Member reasoning attack-oriented depth model privacy protection method and device based on feature enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |