CN112685597B

CN112685597B - Weak supervision video clip retrieval method and system based on erasure mechanism

Info

Publication number: CN112685597B
Application number: CN202110272729.4A
Authority: CN
Inventors: 李昊沅; 周楚程
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-07-13
Anticipated expiration: 2041-03-12
Also published as: CN112685597A

Abstract

The invention discloses a weak supervision video clip retrieval method and system based on an erasing mechanism, and belongs to the field of video clip retrieval. Aiming at a video-query statement, respectively acquiring language features and frame features; constructing a language-aware dual-branch visual filter to generate an enhanced video stream and a suppressed video stream; constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, and generating an active candidate segment and a passive candidate segment; introducing a dynamic erasure mechanism into an enhancement branch of the candidate network, and calculating an enhancement score and a suppression score; training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multi-task loss to obtain a trained model; and aiming at the query sentences and videos to be processed, using the trained model to take the segment corresponding to the highest candidate score output by the enhanced branch as the final retrieval result. The invention enhances the matching capability of the video sentences and improves the performance of video retrieval.

Description

Weak supervision video clip retrieval method and system based on erasure mechanism

Technical Field

The invention relates to the field of video segment retrieval, in particular to a weak supervision video segment retrieval method and system based on an erasing mechanism.

Background

Video clip retrieval is a new topic in information retrieval systems that integrate computer vision and natural language processing. Given an untrimmed video and a natural language description, the purpose of video clip retrieval is to locate temporal boundaries that match the semantics of the target clip. However, most existing methods are trained in a fully supervised environment. Such manual annotation is very expensive and time consuming, especially for ambiguous descriptions.

Existing weakly supervised methods typically employ MIL-based or reconstruction-based methods to train the weakly supervised positioning network. Both of these methods have some drawbacks. The former trains potential visual text matching through inter-sample loss by defining some initial visual language pairs as positive samples, constructing unmatched language visual pairs as negative samples. However, the method has high quality requirements on randomly selected negative samples, and the samples with low quality are easy to identify and cannot provide strong supervision signals. Reconstruction-based approaches, on the other hand, attempt to reconstruct the query statement from the visual content in training and locate candidate targets during reasoning using intermediate results such as attention weights. These methods do not directly optimize the visual text matching score used for reasoning. Because the candidate with higher attention weight does not necessarily have higher association with the query sentence, such indirect optimization limits the performance of the model, and therefore, the existing weak supervision method has at least the following problems:

1) a high-quality negative sample is required, a low-quality sample is easy to identify and cannot provide a strong supervision signal;

2) visual text matching scores used for reasoning cannot be directly optimized, candidates with high attention weights do not necessarily have high relevance to problem statements, and indirect optimization limits the performance of the model.

Erasure is an effective data enhancement method to suppress overfitting and enhance model robustness, and conventional erasure methods are generally used in images, randomly select regions in the image, replace their pixels with 0 or average values of the image, generate a large number of new images for training, but erasure of video images has limited ability to improve video-sentence matching. The method provides a novel regularized double-branch candidate network with an erasing mechanism, a fine-grained in-sample countermeasure is constructed by finding out credible negative candidate moments, and a more complete visual-text relation is captured by paying attention to guided dynamic erasing.

Disclosure of Invention

In order to overcome the defect that in the prior art, only inter-sample countermeasures are usually concerned, and intra-sample countermeasures are ignored, so that a correct result is difficult to select from plausible candidate segments; and in the prior art, the focused video-sentence pairs are concentrated on a plurality of dominant words, the global situation is ignored, samples which do not appear in the training data and are not trained can not be positioned easily, higher accuracy can be obtained only in the training data set, and the practical applicability is poor. The invention provides a weak supervision video segment retrieval method and system based on an erasing mechanism, which can efficiently and accurately retrieve video segments.

According to the invention, a double-branch candidate module is constructed, two branches adopt the same structure, and parameters are shared between the branches, so that the model is lighter and more robust; by constructing a dynamic erasing mechanism, words with higher occupation ratio in the query sentence are erased, the matching capability of the video sentence is enhanced, and the performance of video retrieval is improved.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

one of the purposes of the present invention is to provide a weak supervised video segment retrieval method based on an erasure mechanism, which includes the following steps:

1) aiming at the video-query statement, acquiring the language characteristic of the query statement and the frame characteristic of the video;

2) constructing a language-aware double-branch visual filter, and obtaining an enhanced modal characteristic and a suppressed modal characteristic of each frame in a video by using the frame characteristic and the language characteristic to form an enhanced video stream and a suppressed video stream;

3) constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, wherein the double-branch sharing candidate network comprises an enhanced branch and a suppressed branch;

in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;

in the suppression branch, generating a negative candidate segment set and a candidate score thereof by adopting the same method as the enhancement branch according to the language characteristics of the query statement and the suppression video stream;

4) introducing a dynamic erasing mechanism into the enhancement branch in the step 3) to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating erasure loss by using the erased enhanced language perception frame characteristics, calculating a candidate score of each active segment after erasure, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;

5) combining the candidate score of the passive candidate segment and the final candidate score of the active candidate segment, training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multitask loss to obtain a trained model;

6) and aiming at the query sentences and videos to be processed, utilizing a trained model and combining the steps 1) to 3), and taking the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.

Another object of the present invention is to provide a system for retrieving a weakly supervised video segment based on the above method, which includes:

the video preprocessing module is used for acquiring the frame characteristics of the video;

the query statement preprocessing module is used for acquiring the language features of the query statement;

the double-branch visual filtering module is used for obtaining the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;

a dual-branch shared candidate network module comprising an enhanced branch and a suppressed branch; in the enhancement branch, according to the language features of the query statement and the enhancement video stream, aggregating the language features of each frame of the video to obtain an enhancement aggregate text representation of each frame in the video; performing visual-text interaction on the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video to obtain enhanced language perception frame characteristics of each frame in the video; acquiring the relation between adjacent segments in the video by using the 2D segment characteristic diagram to obtain the cross-modal characteristics of the video segment level and generate an active candidate segment set and candidate scores thereof;

the dynamic erasing module is used for erasing the query sentence in a word level in an enhancement branch of the double-branch sharing candidate network module to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating the erasure loss by using the erased enhanced language perception frame characteristics;

a candidate score calculation module for calculating a final candidate score comprising a final candidate score for a positive candidate segment and a final candidate score for a negative candidate segment; the final candidate score of the extreme candidate segment is obtained by weighting and summing the candidate score of each positive segment after erasure and the candidate score before erasure, and the final candidate score of the negative candidate segment adopts a candidate score for inhibiting branch output by a double-branch sharing candidate network module;

the training module is used for training the double-branch visual filtering module and the double-branch sharing candidate network module based on multi-task loss to obtain a trained model;

and the retrieval module is used for retrieving the query statement and the video to be processed according to the trained model, the video preprocessing module and the query statement preprocessing module, and outputting the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.

Compared with the prior art, the invention has the advantages that:

(1) according to the video segment retrieval method based on weak supervision, only video-level sentence annotation is required to be provided in the training process, and alignment annotation is not required to be carried out on each sentence, so that the annotation cost and time are greatly reduced;

(2) compared with the majority of multi-Instance Learning-based methods (Multiple Instance Learning) which mainly focus on inter-sample confrontation to judge the result, the method fully performs intra-sample confrontation between the moments with similar information in the video, and can select the correct result from plausible candidate segments;

(3) the invention introduces a dynamic erasure mechanism of antagonism in the double-branch candidate network training, covers the word set with the highest text attention in the query sentence, utilizes the enhanced video stream and the language features corresponding to the erased query sentence for interaction, fully considers global information, and performs weighted summation of the obtained candidate scores and the candidate scores obtained by the original enhanced branch to be the final scores of the active candidate segments, which is beneficial to improving the matching of video sentences, not only can improve the training effect in the training set, but also can be migrated to practical application, and improves the retrieval performance.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of the overall architecture of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, the method for retrieving a weakly supervised video segment based on an erasure mechanism provided by the present invention mainly includes the following steps:

the method comprises the steps of firstly, aiming at a video-query statement, obtaining language features of the query statement and frame features of a video;

secondly, a language-aware double-branch visual filter is constructed, and the enhanced modal characteristics and the suppressed modal characteristics of each frame in the video are obtained by utilizing the frame characteristics and the language characteristics to form an enhanced video stream and a suppressed video stream;

step three, constructing a double-branch sharing candidate network based on a dynamic erasure mechanism, wherein the double-branch sharing candidate network comprises an enhanced branch and a suppressed branch;

step four, introducing a dynamic erasing mechanism into the enhancement branch in the step three to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating erasure loss by using the erased enhanced language perception frame characteristics, calculating a candidate score of each active segment after erasure, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;

combining the candidate scores of the passive candidate segments and the final candidate scores of the active candidate segments, and training a language-aware double-branch visual filter and a double-branch shared candidate network based on a dynamic erasure mechanism by adopting multi-task loss to obtain a trained model;

and step six, aiming at the query sentences and videos to be processed, using the trained model and combining the steps one to three, and taking the segment corresponding to the highest candidate score output by the enhanced branch as a final retrieval result.

In one embodiment of the present invention, step one is described as follows:

aiming at a query sentence, extracting character features in the query sentence through a pre-trained Glove model, inputting the character features into a Bi-GRU to learn word semantic representation, and obtaining language features Q of the query sentence as { Q ═ Q_i},1≤i≤n_qWherein n is_qIs the number of words in the query sentence, q_iIs the linguistic feature of the ith word in the query statement;

aiming at a section of video, extracting video features through a pre-trained video feature extractor, and then shortening the length of a video sequence by utilizing time-average pooling to obtain frame features V ═ { V ═ V_i},1≤i≤n_vWherein n is_vIs the number of frames in the video, v_iThe frame characteristics of the ith frame in the video. In this embodiment, the pre-trained video feature extractionThe extractor may employ a C3D feature extractor.

In one embodiment of the present invention, step two is described as follows:

in step two, a language-aware dual-branch visual filter is constructed, and the frame feature V and the language feature Q are used for generating the enhanced video stream

And suppressing video streams

Wherein

Is the enhanced modal feature corresponding to the ith frame in the video,

is the corresponding suppressed modal feature of the ith frame in the video.

Specifically, the method comprises the following steps:

2.1) regarding each language scene as a cluster center, and setting the language feature Q as { Q ═ Q }_i},1≤i≤n_qProjected to each cluster center, calculating language characteristic Q ═ Q { Q } by NetVLAD model_i},1≤i≤n_qAnd a trainable center vector C ═ { C ═ C_j},1≤j≤n_cThe accumulated residual error between the two is calculated by the formula:

wherein, c_jIs the jth cluster center, n_cIs the number of cluster centers, W^cAnd b^cProjection matrix and bias of NetVLAD model, respectively;

linguistic features representing the ith word and n_cCoefficient vector, alpha, related to the center of each cluster_ijIs alpha_iThe j-th element in (1) representsA coefficient associated with the jth cluster center; u. of_jIs the language feature Q ═ Q_i},1≤i≤n_qThe accumulated residual error of the jth cluster center is used for representing the jth scene-based language feature;

2.2) calculating a cross-modal matching score from the accumulated residuals:

wherein,

and

is a projection matrix, b^aIs a bias that is a function of the bias,

is a row vector, T denotes a transpose, σ is a sigmoid function, tanh (·) is a tanh function; beta is a_ijE (0,1) is the matching score of the frame characteristics of the ith frame and the jth scene-based language characteristics in the video;

2.3) obtaining the matching scores between the frame characteristics of the ith frame in the video and all scenes according to the steps 2.1) to 2.2), and taking the maximum matching score as a global score

The calculation formula is as follows:

wherein,

representing the global scores of all scenes corresponding to the frame characteristics of the ith frame in the video; further obtaining the global scores of the scenes corresponding to all the frame characteristics in the video, and expressing the global scores as

1≤i≤n_v；

2.4) normalizing the global score, wherein the calculation formula is as follows:

wherein,

represents the minimum value of the global score of all frame features in the video corresponding to the scene,

represents the maximum value of the global score of the scene corresponding to all the frame features in the video.

Through the steps, the normalized fraction of all the frames is obtained

1≤i≤n_vExpressing the relation between each frame and the query statement by using the corresponding normalized score of the frame;

2.5) respectively calculating the enhanced modal characteristics and the suppressed modal characteristics corresponding to each frame in the video according to the normalized global score, wherein the calculation formula is as follows:

wherein,

is the enhanced modal feature of the ith frame in the video, all the frames together form the enhanced video stream

Enhancing the important frames according to the normalized fraction and weakening the non-important frames;

is the suppression modal characteristic of the ith frame in the video, and all the frames form the suppression video stream together

In contrast to enhancing the video stream, the suppression video stream weakens the important frames and enhances the non-important frames.

In one embodiment of the present invention, step three is described as follows:

in the third step, the established dual-branch sharing candidate module based on the dynamic erasure mechanism comprises an enhanced branch and a suppressed branch with the same structure, and parameters are shared among the branches;

the processing flows of the enhanced branch and the suppressed branch are similar, and only the calculation process of the enhanced branch is specifically described below, including:

3.1a) language feature Q ═ { Q) from query statement_i},1≤i≤n_qEnhanced video stream generated with a dual-branch visual filter

And aggregating the language features of each frame by using a cross-modal unit to obtain an enhanced aggregate text representation of each frame in the video, wherein the calculation formula is as follows:

wherein, delta_ijRepresenting cross-modal attention between the ith frame in the video and the jth word in the query statement;

and

is a projection matrix, b^mIs a bias that is a function of the bias,

is a row vector, T represents a transpose, tanh (·) is a tanh function;

the cross-modal attention after normalization is represented,

is an enhanced aggregate text representation of the ith frame in the video;

3.2a) connecting the enhanced modal characteristics and the enhanced aggregation text representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the enhanced language perception frame characteristics of each frame in the video, wherein the calculation formula is as follows:

wherein,

is an enhanced language aware frame feature of the ith frame in the video;

3.3a) calculating cross-modal features at video segment level:

dividing video into segments and constructing 2D segment feature map

The first two dimensions represent the beginning and ending frames of a segment, and the third dimension is the fusion feature of the segment; the calculation formula of the fusion characteristics is as follows:

wherein a and b are respectively a start frame and an end frame of the video segment, F [ a, b ]; is a fusion characteristic of the fragment [ a, b ];

based onThe 2D segment characteristic graph is subjected to two-layer 2D convolution, the size of a convolution kernel can be selected according to actual conditions to calculate the matrix relation between adjacent segments, and cross-modal characteristics of the video segment level are obtained

Wherein M is_enIs the number of fragments in the 2D fragment profile;

3.4a) calculating the candidate score of each segment according to the following formula:

in the formula,

is the candidate score of the ith video segment, W^pAnd b^pIs the projection matrix and the offset, sigma is the sigmoid function,

is the cross-modal feature of the ith video segment;

selecting T candidate segments with highest scores to form an active candidate segment set

And extracting a candidate score for each active candidate segment

Wherein

And

respectively representing the ith positive candidate segment and the corresponding candidate score;

similarly, the same suppression branch is used, and the language feature Q of the query sentence is { Q ═ Q { (Q) }_i},1≤i≤n_qAnd double branchesVisual filter generated suppressed video streams

Generating a set of negative candidate segments according to the procedure of step 3.1a) to step 3.4a)

And a candidate score for each passive candidate segment

Wherein, therein

And

respectively, the ith negative candidate segment and the corresponding candidate score.

In one embodiment of the present invention, step four is described as follows:

4.1) calculating the attention of the aggregated text by utilizing the cross-modal attention between each frame in the video and each word in the query sentence calculated in the step 3.1 a):

wherein,

representing aggregate textual attention between the video and the jth word in the query sentence;

4.2) screening n with highest attention of aggregated texts in query statement_eIndividual words forming a mask word set W^*＝{w_i ^*},1≤i≤n_e，

Represents a rounding down, E% represents a mask percentage threshold;

substituting mask character 'Unknown' for mask word set W^*N in (1)_eObtaining the erased query sentence by using the words; obtaining the language characteristic Q of the erased query statement by using the method in the step one^en*＝{q_i ^*},1≤i≤n_q；

4.3) enhanced video streams generated from Dual-Branch visual Filter

And the language feature Q of the erased query statement^en*＝{q_i ^*},1≤i≤n_qObtaining the erased enhanced video stream by adopting the method of the step two

Is the erased enhanced modal characteristics of the ith frame in the video; according to the erased enhanced video stream V^en*And the language feature Q of the erased query statement^enObtaining the cross-modal attention between the ith frame in the video and the ith word in the erased query sentence by using the method in the step 3.1a)

4.4) calculating the erasure aggregation visual representation of each word in the query sentence, wherein the calculation formula is as follows:

wherein,

representing the normalized cross-modal attention,

an erasure-aggregated visual representation representing a jth word in the query statement;

4.5) calculating the erasure loss:

wherein,

representing erase loss, s being an intermediate variable, W_eIn order to be a projection matrix, the projection matrix,

is the erased enhanced speech perception frame characteristic of the ith frame in the video;

further, after the step 4.4), the method further comprises the following steps:

connecting the erased enhanced modal characteristics and the erased aggregate visual representation of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain the erased enhanced language-aware frame characteristics of each frame in the video, wherein the calculation formula is as follows:

wherein,

calculating the candidate score of each erased fragment by adopting the methods from the step 3.3a) to the step 3.4a)

And then carrying out weighted summation with the candidate scores before erasure to obtain the final candidate score of the active candidate segment, wherein the calculation formula is as follows:

wherein, w^cIs a learnable parameter, the initial value is 0.5;

is the final candidate score for the ith positive candidate segment.

In one embodiment of the present invention, step five is described as follows:

the multi-task loss adopted by the invention is the weighted sum of an erasure loss value, an inter-sample loss value, an intra-sample loss value, a global loss value and a gap loss value;

a. the calculation formula of the loss value between samples is as follows:

wherein,

is the value of the loss between samples, Δ_intraIs the value of the margin at which,

is the final candidate score for the ith positive candidate segment,

is the candidate score of the ith negative candidate segment, K^enIs the enhancement score, K^spIs the inhibition score, and T is the number of candidate segments.

b. The calculation formula of the loss value in the sample is as follows:

wherein,

is the value of the loss in the sample, Δ_interIs the value of the margin at which,

is a negative sample

The corresponding enhancement score is calculated based on the corresponding enhancement score,

is a negative sample

A corresponding enhancement score;

negative samples

And negative samples

The acquisition method comprises the following steps: for each pair of video-query statements (V, Q), an unmatched video is randomly selected from the training set

Composing passive samples

Selecting unmatched queries

And video V form another negative sample

c. The global loss is calculated by the following formula:

wherein,

is a global penalty value for keeping the score of each candidate segment relatively low again, M_enIs the number of fragments in the 2D fragment profile.

d. The gap loss is calculated by the following formula:

wherein,

is a gap loss value used to extend the score gap between active segment candidates.

Building multitask losses

Wherein λ_CIs a hyper-parameter used to control various losses. And training by utilizing multitask loss to obtain a trained model. When the method is used, the video and the text to be inferred are input, and the segment with the highest candidate score is selected as the required segment.

The invention also provides a system for retrieving the weakly supervised video clips based on the erasure mechanism, which can be referred to as fig. 2 and comprises:

a candidate score calculation module for calculating a final candidate score comprising a final candidate score for a positive candidate segment and a final candidate score for a negative candidate segment; the final candidate score of the extreme candidate segment is obtained by weighting and summing the candidate score of each positive segment after erasure and the candidate score before erasure, and the final candidate score of the negative candidate segment adopts a candidate score for inhibiting branch output by a double-branch sharing candidate network module; in this embodiment, the candidate score calculation module is located inside the enhancement branch and the suppression branch, and therefore is not shown in fig. 2. In addition, it should be noted that each candidate segment further needs to calculate an enhancement score and a suppression score according to the obtained candidate score, and please refer to the description of step five of the method part above for the calculation formula, which is not repeated here.

In this embodiment, the dual-branch visual filtering module includes:

the accumulated residual calculation module is used for projecting the language features of the query statement to the clustering center corresponding to each language scene and calculating the accumulated residual between the language features and the center vectors;

a cross-modal matching score calculation module for calculating the matching score of the language features of each frame and all scenes in the video according to the accumulated residual;

the global score calculation module is used for screening out the highest score from the matching scores of the language features of each frame and all scenes in the video to be used as a global score, and carrying out normalization processing to obtain the normalization scores of all frames in the video;

and the modal characteristic calculation module is used for calculating to obtain the enhanced modal characteristic and the suppressed modal characteristic corresponding to each frame of the video according to the normalized fraction of each frame of the video, and respectively forming an enhanced video stream and a suppressed video stream.

In this embodiment, the dual-branch sharing candidate network module includes an enhanced branch and a suppressed branch, and parameter sharing between the branches;

the two branches comprise:

the aggregation text representation calculation module is used for obtaining cross-modal attention between each frame and each word in the video according to the language features of the query sentences and the enhancement video stream or the inhibition video stream, and obtaining enhancement aggregation text representation or inhibition aggregation text representation after normalization and cumulative summation;

the language perception frame feature calculation module is used for connecting modal features and aggregate text representations of each frame in the video, and performing visual-text interaction by applying a Bi-GRU network to obtain enhanced language perception frame features or suppressed language perception frame features of each frame in the video;

the cross-modal feature calculation module is used for constructing a 2D fragment feature map, performing two-layer 2D convolution based on the 2D fragment feature map, calculating a matrix relation between adjacent video fragments and obtaining cross-modal features of video fragment levels;

a candidate score calculating module for calculating a candidate score of each video segment, and selecting the T segments with the highest scores as candidate segments; the candidate segment obtained in the enhancement branch is a positive candidate segment, and the candidate segment obtained in the suppression branch is a negative candidate segment.

In this embodiment, the dynamic erase module includes:

the cross-modal attention calculation module is used for weighting and summing the cross-modal attention between each frame in the video and each word in the query sentence to obtain the aggregate text attention between the video and each word in the query sentence;

the query sentence erasing module is used for screening a plurality of words with the highest attention of the aggregated text in the query sentence and replacing the words with mask symbols; obtaining the language features of the erased query statement according to the query statement preprocessing module;

the erasure cross-modal attention calculation module is used for obtaining an erased enhanced video stream according to the language features of the enhanced video stream and the erased query statement; further obtaining cross-modal attention between each frame in the video and the word of the erased query sentence according to the erased enhanced video stream and the language characteristics of the erased query sentence;

the erasure aggregation visual representation calculation module is used for normalizing, accumulating and summing the cross-modal attention between each frame in the video and the erased words of the query statement to obtain the erasure aggregation visual representation of each word in the query statement;

and the erasure loss calculation module is used for calculating the erasure loss.

The implementation of each module in the above description may refer to the description of the method portion, and is not described herein again.

In the embodiments provided in the present invention, it should be understood that the above-described system embodiments are merely illustrative, and for example, the dual-branch shared candidate network module may be a logical functional partition, and may have another partition in actual implementation, for example, multiple modules may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the connections between the modules shown or discussed may be communication connections via interfaces, electrical or otherwise.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

Examples

The invention was tested in the following three data sets:

ActivityCaption A data set containing more than 19k of video is rich in content with an average duration of about 2 minutes. And dividing the sentence into 37k, 18k and 17k pairs as training verification and test sets.

Charrades-STA: the data set is built on a Charades data set, which contains video level paragraph descriptions and temporal action annotations. The description at the video level is decomposed into descriptions at the segment level using a semi-automatic method. The data set contains 10k of live video in the room, which are on average about 30 seconds in duration. There are approximately 12k pairs of segment statements for training and 4k pairs for testing.

DiDeMo: the data set consists of 10k videos, each of 25-30 seconds in duration. It contains approximately 33k segment sentence pairs for training, 4k for verification, and 4k for testing. Each video is divided into six five-second segments in the data set, with the target segment containing one or more consecutive segments.

Evaluation criteria:

r @ n, IoU ═ m was used as an evaluation index on ActivityCaption and charads-STA datasets.

Rank @1, Rank @5 and mlou were used as evaluation indices of the DiDeMo.

Data processing:

C3D features were extracted on ActivityCaption and chardes-STA datasets and VGG16 and visual flow features were applied on the didymo dataset.

Time-averaged pooling with steps of 8 and 4 was used at activityCaption and Charides-STA, respectively, to shorten the signature sequence. The average signature of the five second segments fixed in the DiDeMo was calculated.

Each sentence is decomposed into words and a tone list, the vocabulary which does not belong to Glove is removed, the first MaxSeq element is obtained, and MaxSeq elements of ActivityCaption, Chardes-STA and DiDeMo are respectively set to be 25, 20 and 20. And extracting text features of each word by using pre-trained 300-d Glove coding.

Setting a model:

cross-modal interaction unit with Wc bc in NetVLAD

And b^mSet to 512. The dimension of the hidden state for each direction in the Bi-GRU is set to 256 and the dimension of the trainable center vector is set to 512. In the construction of the 2D map, for the data set DiDeMo, for each position of a ≦ b [ a, b]Will be filled. For a dataset, Chardes-STA, the positions that satisfy a ≦ b and (b-a) mod 2 ≦ 1 will be filled. For the dataset activityCaption, the locations that satisfy a ≦ b and (b-a) mod 8 ≦ 0 will be filled. The convolution kernel size K applied to ActivityCaption, Charads-STA and DiDeMo is set to 5, 3, 1, respectively. In the center-based candidate method, the number T of active/passive candidate segments found by ActivityCaption, chardes-STA, and didymo is set to 16, 32, 6, respectively. Will be lambda₁-λ₅Set to 0.1, 1, 0.1, 0.01 and 0.01, respectively. Delta_intraAnd Δ_interSet to 0.4 and 0.6, respectively. Using an initial learning rate of 10^-4Weight attenuation of 10^-7Adam optimizer of (1). A threshold of 0.55 non-maximum suppression (NMS) is used in the inference process to select multiple times.

The experimental results are as follows:

TABLE 1 Performance on dataset ActivityCaption

TABLE 2 Performance on data set Chardes-STA

TABLE 3 Performance on data set DiDeMo

Method	Input	Rank@1	Rank@5	mIoU
					WSLLN	RGB	19.30	53.00	25.30
RTPEN (invention)	RGB	20.19	60.38	28.22
					WSLLN	Flow	18.41	54.51	27.41
RTPEN (invention)	Flow	18.39	54.39	27.39
					TGA	RGB+Flow	20.90	60.17	30.99
RTPEN (invention)	RGB+Flow	21.55	62.95	30.98

From the results, the invention shows the optimal retrieval results on three data sets, and the RTPEN method provided by the invention almost achieves the optimal weak supervision performance.

IoU-0.3 and IoU-0.5 are respectively set on activityCaption and Charads-STA, which shows that the dynamic erasing mechanism of the invention is helpful for capturing the correspondence of comprehensive videos and sentences and retrieving more accurate time, especially on the performance of Rank @5 in DiDeMo, even exceeding a fully supervised method, and verifies the effectiveness of two branch frameworks with the erasing mechanism and the regularization strategy.

On activityCaption and Charads-STA, the RTPEN of the invention is superior to the SCN method based on reconstruction, which shows that the intra-sample countermeasure mode adopted by the invention can effectively utilize negative samples, and the intra-sample countermeasure is fully carried out between the moments of similar information, so that the matching capability of video-sentences is improved, and therefore, excellent retrieval capability is obtained.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A weak supervision video segment retrieval method based on an erasure mechanism is characterized by comprising the following steps:

4) introducing a dynamic erasing mechanism into the enhancement branch in the step 3) to obtain the erased enhancement language perception frame characteristics of each frame in the video; calculating an erasure loss value by using the erased enhanced language perception frame characteristics, calculating a candidate score of each erased active segment, and performing weighted summation with the candidate scores before erasure to serve as a final candidate score of the active candidate segment;

2. The method according to claim 1, wherein the language-aware dual-branch visual filter in step 2) is specifically:

2.1) regarding each language scene as a cluster center, and setting the language feature Q of the query sentence as Q_i},1≤i≤n_qProjected to each cluster center, calculating language characteristic Q ═ Q { Q } by NetVLAD model_i},1≤i≤n_qAnd a trainable center vector C ═ { C ═ C_j},1≤j≤n_cThe accumulated residual error between the two is calculated by the formula:

wherein n is_qIs the number of words in the query sentence, q_iIs the linguistic feature of the ith word in the query statement; c. C_jIs the jth cluster center, n_cIs the number of cluster centers, W^cAnd b^cProjection matrix and bias of NetVLAD model, respectively;

linguistic features representing the ith word and n_cCoefficient vector, alpha, related to the center of each cluster_ijIs alpha_iThe jth element in (a), representing a coefficient associated with the jth cluster center; u. of_jThe accumulated residual error of the language feature to the jth clustering center is used for representing the jth scene-based language feature;

2.2) calculating a cross-modal matching score from the accumulated residuals:

representing a frame feature of a video as V ═ V_i},1≤i≤n_vWherein n is_vIs the number of frames in the video, v_iThe frame characteristics of the ith frame in the video are obtained;

and

is a projection matrix, b^aIs a bias that is a function of the bias,

The calculation formula is as follows:

wherein,

obtaining normalized scores for all frames

Expressing the relation between each frame and the query statement by using the corresponding normalized score of the frame;

wherein,

3. The method according to claim 1, wherein the candidate network based on the dynamic erasure mechanism for dual-branch sharing in step 3) comprises an enhanced branch and a suppressed branch with the same structure, and the parameters between the branches are shared;

the calculation process of the enhancement branch is as follows:

wherein,

is the enhanced modal feature, δ, corresponding to the ith frame in the video_ijRepresenting videoCross-modal attention between the ith frame and the jth word in the query statement;

and

is a projection matrix, b^mIs a bias that is a function of the bias,

is a row vector, T represents a transpose, tanh (·) is a tanh function;

the cross-modal attention after normalization is represented,

is an enhanced aggregate text representation of the ith frame in the video;

wherein,

is an enhanced language aware frame feature of the ith frame in the video;

3.3a) calculating cross-modal features at video segment level:

dividing a video into segments, constructing a 2D segment feature map, wherein the first two dimensions represent a starting frame and an ending frame of a segment, and the third dimension is a fusion feature of the segment; the fusion features are obtained by accumulating the enhanced language perception frame features of each frame in the segment;

performing two-layer 2D convolution based on the 2D segment characteristic diagram to calculate the matrix relation between adjacent segments and obtain the cross-modal characteristics of the video segment level

Wherein M is_enIs the number of fragments in the 2D fragment profile;

in the formula,

is the cross-modal feature of the ith video segment;

And extracting a candidate score for each active candidate segment

Wherein

And

similarly, the same structure of the suppression branch is used for inquiring the language of the statementSaid characteristic Q ═ Q_i},1≤i≤n_qAnd suppression video streaming with dual-branch visual filter generation

And a candidate score for each passive candidate segment

Wherein, therein

And

4. The method according to claim 3, wherein the step 4) is specifically as follows:

4.1) obtaining the aggregate text attention between each frame in the video and each word in the query sentence by utilizing the cross-modal attention between each frame in the video and each word in the query sentence calculated in the step 3.1a) through weighted summation;

Represents a rounding down, E% represents a mask percentage threshold;

using mask tokens to replace mask word sets W^*N in (1)_eObtaining the erased query sentence by using the words; obtaining the language characteristic Q of the erased query statement by using the method in the step 1)^en*＝{q_i ^*},1≤i≤n_q，q_i ^*The language features of the ith word in the query sentence after erasure are represented;

4.3) enhanced video streams generated from Dual-Branch visual Filter

And the language feature Q of the erased query statement^en*＝{q_i ^*},1≤i≤n_qObtaining the erased enhanced video stream by adopting the method of the step 2)

wherein,

representing normalized cross-modal attention，

4.5) calculating the erasure loss:

wherein,

representing the value of the erase loss, s being an intermediate variable, W_eIn order to be a projection matrix, the projection matrix,

5. the method for retrieving the weakly supervised video segment based on the erasure mechanism as recited in claim 4, further comprising:

wherein,

is the erasure of the ith frame in the videoThe enhanced language perception frame characteristics after the division;

wherein, w^cIs a learnable parameter, the initial value is 0.5;

is the final candidate score for the ith positive candidate segment.

6. The method according to claim 1, wherein the multitask penalty in step 5) is the weighted sum of the erasure penalty value in step 4) and the inter-sample penalty value, the intra-sample penalty value, the global penalty value, and the gap penalty value;

a. the calculation formula of the loss value between samples is as follows:

wherein,

is the final candidate score for the ith positive candidate segment,

is the candidate score of the ith negative candidate segment, K^enIs the enhancement score, K^spIs the suppression score, T is the number of candidate segments;

b. the calculation formula of the loss value in the sample is as follows:

wherein,

is a negative sample

is a negative sample

A corresponding enhancement score;

c. the global loss is calculated by the following formula:

wherein,

is the global penalty value, M_enIs the number of fragments in the 2D fragment profile;

d. the gap loss is calculated by the following formula:

wherein,

is the gap loss value.

7. The method as claimed in claim 6, wherein the intra-sample loss is a negative sample of the weakly supervised video segment retrieval method based on the erasure mechanism

And negative samples

Composing passive samples

Selecting unmatched queries

And video V form another negative sample

8. A retrieval system based on the weak supervision video clip retrieval method of claim 1, characterized by comprising:

9. The retrieval system of claim 8, wherein the dual-branch visual filter module comprises:

10. The retrieval system of claim 9, wherein the dual-branch sharing candidate network modules include an enhanced branch and a suppressed branch, and parameter sharing between the branches;

the two branches comprise: