CN118427396B

CN118427396B - Multi-mode large-model-assisted unsupervised cross-mode video retrieval method and equipment

Info

Publication number: CN118427396B
Application number: CN202410893508.2A
Authority: CN
Inventors: 王滨; 董建锋; 王星; 余京涛; 左智文; 李超豪
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2024-07-03
Filing date: 2024-07-03
Publication date: 2024-09-03
Anticipated expiration: 2044-07-03
Also published as: CN118427396A

Abstract

The application provides a multi-mode large model assisted unsupervised cross-mode video retrieval method and equipment. In one example of the application, the method includes: for any video in the training set, performing video frame sampling on the video by using a representative video frame sampling method based on the difference evaluation to obtain a representative frame of the video, and generating a corresponding text label by using a pre-training multi-mode text label large model; filtering the text labels which do not meet the requirements of the relevance according to the relevance between the representative frames and the corresponding text labels; determining text description information of the video according to the filtered text labels to obtain a video-text description information pair; and training the cross-mode video retrieval model according to the video-text description information pairs corresponding to the videos in the training set. The method can reduce the dependence of cross-mode video retrieval model training on manual annotation.

Description

Multi-mode large-model-assisted unsupervised cross-mode video retrieval method and equipment

Technical Field

The application relates to the technical field of text video cross-modal retrieval, in particular to a multi-modal large-model-assisted non-supervision cross-modal video retrieval method and equipment.

Background

With the rise of social media and video sharing platforms, short video software attracts a large number of users who upload and share a large amount of video content. With the development of short video software, there is an increasing demand for searching for interest-related videos. So the task of cross-modal retrieval of text to video has received a lot of attention in academia.

Text video retrieval is performed by computing relevance to video content through a user-given natural language query, and corresponding semantically related videos are retrieved in a large video library.

In recent years, a cross-modal video retrieval method based on deep learning has made remarkable progress. However, the current mainstream methods rely heavily on a large amount of manually annotated training data. In the case of limited labor costs, it is often difficult to collect a large-scale and high-quality video text pair (video-text pair) data set, which limits the application of cross-modal retrieval technology in practical scenarios to some extent.

Disclosure of Invention

In view of the above, the application provides a multi-mode large-model-assisted unsupervised cross-mode video retrieval method and device.

Specifically, the application is realized by the following technical scheme:

according to a first aspect of an embodiment of the present application, there is provided a multi-modal large model assisted unsupervised cross-modal video retrieval method, including:

for any video in the training set, performing video frame sampling on the video by using a representative video frame sampling method based on the difference evaluation to obtain a representative frame of the video;

generating a corresponding text label by utilizing a pre-training multi-mode text label large model according to the representative frame of the video;

Determining the correlation degree between the representative frame and the corresponding text label by using a pre-training cross-mode text image matching model, and filtering the text labels which do not meet the correlation degree requirement according to the correlation degree between the representative frame and the corresponding text label;

Determining text description information of the video according to the filtered text labels to obtain a video-text description information pair;

training the cross-modal video retrieval model according to the video-text description information pairs corresponding to the videos in the training set to obtain a trained cross-modal video retrieval model; the trained cross-modal video retrieval model is used for executing cross-modal video retrieval tasks.

According to a second aspect of the embodiment of the present application, there is provided a multi-modal large-model-assisted unsupervised cross-modal video retrieval method, including:

Acquiring target text information for video retrieval;

Extracting text features of the target text information by using the trained cross-mode video retrieval model to obtain text feature information of the target text information; wherein the trained cross-modal video retrieval model is trained and obtained by the method provided by the first aspect;

determining a target video matched with the target text information according to the text characteristic information of the target text information and the video characteristics of the candidate video; and extracting video features of the candidate videos by using the trained cross-mode video retrieval model.

According to a third aspect of embodiments of the present application, there is provided a multi-modal large-model-aided unsupervised cross-modal video retrieval apparatus, comprising:

the sampling unit is used for sampling video frames of any video in the training set by using a representative video frame sampling method based on the difference evaluation to obtain a representative frame of the video;

the generation unit is used for generating corresponding text labels by utilizing a pre-training multi-mode text label large model according to the representative frame of the video;

The filtering unit is used for determining the correlation degree between the representative frame and the corresponding text label by utilizing the pre-training cross-mode text image matching model, and filtering the text label which does not meet the correlation degree requirement according to the correlation degree between the representative frame and the corresponding text label;

The determining unit is used for determining text description information of the video according to the filtered text labels to obtain a video-text description information pair;

The training unit is used for training the cross-modal video retrieval model according to the video-text description information pairs corresponding to the videos in the training set to obtain a trained cross-modal video retrieval model; the trained cross-modal video retrieval model is used for executing cross-modal video retrieval tasks.

According to a fourth aspect of embodiments of the present application, there is provided an electronic device comprising a processor and a memory, wherein,

A memory for storing a computer program;

and a processor configured to implement the method provided in the first aspect when executing the program stored in the memory.

According to a fifth aspect of embodiments of the present application, there is provided a computer program product having a computer program stored therein, which when executed by a processor implements the method provided by the first aspect.

According to the multi-modal large-model-assisted non-supervision cross-modal video retrieval method, for any video in a training set, a representative video frame sampling method based on difference evaluation is utilized to sample the video to obtain a representative frame of the video, a corresponding text label is generated by utilizing a pre-training multi-modal text label large model according to the representative frame of the video, automatic generation of the text label is realized by introducing the pre-training multi-modal text label large model, dependence on manual labels is reduced, and representative video frame sampling based on difference evaluation is designed to select the representative frame in each video, key information in the video is extracted, frequent calling of the pre-training multi-modal text label large model is avoided, calling cost is reduced, and calculation amount is greatly reduced; for the generated text labels, noise in machine labels is filtered by using a pre-training cross-mode text image matching model, so that accuracy of machine labels is improved, errors in label information are reduced, further, text description information of the video is determined according to the filtered text labels, video-text description information pairs are obtained, a cross-mode video retrieval model is trained according to the video-text description information pairs corresponding to videos in a training set, a trained cross-mode video retrieval model is obtained, the trained cross-mode video retrieval model is used for executing a cross-mode video retrieval task, accuracy and reliability of cross-mode retrieval can be improved, and training samples used in the cross-mode video retrieval model training process do not need to be labeled manually, so that non-supervision training of the cross-mode video retrieval model is realized, and model training efficiency is improved.

Drawings

FIG. 1 is a flow chart of a pre-training multi-modal large-model assisted unsupervised cross-modal video retrieval method according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a data preparation phase of an unsupervised cross-modality video retrieval based on a pre-trained multi-modality large model, according to an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a training model phase of an unsupervised cross-modality video retrieval based on a pre-training multi-modality large model, according to an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a pre-training multi-modal large-model assisted unsupervised cross-modal video retrieval device according to an exemplary embodiment of the present application;

Fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present application, some technical terms related to the embodiments of the present application are briefly described below.

1. Pretrained multi-modal large model: the pre-training multi-mode large model is a model architecture based on deep learning, and aims to process data of multiple modes (such as images, texts, audios and the like) and realize cross-mode tasks. It combines techniques in the fields of computer vision, natural language processing, and audio processing to establish associations between multimodal data and semantic understanding. The pre-training stage learns a shared modal representation space through joint encoding of multi-modal input data, so that features between different modalities can be effectively converted and aligned in the space.

2. Unsupervised learning: unsupervised learning is a learning paradigm in the field of deep learning that, unlike supervised learning and reinforcement learning, does not require reliance on manually annotated training data or external reward signals for training. The goal of unsupervised learning is to discover structures, patterns, or associations in data from unlabeled data.

3. Cross-modality video retrieval: cross-modal video retrieval aims to retrieve video data associated therewith from text query content. This task combines techniques of text information retrieval and video content analysis such that a user can find and obtain video clips associated with their query content through text descriptions or keywords. In text-to-video retrieval, the text query may be a phrase, sentence, or a natural language description that expresses the user's need for the desired video content. The system needs to semantically understand and represent these text queries in order to match and calculate similarity to the video data.

In order to make the above objects, features and advantages of the embodiments of the present application more comprehensible, the following describes the technical solution of the embodiments of the present application in detail with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a pre-training multi-mode large model assisted non-supervision cross-mode video retrieval method provided by an embodiment of the present application, as shown in fig. 1, may include the following steps:

Step S100, for any video in the training set, performing video frame sampling on the video by using a representative video frame sampling method based on the difference evaluation to obtain a representative frame of the video.

Step S110, generating corresponding text labels by utilizing a pre-training multi-mode text label large model according to the representative frames of the video.

In the embodiment of the application, the pre-training multi-mode large model has strong generating capability, and text labels can be automatically generated for visual contents, so that in order to reduce the dependence of cross-mode video retrieval model training on manpower, the pre-training multi-mode large model can be introduced for generating training data in the cross-mode video retrieval model training.

In addition, considering that the video generally includes more video frames, there is a certain redundancy in the content between frames, and a huge amount of calculation overhead is caused by calling a large model, in order to reduce the cost of calling the large model frequently, all video frames may not be automatically generated with text labels, but a representative video frame sampling method based on the difference evaluation is used to select a frame with higher representativeness (may be referred to as a representative frame) from the video frames.

Accordingly, for any video in the training set, a representative video frame sampling method based on the difference evaluation can be utilized to sample the video in a video frame manner, so as to obtain a representative frame of the video.

For example, the videos in the training set may originate from a mainstream video website.

For example, a clustering method may be used to implement the difference assessment of video frames, and a feature clustering manner is used to screen different cluster centers, that is, each representative frame, from the feature space, so as to obtain a more representative video frame.

In the embodiment of the application, for the representative frames obtained by sampling, a pre-trained multi-mode large model (which can be called a pre-trained multi-mode text labeling large model) can be utilized to generate corresponding text labels.

And step 120, determining the correlation degree between the representative frame and the corresponding text label by utilizing the pre-training cross-mode text image matching model, and filtering the text label which does not meet the correlation degree requirement according to the correlation degree between the representative frame and the corresponding text label.

In the embodiment of the application, the fact that the pre-training multi-mode large model is difficult to stably output a high-quality label (namely the text label) usually has noise, and the existence of the noise in training data tends to influence the model performance in the training process of the cross-mode video retrieval model. Therefore, it is necessary to evaluate the quality of the generated text labels and filter out samples of significantly lower quality.

Illustratively, since the generated text labels originate from video frames, the relevance of the generated text labels and the corresponding video frames can be measured as a quality assessment criterion for the text. The higher the correlation degree between the text label and the video frame is, the higher the quality of the generated text label is; and vice versa.

For example, the quality of a text annotation may be evaluated using a pre-trained multimodal large model for evaluating the quality of the text annotation (which may be referred to as a pre-trained cross-modal text image matching model or a pre-trained multi-modal text image matching large model).

The pre-training cross-mode text image matching model refers to a model trained by using a large number (hundreds of millions) of matched texts and images, and through training, the model learns to encode the images and the texts into a unified vector space, so that the pre-training cross-mode text image matching model has strong alignment capability for the images and the texts.

By way of example, the pre-trained cross-modal text image matching model may include a CLIP (Contrastive Language-IMAGE PRETRAINING, comparative language-image pre-training) model.

In the embodiment of the application, under the condition that the text labels corresponding to the representative frames in the video frames are generated in the above manner, a pre-training cross-mode text image matching model can be utilized to determine the correlation degree between the representative frames and the corresponding text labels, and text labels which do not meet the correlation degree requirement are filtered according to the correlation degree between the representative frames and the corresponding text labels.

And step S130, determining text description information of the video according to the filtered text labels to obtain a video-text description information pair.

According to the embodiment of the application, the text description information of the video can be generated according to the filtered text labels obtained in the mode, so that a video-text description information pair is obtained.

For any video in the training set, the generation of the video-text description information pair can be performed in the manner described above, and training data for training the cross-modal video retrieval model can be obtained according to the obtained video-text description information pair.

Step S140, training a cross-modal video retrieval model according to video-text description information pairs corresponding to videos in a training set to obtain a trained cross-modal video retrieval model; the trained cross-modal video retrieval model is used for executing cross-modal video retrieval tasks.

According to the embodiment of the application, the video-text description information pairs corresponding to the videos in the training set can be generated in the mode, and the cross-modal video retrieval model is trained according to the video-text description information pairs corresponding to the videos in the training set, so that the trained cross-modal video retrieval model is obtained.

Illustratively, the trained cross-modal video retrieval model is used to perform cross-modal video retrieval tasks.

It can be seen that, in the method flow shown in fig. 1, for any video in the training set, a representative video frame sampling method based on the difference evaluation is utilized to sample the video to obtain a representative frame of the video, and according to the representative frame of the video, a pre-training multi-mode text annotation large model is utilized to generate a corresponding text annotation, and by introducing the pre-training multi-mode text annotation large model, automatic generation of the text annotation is realized, dependence on manual annotation is reduced, unsupervised training of a cross-mode video retrieval model can be realized, and representative frames in each video are selected by designing the representative video frame sampling based on the difference evaluation, so that key information in the video is extracted, frequent calling of the pre-training multi-mode text annotation large model is avoided, the calling cost is reduced, and the calculation amount is greatly reduced; for the generated text labels, noise in machine labels is filtered by using a pre-training cross-mode text image matching model, so that accuracy of machine labels is improved, errors in label information are reduced, further, text description information of the video is determined according to the filtered text labels, video-text description information pairs are obtained, a cross-mode video retrieval model is trained according to the video-text description information pairs corresponding to videos in a training set, a trained cross-mode video retrieval model is obtained, the trained cross-mode video retrieval model is used for executing a cross-mode video retrieval task, accuracy and reliability of cross-mode retrieval can be improved, and training samples used in the cross-mode video retrieval model training process do not need to be labeled manually, so that non-supervision training of the cross-mode video retrieval model is realized, and model training efficiency is improved.

In some embodiments, the video sampling the video using the representative video frame sampling method based on the difference evaluation may include:

Extracting visual features of each video frame in the video respectively to obtain the visual features of each video frame;

selecting a video frame with the largest visual characteristic amplitude as a first clustering center according to the visual characteristics of each video frame;

Sequentially selecting video frames which are farthest from the existing cluster centers as new cluster centers until the number of the selected cluster centers reaches a first number;

each cluster center is determined as a representative frame of the video.

For example, in order to implement the determination of the representative frame, visual feature extraction may be performed on each video frame in the video, so as to obtain the visual feature of each video frame.

For example, for a given video containing N (N.gtoreq.2) frames, visual features of each video frame in the video may be extracted, with the final feature being denoted as。

For example, visual feature extraction may be performed on video frames using a CLIP pre-training model developed by OpenAI (e.g., the pre-training cross-modality text image matching model described above). Each frame in the video may be converted into a feature vector in the k (k being a positive integer) dimension, and the resulting video features are represented as a matrix in the dimension (kxn).

Illustratively, the extracted visual features may be clustered to identify key frames (i.e., representative frames as described above) in the video.

The method includes the steps of determining visual feature amplitude of each video frame according to visual features of each video frame and L2 norm of the visual features of each video frame, selecting a video frame with the largest visual feature amplitude as a first clustering center, sequentially selecting video frames with the largest distance from the existing clustering center as new clustering centers until the number of the selected clustering centers reaches a first number (the specific value can be set according to actual requirements), and further determining each clustering center as a representative frame of the video.

Because the new cluster center selected each time is the video frame which is farthest from the existing cluster center, the difference between the selected representative frames is ensured, and finally, the multi-frame video frame which can better represent the video content is obtained.

In the embodiment of the present application, the process of clustering the visual features of each video frame in the video to identify the representative frame is not limited to the above clustering method, and may also use a clustering algorithm such as a K-means (K center point clustering algorithm) algorithm or a K-means (K mean value clustering algorithm), and the specific implementation thereof will not be described herein.

Compared with the implementation mode that the K-means algorithm defines the average value of each point in the cluster as the cluster center, the central point in the cluster in the clustering mode provided by the embodiment of the application is defined as the actual data point in the cluster, and the method has better interpretation.

Compared with the K-medoids algorithm, the clustering method provided by the embodiment of the application reduces the randomness of the initial clustering center selection, and is beneficial to accelerating the convergence of the algorithm.

In some embodiments, the generating the corresponding text label according to the representative frame of the video by using the pre-training multi-mode text label large model may include:

And for any representative frame, inputting the representative frame and the prompt word corresponding to the representative frame into a pre-training multi-mode text labeling large model to generate a corresponding text label.

For example, to achieve automatic generation of text labels, a multimodal big model for automatically generating text labels (i.e., the above-mentioned pre-trained multimodal text label big model) may be pre-trained, and with the pre-trained multimodal text label big model, text labels may be automatically generated for representative frames.

For example, in order to improve the efficiency of text labeling, for each representative frame, a sentence of a prompt word may be designed for the representative frame (for prompting the pre-training multi-mode text labeling large model to perform text labeling on the representative frame), and the prompt word and the representative frame are input into the pre-training multi-mode text labeling large model together, so as to obtain a corresponding text label.

In some embodiments, determining the relevance between the representative frame and the corresponding text label by using the pre-training cross-modal text image matching model, and filtering the text label that does not meet the relevance requirement according to the relevance between the representative frame and the corresponding text label may include:

For any video, each representative frame in the video and the text label corresponding to the representative frame are input into a pre-training cross-mode text image matching model, and the correlation degree between each representative frame in the video and the corresponding text label is determined;

And according to the correlation degree between each representative frame and the corresponding text label in the video, reserving a second number of representative frames with the top correlation degree rank and the text labels in the corresponding text labels, and deleting the rest text labels. For example, in order to improve the sample quality, for any video, when text labels corresponding to each representative frame in the video are generated in the above manner, a pre-training cross-mode text image matching model may be used to determine the correlation degree between each representative frame in the video and the corresponding text label, and according to the correlation degree between each representative frame in the video and the corresponding text label, a second number of representative frames with the highest correlation degree rank (the specific value may be set according to actual requirements) and the text labels in the corresponding text labels are reserved, and the rest of text labels, that is, the text labels in the representative frames with the retained correlation degree topK (K is the second number) and the corresponding text labels are deleted.

In some embodiments, the training the cross-modal video retrieval model according to the video-text description information pair corresponding to each video in the training set may include:

for any video-text description information pair in the training set, determining similarity scores of videos and text description information in the video-text description information pair by using a pre-trained cross-mode similarity evaluation model;

determining training weights of the video-text description information pairs according to the similarity scores; wherein the training weight of the video-text description information pair is positively correlated with the similarity score;

And training the cross-mode video retrieval model according to the video-text description information pairs corresponding to the videos and training weights of the video-text description information pairs corresponding to the videos.

Illustratively, considering that training samples are generated by pre-training a large model of multi-modal text labeling, some noise may still be present in the retained samples, even though some lower quality samples have been removed by way of sample filtering. Thus, to reduce the effects of noise in the samples, different weight adjustments may be made to each sample pair during the optimization training process.

For example, a cross-modal large model (which may be referred to as a pre-trained cross-modal similarity assessment model) for measuring similarity between video and text descriptors may be trained.

For any video-text description information pair in the training set, determining the similarity score of the video and the text description information in the video-text description information pair by utilizing a pre-trained cross-mode similarity evaluation model, and determining the training weight of the video-text description information pair according to the similarity score.

Under the condition that the training weights of the video-text description information pairs are determined in the above manner, the cross-mode video retrieval model can be trained according to the video-text description information pairs corresponding to the videos and the training weights of the video-text description information pairs corresponding to the videos.

In the training process of the cross-modal video retrieval model, for training data of any batch, training loss of each video-text description information pair in the batch of training data can be weighted and averaged according to the training weight of each video-text description information pair in the batch of training data determined in the above manner, so as to obtain training loss corresponding to the batch of training data, and specific implementation of the training loss can be described in the following in connection with specific examples.

In some embodiments, the cross-modality video retrieval model includes a text encoder, a video encoder, and a sequence Transformer module;

the training of the cross-mode video retrieval model according to the video-text description information pairs corresponding to each video in the training set may include:

For any video-text description information pair in any batch of training data, respectively carrying out text feature extraction on text description information in the video-text description information pair by using a text encoder to obtain text feature representation corresponding to the text description information, carrying out video feature extraction on downsampled video corresponding to video in the video-text description information pair by using a video encoder, and processing a video feature extraction result by using a sequence converter module to obtain video feature representation;

Determining a first type of penalty based on a first similarity between a video feature representation of the video in the video-text description information pair and a text feature representation of the text description information in the video-text description information pair, and a second similarity between the video feature representation and text feature representations of the text description information in other video-text description information pairs in the same batch of training data; and

Determining a second type of loss based on a first similarity between the video feature representation of the video in the video-text description information pair and the text feature representation of the text description information in the video-text description information pair, and a third similarity between the text feature representation and the video feature representations of the video in other video-text description information pairs in the same batch of training data;

And determining the training loss of the batch of training data according to the first type loss and the second type loss corresponding to each video-text description information in the batch of training data.

By way of example, the cross-modality search model may include a video encoder for feature extraction of video and a text encoder for feature extraction of text.

Furthermore, it is contemplated that the information contained in the video is far more than the information contained in the text. In order to better mine the rich information contained in the video, a sequence Transformer (seqTransformer) module can be introduced, and the self-attention mechanism and the multi-head attention mechanism of the sequence can be utilized to effectively capture the context time sequence dependency relationship in the sequence.

Correspondingly, for any video-text description information pair in any Batch (Batch) of training data, text characteristic extraction can be performed on text description information in the video-text description information pair by using a text encoder to obtain text characteristic representation corresponding to the text description information, video characteristic extraction is performed on downsampled video corresponding to video in the video-text description information pair by using a video encoder, and a sequence converter module is used for processing a video characteristic extraction result to obtain video characteristic representation.

Illustratively, in order to optimize the training effect of the cross-modal search model, the loss of video search text and the loss of text search video need to be considered during training, respectively.

Accordingly, where the video feature representation is derived in the manner described above, and the text feature representation is for any video-text description information pair in any batch of training data, on the one hand, the loss of video retrieval text (which may be referred to as a first type of loss) may be determined based on the similarity (which may be referred to as a first similarity) between the video feature representation of the video in the video-text description information pair and the text feature representation of the text description information in the video-text description information pair, and the similarity (which may be referred to as a second similarity) between the video feature representation and the text feature representations of the text description information in other video-text description information pairs in the same batch of training data.

For example, for any video-text description information pair, the first type of loss may be determined according to a ratio of a natural index of the first similarity corresponding to the video-text description information pair to a sum of natural indexes of the second similarities corresponding to the video-text description information pair, and a specific implementation thereof may be described below in conjunction with a specific example.

Wherein exp is a natural index of "".

Alternatively, the loss of text-retrieved video (which may be referred to as a second type of loss) may be determined based on the similarity between the video feature representation of the video in the video-text description information pair and the text feature representation of the text description information in the video-text description information pair (i.e., the first similarity described above), and the similarity between the text feature representation and the video feature representations of the video in other video-text description information pairs in the same batch of training data (which may be referred to as a third similarity).

For example, for any video-text description information pair, the first type of loss may be determined according to a ratio of a natural index of the first similarity corresponding to the video-text description information pair to a sum of natural indexes of the third similarities corresponding to the video-text description information pair, and a specific implementation thereof may be described below in conjunction with a specific example.

Under the condition that the first type loss and the second type loss corresponding to the video-text description information in the same batch of training data are determined in the above manner, the training loss of the batch of training data can be determined according to the first type loss and the second type loss corresponding to the video-text description information in the same batch of training data.

For example, a weighted average of the first type of loss corresponding to each video-text description information in the same batch of training data (the weighting coefficient of each video-text description information pair may be the training weight determined in the above manner), and a weighted average of the second type of loss may be determined separately, and the sum of the weighted average of the first type of loss and the weighted average of the second type of loss may be determined as the training loss of the batch of training data, and a specific implementation thereof may be described below in connection with a specific example.

In some embodiments, the similarity between the video feature representation of the video and the text feature representation of the text description information is determined by:

for any video frame in the video, searching a word with the highest feature similarity between the feature and the video frame according to the feature of the video frame, and determining the fourth similarity between the word and the video frame as the similarity corresponding to the video frame; and

For any word in the text description information, searching a video frame with the highest feature similarity between the feature and the word according to the feature of the word, and determining the fifth similarity between the video frame and the word as the similarity corresponding to the word;

And determining the similarity between the video characteristic representation of the video and the text characteristic representation of the text description information according to the fourth similarity corresponding to each video frame in the video and the fifth similarity corresponding to each word in the text description information.

For example, to more accurately determine the similarity between the video and the text, the similarity between the video and the text may be determined based on fine-grained similarity between the video frames in the video and the sentence words in the text.

For any video frame in the video, a word with the highest feature similarity between the feature and the video frame is searched according to the feature of the video frame, and the similarity (which may be called as a fourth similarity) between the word and the video frame is determined as the similarity corresponding to the video frame.

For example, for any video frame in the video, the similarity between the feature of the video frame and the feature of each word in the text description information can be calculated, the word with the highest feature similarity between the feature and the feature of the video frame is determined, and the similarity between the feature of the word and the feature of the video frame is determined as the similarity (i.e. fourth similarity) corresponding to the video frame.

Similarly, for any word in the text description information, a video frame with the highest feature similarity between the feature and the word can be searched according to the feature of the word, and the similarity (which can be called as a fifth similarity) between the video frame and the word can be determined as the similarity corresponding to the word.

For any word in the text description information, the similarity between the feature of the word and the feature of each video frame in the video can be calculated, the video frame with the highest feature similarity between the feature and the word is determined, and the similarity between the feature of the video frame and the feature of the word is determined as the corresponding similarity (namely, the fifth similarity) of the word.

Further, a similarity between the video feature representation of the video and the text feature representation of the text description information may be determined according to a fourth similarity corresponding to each video frame in the video and a fifth similarity corresponding to each word in the text description information.

For example, the sum of the fourth similarity corresponding to each video frame in the video and the fifth similarity corresponding to each word in the text description information may be determined as the similarity between the video feature representation of the video and the text feature representation of the text description information, and the specific implementation thereof may be described below in connection with specific examples.

The embodiment of the application also provides a multi-mode large-model-assisted non-supervision cross-mode video retrieval method, which comprises the following steps of:

And step T100, acquiring target text information for video retrieval.

In the embodiment of the application, the target text information can be text information for video retrieval, which is input by a user.

And step T110, extracting text features of the target text information by using the trained cross-mode video retrieval model to obtain the text feature information of the target text information.

For example, the training manner of the cross-modal video retrieval model can be referred to the relevant description in the above embodiment.

In the embodiment of the application, for the obtained target text information, the text feature extraction can be carried out on the target text information by utilizing the trained cross-mode video retrieval model, so as to obtain the text feature information of the target text information.

For example, text feature extraction may be performed on target text information using a text encoder in a cross-modality video retrieval model.

Step T120, determining a target video matched with the target text information according to the text characteristic information of the target text information and the video characteristics of the candidate video; the video features of the candidate videos are obtained by extracting the video features of the candidate videos by using a trained cross-mode video retrieval model.

In the embodiment of the application, under the condition that the text characteristic information of the target text information is obtained, the target video matched with the target text information can be determined according to the text characteristic information and the video characteristics of the candidate video.

For example, a certain number (may be called a third number, and a specific value may be set according to actual requirements) of the similarity ranking top may be based on the similarity between the text feature information of the target text information and the video feature of the candidate video, that is, the candidate video with the similarity topM (M is the third number) is used as the target video.

By way of example, the candidate video may be a video in a video library for searching. For example, videos in a video library of a video website.

For example, the video features of the candidate video may be obtained by extracting video features of the candidate video using a trained cross-modal video retrieval model.

For example, video feature extraction can be performed on candidate videos by using pairs of a video encoder and a sequence Transformer module in a cross-mode video retrieval model to obtain video feature information of the candidate videos.

In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present application, the technical solutions provided by the embodiments of the present application are described below with reference to specific examples.

In this embodiment, considering that text labeling is very difficult for a large-scale dataset, the pre-training multi-mode large model has very strong generating capability, and text labeling can be automatically generated for visual content, so in order to reduce the dependence of cross-mode video retrieval model training on manpower, the pre-training multi-mode large model can be introduced for generating training data in cross-mode video retrieval model training.

But the following problems exist with generating text labels using pre-trained multi-modal large models: firstly, a large number of large models are called to bring huge calculation cost; secondly, a large model is difficult to stably output high-quality text labels (which can be called pseudo labels), and the sample quality is uneven.

Aiming at the problems, the embodiment of the application provides an unsupervised cross-modal video retrieval framework based on a pre-training multi-modal large model, wherein the framework comprises a data preparation stage and a training model stage; wherein:

1) Data preparation stage: as shown in fig. 2, a representative video frame sampling module (a clustering module in a corresponding graph) based on the difference evaluation is designed at this stage to acquire frames with a representative video comparison (i.e. the representative frames), machine labeling information is provided for the video through a pre-training multi-mode text labeling large model (such as LLaVa model), and finally noise in a filter labeling is filtered through a large-scale pre-training model (a filtering module in the corresponding graph), so that better-quality video-text data (i.e. the video-text description information pair) is obtained.

2) Training a model stage: and training a cross-mode video retrieval model by utilizing the original video and the filtered machine annotation information and comparing and learning the optimization of the loss function.

Illustratively, considering that training samples are generated from a pre-trained multi-modal large model, some noise may be present in the retained samples even though some lower quality samples have been removed by way of sample filtering. Therefore, noise robust-based similarity learning is proposed, the quality of samples is considered on the basis of contrast loss, and different weight adjustment needs to be performed on each sample pair in the optimization training process, so that the influence of noise in the samples is reduced.

Through the two stages, the task of non-supervision cross-modal retrieval based on the pre-training multi-modal large model is realized.

Relevant implementation details of an unsupervised cross-modality video retrieval framework based on a pre-trained multi-modality large model are described below.

1. Representative video frame sampling module based on difference evaluation

Although the pre-trained multi-modal large model has strong generation capability, the cost per call is not low. In addition, since video typically includes many video frames, there is also some redundancy in the content from frame to frame. In order to reduce the cost of frequently calling the large model, the automatic generation of text labels can be carried out not on all video frames, but a representative video frame sampling method based on the difference evaluation is provided, a frame with higher representativeness (namely a representative frame) is selected from the initial video frames of the video, and the automatic generation of text labels is carried out by taking the pre-trained multi-mode text label large model as the representative frame.

For example, the difference evaluation of the video frames can be implemented in a clustering-based manner, and different cluster centers, namely, each representative frame, are screened out from the feature space by using a feature clustering manner, so that more representative video frames are acquired.

For a given video containing N (N.gtoreq.2) frames, visual features of each video frame in the video may be extracted, and the final features may be referred to as。

For example, visual feature extraction may be performed on video frames using the CLIP pre-training model developed by OpenAI. Each frame in the video may be converted into a feature vector in the k (k being a positive integer) dimension, and the resulting video features are represented as a matrix in the dimension (kxn).

For example, where visual features of each video frame are extracted, the visual features may be clustered to enable identification of representative frames in the video.

For a specific clustering algorithm, the k-means algorithm can be adopted in consideration of the simplicity and effectiveness of the clustering algorithm. The K-means algorithm divides the data into K clusters by minimizing the sum of the distances between all data points in a cluster and the center of the cluster. The center of a particular cluster is defined as the actual data point that has the smallest average distance from all points in the cluster. Whereas the conventional K-means algorithm defines the center as the average between cluster midpoints. In contrast, the center of the K-medoids algorithm is better explained.

To further optimize the clustering process KKZ (Katz-Murphy-KHACHIYAN) may be used as an initialization method for the cluster center.

I.e. a data point with the largest L2 norm (i.e. the video frame with the largest visual feature amplitude) may be selected as the first cluster center, and then the data points with the largest distance from the existing cluster center (i.e. the video frames) are sequentially selected as new cluster centers until the number of selected cluster centers reaches the first number. This strategy not only reduces the randomness of the K-means algorithm in the initial cluster center selection, but also helps to speed up the convergence of the algorithm.

Through video clustering sampling, multi-frame images which can well represent video content are finally obtained and output as representing frames of videoWhereinThe number of class centers specified in the K-medoids clustering algorithm. Subsequently, text labels may be generated for each representative frame using a pre-trained multi-modal text label large model.

2. Generating module based on pre-training multi-mode large model

Exemplary, given a videoThe representative frame of the video can be obtained by the processing in the above wayAnd the text is matched with the prompt words and is sent into a pre-training multi-mode text labeling large model (such as LLaVa).

For example, each representative frame may be input LLaVa, and a corresponding text label may be obtained, so that, for any video, the text label corresponding to each representative frame in the video may be used.

For example, for a video, text description information corresponding to the whole video can be obtained according to text labels corresponding to representative frames in the video.

For example, to improve the efficiency of text labeling, for each representative frame, a sentence of a prompt word may be designed for the representative frame, and the prompt word and the representative frame are input together into LLaVa, so as to obtain a corresponding text label.

Through the mode, the text labels corresponding to each representative frame in the video can be obtained。

3. Data filtering module based on large-scale pre-training model

For example, when the text labels Q corresponding to each representative frame in the video are obtained through the pre-training multi-mode text label large model, noise is considered in the generated text description of the pre-training multi-mode text label large model, and the existence of the noise in the training data tends to influence the training of the cross-mode video retrieval model. Thus, the quality of the generated text labels can be evaluated and text labels representing frames of significantly lower quality can be filtered out.

For example, a pre-trained cross-modal text image matching model (e.g., CLIP model) may be employed to perform relevance calculations for text and video frames.

For example, corresponding text labels for a given videoAnd original video frameThe relevance scores of the text data and the text data can be calculated through a large-scale pre-training model, topK texts with higher scores are reserved according to the relevance scores, and filtered text data are obtained：

Wherein CLIP represents calculating a relevance score between the video frame and the generated text description using the CLIP model, topK represents taking the top K text as the filtering result.

4. Multi-mode coding module

To verify the validity of the generated text information, a cross-modal video retrieval model is proposed that trains on the generated text information and video pairs.

As shown in fig. 3, the cross-mode video retrieval model is composed of a video encoder and a text encoder, a SeqTransformer (sequence transform) module is introduced at the video end, and a self-attention mechanism and a multi-head attention mechanism of the cross-mode video retrieval model are utilized to effectively capture the context dependency relationship in the sequence, and finally, the two modes are subjected to similarity calculation. In this way, an unsupervised cross-modal retrieval may be achieved.

Illustratively, to obtain a representation of a video feature, frames may be extracted from the video, for example, downsampling (e.g., uniformly sampling) the video frames to obtain a downsampled video (including a plurality of sampled frames), and passing the downsampled video through a video encoder to obtain a series of features.

For example, a video encoder may employ ViT-B/32 with 12 layers and a block size of 32.

Given a video, m frames of the video can be uniformly sampled to obtain a downsampled videoThe computational complexity of the video encoder is reduced.

Image encoders for video input to cross-modal video retrieval model obtain corresponding video features respectively。

The video features are composed of m frame feature sequences, and the feature dimension of the frames is d.

The encoded visual features may be expressed as:

wherein CLIP represents a video encoder that spans a modal video retrieval model.

Furthermore, since the information contained in the video is far more than the information contained in the text. To better mine the rich information contained in the video, seqTransformer can be introduced, and the self-attention mechanism and the multi-head attention mechanism of the rich information can be utilized to effectively capture the context time sequence dependency relationship in the sequence. The final video features are expressed as:

for text representation, a text query of word length p is given Text encoders in the cross-modality video retrieval model may be used to generate text feature representations.

Illustratively, the text encoder is a standard transducer architecture.

For example, the text encoder may be a 12-layer model with dimension 512, containing 8 attention headers.

For text information processed by the text encoder, the [ EOS ] of the last layer of fransformer can be used as a text feature.

For textObtain。

The text feature is composed of p word feature sequences, and the word feature dimension is d.

Illustratively, the encoded text features may be represented as:

wherein CLIP represents a text encoder of the cross-modality video retrieval model.

5. Noise robust similarity learning module

For example, for similarity learning of videos and query texts, semantically related videos and query texts can be made to be as close as possible in the feature space (i.e. the similarity is higher), whereas irrelevant videos and texts are made to be as far as possible in the feature space.

Thus, contrast loss can be used to optimally train the features of text and video.

In addition, because the training samples are generated by the pre-training multi-mode text labeling large model, even if some samples with lower quality are removed by a sample filtering mode, certain noise still exists in the reserved samples. Therefore, the embodiment of the application provides similarity learning based on noise robustness, considers the quality of samples on the basis of contrast loss, and carries out different weight adjustment on each sample pair in the process of optimizing training, thereby reducing the influence of noise in the samples.

Illustratively, a pre-trained cross-modality similarity assessment model (which may correspond to the matrix learning module in fig. 3) may be used to measure similarity between each video-text (i.e., the text description information described above) pair, and the resulting similarity score is used as a weight.

Illustratively, pairs of samples with higher similarity scores will be weighted more heavily during training, while pairs of samples with lower similarity scores will be weighted less heavily.

For example, given data for one Batch (Batch) of B (B.gtoreq.2) video-text pairs, it is desirable to optimize the loss of video search text simultaneouslyAnd loss of text retrieval videoTherefore, the loss function of the model is as follows:

Wherein, for a video-text pair, For a first similarity between a video feature representation of a video in a video-text pair and a text feature representation of text in the video-text pair,A second similarity between the video feature representation of the video in the video-text pair and the text feature representations of the text description information in other video-text pairs in the same batch of training data; A third similarity between the text feature representation and video feature representations of videos in other video-text pairs in the same batch of training data; for the first type of loss of the video-text pair, Is a second type of loss for the video-text pair.

The similarity score of the video and text in the video-text pair, determined for the cross-modality similarity assessment model that utilizes pre-training, is used as a training weight (i.e., a weighting coefficient for a weighted average) for the video-text pair.Is a temperature factor. Sim represents video-text similarity.

To employ fine-grained similarity between video frames and sentence words, the calculation formula is as follows:

Wherein the formula is a fine granularity matching formula between text and video, A representation of a video feature representing a video,The characteristics of a single frame are represented,A text feature representation representing a sentence level,Features representing word level. The first max () represents searching for the word with the highest similarity to all the frames, and the second max () represents searching for the frame with the highest similarity to all the words, embodying fine-grained matching.

Wherein, To determine a fourth similarity for a video frame to a video frame,To determine a fifth similarity corresponding to a word in the text.Is the sum of the fourth similarity corresponding to each video frame in the video,Is the sum of the fifth similarity corresponding to each word in the text. Finally, the similarity of the two is added, and the two directions of the optimized video retrieval text and the text retrieval video are considered.

The method provided by the application is described above. The device provided by the application is described below:

Referring to fig. 4, a schematic structural diagram of a multi-mode large-model-assisted unsupervised cross-mode video retrieval device provided by an embodiment of the present application, as shown in fig. 4, the multi-mode large-model-assisted unsupervised cross-mode video retrieval device may include:

The sampling unit 410 is configured to sample, for any video in the training set, a video frame by using a representative video frame sampling method based on a difference evaluation, so as to obtain a representative frame of the video;

The generating unit 420 is configured to generate a corresponding text label by using the pre-training multi-mode text label large model according to the representative frame of the video;

the filtering unit 430 is configured to determine a correlation between the representative frame and the corresponding text label by using the pre-training cross-mode text image matching model, and filter the text label that does not meet the correlation requirement according to the correlation between the representative frame and the corresponding text label;

a determining unit 440, configured to determine text description information of the video according to the filtered text labels, so as to obtain a video-text description information pair;

The training unit 450 is configured to train the cross-modal video retrieval model according to the video-text description information pairs corresponding to each video in the training set, so as to obtain a trained cross-modal video retrieval model; the trained cross-modal video retrieval model is used for executing cross-modal video retrieval tasks.

In some embodiments, the sampling unit 410 performs video frame sampling on the video using a representative video frame sampling method based on a difference estimate, including:

each cluster center is determined as a representative frame of the video.

In some embodiments, the generating unit 420 generates the corresponding text label according to the representative frame of the video by using the pre-trained multi-modal text label large model, including:

For any representative frame, inputting the representative frame and a prompt word corresponding to the representative frame into the pre-training multi-mode text labeling large model to generate a corresponding text label; the prompt word is used for prompting the pretrained multi-mode text labeling large model to label the text of the representative frame.

In some embodiments, the filtering unit 430 determines a relevance between the representative frame and the corresponding text label using a pre-trained cross-modal text image matching model, and filters text labels that do not meet the relevance requirement according to the relevance between the representative frame and the corresponding text label, including:

and according to the correlation degree between each representative frame and the corresponding text label in the video, reserving a second number of representative frames with the top correlation degree rank and the text labels in the corresponding text labels, and deleting the rest text labels.

In some embodiments, the training unit 450 trains the cross-modal video retrieval model according to the video-text description information pair corresponding to each video in the training set, including:

The training unit 450 trains the cross-modal video retrieval model according to the video-text description information pairs corresponding to each video in the training set, including:

For any video-text description information pair in any batch of training data, respectively extracting text characteristics of text description information in the video-text description information pair by using the text encoder to obtain text characteristic representation corresponding to the text description information, extracting video characteristics of downsampled video corresponding to video in the video-text description information pair by using the video encoder, and processing video characteristic extraction results by using the sequence transform module to obtain video characteristic representation;

The embodiment of the application also provides a multi-mode large-model-assisted non-supervision cross-mode video retrieval device, which can comprise:

an acquisition unit for acquiring target text information for video retrieval;

The feature extraction unit is used for extracting text features of the target text information by using the trained cross-mode video retrieval model to obtain text feature information of the target text information;

the trained cross-modal video retrieval model can be obtained by training in the mode described in the embodiment of the method;

The searching unit is used for determining a target video matched with the target text information according to the text characteristic information of the target text information and the video characteristics of the candidate video; and extracting video features of the candidate videos by using the trained cross-mode video retrieval model.

The embodiment of the application also provides electronic equipment, which comprises a processor and a memory, wherein the memory is used for storing a computer program; and the processor is used for realizing the multi-mode large-model-assisted non-supervision cross-mode video retrieval method when executing the programs stored in the memory.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. The electronic device may include a processor 501, a memory 502 storing machine-executable instructions. The processor 501 and the memory 502 may communicate via a system bus 503. Also, by reading and executing machine-executable instructions corresponding to the multimodal, large model-assisted, unsupervised, cross-modal video retrieval logic in memory 502, processor 501 may perform the multimodal, large model-assisted, unsupervised, cross-modal video retrieval method described above.

The memory 502 referred to herein may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state disk, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.

In some embodiments, a machine-readable storage medium, such as memory 502 in fig. 5, is also provided, having stored thereon machine-executable instructions that when executed by a processor implement the multi-modal large model-aided, unsupervised cross-modal video retrieval method described above. For example, the machine-readable storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Embodiments of the present application also provide a computer program product storing a computer program and when executed by a processor causing the processor to perform the multi-modal large model aided unsupervised cross-modal video retrieval method described hereinabove.

Claims

1. A multi-mode large model assisted non-supervision cross-mode video retrieval method is characterized by comprising the following steps:

2. The method of claim 1, wherein the video is video frame sampled using a representative video frame sampling method based on a difference estimate, comprising:

each cluster center is determined as a representative frame of the video.

3. The method of claim 1, wherein generating the corresponding text labels from the representative frames of the video using the pre-trained multi-modal text label large model comprises:

4. The method of claim 1, wherein determining the relevance between the representative frame and the corresponding text label using the pre-trained cross-modal text image matching model, and filtering text labels that do not meet the relevance requirement based on the relevance between the representative frame and the corresponding text label, comprises:

5. The method according to claim 1, wherein training the cross-modality video retrieval model based on the video-text description information pairs corresponding to each video in the training set comprises:

6. The method of claim 1, wherein the cross-modality video retrieval model comprises a text encoder, a video encoder, and a sequence Transformer module;

Training the cross-mode video retrieval model according to the video-text description information pairs corresponding to each video in the training set, wherein the training comprises the following steps:

7. The method of claim 6, wherein the similarity between the video feature representation of the video and the text feature representation of the text description information is determined by:

8. A multi-mode large model assisted non-supervision cross-mode video retrieval method is characterized by comprising the following steps:

Acquiring target text information for video retrieval;

Extracting text features of the target text information by using the trained cross-mode video retrieval model to obtain text feature information of the target text information; wherein the trained cross-modal video retrieval model is trained using the method of any one of claims 1-7;

9. A multi-modal large-model-aided unsupervised cross-modal video retrieval apparatus, comprising:

10. An electronic device comprising a processor and a memory, wherein,

A memory for storing a computer program;

a processor for implementing the method of any one of claims 1-7 or 8 when executing a program stored on a memory.

11. A computer program product, characterized in that the computer program product has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-7 or 8.