CN118427608A

CN118427608A - Multi-modal image language model combined prompt learning method and device

Info

Publication number: CN118427608A
Application number: CN202410477595.3A
Authority: CN
Inventors: 谢延昭; 张方园; 汪洋涛; 汤茂斌; 范立生; 彭伟龙; 李明杰; 陈思远
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2024-04-19
Filing date: 2024-04-19
Publication date: 2024-08-02

Abstract

The embodiment of the specification provides a multi-mode image language model combined prompt learning method, device, equipment and medium, wherein the method comprises the steps of constructing a data set, including text data of spliced text prompts and image data of spliced image prompts; constructing CCPL a model based on the CLIP model, wherein the CCPL model comprises a prompt updating module and a feature fusion module; the prompt updating module is arranged between corresponding layers of the image encoder and the text encoder, and the setting of the prompt updating module updates the prompts of the image and the text branches; the feature fusion module is used for carrying out depth fusion on the output text features and the image features and predicting classification probability; and inputting the data set into the multi-modal image language model for training until convergence conditions are met, and obtaining the trained multi-modal image language model. The method solves the problem that the generalization capability is poor due to the fact that no depth interaction information exists between two branches of images and texts in the existing method.

Description

Multi-modal image language model combined prompt learning method and device

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method, an apparatus, a device, and a medium for learning a multimodal image language model in combination with a prompt.

Background

Because of the huge parameter and the limited data set, the multi-modal image language large model is difficult to transfer the learned knowledge to a downstream task, and the learning is prompted to be combined with the multi-modal image language model to become a new research direction recently.

In recent years, various researches combining a multi-modal image language model and prompt learning have been made day by day, and the general architecture of the existing mainstream prompt learning method is shown in reference to fig. 1, and can be divided into two methods from the model structure, namely, a structure in which prompt information is added only to an image encoder branch or a text encoder branch (as shown in fig. 1 (a)) in the prior art, and a structure in which prompt information branches (as shown in fig. 1 (b)) are added to a text encoder and an image encoder branch.

The text input prompt of the existing CLIP (Language-Image Pre-Training) model is manually set to be "a photo of a < class >" (class represents the class of the Image), the content obtained by sending the prompt into the text encoder is used as classification weight to be matched with the Image characteristics, and the Image and the text description distance related to the Image are pulled up in the mapping space by comparing learning loss targets, so that the unmatched Image in the characteristic space is eliminated, and the performance of the model is improved. However, small changes in manually adjusted text prompts cause large performance changes, which require a significant amount of time and effort to explore a good prompt. Inspired by the prompt learning study in natural language processing (NLP, natural Language Processing), the text prompt is converted into a learnable context vector by a CoOp (Context Optimization) method, and only a small number of marked images are used for learning, so that the vast improvement of the manual prompt compared with the density adjustment can be realized on a wide image recognition dataset, and the automatic generation of the text prompt is realized to adapt to downstream tasks.

CoOp also has some defects, the learned context overfits the basic class, the generalization capability is weaker in the unseen class, namely the basic class works well, but the performance effect in the unseen class is not ideal, and the accuracy is obviously reduced. To address this problem, the CoCoOp (Conditional Context Optimization) method was created with the learning text prompts depending on the image entered. CoCoOp specific steps: an input condition token is generated for each image by a lightweight neural network design and combined with the learnable context vector in CoOp, so that each text prompt contains information about its associated image, not just the specific category of information set during training. However, the above method focuses only on the prompt part of the text branch, ignoring the image branch. Based on this problem Visual Prompt Tuning (VPT) demonstrates the feasibility of this scheme. However, the intra-class variance of the image features and the inter-class variance of the text embedding cause differences in data distribution of the two, and consistent performance improvement cannot be obtained. UPT (Unified Prompt Tuning) and MaPLe (Multi-modal Prompt Learning) combine the hints of both the text and image branches, highlighting the advantages of Multi-modal hints. MaPLe, conditioned on the text, generate image cues corresponding thereto by a mapping function. However, text cues and image cues are combined using contrast learning, and interaction is achieved only through a simple linear mapping. In view of this, the DCP (Deeply coupled Cross-modal Prompt learning) improves performance by coupling text branches and image branches through a cross-modality hint attention module, thereby enabling deeper interaction of the two.

However, the existing method is to add the prompt on the text branch only, or the prompt on the image branch only, or the prompt is added on the two branches of the image and the text at the same time, and the simple fusion is carried out only through the linear mapping function, and no deeper interaction is carried out.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a method, an apparatus, a device, and a medium for learning a multimodal image language model in combination with a hint, so as to solve the above technical problems.

One or more embodiments of the present disclosure provide a multi-modal image language model in combination with a prompt learning method, constructing a dataset including text data for stitched text prompts and image data for stitched image prompts;

Constructing CCPL a multi-modal image language model based on an open-source CLIP model, wherein the CCPL multi-modal image language model comprises a prompt updating module and a feature fusion module;

The prompt updating module is arranged between the two adjacent layers of image encoders and the text encoder and is connected with the input and output image information and the text information of the two adjacent layers of image encoders and the text encoder, the prompt updating module fuses the image prompt and the text prompt output by the upper layers of image encoders and the text encoder by using a cross attention mechanism to respectively obtain a fused image prompt and a fused text prompt, and sums the fused image prompt and the fused text prompt with the image prompt and the text prompt obtained by the own layers of image encoders and the text encoder according to preset weights to obtain new image prompt and text prompt which are respectively used as the image prompt and the text prompt input by the lower layers of image encoders and the text encoder;

The feature fusion module is used for carrying out depth fusion on the image features output by the last layer of image encoder and the text features output by the text encoder, and predicting classification probability;

and inputting the data set into the multi-modal image language model for training until convergence conditions are met, and obtaining the trained multi-modal image language model.

One or more embodiments of the present specification provide a multi-modal image language model combined hint learning apparatus, including:

The data set construction module is used for constructing a data set, and comprises text data of spliced text prompts and image data of spliced image prompts;

The model construction module is used for constructing CCPL a multi-mode image language model based on an open-source CLIP model, and the CCPL multi-mode image language model comprises a prompt updating module and a feature fusion module;

The prompt updating module is connected with the input and output image information and text information of the two adjacent layers of image encoders and text encoders, the prompt updating module fuses the image prompt and the text prompt input by the upper layers of image encoders and text encoders by using a cross attention mechanism to respectively obtain a fused image prompt and a fused text prompt, and sums the fused image prompt and the fused text prompt with the image prompt and the text prompt obtained by the own layers of image encoders and text encoders according to preset weights to obtain new image prompt and text prompt which are respectively used as the image prompt and the text prompt input by the lower layers of image encoders and the text encoder;

The model training module is used for inputting the data set into the multi-modal image language model for training until convergence conditions are met, and a trained multi-modal image language model is obtained.

One or more embodiments of the present specification provide a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the multimodal image language model combined prompt learning method as described above when executing the computer program.

One or more embodiments of the present specification provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the multimodal image language model combined prompt learning method as described above.

The multi-mode image language model combined prompt learning method, device, equipment and medium have the advantages that the prompt updating module and the characteristic fusion module are arranged on the basis of establishing CCPL based on an open source CLIP model and freezing basic model parameters, wherein the prompt of updating images and text branches is set by the prompt updating module, the image prompt and the text prompt are encouraged to capture key information of each other, interaction between the prompt information of the texts and the two branches of the images is enhanced, the capability of capturing the key information of the other side of the two is improved, the key information is integrated into the own, the characteristic fusion module is arranged at the last layer of the model, deep fusion of image characteristics and text characteristics at the output end is enhanced, consistency modes of the images and the text characteristics are effectively maintained, and the problem that in the prior art, the prompt learning method only focuses on prompt information or modes in a single mode, information interaction is not obvious, and the cross-mode prompt learning technology innovation is relatively high in potential is solved.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description that follow are only some of the embodiments described in the description, from which, for a person skilled in the art, other drawings can be obtained without inventive faculty.

Fig. 1 is a schematic diagram of a conventional mainstream hint learning method according to one or more embodiments of the present disclosure, where fig. 1 (a) is a diagram of a structure in which hint information is only added to a branch of an image encoder or a branch of a text encoder, and fig. 1 (b) is a diagram of a structure in which hint information is added to a branch of a text encoder and a branch of an image encoder;

FIG. 2 is a flowchart of a method for learning a multimodal image language model in combination with hints provided in one or more embodiments of the present disclosure;

FIG. 3 is an architecture diagram of a CCPL model provided in one or more embodiments of the present disclosure;

FIG. 4 is a network architecture diagram of a multi-headed note mechanism MHA layer provided by one or more embodiments of the present disclosure;

FIG. 5 is a table of experimental data for the CCPL model and other comparative models over 7 data sets provided in one or more embodiments of the present disclosure;

FIG. 6 is a graph comparing the CCPL model and the DCP model over 7 datasets Accuracy-Epochs provided in one or more embodiments of the present disclosure;

FIG. 7 is a block diagram of a multi-modal image language model combined hint learning device provided in one or more embodiments of the present disclosure;

fig. 8 is a schematic structural diagram of a computer device according to one or more embodiments of the present disclosure.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.

The invention is described in detail below with reference to the detailed description and the accompanying drawings.

Method embodiment

According to an embodiment of the present invention, as shown in fig. 2, a method for learning a multi-modal image language model in combination with a prompt is provided, which is a flowchart of the method for learning a multi-modal image language model in combination with a prompt provided in this embodiment, and the method for learning a multi-modal image language model in combination with a prompt according to an embodiment of the present invention includes:

step S1, acquiring a data set, constructing a prompt Fu Moban, and generating text data comprising a spliced text prompt and image data of a spliced image prompt.

S2, constructing CCPL (Cross-Coupled Prompt Learning) multi-mode Image language model based on an open-source CLIP (Contrastive Language-Image Pre-tracking) model, wherein the multi-mode Image language model comprises a prompt updating module and a feature fusion module;

the prompt updating module is arranged between corresponding layers of the image encoder and the text encoder and is connected with the input and output image information and text information of the two adjacent layers of the image encoder and the text encoder, the prompt updating module fuses the image prompt and the text prompt output by the upper layer of the image encoder and the text encoder by using a cross attention mechanism to respectively obtain a fused image prompt and a fused text prompt, and sums the fused image prompt and the fused text prompt with the image prompt and the text prompt obtained by the own layer of the image encoder and the text encoder respectively according to preset weights to obtain new image prompt and text prompt which are respectively used as the image prompt and the text prompt input by the lower layer of the image encoder and the text encoder;

The feature fusion module is used for carrying out depth fusion on the image features output by the last layer of image encoder and the text features output by the text encoder, and predicting classification probability.

In this embodiment, a trained optimized open-source CLIP model is used, and the image encoder and text encoder parameters of the entire model are frozen to achieve that parameters of the prompt update module and the feature fusion module only are updated throughout the process.

And step S3, inputting the data set into the multi-modal image language model for training until convergence conditions are met, and obtaining the trained multi-modal image language model.

According to the method, a prompt sign updating module and a feature fusion module are arranged on the basis of a switch source CLIP model construction CCPL and freezing basic model parameters, wherein an image prompt sign and a text prompt sign are spliced with original image input and text input to serve as input of the model, prompt of the image and text branches is updated through the prompt sign updating module, the image prompt sign and the text prompt sign are encouraged to capture key information of each other, interaction between prompt information of the text and the image branches is enhanced, the capability of capturing key information of each other and integrating the key information into the key information of each other is improved, and the feature fusion module is arranged on the last layer of the model, so that depth fusion of image features and text features of an output end is enhanced, consistency of the image features and the text features is effectively maintained, and the problem that in the prior art, prompt learning methods only focus on prompt information interaction or mode lack in a single mode is solved, and the problem that the generalization capability is not obvious is relatively innovative and has relatively high potential.

In this embodiment, the constructing the data set in step S1 specifically includes:

step S11, acquiring a data set on the public platform, and dividing the data set into a training set, a verification set and a test set, wherein the training set and the verification set are respectively used for training and verifying the CCPL model, and the test set is used for verifying the overall performance of the model.

In a particular embodiment, the dataset may cover a wide variety of image recognition tasks, such as classification of automobiles, flowers, airplanes, scenes, texture recognition, satellite images, motion video, etc., and the dataset is divided into training, validation, and test sets according to json files where the image dataset is related to split.

Step S12, preprocessing the images in each data set, such as image enhancement, wherein the images can be preprocessed by using enhancement methods such as random overturn, random clipping, random rotation and the like, and the preprocessing is performed to generate RGB three-channel images with 224 multiplied by 224 pixels, so that the model learning is facilitated for the robustness of different scales and positions of the object. Meanwhile, the image is further subjected to random horizontal overturning, so that model learning is facilitated to be free from the influence of the horizontal position of an object in the image, and the generalization capability of the model is improved.

Step S13, carrying out normalization processing on the data in each data set; in particular to a special-shaped ceramic tile,

The mean value and standard deviation of the training set are obtained first, all images of the data set are standardized, the model is facilitated to converge more quickly, and the performance of the model is possibly improved. The formula is as follows:

Where μ is the mean of the corresponding dataset and σ is the standard deviation of the corresponding dataset.

In this embodiment, the length of the text prompt and the image prompt are set to 16, the text prompt for initial input is initialized by using word embedding of "a photo of a < class >" of the pre-trained CLIP model, the image prompt is randomly initialized from normal distribution, and the image prompt and the text prompt are spliced with the original visual input and the text input as the input of the model.

In this embodiment, referring to fig. 3, a CCPL model framework is constructed based on a deep learning library PyTorch, a CCPL model backup uses a pretrained ViT-B/16CLIP (Contrastive Language-Image Pre-tracking), an Image encoder and a text encoder together form 12 layers, and each encoder parameter is frozen, and fig. 3 schematically shows a CCPL model framework with 2 prompt update modules.

In this embodiment, the prompt update module is a CCPG module consisting of two multi-head attention (Multihead Attention) layers, a residual connection and an MLP (Multilayer Perceptron, multi-layer perceptron) layer, wherein,

① The Multihead Attention layer realizes the following steps:

attention score matrix calculation: wherein one multi-head attention layer takes text prompt information as query (Q), image prompt information as key (K), and dot product is carried out on the text prompt information and the image prompt information to obtain a similarity matrix, namely a first attention score matrix; the other multi-head attention layer takes the image prompt information as a query (Q), the text prompt information as a key (K), and dot products are carried out on the image prompt information and the text prompt information to obtain a similarity matrix, namely a second attention score matrix.

Normalization: the first and second attention score matrices are normalized by a softmax activation function, respectively, to obtain probability distributions.

Weighted summation: and respectively carrying out weighted summation on the normalization processing results and the image prompt information (Vi) and the text prompt information (V _t) to obtain a final fusion image prompt and a final fusion text prompt.

Thus, in this embodiment, at least one CCPG modules are provided, and at most 11 modules are provided.

In this embodiment, the CCPG modules preferably set 9, and update the text prompt and the image prompt of the image encoder and the text encoder of the 1-10 layers of the CCPL model respectively.

② The residual connection specifically realizes the following processes:

The image prompt and the text prompt respectively output by the image encoder and the text encoder of the upper layer are output to the MLP layer by taking the weight coefficient as alpha, the fusion prompt obtained by the Multihead Attention layer update is output to the MLP layer by taking the weight coefficient as (1-alpha), and the residual connection can balance the importance of the prompt information of the encoder of the upper layer and the prompt information obtained by update.

In this embodiment, the value of α is preferably in the range of 0 to 1, and specifically, the value of α is 0.9.

③ The MLP layer specifically realizes the following processes:

the MLP layer adds the input image prompt, text prompt, fusion image prompt and fusion text prompt respectively through residual connection with preset weight coefficients, and correspondingly obtains the image prompt and text prompt of the next layer of image encoder and text encoder.

Thus, referring to fig. 3 and fig. 4, fig. 4 is a network structure diagram of the MHA layer of the multi-head attention mechanism, and the CCPG module information processing and calculating process specifically includes the following formula:

wherein, AndRespectively representing a text prompt and an image prompt at the z-layer encoder,AndA text prompt and an image prompt at the z+1 layer encoder are represented, respectively. Notably, CCPG modules are added in layers 1 through N of the text encoder and image encoder layers, N.ltoreq.L (L represents the number of layers of the encoder).

In this embodiment, the core principle of the Cross-Coupled Prompt Generator (CCPG) module is to update the image prompt and the text prompt of each layer of image encoder and each layer of text encoder by using a Cross-attention mechanism, so that the image prompt and the text prompt are allowed to capture key information of each other and integrate the key information into own capability, and the problem that the conventional prompt learning method is not obvious due to the fact that the conventional prompt learning method only focuses on prompt information in a single mode or focuses on prompt information of modes of both sides but lacks information interaction between modes can be effectively solved.

In this embodiment, the final feature fusion module for feature depth fusion is set up in the model, which has the main effects of enhancing the depth fusion of image features and text features and maintaining the semantic consistency of images and text. The feature Fusion module is a Cross attention module, a multi-layer perceptron (MLP) layer, a CMF (Cross-model Fusion) module formed by a linear layer and a sigmoid activation function, the module takes the image feature output by the last layer of image encoder as Q in a Cross attention mechanism, the text feature output by the last layer of text encoder as K, V in the Cross attention mechanism, and then the text feature is input into the MLP layer, finally, classification probability is predicted by a classification head formed by the linear layer and the sigmoid activation function in sequence, so that a classification result is obtained.

In this embodiment, the MLP layer is also composed of two linear layers and one active cell QuickGELU.

In this embodiment, in the CCPL model training stage, model optimization CCPL is implemented through a set loss function, where the loss function of the CCPL model is composed of two parts, namely an image-text matching loss function L _ITM and an original contrast learning loss L _CL, and the sum of the two parts is used as a final optimization target. The idea is that the set loss function L _ITM is used in an image-text matching (ITM) task, in order to determine whether image-text feature pairs match each other, the cross entropy between the prediction classification probability p ^itm of the CMF module and the label y ^itm is used, the set original contrast learning loss L _CL is used in a Contrast Learning (CL) task, in order to reduce the distance between the prediction sample and the positive sample, and simultaneously enlarge the distance between the prediction sample and the negative sample, the CLs mark is extracted from the last layer of image encoder, the EOT mark is extracted from the last layer of text encoder, and the cosine similarity between the CLs mark and the EOT mark is calculated as the prediction probability p ^cl of contrast learning, and finally, the contrast learning loss is the cross entropy between the p ^cl and the image label y ^cl, so that the distance between the prediction sample and the positive sample is reduced, and the distance between the prediction sample and the negative sample is enlarged. The specific formulas of the two loss functions are as follows:

Wherein y ^cl and y ^itm both represent the true class of the image, p represents the probability of image class prediction, D represents the probability distribution of prediction classification, and H represents cross entropy.

The embodiment further comprises the step of classifying and predicting the trained CCPL model through a test set formed by few-shot image data, wherein the method specifically comprises the following steps:

Generating few-shot data according to the num_shot, and storing the shot data into a shot_ { num_shot } -seed_ { seed_num }. Pkl file, such as shot_16-seed1.Pkl; and randomly taking num_shot samples of the data in the Pkl files in each category of the source data set, respectively taking 1,2 and 3 of seed_num, representing to generate 3 Pkl files, and finally taking the average value of the three random seeds as the result of each num_shot.

Each dataset was selected 1/2/4/8/16-shot for training.

In this embodiment, the average value of the num_shot results is obtained by the following steps:

In the step A1, in order to ensure the stability of experimental results, three random seeds are used for running for each few-shot, and finally, the results are the average value of three times of accuracy of seed1, seed2 and seed 3.

Step A2, for seed1, initializing iteration times epoch=0, wherein the value interval of epoch is 0-20, and the training batch size batchsize =4;

In this embodiment, 20 epochs are used for most data sets except SUN397, and 5 epochs are used for 1/2-shot of SUN397, so that the parameters are set to optimize performance.

And A3, learning a model by using a random gradient descent SGD (Stochastic GRADIENT DESCENT) optimizer, wherein the maximum training period is 20, preheating for one period by using a learning rate of l _r =0.00001, performing formal training by using the learning rate of l _r =0.0035, and adjusting the learning rate by using a cosine annealing strategy in the training process.

And A4, evaluating CCPL model through PyTorch built-in functions to obtain loss function values of a training set and results of few-shot tasks, iterating through the set epoch values, evaluating the current training model by adopting a verification set after 20 epoch training ends, and storing experimental results in the whole process.

And step A5, after the test of the feed 1 is completed, taking the feed 2, and continuously cycling the feed 3 through the steps A1-A4, wherein the average value of the three results of the feed 1, the feed 2 and the feed 3 is taken as the final result.

In the embodiment, the model training selects the accuracy rate as the evaluation index of CCPL model few-shot task, and the closer the accuracy rate is to 1, the better the prediction result is.

The effectiveness and advantages of the present technology are described below by way of specific examples.

The first step: and acquiring a data set required by the model, preprocessing the data, and packaging the data to divide the data into a training set, a verification set and a test set.

7 Public image identification datasets were downloaded. The datasets have StanfordCars, flowers, FGVCAIRCRAFT, SUN, 397, DTD, euroSAT, and UCF101. These datasets cover a wide variety of image recognition tasks such as classification of cars, flowers, airplanes, scenes, texture recognition, satellite images, motion video, etc.

The relevant dataset process file is downloaded, such as split_zhou_oxfordflow, json, etc. of the Flowers102 dataset.

The dataset was divided into a training set, a validation set and a test set, which were used to train and validate our CCPL model, respectively, while the test set was used to verify the overall performance of the model. In this study, the image dataset was partitioned into a training set, a validation set, and a test set according to the json file associated with split.

And a second step of: and constructing a Cross-Coupled Prompt Learning model.

And (5) building a deep learning environment. PyTorch-GPU virtual environment and PyTorch library are installed on the server, python=3.8, torch= = 1.9.0+cu111 torchvision = 0.10.0+cu111 torchaudio = 0.9.0.

The dassl. Pyretch tool library is installed and initialized.

A Cross-Coupled Prompt Learning (CCPL) model framework is built based on the deep learning library PyTorch.

And a second step of: and (3) a test stage.

A large number of sample tests show that the average effect of 7 data sets is optimal, the higher Accuracy is shown on few-shot image recognition tasks, as shown in fig. 5, experimental data of CCPL model provided by the embodiment and other comparison models on 7 data sets can be seen that compared with the model with the second best effect, 1.06%, 1.29%, 1.93%, 1.74% and 0.81% on average effect of 1,2, 4, 8 and 16-shots on 7 data sets are respectively improved, and in addition, a comparison chart of Accuracy-Epochs on 7 data sets of CCPL model and DCP model is also provided, as shown in fig. 6.

Device embodiment

According to an embodiment of the present invention, as shown in fig. 7, a block diagram of a multi-modal image language model combined prompt learning device provided in this embodiment is provided, and the multi-modal image language model combined prompt learning device according to the embodiment of the present invention includes:

the dataset construction module 10 is configured to obtain a dataset and construct a prompt Fu Moban to generate text data including a stitched text prompt and image data including a stitched image prompt.

The model construction module 20 is configured to construct CCPL a multi-modal image language model based on the CLIP model of the open source, where the multi-modal image language model includes a prompt updating module and a feature fusion module;

The model training module 30 is configured to input the dataset data into the multi-modal image language model for training until convergence conditions are satisfied, and obtain a trained multi-modal image language model.

According to the method, the image prompt and the text prompt are spliced with the original image input and the original text input through the data set construction module 10 to serve as the input of the model, the prompt of the image and the text branches is updated through the prompt update module, the image prompt and the text prompt are encouraged to capture key information of each other, interaction between the prompt information of the text and the prompt information of the image branches is enhanced, the capability of capturing the key information of the other party and integrating the key information into the key information of the other party is improved, the depth fusion of the image characteristics and the text characteristics of the output end is enhanced through the set characteristic fusion module at the last layer of the model, consistency of the image characteristics and the text characteristics is effectively maintained, and the problem that in the prior art, the prompt learning method only focuses on the prompt information in a single mode or lacks information interaction among modes, and the generalization capability is not obvious is solved.

In this embodiment, a data processing module 40 is further provided for preprocessing and normalizing the images in each dataset, wherein,

The preprocessing comprises image enhancement operation, and can use enhancement methods such as random overturn, random clipping, random rotation and the like to preprocess the image, and the preprocessing is performed to generate an RGB three-channel image with 224 multiplied by 224 pixels, thereby being beneficial to model learning on robustness of different scales and positions of an object, carrying out random horizontal overturn on the image, being beneficial to model learning without being influenced by the horizontal position of the object in the image, and further improving the generalization capability of the model.

The normalization process specifically comprises the following steps:

In this embodiment, the text prompt and the image prompt set by the data processing module 40 are spliced with the visual input and the text input to be used as the input of the model, wherein the lengths of the text prompt and the image prompt are set to be 16, the initially input text prompt is initialized by using word embedding of "aphoto of a < class >" of the pretrained CLIP model, and the image prompt is randomly initialized from normal distribution.

① The Multihead Attention layer realizes the following steps:

② The residual connection specifically realizes the following processes:

③ The MLP layer specifically realizes the following processes:

The MLP layer is used for adding the residual connection input image prompt, text prompt, fusion image prompt and fusion text prompt respectively according to the residual connection preset weight coefficient, and correspondingly obtaining the image prompt and text prompt of the next layer of image encoder and the text encoder.

wherein, AndRespectively representing a text prompt and an image prompt at the z-layer encoder,AndA text prompt and an image prompt at the z+1 layer encoder are represented, respectively.

In this embodiment, the feature Fusion module is a Cross attention module, a multi-layer perceptron (MLP), a CMF (Cross-model Fusion) module composed of a linear layer and a sigmoid function, the module uses the image feature output by the last layer of image encoder as Q in the Cross attention mechanism, the text feature output by the last layer of text encoder as K, V in the Cross attention mechanism, and then inputs the text feature into the MLP layer, and finally predicts the classification probability sequentially through a classification head composed of the linear layer and the sigmoid activation function, thereby obtaining the classification result.

In this embodiment, the loss function of CCPL model is composed of two parts, namely, the graph-text matching loss function L _ITM and the original contrast learning loss L _CL, and the sum of the two parts is used as the final optimization target. The central idea is that in an image-text matching (ITM) task, in order to determine whether image-text feature pairs are matched with each other, cross entropy between a prediction classification probability p ^itm of a CMF module and a label y ^itm is used, in a Contrast Learning (CL) task, in order to reduce a distance between a prediction sample and a positive sample, and simultaneously enlarge a distance between the prediction sample and the positive sample, a CLs mark is extracted from a last layer of image encoder, an EOT mark is extracted from a last layer of text encoder, and cosine similarity between the CLs mark and the EOT mark is calculated as a prediction probability p ^cl of contrast learning, and finally, a contrast learning loss is the cross entropy between p ^cl and an image label y ^cl, so that a distance between the prediction sample and the positive sample is reduced, and simultaneously enlarge a distance between the prediction sample and the positive sample is enlarged. The specific formulas of the two loss functions are as follows:

L_CL＝E_(I,T)～DH(y^cl,p^cl(I，T))

L_ITM＝E_(I,T)～DH(y^itm,p^itm(I，T))

L＝L_CL+L_ITM；

In this embodiment, the model training module 30 performs classification prediction on the CCPL model after training through a test set formed by few-shot image data, and a large number of experiments prove that the overall level of the proposed method on the few-shot image recognition task is optimal.

The embodiment of the present invention is an embodiment of a device corresponding to the embodiment of the method, and specific operations of processing steps of each module may be understood by referring to descriptions of the embodiment of the method, which are not repeated herein.

As shown in fig. 8, the present invention further provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the multi-modal image language model combined prompt learning method in the above embodiment, or which when executed by a processor implements the multi-modal image language model combined prompt learning method in the above embodiment, and which when executed by the processor implements the following method steps:

S2, constructing CCPL a multi-mode image language model based on an open-source CLIP model, wherein the multi-mode image language model comprises a prompt updating module and a feature fusion module;

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention and are not specifically described in the present specification and will be apparent to those skilled in the art from the scope of the present invention.

Claims

1. A multi-modal image language model combined prompt learning method is characterized by comprising the following steps:

Acquiring a data set, constructing a prompt Fu Moban, and generating text data comprising a spliced text prompt and image data of a spliced image prompt;

The prompt updating module is connected with the input and output image information and text information of the two adjacent layers of image encoders and text encoders, the prompt updating module fuses the image prompt and the text prompt output by the upper layers of image encoders and text encoders by using a cross attention mechanism to respectively obtain a fused image prompt and a fused text prompt, and sums the fused image prompt and the fused text prompt with the image prompt and the text prompt obtained by the own layers of image encoders and text encoders according to preset weights to obtain new image prompt and text prompt which are respectively used as the image prompt and the text prompt input by the lower layers of image encoders and the text encoder;

2. The multi-modal image language model combined prompt learning method as claimed in claim 1, wherein the prompt updating module is CCPG module composed of two multi-head attention layers, residual connection and MLP layers, wherein,

Wherein one multi-head attention layer takes text prompt information as query (Q), image prompt information as key (K), dot product is carried out on the text prompt information and the text prompt information to obtain an attention score matrix, normalization processing is carried out on the attention score matrix through an activation function to obtain a probability distribution, and weighted summation is carried out on a normalization processing result and the image prompt information (V) to obtain a fusion image prompt;

The other multi-head attention layer takes image prompt information as query (Q), text prompt information as key (K), dot products are carried out on the image prompt information and the text prompt information to obtain an attention score matrix, normalization processing is carried out on the attention score matrix through an activation function to obtain a probability distribution, and weighted summation is carried out on a normalization processing result and the text prompt information (V) to obtain a fusion text prompt;

The residual connection is used for outputting the image prompt and the text prompt respectively output by the image encoder and the text encoder at the upper layer to the MLP layer by taking the weight coefficient as alpha, and outputting the fused image prompt and the fused text prompt obtained by updating the two multi-head attention layers to the MLP layer by taking the weight coefficient as 1-alpha;

3. The multi-modal image language model combined prompt learning method of claim 2 wherein the α value is 0.9.

4. The multi-modal image language model combined prompt learning method as claimed in claim 2 or 3, wherein the CCPG module information processing and calculating process has the following formula:

z＝1,2,...,N-1

5. The multi-modal image language model combined prompt learning method of claim 1 wherein the feature fusion module is a CMF module consisting of a cross attention module, an MLP layer, a linear layer and an activation function; taking the image feature output by the last layer of image encoder as Q in a cross attention mechanism, taking the text feature output by the last layer of text encoder as K, V in the cross attention mechanism, inputting the text feature into an MLP layer, and predicting the classification probability through a classification head consisting of a linear layer and a sigmoid activation function to obtain a classification result.

6. The multi-modal image language model combined prompt learning method of claim 1, wherein the CCPL multi-modal image language model loss function consists of an image-text matching loss function L _ITM and an original contrast learning loss L _CL, and the specific formulas of the two loss functions are as follows:

L_CL＝E_(I,T)～DH(y^cl,p^cl(I,T))

L_ITM＝E_(I,T)～DH(y^itm,p^itm(I,T))

L＝L_CL+L_ITM；

7. A multi-modal image language model combined prompt learning device, comprising:

The data set construction module is used for acquiring a construction data set, constructing prompts Fu Moban and generating text data comprising spliced text prompts and image data of spliced image prompts;

8. The multi-modal image language model in combination with prompt learning apparatus of claim 7 wherein the prompt update module is a CCPG module consisting of two multi-headed attention layers, a residual connection and an MLP layer, wherein,

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the multimodal image language model binding hint learning method of any of claims 1 to 6 when the computer program is executed.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the multimodal image language model binding hint learning method of any of claims 1 to 6.