Nothing Special   »   [go: up one dir, main page]

WO2023241410A1 - 数据处理方法、装置、设备及计算机介质 - Google Patents

数据处理方法、装置、设备及计算机介质 Download PDF

Info

Publication number
WO2023241410A1
WO2023241410A1 PCT/CN2023/098690 CN2023098690W WO2023241410A1 WO 2023241410 A1 WO2023241410 A1 WO 2023241410A1 CN 2023098690 W CN2023098690 W CN 2023098690W WO 2023241410 A1 WO2023241410 A1 WO 2023241410A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
target
initial
model
task
Prior art date
Application number
PCT/CN2023/098690
Other languages
English (en)
French (fr)
Inventor
张新松
刁诗哲
周王春澍
王嘉伟
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023241410A1 publication Critical patent/WO2023241410A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Definitions

  • This application belongs to the field of artificial intelligence, and in particular relates to a data processing method, device, equipment and computer medium.
  • pre-training models in related technologies generally only focus on the processing of tasks of a single task category.
  • the task categories can be: involving text understanding tasks, visual understanding tasks, multi-modal understanding tasks, image-to-text generation tasks, and Text to image generation task.
  • the multi-modal understanding task is to understand visual information and language information at the same time to solve tasks such as visual question answering, visual reasoning and visual implication.
  • the image-to-text generation task requires understanding the input image information to generate corresponding text descriptions.
  • the text-to-image generation task requires generating corresponding images based on input text information.
  • the embodiments of the present application provide an implementation solution that is different from the existing technology to solve the technical problem in the existing technology that when it is necessary to train a task processing model corresponding to multiple tasks, the training efficiency of training the corresponding pre-training model is low. .
  • this application provides a data processing method, which includes: obtaining image and text feature information to be processed, where the image and text feature information to be processed includes text feature information to be processed and image feature information to be processed; the text feature information to be processed is The information or the image feature information to be processed contains a mask identifier, and the text feature information to be processed matches the image feature information to be processed;
  • the first vector information group and the second vector information group are encoded according to the initial encoding rules to obtain the corresponding fusion vector information group; wherein the fusion vector information group includes multiple fusion vector information, each fusion vector information Related to the first vector information group and the second vector information group;
  • the initial pre-training model is trained based on the prediction result and the mask identifier to obtain a target pre-training model.
  • the target pre-training model is used to train the target task category corresponding to the acquired target task category.
  • Target task processing model is used to train the target task category corresponding to the acquired target task category.
  • this application provides a model training method, including:
  • sample task information corresponding to the target task category includes sample task input information, and a sample task result label corresponding to the sample task input information;
  • the plurality of candidate units include: a preprocessing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit;
  • the target pre-training model is obtained by training the initial pre-training model in the aforementioned data processing method.
  • this application provides a task processing method, including:
  • target task information of the target task category where the target task information includes target task input information
  • the target task processing model is trained through the aforementioned model training method.
  • this application provides a data processing device, including:
  • the acquisition unit is used to obtain the image and text feature information to be processed.
  • the image and text feature information to be processed includes the text feature information to be processed and the image feature information to be processed; the text feature information to be processed or the image feature information to be processed. Contains a mask identifier, and the feature information of the text to be processed matches the feature information of the image to be processed;
  • a generation unit configured to generate a first vector information group corresponding to the text feature information to be processed and a second vector information group corresponding to the image feature information to be processed based on the initial vector generation rule;
  • An encoding unit configured to encode the first vector information group and the second vector information group through initial encoding rules to obtain a corresponding fusion vector information group; wherein the fusion vector information group includes a plurality of fusion vector information , each fusion vector information is related to the first vector information group and the second vector information group;
  • a decoding unit configured to decode the fusion vector information group through initial decoding rules to obtain the prediction result corresponding to the mask identifier
  • a determination unit configured to train the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, where the target pre-training model is used to train the target task category according to the obtained target task category.
  • the target task processing model corresponding to the target task category.
  • this application provides a model training device, including:
  • An acquisition unit configured to acquire a target task category and sample task information corresponding to the target task category, where the sample task information includes sample task input information, and a sample task result label corresponding to the sample task input information; and Obtain multiple candidate units in the target pre-training model, the plurality of candidate units include: a preprocessing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit;
  • a determination unit configured to determine the target unit corresponding to the target task category from the plurality of candidate units according to the preset correspondence relationship
  • a construction unit configured to construct an initial task processing model corresponding to the target task category based on the target unit
  • a training unit configured to use the sample task information to train the initial task processing model to obtain a target task processing model for completing the target task corresponding to the target task category;
  • the target task processing model is obtained by training the initial pre-training model in the aforementioned model training method.
  • this application provides a task processing device, including:
  • An acquisition unit configured to acquire target task information of the target task category, where the target task information includes target task input information
  • a determination unit configured to determine a corresponding target task processing model according to the target task category
  • An input unit configured to input the target task input information into the target task processing model to obtain a target task processing result corresponding to the target task category and the target task input information;
  • the target task processing model is trained through the aforementioned model training method.
  • this application provides an electronic device, including:
  • Memory used to store executable instructions for the processor
  • the processor is configured to execute the first aspect, the second aspect, the third aspect, each possible implementation of the first aspect, each possible implementation of the second aspect, or each possible implementation of the third aspect by executing executable instructions. Any method in the embodiment.
  • embodiments of the present application provide a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the possible aspects of the first aspect, the second aspect, the third aspect, and the first aspect are realized. Any method in the implementation, each possible implementation of the second aspect, or each possible implementation of the third aspect.
  • This application obtains image and text feature information to be processed, which includes text feature information to be processed and image feature information to be processed; the text feature information to be processed or the image feature information to be processed includes a mask. code identification, the text feature information to be processed matches the image feature information to be processed; the first vector information group corresponding to the text feature information to be processed is generated based on the initial vector generation rule, and the image feature information to be processed corresponds to The second vector information group; the first vector information group and the second vector information group are encoded through the initial encoding rules to obtain the corresponding fusion vector information group; wherein the fusion vector information group includes multiple fusion vector information groups.
  • each fused vector information is related to the first vector information group and the second vector information group; decoding the fused vector information group through initial decoding rules to obtain the prediction result corresponding to the mask identifier;
  • the initial pre-training model is trained based on the prediction result and the mask identifier to obtain a target pre-training model.
  • the target pre-training model is used to train the target task category corresponding to the acquired target task category.
  • the solution of the target task processing model unifies the processing of text matching the image and the image itself into the same pre-training model, and the sample data used to train the target pre-training model involves multi-modal information. Through this The pre-training model trained by the applied solution can provide material for the training of multiple task processing models, thus improving the training efficiency of training the corresponding pre-training model when it is necessary to train task processing models corresponding to multiple tasks.
  • Figure 1 is a schematic structural diagram of a data processing system provided by an embodiment of the present application.
  • Figure 2a is a schematic flow chart of a data processing method provided by an embodiment of the present application.
  • Figure 2b is another schematic flowchart of a data processing method provided by an embodiment of the present application.
  • Figure 2c is another schematic flowchart of a data processing method provided by an embodiment of the present application.
  • Figure 3a is a schematic flow chart of a model training method provided by an embodiment of the present application.
  • Figure 3b is a schematic diagram of a method for determining a target unit provided by an embodiment of the present application.
  • Figure 3c is an example diagram of the corresponding target task processing results when classifying images when the target task category provided by an embodiment of the present application is to analyze the semantic recognition results corresponding to the image;
  • Figure 3d is an example diagram of the corresponding target task processing results when the target task category provided by an embodiment of the present application is to answer questions based on graphic and text information;
  • Figure 3e is an example diagram of the corresponding target task processing results provided by an embodiment of the present application.
  • the target task category is to determine whether the text correctly describes the image timing;
  • Figure 3f is an example of the corresponding target task processing results when the target task category provided by an embodiment of the present application is to determine whether the relationship between the image and the text is implicit, contradictory, or neutral given an image and a text description. picture;
  • Figure 3g is an example diagram of the corresponding target task processing results when a text description of the image is output when the target task category provided by an embodiment of the present application is a given image;
  • Figure 3h is an example diagram of the corresponding target task processing results when an image corresponding to the text description is output when the target task category provided by an embodiment of the present application is a given text description;
  • Figure 3i is a target task processing model trained by the model training method of the present application provided by an embodiment of the present application.
  • the target task processing result of the image described by the text is output, which is the same as the DALLE-based DALLE in related technologies. , and the comparison of the target task processing results determined by the OFA corresponding model;
  • Figure 4 is a schematic flowchart of a task processing method provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a model training device provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a task processing device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • MIM Masked Image Model, masked visual model.
  • MLM Masked Language Model, masked language model.
  • FLAVA A Foundational Language And Vision Alignment Model, a basic language and visual alignment model.
  • CLIP Contrastive Language–Image Pre-training, comparing text-image pre-training models. It is a multi-modal model based on image and text parallelism, and then constructs the training target through similarity calculation of the feature vectors of the two branches.
  • SimVLM Simple Visual Language Model Pre-training with Weak Supervision, weakly supervised simple visual language model pre-training.
  • MNLI Multi-Genre Natural Language Inference, text inclusion identification.
  • CoLA The Corpus of Linguistic Acceptability, a language acceptable corpus, a data set about grammar. The task is mainly to determine whether a given sentence is grammatically correct.
  • MRPC Microsoft Research Paraphrase Corpus, Microsoft Research Paraphrase Corpus, determines whether two given sentences have the same semantics, which is a text classification task for sentence pairs.
  • QQP Quora Question Pairs, text matching, is a data set published by Quora to determine whether two sentences are semantically consistent. It is a text classification task for sentence pairs.
  • SST The Stanford Sentiment Treebank, Stanford sentiment analysis data set, mainly performs sentiment classification on movie reviews, so SST is a single sentence text classification task (SST-2 is a two-category, SST-5 is a five-category, SST-5 The emotional polarity is distinguished more carefully).
  • QNLI Question Natural Language Inference, natural language question reasoning, its predecessor is the SQuAD 1.0 data set. Given a question, it is necessary to determine whether the given text contains the correct answer to the question. Text binary classification task belonging to sentence pairs.
  • RTE Recognizing Textual Entailment, a textual inclusion recognition model. Similar to MNLI, it is also a textual inclusion task. The difference is that MNLI is a three-class classification. RTE only needs to determine whether two sentences can be inferred or aligned. It is a text two-classification task of sentence pairs. .
  • STS-B the semantic textual similarity benchmark, semantic text similarity data set.
  • ImageNet is a large-scale visualization database used for visual object recognition software research.
  • Food-101 dataset This dataset contains image datasets of 101 food categories, with a total of 101,000 images. Each category has an average of 250 test images and 750 training images. Training images have not been data cleaned. All images have been rescaled to a maximum side length of 512 pixels.
  • CIFAR-10 dataset is a small dataset used to identify universal objects. It contains a total of 10 categories of RGB color images. There are a total of 50,000 training images and 10,000 test images in the data set.
  • CIFAR100 dataset There are 100 classes. Each class has 600 color images, 500 of which are used as training set and 100 as test set.
  • Cars Cars dataset.
  • Aircraft dataset This dataset contains 10,200 images of aircraft, 102 different types of aircraft, each with 100 images.
  • DTD Describable Textures Dataset
  • texture recognition data set texture recognition data set.
  • the Pets data set is a pet data set provided by Oxford. It contains about 7,000 images of cats and dogs, and some of the images mark the locations of the faces of cats and dogs.
  • Flowers102 dataset This dataset contains image datasets of 102 types of flowers, each category contains 40-258 images. The images are rich in variation in proportion, pose, and lighting.
  • MNIST data set It is an image data set of handwritten digits.
  • STL-10 data set It is an image recognition data set used to develop unsupervised feature learning, deep learning, and self-learning learning algorithms.
  • VQA_v2 visual question answering, the second version of the visual question answering task, the form is to give an image and a question about the image, and output an answer.
  • NLVR2 is a dataset launched by a Cornell University research team containing 107,292 examples of human written English sentences based on paired photos.
  • OFA Unifying architectures, tasks, modalities through a simple sequence-to-sequence learning framework
  • OFA is a multi-task training framework that unifies different tasks into sequence-to-sequence training goals and trains multiple downstream multi-modal models simultaneously. Dynamic tasks to achieve the purpose of pre-training.
  • This model requires the use of annotated data from downstream tasks, so it has shortcomings in scalability and solution operability. .
  • the DALL.E model is a model that generates images from text. It achieves the purpose of generating images from text by discretizing the image and then using joint modeling technology of image token and text token.
  • the prefix language model is a front-to-back language model that generates residual text based on the input image and prefix text, and generates residual images based on the input text and prefix image.
  • Figure 1 is a schematic structural diagram of a data processing system provided by an exemplary embodiment of the present application.
  • the structure includes: a task processing device 11 and a model training device 12; wherein both the task processing device 11 and the model training device 12 can be computers.
  • Equipment wherein the computer equipment may be a terminal or a server or other equipment.
  • the terminal can be a smartphone, tablet, laptop, intelligent voice interaction device, smart home appliance, wearable smart device, aircraft, smart vehicle terminal and other devices.
  • the terminal can also include a client, which can be a video client, Browser client or instant messaging client, etc.
  • the server can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services and cloud data.
  • Basic cloud computing such as database, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms Service cloud server.
  • CDN Content Delivery Network
  • the task processing device 11 is used for:
  • target task information of the target task category where the target task information includes target task input information
  • the target task processing model is trained through the aforementioned model training method.
  • the model training device 12 may be used to train an initial pre-training model to obtain a target pre-training model, and to train an initial task processing model to obtain the aforementioned target task processing model.
  • the model training device 12 when used to train the initial pre-training model to obtain the target pre-training model, the model training device 12 is specifically used to:
  • the image and text feature information to be processed which includes the text feature information to be processed and the image feature information to be processed; the text feature information to be processed or the image feature information to be processed includes a mask identifier, The feature information of the text to be processed matches the feature information of the image to be processed;
  • the first vector information group and the second vector information group are encoded according to the initial encoding rules to obtain the corresponding fusion vector information group; wherein the fusion vector information group includes multiple fusion vector information, each fusion vector information Related to the first vector information group and the second vector information group;
  • the initial pre-training model is trained based on the prediction result and the mask identifier to obtain a target pre-training model.
  • the target pre-training model is used to train the target task category corresponding to the acquired target task category.
  • Target task processing model is used to train the target task category corresponding to the acquired target task category.
  • model training device 12 when used to train the initial task processing model to obtain the aforementioned target task processing model, it is specifically used for:
  • sample task information corresponding to the target task category includes sample task input information, and a sample task result label corresponding to the sample task input information;
  • the plurality of candidate units include: a preprocessing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit;
  • the target pre-training model is trained based on the aforementioned initial pre-training model.
  • each component unit in this system embodiment such as the task processing device 11 and the model training device 12, can be found in the following descriptions of each method embodiment.
  • Figure 2a is a schematic flowchart of a data processing method provided by an exemplary embodiment of the present application.
  • the method can be applied to model training equipment and is used to train an initial pre-training model into a target pre-training model.
  • the method at least includes the following S201 -S205:
  • the image and text feature information to be processed includes text feature information to be processed and image feature information to be processed; the text feature information to be processed or the image feature information to be processed includes a mask. identification, the text feature information to be processed matches the image feature information to be processed;
  • the mask identifier may be a preset identifier, for example, it may be a preset Arabic value 0.
  • the method further includes the following steps S01-S04:
  • sample graphic and text information includes sample text information and sample image information, and the sample text information and sample image information match;
  • matching the sample text information and the sample image information means: the content pointed by the sample text information is related to the content pointed by the sample image information; for example, the content pointed by the sample text information is "puppy", and the sample image information The content in question is an image of a puppy.
  • sample text information refers to the content: "The Last Supper of Jesus with the Twelve Aposities, painting by Leonardo da Vinci”
  • sample image information refers to the image "The Last Supper”.
  • S02. Determine the tag information of each character in the sample text information in the preset lexicon according to the sample text information and the preset lexicon, and obtain the tag information group corresponding to the sample text information;
  • the initial pre-training model includes a preprocessing unit, which includes a table lookup unit and a first image encoder;
  • the table unit performs the aforementioned S02, and the aforementioned first preset encoding rule may be built into the first image encoder, and the first image encoder performs the aforementioned S03.
  • the tag information group contains multiple tag information, and each tag information corresponds to the characters in the sample text information one-to-one.
  • the mark information can be: address information, index information, location information, etc.
  • Tag information contained in the tag information group can be Arabic numeric values. For example, it can be: a1 a2 a3 a4 a5 a6 a7 a8 a9, a10 a11 a12 a13 a14 in Figure 2b and Figure 2c, where a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, and a14 can be Arabic numerals, and multiple mark information is not necessarily continuous.
  • the first image encoder is used to extract features from the sample image information. It may specifically include a splitting module and an encoding module, which may split the sample image information into a preset number of sub-images; and then based on the encoding module and the Multiple sub-images determine corresponding multiple initial vector information, and the multiple initial vector information constitute an initial vector information group.
  • the initial vector information corresponds to the sub-image one-to-one.
  • the multiple initial vector information contained in the initial vector information group can be numerical vectors, specifically: b1 b2 b3 b4 b5 b6 b7 b8 b9 in Figure 2c.
  • the initial pre-training model also includes a mask processing unit.
  • the aforementioned preset mask rules can be built into the mask processing unit.
  • the preset mask rules can include first mask rules and second mask rules; through the built-in
  • the first mask processing unit with the first mask rule built in the mask processing unit may be used.
  • the image feature information to be processed is the same as multiple initial vector information in the initial vector information group. Part of the mark information in the mark information group is masked.
  • the image and text feature information to be processed can be referred to the image and text feature information to be processed 1 in Figure 2b.
  • the number of tag information contained in the masked partial tag information is the same as the number of mask identifiers.
  • the mask processing unit can be configured with a built-in second mask rule.
  • the second mask processing module performs mask processing on part of the initial vector information in the initial vector information group.
  • the text feature information to be processed is the same as multiple mark information in the mark information group, and the initial vector
  • a mask identifier is obtained after part of the initial vector information is masked, and the remaining initial vector information in the initial vector information group that is not masked; the mask identifier and the remaining
  • the sum of the initial vector information is the feature information of the image to be processed.
  • the image and text feature information to be processed can be seen as the image and text feature information to be processed 2 in Figure 2c.
  • the number of initial vector information contained in the masked partial initial vector information is the same as the number of mask identifiers.
  • the mask processing unit when performing masking processing on part of the tag information in the tag information group or masking part of the initial vector information in the initial vector information group through the mask processing unit, the mask processing unit may be used from back to front. Before doing a mask of random length, or random proportions.
  • sample text information matches the sample image information
  • text feature information to be processed matches the image feature information to be processed.
  • the first mask processing module and the second mask processing module may or may not be the same module, which is not limited in this application.
  • the mark information group a1 a2 a3 a4 a5 a6 a7 a8 Part of the mark information in a9, a10 a11 a12 a13 a14: a5 a6 a7 a8 a9, a10 a11 a12 a13 a14 is the masked content.
  • FIG. 2c An example of masking part of the initial vector information in the initial vector information group through a mask processing unit with built-in preset mask rules can be seen in Figure 2c.
  • the initial vector information group b1 b2 Part of the initial vector information b6 b7 b8 b9 in b3 b4 b5 b6 b7 b8 b9 is the mask content.
  • the first vector information group includes a plurality of first vector information.
  • the aforementioned initial pre-training model can also include an initial vector generation unit, and the aforementioned initial vector generation rules are built-in in the initial vector information group.
  • the aforementioned S202 may be executed specifically through the initial vector generation unit.
  • the aforementioned plurality of first vector information may be t 1 , t 2 , ..., t n in Figure 2b and Figure 2c;
  • the second vector information group includes a plurality of second vector information, and the plurality of second vector information may be are v 1 , v 2 ,..., v m in 2b and Figure 2c.
  • the aforementioned initial vector generation rules include a first vector generation rule and a second vector generation rule.
  • the text feature information to be processed can be processed through a first vector generation module with a first vector generation rule built into the initial vector generation unit, to obtain to the first vector information group corresponding to the text feature information to be processed; and processing the image feature information to be processed through a second vector generation module with a second vector generation rule built into the initial vector generation unit to obtain the image feature information to be processed.
  • the second vector generation module may include the first three layers of networks in Resnet101.
  • the first vector generation module can perform position embedding and normalization processing on the text feature information to be processed.
  • the second vector generation module can perform position embedding and normalization processing on the image feature information to be processed.
  • S203 Encode the first vector information group and the second vector information group through initial encoding rules to obtain a corresponding fusion vector information group; wherein the fusion vector information group includes multiple fusion vector information, each fusion vector Vector information is related to the first vector information group and the second vector information group;
  • the initial pre-training model may include an initial cross-modal coding unit, and the aforementioned initial coding rules may be built into the initial cross-modal coding unit. Specifically, the aforementioned S203 may be performed through the initial cross-modal coding unit.
  • the initial coding rules may include splicing rules and cross-modal coding rules.
  • the initial cross-modal coding unit may include a splicing unit with built-in splicing rules and a cross-modal encoder with built-in cross-modal coding rules; wherein the splicing unit is used to convert the first
  • the vector information group is spliced with the second vector information group to obtain a splicing vector
  • a cross-modal encoder is used to encode the splicing vector to obtain the fused vector information group.
  • the splicing vectors can be: v 1 , v 2 , ..., v m , t 1 , t 2 , ..., t n , or t 1 , t 2 , ... in Figure 2b and Figure 2c , t n , v 1 , v 2 ,..., v m .
  • the prediction results may include predicted text feature information corresponding to the text feature information to be processed, or predicted image feature information corresponding to the image feature information to be processed.
  • the initial pre-training model also includes an initial cross-modal decoding unit, and the aforementioned initial decoding rules can be built into the initial cross-modal decoding unit.
  • the aforementioned S204 may be performed by the initial cross-modal decoding unit.
  • the prediction result may include predicted text feature information corresponding to the text feature information to be processed;
  • the prediction result may include a prediction encoding value corresponding to the image feature information to be processed.
  • the aforementioned S205 may be performed specifically through the training unit in the initial pre-training model.
  • the initial pre-training model is trained based on the prediction result and the mask identifier to obtain a target pre-training model, including the following S2051-S2053:
  • the target information corresponding to the mask identifier is partial mark information.
  • the method also includes a method for determining the target information corresponding to the mask identifier.
  • the sample image information is encoded according to the second preset encoding rule to obtain a coded value group corresponding to the sample image information.
  • the coded value group includes a plurality of coded values, wherein the coded value in the coded value group The number is the same as the number of initial vector information in the initial vector information group; and the target information corresponding to the mask identifier is determined according to the encoded value group.
  • the coded value can be an Arabic value, as shown in Figure 2c, and the multiple coded values can be: 123 234 345 987 654 321 999 888 777.
  • the second preset encoding rule can be built into the second image encoder.
  • the function of the second image encoder with the built-in second preset encoding rule is to convert the characteristics of the sample image information into discrete Values, that is, multiple coded values in the coded value group, facilitate the initial cross-modal decoding unit to output the predicted coded value corresponding to the corresponding sample image information, and compare the predicted coded value with the target information corresponding to the mask identifier in the coded value group. Compare and train the initial pre-trained model.
  • determining the target information corresponding to the mask identifier according to the encoded value group may include the following S001-S003:
  • the image and text feature information to be processed is obtained by masking part of the initial vector information in the initial vector information group based on preset masking rules, then the masked content is the part of the initial vector information.
  • Vector information the mask object is an initial vector information group.
  • the first position information may include: index information of each initial vector information in the masked content in the initial vector information group, For example, when part of the initial vector information b6b7 b8 b9 in the initial vector information group in Figure 2c: b1 b2 b3 b4 b5 b6 b7 b8 b9: is masked content, the first position information of b6b7 b8 b9 can be: 6, 7, 8, 9.
  • the second position information of each coded value in the coded value group in the coded value group is equal to Each initial vector information in the initial vector information group has a one-to-one correspondence with the first position information in the initial vector information group.
  • the second location information may also be index information.
  • the index information of each coded value in the coded value group in the coded value group in Figure 2c, and the index information of each initial vector information in the initial vector information group can be as shown in Table 1:
  • the image and text feature information to be processed is obtained by masking part of the initial vector information in the initial vector information group based on preset masking rules, then the masked content is the part of the initial vector information.
  • Vector information, the mask object is the initial vector information group.
  • the training unit may include a comparison unit, and the comparison unit may perform S2052.
  • the similarity value is not less than the preset similarity value, update the model parameters in the initial pre-training model based on the similarity value to obtain an initial pre-training model with updated model parameters, and Return to the step of generating the first vector information group corresponding to the text feature information to be processed and the second vector information group corresponding to the image feature information to be processed based on the initial vector generation rule until the similarity value is less than the Preset the similarity value until the target pre-trained model is obtained;
  • model parameters in the initial pre-training model include: at least one parameter among the parameters in the initial vector generation rule, the parameters in the initial encoding rule, and the parameters in the initial decoding rule.
  • model parameters in the initial pre-training model may also include: parameters in the pre-processing unit.
  • the initial pre-training model may also include another table lookup unit for relevant personnel to confirm the training completion status of the initial training model based on the output results of the table lookup unit.
  • another table lookup unit for relevant personnel to confirm the training completion status of the initial training model based on the output results of the table lookup unit. For example, as shown in Figure 2b, if the initial The output of the cross-modal decoding unit is: a5 a6 a7 a8 a9, a10 a11 a12 a13 a14.
  • the table lookup unit it can be determined: "Jesus with the Twelve Aposities, painting by Leonardo da Vinci".
  • the initial pre-training model may also include an image decoder corresponding to the second image encoder, for relevant personnel to confirm the training completion of the initial training model based on the output result of the image decoder, for example, as shown in Figure
  • the image decoder can decode: among the aforementioned multiple sub-images, the sub-image corresponding to 321, the sub-image corresponding to 999, and the sub-image corresponding to 888 sub-image, and 777 the corresponding sub-image.
  • the initial pre-training model is trained based on the prediction result and the mask identifier to obtain a target pre-training model, including:
  • the model parameters in the model are updated to obtain an initial pre-trained model after the model parameters are updated, and the first vector information group corresponding to the text feature information to be processed and the image to be processed are generated based on the initial vector generation rules.
  • the step of the second vector information group corresponding to the feature information is to use the initial pre-training model as the target pre-training model until the number of updates is greater than the preset number of times;
  • model parameters in the initial pre-training model include: at least one parameter among the parameters in the initial vector generation rule, the parameters in the initial encoding rule, and the parameters in the initial decoding rule.
  • the loss function can be a cross-entropy function.
  • the initial pre-training model may include: the pre-processing unit, the mask processing unit, the initial vector generation unit, the initial cross module coding unit, the initial cross-modal decoding unit, and the training unit.
  • the method also includes the following S1-S2:
  • the sample task information includes sample task input information, and the sample task result label corresponding to the sample task input information;
  • the target task category may be any one of: text understanding tasks, image understanding tasks (ie, visual understanding tasks), text-generated images, image-generated text, multi-modal recognition tasks, etc.
  • the aforementioned sample task input information includes the task prerequisites required for analysis in order to obtain the sample task results; the sample task result label is the task processing result obtained based on the sample task input information.
  • the target task category is image To generate text
  • multiple sample images are required, as well as the text generation results corresponding to each sample image, then at least one sample image among the multiple sample images is the sample task input Information, the text generation result corresponding to each sample image is the sample task result label.
  • the target pre-training model is the trained initial pre-training model.
  • the target pre-training model includes a plurality of candidate units: a preprocessing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit;
  • each candidate unit may include multiple subunits, and the determined target unit may include only some of the subunits in a candidate unit, or may include all of the subunits.
  • the preprocessing unit includes a table lookup unit and a first image encoder.
  • the preprocessing unit included in the determined target unit may include only a table lookup unit or the first image encoder, or may include both a table lookup unit and a first image encoder. Contains the first image encoder.
  • the first target cross-modal coding unit includes a splicing unit and a cross-modal encoder.
  • the first target cross-modal coding unit included in the determined target unit may only include a cross-modal encoder, or may both Contains splicing units and cross-modal encoders.
  • the sample task information and the target pre-training model are used to train the target task processing model for completing the target task corresponding to the target task category, including the following S21-S23:
  • multiple task categories are stored in the preset correspondence relationship, as well as the association relationship between each task category and its corresponding task processing model.
  • the first target vector generation unit corresponds to the initial vector generation unit
  • the first target cross-modal coding unit corresponds to the initial cross-modal coding unit
  • the first target cross-modal decoding unit Corresponds to the initial cross-modal decoding unit.
  • the first target vector generation unit is: after the target pre-training model is trained, the initial vector generation unit in the target pre-training model;
  • the first target cross-modal coding unit is: after the target pre-training model is trained, the The initial cross-modal coding unit in the target pre-training model;
  • the first target cross-modal decoding unit is: after the target pre-training model is trained, the initial cross-modal decoding unit in the target pre-training model.
  • the aforementioned method of determining the target unit can also be implemented based on the user's selection instructions for the target unit, which is not limited by this application.
  • sample data used to train the target pre-training model may also include plain text sample data.
  • the sample data used to train the target pre-training model (such as sample image information, plain text sample data) can come from the Internet or public data sets.
  • the target pre-training model in this application is a prefix language model, which can fully associate language and images, so that the target pre-training model has text generation capabilities, image encoding capabilities, and sufficient correlation between text and images, so as to Enhance the ability to understand across modalities.
  • This application obtains image and text feature information to be processed, which includes text feature information to be processed and image feature information to be processed; the text feature information to be processed or the image feature information to be processed includes a mask. code identification, the text feature information to be processed matches the image feature information to be processed; the first vector information group corresponding to the text feature information to be processed is generated based on the initial vector generation rule, and the image feature information to be processed corresponds to The second vector information group; the first vector information group and the second vector information group are encoded through the initial encoding rules to obtain the corresponding fusion vector information group; wherein the fusion vector information group includes multiple fusion vector information groups.
  • the vector information group is related; the fused vector information group is decoded through the initial decoding rule to obtain the prediction result corresponding to the mask identification; the initial pre-training model is trained based on the prediction result and the mask identification , obtain a target pre-training model.
  • the target pre-training model is used to train the target task processing model corresponding to the target task category according to the obtained target task category.
  • the target pre-training model will be used to process the text that matches the image and the image itself. The process is unified into the same pre-training model, and the sample data used to train the target pre-training model involves multi-modal information.
  • the pre-training model trained through the solution of this application can be used to train models for various task processing. Provide materials, thus improving the training efficiency of training corresponding pre-training models when it is necessary to train task processing models corresponding to multiple tasks.
  • the second image encoder in the target pre-training model can give the target pre-training model the image generation capability
  • the table lookup unit in the target pre-training model can give the target pre-training model the text generation capability
  • the first target cross-module The state encoding unit, together with the first target cross-modal decoding unit, endows the target pre-training model with multi-modal understanding capabilities, text understanding capabilities, and visual understanding capabilities; thus making the target pre-training model highly compatible and scalable. It can provide material for the training of various task processing models, so that the efficiency of processing various tasks has also been improved.
  • this solution proposes that the image is encoded into discrete data through the second image encoder, so that the image and text information containing image information and text information can be used as sample data to train the target pre-training model, which can make the processing of image information
  • the method is similar to the way text information is processed, and the target pre-trained model can be trained faster.
  • the target task processing model trained through the aforementioned model training method is trained based on the target pre-training model related to multiple modalities, and has high task processing accuracy when processing tasks.
  • Table 2 shows the accuracy value of the corresponding target task processing results of the target task processing model determined through the solution of this application when processing the task, and the accuracy value of the task processing model determined by other methods in the related art when processing the task. Comparison of accuracy values of task processing results.
  • Table 2 Comparison of the accuracy value of the corresponding task processing results when the task processing model of this solution processes tasks, and the accuracy value of the task processing results of the task processing model determined by other methods when processing tasks.
  • MIM, MLM, FLAVA, CLIP, SimVLM, and SimVLM refer to the categories of task processing models
  • MNLI, CoLA, MRPC, QQP, SST-2, QNLI, RTE, and STS-B are the category information of the tasks being processed; among them, the MNLI result is the average of MNLI-m and MNLI-mm. MRPC and QQP results are the average of accuracy and F1 score.
  • CoLA reports the Matthews correlation coefficient (MCC) and STS-B reports the Pearson correlation coefficient (PCC).
  • the “M” in 70M, 46.4M, and 647.7M refers to "million", that is, 70M, 46.4M, and 647.7M refer to the amount of data used to calculate the accuracy value of the task processing result.
  • NLP Avg refers to the average accuracy value of task processing results at the natural language processing level.
  • Vision Avg refers to the average accuracy value of task processing results at the visual recognition (ie, image recognition) level.
  • Multi-modal refers to the average of the accuracy values of task processing results for multi-modal processing tasks.
  • Eval method refers to the evaluation method of the corresponding task, specifically including: 1) Fine-tuning refers to the complete training of the model on the corresponding task; 2) Linear eval refers to the fixed model, which predicts the results of the corresponding task by adding a classifier; 3) zero-shot It refers to a completely fixed model without adding any learnable parameters to solve the corresponding tasks.
  • ImageNet, Food101, CIFAR10, Cars, Aircraft, DTD, Pets, Flowers102, MNIST, STL10, and Country211 refer to the data set names.
  • the accuracy value of the task processing result corresponding to each data set refers to the accuracy value of the task processing result using the current data set as the task analysis data.
  • VQAv2, SNLI-VE, and NLVR2 refer to the name of the data set
  • I2T and T2I represent the task of generating text from images and generating images from text.
  • I2T@B4 and I2T@C are evaluation indicators for the task of generating text from images.
  • B4 refers to the 4-gram bilingual evaluation understudy (BLUE)
  • C refers to the consensus-based image description evaluation (Consensus-based Image Description Evaluation).
  • C refers to the consensus-based image description evaluation (Consensus-based Image Description Evaluation). CIDEr).
  • T2I@IS, T2I@FID, T2I@IS and T2I@FID are evaluation indicators for the task of generating images based on text.
  • IS refers to Inception Score (IS)
  • FID refers to Frechet Inception Distance (FID).
  • means that the larger the corresponding accuracy value is, the higher the accuracy of the task processing result is.
  • means that the smaller the corresponding accuracy value is, the higher the accuracy of the task processing result is.
  • the target task processing model determined by this solution when used to process the target task, it achieves the most advanced effect of similar models under the same model scale and data scale.
  • the relevant models compared with the target task processing model determined through the solution of this application are FLAVA and SimVLM.
  • the target task processing model determined through the solution of this application has the best performance on 22 tasks among all 26 tasks. It is on par with related models in text understanding tasks. Other than that, it is all better than related models, including achieving significant improvements in visual understanding tasks, multi-modal understanding tasks, text-to-image generation and image-to-text generation tasks. promote.
  • Figure 3a is a schematic flowchart of a model training method provided by an exemplary embodiment of the present application.
  • the method can be applied to model training equipment.
  • the method at least includes the following steps S301-S305:
  • the sample task information includes sample task input information, and the sample task result label corresponding to the sample task input information;
  • the multiple candidate units include: a preprocessing unit, a first target vector generation unit, a first target cross-modal coding unit, and a first target cross-modal decoding unit. unit;
  • the target pre-training model is obtained by training the initial pre-training model in the data processing method in the embodiment corresponding to Figure 2a.
  • the initial pre-training model includes: an initial vector generation unit, an initial cross-modal coding unit, and an initial cross-modal decoding unit.
  • the first vector information group corresponding to the text feature information to be processed and the second vector corresponding to the image feature information to be processed can be generated specifically based on the initial vector generation rule built in the initial vector generation unit. information group;
  • the first vector information group and the second vector information group are encoded through the initial coding rules built into the initial cross-modal coding unit to obtain a corresponding fused vector information group; wherein the fused vector information group includes multiple Fusion vector information, each fusion vector information is related to the first vector information group and the second vector information group;
  • the initial pre-training model is trained based on the prediction result and the mask identifier to obtain a target pre-training model.
  • the target pre-training model is used to train the target corresponding to the target task category according to the obtained target task category. Task processing model.
  • the target pre-training model is a trained initial pre-training model
  • the first target vector generation unit is a trained initial vector generation unit in the initial pre-training model
  • the first target cross-modal coding unit is an initial pre-training unit.
  • the first target cross-modal decoding unit is the initial cross-modal decoding unit trained in the initial pre-training model.
  • the target pre-training model includes a plurality of candidate units, and the plurality of candidate units include at least 2 of the following units: a preprocessing unit, a first target vector generation unit, a first target cross-modal coding unit, a first target Cross-modal decoding unit.
  • the target unit when building the initial task processing model based on the target unit, can also be obtained according to the target task category additional units. That is, the initial task processing model can be constructed from the target unit and the additional unit. In the training target task processing model, in addition to updating the parameters in the target unit, the parameters of the additional unit can also be updated.
  • the aforementioned additional unit may also be a recovery unit, and the recovery unit may include a table lookup unit and/or an image decoder corresponding to the second image encoder.
  • the target unit includes: a preprocessing unit, a first target vector generation unit, a first target cross-modal coding unit, and a first target cross-modal decoding unit
  • the trained initial task processing model that is, the target
  • the task processing model may include: a trained preprocessing unit, a second target vector generation unit, a second target cross-modal encoding unit, and a second target cross-modal decoding unit, and may also include an image decoder.
  • the second target vector generation unit is the trained first target vector generation unit
  • the second target cross-modal coding unit is the trained first target cross-modal coding unit
  • the second target cross-modal decoding unit is the trained first target cross-modal decoding unit.
  • different target task categories determine different target units.
  • the target unit corresponding to the determined target task category may include: a table lookup unit in the preprocessing unit, a first target vector generation unit, and a cross-modal encoder in the first target cross-modal coding unit, and a first target cross-modal decoding unit.
  • the target unit corresponding to the target task category may include: a table lookup unit in the preprocessing unit, a first target vector generation unit, a cross-modal encoder in the first target cross-modal coding unit, and a first target cross-modality decoding unit.
  • the table lookup unit in the preprocessing unit is connected to the first target vector generation unit, and the first target vector generation unit is connected to the first target cross-modal coding unit.
  • the cross-modal encoder is connected, and the cross-modal encoder in the first target cross-modal coding unit is connected with the first target cross-modal decoding unit.
  • the initial task processing model may also include other units besides the aforementioned target unit.
  • the initial task processing model may also include an initial classifier.
  • the input interface of the initial classifier is connected to the output interface of the first target cross-modal decoding unit.
  • the output interface of the initial classifier is used to output the predicted task processing results, so that in the process of training the target task processing model, the initial task processing model is trained based on the sample task information, predicted task processing results, and sample task result labels.
  • the target task category is text classification after training
  • the corresponding target task processes the model
  • for the target task obtain two statements and determine the relationship between the second statement and the first statement.
  • the first statement and the second statement can be used as the target task input information of the target task processing model, and the target task processing result can be obtained: implication (indicating that the semantics of the second statement implies the semantics of the first statement).
  • the target task category is the task of analyzing the semantic recognition results corresponding to the image and classifying the image
  • the target task category only involves the analysis and image classification of the image. Therefore, From the plurality of candidate units, the determined target unit corresponding to the target task category may include: the first image encoder in the preprocessing unit, the first target vector generation unit, and the first target cross-modal coding unit. a cross-modal encoder, and a first target cross-modal decoding unit.
  • the first image encoder in the preprocessing unit is connected to the first target vector generation unit, and the first target vector generation unit is connected to the first target cross-modal coding
  • the cross-modal encoder in the unit is connected, and the cross-modal encoder in the first target cross-modal encoding unit is connected to the first target cross-modal decoding unit.
  • the initial task processing model may also include other units besides the aforementioned target unit.
  • the initial task processing model may also include an initial classifier.
  • the input interface of the initial classifier is connected to the output interface of the first target cross-modal decoding unit.
  • the output interface of the initial classifier is used to output the predicted task processing results, so that in the process of training the target task processing model, the initial task processing model is trained based on the sample task information, predicted task processing results, and sample task result labels.
  • the training results are that the target task category is the semantic recognition result corresponding to the analyzed image.
  • the target task category is the semantic recognition result corresponding to the analyzed image.
  • the target task processing model for the target task: analyze the results in Figure 3c Based on the semantic recognition result of image 1 and classifying the image, image 1 can be used as the target task input information of the target task processing model, and the target task processing result can be obtained: desk lamp.
  • Targeting the target task Analyze the semantic recognition result of image 2 in Figure 3c and classify the image. Then image 2 can be used as the target task input information of the target task processing model, and the target task processing result can be obtained: ice cream.
  • the target unit corresponding to the determined target task category may include: a preprocessing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit.
  • the preprocessing unit and the first target vector generate a unit The meta-connection, the first target vector generation unit is connected to the first target cross-modal coding unit, and the first target cross-modal coding unit is connected to the first target cross-modal decoding unit.
  • the initial task processing model may also include other units besides the aforementioned target unit.
  • the initial task processing model may also include an initial classifier.
  • the input interface of the initial classifier is connected to the output interface of the first target cross-modal decoding unit.
  • the output interface of the initial classifier is used to output the predicted task processing results, so that in the process of training the target task processing model, the initial task processing model is trained based on the sample task information, predicted task processing results, and sample task result labels.
  • the training determines that the target task category is a task of answering questions based on graphic and text information
  • the target task According to the graphic and text information 1 in Figure 3d
  • the image and text information 1 can be used as the target task input information of the target task processing model, and the target task processing result can be obtained: man.
  • the graphic information 2 can be used as the target task input information of the target task processing model, then The target task processing result can be obtained: woman.
  • the target unit corresponding to the determined target task category may include: a preprocessing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit.
  • the preprocessing unit is connected to the first target vector generation unit, the first target vector generation unit is connected to the first target cross-modal coding unit, and the first target The cross-modal coding unit is connected to the first target cross-modal decoding unit.
  • the initial task processing model may also include other units besides the aforementioned target unit.
  • the initial task processing model may also include an initial classifier.
  • the input interface of the initial classifier is connected to the output interface of the first target cross-modal decoding unit.
  • the output interface of the initial classifier is used to output the predicted task processing results, so that in the process of training the target task processing model, the initial task processing model is trained based on the sample task information, predicted task processing results, and sample task result labels.
  • the target task category is determined by training to determine whether the text correctly describes the image pair
  • the target task after the corresponding target task processes the model
  • for the target task determine the image and text in Figure 3e Whether the text in information 3 correctly describes the image pair in image and text information 3, then image and text information 3 can be used as the target task input information of the target task processing model, and the target task processing result can be obtained: Yes.
  • the target task determine whether the text in the graphic information 4 correctly describes the image pair in the graphic information 4, then the graphic information 4 can be used as the target task input information of the target task processing model, and then the target task processing can be obtained Result: Wrong.
  • the target task category is a task of determining whether the relationship between the image and the text is implicit, contradictory, or neutral given an image and a text description
  • the target task category involves To analyze images and texts, therefore, from the plurality of candidate units, the determined target unit corresponding to the target task category may include: a preprocessing unit, a first target vector generation unit, a first target cross-modality Coding unit, and first target cross-modal decoding unit.
  • the preprocessing unit is connected to the first target vector generation unit, the first target vector generation unit is connected to the first target cross-modal coding unit, and the first target The cross-modal coding unit is connected to the first target cross-modal decoding unit.
  • the initial task processing model may also include other units besides the aforementioned target unit.
  • the initial task processing model may also include an initial classifier.
  • the input interface of the initial classifier is connected to the output interface of the first target cross-modal decoding unit.
  • the output interface of the initial classifier is used to output the predicted task processing results, so that in the process of training the target task processing model, the initial task processing model is trained based on the sample task information, predicted task processing results, and sample task result labels.
  • the target task category obtained through training is to give an image and a text description and determine whether the relationship between the image and the text is implicit, contradictory, or neutral.
  • Targeted at the target task Given the premise image in Figure 3f and text description 1: "Two women are holding packages.”, determine whether the relationship between the premise image and text description 1 is implicit, contradictory or neutral, then the premise can be Image and text description 1 are used as the target task input information of the target task processing model, then the target task processing result can be obtained: implication.
  • Targeted at the target task Given the premise image in Figure 3f, and text description 2: "The sisters are hugging goodbye while holding to go packages after just eating lunch.”, determine that the relationship between the premise image and text description 2 is implication. , contradiction or neutrality, then the premise image and text description 2 can be used as the target task input information of the target task processing model, and the target task processing result can be obtained: neutral.
  • the target task processing result can be obtained: contradiction.
  • the target unit corresponding to the target task category determined may include: the first unit in the preprocessing unit.
  • An image encoder a first target vector generating unit, a cross-modal encoder in the first target cross-modal encoding unit, and a first target cross-modal decoding unit.
  • the first image encoder in the preprocessing unit is connected to the first target vector generation unit, and the first target vector generation unit is connected to the first target cross-modal coding
  • the cross-modal encoder in the unit is connected, and the cross-modal encoder in the first target cross-modal encoding unit is connected to the first target cross-modal decoding unit.
  • the target task category is obtained through training for a given image.
  • the image can be used as the target task input information of the target task processing model, and the target task processing result can be obtained: "A seabird is walking on the shore.”
  • the target unit corresponding to the determined target task category may include: a table lookup unit in the preprocessing unit, a first target vector generation unit, a cross-modal encoder in the first target cross-modal coding unit, and a first Target cross-modal decoding unit.
  • the table lookup unit in the preprocessing unit is connected to the first target vector generation unit, and the first target vector generation unit is connected to the first target cross-modal coding unit.
  • the cross-modal encoder is connected, and the cross-modal encoder in the first target cross-modal coding unit is connected with the first target cross-modal decoding unit.
  • the target task category is obtained by training as a given text description.
  • the image corresponding to the text description is output, after the corresponding target task processing model, for the target task: Given Figure 3h
  • the text in: "a baseball player holding a bat next to a base”, and the image corresponding to the text is output, then the text: "a baseball player holding a bat next to a base” can be used as the target task of the target task processing model Entering the information results in the image in Figure 3h.
  • the target task processing model trained by the model training method of the present application outputs the target task processing result of the image described by the text.
  • Figure 3i it can be seen from Figure 3i that the image generation quality in the target task processing results obtained through the solution of the present application is higher, and it is more realistic and accurate.
  • FIG. 4 is a schematic flowchart of a task processing method provided by an exemplary embodiment of the present application.
  • the method includes the following S401-S403:
  • the target task processing model is trained through the aforementioned model training method.
  • Figure 5 is a schematic structural diagram of a data processing device provided by an exemplary embodiment of the present application.
  • the device includes:
  • the acquisition unit 51 is used to obtain the image and text feature information to be processed, which includes the text feature information to be processed and the image feature information to be processed; the text feature information to be processed or the image feature information to be processed. contains a mask identifier, and the feature information of the text to be processed matches the feature information of the image to be processed;
  • Generating unit 52 configured to generate a first vector information group corresponding to the text feature information to be processed and a second vector information group corresponding to the image feature information to be processed based on the initial vector generation rule;
  • Encoding unit 53 configured to encode the first vector information group and the second vector information group through initial encoding rules to obtain a corresponding fusion vector information group; wherein the fusion vector information group includes multiple fusion vectors Information, each fusion vector information is related to the first vector information group and the second vector information group;
  • Decoding unit 54 configured to decode the fusion vector information group through initial decoding rules to obtain the prediction result corresponding to the mask identifier
  • Determining unit 55 is configured to train the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model.
  • the target pre-training model is used to train the target task category according to the obtained target task category. Describe the target task processing model corresponding to the target task category.
  • the device is also used for:
  • sample graphic and text information includes sample text information and sample image information, and the sample text information and sample image information match;
  • Mask processing is performed on part of the mark information in the mark information group or on part of the initial vector information in the initial vector information group based on preset masking rules to obtain the image and text feature information to be processed.
  • the device when the device is used to train the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, it is specifically used to:
  • the initial pre-training model is used as the target pre-training model
  • the similarity value is not less than the preset similarity value, update the model parameters in the initial pre-training model based on the similarity value to obtain an initial pre-training model with updated model parameters, and Return to the step of generating the first vector information group corresponding to the text feature information to be processed and the second vector information group corresponding to the image feature information to be processed based on the initial vector generation rule until the similarity value is less than the Preset the similarity value until the target pre-trained model is obtained;
  • model parameters in the initial pre-training model include: at least one parameter among the parameters in the initial vector generation rule, the parameters in the initial encoding rule, and the parameters in the initial decoding rule.
  • the device is also used for:
  • the sample image information is encoded according to the second preset encoding rule to obtain a coded value group corresponding to the sample image information.
  • the coded value group includes a plurality of coded values, wherein the coded value in the coded value group The number is the same as the number of initial vector information in the initial vector information group;
  • the target information corresponding to the mask identification is determined according to the encoded value group.
  • the device when the device is used to determine the target information corresponding to the mask identifier according to the encoded value group, it is specifically used to:
  • the image and text feature information to be processed is obtained by masking part of the initial vector information in the initial vector information group based on preset masking rules, then the masked content is the part of the initial vector information.
  • Vector information, the mask object is the initial vector information group.
  • the aforementioned device when used to train an initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, it is specifically used to:
  • the model parameters in the model are updated to obtain an initial pre-trained model after the model parameters are updated, and the first vector information group corresponding to the text feature information to be processed and the image to be processed are generated based on the initial vector generation rules.
  • the step of the second vector information group corresponding to the feature information is to use the initial pre-training model as the target pre-training model until the number of updates is greater than the preset number of times;
  • model parameters in the initial pre-training model include: at least one parameter among the parameters in the initial vector generation rule, the parameters in the initial encoding rule, and the parameters in the initial decoding rule.
  • Figure 6 is a schematic structural diagram of a model training device provided by an exemplary embodiment of the present application.
  • the device includes:
  • the acquisition unit 61 is configured to acquire a target task category and sample task information corresponding to the target task category.
  • the sample task information includes sample task input information, and a sample task result label corresponding to the sample task input information; and
  • the multiple candidate units include: a pre-processing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit. ;
  • the determining unit 62 is configured to determine the target unit corresponding to the target task category from the plurality of candidate units according to the preset correspondence relationship;
  • Building unit 63 configured to build an initial task processing model corresponding to the target task category based on the target unit;
  • the training unit 64 is configured to use the sample task information to train the initial task processing model to obtain a target task processing model for completing the target task corresponding to the target task category;
  • the target pre-training model is obtained by training the initial pre-training model in the aforementioned data processing method.
  • Figure 7 is a schematic structural diagram of a task processing device provided by an exemplary embodiment of the present application; wherein, the device includes:
  • the acquisition unit 71 is used to acquire target task information of the target task category, where the target task information includes target task input information. interest;
  • Determining unit 72 configured to determine the corresponding target task processing model according to the target task category
  • the input unit 73 is used to input the target task input information into the target task processing model to obtain a target task processing result corresponding to the target task category and the target task input information;
  • the target task processing model is trained through the aforementioned model training method.
  • the device embodiments and the method embodiments may correspond to each other, and similar descriptions may refer to the method embodiments. To avoid repetition, they will not be repeated here.
  • the device can execute the above method embodiment, and the foregoing and other operations and/or functions of each module in the device are respectively the corresponding processes in each method in the above method embodiment. For the sake of brevity, they will not be described here. Repeat.
  • the software module may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, register, etc.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps in the above method embodiment in combination with its hardware.
  • FIG. 8 is a schematic block diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device may include:
  • Memory 801 and processor 802. The memory 801 is used to store computer programs and transmit the program code to the processor 802. In other words, the processor 802 can call and run the computer program from the memory 801 to implement the method in the embodiment of the present application.
  • the processor 802 may be configured to execute the above method embodiments according to instructions in the computer program.
  • the processor 802 may include but is not limited to:
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the memory 801 includes but is not limited to:
  • Non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory may be Random Access Memory (RAM), which is used as an external cache.
  • RAM Random Access Memory
  • RAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM DDR SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM
  • the computer program can be divided into one or more modules, and the one or more modules are stored in the memory 801 and executed by the processor 802 to complete the tasks provided by this application.
  • the one or more modules may be a series of computer program instruction segments capable of completing specific functions. The instruction segments are used to describe the execution process of the computer program in the electronic device.
  • the electronic device may also include:
  • Transceiver 803 which may be connected to the processor 802 or the memory 801.
  • the processor 802 can control the transceiver 803 to communicate with other devices. Specifically, it can send information or data to other devices, or receive information or data sent by other devices.
  • Transceiver 803 may include a transmitter and a receiver.
  • the transceiver 803 may further include an antenna, and the number of antennas may be one or more.
  • bus system where in addition to the data bus, the bus system also includes a power bus, a control bus and a status signal bus.
  • This application also provides a computer storage medium on which a computer program is stored.
  • the computer program When the computer program is executed by a computer, the computer can perform the method of the above method embodiment.
  • embodiments of the present application also provide a computer program product containing instructions, which when executed by a computer causes the computer to perform the method of the above method embodiments.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another
  • the computer instructions may be transmitted from a website, computer, server, or data center via wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless Infrared, wireless, microwave, etc.) method to transmit to another website, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, digital video disc (DVD)), or semiconductor media (eg, solid state disk (SSD)), etc.
  • a data processing method including:
  • the image and text feature information to be processed which includes the text feature information to be processed and the image feature information to be processed; the text feature information to be processed or the image feature information to be processed includes a mask identifier, The feature information of the text to be processed matches the feature information of the image to be processed;
  • the first vector information group and the second vector information group are encoded according to the initial encoding rules to obtain the corresponding fusion vector information group; wherein the fusion vector information group includes multiple fusion vector information, each fusion vector information Related to the first vector information group and the second vector information group;
  • the initial pre-training model is trained based on the prediction result and the mask identifier to obtain a target pre-training model.
  • the target pre-training model is used to train the target task category corresponding to the acquired target task category.
  • Target task processing model is used to train the target task category corresponding to the acquired target task category.
  • the method further includes:
  • sample graphic and text information includes sample text information and sample image information, and the sample text information and sample image information match;
  • Mask processing is performed on part of the mark information in the mark information group or on part of the initial vector information in the initial vector information group based on preset masking rules to obtain the image and text feature information to be processed.
  • the initial pre-training model is trained based on the prediction result and the mask identifier to obtain a target pre-training model, including:
  • the similarity value is not less than the preset similarity value, update the model parameters in the initial pre-training model based on the similarity value to obtain an initial pre-training model with updated model parameters, and Return to the step of generating the first vector information group corresponding to the text feature information to be processed and the second vector information group corresponding to the image feature information to be processed based on the initial vector generation rule until the similarity value is less than the Preset the similarity value until the target pre-trained model is obtained;
  • model parameters in the initial pre-training model include: at least one parameter among the parameters in the initial vector generation rule, the parameters in the initial encoding rule, and the parameters in the initial decoding rule.
  • the method further includes:
  • the sample image information is encoded according to the second preset encoding rule to obtain a coded value group corresponding to the sample image information.
  • the coded value group includes a plurality of coded values, wherein the coded value in the coded value group The number is the same as the number of initial vector information in the initial vector information group;
  • the target information corresponding to the mask identification is determined according to the encoded value group.
  • determining the target information corresponding to the mask identifier based on the encoded value group includes:
  • the image and text feature information to be processed is obtained by masking part of the initial vector information in the initial vector information group based on preset masking rules, then the masked content is the part of the initial vector information.
  • Vector information, the mask object is the initial vector information group.
  • an initial pre-training model is trained based on the prediction result and the mask identification.
  • Practice to get the target pre-trained model including:
  • the model parameters in the model are updated to obtain an initial pre-trained model after the model parameters are updated, and the first vector information group corresponding to the text feature information to be processed and the image to be processed are generated based on the initial vector generation rules.
  • the step of the second vector information group corresponding to the feature information is to use the initial pre-training model as the target pre-training model until the number of updates is greater than the preset number of times;
  • model parameters in the initial pre-training model include: at least one parameter among the parameters in the initial vector generation rule, the parameters in the initial encoding rule, and the parameters in the initial decoding rule.
  • a model training method including:
  • sample task information corresponding to the target task category includes sample task input information, and a sample task result label corresponding to the sample task input information;
  • the plurality of candidate units include: a preprocessing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit;
  • the target pre-training model is obtained by training the initial pre-training model in the aforementioned data processing method.
  • a task processing method including:
  • target task information of the target task category where the target task information includes target task input information
  • the target task processing model is trained through the aforementioned model training method.
  • a data processing device including:
  • the acquisition unit is used to obtain the image and text feature information to be processed.
  • the image and text feature information to be processed includes the text feature information to be processed and the image feature information to be processed; the text feature information to be processed or the image feature information to be processed. Contains a mask identifier, and the feature information of the text to be processed matches the feature information of the image to be processed;
  • a generation unit configured to generate a first vector information group corresponding to the text feature information to be processed and a second vector information group corresponding to the image feature information to be processed based on the initial vector generation rule;
  • An encoding unit configured to encode the first vector information group and the second vector information group through initial encoding rules to obtain a corresponding fusion vector information group; wherein the fusion vector information group includes a plurality of fusion vector information , each fusion vector information is related to the first vector information group and the second vector information group;
  • a decoding unit configured to decode the fusion vector information group through initial decoding rules to obtain the prediction result corresponding to the mask identifier
  • a determination unit configured to train the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, where the target pre-training model is used to train the target task category according to the obtained target task category.
  • the target task processing model corresponding to the target task category.
  • the device is also used for:
  • sample graphic and text information includes sample text information and sample image information, and the sample text information and sample image information match;
  • Mask processing is performed on part of the mark information in the mark information group or on part of the initial vector information in the initial vector information group based on preset masking rules to obtain the image and text feature information to be processed.
  • the device when the device is used to train the initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, it is specifically used to:
  • the similarity value is not less than the preset similarity value, update the model parameters in the initial pre-training model based on the similarity value to obtain an initial pre-training model with updated model parameters, and Return to the step of generating the first vector information group corresponding to the text feature information to be processed and the second vector information group corresponding to the image feature information to be processed based on the initial vector generation rule until the similarity value is less than the Preset the similarity value until the target pre-trained model is obtained;
  • model parameters in the initial pre-training model include: at least one parameter among the parameters in the initial vector generation rule, the parameters in the initial encoding rule, and the parameters in the initial decoding rule.
  • the device is also used for:
  • the sample image information is encoded according to the second preset encoding rule to obtain a coded value group corresponding to the sample image information.
  • the coded value group includes a plurality of coded values, wherein the coded value in the coded value group The number is the same as the number of initial vector information in the initial vector information group;
  • the target information corresponding to the mask identification is determined according to the encoded value group.
  • the device when the device is used to determine the target information corresponding to the mask identifier according to the encoded value group, it is specifically used to:
  • the image and text feature information to be processed is obtained by masking part of the initial vector information in the initial vector information group based on preset masking rules, then the masked content is the part of the initial vector information.
  • Vector information, the mask object is the initial vector information group.
  • the device when used to: train an initial pre-training model based on the prediction result and the mask identifier to obtain a target pre-training model, it is specifically used to:
  • the model parameters in the model are updated to obtain an initial pre-trained model after the model parameters are updated, and the first vector information group corresponding to the text feature information to be processed and the image to be processed are generated based on the initial vector generation rules.
  • the step of the second vector information group corresponding to the feature information is to use the initial pre-training model as the target pre-training model until the number of updates is greater than the preset number of times;
  • model parameters in the initial pre-training model include: at least one parameter among the parameters in the initial vector generation rule, the parameters in the initial encoding rule, and the parameters in the initial decoding rule.
  • a model training device including:
  • An acquisition unit configured to acquire a target task category and sample task information corresponding to the target task category, where the sample task information includes sample task input information, and a sample task result label corresponding to the sample task input information; and Obtain multiple candidate units in the target pre-training model, the plurality of candidate units include: a preprocessing unit, a first target vector generation unit, a first target cross-modal encoding unit, and a first target cross-modal decoding unit;
  • a determination unit configured to determine the target unit corresponding to the target task category from the plurality of candidate units according to the preset correspondence relationship
  • a construction unit configured to construct an initial task processing model corresponding to the target task category based on the target unit
  • a training unit configured to use the sample task information to train the initial task processing model to obtain a target task processing model for completing the target task corresponding to the target task category;
  • the target pre-training model is obtained by training the initial pre-training model in the aforementioned data processing method.
  • a task processing device including:
  • An acquisition unit configured to acquire target task information of the target task category, where the target task information includes target task input information
  • a determination unit configured to determine a corresponding target task processing model according to the target task category
  • An input unit configured to input the target task input information into the target task processing model to obtain a target task processing result corresponding to the target task category and the target task input information;
  • the target task processing model is trained through the aforementioned model training method.
  • an electronic device including:
  • Memory used to store executable instructions for the processor
  • the processor is configured to execute each of the foregoing methods by executing executable instructions.
  • a computer-readable storage medium is provided with a computer program stored thereon.
  • the computer program is executed by the processor, the aforementioned methods are implemented.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be in electrical, mechanical or other forms.
  • Modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, each functional module in each embodiment of the present application can be integrated into a processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

提供了一种数据处理方法、装置、设备,及计算机介质,方法包括:获取包括互相匹配的待处理文本特征信息与待处理图像特征信息的待处理图文特征信息;待处理文本特征信息或待处理图像特征信息中包含掩码标识;基于初始向量生成规则,对待处理图文特征信息进行处理得到第一向量信息组与第二向量信息组;通过初始编码规则对第一向量信息组与第二向量信息组进行编码得到融合向量信息组;通过初始解码规则对融合向量信息组进行解码得到掩码标识对应的预测结果;基于预测结果与掩码标识对初始预训练模型进行训练得到目标预训练模型。

Description

数据处理方法、装置、设备及计算机介质
相关申请的交叉引用
本申请要求于2022年06月14日提交的,申请名称为“数据处理方法、装置、设备及计算机介质”的、中国专利申请号为“202210671652.2”的优先权,该中国专利申请的全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能领域,尤其涉及一种数据处理方法、装置、设备及计算机介质。
背景技术
目前,相关技术中的预训练模型,一般仅关注单一的任务类别的任务的处理,其中,任务类别可以为:涉及文本理解任务、视觉理解任务、多模态理解任务、图像到文本生成任务和文本到图像生成任务。多模态理解任务是同时理解视觉信息和语言信息来解决视觉问答、视觉推理和视觉蕴含等任务。图像到文本生成任务是需要理解输入的图像信息生成相应的文本描述。文本到图像生成任务是需要根据输入的文本信息生成相应的图像。当需要完成多种类别的任务时,相关技术中,需要训练各类别的任务对应的任务处理模型,与各类别的任务对应的预训练模型,预训练模型的训练效率较低。
发明内容
本申请实施例提供一种与现有技术不同的实现方案,以解决现有技术中,当需要训练对应多种任务的任务处理模型时,训练相应的预训练模型的训练效率较低的技术问题。
第一方面,本申请提供一种数据处理方法,包括:获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
第二方面,本申请提供一种模型训练方法,包括:
获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;
获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
其中,所述目标预训练模型为通过前述数据处理方法中的初始预训练模型训练得出的。
第三方面,本申请提供一种任务处理方法,包括:
获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信息;
根据所述目标任务类别确定对应的目标任务处理模型;
将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
其中,所述目标任务处理模型为通过前述模型训练方法训练得出的。
第四方面,本申请提供一种数据处理装置,包括:
获取单元,用于获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
生成单元,用于基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
编码单元,用于通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
解码单元,用于通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
确定单元,用于基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
第五方面,本申请提供一种模型训练装置,包括:
获取单元,用于获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;以及用于获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
确定单元,用于根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
构建单元,用于基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
训练单元,用于利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
其中,所述目标任务处理模型为通过前述模型训练方法中的初始预训练模型训练得出的。
第六方面,本申请提供一种任务处理装置,包括:
获取单元,用于获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信息;
确定单元,用于根据所述目标任务类别确定对应的目标任务处理模型;
输入单元,用于将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
其中,所述目标任务处理模型为通过前述模型训练方法训练得出的。
第七方面,本申请提供一种电子设备,包括:
处理器;以及
存储器,用于存储处理器的可执行指令;
其中,处理器配置为经由执行可执行指令来执行第一方面、第二方面、第三方面、第一方面各可能的实施方式、第二方面各可能的实施方式、或第三方面各可能的实施方式中的任一方法。
第八方面,本申请实施例提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现第一方面、第二方面、第三方面、第一方面各可能的实施方式、第二方面各可能的实施方式、或第三方面各可能的实施方式中的任一方法。
本申请通过获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型的方案,将针对与图像匹配的文本,以及图像本身的处理过程,统一到了同一预训练模型中,且用于训练目标预训练模型的样本数据涉及多模态的信息,通过本申请的方案训练得出的预训练模型,可为多种任务处理模型的训练提供素材,从而提高了当需要训练对应多种任务的任务处理模型时,训练相应的预训练模型的训练效率。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1为本申请一实施例提供的数据处理系统的结构示意图;
图2a为本申请一实施例提供的数据处理方法的流程示意图;
图2b为本申请一实施例提供的数据处理方法的另一流程示意图;
图2c为本申请一实施例提供的数据处理方法的另一流程示意图;
图3a为本申请一实施例提供的模型训练方法的流程示意图;
图3b为本申请一实施例提供的目标单元的确定方式的示意图;
图3c为本申请一实施例提供的目标任务类别为分析图像对应的语义识别结果,对图像进行分类时,对应的目标任务处理结果的示例图;
图3d为本申请一实施例提供的目标任务类别为根据图文信息回答问题时,对应的目标任务处理结果的示例图;
图3e为本申请一实施例提供的目标任务类别为判断文字是否正确描述了图像对时,对应的目标任务处理结果的示例图;
图3f为本申请一实施例提供的目标任务类别为给出一张图像和一个文本描述,判断图像和文本之间的关系是蕴含、矛盾还是中立的任务时,对应的目标任务处理结果的示例图;
图3g为本申请一实施例提供的目标任务类别为给定一张图像,输出该图像的文本描述时,对应的目标任务处理结果的示例图;
图3h为本申请一实施例提供的目标任务类别为给定一段文本描述,输出该文本描述对应的图像时,对应的目标任务处理结果的示例图;
图3i为本申请一实施例提供的通过本申请的模型训练方法训练得出的目标任务处理模型,针对给定一段文本描述,输出该文本描述的图像的目标任务处理结果与相关技术中基于DALLE,以及OFA对应的模型确定出的目标任务处理结果的对比情况;
图4为本申请一实施例提供的任务处理方法的流程示意图;
图5为本申请一实施例提供的数据处理装置的结构示意图;
图6为本申请一实施例提供的模型训练装置的结构示意图;
图7为本申请一实施例提供的任务处理装置的结构示意图;
图8为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
下面详细描述本申请的实施例,实施例的示例在附图中示出。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。
本申请实施例的说明书、权利要求书及附图中的术语“第一”和“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请实施例的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
首先,下面对本申请实施例中的部分用语进行解释说明,以便于本领域技术人员理解。
MIM:Masked Image Model,掩码视觉模型。
MLM:Masked Language Model,掩码语言模型。
FLAVA:A Foundational Language And Vision Alignment Model,基础语言和视觉对齐模型。
CLIP:Contrastive Language–Image Pre-training,对比文本-图像预训练模型。是一个基于图像和文本并行的多模态模型,然后通过两个分支的特征向量的相似度计算来构建训练目标。
SimVLM:Simple Visual Language Model Pre-training with Weak Supervision,弱监督简单视觉语言模型预训练。
MNLI:Multi-Genre Natural Language Inference,文本蕴含识别。
CoLA:The Corpus of Linguistic Acceptability,语言可接受语料库,有关语法的数据集,该任务主要是对一个给定句子,判定其是否语法正确。
MRPC:Microsoft Research Paraphrase Corpus,微软研究释义语料库,判断两个给定句子,是否具有相同的语义,属于句子对的文本二分类任务。
QQP:Quora Question Pairs,文本匹配,是由Quora发布的两个句子是否语义一致的数据集,属于句子对的文本二分类任务。
SST:The Stanford Sentiment Treebank,斯坦福情感分析数据集,主要针对电影评论来做情感分类,因此SST属于单个句子的文本分类任务(其中SST-2是二分类,SST-5是五分类,SST-5的情感极性区分的更细致)。
QNLI:Question Natural Language Inference,自然语言问题推理,其前身是SQuAD 1.0数据集,给定一个问句,需要判断给定文本中是否包含该问句的正确答案。属于句子对的文本二分类任务。
RTE:Recognizing Textual Entailment,文本蕴含识别模型,和MNLI类似,也是一个文本蕴含任务,不同的是MNLI是三分类,RTE只需要判断两个句子是否能够推断或对齐,属于句子对的文本二分类任务。
STS-B:the semantic textual similarity benchmark,语义文本相似度数据集。
ImageNet:是一个用于视觉对象识别软件研究的大型可视化数据库。
Food-101数据集:本数据集包含了101种食品类别的图像数据集,共有101,000张图像,平均每个类别拥有250张测试图像和750张训练图像。训练图像未经过数据清洗。所有图像都已经重新进行了尺寸缩放,最大边长达到了512像素。
CIFAR-10数据集:是一个用于识别普适物体的小型数据集。一共包含10个类别的RGB彩色图片,数据集中一共有50000张训练图像和10000张测试图像。
CIFAR100数据集:有100个类。每个类有600张彩色图像,其中500张作为训练集,100张作为测试集。
Cars:汽车数据集。
Aircraft数据集:该数据集包含10,200架飞机的图像,其中102种不同飞机,每一种都具有100张图像。
DTD,Describable Textures Dataset,纹理识别数据集。
Pets数据集,是Oxford提供的宠物数据集,内含约7000张猫狗图像,并且其中一部分图像标出了猫狗脸的位置。
Flowers102数据集:本数据集包含102种花类的图像数据集,每个类别包含40—258张图像。这些图像在比例、姿势以及光照方面有着丰富的变化。
MNIST数据集:是一个手写体数字的图像数据集。
STL-10数据集:是一个用于开发无监督特征学习、深度学习、自学学习算法的图像识别数据集。
Country211,国家统计数据集。
VQA_v2:visual question answering,视觉问答任务第2版本,形式是给一个图像和一个关于这张图像的问题,输出一个答案。
SNLI-VE,The Stanford Natural Language Inference,斯坦福自然语言推理语料库是一个50万标记英语句子对。
NLVR2为康奈尔大学研究团队推出的包含107,292个以成对照片为基础的人类书面英语句子的例子的数据集。
OFA模型,OFA(Unifying architectures,tasks,modalities through a simple sequence-to-sequence learning framework)是一个多任务训练框架,将不同的任务统一到序列到序列的训练目标,通过同时训练多个下游多模态任务来达到预训练的目的。该模型需要使用下游任务的标注数据,因此在可扩展性和方案可操作性方面存在缺陷。。
DALL.E模型,是一个文本生成图像的模型,通过将图像离散化之后使用图像token和文本token联合建模的技术,达到从文本生成图片的目的。
前缀语言模型,是一个从前至后的语言模型,可根据输入图像和前缀文本生成剩余文本,以及根据输入文本和前缀图像生成剩余图像。
下面以具体的实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。
图1为本申请一示例性实施例提供的一种数据处理系统的结构示意图,该结构包括:任务处理设备11与模型训练设备12;其中,任务处理设备11与模型训练设备12都可以为计算机设备,其中,该计算机设备可以为终端或者服务器等设备。该终端可以为智能手机、平板电脑、笔记本电脑、智能语音交互设备、智能家电、穿戴式智能设备、飞行器、智能车载终端等设备,终端还可以包括客户端,该客户端可以是视频客户端、浏览器客户端或即时通信客户端等。服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数 据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
可选地,任务处理设备11用于:
获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信息;
根据所述目标任务类别确定对应的目标任务处理模型;
将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
其中,所述目标任务处理模型通过前述模型训练方法训练得出。
具体地,模型训练设备12可用于对初始预训练模型进行训练得到目标预训练模型,以及对初始任务处理模型进行训练得出前述目标任务处理模型。
可选地,模型训练设备12在用于对初始预训练模型进行训练得到目标预训练模型时,具体用于:
获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
进一步地,当模型训练设备12用于对初始任务处理模型进行训练得出前述目标任务处理模型时,具体用于:
获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;
获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
其中,所述目标预训练模型为基于前述初始预训练模型训练得出的。
本系统实施例中的各组成单元,如任务处理设备11与模型训练设备12的执行原理及交互过程可参见如下各方法实施例的描述。
图2a为本申请一示例性实施例提供的一种数据处理方法的流程示意图,该方法可适用于模型训练设备,用于将初始预训练模型训练为目标预训练模型,该方法至少包括以下S201-S205:
S201、获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
具体地,其中,掩码标识可以为预设的标识,例如可以为预设的阿拉伯数值0,针对待处理图文特征信息的确定方式,所述方法还包括以下步骤S01-S04:
S01、获取样本图文信息,所述样本图文信息包括样本文本信息与样本图像信息,所述样本文本信息与样本图像信息匹配;
可选地,样本文本信息与样本图像信息匹配指:样本文本信息所指的内容,与样本图像信息所指的内容相关;例如,样本文本信息所指的内容为“小狗”,样本图像信息所指的内容为小狗的图像。
又例如,当样本文本信息所指的内容为:“The Last Supper of Jesus with the Twelve Aposities,painting by Leonardo da Vinci”时,样本图像信息所指的内容为图像《最后的晚餐》。
S02、根据所述样本文本信息与预设词库确定所述样本文本信息中的各字符在预设词库中的标记信息,得到所述样本文本信息对应的标记信息组;
S03、按照第一预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的初始向量信息组;
在本申请的一些可选的实施例中,参见图2b与图2c所示,初始预训练模型包括预处理单元,所述预处理单元中包含查表单元与第一图像编码器;可通过查表单元执行前述S02,以及前述第一预设编码规则可内置于第一图像编码器中,通过第一图像编码器执行前述S03。
标记信息组中包含多个标记信息,各标记信息与样本文本信息中的字符一一对应。可选地,标记信息可以为:地址信息、索引信息,位置信息等。
标记信息组中包含的多个标记信息都可以为阿拉伯数值,例如,可以为图2b与图2c中的:a1 a2 a3 a4 a5 a6 a7 a8 a9,a10 a11 a12 a13 a14,其中,a1、a2、a3、a4、a5、a6、a7、a8、a9、a10、a11、a12、a13,以及a14可以为阿拉伯数值,并且,多个标记信息之间不一定连续。
其中,第一图像编码器用于对样本图像信息进行特征提取,其具体可包括拆分模块与编码模块,可将样本图像信息拆分为预设数目的多个子图像;然后再基于编码模块与该多个子图像确定对应的多个初始向量信息,该多个初始向量信息组成了初始向量信息组。可选地,初始向量信息与子图像一一对应。
初始向量信息组中包含的多个初始向量信息可以为数值向量,具体可以为图2c中的:b1 b2 b3 b4 b5 b6 b7 b8 b9。
S04、基于预设掩码规则对所述标记信息组中的部分标记信息,或对初始向量信息组中的部分初始向量信息进行掩码处理,得到所述待处理图文特征信息。
可选地,初始预训练模型还包括掩码处理单元,前述预设掩码规则可内置于掩码处理单元,预设掩码规则可包括第一掩码规则与第二掩码规则;通过内置有预设掩码规则的掩码处理单元对所述标记信息组中的部分标记信息进行掩码处理时,具体可通过掩码处理单元中的内置有第一掩码规则的第一掩码处理模块,对所述标记信息组中的部分标记信息进行掩码处理,此时,待处理图像特征信息与初始向量信息组中的多个初始向量信息相同,对所述标记信息组中的部分标记信息进行掩码处理后,得到部分标记信息被掩码后的掩码标识,与标记信息中未被掩码处理的剩余标记信息;掩码标识与剩余标记信息的总和,则为待处理文本特征信息。此时,待处理图文特征信息可参见图2b中的待处理图文特征信息1。
其中,被掩码处理的部分标记信息中包含的标记信息的数量,与掩码标识的数量相同。
当通过内置有预设掩码规则的掩码处理单元,对所述初始向量信息组中的部分初始向量信息进行掩码处理时,具体可通过掩码处理单元中的内置有第二掩码规则的第二掩码处理模块对所述初始向量信息组中的部分初始向量信息进行掩码处理,此时,待处理文本特征信息与标记信息组中的多个标记信息相同,对所述初始向量信息组中的部分初始向量信息进行掩码处理后,得到部分初始向量信息被掩码后的掩码标识,与初始向量信息组中未被掩码处理的剩余初始向量信息;掩码标识与剩余初始向量信息的总和,则为待处理图像特征信息。此时,待处理图文特征信息可参见图2c中的待处理图文特征信息2。
其中,被掩码处理的部分初始向量信息中包含的初始向量信息的数量,与掩码标识的数量相同。
可选地,通过掩码处理单元,对所述标记信息组中的部分标记信息进行掩码处理,或对所述初始向量信息组中的部分初始向量信息进行掩码处理时,可从后往前做随机长度,或随机比例的掩码。
可选地,样本文本信息与样本图像信息匹配时,视为待处理文本特征信息与所述待处理图像特征信息匹配。
可选地,第一掩码处理模块与第二掩码处理模块可以为同一模块,也可以不为同一模块,对此,本申请不做限定。具体地,通过所述掩码处理单元对所述标记信息组中的部分标记信息进行掩码处理的示例可参见图2b所示,图2b中,标记信息组:a1 a2 a3 a4 a5 a6 a7 a8 a9,a10 a11 a12 a13 a14中的部分标记信息:a5 a6 a7 a8 a9,a10 a11 a12 a13 a14为被掩码内容。
通过内置有预设掩码规则的掩码处理单元对所述初始向量信息组中的部分初始向量信息进行掩码处理的示例可参见图2c所示,图2c中,初始向量信息组:b1 b2 b3 b4 b5 b6 b7 b8 b9中的部分初始向量信息b6 b7 b8 b9为掩码内容。
S202、基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
例如,第一向量信息组包括多个第一向量信息,可选地,参见图2b与图2c所示,前述初始预训练模型还可以包括初始向量生成单元,前述初始向量生成规则,内置于初始向量生成单元中,具体可通过初始向量生成单元执行前述S202。
前述多个第一向量信息可以为图2b与图2c中的t1、t2、...、tn;第二向量信息组包括多个第二向量信息,该多个第二向量信息可以为2b与图2c中的v1、v2、...、vm
具体地,前述初始向量生成规则包括第一向量生成规则与第二向量生成规则。可通过初始向量生成单元中内置有第一向量生成规则的第一向量生成模块对所述待处理文本特征信息进行处理,得 到所述待处理文本特征信息对应的第一向量信息组;通过初始向量生成单元中内置有第二向量生成规则的第二向量生成模块对所述待处理图像特征信息进行处理,得到所述待处理图像特征信息对应的第二向量信息组。其中,第二向量生成模块可以包括Resnet101中的前三层网络。
其中,第一向量生成模块可以对待处理文本特征信息进行位置嵌入,以及归一化处理,相应地,第二向量生成模块可以对待处理图像特征信息进行位置嵌入,以及归一化处理。
S203、通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
可选地,初始预训练模型可包括初始交叉模态编码单元,前述初始编码规则可内置于初始交叉模态编码单元中,具体可通过初始交叉模态编码单元执行前述S203。
具体地,初始编码规则可包括拼接规则与交叉模态编码规则。
参见图2b与图2c所示,初始交叉模态编码单元中可包括内置有拼接规则的拼接单元与内置有交叉模态编码规则的交叉模态编码器;其中,拼接单元,用于将第一向量信息组与第二向量信息组进行拼接,得到拼接向量,交叉模态编码器用于对所述拼接向量进行编码,得到所述融合向量信息组。
其中,拼接向量可以为图2b与图2c中的:v1、v2、...、vm、t1、t2、...、tn,或t1、t2、...、tn、v1、v2、...、vm。融合向量信息组包含的多个融合向量信息包括:h1、h2、...、hl,其中,l=m+n。
S204、通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
其中,预测结果中可包括待处理文本特征信息对应的预测文本特征信息,或待处理图像特征信息对应的预测图像特征信息。
可选地,参见图2b与图2c所示,初始预训练模型还包括初始交叉模态解码单元,前述初始解码规则可内置于初始交叉模态解码单元。具体可通过初始交叉模态解码单元执行前述S204。
具体地,当掩码标识是通过掩码处理单元,对所述标记信息组中的部分标记信息进行掩码处理得到的,预测结果中可包括待处理文本特征信息对应的预测文本特征信息;当掩码标识是通过掩码处理单元,对所述初始向量信息组中的部分初始向量信息进行掩码处理得到的,预测结果中可包括待处理图像特征信息对应的预测编码数值。其中,对于预测编码数值的解释可详见下文。
S205、基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
可选地,具体可通过初始预训练模型中的训练单元执行前述S205。
可选地,S205中,基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,包括以下S2051-S2053:
S2051、获取所述掩码标识对应的目标信息;
可选地,当所述掩码标识为通过掩码处理单元,对所述标记信息组中的部分标记信息进行掩码处理得到的时,掩码标识对应的目标信息为部分标记信息。
当所述掩码标识为通过掩码处理单元,对所述初始向量信息组中的部分初始向量信息进行掩码处理得到的时,针对掩码标识对应的目标信息的确定方式,所述方法还包括:
按照第二预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的编码数值组,所述编码数值组包括多个编码数值,其中,所述编码数值组中的编码数值的数量,与所述初始向量信息组中的初始向量信息的数量相同;并根据所述编码数值组确定所述掩码标识对应的目标信息。
其中,编码数值可以为阿拉伯数值,如图2c所示,多个编码数值可以为:123 234 345 987 654 321 999 888 777。
具体可参见图2c所示,第二预设编码规则可内置于第二图像编码器,内置有第二预设编码规则的第二图像编码器的作用是将样本图像信息的特征转换为离散的数值,即编码数值组中的多个编码数值,便于初始交叉模态解码单元输出对应的样本图像信息对应的预测编码数值,并将预测编码数值与编码数值组中,掩码标识对应的目标信息进行对比,对初始预训练模型进行训练。
进一步地,根据所述编码数值组确定所述掩码标识对应的目标信息,可包括以下S001-S003:
S001、获取所述掩码标识对应的被掩码内容在掩码对象中的第一位置信息;
其中,若所述待处理图文特征信息为基于预设掩码规则对所述初始向量信息组中的部分初始向量信息进行掩码处理得到的,则所述被掩码内容为所述部分初始向量信息,所述掩码对象为初始向量信息组。
其中,第一位置信息可包括:被掩码内容中的各初始向量信息在初始向量信息组中的索引信息, 例如,图2c中的初始向量信息组:b1 b2 b3 b4 b5 b6 b7 b8 b9:中的部分初始向量信息b6b7 b8 b9为被掩码内容时,b6b7 b8 b9的第一位置信息可以为:6、7、8、9。
S002、从所述编码数值组中选择出所述第一位置信息对应的目标编码数值;
由于编码数值组中包含的编码数值的数量,与初始向量信息组中包含的初始向量信息的数量相同,可选地,编码数值组中各编码数值在编码数值组中的第二位置信息,与初始向量信息组中各初始向量信息在初始向量信息组中的第一位置信息一一对应。第二位置信息也可以为索引信息。具体地,图2c中的编码数值组中各编码数值在编码数值组中的索引信息,以及初始向量信息组中的各初始向量信息的索引信息可如表1所示:
表1 编码数值组与初始向量信息组中索引信息不同时,对应的初始向量信息与编码数值。
可选地,编码数值组中还包含各编码数值的第二位置信息,从所述编码数值组中选择出所述第一位置信息对应的目标编码数值可包括:从所述编码数值组中确定出与所述第一位置信息匹配的第二位置信息对应的目标编码数值。其中,当第一位置信息与第二位置信息相同时,可视为第一位置信息与第二位置信息匹配。
S003、将所述目标编码数值作为所述目标信息;
其中,若所述待处理图文特征信息为基于预设掩码规则对所述初始向量信息组中的部分初始向量信息进行掩码处理得到的,则所述被掩码内容为所述部分初始向量信息,所述掩码对象为所述初始向量信息组。
例如,参见图2c与表1所示,当第一位置信息为:6、7、8、9时,则目标编码数值为:“321 999 888 777”。
S2052、确定所述预测结果与所述目标信息的相似度值,具体地,前述相似度值可通过交叉熵函数确定;
可选地,可参见图2b与图2c所示,训练单元可包括对比单元,该对比单元可执行S2052。
S2053、若所述相似度值小于预设相似度值,则将所述初始预训练模型作为目标预训练模型;
若所述相似度值不小于所述预设相似度值,则基于所述相似度值对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述相似度值小于所述预设相似度值,得出目标预训练模型为止;
其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
可选地,初始预训练模型中的模型参数还可以包括:预处理单元中的参数。
可选地,初始预训练模型中还可包括另一查表单元,用于供相关人员根据该查表单元的输出结果确认初始训练模型的训练完成情况,例如,如图2b所示,若初始交叉模态解码单元的输出为:a5 a6 a7 a8 a9,a10 a11 a12 a13 a14,通过查表单元,则可确定出:“Jesus with the Twelve Aposities,painting by Leonardo da Vinci”。
可选地,初始预训练模型中还可包括与第二图像编码器对应的图像解码器,用于供相关人员根据该图像解码器的输出结果确认初始训练模型的训练完成情况,例如,如图2c所示,若初始交叉模态解码单元的输出为:321 999 888 777,则通过图像解码器可解码出:前述多个子图像中,321对应的子图像、999对应的子图像、888对应的子图像,以及777对应的子图像。
在本申请的另一些可选的实施例中,基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,包括:
确定对所述初始预训练模型中的模型参数的更新次数是否大于预设次数,若是,则将所述初始预训练模型作为目标预训练模型;
若否,则获取所述掩码标识对应的目标信息;根据所述预测结果、所述目标信息以及预设的损失函数,确定出对应的损失信息;根据所述损失信息对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述更新次数大于所述预设次数时,将所述初始预训练模型作为目标预训练模型;
其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
其中,损失函数可以为交叉熵函数。
综上,参见图2b与图2c所示,初始预训练模型可包括:所述预处理单元、所述掩码处理单元、所述初始向量生成单元、所述初始交叉模块编码单元、所述初始交叉模态解码单元,以及所述训练单元。
进一步地,所述方法还包括以下S1-S2:
S1、获取所述目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;
S2、利用所述样本任务信息与所述目标预训练模型训练用于完成所述目标任务类别对应的目标任务的所述目标任务处理模型。
可选地,目标任务类别可以为:文本理解任务、图像理解任务(即视觉理解任务)、文本生成图像、图像生成文本、多模态识别任务等中的任一种。
前述样本任务输入信息,包括为了得出样本任务结果,所需的用于分析的任务前提;样本任务结果标签,为根据样本任务输入信息得出的任务处理结果,例如,若目标任务类别为图像生成文本,在训练该目标任务类别对应的目标任务处理模型时,需多个样本图像,以及各样本图像对应的文本生成结果,则该多个样本图像中的至少一个样本图像则为样本任务输入信息,各样本图像对应的文本生成结果则为样本任务结果标签。
可选地,目标预训练模型则为训练好的初始预训练模型。具体地,所述目标预训练模型包括多个候选单元:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
可选地,各候选单元中可包含多个子单元,确定出的目标单元可仅包括一候选单元中的部分子单元,也可以包含全部子单元。
可选地,预处理单元包括查表单元与第一图像编码器,确定出的目标单元中包含的预处理单元可仅包括查表单元或第一图像编码器,也可既包含查表单元也包含第一图像编码器。
可选地,第一目标交叉模态编码单元包括拼接单元与交叉模态编码器,确定出的目标单元包含的第一目标交叉模态编码单元中可仅包括交叉模态编码器,也可既包含拼接单元又包含交叉模态编码器。
S2中,利用所述样本任务信息与所述目标预训练模型训练用于完成所述目标任务类别对应的目标任务的所述目标任务处理模型,包括以下S21-S23:
S21、根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
其中,预设对应关系中,存储有多个任务类别,以及各任务类别与其对应的任务处理模型的关联关系。
S22、基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
S23、利用所述样本任务信息对所述初始任务处理模型进行训练,得到所述目标任务处理模型;
其中,所述第一目标向量生成单元与所述初始向量生成单元对应、所述第一目标交叉模态编码单元与所述初始交叉模态编码单元,以及所述第一目标交叉模态解码单元与所述初始交叉模态解码单元对应。
具体地,第一目标向量生成单元为:训练好目标预训练模型后,该目标预训练模型中的初始向量生成单元;第一目标交叉模态编码单元为:训练好目标预训练模型后,该目标预训练模型中的初始交叉模态编码单元;第一目标交叉模态解码单元为:训练好目标预训练模型后,该目标预训练模型中的初始交叉模态解码单元。
在本申请的另一些可选的实施例中,前述确定目标单元的方式,还可以基于用户针对目标单元的选择指令实现,对此,本申请不做限定。
可选地,用于训练目标预训练模型的样本数据还可以包括纯文本的样本数据。用于训练目标预训练模型的样本数据(如样本图像信息,纯文本的样本数据),可来源于网络,或公开数据集。
可选地,本申请中的目标预训练模型为一前缀语言模型,可以进行充分的语言和图像的关联,可使得目标预训练模型具有文本生成能力、图像编码能力、充分关联文本与图像,以增强跨模态理解的能力。
本申请通过获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二 向量信息组相关;通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型的方案,将针对与图像匹配的文本,以及图像本身的处理过程,统一到了同一预训练模型中,且用于训练目标预训练模型的样本数据涉及多模态的信息,通过本申请的方案训练得出的预训练模型,可为多种任务处理模型的训练提供素材,从而提高了当需要训练对应多种任务的任务处理模型时,训练相应的预训练模型的训练效率。
另外,目标预训练模型中的第二图像编码器可赋予目标预训练模型的图像生成能力,目标预训练模型中的查表单元,可赋予目标预训练模型的文字生成能力;第一目标交叉模态编码单元,与第一目标交叉模态解码单元赋予目标预训练模型多模态理解能力、文本理解能力,以及视觉理解能力;进而使得目标预训练模型的兼容能力与扩展性较强。其可为多种任务处理模型的训练提供素材,使得处理多种任务的效率也得到了提高。
并且,本方案提出,通过第二图像编码器将图像编码为离散的数据,进而可使得将包含图像信息与文本信息的图文信息作为样本数据,训练目标预训练模型,可以使得图像信息的处理方式,类似于文本信息的处理方式,训练得出目标预训练模型的速度更快。
通过前述模型训练方法训练得出的目标任务处理模型,基于与多种模态相关的目标预训练模型训练得出,在对任务处理时,具有较高的任务处理准确度。表2为通过本申请的方案确定出的目标任务处理模型在处理任务时,相应的目标任务处理结果的准确度值,与通过相关技术中的其他方法确定出的任务处理模型在处理任务时的任务处理结果的准确度值的对比情况。
表2:本方案的任务处理模型在处理任务时,相应的任务处理结果的准确度值,与通过其他方法确定出的任务处理模型在处理任务时的任务处理结果的准确度值的对比情况
其中,MIM、MLM、FLAVA、CLIP、SimVLM、SimVLM指任务处理模型的类别;
MNLI、CoLA、MRPC、QQP、SST-2、QNLI、RTE、STS-B为被处理的任务的类别信息;其中,MNLI结果是MNLI-m和MNLI-mm的平均值。MRPC和QQP结果是准确度和F1分数的平均值。CoLA报告了马修斯相关系数(MCC),STS-B报告了皮尔逊相关系数(PCC)。
70M、46.4M,以及647.7M中的“M”指“million”即70M、46.4M,以及647.7M指用于计算任务处理结果的准确度值的数据量。
NLP Avg指针对自然语言处理层面的任务处理结果的准确度值的平均值。
Vision Avg指针对视觉识别(即图像识别)层面的任务处理结果的准确度值的平均值。
Multi-modal指针对多模态处理任务的任务处理结果的准确度值的平均值。
Eval method指对应任务的测评方法,具体包括:1)Fine-tuning指在对应任务上完整训练模型;2)Linear eval指固定模型,通过添加一个分类器预测对应任务的结果;3)zero-shot指完全固定模型,不增加任何可学习参数,来解决对应任务。
ImageNet、Food101、CIFAR10、Cars、Aircraft、DTD、Pets、Flowers102、MNIST、STL10,以及Country211指数据集名称。各数据集对应的任务处理结果的准确度值指以当前数据集为任务分析数据的任务处理结果的准确度值。
VQAv2、SNLI-VE、NLVR2指数据集名称;
I2T和T2I表示图像生成文本,与文本生成图像的任务。I2T@B4和I2T@C是根据图像生成文本任务的测评指标,B4指4-gram双语评估替换指数(Bilingual Evaluation Understudy,BLUE),C指基于共识的图像描述评估(Consensus-based Image Description Evaluation,CIDEr)。T2I@IS、T2I@FID、T2I@IS和T2I@FID是根据文本生成图像任务的测评指标,IS指Inception Score(IS),FID是指Frechet Inception Distance(FID)。
其中,“↑”表示对应的准确度值越大,代表任务处理结果的准确度越高,“↓”表示对应的准确度值越小,代表任务处理结果的准确度越高。
由表2可知,通过本方案确定出的目标任务处理模型对目标任务进行处理时,在相同模型规模和数据规模情况下达到了同类模型的最先进的效果。与通过本申请的方案确定出的目标任务处理模型进行比较的相关模型是FLAVA和SimVLM。通过本申请的方案确定出的目标任务处理模型在全部26个任务中有22个任务的表现为最佳。在文本理解任务上和相关模型持平,除此之外,全部优于相关模型,包括在视觉理解任务、多模态理解任务、文本到图像的生成和图像到文本的生成任务上取得了大幅的提升。
图3a是本申请一示例性实施例提供的一种模型训练方法的流程示意图,该方法可适用于模型训练设备,该方法至少包括以下步骤以下S301-S305:
S301、获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;
S302、获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
S303、根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
S304、基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
S305、利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
可选地,所述目标预训练模型为通过图2a对应的实施例中的数据处理方法中的初始预训练模型训练得出的。
结合图2a对应的实施例可知,所述初始预训练模型包括:初始向量生成单元、初始交叉模态编码单元,以及初始交叉模态解码单元。
前述数据处理方法中,可具体基于初始向量生成单元中内置的初始向量生成规则,生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
通过内置于初始交叉模态编码单元的初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
并通过内置于初始交叉模态解码单元中的初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
并基于所述预测结果与所述掩码标识对初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
其中,目标预训练模型为训练好的初始预训练模型,所述第一目标向量生成单元为初始预训练模型中训练好的初始向量生成单元、所述第一目标交叉模态编码单元为初始预训练模型中训练好的初始交叉模态编码单元,以及所述第一目标交叉模态解码单元为初始预训练模型中训练好的初始交叉模态解码单元。所述目标预训练模型包括多个候选单元,该多个候选单元至少包括以下单元中的至少2个:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元、第一目标交叉模态解码单元。
可选地,在基于所述目标单元构建初始任务处理模型时,还可根据目标任务类别获取目标单元 以外的附加单元。即初始任务处理模型可以由目标单元与附加单元构建完成,在训练目标任务处理模型中,除了可对目标单元中的参数进行更新,还可以对附加单元的参数进行更新。
可选地,前述附加单元还可以恢复单元,恢复单元可包括查表单元和/或与第二图像编码器对应的图像解码器。
可选地,若目标单元中包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元、第一目标交叉模态解码单元,则训练好的初始任务处理模型,即目标任务处理模型中可包括:训练好的预处理单元、第二目标向量生成单元、第二目标交叉模态编码单元,以及第二目标交叉模态解码单元,还可包括图像解码器。
其中,第二目标向量生成单元为训练好的第一目标向量生成单元、第二目标交叉模态编码单元为训练好的所述第一目标交叉模态编码单元,第二目标交叉模态解码单元为训练好的第一目标交叉模态解码单元。
可选地,不同的目标任务类别,确定出的目标单元不同,在本申请的一些可选的实施例中,若目标任务类别为文本理解任务时,参见图3b所示,从所述多个候选单元中,确定出的该目标任务类别对应的目标单元可包括:预处理单元中的查表单元、第一目标向量生成单元、第一目标交叉模态编码单元中的交叉模态编码器,以及第一目标交叉模态解码单元。
在本申请的一些可选的实施例中,若目标任务类别为文本分类时,仅涉及到对文本的分析,不涉及对图像的分析,因此,从所述多个候选单元中,确定出的该目标任务类别对应的目标单元可包括:预处理单元中的查表单元、第一目标向量生成单元、第一目标交叉模态编码单元中的交叉模态编码器,以及第一目标交叉模态解码单元。
其中,在基于所述目标单元构建出的初始任务处理模型中,预处理单元中的查表单元与第一目标向量生成单元连接、第一目标向量生成单元与第一目标交叉模态编码单元中的交叉模态编码器连接,且第一目标交叉模态编码单元中的交叉模态编码器与第一目标交叉模态解码单元连接。
进一步地,初始任务处理模型还可以包括前述目标单元以外的其他单元,如初始任务处理模型还包括初始分类器,该初始分类器的输入接口与第一目标交叉模态解码单元的输出接口连接,初始分类器的输出接口用于输出预测的任务处理结果,进而使得在训练目标任务处理模型过程中,基于样本任务信息、预测的任务处理结果,以及样本任务结果标签对初始任务处理模型进行训练。
在本申请的一些可选的实施例中,训练得出目标任务类别为文本分类时,对应的目标任务处理模型后,针对目标任务:获取两个语句,确定第二语句与第一语句的关系,其中:
第一语句:One of our number will carry out your instructions minutely;
第二语句:A member of my team will execute your orders with immense precision.
则可将第一语句与第二语句作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:蕴含(表示第二语句的语义蕴含第一语句的语义)。
在本申请的一些可选的实施例中,若目标任务类别为分析图像对应的语义识别结果,对图像进行分类的任务时,该目标任务类别仅涉及到对图像的分析与图像分类,因此,从所述多个候选单元中,确定出的该目标任务类别对应的目标单元可包括:预处理单元中的第一图像编码器,第一目标向量生成单元、第一目标交叉模态编码单元中的交叉模态编码器,以及第一目标交叉模态解码单元。
其中,在基于所述目标单元构建出的初始任务处理模型中,预处理单元中的第一图像编码器与第一目标向量生成单元连接、第一目标向量生成单元与第一目标交叉模态编码单元中的交叉模态编码器连接,且第一目标交叉模态编码单元中的交叉模态编码器与第一目标交叉模态解码单元连接。
进一步地,初始任务处理模型还可以包括前述目标单元以外的其他单元,如初始任务处理模型还包括初始分类器,该初始分类器的输入接口与第一目标交叉模态解码单元的输出接口连接,初始分类器的输出接口用于输出预测的任务处理结果,进而使得在训练目标任务处理模型过程中,基于样本任务信息、预测的任务处理结果,以及样本任务结果标签对初始任务处理模型进行训练。
在本申请的一些可选的实施例中,训练得出目标任务类别为分析图像对应的语义识别结果,对图像进行分类时,对应的目标任务处理模型后,针对目标任务:分析图3c中的图像1的语义识别结果,对图像进行分类,则可将图像1作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:台灯。针对目标任务:分析图3c中的图像2的语义识别结果,对图像进行分类,则可将图像2作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:冰激凌。
在本申请的一些可选的实施例中,若目标任务类别为根据图文信息回答问题的任务时,该目标任务类别涉及到对图像与文本的分析,因此,从所述多个候选单元中,确定出的该目标任务类别对应的目标单元可包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元。
其中,在基于所述目标单元构建出的初始任务处理模型中,预处理单元与第一目标向量生成单 元连接、第一目标向量生成单元与第一目标交叉模态编码单元连接,且第一目标交叉模态编码单元与第一目标交叉模态解码单元连接。
进一步地,初始任务处理模型还可以包括前述目标单元以外的其他单元,如初始任务处理模型还包括初始分类器,该初始分类器的输入接口与第一目标交叉模态解码单元的输出接口连接,初始分类器的输出接口用于输出预测的任务处理结果,进而使得在训练目标任务处理模型过程中,基于样本任务信息、预测的任务处理结果,以及样本任务结果标签对初始任务处理模型进行训练。
在本申请的一些可选的实施例中,训练得出目标任务类别为根据图文信息回答问题的任务时,对应的目标任务处理模型后,针对目标任务:根据图3d中的图文信息1中包含的图像与文本:“Who is wearing glasses?”,回答问题,则可将图文信息1作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:man。针对目标任务:根据图3d中的图文信息2中包含的图像与文本:“Who is wearing glasses?”,回答问题,则可将图文信息2作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:woman。
在本申请的一些可选的实施例中,若目标任务类别为判断文字是否正确描述了图像对的任务时,该目标任务类别涉及到对图像与文本的分析,因此,从所述多个候选单元中,确定出的该目标任务类别对应的目标单元可包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元。
其中,在基于所述目标单元构建出的初始任务处理模型中,预处理单元与第一目标向量生成单元连接、第一目标向量生成单元与第一目标交叉模态编码单元连接,且第一目标交叉模态编码单元与第一目标交叉模态解码单元连接。
进一步地,初始任务处理模型还可以包括前述目标单元以外的其他单元,如初始任务处理模型还包括初始分类器,该初始分类器的输入接口与第一目标交叉模态解码单元的输出接口连接,初始分类器的输出接口用于输出预测的任务处理结果,进而使得在训练目标任务处理模型过程中,基于样本任务信息、预测的任务处理结果,以及样本任务结果标签对初始任务处理模型进行训练。
在本申请的一些可选的实施例中,训练得出目标任务类别为判断文字是否正确描述了图像对的任务时,对应的目标任务处理模型后,针对目标任务:判断图3e中的图文信息3中的文本是否正确描述了图文信息3中的图像对,则可将图文信息3作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:对。针对目标任务:判断图文信息4中的文本是否正确描述了图文信息4中的图像对,则可将图文信息4作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:错。
在本申请的一些可选的实施例中,若目标任务类别为给出一张图像和一个文本描述,判断图像和文本之间的关系是蕴含、矛盾还是中立的任务时,该目标任务类别涉及到对图像与文本的分析,因此,从所述多个候选单元中,确定出的该目标任务类别对应的目标单元可包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元。
其中,在基于所述目标单元构建出的初始任务处理模型中,预处理单元与第一目标向量生成单元连接、第一目标向量生成单元与第一目标交叉模态编码单元连接,且第一目标交叉模态编码单元与第一目标交叉模态解码单元连接。
进一步地,初始任务处理模型还可以包括前述目标单元以外的其他单元,如初始任务处理模型还包括初始分类器,该初始分类器的输入接口与第一目标交叉模态解码单元的输出接口连接,初始分类器的输出接口用于输出预测的任务处理结果,进而使得在训练目标任务处理模型过程中,基于样本任务信息、预测的任务处理结果,以及样本任务结果标签对初始任务处理模型进行训练。
在本申请的一些可选的实施例中,训练得出目标任务类别为给出一张图像和一个文本描述,判断图像和文本之间的关系是蕴含、矛盾还是中立的任务时,对应的目标任务处理模型后,
针对目标任务:给出图3f中的前提图像,与文本描述1:“Two woman are holding packages.”,判断前提图像,与文本描述1之间的关系是蕴含、矛盾还是中立,则可将前提图像与文本描述1作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:蕴含。
针对目标任务:给出图3f中的前提图像,与文本描述2:“The sisters are hugging goodbye whie holding to go packages after just eating lunch.”,判断前提图像,与文本描述2之间的关系是蕴含、矛盾还是中立,则可将前提图像与文本描述2作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:中立。
针对目标任务:给出图3f中的前提图像,与文本描述3:“The men are fighting outside a deli.”,判断前提图像,与文本描述3之间的关系是蕴含、矛盾还是中立,则可将前提图像与文本描述3作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:矛盾。
在本申请的一些可选的实施例中,若目标任务类别为给定一张图像,输出该图像的文本描述的 任务时,该目标任务类别该目标任务类别仅涉及到对图像的分析,因此,从所述多个候选单元中,确定出的该目标任务类别对应的目标单元可包括:预处理单元中的第一图像编码器,第一目标向量生成单元、第一目标交叉模态编码单元中的交叉模态编码器,以及第一目标交叉模态解码单元。
其中,在基于所述目标单元构建出的初始任务处理模型中,预处理单元中的第一图像编码器与第一目标向量生成单元连接、第一目标向量生成单元与第一目标交叉模态编码单元中的交叉模态编码器连接,且第一目标交叉模态编码单元中的交叉模态编码器与第一目标交叉模态解码单元连接。
在本申请的一些可选的实施例中,训练得出目标任务类别为给定一张图像,输出该图像的文本描述时,对应的目标任务处理模型后,针对目标任务:给定图3g中的图像,输出该图像对应的文本描述,则可将该图像作为目标任务处理模型的目标任务输入信息,则可得出目标任务处理结果:“一只海鸟在岸边散步”。
在本申请的一些可选的实施例中,若目标任务类别为给定一段文本描述,输出该文本描述对应的图像时,仅涉及到对文本的分析,因此,从所述多个候选单元中,确定出的该目标任务类别对应的目标单元可包括:预处理单元中的查表单元、第一目标向量生成单元、第一目标交叉模态编码单元中的交叉模态编码器,以及第一目标交叉模态解码单元。
其中,在基于所述目标单元构建出的初始任务处理模型中,预处理单元中的查表单元与第一目标向量生成单元连接、第一目标向量生成单元与第一目标交叉模态编码单元中的交叉模态编码器连接,且第一目标交叉模态编码单元中的交叉模态编码器与第一目标交叉模态解码单元连接。
在本申请的一些可选的实施例中,训练得出目标任务类别为给定一段文本描述,输出该文本描述对应的图像时,对应的目标任务处理模型后,针对目标任务:给定图3h中的文本:“a baseball player holding a bat next to a base”,输出该文本对应的图像,则可将文本:“a baseball player holding a bat next to a base”,作为目标任务处理模型的目标任务输入信息,则可得出图3h中的图像。
通过本申请的模型训练方法训练得出的目标任务处理模型,针对为给定一段文本描述,输出该文本描述的图像的目标任务处理结果与相关技术中的DALLE,以及OFA相比,可参见图3i所示,由图3i可知,通过本申请的方案的出的目标任务处理结果中的图像生成质量更高,表现为更真实和准确。
图4是本申请一示例性实施例提供的一种任务处理方法的流程示意图,该方法包括以下S401-S403:
S401、获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信息;
S402、根据所述目标任务类别确定对应的目标任务处理模型;
S403、将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
其中,所述目标任务处理模型为通过前述模型训练方法训练得出的。
其中,目标任务类别与目标任务输入信息的对应关系,可参见前述图3a对应的实施例,此处不再赘述。
图5为本申请一示例性实施例提供的一种数据处理装置的结构示意图;
其中,该装置包括:
获取单元51,用于获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
生成单元52,用于基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
编码单元53,用于通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
解码单元54,用于通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
确定单元55,用于基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
根据本申请的一个或多个实施例,所述装置还用于:
获取样本图文信息,所述样本图文信息包括样本文本信息与样本图像信息,所述样本文本信息与样本图像信息匹配;
根据所述样本文本信息与预设词库确定所述样本文本信息中的各字符在预设词库中的标记信息, 得到所述样本文本信息对应的标记信息组;
按照第一预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的初始向量信息组;
基于预设掩码规则对所述标记信息组中的部分标记信息,或对初始向量信息组中的部分初始向量信息进行掩码处理,得到所述待处理图文特征信息。
根据本申请的一个或多个实施例,所述装置在用于基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型时,具体用于:
获取所述掩码标识对应的目标信息;
确定所述预测结果与所述目标信息的相似度值;
若所述相似度值小于预设相似度值,则将初始预训练模型作为目标预训练模型;
若所述相似度值不小于所述预设相似度值,则基于所述相似度值对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述相似度值小于所述预设相似度值,得出目标预训练模型为止;
其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
根据本申请的一个或多个实施例,所述装置还用于:
按照第二预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的编码数值组,所述编码数值组包括多个编码数值,其中,所述编码数值组中的编码数值的数量,与所述初始向量信息组中的初始向量信息的数量相同;
根据所述编码数值组确定所述掩码标识对应的目标信息。
根据本申请的一个或多个实施例,所述装置在用于根据所述编码数值组确定所述掩码标识对应的目标信息时,具体用于:
获取所述掩码标识对应的被掩码内容在掩码对象中的第一位置信息;
从所述编码数值组中选择出所述第一位置信息对应的目标编码数值;
将所述目标编码数值作为所述目标信息;
其中,若所述待处理图文特征信息为基于预设掩码规则对所述初始向量信息组中的部分初始向量信息进行掩码处理得到的,则所述被掩码内容为所述部分初始向量信息,所述掩码对象为所述初始向量信息组。
根据本申请的一个或多个实施例,前述装置在用于基于所述预测结果与所述掩码标识对初始预训练模型进行训练,得到目标预训练模型时,具体用于:
确定对初始预训练模型中的模型参数的更新次数是否大于预设次数,若是,则将所述初始预训练模型作为目标预训练模型;
若否,则获取所述掩码标识对应的目标信息;根据所述预测结果、所述目标信息以及预设的损失函数,确定出对应的损失信息;根据所述损失信息对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述更新次数大于所述预设次数时,将所述初始预训练模型作为目标预训练模型;
其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
图6为本申请一示例性实施例提供的一种模型训练装置的结构示意图,该装置包括:
获取单元61,用于获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;以及用于获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
确定单元62,用于根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
构建单元63,用于基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
训练单元64,用于利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
其中,所述目标预训练模型为通过前述数据处理方法中的初始预训练模型训练得出的。
图7为本申请一示例性实施例提供的一种任务处理装置的结构示意图;其中,该装置包括:
获取单元71,用于获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信 息;
确定单元72,用于根据所述目标任务类别确定对应的目标任务处理模型;
输入单元73,用于将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
其中,所述目标任务处理模型为通过前述模型训练方法训练得出的。
应理解的是,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,该装置可以执行上述方法实施例,并且该装置中的各个模块的前述和其它操作和/或功能分别为了上述方法实施例中的各个方法中的相应流程,为了简洁,在此不再赘述。
上文中结合附图从功能模块的角度描述了本申请实施例的装置。应理解,该功能模块可以通过硬件形式实现,也可以通过软件形式的指令实现,还可以通过硬件和软件模块组合实现。具体地,本申请实施例中的方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路和/或软件形式的指令完成,结合本申请实施例公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。可选地,软件模块可以位于随机存储器,闪存、只读存储器、可编程只读存储器、电可擦写可编程存储器、寄存器等本领域的成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法实施例中的步骤。
图8是本申请实施例提供的电子设备的示意性框图,该电子设备可包括:
存储器801和处理器802,该存储器801用于存储计算机程序,并将该程序代码传输给该处理器802。换言之,该处理器802可以从存储器801中调用并运行计算机程序,以实现本申请实施例中的方法。
例如,该处理器802可用于根据该计算机程序中的指令执行上述方法实施例。
在本申请的一些实施例中,该处理器802可以包括但不限于:
通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等等。
在本申请的一些实施例中,该存储器801包括但不限于:
易失性存储器和/或非易失性存储器。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
在本申请的一些实施例中,该计算机程序可以被分割成一个或多个模块,该一个或者多个模块被存储在该存储器801中,并由该处理器802执行,以完成本申请提供的方法。该一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述该计算机程序在该电子设备中的执行过程。
如图8所示,该电子设备还可包括:
收发器803,该收发器803可连接至该处理器802或存储器801。
其中,处理器802可以控制该收发器803与其他设备进行通信,具体地,可以向其他设备发送信息或数据,或接收其他设备发送的信息或数据。收发器803可以包括发射机和接收机。收发器803还可以进一步包括天线,天线的数量可以为一个或多个。
应当理解,该电子设备中的各个组件通过总线系统相连,其中,总线系统除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。
本申请还提供了一种计算机存储介质,其上存储有计算机程序,该计算机程序被计算机执行时使得该计算机能够执行上述方法实施例的方法。或者说,本申请实施例还提供一种包含指令的计算机程序产品,该指令被计算机执行时使得计算机执行上述方法实施例的方法。
当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例该的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一 个计算机可读存储介质传输,例如,该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
根据本申请的一个或多个实施例,提供一种数据处理方法,包括:
获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
根据本申请的一个或多个实施例,所述方法还包括:
获取样本图文信息,所述样本图文信息包括样本文本信息与样本图像信息,所述样本文本信息与样本图像信息匹配;
根据所述样本文本信息与预设词库确定所述样本文本信息中的各字符在预设词库中的标记信息,得到所述样本文本信息对应的标记信息组;
按照第一预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的初始向量信息组;
基于预设掩码规则对所述标记信息组中的部分标记信息,或对初始向量信息组中的部分初始向量信息进行掩码处理,得到所述待处理图文特征信息。
根据本申请的一个或多个实施例,基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,包括:
获取所述掩码标识对应的目标信息;
确定所述预测结果与所述目标信息的相似度值;
若所述相似度值小于预设相似度值,则将所述初始预训练模型作为目标预训练模型;
若所述相似度值不小于所述预设相似度值,则基于所述相似度值对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述相似度值小于所述预设相似度值,得出目标预训练模型为止;
其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
根据本申请的一个或多个实施例,所述方法还包括:
按照第二预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的编码数值组,所述编码数值组包括多个编码数值,其中,所述编码数值组中的编码数值的数量,与所述初始向量信息组中的初始向量信息的数量相同;
根据所述编码数值组确定所述掩码标识对应的目标信息。
根据本申请的一个或多个实施例,根据所述编码数值组确定所述掩码标识对应的目标信息,包括:
获取所述掩码标识对应的被掩码内容在掩码对象中的第一位置信息;
从所述编码数值组中选择出所述第一位置信息对应的目标编码数值;
将所述目标编码数值作为所述目标信息;
其中,若所述待处理图文特征信息为基于预设掩码规则对所述初始向量信息组中的部分初始向量信息进行掩码处理得到的,则所述被掩码内容为所述部分初始向量信息,所述掩码对象为所述初始向量信息组。
根据本申请的一个或多个实施例,基于所述预测结果与所述掩码标识对初始预训练模型进行训 练,得到目标预训练模型,包括:
确定对初始预训练模型中的模型参数的更新次数是否大于预设次数,若是,则将所述初始预训练模型作为目标预训练模型;
若否,则获取所述掩码标识对应的目标信息;根据所述预测结果、所述目标信息以及预设的损失函数,确定出对应的损失信息;根据所述损失信息对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述更新次数大于所述预设次数时,将所述初始预训练模型作为目标预训练模型;
其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
根据本申请的一个或多个实施例,提供一种模型训练方法,包括:
获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;
获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
其中,所述目标预训练模型为通过前述数据处理方法中的初始预训练模型训练得出的。
根据本申请的一个或多个实施例,提供一种任务处理方法,包括:
获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信息;
根据所述目标任务类别确定对应的目标任务处理模型;
将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
其中,所述目标任务处理模型为通过前述模型训练方法训练得出的。
根据本申请的一个或多个实施例,提供一种数据处理装置,包括:
获取单元,用于获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
生成单元,用于基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;;
编码单元,用于通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
解码单元,用于通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
确定单元,用于基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
根据本申请的一个或多个实施例,所述装置还用于:
获取样本图文信息,所述样本图文信息包括样本文本信息与样本图像信息,所述样本文本信息与样本图像信息匹配;
根据所述样本文本信息与预设词库确定所述样本文本信息中的各字符在预设词库中的标记信息,得到所述样本文本信息对应的标记信息组;
按照第一预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的初始向量信息组;
基于预设掩码规则对所述标记信息组中的部分标记信息,或对初始向量信息组中的部分初始向量信息进行掩码处理,得到所述待处理图文特征信息。
根据本申请的一个或多个实施例,所述装置在用于基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型时,具体用于:
获取所述掩码标识对应的目标信息;
确定所述预测结果与所述目标信息的相似度值;
若所述相似度值小于预设相似度值,则将所述初始预训练模型作为目标预训练模型;
若所述相似度值不小于所述预设相似度值,则基于所述相似度值对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述相似度值小于所述预设相似度值,得出目标预训练模型为止;
其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
根据本申请的一个或多个实施例,所述装置还用于:
按照第二预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的编码数值组,所述编码数值组包括多个编码数值,其中,所述编码数值组中的编码数值的数量,与所述初始向量信息组中的初始向量信息的数量相同;
根据所述编码数值组确定所述掩码标识对应的目标信息。
根据本申请的一个或多个实施例,所述装置在用于根据所述编码数值组确定所述掩码标识对应的目标信息时,具体用于:
获取所述掩码标识对应的被掩码内容在掩码对象中的第一位置信息;
从所述编码数值组中选择出所述第一位置信息对应的目标编码数值;
将所述目标编码数值作为所述目标信息;
其中,若所述待处理图文特征信息为基于预设掩码规则对所述初始向量信息组中的部分初始向量信息进行掩码处理得到的,则所述被掩码内容为所述部分初始向量信息,所述掩码对象为所述初始向量信息组。
根据本申请的一个或多个实施例,所述装置在用于:基于所述预测结果与所述掩码标识对初始预训练模型进行训练,得到目标预训练模型时,具体用于:
确定对初始预训练模型中的模型参数的更新次数是否大于预设次数,若是,则将所述初始预训练模型作为目标预训练模型;
若否,则获取所述掩码标识对应的目标信息;根据所述预测结果、所述目标信息以及预设的损失函数,确定出对应的损失信息;根据所述损失信息对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述更新次数大于所述预设次数时,将所述初始预训练模型作为目标预训练模型;
其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
根据本申请的一个或多个实施例,提供一种模型训练装置,包括:
获取单元,用于获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;以及用于获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
确定单元,用于根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
构建单元,用于基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
训练单元,用于利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
其中,所述目标预训练模型为通过前述数据处理方法中的初始预训练模型训练得出的。
根据本申请的一个或多个实施例,提供一种任务处理装置,包括:
获取单元,用于获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信息;
确定单元,用于根据所述目标任务类别确定对应的目标任务处理模型;
输入单元,用于将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
其中,所述目标任务处理模型为通过前述模型训练方法训练得出的。
根据本申请的一个或多个实施例,提供一种电子设备,包括:
处理器;以及
存储器,用于存储处理器的可执行指令;
其中,处理器配置为经由执行可执行指令来执行前述的各方法。
根据本申请的一个或多个实施例,提供一种计算机可读存储介质,其上存储有计算机程序,计 算机程序被处理器执行时实现前述的各方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。例如,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
以上仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以该权利要求的保护范围为准。

Claims (13)

  1. 一种数据处理方法,包括:
    获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
    基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
    通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
    通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
    基于所述预测结果与所述掩码标识对初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    获取样本图文信息,所述样本图文信息包括样本文本信息与样本图像信息,所述样本文本信息与样本图像信息匹配;
    根据所述样本文本信息与预设词库确定所述样本文本信息中的各字符在预设词库中的标记信息,得到所述样本文本信息对应的标记信息组;
    按照第一预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的初始向量信息组;
    基于预设掩码规则对所述标记信息组中的部分标记信息,或对初始向量信息组中的部分初始向量信息进行掩码处理,得到所述待处理图文特征信息。
  3. 根据权利要求2所述的方法,其中,基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,包括:
    获取所述掩码标识对应的目标信息;
    确定所述预测结果与所述目标信息的相似度值;
    若所述相似度值小于预设相似度值,则将初始预训练模型作为目标预训练模型;
    若所述相似度值不小于所述预设相似度值,则基于所述相似度值对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述相似度值小于所述预设相似度值,得出目标预训练模型为止;
    其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
  4. 根据权利要求3所述的方法,其中,所述方法还包括:
    按照第二预设编码规则对所述样本图像信息进行编码,得到所述样本图像信息对应的编码数值组,所述编码数值组包括多个编码数值,其中,所述编码数值组中的编码数值的数量,与所述初始向量信息组中的初始向量信息的数量相同;
    根据所述编码数值组确定所述掩码标识对应的目标信息。
  5. 根据权利要求4所述的方法,其中,根据所述编码数值组确定所述掩码标识对应的目标信息,包括:
    获取所述掩码标识对应的被掩码内容在掩码对象中的第一位置信息;
    从所述编码数值组中选择出所述第一位置信息对应的目标编码数值;
    将所述目标编码数值作为所述目标信息;
    其中,若所述待处理图文特征信息为基于预设掩码规则对所述初始向量信息组中的部分初始向量信息进行掩码处理得到的,则所述被掩码内容为所述部分初始向量信息,所述掩码对象为所述初始向量信息组。
  6. 根据权利要求2所述的方法,其中,基于所述预测结果与所述掩码标识对初始预训练模型进行训练,得到目标预训练模型,包括:
    确定对初始预训练模型中的模型参数的更新次数是否大于预设次数,若是,则将所述初始预训练模型作为目标预训练模型;
    若否,则获取所述掩码标识对应的目标信息;根据所述预测结果、所述目标信息以及预设的损失函数,确定出对应的损失信息;根据所述损失信息对所述初始预训练模型中的模型参数进行更新,得到模型参数进行更新后的初始预训练模型,并返回执行基于初始向量生成规则生成所述待处理文 本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组的步骤,直至所述更新次数大于所述预设次数时,将所述初始预训练模型作为目标预训练模型;
    其中,所述初始预训练模型中的模型参数包括:所述初始向量生成规则中的参数、所述初始编码规则中的参数,以及所述初始解码规则中的参数中的至少一种参数。
  7. 一种模型训练方法,包括:
    获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;
    获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
    根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
    基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
    利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
    其中,所述目标预训练模型为通过权利要求1-6中任一项所述的数据处理方法中的初始预训练模型训练得出的。
  8. 一种任务处理方法,包括:
    获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信息;
    根据所述目标任务类别确定对应的目标任务处理模型;
    将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
    其中,所述目标任务处理模型为通过权利要求7中所述的模型训练方法训练得出的。
  9. 一种数据处理装置,包括:
    获取单元,用于获取待处理图文特征信息,所述待处理图文特征信息包括待处理文本特征信息与待处理图像特征信息;所述待处理文本特征信息或所述待处理图像特征信息中包含掩码标识,所述待处理文本特征信息与所述待处理图像特征信息匹配;
    生成单元,用于基于初始向量生成规则生成所述待处理文本特征信息对应的第一向量信息组,以及所述待处理图像特征信息对应的第二向量信息组;
    编码单元,用于通过初始编码规则对所述第一向量信息组与所述第二向量信息组进行编码,得到对应的融合向量信息组;其中,所述融合向量信息组包括多个融合向量信息,各融合向量信息与所述第一向量信息组以及所述第二向量信息组相关;
    解码单元,用于通过初始解码规则对所述融合向量信息组进行解码,得到所述掩码标识对应的预测结果;
    确定单元,用于基于所述预测结果与所述掩码标识对所述初始预训练模型进行训练,得到目标预训练模型,所述目标预训练模型用于根据获取到的目标任务类别训练所述目标任务类别对应的目标任务处理模型。
  10. 一种模型训练装置,包括:
    获取单元,用于获取目标任务类别,与所述目标任务类别对应的样本任务信息,所述样本任务信息包括样本任务输入信息,以及所述样本任务输入信息对应的样本任务结果标签;以及用于获取目标预训练模型中的多个候选单元,所述多个候选单元包括:预处理单元、第一目标向量生成单元、第一目标交叉模态编码单元,以及第一目标交叉模态解码单元;
    确定单元,用于根据预设对应关系,从所述多个候选单元中,确定出所述目标任务类别对应的目标单元;
    构建单元,用于基于所述目标单元构建所述目标任务类别对应的初始任务处理模型;
    训练单元,用于利用所述样本任务信息对所述初始任务处理模型进行训练,得到用于完成所述目标任务类别对应的目标任务的目标任务处理模型;
    其中,所述目标预训练模型为通过权利要求1-6中任一项所述的数据处理方法中的初始预训练模型训练得出的。
  11. 一种任务处理装置,包括:
    获取单元,用于获取目标任务类别的目标任务信息,所述目标任务信息包括目标任务输入信息;
    确定单元,用于根据所述目标任务类别确定对应的目标任务处理模型;
    输入单元,用于将所述目标任务输入信息输入所述目标任务处理模型,得到对应所述目标任务类别与所述目标任务输入信息的目标任务处理结果;
    其中,所述目标任务处理模型为通过所述权利要求7的模型训练方法训练得出的。
  12. 一种电子设备,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1-6中任一项所述的数据处理方法,或权利要求7所述的模型训练方法,或权利要求8所述的任务处理方法。
  13. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-6中任一项所述的数据处理方法,或权利要求7所述的模型训练方法,或权利要求8所述的任务处理方法。
PCT/CN2023/098690 2022-06-14 2023-06-06 数据处理方法、装置、设备及计算机介质 WO2023241410A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210671652.2A CN114972823A (zh) 2022-06-14 2022-06-14 数据处理方法、装置、设备及计算机介质
CN202210671652.2 2022-06-14

Publications (1)

Publication Number Publication Date
WO2023241410A1 true WO2023241410A1 (zh) 2023-12-21

Family

ID=82963036

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/098690 WO2023241410A1 (zh) 2022-06-14 2023-06-06 数据处理方法、装置、设备及计算机介质

Country Status (2)

Country Link
CN (1) CN114972823A (zh)
WO (1) WO2023241410A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726990A (zh) * 2023-12-27 2024-03-19 浙江恒逸石化有限公司 纺丝车间的检测方法、装置、电子设备及存储介质
CN118098274A (zh) * 2024-04-19 2024-05-28 腾讯科技(深圳)有限公司 模型训练方法、装置、电子设备及存储介质
CN118155214A (zh) * 2024-05-11 2024-06-07 腾讯科技(深圳)有限公司 一种提示学习方法、图像分类方法及相关装置
CN118196398A (zh) * 2024-05-15 2024-06-14 海信集团控股股份有限公司 一种目标检测方法、装置、设备及介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114972823A (zh) * 2022-06-14 2022-08-30 北京有竹居网络技术有限公司 数据处理方法、装置、设备及计算机介质
WO2024108472A1 (zh) * 2022-11-24 2024-05-30 北京京东方技术开发有限公司 模型训练方法及装置、文本图像处理方法、设备、介质
CN118153657A (zh) * 2022-11-30 2024-06-07 北京有竹居网络技术有限公司 网络模型的训练方法、数据处理方法及装置
CN115601485B (zh) * 2022-12-15 2023-04-07 阿里巴巴(中国)有限公司 任务处理模型的数据处理方法及虚拟人物动画生成方法
CN116383027B (zh) * 2023-06-05 2023-08-25 阿里巴巴(中国)有限公司 人机交互的数据处理方法及服务器
CN118429658B (zh) * 2024-06-25 2024-10-08 阿里巴巴(中国)有限公司 信息抽取方法以及信息抽取模型训练方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200334334A1 (en) * 2019-04-18 2020-10-22 Salesforce.Com, Inc. Systems and methods for unifying question answering and text classification via span extraction
CN113792113A (zh) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 视觉语言模型获得及任务处理方法、装置、设备及介质
CN114372414A (zh) * 2022-01-06 2022-04-19 腾讯科技(深圳)有限公司 多模态模型构建方法、装置和计算机设备
CN114549935A (zh) * 2022-02-25 2022-05-27 北京百度网讯科技有限公司 信息生成方法和装置
CN114972823A (zh) * 2022-06-14 2022-08-30 北京有竹居网络技术有限公司 数据处理方法、装置、设备及计算机介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200334334A1 (en) * 2019-04-18 2020-10-22 Salesforce.Com, Inc. Systems and methods for unifying question answering and text classification via span extraction
CN113792113A (zh) * 2020-07-31 2021-12-14 北京京东尚科信息技术有限公司 视觉语言模型获得及任务处理方法、装置、设备及介质
CN114372414A (zh) * 2022-01-06 2022-04-19 腾讯科技(深圳)有限公司 多模态模型构建方法、装置和计算机设备
CN114549935A (zh) * 2022-02-25 2022-05-27 北京百度网讯科技有限公司 信息生成方法和装置
CN114972823A (zh) * 2022-06-14 2022-08-30 北京有竹居网络技术有限公司 数据处理方法、装置、设备及计算机介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726990A (zh) * 2023-12-27 2024-03-19 浙江恒逸石化有限公司 纺丝车间的检测方法、装置、电子设备及存储介质
CN117726990B (zh) * 2023-12-27 2024-05-03 浙江恒逸石化有限公司 纺丝车间的检测方法、装置、电子设备及存储介质
CN118098274A (zh) * 2024-04-19 2024-05-28 腾讯科技(深圳)有限公司 模型训练方法、装置、电子设备及存储介质
CN118155214A (zh) * 2024-05-11 2024-06-07 腾讯科技(深圳)有限公司 一种提示学习方法、图像分类方法及相关装置
CN118196398A (zh) * 2024-05-15 2024-06-14 海信集团控股股份有限公司 一种目标检测方法、装置、设备及介质

Also Published As

Publication number Publication date
CN114972823A (zh) 2022-08-30

Similar Documents

Publication Publication Date Title
WO2023241410A1 (zh) 数据处理方法、装置、设备及计算机介质
CN112164391B (zh) 语句处理方法、装置、电子设备及存储介质
CN112487182B (zh) 文本处理模型的训练方法、文本处理方法及装置
WO2021082953A1 (zh) 机器阅读理解方法、设备、存储介质及装置
CN112470160B (zh) 个性化自然语言理解的装置和方法
US10592607B2 (en) Iterative alternating neural attention for machine reading
CN110377714A (zh) 基于迁移学习的文本匹配方法、装置、介质及设备
US20200372217A1 (en) Method and apparatus for processing language based on trained network model
CN110413746A (zh) 对用户问题进行意图识别的方法及装置
CN112214591B (zh) 一种对话预测的方法及装置
CN114676234A (zh) 一种模型训练方法及相关设备
CN112131883B (zh) 语言模型训练方法、装置、计算机设备和存储介质
US20240185602A1 (en) Cross-Modal Processing For Vision And Language
US20200081973A1 (en) Methods, apparatuses, devices, and computer-readable storage media for determining category of entity
CN116861995A (zh) 多模态预训练模型的训练及多模态数据处理方法和装置
WO2021120779A1 (zh) 一种基于人机对话的用户画像构建方法、系统、终端及存储介质
CN113158656B (zh) 讽刺内容识别方法、装置、电子设备以及存储介质
CN113139391A (zh) 翻译模型的训练方法、装置、设备和存储介质
CN111144102B (zh) 用于识别语句中实体的方法、装置和电子设备
CN113779225B (zh) 实体链接模型的训练方法、实体链接方法及装置
CN111814496A (zh) 文本处理方法、装置、设备及存储介质
CN113421551B (zh) 语音识别方法、装置、计算机可读介质及电子设备
CN116975288A (zh) 文本处理方法及文本处理模型训练方法
CN114492661B (zh) 文本数据分类方法和装置、计算机设备、存储介质
CN113657092B (zh) 识别标签的方法、装置、设备以及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23822979

Country of ref document: EP

Kind code of ref document: A1