CN110472688A

CN110472688A - The method and device of iamge description, the training method of image description model and device

Info

Publication number: CN110472688A
Application number: CN201910760737.6A
Authority: CN
Inventors: 廖敏鹏; 白静; 李长亮
Original assignee: Chengdu Kingsoft Digital Entertainment Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd; Chengdu Kingsoft Digital Entertainment Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2019-11-19

Abstract

This application provides the method and device of iamge description, the training method of image description model and devices, and wherein the method for iamge description includes: to extract characteristics of image to target image；Characteristics of image is subjected to tag extraction, generates corresponding image tag；The characteristics of image of target image and image tag are input to the encoder of image description model, generate the corresponding eigenmatrix of target image；The decoder that eigenmatrix is input to image description model is decoded, obtain the corresponding iamge description sentence of target image, to make image description model during iamge description sentence, it can be reference according to the information of specific reliable image tag, making the iamge description sentence generated includes more key messages, improves the accuracy and reliability of iamge description sentence；And it is instructed since the generation phase in iamge description sentence is used as according to reliable image tag, reduces the generation of redundant data.

Description

The method and device of iamge description, the training method of image description model and device

Technical field

This application involves technical field of image processing, in particular to a kind of method and device of iamge description, iamge description The training method and device of model calculate equipment and computer readable storage medium.

Background technique

Iamge description, the purpose is to a segment description text, i.e. picture talk are automatically generated from image.Iamge description Process is not only wanted to detect the object in image, but also is appreciated that the correlation between object, finally also with reasonable Language is expressed.

Currently, the information of image mainly uses the feature of convolutional neural networks model extraction in iamge description task (Feature map) the either character representation of target detection model inspection to objectives.These information are all with matrix Form exist, therefore to the expression of the same key message may it is different, such as: be equally automobile, the position due to parking or The angle that person parks is different, so that different using the character representation of convolutional neural networks model and target detection model extraction Sample, this will increase the redundancy of information and unreliable.

To sum up, the description information that current iamge description task generates image, which is depended primarily on, carries out spy to image itself Sign is extracted, and generates description information of image according to the feature of extraction.After carrying out feature extraction to image, due to characteristics of image Redundancy properties cause the key descriptors for ultimately generating image the deviation even description information of image of generation error occur.

Summary of the invention

In view of this, the embodiment of the present application provides a kind of method and device of iamge description, the instruction of image description model Practice method and device, calculate equipment and computer readable storage medium, to solve technological deficiency existing in the prior art.

The embodiment of the present application provides a kind of method of iamge description, is used for image description model, which comprises

Characteristics of image is extracted to target image；

Described image feature is subjected to tag extraction, generates corresponding image tag；

The characteristics of image of the target image and image tag be input to the encoder of image description model, described in generation The corresponding eigenmatrix of target image；

The decoder that the eigenmatrix is input to image description model is decoded, the corresponding figure of target image is obtained As descriptive statement.

Optionally, described image feature is subjected to tag extraction, generates corresponding image tag, comprising:

Described image feature is input to multi-tag disaggregated model and carries out tag extraction, generates at least one corresponding image Label.

Optionally, the encoder includes a coding layer；

The characteristics of image of the target image and image tag be input to the encoder of image description model, described in generation The corresponding eigenmatrix of target image, comprising:

The characteristics of image of the target image and image tag are pre-processed respectively, generate pretreatment image feature and Label vector；

Pretreatment image feature and label vector are input to the coding layer, and the output feature of the coding layer is made For the corresponding eigenmatrix of the target image.

Optionally, the encoder includes sequentially connected N number of coding layer；

S11, the characteristics of image of the target image and image tag are carried out to characteristic processing respectively, generate pretreatment image Feature and label vector；

S12, pretreatment image feature and label characteristics are input to first coding layer, obtain the defeated of first coding layer Feature out；

S13, the output feature of (i-1)-th coding layer and label vector are input to i-th of coding layer, obtain i-th of volume The output feature of code layer；

S14, i is increased to 1 certainly, whether i of the judgement from after increasing 1 is less than N, if so, execution step S13, if it is not, executing step S15；

S15, using the output feature of n-th coding layer as the corresponding eigenmatrix of the target image.

Optionally, coding layer include: first from attention layer, the first bull attention layer and first feedforward layer；

Pretreatment image feature and label vector are input to i-th of coding layer, the output for obtaining i-th of coding layer is special Sign, comprising:

First that pretreatment image feature is input to i-th of coding layer is handled from attention layer, generates first certainly Attention characteristics；

Described first is input to from attention characteristics and the label characteristics the first bull attention of i-th of coding layer Layer generates the first fusion feature；

By first fusion feature via the first feedforward layer processing, the output feature of i-th of coding layer is generated.

Optionally, the decoder that the eigenmatrix is input to image description model is decoded, obtains target image Corresponding iamge description sentence, comprising:

Reference decoder vector and eigenmatrix are input to the decoder to be decoded, obtain the decoder output Decoded vector；

Linearisation and normalized are carried out according to the decoded vector, generates the corresponding iamge description language of target image Sentence.

The embodiment of the present application provides a kind of training method of image description model, which comprises

Characteristics of image is extracted to sample image；

The characteristics of image of the sample image, image tag and the corresponding sample image of the sample image are described into language Sentence is input to image description model, is trained to described image descriptive model, until reaching trained stop condition.

Optionally, the trained stop condition include: the decoded vector that generates described image descriptive model with it is preset Vector verifying collection compares, and the change rate for obtaining the error of the decoded vector is less than stable threshold.

The embodiment of the present application provides a kind of device of iamge description, and described device includes:

Fisrt feature extraction module is configured as extracting characteristics of image to target image；

First tag extraction module is configured as described image feature carrying out tag extraction, generates corresponding image mark Label；

Coding module is configured as the characteristics of image of the target image and image tag being input to image description model Encoder, generate the corresponding eigenmatrix of the target image；

Decoder module, the decoder for being configured as the eigenmatrix being input to image description model are decoded, obtain To the corresponding iamge description sentence of target image.

The embodiment of the present application provides a kind of training device of image description model, and described device includes:

Second feature extraction module is configured as extracting characteristics of image to sample image；

Second tag extraction module is configured as described image feature carrying out tag extraction, generates corresponding image mark Label；

Training module is configured as the characteristics of image of the sample image, image tag and the sample image pair The sample image descriptive statement answered is input to image description model, is trained to described image descriptive model, until reaching instruction Practice stop condition.

The embodiment of the present application provides a kind of calculating equipment, including memory, processor and storage are on a memory and can The computer instruction run on a processor, the processor realize image description model as described above when executing described instruction Training method or iamge description method the step of.

The embodiment of the present application provides a kind of computer readable storage medium, is stored with computer instruction, the instruction quilt Processor execute when realize image description model as described above training method or iamge description method the step of.

The method and device of iamge description provided by the present application carries out tag extraction by the characteristics of image to target image Corresponding image tag is generated, the characteristics of image of target image and image tag are input to image description model, obtain target The corresponding iamge description sentence of image, thus make image description model during iamge description sentence, it can be according to specific The information of reliable image tag is reference, and making the iamge description sentence generated includes more key messages, improves image and retouches The accuracy and reliability of predicate sentence；And since the generation phase in iamge description sentence is according to reliable image tag conduct Guidance, reduces the generation of redundant data.

The method and device of the training of image description model provided by the present application, by the characteristics of image of sample image, image Label and the corresponding sample image descriptive statement of sample image are input to image description model, instruct to image description model Practice, until reaching trained stop condition, to obtain the iamge description mould that may be implemented to generate descriptive statement according to target image Type.

Detailed description of the invention

Fig. 1 be the invention relates to Transformer model configuration diagram；

Fig. 2 is the flow diagram of the method for the iamge description of one embodiment of the application；

Fig. 3 is the flow diagram of the method for the iamge description of one embodiment of the application；

Fig. 4 is the structural schematic diagram of the coding layer of one embodiment of the application；

Fig. 5 is the flow diagram of the method for the iamge description of one embodiment of the application；

Fig. 6 is the model framework schematic diagram of the method for the realization iamge description of one embodiment of the application；

Fig. 7 is the flow diagram of the training method of the image description model of one embodiment of the application；

Fig. 8 is the structural schematic diagram of the device of the iamge description of another embodiment of the application；

Fig. 9 is the structural schematic diagram of the training device of the image description model of another embodiment of the application；

Figure 10 is the structural schematic diagram of the calculating equipment of another embodiment of the application.

Specific embodiment

Many details are explained in the following description in order to fully understand the application.But the application can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to the application intension the case where Under do similar popularization, therefore the application is not limited by following public specific implementation.

The term used in this specification one or more embodiment be only merely for for the purpose of describing particular embodiments, It is not intended to be limiting this specification one or more embodiment.In this specification one or more embodiment and appended claims The "an" of singular used in book, " described " and "the" are also intended to including most forms, unless context is clearly Indicate other meanings.It is also understood that term "and/or" used in this specification one or more embodiment refers to and includes One or more associated any or all of project listed may combine.

It will be appreciated that though may be retouched using term first, second etc. in this specification one or more embodiment Various information are stated, but these information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other It opens.For example, first can also be referred to as second, class in the case where not departing from this specification one or more scope of embodiments As, second can also be referred to as first.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... " or " in response to determination ".

Firstly, the vocabulary of terms being related to one or more embodiments of the invention explains.

Transformer model: a kind of neural network framework is used for machine translation.Its main thought is by spy to be translated Sign or vector are encoded by coding layer (encoder) becomes a coding characteristic or vector, then utilizes decoding layer (decoder) Coding characteristic or vector are decoded, decoded vector is obtained, decoded vector translation is then become into corresponding translation sentence.

Iamge description: a fusion calculation machine vision, the synthtic price index of natural language processing and machine learning, according to image The natural language sentence that can describe picture material is provided, popular to say, it is exactly that translate a secondary picture be a segment description sentence.

Multi-tag disaggregated model: the text or image given for one, corresponding label may more than one.It utilizes Multi-tag disaggregated model can predict the corresponding label of given text or image.

In this application, a kind of method and device of iamge description, the training method of image description model and dress are provided It sets, calculate equipment and computer readable storage medium, be described in detail one by one in the following embodiments.

Firstly, the image description model to the embodiment of the present application is schematically illustrated.Realize the model of iamge description Can be to be a variety of, such as convolutional neural networks (Convolutional Neural Networks, CNN) model or circulation nerve Network (Recurrent Neural Networks, RNN) model or Transformer model etc..

Wherein, CNN model generally comprises: input layer, convolutional layer, pond layer and full articulamentum.The mind of one side CNN model Through the connection between member be it is non-connect entirely, the weight of the connection in another aspect same layer between certain neurons is shared (i.e. identical).The network structure that its non-full connection and weight are shared is allowed to be more closely similar to biological neural network, reduces net The complexity of network model reduces the quantity of weight.

RNN model is also known as recurrent neural network, is a kind of neural network with feedback arrangement, output not only with work as Preceding input is related with the weight of network and related with the input of network before.RNN model crosses over time point by addition From connection hidden layer, the time is modeled；In other words, the feedback of hidden layer not only enters output end, but also enters The hidden layer of future time.

The framework of Transformer model includes: encoder (encoder)-decoder (decoder).Encoder is realized The object statement of input is carried out coding generation coding vector or carries out coding to target image characteristics to generate coding characteristic, decoding Device, which is realized to be decoded coding vector or coding characteristic, generates corresponding iamge description sentence.

The present embodiment carries out the method for the iamge description of the present embodiment schematical by taking Transformer model as an example Explanation.It should be noted that the mould of the single model of encoder-decoder architecture or multiple models composition may be implemented in other The method of the iamge description of the application also may be implemented within the scope of protection of this application in type group.

Fig. 1 shows a kind of framework of Transformer model.Model is divided into encoder and decoder two parts.Coding Device is superimposed on together by N number of identical coding layer, and each coding layer includes three sublayers: first from attention layer, the first bull Attention layer and the first feedforward layer.Wherein, the positive integer of N >=1.

Decoder is superimposed on together by M identical decoding layers, and each decoding layer includes three sublayers: implicit bull pays attention to Power layer, the second bull attention layer and the second feedforward layer.Wherein, the positive integer of M >=1.

In use, in encoder, the characteristics of image of target image and image tag are subjected to characteristic processing respectively, Pretreatment image feature and label vector are generated, using pretreatment image feature and label vector as the defeated of first coding layer Enter, obtains the output feature of first coding layer, input of the output feature of each coding layer as next coding layer, finally Eigenmatrix of the output feature of one coding layer as entire encoder output, is input to each decoding layer of decoder.

In decoder-side, reference vector and eigenmatrix are input to first decoding layer, it is defeated to obtain first decoding layer Decoded vector out；The decoded vector that eigenmatrix and a upper decoding layer export is input to when the one before decoding layer, is worked as The decoded vector ... of the one before decoding layer output finally obtains the decoded vector of the last one decoding layer output as decoder Decoded vector.

The decoded vector of decoder is converted via linear layer and normalization layer (softmax), obtains final mesh Poster sentence.

It should be noted that iamge description sentence includes multiple images words of description, for decoder, decode every time An iamge description word is obtained, final object statement is obtained after the completion of decoding.For first figure of iamge description sentence As words of description, reference decoder vector is preset initial decoded vector；For first image of removing of iamge description sentence Other iamge description words except words of description, reference decoder vector be a upper iamge description word it is corresponding decode to Amount.

The embodiment of the present application discloses a kind of method of iamge description, referring to fig. 2, includes the following steps 201~204:

201, characteristics of image is extracted to target image.

Wherein, the method for extracting characteristics of image can be to extract characteristics of image to target image using Feature Selection Model. Feature Selection Model can be to be a variety of, such as CNN (Convolutional Neural Network, convolutional neural networks) mould Type, LSTM model etc..

Such as the characteristics of image of Feature Selection Model generation is P*Q*L1 namely characteristics of image is L1, characteristics of image Having a size of P*Q.Wherein, P*Q is the height * width of characteristics of image.

During extracting characteristics of image, redundant data can be generated.

Redundant data refers to the repeated data generated in iamge description task.In the task of iamge description, such as The classification of image expression is same, but the feature that Feature Selection Model extracts can difference.In this way, in feature extraction Feature Selection Model can go out same type of image characteristics extraction in journey, will generate redundant data.

For different images, the feature extracted to same category data is different, and first will increase the difficulty of model learning And complexity, second due to the difference of feature representation can make real image description there is deviation, especially feature at classification edge, To be had adverse effect on to iamge description task.

202, described image feature is subjected to tag extraction, generates corresponding image tag.

Specifically, step 202 includes: and described image feature is input to multi-tag disaggregated model to carry out tag extraction, raw At at least one corresponding image tag.

Such as the image that a children fly a kite on meadow, which is subjected to tag extraction, obtained image Label includes " children " and " kite " two labels.

It should be noted that comparing for multi-tag disaggregated model compared to target detection model, there is model structure Simply, training data marks the simple and higher advantage of data rich, model accuracy；Meanwhile multi-tag disaggregated model will scheme The solidifications such as the object scene as in show, and are more in line with the process that the mankind describe image.

203, the characteristics of image of the target image and image tag are input to the encoder of image description model, generated The corresponding eigenmatrix of the target image.

Wherein, encoder includes at least one coding layer；

Include the situation of a coding layer for encoder, step 203 includes the following steps S2031~S2032:

S2031, the characteristics of image of target image and image tag are pre-processed respectively, generates pretreatment image feature And label vector.

Wherein, relative position coding (Positional Encoding) processing is carried out to the characteristics of image of target image, obtained To pretreatment image feature.Specifically, it is that each input picture feature is added to a feature that relative position coding, which is encoder, It may thereby determine that the distance between position or the different images feature of each characteristics of image.

Specifically, in the case where the characteristics of image of input includes the two dimensional character of length * width, the pretreatment figure of generation As feature is still the two dimensional character for including length * width.

For example, be P*Q*L1 with characteristics of image, for the pretreatment image feature (v1, v2 ... vn) of generation, P*Q n, Each vn is the one-dimensional vector indicated comprising L1 number.

Specifically, for image tag, image tag is subjected to embeding layer (embedding) processing, obtains label vector. Such as the image tag of piece image can be " apple ", " football ", then label vector is " apple " " football " corresponding one Dimensional vector.

S2032, pretreatment image feature and label vector are input to coding layer, using the output feature of coding layer as mesh The corresponding eigenmatrix of logo image.

Step S2032 includes: that pretreatment image feature is input to the first of coding layer to handle from attention layer, raw At first from attention characteristics；First is input to from attention characteristics and label vector the first bull attention layer of coding layer, it is raw At the first fusion feature；By the first fusion feature via the first feedforward layer processing, the output feature of coding layer is generated.

For first from attention layer, first can be infused certainly using pretreatment image feature as key-value feature pair Then feature of anticipating carries out calculating from attention as query feature.

It, can be special from paying attention to by first using label vector as key-value feature pair for the first bull attention layer Sign is used as query feature.

First can indicate from attention characteristics or the first fusion feature are as follows:

Wherein, d_kFor smoothing factor.

Include the situation of multiple coding layers for encoder, referring to Fig. 3, step 203 includes the following steps 301~305:

301, the characteristics of image of the target image and image tag are pre-processed respectively, it is special generates pretreatment image It seeks peace label vector.

302, pretreatment image feature and label vector are input to first coding layer, obtain the defeated of first coding layer Feature out.

303, the output feature of (i-1)-th coding layer and label vector are input to i-th of coding layer, obtain i-th of volume The output feature of code layer.

304, by i from increasing 1, whether i of the judgement from after increasing 1 is less than N, if so, step 303 is executed, if it is not, executing step 305。

305, using the output feature of n-th coding layer as the corresponding eigenmatrix of the target image.

More specifically, referring to fig. 4, coding layer includes: first from before attention layer, the first bull attention layer and first Present layer.

Step 302 includes: that pretreatment image feature is input to the first of first coding layer from attention layer Reason generates first from attention characteristics；First is input to from attention characteristics and label vector first bull of first coding layer Attention layer generates the first fusion feature；By the first fusion feature via the first feedforward layer processing, first coding layer is generated Export feature.

Step 303 includes: that the output feature of (i-1)-th coding layer is input to the first of i-th of coding layer from attention Layer is handled, and generates first from attention characteristics；The of i-th of coding layer is input to from attention characteristics and label vector by first One bull attention layer generates the first fusion feature；By the first fusion feature via the first feedforward layer processing, i-th of volume is generated The output feature of code layer.

Wherein, the output feature of each coding layer is three-dimensional space matrix, i.e., 3 dimension tensors, dimension is [batch, seq_ Length, hidden_dim], wherein batch is block size；Seq_length is label number or characteristics of image at Feature map (long * wide) size after reason；Hidden_dim is by the fused label substance of coding layer or characteristics of image Information.

In addition, the characteristics of image due to extraction includes redundant data, then the pretreatment image feature generated also can include Redundant data.In an encoding process, the pretreatment image feature of target image can be reset, is reduced using image tag Redundant data keeps feature statement more acurrate.Such as some specific region in piece image is " flower ", but target image is pre- Processing characteristics of image be v1, pretreatment image feature label vector u1 corresponding with label " flower " be it is differentiated, due to mark The label vector for signing " flower " is more accurate, then just directly substituting the pretreatment of target image with the label vector u1 of label " flower " Characteristics of image v1, to reduce the generation of redundant data.

204, the decoder that the eigenmatrix is input to image description model is decoded, it is corresponding obtains target image Iamge description sentence.

Specifically, step 204 includes:

S2041, it reference decoder vector and eigenmatrix is input to the decoder is decoded, obtain the decoding The decoded vector of device output.

It specifically, include M sequentially connected decoding layers for decoder, referring to Fig. 5, step S2041 includes:

501, reference decoder vector and eigenmatrix are input to first decoding layer, obtain the defeated of first decoding layer Outgoing vector.

502, the output vector of -1 decoding layer of jth and eigenmatrix are input to j-th of decoding layer, obtained j-th The output vector of decoding layer, wherein 2≤j≤M.

503, by j from increasing 1, whether j of the judgement from after increasing 1 is less than M, if so, step 502 is executed, if it is not, continuing to execute step Rapid 504.

504, using the output vector of m-th decoding layer as the corresponding decoded vector of target image.

S2042, linearisation and normalized are carried out according to the decoded vector, generates the corresponding image of target image and retouches Predicate sentence.

Specifically, for decoding every time, linearisation and normalized is carried out according to the decoded vector, generate target figure As corresponding word, and will be when previous decoded vector is as decoded reference decoder vector next time.Finally, according to target figure As corresponding multiple words generate iamge description sentence.

Wherein, it is handled by linearisation (linear), decoded vector can be mapped as linear vector.

Normalized can be to be a variety of, and it is preferable to use softmax to be normalized for the present embodiment, so that statistics exists Statistical probability distribution between [0,1], and according to the corresponding word of decoded vector that determine the probability generates every time.

The method of iamge description provided by the present application carries out tag extraction generation pair by the characteristics of image to target image The characteristics of image of target image and image tag are input to image description model, obtain target image pair by the image tag answered The iamge description sentence answered, thus make image description model during iamge description sentence, it can be according to specific reliable The information of image tag is reference, and making the iamge description sentence generated includes more key messages, improves iamge description sentence Accuracy and reliability；And it is instructed since the generation phase in iamge description sentence is used as according to reliable image tag, Reduce the generation of redundant data.

Secondly, the reduction of redundant data, can have following positive influences to the iamge description of the present embodiment:

1) model can be made to be easier to restrain.

2) iamge description becomes comparison controllably (or visualization), can use the legal of middle category control iamge description The problems such as property.

3) iamge description can be more accurate and reliable, reduces the influence of extraneous data.

In order to further be illustrated to the method for the iamge description of the embodiment of the present application, Fig. 6 shows realization this reality Apply the specific schematic diagram of model framework of the method for the iamge description of example.

It include three models: Feature Selection Model (CNN), multi-tag disaggregated model and Transformer model in Fig. 6. Target image in Fig. 6 is a diver in marine diving, and there is a green turtle in lower left.

The method of the present embodiment includes:

1) characteristics of image V is extracted to target image.

2) described image feature is subjected to tag extraction, generates corresponding image tag U.

3) the characteristics of image V of the target image and image tag U are input to the encoder of image description model, generated The corresponding eigenmatrix of target image.

Specifically, step 3) includes the following steps S11~S15:

S11, the characteristics of image V of the target image and image tag U are pre-processed respectively, generates pretreatment image Feature v1, v2 ... vn } and label vector { u1, u2 }.

S12, pretreatment image feature { v1, v2 ... vn } and label vector { u1, u2 } are input to first coding layer, Obtain the output feature of first coding layer.

S13, the output feature of (i-1)-th coding layer and label vector { u1, u2 } are input to i-th of coding layer, obtained The output feature of i-th of coding layer.

S14, i is increased to 1 certainly, whether i of the judgement from after increasing 1 is less than N, if so, execution step S13, if it is not, executing step S15。

S15, using the output feature of n-th coding layer as the corresponding eigenmatrix of target image.

4) decoder that the eigenmatrix is input to image description model is decoded, it is corresponding obtains target image Iamge description sentence.

Specifically, step 4) includes the following steps S21~S24:

S21, reference decoder vector and eigenmatrix are input to first decoding layer, obtain the defeated of first decoding layer Outgoing vector.

S22, the output vector of -1 decoding layer of jth and eigenmatrix are input to j-th of decoding layer, obtained j-th The output vector of decoding layer, wherein 2≤j≤M.

S23, j is increased to 1 certainly, whether j of the judgement from after increasing 1 is less than M, if so, execution step S22, if it is not, continuing to execute step Rapid S24.

S24, using the output vector of m-th decoding layer as the corresponding decoded vector of target image.

Specifically, first decoded vector is subjected to linearisation and normalized, generates the corresponding image of target image Words of description " one "；

Using first decoded vector as reference decoder vector, above-mentioned steps S21~S24 is repeated, obtains second Decoded vector；Second decoded vector is subjected to linearisation and normalized, generates the corresponding iamge description word of target image Language " a "；

……

And so on, finally obtained iamge description word includes " one " " a " " latent " " water " " member " " " " sea " "bottom" " sight " " examining " " sea " " tortoise ", finally obtained iamge description sentence are " diver observes green turtle in seabed ".

The embodiment of the present application also discloses a kind of training method of image description model, wherein sample image and sample graph As descriptive statement is input to image description model as training set.

Referring to Fig. 7, the training method includes:

701, characteristics of image is extracted to sample image.

702, described image feature is subjected to tag extraction, generates corresponding image tag.

703, the characteristics of image of the sample image, image tag and the corresponding sample image of the sample image are retouched Predicate sentence is input to image description model, is trained to described image descriptive model, until reaching trained stop condition.

Wherein, training stop condition includes: that the decoded vector for generating image description model and the verifying of preset vector collect It compares, the change rate for obtaining the error of the decoded vector is less than stable threshold.

Wherein, stable threshold can be set according to actual needs, such as be set as 1%.In this way, error tends towards stability, It can think that model training finishes.

Specifically, by the characteristics of image of the sample image, image tag and the corresponding sample graph of the sample image As descriptive statement is input to image description model, described image descriptive model is trained, include the following steps S7031~ S7034:

S7031, the encoder that the characteristics of image of the sample image and image tag are input to image description model, Generate the output feature of encoder.

S7032, reference decoder vector and output feature are input to the decoder and be decoded, obtain the decoding The decoded vector of device output.

S7033, linearisation and normalized are carried out according to the decoded vector, generates the corresponding image of sample image and retouches Predicate sentence.

S7034, the corresponding iamge description sentence of sample image and sample image descriptive statement are subjected to error comparison, and Adjust the parameter of described image descriptive model.

The method of the training of image description model provided in this embodiment, by the characteristics of image of sample image, image tag And the corresponding sample image descriptive statement of sample image is input to image description model, is trained to image description model, Until reaching trained stop condition, to obtain the image description model that may be implemented to generate descriptive statement according to target image.

The embodiment of the present application also discloses a kind of iamge description device, referring to Fig. 8, comprising:

Fisrt feature extraction module 801 is configured as extracting characteristics of image to target image；

First tag extraction module 802 is configured as described image feature carrying out tag extraction, generates corresponding image Label；

Coding module 803 is configured as the characteristics of image of the target image and image tag being input to iamge description The encoder of model generates the corresponding eigenmatrix of the target image；

Decoder module 804, the decoder for being configured as the eigenmatrix being input to image description model are decoded, Obtain the corresponding iamge description sentence of target image.

Optionally, the first tag extraction module 802 is specifically configured to: described image feature is input to multi-tag classification Model carries out tag extraction, generates at least one corresponding image tag.

Optionally, the encoder includes a coding layer, and coding module 803 is specifically configured to:

Optionally, encoder includes sequentially connected N number of coding layer, and coding module 803 is specifically configured to:

Characteristic processing unit is configured as respectively carrying out the characteristics of image of the target image and image tag at feature Reason generates pretreatment image feature and label vector；

First coding unit is configured as pretreatment image feature and label characteristics being input to first coding layer, obtain To the output feature of first coding layer；

Second coding unit is configured as the output feature of (i-1)-th coding layer and label vector being input to i-th of volume Code layer obtains the output feature of i-th of coding layer；

Judging unit is configured as i from increasing 1, and whether i of the judgement from after increasing 1 is less than N, if so, it is single to execute the second coding Member, if it is not, executing eigenmatrix acquiring unit；

Eigenmatrix acquiring unit is configured as corresponding using the output feature of n-th coding layer as the target image Eigenmatrix.

Optionally, the coding layer include: first from attention layer, the first bull attention layer and first feedforward layer；

Second coding unit is configured as:

Optionally, decoder module 804 is specifically configured to:

The device of iamge description provided in this embodiment carries out tag extraction generation by the characteristics of image to target image The characteristics of image of target image and image tag are input to image description model, obtain target image by corresponding image tag Corresponding iamge description sentence, thus make image description model during iamge description sentence, it can be according to specific reliable Image tag information be reference, make generate iamge description sentence include more key messages, improve iamge description language The accuracy and reliability of sentence；And refer to since the generation phase in iamge description sentence is used as according to reliable image tag It leads, reduces the generation of redundant data.

A kind of exemplary scheme of the device of above-mentioned iamge description for the present embodiment.It should be noted that the device The technical solution of technical solution and the method for above-mentioned iamge description belongs to same design, and the technical solution of device is not described in detail Detail content, may refer to the description of the technical solution of the method for above-mentioned iamge description.

The embodiment of the present application discloses a kind of training device of image description model, referring to Fig. 9, comprising:

Second feature extraction module 901 is configured as extracting characteristics of image to sample image；

Second tag extraction module 902 is configured as described image feature carrying out tag extraction, generates corresponding image Label；

Training module 903 is configured as the characteristics of image of the sample image, image tag and the sample image Corresponding sample image descriptive statement is input to image description model, is trained to described image descriptive model, until reaching Training stop condition.

Optionally, training stop condition includes: the decoded vector and preset vector for generating described image descriptive model Verifying collection compares, and the change rate for obtaining the error of the decoded vector is less than stable threshold.

The training device of image description model provided in this embodiment, by the characteristics of image of sample image, image tag with And the corresponding sample image descriptive statement of sample image is input to image description model, is trained to image description model, directly To trained stop condition is reached, to obtain the image description model that may be implemented to generate descriptive statement according to target image.

A kind of exemplary scheme of the training device of above-mentioned image description model for the present embodiment.It should be noted that The technical solution of the technical solution of the training device and above-mentioned training method belongs to same design, the technical solution of training device The detail content being not described in detail may refer to the description of the technical solution of above-mentioned training method.

One embodiment of the application also provides a kind of calculating equipment, is stored with computer instruction, which is held by processor Realized when row iamge description as previously described method or image description model training method the step of.

Figure 10 is to show the structural block diagram of the calculating equipment 100 according to one embodiment of this specification.The calculating equipment 100 Component include but is not limited to memory 110 and processor 120.Processor 120 is connected with memory 110 by bus 130, Database 150 is for saving data.

Calculating equipment 100 further includes access device 140, access device 140 enable calculate equipment 100 via one or Multiple networks 160 communicate.The example of these networks includes public switched telephone network (PSTN), local area network (LAN), wide area network (WAN), the combination of the communication network of personal area network (PAN) or such as internet.Access device 140 may include wired or wireless One or more of any kind of network interface (for example, network interface card (NIC)), such as IEEE802.11 wireless local area Net (WLAN) wireless interface, worldwide interoperability for microwave accesses (Wi-MAX) interface, Ethernet interface, universal serial bus (USB) connect Mouth, cellular network interface, blue tooth interface, near-field communication (NFC) interface, etc..

In one embodiment of this specification, calculate equipment 100 above-mentioned component and Figure 10 in it is unshowned other Component can also be connected to each other, such as pass through bus.It should be appreciated that calculating device structure block diagram shown in Fig. 10 is only In exemplary purpose, rather than the limitation to this specification range.Those skilled in the art can according to need, and increases or replaces Other component.

Calculating equipment 100 can be any kind of static or mobile computing device, including mobile computer or mobile meter Calculate equipment (for example, tablet computer, personal digital assistant, laptop computer, notebook computer, net book etc.), movement Phone (for example, smart phone), wearable calculating equipment (for example, smartwatch, intelligent glasses etc.) or other kinds of shifting Dynamic equipment, or the static calculating equipment of such as desktop computer or PC.Calculating equipment 100 can also be mobile or state type Server.

One embodiment of the application also provides a kind of computer readable storage medium, is stored with computer instruction, the instruction Realized when being executed by processor iamge description as previously described method or image description model training method the step of.

A kind of exemplary scheme of above-mentioned computer readable storage medium for the present embodiment.It should be noted that this is deposited The technical solution of the training method of the method or image description model of the technical solution of storage media and above-mentioned iamge description belongs to together One design, the detail content that the technical solution of storage medium is not described in detail, may refer to above-mentioned iamge description method or The description of the technical solution of the training method of image description model.

The computer instruction includes computer program code, the computer program code can for source code form, Object identification code form, executable file or certain intermediate forms etc..The computer-readable medium may include: that can carry institute State any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, the computer storage of computer program code Device, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer-readable medium include it is interior Increase and decrease appropriate can be carried out according to the requirement made laws in jurisdiction with patent practice by holding, such as in certain jurisdictions of courts Area does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium.

It should be noted that for the various method embodiments described above, describing for simplicity, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules might not all be this Shen It please be necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.

The application preferred embodiment disclosed above is only intended to help to illustrate the application.There is no detailed for alternative embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to preferably explain the application Principle and practical application, so that skilled artisan be enable to better understand and utilize the application.The application is only It is limited by claims and its full scope and equivalent.

Claims

1. a kind of method of iamge description, which is characterized in that be used for image description model, which comprises

Characteristics of image is extracted to target image；

The characteristics of image of the target image and image tag are input to the encoder of image description model, generate the target The corresponding eigenmatrix of image；

The decoder that the eigenmatrix is input to image description model is decoded, the corresponding image of target image is obtained and retouches Predicate sentence.

2. the method as described in claim 1, which is characterized in that described image feature is carried out tag extraction, is generated corresponding Image tag, comprising:

Described image feature is input to multi-tag disaggregated model and carries out tag extraction, generates at least one corresponding image mark Label.

3. the method as described in claim 1, which is characterized in that the encoder includes a coding layer；

The characteristics of image of the target image and image tag are input to the encoder of image description model, generate the target The corresponding eigenmatrix of image, comprising:

Pretreatment image feature and label vector are input to the coding layer, and using the output feature of the coding layer as institute State the corresponding eigenmatrix of target image.

4. the method as described in claim 1, which is characterized in that the encoder includes sequentially connected N number of coding layer；

S12, pretreatment image feature and label characteristics are input to first coding layer, the output for obtaining first coding layer is special Sign；

S13, the output feature of (i-1)-th coding layer and label vector are input to i-th of coding layer, obtain i-th of coding layer Output feature；

5. method as claimed in claim 4, which is characterized in that the coding layer includes: first from attention layer, the first bull Attention layer and the first feedforward layer；

Pretreatment image feature and label vector are input to i-th of coding layer, obtain the output feature of i-th of coding layer, is wrapped It includes:

First that pretreatment image feature is input to i-th of coding layer is handled from attention layer, generates first from attention Feature；

Described first is input to from attention characteristics and the label characteristics the first bull attention layer of i-th of coding layer, it is raw At the first fusion feature；

6. the method as described in claim 1, which is characterized in that the eigenmatrix is input to the decoding of image description model Device is decoded, and obtains the corresponding iamge description sentence of target image, comprising:

Reference decoder vector and eigenmatrix are input to the decoder to be decoded, obtain the solution of the decoder output Code vector；

Linearisation and normalized are carried out according to the decoded vector, generates the corresponding iamge description sentence of target image.

7. a kind of training method of image description model, which is characterized in that the described method includes:

Characteristics of image is extracted to sample image；

The characteristics of image of the sample image, image tag and the corresponding sample image descriptive statement of the sample image is defeated Enter to image description model, described image descriptive model is trained, until reaching trained stop condition.

8. the method for claim 7, which is characterized in that the trained stop condition includes:

Decoded vector that described image descriptive model generates and preset vector verifying collection are compared, obtain it is described decode to The change rate of the error of amount is less than stable threshold.

9. a kind of device of iamge description, which is characterized in that described device includes:

First tag extraction module is configured as described image feature carrying out tag extraction, generates corresponding image tag；

Coding module is configured as the characteristics of image of the target image and image tag being input to the volume of image description model Code device, generates the corresponding eigenmatrix of the target image；

Decoder module, the decoder for being configured as the eigenmatrix being input to image description model are decoded, and obtain mesh The corresponding iamge description sentence of logo image.

10. a kind of training device of image description model, which is characterized in that described device includes:

Second tag extraction module is configured as described image feature carrying out tag extraction, generates corresponding image tag；

Training module is configured as the characteristics of image of the sample image, image tag and the sample image is corresponding Sample image descriptive statement is input to image description model, is trained to described image descriptive model, stops until reaching training Only condition.

11. a kind of calculating equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine instruction, which is characterized in that the processor realizes side described in claim 1-6 or 7-8 any one when executing described instruction The step of method.

12. a kind of computer readable storage medium, is stored with computer instruction, which is characterized in that the instruction is held by processor The step of claim 1-6 or 7-8 any one the method are realized when row.