CN109002852B

CN109002852B - Image processing method, apparatus, computer readable storage medium and computer device

Info

Publication number: CN109002852B
Application number: CN201810758796.5A
Authority: CN
Inventors: 陈志博; 石楷弘
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2023-05-23
Anticipated expiration: 2038-07-11
Also published as: CN109002852A

Abstract

The application relates to an image processing method, an image processing device, a computer readable storage medium and a computer device, wherein the method comprises the following steps: acquiring an input image; extracting image features of the input image through a first model; determining category label text corresponding to the input image through the first model according to the image characteristics; performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features; and processing the comprehensive characteristics through a second model, and outputting image description text of the input image. The scheme provided by the application can improve the accuracy of the image understanding information.

Description

Image processing method, apparatus, computer readable storage medium and computer device

Technical Field

The present invention relates to the field of computer technology, and in particular, to an image processing method, an image processing apparatus, a computer readable storage medium, and a computer device.

Background

As computer technology has evolved, it has become more and more frequent to deal with various complex problems or to interact with people through computer devices. For example, computer devices are used to help people understand images, and are particularly helpful for infants, elderly people, visually impaired people, language impaired people, and the like.

The conventional image understanding method generally extracts image features of an image, inputs the image features and a preset text into an encoder together, and decodes the image features and the preset text by a decoder to obtain image understanding information. However, the conventional image understanding method processes an image through a structure of encoding-decoding, and as the processing time increases, guidance of image features is slowly lacking, so that the image understanding is not accurate enough.

Disclosure of Invention

Based on this, it is necessary to provide an image processing method, apparatus, computer-readable storage medium and computer device for solving the technical problem of insufficient accuracy of image understanding in the conventional image understanding scheme.

An image processing method, comprising:

acquiring an input image;

extracting image features of the input image through a first model;

determining category label text corresponding to the input image through the first model according to the image characteristics;

performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features;

and processing the comprehensive characteristics through a second model, and outputting image description text of the input image.

An image processing apparatus, the apparatus comprising:

The acquisition module is used for acquiring an input image;

the extraction module is used for extracting image features of the input image through the first model;

the determining module is used for determining category label text corresponding to the input image through the first model according to the image characteristics;

the fusion module is used for carrying out cross-modal fusion on the image characteristics and the corresponding category label text to obtain comprehensive characteristics;

and the output module is used for processing the comprehensive characteristics through a second model and outputting image description text of the input image.

A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the image processing method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the image processing method.

According to the image processing method, the image processing device, the computer readable storage medium and the computer equipment, the image characteristics of the input image are extracted through the first model, the category label text corresponding to the input image is determined, and the image characteristics of the input image and the corresponding category label text can be quickly and accurately obtained. And performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features, and processing the comprehensive features through a second model to obtain the image description text. Therefore, the second model can fully utilize the image characteristics of the input image and combine the category information of the input image in the processing process. Thus, the characteristics of the input image are carefully and fully utilized, and when the image is understood, double guidance of the image characteristics and the category label text is obtained, so that the accuracy of the image understanding information is greatly improved, and the understanding capability of the computer equipment on the image is improved.

An image processing method, comprising:

acquiring an input image and a question text corresponding to the input image;

extracting image features of the input image;

extracting text features of the question text;

according to the text characteristics, performing attention distribution processing on the image characteristics to obtain attention weights;

determining weighted image features from the image features and the attention weighting values;

and carrying out classification processing according to the weighted image characteristics to obtain answer texts corresponding to the question texts.

An image processing apparatus comprising:

the acquisition module is used for acquiring an input image and a question text corresponding to the input image;

the extraction module is used for extracting image characteristics of the input image;

the extraction module is also used for extracting text characteristics of the problem text;

attention distribution processing, which is used for carrying out attention distribution processing on the image characteristics according to the text characteristics to obtain attention weight values;

a determining module for determining weighted image features from the image features and the attention weighting values;

and the classification module is used for carrying out classification processing according to the weighted image characteristics to obtain answer texts corresponding to the question texts.

The image processing method, the device, the computer readable storage medium and the computer equipment are used for extracting the image characteristics of the input image, extracting the text characteristics of the problem text corresponding to the input image, carrying out attention distribution processing on the image characteristics according to the text characteristics to obtain attention weight values, and determining weighted image characteristics according to the image characteristics and the attention weight values. And then classifying according to the weighted image characteristics, and outputting answer texts corresponding to the question texts. In this way, attention distribution processing can be performed on the image features according to the text features corresponding to the question text to obtain weighted image features, so that the image features related to the question text can be focused in the image processing process, and then the accuracy of the answer text can be greatly improved by classifying the weighted image features, namely, the accuracy of image understanding information is greatly improved, and the understanding capability of computer equipment on the images is improved.

Drawings

FIG. 1 is a diagram of an application environment for an image processing method in one embodiment;

FIG. 2 is a flow chart of an image processing method in one embodiment;

FIG. 3 is a schematic diagram of an input image in one embodiment;

FIG. 4 is a schematic flow chart of a step of cross-modal fusion of image features and corresponding class label text to obtain integrated features in one embodiment;

FIG. 5 is a flow chart illustrating steps for performing an image question and answer in one embodiment;

FIG. 6 is a flow chart of an image processing method according to another embodiment;

FIG. 7 is a flow chart of an image processing method according to another embodiment;

FIG. 8 is a flow chart of an image processing method in one embodiment;

FIG. 9 is a flow diagram of steps for extracting text features of a question text in one embodiment;

FIG. 10 is a flowchart of an image processing method according to another embodiment;

FIG. 11 is a flow chart of an image processing method according to another embodiment;

FIG. 12 is a block diagram showing the structure of an image processing apparatus in one embodiment;

fig. 13 is a block diagram showing the structure of an image processing apparatus in another embodiment;

FIG. 14 is a block diagram showing the structure of an image processing apparatus in one embodiment;

FIG. 15 is a block diagram of a computer device in one embodiment;

fig. 16 is a block diagram of a computer device in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

FIG. 1 is a diagram of an application environment for an image processing method in one embodiment. Referring to fig. 1, the image processing method is applied to an image processing system. The image processing system includes a terminal 110 and a server 120. The image processing method may be performed in the terminal 110 or the server 120, and the terminal 110 may directly acquire an input image and perform the above-described image processing method at the terminal side; alternatively, the terminal 110 may transmit an input to the server after acquiring the input image, so that the server acquires the input image and performs the above-described image processing method. The terminal 110 and the server 120 are connected through a network. The terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

As shown in fig. 2, in one embodiment, an image processing method is provided. The embodiment is mainly exemplified by the method applied to the computer device in fig. 1, such as the terminal 110 or the server 120. Referring to fig. 2, the image processing method specifically includes the steps of:

s202, an input image is acquired.

Specifically, the computer device may acquire a local image as an input image, or acquire an input image from another computer device through a network connection, a USB (Universal Serial Bus ) interface connection, or the like.

In one embodiment, the terminal may collect an image through the camera under the current field of view of the camera, and use the collected image as the input image. Alternatively, the terminal may display an image display interface to the user, the user may perform a selection operation in the image display interface, and the terminal may use the selected image as the input image. The image displayed in the image display interface can be an image stored locally by the terminal, or an image obtained by the terminal accessing the server through network connection.

In one embodiment, the terminal may perform the image processing method locally after acquiring the input image. Alternatively, the terminal may transmit the input image to the server so that the server acquires the input image and performs the image processing method.

S204, extracting image features of the input image through the first model.

Wherein the model is a model composed of an artificial neural network. Artificial neural networks (Artificial Neural Networks, abbreviated as ANNs), also known as Neural Networks (NNs) or Connection models. The artificial neural network can abstract the human brain neural network from the angle of information processing to build a certain model, and different networks are formed according to different connection modes. Also commonly referred to in engineering and academia as neural networks or neural-like networks.

Neural network models such as CNN (Convolutional Neural Network ) model, DNN (Deep Neural Network, deep neural network) model, RNN (Recurrent Neural Network, cyclic neural network) model, and the like.

The convolutional neural network comprises a convolutional Layer (Convolutional Layer) and a Pooling Layer (Pooling Layer). There are various convolutional neural network models, such as VGG (Visual Geometry Group vision set group) network model, google net (google network) model, or res net (energy efficiency evaluation system) network model, etc. The deep neural network comprises an input layer, an implicit layer and an output layer, wherein the layers are in full-connection relation. A recurrent neural network is a neural network modeling sequence data, i.e. a sequence's current output is also related to the previous output. The specific expression is that the network will memorize the previous information and apply it to the calculation of the current output, i.e. the nodes between the hidden layers are no longer connectionless but connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment. A recurrent neural network model, such as an LSTM (Long Short-Term Memory Neural Network, long Short term memory neural network) model.

Image features are features that represent color, texture, shape, spatial relationship, or the like of an image. In this embodiment, the image feature may be specifically data extracted from the input image by the computer device, where the data may represent color, texture, shape, or spatial relationship of the image, and may obtain a representation or description of "non-image" of the image, such as a numerical value, vector, or symbol.

In this embodiment, the first model may specifically be a convolutional neural network model, such as ResNet-80. The computer device may input the input image into a first model by which image features of the input image are extracted. For example, the computer device may input the input image into a convolutional neural network model, perform convolutional processing on the input image through a convolutional layer of the convolutional neural network, and extract image features of the input image. That is, the convolutional neural network may perform convolutional processing on the input image through the convolutional layer to obtain a feature map (feature map) of the input image, where the feature map is an image feature in this embodiment.

In one embodiment, the first model is a model for classifying an input image obtained by learning training with images in an image library (ImageNet) and corresponding class labels as training data. After the computer equipment acquires the input image, the input image is input into a first model, the image characteristics of the input image are extracted through a convolution layer structure of the first model, and the category label text corresponding to the input image is determined through a pooling layer structure and/or a full connection layer structure of the first model.

S206, determining category label text corresponding to the input image through the first model according to the image characteristics.

The category label text is a label text corresponding to the category to which the input image belongs. Specifically, the computer device may extract image features through the first model, and then perform subsequent classification processing on the extracted image features to obtain a category of the input image, so as to determine a category label text corresponding to the input image.

In one embodiment, the first model may be a convolutional neural network model in particular. The computer device may input the input image into a convolutional neural network model to extract image features of the input image. And processing the image characteristics through the pooling layer and the full connection layer to obtain the probability value of the category to which the input image belongs. And taking the class label corresponding to the maximum probability value as the class label corresponding to the input image.

In one embodiment, the computer device may process the input image through a multi-tasking convolutional neural network to obtain a corresponding plurality of category label text for the input image. Wherein, the multitasking convolutional neural network is a convolutional neural network which can perform multitasking learning. The network structure of the multi-task convolutional neural network is slightly different from that of the single-task convolutional neural network. For a single-task convolutional neural network, i.e., an independent neural network, each network is a function of only one output for an input. The multitasking convolutional neural network may have multiple outputs for the input, one for each task. It will be appreciated that these outputs may connect all neurons of a hidden layer they share, and that features in these hidden layers for a task may also be exploited by other tasks to facilitate co-learning of multiple tasks, so that features learned by a single network may assist in learning by another network.

S208, performing cross-modal fusion on the image features and the corresponding category label text to obtain comprehensive features.

The cross-modal fusion is to fuse data with different modalities. In this embodiment, the data of different modalities specifically refers to image features corresponding to an input image and text data corresponding to a category label text. Specifically, the computer device may map the extracted image features and the corresponding category label text to data in the same space, and then perform fusion processing on the mapped data to obtain the integrated features.

In one embodiment, image features of an input image are extracted by a first model. The computer device may extract text features of the category label text through the recurrent neural network. Wherein, the expression forms of the image features and the text features can be vector forms. Before fusing the image features and the text features, the computer device can convert the image features and the text features into standard forms respectively, so that feature vectors of the image features and the text features are in the same range. For example, the image features and the text features may be normalized separately. Common normalization algorithms are a function method and a probability density method. Wherein the function method, such as a max-min function, a mean-variance function (features are normalized to a consistent interval, such as an interval with mean of 0 and variance of 1), or a hyperbolic sigmoid (S-shaped growth curve) function, etc.

Further, the computer device may perform a fusion operation on the normalized image feature and the text feature corresponding to the corresponding category label text to obtain a comprehensive feature. The algorithm for fusing the image features and the text features can specifically adopt an algorithm based on a Bayesian decision theory, an algorithm based on a sparse representation theory or an algorithm based on a deep learning theory. Alternatively, the computer device may perform weighted summation on the normalized two vectors to fuse the image features with the text features to obtain the composite feature.

In one embodiment, the computer device may extract text features of the category label text through the recurrent neural network, perform attention allocation processing, that is, attention processing, on the image features and the text features to obtain attention allocation weights, that is, attention weights (attention values), and then combine the attention values with the image features to obtain the integrated features.

The attention process is understood to be that a small amount of important information is selectively screened out from a large amount of information and focused on the important information, and most of unimportant information is ignored. The focusing process is embodied in the calculation of the attention distribution weight, and the larger the attention distribution weight is, the more focused the attention distribution weight is on the corresponding image feature.

S210, processing the comprehensive characteristics through a second model, and outputting an image description text of the input image.

The image description text is text describing the input image, such as identifying objects in the input image, understanding relationships among the objects, and the like, and specifically may be a word, a whole sentence, or paragraph text, and the like. The second model may specifically be a recurrent neural network model, such as an LSTM (Long Short-Term Memory Neural Network ) model.

In particular, the computer device may input the integrated features into a second model, and process the integrated features through the second model to output image description text of the input image.

In one embodiment, the step S210 may specifically include the following steps: acquiring an image pre-description text corresponding to an input image; sequentially inputting the comprehensive characteristics and each word vector of the image pre-description text into a second model; and processing the sequentially input comprehensive features and word vectors through a second model, and outputting an image description text of the input image.

Wherein the image pre-description text is a text describing the input image in advance. The image pre-description text may specifically be an initial coarser description text that is considered to be obtained after understanding the input image. The image pre-description text and the image description text can be of a unified medium language species, or can be of different language species. For example, the image pre-description text may be text describing the input image in chinese, and the image description text is text describing the input image in english.

In one embodiment, a computer device may obtain image pre-description text corresponding to an input image and obtain respective word vectors of the image pre-description text. The computer equipment can adopt an encoding-decoding mode, input the comprehensive characteristics as a first moment, respectively input each word vector as a subsequent moment, process the sequentially input comprehensive characteristics and word vectors through a second model, and output the image description text. Therefore, the second model can be combined with the comprehensive characteristics and the image pre-description text, and the output image description text is more fit with the input image, so that the accuracy of image understanding information is greatly improved.

According to the image processing method, the image characteristics of the input image are extracted through the first model, and the category label text corresponding to the input image is determined, so that the image characteristics of the input image and the corresponding category label text can be obtained rapidly and accurately. And performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features, and processing the comprehensive features through a second model to obtain the image description text. Therefore, the second model can fully utilize the image characteristics of the input image and combine the category information of the input image in the processing process. Thus, the characteristics of the input image are carefully and fully utilized, and when the image is understood, double guidance of the image characteristics and the category label text is obtained, so that the accuracy of the image understanding information is greatly improved, and the understanding capability of the computer equipment on the image is improved.

In one embodiment, the step of extracting image features of the input image by the first model comprises: determining a plurality of candidate regions which are different from each other in the input image through a first model; and respectively extracting the image characteristics of each candidate region through the first model.

Specifically, the computer device may process the input image through the first model, determine a plurality of targets in the input image, and determine a plurality of candidate regions, i.e., region Proposals, in the input image that are different from each other according to the respective targets. Wherein the candidate regions are different from each other and may be partially overlapped or not overlapped at all. Wherein overlapping of candidate regions refers to the same pixels in different candidate regions. The computer device may extract image features of each candidate region separately through the first model.

Among them, there are various algorithms for dividing candidate regions of an input image, and for example, a sliding window judgment method, an object detection method (Selective Search for Object Recognition), an SSD (Single Shot Multibox Detector, single-shot multi-frame detection) algorithm, or the like can be used.

In one embodiment, the computer device may determine category label text corresponding to each candidate region by a first model and based on image features corresponding to each candidate region. For example, referring to fig. 3, fig. 3 shows a schematic diagram of an input image in one embodiment. As shown in FIG. 3, the input image includes a house, a stream, a dog, and a person. Wherein the creek is in front of the house, the dog is beside the creek, and the person is at the left side of the house. The input image is input into a first model, which may determine a plurality of candidate regions, such as regions a-D included in the dashed box in fig. 3. Accordingly, the first model may extract image features of the respective candidate regions, respectively, and determine category label text corresponding to each candidate region. Such as "house" for the category label text corresponding to candidate area a, "person" for the category label text corresponding to candidate area B, "stream" for the category label text corresponding to candidate area C, and "dog" for the category label text corresponding to candidate area D.

In the above embodiment, a plurality of candidate areas different from each other in the input image are determined by the first model, and the image features of the respective candidate areas are extracted, respectively, so as to determine a plurality of category label texts corresponding to the input image.

In one embodiment, step S210, i.e. the step of processing the composite feature by the second model, the step of outputting the image description text of the input image specifically includes: splicing the corresponding comprehensive characteristics of each candidate region to obtain splicing characteristics; and processing the spliced characteristic through a second model, and outputting an image description text of the input image.

Specifically, the computer device may perform cross-modal fusion on the image features and the category label text corresponding to each candidate region, to obtain the comprehensive features corresponding to each candidate region. The computer equipment can splice the corresponding comprehensive characteristics of each candidate region to obtain spliced characteristics, process the spliced characteristics through the second model and output the image description text of the input image.

In one embodiment, the computer device may determine a plurality of candidate regions different from each other in the input image, and after determining the candidate regions, the computer device may select the candidate regions satisfying the preset condition as target candidate regions, further extract image features of the target candidate regions, and determine category label texts corresponding to the target candidate regions, so as to perform cross-modal fusion on the image features and the category label texts corresponding to the target candidate regions, respectively, to obtain a plurality of comprehensive features.

The ratio of the area of the candidate region to the area of the input image satisfies a preset ratio, or the first few names, such as the first three names, with the largest ratio. The preset conditions are also, for example, that the most popular targets are learned under big data by the network model, and a preset number of candidate regions containing the corresponding targets are selected.

In the embodiment, the corresponding comprehensive characteristics of each candidate region are spliced to obtain the spliced characteristics, and then the image description text is output according to the spliced characteristics, so that the image information is more fully utilized, the image characteristics and the category label text are effectively combined, and the accuracy of the image understanding information is greatly improved.

In one embodiment, step S208, that is, performing cross-modal fusion on the image feature and the corresponding category label text, the step of obtaining the integrated feature specifically includes the following steps:

s402, determining coded data corresponding to the category label text.

The encoded data is data obtained by encoding the category label text, and the encoded data may represent the encoded data, that is, the category label text in this embodiment. The common coding modes are as follows: unipolar codes, polar codes, bipolar codes, return-to-zero codes, biphase codes, non-return-to-zero codes, manchester codes, differential manchester codes, multilevel codes, and the like.

In one embodiment, the computer device may preset the mapping of category label text and encoded data. And determining the coded data corresponding to the category label text according to the mapping relation. For example, it may be preset that the category label text "dog" corresponds to the encoded data "0001", the category label text "person" corresponds to the encoded data "0002", the category label text "mountain" corresponds to the encoded data "0003", the category label text "house" corresponds to the encoded data "0101", and the like. When the computer device determines that the category label corresponding to the image feature is "dog," then the corresponding encoded data "0001" may be determined.

In one embodiment, the computer device may extract text features of the category label text via the recurrent neural network, with the corresponding text features as encoded data corresponding to the category label text.

S404, performing attention distribution processing on the image features according to the encoded data to obtain attention weights.

In one embodiment, the computer device may perform an attention allocation process on the image features based on the encoded data to obtain the attention weight.

In one embodiment, the computer device may map the encoded data and the image features to standard vectors in the same space, respectively, according to a preset standard rule. And performing dot multiplication operation on standard vectors corresponding to the coded data and the image features respectively to obtain an intermediate result. And sequentially carrying out pooling treatment (such as a sum pooling treatment) and regression treatment (such as a softmax treatment) on the intermediate result to obtain the attention weight.

S406, calculating to obtain comprehensive characteristics according to the attention weight and the image characteristics.

In particular, the computer device may combine the attention weight with the corresponding image feature to obtain a weighted composite feature. In one embodiment, the computer device may implement the step of cross-modal fusion of image features and corresponding category label text via an attention model to obtain integrated text. The image characteristics and the corresponding category label text are input into an attention model, and the attention model can obtain attention weight through automatic learning weight of a network structure. And combining the attention weight with the image characteristics to obtain comprehensive characteristics. In the resulting composite feature, the more focused the attention model is, the greater the weight that is occupied.

In the above embodiment, the attention weight is obtained by performing attention distribution processing on the image feature and the corresponding encoded data, and then the attention weight is combined with the image feature to obtain the comprehensive feature, so that the more important elements in the comprehensive feature occupy larger weight, the target elements can be focused in the image processing process, the accuracy of image understanding information is greatly improved, and the understanding capability of the computer equipment on the image is improved.

In one embodiment, the image processing method further includes: text content in the input image is extracted by the first model. The step of performing cross-modal fusion on the image features and the corresponding category label text to obtain the comprehensive features specifically comprises the following steps: and performing cross-modal fusion on the image features, the text content corresponding to the image features and the category label text corresponding to the image features to obtain comprehensive features.

Specifically, text content is included in the input image. The computer device may employ a multi-example learning (Multiple Instance Learning) approach to extract text content from the input image that has semantic meaning. And performing cross-modal fusion on the image features, the text content corresponding to the image features and the category label text corresponding to the image features to obtain comprehensive features.

In one embodiment, the computer device determines a plurality of candidate regions in the input image that are different from each other through the first model, and when the computer device extracts text content having a semantic meaning from the input image, the text content may be mapped to the respective candidate regions. Accordingly, the computer device may perform cross-modal fusion on the image features, text content, and category label text corresponding to each candidate region to obtain the integrated features.

In the embodiment, by extracting the text content in the input image and performing cross-modal fusion on the image features, the text content corresponding to the image features and the category label text corresponding to the image features, the features of the input image can be more fully and carefully mined, so that the image description text is more accurate, the accuracy of image understanding information is further improved, and the understanding capability of computer equipment on the image is improved.

In one embodiment, the image processing method further includes a step of performing an image question and answer, and the step specifically includes:

s502, acquiring a corresponding question text of the input image.

Where the question text is text describing a question for the input image. For example, referring to the input image in fig. 3, the corresponding question text may be specifically "what is in front of the house? "," what is the left side of the house? What is there "or" around the stream? "etc.

Specifically, the computer device may obtain text corresponding to the input image locally as the question text, or obtain the question text from other computer devices through a network connection, a USB (Universal Serial Bus ) interface connection, or the like.

In one embodiment, the terminal may present the image presentation interface to the user, where the user may perform a selection operation, and the terminal may use the selected image as the input image. The terminal can display preset problem text beside the input image displayed in the image display interface. The user can perform a selection operation in the image display interface, and the terminal takes the question text selected by the user as the question text corresponding to the input image.

In one embodiment, the terminal may invoke a local sound collection device to collect voice data. And recognizing the voice data locally or sending the corresponding voice data to a server to recognize the voice data so as to obtain the corresponding problem text.

In one embodiment, the terminal may perform the image processing method locally after acquiring the input image and the corresponding question text. Alternatively, the terminal may transmit the input image and the corresponding question text to the server, so that the server acquires the input image and the corresponding question text and performs the image processing method.

S504, extracting text characteristics of the question text.

In particular, the computer device may extract text features of the question text through the recurrent neural network. A recurrent neural network, such as an LSTM network. In one embodiment, the computer device may extract text features of words, or whole sentences of the question text.

S506, performing attention distribution processing on the image features according to the text features to obtain attention weights.

In one embodiment, the computer device may perform an attention allocation process on the image features according to the text features to obtain the attention weight.

In one embodiment, the computer device may map the text features and the image features to standard vectors within the same space, respectively, according to preset standard rules. And performing dot multiplication operation on standard vectors corresponding to the coded data and the image features respectively to obtain an intermediate result. And sequentially carrying out pooling treatment (such as a sum pooling treatment) and regression treatment (such as a softmax treatment) on the intermediate result to obtain the attention weight.

S508, determining weighted image features according to the image features and the attention weight.

In particular, the computer device may combine the attention weight with the corresponding image feature to obtain a weighted image feature that is weighted. In one embodiment, the computer device may implement the step of cross-modal fusion of image features and corresponding question text via an attention model, resulting in weighted image features. The image characteristics and the corresponding problem text are input into an attention model, and the attention model can obtain attention weight through the automatic learning weight of the network structure. And combining the attention weight with the image characteristic to obtain a weighted image characteristic. The more relevant the problem text, the greater the weight that is taken up in the resulting weighted image features.

S510, classifying according to the weighted image features to obtain answer texts corresponding to the question texts.

Specifically, the computer device may perform classification processing on the weighted image features through a machine learning classifier to obtain a class label text to which the weighted image features belong. And taking the corresponding category label text as an answer text corresponding to the question text.

In one embodiment, the computer device may input the weighted image features to a trained machine learning classifier, perform 3000-class classification, obtain a corresponding class label text, and use the class label text as an answer text corresponding to the question text.

For example, referring to the input image of fig. 3, when the question text corresponding to the input image is "what is in front of the house? When ' the answer text obtained according to the image processing method is ' xiaoxi '; when the question text corresponding to the input image is "what is there next to the stream? "in the case of" dog ", the answer text obtained according to the above image processing method is" dog ".

In the above embodiment, text features of the question text corresponding to the input image are extracted, attention distribution processing is performed on the image features according to the text features, attention weights are obtained, and weighted image features are determined according to the image features and the attention weights. And then classifying according to the weighted image characteristics, and outputting answer texts corresponding to the question texts. In this way, attention distribution processing can be performed on the image features according to the text features corresponding to the question text to obtain weighted image features, so that the image features related to the question text can be focused in the image processing process, and then the accuracy of the answer text can be greatly improved by classifying the weighted image features, namely, the accuracy of image understanding information is greatly improved, and the understanding capability of computer equipment on the images is improved.

In one embodiment, referring to FIG. 6, FIG. 6 shows a flow diagram of an image processing method in one embodiment. As shown in fig. 6, the computer device may combine the first model, the second model, and the attention model to construct an Image capture system for processing the input Image to obtain an Image description text of the input Image. Wherein the structure serving as the first model in the Image capture system is a CNN model structure, and the structure serving as the second model is an RNN model structure. Thus, the input Image can be processed through a complete Image capture system, and the Image understanding text corresponding to the input Image can be output.

As shown in fig. 6, an input Image (Image) may be input into the Image capture system, a plurality of candidate regions (Region Proposal) are determined by a convolutional neural network model (CNN network structure), and then Image features (Feature map) of the corresponding candidate regions are extracted by the convolutional neural network model (CNN network structure). Category Label text (Label) corresponding to each candidate region is determined by a convolutional neural network model (CNN network structure). And performing attention distribution processing on the category label text and the image features through the attention model to obtain corresponding comprehensive features. The comprehensive characteristics are input into a long-short-term memory network model (LSTM network structure), and corresponding Image description text (Image capture) is output.

As shown in fig. 7, in a specific embodiment, the image processing method includes:

s702, acquiring an input image.

S704, determining a plurality of candidate regions different from each other in the input image by the first model.

S706, extracting the image characteristics of each candidate region through the first model.

S708, determining category label text corresponding to the input image through the first model according to the image characteristics.

S710, determining the coded data corresponding to the category label text.

S712, performing attention distribution processing on the image features according to the encoded data to obtain attention weights.

S714, calculating the comprehensive characteristics according to the attention weight and the image characteristics.

S716, splicing the corresponding comprehensive features of the candidate areas to obtain splicing features.

S718, acquiring an image pre-description text corresponding to the input image.

And S720, sequentially inputting each word vector of the splicing characteristic and the image pre-description text into the second model.

S722, processing the splice features and word vectors which are sequentially input through the second model, and outputting an image description text of the input image.

Fig. 7 is a flow chart of an image processing method in one embodiment. It should be understood that, although the steps in the flowchart of fig. 7 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 7 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, or the order in which the sub-steps or stages are performed is not necessarily sequential, but may be performed in rotation or alternatively with at least a portion of the sub-steps or stages of other steps or steps.

As shown in fig. 8, in one embodiment, an image processing method is provided. The embodiment is mainly exemplified by the method applied to the computer device in fig. 1, such as the terminal 110 or the server 120. Referring to fig. 8, the image processing method specifically includes the steps of:

s802, acquiring an input image and a question text corresponding to the input image.

Specifically, the computer device may obtain the local image and the corresponding text as the input image and the corresponding question text, or obtain the input image and the corresponding question text from other computer devices through a network connection, a USB interface connection, or the like.

In one embodiment, the terminal may collect an image through the camera under the current field of view of the camera, and use the collected image as the input image. In one embodiment, the terminal may invoke a local sound collection device to collect voice data. And recognizing the voice data locally or sending the corresponding voice data to a server to recognize the voice data so as to obtain the corresponding problem text.

In one embodiment, the terminal may present an image presentation interface in which a user may perform a selection operation, and the terminal may use the selected image as an input image. The image displayed in the image display interface can be an image stored locally by the terminal, or an image obtained by the terminal accessing the server through network connection. The terminal can display preset problem text beside the input image displayed in the image display interface. The user can perform a selection operation in the image display interface, and the terminal takes the question text selected by the user as the question text corresponding to the input image.

S804, extracting image characteristics of the input image.

In one embodiment, the computer device may extract image features of the input image through a convolutional neural network, such as ResNet-80. And inputting the input image into a convolutional neural network, performing convolutional processing on the input image through a convolutional layer of the convolutional neural network, and extracting the image characteristics of the input image. That is, the convolutional neural network may perform convolutional processing on the input image through the convolutional layer to obtain a feature map (feature map) of the input image, where the feature map is an image feature in this embodiment.

In one embodiment, the convolutional neural network is obtained by learning and training with images in an image library (ImageNet) and corresponding class labels as training data. After the computer equipment acquires the input image, the input image is input into a convolutional neural network, and the image characteristics of the input image are extracted through a convolutional layer structure of the convolutional neural network.

S806, extracting text features of the question text.

S808, performing attention distribution processing on the image features according to the text features to obtain attention weights.

Specifically, the computer device may perform attention allocation processing on the image feature according to the text feature to obtain an attention weight.

In one embodiment, the computer device may map text features to first standard features and image features to second standard features. Wherein the first standard feature and the second standard feature are features in the same mapping space. And adding the first standard feature and the second standard feature, performing nonlinear operation, and finally performing softmax processing to obtain the attention weight.

In one embodiment, the computer device may map text features to first standard features and image features to second standard features. Wherein the first standard feature and the second standard feature are features in the same mapping space. And performing point multiplication operation on the first standard feature and the second feature to obtain an intermediate feature. And sequentially carrying out pooling treatment (such as a sum pooling treatment) and regression treatment (such as a softmax treatment) on the intermediate features to obtain the attention weight.

S810, determining weighted image features according to the image features and the attention weight.

S812, classifying according to the weighted image features to obtain answer texts corresponding to the question texts.

In one embodiment, step S806, i.e. the step of extracting text features of the question text, specifically includes:

s902, acquiring a word sequence corresponding to the question text.

In particular, the computer device may split the question text to obtain a corresponding word sequence of individual words.

S904, word segmentation processing is carried out on the question text, and a word sequence corresponding to the question text is obtained.

Specifically, the computer device may perform word segmentation processing on the question text by using a word segmentation method, so as to obtain a word sequence composed of words. The computer device may employ a dictionary-based word segmentation algorithm or word segmentation model, etc. to segment the question text. The word segmentation algorithm based on the dictionary can be specifically a forward maximum matching algorithm, a reverse maximum matching algorithm, a least segmentation algorithm, a bidirectional maximum matching algorithm or the like based on the dictionary. The word segmentation model may specifically be a hidden markov model or a CRF (conditional random field algorithm ) model, or the like.

In one embodiment, after the computer device performs word segmentation on the question text, the word obtained by the word segmentation is deactivated to obtain the word sequence. In information retrieval, stop Words (Stop Words) refer to certain Words or Words that are automatically filtered before or after processing natural language data (or text), such as Words, language aid Words, guest Words, prepositions or connective Words, etc., which are very widely used, in order to save storage space and improve retrieval efficiency.

S906, respectively extracting text features of the word sequence, the word sequence and the whole sentence of the problem text.

Specifically, the computer device may extract text features of the word sequence, and the whole sentence of the question text, respectively, through the recurrent neural network.

In the above embodiment, the word sequence and the text feature of the whole sentence of the question text corresponding to the question text are extracted respectively, so that the text information of the question text can be fully mined according to the word level, the word level and the sentence level of the question text to perform multi-level feature extraction.

In one embodiment, step S808, i.e. performing attention allocation processing on the image features according to the text features, the step of obtaining the attention weight includes: and respectively carrying out attention distribution processing on the image characteristics according to the word sequence, the word sequence and the text characteristics of the whole sentence of the problem text to obtain a first attention weight, a second attention weight and a third attention weight. Step S810, namely the step of determining weighted image features from the image features and the attention weights, comprises: and determining weighted image features according to the first attention weight, the second attention weight and the third attention weight by combining the image features.

Specifically, the computer device may perform attention allocation processing on the image feature according to the word sequence, and the text feature of the whole sentence of the question text, to obtain a first attention weight, a second attention weight, and a third attention weight. Further, a weighted image feature is determined based on the first attention weight, the second attention weight, and the third attention weight in combination with the image feature.

In one embodiment, the computer device may weight the image features according to the first attention weight, the second attention weight and the third attention weight, respectively, to obtain corresponding first intermediate image features. And fusing the first intermediate image features to obtain second intermediate image features, and directly taking the second image features as weighted image features.

In one embodiment, the computer device may fuse, such as weight sum, the first attention weight, the second attention weight, and the third attention weight to obtain a composite attention weight. And obtaining a second intermediate image feature according to the comprehensive attention weight and the image feature, and directly taking the second intermediate image feature as a weighted image feature.

In one embodiment, the computer device may weight the image features according to the first attention weight, the second attention weight and the third attention weight, respectively, to obtain corresponding first intermediate image features. And fusing the first intermediate image features to obtain second intermediate image features. And carrying out attention distribution processing on the second intermediate image feature according to the text feature of the whole sentence of the question text to obtain a fourth attention weight. A weighted image feature is determined based on the second intermediate image feature and the fourth attention weighting value.

In one embodiment, the computer device combines the first attention weight, the second attention weight, and the third attention weight with the image features, respectively, to obtain corresponding first intermediate image features corresponding to the word level, and the sentence level of the question text. The computer device may superimpose the first intermediate image feature corresponding to the word level with the first intermediate image feature corresponding to the word level, and then superimpose the first intermediate image feature corresponding to the sentence level to obtain the second intermediate image feature.

In one embodiment, the computer device may perform attention allocation processing on the second intermediate image feature according to the text feature of the whole sentence of the question text, to obtain a fourth attention weight, and determine the weighted image feature according to the second intermediate image feature and the fourth attention weight. In the above embodiment, after the attention distribution processing is performed on the multiple levels of the question text and the image features, the second intermediate image features are obtained. And performing attention distribution processing on the second intermediate image characteristic according to the text characteristic of the whole sentence of the question text to obtain a weighted image characteristic, so that the key point of the weighted image characteristic is more close to the content of the question text, and the accuracy of the answer text obtained by performing subsequent classification processing on the weighted image characteristic can be improved.

In the above embodiment, attention distribution processing is performed on the image features according to the word sequence corresponding to the question text, the word sequence, and the text features of the whole sentence of the question text, so as to obtain the first attention weight, the second attention weight, and the third attention weight, and then the weighted image features are determined by combining the image features according to the first attention weight, the second attention weight, and the third attention weight. Therefore, text information of the question text can be fully mined, so that the key points of the weighted image features are closer to the content of the question text, and the accuracy of answer text obtained by classifying the weighted image features can be improved.

In one embodiment, referring to FIG. 10, FIG. 10 illustrates a flow chart of a method of image processing in one embodiment. As shown in fig. 10, the computer device may extract image features of the input image through a convolutional neural network. And extracting text characteristics of the problem text through the cyclic neural network. And inputting the weighted image features into a machine learning classifier for classification processing to obtain answer texts corresponding to the question texts. In this embodiment, the computer device may combine the convolutional neural network, the recurrent neural network, and the machine learning classifier to construct a visual question-and-answer (visual question answering) system.

As shown in fig. 10, an input image (image) may be input into the visual question-answering system, and an image feature (feature map) of the input image may be extracted by a convolutional neural network model (CNN network structure). And inputting the question text into the visual question-answering system, and extracting text features (query features) of the question text through a long-short-term memory network model (LSTM network structure). Attention distribution processing (Attention processing) is performed on the image features and the text features, regression processing (softmax processing) is performed, and Attention value (Attention value) is obtained. And obtaining a second intermediate image feature (Attention map) according to the Attention weight and the image feature. And performing Attention distribution processing (Attention) on the second intermediate image feature (Attention map) and the whole sentence of the problem text to obtain a weighted image feature. The weighted image features are input into a machine learning classifier for Classification (Classification) processing, and Answer text (Answer) corresponding to the question text is obtained.

In one embodiment, the computer device may also perform attention allocation processing on the image features and text features in a co-attention (coordination-attention allocation processing) manner. The co-attention processing mode mainly refers to performing attention distribution processing on image features according to the text features, performing attention distribution processing on the text features according to the image features, and combining the results of the two processing, so that details are not repeated here.

As shown in fig. 11, in one specific embodiment, the image processing method includes the steps of:

s1102, an input image and a question text corresponding to the input image are acquired.

S1104, extracting image characteristics of the input image through a convolutional neural network.

S1106, a word sequence corresponding to the question text is acquired.

S1108, word segmentation processing is carried out on the question text, and a word sequence corresponding to the question text is obtained.

S1110, respectively extracting the character sequence, the word sequence and the text characteristics of the whole sentence of the problem text through a cyclic neural network.

S1112, performing attention distribution processing on the image features according to the word sequence, the word sequence and the text features of the whole sentence of the question text respectively to obtain a first attention weight, a second attention weight and a third attention weight.

And S1114, weighting the image features according to the first attention weight, the second attention weight and the third attention weight respectively to obtain corresponding first intermediate image features.

S1116, fusing the first intermediate image features to obtain second intermediate image features.

And S1118, performing attention distribution processing on the second intermediate image feature according to the text feature of the whole sentence of the question text to obtain a fourth attention weight.

S1120, determining weighted image features according to the second intermediate image features and the fourth attention weight.

S1122, the weighted image features are input into a machine learning classifier for classification processing, and answer texts corresponding to the question texts are obtained.

The image processing method extracts the image characteristics of the input image, extracts the text characteristics of the problem text corresponding to the input image, performs attention distribution processing on the image characteristics according to the text characteristics to obtain attention weight values, and determines weighted image characteristics according to the image characteristics and the attention weight values. And then classifying according to the weighted image characteristics, and outputting answer texts corresponding to the question texts. In this way, attention distribution processing can be performed on the image features according to the text features corresponding to the question text to obtain weighted image features, so that the image features related to the question text can be focused in the image processing process, and then the accuracy of the answer text can be greatly improved by classifying the weighted image features, namely, the accuracy of image understanding information is greatly improved, and the understanding capability of computer equipment on the images is improved.

FIG. 11 is a flow chart of an image processing method in one embodiment. It should be understood that, although the steps in the flowchart of fig. 11 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 11 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, or the order in which the sub-steps or stages are performed is not necessarily sequential, but may be performed in rotation or alternatively with at least a portion of the sub-steps or stages of other steps or steps.

In a specific application scenario, a user may input a new image into the image processing system, and the image processing system performs the image processing method, giving an understanding of the image. For example, the image processing system may output image description text for the image. Alternatively, for a given image, the user may present a number of questions, and the image processing system may output corresponding answer text by performing the image processing method described above. In particular in the education field, the image processing method can help a user to effectively and quickly understand semantic information in the graph, and can interact with the user in question and answer, and particularly has great help for infants, the elderly, visually impaired persons or language understanding impaired persons and the like.

As shown in fig. 12, in one embodiment, there is provided an image processing apparatus 1200 including: an acquisition module 1201, an extraction module 1202, a determination module 1203, a fusion module 1204, and an output module 1205.

An acquisition module 1201 is used for acquiring an input image.

An extraction module 1202 is configured to extract image features of an input image through a first model.

A determining module 1203 is configured to determine, by using the first model and according to the image features, a category label text corresponding to the input image.

And the fusion module 1204 is used for performing cross-modal fusion on the image features and the corresponding category label text to obtain comprehensive features.

An output module 1205 for processing the integrated features through the second model, and outputting the image description text of the input image.

In one embodiment, the extraction module 1202 is further configured to determine a plurality of candidate regions in the input image that are different from each other by the first model; and respectively extracting the image characteristics of each candidate region through the first model.

In one embodiment, the output module 1205 is further configured to splice the corresponding comprehensive features of each candidate region to obtain a spliced feature; and processing the spliced characteristic through a second model, and outputting an image description text of the input image.

In one embodiment, the fusion module 1204 is further configured to determine encoded data corresponding to the category label text; according to the encoded data, performing attention distribution processing on the image characteristics to obtain attention weights; and calculating to obtain comprehensive characteristics according to the attention weight and the image characteristics.

In one embodiment, the extraction module 1202 is further configured to extract text content in the input image by the first model. The fusion module 1204 is further configured to cross-modal fuse the image feature, the text content corresponding to the image feature, and the category label text corresponding to the image feature to obtain a comprehensive feature.

In one embodiment, the output module 1205 is further configured to obtain an image pre-description text corresponding to the input image; sequentially inputting the comprehensive characteristics and each word vector of the image pre-description text into a second model; and processing the sequentially input comprehensive features and word vectors through a second model, and outputting an image description text of the input image.

As shown in fig. 13, in one embodiment, the image processing apparatus 1200 further includes an attention distribution processing module 1206.

The obtaining module 1201 is further configured to obtain a corresponding question text of the input image.

The extraction module 1202 is also configured to extract text features of the question text.

The attention distribution processing module 1206 is configured to perform attention distribution processing on the image feature according to the text feature, so as to obtain an attention weight.

The determination module 1203 is further configured to determine weighted image features based on the image features and the attention weighting values.

The output module 1205 is further configured to perform classification processing according to the weighted image features, so as to obtain an answer text corresponding to the question text.

According to the image processing device, the image characteristics of the input image are extracted through the first model, and the category label text corresponding to the input image is determined, so that the image characteristics of the input image and the corresponding category label text can be quickly and accurately obtained. And performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features, and processing the comprehensive features through a second model to obtain the image description text. Therefore, the second model can fully utilize the image characteristics of the input image and combine the category information of the input image in the processing process. Thus, the characteristics of the input image are carefully and fully utilized, and when the image is understood, double guidance of the image characteristics and the category label text is obtained, so that the accuracy of the image understanding information is greatly improved, and the understanding capability of the computer equipment on the image is improved.

As shown in fig. 14, in one embodiment, there is provided an image processing apparatus 1400 including: an acquisition module 1401, an extraction module 1402, an attention distribution processing module 1403, a determination module 1404, and a classification module 1405.

An acquisition module 1401 is configured to acquire an input image and a question text corresponding to the input image.

An extraction module 1402 is configured to extract image features of an input image.

The extraction module 1402 is also configured to extract text features of the question text.

The attention allocation processing module 1403 is configured to perform attention allocation processing on the image feature according to the text feature, so as to obtain an attention weight.

A determination module 1404 is provided for determining weighted image features based on the image features and the attention weighting values.

The classification module 1405 is configured to perform classification processing according to the weighted image features, and obtain answer text corresponding to the question text.

In one embodiment, the extraction module 1402 is further configured to obtain a word sequence corresponding to the question text; word segmentation processing is carried out on the problem text, and a word sequence corresponding to the problem text is obtained; and respectively extracting the character sequence, the word sequence and the text characteristics of the whole sentence of the problem text.

In one embodiment, the attention allocation processing module 1403 is further configured to perform attention allocation processing on the image feature according to the word sequence, and the text feature of the whole sentence of the question text, so as to obtain a first attention weight, a second attention weight, and a third attention weight. The determining module 1404 is further configured to determine a weighted image feature based on the first attention weight, the second attention weight, and the third attention weight in combination with the image feature.

In one embodiment, the determining module 1404 is further configured to weight the image features according to the first attention weight, the second attention weight, and the third attention weight, respectively, to obtain corresponding first intermediate image features; fusing the first intermediate image features to obtain second intermediate image features; according to the text characteristics of the whole sentence of the question text, performing attention distribution processing on the second intermediate image characteristics to obtain a fourth attention weight; a weighted image feature is determined based on the second intermediate image feature and the fourth attention weighting value.

In one embodiment, the attention allocation processing module 1403 is further configured to map the text feature to a first standard feature; mapping the image features to second standard features; performing dot multiplication operation on the first standard feature and the second standard feature to obtain an intermediate feature; and sequentially carrying out pooling treatment and regression treatment on the intermediate characteristics to obtain the attention weight.

In one embodiment, the extraction module 1402 is further configured to extract image features of an input image through a convolutional neural network. And extracting text characteristics of the problem text through the cyclic neural network. The classification module 1405 is further configured to input the weighted image features to a machine learning classifier to perform classification processing, so as to obtain answer text corresponding to the question text.

The image processing device extracts image characteristics of an input image, extracts text characteristics of a problem text corresponding to the input image, performs attention distribution processing on the image characteristics according to the text characteristics to obtain attention weights, and determines weighted image characteristics according to the image characteristics and the attention weights. And then classifying according to the weighted image characteristics, and outputting answer texts corresponding to the question texts. In this way, attention distribution processing can be performed on the image features according to the text features corresponding to the question text to obtain weighted image features, so that the image features related to the question text can be focused in the image processing process, and then the accuracy of the answer text can be greatly improved by classifying the weighted image features, namely, the accuracy of image understanding information is greatly improved, and the understanding capability of computer equipment on the images is improved.

FIG. 15 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be specifically the terminal 110 of fig. 1. As shown in fig. 15, the computer device includes a processor, a memory, a network interface, an input device, and a display screen connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a route mining method. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform the route mining method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

FIG. 16 illustrates an internal block diagram of a computer device in one embodiment. The computer device may be specifically the server 120 of fig. 1. As shown in fig. 16, the computer device includes a processor, a memory, and a network interface connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by a processor, causes the processor to implement a route mining method. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform the route mining method.

It will be appreciated by those skilled in the art that the structures shown in fig. 15 and 16 are merely block diagrams of partial structures related to the aspects of the present application and do not constitute a limitation of the computer device to which the aspects of the present application apply, and that a particular computer device may include more or fewer components than shown in the figures, or may combine certain components, or have a different arrangement of components.

In one embodiment, the image processing apparatus provided herein may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 15 or 16. The memory of the computer device may store therein various program modules constituting the image processing apparatus, such as an acquisition module, an extraction module, a determination module, a fusion module, and an output module shown in fig. 12. Also, for example, an acquisition module, an extraction module, an attention distribution processing module, a determination module, and a classification module shown in fig. 14. The computer program of each program module causes the processor to execute the steps in the hairstyle recognition method of each embodiment of the present application described in the present specification.

For example, the computer device shown in fig. 15 or fig. 16 may execute step S202 by the acquisition module in the image processing apparatus shown in fig. 12. The computer device may perform step S204 through the extraction module. The computer device may perform step S206 by the determination module. The computer device may execute step S208 through the fusion module. The computer device may perform step S210 through the output module.

For example, the computer apparatus shown in fig. 15 or fig. 16 may execute step S802 by the acquisition module in the image processing device shown in fig. 14. The computer device may perform steps S804 and S806 through the extraction module. The computer apparatus may perform step S808 through the attention distribution processing module. The computer device may perform step S210 through the determination module. The computer device may perform step S812 through the classification module.

In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of: acquiring an input image; extracting image features of an input image through a first model; determining a category label text corresponding to the input image through the first model according to the image characteristics; performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features; and processing the comprehensive characteristics through a second model, and outputting the image description text of the input image.

In one embodiment, the computer program causes the processor to, when executing the step of extracting image features of the input image by the first model, specifically perform the steps of: determining a plurality of candidate regions which are different from each other in the input image through a first model; and respectively extracting the image characteristics of each candidate region through the first model.

In one embodiment, the computer program causes the processor to, when executing the step of processing the integrated feature by the second model, output the image description text of the input image, specifically execute the steps of: splicing the corresponding comprehensive characteristics of each candidate region to obtain splicing characteristics; and processing the spliced characteristic through a second model, and outputting an image description text of the input image.

In one embodiment, the computer program causes the processor to perform the step of cross-modal fusing the image features and the corresponding category label text to obtain the composite feature by: determining coded data corresponding to the category label text; according to the encoded data, performing attention distribution processing on the image characteristics to obtain attention weights; and calculating to obtain comprehensive characteristics according to the attention weight and the image characteristics.

In one embodiment, the computer program causes the processor to further perform the steps of: extracting text content in an input image through a first model; the computer program causes the processor to perform the steps of cross-modal fusion of the image features and the corresponding category label text to obtain the integrated features, comprising: and performing cross-modal fusion on the image features, the text content corresponding to the image features and the category label text corresponding to the image features to obtain comprehensive features.

In one embodiment, the computer program causes the processor to, when executing the step of processing the integrated feature by the second model, output the image description text of the input image, specifically execute the steps of: acquiring an image pre-description text corresponding to an input image; sequentially inputting the comprehensive characteristics and each word vector of the image pre-description text into a second model; and processing the sequentially input comprehensive features and word vectors through a second model, and outputting an image description text of the input image.

In one embodiment, the computer program causes the processor to further perform the steps of: acquiring a corresponding problem text of an input image; extracting text characteristics of the problem text; according to the text characteristics, performing attention distribution processing on the image characteristics to obtain attention weights; determining weighted image features from the image features and the attention weights; and carrying out classification processing according to the weighted image characteristics to obtain answer texts corresponding to the question texts.

According to the computer equipment, the image characteristics of the input image are extracted through the first model, the category label text corresponding to the input image is determined, and the image characteristics of the input image and the corresponding category label text can be quickly and accurately obtained. And performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features, and processing the comprehensive features through a second model to obtain the image description text. Therefore, the second model can fully utilize the image characteristics of the input image and combine the category information of the input image in the processing process. Thus, the characteristics of the input image are carefully and fully utilized, and when the image is understood, double guidance of the image characteristics and the category label text is obtained, so that the accuracy of the image understanding information is greatly improved, and the understanding capability of the computer equipment on the image is improved.

In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of: acquiring an input image and a question text corresponding to the input image; extracting image features of an input image; extracting text characteristics of the problem text; according to the text characteristics, performing attention distribution processing on the image characteristics to obtain attention weights; determining weighted image features from the image features and the attention weights; and carrying out classification processing according to the weighted image characteristics to obtain answer texts corresponding to the question texts.

In one embodiment, the computer program causes the processor to, when executing the step of extracting the text features of the question text, specifically perform the steps of: acquiring a word sequence corresponding to a question text; word segmentation processing is carried out on the problem text, and a word sequence corresponding to the problem text is obtained; and respectively extracting the character sequence, the word sequence and the text characteristics of the whole sentence of the problem text.

In one embodiment, the computer program causes the processor to perform the following steps when executing the step of obtaining the attention weight by performing an attention distribution process on the image feature according to the text feature: respectively carrying out attention distribution processing on the image characteristics according to the word sequence, the word sequence and the text characteristics of the whole sentence of the problem text to obtain a first attention weight, a second attention weight and a third attention weight; the computer program causes the processor to perform the following steps in particular when performing the step of determining weighted image features from the image features and the attention weighting values: and determining weighted image features according to the first attention weight, the second attention weight and the third attention weight by combining the image features.

In one embodiment, the computer program causes the processor to perform the step of determining weighted image features in combination with the image features based on the first attention weight, the second attention weight and the third attention weight, the step of: weighting the image features according to the first attention weight, the second attention weight and the third attention weight respectively to obtain corresponding first intermediate image features; fusing the first intermediate image features to obtain second intermediate image features; according to the text characteristics of the whole sentence of the question text, performing attention distribution processing on the second intermediate image characteristics to obtain a fourth attention weight; a weighted image feature is determined based on the second intermediate image feature and the fourth attention weighting value.

In one embodiment, the computer program causes the processor to perform the following steps when executing the step of obtaining the attention weight by performing an attention distribution process on the image feature according to the text feature: mapping the text feature to a first standard feature; mapping the image features to second standard features; performing dot multiplication operation on the first standard feature and the second standard feature to obtain an intermediate feature; and sequentially carrying out pooling treatment and regression treatment on the intermediate characteristics to obtain the attention weight.

In one embodiment, the computer program causes the processor to perform the following steps in particular when performing the step of extracting image features of the input image: extracting image features of an input image through a convolutional neural network; the computer program causes the processor to, when executing the step of extracting the text features of the question text, specifically perform the steps of: extracting text characteristics of the problem text through a cyclic neural network; the computer program causes the processor to execute the following steps when executing the step of classifying according to the weighted image characteristics to obtain the answer text corresponding to the question text: and inputting the weighted image features into a machine learning classifier for classification processing to obtain answer texts corresponding to the question texts.

The computer equipment extracts image characteristics of the input image, extracts text characteristics of a problem text corresponding to the input image, performs attention distribution processing on the image characteristics according to the text characteristics to obtain attention weight values, and determines weighted image characteristics according to the image characteristics and the attention weight values. And then classifying according to the weighted image characteristics, and outputting answer texts corresponding to the question texts. In this way, attention distribution processing can be performed on the image features according to the text features corresponding to the question text to obtain weighted image features, so that the image features related to the question text can be focused in the image processing process, and then the accuracy of the answer text can be greatly improved by classifying the weighted image features, namely, the accuracy of image understanding information is greatly improved, and the understanding capability of computer equipment on the images is improved.

A computer readable storage medium storing a computer program which when executed by a processor performs the steps of: in one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of: acquiring an input image; extracting image features of an input image through a first model; determining a category label text corresponding to the input image through the first model according to the image characteristics; performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features; and processing the comprehensive characteristics through a second model, and outputting the image description text of the input image.

The computer readable storage medium extracts the image characteristics of the input image through the first model and determines the category label text corresponding to the input image, so that the image characteristics of the input image and the corresponding category label text can be quickly and accurately obtained. And performing cross-modal fusion on the image features and the corresponding category label texts to obtain comprehensive features, and processing the comprehensive features through a second model to obtain the image description text. Therefore, the second model can fully utilize the image characteristics of the input image and combine the category information of the input image in the processing process. Thus, the characteristics of the input image are carefully and fully utilized, and when the image is understood, double guidance of the image characteristics and the category label text is obtained, so that the accuracy of the image understanding information is greatly improved, and the understanding capability of the computer equipment on the image is improved.

A computer readable storage medium storing a computer program which when executed by a processor performs the steps of: acquiring an input image and a question text corresponding to the input image; extracting image features of an input image; extracting text characteristics of the problem text; according to the text characteristics, performing attention distribution processing on the image characteristics to obtain attention weights; determining weighted image features from the image features and the attention weights; and carrying out classification processing according to the weighted image characteristics to obtain answer texts corresponding to the question texts.

The computer readable storage medium extracts image characteristics of an input image, extracts text characteristics of a question text corresponding to the input image, performs attention distribution processing on the image characteristics according to the text characteristics to obtain attention weights, and determines weighted image characteristics according to the image characteristics and the attention weights. And then classifying according to the weighted image characteristics, and outputting answer texts corresponding to the question texts. In this way, attention distribution processing can be performed on the image features according to the text features corresponding to the question text to obtain weighted image features, so that the image features related to the question text can be focused in the image processing process, and then the accuracy of the answer text can be greatly improved by classifying the weighted image features, namely, the accuracy of image understanding information is greatly improved, and the understanding capability of computer equipment on the images is improved.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. An image processing method, comprising:

acquiring an input image and a question text corresponding to the input image;

extracting image features of the input image;

extracting multi-level text features of the question text, wherein the multi-level text features comprise word sequences, word sequences and text features of whole sentences of the question text, which correspond to the question text;

Performing attention distribution processing on the image features according to the word sequence, the word sequence and the text features of the whole sentence of the question text respectively to obtain a first attention weight, a second attention weight and a third attention weight;

according to the first attention weight, the second attention weight and the third attention weight, combining the image features to obtain corresponding first intermediate image features;

fusing the first intermediate image features to obtain second intermediate image features;

performing attention distribution processing on the second intermediate image feature according to the text feature of the whole sentence of the question text, and combining the second intermediate image feature to obtain a weighted image feature;

2. The method of claim 1, wherein extracting multi-level text features of the question text comprises:

acquiring a word sequence corresponding to the question text;

word segmentation processing is carried out on the question text, and a word sequence corresponding to the question text is obtained;

and respectively extracting the word sequence, the word sequence and the text characteristics of the whole sentence of the question text.

3. The method according to claim 1, wherein the performing attention allocation processing on the second intermediate image feature according to the text feature of the whole sentence of the question text, and combining the second intermediate image feature to obtain a weighted image feature includes:

according to the text characteristics of the whole sentence of the question text, performing attention distribution processing on the second intermediate image characteristics to obtain a fourth attention weight;

and determining a weighted image feature according to the second intermediate image feature and the fourth attention weight.

4. The method of claim 1, wherein the determining of any of the first, second, or third attention weights comprises:

mapping the aimed text feature into a first standard feature aiming at any one of the multi-level text features;

mapping the image features to second standard features;

performing dot multiplication operation on the first standard feature and the second standard feature to obtain an intermediate feature;

and carrying out pooling treatment and regression treatment on the intermediate features in sequence to obtain the attention weight.

5. The method according to any one of claims 1 to 4, wherein the extracting image features of the input image comprises:

extracting image features of the input image through a convolutional neural network;

the extracting the multi-level text features of the question text comprises the following steps:

extracting multi-level text characteristics of the problem text through a cyclic neural network;

the step of classifying according to the weighted image features to obtain answer texts corresponding to the question texts comprises the following steps:

and inputting the weighted image features into a machine learning classifier for classification processing to obtain answer texts corresponding to the question texts.

6. The method according to claim 1, wherein the method further comprises:

extracting image features of the input image through a first model;

7. The method of claim 6, wherein extracting image features of the input image by the first model comprises:

determining a plurality of candidate regions which are different from each other in the input image through a first model;

and respectively extracting the image characteristics of each candidate region through the first model.

8. The method of claim 7, wherein processing the composite feature through a second model, outputting image description text of the input image comprises:

splicing the corresponding comprehensive characteristics of each candidate region to obtain splicing characteristics;

and processing the spliced characteristic through a second model, and outputting an image description text of the input image.

9. The method of claim 6, wherein cross-modal fusing the image features and corresponding category label text to obtain integrated features comprises:

determining coded data corresponding to the category label text;

according to the encoded data, performing attention distribution processing on the image characteristics to obtain attention weight;

and calculating to obtain comprehensive characteristics according to the attention weight and the image characteristics.

10. The method of claim 6, wherein the method further comprises:

Extracting text content in the input image through the first model;

the cross-modal fusion of the image features and the corresponding category label text is carried out, and the obtaining of the comprehensive features comprises the following steps:

and performing cross-modal fusion on the image features, text contents corresponding to the image features and category label texts corresponding to the image features to obtain comprehensive features.

11. The method of claim 6, wherein processing the composite feature through a second model, outputting image description text of the input image comprises:

acquiring an image pre-description text corresponding to the input image;

sequentially inputting the comprehensive characteristics and each word vector of the image pre-description text into a second model;

and processing the sequentially input comprehensive features and word vectors through the second model, and outputting an image description text of the input image.

12. An image processing apparatus comprising:

the extraction module is also used for extracting multi-level text features of the question text, wherein the multi-level text features comprise word sequences corresponding to the question text, word sequences and text features of the whole sentence of the question text;

The attention distribution processing module is used for carrying out attention distribution processing on the image characteristics according to the word sequence, the word sequence and the text characteristics of the whole sentence of the question text respectively to obtain a first attention weight, a second attention weight and a third attention weight; according to the first attention weight, the second attention weight and the third attention weight, combining the image features to obtain corresponding first intermediate image features, and fusing the first intermediate image features to obtain second intermediate image features;

the determining module is used for carrying out attention distribution processing on the second intermediate image feature according to the text feature of the whole sentence of the question text and combining the second intermediate image feature to obtain a weighted image feature;

13. The apparatus of claim 12, wherein the extraction module is further configured to obtain a word sequence corresponding to the question text; word segmentation processing is carried out on the question text, and a word sequence corresponding to the question text is obtained; and respectively extracting the word sequence, the word sequence and the text characteristics of the whole sentence of the question text.

14. The apparatus of claim 12, wherein the determining module is further configured to perform attention allocation processing on the second intermediate image feature according to the text feature of the whole sentence of the question text, to obtain a fourth attention weight; and determining a weighted image feature according to the second intermediate image feature and the fourth attention weight.

15. The apparatus of claim 12, wherein the attention distribution processing module is further configured to map, for any one of the multi-level text features, the text feature being targeted to a first standard feature; mapping the image features to second standard features; performing dot multiplication operation on the first standard feature and the second standard feature to obtain an intermediate feature; and carrying out pooling treatment and regression treatment on the intermediate features in sequence to obtain the attention weight.

16. The apparatus according to any one of claims 12 to 15, wherein the extraction module is further configured to extract image features of the input image through a convolutional neural network; extracting multi-level text characteristics of the problem text through a cyclic neural network;

The classification module is further configured to input the weighted image features to a machine learning classifier for classification processing, so as to obtain an answer text corresponding to the question text.

17. The apparatus of claim 12, wherein the device comprises a plurality of sensors,

the extraction module is also used for extracting image features of the input image through a first model;

the category label determining module is used for determining category label text corresponding to the input image through the first model according to the image characteristics;

18. The apparatus of claim 17, wherein the extraction module is further configured to determine a plurality of candidate regions in the input image that are different from each other by a first model; and respectively extracting the image characteristics of each candidate region through the first model.

19. The apparatus of claim 18, wherein the output module is further configured to splice the corresponding integrated features of each candidate region to obtain a spliced feature; and processing the spliced characteristic through a second model, and outputting an image description text of the input image.

20. The apparatus of claim 17, wherein the fusion module is further configured to determine encoded data corresponding to the category label text; according to the encoded data, performing attention distribution processing on the image characteristics to obtain attention weight; and calculating to obtain comprehensive characteristics according to the attention weight and the image characteristics.

21. The apparatus of claim 17, wherein the device comprises a plurality of sensors,

the extraction module is further used for extracting text content in the input image through the first model;

and the fusion module is also used for carrying out cross-mode fusion on the image features, the text content corresponding to the image features and the category label text corresponding to the image features to obtain comprehensive features.

22. The apparatus of claim 17, wherein the output module is further configured to obtain image pre-description text corresponding to the input image; sequentially inputting the comprehensive characteristics and each word vector of the image pre-description text into a second model; and processing the sequentially input comprehensive features and word vectors through the second model, and outputting an image description text of the input image.

23. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 11.

24. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 11.