CN109086811B

CN109086811B - Multi-label image classification method and device and electronic equipment

Info

Publication number: CN109086811B
Application number: CN201810802045.9A
Authority: CN
Inventors: 魏秀参; 陈钊民
Original assignee: Xuzhou Kuangshi Data Technology Co ltd; Nanjing Kuangyun Technology Co ltd; Beijing Kuangshi Technology Co Ltd
Current assignee: Xuzhou Kuangshi Data Technology Co ltd; Nanjing Kuangyun Technology Co ltd; Beijing Kuangshi Technology Co Ltd
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2021-06-22
Anticipated expiration: 2038-07-19
Also published as: CN109086811A

Abstract

The invention provides a multi-label image classification method, a multi-label image classification device and electronic equipment. On the other hand, according to the parameters of the full connection layer and the first characteristic image, performing characteristic filtering on the first characteristic image to obtain a second characteristic image; wherein parameters of the fully-connected layer and parameters of the convolutional layer are optimized based on a metric learning algorithm; and then performing pooling processing on the second characteristic image to obtain a second label classification prediction result. And finally, comprehensively considering the first label classification prediction result and the second label classification prediction result to obtain a target label classification prediction result. According to the method, label classification is carried out from two aspects, and the first label classification prediction result is corrected by the second label classification prediction result obtained based on the second feature map, so that the number of label combinations is reduced, and the multi-label image identification precision is assisted to be improved.

Description

Multi-label image classification method and device and electronic equipment

Technical Field

The invention relates to the technical field of image processing, in particular to a multi-label image classification method and device and electronic equipment.

Background

Multi-label image classification (Multi-label image classification) is a very important research topic in computer vision. Because a picture taken in a real scene always contains a plurality of objects, the image contains a plurality of labels, and the number of result combinations of classification is exponentially increased compared with that of a single label. Compared with the single-label image classification problem, the multi-label image classification problem is higher in difficulty, lower in precision and higher in research significance.

Most of the conventional methods use Graph (Graph) to model the relationship between labels, so as to artificially add constraints to the final predicted result, so as to reduce the number of classification results. Such a method is very dependent on the a priori knowledge of the person and the quality of the created graph (graph), and has great limitations. In recent years, with the rapid development of Deep learning (Deep learning), there are more and more methods for modeling relationships between tags using a neural network, and such methods can break through the above limitations, but most of the current Deep learning methods use an Attention (Attention) mechanism, and improve the accuracy based on a single tag classification method. Therefore, at present, how to improve the multi-label image identification precision by using the relation between labels in a real sense still has no good solution.

Disclosure of Invention

In view of the above, the present invention provides a multi-label image classification method, apparatus and electronic device to effectively reduce the number of label combinations and improve the accuracy of multi-label image recognition.

In a first aspect, an embodiment of the present invention provides a multi-label image classification method, including:

extracting a first characteristic image of an image to be processed;

processing the first characteristic image through a pooling layer and a full-link layer in sequence to obtain a first label classification prediction result;

obtaining a second characteristic image according to the parameters of the full connection layer and the first characteristic image, wherein the second characteristic image comprises a sub-characteristic image corresponding to each label of a preset category;

performing pooling processing on the second characteristic image to obtain a second label classification prediction result;

and obtaining a target label classification prediction result according to the first label classification prediction result and the second label classification prediction result.

With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where a first loss function is determined according to a target label classification prediction result;

according to a preset metric learning loss function and the first loss function, determining a final loss function;

wherein the metric learning loss function is set based on a metric learning algorithm.

With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the method further includes:

and in the training process, optimizing the parameters of the full-connection layer and the parameters of the convolution layer based on the metric learning loss function and the correlation between the labels.

With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the optimizing, by using a back propagation algorithm, the parameters of the fully-connected layer and the parameters of the convolutional layer based on the metric learning loss function and the correlation between the labels includes:

mapping the sub-feature images corresponding to each label into a preset space based on a metric learning algorithm, and respectively calculating the distance between each sub-feature image in the preset space;

and optimizing parameters of the full-connection layer and parameters of the convolution layer based on the metric learning loss function and the correlation between the labels by utilizing a back propagation algorithm so as to adjust the distance between the sub-feature images in the preset space.

With reference to the third possible implementation manner of the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where in the sub-feature image corresponding to each label, a sub-feature image corresponding to a label that belongs to a currently input image is used as a correlation image, and a sub-feature image corresponding to a label that does not belong to the currently input image is used as a non-correlation image;

and zooming the distance of the correlation images in the preset space, and zooming the non-correlation images to make the non-correlation images far away from the correlation images in the preset space.

With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the obtaining a first label classification prediction result by sequentially processing the first feature image through a pooling layer and a full link layer includes:

inputting the first feature image into a pooling layer, and performing pooling processing on the first feature image based on a global maximum pooling function to obtain a first feature image after dimension reduction;

and inputting the first feature image subjected to dimension reduction to a full connection layer for classification processing, and generating a first label classification prediction result.

With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where obtaining a second feature image according to the parameter of the full connection layer and the first feature image includes:

and multiplying the first characteristic image and the parameters of the full connection layer to obtain a second characteristic image.

With reference to the first aspect, an embodiment of the present invention provides a seventh possible implementation manner of the first aspect, where obtaining a target label classification prediction result according to the first label classification prediction result and the second label classification prediction result includes:

and taking the sum of the first label classification prediction result and the second label classification prediction result as a target label classification prediction result.

In a second aspect, an embodiment of the present invention further provides a multi-label image classification apparatus, including:

the first extraction module is used for extracting a first characteristic image of the image to be processed;

the first prediction module is used for processing the first characteristic image through a pooling layer and a full-link layer in sequence to obtain a first label classification prediction result;

the second extraction module is used for obtaining a second characteristic image according to the parameters of the full connection layer and the first characteristic image, wherein the second characteristic image comprises a sub-characteristic image corresponding to each label in a preset category;

the second prediction module is used for performing pooling processing on the second characteristic image to obtain a second label classification prediction result;

and the target prediction module is used for obtaining a target label classification prediction result according to the first label classification prediction result and the second label classification prediction result.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor executes the computer program to implement the method described in the first aspect and any possible implementation manner thereof.

In a fourth aspect, the present invention further provides a computer-readable medium having non-volatile program code executable by a processor, where the program code causes the processor to execute the method described in the first aspect and any possible implementation manner thereof.

The embodiment of the invention has the following beneficial effects:

in the embodiment provided by the invention, on one hand, a first characteristic image is obtained by utilizing the convolutional layer, and then, the first characteristic image is classified by utilizing the pooling layer and the full-link layer to obtain a first label classification prediction result. On the other hand, the parameters of the full connection layer and the parameters of the convolution layer are optimized based on a metric learning algorithm; and then performing pooling processing on the second characteristic image to obtain a second label classification prediction result. And finally, comprehensively considering the first label classification prediction result and the second label classification prediction result to obtain a target label classification prediction result. The method classifies labels from two aspects, obtains a second feature map bearing the relationship between labels based on a metric learning algorithm, and corrects a first label classification prediction result according to a second label classification prediction result obtained from the second feature map, so that the number of label combinations is reduced, and the multi-label image identification precision is assisted to be improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a multi-label image classification method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a network model structure of a multi-label image classification method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a multi-label image classification apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of another multi-label image classification apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, a mode of modeling relationships among labels by using a graph has great limitation, in order to break through the limitation, a neural network method is adopted to model the relationships among the labels, however, most of the methods use an Attention (Attention) mechanism, the accuracy is improved by a method based on single label classification, the relationships among the labels cannot be modeled in a real sense, and the multi-label image identification accuracy needs to be further improved.

In view of the above problem of how to model the relationship between labels in the multi-label image classification problem, it was found that some labels have strong correlation, such as sky and white cloud, and some labels have weak or even negative correlation, such as polar bear and penguin (unless they can occur simultaneously in the zoo). Therefore, the probability that some labels appear in the images at the same time is high, other labels basically do not appear in one image at the same time, and for the labels which do not appear in one image at the same time, the label combination does not need to be considered, so that the combination number of the predicted labels is reduced, and the multi-label image classification precision is improved.

Based on the above research thought, assuming that the labels appearing in the same image at the same time have correlation, the multi-label image classification method, apparatus, and electronic device provided in the embodiments of the present invention, on one hand, obtain the first feature image by using the convolution layer, and then perform classification by using the pooling layer and the full link layer, so as to obtain the first label classification prediction result. On the other hand, according to the parameters of the full connection layer and the first characteristic image, performing characteristic filtering on the first characteristic image to obtain a second characteristic image; and then performing pooling processing on the second characteristic image to obtain a second label classification prediction result. And finally, comprehensively considering the first label classification prediction result and the second label classification prediction result to obtain a target label classification prediction result. The method classifies labels from two aspects, obtains a second feature map bearing the relationship between labels based on a metric learning algorithm, and corrects a first label classification prediction result by using a second label classification prediction result obtained from the second feature map, so that the number of label combinations is reduced, and the multi-label image identification precision is assisted to be improved.

In one embodiment, the multi-label image classification method is implemented by a network model as shown in fig. 1. The network model comprises two sub-networks, specifically, a main network used for extracting image features and generating a first label classification prediction result, and a metric learning network used for applying constraint on the main network and assisting the main network to correct and enhance the label classification result. The metric learning network firstly needs to apply a metric learning algorithm to model the relationship among the labels in the training process, so that the final multi-label image identification precision can be improved by using the characteristic diagram of the relationship among the bearing labels in the testing process. For the convenience of understanding the present embodiment, first, referring to fig. 1, a multi-label image classification method disclosed in the present embodiment is described in detail.

Fig. 2 is a flowchart illustrating a multi-label image classification method according to an embodiment of the present invention. As shown in fig. 2, the multi-label image classification method includes:

in step S201, a first feature image of an image to be processed is extracted.

In the embodiment of the present invention, the image to be processed may be an image to be processed uploaded by a user in a picture format, such as bmp, jpg, png, or the like. But also a shot captured by an image capture device, such as a camera. Or the image to be processed in the picture format downloaded by the user through the network.

In one embodiment, as shown in fig. 1, the first feature image is a convolution layer in the main network, such as a deep convolution network Resnet101, and then a corresponding convolution feature is obtained. For example, the image to be processed is 448 × 448, and a first feature image with the dimension of 2048 × 14 is obtained through the Resnet101 network, and the first feature image contains information of all labels in the image to be processed. For the convenience of the following description, the first feature image is defined as z_iWhere the index i denotes the ith inspection chart input into the main networkLike this.

In practical application, the features of the image to be processed may also be extracted in other manners, so as to obtain the first feature image, for example, the features of the image to be processed are extracted in manners such as a VGG network or an inclusion network, so as to obtain the first feature image.

Step S202, the first characteristic image is processed by a pooling layer and a full-link layer in sequence to obtain a first label classification prediction result.

In order to solve the problem that small object information is easily lost in a Global Average Pooling (GAP) mode, in an embodiment, a Global Max Pooling (GMP) mode is adopted to perform dimension reduction processing on the first feature image, and based on this, the step S202 includes: inputting the first characteristic image into a pooling layer, and performing pooling processing on the first characteristic image based on a global maximum pooling function to obtain a first characteristic image after dimension reduction; and inputting the first feature image after dimension reduction into a full Connected layers (FC) for classification processing, and generating a first label classification prediction result.

Continuing with the example in step S201, as shown in fig. 1, assuming that the image to be processed is 448 × 448, a first feature image with a dimension of 2048 × 14 is obtained through the Resnet101 network. After processing by the global maximum pooling function, a feature vector with dimension size of 2048 is obtained. Assume that the total number of labels in the preset category of the network model is C (i.e., the total number of labels included in the training set during training of the network model), and C is less than 2048. And then, parameters of the full connection layer are respectively configured according to the total number of labels of the preset category and the dimensionality of the first feature image after dimensionality reduction, so that the full connection layer is 2048 × C. And finally, calculating characteristic vectors with the dimension of 2048 through 2048 × C full-connection layers to obtain a first label classification prediction result, wherein the first label classification prediction result is a set of C vectors, and each vector represents a prediction result corresponding to each category.

Assuming that C is 3, the first label classification prediction result includes a set of 3 vectors, for example, the set is a { a1, a2, a3}, then a1 represents the prediction result of the first classification, a2 represents the prediction result of the second classification, and a3 represents the prediction result corresponding to the third classification.

It should be noted that the number of the labels of the preset category in the embodiment of the present invention may be 1, and may also be greater than 1. When the number of the labels of the preset category is greater than 1, the effect of the multi-label image classification method in the embodiment of the invention is more obvious.

Step S203, obtaining a second characteristic image according to the parameters of the full connection layer and the first characteristic image, wherein the second characteristic image comprises a sub-characteristic image corresponding to each label of a preset category.

For example, the preset type of label (hereinafter referred to as a label in the training set) includes a blue sky, a white cloud, a puppy, and a kitten, and the second feature map includes a sub-feature image corresponding to the blue sky, a sub-feature image corresponding to the white cloud, a sub-feature image corresponding to the puppy, and a sub-feature image corresponding to the kitten.

The step S203 includes: and multiplying the first characteristic image by the parameters of the full connection layer to obtain a second characteristic image.

Still taking fig. 1 as an example, it is first necessary to multiply the parameter W of the fully-connected layer by the first feature map, where the dimension of the parameter W is 2048 × C, and the parameter may be understood as a filter that can filter out information required by each label category in the first feature map, and filter out information that is not related to each label category. As can be seen from fig. 1, the dimension size of the second feature map is C14, where C represents the total number of labels of the preset category of the network model, for example, when the training set is trained by the network model and there are 80 labels in total, C is 80. The second feature image includes a sub-feature image corresponding to each label of a preset category, where the second feature image may also be referred to as a multi-label activation map, and the sub-feature image corresponding to the label may also be referred to as an activation map corresponding to the label, for example, an activation map corresponding to a label sky.

In a possible embodiment, A may be defined_cThe activation map corresponding to label c is shown. A is a_c＝W^cz_iWherein W is^cAnd (4) representing the parameter corresponding to the label c in the parameter W, wherein the dimension size is 1 × 2048.

Specifically, in the process of multiplying the parameter W of the fully connected layer by the first feature map, the dimension of the first feature map is straightened to 2048 × 196, then the parameter W is multiplied by the first feature map to obtain C × 196, and finally C × 196 is transformed to C × 14.

It should be noted that the parameters of the convolutional layer and the parameters of the fully-connected layer in fig. 1 are optimized based on a metric learning algorithm in the training process.

And step S204, performing pooling processing on the second characteristic image to obtain a second label classification prediction result.

As a possible embodiment, in the metric learning network, referring to fig. 1, the optimized second feature image is subjected to pooling layer processing based on a global maximum pooling function, so as to obtain a second label classification prediction result.

And S205, obtaining a target label classification prediction result according to the first label classification prediction result and the second label classification prediction result.

In a possible embodiment, the sum of the first label classification prediction result and the second label classification prediction result is used as a target label classification prediction result, so that the first label classification prediction result obtained by the main network is corrected by using the second label classification prediction result obtained by the metric learning network, and the accuracy of multi-label classification identification is assisted to be enhanced.

Specifically, the first label classification prediction result and the second label classification prediction result may be expressed as confidence degrees. Referring to fig. 1, the target label classification prediction result

Can be expressed as:

wherein, y_pclsRepresenting the first label classification prediction, y_sclsRepresenting the second label classification prediction result.

The method carries out label classification from two aspects, obtains a second feature map bearing the relationship between labels based on a metric learning algorithm, and then corrects the first label classification prediction result by using a second label classification prediction result obtained from the second feature map, so that the number of label combinations is reduced, and the multi-label image identification precision is assisted to be improved.

In a possible embodiment, the method further comprises:

(a1) and determining a first loss function according to the target label classification prediction result.

In particular, the first loss function may be expressed as:

wherein,

an identifier indicating whether the ith image in the image batch contains the c-th label is 0 or 1, and if the ith image contains the c-th label, the identifier is used for determining whether the ith image contains the c-th label

The value is 1, if the ith image does not contain the c-th label, y_i ^cThe value is 0;

representing the confidence of the prediction corresponding to the c label of the ith image in the image batch (namely the target label classification prediction result of the c label of the ith image),

the probability of the prediction corresponding to the c label of the ith image in the image batch is shown. The confidence coefficient can be processed by a sigmoid function to obtain a corresponding probability.

(a2) And learning the loss function and the first loss function according to preset measurement, and determining a final loss function.

In a possible embodiment, the final loss function is a sum of the metric learning loss function and the first loss function, and may be specifically represented as:

L＝Lcls+αL_dis (3)

wherein, alpha represents a hyper-parameter, and the value range is alpha belongs to [0,1 ].

In a possible embodiment, the metric learning loss function is set based on a metric learning algorithm, and is a function of distance.

The Metric Learning (Metric Learning) is to learn the similarity of samples by a distance function. If the similarity between two images needs to be calculated, how to measure the similarity between the images is small, and the similarity between the images in the same category is large, which is the target of metric learning.

In this embodiment, the relationship between the labels is modeled based on the metric learning algorithm, and then the network model shown in fig. 1 is trained. The following describes the training process.

The method further comprises the following steps: in the training process, parameters of the full connection layer and parameters of the convolution layer are optimized based on the metric learning loss function and the correlation among the labels.

Specifically, the parameters of the fully-connected layer and the parameters of the convolutional layer are optimized, and the purpose is to improve the second feature diagram, so that the second feature diagram can bear the relationship among the labels, realize the modeling of the relationship among the labels, and then use the second feature diagram to constrain the first prediction classification result, so as to assist in improving the final classification result.

In a possible embodiment, a metric learning algorithm is utilized, and the correlation among the labels is embodied through the distance among the sub-feature graphs corresponding to the labels, so that the relational modeling is realized. Based on this, parameters of the full connection layer and parameters of the convolution layer are optimized, and the parameters specifically include:

(b1) and mapping the sub-feature images corresponding to each label into a preset space based on a metric learning algorithm, and respectively calculating the distance between each sub-feature image in the preset space.

Specifically, using the metric learning algorithm requires determining a matrix M, so that the sub-feature image, i.e., the activation map, corresponding to each label is mapped into the preset space T. Let A denote the sub-feature images corresponding to any two different labels_jAnd A_kThe corresponding one-dimensional vectors after straightening are respectively expressed as a_jAnd a_kWhere j ≠ k. Then, the distance between any two sub-feature images can be expressed as:

due to the fact that

Representing a distance, satisfying the nonnegativity, symmetry, and triangle inequalities, so that M is a semi-positive definite matrix, M can be decomposed as:

M＝B^TB (5)

thus, according to equation (5), equation (4) can be rewritten as:

wherein, B can be simulated and calculated through a neural network. Therefore, referring to fig. 1, the sub-feature image corresponding to each label can be mapped into the preset space T through the matrix B, and the distance between the sub-feature images in the preset space, that is, the euclidean distance, is calculated.

(b2) And optimizing parameters of the full-connection layer and parameters of the convolution layer by using a back propagation algorithm based on the metric learning loss function and the correlation among the labels so as to adjust the distance of each sub-feature image in a preset space.

In order to realize the purpose of embodying the correlation among the labels by using the distance, in the preset space T, the distance of the sub-feature image corresponding to the label with strong correlation is shortened, and the distance of the sub-feature image corresponding to the label with weak correlation or even the sub-feature image corresponding to the label with negative correlation is lengthened, so that the labels with strong correlation are gathered together. In this embodiment, it is assumed that the labels appearing in the same image to be detected have correlation, and the distance between the sub-feature images corresponding to the labels in the image to be detected is shortened, but the distance between the sub-feature images not belonging to the labels in the image to be detected is lengthened. Based on the above mode, in the training process, each image in the training set is trained, so that modeling of the relation between the labels is realized.

Based on this, the step (b2) includes: in the sub-feature image corresponding to each label, taking the sub-feature image corresponding to the label belonging to the currently input image as a correlation image, and taking the sub-feature image corresponding to the label not belonging to the currently input image as a non-correlation image; and zooming in the distance of the correlation images in a preset space, and zooming out the non-correlation images to enable the non-correlation images to be far away from the correlation images in the preset space.

Wherein, the label corresponding to each input image is known in the training process and can be labeled. Therefore, if the labels in the training set include puppies, kittens, sky, and white clouds, and the labels in the input image include only the sky and the white clouds, the sub-feature images corresponding to the sky and the white clouds are used as correlation images, and the distance between the correlation images in the preset space is reduced. And taking the sub-feature images corresponding to the puppies and the kittens as irrelevant images, and increasing the distance of the irrelevant images in a preset space to make the irrelevant images far away from the relevant images, namely the sub-feature images corresponding to the sky and the white clouds.

After the metric learning is completed, back propagation is performed based on the final loss function to optimize the parameters of the fully-connected layer and the parameters of the convolutional layer. In order to achieve the purpose of distance adjustment, in a specific implementation process, a metric learning loss function is constructed, and a back propagation algorithm is used to optimize the final loss function, especially the metric learning loss function in the final loss function, so as to adjust the distance of the sub-feature image corresponding to each label in the preset space T, and further improve the second feature map, so that the second feature map bears the correlation between the labels.

Based on this, the metric learning loss function constructed should embody: based on the assumption that the labels in one image have correlation, the closer the distance between the sub-feature images corresponding to the labels belonging to the same image is, the farther the sub-feature image corresponding to the label not belonging to the image is from the sub-feature image corresponding to the label of the image, and the smaller the metric learning loss value finally calculated after metric learning is.

In a possible embodiment, the metric learning loss function is expressed in particular as equation (7):

wherein, a ═ f_flat(f_B(A))， (8)

fw (a) represents a sub-feature image in the preset space T obtained after the sub-feature image a corresponding to the label is transformed by the matrix B. a' represents the one-dimensional feature vector after straightening for fw (A). N represents the number of images (as images to be detected) in an image batch (sample set extracted from a training set) input into the main network at a time, i represents the number of images in the image batch, j represents the number of sub-feature images corresponding to labels belonging to the ith image, and k represents the number of sub-feature images corresponding to labels not belonging to the ith image in the labels of the training set. S represents a set of sub-feature images corresponding to labels belonging to the ith image, and C' is the size of the set S, namely the number of sub-feature images corresponding to labels belonging to the ith image;

and a set of sub-feature images corresponding to labels which do not belong to the ith image among the labels representing the training set.

By using a metric learning algorithm, based on a final loss function, particularly a metric learning loss function therein, parameters of the network model are optimized after back propagation, and then the distance between the sub-feature images, i.e. the activation maps of each label, in the preset space T is adjusted, so that the second feature image, i.e. the multi-label activation map, is improved, and therefore, the activation maps corresponding to the labels with correlation can be simultaneously activated in the test process, i.e. the activation maps corresponding to the labels belonging to the same image move close to each other in the preset space T.

Therefore, the multi-label activation graph is optimized based on the correlation among the labels, so that the multi-label activation graph contains the information of the correlation among the labels, the modeling of the relation among the labels is realized, and the multi-label activation graph is further used as applied constraint to improve the classification accuracy. For example, the labels of one image include moon, sky, and white cloud, and originally, because the ratio of the moon in the image is too small, the neural network cannot capture the labels well, and after the metric learning algorithm is used, the distances of activation maps corresponding to the labels of the moon, the sky, and the white cloud are reduced, so that the neural network considers that the moon, the white cloud, and the sky have correlation, and activates the activation map corresponding to the label of the moon in the multi-label activation map, so as to obtain an optimized second feature image, that is, a second feature image bearing the relationship between the labels.

In the embodiment provided by the invention, on one hand, a first characteristic image is obtained by utilizing the convolutional layer, and then, the first characteristic image is classified by utilizing the pooling layer and the full-link layer to obtain a first label classification prediction result. On the other hand, according to the parameters of the full connection layer and the first characteristic image, performing characteristic filtering on the first characteristic image to obtain a second characteristic image; and then performing pooling processing on the second characteristic image to obtain a second label classification prediction result. And finally, comprehensively considering the first label classification prediction result and the second label classification prediction result to obtain a target label classification prediction result. The method classifies labels from two aspects, obtains a second feature map bearing the relationship between labels based on a metric learning algorithm, and corrects a first label classification prediction result by using a second label classification prediction result obtained from the second feature map, so that the number of label combinations is reduced, and the multi-label image identification precision is assisted to be improved.

In order to more intuitively embody the beneficial effects of the multi-label image classification method in the embodiment of the present invention, the result of the multi-label classification precision experiment of the classification method in the embodiment of the present invention on the current large-scale image authoritative data set MS-COCO in the industry is compared with the existing method, as shown in table 1:

TABLE 1

For better effectiveness OF the weighing method, thirteen indexes are provided in table 1 as measurement standards, including op (overall precision), OR (overall precision), OF1(overall F1), CP (pre-class precision), OR (pre-class precision), CF1(pre-class F1), and map (mean precision); in addition, CP/top3 represents CPs calculated from the first three categories for which the prediction results are optimal; CR/top3 represents the CR calculated from the first three categories for which the prediction is best; CF1/top3 represents CF1 calculated from the first three categories for which the prediction results are best; OP/top3 represents the OP calculated from the first three categories for which the prediction results are best; OR/top3 represents the OR calculated from the first three categories for which the prediction is best; OF1/top3 shows OF1 calculated from the first three categories with the best prediction results. The indexes in table 1 are all the larger the better, and the calculation formula for each index in table 1 is as follows.

Where D represents the number of classes to be predicted, eThe index is represented by a number of words,

indicating the number of e-th classes of prediction pairs,

indicating the number of predicted e-th classes,

indicating the number of all e-th categories.

In addition, WARP is from the article "deep restriction and transmission for a multi-label image analysis"; CNN-RNN (Convolutional Neural Network-Recurrent Neural Network) comes from the paper "a unified frame for multi-label image classification"; RLSD is available from the paper Multi-label image classification with regional specific requirements; RNN-Attention is from the paper Multi-Label image recording by recording the actual regions; RNN-Reinforcement from the article "Current Attentional reliability learning for Multi-label image recognition"; Order-Free RNN, from the article Order-Free Rnn with visual adherence for Multi-label classification; SRN from the article "Learning spatial regularization with image level supervisors for Multi-layer image classification"; multi-event is from the article "Multi-event filtering and fusion for Multi-label classification, object detection and segmentation based on weather modification". Table 1 was obtained by comparing the method in the present example with the method in each paper.

OF these, OF1 and CF1 are important indicators, and MAP is the most important indicator. Therefore, it can be intuitively seen from table 1 that the values OF the OF1, CF1 and MAP index obtained by the multi-label image classification method provided by the embodiment OF the present invention are all maximum values compared with the results obtained by the method in the prior art, so that the multi-label image classification method in the embodiment OF the present invention can effectively improve the accuracy OF multi-label image classification compared with the prior art.

Fig. 3 shows a multi-label image classification apparatus that employs the multi-label image classification method shown in the first embodiment in a one-to-one correspondence, corresponding to the multi-label image classification method in the first embodiment. The multi-label image classification device comprises:

the first extraction module 11 is configured to extract a first feature image of an image to be processed;

the first prediction module 12 is configured to sequentially process the first feature image through a pooling layer and a full link layer to obtain a first label classification prediction result;

the second extraction module 13 is configured to obtain a second feature image according to the parameters of the full connection layer and the first feature image, where the second feature image includes a sub-feature image corresponding to each label of a preset category;

the second prediction module 14 is configured to perform pooling processing on the second feature image to obtain a second label classification prediction result;

and the target prediction module 15 is configured to obtain a target label classification prediction result according to the first label classification prediction result and the second label classification prediction result.

Further, referring to fig. 4, a function determining module 16 is further included for:

determining a first loss function according to the target label classification prediction result;

Further, a model training module 17 is included for:

in the training process, parameters of the full connection layer and parameters of the convolution layer are optimized based on the metric learning loss function and the correlation among the labels.

Further, the model training module 17 is further configured to:

and optimizing parameters of the full-connection layer and parameters of the convolution layer by using a back propagation algorithm based on the metric learning loss function and the correlation among the labels so as to adjust the distance of each sub-feature image in a preset space.

Further, the model training module 17 is further configured to:

in the sub-feature images corresponding to each label, the sub-feature image corresponding to the label belonging to the currently input image is taken as a relevant image, and the sub-feature image corresponding to the label not belonging to the currently input image is taken as a non-relevant image; and zooming in the distance of the correlation images in a preset space, and zooming out the non-correlation images to enable the non-correlation images to be far away from the correlation images in the preset space.

Referring to fig. 5, an embodiment of the present invention further provides an electronic device 100, including: a processor 40, a memory 41, a bus 42 and a communication interface 43, wherein the processor 40, the communication interface 43 and the memory 41 are connected through the bus 42; the processor 40 is arranged to execute executable modules, such as computer programs, stored in the memory 41.

The Memory 41 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 43 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

The bus 42 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 5, but this does not indicate only one bus or one type of bus.

The memory 41 is used for storing a program, the processor 40 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 40, or implemented by the processor 40.

The processor 40 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 40. The Processor 40 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 41, and the processor 40 reads the information in the memory 41 and completes the steps of the method in combination with the hardware thereof.

The multi-label image classification device and the electronic equipment provided by the embodiment of the invention have the same technical characteristics as the multi-label image classification method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

The computer program product for performing the multi-label image classification method provided in the embodiment of the present invention includes a computer-readable storage medium storing a non-volatile program code executable by a processor, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, which is not described herein again.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and the electronic device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-label image classification method is characterized by comprising the following steps:

extracting a first characteristic image of an image to be processed;

obtaining a second characteristic image according to the parameters of the full connection layer and the first characteristic image, wherein the second characteristic image comprises a sub-characteristic image corresponding to each label of a preset category; wherein the parameters of the fully-connected layer and the parameters of the convolutional layer are optimized based on a metric learning algorithm;

obtaining a target label classification prediction result according to the first label classification prediction result and the second label classification prediction result;

further comprising:

wherein the metric learning loss function is set based on a metric learning algorithm;

further comprising:

in the training process, optimizing parameters of the full connection layer and parameters of the convolution layer based on the metric learning loss function and the correlation between the labels;

the optimizing parameters of the fully-connected layer and parameters of the convolutional layer based on the metric learning loss function and the correlation between the labels comprises:

2. The method of claim 1, wherein the adjusting the spacing of the sub-feature images in the preset space comprises:

in the sub-feature images corresponding to each label, the sub-feature images corresponding to the labels belonging to the currently input image are taken as correlation images, and the sub-feature images corresponding to the labels not belonging to the currently input image are taken as non-correlation images;

3. The method according to claim 1, wherein the processing the first feature image sequentially through a pooling layer and a full-link layer to obtain a first label classification prediction result comprises:

4. The method of claim 1, wherein obtaining a second feature image according to the parameters of the fully-connected layer and the first feature image comprises:

5. The method of claim 1, wherein obtaining a target label classification predictor from the first label classification predictor and the second label classification predictor comprises:

6. A multi-label image classification apparatus, comprising:

the target prediction module is used for obtaining a target label classification prediction result according to the first label classification prediction result and the second label classification prediction result;

further comprising:

the function determining module is used for determining a first loss function according to the target label classification prediction result;

further comprising:

the model training module is used for optimizing parameters of the full-connection layer and parameters of the convolutional layer based on the metric learning loss function and the correlation between the labels in the training process;

the model training module is further to: mapping the sub-feature images corresponding to each label into a preset space based on a metric learning algorithm, and respectively calculating the distance between each sub-feature image in the preset space;

7. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 5 when executing the computer program.

8. A computer-readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of any of claims 1 to 5.