CN113902948A

CN113902948A - Fine-grained image classification method and system based on double-branch network

Info

Publication number: CN113902948A
Application number: CN202111175746.2A
Authority: CN
Inventors: 苗壮; 赵勋; 王家宝; 李阳; 张睿; 许博; 王亚鹏; 杨利; 赵昕昕; 杨义鑫
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-01-07

Abstract

The invention discloses a fine-grained image classification method based on a double-branch network, which comprises the following steps: preprocessing a target image to be classified; inputting the preprocessed target image into a pre-trained non-maximum activated double-branch network, and extracting to obtain the characteristics of the target image to be classified; based on the obtained target image features to be classified, adopting a classifier to perform class prediction to obtain class prediction results of the target image features to be classified; and fusing the category prediction results by adopting a preset fusion method to obtain the classification result of the target image to be classified. The method can solve the problems that the convolutional neural network method in the prior art cannot be expanded to a fine-grained image classification task based on a transform architecture network and the attention mechanism is insufficient in extracting the target features.

Description

Fine-grained image classification method and system based on double-branch network

Technical Field

The invention relates to a fine-grained image classification method and system based on a double-branch network, and belongs to the technical field of computer vision.

Background

The fine-grained image classification belongs to the field of image classification tasks, and is different from a common classification task in that the fine-grained image classification task aims at classifying subclasses in a large class, such as: different types of cars, birds, airplanes, etc. The classification targets have the characteristics of large intra-class difference and small inter-class difference, so that the key of classification lies in the fine feature extraction of the targets. At the present stage, the fine-grained image classification method mainly utilizes an attention mechanism to carry out maximum value activation on a target so as to obtain effective local discriminant features of the target, and extraction of non-maximum value activation features is lacked; on the other hand, most of the existing fine-grained classification methods mainly extract target features based on a convolutional neural network, and lack of consideration of designing a network framework and an objective function by using the angle of a Transformer architecture, and the like, and the existing convolutional neural network methods are difficult to be directly extended to fine-grained image classification methods based on the Transformer architecture network.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a fine-grained image classification method and a fine-grained image classification system based on a double-branch network, and can solve the problems that a convolutional neural network method in the prior art cannot be expanded to a fine-grained image classification task based on a Transformer architecture network, and the target features are not extracted sufficiently by an attention mechanism. In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a fine-grained image classification method based on a dual-branch network, including:

preprocessing a target image to be classified;

inputting the preprocessed target image into a pre-trained non-maximum activated double-branch network, and extracting to obtain the characteristics of the target image to be classified; the non-maximum activation double-branch network comprises a non-maximum activation module and a homogeneous double-branch subnetwork, wherein the non-maximum activation module is used for outputting maximum activation characteristics and non-maximum activation characteristics, the output characteristics are input into the homogeneous double-branch subnetwork, the homogeneous double-branch subnetwork learns and outputs target image characteristics to be classified;

based on the obtained target image features to be classified, adopting a classifier to perform class prediction to obtain class prediction results of the target image features to be classified;

and fusing the category prediction results by adopting a preset fusion method to obtain the classification result of the target image to be classified.

With reference to the first aspect, further, the preprocessing the target image to be classified includes:

the target image to be classified is scaled to a size of 600 pixels × 600 pixels, and an image area of 448 pixels × 448 pixels in size is cropped centering on the center of the image.

With reference to the first aspect, further, the non-maximum-value-activated dual-branch network includes an image preprocessing module, a backbone network feature extraction module, a non-maximum-value-activated module, and a homogeneous dual-branch subnetwork.

With reference to the first aspect, further, the non-max-activated dual-branch network is trained by:

inputting an image for training into the non-maximal value activation double-branch network; and under the guidance of a preset target loss function, training parameters of the non-maximum activated double-branch network by adopting a random gradient descent algorithm to obtain optimal network parameters.

With reference to the first aspect, preferably, the image for training is obtained by: the image is scaled to 600 pixels × 600 pixels, a region image with 448 pixels × 448 pixels is randomly cut, and the image is flipped in a random horizontal flipping manner with a random probability of 0.5.

With reference to the first aspect, further, the preset target loss function is:

L＝L^CE+λL^DB (1)

in the formula (1), L represents an objective loss function; λ represents L^CEAnd L^DBWeight in between; l is^CERepresents a cross-entropy classification objective loss function, represented by the following equation:

in the formula (2), B represents the number of input images; f. of^bRepresenting the feature of the b-th image, y^bA real label representing the b-th image, and y^b∈{1，2，…，C}，

Representing a feature f^bMapping to a genuine tag class y^bWeight parameter of W_jRepresenting a feature f^bMapping to a weight parameter of the jth category, wherein C is the total number of categories;

in the formula (1), L^DBExpressing similarity measures the objective loss function, expressed by:

in the formula (3), the reaction mixture is,

a similarity matrix between different branch signatures is represented,

representing the features of the b-th image

And

the characteristics of splicing and standardization are such that,

to represent

Feature after transposition, diag (S)_b) Represents the extraction of S_bThe value of the main diagonal line, B, represents the number of input images.

With reference to the first aspect, further, the calculation process of the non-max activation module includes:

F^m，Fⁿ＝NAM(F) (4)

in formula (4), NAM (.) represents a module calculation process; f represents the characteristic of the input non-maximum value activation module and satisfies the condition that F belongs to R^B×L×C(ii) a Wherein B represents the number of input images, L represents the feature dimension, and C represents the number of feature channels; f^mThe maximum value activation characteristic of the output of the non-maximum value activation module is represented, and F is satisfied^m∈R^B×L×C，

FⁿExpressing the non-maximum activation characteristic output by the non-maximum activation module to satisfy Fⁿ∈R^B×L×C。

With reference to the first aspect, further, the maximum activation characteristic F output by the non-maximum activation module^mThe calculation process of (2) is as follows:

equally dividing the characteristic F of the input non-maximum activation module into k groups along the 2 nd dimension, wherein the characteristic dimension L is required to be integral multiple of the group number k, and obtaining the ith group of characteristics

Calculating each set of features F_iCorresponding weight matrix

Calculated by the following formula:

in the formula (5), the reaction mixture is,

representing the weight of the ith dimension of the b-th image, and carrying out weight normalization on the k blocks by denominator; a. the_i(b, l) represents A_iAn element in the middle position (b, l), A_iRepresents the ith set of feature vectors, represented by:

in formula (6), GAP (. cndot.) represents a global mean pooling operation,

represents a convolution operation, σ (-) represents a Relu activation operation;

weighting matrix

In-channel dimension pair feature F_iPerforming extended weighting calculation to obtain weighted features

Will k group characteristics

Splicing to obtain the maximum activation characteristic F^m。

With reference to the first aspect, further, the non-maximum activation feature F output by the non-maximum activation moduleⁿThe calculation process of (2) is as follows:

performing a maximum suppression operation on each set of features:

in the formula (7), the reaction mixture is,

representation pair weight matrix

The ranking of the suppression is determined by the rank of the suppression,

to represent

Middle alpha is large value, beta is equal to 0, 1]Representing activation of feature weights for maxima

The degree of inhibition of (a) is,

representing the ith group of non-maximum excitation characteristic weight matrixes;

weighting matrix

Will k group characteristics

Splicing to obtainActivation of feature F to non-maximum valueⁿ。

With reference to the first aspect, further, the fusing the category prediction results by using a preset fusion method includes:

the class probability prediction result is denoted as p¹、p²And p³Three characteristics X for respectively activating output of the dual-branch network corresponding to the non-maximum value₁、X₂And X₃(ii) a Wherein, X₁Representing features of the target image to be classified containing maximum excitation features, X₂Representing features of the target image to be classified containing non-maximum excitation features, X₃Represents the splicing feature and satisfies X₃＝Concat(X₁,X₂) Concat (·) indicates that the splicing operation is performed in the feature channel dimension;

and obtaining a prediction result by adopting weighted summation fusion, and calculating by the following formula:

in the formula (8), the reaction mixture is,

represents the prediction probability of the c-th class after weighted fusion,

representing the probability of the C class output by the kth path, wherein C is the number of classes, and M represents the total number of paths;

the classification result of the target image to be classified is

The category corresponding to the maximum value of the median is calculated as

In a second aspect, the present invention provides a fine-grained image classification system based on a dual-branch network, including:

a preprocessing module: the image preprocessing module is used for preprocessing a target image to be classified;

a feature extraction module: the system is used for inputting the preprocessed target image into a pre-trained non-maximum activated double-branch network, and extracting to obtain the characteristics of the target image to be classified; the non-maximum activation double-branch network comprises a non-maximum activation module and a homogeneous double-branch subnetwork, wherein the non-maximum activation module is used for outputting maximum activation characteristics and non-maximum activation characteristics, the output characteristics are input into the homogeneous double-branch subnetwork, the homogeneous double-branch subnetwork learns and outputs target image characteristics to be classified;

a category prediction module: the image classification device is used for performing class prediction by adopting a classifier based on the obtained target image features to be classified to obtain class prediction results of the target image features to be classified;

a fusion output module: and fusing the category prediction results by adopting a preset fusion method to obtain the classification result of the target image to be classified.

Compared with the prior art, the fine-grained image classification method and the fine-grained image classification system based on the double-branch network have the advantages that:

the method comprises the steps of preprocessing a target image to be classified; inputting the preprocessed target image into a pre-trained non-maximum activated double-branch network, and extracting to obtain the characteristics of the target image to be classified; the non-maximum activation double-branch network comprises a non-maximum activation module and a homogeneous double-branch subnetwork, wherein the non-maximum activation module is used for outputting maximum activation characteristics and non-maximum activation characteristics, the output characteristics are input into the homogeneous double-branch subnetwork, the homogeneous double-branch subnetwork learns and outputs target image characteristics to be classified; the method is improved on the basis of the Swin transducer deep neural network, and a non-maximum value activation module and an isomorphic double-branch subnetwork are introduced, so that more sufficient target discrimination area characteristics can be obtained;

based on the obtained target image features to be classified, a classifier is adopted to carry out class prediction, and a class prediction result of the target image features to be classified is obtained; fusing the category prediction results by adopting a preset fusion method to obtain a classification result of the target image to be classified; the method can solve the problem that the current attention mechanism feature extraction is insufficient, can solve the defect that the attention mechanism method based on the convolutional neural network in the prior art is difficult to be directly expanded to a fine-grained classification method based on a transform frame, and can improve the accuracy and robustness of a fine-grained classification result.

Drawings

Fig. 1 is a flowchart of a fine-grained image classification method based on a dual-branch network according to an embodiment of the present invention;

fig. 2 is a structural diagram of a non-maximum activated dual-branch network in a fine-grained image classification method based on a dual-branch network according to an embodiment of the present invention;

fig. 3 is a structural diagram of a non-maximum activation module in a fine-grained image classification method based on a dual-branch network according to an embodiment of the present invention;

fig. 4 is a structural diagram of a fine-grained image classification system based on a dual-branch network according to a second embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The first embodiment is as follows:

as shown in fig. 1, an embodiment of the present invention provides a fine-grained image classification method based on a dual-branch network, including:

preprocessing a target image to be classified;

As shown in fig. 1, the specific steps are as follows:

step 1: and preprocessing the target image to be classified.

Step 2: and constructing and training a non-maximum activation double-branch network.

The non-maximum value activated double-branch network comprises an image preprocessing module, a backbone network feature extraction module, a non-maximum value activation module and a homogeneous double-branch sub-network.

Step 2.1: and constructing a non-maximum value activated double-branch network.

After the first three stages of the conventional Swin transducer deep neural network, a non-maximum activation module is introduced, the last stage of the Swin transducer deep neural network is copied to construct a homogeneous double-branch subnetwork, and finally, the non-maximum activation double-branch network is constructed. The constructed non-maximal value activated dual-branch network is shown in fig. 2.

Specifically, relevant contents of Swin Transformer deep neural network are found in Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin and BaiingGuo, Swin Transformer: Hierarchic Vision transform using Shifted Windows, [ DB ] [2021-04-08] https:// arxiv.org/abs/2103.14030.

As shown in fig. 3, the non-maximum activation module processes the output characteristics of the first three stages to obtain the maximum activation characteristics and the non-maximum activation characteristics, and sends the maximum activation characteristics and the non-maximum activation characteristics to the homogeneous dual-branch subnetwork for processing.

The calculation process of the non-maximum activation module comprises the following steps:

F^m,Fⁿ＝NAM(F) (1)

in formula (1), NAM (.) represents a modular calculation process; f represents the characteristic of the input non-maximum value activation module and satisfies the condition that F belongs to R^B×L×C(ii) a Wherein B represents the number of input images, L represents the feature dimension, and C represents the number of feature channels; f^mThe maximum value activation characteristic of the output of the non-maximum value activation module is represented, and F is satisfied^m∈R^B×L×C，

Maximum value activation characteristic F output by non-maximum value activation module^mThe calculation process of (2) is as follows:

Calculating each set of features F_iCorresponding weight matrix

Calculated by the following formula:

in the formula (2), the reaction mixture is,

in formula (3), GAP (. cndot.) represents a global mean pooling operation,

weighting matrix

Will k group characteristics

Splicing to obtain the maximum activation characteristic F^m。

Non-maximum activation feature F output by non-maximum activation moduleⁿThe calculation process of (2) is as follows:

performing a maximum suppression operation on each set of features:

in the formula (4), the reaction mixture is,

representation pair weight matrix

The ranking of the suppression is determined by the rank of the suppression,

to represent

The degree of inhibition of (a) is,

weighting matrix

Will k group characteristics

Splicing to obtain non-maximum activation characteristic Fⁿ。

Step 2.2: training the non-maximal value activates the dual-branch network.

The images used for training are obtained by the following steps: the image is scaled to 600 pixels × 600 pixels, a region image with 448 pixels × 448 pixels is randomly cut, and the image is flipped in a random horizontal flipping manner with a random probability of 0.5.

The preset target loss function is composed of a cross entropy classification target loss function and a similarity measurement target loss function, and is calculated according to the following formula:

L＝L^CE+λL^DB (5)

in the formula (5), L represents an objective loss function; λ represents L^CEAnd L^DBWeight in between; l is^CERepresents a cross-entropy classification objective loss function, represented by the following equation:

in the formula (6), B represents the number of input images; f. of^bRepresenting the feature of the b-th image, y^bA real label representing the b-th image, and y^b∈{1，2，…，C}，

in the formula (5), L^DBExpressing similarity measures the objective loss function, expressed by:

in the formula (7), the reaction mixture is,

a similarity matrix between different branch signatures is represented,

representing the features of the b-th image

And

the characteristics of splicing and standardization are such that,

to represent

And step 3: inputting the preprocessed target image into a pre-trained non-maximum value activation double-branch network, and extracting to obtain the target image features to be classified.

And the non-maximum activation module outputs maximum activation characteristics and non-maximum activation characteristics, the output characteristics are input into the isomorphic double-branch sub-network, the isomorphic double-branch sub-network learns, and the characteristics of the target image to be classified are output.

And 4, step 4: and based on the obtained target image features to be classified, performing class prediction by adopting a classifier to obtain a class prediction result of the target image features to be classified.

Specifically, the image I to be classified belongs to R^3×448×448Three classification characteristics X are obtained by activating a double-branch network through a non-maximum value₁、X₂And X₃Respectively adopting independent classifiers to predict the class probability of the three classification characteristics to obtain a class probability prediction result p¹、p²And p³. Wherein, X₁Representing features of the target image to be classified containing maximum excitation features, X₂Representing features of the target image to be classified containing non-maximum excitation features, X₃Represents the splicing feature and satisfies X₃＝Concat(X₁，X₂) Concat (. cnat.) indicates that the splicing operation is performed in the feature channel dimension.

And 5: and fusing the category prediction results by adopting a preset fusion method to obtain the classification result of the target image to be classified.

Step 5.1: predicting a result p based on class probability¹、p²And p³And obtaining a prediction result by adopting weighted summation fusion:

in the formula (8), the reaction mixture is,

represents the prediction probability of the c-th class after weighted fusion,

and C is the number of categories, and M is the total number of paths.

Step 5.2: the classification result of the target image to be classified is

The category corresponding to the maximum value of the median is calculated as

The method can obtain more sufficient target discrimination region characteristics, can solve the problem of insufficient extraction of the current attention mechanism characteristics, can solve the defect that the attention mechanism method based on the convolutional neural network in the prior art is difficult to be directly expanded to a fine-grained classification method based on a transform frame, and can improve the accuracy and robustness of a fine-grained classification result.

Example two:

as shown in fig. 4, an embodiment of the present invention provides a fine-grained image classification system based on a dual-branch network, including:

Example three:

the embodiment of the invention provides a fine-grained image classification device based on a double-branch network, which comprises a processor and a storage medium, wherein the processor is used for processing images;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of embodiment one.

Example four:

embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method according to one embodiment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A fine-grained image classification method based on a double-branch network is characterized by comprising the following steps:

preprocessing a target image to be classified;

2. The fine-grained image classification method based on the double-branch network according to claim 1, wherein the preprocessing of the target image to be classified comprises:

3. The fine-grained image classification method based on the double-branch network according to claim 1, wherein the non-maximum activated double-branch network comprises an image preprocessing module, a backbone network feature extraction module, a non-maximum activated module and a homogeneous double-branch subnetwork.

4. The fine-grained image classification method based on the double-branch network according to claim 1, characterized in that the non-maximum activated double-branch network is trained by the following steps:

5. The fine-grained image classification method based on the dual-branch network according to claim 4, wherein the preset target loss function is:

L＝L^CE+λL^DB (1)

in the formula (2), B represents the number of input images; f. of^bRepresenting the feature of the b-th image, y^bA real label representing the b-th image, and y^b∈{1，2，...，C}，

in the formula (3), the reaction mixture is,

a similarity matrix between different branch signatures is represented,

representing the features of the b-th image

And

the characteristics of splicing and standardization are such that,

to represent

Feature after transposition, diag (S)_b) Represents the extraction of S_bValue of the main diagonal, B denotesThe number of images is input.

6. The fine-grained image classification method based on the dual-branch network according to claim 1, wherein the calculation process of the non-maximum activation module comprises:

F^m，Fⁿ＝NAM(F) (4)

in formula (4), NAM (.) represents a module calculation process; f represents the characteristic of the input non-maximum value activation module and satisfies the condition that F belongs to R^B ^×L×C(ii) a Wherein B represents the number of input images, L represents the feature dimension, and C represents the number of feature channels; f^mThe maximum value activation characteristic of the output of the non-maximum value activation module is represented, and F is satisfied^m∈R^B×L×C，

7. The fine-grained image classification method based on double-branch network according to claim 6, characterized in that the maximum activation feature F output by the non-maximum activation module^mThe calculation process of (2) is as follows:

Calculating each set of features F_iCorresponding weight matrix

Calculated by the following formula:

in the formula (5), the reaction mixture is,

in formula (6), GAP (. cndot.) represents a global mean pooling operation,

weighting matrix

Will k group characteristics

Splicing to obtain the maximum activation characteristic F^m。

8. The fine-grained image classification method based on double-branch network according to claim 7, characterized in that the non-maximum activation feature F output by the non-maximum activation moduleⁿThe calculation process of (2) is as follows:

performing a maximum suppression operation on each set of features:

in the formula (7), the reaction mixture is,

representation pair weight matrix

The ranking of the suppression is determined by the rank of the suppression,

to represent

The degree of inhibition of (a) is,

weighting matrix

Will k group characteristics

Splicing to obtain non-maximum activation characteristic Fⁿ。

9. The fine-grained image classification method based on the dual-branch network according to claim 1, wherein the fusing the class prediction results by using a preset fusion method comprises:

the class probabilityThe prediction result is expressed as p¹、p²And p³Three characteristics X for respectively activating output of the dual-branch network corresponding to the non-maximum value₁、X₂And X₃(ii) a Wherein, X₁Representing features of the target image to be classified containing maximum excitation features, X₂Representing features of the target image to be classified containing non-maximum excitation features, X₃Represents the splicing feature and satisfies X₃＝Concat(X₁，X₂) Concat (·) indicates that the splicing operation is performed in the feature channel dimension;

in the formula (8), the reaction mixture is,

represents the prediction probability of the c-th class after weighted fusion,

the classification result of the target image to be classified is

The category corresponding to the maximum value of the median is calculated as

10. A fine-grained image classification system based on a double-branch network is characterized by comprising the following components: