CN114626476A

CN114626476A - Bird fine-grained image recognition method and device based on Transformer and component feature fusion

Info

Publication number: CN114626476A
Application number: CN202210279684.8A
Authority: CN
Inventors: 阮涛; 张海苗; 刘畅; 邱钧
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-06-14

Abstract

The invention discloses a method and a device for identifying bird fine-grained images based on Transformer and component feature fusion, wherein the method comprises the following steps: step 1, inputting a preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map; step 2, performing bilinear attention pooling operation on the basic feature diagram and the component attention diagram to obtain a discriminant component feature; step 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information; and 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss. The bird image recognition method can realize high-precision recognition of the bird image under weak supervision.

Description

Bird fine-grained image recognition method and device based on Transformer and component feature fusion

Technical Field

The invention relates to the technical field of computer vision, in particular to deep learning and fine-grained image recognition technology, and specifically relates to a method and a device for bird fine-grained recognition based on Transformer and component feature fusion.

Background

Bird image recognition belongs to a fine-grained image recognition task. The fine-grained image identification is to distinguish different subclasses belonging to the same general class. In general, image recognition is to classify large categories of objects, such as horse and cat, and features of different objects are different greatly, so that categories are relatively easy to distinguish. For fine-grained images, differences between objects generally exist in subtle parts, and the same object presents larger visual differences due to dimensions or viewing angles, backgrounds, and the like, so that the recognition difficulty is higher.

Disclosure of Invention

The invention aims to provide a method and a device for identifying a bird fine-grained image based on Transformer and component feature fusion, which can obtain higher identification precision of the bird fine-grained image.

In order to achieve the above object, the present invention provides a method for identifying bird fine-grained images based on Transformer and component feature fusion, which comprises:

step 1, inputting a preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map;

step 2, performing bilinear attention pooling operation on the basic feature diagram and the component attention diagram to obtain a discriminant component feature;

step 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information;

and 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss.

Further, the method for extracting the basic feature map in step 1 specifically includes:

step 11a, inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W respectively represents the height and width of the basic feature map F, and D represents the size of an embedding dimension;

step 12a, recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram

The process is shown in the following formula (1):

in the formula, reshape (·) represents reorganization of the base feature map.

Further, the method for generating the component attention diagram in step 1 specifically includes:

step 11b, determining the number M of channels of the component attention diagram needing to be generated, namely the number of generated component characteristics;

step 12b, forming an attention module G by a two-dimensional convolution with convolution kernel of 1 multiplied by 1 and a Sigmoid function, and converting the feature map

Inputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):

in the formula, A_i(i＝1，2, …, M) represents the ith part attention map in the target object.

Further, the step 2 specifically includes:

step 21, drawing different part attention diagrams A_iExtend to the dimension of the base feature map

Then the expanded part is subjected to attention drawing A_iAnd basic feature map

Multiplying element by element in the following formula (3) to obtain the discriminant part feature P_i：

In the formula, "indicates element-by-element multiplication operation;

step 22, distinguishing the part characteristic P_iAggregating each discriminative part feature P by performing an aggregation operation according to the global average pooling provided by the following formula (4)_i：

h_i＝ψ(P_i) (4)

In the formula, h_iRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).

Further, the step 3 specifically includes:

step 31, aggregating the discriminant part features h_iAnd splicing is carried out on the channel dimension, so that an enhanced feature representation, namely a global component feature Q is obtained, the feature is fused with discriminant component information, and the feature expression capability is stronger.

Q＝Concate(h₁，h₂，…，h_M) (5)

In the formula, Concate (·) represents feature concatenation;

step 32, performing L on the global component characteristic Q₂After normalization of the norm, full connection is introducedAnd the layer is used for finishing the mapping of the feature vectors to the categories.

Further, the step 4 specifically includes:

step 41, inputting the global component characteristics Q into the full connection layer, completing the mapping of bird image categories, and obtaining the cross entropy loss of the predicted value and the label

For penalizing the classification result, the loss of a single sample is shown in formula (6):

in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;

step 42, the central loss of a single sample described by equation (8) is used to weakly supervise the generation process of the attention of the component, so that different component features continuously approach the feature center:

in the formula, q_iIs the ith component feature, c, of the global component features Q_iIs the center of the ith component feature;

step 43, initialize c_iAnd updating the model in the training process according to the following formula (9):

c_i←c_i+α(q_i-c_i) (9)

in the formula, alpha is belonged to [0, 1 ]]Is c_iUpdated learning rate, total loss of model during training phase

The following (10) is defined:

the invention also provides a bird fine-grained image recognition device based on Transformer and component feature fusion, which comprises:

a component attention generation unit, which is used for extracting a basic feature map by inputting the preprocessed image into a feature encoder based on a Transformer architecture network, and inputting the basic feature map into an attention module to generate a component attention map;

a discriminative component feature generation unit configured to perform bilinear attention pooling operations on the basic feature map and the component attention map to obtain discriminative component features;

the feature fusion unit is used for splicing the discriminative component features on the channel dimension to obtain enhanced feature representation fused with the discriminative component information;

and the parameter learning optimization unit is used for completing the mapping of the categories by inputting the enhanced feature representation into the full-connection layer, and optimizing the model parameters through cross entropy loss and center loss.

Further, the component attention generating unit includes:

the basic feature map extraction subunit specifically includes:

the two-dimensional basic feature map extraction module is used for inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W is respectively expressed as the height and width of the basic feature map F, and D is expressed as the size of an embedding dimension;

a three-dimensional basic characteristic diagram module for recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram

The process is shown in the following formula (1):

in the formula, reshape (-) indicates that the feature map is reorganized;

a component attention diagram generation subunit, which is used for determining the number M of channels of the component attention diagram to be generated, and an attention module G is composed of a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and a feature diagram is generated

in the formula, Ai (i ═ 1, 2, …, M) represents the ith component attention map in the target object.

Further, the discriminative component feature generation unit specifically includes:

a single discriminative component feature generation module for attention mapping A different components_iIs expanded to conform to the base feature map F, and the expanded component attention map a is then applied_iMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)_i：

In the formula, "indicates element-by-element multiplication operation;

a discriminative part feature fusion module for fusing discriminative part features P_iThe aggregation operation is performed according to the global average pooling provided by the following formula (4), fusing each discriminant part feature P_i：

h_i＝ψ(P_i) (4)

Further, the parameter learning optimization unit specifically includes:

a sample classification loss acquisition module for inputting global component characteristics Q into the full connection layer to complete the mapping of bird image categories and obtain the cross entropy loss of the predicted value and the label

For punishing the classification result, the classification loss of a single sample is shown as formula (6):

a central update module of the part features for central loss of a single sample described with equation (8) to weakly supervise the generation process of the part attention and initialize c_iThe model is updated in the training process according to the following formula (9), and the total loss of the model in the training stage

The following (10) is defined:

c_i←c_i+α(q_i-c_i) (9)

in the formula, q_iIs the ith component feature of the global component feature Q, c_iIs the center of the ith part feature, α ∈ [0, 1 ]]Is c_iAn updated learning rate.

Due to the adoption of the technical scheme, the invention has the following advantages:

the method generates the component attention diagram in an attention mode, combines the component attention diagram with a feature extraction network based on a Transformer architecture to realize the fusion of the characteristics of the discriminant component, and can not only focus on the discriminant component, but also obtain the feature representation with better expression capability; in the training stage, the model can realize high identification precision of bird images under weak supervision only by category labels without other labeling information.

Drawings

Fig. 1 is a schematic flow chart of a method according to an embodiment of the present invention.

Fig. 2 is a general model structure diagram corresponding to fig. 1.

Fig. 3 is a diagram of the attention module of fig. 1.

Fig. 4 is a diagram of the extraction and fusion process of the feature in fig. 1.

FIG. 5 is a graph illustrating the effect of center loss on model performance in FIG. 1.

Detailed Description

In order to make the aforementioned objects, features and advantages more comprehensible, the present invention is described in detail below with reference to the accompanying drawings and the detailed description thereof.

Interpretation of terms: in the field of computer vision, a network based on a transform architecture is mainly composed of a multi-layer perceptron, which first divides an image into a plurality of image blocks and then transmits the image blocks to other subsequent networks. The self-attention mechanism in the network enables the extracted feature map to contain global information, which is beneficial to downstream tasks.

As shown in fig. 1, the method for identifying a bird fine-grained image based on fusion of a Transformer and a component feature provided by the embodiment of the invention includes the following steps:

step 1, inputting the preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map.

And 2, performing bilinear attention pooling operation on the basic feature map and the component attention map to obtain the discriminative component feature.

And 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information.

And 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss. The model suitable for the model parameters is composed of a feature extraction network f, a constructed attention module G and a full connection layer.

Thus, step 1 is used to obtain a base feature map and a discriminative part attention map of an image.

In one embodiment, the method for acquiring the basic feature map of the image in step 1 specifically includes:

step 11a, image preprocessing.

For example: the published bird data sets CUB-200 and NABirds are selected, and the selected bird data sets are divided into training sets and test sets. The following illustrates the specific implementation of image preprocessing at this stage, and the image preprocessing methods for both bird datasets are the same.

A training stage: firstly, images of a training set are adjusted to 496 × 496 pixels, then areas of 384 × 384 pixels are cut out randomly, then data amplification is carried out in a random horizontal flipping mode, and finally normalization processing is carried out on image data, wherein the normalized mean value and standard deviation are respectively [0.485, 0.456, 0.406], [0.229, 0.224 and 0.225 ].

And (3) a testing stage: the image center is clipped to 384 × 384 pixels and normalized the same as in the training phase.

And step 12a, inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W respectively represents the height and width of the basic feature map F, and D represents the size of an embedding dimension. As shown in fig. 2, the feature extraction network f used in the present embodiment is Swin-L based on the transform architecture.

Step 13a, recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram

Thus, the basic feature diagram F can be adapted to the attention network constructed later, and the input dimension meets the consistency requirement.

Then, the method of obtaining the basic feature map of the image in step 1 can be described as the following formula (1):

in the formula, reshape (. circle.) represents the reorganization of the feature map.

In the above embodiment, ResNet may be used to extract image features, and the output of the image features is a three-dimensional feature map, which does not require recombination. However, the method extracts local features, and the feature expression capability is limited. The feature extraction network f in step 1 may also adopt other network structures in the prior art, as long as the basic feature map with rich expression can be obtained, and is not listed here.

In one embodiment, the method for obtaining the part attention map of the image discriminability in step 1 specifically includes:

and step 11b, determining the number M of channels of the component attention diagram needing to be generated, namely the number of generated component characteristics, wherein different data sets can be selected according to actual conditions.

Since the number of channels of the component attention map reflects the coverage of the components which focus on the object discrimination by the model, the discrimination performance of the model on the object subtle differences is better when the number of focused components is large. In the case where the equalization model can learn the number of parameters and the accuracy, the M values on the CUB-200-2011 and NABirds datasets can be set to 64 and 32, respectively. The value of M can be selected according to experimental effects on different data sets.

At step 12b, an attention model is constructed to generate a component attention map.

Generally, the attention module of an image is composed of a full connection layer, a two-dimensional convolution, batch normalization, an activation function (e.g., ReLU, Sigmoid, Softmax), and the like, and different attention architectures generate different attention attempts to improve the performance of the model differently. Through analysis in experiments, it is found that the attention generating module G composed of a two-dimensional convolution with a convolution kernel of 1 × 1 and a Sigmoid function is more suitable for the backbone network of the architecture of the present embodiment, and the specific structure thereof is as shown in fig. 3, where the convolution with 1 × 1 is used to change the number of characteristic channels to be equal to the number of components to be set. From this, it is found that the method of generating attention by combining two-dimensional convolution with an activation function is effective not only for a convolutional neural network but also for a network based on a transform architecture. The process of generating the component attention map a is shown in the following equation (2):

wherein A is ∈ H × W × M, A_iE H × W (i ═ 1, 2, …, M) represents the ith part attention map in the target object, such as the bird's head, torso, etc.

As another implementation of the component attention map for obtaining the image discriminability in step 1, step 12b may also use the attention module G composed of the two-dimensional convolution of a convolution kernel 1x1, the two-dimensional batch normalization and the ReLU function, may also use the attention module G composed of a full-connected layer, a one-dimensional layer normalization and a Softmax function, and may even use the attention module G composed of a full-connected layer, a one-dimensional layer normalization and a ReLU function, without changing step 11 b.

From the above, it can be known that: and 2, acquiring a discriminant component feature by Bilinear Attention Pooling (BAP) of the basic feature map and the component attention map. In one embodiment, as shown in fig. 4, the method for extracting the feature of the discriminative part in step 2 specifically includes:

step 21, drawing attention of different parts to A_iExtend to the base feature map

That is, A is_iRepeating the channel dimension for multiple times to make the channel number equal to

Are kept consistent, and then they are multiplied element by element to obtain a discriminative part feature P_ie.H × W × D, the basic feature map with the discriminant part position is activated, and the discriminant part feature is obtained. The specific process is shown as formula (3):

in the equation, "-" indicates element-by-element multiplication operation.

For the image classification task, a Global Average Pooling (GAP) approach is typically used to aggregate features, step 22. And aggregating the discriminative part features obtained in the step 31 through global average pooling so as to facilitate the fusion of the part features. The feature aggregation process for the ith part is defined as follows:

h_i＝ψ(P_i) (4)

in the formula, h_iE D represents the aggregated characteristics of the ith part, and ψ (-) represents the global average pooling.

In another embodiment, step 21 can be implemented by directly channel-stitching the obtained feature map with the attention map. However, this method does not extract the discriminative component and then performs fusion of the features, and thus the representation capability of the features is limited.

From the above, it can be seen that: step 3 is used for fusing the discriminative part features, and specifically comprises the following steps:

step 31, aggregating each discriminant part feature h_iAnd (3) splicing is carried out on the channel dimension, so that an enhanced feature representation is obtained, namely the global component feature Q ∈ M.D shown in the following formula (5), the feature is fused with discriminant component information, and the feature expression capability is stronger.

Q＝Concate(h₁，h₂，…，h_M) (5)

In the formula, Concate (.) represents feature concatenation;

step 32, performing L on the global component characteristic Q₂And after the norm normalization processing, transmitting the norm normalization processing into a full connection layer.

In the step 31, feature fusion may also be performed by directly adding feature maps instead of performing stitching in channel dimension.

In one embodiment, step 4 specifically includes:

and 41, forming a classification network of the model by the full connection layer and the Softmax. Inputting the global component characteristics Q into a full connection layer, completing the mapping of bird image categories, and obtaining the Cross entropy (Cross entropy) loss of a predicted value and a label

And the method is used for punishing the classification result and measuring the difference between the classes. The classification loss of a single sample is shown as the formula (6):

where y represents a category label previously marked in the image, such as 0, 1, 2, …, y' represents a predicted value obtained after the component feature Q is input into the full-link layer, and P represents a classification probability of 0-1 after being processed by Softmax, which can be represented by equation (7):

of formula (II) to (III)'_jAnd C is the total number of the categories in the data set.

In step 42, in order to avoid homogenization of the component attention diagrams in the model training process, that is, to ensure that different layers of attention diagrams can represent different target components, a Center loss (Center loss) function is adopted in the model to weakly supervise the generated component attention, and the component attention diagrams are constrained so that the component feature Q continuously approaches the feature Center. During model training, the central loss function makes the same component feature expression of the target as similar as possible, and the different component features differ the greater. The center loss for a single sample is defined as follows:

in the formula, q_iE D is the ith component feature in the global component feature Q, c_ie.D is the feature center of the ith part. c. C_iThe initialization is 0, and the model is updated in the following way in the training process of the model:

c_i←c_i+α(q_i-c_i) (9)

in the formula, alpha is belonged to [0, 1 ]]Is c_iAn updated learning rate. In experiments, it was found that better results were obtained when α is 0.05. Total loss of model in training phase

The definition is as follows:

the model uses only the cross entropy loss as the overall loss at the test stage.

The embodiment of the invention also provides a bird fine-grained image recognition device based on Transformer and component feature fusion, which comprises a component attention generation unit, a discriminant component feature generation unit, a feature fusion unit and a parameter learning optimization unit, wherein:

the component attention generating unit is used for inputting the preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map.

And the discriminant part feature generation unit is used for performing bilinear attention pooling operation on the basic feature map and the part attention map to obtain discriminant part features.

The feature fusion unit is used for splicing the discriminative component features on the channel dimension to obtain the enhanced feature representation fused with the discriminative component information.

And the parameter learning optimization unit is used for completing the mapping of the categories by inputting the enhanced feature representation into the full-connection layer and optimizing the model parameters through cross entropy loss and central loss.

In one embodiment, the component attention generating unit includes a base feature map extracting sub-unit and a component attention map generating sub-unit.

The two-dimensional basic feature map extracting module is used for inputting a preprocessed original image I into a feature extracting network F and extracting a two-dimensional basic feature map F, wherein the F belongs to (H.W) x D, H, W is respectively expressed as the height and the width of the basic feature map F, and D is expressed as the size of an embedding dimension. The three-dimensional basic characteristic diagram module is used for recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram

The process is shown in the following formula (1):

The component attention diagram generation subunit is used for determining the number M of channels of the component attention diagram to be generated, and an attention module G is formed by a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and the feature diagram is processed

Inputting the attention module G to generate different part scores for representing the target objectThe component of the cloth is noted in drawing a, as shown in the following formula (2):

in the formula, A_i(i ═ 1, 2, …, M) represents the ith part attention map in the target object.

In one embodiment, the discriminative component feature generation unit specifically includes a single discriminative component feature generation module and a discriminative component feature fusion module, wherein:

a single discriminative component feature generation module for mapping different component attention maps A_iIs expanded to conform to the base feature map F, and the expanded component attention map a is then applied_iMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)_i：

In the equation, "-" indicates element-by-element multiplication operation.

The discriminative part feature fusion module is used for fusing discriminative part features P_iThe aggregation operation is performed according to the global average pooling provided by the following formula (4), fusing each discriminant part feature P_i：

h_i＝ψ(P_i) (4)

In one embodiment, the parameter learning optimization unit specifically includes a single sample loss acquisition module and a central update module of component features, wherein:

the sample classification loss acquisition module is used for inputting global component characteristics Q into the full-connection layer to complete the mapping of bird image categories and obtain the cross entropy loss of a predicted value and a label

in the formula, y represents a category label, y' represents a predicted value, and P represents a probability after the Softmax processing.

The central updating module of the component features is used for weakly supervising the generation process of the attention of the component by adopting the central loss of a single sample described by the formula (8) and initializing c_iThe model is updated in the training process according to the following formula (9), and the total loss of the model in the training stage

The following (10) is defined:

c_i←c_i+α(q_i-c_i) (9)

In practical use, the input image is firstly preprocessed as in the above embodiment, then trained model parameters are loaded, and finally the preprocessed image is input into the model, so as to output the class probability.

In order to verify that the central loss is effective for the performance improvement of the model, the Grad-CAM is used for visualizing the feature map output by the last layer of the feature extraction network, and the result is shown in FIG. 5. It can be seen that the heat map high energy without added central loss is sporadically distributed over the body of birds or contains many background areas, while the heat map high energy areas with added central loss are more concentrated over the body of birds, indicating that the areas have a greater impact on the classification results and better classification results.

By adopting the method provided by the invention, the discriminative part can be concerned, and the high-precision identification of the bird fine-grained image under weak supervision is realized.

Finally, it should be pointed out that: the above examples are only for illustrating the technical solutions of the present invention, and are not limited thereto. Those of ordinary skill in the art will understand that: modifications can be made to the technical solutions described in the foregoing embodiments, or some technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A bird fine-grained image identification method based on Transformer and component feature fusion is characterized by comprising the following steps:

2. The method for identifying bird fine-grained images based on Transformer and component feature fusion as claimed in claim 1, wherein the method for extracting the basic feature map in the step 1 specifically comprises:

The process is shown in the following formula (1):

in the formula, reshape (·) represents reorganization of the base feature map.

3. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 1 or 2, wherein the method for generating the component attention map in the step 1 specifically comprises the following steps:

in the formula, A_i(i 1, 2.. M) represents the ith component attention map in the target object.

4. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 3, wherein the step 2 specifically comprises the following steps:

In the formula, "indicates element-by-element multiplication operation;

step 22, judging the characteristic P of the discriminant part_iAggregating each discriminative part feature P by performing an aggregation operation according to the global average pooling provided by the following formula (4)_i：

h_i＝ψ(P_i) (4)

5. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 4, wherein the step 3 specifically comprises the following steps:

step 31, aggregating the discriminant part features h_iSplicing in channel dimensions to be enhancedThe feature representation of (2) is a global component feature Q, the feature is fused with discriminant component information, and the feature expression capability is stronger.

Q＝Concate(h₁，h₂，...，h_M) (5)

In the formula, Concate (·) represents feature concatenation;

step 32, performing L on the global component characteristic Q₂And after normalization processing of the norm, transmitting the norm into a full-connection layer to complete mapping from the feature vector to the category.

6. The method for identifying bird fine-grained images based on the fusion of Transformer and component features according to any one of claims 1 to 5, wherein the step 4 specifically comprises the following steps:

in the formula, q_iIs the ith component feature of the global component feature Q, c_iIs the center of the ith component feature;

c_i←c_i+α(q_i-c_i) (9)

The following (10) is defined:

7. a bird fine-grained image recognition device based on Transformer and component feature fusion is characterized by comprising:

a discriminative component feature generation unit configured to perform bilinear attention pooling on the basic feature map and the component attention map to obtain discriminative component features;

8. The device for bird fine-grained image recognition based on Transformer and component feature fusion according to claim 7, wherein the component attention generating unit comprises:

the basic feature map extraction subunit specifically includes:

The process is shown in the following formula (1):

in the formula, reshape (-) indicates that the feature map is reorganized;

a component attention diagram generation subunit, which is used for determining the number M of channels of the component attention diagram to be generated, and forming an attention module G by a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and mapping the feature diagram

9. The apparatus for bird fine-grained image recognition based on transform and component feature fusion as claimed in claim 7, wherein said discriminant component feature generation unit specifically comprises:

single discriminationA characteristic component feature generation module for drawing different component attention diagrams A_iIs expanded to conform to the base feature map F, and the expanded component attention map a is then applied_iMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)_i：

In the formula, "indicates element-by-element multiplication operation;

h_i＝ψ(P_i) (4)

10. The apparatus for bird fine-grained image recognition based on Transformer and component feature fusion according to claim 4, wherein the parameter learning optimization unit specifically comprises:

The following (10) is defined:

c_i←c_i+α(q_i-c_i) (9)