Nothing Special   »   [go: up one dir, main page]

CN114626476A - Bird fine-grained image recognition method and device based on Transformer and component feature fusion - Google Patents

Bird fine-grained image recognition method and device based on Transformer and component feature fusion Download PDF

Info

Publication number
CN114626476A
CN114626476A CN202210279684.8A CN202210279684A CN114626476A CN 114626476 A CN114626476 A CN 114626476A CN 202210279684 A CN202210279684 A CN 202210279684A CN 114626476 A CN114626476 A CN 114626476A
Authority
CN
China
Prior art keywords
feature
component
attention
map
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210279684.8A
Other languages
Chinese (zh)
Inventor
阮涛
张海苗
刘畅
邱钧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN202210279684.8A priority Critical patent/CN114626476A/en
Publication of CN114626476A publication Critical patent/CN114626476A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a device for identifying bird fine-grained images based on Transformer and component feature fusion, wherein the method comprises the following steps: step 1, inputting a preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map; step 2, performing bilinear attention pooling operation on the basic feature diagram and the component attention diagram to obtain a discriminant component feature; step 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information; and 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss. The bird image recognition method can realize high-precision recognition of the bird image under weak supervision.

Description

Bird fine-grained image recognition method and device based on Transformer and component feature fusion
Technical Field
The invention relates to the technical field of computer vision, in particular to deep learning and fine-grained image recognition technology, and specifically relates to a method and a device for bird fine-grained recognition based on Transformer and component feature fusion.
Background
Bird image recognition belongs to a fine-grained image recognition task. The fine-grained image identification is to distinguish different subclasses belonging to the same general class. In general, image recognition is to classify large categories of objects, such as horse and cat, and features of different objects are different greatly, so that categories are relatively easy to distinguish. For fine-grained images, differences between objects generally exist in subtle parts, and the same object presents larger visual differences due to dimensions or viewing angles, backgrounds, and the like, so that the recognition difficulty is higher.
Disclosure of Invention
The invention aims to provide a method and a device for identifying a bird fine-grained image based on Transformer and component feature fusion, which can obtain higher identification precision of the bird fine-grained image.
In order to achieve the above object, the present invention provides a method for identifying bird fine-grained images based on Transformer and component feature fusion, which comprises:
step 1, inputting a preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map;
step 2, performing bilinear attention pooling operation on the basic feature diagram and the component attention diagram to obtain a discriminant component feature;
step 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information;
and 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss.
Further, the method for extracting the basic feature map in step 1 specifically includes:
step 11a, inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W respectively represents the height and width of the basic feature map F, and D represents the size of an embedding dimension;
step 12a, recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram
Figure BDA0003556345140000021
The process is shown in the following formula (1):
Figure BDA0003556345140000022
in the formula, reshape (·) represents reorganization of the base feature map.
Further, the method for generating the component attention diagram in step 1 specifically includes:
step 11b, determining the number M of channels of the component attention diagram needing to be generated, namely the number of generated component characteristics;
step 12b, forming an attention module G by a two-dimensional convolution with convolution kernel of 1 multiplied by 1 and a Sigmoid function, and converting the feature map
Figure BDA0003556345140000023
Inputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):
Figure BDA0003556345140000024
in the formula, Ai(i=1,2, …, M) represents the ith part attention map in the target object.
Further, the step 2 specifically includes:
step 21, drawing different part attention diagrams AiExtend to the dimension of the base feature map
Figure BDA0003556345140000025
Then the expanded part is subjected to attention drawing AiAnd basic feature map
Figure BDA0003556345140000026
Multiplying element by element in the following formula (3) to obtain the discriminant part feature Pi
Figure BDA0003556345140000027
In the formula, "indicates element-by-element multiplication operation;
step 22, distinguishing the part characteristic PiAggregating each discriminative part feature P by performing an aggregation operation according to the global average pooling provided by the following formula (4)i
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
Further, the step 3 specifically includes:
step 31, aggregating the discriminant part features hiAnd splicing is carried out on the channel dimension, so that an enhanced feature representation, namely a global component feature Q is obtained, the feature is fused with discriminant component information, and the feature expression capability is stronger.
Q=Concate(h1,h2,…,hM) (5)
In the formula, Concate (·) represents feature concatenation;
step 32, performing L on the global component characteristic Q2After normalization of the norm, full connection is introducedAnd the layer is used for finishing the mapping of the feature vectors to the categories.
Further, the step 4 specifically includes:
step 41, inputting the global component characteristics Q into the full connection layer, completing the mapping of bird image categories, and obtaining the cross entropy loss of the predicted value and the label
Figure BDA0003556345140000031
For penalizing the classification result, the loss of a single sample is shown in formula (6):
Figure BDA0003556345140000032
in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;
step 42, the central loss of a single sample described by equation (8) is used to weakly supervise the generation process of the attention of the component, so that different component features continuously approach the feature center:
Figure BDA0003556345140000033
in the formula, qiIs the ith component feature, c, of the global component features QiIs the center of the ith component feature;
step 43, initialize ciAnd updating the model in the training process according to the following formula (9):
ci←ci+α(qi-ci) (9)
in the formula, alpha is belonged to [0, 1 ]]Is ciUpdated learning rate, total loss of model during training phase
Figure BDA0003556345140000034
The following (10) is defined:
Figure BDA0003556345140000035
the invention also provides a bird fine-grained image recognition device based on Transformer and component feature fusion, which comprises:
a component attention generation unit, which is used for extracting a basic feature map by inputting the preprocessed image into a feature encoder based on a Transformer architecture network, and inputting the basic feature map into an attention module to generate a component attention map;
a discriminative component feature generation unit configured to perform bilinear attention pooling operations on the basic feature map and the component attention map to obtain discriminative component features;
the feature fusion unit is used for splicing the discriminative component features on the channel dimension to obtain enhanced feature representation fused with the discriminative component information;
and the parameter learning optimization unit is used for completing the mapping of the categories by inputting the enhanced feature representation into the full-connection layer, and optimizing the model parameters through cross entropy loss and center loss.
Further, the component attention generating unit includes:
the basic feature map extraction subunit specifically includes:
the two-dimensional basic feature map extraction module is used for inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W is respectively expressed as the height and width of the basic feature map F, and D is expressed as the size of an embedding dimension;
a three-dimensional basic characteristic diagram module for recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram
Figure BDA0003556345140000041
The process is shown in the following formula (1):
Figure BDA0003556345140000042
in the formula, reshape (-) indicates that the feature map is reorganized;
a component attention diagram generation subunit, which is used for determining the number M of channels of the component attention diagram to be generated, and an attention module G is composed of a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and a feature diagram is generated
Figure BDA0003556345140000043
Inputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):
Figure BDA0003556345140000044
in the formula, Ai (i ═ 1, 2, …, M) represents the ith component attention map in the target object.
Further, the discriminative component feature generation unit specifically includes:
a single discriminative component feature generation module for attention mapping A different componentsiIs expanded to conform to the base feature map F, and the expanded component attention map a is then appliediMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)i
Figure BDA0003556345140000045
In the formula, "indicates element-by-element multiplication operation;
a discriminative part feature fusion module for fusing discriminative part features PiThe aggregation operation is performed according to the global average pooling provided by the following formula (4), fusing each discriminant part feature Pi
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
Further, the parameter learning optimization unit specifically includes:
a sample classification loss acquisition module for inputting global component characteristics Q into the full connection layer to complete the mapping of bird image categories and obtain the cross entropy loss of the predicted value and the label
Figure BDA0003556345140000046
For punishing the classification result, the classification loss of a single sample is shown as formula (6):
Figure BDA0003556345140000047
in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;
a central update module of the part features for central loss of a single sample described with equation (8) to weakly supervise the generation process of the part attention and initialize ciThe model is updated in the training process according to the following formula (9), and the total loss of the model in the training stage
Figure BDA0003556345140000051
The following (10) is defined:
Figure BDA0003556345140000052
ci←ci+α(qi-ci) (9)
Figure BDA0003556345140000053
in the formula, qiIs the ith component feature of the global component feature Q, ciIs the center of the ith part feature, α ∈ [0, 1 ]]Is ciAn updated learning rate.
Due to the adoption of the technical scheme, the invention has the following advantages:
the method generates the component attention diagram in an attention mode, combines the component attention diagram with a feature extraction network based on a Transformer architecture to realize the fusion of the characteristics of the discriminant component, and can not only focus on the discriminant component, but also obtain the feature representation with better expression capability; in the training stage, the model can realize high identification precision of bird images under weak supervision only by category labels without other labeling information.
Drawings
Fig. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a general model structure diagram corresponding to fig. 1.
Fig. 3 is a diagram of the attention module of fig. 1.
Fig. 4 is a diagram of the extraction and fusion process of the feature in fig. 1.
FIG. 5 is a graph illustrating the effect of center loss on model performance in FIG. 1.
Detailed Description
In order to make the aforementioned objects, features and advantages more comprehensible, the present invention is described in detail below with reference to the accompanying drawings and the detailed description thereof.
Interpretation of terms: in the field of computer vision, a network based on a transform architecture is mainly composed of a multi-layer perceptron, which first divides an image into a plurality of image blocks and then transmits the image blocks to other subsequent networks. The self-attention mechanism in the network enables the extracted feature map to contain global information, which is beneficial to downstream tasks.
As shown in fig. 1, the method for identifying a bird fine-grained image based on fusion of a Transformer and a component feature provided by the embodiment of the invention includes the following steps:
step 1, inputting the preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map.
And 2, performing bilinear attention pooling operation on the basic feature map and the component attention map to obtain the discriminative component feature.
And 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information.
And 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss. The model suitable for the model parameters is composed of a feature extraction network f, a constructed attention module G and a full connection layer.
Thus, step 1 is used to obtain a base feature map and a discriminative part attention map of an image.
In one embodiment, the method for acquiring the basic feature map of the image in step 1 specifically includes:
step 11a, image preprocessing.
For example: the published bird data sets CUB-200 and NABirds are selected, and the selected bird data sets are divided into training sets and test sets. The following illustrates the specific implementation of image preprocessing at this stage, and the image preprocessing methods for both bird datasets are the same.
A training stage: firstly, images of a training set are adjusted to 496 × 496 pixels, then areas of 384 × 384 pixels are cut out randomly, then data amplification is carried out in a random horizontal flipping mode, and finally normalization processing is carried out on image data, wherein the normalized mean value and standard deviation are respectively [0.485, 0.456, 0.406], [0.229, 0.224 and 0.225 ].
And (3) a testing stage: the image center is clipped to 384 × 384 pixels and normalized the same as in the training phase.
And step 12a, inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W respectively represents the height and width of the basic feature map F, and D represents the size of an embedding dimension. As shown in fig. 2, the feature extraction network f used in the present embodiment is Swin-L based on the transform architecture.
Step 13a, recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram
Figure BDA0003556345140000061
Thus, the basic feature diagram F can be adapted to the attention network constructed later, and the input dimension meets the consistency requirement.
Then, the method of obtaining the basic feature map of the image in step 1 can be described as the following formula (1):
Figure BDA0003556345140000062
in the formula, reshape (. circle.) represents the reorganization of the feature map.
In the above embodiment, ResNet may be used to extract image features, and the output of the image features is a three-dimensional feature map, which does not require recombination. However, the method extracts local features, and the feature expression capability is limited. The feature extraction network f in step 1 may also adopt other network structures in the prior art, as long as the basic feature map with rich expression can be obtained, and is not listed here.
In one embodiment, the method for obtaining the part attention map of the image discriminability in step 1 specifically includes:
and step 11b, determining the number M of channels of the component attention diagram needing to be generated, namely the number of generated component characteristics, wherein different data sets can be selected according to actual conditions.
Since the number of channels of the component attention map reflects the coverage of the components which focus on the object discrimination by the model, the discrimination performance of the model on the object subtle differences is better when the number of focused components is large. In the case where the equalization model can learn the number of parameters and the accuracy, the M values on the CUB-200-2011 and NABirds datasets can be set to 64 and 32, respectively. The value of M can be selected according to experimental effects on different data sets.
At step 12b, an attention model is constructed to generate a component attention map.
Generally, the attention module of an image is composed of a full connection layer, a two-dimensional convolution, batch normalization, an activation function (e.g., ReLU, Sigmoid, Softmax), and the like, and different attention architectures generate different attention attempts to improve the performance of the model differently. Through analysis in experiments, it is found that the attention generating module G composed of a two-dimensional convolution with a convolution kernel of 1 × 1 and a Sigmoid function is more suitable for the backbone network of the architecture of the present embodiment, and the specific structure thereof is as shown in fig. 3, where the convolution with 1 × 1 is used to change the number of characteristic channels to be equal to the number of components to be set. From this, it is found that the method of generating attention by combining two-dimensional convolution with an activation function is effective not only for a convolutional neural network but also for a network based on a transform architecture. The process of generating the component attention map a is shown in the following equation (2):
Figure BDA0003556345140000071
wherein A is ∈ H × W × M, AiE H × W (i ═ 1, 2, …, M) represents the ith part attention map in the target object, such as the bird's head, torso, etc.
As another implementation of the component attention map for obtaining the image discriminability in step 1, step 12b may also use the attention module G composed of the two-dimensional convolution of a convolution kernel 1x1, the two-dimensional batch normalization and the ReLU function, may also use the attention module G composed of a full-connected layer, a one-dimensional layer normalization and a Softmax function, and may even use the attention module G composed of a full-connected layer, a one-dimensional layer normalization and a ReLU function, without changing step 11 b.
From the above, it can be known that: and 2, acquiring a discriminant component feature by Bilinear Attention Pooling (BAP) of the basic feature map and the component attention map. In one embodiment, as shown in fig. 4, the method for extracting the feature of the discriminative part in step 2 specifically includes:
step 21, drawing attention of different parts to AiExtend to the base feature map
Figure BDA0003556345140000081
That is, A isiRepeating the channel dimension for multiple times to make the channel number equal to
Figure BDA0003556345140000082
Are kept consistent, and then they are multiplied element by element to obtain a discriminative part feature Pie.H × W × D, the basic feature map with the discriminant part position is activated, and the discriminant part feature is obtained. The specific process is shown as formula (3):
Figure BDA0003556345140000083
in the equation, "-" indicates element-by-element multiplication operation.
For the image classification task, a Global Average Pooling (GAP) approach is typically used to aggregate features, step 22. And aggregating the discriminative part features obtained in the step 31 through global average pooling so as to facilitate the fusion of the part features. The feature aggregation process for the ith part is defined as follows:
hi=ψ(Pi) (4)
in the formula, hiE D represents the aggregated characteristics of the ith part, and ψ (-) represents the global average pooling.
In another embodiment, step 21 can be implemented by directly channel-stitching the obtained feature map with the attention map. However, this method does not extract the discriminative component and then performs fusion of the features, and thus the representation capability of the features is limited.
From the above, it can be seen that: step 3 is used for fusing the discriminative part features, and specifically comprises the following steps:
step 31, aggregating each discriminant part feature hiAnd (3) splicing is carried out on the channel dimension, so that an enhanced feature representation is obtained, namely the global component feature Q ∈ M.D shown in the following formula (5), the feature is fused with discriminant component information, and the feature expression capability is stronger.
Q=Concate(h1,h2,…,hM) (5)
In the formula, Concate (.) represents feature concatenation;
step 32, performing L on the global component characteristic Q2And after the norm normalization processing, transmitting the norm normalization processing into a full connection layer.
In the step 31, feature fusion may also be performed by directly adding feature maps instead of performing stitching in channel dimension.
In one embodiment, step 4 specifically includes:
and 41, forming a classification network of the model by the full connection layer and the Softmax. Inputting the global component characteristics Q into a full connection layer, completing the mapping of bird image categories, and obtaining the Cross entropy (Cross entropy) loss of a predicted value and a label
Figure BDA0003556345140000084
And the method is used for punishing the classification result and measuring the difference between the classes. The classification loss of a single sample is shown as the formula (6):
Figure BDA0003556345140000085
where y represents a category label previously marked in the image, such as 0, 1, 2, …, y' represents a predicted value obtained after the component feature Q is input into the full-link layer, and P represents a classification probability of 0-1 after being processed by Softmax, which can be represented by equation (7):
Figure BDA0003556345140000091
of formula (II) to (III)'jAnd C is the total number of the categories in the data set.
In step 42, in order to avoid homogenization of the component attention diagrams in the model training process, that is, to ensure that different layers of attention diagrams can represent different target components, a Center loss (Center loss) function is adopted in the model to weakly supervise the generated component attention, and the component attention diagrams are constrained so that the component feature Q continuously approaches the feature Center. During model training, the central loss function makes the same component feature expression of the target as similar as possible, and the different component features differ the greater. The center loss for a single sample is defined as follows:
Figure BDA0003556345140000092
in the formula, qiE D is the ith component feature in the global component feature Q, cie.D is the feature center of the ith part. c. CiThe initialization is 0, and the model is updated in the following way in the training process of the model:
ci←ci+α(qi-ci) (9)
in the formula, alpha is belonged to [0, 1 ]]Is ciAn updated learning rate. In experiments, it was found that better results were obtained when α is 0.05. Total loss of model in training phase
Figure BDA0003556345140000093
The definition is as follows:
Figure BDA0003556345140000094
the model uses only the cross entropy loss as the overall loss at the test stage.
The embodiment of the invention also provides a bird fine-grained image recognition device based on Transformer and component feature fusion, which comprises a component attention generation unit, a discriminant component feature generation unit, a feature fusion unit and a parameter learning optimization unit, wherein:
the component attention generating unit is used for inputting the preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map.
And the discriminant part feature generation unit is used for performing bilinear attention pooling operation on the basic feature map and the part attention map to obtain discriminant part features.
The feature fusion unit is used for splicing the discriminative component features on the channel dimension to obtain the enhanced feature representation fused with the discriminative component information.
And the parameter learning optimization unit is used for completing the mapping of the categories by inputting the enhanced feature representation into the full-connection layer and optimizing the model parameters through cross entropy loss and central loss.
In one embodiment, the component attention generating unit includes a base feature map extracting sub-unit and a component attention map generating sub-unit.
The two-dimensional basic feature map extracting module is used for inputting a preprocessed original image I into a feature extracting network F and extracting a two-dimensional basic feature map F, wherein the F belongs to (H.W) x D, H, W is respectively expressed as the height and the width of the basic feature map F, and D is expressed as the size of an embedding dimension. The three-dimensional basic characteristic diagram module is used for recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram
Figure BDA0003556345140000101
The process is shown in the following formula (1):
Figure BDA0003556345140000102
in the formula, reshape (. circle.) represents the reorganization of the feature map.
The component attention diagram generation subunit is used for determining the number M of channels of the component attention diagram to be generated, and an attention module G is formed by a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and the feature diagram is processed
Figure BDA0003556345140000103
Inputting the attention module G to generate different part scores for representing the target objectThe component of the cloth is noted in drawing a, as shown in the following formula (2):
Figure BDA0003556345140000104
in the formula, Ai(i ═ 1, 2, …, M) represents the ith part attention map in the target object.
In one embodiment, the discriminative component feature generation unit specifically includes a single discriminative component feature generation module and a discriminative component feature fusion module, wherein:
a single discriminative component feature generation module for mapping different component attention maps AiIs expanded to conform to the base feature map F, and the expanded component attention map a is then appliediMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)i
Figure BDA0003556345140000105
In the equation, "-" indicates element-by-element multiplication operation.
The discriminative part feature fusion module is used for fusing discriminative part features PiThe aggregation operation is performed according to the global average pooling provided by the following formula (4), fusing each discriminant part feature Pi
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
In one embodiment, the parameter learning optimization unit specifically includes a single sample loss acquisition module and a central update module of component features, wherein:
the sample classification loss acquisition module is used for inputting global component characteristics Q into the full-connection layer to complete the mapping of bird image categories and obtain the cross entropy loss of a predicted value and a label
Figure BDA0003556345140000111
For punishing the classification result, the classification loss of a single sample is shown as formula (6):
Figure BDA0003556345140000112
in the formula, y represents a category label, y' represents a predicted value, and P represents a probability after the Softmax processing.
The central updating module of the component features is used for weakly supervising the generation process of the attention of the component by adopting the central loss of a single sample described by the formula (8) and initializing ciThe model is updated in the training process according to the following formula (9), and the total loss of the model in the training stage
Figure BDA0003556345140000113
The following (10) is defined:
Figure BDA0003556345140000114
ci←ci+α(qi-ci) (9)
Figure BDA0003556345140000115
in the formula, qiIs the ith component feature of the global component feature Q, ciIs the center of the ith part feature, α ∈ [0, 1 ]]Is ciAn updated learning rate.
In practical use, the input image is firstly preprocessed as in the above embodiment, then trained model parameters are loaded, and finally the preprocessed image is input into the model, so as to output the class probability.
In order to verify that the central loss is effective for the performance improvement of the model, the Grad-CAM is used for visualizing the feature map output by the last layer of the feature extraction network, and the result is shown in FIG. 5. It can be seen that the heat map high energy without added central loss is sporadically distributed over the body of birds or contains many background areas, while the heat map high energy areas with added central loss are more concentrated over the body of birds, indicating that the areas have a greater impact on the classification results and better classification results.
By adopting the method provided by the invention, the discriminative part can be concerned, and the high-precision identification of the bird fine-grained image under weak supervision is realized.
Finally, it should be pointed out that: the above examples are only for illustrating the technical solutions of the present invention, and are not limited thereto. Those of ordinary skill in the art will understand that: modifications can be made to the technical solutions described in the foregoing embodiments, or some technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A bird fine-grained image identification method based on Transformer and component feature fusion is characterized by comprising the following steps:
step 1, inputting a preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map;
step 2, performing bilinear attention pooling operation on the basic feature diagram and the component attention diagram to obtain a discriminant component feature;
step 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information;
and 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss.
2. The method for identifying bird fine-grained images based on Transformer and component feature fusion as claimed in claim 1, wherein the method for extracting the basic feature map in the step 1 specifically comprises:
step 11a, inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W respectively represents the height and width of the basic feature map F, and D represents the size of an embedding dimension;
step 12a, recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram
Figure FDA0003556345130000011
Figure FDA0003556345130000012
The process is shown in the following formula (1):
Figure FDA0003556345130000013
in the formula, reshape (·) represents reorganization of the base feature map.
3. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 1 or 2, wherein the method for generating the component attention map in the step 1 specifically comprises the following steps:
step 11b, determining the number M of channels of the component attention diagram needing to be generated, namely the number of generated component characteristics;
step 12b, forming an attention module G by a two-dimensional convolution with convolution kernel of 1 multiplied by 1 and a Sigmoid function, and converting the feature map
Figure FDA0003556345130000014
Inputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):
Figure FDA0003556345130000015
in the formula, Ai(i 1, 2.. M) represents the ith component attention map in the target object.
4. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 3, wherein the step 2 specifically comprises the following steps:
step 21, drawing attention of different parts to AiExtend to the base feature map
Figure FDA0003556345130000021
Then the expanded part is subjected to attention drawing AiAnd basic feature map
Figure FDA0003556345130000022
Multiplying element by element in the following formula (3) to obtain the discriminant part feature Pi
Figure FDA0003556345130000023
In the formula, "indicates element-by-element multiplication operation;
step 22, judging the characteristic P of the discriminant partiAggregating each discriminative part feature P by performing an aggregation operation according to the global average pooling provided by the following formula (4)i
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
5. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 4, wherein the step 3 specifically comprises the following steps:
step 31, aggregating the discriminant part features hiSplicing in channel dimensions to be enhancedThe feature representation of (2) is a global component feature Q, the feature is fused with discriminant component information, and the feature expression capability is stronger.
Q=Concate(h1,h2,...,hM) (5)
In the formula, Concate (·) represents feature concatenation;
step 32, performing L on the global component characteristic Q2And after normalization processing of the norm, transmitting the norm into a full-connection layer to complete mapping from the feature vector to the category.
6. The method for identifying bird fine-grained images based on the fusion of Transformer and component features according to any one of claims 1 to 5, wherein the step 4 specifically comprises the following steps:
step 41, inputting the global component characteristics Q into the full connection layer, completing the mapping of bird image categories, and obtaining the cross entropy loss of the predicted value and the label
Figure FDA0003556345130000024
For penalizing the classification result, the loss of a single sample is shown in formula (6):
Figure FDA0003556345130000025
in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;
step 42, the central loss of a single sample described by equation (8) is used to weakly supervise the generation process of the attention of the component, so that different component features continuously approach the feature center:
Figure FDA0003556345130000026
in the formula, qiIs the ith component feature of the global component feature Q, ciIs the center of the ith component feature;
step 43, initialize ciAnd updating the model in the training process according to the following formula (9):
ci←ci+α(qi-ci) (9)
in the formula, alpha is belonged to [0, 1 ]]Is ciUpdated learning rate, total loss of model during training phase
Figure FDA0003556345130000031
The following (10) is defined:
Figure FDA0003556345130000032
7. a bird fine-grained image recognition device based on Transformer and component feature fusion is characterized by comprising:
a component attention generation unit, which is used for extracting a basic feature map by inputting the preprocessed image into a feature encoder based on a Transformer architecture network, and inputting the basic feature map into an attention module to generate a component attention map;
a discriminative component feature generation unit configured to perform bilinear attention pooling on the basic feature map and the component attention map to obtain discriminative component features;
the feature fusion unit is used for splicing the discriminative component features on the channel dimension to obtain enhanced feature representation fused with the discriminative component information;
and the parameter learning optimization unit is used for completing the mapping of the categories by inputting the enhanced feature representation into the full-connection layer, and optimizing the model parameters through cross entropy loss and center loss.
8. The device for bird fine-grained image recognition based on Transformer and component feature fusion according to claim 7, wherein the component attention generating unit comprises:
the basic feature map extraction subunit specifically includes:
the two-dimensional basic feature map extraction module is used for inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W is respectively expressed as the height and width of the basic feature map F, and D is expressed as the size of an embedding dimension;
a three-dimensional basic characteristic diagram module for recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram
Figure FDA0003556345130000033
The process is shown in the following formula (1):
Figure FDA0003556345130000034
in the formula, reshape (-) indicates that the feature map is reorganized;
a component attention diagram generation subunit, which is used for determining the number M of channels of the component attention diagram to be generated, and forming an attention module G by a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and mapping the feature diagram
Figure FDA0003556345130000035
Inputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):
Figure FDA0003556345130000041
in the formula, Ai(i 1, 2.. M) represents the ith component attention map in the target object.
9. The apparatus for bird fine-grained image recognition based on transform and component feature fusion as claimed in claim 7, wherein said discriminant component feature generation unit specifically comprises:
single discriminationA characteristic component feature generation module for drawing different component attention diagrams AiIs expanded to conform to the base feature map F, and the expanded component attention map a is then appliediMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)i
Figure FDA0003556345130000042
In the formula, "indicates element-by-element multiplication operation;
a discriminative part feature fusion module for fusing discriminative part features PiThe aggregation operation is performed according to the global average pooling provided by the following formula (4), fusing each discriminant part feature Pi
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
10. The apparatus for bird fine-grained image recognition based on Transformer and component feature fusion according to claim 4, wherein the parameter learning optimization unit specifically comprises:
a sample classification loss acquisition module for inputting global component characteristics Q into the full connection layer to complete the mapping of bird image categories and obtain the cross entropy loss of the predicted value and the label
Figure FDA0003556345130000043
For punishing the classification result, the classification loss of a single sample is shown as formula (6):
Figure FDA0003556345130000044
in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;
a central update module of the part features for central loss of a single sample described with equation (8) to weakly supervise the generation process of the part attention and initialize ciThe model is updated in the training process according to the following formula (9), and the total loss of the model in the training stage
Figure FDA0003556345130000045
The following (10) is defined:
Figure FDA0003556345130000046
ci←ci+α(qi-ci) (9)
Figure FDA0003556345130000047
in the formula, qiIs the ith component feature of the global component feature Q, ciIs the center of the ith part feature, α ∈ [0, 1 ]]Is ciAn updated learning rate.
CN202210279684.8A 2022-03-21 2022-03-21 Bird fine-grained image recognition method and device based on Transformer and component feature fusion Pending CN114626476A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210279684.8A CN114626476A (en) 2022-03-21 2022-03-21 Bird fine-grained image recognition method and device based on Transformer and component feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210279684.8A CN114626476A (en) 2022-03-21 2022-03-21 Bird fine-grained image recognition method and device based on Transformer and component feature fusion

Publications (1)

Publication Number Publication Date
CN114626476A true CN114626476A (en) 2022-06-14

Family

ID=81903433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210279684.8A Pending CN114626476A (en) 2022-03-21 2022-03-21 Bird fine-grained image recognition method and device based on Transformer and component feature fusion

Country Status (1)

Country Link
CN (1) CN114626476A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035389A (en) * 2022-08-10 2022-09-09 华东交通大学 Fine-grained image identification method and device based on reliability evaluation and iterative learning
CN115471724A (en) * 2022-11-02 2022-12-13 青岛杰瑞工控技术有限公司 Fine-grained fish epidemic disease identification fusion algorithm based on self-adaptive normalization
CN117853875A (en) * 2024-03-04 2024-04-09 华东交通大学 Fine-granularity image recognition method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151424A1 (en) * 2018-11-09 2020-05-14 Sap Se Landmark-free face attribute prediction
CN111523534A (en) * 2020-03-31 2020-08-11 华东师范大学 Image description method
CN112232293A (en) * 2020-11-09 2021-01-15 腾讯科技(深圳)有限公司 Image processing model training method, image processing method and related equipment
CN112381830A (en) * 2020-10-23 2021-02-19 山东黄河三角洲国家级自然保护区管理委员会 Method and device for extracting bird key parts based on YCbCr superpixels and graph cut
EP3923185A2 (en) * 2021-03-03 2021-12-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Image classification method and apparatus, electronic device and storage medium
CN113902948A (en) * 2021-10-09 2022-01-07 中国人民解放军陆军工程大学 Fine-grained image classification method and system based on double-branch network
CN114140353A (en) * 2021-11-25 2022-03-04 苏州大学 Swin-Transformer image denoising method and system based on channel attention

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151424A1 (en) * 2018-11-09 2020-05-14 Sap Se Landmark-free face attribute prediction
CN111523534A (en) * 2020-03-31 2020-08-11 华东师范大学 Image description method
CN112381830A (en) * 2020-10-23 2021-02-19 山东黄河三角洲国家级自然保护区管理委员会 Method and device for extracting bird key parts based on YCbCr superpixels and graph cut
CN112232293A (en) * 2020-11-09 2021-01-15 腾讯科技(深圳)有限公司 Image processing model training method, image processing method and related equipment
EP3923185A2 (en) * 2021-03-03 2021-12-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Image classification method and apparatus, electronic device and storage medium
CN113902948A (en) * 2021-10-09 2022-01-07 中国人民解放军陆军工程大学 Fine-grained image classification method and system based on double-branch network
CN114140353A (en) * 2021-11-25 2022-03-04 苏州大学 Swin-Transformer image denoising method and system based on channel attention

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MENGZE LI: "Multi-task attribute-fusion model for fine-grained image recognition", CONFERENCE ON OPTOELECTRONIC IMAGING AND MULTIMEDIA TECHNOLOGY VII, 10 October 2020 (2020-10-10), pages 1 - 15500 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035389A (en) * 2022-08-10 2022-09-09 华东交通大学 Fine-grained image identification method and device based on reliability evaluation and iterative learning
CN115035389B (en) * 2022-08-10 2022-10-25 华东交通大学 Fine-grained image identification method and device based on reliability evaluation and iterative learning
CN115471724A (en) * 2022-11-02 2022-12-13 青岛杰瑞工控技术有限公司 Fine-grained fish epidemic disease identification fusion algorithm based on self-adaptive normalization
CN117853875A (en) * 2024-03-04 2024-04-09 华东交通大学 Fine-granularity image recognition method and system
CN117853875B (en) * 2024-03-04 2024-05-14 华东交通大学 Fine-granularity image recognition method and system

Similar Documents

Publication Publication Date Title
CN110543892B (en) Part identification method based on multilayer random forest
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN111583263B (en) Point cloud segmentation method based on joint dynamic graph convolution
CN111476806B (en) Image processing method, image processing device, computer equipment and storage medium
CN112801015B (en) Multi-mode face recognition method based on attention mechanism
CN114626476A (en) Bird fine-grained image recognition method and device based on Transformer and component feature fusion
CN108960059A (en) A kind of video actions recognition methods and device
CN110222767B (en) Three-dimensional point cloud classification method based on nested neural network and grid map
CN114821014B (en) Multi-mode and countermeasure learning-based multi-task target detection and identification method and device
CN111652273B (en) Deep learning-based RGB-D image classification method
CN106408037A (en) Image recognition method and apparatus
CN109740539B (en) 3D object identification method based on ultralimit learning machine and fusion convolution network
CN113221987A (en) Small sample target detection method based on cross attention mechanism
CN112329771B (en) Deep learning-based building material sample identification method
CN114283325A (en) Underwater target identification method based on knowledge distillation
CN114187506B (en) Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN114926691A (en) Insect pest intelligent identification method and system based on convolutional neural network
CN113496260B (en) Grain depot personnel non-standard operation detection method based on improved YOLOv3 algorithm
CN114494773A (en) Part sorting and identifying system and method based on deep learning
CN113011506A (en) Texture image classification method based on depth re-fractal spectrum network
CN111143544B (en) Method and device for extracting bar graph information based on neural network
CN117036904A (en) Attention-guided semi-supervised corn hyperspectral image data expansion method
CN111553437A (en) Neural network based image classification method
CN111046861B (en) Method for identifying infrared image, method for constructing identification model and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Zhang Haimiao

Inventor after: Liu Chang

Inventor after: Qiu Jun

Inventor after: Ruan Tao

Inventor before: Ruan Tao

Inventor before: Zhang Haimiao

Inventor before: Liu Chang

Inventor before: Qiu Jun

CB03 Change of inventor or designer information