CN114626476A - Bird fine-grained image recognition method and device based on Transformer and component feature fusion - Google Patents
Bird fine-grained image recognition method and device based on Transformer and component feature fusion Download PDFInfo
- Publication number
- CN114626476A CN114626476A CN202210279684.8A CN202210279684A CN114626476A CN 114626476 A CN114626476 A CN 114626476A CN 202210279684 A CN202210279684 A CN 202210279684A CN 114626476 A CN114626476 A CN 114626476A
- Authority
- CN
- China
- Prior art keywords
- feature
- component
- attention
- map
- formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000004927 fusion Effects 0.000 title claims abstract description 31
- 238000010586 diagram Methods 0.000 claims abstract description 45
- 238000011176 pooling Methods 0.000 claims abstract description 21
- 238000013507 mapping Methods 0.000 claims abstract description 18
- 230000008569 process Effects 0.000 claims description 20
- 238000012549 training Methods 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 7
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 229910052739 hydrogen Inorganic materials 0.000 claims description 6
- 230000004931 aggregating effect Effects 0.000 claims description 4
- 238000013459 approach Methods 0.000 claims description 4
- 230000008521 reorganization Effects 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000000265 homogenisation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method and a device for identifying bird fine-grained images based on Transformer and component feature fusion, wherein the method comprises the following steps: step 1, inputting a preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map; step 2, performing bilinear attention pooling operation on the basic feature diagram and the component attention diagram to obtain a discriminant component feature; step 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information; and 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss. The bird image recognition method can realize high-precision recognition of the bird image under weak supervision.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to deep learning and fine-grained image recognition technology, and specifically relates to a method and a device for bird fine-grained recognition based on Transformer and component feature fusion.
Background
Bird image recognition belongs to a fine-grained image recognition task. The fine-grained image identification is to distinguish different subclasses belonging to the same general class. In general, image recognition is to classify large categories of objects, such as horse and cat, and features of different objects are different greatly, so that categories are relatively easy to distinguish. For fine-grained images, differences between objects generally exist in subtle parts, and the same object presents larger visual differences due to dimensions or viewing angles, backgrounds, and the like, so that the recognition difficulty is higher.
Disclosure of Invention
The invention aims to provide a method and a device for identifying a bird fine-grained image based on Transformer and component feature fusion, which can obtain higher identification precision of the bird fine-grained image.
In order to achieve the above object, the present invention provides a method for identifying bird fine-grained images based on Transformer and component feature fusion, which comprises:
and 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss.
Further, the method for extracting the basic feature map in step 1 specifically includes:
step 11a, inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W respectively represents the height and width of the basic feature map F, and D represents the size of an embedding dimension;
step 12a, recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagramThe process is shown in the following formula (1):
in the formula, reshape (·) represents reorganization of the base feature map.
Further, the method for generating the component attention diagram in step 1 specifically includes:
step 11b, determining the number M of channels of the component attention diagram needing to be generated, namely the number of generated component characteristics;
step 12b, forming an attention module G by a two-dimensional convolution with convolution kernel of 1 multiplied by 1 and a Sigmoid function, and converting the feature mapInputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):
in the formula, Ai(i=1,2, …, M) represents the ith part attention map in the target object.
Further, the step 2 specifically includes:
step 21, drawing different part attention diagrams AiExtend to the dimension of the base feature mapThen the expanded part is subjected to attention drawing AiAnd basic feature mapMultiplying element by element in the following formula (3) to obtain the discriminant part feature Pi:
In the formula, "indicates element-by-element multiplication operation;
step 22, distinguishing the part characteristic PiAggregating each discriminative part feature P by performing an aggregation operation according to the global average pooling provided by the following formula (4)i:
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
Further, the step 3 specifically includes:
step 31, aggregating the discriminant part features hiAnd splicing is carried out on the channel dimension, so that an enhanced feature representation, namely a global component feature Q is obtained, the feature is fused with discriminant component information, and the feature expression capability is stronger.
Q=Concate(h1,h2,…,hM) (5)
In the formula, Concate (·) represents feature concatenation;
step 32, performing L on the global component characteristic Q2After normalization of the norm, full connection is introducedAnd the layer is used for finishing the mapping of the feature vectors to the categories.
Further, the step 4 specifically includes:
step 41, inputting the global component characteristics Q into the full connection layer, completing the mapping of bird image categories, and obtaining the cross entropy loss of the predicted value and the labelFor penalizing the classification result, the loss of a single sample is shown in formula (6):
in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;
step 42, the central loss of a single sample described by equation (8) is used to weakly supervise the generation process of the attention of the component, so that different component features continuously approach the feature center:
in the formula, qiIs the ith component feature, c, of the global component features QiIs the center of the ith component feature;
step 43, initialize ciAnd updating the model in the training process according to the following formula (9):
ci←ci+α(qi-ci) (9)
in the formula, alpha is belonged to [0, 1 ]]Is ciUpdated learning rate, total loss of model during training phaseThe following (10) is defined:
the invention also provides a bird fine-grained image recognition device based on Transformer and component feature fusion, which comprises:
a component attention generation unit, which is used for extracting a basic feature map by inputting the preprocessed image into a feature encoder based on a Transformer architecture network, and inputting the basic feature map into an attention module to generate a component attention map;
a discriminative component feature generation unit configured to perform bilinear attention pooling operations on the basic feature map and the component attention map to obtain discriminative component features;
the feature fusion unit is used for splicing the discriminative component features on the channel dimension to obtain enhanced feature representation fused with the discriminative component information;
and the parameter learning optimization unit is used for completing the mapping of the categories by inputting the enhanced feature representation into the full-connection layer, and optimizing the model parameters through cross entropy loss and center loss.
Further, the component attention generating unit includes:
the basic feature map extraction subunit specifically includes:
the two-dimensional basic feature map extraction module is used for inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W is respectively expressed as the height and width of the basic feature map F, and D is expressed as the size of an embedding dimension;
a three-dimensional basic characteristic diagram module for recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagramThe process is shown in the following formula (1):
in the formula, reshape (-) indicates that the feature map is reorganized;
a component attention diagram generation subunit, which is used for determining the number M of channels of the component attention diagram to be generated, and an attention module G is composed of a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and a feature diagram is generatedInputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):
in the formula, Ai (i ═ 1, 2, …, M) represents the ith component attention map in the target object.
Further, the discriminative component feature generation unit specifically includes:
a single discriminative component feature generation module for attention mapping A different componentsiIs expanded to conform to the base feature map F, and the expanded component attention map a is then appliediMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)i:
In the formula, "indicates element-by-element multiplication operation;
a discriminative part feature fusion module for fusing discriminative part features PiThe aggregation operation is performed according to the global average pooling provided by the following formula (4), fusing each discriminant part feature Pi:
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
Further, the parameter learning optimization unit specifically includes:
a sample classification loss acquisition module for inputting global component characteristics Q into the full connection layer to complete the mapping of bird image categories and obtain the cross entropy loss of the predicted value and the labelFor punishing the classification result, the classification loss of a single sample is shown as formula (6):
in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;
a central update module of the part features for central loss of a single sample described with equation (8) to weakly supervise the generation process of the part attention and initialize ciThe model is updated in the training process according to the following formula (9), and the total loss of the model in the training stageThe following (10) is defined:
ci←ci+α(qi-ci) (9)
in the formula, qiIs the ith component feature of the global component feature Q, ciIs the center of the ith part feature, α ∈ [0, 1 ]]Is ciAn updated learning rate.
Due to the adoption of the technical scheme, the invention has the following advantages:
the method generates the component attention diagram in an attention mode, combines the component attention diagram with a feature extraction network based on a Transformer architecture to realize the fusion of the characteristics of the discriminant component, and can not only focus on the discriminant component, but also obtain the feature representation with better expression capability; in the training stage, the model can realize high identification precision of bird images under weak supervision only by category labels without other labeling information.
Drawings
Fig. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a general model structure diagram corresponding to fig. 1.
Fig. 3 is a diagram of the attention module of fig. 1.
Fig. 4 is a diagram of the extraction and fusion process of the feature in fig. 1.
FIG. 5 is a graph illustrating the effect of center loss on model performance in FIG. 1.
Detailed Description
In order to make the aforementioned objects, features and advantages more comprehensible, the present invention is described in detail below with reference to the accompanying drawings and the detailed description thereof.
Interpretation of terms: in the field of computer vision, a network based on a transform architecture is mainly composed of a multi-layer perceptron, which first divides an image into a plurality of image blocks and then transmits the image blocks to other subsequent networks. The self-attention mechanism in the network enables the extracted feature map to contain global information, which is beneficial to downstream tasks.
As shown in fig. 1, the method for identifying a bird fine-grained image based on fusion of a Transformer and a component feature provided by the embodiment of the invention includes the following steps:
And 2, performing bilinear attention pooling operation on the basic feature map and the component attention map to obtain the discriminative component feature.
And 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information.
And 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss. The model suitable for the model parameters is composed of a feature extraction network f, a constructed attention module G and a full connection layer.
Thus, step 1 is used to obtain a base feature map and a discriminative part attention map of an image.
In one embodiment, the method for acquiring the basic feature map of the image in step 1 specifically includes:
step 11a, image preprocessing.
For example: the published bird data sets CUB-200 and NABirds are selected, and the selected bird data sets are divided into training sets and test sets. The following illustrates the specific implementation of image preprocessing at this stage, and the image preprocessing methods for both bird datasets are the same.
A training stage: firstly, images of a training set are adjusted to 496 × 496 pixels, then areas of 384 × 384 pixels are cut out randomly, then data amplification is carried out in a random horizontal flipping mode, and finally normalization processing is carried out on image data, wherein the normalized mean value and standard deviation are respectively [0.485, 0.456, 0.406], [0.229, 0.224 and 0.225 ].
And (3) a testing stage: the image center is clipped to 384 × 384 pixels and normalized the same as in the training phase.
And step 12a, inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W respectively represents the height and width of the basic feature map F, and D represents the size of an embedding dimension. As shown in fig. 2, the feature extraction network f used in the present embodiment is Swin-L based on the transform architecture.
Step 13a, recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagramThus, the basic feature diagram F can be adapted to the attention network constructed later, and the input dimension meets the consistency requirement.
Then, the method of obtaining the basic feature map of the image in step 1 can be described as the following formula (1):
in the formula, reshape (. circle.) represents the reorganization of the feature map.
In the above embodiment, ResNet may be used to extract image features, and the output of the image features is a three-dimensional feature map, which does not require recombination. However, the method extracts local features, and the feature expression capability is limited. The feature extraction network f in step 1 may also adopt other network structures in the prior art, as long as the basic feature map with rich expression can be obtained, and is not listed here.
In one embodiment, the method for obtaining the part attention map of the image discriminability in step 1 specifically includes:
and step 11b, determining the number M of channels of the component attention diagram needing to be generated, namely the number of generated component characteristics, wherein different data sets can be selected according to actual conditions.
Since the number of channels of the component attention map reflects the coverage of the components which focus on the object discrimination by the model, the discrimination performance of the model on the object subtle differences is better when the number of focused components is large. In the case where the equalization model can learn the number of parameters and the accuracy, the M values on the CUB-200-2011 and NABirds datasets can be set to 64 and 32, respectively. The value of M can be selected according to experimental effects on different data sets.
At step 12b, an attention model is constructed to generate a component attention map.
Generally, the attention module of an image is composed of a full connection layer, a two-dimensional convolution, batch normalization, an activation function (e.g., ReLU, Sigmoid, Softmax), and the like, and different attention architectures generate different attention attempts to improve the performance of the model differently. Through analysis in experiments, it is found that the attention generating module G composed of a two-dimensional convolution with a convolution kernel of 1 × 1 and a Sigmoid function is more suitable for the backbone network of the architecture of the present embodiment, and the specific structure thereof is as shown in fig. 3, where the convolution with 1 × 1 is used to change the number of characteristic channels to be equal to the number of components to be set. From this, it is found that the method of generating attention by combining two-dimensional convolution with an activation function is effective not only for a convolutional neural network but also for a network based on a transform architecture. The process of generating the component attention map a is shown in the following equation (2):
wherein A is ∈ H × W × M, AiE H × W (i ═ 1, 2, …, M) represents the ith part attention map in the target object, such as the bird's head, torso, etc.
As another implementation of the component attention map for obtaining the image discriminability in step 1, step 12b may also use the attention module G composed of the two-dimensional convolution of a convolution kernel 1x1, the two-dimensional batch normalization and the ReLU function, may also use the attention module G composed of a full-connected layer, a one-dimensional layer normalization and a Softmax function, and may even use the attention module G composed of a full-connected layer, a one-dimensional layer normalization and a ReLU function, without changing step 11 b.
From the above, it can be known that: and 2, acquiring a discriminant component feature by Bilinear Attention Pooling (BAP) of the basic feature map and the component attention map. In one embodiment, as shown in fig. 4, the method for extracting the feature of the discriminative part in step 2 specifically includes:
step 21, drawing attention of different parts to AiExtend to the base feature mapThat is, A isiRepeating the channel dimension for multiple times to make the channel number equal toAre kept consistent, and then they are multiplied element by element to obtain a discriminative part feature Pie.H × W × D, the basic feature map with the discriminant part position is activated, and the discriminant part feature is obtained. The specific process is shown as formula (3):
in the equation, "-" indicates element-by-element multiplication operation.
For the image classification task, a Global Average Pooling (GAP) approach is typically used to aggregate features, step 22. And aggregating the discriminative part features obtained in the step 31 through global average pooling so as to facilitate the fusion of the part features. The feature aggregation process for the ith part is defined as follows:
hi=ψ(Pi) (4)
in the formula, hiE D represents the aggregated characteristics of the ith part, and ψ (-) represents the global average pooling.
In another embodiment, step 21 can be implemented by directly channel-stitching the obtained feature map with the attention map. However, this method does not extract the discriminative component and then performs fusion of the features, and thus the representation capability of the features is limited.
From the above, it can be seen that: step 3 is used for fusing the discriminative part features, and specifically comprises the following steps:
step 31, aggregating each discriminant part feature hiAnd (3) splicing is carried out on the channel dimension, so that an enhanced feature representation is obtained, namely the global component feature Q ∈ M.D shown in the following formula (5), the feature is fused with discriminant component information, and the feature expression capability is stronger.
Q=Concate(h1,h2,…,hM) (5)
In the formula, Concate (.) represents feature concatenation;
step 32, performing L on the global component characteristic Q2And after the norm normalization processing, transmitting the norm normalization processing into a full connection layer.
In the step 31, feature fusion may also be performed by directly adding feature maps instead of performing stitching in channel dimension.
In one embodiment, step 4 specifically includes:
and 41, forming a classification network of the model by the full connection layer and the Softmax. Inputting the global component characteristics Q into a full connection layer, completing the mapping of bird image categories, and obtaining the Cross entropy (Cross entropy) loss of a predicted value and a labelAnd the method is used for punishing the classification result and measuring the difference between the classes. The classification loss of a single sample is shown as the formula (6):
where y represents a category label previously marked in the image, such as 0, 1, 2, …, y' represents a predicted value obtained after the component feature Q is input into the full-link layer, and P represents a classification probability of 0-1 after being processed by Softmax, which can be represented by equation (7):
of formula (II) to (III)'jAnd C is the total number of the categories in the data set.
In step 42, in order to avoid homogenization of the component attention diagrams in the model training process, that is, to ensure that different layers of attention diagrams can represent different target components, a Center loss (Center loss) function is adopted in the model to weakly supervise the generated component attention, and the component attention diagrams are constrained so that the component feature Q continuously approaches the feature Center. During model training, the central loss function makes the same component feature expression of the target as similar as possible, and the different component features differ the greater. The center loss for a single sample is defined as follows:
in the formula, qiE D is the ith component feature in the global component feature Q, cie.D is the feature center of the ith part. c. CiThe initialization is 0, and the model is updated in the following way in the training process of the model:
ci←ci+α(qi-ci) (9)
in the formula, alpha is belonged to [0, 1 ]]Is ciAn updated learning rate. In experiments, it was found that better results were obtained when α is 0.05. Total loss of model in training phaseThe definition is as follows:
the model uses only the cross entropy loss as the overall loss at the test stage.
The embodiment of the invention also provides a bird fine-grained image recognition device based on Transformer and component feature fusion, which comprises a component attention generation unit, a discriminant component feature generation unit, a feature fusion unit and a parameter learning optimization unit, wherein:
the component attention generating unit is used for inputting the preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map.
And the discriminant part feature generation unit is used for performing bilinear attention pooling operation on the basic feature map and the part attention map to obtain discriminant part features.
The feature fusion unit is used for splicing the discriminative component features on the channel dimension to obtain the enhanced feature representation fused with the discriminative component information.
And the parameter learning optimization unit is used for completing the mapping of the categories by inputting the enhanced feature representation into the full-connection layer and optimizing the model parameters through cross entropy loss and central loss.
In one embodiment, the component attention generating unit includes a base feature map extracting sub-unit and a component attention map generating sub-unit.
The two-dimensional basic feature map extracting module is used for inputting a preprocessed original image I into a feature extracting network F and extracting a two-dimensional basic feature map F, wherein the F belongs to (H.W) x D, H, W is respectively expressed as the height and the width of the basic feature map F, and D is expressed as the size of an embedding dimension. The three-dimensional basic characteristic diagram module is used for recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagramThe process is shown in the following formula (1):
in the formula, reshape (. circle.) represents the reorganization of the feature map.
The component attention diagram generation subunit is used for determining the number M of channels of the component attention diagram to be generated, and an attention module G is formed by a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and the feature diagram is processedInputting the attention module G to generate different part scores for representing the target objectThe component of the cloth is noted in drawing a, as shown in the following formula (2):
in the formula, Ai(i ═ 1, 2, …, M) represents the ith part attention map in the target object.
In one embodiment, the discriminative component feature generation unit specifically includes a single discriminative component feature generation module and a discriminative component feature fusion module, wherein:
a single discriminative component feature generation module for mapping different component attention maps AiIs expanded to conform to the base feature map F, and the expanded component attention map a is then appliediMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)i:
In the equation, "-" indicates element-by-element multiplication operation.
The discriminative part feature fusion module is used for fusing discriminative part features PiThe aggregation operation is performed according to the global average pooling provided by the following formula (4), fusing each discriminant part feature Pi:
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
In one embodiment, the parameter learning optimization unit specifically includes a single sample loss acquisition module and a central update module of component features, wherein:
the sample classification loss acquisition module is used for inputting global component characteristics Q into the full-connection layer to complete the mapping of bird image categories and obtain the cross entropy loss of a predicted value and a labelFor punishing the classification result, the classification loss of a single sample is shown as formula (6):
in the formula, y represents a category label, y' represents a predicted value, and P represents a probability after the Softmax processing.
The central updating module of the component features is used for weakly supervising the generation process of the attention of the component by adopting the central loss of a single sample described by the formula (8) and initializing ciThe model is updated in the training process according to the following formula (9), and the total loss of the model in the training stageThe following (10) is defined:
ci←ci+α(qi-ci) (9)
in the formula, qiIs the ith component feature of the global component feature Q, ciIs the center of the ith part feature, α ∈ [0, 1 ]]Is ciAn updated learning rate.
In practical use, the input image is firstly preprocessed as in the above embodiment, then trained model parameters are loaded, and finally the preprocessed image is input into the model, so as to output the class probability.
In order to verify that the central loss is effective for the performance improvement of the model, the Grad-CAM is used for visualizing the feature map output by the last layer of the feature extraction network, and the result is shown in FIG. 5. It can be seen that the heat map high energy without added central loss is sporadically distributed over the body of birds or contains many background areas, while the heat map high energy areas with added central loss are more concentrated over the body of birds, indicating that the areas have a greater impact on the classification results and better classification results.
By adopting the method provided by the invention, the discriminative part can be concerned, and the high-precision identification of the bird fine-grained image under weak supervision is realized.
Finally, it should be pointed out that: the above examples are only for illustrating the technical solutions of the present invention, and are not limited thereto. Those of ordinary skill in the art will understand that: modifications can be made to the technical solutions described in the foregoing embodiments, or some technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A bird fine-grained image identification method based on Transformer and component feature fusion is characterized by comprising the following steps:
step 1, inputting a preprocessed image into a feature encoder based on a Transformer architecture network, extracting a basic feature map, inputting the basic feature map into an attention module, and generating a component attention map;
step 2, performing bilinear attention pooling operation on the basic feature diagram and the component attention diagram to obtain a discriminant component feature;
step 3, splicing the discriminative part features on the channel dimension to obtain an enhanced feature representation fused with the discriminative part information;
and 4, inputting the enhancement feature representation into the full-connection layer to complete the mapping of the categories, and optimizing the model parameters through cross entropy loss and central loss.
2. The method for identifying bird fine-grained images based on Transformer and component feature fusion as claimed in claim 1, wherein the method for extracting the basic feature map in the step 1 specifically comprises:
step 11a, inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W respectively represents the height and width of the basic feature map F, and D represents the size of an embedding dimension;
step 12a, recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagram The process is shown in the following formula (1):
in the formula, reshape (·) represents reorganization of the base feature map.
3. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 1 or 2, wherein the method for generating the component attention map in the step 1 specifically comprises the following steps:
step 11b, determining the number M of channels of the component attention diagram needing to be generated, namely the number of generated component characteristics;
step 12b, forming an attention module G by a two-dimensional convolution with convolution kernel of 1 multiplied by 1 and a Sigmoid function, and converting the feature mapInputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):
in the formula, Ai(i 1, 2.. M) represents the ith component attention map in the target object.
4. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 3, wherein the step 2 specifically comprises the following steps:
step 21, drawing attention of different parts to AiExtend to the base feature mapThen the expanded part is subjected to attention drawing AiAnd basic feature mapMultiplying element by element in the following formula (3) to obtain the discriminant part feature Pi:
In the formula, "indicates element-by-element multiplication operation;
step 22, judging the characteristic P of the discriminant partiAggregating each discriminative part feature P by performing an aggregation operation according to the global average pooling provided by the following formula (4)i:
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
5. The method for identifying bird fine-grained images based on Transformer and component feature fusion according to claim 4, wherein the step 3 specifically comprises the following steps:
step 31, aggregating the discriminant part features hiSplicing in channel dimensions to be enhancedThe feature representation of (2) is a global component feature Q, the feature is fused with discriminant component information, and the feature expression capability is stronger.
Q=Concate(h1,h2,...,hM) (5)
In the formula, Concate (·) represents feature concatenation;
step 32, performing L on the global component characteristic Q2And after normalization processing of the norm, transmitting the norm into a full-connection layer to complete mapping from the feature vector to the category.
6. The method for identifying bird fine-grained images based on the fusion of Transformer and component features according to any one of claims 1 to 5, wherein the step 4 specifically comprises the following steps:
step 41, inputting the global component characteristics Q into the full connection layer, completing the mapping of bird image categories, and obtaining the cross entropy loss of the predicted value and the labelFor penalizing the classification result, the loss of a single sample is shown in formula (6):
in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;
step 42, the central loss of a single sample described by equation (8) is used to weakly supervise the generation process of the attention of the component, so that different component features continuously approach the feature center:
in the formula, qiIs the ith component feature of the global component feature Q, ciIs the center of the ith component feature;
step 43, initialize ciAnd updating the model in the training process according to the following formula (9):
ci←ci+α(qi-ci) (9)
in the formula, alpha is belonged to [0, 1 ]]Is ciUpdated learning rate, total loss of model during training phaseThe following (10) is defined:
7. a bird fine-grained image recognition device based on Transformer and component feature fusion is characterized by comprising:
a component attention generation unit, which is used for extracting a basic feature map by inputting the preprocessed image into a feature encoder based on a Transformer architecture network, and inputting the basic feature map into an attention module to generate a component attention map;
a discriminative component feature generation unit configured to perform bilinear attention pooling on the basic feature map and the component attention map to obtain discriminative component features;
the feature fusion unit is used for splicing the discriminative component features on the channel dimension to obtain enhanced feature representation fused with the discriminative component information;
and the parameter learning optimization unit is used for completing the mapping of the categories by inputting the enhanced feature representation into the full-connection layer, and optimizing the model parameters through cross entropy loss and center loss.
8. The device for bird fine-grained image recognition based on Transformer and component feature fusion according to claim 7, wherein the component attention generating unit comprises:
the basic feature map extraction subunit specifically includes:
the two-dimensional basic feature map extraction module is used for inputting the preprocessed original image I into a feature extraction network F, and extracting a two-dimensional basic feature map F, wherein F belongs to (H.W) multiplied by D, H, W is respectively expressed as the height and width of the basic feature map F, and D is expressed as the size of an embedding dimension;
a three-dimensional basic characteristic diagram module for recombining the basic characteristic diagram F to obtain a three-dimensional basic characteristic diagramThe process is shown in the following formula (1):
in the formula, reshape (-) indicates that the feature map is reorganized;
a component attention diagram generation subunit, which is used for determining the number M of channels of the component attention diagram to be generated, and forming an attention module G by a two-dimensional convolution with convolution kernel of 1 × 1 and a Sigmoid function, and mapping the feature diagramInputting the attention module G, and generating a component attention map A representing the distribution of different components of the target object, as shown in the following formula (2):
in the formula, Ai(i 1, 2.. M) represents the ith component attention map in the target object.
9. The apparatus for bird fine-grained image recognition based on transform and component feature fusion as claimed in claim 7, wherein said discriminant component feature generation unit specifically comprises:
single discriminationA characteristic component feature generation module for drawing different component attention diagrams AiIs expanded to conform to the base feature map F, and the expanded component attention map a is then appliediMultiplying the feature value of the feature value P by the basic feature map F element by element according to the following formula (3)i:
In the formula, "indicates element-by-element multiplication operation;
a discriminative part feature fusion module for fusing discriminative part features PiThe aggregation operation is performed according to the global average pooling provided by the following formula (4), fusing each discriminant part feature Pi:
hi=ψ(Pi) (4)
In the formula, hiRepresents the aggregated characteristics of the ith part, and ψ (·) represents the Global Average Pooling (GAP).
10. The apparatus for bird fine-grained image recognition based on Transformer and component feature fusion according to claim 4, wherein the parameter learning optimization unit specifically comprises:
a sample classification loss acquisition module for inputting global component characteristics Q into the full connection layer to complete the mapping of bird image categories and obtain the cross entropy loss of the predicted value and the labelFor punishing the classification result, the classification loss of a single sample is shown as formula (6):
in the formula, y represents a category label, y' represents a predicted value, and P represents the probability after the Softmax processing;
a central update module of the part features for central loss of a single sample described with equation (8) to weakly supervise the generation process of the part attention and initialize ciThe model is updated in the training process according to the following formula (9), and the total loss of the model in the training stageThe following (10) is defined:
ci←ci+α(qi-ci) (9)
in the formula, qiIs the ith component feature of the global component feature Q, ciIs the center of the ith part feature, α ∈ [0, 1 ]]Is ciAn updated learning rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210279684.8A CN114626476A (en) | 2022-03-21 | 2022-03-21 | Bird fine-grained image recognition method and device based on Transformer and component feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210279684.8A CN114626476A (en) | 2022-03-21 | 2022-03-21 | Bird fine-grained image recognition method and device based on Transformer and component feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114626476A true CN114626476A (en) | 2022-06-14 |
Family
ID=81903433
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210279684.8A Pending CN114626476A (en) | 2022-03-21 | 2022-03-21 | Bird fine-grained image recognition method and device based on Transformer and component feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114626476A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115035389A (en) * | 2022-08-10 | 2022-09-09 | 华东交通大学 | Fine-grained image identification method and device based on reliability evaluation and iterative learning |
CN115471724A (en) * | 2022-11-02 | 2022-12-13 | 青岛杰瑞工控技术有限公司 | Fine-grained fish epidemic disease identification fusion algorithm based on self-adaptive normalization |
CN117853875A (en) * | 2024-03-04 | 2024-04-09 | 华东交通大学 | Fine-granularity image recognition method and system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200151424A1 (en) * | 2018-11-09 | 2020-05-14 | Sap Se | Landmark-free face attribute prediction |
CN111523534A (en) * | 2020-03-31 | 2020-08-11 | 华东师范大学 | Image description method |
CN112232293A (en) * | 2020-11-09 | 2021-01-15 | 腾讯科技(深圳)有限公司 | Image processing model training method, image processing method and related equipment |
CN112381830A (en) * | 2020-10-23 | 2021-02-19 | 山东黄河三角洲国家级自然保护区管理委员会 | Method and device for extracting bird key parts based on YCbCr superpixels and graph cut |
EP3923185A2 (en) * | 2021-03-03 | 2021-12-15 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Image classification method and apparatus, electronic device and storage medium |
CN113902948A (en) * | 2021-10-09 | 2022-01-07 | 中国人民解放军陆军工程大学 | Fine-grained image classification method and system based on double-branch network |
CN114140353A (en) * | 2021-11-25 | 2022-03-04 | 苏州大学 | Swin-Transformer image denoising method and system based on channel attention |
-
2022
- 2022-03-21 CN CN202210279684.8A patent/CN114626476A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200151424A1 (en) * | 2018-11-09 | 2020-05-14 | Sap Se | Landmark-free face attribute prediction |
CN111523534A (en) * | 2020-03-31 | 2020-08-11 | 华东师范大学 | Image description method |
CN112381830A (en) * | 2020-10-23 | 2021-02-19 | 山东黄河三角洲国家级自然保护区管理委员会 | Method and device for extracting bird key parts based on YCbCr superpixels and graph cut |
CN112232293A (en) * | 2020-11-09 | 2021-01-15 | 腾讯科技(深圳)有限公司 | Image processing model training method, image processing method and related equipment |
EP3923185A2 (en) * | 2021-03-03 | 2021-12-15 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Image classification method and apparatus, electronic device and storage medium |
CN113902948A (en) * | 2021-10-09 | 2022-01-07 | 中国人民解放军陆军工程大学 | Fine-grained image classification method and system based on double-branch network |
CN114140353A (en) * | 2021-11-25 | 2022-03-04 | 苏州大学 | Swin-Transformer image denoising method and system based on channel attention |
Non-Patent Citations (1)
Title |
---|
MENGZE LI: "Multi-task attribute-fusion model for fine-grained image recognition", CONFERENCE ON OPTOELECTRONIC IMAGING AND MULTIMEDIA TECHNOLOGY VII, 10 October 2020 (2020-10-10), pages 1 - 15500 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115035389A (en) * | 2022-08-10 | 2022-09-09 | 华东交通大学 | Fine-grained image identification method and device based on reliability evaluation and iterative learning |
CN115035389B (en) * | 2022-08-10 | 2022-10-25 | 华东交通大学 | Fine-grained image identification method and device based on reliability evaluation and iterative learning |
CN115471724A (en) * | 2022-11-02 | 2022-12-13 | 青岛杰瑞工控技术有限公司 | Fine-grained fish epidemic disease identification fusion algorithm based on self-adaptive normalization |
CN117853875A (en) * | 2024-03-04 | 2024-04-09 | 华东交通大学 | Fine-granularity image recognition method and system |
CN117853875B (en) * | 2024-03-04 | 2024-05-14 | 华东交通大学 | Fine-granularity image recognition method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110543892B (en) | Part identification method based on multilayer random forest | |
CN110443143B (en) | Multi-branch convolutional neural network fused remote sensing image scene classification method | |
CN111583263B (en) | Point cloud segmentation method based on joint dynamic graph convolution | |
CN111476806B (en) | Image processing method, image processing device, computer equipment and storage medium | |
CN112801015B (en) | Multi-mode face recognition method based on attention mechanism | |
CN114626476A (en) | Bird fine-grained image recognition method and device based on Transformer and component feature fusion | |
CN108960059A (en) | A kind of video actions recognition methods and device | |
CN110222767B (en) | Three-dimensional point cloud classification method based on nested neural network and grid map | |
CN114821014B (en) | Multi-mode and countermeasure learning-based multi-task target detection and identification method and device | |
CN111652273B (en) | Deep learning-based RGB-D image classification method | |
CN106408037A (en) | Image recognition method and apparatus | |
CN109740539B (en) | 3D object identification method based on ultralimit learning machine and fusion convolution network | |
CN113221987A (en) | Small sample target detection method based on cross attention mechanism | |
CN112329771B (en) | Deep learning-based building material sample identification method | |
CN114283325A (en) | Underwater target identification method based on knowledge distillation | |
CN114187506B (en) | Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network | |
CN112149526A (en) | Lane line detection method and system based on long-distance information fusion | |
CN114926691A (en) | Insect pest intelligent identification method and system based on convolutional neural network | |
CN113496260B (en) | Grain depot personnel non-standard operation detection method based on improved YOLOv3 algorithm | |
CN114494773A (en) | Part sorting and identifying system and method based on deep learning | |
CN113011506A (en) | Texture image classification method based on depth re-fractal spectrum network | |
CN111143544B (en) | Method and device for extracting bar graph information based on neural network | |
CN117036904A (en) | Attention-guided semi-supervised corn hyperspectral image data expansion method | |
CN111553437A (en) | Neural network based image classification method | |
CN111046861B (en) | Method for identifying infrared image, method for constructing identification model and application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Zhang Haimiao Inventor after: Liu Chang Inventor after: Qiu Jun Inventor after: Ruan Tao Inventor before: Ruan Tao Inventor before: Zhang Haimiao Inventor before: Liu Chang Inventor before: Qiu Jun |
|
CB03 | Change of inventor or designer information |