Abstract
The accurate identification of bird image is of great significance to protecting the ecological environment and bird species diversity. To address the issue of low recognition accuracy arising from the similarities in features among different bird species and the susceptibility of shallow edge features to loss, this paper proposes a fine-grained bird image classification model that incorporates hierarchical feature fusion and counterfactual feature selection. The model is based on vision transformer and builds a hierarchical feature fusion module and a counterfactual feature enhancement module. The hierarchical feature fusion module superimposes shallow features rich in fine-grained information into deep features to improve the problem of lack of edge detail information in key features. The counterfactual feature enhancement module selects distinguishing features through counterfactual intervention to reduce classification errors caused by highly similar features of different species of birds. The experimental results show that the method can achieve 91.9\(\%\) and 91.4\(\%\) accuracy on two available datasets, CUB-200-2011 and NABirds, respectively, which is higher than the current mainstream fine-grained bird recognition algorithms and shows excellent classification performance.
Similar content being viewed by others
Data availability
Publicly available datasets were analyzed in this study. The CUB-200-2011 dataset is available at http://www.vision.caltech.edu/datasets/cub_200_2011/ accessed in 2011. The NABirds dataset is available at https://dl.allaboutbirds.org/nabirds, accessed in 2015.
References
Socolar JB, Gilroy JJ, Kunin WE, Edwards DP (2016) How should beta-diversity inform biodiversity conservation. Trends Ecol Evolut 31(1):67–80
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, Perona P, Belongie S (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 595–604
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based r-cnns for fine-grained category detection. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp 834–849. Springer
Huang S, Xu Z, Tao D, Zhang Y (2016) Part-stacked cnn for fine-grained visual categorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1173–1182
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp 6105–6114. PMLR
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708
Branson S, Van Horn G, Belongie S, Perona P (2014) Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952
Wei X-S, Xie C-W, Wu J (2016) Mask-cnn: Localizing parts and selecting descriptors for fine-grained image recognition. arXiv preprint arXiv:1605.06878
Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–326
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847
Cui Y, Zhou F, Wang J, Liu X, Lin Y, Belongie S (2017) Kernel pooling for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2921–2930
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1440–1448
Yang Z, Luo T, Wang D, Hu Z, Gao J, Wang L (2018) Learning to navigate for fine-grained classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 420–435
Liu C, Xie H, Zha ZJ, Ma L, Zhang Y (2020) Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, No (7), pp 11555–11562
Ge W, Lin X, Yu Y (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3034–3043
Rao Y, Chen G, Lu J, Zhou J (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1025–1034
Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5209–5217
Sun M, Yuan Y, Zhou F, Ding E (2018) Multi-attention multi-class constraint for fine-grained image recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 805–821
Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 13130–13137
Hu T, Qi H, Huang Q, Lu Y (2019) See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 244–253
He J, Chen JN, Liu S, Kortylewski A, Yang C, Bai Y, Wang C (2022) Transfg: a transformer architecture for fine-grained recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 36, pp 852–860
Zhang Y, Cao J, Zhang L, Liu X, Wang Z, Ling F, Chen W (2022) A free lunch from vit: adaptive attention multi-scale fusion transformer for fine-grained visual recognition. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3234–3238. IEEE
Misra I, Girdhar R, Joulin A (2021) An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2906–2917
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH, et al. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6881–6890
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306
Wang J, Yu X, Gao Y (2021) Feature fusion vision transformer for fine-grained visual categorization. arXiv preprint arXiv:2107.02341
Hu Y, Jin X, Zhang Y, Hong H, Zhang J, He Y, Xue H (2021) Rams-trans: recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 4239–4248
Zhang Z-C, Chen Z-D, Wang Y, Luo X, Xu X-S (2022) Vit-fod: a vision transformer based fine-grained object discriminator. arXiv preprint arXiv:2203.12816
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Korsch D, Bodesheim P, Denzler J (2021) End-to-end learning of fisher vector encodings for part features in fine-grained recognition. In: DAGM German Conference on Pattern Recognition, pp 142–158. Springer
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp 10347–10357. PMLR
Abnar S, Zuidema W (2020) Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928
Serrano S, Smith NA (2019) Is attention interpretable? arXiv preprint arXiv:1906.03731
Liu Q, Kusner M, Blunsom P (2021) Counterfactual data augmentation for neural machine translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 187–197
Luo W, Yang X, Mo X, Lu Y, Davis LS, Li J, Yang J, Lim S-N (2019) Cross-x learning for fine-grained visual categorization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8242–8251
Touvron H, Vedaldi A, Douze M, Jégou H (2019) Fixing the train-test resolution discrepancy, vol 32
Cui Y, Song Y, Sun C, Howard A, Belongie S (2018) Large scale fine-grained categorization and domain-specific transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4109–4118
Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 13130–13137
Behera A, Wharton Z, Hewage PR, Bera A (2021) Context-aware attentional pooling (cap) for fine-grained visual classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 35, pp 929–937
Zhang L, Huang S, Liu W, Tao D (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 8331–8340
Acknowledgements
The authors are grateful to the reviewers for their valuable suggestions.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
TC, YL and QQ contributed to the conception of the study; TC and YL performed the experiment; TC and YL contributed significantly to analysis and manuscript preparation; TC, YL and QQ performed the data analyses and wrote the manuscript; TC, YL and QQ helped perform the analysis with constructive discussions.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, T., Li, Y. & Qiao, Q. Fine-grained bird image classification based on counterfactual method of vision transformer model. J Supercomput 80, 6221–6239 (2024). https://doi.org/10.1007/s11227-023-05701-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05701-6