Abstract
Convolutional neural networks (CNN) have been widely used in image scene classification and have achieved remarkable progress. However, because the extracted deep features can neither focus on the local semantics of the image, nor capture the spatial morphological variation of the image, it is not appropriate to directly use CNN to generate the distinguishable feature representations. To relieve this limitation, a global-local feature adaptive fusion (GLFAF) network is proposed. The GLFAF framework extracts multi-scale and multi-level features by using a designed CNN. Then, to leverage the complementary advantages of the multi-scale and multi-level features, we design a global feature aggregate module to discover global attention features and further learn the multiple deep dependencies of spatial scale variations among these global features. Meanwhile, a local feature aggregate module is designed to aggregate the multi-scale and multi-level features. Specially, multi-level features at the same scale are fused based on channel attention, and then spatial fused features at different scales are aggregated based on channel dependence. Moreover, spatial contextual attention is designed to refine spatial features across scales and different fisher vector layers are designed to learn semantic aggregation among spatial features. Subsequently, two different feature adaptive fusion modules are introduced to explore the complementary associations of global and local aggregate features, which can obtain comprehensive and differentiated image scene presentation. Finally, a large number of experiments on real scene datasets coming from three different fields show that the proposed GLFAF approach can more accurately realize scene classification than other state-of-the-art models.
Similar content being viewed by others
Data Availability
The UC Merced Land-Use dataset that support the findings of this study are available in the ucmerced repository: http://weegee.vision.ucmerced.edu/datasets/landuse.html. The UIUC Sports dataset that support the findings of this study are available in the stanford repository: http://vision.stanford.edu/lijiali/event_dataset/. The infrared maritime scene dataset that support the findings of this study are available from the corresponding author on reasonable request.
References
Anwer RM, Khan FS, van de Weijer J et al (2018) Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification. ISPRS J Photogrammetry Rem Sens 138:74–85
Basiri ME, Nemati S, Abdar M et al (2021) ABCDM: an attention-based bidirectional CNN-RNN deep model for sentiment analysis. Futur Gener Comput Syst 115:279–294
Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features, European conference on computer vision. Springer, Berlin, pp 404–417
Bi Q, Qin K, Li Z et al (2019) Multiple instance dense connected convolution neural network for aerial image scene classification. In: 2019 IEEE International conference on image processing (ICIP). IEEE, pp 2501–2505
Bi Q, Qin K, Zhang H et al (2019) APDC-Net: attention pooling-based convolutional network for aerial scene classification. IEEE Geosci Rem Sens Lett 17(9):1603–1607
Bi Q, Qin K, Zhang H (2020) RADC-Net: a residual attention based convolution network for aerial scene classification. Neurocomputing 377:345–359
Bi Q, Qin K, Li Z et al (2020) A multiple-instance densely-connected ConvNet for aerial scene classification. IEEE Trans Image Process 29:4911–4926
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Chen Y (2015) Convolutional neural network for sentence classification. University of Waterloo
Cheng G, Ma C, Zhou P et al (2016) Scene classification of high resolution remote sensing images using convolutional neural networks. In: 2016 IEEE International geoscience and remote sensing symposium (IGARSS). IEEE, pp 767–770
Cheng G, Xie X, Han J et al (2020) Remote sensing image scene classification meets deep learning: challenges, methods, benchmarks, and opportunities. IEEE J Selected Topics Appl Earth Observ Rem Sens PP(99):1–1
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893
Ding C, Tao D (2015) Robust face recognition via multimodal deep face representation. IEEE Trans Multimed 17(11):2049–2058
Dong L, Zhang T, Ma D et al (2020) Maritime background infrared imagery classification based on histogram of oriented gradient and local contrast features. Journal of Infrared and Millimeter Waves 39:5
Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale, arXiv:2010.11929
Feng Y, Chen F, Ji Y, et al. (2021) Efficient cross-modality graph reasoning for RGB-infrared person re-identification. IEEE Signal Process Lett 28:1425–1429
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Hu X, Yang K, Fei L et al (2019) Acnet: attention based network to exploit complementary features for rgbd semantic segmentation. In: IEEE International conference on image processing (ICIP). IEEE, pp 1440–1444
Huang H, Xu K (2019) Combing triple-part features of convolutional neural networks for scene classification in remote sensing. Remote Sens 11(14):1687
Jiang Y, Yuan J, Yu G (2012) Randomized spatial partition for scene recognition, European conference on computer vision. Springer, Berlin, pp 730–743
Jgou H, Douze M, Schmid C et al (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3304–3311
Li LJ, Li FF (2007) What, where and who? Classifying events by scene and object recognition Computer Vision. In: Proc.of IEEE International conference on computer vision, pp 1–8
Li Q, Wu J, Tu Z (2013) Harvesting mid-level visual concepts from large-scale internet images. In: 2013 IEEE Conference on computer vision and pattern recognition, pp 851–858
Li Q, Peng Q, Yan C (2018) Multiple VLAD encoding of CNNs for image classification. Comput Sci Eng 20(2):52–63
Lin D, Lu C, Liao R et al (2014) Learning important spatial pooling regions for scene classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3726–3733
Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. International Conference on Computer Vision, 10012-10022
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Lu X, Sun H, Zheng X (2019) A feature aggregation convolutional neural network for remote sensing scene classification. IEEE Trans Geosci Remote Sens 57(10):7894–7906
Lv Y, Zhang X, Xiong W et al (2019) An end-to-end local-global-fusion feature extraction network for remote sensing image scene classification. Rem Sens 2019 11(24):3006
Ma J, Ma Q, Tang X et al (2020) Remote sensing scene classification based on global and local consistent network, IGARSS 2020-2020. In: IEEE International geoscience and remote sensing symposium. IEEE, pp 537–540
Ni K, Liu P, Wang P (2021) Compact global-local convolutional network with multifeature fusion and learning for scene classification in synthetic aperture radar imagery. IEEE J Selected Topics Appl Earth Observ Rem Sens 14:7284–7296
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Perronnin F, Snchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification, European conference on computer vision. Springer, Heidelberg, pp 143–156
Qi K, Yang C, Hu C et al (2021) Rotation invariance regularization for remote sensing image scene classification with convolutional neural networks[J]. Remote Sens 13(4):569
Rublee E, Rabaud V, Konolige K et al (2011) ORB: an efficient alternative to SIFT or SURF. In: 2011 International conference on computer vision. IEEE, pp 2564–2571
Sadeghi F, Tappen M F (2012) Latent pyramidal regions for recognizing scenes, European conference on computer vision. Springer, Berlin, pp 228–241
Satpathy A, Jiang X, Eng HL (2014) LBP-based edge-texture features for object recognition. IEEE Trans Image Process 23(5):1953–1964
Sheng G, Wen Y, Tao X et al (2012) High-resolution satellite scene classification using a sparse coding based multiple feature combination. Int J Remote Sens 33(8):2395–2412
Shen J, Zhang T, Wang Y et al (2010) A dual-model architecture with grouping-attention-fusion for remote sensing scene classification. Remote Sens 13(3):433
Shi C, Wang T, Wang L (2020) Branch feature fusion convolution network for remote sensing scene classification. IEEE J Selected Topics Appl Earth Observ Rem Sens 13:5194–5210
Shrinivasa SR, Prabhakar CJ (2022) Scene image classification based on visual words concatenation of local and global features. Multimed Tools Appl 81 (1):1237–1256
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
Sitaula C, Xiang Y, Basnet A et al (2019) Tag-based semantic features for scene image classification. In: International conference on neural information processing. Springer, Cham, pp 90–102
Sitaula C, Xiang Y, Basnet A et al (2020) Hdf: hybrid deep features for scene image representation. International Joint Conference on Neural Networks (IJCNN) IEEE 2020:1–8
Sitaula C, Aryal S, Xiang Y et al (2021) Content and context features for scene image representation[J]. Knowl-Based Syst 232:107470
Smeulders AWM, Worring M, Santini S et al (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Sun N, Li W, Liu J et al (2018) Fusing object semantics and deep appearance features for scene recognition. IEEE Trans Circuits Syst Video Technol 29 (6):1715–1728
Szegedy C, Liu W, Jia Y et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Sun H, Li S, Zheng X et al (2019) Remote sensing scene classification by gated bidirectional network. IEEE Trans Geosci Rem Sens PP(99):1–15
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need[J]. Advances in Neural Information Processing Systems, 30
Wang Y (2021) Survey on deep multi-modal data analytics: collaboration, rivalry, and fusion. ACM Trans Multimed Comput Commun Appli (TOMM) 17 (1s):1–25
Wang D, Mao K (2019) Task-generic semantic convolutional neural network for web text-aided image classification. Neurocomputing 329:103–115
Wang Y, Zhang W, Wu L et al (2016) Iterative views agreement: an iterative low-rank based structured optimization method to multi-view spectral clustering. arXiv:1608.05560
Wang G, Fan B, Xiang S et al (2017) Aggregating rich hierarchical features for scene classification in remote sensing imagery. IEEE J Selected Topics Appl Earth Observ Rem Sens 10(9):4104–4115
Wang Q, Liu S, Chanussot J et al (2018) Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans Geosci Remote Sens 57(2):1155–1167
Wang X, Wang S, Ning C et al (2021) Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification. IEEE Trans Geosci Rem Sens 59(9):7918–7932
Wang W, Xie E, Li X et al (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. International Conference on Computer Vision, 568–578
Woo S, Park J, Lee JY et al (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Wu J, Rehg JM (2010) Centrist: a visual descriptor for scene categorization. IEEE Trans Pattern Anal Mach Intell 33(8):1489–1501
Wu F, Jing XY, Dong X et al (2018) Intraspectrum discrimination and interspectrum correlation analysis deep network for multispectral face recognition. IEEE Trans Cybern 50(3):1009–1022
Wu F, Jing XY, Feng Y et al (2021) Spectrum-aware discriminative deep feature learning for multi-spectral face recognition. Pattern Recogn 111:107632
Xia GS, Hu J, Hu F (2017) AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans Geosci Remote Sens 55(7):3965–3981
Xia S, Zeng J, Leng L et al (2019) Ws-am: weakly supervised attention map for scene recognition. Electronics 8(10):1072
Xiong Z, Yuan Y, Wang Q (2020) MSN: modality separation networks for RGB-D scene recognition. Neurocomputing 373:81–89
Xu K, Huang H, Deng P et al (2020) Two-stream feature aggregation deep neural network for scene classification of remote sensing images[J]. Inform Sci 539:250–268
Xu K, Huang H, Deng P (2021) Remote sensing image scene classification based on global-local dual-branch structure model. IEEE Geoscience and Remote Sensing Letters
Yang Y, Newsam S (2010) Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pp 270–279
Zeng D, Chen S, Chen B et al (2018) Improving remote sensing scene classification by integrating global-context and local-object features. Remote Sens 10(5):734
Zhang F, Du B, Zhang L (2015) Scene classification via a gradient boosting random convolutional network framework. IEEE Trans Geosci Remote Sens 54(3):1793–1802
Zhang C, Zhu G, Huang Q et al (2017) Image classification by search with explicitly and implicitly semantic representations. Inform Sci 376:125–135
Zhang W, Tang P, Zhao L (2019) Remote sensing image scene classification using CNN-CapsNet. Remote Sens 11(5):494
Zhang J, Yang K, Constantinescu A et al (2021) Trans4Trans: efficient transformer for transparent object segmentation to help visually impaired people navigate in the real world. International Conference on Computer Vision, 1760–1770
Zhang C, Wang Y, Zhu L et al (2021) Multi-graph heterogeneous interaction fusion for social recommendation. ACM Trans Inform Syst (TOIS) 40 (2):1–26
Zheng Y, Jiang YG, Xue X (2012) Learning hybrid part filters for scene recognition, European conference on computer vision. Springer, Berlin, pp 172–185
Zhou B, Khosla A, Lapedriza A et al (2016) Places: an image database for deep scene understanding, arXiv:1610.02055
Zhu Q, Zhong Y, Liu Y et al (2018) A deep-local-global feature fusion framework for high spatial resolution imagery scene classification. Remote Sens 10(4):568
Funding
This paper was supported in part by the Fundamental Research Funds for the Central Universities of China under Grant 3132019340 and 3132019200. This paper was supported in part by high tech ship research project from ministry of industry and information technology of the people’s republic of China under Grant MC-201902-C01.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare no conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lv, G., Dong, L., Zhang, W. et al. A global-local feature adaptive fusion network for image scene classification. Multimed Tools Appl 83, 6521–6554 (2024). https://doi.org/10.1007/s11042-023-15519-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15519-2