On the Co-Selection of Vision Transformer Features and Images for Very High-Resolution Image Scene Classification
<p>Overall architecture of the proposed method.</p> "> Figure 2
<p>Example images associated with 21 land use categories in the UC Merced data set.</p> "> Figure 3
<p>Example images associated with 30 land use classes in the AID data set.</p> "> Figure 4
<p>Example images associated with 45 land use categories in the NWPU RESISC45 data set.</p> "> Figure 5
<p>Rate of important feature effects on the classification accuracy of the UC Merced data set with <math display="inline"><semantics> <mrow> <mn>50</mn> <mo>%</mo> </mrow> </semantics></math> of randomly selected images per class.</p> "> Figure 6
<p>Rate of important feature effects on the classification accuracy of the UC Merced data set with <math display="inline"><semantics> <mrow> <mn>80</mn> <mo>%</mo> </mrow> </semantics></math> of randomly selected images per class.</p> "> Figure 7
<p>Confusion matrix of our method under the <math display="inline"><semantics> <mrow> <mn>80</mn> <mo>%</mo> </mrow> </semantics></math> training ratio and <math display="inline"><semantics> <mrow> <mn>50</mn> <mo>%</mo> </mrow> </semantics></math> of important features on the UC Merced data set.</p> "> Figure 8
<p>Confusion matrix of our method under the <math display="inline"><semantics> <mrow> <mn>50</mn> <mo>%</mo> </mrow> </semantics></math> training ratio and <math display="inline"><semantics> <mrow> <mn>50</mn> <mo>%</mo> </mrow> </semantics></math> of important features on the UC Merced data set.</p> "> Figure 9
<p>The rate of important feature effects on the classification accuracy of the AID data set with <math display="inline"><semantics> <mrow> <mn>20</mn> <mo>%</mo> </mrow> </semantics></math> of randomly selected images per class.</p> "> Figure 10
<p>Rate of important feature effects on the classification accuracy of the AID data set with <math display="inline"><semantics> <mrow> <mn>50</mn> <mo>%</mo> </mrow> </semantics></math> of randomly selected images per class.</p> "> Figure 11
<p>Confusion matrix of our method under the <math display="inline"><semantics> <mrow> <mn>50</mn> <mo>%</mo> </mrow> </semantics></math> training ratio and <math display="inline"><semantics> <mrow> <mn>90</mn> <mo>%</mo> </mrow> </semantics></math> of important features on the AID data set.</p> "> Figure 12
<p>Confusion matrix of our method under the <math display="inline"><semantics> <mrow> <mn>20</mn> <mo>%</mo> </mrow> </semantics></math> training ratio and <math display="inline"><semantics> <mrow> <mn>90</mn> <mo>%</mo> </mrow> </semantics></math> of important features on the AID data set.</p> "> Figure 13
<p>Rate of important feature effects on the classification accuracy of the NWPU-RESISC45 data set with <math display="inline"><semantics> <mrow> <mn>10</mn> <mo>%</mo> </mrow> </semantics></math> randomly selected images per class.</p> "> Figure 14
<p>Rate of important feature effects on the classification accuracy of the NWPU-RESISC45 data set with 20 randomly selected images per class.</p> "> Figure 15
<p>Confusion matrix of our method under the <math display="inline"><semantics> <mrow> <mn>10</mn> <mo>%</mo> </mrow> </semantics></math> training ratio and <math display="inline"><semantics> <mrow> <mn>90</mn> <mo>%</mo> </mrow> </semantics></math> of features on the NWPU data set.</p> "> Figure 16
<p>Confusion matrix of our method under the <math display="inline"><semantics> <mrow> <mn>20</mn> <mo>%</mo> </mrow> </semantics></math> training ratio <math display="inline"><semantics> <mrow> <mn>90</mn> <mo>%</mo> </mrow> </semantics></math> of features on the NWPU data set.</p> ">
Abstract
:1. Introduction
- 1.
- We explore the performance of the vision transformer method based on the pre-trained ViT model. The transformer–encoder layer is considered a feature descriptor, where a discriminative image scene representation is generated from the transformer–encoder layer.
- 2.
- Second, we present a new approach that consists of selecting the most important features and images and detecting unwanted and noisy images. Indeed, these images can have negative impacts on the accuracy of the final model. By doing so, we obtained a good data set without noise, which allowed us to have good accuracy and, consequently, reduce the learning time.
- 3.
- Another challenging problem in understanding a VHR image scene involves the classification strategy. To this end, we used the support vector machine (SVM) to classify the extracted ViT features corresponding to the selected encoder layers.
2. Proposed Framework
2.1. ViT Model and Feature Fusion
2.2. Co-Selection of Features and Images
- Q is a projection matrix to be estimated. It is of dimension () where and h denote the sizes of the new feature set.
- A is a binary matrix, which is derived from the label information as follows:
- is a regularization hyperparameter used to control the sparsity of the projection matrix Q.
- is the -norm. If is matrix, then its -norm is defined by:
- is the Frobenius norm () defined by:
- is a regularization hyperparameter used to control the sparsity of the residual matrix R.
2.3. Optimization
Algorithm 1 The Proposed Framework |
Input: Data set X of n images and the label information ; the map function of the deep model m; the hyperparameters: . Output: the fitted Q and R.
|
3. Experimental Results and Setup
3.1. Data Sets
3.2. Experimental
3.3. UC Merced Data Set
3.4. AID Data Set
3.5. NWPU-RESISC45 Data Set
3.6. Ablation Study
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
- Swain, M.J.; Ballard, D.H. Color indexing. Int. J. Comput. Vis. 1991, 7, 11–32. [Google Scholar] [CrossRef]
- Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Yang, Y.; Newsam, S. Comparing SIFT descriptors and Gabor texture features for classification of remote sensed imagery. In Proceedings of the 2008 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; pp. 1852–1855. [Google Scholar]
- dos Santos, J.A.; Penatti, O.A.B.; da Silva Torres, R. Evaluating the Potential of Texture and Color Descriptors for Remote Sensing Image Retrieval and Classification. In Proceedings of the VISAPP, Angers, France, 17–21 May 2010. [Google Scholar]
- Risojević, V.; Momić, S.; Babić, Z. Gabor descriptors for aerial image classification. In Proceedings of the International Conference on Adaptive and Natural Computing Algorithms, Ljubljana, Slovenia, 14–16 April 2011; pp. 51–60. [Google Scholar]
- Avramović, A.; Risojević, V. Block-based semantic classification of high-resolution multispectral aerial images. Signal Image Video Process. 2016, 10, 75–84. [Google Scholar] [CrossRef]
- Chen, X.; Fang, T.; Huo, H.; Li, D. Measuring the effectiveness of various features for thematic information extraction from very high resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4837–4851. [Google Scholar] [CrossRef]
- Luo, B.; Jiang, S.; Zhang, L. Indexing of remote sensing images with different resolutions by multiple features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 1899–1912. [Google Scholar] [CrossRef]
- Luo, B.; Aujol, J.F.; Gousseau, Y.; Ladjal, S. Indexing of satellite images with different resolutions by wavelet features. IEEE Trans. Image Process. 2008, 17, 1465–1472. [Google Scholar]
- Luo, B.; Aujol, J.F.; Gousseau, Y. Local scale measure from the topographic map and application to remote sensing images. Multiscale Model. Simul. 2009, 8, 1–29. [Google Scholar] [CrossRef] [Green Version]
- Qi, K.; Wu, H.; Shen, C.; Gong, J. Land-use scene classification in high-resolution remote sensing images using improved correlatons. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2403–2407. [Google Scholar]
- Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
- Drăguţ, L.; Blaschke, T. Automated classification of landform elements using object-based image analysis. Geomorphology 2006, 81, 330–344. [Google Scholar] [CrossRef]
- Zhang, J.; Li, T.; Lu, X.; Cheng, Z. Semantic classification of high-resolution remote-sensing images based on mid-level features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2343–2353. [Google Scholar] [CrossRef]
- Cui, S.; Schwarz, G.; Datcu, M. Remote sensing image classification: No features, no clustering. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 5158–5170. [Google Scholar] [CrossRef] [Green Version]
- Sivic, J.; Zisserman, A. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the Computer Vision, IEEE International Conference on. IEEE Computer Society, Nice, France, 13–16 October 2003; Volume 3, p. 1470. [Google Scholar]
- Cheng, G.; Guo, L.; Zhao, T.; Han, J.; Li, H.; Fang, J. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. Int. J. Remote Sens. 2013, 34, 45–59. [Google Scholar] [CrossRef]
- Zhao, L.; Tang, P.; Huo, L. Feature significance-based multibag-of-visual-words model for remote sensing image scene classification. J. Appl. Remote Sens. 2016, 10, 035004. [Google Scholar] [CrossRef]
- Wu, H.; Liu, B.; Su, W.; Zhang, W.; Sun, J. Hierarchical coding vectors for scene level land-use classification. Remote Sens. 2016, 8, 436. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Y.; Sun, X.; Wang, H.; Fu, K. High-resolution remote-sensing image classification via an approximate earth mover’s distance-based bag-of-features model. IEEE Geosci. Remote Sens. Lett. 2013, 10, 1055–1059. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Guo, L.; Liu, Z.; Bu, S.; Ren, J. Effective and efficient mid-level visual elements-oriented land-use classification using VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4238–4249. [Google Scholar] [CrossRef] [Green Version]
- Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef] [Green Version]
- Deng, J.; Berg, A.; Satheesh, S.; Su, H.; Khosla, A.; Li, F. Imagenet Large Scale Visual Recognition Competition. ilsvrc2012. 2012. Available online: https://image-net.org/challenges/LSVRC/ (accessed on 15 March 2020).
- Zhang, F.; Du, B.; Zhang, L. Scene classification via a gradient boosting random convolutional network framework. IEEE Trans. Geosci. Remote Sens. 2015, 54, 1793–1802. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
- Nogueira, K.; Penatti, O.A.; Dos Santos, J.A. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognit. 2017, 61, 539–556. [Google Scholar] [CrossRef] [Green Version]
- Othman, E.; Bazi, Y.; Alajlan, N.; Alhichri, H.; Melgani, F. Using convolutional features and a sparse autoencoder for land-use scene classification. Int. J. Remote Sens. 2016, 37, 2149–2167. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the Association for Computational Linguistics Meeting, Florence, Italy, 28 July–2 August 2019; Volume 2019, p. 6558. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Chaib, S.; Liu, H.; Gu, Y.; Yao, H. Deep feature fusion for VHR remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4775–4784. [Google Scholar] [CrossRef]
- Benabdeslem, K.; Mansouri, D.E.K.; Makkhongkaew, R. sCOs: Semi-Supervised Co-Selection by a Similarity Preserving Approach. IEEE Trans. Knowl. Data Eng. 2022, 34, 2899–2911. [Google Scholar] [CrossRef]
- Tang, J.; Liu, H. Coselect: Feature selection with instance selection for social media data. In Proceedings of the 2013 SIAM International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013; pp. 695–703. [Google Scholar]
- She, Y.; Owen, A.B. Outlier Detection Using Nonconvex Penalized Regression. J. Am. Stat. Assoc. 2011, 106, 626–639. [Google Scholar] [CrossRef] [Green Version]
- Castelluccio, M.; Poggi, G.; Sansone, C.; Verdoliva, L. Land use classification in remote sensing images by convolutional neural networks. arXiv 2015, arXiv:1508.00092. [Google Scholar]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef] [Green Version]
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef] [Green Version]
- Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]
- Zhang, F.; Du, B.; Zhang, L. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote Sens. 2014, 53, 2175–2184. [Google Scholar] [CrossRef]
- Anwer, R.M.; Khan, F.S.; Van De Weijer, J.; Molinier, M.; Laaksonen, J. Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification. ISPRS J. Photogramm. Remote Sens. 2018, 138, 74–85. [Google Scholar] [CrossRef] [Green Version]
- Liu, Y.; Zhong, Y.; Qin, Q. Scene classification based on multiscale convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7109–7121. [Google Scholar] [CrossRef] [Green Version]
- He, N.; Fang, L.; Li, S.; Plaza, A.; Plaza, J. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6899–6910. [Google Scholar] [CrossRef]
- Li, B.; Su, W.; Wu, H.; Li, R.; Zhang, W.; Qin, W.; Zhang, S. Aggregated deep fisher feature for VHR remote sensing scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3508–3523. [Google Scholar] [CrossRef]
- Ma, C.; Mu, X.; Lin, R.; Wang, S. Multilayer feature fusion with weight adjustment based on a convolutional neural network for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2020, 18, 241–245. [Google Scholar] [CrossRef]
- Yu, Y.; Liu, F. A two-stream deep fusion framework for high-resolution aerial scene classification. Comput. Intell. Neurosci. 2018, 2018, 8639367. [Google Scholar] [CrossRef] [Green Version]
- Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1155–1167. [Google Scholar] [CrossRef]
- Sun, X.; Zhu, Q.; Qin, Q. A multi-level convolution pyramid semantic fusion framework for high-resolution remote sensing image scene classification and annotation. IEEE Access 2021, 9, 18195–18208. [Google Scholar] [CrossRef]
- Wang, X.; Duan, L.; Shi, A.; Zhou, H. Multilevel feature fusion networks with adaptive channel dimensionality reduction for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Lv, Y.; Zhang, X.; Xiong, W.; Cui, Y.; Cai, M. An end-to-end local-global-fusion feature extraction network for remote sensing image scene classification. Remote Sens. 2019, 11, 3006. [Google Scholar] [CrossRef] [Green Version]
- Fan, R.; Wang, L.; Feng, R.; Zhu, Y. Attention based residual network for high-resolution remote sensing imagery scene classification. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1346–1349. [Google Scholar]
- Guo, Y.; Ji, J.; Lu, X.; Huo, H.; Fang, T.; Li, D. Global-local attention network for aerial scene classification. IEEE Access 2019, 7, 67200–67212. [Google Scholar] [CrossRef]
- Wang, W.; Chen, Y.; Ghamisi, P. Transferring CNN With Adaptive Learning for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Dimension | 768 | 768 | 384 | 384 |
Models | Accuracy | |
---|---|---|
50% Training Ratio | 80% Training Ratio | |
ViT8b | ||
ViT16b | ||
ViT8s | ||
ViT16s | ||
Fusion | 97.7 ± 0.00113 | 99.49 ± 0.001 |
Methods | 80% Train | 50% Train |
---|---|---|
TEX-Net-LF [43] | ||
Fine-tuned GoogLeNet [29] | - | |
MCNN [44] | - | |
MSCP [45] | - | |
ADFF [46] | - | |
MLFF_WWA [47] | - | |
Two-Stream Fusion [48] | ||
ARCNet-VGG16 [49] | ||
ACR _MLFF [51] | ||
LCPP [50] | - | |
PROPOSED | 99.49 ± 0.001 | 97.90 ± 0.00113 |
Models | Accuracy | |
---|---|---|
20% Training Ratio | 50% Training Ratio | |
ViT8b | ||
ViT16b | ||
ViT8s | ||
ViT16s | ||
Fusion | 94.54 ± 0.00071 | 96.75 ± 0.00104 |
Methods | 50% Train | 20% Train |
---|---|---|
VGG-VD-16 [39] | - | |
DCA fusion [34] | - | |
Two Stream Fusion [48] | ||
ACR _MLFF [51] | ||
ARCNet-VGG16 [49] | ||
LGFFE [52] | 11 | |
LCPP [50] | ||
PROPOSED | 96.932 ± 0.00024 | 94.625 ± 0.0001 |
Models | Accuracy | |
---|---|---|
10% Training Ratio | 20% Training Ratio | |
ViT8b | ||
ViT16b | ||
ViT8s | ||
ViT16s | ||
Fusion | 90.89 ± 0.0011 | 92.23 ± 0.0005 |
Methods | Training Ratios | |
---|---|---|
10 % Train | 20% Train | |
AlexNet [40] | ||
RAN [53] | ||
GLANet [54] | ||
ACR _MLFF [51] | 90.01 ± 0.33 | 92.45 ± 0.20 |
T_CNN [55] | 93.05 ± 0.12 | |
PROPOSED | 90.89 ± 0.00011 |
AID | NWPU | UC Merced | |
---|---|---|---|
With Co-Selection | 94.63 | 89.70 | 97.70 |
Without Co-Selection | 94.02 | 88.45 | 96.92 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chaib, S.; Mansouri, D.E.K.; Omara, I.; Hagag, A.; Dhelim, S.; Bensaber, D.A. On the Co-Selection of Vision Transformer Features and Images for Very High-Resolution Image Scene Classification. Remote Sens. 2022, 14, 5817. https://doi.org/10.3390/rs14225817
Chaib S, Mansouri DEK, Omara I, Hagag A, Dhelim S, Bensaber DA. On the Co-Selection of Vision Transformer Features and Images for Very High-Resolution Image Scene Classification. Remote Sensing. 2022; 14(22):5817. https://doi.org/10.3390/rs14225817
Chicago/Turabian StyleChaib, Souleyman, Dou El Kefel Mansouri, Ibrahim Omara, Ahmed Hagag, Sahraoui Dhelim, and Djamel Amar Bensaber. 2022. "On the Co-Selection of Vision Transformer Features and Images for Very High-Resolution Image Scene Classification" Remote Sensing 14, no. 22: 5817. https://doi.org/10.3390/rs14225817