WIREs Data Min Knowl - 2018 - Li - Deep Learning For Remote Sensing Image Classification A Survey
WIREs Data Min Knowl - 2018 - Li - Deep Learning For Remote Sensing Image Classification A Survey
WIREs Data Min Knowl - 2018 - Li - Deep Learning For Remote Sensing Image Classification A Survey
DOI: 10.1002/widm.1264
ADVANCED REVIEW
1
School of Computer Science, Northwestern
Polytechnical University, Shaanxi, Xi'an, China Remote sensing (RS) image classification plays an important role in the earth obser-
2
Department of Computer Science, Institute of vation technology using RS data, having been widely exploited in both military and
Mathematics, Physics and Computer Science, civil fields. However, due to the characteristics of RS data such as high dimensional-
Aberystwyth University, Aberystwyth, UK
ity and relatively small amounts of labeled samples available, performing RS image
Correspondence
classification faces great scientific and practical challenges. In recent years, as new
Qiang Shen, Department of Computer Science,
Institute of Mathematics, Physics and Computer deep learning (DL) techniques emerge, approaches to RS image classification with
Science, Aberystwyth University, Aberystwyth DL have achieved significant breakthroughs, offering novel opportunities for the
SY23 3DB, UK. research and development of RS image classification. In this paper, a brief overview
Email: qqs@aber.ac.uk
of typical DL models is presented first. This is followed by a systematic review of
Funding information
National Key Research and Development Program
pixel-wise and scene-wise RS image classification approaches that are based on the
of China, Grant/Award Number: use of DL. A comparative analysis regarding the performances of typical DL-based
2016YFB0502502; Foundation Project for RS methods is also provided. Finally, the challenges and potential directions for fur-
Advanced Research Field, Grant/Award Number:
614023804016HK03002; Shaanxi International
ther research are discussed.
Scientific and Technological Cooperation Project,
This article is categorized under:
Grant/Award Number: 2017KW-006
Application Areas > Science and Technology
Technologies > Classification
KEYWORDS
1 | INTRODUCTION
Recently, deep learning (DL) has become the fastest-growing trend in big data analysis and has been widely and successfully
applied to various fields, such as natural language processing (Ronan Collobert & Weston, 2008), image classification
(Krizhevsky, Sutskever, & Hinton, 2012), speech enhancement (Xu, Du, Dai, & Lee, 2015), because of its outstanding perfor-
mance compared with that of traditional learning algorithms. Such work is inspired by biology stating that for primate visual
systems, the brain is organized in deep architecture and the perception is also represented at multiple levels of abstraction. DL
architectures are characterized as artificial neural networks, involving usually more than two layers. As with their shallow
counterpart, deep neural networks exploit feature representations learned exclusively from data. However, they do not require
hand-crafted features that are mostly designed on the basis of domain-specific knowledge. This avoids the problem that hand-
crafted features are highly dependent on domain knowledge. Besides, it is impractical to address the need of considering all of
the details embedded in all forms of real data via the use of predesigned hand-crafted features. Instead of relying on shallow
manually engineered features, DL techniques are able to automatically learn informative representations of raw input data with
multiple levels of abstraction. Such learned features have achieved success by being used in many machine vision tasks.
Representing an important and initial breakthrough in DL, deep belief networks (DBNs) (Hinton, Osindero, & Teh, 2006)
were proposed through exploitation of restricted Boltzmann machines (RBMs) (Freund & Haussler, 1991). This was followed
by the development focused on work that is based on Auto-encoder (Rumelhart & Mcclelland, 1988; Vincent, Larochelle,
This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and
reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
© 2018 The Authors. WIREs Data Mining and Knowledge Discovery published by Wiley Periodicals, Inc.
WIREs Data Mining Knowl Discov. 2018;8:e1264. wires.wiley.com/dmkd 1 of 17
https://doi.org/10.1002/widm.1264
19424795, 2018, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1264 by Universidade Federal Do Rio De Janeiro, Wiley Online Library on [07/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 of 17 LI ET AL.
Bengio, & Manzagol, 2008), which train the multiple intermediate levels of representation locally at each level. Recently,
another DL architecture, of convolutional neural networks (CNNs) (Lecun, Bottou, Bengio, & Haffner, 1998), has achieved
significant results in computer vision, attributing to the deep structure that facilitates the model to capture and generalize filter-
ing mechanisms by performing convolutions in the image domain, leading to highly abstract and effective features.
Despite its great potential, in general, the use of DL in RS image classification brings forward significant new challenges.
There are several reasons for this: First, many RS data, especially hyperspectral images (HSIs), contain hundreds of bands that
can cause a small patch to involve a really large amount of data, which would demand a large number of neurons in a DL net-
work (Berlin & Kay, 1969; Chen, Xiang, Liu, & Pan, 2013; Zhang et al., 2018). Apart from the visual geometrical patterns
within each band, the spectral curve vectors across bands may also provide important information. However, how to utilize
this information still requires further research. Second, the usually impressive performance of DL techniques relies on large
numbers of labeled samples. Unfortunately, very few labeled samples are available in RS data. Third, compared with conven-
tional natural scene images, RS images are more complex. The high spatial resolution RS images may involve various types
of objects, which are also different in size, color, location and rotation. HSIs may be acquired using different sensors in the
first place. The complexity of RS data makes it very difficult if not impossible to directly construct a DL network model for
the classification of such images, assistance is required for DL to perform.
The aforementioned reasons make the application of DL in RS image classification rather specific, but challenging. Hav-
ing recognized this, there have been a good number of approaches recently proposed to deal with such challenges. This paper
presents a survey of such developments, focussing on two important aspects: one being on pixel-wise classification for HSIs
and the other being scene classification for high-resolution aerial or satellite images. The former is concerned with identifying
what category each pixel in a given RS image belongs to, and the later aims to automatically assign a semantic label to each
RS scene image.
This survey is organized as follows. The second section outlines typical DL models which are used in RS image classifica-
tion, including CNNs, stacked auto-encoders (SAEs), and DBNs. The third section reviews the pixel-wise and scene-wise RS
image classification approaches that are based on DL. The classification performances of typical DL-based methods for RS
images are also compared in this section. The fourth section summarizes the present work and discusses challenges ahead,
pointing out potential directions for further research in RS image classification using DL techniques.
2 | T Y P I C A L DE EP N E T W O R K M O D E L S
In this section, we briefly review the following three typical deep neural network models that have been used for RS image
classification. More details about DL architectures in machine learning can be found in (Bengio, 2009; Bengio, Courville, &
Vincent, 2013).
C2 feature maps
C1 feature maps
Input P2 feature maps
P1 feature maps
C3 feature maps
Convolution
Convolution Convolution Pooling Full connection
Pooling
!
X HX X
l −1W l −1
ðx + hÞ, ðy + wÞ
maplx,,jy =f klh,,j:m
w
mapðl − 1Þ, m + bl, j ð1Þ
m h=0 w=0
where klh,,j:m
w
is the value at the position (h, w) of the kernel connected to the mth feature map in the (l − 1)th layer, Hi and Wi
are the height and width of the kernel, respectively, and bl, j is the bias of the jth feature map in the lth layer. Such convolution
layers introduce weight sharing mechanism within same feature maps, which helps reduce significantly the number of parame-
ters otherwise required. It can take two-dimensional (2D) images with any scale directly as input while reserving the location
information of objects in the images. Due to the recognition of the inherent advantages of convolution operation, a significant
amount of work has been focused on improving the ability of convolution layers in the literature. For instance Lin, Chen, and
Yan (2013) proposed a network in a network, substituting the conventional convolution layer with a multilayer perceptron
consisting of multiple fully connected layers. Long, Shelhamer, and Darrell (2017) replaced the fully connected layers in a
CNN with a deconvolution layer to build a novel convolutional network.
Generally, a pooling layer follows a convolutional layer and it is used to reduce the dimensionality of feature maps. There
are two types of basic pooling operation which are the most commonly used: average pooling and max pooling, as shown in
Figure 2. Detailed theoretical analysis of these is beyond the scope of this paper, but can be found in Scherer, Muller, and
Behnke (2010). As the computation process of pooling operation takes neighboring pixels into account, a pooling layer is
translation invariant. Apart from average and max pooling, there are several other pooling operations, including spatial pyra-
mid pooling (He et al., 2014), stochastic pooling (Zeiler & Fergus, 2013) and def-pooling (Ouyang et al., 2014).
A fully connected layer is basically the same as one within a traditional neural network (such as Back Propagation net-
work). The output maps of the last convolution layer or pooling layer are arranged into vectors, acting as the inputs to the first
fully connected layer. The output of the final fully connected layer can be regarded as the learnt feature, forming the result of
which is extracted from the input image by the convolutional network. The classification operation can be simply implemented
by connecting this output to a learning classifier, such as Softmax (Krizhevsky et al., 2012).
Compared to shallow learning, the advantage of DL is that it introduces deep network architectures to learn more abstract
and effective features. However, the large amount of parameters introduced in so doing may lead to overfitting. Numerous reg-
ularization methods have emerged in defense of potential overfitting, such as dropout (Krizhevsky et al., 2012) and batch nor-
malization (Ioffe & Szegedy, 2015). The former randomly omits part of the feature detectors during each training case, and
the later normalizes a certain part of the model architecture for each training mini-batch.
The learning and working process of CNN can be summarized into two stages: (a) networking training and (b) feature extraction
and classification. There are two parts for the first stage: a forward part and a backward part. In the forward part, the input images are
fed through the network to obtain an abstract representation, which will be used to compute the loss cost with regard to the given
ground truth labels. Based on the loss cost, the backward part computes the gradients of each parameter of the network. Then all the
parameters are updated in response to the gradients in preparation for the next forward computation cycle. After sufficient iterations
of training, in the second stage, the trained network can be used to extract deep features and classify unknown images.
(a) (b)
1 1 2 2 1 0 0 0
1 1 2 2 1 2 0 0 2 1 1 2
3 3 4 4 3 4 0 3 1 3 3 4
FIGURE 2 Two basic pooling operations.
3 3 4 4 2 0 4 2
(a) Average pooling. (b) Max pooling
19424795, 2018, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1264 by Universidade Federal Do Rio De Janeiro, Wiley Online Library on [07/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 of 17 LI ET AL.
SAE
Output
layer AE
Wn
Reconstruction
layer
Wʹ1
Hidden Hidden
layer layer
W2 W1
Input
layer
W1
Input
layer FIGURE 3 Stacked auto-encode and auto-encoder
where Wy,Wz denotes input-to-hidden and hidden-to-output weights, respectively, by and bz denote the bias of hidden and
output units, respectively, and f() denotes the activation function, which apply element-wise to its arguments. The loss
function or energy function J(θ) measures the reconstruction z when given input x,
1 XM
zðmÞ −xðmÞ 2
J ðθ Þ = 2
ð3Þ
2M m = 1
where M denotes the number of training samples. The objective is finding the parameters θ = (W, by, bz) which can minimize
the difference between the output and the input over the whole training set X = [x(1), x(2), …, x(m), …, x(M)], and this can be
efficiently implemented via the stochastic gradient descent algorithm (Johnson & Zhang, 2013).
There are two well-known variants of the AE, that is, denoising AE (Vincent et al., 2008) and sparse AE (Schlkopf,
Platt, & Hofmann, 2006a). The former can recover the correct input from a corrupted version, thus forcing the model to cap-
ture the structure of the input distribution. The latter aims to extract sparse features from raw data where the objective is to
minimize the reconstruction error with a sparsity constraint.
As indicated above, SAE consists of multiple layers of AEs, each of which is a special type of neural network used for
efficient encodings. Instead of training the network to predict a certain target label given inputs, an AE is trained to reconstruct
its own inputs. A single AE is not able to get the discriminative and representative features of raw input data. Multiple AEs
are usually stacked with one other to form an SAE, which forwards the code learned from the previous AE to the next in order
to accomplish a given task.
X
I X
J X
I X
J
E v l , h l jθ l = − ail vil − bjl hjl − wijl hjl vil ð4Þ
i=1 j=1 i=1 j=1
DBN
Output
layer
Wn
RBM
Hidden Hidden
layer layer
W2 W1
Visible
layer
W1
Input
layer FIGURE 4 Deep belief network and restricted Boltzmann machine
19424795, 2018, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1264 by Universidade Federal Do Rio De Janeiro, Wiley Online Library on [07/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 5 of 17
n o
where θ l = wijl , ail , bjl ,i = 1,2, , I;j = 1,2 , J forms the set of model parameters. An RBM defines a joint probability over
the hidden units as
l l l exp −E v l , h l jθ l
p v , h jθ = ð5Þ
Z θl
where Z is the so-called partition function,
XX
Z θl = exp −E v l , h l jθ l ð6Þ
vl hl
Then, the conditional distributions p hjl = 1jv l and p vil = 1jh l can be readily computed. Figure 4 shows a typical DBN
for deep feature learning from hyperspectral images. In DBN, the output of the preceding RBM is used as input data for the
next RBM. Two adjacent layers have a full set of connections between them, but no two units in the same layer are connected.
T
The input vector v01, v02, v0I can be set to the spectral signature of each pixel or the contextual features from neighboring
pixels. Every layer outputs a feature of its input data, and the further away from the network input a layer is, the more abstract
the feature that is produced by it is.
DBN is a probabilistic generative model which provides a joint probability distribution over observable data and labels. A
DBN first takes advantages of an efficient layer-by-layer greedy learning strategy to initialize the deep network, and then fine-
tunes all of the weights jointly with the desired outputs.
3 | D L F OR RE M O T E S E NS I N G I M A G E C L A SSI FI CA T I ON
Within the past decade, DL has emerged as of the most successful machine learning techniques and has achieved impressive
performance in the field of computer vision and image processing, with applications such as image classification (He et al.,
2014; Krizhevsky et al., 2012), object detection (Girshick, 2015; Girshick et al., 2013), and super-resolution restoration (Dong
et al., 2016). DL is also taking off in remote sensing image classification most recently, and a growing number of relative
papers are reported in the literature year by year. As a focus of this survey, in this section, we focus on pixel-wise and scene-
wise remote sensing image classification approaches that are based on DL, supported with comparative experimental analyses.
Data sets Indian Pines Salinas Kennedy Space Center Pavia Center Pavia University Botswana
Acquisition time 1992 1992 1996 2001 2001 2001
Location Indiana California Florida Northern Italy Northern Italy Okavango delta
Device AVIRIS AVIRIS AVIRIS ROSIS ROSIS Hyperion
Spectrum coverage (nm) 400–2,500 400–2,500 400–2,500 430–860 430–860 400–2,500
Data size (pixels) 145 × 145 512 × 217 512 × 614 1096 × 492 610 × 340 1476 × 256
Spectrum number(corrected) 224 224 224 115 115 242
Sample size 10,249 54,129 5,211 7,456 42,776 3,248
Category 16 16 13 9 9 14
data have 224 spectral in the wavelength range of 400–2,500 nm. The 24 bands covering the region of water absorption were
removed to noise. The Indian Pines ground truth contains 16 classes with 10,249 pixels labeled in total.
As with the common literature, three performance indicators, overall accuracy (OA), average accuracy (AA), and kappa
coefficient (K), are employed to evaluate the classification performance on benchmark data sets. OA equals to the number of
properly classified samples divided by that of overall samples. AA is the averaged value of accuracies across all category. K is
a relatively more comprehensive indicator and its formula is shown below:
P
n P
n
N Cii − Ci + C + i
i=1 i=1
K= P
n ð7Þ
N2 − Ci + C + i
i=1
where N is the number of overall samples, n is the number of categories and Cij is the (i, j)th value of the confusion matrix
C (Thompson & Walter, 1988), with Ci+ and C+i, respectively, denoting the sum of the ith row and that of the ith column of C.
Asphalt
2000 Meadows
Wn Wn
Gravel
Trees
Wʹ1
Painted metal sheets
1000 W2 W1
W2 W1 Bare soil
W1 Bitumen
W1
Self-blocking bricks
200
0 60 120
1D-CNN SAE DBN Shadows
Spectral
HSI vector Deep learning models Classification
result
Hu, Huang, Wei, Zhang, and Li (2015)) attempt to carry out HSI classification using 1D-CNN that contains four layers: one
convolution layer followed by one pooling layer and two fully connected layers. The study of Mei et al. (2016) exploit a simi-
lar 1D-CNN to classify HSI, the difference is that the latter saves the pooling layer but relies on the exploitation of additional
techniques like dropout (Krizhevsky et al., 2012) and batch normalization (Xu et al., 2015).
Among the three typical DL models (CNN, SAE, and DBN), SAE and DBN take the data that is represented in a vector
form as input, thereby fitting the need to extract deep spectral feature from spectral vectors. In fact, as compared to CNN,
SAE and DBN were introduced to perform HSI spectral feature classification earlier.
An initial attempt to work along this direction can be found in (Chen et al., 2017), where authors adopt an SAE to extract
deep spectral feature. By exploiting the relationship between the input layer and the reconstruction layer in an AE, a constraint
is introduced such that the weights associated with the connections between the hidden-to-output layer are the transposition of
those of input-to-hidden. Following this original work, the use of a DBN instead of an SAE is reported in (Chen et al., 2015).
Similarly, Ma, Wang, Geng, and Wang (2016) employ a SAE to learn effective feature and add a relative distance prior in the
fine-tuning process, giving more effective guidance regarding the desirable features when there are not sufficient labeled sam-
ples. Xing, Ma, and Yang (2015) use stacked denoising AE (Vincent, Larochelle, Lajoie, Bengio, & Manzagol, 2010) to
extract robust spectral features and complete the classification task.
A similar idea was adopted in (He, Li, Zhang, Zhang, & Wang, 2016), where HSI is classified via a novel model named deep
stacking network (DSN). A DSN model stacks many simple modules, each of which contains an input layer, a hidden layer and
an output layer, with the weights of input-to-hidden initialized randomly or via contrastive divergence (Hinton, 2002), and those
of hidden-to-output initialized by computing pseudo-inverse (Golub & Kahan, 1965). Zhong, Gong, and Schnlieb (2016) added
diversity promoting priors by incorporating the diversity promoting conditions into the optimization of training objective in the
pretraining and fine-tuning processes of DBN, which helps improve the classification efficiency in HSI.
Multi-channel
image Asphalt
Meadows
Gravel
Wn Wn
Trees
Dimension Wʹ1 Painted metal sheets
reduction Flatten
W2 W1 Bare soil
W2 W1
Bitumen
W1
W1 Self-blocking bricks
Shadows
spatial features (with a fairly large window size this time, of 42 × 42).Clearly, there is a trade-off between the number of PCs
used and the window size to employ.
Both methods proposed in Makantasis et al. (2015) and Yue et al. (2015) are simple and intuitive, performing classifica-
tion using 2D-CNN-based HSI spatial features. In addition to these, there are several alternative attempts. For instance, in
order to capture multiscale spatial features, Zhao, Guo, Yue, Luo, and Luo (2015) used Laplacian pyramid transformation to
extract multiscale data from condensed HSI, and then each scale data is fed to an independent 2D-CNN to extract deep spatial
features. Aptoula, Ozdemir, and Yanikoglu (2016) preprocess raw HSI images with PCA and attribute profiles (Mura, Bene-
diktsson, Waske, & Bruzzone, 2010) sequentially, followed by the use of a 2D-CNN that takes 42 × 42-sized HSI patch as
input to accomplish the classification task. In Liang and Li (2016), the dimensionality of HSI images is reduced with PCA
first, spatial features are next extracted using a 2D-CNN, the results are further processed with sparse coding and then, fed into
a learning classifier for classification at last. Another novel approach has been proposed in Li, Xie, and Li (2016), where an
HSI reconstruction model based on the use of a deep CNN is proposed to enhance spatial features, with the reconstructed
image classified by the efficient extreme learning machine (Li, Chen, Su, & Du, 2015).
Similarly in the underlying ideas, in Chen et al. (2017) and Lin, Chen, Zhao, and Wang (2015), PCA is exploited to reduce
the dimensionality of HSI images (particularly to four and three, respectively), the resultant data cubes are flattened or extracted
from neighborhood regions. This is then followed by an SAE performing the actual classification. In the pretraining stage, the
weights of hidden-to-output are restricted to the transposition of those of input-to-output with the cross entropy taken as the loss
function to minimize. In the fine-tuning stage, softmax is taken as the activation function of the output layer of the SAE. Two sim-
ilar frameworks which adopt DBN in their structure can be found in Chen et al. (2015) and Li, Zhang, and Zhang (2015) also.
HSI
3D-CNN
Sub cube
2000
Asphalt
1D-CNN
Meadows
Gravel
1000
Trees
reduction
200
0 60 120 vector Bitumen
Self-blocking bricks
Shadows
Wn
Wʹ1 Classification
W2 W1
SAE result
Spatial
W1
cube
Flatten
Wn
Compacted W2 W1
DBN
Spatial
HSI W1
vector
spatial features are learned from compacted HSI via a 2D-CNN which implements spatial pyramid pooling (He et al. (2014).
In Zhao and Du (2016), based on local discriminant embedding (LDE) (Chen, Chang, & Liu, 2005), a balanced LDE method
was proposed and jointly used with a 2D-CNN to obtain the final classification result.
The main challenge facing 2D-CNN due to 3D HSI is the additional dimension. The CNN-based spectral-spatial classifica-
tion methods in the literature deal with this challenge by either expanding 2D-CNN to 3D-CNN or rearranging 3D HSI to 2D
HSI. Both in Chen et al. (2016) and Li et al. (2017), for example, a 3D-CNN is employed to learn deep spectral-spatial fea-
tures. In particular, the former exploits a large-scale 3D-CNN which takes cubes of 27 × 27 in space size as input, while the
latter uses a much more compact 3D-CNN with input cubes of 5 × 5 in size. In Lee and Kwon (2016)and Slavkovikj, Ver-
stockt, Neve, Hoecke, and Walle (2015), a 2D-CNN is employed to learn spectral-spatial features. Particularly, Lee and Kwon
(2016) presented an approach that convolves the 3D subcubes extracted from raw HSI images along the spatial dimension
with 3 × 3-sized and 1 × 1-sized convolution kernels, and then reconstructs new 3D data using the convolved outputs jointly.
The procedure employs a pointwise convolution layer at last to complete the classification. The work of Slavkovikj
et al. (2015) extracts 3D cubes from raw HSI first, and then reshapes such cubes to 2D images.
Classification methods that rely on SAE- and DBN-based spectral-spatial features always extract spectral and spatial fea-
tures separately and then joint them to form spectral-spatial features. Spectral information does not require any preprocessing,
but spatial information has to be flattened to a 1D vector, as SAE and DBN can only handle 1D input. Following this general
approach, Chen et al. applied SAE (Chen et al., 2017) and DBN (Chen et al., 2015) for spectral-spatial feature extraction and
classification. Also, Ma, Wang, and Geng (2016) proposed a spatially updated deep AE for spectral-spatial feature extraction,
by adding a sample similarity regularization mechanism and combining it with the collaborative representation-based classifi-
cation to deal with the problem of small training sets. Tao, Pan, Li, and Zou (2015) adopted stacked sparse AE to extract
high-level features from unlabeled data and then, to feed the leant features to SVM for classification. Li, Bruzzone, and Liu
(2015) proposed a two-step framework, where HSI cubes are filtered by 3D Gabor wavelets first and then an SAE is trained
using the outputs of the previous step via unsupervised pre-training, followed by fine-tuning over the entire network last. Ma,
Geng, and Wang (2015) proposed a contextual DL algorithm which extracts spectral-spatial features through a deep SAE
architecture. To extract spectral-spatial features efficiently, Han, Zhong, and Zhang (2016) in their study proposed unsuper-
vised convolutional sparse AE (UCSAE) with window-in-window selection strategy. Again, to deal with the problem of hav-
ing limited training samples, Ma, Wang, and Wang (2016) in their study proposed a novel semi-supervised classification
framework based on the utilization of multidecision and deep features.
Apart from SAE, DBN, and CNN, these is one more popular DL model namely recurrent neural network (RNN), which is
proposed for processing sequential data (such as speech data) and it has also been introduced into HSI classification. Com-
pared with the amount of HSI classification methods based SAE, DBN, and CNN, that of RNN is relatively less, but the time
that RNN-based HSI classification methods are proposed is more recent. The first attempt can be found in Mou, Ghamisi, and
Zhu (2017), where Mou et al. used an RNN to capture the sequential property of a hyperspectral pixel vector to perform classi-
fication tasks. They also used parametric rectified tanh (PRetanh) in their network to avoid the risk of divergence during the
training procedure. Wu and Prasad (2017) proposed convolutional recurrent neural network (CRNN), in which a few convolu-
tion layers are followed by recurrent layers. Middle-level and locally invariant features are extracted from raw HSI and spec-
trally contextual features are then extracted from the features generated by convolution layers.
CNN = convolutional neural networks; SAE = stacked auto-encoders; DBN = deep belief network. The bold values are the best results.
CNN = convolutional neural networks; SAE = stacked auto-encoders; DBN = deep belief network. The bold values are the best results.
Now, taking the Indian Pines data set for another example. By splitting the labeled samples into training data and testing
data with a split ratio 1:1, we compare five spectral-spatial feature-based classification frameworks, which are based on SAE,
DBN, 2D-CNN, 3D-CNN, and dual channel convolutional neural network (DC-CNN), respectively. The experimental results
are listed in Table 3 and the visual classification results are shown in Figure 8.
As shown in Figure 8, CNN-based HSI spectral-spatial feature classification approaches, including 2D-CNN, 3D-CNN,
and DC-CNN can achieve better performance than SAE- and DBN-based methods. It is interesting to note that historically,
works on HSI classification methods based on SAEs and DBNs were developed earlier than those based on CNNs. However,
statistically, it shows that recently, the number of papers regarding the use of CNNs for HSI classification grows fastest and
the performance of CNNs is generally better.
N
Alfalfa
Corn-notill
Corn-mintill
Corn
Grass-pasture
Grass-trees
Grass-pasture-mowed
0 100 200 m
Hay-windrowed
(a) (b) (c)
Oats
Soybean-notill
Soybean-mintill
Soybean-clean
Wheat
Woods
Buildings-grass-trees-drives
Stone-steel-towers
(d) (e) (f) (g)
FIGURE 8 Classification of Indian pines. (a) false-color composite; (b) ground truth; (c) SAE, OA = 93.98%; (d) DBN, OA = 95.91; (e) 2D-CNN,
OA = 95.97%; (f ) 3D-CNN, OA = 99.07%; (g) DC-CNN, OA = 99.92%
19424795, 2018, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1264 by Universidade Federal Do Rio De Janeiro, Wiley Online Library on [07/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 11 of 17
variations potentially existing in scene images in the spatial arrangements and structural patterns make scene classification a
considerably challenging task (Zhu et al., 2017).
To construct a high-powered scene classification method, the use of efficient and effective feature representations is very
important. The early works for scene classification are mainly based on handcrafted features. These methods generally focus
on the use of a considerable amount of domain-specific properties to design various low-level visual features or on middle-
level feature representations by encoding low-level local features. The former include properties such as color (typically the
color histograms, CH; Swain & Ballard, 1991), texture (typically the local binary patterns, LBP; Ojala, Pietikinen, & Menp,
2000), and structure (typically the scale invariant feature transforms, SIFT; Kim, Madden, & Warner, 2009), and the latter
include representations such as bag of visual words (BoVW; Sivic & Zisserman, 2003), spatial pyramid matching (SPM;
Lazebnik, Schmid, & Ponce, 2006), locality-constrained linear coding (LLC; Wang et al., 2010), and improved Fisher kernel
(IFK; Perronnin & Mensink, 2010). The potential for improvement over such traditional approaches is limited by the ability
of experts to design the feature extractors and the expensive encoding power. In recent years, learned high-level deep features
have been reported to achieve state-of-the-art performance on aerial image classification (Hu et al., 2016, 2017; Xia et al.,
2016; Yang et al., 2015; Zhao & Du, 2016).
unlabeled images. Unsupervised feature extraction methods can learn feature representations from the images or patches with
no prior labels. Traditional unsupervised feature extraction methods include RBMs, AEs, sparse coding, and k-means cluster-
ing. For instance, Risojevic and Babic (2014) proposed an approach combining quaternion PCA and k-means for unsupervised
feature learning that makes joint encoding of the intensity and color information possible. Cheriyadat (2013) introduced a
sparse coding-based method where the dense low-level features were extracted and encoded in terms of the basis functions in
an effort to generate a new sparse representation. All of these feature-learning models are shadow and can be stacked to form
deep unsupervised models, several of which have been successfully applied to RS image scene classification (Zhang et al.,
2014).For instance, Zhang et al. (2014) proposed an unsupervised feature learning framework for scene classification using
the saliency-guided sparse AE model and a new dropout technique (Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdi-
nov, 2012). In addition, CNNs can also be trained in unsupervised fashion, by means of greedy layerwise pretraining (Lee,
Grosse, Ranganath, & Ng, 2009; Masci, Meier, Dan, & Schmidhuber, 2011; Schlkopf, Platt, & Hofmann, 2006b). For exam-
ple, the use of deep CNNs for RS image classification was introduced in Romero, Gatta, and Camps-Valls (2016) with the net-
work being trained by an unsupervised method that seeks sparse feature representation.
4 | CO N CL U S I ON S AN D F UR T H E R RE S E AR C H
In this literature survey, we have briefly introduced a number of typical DL models that may be used to perform RS image
classification, including: CNNs, SAEs and DBNs. Following the introduction, from two main perspectives, pixel-wise image
classification and scene-wise image classification, we have systematically reviewed the state-of-the-art DL approaches for RS
image classification. In particular, classification methods based on spectral features, spatial features and joint spectral and spa-
tial features have been discussed having both supervised and unsupervised feature extraction methods using DL. We have also
Feature level Method UC-Merced (50%) WHU-RS19 (60%) RSSCN7 (50%) AID (50%)
Low SIFT 28.92 27.21 32.76 16.67
LBP 34.57 44.08 60.38 29.99
CH 42.09 51.87 60.54 37.28
Middle BoVW(SIFT) 71.9 80.13 81.34 67.65
SPM(SIFT) 56.5 55.82 68.45 45.52
LLC(SIFT) 77.08 80.71 83.34 75.01
IFK(SIFT) 77.09 86.95 84.41 77.33
High CaffeNet 93.98 96.24 88.35 86.86
VGGNet 94.14 96.05 87.18 86.59
GoogleNet 92.7 94.71 85.84 83.44
19424795, 2018, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1264 by Universidade Federal Do Rio De Janeiro, Wiley Online Library on [07/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 of 17 LI ET AL.
compared and analyzed the performances of such typical methods. The performance of DL-based RS classification techniques
have shown their effectiveness in solving real-world problems, although such performance does not reflect the full potential of
DL yet. In the upcoming years, rapid advancement of DL in remote sensing image classification is expected, owing to the
increased availability of RS data and computational resources. Nevertheless, there is still a long way to go in order to realize
full potential while coping with many unanswered challenges. We discuss several important open issues and point out the cor-
responding possible future directions in addressing such issues as follows.
1. Limited labeled samples: Although DL models can learn high-level abstract features from raw images with excellent per-
formance in dealing with a wide range of problems, we have to pay attention to the observation that such performance
heavily relies on large amounts of training samples. In RS images, the available labeled samples are rather limited, thereby
restricting the DL-based RS images classification approaches to obtain better performance. How to build an efficient net-
work and train it with a small number of training samples is both challenging and interesting. Investigating into novel
models that can exploit unlabeled samples is clearly a desirable direction for further work.
2. Transfer between data sets: For natural image classification, the common practice is to pretrain a DL model using a data
set with a large number of labeled samples, such as ImageNet, and then to fine-tune the model using a data set which con-
tains limited training samples. However, RS data are more complex than natural image, parts of them are typically even
acquired by the use of different remote sensors. How to introduce transfer learning to RS image classification therefore,
presents a major challenge, which needs significant further research.
3. DL model architecture: Recently, an increasing number of novel deep networks have been proposed. These networks can
often achieve excellent performance in performing their dedicated task. For instance, U-net (Ronneberger, Fischer, &
Brox, 2015) can obtain an impressive performance in segmentation, ResNet (He, Zhang, Ren, & Sun, 2015) can have an
outstanding accuracy in applicable image classification and object detection. However, almost of such networks are aimed
at coping with natural image processing. As we mentioned previously, RS images are generally different from natural
images. Exploring appropriate network structures for a given RS image classification problem is still an open topic.
In addition, there are other HSI classification techniques that have been proposed in recent years. These techniques may
also lead to good performance and therefore should be worth paying attention to. For instance, Zhu, Hu, Jia, and Li (2018)
proposed a multiple 3D feature fusion framework to extract spectral-spatial features via 3D morpholigical profiles, 3D LBPs
and 3D gabor surface features. Also, Fang, He, Li, Ghamisi, and Benediktsson (2018) proposed a novel fusion framework
termed extinction profile fusion to exploit the information contained within and among EPs for HSI classification. Such recent
developments are not examined in detail in this work, but may establish themselves in future practical applications.
ACKNOWLEDGMENTS
This research has received funding from the National Key Research and Development Program of China (Grant
No. 2016YFB0502502), Foundation Project for Advanced Research Field (614023804016HK03002), and Shaanxi Interna-
tional Scientific and Technological Cooperation Project (2017KW-006).
CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.
ORCID
REFERENC ES
Aptoula, E., Ozdemir, M. C., & Yanikoglu, B. (2016). Deep learning with attribute profiles for hyperspectral image classification. IEEE Geoscience and Remote Sensing
Letters, 13(12), 1970–1974.
Bengio, Y. (2009). Learning deep architectures for ai. Foundations & Trends in Machine Learning, 2(1), 1–127.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 35(8), 1798–1828.
19424795, 2018, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1264 by Universidade Federal Do Rio De Janeiro, Wiley Online Library on [07/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 15 of 17
Berlin, B., & Kay, P. (1969). Basic color terms : Their universality and evolution. University of California Press.
Castelluccio, M., Poggi, G., Sansone, C., & Verdoliva, L. (2015). Land use classification in remote sensing images by convolutional neural networks. Acta Ecologica
Sinica, 28(2), 627–635.
Chen, H. T., Chang, H. W., & Liu, T. L. (2005). Local discriminant embedding and its variants. Paper presented at IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, San Diego, CA, USA, 846–853.
Chen, X., Xiang, S., Liu, C. L., & Pan, C. H. (2013). Aircraft detection by deep belief nets. Paper presented at 2013 2nd IAPR Asian Conference on Pattern Recognition
(ACPR), Naha, Japan. https://doi.org/10.1109/ACPR.2013.5
Chen, Y., Jiang, H., Li, C., Jia, X., & Ghamisi, P. (2016). Deep feature extraction and classification of hyperspectral images based on convolutional neural networks.
IEEE Transactions on Geoscience and Remote Sensing, 54(10), 6232–6251.
Chen, Y., Lin, Z., Zhao, X., Wang, G., & Gu, Y. (2017). Deep learning-based classification of hyperspectral data. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 7(6), 2094–2107.
Chen, Y., Zhao, X., & Jia, X. (2015). Spectralspatial classification of hyperspectral data based on deep belief network. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 8(6), 2381–2392.
Cheriyadat, A. M. (2013). Unsupervised feature learning for aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 52(1), 439–451.
Deng, J., Dong, W., Socher, R., & Li, L. J. (2009). Imagenet: A large-scale hierarchical image database. Paper presented at IEEE Conference on Computer Vision and
Pattern Recognition, Miami, FL, USA, 248–255.
Dong, C., Chen, C. L., He, K., & Tang, X. (2016). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 38(2), 295–307.
Du, Q., & Chang, C. I. (2001). A linear constrained distance-based discriminant analysis for hyperspectral image classification. Pattern Recognition, 34(2), 361–373.
Ediriwickrema, J., & Khorram, S. (1997). Hierarchical maximum-likelihood classification for improved accuracies. IEEE Transactions on Geoscience and Remote Sens-
ing, 35(4), 810–816.
Fang, L., He, N., Li, S., Ghamisi, P., & Benediktsson, J. A. (2018). Extinction profiles fusion for hyperspectral images classification. IEEE Transactions on Geoscience
and Remote Sensing, PP(99), 1–13.
Firat, O., Can, G., Vural, F. T. Y., Firat, O., Can, G., & Vural, F. T. Y. (2014). Representation learning for contextual object and region detection in remote sensing.
Paper presented at International Conference on Pattern Recognition, Stockholm, Sweden, 3708–3713.
Freund, Y., & Haussler, D. (1991). Unsupervised learning of distributions on binary vectors using two layer networks. Advances in Neural Information Processing Sys-
tems, 4, 912–919.
Girshick, R. (2015). Fast r-cnn. Computer Science. arXiv:1504.08083.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. Paper presented at Proceed-
ings of the 2014 I.E. Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 580–587.
Golub, G., & Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics, 2(2),
205–224.
Han, X., Zhong, Y., & Zhang, L. (2016). Spatial-spectral classification based on the unsupervised convolutional sparse auto-encoder for hyperspectral remote sensing
imagery. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, III-7, 25–31.
He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 37(9), 1904–1916.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition (pp. 770–778).
He, M., Li, X., Zhang, Y., Zhang, J., & Wang, W. (2016). Hyperspectral image classification based on deep stacking network. Paper presented at Geoscience and
Remote Sensing Symposium, Beijing, China, 3286–3289.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Cambridge, MA: MIT Press.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors.
Computer Science, 3(4), 212–223.
Hu, F., Xia, G. S., Hu, J., & Zhang, L. (2015). Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery.
Remote Sensing, 7(11), 14680–14707.
Hu, F., Xia, G. S., Hu, J., Zhong, Y., & Xu, K. (2016). Fast binary coding for the scene classification of high-resolution remote sensing imagery. Remote Sensing, 8(7), 555.
Hu, F., Xia, G. S., Wang, Z., Huang, X., Zhang, L., & Sun, H. (2017). Unsupervised feature learning via spectral clustering of multidimensional patches for remotely
sensed scene classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(5), 2015–2030.
Hu, W., Huang, Y., Wei, L., Zhang, F., & Li, H. (2015). Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors, 2015(2), 1–12.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. eprint arXiv:1502.03167, 448–456.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding, arXiv preprint arXiv:
1408.5093, 675–678.
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. Paper presented at International Conference on Neural
Information Processing Systems, Daegu, South Korea, 315–323.
Kim, M. H., Madden, M., & Warner, T. A. (2009). Forest type mapping using object-specific texture measures from multispectral ikonos imagery: Segmentation quality
and image classification issues. Photogrammetric Engineering & Remote Sensing, 75(7), 819–829.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Paper presented at International Conference on
Neural Information Processing Systems, Doha, Qatar, 1097–1105.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Paper presented at IEEE
Computer Society Conference on Computer Vision & Pattern Recognition, New York, NY, USA, 2169–2178
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Paper
presented at International Conference on Machine Learning, Montréal, Canada, 609–616.
Lee, H., & Kwon, H. (2016). Contextual deep cnn based hyperspectral classification. Paper presented at Geoscience and Remote Sensing Symposium, Beijing, China, 1–1.
Li, J., Bioucas-Dias, J. M., & Plaza, A. (2010). Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning. IEEE
Transactions on Geoscience and Remote Sensing, 48(11), 4085–4098.
Li, J., Bruzzone, L., & Liu, S. (2015). Deep feature representation for hyperspectral image classification. Paper presented at Geoscience and Remote Sensing Sympo-
sium, Milan, Italy, 4951–4954.
Li, T., Zhang, J., & Zhang, Y. (2015). Classification of hyperspectral image based on deep belief networks. In IEEE international conference on image processing
(p. 5132–5136).
19424795, 2018, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1264 by Universidade Federal Do Rio De Janeiro, Wiley Online Library on [07/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 of 17 LI ET AL.
Li, W., Chen, C., Su, H., & Du, Q. (2015). Local binary patterns and extreme learning machine for hyperspectral imagery classification. IEEE Transactions on Geosci-
ence and Remote Sensing, 53(7), 3681–3693.
Li, Y., Xie, W., & Li, H. (2016). Hyperspectral image reconstruction by deep convolutional neural network for classification. Pattern Recognition, 63, 371–383.
Li, Y., Zhang, H., & Shen, Q. (2017). Spectral-spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sensing, 9(1), 67.
Liang, H., & Li, Q. (2016). Hyperspectral imagery classification using sparse representations of convolutional neural network features. Remote Sensing, 8(2), 99.
Lin, M., Chen, Q., & Yan, S. (2013). Network in network. Computer Science.
Lin, Z., Chen, Y., Zhao, X., & Wang, G. (2015). Spectral-spatial classification of hyperspectral image using autoencoders. Paper pretended at International Conference
on Information, Communications and Signal Processing,Tainan, Taiwan, 1–5.
Long, J., Shelhamer, E., & Darrell, T. (2017). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 39(4), 640–651.
Luus, F. P. S., Salmon, B. P., Bergh, F. V. D., & Maharaj, B. T. J. (2015). Multiview deep learning for land-use classification. IEEE Geoscience and Remote Sensing
Letters, 12(12), 2448–2452.
Ma, X., Geng, J., & Wang, H. (2015). Hyperspectral image classification via contextual deep learning. Eurasip Journal on Image & Video Processing, 2015(1), 20.
Ma, X., Wang, H., & Geng, J. (2016). Spectralspatial classification of hyperspectral image based on deep auto-encoder. IEEE Journal of Selected Topics in Applied
Earth Observations& Remote Sensing, 9(9), 4073–4085.
Ma, X., Wang, H., Geng, J., & Wang, J. (2016). Hyperspectral image classification with small training set by deep network and relative distance prior. Paper presented
at Geoscience and Remote Sensing Symposium, Beijing, China, 3282–3285.
Ma, X., Wang, H., & Wang, J. (2016). Semisupervised classification for hyperspectral image based on multi-decision labeling and deep feature learning. Isprs Journal
of Photogrammetry & Remote Sensing, 120, 99–107.
Makantasis, K., Karantzalos, K., Doulamis, A., & Doulamis, N. (2015). Deep supervised learning for hyperspectral data classification through convolutional neural
networks. Paper presented at Geoscience and Remote Sensing Symposium, Milan, Italy, 4959–4962.
Masci, J., Meier, U., Dan, C., & Schmidhuber, J. (2011). Stacked convolutional autoencoders for hierarchical feature extraction. Paper presented at International Con-
ference on Artificial Neural Networks, Espoo, Finland, 52–59.
Mei, S., Ji, J., Bi, Q., Hou, J., Du, Q., & Li, W. (2016). Integrating spectral and spatial information into deep convolutional neural networks for hyperspectral classifi-
cation. Paper presented at Geoscience and Remote Sensing Symposium, Beijing, China, 5067–5070.
Mou, L., Ghamisi, P., & Zhu, X. (2017). Deep recurrent neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing,
55(7), 3639–3655.
Mura, M. D., Benediktsson, J. A., Waske, B., & Bruzzone, L. (2010). Morphological attribute profiles for the analysis of very high resolution images. IEEE Transac-
tions on Geoscience and Remote Sensing, 48(10), 3747–3762.
Nogueira, K., Penatti, O. A. B., & Santos, J. A. D. (2016). Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Rec-
ognition, 61, 539–556.
Ojala, T., Pietikinen, M., & Menp, T. (2000). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24(7), 971–987.
Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li, H., … Tang, X. (2014). Deepid-net: Multi-stage and deformable deep convolutional neural networks for object
detection. Eprint Arxiv.
Penatti, O. A. B., Nogueira, K., & Santos, J. A. D. (2015). Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? Paper pre-
sented at Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 44–51.
Perronnin, F., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. Paper presented at European Conference on Computer Vision,
Hersonissos, Heraklion, Crete, Greece, 143–156.
Risojevic, V., & Babic, Z. (2014). Unsupervised learning of quaternion features for image classification. Paper presented at International Conference on Telecommuni-
cation in Modern Satellite, Cable and Broadcasting Services, Nis, Serbia, 345–348.
Romero, A., Gatta, C., & Camps-Valls, G. (2016). Unsupervised deep feature extraction for remote sensing image classification. IEEE Transactions on Geoscience and
Remote Sensing, 54(3), 1349–1362.
Ronan Collobert, J. W. & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask. Paper presented at Proceed-
ings of the 25th International Conference on Machine, New York, USA, 160–167.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. Springer International Publishing.
Rumelhart, D., & Mcclelland, J. (1988). Learning internal representations by error propagation. Cambridge, MA: MIT Press.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International Journal of
Computer Vision, 115(3), 211–252.
Samaniego, L., Bardossy, A., & Schulz, K. (2008). Supervised classification of remotely sensed imagery using a modified k-nn technique. IEEE Transactions on Geo-
science and Remote Sensing, 46(7), 2112–2125.
Scherer, D., Muller, A., & Behnke, S. (2010). Evaluation of pooling operations in convolutional architectures for object recognition. Paper presented at International
Conference on Artificial Neural Networks, Thessaloniki, Greece, 92–101.
Schlkopf, B., Platt, J., & Hofmann, T. (2006a). Efficient learning of sparse representations with an energy-based model. Paper presented at Advances in Neural Infor-
mation Processing Systems, Vancouver, B.C., Canada, 1137–1144.
Schlkopf, B., Platt, J., & Hofmann, T. (2006b). Greedy layer-wise training of deep networks. Paper presented at International Conference on Neural Information Proces-
sing Systems, Vancouver, B.C., Canada, 153–160.
Sheng, G., Yang, W., Xu, T., & Sun, H. (2012). High-resolution satellite scene classification using a sparse coding based multiple feature combination. International
Journal of Remote Sensing, 33(8), 2395–2412.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. Computer Science.
Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. Paper presented at IEEE International Conference on Computer
Vision, Nice, France, 1470.
Slavkovikj, V., Verstockt, S., Neve, W. D., Hoecke, S. V., & Walle, R. V. D. (2015). Hyperspectral image classification with convolutional neural networks. Paper pre-
sented at ACM International Conference on Multimedia, 1159–1162.
Swain, M. J., & Ballard, D. H. (1991). Color indexing. International Journal of Computer Vision, 7(1), 11–32.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2014). Going deeper with convolutions, arXiv:1409.4842, 1–9.
Tao, C., Pan, H., Li, Y., & Zou, Z. (2015). Unsupervised spectralspatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE
Geoscience and Remote Sensing Letters, 12(12), 2438–2442.
Thompson, W. D., & Walter, S. D. (1988). A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology, 41(10), 949–958.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). Extracting and composing robust features with denoising autoencoders. Paper presented at Interna-
tional Conference on Machine Learning, Kunming, China, 1096–1103.
19424795, 2018, 6, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1264 by Universidade Federal Do Rio De Janeiro, Wiley Online Library on [07/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
LI ET AL. 17 of 17
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a
local denoising criterion. Journal of Machine Learning Research, 11(12), 3371–3408.
Volpi, M., & Tuia, D. (2016). Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Transactions on Geoscience and
Remote Sensing, PP(99), 1–13.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. Paper presented at Computer vision and
pattern recognition, San Francisco, CA, USA, 3360–3367.
Wu, H., & Prasad, S. (2017). Convolutional recurrent neural networks for hyperspectral data classification. Remote Sensing, 9(1), 298.
Xia, G. S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., & Zhang, L. (2016). Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE
Transactions on Geoscience and Remote Sensing, PP(99), 1–17.
Xia, G. S., Yang, W., Delon, J., Gousseau, Y., Sun, H., & Matre, H. (2010). Structural highresolution satellite image indexing. Paper presented at ISPRS TC VII Sym-
posium - 100 Years ISPRS, XXXVIII, Vienna, Austria, 298–303.
Xing, C., Ma, L., & Yang, X. (2015, 2016). Stacked denoise autoencoder based feature extraction and classification for hyperspectral images. Journal of Sensors,
2016, 1–10.
Xu, Y., Du, J., Dai, L. R., & Lee, C. H. (2015). A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio
Speech and Language Processing, 23(1), 7–19.
Yang, J., Zhao, Y., Chan, C. W., & Yi, C. (2016). Hyperspectral image classification using two-channel deep convolutional neural network. Paper presented at Geosci-
ence and Remote Sensing Symposium, Beijing, China, 5079–5082.
Yang, W., Yin, X., & Xia, G. S. (2015). Learning high-level features for satellite image classification with limited labeled samples. IEEE Transactions on Geoscience
and Remote Sensing, 53(8), 4472–4482.
Yue, J., Mao, S., & Li, M. (2016). A deep learning framework for hyperspectral image classification using spatial pyramid pooling. Remote Sensing Letters, 7(9),
875–884.
Yue, J., Zhao, W., Mao, S., & Liu, H. (2015). Spectralspatial classification of hyperspectral images using deep convolutional neural networks. Remote Sensing Letters,
6(6), 468–477.
Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. Eprint Arxiv. arXiv:1301.3557
Zhang, F., Du, B., & Zhang, L. (2014). Saliency-guided unsupervised feature learning for scene classification. IEEE Transactions on Geoscience and Remote Sensing,
53(4), 2175–2184.
Zhang, F., Du, B., & Zhang, L. (2016). Scene classification via a gradient boosting random convolutional network framework. IEEE Transactions on Geoscience and
Remote Sensing, 54(3), 1793–1802.
Zhang, H., Li, Y., Zhang, Y., & Shen, Q. (2017). Spectral-spatial classification of hyperspectral imagery using a dual-channel convolutional neural network. Remote
Sensing Letters, 8(5), 438–447.
Zhang, L., Wei, W., Zhang, Y., Shen, C., van den Hengel, A., & Shi, Q. (2018). Cluster sparsity field: An internal hyperspectral imagery prior for reconstruction. Inter-
national Journal of Computer Vision, 1–25.
Zhang, L., Zhang, L., & Du, B. (2016). Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Maga-
zine, 4(2), 22–40.
Zhao, B., Zhong, Y., Xia, G. S., & Zhang, L. (2016). Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery.
IEEE Transactions on Geoscience and Remote Sensing, 54(4), 2108–2123.
Zhao, L., Tang, P., & Huo, L. (2016). Feature significance-based multibag-of-visual-words model for remote sensing image scene classification. Journal of Applied
Remote Sensing, 10(3), 035004.
Zhao, W., & Du, S. (2016). Spectralspatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach. IEEE Transac-
tions on Geoscience and Remote Sensing, 54(8), 4544–4554.
Zhao, W., Guo, Z., Yue, J., Luo, L., & Luo, L. (2015). On combining multiscale deep learning features for the classification of hyperspectral remote sensing imagery.
International Journal of Remote Sensing, 36(13), 3368–3379.
Zhong, P., Gong, Z. Q., & Schnlieb, C. (2016). A diversified deep belief network for hyperspectral image classification. ISPRS - International Archives of the Photo-
grammetry, Remote Sensing and Spatial Information Sciences, XLI-B7, 443–449.
Zhu, J., Hu, J., Jia, S., & Li, Q. (2018). Multiple 3-d feature fusion framework for hyperspectral image classification. IEEE Transactions on Geoscience and Remote
Sensing, PP(99), 1–14.
Zhu, X. X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., & Fraundorfer, F. (2017). Deep learning in remote sensing: A review. arXiv:1710.03959
Zou, Q., Ni, L., Zhang, T., & Wang, Q. (2015). Deep learning based feature selection for remote sensing scene classification. IEEE Geoscience and Remote Sensing Let-
ters, 12(11), 2321–2325.
How to cite this article: Li Y, Zhang H, Xue X, Jiang Y, Shen Q. Deep learning for remote sensing image classifica-
tion: A survey. WIREs Data Mining Knowl Discov. 2018;8:e1264. https://doi.org/10.1002/widm.1264