1. Introduction
Scene classification [
1] is to judge the category of an image according to the scene content. Remote sensing (RS) scene classification [
2,
3] refers to the fact that remote sensing scene images are assigned specific labels and classified by some algorithm. It plays an irreplaceable role in crop yield estimation, disaster prevention, resource protection, and land planning and utilization. So far, with the extensive application of deep learning [
4,
5] in RS scene classification, great success has been achieved, most of the work of RS scene classification is based on large-scale remote sensing data sets, and more than thousands of images of each type are used to fit the neural network model. However, the process of labeling a large-scale dataset is very complex and labor-intensive. In contrast, few-shot learning [
6,
7] does not require a lot of labeled data. It tries to imitate human ability, where classification systems can learn to classify based on a small quantity of labeled images (few shots). One-shot learning [
8] is included in few-shot learning, where each class of one-shot learning contains one sample. In addition, zero-shot learning [
9] refers to the recognition of new things that have never been seen by computers, which is more demanding than few-shot learning.
Since deep learning abandons the traditional manual learning features, RS scene classification based on deep learning is of great significance [
10,
11,
12,
13]. Recently, Zhai et al. [
14] proposed a useful model for lifelong learning that extracts prior knowledge by learning the ability of the classifier to achieve rapid generalization to new data sets. For the purpose of achieving the purpose of the model, which is to learn the global features of an image, Zhang et al. [
15] introduced the remote sensing transformer (TRS) into RS scene classification to capture long-range dependency. For the benefit of mining the semantic features of different categories through the global features of RS images, Tang et al. [
16] constructed RS images by spatial rotation on the basis of previous studies to capture more useful information and reduce the possibility of misclassification by improving the discrimination of features.
Few-shot learning is used to classify RS scenes with insufficient labeled data, which can solve the defects of the above methods and improve the interpretation performance [
17,
18,
19,
20,
21,
22]. For the sake of solving the disadvantage that RS images lack the ability to learn more judgmental features and reliable metrics, Li et al. [
23] proposed an adaptive distance-based matcher to ameliorate the classification efficiency, called DLA-MatchNet. Sample-based training methods exist in most experiments and can achieve better results, but the probability of fitting individual samples will be greatly increased. By summarizing the previous methods, Li et al. [
24] concluded that different tasks should be trained to extract features instead of samples and proposed an extremely reliable method called RS-MetaNet. The effectiveness of the prototype is ignored by most existing prototype-based few-shot learning, and directly calculating the prototype from the support sample will lead to a decrease in the accuracy of subsequent inferences. In view of the above issues, Cheng et al. [
25] proposed a combination of the SC method without adding any learnable parameters and the IC method to increase the prediction accuracy. The addition of these two Siamese prototypes can extract more representative feature information for RS scene classification. In order to address the drawback that insufficient labeled samples make it difficult to extract categorical features, Zeng et al. [
26] proposed an iterative looping architecture (IDLN) to improve classification performance. Due to the problem of sample quantity, the learning ability of the model is markedly reduced. In order to identify the classification boundary that depends on the sample deviation, the distance between different categories is widened and the data of the same category is polymerized. Cui et al. [
27] proposed a framework called meta kernel networks (MKNs). For automatic modulation classification, which requires a large number of labeled samples, Che et al. [
28] designed two feature extraction networks, which correspond to spatial and temporal feature spaces, respectively. The classification results of the two feature spaces are fused. In addition, a new mixed loss function is designed to expand the distance between classes. Furthermore, some graph-based methods [
29,
30] have also achieved advanced results in the field of remote sensing. To address the problems of noise influence and insufficient labeled training samples in hyperspectral classification. Zhang et al. [
31] proposed a mechanism for automatically exploring receptive fields and learning the importance of different neighborhoods. When the node is updated, the local information of the node is not discarded. It is difficult to identify the global information of the graph for the existing graph-based methods. Ding et al. [
32] proposed a semi-supervised network that flexibly aggregates graph nodes between data and captures deeper relationships based on the relationship between the obtained contexts.
Meta-learning is often used to solve few-shot problems because of its self-learning ability and strong generalization performance. Meta-learning research is currently divided into three independent methods: metric-based methods, optimization-based methods, and model-based methods. Among them, model-based meta-learning has made the most progress. At present, the best experimental accuracy results come from the subsequent improvement of Model-Agnostic Meta-Learning [
33] (MAML) algorithms in this direction. This direction has also become the backbone of the meta-learning field. In data augmentation-based methods, Li et al. write the features of a set of labeled images (support sets) into memory and extract them from the memory while performing inference, making full use of the knowledge in the set, called MatchingNet [
34]. In the metric-based methods, the classical networks include relational networks [
35], prototype networks [
6], etc. According to these two models, many novel meta-learning models have been developed.
The methods mentioned above mainly focus on sample-level features for few-shot RS scene classification, resulting in learned features that easily overfit individual samples, and most use metrics based on image-level features. Meanwhile, the problem of fuzzy hidden layer representation and decision boundary in neural networks is ignored, so accurate feature representation is difficult to learn. In addition, RS images have large differences with natural images due to different shooting content and shooting methods, such as aerial photos and satellite photos. Therefore, few-shot learning needs to overcome the influence of indistinguishable features and unrelated backgrounds between categories caused by remote sensing images due to shooting methods, as indicated in
Figure 1. For the above-mentioned problems, few-shot learning should be organized based on tasks rather than image-level. At the same time, the diversity of feature vectors should be increased to learn accurate feature representation. In addition, image-to-class metrics based on local descriptors are adopted for final classification.
Aiming at the challenges brought by the above problems, a few-shot RS scene classification method based on metric learning and local descriptors is proposed, called MLLD, which not only has the capacity to increase the classification accuracy of RS scene images with fewer labels but also addresses the aforementioned issues. The overall framework of the model consists of Meta-tasks Module, Deep Embedded Learning Module, and Metric Module. Firstly, the meta-task module has the ability to simulate human learning, learn various knowledge through different meta-tasks, and finally achieve the purpose of learning to classify rare samples. The performance of meta-learning on each task increases with experience and the number of tasks; that is, the efficiency of the model is gradually improved by learning multiple tasks. The Meta-tasks Module to learn task-based metrics can be better extended to invisible test tasks. Secondly, the Embedded Learning Module extracts and fuses features to increase the diversity of feature vectors, including the part of extracting and processing features. The existence of hidden layer representation and decision boundary ambiguity in neural networks leads to a lot of irrelevant noise being learned during model training, which affects the adaptability of the model to fresh samples and reduces the classification accuracy of the test data. Our model comes up with solutions to these challenges, the feature processor uses Manifold Mixup [
36] to apply to the hidden layer of deep neural networks; that is, the model is required to satisfy linear constraints on the operation at the feature level, and this constraint is used to regularize the model. Finally, the processed feature vector is divided into local descriptors. The last layer of measurement image-level features is replaced by measuring image-to-class local descriptors. According to local invariant features, this method can achieve breakthrough results. At the same time, using local descriptors in few-shot can reduce the computation of searching for the nearest neighbors from local descriptors in a large sample.
2. Proposed Method
The focus of this paper is to solve the two major challenges of RS scene classification: (1) The lack of labeled samples and the nature of neural networks make it difficult for the model to learn accurate feature representations. (2) Few-shot RS scene classification needs to perform classification tasks under the influence of small distances between different categories and larger intra-class variance. In addition, irrelevant background information will be confused with valid content, which can affect classification accuracy. The proposed model is shown in
Figure 2. Firstly, the Meta-tasks Module improves the problem-solving ability of the model by learning meta-knowledge from multiple tasks through meta-training, which is an extremely effective method to solve few-shot tasks. Second, the Embedded Learning Module not only extract features but also enrich the diversity of features. Finally, the local descriptor is used by the Metric Module to calculate the similarity between the image and category.
2.1. Meta Task Module
Deep learning essentially uses thousands of pieces of data to train the model and then gradually updates the model parameters in the opposite direction of loss gradient descent, so that the accuracy of classification is improved until the optimal model is learned. General deep learning only considers the information between samples, while the relationship between few-shot tasks is ignored, which leads to the phenomenon of fitting to a single sample. On the contrary, meta-learning is used to train the model, namely task-level training. Through the learning of multiple tasks, the parameters are gradually updated to further fit the model; that is, the prediction precision is proportional to the learning experience and the appropriate number of tasks.
The specific method of meta-train is as follows: construct two non-overlapping RS scene datasets and , train set is constructed by random sampling from dataset , and test set is constructed by random sampling from dataset , then multiple meta-train sets and meta-test sets are randomly sampled from train set . Likewise, the test set also uses the same sampling method. , .
During the training phase,
S different classes of images are randomly sampled from the train set
to constitute a meta-train set
. The corresponding meta-test set
is from the train set
random sampling
Q images of different categories. In particular, the categories of meta-train sets and meta-test sets are different,
. The task-level training of the model is realized by meta-training, so that few-shot RS scene classification is simulated by each meta-task, and the final model has the ability of autonomous learning, as shown in
Figure 3.
For meta-tasks, model parameters are gradually optimized by training different meta-tasks. The loss function is expressed as:
The optimization of the model loss has a great influence on the final classification result of the model, which directly affects the feature representation of the sample. represents the number of categories, and represents the symbol function. represents the final probability that the sample predicted by the model belongs to category .
2.2. Embedding Learning Module
The embedded learning module is composed of a feature extractor and feature processor . The first part of the deep embedding learning module is the feature extractor . Similar to the traditional model, the four-layer convolutional blocks are used as a feature extractor in this paper, which can provide a more equitable comparison in the experiment.
Four convolution blocks are used to extract image features, and the convolution blocks uniformly use 3 × 3 convolution kernels to extract feature vectors. In particular, a batch normalization layer is used after the convolutional layer to prevent gradients from disappearing or exploding; training speed will also be increased on this basis. In particular, gradient disappearance or explosion may also occur, and the addition of a batch normalization layer can accelerate the training speed and prevent errors. In addition, due to the lack of expressiveness of the linear model, the ReLU function can be used to add nonlinear factors to complete complex tasks and reduce the reciprocity of the parameters so that the overfitting problem is alleviated. The maximum pooling layer provides a translation-invariant way to extract the edge and texture structure of the image.
The feature extractor
uses the parameter
to map the original data domain to the target feature space, and then learns the feature representation of the image. The feature vector is expressed as:
The feature extractor
in this article has no fully connected layer (FC); therefore, the image will output a tensor of
h ×
w ×
d dimensions after passing through the feature extractor. A feature vector with a length of
d is regarded as a local descriptor:
m = hw represents the number of d-dimensional local descriptors. For example, the pixel of the UC Merced is 256 × 256 pixels, and we can obtain a tensor, that is, and . So, the number of local descriptors is 4096.
The second part of the deep embedded learning module
is the feature processor
. The adaptability of the model to new samples is affected by the ambiguity of the hidden layer representation and the decision boundary in the neural network, resulting in greatly reduced accuracy on the test data. Therefore, this paper adds the feature processor
, Manifold Mixup is adopted, that is, a regularization method. The role of the regularizer is to prevent the increase in model complexity caused by excessive parameters of the neural network. The phenomenon of overfitting on the training set is prevented, and properties such as low rank and smoothness of model learning are constrained. The Manifold Mixup used in this paper fuses both the features of the sample and the labels of the sample. The formula is as follows:
where
k represents the k-th layer of the neural network, and
λ is the mixing coefficient.
. The mixing coefficient
uses the beta distribution, that is
. When
,
is the uniform distribution of (0,1), that is,
. When
,
is the U-shaped distribution, showing the characteristics of large probability at both ends and small probability in the middle. When
,
is a binomial distribution, that is, the data is not manifold mixed operation and the original data is not enhanced. When
, similar to a normal distribution, the probability is small at both ends and large in the middle. When
, the probability is equal to 0.5, equal to half of the two samples. Here,
is one-hot labels and
.
Unlike traditional regularizers applied to the input space, Manifold Mixup is applied to the hidden layer of a deep neural network, encouraging the uncertainty of the model, so that the visual representation of training examples is concentrated in low-dimensional sub-layers space, thereby generating more discriminative features. By training the neural network, the intermediate hidden layers of the data are linearly combined so that the model can enlarge the confidence space and obtain a smoother decision boundary and a simpler hidden layer representation.
2.3. Metrics Module
There are two main traditional measurement methods. (1) The feature information is compressed into a compact image-level representation and classified by measuring the feature vector. (2) The comparison between images is used to directly use image-level representation, and classification is performed by measuring image-to-image similarity. The first method of compressing feature information will lose a lot of discriminative information, and the loss caused by this method for few-shot is difficult to recover. The feasibility of the second method is very low, even if the two images in the same category are very different in the local area. Based on the above defects, this paper uses the method of comparing local descriptors to achieve scene classification.
Due to the particularity of RS images, different images of the same category will be very dissimilar, and different images of different categories will also have similarities, so there will be large errors in directly comparing the features between images, but an image is flipped, sheared, and translated; this will not change the local features. Therefore, local descriptors extracted and processed by the deep embedding learning module are compared in this paper, and the invariant characteristics of local features are fully utilized. The method of evaluating the similarity between local features breaks the traditional image-level comparison, which increases the diversity of each few-shot task and provides richer and more flexible representations to each class.
In the classification, the k-nearest neighbor algorithm is used in this paper. For the local descriptors
of the query image
, the
most similar local descriptors are found in each category of the support set. Angle cosines between vectors are compared to predict the possibility that the query image
belongs to category
, namely cosine similarity summation. For cosine similarity, more emphasis is placed on the difference in direction between the two vectors. The cosine value is inversely proportional to the angle of the vector:
where
represents different categories, and
represents that the local descriptors in the query set correspond to the
. closest local descriptors in the support set.
2.4. Experiment Methodology
We abandon the traditional model training method and adopt meta-learning, as shown in Algorithm 1. The computational complexity analysis of the model algorithm is . The training and testing of meta-learning are based on few-shot tasks. Each task has its own meta-train dataset and meta-test dataset, also known as the support set and query set. In order to achieve the ability to quickly learn new tasks from training data, meta-learning regards the entire task set as a training sample during model training. Each few-shot task forms an episode of training. During the training process, features are extracted by a convolutional neural network without a fully connected layer, called a feature extractor, which outputs a tensor of dimensions, and the data and labels are fused separately, that is, enhanced by Manifold Mixup. The feature vector with length is regarded as a local descriptor. For the samples in the query set, the deep embedding module is used to obtain the processed local descriptors, and the k-nearest neighbors of each local descriptor are found in different categories. Then, the similarity between local descriptors and k-nearest neighbors is calculated by cosine similarity summation, and the similarity between query set images and categories is obtained. Finally, the category with the highest similarity is selected as the prediction result of the query set.
Algorithm 1. Model Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
During the training process, a meta-task is sampled from . Then the model is trained with samples and the samples’ loss function , and the trained model is tested with the new sample generated by . The test error will be used as the training loss function of the meta-train process.
The model is represented by a function
with a parameter
. When transferred to a new task
, the parameter
of the model is updated to
by gradient rise. When the first updated:
The update step
is a fixed hyperparameter, and the meta-learning process on different tasks is performed by random gradient ascent. Therefore, the update criterion of
is:
2.5. Dataset Description
Three remote sensing datasets are utilized for comparative and ablation experiments, which are UC Merced [
37], WHU-RS19 [
38], and NWPU-RESISC45 [
39]. Detailed descriptions of the three data sets are given in
Table 1.
The UC Merced dataset contains 2100 images, which are divided into 10: 5: 6 for experiments; the spatial resolution of each image is 0.3 m with 256 × 256 pixels. As shown in
Figure 4. Many RS scene classification researchers have applied this data set to experiments since it appeared. This data set has too much noise, so it is more classification challenging. NWPU-RESISC45 has 45 image categories, which is the most in the three data sets, and its image pixels are the same as the previous data sets. Currently, this dataset has the largest number of scene categories and image totals. WHU-RS19 is a dataset that contains 19 categories with a total of 1005 images, of which pixels are the largest in the three datasets.
The RS images are divided into three parts. Two of these are used for training and validation of the model, while another dataset evaluates the model through cross validation. For the training task, five scene categories were randomly selected from the dataset to simulate the few-shot task. Extract one and five samples from each category to form a meta-task. For each test task, five scene categories were also randomly selected in the dataset , with one and five labeled samples for each scene category.
2.6. Experiment Setups
For traditional deep learning, an iteration represents the entire data set propagating forward once through a neural network. For the few-shot learning in this experiment, although one or several label samples are randomly sampled in each task, all samples are likely to be sampled when the number of tasks is large enough. 8000 training tasks were set in the experiment, the initial value of the learning rate was 0.005, and the initial parameters of the optimizer were set to 0.5. In addition, the generalization of all models is evaluated by cross-entropy loss. All experiments were set to five scene categories, with one sample and five samples selected for each scene category. Theoretically, the more samples of the scene, the higher the accuracy of the experiment. In order to avoid the model’s preference for specific data, all experiments were randomly sampled from for 15 tests. For the test results, 600 few-shot tasks were sampled to test the model, and the test results were averaged to obtain the final result.
The accuracy assessment metrics of our experiment is usually N-way K-shot. N categories are selected, and K samples are selected for each category. Generally, . In the model training phase, the training model is constructed and trained on the selected samples. In the test phase, K samples from N categories are selected to perform the N-way K-shot classification task. According to the prediction results, the prediction category is determined, and the accuracy rate of the prediction category consistent with the actual category is the accuracy assessment metrics.
4. Discussion
The application requirements of remote sensing image scenes in urban supervision, resource exploration, and natural disaster detection are gradually increasing. Therefore, remote sensing scene classification is an urgent problem that needs to be solved. However, due to the characteristics of background confusion and image noise in RS images, as well as the boundary blurring of neural networks, the classification accuracy will be reduced. Therefore, this paper proposes a classification method based on metric learning and local descriptors (MLLD), which structures an embedded learning module and a metric module. The embedded learning module learns model parameters through meta-training on multiple few-shot tasks, then extracts features through a four-layer convolutional network and fuses feature vectors and labels. The visual representation of the sample is concentrated in a low-dimensional subspace to produce more discriminative features. The feature vector is divided into local descriptors, and then the similarity between the image and the category is calculated by the measurement module according to the local feature invariance.
The summary of this paper is summarized as follows:
By studying the data augmentation strategy, a novel embedded learning module with the data-dependent regularization operation is added. This module adds Manifold Mixup to smooth the decision boundary and learn accurate feature representation in few-shot RS scene classification.
According to local invariant features, we replace the metric based on image-level features with an image-to-class metric based on local descriptors. The measurement based on local features can effectively avoid the error caused by image-level feature representation and prevent the loss of some discriminative information that leads to inaccurate measurement.
Experiments were conducted on three remote sensing data sets, namely UC Merced, WHU-RS19, and NWPU-RESISC45. Experimental results show that our model (MLLD) has a significant effect on few-shot RS image classification, which can improve the shortcomings of previous models and further enhance the classification accuracy. The classification result of the three datasets on a 5-way 1-shot can reach 59.05%, 65.88%, and 76.07%, respectively, and on a 5-way 5-shot, it can reach 77.76%, 82.06%, and 90.69%, respectively. Experiments show that the embedded learning module based on Manifold Mixup and the measurement module based on local descriptors are proven to effectively improve the classification accuracy of a few-shot RS image.