1. Introduction
With the rapid development of remote sensing technology, the quantity and quality of remote sensing images have been significantly improved, and the remote sensing interpretation has attracted more and more attention. Image classification, segmentation, detection, tracking, etc. in remote sensing images have gradually become hot topics. The object detection of remote sensing images plays a vital role in military, intelligent supervision, urban planning, etc. In this paper, we mainly studied object detection for optical remote sensing images based on deep learning.
Artificial intelligence is a data-driven technology that has been widely used in computer vision. Compared with the traditional models in the past, deep learning-based methods have great advantages, such as strong adaptability, high accuracy, and non-artificial design features. Deep learning algorithms can autonomously extract features such as object texture from data distribution and generate high-level semantic information. Benefiting from the development of the network and a large number of publicly available natural image datasets, such as MSCOCO [
1] and PASCAL VOC [
2], many object detection methods based on deep learning have achieved great success in the natural scene [
3,
4,
5,
6,
7,
8,
9,
10].
However, object detection algorithms that have achieved good performance on natural images cannot be directly applied to optical remote sensing images. This is because the difference between remote sensing images and natural images are very significant. The most obvious difference between them is the shooting angle. Specifically, the remote sensing image is taken from a bird’s-eye view, which usually captures the top information of the object, while the natural image is taken from the front, which usually captures the outline of the object. Because of the different angles, remote sensing images usually have some characteristics that are different from natural images:
Scales of objects vary greatly—since a remote sensing image is taken from a long distance and wide angle, an image will contain objects with large-scale differences;
Objects with arbitrary orientations—due to the special imaging perspective, the objects in the remote sensing image have any direction;
Complex background—many interesting objects in remote sensing images are usually surrounded by complex backgrounds, which can seriously interfere with the detection of objects.
These characteristics make object detection in remote sensing images a particularly challenging task. For objects with arbitrary orientations, the horizontal bounding box will cause misalignment between the detected bounding box and the object [
11]. To solve this problem, oriented bounding boxes are preferred for capturing objects in remote sensing images. Current oriented object detection methods are mainly originated from anchor-based detectors. Generally, those detectors generate oriented anchor boxes and then use the anchor boxes as a reference to learn the direction information and regress the offsets between the object box and the anchor box parameters (such as R
PN [
12], R-DFPN [
13]), or after generating the region-of-interest (ROI), regress the angle parameters to achieve oriented object detection (such as ROI transformer [
11]). Although these methods accomplish the detection of oriented objects in remote sensing images, they are still anchor-based and share the same drawbacks with anchor-based detectors. For example, the design of the anchor box is complicated. The aspect ratios and the size of the anchor boxes need to be carefully adjusted. In addition, the sharp imbalance between the positive and negative anchor boxes can lead to slow training and suboptimal performance.
In recent years, researchers have developed keypoint-based object detectors [
14,
15,
16] to overcome the shortcomings of anchoring-based solutions in horizontal object detection tasks. The keypoint-based object detector [
17,
18] captures the object by detecting the keypoint, thereby providing an anchor-free way. In particular, these methods detect the corner points or center points of the bounding boxes and then extract the size of the bounding boxes from these points. Cornernet [
17] is one of the pioneers who proposed keypoint-based detection methods in horizontal object detection tasks. It uses heatmaps to capture the top left and bottom right corner points of the horizontal bounding box and then groups the corners. CenterNet [
18] proposed by Duan detects corner points and center points. ExtremeNet [
19] locates the extreme points and center points of the bounding boxes. These two methods finally group the points of the box by the center information. Zhou’s CenterNet [
20] is a faster prediction method because there is no post-processing, and only recovers the width and height of the bounding box at the center point. Keypoint-based detectors have achieved great success in horizontal detection tasks, and have advantages over anchor-based detectors in speed and accuracy. However, keypoint-based detectors are rarely used in oriented object detection.
In many applications, the lightweight nature of models is a key requirement, and fundamentally it has a competitive relationship with accuracy. Although object detection in remote sensing images has made great progress, they rely on the increasingly deeper network, which increases a lot of computational expenses and parameters. To solve the problem of over-parameterization, some previous works have designed new network structures, such as fully convolutional networks, or lightweight networks [
21]. These networks have made progress to a certain extent, but they need to be carefully designed and adjusted. Some image classification uses model compression, which decomposes the weights of each layer, and then reconstructs or fine-tunes layer by layer to restore partial accuracy [
22,
23,
24,
25]. However, there is often a gap between the accuracy of the original model and the compressed model, especially those which will become larger when using model compression to solve more complex problems, such as remote sensing image object detection. On the other hand, the research of knowledge distillation shows that imitating the behavior of deeper or more complex network models to obtain low-level models or have a compressed effect can make up for part or all of the accuracy gap [
26,
27,
28]. However, these results only show up on classification issues. Object detection is a more complex task, which includes two subtasks: classification and bounding box regression.
Inspired by the above research, we propose a keypoint-based oriented object detector for remote sensing images and lighten this detector through the knowledge distillation method. We use BBAVectors as a benchmark and make improvements on it. First, a semantic transfer block is proposed, which uses semantic information as a guide to refine features. It solves the problem of excessive noise information in features due to complex backgrounds. In addition, considering the variety of object sizes in the remote sensing image, we propose the adaptive Gaussian kernel to generate adaptive Gaussian heatmaps to locate and classify objects. For the problem that the neural network is always over-parameterized, we propose the distillation loss associated with object detection in remote sensing images to obtain a lightweight student network. Finally, the effectiveness of this method is verified on two datasets. The contributions of this paper are summarized as follows:
In order to avoid the noise caused by the complex background of remote sensing images, we proposed a semantic transfer block to refine the low-level features when fusing high-level features and low-level features. The refined low-level features not only reduce noise but also restore the semantic information of the fused features;
Considering that the scale of objects in remote sensing images varies greatly, the adaptive Gaussian kernel is proposed to produce adaptive Gaussian heatmaps to locate and classify objects. Compared with the traditional heatmap, our improved heatmap adapts to objects of different sizes and avoids the fuzzy samples caused by traditional heatmap labels;
To solve the problem that the object detection model of remote sensing images is always over-parameterized, we propose a distillation framework for the keypoint-based object detection network. For the four heads of the network, we calculate the distillation loss based on the soft labels and combine it with the loss using hard labels to distillation the student network.
2. Proposed Methods
In this section, we will describe the various parts of our pipeline in detail.
Figure 1 shows the overall framework of our method. The network is built on a U-shaped structure, and the feature map is upsampled at the top of the backbone. In the process of upsampling, we use the semantic transfer block (STB) to introduce the deep semantic information into the low-level features as a guide to generate the low-level features with semantic guidance. Then, we merge the deep features with the low-level features through jump connections, sharing high-level semantic information and low-level and more refined details. Using
convolutions and
convolutions, the fused feature maps
are converted to four prediction branches: Heatmap (
), Offset (
), Box Parameter (
), and Orientation (
)), where
K is the number of classes and
is the scale. Finally, we use knowledge distillation to obtain a lightweight keypoint-based oriented object detector for remote sensing images. Specifically, we regard the network which backbone is ResNet50 as the teacher network and the network which backbone is ResNet18 as the student network. After inputting images into the two networks, we use the final outputs of the teacher network as soft labels to assist student network learning. In this process, the mean square error (MSE) loss function is used to align the heatmap and orientation branch, and the
loss is used to align the offset and bounding box parameters. In addition, we also use the ground-truths of the dataset as the hard labels and use them to train the student detection network.
2.1. Keypoint-Based Object Detection
2.1.1. Semantic Transfer Block
The backbone network uses a U-shaped structure, similar to U-Net [
29]. Although this U-shaped structure has achieved great success, high-level features and low-level features are simply merged through simple jump connections or horizontal connections, but its working mechanism is still unknown. Whether this simple fusion is really helpful for remote sensing image object detection is worthy of further study. The high-level and low-level features are complementary in nature. The low-level features have rich spatial details but lack semantic information and vice versa. Consider an extreme case where pure low-level features only contain information such as points, lines or edges. Intuitively, fusing high-level features with this purely low-level feature is hardly helpful, because low-level features contain too much noise, especially in scenes with complex backgrounds such as remote sensing images, and cannot provide enough high-resolution semantic information guides feature fusion. On the contrary, if the low-level features contain relatively clear semantic boundaries, it will be more helpful to merge the low-level features with more semantic information with the high-level features. Similarly, high-level semantic features with little spatial information cannot make full use of low-level texture features. However, by introducing additional high-resolution features, high-level semantic features may have the opportunity to refine their information by aligning the boundary information in related low-level features. Therefore, we propose the semantic transfer block (STB) to provide a more effective feature fusion method.
where
l represents the level of the feature map. The meaning of Formula (
1) is to introduce more semantic information from high-level features to guide feature fusion. The function
represents the semantic migration module, and its design details are shown in
Figure 2. Specifically, we first upsample the deep features to the same size as the low-level features through bilinear interpolation, refine the upsampled feature maps through a
convolution, and then combine the feature maps with the low-level features. Finally, a
convolutional layer is used to refine the channel of fusion features. In this process, batch normalization (BN) and ReLU activation functions are used in the hidden layer. We apply this component for level 2–4 features (see
Figure 1).
To better prove the above point, we visualize the feature map, as shown in
Figure 3.
Figure 3a shows a low-level feature map in the backbone.
Figure 3c shows a low-level feature map after STB processing.
Figure 3b shows a merged feature map that combines the original low-level feature with a high-level feature through jump connections.
Figure 3d shows a STB processed low-level feature with a high-level feature through jump connection. It can be seen from the figure that the low-level features of the backbone have too much noise information, which leads to the feature maps obtained by jump connection being too fuzzy. This shows that if the low-level features contain less semantic information, the jump connection cannot restore the semantics. In contrast, the low-level features processed by STB introduce more semantic information and reduce more noise, resulting in the merged feature maps restore the semantics well.
2.1.2. Adaptive Gaussian Kernel
Heatmaps are usually used to locate specific keypoints of the input image, such as human joints and facial feature points. In this paper, the heatmap is used to detect the center point of the oriented object in the remote sensing image. Specifically, the heatmap applied in this paper has K channels, and each channel corresponds to a object category. The mapping on each channel is passed through a sigmoid function. We used the heatmap value predicted by the specific center point as the confidence level for object detection.
Assuming that
represents the center point of the directed bounding box, a two-dimensional Gaussian kernel needs to be set around each center point to form the heatmap. The original 2D Gaussian is defined as follows:
where
is a pixel on the heatmap, and
is a standard deviation adapted to the size of the bounding box. Since the standard deviation corresponding to the width and height of the bounding box in the original Gaussian map is equal, the situation shown in
Figure 4a may occur, which will cause the division of positive and negative samples to be blurred, and will affect the training of the model.
In other words, we use a 2D Gaussian to generate a heatmap to locate the center point of the object. The center point of the object has the highest confidence, and the confidence of the rest of the bounding box will gradually decrease according to the Gaussian distribution. The regions outside the bounding box are all negative samples. In theory, the confidence of the corresponding position of the heatmap should be zero. However, in the case of
Figure 4a, the confidence of the mapping of some areas beyond the bounding box is not zero. These samples are fuzzy samples for the model, which will cause the model to calculate the loss during training inaccurate weighting, thereby affecting the final performance.
Therefore, we propose an adaptive Gaussian kernel (AGK), as shown in
Figure 4b. The specific definition is as follows:
where
and
are, respectively, proportional to the width and height of the bounding box.
is a hyperparameter and we set
in this paper based on experience. Too large a value of
would the Gaussian map exceeding the bounding box. The adaptive Gaussian heatmap generated by the adaptive Gaussian kernel is shown in
Figure 5.
When training the heatmap, the peak point of the Gaussian distribution (that is, the center point c of the object) is regarded as a positive sample, and any other pixels are regarded as a negative sample. Due to the imbalance of positive and negative samples, it will be difficult to directly learn the central point. Therefore, in this article, we use the variant focal loss [
17] to train the heatmap, which is defined as follows:
where
and
p represent the object heatmap value and the predicted heatmap, respectively, and
i represents the pixel position on the feature map,
N represents the number of objects, and
and
are the hyperparameters used to control the contribution of each pixel. According to the experience in [
17], we choose
and
.
2.1.3. Offset
In the inference process of the model, we need to extract the peak point of the predicted heatmap and use it as the center location of the object. This center point is an integer, but it becomes a floating-point number after downsampling the original image. In order to compensate for this quantization error, the network needs to predict an offset
:
The offset is optimized using the smooth
loss [
3]:
where
N denotes the number of objects,
n denotes the index of the object, and
and
denote the object offset and the predicted offset, respectively. The expression of smooth
is as follows:
2.1.4. Orientation
There are two types of oriented bounding boxes (OBB), one is a horizontal bounding box (HBB) and the other is a rotating bounding box (RBB). When the oriented bounding box is horizontal, an accurate horizontal bounding box can be obtained through
w and
h in the bounding box parameters. When the oriented bounding box is a rotating box, the top
, right
, bottom
, and left
vectors from the center points of the objects can describe the rotating bounding box, as shown in
Figure 6. For the detector to better learn the direction information of the bounding box, a branch is introduced to predict the orientation coefficient
:
where the IOU is the intersection over union (IOU) between the horizontal bounding box and the oriented bounding box. In training, the orientation coefficient is optimized utilizing the standard binary cross-entropy loss function:
where
and
represent the object orientation coefficient and the predicted orientation coefficient.
2.1.5. Box Parameter
In order to obtain the oriented bounding box, we use four-vectors
to describe the oriented bounding box, as shown in
Figure 6. In addition,
w and
h are the width and height of the external horizontal bounding box of the oriented bounding box used to conver coordinate when the predicted bounding box is horizontal. Therefore, the bounding box parameters are defined as
, and a smooth
loss is used to regress the bounding box parameter:
where
and
B are the object bounding box parameters and the predicted bounding box parameters.
In the process of model training, the above bounding box parameter
B is continuously optimized. In the inference stage, four coordinate points of the bounding box need to be obtained. First, the center point is adjusted by
, where
is the predicted offset; then, the adjusted center point is mapped back to the input image through
. When the orientation coefficient
, the predicted oriented bounding box is a rotating box, otherwise, the predicted oriented bounding box is horizontal. In this paper, we used the top left (
), bottom left (
), bottom right (
), and top right (
) to describe the points of the bounding box after decoding. The way to decode the rotating box is as follows:
When the oriented bounding box is horizontal, the following definition is utilized to obtain the horizontal bounding box:
It can be seen from the above functions that all the outputs are directly estimated from the center keypoints. The total loss is composed of the loss function of the four branches:
2.2. Knowledge Distillation
As can be seen from the previous section, our detector includes an encoding–decoding structure for feature extraction and four prediction branches: Heatmap, Offset, Box Parameter, and Orientation. In this section, we will use knowledge distillation to transfer knowledge from the larger teacher network into a smaller, distilled student network. Specifically, the network for which the backbone is ResNet50 as the teacher network and the network for which the backbone is ResNet18 as the student network. After inputting images into the two networks, we used the final outputs of the teacher network as soft labels to assist student network learning. In this process, the mean square error (MSE) loss function is used to align the heatmap branch and the orientation branch, and the loss is used to align the offset branch and bounding box parameters. In addition, we also used the ground-truth labels of the dataset as the hard labels and used them to train the student detection network.
We utilized
to train the student model to align with the teacher model, which can be formulated as
where
,
,
and
are denote the heatmap loss, orientation loss, offset loss, and box parameter loss of the student network, which are composed of two components: the loss calculated by hard labels and the knowledge distillation loss calculated by soft labels. We introduced a hyper-parameter
to balance different loss, and set
based on experience.
Distillation Loss
The heatmap prediction branch is responsible for the classification and localization of objects. Specifically, it predicts the center keypoint to locate the objects and takes the center keypoint as a positive sample, and the remaining points are all negative samples. Since there are multiple objects for object detection, the heatmap branch trains multiple binary classifiers. In image classification, the knowledge distillation framework is to distill the logits output of the network, which does not apply to the heatmap branch where multiple binary classifiers are trained. Therefore, in this paper, we propose the distillation loss for the alignment of the heatmap branches between the student network and the teacher network, which is defined as follows:
Consequently, the loss function of the heatmap branch is:
where
represents the label of the heatmap branch.
and
are the predicted value of the heatmap yield by student network and teacher network.
and
are the losses calculated by the student network using hard labels and soft labels.
As is well known, a deeper teacher network can better adapt to the training dataset and perform well on the test dataset. Since the soft labels contain the category information learned by the teacher network, the student network can inherit this information by learning soft labels.
Since the prediction of the orientation is also a two-class classification, we use the same distillation loss as the heatmap to learn the knowledge of the orientation of the teacher network. The specific definition is as follows:
The distillation loss of these two branches adopts MSE loss because these two branches pay more attention to abnormal cases. The heatmap branch only takes the center point as a positive sample and treats other points as anomalies, so it needs to be very accurate to determine which points are positive samples or not. Similarly, for the orientation branch, if the output is greater than a certain threshold, the oriented bounding box is considered to be rotated; otherwise, the output is equivalent to an abnormal value when it is less than the threshold. In summary, the network needs to adapt well to these two situations, so it is necessary to pay more attention to some outliers.
The orientation of the student network is optimized by the following:
where
represents the orientation coefficient label,
and
, respectively, represent the predicted output of the student network and the teacher network in orientation branch,
is the loss calculated by the student network using the hard labels, and
is the distillation loss obtained through soft labels.
For the offset and box parameter, we use the
loss as the distillation loss:
When the center keypoint and orientation are predicted, the offset and the box parameter do not need to pay much attention to the outliers, so we use to calculate their distillation loss. Specifically, the loss will result in sparse features and reset most useless weights to zero. Therefore, these two branches only focus on the difference between the prediction of the student network and the output of the teacher network.
The offset and box parameter of the student network is trained by the following definitions:
where
and
denote the offset label and the bounding box parameter label, and
,
,
and
denote the output of the student network and the teacher network in these two branches.
and
are the loss calculated by hard labels.
and
are the distillation losses calculated based on soft labels.
3. Experiments and Results
In this section, we will introduce the datasets, implementation details and results in the experiment. All experiments are performed on a piece of NVIDIA Geforce RTX 2080Ti 11GB with Pytorch framework.
3.1. Datasets
We evaluated our approach on two public remote sensing datasets: HRSC2016 [
30] and UCAS-AOD [
31], in which ground-truth labels are all marked with oriented bounding boxes. The datasets used for the experiments in this paper are briefly introduced as follows:
HRSC2016 is a high-resolution optical remote sensing dataset for ship recognition. All the images are collected from Google Earth and their resolutions are between 2 m and 0.4 m. The image sizes range from to and most of them are larger than . Each sample in HRSC2016 is annotated with the bounding box, rotated bounding box, and ship head location, with three-level classes including ship, ship category, and ship types. The dataset contains 617 training images and 438 test images. We trained the network on the training set and evaluated the performance on the test set.
UCAS-AOD is a vehicle dataset collected from Google Earth, for which the objects are airplanes and vehicles under aerial images. The dateset contains a total of 1000 aircraft images and 510 vehicle images, and its image data are divided into three parts: CAR, PLANE, and NEG. The positive image is named with P+number, the negative image is named with N+number, and all image formats are in PNG format. The ground-truth label includes the coordinates, height, width, and rotation angle of the object, and its format is in TXT format. Because the dataset was not formally divided into the training set and test set, we randomly divided it into 50%, 20%, and 30% as the training set, validation set, and test set, respectively. In our experiment, all images are adjusted to .
3.2. Implementation Details
3.2.1. Training and Inference
The backbone of the teacher network uses the first five convolutions of ResNet50 pre-trained on the ImageNet dataset. The remaining weights are initialized under Pytorch’s default settings. We reshape the images to
in the training and inference stage, and the final output size is
. In addition, we used standard data augmentations, including random flipping and random cropping in the scale range
. In the training process, the Adam algorithm [
32] was adopted with an initial learning rate of
to optimize our teacher network and an exponential decay strategy was used to update the learning rate. The batch size was set to 10, and 100 epochs were trained on both datasets.
For the student network, it first used the first five convolutions of ResNet18 as the backbone, and the initialization of the remaining network weights, data processing, and optimization strategies are the same as the teacher network. The difference is that the distilled student network is trained with a batch size of 16 and 100 epochs on both datasets.
3.2.2. Evaluation Indicators
In order to quantitatively evaluate our model’s detection accuracy, we used average accuracy (AP) and mean average accuracy, which are two widely used evaluation indicators [
33].
AP is the average precision of the object in the range of recall =
, and is generally the area under precision–recall curve (PRC). PRC can be obtained through four evaluation components: true positive (TP), false positive (FP), false negative (FN), and true negative(TN). Recall (R), precision (P), and AP are defined as
where TP, FP, and FN represent the number of correctly detected objects, the number of incorrectly detected objects, and the number of non-detected objects, respectively. mAP is adopted for multi-class evaluation, which is the average value of AP values for all classes.
For the quantitative evaluation of the model efficiency, we measured the model size and the calculation complexity [
34]. The model size was expressed by the number of network parameters. Furthermore, the calculation complexity was represented by the sum of the floating-point operations (FLOPs) in once forward on a fixed input size. The number of network parameters and FLOPs were calculated by the Pytorch-OpCounter package.
3.3. Peer Methods Comparison
3.3.1. Results on HRSC2016
In order to show the performance of our proposed mothed, we compared our results with R
CNN [
35], RRPN [
36], ROI Transformer [
11], and BBAVectors [
37]. The performance of these methods is shown in
Table 1. Compared with the existing published oriented object detection models, our network gains 14.65%, 8.67%, 1.52%, and 1.4% improvements over R
CNN, RRPN, ROI Transformer, and BBAVectors on the HRSC2016 dataset, respectively.
Note that the backbone used by our teacher network is ResNet50, while the backbone used by R
CNN, RRPN, and ROI Transformer is ResNet101, however, the results of our method are better than these networks. In addition, most of the image sizes used by the compared methods are larger than our network, and the larger image size is helpful for the detection of remote sensing image objects. However, it can be seen from
Table 1 that our method can still outperform other methods at an image size of
. BBAvectors is the benchmark of our method. Under the same settings, the results of our method are better than the benchmark network, which shows the effectiveness of our method.
Table 1 also shows the comparison between the distilled student network and other methods on the HRSC2016 dataset. The detection results of the undistilled student network are only better than R
CNN and RRPN. However, the distilled student network gains 8.29%, 1.14%, 1.02%, and 3.22% improvements in AP compared with RRPN, ROI Transformer, BBAvectors, and undistilled student networks. The accuracy of the distilled student network is very close to that of the teacher network, but its parameter is 15.84 MB, which is only 1/6 of the teacher network. Note that the distilled student network can still achieve better performance than other methods when the size of the input image is relatively smaller and the backbone network is shallower than other methods. This shows the effectiveness of our method.
3.3.2. Results on UCAS-AOD
To further demonstrate the performance of our method, we compared it with the directed object detection method on the UCAS-AOD dataset. The results are shown in
Table 2. Compared with R-YOLOv3 [
38], R-RetinaNet [
39], Faster RCNN [
10], and BBAVectors [
37], our teacher network achieves 8.29%, 1.14%, 1.02%, and 3.22% advances in mAP. By analyzing the AP value, it can be obtained that the teacher network has achieved a significant improvement on the vehicle, and achieved a second-best performance on the aircraft, which shows that it is relatively robust to a small object such as a vehicle. The distilled student network is very close to the teacher network in mAP, and at the same time, gains 7.85%, 2.36%, 1.57%, 1.37%, and 1.74% compared with R-YOLOv3, R-RetinaNet, Faster RCNN, BBAVectors, and the undistilled student network, respectively. The distilled student network is better than the other methods and is comparable to the teacher network in AP.
3.4. Ablation Study
In order to verify the effectiveness of the semantic transfer block and adaptive Gaussian kernel, we conducted ablation experiments on the test set of the UCAS-AOD dataset. In the experiment, we used AP and mAP as evaluation indicators. We chose BBAVectors as the baseline, and its backbone is ResNet50. For the sake of fairness, all experimental data and parameter settings are strictly consistent.
Table 3 summarizes the results of models with different modules on our UCAS-AOD dataset. It can be seen from the table that the baseline in which the semantic transfer block is added increases mAP by 1.11%, replacing the Gaussian heatmap in the baseline with an adaptive Gaussian heatmap generated by the adaptive Gaussian kernel, increases mAP by 1.15%, and after adding these two modules, the network achieved a 1.39% improvement on mAP. The results show that both the STB and AKG can improve the detection results of the model, especially the small object of vehicles, which proves the validity of our method.
In order to further prove the benefit of knowledge distillation, we measure the calculation complexity of our model.
Table 4 shows the calculation complexity of the teacher and student networks. From
Table 1,
Table 2 and
Table 4, it can be seen that compared with the teacher network, the FLOPs of the student network before distillation are reduced by four times, but its accuracy is not as good as that of the teacher network. However, the distilled student network is not only as small as FLOPs as undistilled students, however, the accuracy is also closer to that of the teacher network. These data lead us to conclude that the knowledge distillation method in this paper can learn a lightweight and accurate oriented object detector in remote sensing images. The distilled student network we proposed can greatly reduce the complexity of the model with less loss of accuracy, which is conducive to practical application.
3.5. Visualization
To visually show the performance of the teacher network, we visualized the detection results of the HRSC2016 dataset, as shown in
Figure 7. It can be seen from the figure that the teacher network can detect different scales and obtain accurate bounding boxes. This shows that the adaptive Gaussian heatmap we use cannot only adapt to objects of different scales, avoid the adverse effects of fuzzy samples, but also accurately predict the center keypoints of the object. In addition, we can also see that the teacher network can precisely detect ships at the edge of the port. This is because the proposed semantic migration module provides more refined feature maps for the network. In summary, the experimental results show that the network can effectively avoid the adverse effects of background noise, adapt to objects of different proportions, and has good robustness.
The results of the teacher network on the UCAS-AOD dataset are shown in
Figure 8. It can be seen that the network can accurately detect vehicles and airplanes in different orientations and scales.
The results of the student network after distillation on the HRSC2016 and UCAS-AOD datasets are shown in
Figure 9 and
Figure 10. We chose to display remote sensing images with more complex backgrounds in
Figure 9 and
Figure 10. Especially in the second column of the figures, there are several objects hidden in the environment. It can be seen from the figure that the distilled student network still retains the advantage of the teacher. Even if the object is in a complex environment, the student network can still properly detect the object. These examples indicate that the student network has learned the knowledge of the teacher network well.
4. Discussion
By comparing and analyzing experiments, the validity of our method is verified. From the visualization of the detection results, we can see that our network can avoid the adverse effects of background noise, adapt to objects of various scales, and has good robustness.
Semantic transfer block (STB) provides a stronger feature fusion method than jump connection. This is because low-level features have the location information of the object, but lack semantic clues. In addition, the shallow feature maps in remote sensing images usually contain much noise, causing the feature maps after direct fusion to be overly blurred, as shown in
Figure 3. The shallow feature maps processed by STB introduce more semantic information and reduce noise so that the fused feature map can restore the semantics well.
The proposed adaptive Gaussian kernel (AGK) can adapt to objects of different scales and eliminate the influence of fuzzy samples on model training. This is because the heatmaps generated by AGK are proportional to the width and height of the bounding box, such as that shown in
Figure 4. This avoids the fact that the confidence of some locations beyond the bounding box is not zero since the original 2D Gaussian, which results in the blurring of positive and negative samples and has a negative effect on training.
The distillation loss is based on the idea of knowledge distillation in the classification task. In the classification task, the ground-truths are usually converted into one-hot encodings, which are called hard labels. Correspondingly, the outputs of the softmax function of the teacher network are called soft labels, which represent the probability distribution of the current prediction object and contain more information than hard labels. Distillation loss is to use the outputs of the four prediction branches as soft labels, so that the prediction results of the student network are aligned with the teacher, thereby obtaining a lightweight distilled student network.
In this paper, we studied oriented object detection in remote sensing images and model compression for object detection, and have made some progress, but there are still some shortcomings. Therefore, we put forward some research directions in the future:
In model compression, we just distilled the prediction components of the network, but the imbalance between positive and negative samples was not considered. For the student network to better learn positive samples and suppress learning background pixels, attention-guided knowledge distillation can be researched in future research work;
The relationship between different objects contains valuable information. There are some works such as non-local [
40], which can make the detector capture and use this information well. However, the existing knowledge distillation method for object detection in remote sensing images only extracts the information of single pixels, ignoring the relationship between different pixels. Therefore, we can study non-local knowledge distillation to help the student network learn the relationship between different pixels of the teacher network in the future.
5. Conclusions
In this paper, we propose a lightweight based on the keypoints of multi-category detector designed for arbitrary-oriented objects in remote sensing images.
Considering that the complex background in remote sensing images will lead to rough feature maps, we proposed a semantic transfer block (STB) to provide a more effective feature fusion method. This is a module that uses semantic information as a guide to align the boundary information of low-level features to refine features. Due to the variety of object sizes in remote sensing images, an adaptive Gaussian kernel (AGK) was proposed to adapt to objects of different scales, eliminate the influence of fuzzy samples in the training stage, and further improve detection performance. It is worth mentioning that BKDet is also an anchor-free detector, which avoids some of the shortcomings of anchor-based methods. Finally, we use knowledge distillation to learn a lighter remote sensing image object detector. We propose the corresponding distillation loss and use the complex detector as a teacher network to guide the student model learning. It is worth noting that the experimental results show that the parameter of the student network obtained by our distillation framework is significantly less than that of the teacher network, but its performance is close to the teacher network, which shows the validity of the method in this paper.
We conduct experiments on the HRSC2016 and UCAS-AOD datasets. The results show that the teacher network can achieve better performance in oriented object detection for remote sensing images. The student network obtained by knowledge distillation can greatly reduce the model parameters and calculation complexity with a small loss of accuracy, which is very valuable for practical applications.