Open AccessArticle

Towards Feature Decoupling for Lightweight Oriented Object Detection in Remote Sensing Images

Chenwei Deng

^1,2

Donglin Jing

Yuqi Han

^2,*

Zhiyuan Deng

² and

Hong Zhang

Chongqing lnnovation Center, Beijing Institute of Technology, Chongqing 401135, China

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

School of Astronautics, Beihang University, Beijing 100191, China

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(15), 3801; https://doi.org/10.3390/rs15153801

Submission received: 31 May 2023 / Revised: 15 July 2023 / Accepted: 24 July 2023 / Published: 30 July 2023

(This article belongs to the Special Issue Recent Advances in High Resolution Remote Sensing Image Processing and Analysis: Methodology and Application)

Download

Browse Figures

Figure 1
A general framework of a one-stage detector. The convolution operation in the backbone slides along a fixed axis, and the classifier and regressor share features from the backbone. "> Figure 2
Architecture of the proposed FDLO-Det. "> Figure 3
A symmetry group <math display="inline"><semantics> <mrow> <mi>p</mi> <mn>4</mn> </mrow> </semantics></math>. "> Figure 4
Rotational separable convolution. "> Figure 5
The architecture of OPTM. "> Figure 6
The curve of polarization function for classification and regression tasks. "> Figure 7
Visualization of results on DOTA dataset with FDLO-Det. Small vehicles and boats parked closely side by side are accurately detected. "> Figure 8
Comparison between FDLO-Det and S2ANet (typical CNN) on UCAS-AOD. "> Figure 9
Performance on UCAS-ADO. "> Figure 10
Visual detection results of FDLO-Det on HRSC 2016. "> Figure 11
Comparison between FDLO-Det and RetinaNet on HRSC2016. "> Figure 12
Visualization results of intermediate features. ">

Versions Notes

Abstract

Recently, the improvement of detection performance always relies on deeper convolutional layers and complex convolutional structures in remote sensing images, which significantly increases the storage space and computational complexity of the detector. Although previous work has designed various novel lightweight convolutions, when these convolutional structures are applied to remote sensing detection tasks, the inconsistency between features and targets as well as between features and tasks in the detection architecture is often ignored: (1) The features extracted by convolution sliding in a fixed direction make it difficult to effectively model targets with arbitrary direction distribution, which leads to the detector needing more parameters to encode direction information and the network parameters being highly redundant; (2) The detector shares features from the backbone, but the classification task requires rotation-invariant features while the regression task requires rotation-sensitive features. This inconsistency in the task can lead to inefficient convolutional structures. Therefore, this paper proposed a detector that uses the Feature Decoupling for Lightweight Oriented Object Detection (FDLO-Det). Specifically, we constructed a rotational separable convolution that extracts rotational equivariant features while significantly compressing network parameters and computational complexity through highly shared parameters. Next, we introduced an orthogonal polarization transformation module that decomposes rotational equivariant features in both horizontal and vertical orthogonal directions, and used polarization functions to filter out the required features for classification and regression tasks, effectively improving detector performance. Extensive experiments on DOTA, HRSC2016, and UCAS-AOD show that the proposed detector can achieve the best performance and achieve an effective balance between computational complexity and detection accuracy.

Keywords:

aerial object detection; convolutional neural network; deep compression; lightweight network

1. Introduction

Aerial object detection is a key technology for the intelligent interpretation of remote sensing (RS) image data. It can automatically locate and recognize valuable targets (such as aircraft, ships, bridges, etc.) from massive optical RS images. Aerial object detection is widely used in various fields, including environmental monitoring, geological disaster detection, smart cities, intelligent transportation, and so on. Due to the rapid improvement of GPU and the rapid growth of optical RS images, detection algorithms [1,2,3,4,5,6] with convolutional neural networks (CNNs) have received increasing attention in recent years.

The existing detection framework mainly consists of two categories: one-stage [7,8,9] and two-stage detectors [10,11,12]. For one stage, they usually obtain the possible regions by presetting a large number of horizontal anchors and then extracting the target features through convolution operations. Finally, based on these anchors and feature maps, conduct regression and classification operations obtain the target boundary. However, such a horizontal anchor can cause misalignment between the target and candidate regions, generate serious background interference, and mislead detection. At present, the field of RS detection mainly uses preset rotated anchors [13,14] to locate targets with arbitrary directions. However, such an operation not only leads to duplicate calculations but also cannot guarantee accurate features. To eliminate the misalignment of horizontal candidate regions when detecting rotated targets, Gliding Vertex [15] avoids interference from horizontal candidate regions by regressing the detection box with four vertices of coordinates. To achieve a higher performance, the two-stage detector uses the regional potential network (RPN) [16] to finely select and refine the possible areas. For example, the RoI-transformer transforms horizontal candidate regions into rotated ones to avoid background interference. However, the two-stage detector has high computational complexity and slow inference speed, making it difficult to apply to various embedded platforms.

At present, the research on convolutional neural networks has changed from improving accuracy to optimizing speed. MobileNetv2 [17] splits standard convolutions into group convolution and pointwise convolution. The feature maps are divided into different groups and then different convolution kernels are used to convolute each group, which reduces the computational complexity of the convolution. On this basis, a deep separable convolution module and an inverse residual module are constructed to achieve large-scale compression of network parameters. Moreover, the parameters of MobileNetv3 [18] are obtained through NAS (network architecture search) and the channel attention mechanism is introduced, effectively achieving further improvement in network performance. However, the grouping operation in MobileNet would lead to losing the connection between different groups, resulting in limited learning of features. Therefore, ShuffleNet [19,20] was proposed to shuffle different channels to solve the problem of insufficient features caused by group convolution and use multiple convolutional layers to build a more powerful structure. GhostNet [21] divides each convolutional layer into two parts: the first part is a conventional convolution with fewer output feature maps, which is used to strictly control the network parameters and computational complexity; the second part does not adopt conventional convolution but generates new feature maps through simple linear transformations. Compared with ordinary CNNs, the parameters and computational complexity required in the GhostNet are significantly reduced. Howard et al. [22] simply replaced the backbone with MobileNet for faster speed. Although directly replacing the backbones can reduce computational complexity, the accuracy of the detector would also significantly decrease. Light Head R-CNN [23] generates the thin feature maps with small channel numbers by using the kernel separable convolution. This design greatly reduces the computational complexity of subsequent RoI subnetworks and makes the detection system memory friendly.

In summary, these lightweight network structures are often designed for image classification and detection tasks in natural scenes, without fully considering the issue of consistency between detectors and RS detection tasks. Specifically, as shown in Figure 1, the first issue is the inconsistency between the target and the feature within the detector’s backbone [24,25]. The existing lightweight convolution slides along a fixed direction, making it difficult to effectively model the features of aerial targets distributed in any direction. This results in needing more parameters to encode the spatial information and the network parameters being highly redundant. For example, the detector requires a large number of rotated anchors to fully cover the target, which can significantly increase the complexity of the detection head. Therefore, without considering the spatial orientation of the target, no matter how lightweight convolution is designed, it is difficult to ensure the optimal performance of the detector. Moreover, there is an inconsistency between the features from the backbone and detection heads in the classification and regression tasks: the detectors share features from the backbone, but the classification task requires rotation-invariant features while the regression task requires the rotation of sensitive features [26,27]. This inconsistency in the task can lead to inefficient convolutional structures.

Through the above analysis, we can see that, no matter how lightweight convolutional structures are designed, it is difficult to avoid the matching problem between remote sensing detection tasks and detection frameworks. Compared to designing a dazzling convolution, it is particularly important to build an adaptive feature extractor for rotated targets and design feature filters that are consistent with classification and regression tasks. To this end, we proposed a detector that uses the Feature Decoupling for Lightweight Oriented Object Detection (FDLO-Det), and it consists of two components: a rotational separable convolution and an orthogonal polarization transformation module. Firstly, we constructed rotatable separable convolutions that can form groups by rotating the convolutional kernel in two-dimensional space to change the sampling position and merge them into the backbone of the network to generate rotational equivariant features. This can accurately predict the direction and reduce the complexity of modeling direction changes. Then, in the orthogonal polarization transformation module, we designed the spatial orthogonal attention mechanism, which splits the global pooling operation into two orthogonal one-dimensional feature encoding operations in the spatial dimension, to obtain integrated features from two different spatial directions and enhance the representation ability of the features. Then, through the filtering effect of the polarization function, the critical features required for classification and regression tasks are filtered out, thereby achieving accurate target detection. Our contributions are as follows:

(1) We systematically analyzed the inconsistency issues faced by existing lightweight convolutional applications in RS detection tasks, and these issues limit the performance and compression ratio of detectors. (2) A rotational separable convolution was proposed to generate rotational equivariant features, which reduces the parameters and computational complexity. (3) An orthogonal polarization transformation module was proposed to decouple rotational equivariant features in both horizontal and vertical orthogonal directions and uses polarization functions to filter out the required features for classification and regression tasks, which improves the detection performance.

2. Related Works

2.1. Oriented Object Detection

Due to the lack of rotation invariance feature extraction and horizontal box positioning methods in current convolutional structures, it is difficult to accurately describe the direction information of RS targets. Existing typical deep learning detectors find it difficult to perform compact and accurate positioning when detecting targets with arbitrary orientation. To this end, the current mainstream work mainly focuses on improving rotation feature extraction and optimizing rotation boxes/regions.

Deformable convolution [28] adds additional offsets in the module to change the spatial sampling position and learns the offsets from the task without the need for additional manual design. These new convolution structures can replace the common modules of existing CNNs and use back-propagation for end-to-end training to generate deformable convolutional neural networks so that the receptive field closes to the actual shape of the object. AlignDet [29] designs RoI convolution to achieve the same effect as RoI alignment in a one-stage detector. However, when these methods are applied to detect rotated and dense targets in aerial images, they often receive interference from adjacent target features, resulting in poor performance. RIFDCNN [30] adds rotation invariant regularization constraints to the objective function of the two-stage detector, such as Faster RCNN, to ensure that the feature representations of samples are similar after rotation, achieving the extraction of rotation invariant features. However, these methods often have complex structures and high computational complexity, making them difficult to apply to various detectors.

FFA [31] added angle information based on the candidate region proposal network (RPN) of the Faster RCNN. The SCRDet [32] uses IoU smoothL1 loss and IoU weighting to suppress out-of-bounds angles; the GWD [33] uses a Gaussian distribution ellipse to fit to approximate the representation of a rotated rectangle. The above methods would face the problems of anchor hyperparameter sensitivity or performance degradation caused by discontinuous boundaries. Moreover, IENet [34], PIoU [35], and other methods are based on the anchor-free natural scene detection to optimize the rotation box representation and regression loss function. However, the detection accuracy of the above methods is still far from that of the anchor-based target detector.

2.2. Lightweight Convolutional Design

DenseNet [36] proposed the concept of dense connections, where the input of the current layer comes from all previous layers. By reusing features, the number of feature maps is greatly reduced, resulting in a more compact network structure. ResNeXt [37] applied group convolution in its building block based on the ResNet to reduce computational complexity and parameter count. Under approximate computational complexity, more groups could achieve higher performance in both image recognition and object detection tasks. Zhang et al. [38] proposed an IGCNet, in which each building block is composed of two independent convolutional layers, namely the main group convolution and the secondary group convolution. To enhance the feature expression of the convolution, the input channels of the secondary group convolution are located in each group convolution. Moreover, a convolution method of adding a receptive field without adding any additional parameters was proposed by the Dilated Convolution [39]. This method introduces a new parameter of the cavitation rate. The size of the receptive field can be controlled, so different receptive fields can be obtained by setting different ratios. EfficientNet [40] proposed a hybrid scaling method based on a neural architecture search (NAS), which could better select dimensions for width, depth, and resolution, thereby achieving high accuracy with fewer parameters in the model. WeightNet [41] integrates the characteristics of SENet in the weight space, adds a layer of full connection behind the activation vector, and directly generates the weight of the convolution kernel. It was very efficient in calculation and could conduct trade-offs in accuracy and speed through the setting of the hyperparameter. MicroNet [42] includes two core ideas: micro-factorized revolution and dynamic shift max. Micro factorized revolution decomposes the original convolution into multiple small convolutions through low-rank approximation, maintaining input connectivity and reducing the number of connections. Dynamic Shift Max increases node connectivity through dynamic inter-group feature fusion, compensating for performance degradation caused by reduced network depth.

In addition, for remote sensing detection tasks, Lightdet [43] suggested preserving more feature maps in the shallow layers of the detector. ThunderNet [44] proposed a compressed RPN subnetwork for generating candidate regions and integrated global features to enhance feature expression by using context enhancement modules. Ding et al. [45] conducted preliminary exploration and proposed a lightweight one-stage detector for deep alignment. Wang et al. [46] designed a lightweight CNN with a simple convolution+pooling structure, which is used for ship detection in infrared images. This structure made the network suitable for the single-category RS detection tasks but difficult to apply to complex scenes with multi-object detection tasks.

3. Methodology

Different from the previous works, we achieved significant compression of the network by constructing an adaptive feature extractor based on rotational separable convolution for aerial targets and designing an orthogonal polarization transformation module that is consistent with classification and regression tasks. The proposed FDLO-Det framework is shown in Figure 2. The rotational separable convolution forms the backbone of the entire detector. The feature map extracted by the backbone is transmitted to the Feature Pyramid Network (FPN). Then, feature maps with different scales are transmitted to the spatial orthogonal attention module in OPTM to achieve target feature enhancement. Finally, the feature maps are sent to the polarization functions corresponding to their respective tasks, and generate the required features for each classifier and regressor. Next, we would provide a further detailed introduction to the proposed RSC and OPTM.

3.1. Rotational Separable Convolution

Rotational invariant features are crucial for detecting targets in any direction. However, CNNs exhibit poor performance in modeling rotational changes, and more parameters are needed to encode directional information [47]. Although some methods [48,49] can achieve rotational equivariant approximation in the image level, they require many samples and parameters. However, RS detection requires instance-level rotation equivariant features and conventional CNNs do not exhibit instance-level rotational invariance. Therefore, in order to obtain rotation-equivariant features and provide richer spatial features for subsequent classification and regression, this section mainly modifies CNNs to meet the requirements of rotation equivariance.

The equivalency of a function refers to the property of a function satisfying the input changes and, where the output also changes in the same way, then this function is equivariant. For example, when the target undergoes a translation transformation and appears at different positions in the image, the response of the output feature map covering the target should also undergo a translation transformation, which is defined as follows:

Φ (T_{g} (f (x))) = T_{g} (Φ (f (x))) \forall (x, g) \in (X, G),

(1)

where

x \in Z^{2}

(Z represents an integer field).

f (x)

indicates that at coordinate x, the value of the image pixel is

f (x)

. For transformation

T_{g}

Φ

is equivariant. For traditional CNNs,

T_{g}

corresponds to the translation operation. Although this operation may change the output of the convolution operation, this change is linear and predictable. On the contrary, the operation of unequal variation will have a non-linear impact on the output.

However, rotation and convolution operations are not commutative, but the stacking of feature maps could be equivariant. Therefore, the rotation operation is not a convolutional equivariant mapping, which reduces the effectiveness of image recognition. To solve this problem, the most traditional method is data augmentation, which directly rotates the image and inputs it into the network for training, but this method is not optimal.

To improve the network itself and solve this problem, we considered a simple symmetry group p4 with a quadruple rotational symmetry axis. As shown in Figure 3, for this group, there are four symmetry operations: translation (e), rotation 90° (r), rotation 180° (

r^{2}

), and rotation 270° (

r^{3}

). We need to design a new CNN structure so that, when the input image undergoes the above transformations, the network still has equivariant properties. Obviously, these operations can loop back and forth, forming a completely closed loop. We put these transformations into a set and defined binary operations on the set as “applying two transformations from the set to a certain image” to form a group. In convolutional structures, we have:

[T_{s} f] (x) = f (x - s) .

(2)

The first layer of the backbone we constructed is:

\begin{matrix} [f ★ ψ] (s) = f (s) ★ ψ (s) \\ = \sum_{x \in Z^{2}} \sum_{k = 1}^{K} f_{k} (x) {[T_{s} ψ]}_{k} (x) \\ \equiv \sum_{x \in Z^{2}} \sum_{k = 1}^{K} f_{k} (x) ψ_{k} (x - s) . \end{matrix}

(3)

Therefore, we can obtain:

[T_{g} f] ★ ψ = T_{g} [f ★ ψ] .

(4)

By performing anticlockwise operations on the filter, the equivariant features of the image during clockwise rotation can be obtained.

Other Layers of the Network. The feature maps obtained through the rotation separable convolutions satisfy the structure of the group. Therefore, the subsequent operations in the convolutional layer will take such a set of feature maps as a whole. We used h to represent an image, and all the images form a set. Therefore,

f (h)

represents such a group structure. For convolutions in other layers of the network, we can define them in the group G:

[f ★ ψ] (g) = \sum_{h \in G} \sum_{k = 1}^{K^{'}} f_{k} (h) ψ_{k} (g^{- 1} h) .

(5)

Then, we used 1 × 1 convolution to fuse the extracted rotational space features (as shown in Figure 4), which allows RSC to freely change the number of output channels and construct correlation relationships between features of different channels.

Compression Ratio. In this section, we demonstrate how the RSC makes the object detection task lightweight. Assuming that the input feature map dimension is

W_{i} \times H_{i} \times M

, and the output feature map dimension is

W_{o} \times H_{o} \times N

, W and H are the width and height of the feature map,

M, N

are the number of channels. For traditional convolutional operations, the computational complexity and parameters of a single convolutional layer can be calculated by the following equation:

\begin{matrix} F_{t} = D_{K} \times D_{K} \times M \times N \times W_{i} \times H_{i} \\ P_{t} = M \times D_{K} \times D_{K} \times N, \end{matrix}

(6)

where

F_{t}

and

P_{t}

represent FLOPs and parameters. For RSC, a single convolution layer is composed of rotation convolution and channel dimension point convolution operation. Let us set G as the number of rotation operations in the group, and RSC’s computational complexity and parameter can be calculated by the following equation:

\begin{matrix} F_{r s c} = F_{r} + F_{p} = D_{K} \times D_{K} \times M \times W_{i} \times H_{i} + M \times N \times W_{i} \times H_{i} \\ P_{r s c} = P_{s d} + P_{s p} = M \times D_{K} \times D_{K} / G + N \times M . \end{matrix}

(7)

At this point, the computational complexity and parameter ratio of RSC and traditional convolution are:

\begin{matrix} Ratio_{F} \frac{D_{K} \times D_{K} \times M \times W_{i} \times H_{i} + M \times N \times W_{i} \times H_{i}}{D_{K} \times D_{K} \times M \times N \times W_{i} \times H_{i}} = \frac{1}{N} + \frac{1}{D_{K}^{2}} \approx \frac{1}{D_{K}^{2}} \\ Ratio_{P} = \frac{M \times D_{K} \times D_{K} / G + N \times M}{M \times D_{K} \times D_{K} \times N} = \frac{1}{G N} + \frac{1}{D_{K}^{2}} \approx \frac{1}{D_{K}^{2}}, \end{matrix}

(8)

where N and

D_{K}

are both positive integers, and the value of

G N

is generally large. From Equations (5) and (6), it can be seen that the computational complexity and parameters of rotationally separable convolution operations are only

\frac{1}{D_{K}^{2}}

compared with traditional convolution. Therefore, in the remote sensing target detection task, using RSC to replace traditional convolution can effectively reduce the computational complexity and parameter of the detector. Compared with ordinary backbones composed of typical CNN, the proposed RSC has the following advantages: (a) Higher degree of weight sharing. As we have introduced, rotation-equivariant feature maps have an additional orientation dimension. Features with different orientations share the same filters with different rotation transformations; (b) Enriched orientation information. For an input image with a fixed orientation, the rotation-equivariant backbone can produce features from multiple orientations. This is important for oriented object detection, which requires accurate orientation information.

3.2. Orthogonal Polarization Transformation Module

In most RS detection architecture, classifiers and regressors rely on shared features extracted by backbones to classify and regress targets. Due to the widespread use of pooling layers in the general CNN architecture, these shared features have a certain degree of rotational invariance [28]. However, in RS detection tasks, targets are distributed in any direction, and these rotational invariant features are often beneficial for target classification but not for target localization. For example, regressors need to be sensitive to angles in order to obtain accurate location prediction, while classifiers should have the same response to angles so that targets with different angle distributions can be accurately identified as the same category.

To avoid feature interference from different tasks and effectively extract critical features required for specific tasks, the Orthogonal Polarization Transformation Module was proposed. The overall structure of OPTM is shown in Figure 5.

Firstly, we designed a spatial orthogonal attention mechanism that directly compresses the information of feature maps into feature channels. In the spatial dimension, two orthogonal one-dimensional feature encoding operations were proposed to enhance the representation ability of features. Then, through the filtering effect of the polarization function, high-response features were chosen to reduce noise interference for classification tasks and suppress the influence of irrelevant high activation regions to select the features of the target boundary for regression tasks.

Spatial Orthogonal Attention Mechanism. This mechanism is used to model the dependency relationship between input image pixels. Specifically, we used two convolution kernels with sizes

(H, 1)

and

(1, W)

to encode the channel along the horizontal and vertical directions, respectively. The formula is as follows:

\begin{matrix} A_{s} (w, c) = \frac{1}{H} \sum_{i = 0}^{H - 1} F_{i n} (i, w, c) \\ A_{s} (h, c) = \frac{1}{W} \sum_{j = 0}^{W - 1} F_{i n} (j, h, c), \end{matrix}

(9)

where

A_{s} (w, c)

represents the encoding output of the column w in the channel c feature map,

A_{s} (h, c)

is the encoding output of the row h of the channel feature map c, and W and H are the width and height of the feature map. Through this operation, we can obtain a pair of aggregated features with mutually perpendicular and orthogonal directions. This pair of features not only extract the dependency relationship between the two directions of the feature map but also preserve the corresponding precise position information, thereby improving the expressive ability of convolutional features and helping the network accurately locate targets in the image.

After obtaining the aggregated feature containing the global spatial information, we concatenate them and then use the convolution kernel whose size is 1 × 1 and the ReLU activation function to process it to obtain the intermediate feature f; its calculation is shown as follows:

f = Re L U (Conv ([A_{h}, A_{w}])),

(10)

where

f \in R^{C / r x (H + W)}

represents the intermediate feature encoding the aggregated feature pair,

[,]

represents the concatenation operation according to the spatial dimension. r was used to adjust the channel compression ratio.

Secondly, we decomposed the intermediate features into two independent features

f_{h} \in R^{C / r \times H}

and

f_{w} \in R^{C / t \times W}

according to their spatial dimensions, and used 1 × 1 convolution operation and sigmoid activation function to transform them into features with the same dimension as the input feature, and its calculation is as follows:

\begin{matrix} Y (i, j, c) = X (i, j, c) \times g_{w} (i, c) \times g_{h} (j, c) \\ g_{w} = Sigmoid (Cowv (f_{w})), \end{matrix}

(11)

where

g_{h} \in R^{C \times H}

and

g_{w} \in R^{C \times W}

. Finally, we used

g_{h}

a and

g_{w}

as attention weights and applied them to the input features, obtaining the final output feature map. The calculation process was as follows:

F_{out} (i, j, c) = X (i, j, c) \times g_{w} (i, c) \times g_{h} (j, c) .

(12)

Polarization Function. Based on SOAM, we further developed a specific polarization function

ϕ

(as shown in Figure 6) to build the critical features required for different tasks. For the classification task, it was expected that features will focus more on the highly responsive parts of the feature map, as well as ignore the less important clues that may be used for localization or may cause interference noise. Therefore, we used the following activation function to achieve this function:

ϕ_{c l s} (x) = \frac{1}{1 + e^{- η (x - 0.5)}},

(13)

where

η

is a modulation factor (set to 15 in our experiment). Due to the high response areas in the obtained feature map being sufficient to achieve accurate classification, we appropriately enhanced the high response areas (greater than a fixed threshold) and suppressed irrelevant features with weights less than a fixed threshold. In this way, the interference of irrelevant features can be reduced and further reduce the risk of network overfitting. Meanwhile, the critical features that can accurately locate the target boundary region are often scattered at the edges of the target for regression branches. Therefore, we expected the feature map to focus on as many target edge clues as possible, such as target contour information. For this purpose, the suppression function was proposed to process input features:

ϕ_{r e g} (x) = \frac{1}{2} exp (- \frac{{(x - μ)}^{2}}{2 σ^{2}}) .

(14)

Unlike classification tasks, a single strong response to a small area at the edge of an object is not conducive to locating the entire object. In the above equation,

ϕ_{r e g} (x)

suppresses areas with high response in feature maps, which forces the detector to search for potential visual cues for precise localization. By effectively enhancing the spatial feature through the spatial orthogonal attention mechanism and using polarization functions to filter out the effective features required for each task, a powerful feature representation for precise target detection is obtained, greatly improving detection performance.

4. Experiments

4.1. Datasets

HRSC2016 [50] is a nearshore ship detection dataset collected on Google Earth and annotated with a rotated box, covering a total of 1061 images. These images are from six famous ports, with spatial resolutions between 2 m and 0.4 m. Image size coverage ranges from 300 × 300 to 1500 × 900, including the ships with a high aspect ratio that are densely docked side by side. The entire dataset is divided into a training set, a validation set, and a testing set with 436, 181, and 444 images, respectively. In our experiment, the image size was adjusted to 800 × 800.

UCAS-AOD [51] is widely used in aerial object detection, in which images were collected from Google Earth and contained 1000 aircraft images and 510 car images. Due to the lack of official division of the dataset, we randomly divided it into training, validation, and testing sets, with a ratio of 5:2:3. In this experiment, all images in UCAS-AOD were adjusted to 800 × 800.

The DOTA dataset [52] is the largest aerial dataset released by the Xia Guisong team of Wuhan University in June 2018. The image size of these images has a huge span, ranging from 800 × 800 to 20,000 × 20,000, which includes objects of different sizes, directions, and shapes. The images were obtained by different sensors and platforms, mainly from Google Earth software, as well as some images captured by the Jilin No. 1 and Gaofen No. 2 satellites. DOTA includes 2806 aerial images and a total of 15 categories: basketball court (BC), roundabout (RA), harbor (HA), swimming pool (SP), helicopter (HC), plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), soccerball field (SBF), large vehicle (LV), ship (SH), tennis court (TC), small vehicle (SV), and storage tank (ST). It should be noted that the image in DOTA is too large. We cropped the original image to 800 × 800 in steps of 200-pixel blocks for training and testing.

4.2. Experimental Evaluation Metrics

mAP. In the RS detection task, targets of different positions and categories may appear in each image. We needed to evaluate the performance of classification and the localization performance. The classification standards used cannot be directly used for aerial object detection tasks. Therefore, the mean average precision (mAP) for measuring the performance of detectors has been proposed. Specifically,

TN, FN, TP, FP

are used to represent true-negative, false-negative, true-positive, and false-positive, respectively. We first obtained the precision P and recall R of the target:

P = \frac{TP}{TP + FP}, R = \frac{TP}{TP + FN} .

(15)

For an excellent detector, we hope that the recall and precision of the model can achieve very high values, so we need to consider both P and R comprehensively. Therefore, we can obtain suitable evaluation indicators by constructing PR curves for all categories and calculating the average value of the area under the AP curve. For datasets with

N c

categories, mAP is defined as follows:

m A P = \frac{1}{N_{C}} \sum_{i = 1}^{N_{c}} \int_{0}^{1} P_{i} (R_{i}) d_{i} .

(16)

Other Metrics. At the same time, other indicators were introduced to comprehensively evaluate the performance of the detector. In addition, we used the storage space occupied by the model parameters to evaluate the model size. FLOPs is an abbreviation for floating point operations, which can be used for the computational complexity of the detector.

4.3. Parameter Setting

We used ResNet as the backbone of FDLO-Det. The original ResNet was fully pre-trained on ImageNet. The P3–P7 scale of the feature pyramid was used for multi-scale detection. The threshold for the positive sample matching process was 0.5, while the confidence of the detection head was 0.6. In addition, when the momentum was 0.9, the Adam optimizer was used to optimize the proposed FDLO-Det.

All models were trained for 200 epochs, and the initial learning rate was equal to 0.01, and the decay rate was 0.1 after 60 epochs. Experiments were conducted on a server equipped with the Pytorch framework and 4 GPUs (RTX 2080Ti). We used a total batch size of 16 for training and a single NVIDIA 2080Ti for inference. The entire training process for DOTA, UCAS-AOD, and HRSC2016 took approximately 46 h, 3.5 h, and 4 h, respectively. RetinaNet is a simple and effective one-stage detector used as a baseline. Any module introduced may complicate calculations.

4.4. Ablation Experiment

4.4.1. Evaluation on Different Components of FDLO-Det

We conducted relevant ablation experiments on HRSC2016 and UCAS-ADO to verify the performance of the proposed modules in FDLO-Det. ‘✓’ and ‘✗’ represent using and not using the module, respectively. Table 1 lists the experimental results achieved on HRSC2016. The baseline model only achieves 83.3% mAP; this is because ordinary CNN convolutional structures find it difficult to model arbitrary-oriented ships. When using RSC, the performance of the detector was improved by 4.1%, which indicates that accurately extracting target directional features can improve performance in rotated target detection tasks. When adding OPTM, the detector can obtain integrated features from two different spatial directions based on orthogonal feature encoding operations. Then, through the filtering effect of the polarization function, the critical features required for classification and regression tasks are filtered out and achieve feature decoupling, which is more conducive to classification and localization. The performance of FDLO-Det improved by 5.2%. When using RSC and OPTM, the performance improved by 7.1%, but the network parameter size changed from 140.5 M to 31.7 M, a decrease of approximately 4.5 times. At the same time, the FLOPs of the detector also changed from 121.6 GFLOPs to 60.1 GFLOPs, reducing by 2 times. This indicates that our rotation invariant feature extraction and feature decoupling strategy can significantly compress invalid convolutional channels, achieving a significant reduction in parameters and FLOPs while maintaining the performance of the detector.

Similar experimental results shown in Table 2 can be obtained on UCAS-AOD. Compared to using one module, networks stacked with different modules achieve better performance. The addition of RSC and OPTM brings rich spatial features and obtains the precise selection of features, thus achieving better regression and classification results. Meanwhile, the experimental results also showed that the proposed modules have no conflict. When using all proposed modules, the detector exhibits the best performance of 91.3% mAP.

4.4.2. Evaluation on Rotational Separable Convolution

Based on the HRSC2016 dataset, we conducted experiments shown in Table 3 to test the impact of different cyclic group structures on target detection performance. When using

G_{4}

and

G_{8}

, the detector performance was optimized by 1.9% and 0.7%, while the detector parameters were reduced by 4.5× and 7×, and FLOPs were reduced by 2× and 4×, respectively. However, when

G_{16}

was adopted, although the parameters of the detector and FLOPs achieved greater compression, the network performance decreased by 3.2%. This is because the spatial features provided by the targets in the image are limited, and excessive grouping not only reduces the effective information required for classification and positioning by the detector, but also does not bring abundant spatial features. When the number of elements in a group increases, the decrease in other effective features used for classification and regression is particularly significant. Therefore, in our FDLO-Det,

G 4

is the best choice for achieving excellent performance.

In addition, we also extended RSC to the other methods shown in Table 4. The Faster R-CNN and SSD with RSC are both superior to their corresponding baselines, further demonstrating the effectiveness of a rotation-equivariant backbone.

4.4.3. Evaluation on Orthogonal Polarization Transformation Module

To further validate the effectiveness of OPTM, some comparative experiments were conducted on the HRSC2016 dataset, and experimental results are shown in Table 5. By using the polarization function for classification and regression branches, the performance of the detector improved by 0.1%. Although the polarization function decouples the feature of different tasks and slightly improves performance, critical features are not fully utilized and its efficiency is relatively low. When we adopted an embedded spatial orthogonal attention mechanism, we achieved a further improvement of 0.7%. This shows that SOAM effectively enhances spatial features for classification and regression tasks. By combining SOAM and the polarization function, the critical part of classification features is enhanced, while the high response areas were suppressed in regression features to find more potential clues and improve localization accuracy. By using SOAM and the polarization function, the detector improved by 1.9%, which confirms our viewpoint. These experiments demonstrate that the proposed OPTM can effectively improve detection performance.

In addition, we also investigated the impact of compression ratio r in the attention module on model performance. We are trying to reduce the compression rate and observe performance changes. As shown in Table 6, when we reduce r to half the original size, the model size increases, but better performance can be achieved. This indicates that adding more parameters by reducing the compression ratio is important for improving model performance. When the compression ratio continues to decrease to 1/2, the performance of the FDLO-Det no longer increases. This is because the spatial features obtained by orthogonal 1-dimensional convolutions are limited, and it is difficult to add useful information through a higher-dimensional map.

4.5. Comparisons Results for Different Datasets

4.5.1. Evaluation on DOTA

We compared our FDLO-Det with the other latest methods on the DOTA dataset. As shown in Table 7, our AP achieved the best results for plane (PL), bridge (BR), ground track field (GTF), large vehicle (LV), ship (SH), soccer ball field (SBF), roundabouts (RA), and helicopter: 90.63, 66.60, 85.68, 87.11, 89.83, 79.85, 76.04, and 83.72, respectively. In addition, we still achieved the best average results for all categories with 79.92% mAP. Some visual detection results of DOTA on aerial targets are shown in Figure 7.

Our FDLO-Det can accurately detect densely arranged small objects (such as small boats and vehicles). In addition, as shown in the first row of images, FDLO-Det can still accurately locate the boundaries of different targets when there is a significant difference in scale between them (such as 10 times in length between a small boat and a harbor). For targets with any orientation (such as airplanes and ships), FDLO-Det can accurately obtain the spatial direction of these targets to achieve adaptation to rotation. In addition, our FDLO-Det achieves precise detection of large roundabouts and small vehicles of different orientations and sizes. Our method can also use some square anchors to detect objects with very large aspect ratios (such as bridges and ports here). Even for the image data with poor quality and a large amount of noise interference (such as the last row of images), FDLO-Det still accurately detected ships and bridges with unclear texture features.

These results indicate that the key to achieving precise detection is to build an adaptive feature extractor for rotated targets and design feature filters that are consistent with classification and regression tasks. In our FDLO-Det, RSC generates rotational equivariant features, which can accurately obtain the spatial direction of the targets and reduce the complexity of modeling direction changes. OPTM can enhance feature representation ability and filter out discriminative features required for classification and regression tasks through the filtering effect of the polarization of the function. These two modules work together to ensure network performance.

4.5.2. Evaluation on UCAS-AOD

From the detection results obtained in Table 8, our FDLO-Det shows the best performance among existing methods and reaches 91.31% mAP, including two-stage and one-stage detectors. For some complex scenarios, we compared FDLO-Det with S2ANet (typical and classical networks) on UCAS-AOD. As shown in Figure 8, for small vehicles with a dense arrangement, our detection boxes cover the target more tightly, and there are almost no missing vehicles. These results further indicated the effectiveness of our method. The visualization results of UCAS-AOD are shown in Figure 9. The vehicles in the figure are small targets distributed in any direction. It can be seen that the proposed FDLO-Det has great adaptability to such targets. In addition, in the second row of the figure, some vehicles are shaded by trees, losing almost all textural details. Our method can also achieve accurate detection of such targets. In the fourth row of the figure, the aircraft are densely parked side by side, and the nose of each aircraft is intertwined with the fuselage of other aircraft, but our FDLO-Det can effectively locate various targets. These experimental results indicate that FDLO-Det can effectively obtain the orientation of the target and provide strong spatial feature representation, thereby enabling the detector to achieve robustness to dense target detection and achieve high-quality detection.

4.5.3. Evaluation on HRSC2016

This dataset contains many types of ships moored in ports and the ocean. We compared FDLO-Det with other existing detectors in Table 9. By using the proposed module, our FDLO-Det achieved an excellent performance at 90.4% mAP. Compared to specific ship detectors, our FDLO-Det has a 0.8% and 1.2% higher mAP compared to AR2Det and SDet, respectively. FDLO-Det outperforms other advanced two-stage and one-stage detectors in terms of detection performance with fewer parameters and FLOPs.

The visual detection results are shown in Figure 10. In the first row of the figure, for slender targets with a high aspect ratio such as ships, our FDLO-Det can achieve accurate detection of such targets. Although these ships are distributed in any direction and some of them are close to the dock or port, connected to the land, FDLO-Det can still effectively distinguish them. Moreover, in the second row of the figure, ships are densely docked side by side, and some ships are closely connected, which can easily cause missed detections or the detection box cannot accurately cover the target. The proposed FDLO-Det effectively achieves localization of different ship boundary regions. In the third row of the image, there is a significant scale difference among the detected different ships, with some ships having a length difference of about 10 times. This indicates that FDLO-Det can also have great adaptability to the scale of the target.

In addition, we used RetinaNet, a classic and solid detector to compare it with our FDLO-Det. As shown in Figure 11, for ships with the high aspect ratio and densely parked side by side at the wharf, we can accurately distinguish the boundaries of different ships to achieve a more compact location. However, RetinaNet finds it difficult to distinguish the boundaries between dense targets and detects two closely arranged targets as one. We think that this is because the proposed RSC can efficiently extract the spatial features of targets and accurately align their spatial directions to achieve adaptation to rotation. At the same time, the proposed OPTM effectively decouples the features of classification and regression tasks, achieving a more accurate and tight regression of target boundaries.

4.6. Comparisons with Data Augmentation

In this section, we have added a comparative experiment between FDLO-Det and data augmentation in Table 10. ‘aug’ represents rotation augmentation, with each image rotated 30 degrees clockwise for a total of 11 rotations. Compared with the original RetinaNet, the model improved by 4.8% after rotation augmentation. This indicates that classical CNN models find it difficult to effectively model target orientation, and require training the model through a large amount of data with different orientations. After using RSC and OPTM, the model accuracy improved by 7.1%. Compared to data augmentation, the proposed method can achieve higher performance improvements. This is because our method not only achieves efficient modeling of target orientation, but also achieves the filtering out of the discriminative features required for classification and regression tasks, thereby achieving accurate target detection.

4.7. Feature Visualization

We visualized the intermediate feature maps extracted by FDLO-Det. As shown in Figure 12, when the target undergoes rotation or translation, the corresponding high response area of the heat map would also rotate or translate equivalently. From this, it can be seen that the heat map guided by RSC accurately responds to the regions corresponding to the critical features required for the detection task. In addition, OPTM extends the critical features required for regression tasks to a larger area adjacent to the target, thereby improving positioning accuracy.

5. Conclusions

In RS detection tasks, the convolution operation in the backbone slides along a fixed axis, and the classifier and regressor share features from the backbone, which limits the performance of the detector. To this end, we proposed a detector which uses the Feature Decoupling for Lightweight Oriented Object Detection (FDLO-Det), which consists of two components: a rotational separable convolution (RSC) and an orthogonal polarization transformation module (OPTM). Specifically, we used RSC to extract rotational equivariant features while significantly compressing network parameters and computational complexity through highly shared parameters. Next, OPTM was used to decompose rotational equivariant features in both horizontal and vertical orthogonal directions and used polarization functions to filter out the required features for classification and regression tasks.

The ablation experiment verified the effectiveness of the proposed RSC and OPTM, and the comparison experiment with RetinaNet’s visualization further showed that our method has great adaptability to rotated and densely distributed targets. Comparative experiments with other advanced one-stage and two-stage detectors indicate that the proposed detector can achieve a great performance and obtain an effective balance between computational complexity and detection accuracy.

In future research, we would aim to optimize the label assignment required by classification and regression tasks, providing accurate supervisory information for the model. Meanwhile, we will also aim to design the loss function that matches the classification and regression tasks to guide the model for robust training, further improving the performance of the detector.

Author Contributions

Funding acquisition, Y.H. and C.D.; Methodology, C.D. and D.J.; Supervision, Y.H.; Validation, Z.D. and H.Z.; Writing original draft, D.J. and Y.H. All authors have read and agreed to the published version of this manuscript.

Funding

This work is supported by National Natural Science Foundation of China (NSFC) under Grant 62171040.

Data Availability Statement

The DOTA and UCAS-AOD are available at following https://captainwhu.github.io/DOTA/dataset.html and https://github.com/Lbx2020/UCAS-AOD-dataset, respectively.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, G.; Zhang, Y.; Zheng, X.; Sun, X.; Fu, K.; Wang, H. A new method on inshore ship detection in high-resolution satellite images using shape and context information. IEEE Geosci. Remote Sens. Lett. 2013, 11, 2272492. [Google Scholar] [CrossRef]
Yang, F.; Xu, Q.; Li, B. Ship Detection From Optical Satellite Images Based on Saliency Segmentation and Structure-LBP Feature. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2664118. [Google Scholar] [CrossRef]
Hong, D.; Yokoya, N.; Chanussot, J.; Zhu, X.X. An Augmented Linear Mixing Model to Address Spectral Variability for Hyperspectral Unmixing. IEEE Trans. Image Process. 2019, 28, 2878958. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 3130716. [Google Scholar] [CrossRef]
Zhao, B.; Zhao, B.; Tang, L.; Han, Y.; Wang, W. Deep Spatial-Temporal Joint Feature Representation for Video Object Detection. Sensors 2018, 18, 774. [Google Scholar] [CrossRef] [Green Version]
Tang, L.; Tang, W.; Qu, X.; Han, Y.; Wang, W.; Zhao, B. A scale-aware pyramid network for multi-scale object detection in SAR images. Remote Sens. 2022, 14, 973. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision—ECCV 2016—14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Lecture Notes in Computer Science; Springer: Amsterdam, The Netherlands, 2016; Volume 9905. [Google Scholar] [CrossRef] [Green Version]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. PANet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef] [Green Version]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. Adv. Neural Inf. Process. Syst. 2016, 29, 1–9. [Google Scholar]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2384–2399. [Google Scholar] [CrossRef]
Qiu, H.; Li, H.; Wu, Q.; Meng, F.; Ngan, K.N.; Shi, H. A2RMNet: Adaptively Aspect Ratio Multi-Scale Network for Object Detection in Remote Sensing Images. Remote Sens. 2019, 11, 1594. [Google Scholar] [CrossRef] [Green Version]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2577031. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 1314–1324. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018–23 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-Head R-CNN: In Defense of Two-Stage Object Detector. arXiv 2017, arXiv:1711.07264. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3062048. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2960–2969. [Google Scholar] [CrossRef] [Green Version]
Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of Localization Confidence for Accurate Object Detection. In Proceedings of the ECCV 2018, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking Classification and Localization for Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 10183–10192. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef] [Green Version]
Chen, Y.; Han, C.; Wang, N.; Zhang, Z. Revisiting Feature Alignment for One-stage Object Detection. arXiv 2019, arXiv:1908.01570. [Google Scholar]
Cheng, G.; Zhou, P.; Han, J. RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 26 June 2016–1 July 2016; pp. 2884–2893. [Google Scholar] [CrossRef]
Wang, P.; Sun, X.; Diao, W.; Fu, K. FMSSD: Feature-Merged Single-Shot Detection for Multiscale Objects in Large-Scale Remote Sensing Imagery. IEEE Trans. Geosci. Remote. Sens. 2020, 58, 3377–3390. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. SCRDet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 8231–8240. [Google Scholar] [CrossRef] [Green Version]
Yang, X.; Zhang, G.; Yang, X.; Zhou, Y.; Wang, W.; Tang, J.; He, T.; Yan, J. Detecting Rotated Objects as Gaussian Distributions and its 3-D Generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4335–4354. [Google Scholar] [CrossRef]
Lin, Y.; Feng, P.; Guan, J. IENet: Interacting Embranchment One Stage Anchor Free Detector for Orientation Aerial Object Detection. arXiv 2019, arXiv:1912.00969. [Google Scholar]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. PIoU Loss: Towards Accurate Oriented Object Detection in Complex Environments. Comput. Vis. ECCV 2020, 12350, 195–211. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Xie, S.; Girshick, R.B.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar] [CrossRef] [Green Version]
Dang, J.; Yang, J. HIGCNN: Hierarchical Interleaved Group Convolutional Neural Networks for Point Clouds Analysis. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2825–2829. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the 4th International Conference on Learning Representations, ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Ma, N.; Zhang, X.; Huang, J.; Sun, J. WeightNet: Revisiting the Design Space of Weight Networks. In Computer Vision—ECCV 2020, 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XV; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12360, pp. 776–792. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Yuan, L.; Liu, Z.; Zhang, L.; Vasconcelos, N. MicroNet: Improving Image Recognition with Extremely Low FLOPs. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 10–17 October 2021; pp. 458–467. [Google Scholar] [CrossRef]
Tang, Q.; Li, J.; Shi, Z.; Hu, Y. Lightdet: A Lightweight and Accurate Object Detection Network. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, 4–8 May 2020; pp. 2243–2247. [Google Scholar] [CrossRef]
Qin, Z.; Li, Z.; Zhang, Z.; Bao, Y.; Yu, G.; Peng, Y.; Sun, J. ThunderNet: Towards Real-Time Generic Object Detection on Mobile Devices. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV, Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 6717–6726. [Google Scholar] [CrossRef]
Ding, P.; Zhang, Y.; Deng, W.J.; Jia, P.; Kuijper, A. A light and faster regional convolutional neural network for object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2018, 141, 208–218. [Google Scholar] [CrossRef]
Wang, N.; Li, B.; Wei, X.; Wang, Y.; Yan, H. Ship Detection in Spaceborne Infrared Image Based on Lightweight CNN and Multisource Feature Cascade Decision. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4324–4339. [Google Scholar] [CrossRef]
Zhang, F.; Wang, X.; Zhou, S.; Wang, Y.; Hou, Y. Arbitrary-oriented ship detection through center-head point extraction. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5612414. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. arXiv 2019, arXiv:1908.05612. [Google Scholar] [CrossRef]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef] [Green Version]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods ICPRAM, Porto, Portugal, 24–26 February 2017; Volume 1, pp. 324–331. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing, ICIP, Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar] [CrossRef]
Xia, G.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.J.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef] [Green Version]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning roi transformer for oriented object detection in aerial images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Wang, J.; Ding, J.; Guo, H.; Cheng, W.; Pan, T.; Yang, W. Mask OBB: A Semantic Attention-Based Mask Oriented Bounding Box Representation for Multi-Category Object Detection in Aerial Images. Remote Sens. 2019, 11, 2930. [Google Scholar] [CrossRef] [Green Version]
Qian, W.; Yang, X.; Peng, S.; Guo, Y.; Yan, J. Learning Modulated Loss for Rotated Object Detection. arXiv 2019, arXiv:1911.08299. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [Green Version]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented object detection in aerial images with box boundary-aware vectors. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021. [Google Scholar] [CrossRef]
Yang, X.; Yan, J. On the Arbitrary-Oriented Object Detection: Classification based Approaches Revisited. arXiv 2022, arXiv:2003.055973. [Google Scholar] [CrossRef]
Huang, Z.; Li, W.; Xia, X.G.; Wu, X.; Cai, Z.; Tao, R. A Novel Nonlocal-Aware Pyramid and Multiscale Multitask Refinement Detector for Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5601920. [Google Scholar] [CrossRef]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef] [PubMed]
Ming, Q.; Miao, L.; Zhou, Z.; Yang, X.; Dong, Y. Optimization for Arbitrary-Oriented Object Detection via Representation Invariance Loss. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8021505. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote Sensing Images. arXiv 2021, arXiv:2101.06849. [Google Scholar] [CrossRef]
Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward Arbitrary-Oriented Ship Detection With Rotated Region Proposal and Discrimination Networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [Google Scholar] [CrossRef]
Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-Sensitive Regression for Oriented Scene Text Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef] [Green Version]
Shu, Z.; Hu, X.; Sun, J. Center-Point-Guided Proposal Generation for Detection of Small and Dense Buildings in Aerial Imagery. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1100–1104. [Google Scholar] [CrossRef]
Yang, Y.; Tang, X.; Cheung, Y.M.; Zhang, X.; Liu, F.; Ma, J.; Jiao, L. AR2Det: An Accurate and Real-Time Rotational One-Stage Ship Detector in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5605414. [Google Scholar] [CrossRef]
Ren, Z.; Tang, Y.; He, Z.; Tian, L.; Yang, Y.; Zhang, W. Ship Detection in High-Resolution Optical Remote Sensing Images Aided by Saliency Information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623616. [Google Scholar] [CrossRef]

Figure 1. A general framework of a one-stage detector. The convolution operation in the backbone slides along a fixed axis, and the classifier and regressor share features from the backbone.

Figure 2. Architecture of the proposed FDLO-Det.

Figure 3. A symmetry group

p 4

Figure 3. A symmetry group

p 4

Figure 4. Rotational separable convolution.

Figure 5. The architecture of OPTM.

Figure 6. The curve of polarization function for classification and regression tasks.

Figure 7. Visualization of results on DOTA dataset with FDLO-Det. Small vehicles and boats parked closely side by side are accurately detected.

Figure 8. Comparison between FDLO-Det and S2ANet (typical CNN) on UCAS-AOD.

Figure 9. Performance on UCAS-ADO.

Figure 10. Visual detection results of FDLO-Det on HRSC 2016.

Figure 11. Comparison between FDLO-Det and RetinaNet on HRSC2016.

Figure 12. Visualization results of intermediate features.

Table 1. Effects of each component on HRSC 2016.

With RSC?	With OPTM?	Parametes (MB)	FLOPs (G)	mAP
✗	✗	140.5	121.6	83.3%
✓	✗	31.6	59.9	87.4%
✗	✓	140.7	121.9	88.5%
✓	✓	31.7	60.1	90.4%

Table 2. Effects of each component on UCAS-AOD.

With RSC?	With OPTM?	Parametes (MB)	FLOPs (G)	mAP
✗	✗	140.6	121.8	86.7%
✓	✗	31.7	60.2	88.6%
✗	✓	140.7	122.1	89.2%
✓	✓	31.8	60.3	91.3%

Table 3. Effects of the groups on HRSC2016.

Backbone	Group	Parametes (MB)	FLOPs (G)	mAP
✗	-	140.7	121.9	88.5%
✓	$G_{4}$	31.7	60.1	90.4%
✓	$G_{8}$	22.4	30.5	89.2%
✓	$G_{16}$	10.3	11.7	85.3%

Table 4. Comparisons of different detectors with RSC on HRSC2016.

Method	Parameters (MB)	FLOPs (G)	mAP
SSD	246.5	121.6	83.8%
SSD+ $G 4$	63.4	49.7	85.6%
Faster RCNN	361.1	100.5	89.2%
Faster RCNN+ $G 4$	96.1	63.2	89.7%

Table 5. Effects of each component of OPTM on HRSC 2016.

With SOAM?	With Polarization?	mAP
✗	✗	88.5%
✗	✓	88.6%
✓	✗	89.2%
✓	✓	90.4%

Table 6. Comparisons with different compression ratios of SOAM on HRSC2016.

Compression Ratio	$r = 64$	$r = 32$	$r = 16$	$r = 8$
mAP	85.8%	88.6%	90.4%	89.2%

Table 7. Performance evaluation on the DOTA dataset.

Methods	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
Two-Stage:
FR-O [16]	79.42	77.13	17.70	64.05	35.30	38.02	37.16	89.41	69.64	59.28	50.30	52.91	47.89	47.40	46.30	54.13
RRPN [49]	88.52	71.20	31.66	59.30	51.85	56.19	57.25	90.81	72.84	67.38	56.69	52.84	53.08	51.94	53.58	61.01
RoI-Trans [53]	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
R²CNN [12]	80.94	65.67	35.34	67.44	59.92	50.91	55.81	90.67	66.92	72.39	55.06	52.23	55.14	53.35	48.22	60.67
Gliding Vertex [15]	89.64	85.00	52.26	77.34	73.01	73.14	86.82	90.74	79.02	86.81	59.55	70.91	72.94	70.86	57.32	75.02
A2RMNet [14]	89.84	83.39	60.06	73.46	79.25	83.07	87.88	90.90	87.02	87.35	60.74	69.05	79.88	79.74	65.17	78.45
MASK-OBB [54]	89.69	87.07	58.51	72.04	78.21	71.47	85.20	89.55	84.71	86.76	54.38	70.21	78.98	77.46	70.40	76.98
SCRDet [8]	89.98	80.65	52.09	68.36	68.36	60.32	72.41	90.85	87.94	86.86	65.02	66.68	66.25	68.24	65.21	72.61
SCRDet++ [13]	90.01	82.32	61.94	68.62	69.62	81.17	78.83	90.86	86.32	85.10	65.10	61.12	77.69	80.68	64.25	76.24
RSDet [55]	89.80	82.90	48.60	65.20	69.50	70.10	70.20	90.50	85.60	83.40	62.50	63,90	65.60	67.20	68.00	72.20
R³Det [48]	89.54	81.99	48.46	62.52	70.48	74.29	77.54	90.80	81.39	83.54	61.97	59.82	65.44	67.46	60.05	71.69
R-RetinaNet [56]	88.82	81.74	44.44	65.72	67.11	55.82	72.77	90.55	82.83	76.30	54.19	63.64	63.71	69.73	53.37	68.72
BBAVectors [57]	88.63	84.06	52.13	69.56	78.26	80.40	88.06	90.87	87.23	86.39	56.11	65.62	67.10	72.08	63.96	75.36
CSL [58]	90.25	85.53	54.64	75.31	70.44	73.51	77.62	90.84	86.15	86.69	69.60	68.04	73.83	71.10	68.93	76.17
NPMMR-Det [59]	89.44	83.18	54.50	66.10	76.93	84.08	88.25	90.87	88.29	86.32	49.95	68.16	79.61	79.51	57.26	76.16
GGHL [60]	89.74	85.63	44.50	77.48	76.72	80.45	86.16	90.83	88.18	86.25	67.07	69.40	73.38	68.45	70.14	76.95
RIDet-O [61]	88.94	78.45	46.87	72.63	77.63	80.68	88.18	90.55	81.33	83.61	64.85	63.72	73.09	73.13	56.87	74.70
S²A-Net [24]	89.28	84.11	56.95	79.21	80.18	82.93	89.21	90.86	84.66	87.61	71.66	68.23	78.58	78.20	65.55	79.15
FDLO-Det (ours)	90.63	80.03	66.60	85.68	69.60	87.11	89.83	90.86	88.02	71.84	79.85	76.04	78.51	60.44	83.72	79.92
The explanation of each category:
Full Name	plane	baseball diamond	bridge	groun dtrack field	small vehicle	large vehicle	ship	tennis court	basketball court	storage tank	soccerball field	roundabout	harbor	swimming pool	helicopter	–

Table 8. Detection results on UCAS-AOD dataset.

Methods	Car	Airplane	mAP
RoI-Trans [53]	88.02	90.02	89.02
S2ANet [24]	89.56	90.42	89.99
RIDet-O [62]	88.88	90.35	89.62
R-RetinaNet [56]	84.65	85.46	78.19
R²PN [63]	76.74	88.66	78.63
YOLOV3	74.63	89.52	82.08
FDLO-Det	87.31	93.24	91.31

Table 9. Performance evaluation on HRSC2016 dataset.

Methods	Backbone	Size	mAP
Two-Stage:
Gliding Vertex [15]	ResNet101	512 × 800	88.2
RRPN [49]	ResNet101	800 × 800	79.1
R²CNN [12]	ResNet101	800 × 800	73.1
RoI-Trans [53]	ResNet101	512 × 800	86.2
R²PN [63]	VGG16	-	79.6
One-stage:
RRD [64]	VGG16	384 × 384	84.3
R³Det [8]	ResNet101	800 × 800	89.3
OPLD [65]	ResNet101	800 × 800	88.4
BBAVectors [57]	ResNet101	800 × 800	89.7
R-RetinaNet [56]	ResNet101	800 × 800	89.2
AR2Det [66]	ResNet101	512 × 512	89.6
SDet [67]	ResNet101	800 × 800	89.2
FDLO-Det	ResNet101	800 × 800	90.4

Table 10. Comparisons with data augmentation on UCAS-AOD dataset.

Methods	Backbone	Sizee	mAP
RetinaNet	ResNet101	800 × 800	83.3
RetinaNet+aug	ResNet101	800 × 800	88.1
RetinaNet+RSC and OPTM	ResNet101	800 × 800	90.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, C.; Jing, D.; Han, Y.; Deng, Z.; Zhang, H. Towards Feature Decoupling for Lightweight Oriented Object Detection in Remote Sensing Images. Remote Sens. 2023, 15, 3801. https://doi.org/10.3390/rs15153801

AMA Style

Deng C, Jing D, Han Y, Deng Z, Zhang H. Towards Feature Decoupling for Lightweight Oriented Object Detection in Remote Sensing Images. Remote Sensing. 2023; 15(15):3801. https://doi.org/10.3390/rs15153801

Chicago/Turabian Style

Deng, Chenwei, Donglin Jing, Yuqi Han, Zhiyuan Deng, and Hong Zhang. 2023. "Towards Feature Decoupling for Lightweight Oriented Object Detection in Remote Sensing Images" Remote Sensing 15, no. 15: 3801. https://doi.org/10.3390/rs15153801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Feature Decoupling for Lightweight Oriented Object Detection in Remote Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. Oriented Object Detection

2.2. Lightweight Convolutional Design

3. Methodology

3.1. Rotational Separable Convolution

3.2. Orthogonal Polarization Transformation Module

4. Experiments

4.1. Datasets

4.2. Experimental Evaluation Metrics

4.3. Parameter Setting

4.4. Ablation Experiment

4.4.1. Evaluation on Different Components of FDLO-Det

4.4.2. Evaluation on Rotational Separable Convolution

4.4.3. Evaluation on Orthogonal Polarization Transformation Module

4.5. Comparisons Results for Different Datasets

4.5.1. Evaluation on DOTA

4.5.2. Evaluation on UCAS-AOD

4.5.3. Evaluation on HRSC2016

4.6. Comparisons with Data Augmentation

4.7. Feature Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI