Open AccessArticle

FEFN: Feature Enhancement Feedforward Network for Lightweight Object Detection in Remote Sensing Images

Jing Wu

^†,

Rixiang Ni

^†

Zhenhua Chen

Feng Huang

and

Liqiong Chen

School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou 350108, China

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(13), 2398; https://doi.org/10.3390/rs16132398

Submission received: 30 May 2024 / Revised: 24 June 2024 / Accepted: 26 June 2024 / Published: 29 June 2024

(This article belongs to the Special Issue Advanced Artificial Intelligence and Deep Learning for Remote Sensing II)

Download

Browse Figures

Figure 1
Architecture of the feature enhancement feedforward network (FEFN). "> Figure 2
Architecture of the lightweight channel feedforward module (LCFM). "> Figure 3
Architecture of the depthwise convolution. "> Figure 4
Architecture of the channel feedforward module. "> Figure 5
Architecture of channel scaling. "> Figure 6
Architecture of feature enhancement module (FEM). "> Figure 7
Twenty categories of object images in the DIOR dataset. (a) airplane; (b) airport; (c) baseball field; (d) basketball field; (e) bridge; (f) chimney; (g) dam; (h) highway service area; (i) highway toll station; (j) golf course; (k) track and field; (l) port; (m) overpass; (n) ship; (o) stadium; (p) oil tank; (q) tennis court; (r) fire station; (s) vehicle; (t) windmill. "> Figure 7 Cont.
Twenty categories of object images in the DIOR dataset. (a) airplane; (b) airport; (c) baseball field; (d) basketball field; (e) bridge; (f) chimney; (g) dam; (h) highway service area; (i) highway toll station; (j) golf course; (k) track and field; (l) port; (m) overpass; (n) ship; (o) stadium; (p) oil tank; (q) tennis court; (r) fire station; (s) vehicle; (t) windmill. "> Figure 8
Visualization results of 20 categories on the DIOR dataset. (a) airplane; (b) airport; (c) baseball field; (d) basketball field; (e) bridge; (f) chimney; (g) dam; (h) highway service area; (i) highway toll station; (j) golf course; (k) track and field; (l) port; (m) overpass; (n) ship; (o) stadium; (p) oil tank; (q) tennis court; (r) fire station; (s) vehicle; (t) windmill. "> Figure 9
Comparison of detection results of different networks on the DIOR datasets. We marked the false alarm and missed detections that occurred with the comparison algorithm with red circles. "> Figure 10
Visualization of the detection results for different scenarios on the HRSC2016 dataset. (a) Maritime ship; (b) offshore ships; (c) ships of different sizes; (d) ships with complex backgrounds. "> Figure 11
Comparison of detection results of different networks on the HRSC2016 datasets. We marked the false alarm and missed detections that occurred with the comparison algorithm with orange circles. "> Figure 12
Comparison of detection results of different networks on the HRSC2016 dataset with an SNR of 8.05 dB. We marked the false alarm and missed detections that occurred with the comparison algorithm with orange circles. "> Figure 13
Comparison of detection results of different networks on the HRSC2016 dataset with an SNR of 1.99 dB. We marked the false alarm and missed detections that occurred with the comparison algorithm with orange circles. ">

Versions Notes

Abstract

Object detection in remote sensing images has become a crucial component of computer vision. It has been employed in multiple domains, including military surveillance, maritime rescue, and military operations. However, the high density of small objects in remote sensing images makes it challenging for existing networks to accurately distinguish objects from shallow image features. These factors contribute to many object detection networks that produce missed detections and false alarms, particularly for densely arranged objects and small objects. To address the above problems, this paper proposes a feature enhancement feedforward network (FEFN), based on a lightweight channel feedforward module (LCFM) and a feature enhancement module (FEM). First, the FEFN captures shallow spatial information in images through a lightweight channel feedforward module that can extract the edge information of small objects such as ships. Next, it enhances the feature interaction and representation by utilizing a feature enhancement module that can achieve more accurate detection results for densely arranged objects and small objects. Finally, comparative experiments on two publicly challenging remote sensing datasets demonstrate the effectiveness of the proposed method.

Keywords:

remote sensing; feature enhancement; channel feedforward; object detection

1. Introduction

Remote sensing object detection is a crucial task in the field of computer vision that aims to automatically identify and locate objects of interest in remote sensing images [1]. Remote sensing object detection tasks play a pivotal role in the realm of airborne and satellite remote sensing imagery and provide invaluable applications [2]. These tasks have been extensively implemented across significant domains, including military reconnaissance [3], urban planning [4], and maritime vessel surveillance [5]. Additionally, Synthetic Aperture Sonar (SAS) has demonstrated significant effectiveness in underwater target detection due to its ability to generate high-resolution images, further expanding the applications of remote sensing object detection [6,7]. For instance, Sledge et al. [8] proposed MB-CEDN, a multi-branch convolutional encoder–decoder network based on SAS images, which is capable of extracting contrast features from one or more images with different spatial scales, and then detecting and segmenting the target through a decoder. Although remote sensing object detection has made significant progress in many fields, substantial differences in features between remote sensing images and natural images can lead to missed detections and false alarms for some objects. Specifically, compared to natural images, remote sensing images are characterized by a large number of small objects and a dense arrangement of objects. Additionally, most of the satellite remote sensing images are not captured in front view, but are captured by the satellite looking down, which leads to the existence of many objects with random angles and large differences in aspect ratios in satellite remote sensing images. For example, satellite remote sensing images contain vehicles arranged at random angles near stations and long bridges, ships, etc.

To address the challenges posed by remote sensing images, numerous researchers have proposed different object detection networks to mitigate missed detection and false alarms. Zhang et al. proposed the anchorless rotary detector GRS-Det [9] to detect dense vessels in remote sensing images; however, this anchorless frame detector tends to misclassify a few ports as ships when facing densely arranged ships in harbor scenes. And the number of participants in this network reaches 200 M. Li et al. designed the rotating region-based residual network R3-Net [10] to recognize densely aligned vehicles; however, this residual network brings 30 M parameters and 25 G computations. SCRDet [11] utilized a fusion network and attention mechanism to detect small and densely aligned objects. Shi et al. [12] used the global contextual features generated by the spatial attention mechanism to identify small objects. However, the contextual features obtained by the spatial attention mechanism were not sufficient to cope with the densely arranged small objects in remote sensing images. Yuan et al. [13] introduced a low-resolution feature fusion method that utilized jump-connected convolutional layers in spatial sequential attention networks to obtain more semantically informative small object features. However, such jump-connected convolutional layers lead to a large increase in the number of network parameters. Li et al. [14] proposed a two-layer cascaded convolutional neural network model featuring dual-branch 3D–2D convolutional units and attention mechanisms; it can extract the information of hyperspectral remote sensing images. Chang et al.’s [15] advanced scheme based on blind source separation (BSS) was put forward to suppress the range ambiguity of spaceborne SAR. Qiu et al. [16] designed a multiscale feature fusion module and constructed four multi-scale regression layers for optical remote sensing object detection. They then constructed an aspect ratio attention module that adaptively selected the appropriate target aspect ratio according to the attention mechanism, but this adaptive selection of aspect ratios does not apply to densely arranged objects. Cheng et al. [17] proposed a cross-scale feature fusion module applied to five feature layers, which was more conducive to the detection of targets with different scales. However, this network suffers from missed detections when facing ships moored in harbors. And the cross-scale feature fusion module increased the inference time of the baseline network from 87 ms to 98 ms. In conclusion, the existing network cannot balance the network complexity with the ability to extract detailed features from remote sensing targets. This results in a poor ability to transform the algorithms into practical applications.

To address these problems, we propose a novel object detection network based on the GGHL [18] for satellite remote sensing images called the feature enhancement feedforward network (FEFN). This network can effectively improve the detection performance of densely arranged objects and small objects by adding fewer parameters. First, we propose a lightweight channel feedforward module in order to capture the edge information of small objects such as boats and cars. It consists of a depth-separable convolution-based residual module and a channel feedforward network-based residual module. Compared to the original network, the model undergoes processing with a lightweight channel feedforward module to enhance feature interaction. Second, to ensure sufficiently dense small-object information in the detection head, we propose a feature enhancement module consisting of different convolutions and group normalization. The feature enhancement module enhances the network’s capability to extract the features of small objects in the images. The main contributions of this study are as follows:

(1): A lightweight channel feedforward module (LCFM) is designed to capture shallow spatial information in the images and enhance feature interactions. The introduction of this module enhances the ability of the model to recognize densely packed objects in the complex backgrounds of remote sensing images, thereby improving the overall performance of the model.
(2): To facilitate the model in learning deeper representations and prevent the omission of densely arranged small objects, a feature enhancement module (FEM) is proposed. The feature enhancement module strengthens the feature extraction capability through residual connections between different convolutional and normalization layers.
(3): We conduct ablation and comparative experiments on two publicly available remote sensing image datasets, and the results demonstrate the effectiveness of our proposed method.

2. Related Works

2.1. Channel Feedforward Network

With the wide application of Transformers and self-attention mechanisms in the field of computer vision, many scholars have explored the application of Transformer in various aspects of object detection, such as small object detection, remote sensing object detection, and other related fields [19,20,21,22]. Many researchers are not only concerned with the enhancement of the Transformer’s effectiveness in different tasks in computer vision but have also considered improving the inference speed of the Transformer [23,24,25,26]. Recently, several researchers [27,28,29,30] have attempted to replace certain components of the Transformer with a channel feedforward network; however, they managed to achieve the same results as the Transformer. The channel feedforward network can achieve this effect because both the channel feedforward network and the Transformer itself are modules designed to better access global information. In addition, the channel feedforward network is better able to achieve the effect of interaction between the spatial and channel feature information. This suggests that the channel feedforward network can potentially substitute the Transformer in acquiring global information without significantly increasing the number of parameters and computations in the network. This aspect holds great promise in computer vision research.

Although channel feedforward networks have made significant progress in several computer vision tasks, such as image defogging [31], medical detection [32,33,34,35], target recognition [36], and medical surgery [37], they are still unable to obtain high detection accuracy when dealing with remote sensing images. For example, they are insufficient at extracting features of small targets in remote sensing images, leading to the omission of small ships and aircraft. Nevertheless, due to the relatively small number of parameters in the channel feedforward network, it can capture more global information. By refining the structure of the channel feedforward network, it is feasible to achieve an enhancement in detection accuracy similar to that of the Transformer without substantially increasing the parameter count.

2.2. Feature Enhancement Modules

Object detection networks typically process original large-size images by down-sampling them to a standard scale for computation, and then detecting objects at a lower resolution. This can lead to information loss during the down-sampling process. Therefore, feature enhancement modules are usually incorporated into the feature fusion module to improve the effectiveness of feature extraction. Kong et al. [38] proposed a spatial feature enhancement module to utilize the spatial interrelationships between different semantic regions. Zhai et al. [39] proposed a depth feature enhancement module for mining rich information from channels and spatial views. Ma et al. [40] constructed a multi-scale contextual feature enhancement module to carry out multi-order spatial interactions to obtain contextual information at different scales. However, existing multiscale contextual feature enhancement modules only randomly combine ordinary convolution, dilation convolution, depthwise separable convolution, and asymmetric convolution. They do not consider the relationship between different scale features. Although these methods enhance feature extraction to some extent, they lead to a significant reduction in detection speed. The feature enhancement module proposed in this paper is designed through an effective combination of convolution and normalization, which leads to faster convergence speed and enhanced feature learning capability. The feature enhancement module consists of different convolutions through group normalization and residual shortcut connections. Importantly, it does not lead to a significant decrease in detection speed.

3. Methods

Although many object detection networks have been proposed for remote sensing images, they still encounter varying degrees of missed detections and false alarms when faced with densely arranged objects and small objects in remote sensing images. In particular, the existing methods may require a substantial increase in the number of parameters, potentially affecting the inference speed of the network. Alternatively, they may encounter difficulties in dealing with objects in densely populated areas. Considering these potential challenges, we believe that it is imperative to further explore more effective solutions to enhance the model’s performance and adaptability in addressing missed detection and false alarms for densely arranged objects and small objects in remote sensing images.

In this section, we discuss the key components of our innovative approach, including the LCFM and FEM. GGHL [18] utilizes an anchorless object adaptation label assignment strategy and a directional bounding box representation component. These components effectively capture the shape and orientation features of arbitrary-oriented objects, significantly enhancing detection performance while reducing parameter adjustment efforts. Inspired by GGHL, we propose FEFN, specifically designed to address the challenges of missed detection and false alarms for densely arranged small objects in satellite remote sensing images. The architecture of the proposed network is shown in Figure 1. Specifically, FEFN is initially processed through the backbone network to extract features. Subsequently, it uses a feature fusion module, in addition to a combination of a lightweight channel feedforward and a feature enhancement module, to fuse the features extracted from the backbone network. Finally, the detection information is outputted by the three different scales of detection heads. The final experimental results demonstrate the effectiveness of these improvements in enhancing the detection accuracy of small objects and arranged objects in satellite remote sensing images.

3.1. Lightweight Channel Feedforward Module

Remote sensing images contain many small objects, such as ships and vehicles. In this paper, a lightweight channel feedforward module is proposed. The architecture of the lightweight channel feedforward module is shown in Figure 2. Traditional self-attention mechanism neural networks such as the Transformer improve the model’s ability to handle contextual information by establishing a correlation between different positions in a sequence within the model. However, the large number of parameters in the self-attention mechanism of the Transformer and the complexity of the computation result in a long inference time, which may not meet the lightweighting requirements. The lightweight channel feedforward module circumvents the self-attention mechanism and adopts the design of the channel feedforward network. Moreover, it incorporates channel scaling and droppath operations into its structure. Certain modules can be randomly discarded during training, which helps enhance the generalization ability and robustness of the network model.

The lightweight channel feedforward module consists of the following two modules: a residual module based on depth-separable convolution and a residual module based on a channel feedforward network (both modules are marked with red dashed boxes in Figure 2). The input features fed to the lightweight channel feedforward module first pass through the residual module based on depth-separable convolution for group normalization and depth-separable convolution (the structures of the depthwise convolution and the channel feedforward module are shown in Figure 3 and Figure 4, respectively), followed by channel scaling and droppath operations. The specific architecture of channel scaling is shown in Figure 5. The channel scaling first generates a tensor called “layer_scale_value” and scales it dimensionally to match the output dimension of the depth-separable convolution. This tensor is then used to scale the output of the depth-separable convolution through element-wise multiplication. This operation allows the model to adjust the dimensions of the outputs from different layers, enabling it to adapt to the varying characteristics of the data during training. This adjustment enhances the model’s expressive ability by making adaptive changes during training. The process of channel scaling is finally accomplished by the unsqueeze feature [41].

The droppath technique randomly discards certain paths with a specified probability during training. This enhances the model’s adaptability to different input distributions and increases its robustness, while also helping to reduce overfitting on the training set.

The random discarding of some paths refers to the stochastic removal of connections within the network structure. This stochastic removal of the network architecture can be emulated by manipulating the input tensor, which is described as follows:

① Verify whether it is in training mode; if not, the tensor manipulation operation described above is skipped and the procedure is terminated;

② If it is in training mode, compute

k e e p_p r o b

based on the predetermined probability;

k e e p_p r o b = 1 - p r o b

(1)

where prob refers to the probability of randomly discarding the path during the droppath process, and

k e e p_p r o b

refers to the probability of keeping the path. In the droppath, the probability value (

p r o b

) is applied incrementally layer by layer. Specifically, a lower

p r o b

value (randomly between 0.1 and 0.2) is used in the shallow network to maintain model stability, while a higher

p r o b

value (randomly between 0.3 and 0.4) is used in the deep network to prevent overfitting.

③ Acquire the input tensor and compute the output tensor based on

k e e p_p r o b

and the input tensor;

④ Return the output tensor and conclude the process.

In the lightweight channel feedforward module, “some paths” refer to the paths of the depth-separable convolution and the channel feedforward network. The operations in the residual module based on the channel feedforward network are similar to those in the residual module based on depth-separable convolution, so they are not described again here.

The operations of the lightweight channel feedforward module are calculated as follows:

x 1 = d r o p p a t h (d w (n o r m (x)) * u n s q u e e z e) + x

(2)

x 2 = d r o p p a t h (d w (n o r m (x 1)) * u n s q u e e z e) + x 1

(3)

where

x

denotes the input features,

n o r m

denotes group normalization processing,

d w

denotes depth-separable convolution,

u n s q u e e z e

denotes the scaling operation on the dimension in channel scaling,

x 1

denotes the output features processed by a residual module based on depth-separable convolution, and

x 2

denotes the output features processed by lightweight channel feedforward module.

The pseudo-code for the LCFM is shown in Algorithm 1.

Algorithm 1 Pseudo code for the LCFM

The Forward Propagation Process of LCFM

Input: input feature

x

Output: output feature

x 2

1. out1 = Group Normalization (

x

)
2. out2 = Depthwise Convolution (out1)
3. out3 = Depth Convolution (out2)
4. out4 = Point Convolution (out3)
5. out5 = Droppath (out4)
6. out6 =

x

concat out5
7. out7 = Group Normalization (out6)
8. out8 = Full connection (out7)
9. out9 = GeLu (out8)
10. out10 = Full connection (out9)
11. out11 = Dropout (out10)
12.

x 2

= out11 concat out6

3.2. Feature Enhancement Module

Remote sensing datasets have a high resolution and contain numerous densely arranged small objects. These objects are challenging to detect because of their small proportions and small size in pixels. To improve the fusion effect of small object features in the shallow part of the neck feature fusion module, we aim to improve the feature information of small objects in a dense scene in the network detection head. This improvement will ultimately improve the detection accuracy of this type of object. This is achieved by enhancing the interaction of features through a feature enhancement module, which improves the fusion effect of small object features. The structure of the feature enhancement module is shown in Figure 6. Specifically, the input features undergo two different convolution and group normalization operations. Subsequently, they are connected using a residual operation to obtain the final enhanced small object features.

The primary purpose of this structure is to extract image features using different layers of convolution and normalization operations and to enhance feature representation using residual connections. The Group Normalization layer in the model performs normalization operations on the features, ensuring that the features across different channels have similar means and variance. This helps to reduce the coupling between features and improves the robustness and generalization of the model. Additionally, the residual connectivity in the model allows information from the original input to be passed from the previous layers to subsequent layers, facilitating better gradient flow and alleviating the vanishing gradient problem. This solves the problem of vanishing gradients and gradient explosion while enabling the model to learn a deeper representation. Residual connectivity also helps the model to better preserve the details and structural information in the original image, which is crucial for object detection in remote sensing images.

The operations of the feature enhancement module are calculated as follows:

o u t 1 = R E L U (G N (C o n v_{r = 2} 3 \times 3 (x)))

(4)

o u t 2 = C o n v 3 \times 3 (o u t 1)

(5)

r e s i d u a l = G N (C o n v 3 \times 3 (x))

(6)

o u t = R E L U (o u t 2 + r e s i d u a l)

(7)

where x denotes the input features,

C o n v_{r = 2} 3 \times 3

denotes a 3 × 3 expansion convolution with an expansion of 2,

G N

denotes group normalization,

C o n v_{r = 2} 3 \times 3

denotes a 3 × 3 convolution,

r e s i d u a l

denotes residual joining, and

o u t

denotes the output features processed by the feature enhancement module.

These specific techniques were chosen over others because of their unique advantages in handling high-resolution images and detecting densely packed small objects. Convolution operations enable the extraction of intricate details and patterns from high-resolution images, essential for detecting small objects. Group normalization ensures consistent feature scaling, reducing inter-channel coupling and enhancing model robustness. Residual connections facilitate effective gradient flow and preserve image details, addressing common issues like vanishing gradients. Collectively, these techniques significantly improve detection accuracy and model robustness.

Although the FEM proposed in this section achieves better feature extraction, there are still some problems with GN and residual connection in FEM. GN requires computation of mean and variance for each group. When combined with residual connection, these computations further increase the time and memory consumption for each forward and backward propagation. Second, the selection of the appropriate number of groups and the design of the residual paths in GN need to be carefully tuned. If the number of groups is not properly chosen, it may result in too few or too many channels per group, which may affect the normalization effect.

4. Experiments

In this section, we discuss the experiments conducted on two publicly available remote sensing image datasets to validate the effectiveness of the proposed FEFN and further explore its performance metrics. First, we describe the experimental conditions, including the datasets, experimental setup, evaluation metrics, and details of the experimental setup details. Secondly, we discuss the ablation experiments carried out on various modules of the FEFN and outline the impact of each module on the detection performance metrics. Finally, we compare the proposed FEFN with existing methods on the two public datasets, analyze the experiments, and present the visual results.

4.1. Experimental Conditions

4.1.1. Datasets

The two publicly available remote sensing datasets used in this study are the DIOR [42] dataset and HRSC2016 [43] dataset.

The DIOR dataset was created by the French Defense and Civil Aerospace Agency to facilitate research on satellite remote sensing image processing and computer vision. Currently, satellite remote sensing image datasets have the most diverse object categories. The dataset consists of 23,463 images, each with a fixed size of 800 × 800 pixels. The training set contains 5862 images, the validation set contains 5863 images, and the test set contains 11,738 images. The dataset includes the following 20 categories of objects: including airplane (AL), airport (AT), baseball field (BF), basketball court (BC), bridge (B), chimney (C), dam (D), expressway service area (ESA), expressway toll station (ETS), golf course (GC), athletics field (GTF), harbor (HB), overpass (O), ship (S), stadium (SD), storage tank (ST), tennis court (TC), train station (TS), vehicle (V), and windmill (W). Figure 7 shows a portion of the DIOR dataset, highlighting the challenges posed by complex backgrounds and large aspect ratios in the dataset.

The HRSC2016 dataset was released by the Northwestern Polytechnical University in 2016 and comprises remote sensing ship data collected from six major ports on Google Earth. The dataset contains 1061 images with valid annotations, annotated using Oriented Bounding Boxes (OBBs). The images in the dataset range from sizes of 300 × 300 to 1500 × 900 pixels and predominantly feature densely arranged ship objects with large aspect ratios. The dataset is divided into a training set, validation set, and test set, with 436, 181, and 444 images, respectively.

4.1.2. Experimental Setup and Evaluation Metrics

The experiments in this paper were conducted on a computer with an NVIDIA GeForce RTX 2080ti GPU. The operating system on this computer was Ubuntu 18.04.

The evaluation metrics for the algorithm in this paper mainly include detection accuracy and inference speed. The evaluation standard for detection accuracy is the mean Average Precision (mAP), where a higher value indicates better detection performance. The calculation process of mAP can be described as follows:

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

A P = \frac{1}{T P} \sum_{n = 1}^{T P} P_{n \max}

(10)

P_{n \max} = \max_{\forall R_{b + 1} \geq R_{b}} P_{b}

(11)

mAP = \frac{1}{C} \sum_{k \in C}^{} A P_{k}

(12)

where

P

and

R

represent precision and recall, respectively.

T P

is the number of true positive samples, i.e., the positive samples correctly predicted as positive by the model.

F P

is the number of false positive samples, i.e., the negative samples incorrectly predicted as positive by the model.

F N

is the number of false negative samples, i.e., the positive samples incorrectly predicted as negative;

P_{b}

and

R_{b}

represent precision and recall, respectively, for b anchor boxes after sorting by their scores.

C

is the number of categories or classes.

The detection speed is evaluated using the common metric of “Frames Per Second” (FPS), indicating the number of image frames processed per second. A higher FPS indicates higher detection efficiency, which is described as follows:

FPS = \frac{1}{t}

(13)

where

t

is the time required for the object detection network to process one frame (one image) in seconds.

4.1.3. Experimental Settings

The initial learning rate is 2 × 10⁻⁴, and the final learning rate is 1 × 10⁻⁶. The learning rate is updated using a cosine update strategy. The weight decay is 5 × 10⁻⁴, and the momentum is 0.9. The training strategy employs SGD with a weight decay of 5 × 10⁻⁴ and a momentum of 0.9. The confidence threshold is set to 0.3, and the non-maximum suppression is 0.45. The training is conducted for 200 epochs with a batch size of 8. The data augmentation strategies include random mixing, random cropping, and random rotation.

4.2. Ablation Experiment Evaluation

To verify the feasibility of our proposed structure, we conducted ablation experiments on the DIOR dataset and the HRSC2016 dataset, assessing the impact of each module on the model. The results of these ablation experiments are listed in Table 1 and Table 2.

The LCFM improves the mAP from 73.1% to 73.6% on the DIOR dataset, which is an overall improvement of 0.5%; the FEM alone improves the mAP from 73.1% to 74.3%, representing a total improvement of 1.2%. In the last row of Table 1, we observe that combining the LCFM and the FEM improves the mAP from 73.1% to 74.7%, with an overall enhancement of 1.6%. This enhancement is attributed to the utilization of the LCFM and the FEM in the shallow feature fusion process. These modules effectively extract features from the numerous small objects in remote sensing images. Simultaneously, a residual module with channel feedforward and depth-separable convolution is applied to cope with the densely arranged small objects. Additionally, we observe that the FPS decreases by 4 and 9, respectively, when the LCFM and the FEM are added separately. If the LCFM and the FEM are stacked together, theoretically, the FPS would decrease significantly. However, from the last row of the table, we can observe that the speed reduction when the two modules are combined is smaller than the sum of the speed reductions when each module acts alone. This suggests the feasibility of combining the LCFM and the FEM. The above experimental results show that the network structure proposed in this paper can improve the detection accuracy without significantly reducing speed.

More detailed ablation experiments on the DIOR dataset are presented in Table 3, where all experimental results are rounded to one decimal place for ease of comparison. In Table 3, the optimal detection results are highlighted in bold. The comparison between the proposed method and the baseline reveals that the network structure proposed in this paper shows an improvement in 15 out of 20 categories on the DIOR dataset, which contains some objects that are difficult to be detected in satellite remote sensing imagery, such as small object categories of airplanes, vehicles, and ships as well as targets such as airports and basketball courts with complex backgrounds. This further demonstrates the effectiveness and versatility of FEFN.

Specifically, FEM enhances the information of small objects in shallow features by combining different convolutions and GNs, which allows the network to obtain more semantic information containing small targets in the detection header. This result can also be verified from Table 3. After adding FEM, the detection accuracy of small target categories such as airplanes, vehicles and ships is improved. Secondly, LCFM reduces the loss of semantic information in the feature fusion module by a residual network with deeply separable convolution and channel feedforward modules. After adding LCFM, the detection accuracy of ships and vehicles, which are densely arranged small objects, is improved.

To further validate the comprehensiveness and generalization of our proposed structure on satellite remote sensing image datasets, we also conducted ablation experiments on the remote sensing ship dataset HRSC2016, which is labeled with rotated box bounding, and the results are shown in Table 2. Compared with the baseline, the addition of the LCFM and the FEM improves the detection accuracy by 0.7% and 0.5%, respectively. However, the combination of both modules enhances the detection accuracy from 96.0% to 97.1%. This demonstrates the effectiveness of the LCFM and FEM modules proposed in this paper.

4.3. Comparative Experiment Evaluation

To further validate the effectiveness of our proposed method, we conducted comparative experiments on two datasets, choosing both generalized and state-of-the-art object detection methods as the control group. The methods used for comparative experiments include not only the classical network Yolov5 [44], which is widely used in industrial projects, but also the anchorless frame network Centernet [45], the classical network Efficientnet [46] for detecting small objects, and the newest remote sensing object detection networks StrMCsDet [47], CF2PN [48], AAFM-Enhanced EfficientDet [49], MSF-SNET [50], ASDN [51], AFADet [52], and GTNet [53], which have been developed in recent years. Because some networks such as Yolov5 and Centernet are not trained on the DIOR dataset, we train these networks on the DIOR dataset and evaluate the detection accuracy and inference speed using the same equipment. The specific data from these comparison experiments are presented in Table 4. And the detection results of FEFN on the DIOR dataset in 20 categories are visualized in Figure 8.

From the comparison experiments on the DIOR dataset, it is evident that FEFN achieves the highest detection accuracy compared with the other networks. FEFN enhances the network’s extraction of small targets by shallow FEM and LCFM, and reduces the information loss in the neck to some extent. Therefore, compared with the other 10 networks, FEFN is not only ranked first in mAP, but also tops the accuracy in the small target category ships and vehicles. Although the inference speeds of Yolov5, StrMCsDet, AFADet, and ASDN are higher than those of FEFN; the detection accuracies of these four networks do not reach 70%, whereas FEFN achieves a detection accuracy of 74.7%. Our network achieves optimal detection accuracy in some difficult-to-detect categories including small object categories like aircraft, golf courses with extremely complex backgrounds, and oil storage tanks, which are prone to misdetection.

The visualization results of Yolov5, Centernet, Efficientnet, FEFN, and the ground truths are shown in Figure 9. Centernet and efficientnet are prone to serious misdetection when ships are densely arranged, where yolov5 misses individual small objects when ships are densely arranged.

The results of the comparative experiments on the HRSC2016 dataset are presented in Table 5. To verify the performance of our proposed method, the networks selected for comparison experiments include not only the rotating frame object detection networks Rotated FCOS [54] and Rotated Retinanet [55], which are transformed from the classical horizontal frame object detection networks, but also specialized remote sensing object detection networks such as R³Det [56], S2ANET [57], ROI-transformer [58], CSL [59], OAF-Net [60], AOPG [61], CenterMap-Net [62] and DRN [63]. From Table 5, we can observe that FEFN has the highest outstanding detection accuracy of 97.10%, outperforming the other networks listed in the table. The visualization results of our network on this dataset are shown in Figure 10. Representative images of maritime ships, offshore ships, ships of different sizes, and ships with complex backgrounds are selected to demonstrate the performance of the proposed network. It is evident that the proposed network can effectively detect ship objects under different scenarios.

Despite the slight speed advantage of Rotated FCOS, the detection accuracy on rotated bounding box remote sensing datasets such as HRSC2016 is only 88.70%. This is because this method only replaces the horizontal bounding box of the classical anchorless box object detection network FCOS with a rotated bounding box. In contrast, our proposed network not only achieves a high detection accuracy of 97.10% but also does not significantly degrade the inference speed, demonstrating its excellent performance. From the visualization results of the different networks presented in Figure 11, we can observe that the Rotated FCOS is prone to miss-detection for densely arranged ships. Rotated Retinanet struggles to detect such non-horizontally arranged rotating frame objects, and is particularly ineffective for detecting large ships. S2ANET fails to detect small ships near the harbor. S2ANET has leakage detection for small ships close to the harbor. In contrast, FEFN achieves optimal detection for ships close to the harbor, different object scales, and small objects.

Although this section demonstrates the effectiveness of FEFN, there are several areas for potential improvement. Firstly, the FEM in FEFN encompasses nearly 30 million parameters on the DIOR and HRSC2016 datasets, significantly affecting inference speed. Future work will explore alternative feature enhancement methods to maintain efficacy while reducing parameter count. Secondly, there is a need to design faster and lighter detection networks suitable for deployment on hardware platforms, with applications in real-world ocean surveillance, sea rescue, military operations, and other critical fields.

4.4. Comparative Experiments on Images with Different Image Quality

In order to verify the anti-disturbance performance of different algorithms with different signal-to-noise ratio (SNR) images, in this section, we discuss the comparative experiments carried out on the HRSC dataset and visualise and analyse the results. Specifically, we add Gaussian noise with a standard deviation of 0.1 (minor noise) and 0.3 (massive noise) on the HRSC dataset, so that the average SNR of the dataset images are 8.05 dB and 1.99 dB, respectively. Subsequently, we performed comparative experiments between FEFN, Rotated FCOS and Rotated RetinaNet algorithms, which perform better on the original HRSC dataset, as shown in Table 6, Figure 12 and Figure 13.

In Table 6, we indicate in bold the best performing algorithms under different SNR datasets. The results show that the detection accuracy of the FEFN achieves the highest mAP scores in the datasets with different SNR. Even at the SNR of 1.99 dB, it achieves 78.00% mAP, which demonstrates the excellent noise resistance of FEFN.

In Figure 12, we show the detection results of each algorithm under different SNRs. The results show that our algorithm can still suppress false detection and overcome leakage detection under strong noise interference. This is due to the fact that FEFN is able to learn the target edge features meticulously.

5. Conclusions

In this paper, we introduce a novel satellite remote sensing image object detection network called FEFN. The network is designed to address the challenges associated with dense objects and small objects in remote sensing object detection. The FEM enhances the network’s ability to extract features of small objects and shallow semantic information. Meanwhile, the LCFM is used to enable the model to scale differently at various depth levels, improving its adaptability to objects of different scales while enhancing robustness and generalizability. Extensive experimental results indicate that our proposed FEFN achieved an enhanced detection accuracy ranging from 73.1% to 74.7% on the DIOR dataset and from 96.0% to 97.1% on the HRSC2016 dataset. Although this research has demonstrated the effectiveness of FEFN, there is still potential for further improvement in this work. To enhance the compatibility of hardware platforms, we should improve the feature extraction method further and move towards a lighter weight network.

Author Contributions

Conceptualization, J.W. and R.N.; formal analysis, J.W. and R.N.; methodology, J.W., R.N., Z.C., F.H. and L.C.; resources, J.W. and R.N.; validation, J.W. and R.N.; writing—original draft, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Nature Science Foundation of Fujian Province under Grant 2022J05113, and in part by the Educational Research Program for Young and Middle-aged Teachers of Fujian Province under Grant JAT210035.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors would like to express their gratitude to the anonymous reviewers and editors who worked selflessly to improve our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guan, X.; Dong, Y.; Tan, W.; Su, Y.; Huang, P.J.R.S. A Parameter-Free Pixel Correlation-Based Attention Module for Remote Sensing Object Detection. Remote Sens. 2024, 16, 312. [Google Scholar] [CrossRef]
Zhang, J.; Chen, Z.; Yan, G.; Wang, Y.; Hu, B. Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images. Remote Sens. 2023, 15, 4974. [Google Scholar] [CrossRef]
Roy, P.; Behera, M.; Srivastav, S. Satellite remote sensing: Sensors, applications and techniques. Proc. Natl. Acad. Sci. India Sect. A Phys. Sci. 2017, 87, 465–472. [Google Scholar] [CrossRef]
Liu, X.; He, J.; Yao, Y.; Zhang, J.; Liang, H.; Wang, H.; Hong, Y. Classifying urban land use by integrating remote sensing and social media data. Int. J. Geogr. Inf. Sci. 2017, 31, 1675–1696. [Google Scholar] [CrossRef]
Weifeng, F.; Weibao, Z. Review of remote sensing image classification based on deep learning. Appl. Res. Comput. 2018, 35, 3521–3525. [Google Scholar]
Kiang, C.-W.; Kiang, J.-F. Imaging on underwater moving targets with multistatic synthetic aperture sonar. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Gerg, I.D.; Monga, V. Deep multi-look sequence processing for synthetic aperture sonar image segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Sledge, I.J.; Emigh, M.S.; King, J.L.; Woods, D.L.; Cobb, J.T.; Principe, J.C. Target detection and segmentation in circular-scan synthetic aperture sonar images using semisupervised convolutional encoder–decoders. IEEE J. Ocean. Eng. 2022, 47, 1099–1128. [Google Scholar] [CrossRef]
Zhang, X.; Wang, G.; Zhu, P.; Zhang, T.; Li, C.; Jiao, L. GRS-Det: An anchor-free rotation ship detector based on Gaussian-mask in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3518–3531. [Google Scholar] [CrossRef]
Li, Q.; Mou, L.; Xu, Q.; Zhang, Y.; Zhu, X.X. R³-net: A deep network for multi-oriented vehicle detection in aerial images and videos. arXiv 2018, arXiv:1808.05560. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Shi, G.; Zhang, J.; Liu, J.; Zhang, C.; Zhou, C.; Yang, S. Global context-augmented objection detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10604–10617. [Google Scholar] [CrossRef]
Yuan, Y.; Xiong, Z.; Wang, Q. VSSA-NET: Vertical spatial sequence attention network for traffic sign detection. IEEE Trans. Image Process. 2019, 28, 3423–3434. [Google Scholar] [CrossRef]
Li, Z.; Chen, G.; Li, G.; Zhou, L.; Pan, X.; Zhao, W.; Zhang, W. DBANet: Dual-branch Attention Network for hyperspectral remote sensing image classification. Comput. Electr. Eng. 2024, 118, 109269. [Google Scholar] [CrossRef]
Chang, S.; Deng, Y.; Zhang, Y.; Zhao, Q.; Wang, R.; Zhang, K. An advanced scheme for range ambiguity suppression of spaceborne SAR based on blind source separation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Qiu, H.; Li, H.; Wu, Q.; Meng, F.; Ngan, K.N.; Shi, H. A2RMNet: Adaptively aspect ratio multi-scale network for object detection in remote sensing images. Remote Sens. 2019, 11, 1594. [Google Scholar] [CrossRef]
Cheng, G.; Si, Y.; Hong, H.; Yao, X.; Guo, L. Cross-scale feature fusion for object detection in optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 431–435. [Google Scholar] [CrossRef]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, X.; Cao, G.; Yang, Y.; Jiao, L.; Liu, F. ViT-YOLO: Transformer-based YOLO for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2799–2808. [Google Scholar]
Li, Q.; Chen, Y.; Zeng, Y. Transformer with transfer CNN for remote-sensing-image object detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Song, H.; Sun, D.; Chun, S.; Jampani, V.; Han, D.; Heo, B.; Kim, W.; Yang, M. Vidt: An efficient and effective fully transformer-based object detector. arXiv 2021, arXiv:2110.03921. [Google Scholar]
Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A²-nets: Double attention networks. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Zheng, M.; Gao, P.; Zhang, R.; Li, K.; Wang, X.; Li, H.; Dong, H. End-to-end object detection with adaptive clustering transformer. arXiv 2020, arXiv:2011.09315. [Google Scholar]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Vaswani, A.; Ramachandran, P.; Srinivas, A.; Parmar, N.; Hechtman, B.; Shlens, J. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12894–12904. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Hou, Q.; Jiang, Z.; Yuan, L.; Cheng, M.-M.; Yan, S.; Feng, J. Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1328–1334. [Google Scholar] [CrossRef]
Liu, H.; Dai, Z.; So, D.; Le, Q.V. Pay attention to mlps. Adv. Neural Inf. Process. Syst. 2021, 34, 9204–9215. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5769–5780. [Google Scholar]
An, L.; Wang, L.; Li, Y. HEA-Net: Attention and MLP Hybrid Encoder Architecture for Medical Image Segmentation. Sensors 2022, 22, 7024. [Google Scholar] [CrossRef]
Liang, Z.; Zheng, Z.; Chen, W.; Pei, Z.; Wang, J.; Chen, J. A Novel Deep Transfer Learning Framework Integrating General and Domain-Specific Features for EEG-Based Brain-Computer Interface. Biomed. Signal Process. Control 2024, 95, 106311. [Google Scholar] [CrossRef]
Mishra, S.; Tripathy, H.K.; Mallick, P.K.; Bhoi, A.K.; Barsocchi, P. EAGA-MLP—An enhanced and adaptive hybrid classification model for diabetes diagnosis. Sensors 2020, 20, 4036. [Google Scholar] [CrossRef]
Al Bataineh, A.; Manacek, S. MLP-PSO hybrid algorithm for heart disease prediction. J. Pers. Med. 2022, 12, 1208. [Google Scholar] [CrossRef]
Pahuja, R.; Kumar, A. Sound-spectrogram based automatic bird species recognition using MLP classifier. Appl. Acoust. 2021, 180, 108077. [Google Scholar] [CrossRef]
Jin, Y.; Hu, Y.; Jiang, Z.; Zheng, Q. Polyp segmentation with convolutional MLP. Vis. Comput. 2023, 39, 4819–4837. [Google Scholar] [CrossRef]
Kong, J.; Wang, H.; Yang, C.; Jin, X.; Zuo, M.; Zhang, X. A spatial feature-enhanced attention neural network with high-order pooling representation for application in pest and disease recognition. Agriculture 2022, 12, 500. [Google Scholar] [CrossRef]
Zhai, Y.; Fan, D.-P.; Yang, J.; Borji, A.; Shao, L.; Han, J.; Wang, L. Bifurcated backbone strategy for RGB-D salient object detection. IEEE Trans. Image Process. 2021, 30, 8727–8742. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Dong, J.; Wei, W.; Zheng, B.; Ma, J.; Zhou, T. Remote Sensing Image Object Detection by Fusing Multi-Scale Contextual Features and Channel Enhancement. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 01–07. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; pp. 324–331. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Yifu, Z.; Montes, D. Ultralytics/Yolov5: V6. 2-Yolov5 Classification Models, Apple M1, Reproducibility, Clearml and Deci. Ai Integrations. Zenodo. 2022. Available online: https://zenodo.org/records/7002879 (accessed on 25 June 2024).
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Ren, S.; Fang, Z.; Gu, X. A Cross Stage Partial Network with Strengthen Matching Detector for Remote Sensing Object Detection. Remote Sens. 2023, 15, 1574. [Google Scholar] [CrossRef]
Huang, W.; Li, G.; Chen, Q.; Ju, M.; Qu, J. CF2PN: A cross-scale feature fusion pyramid network based remote sensing target detection. Remote Sens. 2021, 13, 847. [Google Scholar] [CrossRef]
Ye, Y.; Ren, X.; Zhu, B.; Tang, T.; Tan, X.; Gui, Y.; Yao, Q. An adaptive attention fusion mechanism convolutional network for object detection in remote sensing images. Remote Sens. 2022, 14, 516. [Google Scholar] [CrossRef]
Huyan, L.; Bai, Y.; Li, Y.; Jiang, D.; Zhang, Y.; Zhou, Q.; Wei, J.; Liu, J.; Zhang, Y.; Cui, T. A lightweight object detection framework for remote sensing images. Remote Sens. 2021, 13, 683. [Google Scholar] [CrossRef]
Liu, N.; Mao, Z.; Wang, Y.; Shen, J. Remote Sensing Images Target Detection Based on Adjustable Parameter and Receptive field. Acta Photonica Sin. 2021, 50, 1128001–1128012. [Google Scholar]
Wang, J.; Gong, Z.; Liu, X.; Guo, H.; Yu, D.; Ding, L. Object detection based on adaptive feature-aware method in optical remote sensing images. Remote Sens. 2022, 14, 3616. [Google Scholar] [CrossRef]
Chen, T.; Li, R.; Fu, J.; Jiang, D. Tucker Bilinear Attention Network for Multi-scale Remote Sensing Object Detection. arXiv 2023, arXiv:2303.05329. [Google Scholar] [CrossRef]
Li, Z.; Hou, B.; Wu, Z.; Ren, B.; Yang, C. FCOSR: A simple anchor-free rotated detector for aerial object detection. Remote Sens. 2023, 15, 5499. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Yang, X.; Liu, Q.; Yan, J.; Li, A.; Zhang, Z.; Yu, G. R3det: Refined single-stage detector with feature refinement for rotating object. arXiv 2019, arXiv:1908.05612. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VIII 16. pp. 677–694. [Google Scholar]
Duan, M.; Meng, R.; Xiao, L. An Orientation-Aware Anchor-Free Detector for Aerial Object Detection. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3075–3078. [Google Scholar]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Li, H.-C.; Zhang, H.; Xia, G.-S. Learning center probability map for detecting objects in aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4307–4323. [Google Scholar] [CrossRef]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11207–11216. [Google Scholar]

Figure 1. Architecture of the feature enhancement feedforward network (FEFN).

Figure 2. Architecture of the lightweight channel feedforward module (LCFM).

Figure 3. Architecture of the depthwise convolution.

Figure 4. Architecture of the channel feedforward module.

Figure 5. Architecture of channel scaling.

Figure 6. Architecture of feature enhancement module (FEM).

Figure 7. Twenty categories of object images in the DIOR dataset. (a) airplane; (b) airport; (c) baseball field; (d) basketball field; (e) bridge; (f) chimney; (g) dam; (h) highway service area; (i) highway toll station; (j) golf course; (k) track and field; (l) port; (m) overpass; (n) ship; (o) stadium; (p) oil tank; (q) tennis court; (r) fire station; (s) vehicle; (t) windmill.

Figure 8. Visualization results of 20 categories on the DIOR dataset. (a) airplane; (b) airport; (c) baseball field; (d) basketball field; (e) bridge; (f) chimney; (g) dam; (h) highway service area; (i) highway toll station; (j) golf course; (k) track and field; (l) port; (m) overpass; (n) ship; (o) stadium; (p) oil tank; (q) tennis court; (r) fire station; (s) vehicle; (t) windmill.

Figure 9. Comparison of detection results of different networks on the DIOR datasets. We marked the false alarm and missed detections that occurred with the comparison algorithm with red circles.

Figure 10. Visualization of the detection results for different scenarios on the HRSC2016 dataset. (a) Maritime ship; (b) offshore ships; (c) ships of different sizes; (d) ships with complex backgrounds.

Figure 11. Comparison of detection results of different networks on the HRSC2016 datasets. We marked the false alarm and missed detections that occurred with the comparison algorithm with orange circles.

Figure 12. Comparison of detection results of different networks on the HRSC2016 dataset with an SNR of 8.05 dB. We marked the false alarm and missed detections that occurred with the comparison algorithm with orange circles.

Figure 13. Comparison of detection results of different networks on the HRSC2016 dataset with an SNR of 1.99 dB. We marked the false alarm and missed detections that occurred with the comparison algorithm with orange circles.

Table 1. Ablation experiments of FEFN’s components on DIOR dataset. Where “√” denotes that we apply this module in the variant.

Modules	Baseline	FEM	LCFM	mAP	FPS	FLOPs(G)	Params(M)
Select Module(s)	√			73.1	35	130.70	68.78
	√	√		74.3	26	154.29	97.09
	√		√	73.6	31	136.61	78.23
	√	√	√	74.7	25	160.20	106.54

Table 2. Ablation experiments of FEFN’s components on HRSC2016 dataset. Where “√” denotes that we apply this module in the variant.

Modules	Baseline	FEM	LCFM	mAP	FPS	FLOPs(G)	Params(M)
Select module(s)	√			96.00	26	130.64	68.76
	√	√		96.50	23	154.23	97.07
	√		√	96.70	24	136.54	78.20
	√	√	√	97.10	20	160.14	106.52

Table 3. More detailed results of ablation experiments on the DIOR dataset. We use bold to show the maximum value for each column.

Methods	mAP	AL	AT	BF	BC	B	C	D	ESA	ETS	GC
Baseline	73.1	91.7	74.2	92.6	80.7	43.4	89.8	60.1	55.9	62.0	78.3
Baseline + FEM	74.3	92.0	76.9	93.1	80.8	45.0	90.5	66.7	54.8	64.6	79.0
Baseline + LCFM	73.6	91.3	77.4	92.9	81.3	44.1	91.1	62.9	54.5	61.7	78.2
Baseline + FEM + LCFM (Ours)	74.7	92.3	79.3	91.9	81.4	43.7	91.1	66.2	56.2	63.1	80.6
Methods	mAP	GTF	HB	O	S	SD	ST	TC	TS	V	W
Baseline	73.1	79.0	56.3	59.2	85.9	83.9	82.8	86.0	53.4	71.7	77.0
Baseline + FEM	74.3	76.7	56.2	59.6	86.7	88.0	81.2	86.4	58.1	71.8	78.1
Baseline + LCFM	73.6	76.7	56.2	60.6	87.6	85.3	82.1	84.3	55.9	71.1	77.6
Baseline + FEM + LCFM (Ours)	74.7	76.5	55.3	60.2	87.1	90.2	81.1	85.8	62.5	71.8	78.4

Table 4. Comparison with SOTA methods on the DIOR dataset. We use bold to show the maximum value for each column.

Methods	mAP	FPS	AL	AT	BF	BC	B	C	D	ESA	ETS	GC
Yolov5	68.6	80	87.3	61.7	73.8	90.0	42.6	77.5	55.2	63.8	63.2	66.9
Centernet	63.9	10	73.6	58.0	69.7	88.5	36.2	76.9	47.9	52.7	54.0	60.5
Efficientnet	62.2	13	72.4	68.3	64.6	87.0	33.6	74.5	43.7	60.1	55.4	72.6
StrMCsDet	65.6	38	78.6	58.4	38.1	38.3	55.0	49.5	56.8	35.5	79.1	37.1
CF2PN	57.9	18	70.0	57.4	36.9	36.3	43.4	45.1	51.2	34.8	73.8	45.9
AAFM-Enhanced EfficientDet	69.8	-	71.6	75.1	82.6	81.0	45.9	70.4	69.0	83.2	68.2	78.4
MSF-SNET	66.5	-	90.3	76.6	90.9	69.6	37.5	88.3	70.6	70.8	63.6	69.9
ASDN	66.9	32	63.9	73.8	71.8	81.0	46.3	73.4	56.3	73.4	66.2	74.7
AFADet	66.1	61	85.6	66.5	76.3	88.1	37.4	78.3	53.6	61.8	58.4	54.3
GTNet	73.3	-	72.3	87.5	72.3	89.0	53.7	72.5	71.0	85.1	77.6	78.1
Ours	74.7	25	92.3	79.3	91.9	81.4	43.7	91.1	66.2	56.2	63.1	80.6
Methods	mAP	FPS	GTF	HB	O	S	SD	ST	TC	TS	V	W
Yolov5	68.6	80	78.0	58.2	58.1	87.8	54.3	79.3	89.7	50.2	54.0	79.6
Centernet	63.9	10	62.6	45.7	52.6	88.2	63.7	76.2	83.7	51.3	54.4	79.5
Efficientnet	62.2	13	67.0	47.0	53.0	86.3	37.6	70.9	81.2	43.4	50.3	75.5
StrMCsDet	65.6	38	42.5	66.0	38.3	66.6	62.9	80.8	49.3	35.0	72.1	81.3
CF2PN	57.9	18	38.7	59.0	35.5	46.5	55.2	50.2	47.5	33.5	63.5	77.2
AAFM-Enhanced EfficientDet	69.8	-	80.8	48.3	59.8	76.8	81.0	56.6	85.6	60.5	45.6	76.5
MSF-SNET	66.5	-	61.9	59.0	57.5	20.5	90.6	72.4	80.9	60.3	39.8	58.6
ASDN	66.9	32	75.2	51.1	58.4	76.2	67.4	60.2	81.4	58.7	45.8	83.1
AFADet	66.1	61	67.2	70.4	53.1	82.7	62.8	64.0	88.2	50.3	44.0	79.2
GTNet	73.3	-	81.9	65.9	63.9	80.8	76.2	62.5	81.5	65.5	48.5	80.9
Ours	74.7	25	76.5	55.3	60.2	87.1	90.2	81.1	85.8	62.5	71.8	78.4

Table 5. Comparison with SOTA methods on the HRSC2016 dataset.

Methods	mAP	FPS
Rotated FCOS	88.70	24
Rotated RetinaNet	95.21	20
CSL	96.10	24
R³Det	96.01	16
OAF-Net	89.96	-
AOPG	96.22	11
S2ANET	95.01	13
CenterMap-Net	92.80	6
DRN	92.70	-
ROI-transformer	86.20	6
Ours	97.10	20

Table 6. Comparative experiments on HRSC2016 datasets with different SNR. We use bold to show the maximum value for each column.

Dataset	SNR (dB)	Methods	mAP
Ori HRSC2016	-	Rotated FCOS	88.70
		Rotated RetinaNet	95.21
		Ours	97.10
HRSC2016 with minor noise	8.05	Rotated FCOS	79.20
		Rotated RetinaNet	88.70
		Ours	94.50
HRSC2016 with massive noise	1.99	Rotated FCOS	32.74
		Rotated RetinaNet	67.30
		Ours	78.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, J.; Ni, R.; Chen, Z.; Huang, F.; Chen, L. FEFN: Feature Enhancement Feedforward Network for Lightweight Object Detection in Remote Sensing Images. Remote Sens. 2024, 16, 2398. https://doi.org/10.3390/rs16132398

AMA Style

Wu J, Ni R, Chen Z, Huang F, Chen L. FEFN: Feature Enhancement Feedforward Network for Lightweight Object Detection in Remote Sensing Images. Remote Sensing. 2024; 16(13):2398. https://doi.org/10.3390/rs16132398

Chicago/Turabian Style

Wu, Jing, Rixiang Ni, Zhenhua Chen, Feng Huang, and Liqiong Chen. 2024. "FEFN: Feature Enhancement Feedforward Network for Lightweight Object Detection in Remote Sensing Images" Remote Sensing 16, no. 13: 2398. https://doi.org/10.3390/rs16132398

APA Style

Wu, J., Ni, R., Chen, Z., Huang, F., & Chen, L. (2024). FEFN: Feature Enhancement Feedforward Network for Lightweight Object Detection in Remote Sensing Images. Remote Sensing, 16(13), 2398. https://doi.org/10.3390/rs16132398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FEFN: Feature Enhancement Feedforward Network for Lightweight Object Detection in Remote Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. Channel Feedforward Network

2.2. Feature Enhancement Modules

3. Methods

3.1. Lightweight Channel Feedforward Module

3.2. Feature Enhancement Module

4. Experiments

4.1. Experimental Conditions

4.1.1. Datasets

4.1.2. Experimental Setup and Evaluation Metrics

4.1.3. Experimental Settings

4.2. Ablation Experiment Evaluation

4.3. Comparative Experiment Evaluation

4.4. Comparative Experiments on Images with Different Image Quality

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI