Open AccessArticle

ARSOD-YOLO: Enhancing Small Target Detection for Remote Sensing Images

Yijuan Qiu

^1,2,3

Xiangyue Zheng

^1,2,3

Xuying Hao

^1,2,3,

Gang Zhang

^1,2,3

Tao Lei

^1,2,3

and

Ping Jiang

^1,2,3,*

National Laboratory on Adaptive Optics, Chengdu 610209, China

University of Chinese Academy of Sciences, Beijing 101408, China

Institute of Optics and Electronics Chinese Academy of Sciences, Chengdu 610209, China

Author to whom correspondence should be addressed.

Sensors 2024, 24(23), 7472; https://doi.org/10.3390/s24237472

Submission received: 29 September 2024 / Revised: 12 November 2024 / Accepted: 20 November 2024 / Published: 23 November 2024

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Figure 1
Some examples of remote sensing images. "> Figure 2
YOLOv8 network architecture. "> Figure 3
ARSOD-YOLO network architecture. "> Figure 4
The basic structure of AFEM. It consists of GhostModule and MLP as the basic components. "> Figure 5
Structural diagrams of AKSFFM, C2f, and Bottleneck. "> Figure 6
Images illustrating the different categories of the dataset [<a href="#B43-sensors-24-07472" class="html-bibr">43</a>]. "> Figure 7
Comparison of AI-TOD with other benchmark datasets [<a href="#B44-sensors-24-07472" class="html-bibr">44</a>]. "> Figure 8
Visualization of mAP effects of different modules. "> Figure 9
PR curves for categories in the VEDAI dataset. "> Figure 10
PR curves for categories in the AI-TOD dataset. "> Figure 11
Visual comparison of object detection models: ARSOD-YOLO vs. YOLOv3, YOLOv5, and YOLOv10 on VEDAI dataset images. ">

Versions Notes

Abstract

Remote sensing images play a vital role in domains including environmental monitoring, agriculture, and autonomous driving. However, the detection of targets in remote sensing images remains a challenging task. This study introduces innovative methods to enhance feature extraction, feature fusion, and model optimization. The Adaptive Selective Feature Enhancement Module (AFEM) dynamically adjusts feature weights using GhostModule and sigmoid functions, thereby enhancing the accuracy of small target detection. Moreover, the Adaptive Multi-scale Convolution Kernel Feature Fusion Module (AKSFFM) enhances feature fusion through multi-scale convolution operations and attention weight learning mechanisms. Moreover, our proposed ARSOD-YOLO optimized the network architecture, component modules, and loss functions based on YOLOv8, enhancing outstanding small target detection capabilities while preserving model efficiency. We conducted experiments on the VEDAI and AI-TOD datasets, showcasing the excellent performance of ARSOD-YOLO. Our algorithm achieved an mAP50 of 74.3% on the VEDAI dataset, surpassing the YOLOv8 baseline by 3.1%. Similarly, on the AI-TOD dataset, the mAP50 reached 47.8%, exceeding the baseline network by 6.1%.

Keywords:

object detection; remote sensing; small object; feature fusion

1. Introduction

With the rapid development of remote sensing technology, an increasing number of high-quality remote sensing images are being generated and applied to various fields such as disaster assessment, military surveillance, and automatic driving [1,2]. These images provide substantial data support for remote sensing image interpretation experts. However, the conventional methods that heavily depended on manual annotation are evidently inadequate to handle the vast amount of data present in modern images, necessitating the development of efficient algorithms for processing. Remote sensing object detection is a critical method of interpreting images, playing a central role in both the military and civilian sectors.

As depicted in Figure 1, remote sensing images are different from natural images in shooting distance and image scale, presenting significant challenges [3]. However, due to the restricted size of small targets, mainstream detectors exhibit high false detection and missed detection rates [4,5].

Traditional methods for remote sensing target detection rely on manual feature design, leading to limited generalization and computationally intensive procedures, rendering them inadequate for handling vast image datasets [6,7]. With the advancement of deep learning technology, remote sensing object detection algorithms based on neural networks have gradually become mainstream. These algorithms not only avoid the shortcomings of manually designed traditional algorithms but also excel in feature extraction from extensive datasets, exhibiting high generalization. Significant improvements in both detection speed and accuracy have been realized. The popular remote sensing target detection algorithms based on deep learning can be categorized into single-stage detection and two-stage detection. The two-stage detection algorithm initially filters possible regions of the target, followed by conducting classification and regression on the proposed regions. Conversely, the single-stage detection algorithm eliminates first stage and directly conducts the detection. Therefore, the single-stage detection is generally faster, albeit at a slight cost to accuracy. The YOLO series is well-known for its speed and accuracy balance, making it a popular choice in both academic research and industry. n this study, YOLOv8 was chosen as the detection benchmark to achieve an optimal trade-off between detection speed and accuracy.

Figure 1. Some examples of remote sensing images.

For algorithms utilizing deep learning to detect remote sensing small targets, the existing challenges can be categorized into the following three main points: (1) Extracting effective features and distinguishing small targets from the background is challenging due to insufficient features caused by long shooting distances and small sizes [8,9]. (2) Small targets often lose information during continuous convolution, emphasizing the significance of feature fusion techniques. Many existing methods manually specify convolution kernel sizes for feature detection and fusion, hindering the network’s ability to autonomously determine the most crucial kernel sizes for accurate detection. (3) The detection of small targets is more prone to positional deviations, leading to lower positioning accuracy. Additionally, remote sensing images show significant scale differences, posing a challenge for mainstream remote sensing detection algorithms that are not optimized for effectively detecting small targets.

To address these challenges, we have made improvements in the following aspects:

(1) Adaptive Selective Feature Enhancement Module (AFEM): For efficient small target detection in remote sensing images, we designed an adaptive selective feature enhancement module (AFEM) to enhance the importance and detection accuracy of small target features by dynamically adjusting feature weights and using GhostModule and sigmoid activation functions. (2) Adaptive Multi-scale Convolution Kernel Feature Fusion Module (AKSFFM): Introduced to enhance feature fusion in remote sensing object detection, the AKSFFM improves feature dependency through convolution operations at various scales and attention weight learning, enhancing the model’s ability to fuse features effectively in complex remote sensing images. (3) ARSOD-YOLO model: A model based on YOLOv8, the ARSOD-YOLO model is optimized in terms of network architecture, component modules, and loss functions. It outperforms some SOD models, offering superior detection capabilities for small targets in remote sensing while remaining lightweight.

2. Related Works

2.1. Small Target Detection Algorithm Based on Feature Enhancement

The challenge in detecting small objects predominantly arises from the possible loss of small object features due to the downsampling operation. The abstraction in high-level feature mAPs results in insufficient spatial position information for small targets. Current research addressing the insufficient feature extraction for small targets can be categorized into two approaches.

One prevalent approach involves incorporating pre-processing techniques. Super resolution technology is employed to enhance image resolution prior to detection. Zhang et al. introduced Deconv-RCNN [10], utilizing deconvolution operations on extracted feature mAPs to recover information lost during pooling layers, thereby enhancing detection accuracy. Currently, the mainstream trend involves leveraging Generative Adversarial Networks (GANs) [11] for super resolution. Courtrai [12] integrated a GAN-based super-resolution network branch to enhance input image resolution before detection. MT-GAN [13] employs a multi-task network to address target classification and detection simultaneously. Li [13] devised the Perceptual GAN to exploit intrinsic correlations among objects of varying scales. Wu proposed a point-to-region micro-target detection framework [14] that predicts candidate regions using keypoints and applies a multi-task GAN for super-resolution processing. Additionally, Ma [15] enhanced feature distinctiveness by establishing connections between small objects and their surroundings through self-attention mechanisms.

One strategy involves enhancing small target features by leveraging prior knowledge of semantic and spatial associations based on contextual information. Zhao [16] introduced the Reception Field Block (RFB) module to enhance feature expression capabilities. Luo [16] employed the CSandGLass module to replace residual modules in the Backbone network, enhancing aircraft detection performance. Bell [17] utilized jump connections and spatial recursion networks to capture crucial information inside and outside the region of interest, thereby enhancing target detection accuracy. SCRDet [18] integrates two attention mechanisms to improve effective feature representation and suppress irrelevant noise. PyramidBox [19] combines high-level contextual semantic features with low-level texture features to identify small objects efficiently.

2.2. Small Target Detection Algorithm Based on Feature Fusion

Numerous researchers have explored synthesizing shallow and deep features. They emphasize fusing features from various network branches to incorporate both deep semantic insights and shallow detail information, culminating in enhanced feature representations for detecting small targets.

The Feature Pyramid Network (FPN) introduced by Lin [20] pioneers top-down feature fusion, linking low-level and high-level feature graphs rich in semantic content. PANet [21] has achieved increased accuracy on specific datasets by incorporating bidirectional paths for enhancement. ASFF [22], NAS-FPN [23], and RefinDet [24] have also demonstrated notable results. BiFPN [25] proposes a bidirectional weighted feature network, learning scale-specific weights for effective cross-scale fusion. Lim et al. [26] leveraged varying layers of context information and attention mechanisms to boost small target detection accuracy. Liu et al. [27] proposed two pyramid networks to extract features efficiently while reducing computational load. Nie [28] improved the YOLOv8 Neck network by merging up-sampled deep feature mAPs with shallow ones. Zhao [29], when detecting UAVs, integrated more feature layers through cross-connections, enriching semantic information. FMSSD [30] emphasizes small target detection accuracy via area weighting functions. FS-SSD [31] combines feature fusion and multi-scale scaling to effectively detect tiny UAV targets. Bai et al. [32] devised a bidirectional pyramid network for infrared small-size targets, enhancing the original FPN structure with a bottom-up pyramid network for richer feature fusion. Li et al. [33] proposed an attention network for context extraction, bolstering detection efficacy. Liu et al. [33] introduced IPG-NET and the IPG transformation module to supplement spatial information through feature extraction at varying resolutions. Gong [34] innovated a statistical fusion factor to regulate feature flow across adjacent layers.

2.3. Remote Sensing Small Target Detection Method Based on YOLO

The YOLO (You Only Look Once) series has found extensive applications in remote sensing target detection. Cao et al. [35] developed a detection head tailored for remote sensing small objects, enhancing a lightweight network based on GhostConv. Liu et al. [36] integrated the CBAM into the YOLOx network. Li et al. [37] combined attention mechanisms with the MobileNetv3 Backbone network to enhance YOLOV3, subsequently enhancing detection accuracy. Wan [38] introduced the YOLO-HR algorithm, employing a lightweight hybrid attention module, validated with high-resolution optical remote sensing images. Qu [39] designed a branch network to enhance features of small targets, utilizing an adaptive feature fusion technique to enhance target detection efficiency. Addressing the diverse scales in remote sensing images, Xu [40] introduced DenseNet to enhance the Backbone for improved multi-scale target detection accuracy. HRDNet [27], focusing on small target detection in UAV images, merges depth and multi-scale pyramids, cross-linking high-resolution and low-resolution features for comprehensive information extraction through shallow and deep network fusion.

3. Materials and Methods

3.1. Baseline Model

The YOLO model is currently one of the most popular object detection algorithms. Among its iterations, the YOLOv8 model has notably surpassed previous YOLO models in all aspects. YOLOv8 has demonstrated superior detection accuracy and efficiency compared to earlier YOLO versions. The YOLOv8 network model, shown in Figure 2, can be roughly divided into three parts: the Backbone, Neck, and Prediction sections. The Backbone section is used for feature extraction, where the C2 module draws inspiration from the ELAN concept, parallelizing more gradient flow branches to obtain richer gradient flow information while ensuring lightweight design. The Neck section performs feature fusion on the features extracted by the Backbone, obtaining more context information at different scales, which has proven effective in handling tasks involving multi-scale objects. The Prediction section consists of three detection heads responsible for predicting targets. These three detection heads, respectively, detect large, medium, and small objects, providing the categories and coordinates of the targets. The YOLOv8 series is divided into five models, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, each with different scales and parameters, with computational complexity gradually increasing. To achieve a good balance between speed and accuracy, in this study, we adopted the YOLOv8n model.

3.2. ARSOD-YOLO

The network structure of our proposed ARSOD-YOLO (Adaptive Remote Sensing Small Object Detection) is illustrated in Figure 3. We opted for YOLOv8 as our benchmark model due to its ability to strike a balance between speed and accuracy in target detection. To enhance YOLOv8’s performance in detecting small and medium targets within remote sensing images, we implemented several improvements. Firstly, we enhanced the Neck segment by introducing the BiFPN concept and refining the original PANet network structure. Subsequently, prior to feature fusion, we integrated an Adaptive Features Enhancement Module (AFEM) at each stage to amplify small target features, represented by the yellow module in Figure 3. In the Model Feature Fusion phase, we introduced the Adaptive Kernel Size Feature Fusion Module (AKSFFM) to amalgamate context information and seamlessly integrate features across different levels, denoted by the green modules. Lastly, we employed the WIoU loss function to construct the boundary frame loss with a dynamic focusing mechanism, thereby improving the network model’s localization capability.

Figure 2. YOLOv8 network architecture.

In refining the Neck section of the network structure depicted in Figure 3, our aim was to bolster cross-scale connections within the network architecture. Drawing inspiration from the BiFPN concept, we introduced jump connections that bridge the input and output nodes of identical scales. This strategic enhancement facilitated the incorporation of a richer array of features, as highlighted by the conspicuous red lines in Figure 3, all without imposing an undue burden on computational resources.

In the process of continuous convolution, focusing more on local information for small targets, moving the detection layers of the small and medium detection heads to the Concat layer helps to avoid potential information loss due to excessive convolution operations and enables more effective integration of features across different scales. Conversely, considering that larger targets require attention to global information, retaining the large detection head connected to the AKSFFM layer allows for the utilization of multiple convolution operations to observe large-scale features, facilitating comprehensive use of global information. Such adjustments better cater to the varying needs of targets for local and global information, thereby enhancing overall detection accuracy and performance. Specifically, we transitioned the connection layers of the small and medium detection heads to the Concat layer, thereby ensuring the preservation of crucial information. Notably, the connection layer of the large detection head remained unaltered, as indicated by the yellow line segment linking “Detect” in Figure 3.

Figure 3. ARSOD-YOLO network architecture.

3.3. Adaptive Feature Enhancement Module

Due to the nature of remote sensing images being captured from considerable distances, the targets of interest are often small within the image frame. These small targets can easily blend into the background, posing challenges for effective feature extraction. Our approach focuses on selectively amplifying features crucial for detecting these small targets. By fine-tuning the significance of features based on distinct remote sensing scenarios and target characteristics, we aim to enhance the overall capability of target detection. To achieve this, we have devised a feature enhancement module capable of dynamically choosing and enhancing features that are most relevant and effective for the task at hand.

In Figure 4, our Adaptive Features Enhancement Module (AFEM) initiates by establishing two branches. Each branch undergoes feature processing using GhostModule, with one branch employing a sigmoid activation function. The sigmoid function acts as a gating mechanism, facilitating the learning of correlations between diverse features and enabling the dynamic adjustment of their weights. This adaptive weighting mechanism caters to targets of varying scales and complexities. The learned importance weights in the second branch enhance the model’s understanding of target information within remote sensing images. Processing in the two branches allows the model to extract insights from different facets of the image, facilitating a comprehensive extraction of features relevant to small targets.

Subsequently, the attention-weighted features are fed into a multi-layer perceptron (MLP) to bolster the feature representation post-attention selection. This step aims to refine the remote sensing model’s understanding and detection accuracy concerning small targets. Following this enhancement, a residual operation is conducted to combine the enriched features with the initial input features, facilitating feature fusion. This process serves multiple purposes: it mitigates overfitting, preserves additional context information during feature enhancement, and equips the model with improved capabilities to address small targets within remote sensing images.

Figure 4. The basic structure of AFEM. It consists of GhostModule and MLP as the basic components.

We illustrate the calculation process of the Adaptive Features Enhancement Module (AFEM) in Equation (1):

\begin{matrix} a = G h o s t M o d u l e (x, k e r n e l_s i z e = 1) \\ w e i g h t s = σ (a) \\ σ (a) = \frac{1}{1 + e^{- a}} \\ e n h a n c e d_f e a t u r e s = M L P (w e i g t h s \times a, k e r n e l_s i z e = 3, p a d d i n g = 1, R e L U) \\ o u t p u t = x + e n h a n c e d_f e a t u r e s \end{matrix}

(1)

where x represents the input, while a and b signify the outputs after x undergoes GhostModule processing.

σ

denotes the sigmoid function used for processing. The variable

σ

corresponds to the weights acquired post-processing by the sigmoid functions. The term

M L P

refers to the multi-layer perceptron, with a

k e r n e l_s i z e

set to 3, padding set to 1, and

R e L U

employed as the activation function.

e n h a n c e d_f e a t u r e s

signifies the resulting enhanced features.

In the initial processing of input features, it is crucial to highlight the utilization of GhostModule instead of traditional convolutions. An observation in ResNet50 revealed the presence of highly similar feature mAPs within the same residual group. GhostModule asserts that the positive correlation between highly similar features enhances CNN’s feature extraction capabilities, thereby acquiring more Ghost pairs through linear operations. The GhostModule comprises three key steps: convolution, Ghost feature mAP generation, and feature concatenation, as depicted in Figure 4. Initially, feature mAPs are derived via regular convolutions, followed by the generation of Ghost feature mAPs through linear operations (replacing linear operations with lightweight Conv). Substituting Conv with Ghost feature mAPs streamlines the process, yielding a lighter computational load. The resulting output Ghost feature mAPs are subsequently concatenated with the original feature mAPs to yield the final outcome. Compared to direct usage of regular convolutions, the GhostModule notably diminishes computational complexity while enhancing the feature mAPs derived from convolutions. The calculation formula for GhostModule is presented in Equation (2):

\begin{matrix} X = C o n v (x) \\ Y = C o n v_{l i n e a r} (x) \times W \\ O u t p u t = C o n c a t (X, Y) \end{matrix}

(2)

where x represents the input, while

C o n v

represents a common convolution, and X represents the features processed through the

C o n v

C o n v_l i n e a r

represents a linear operation, and

C o n c a t

represents a concatenation. The weight matrix W is a parameter used for linear convolution operations, typically learned through the training process of the network. Y represents the features processed through

C o n v_l i n e a r

with the weight matrix W.

3.4. Adaptive Multi-Convolutional Kernel Feature Fusion Module

While feature enhancement has been implemented to handle small targets, remote sensing images typically exhibit a diverse array of features due to the intricate structure and textures of ground objects. It is imperative to select appropriate methods to enhance the feature fusion aspect of YOLOv8 for bolstering its fusion capacity in remote sensing target detection. The underlying features often possess higher resolution and encompass more positional information but undergo fewer convolution steps, leading to increased noise. Conversely, higher-level features, after multiple convolutions, contain richer semantic information. In response to these challenges, we introduced AKSFFM, a versatile module that dynamically selects features of varying convolution kernel sizes and processes them following each concatenation operation in the Neck section. This approach aims to effectively amalgamate feature information of diverse scales and enhance feature fusion capabilities. As depicted in Figure 5, the primary structure of AKSFFM retains the design framework of C2f but incorporates a Bottleneck for enhanced performance.

The AKSFFM module begins by employing convolution operations of 4 different scales to extract feature information of various scales from the input feature mAP, bolstering the model’s capability to perceive small targets effectively. Different kernel sizes impact the spatial range of features learned by the network and the strength of localized information. Hence, we opt for diverse kernel sizes to facilitate feature extraction and abstraction at different network levels. Subsequently, the feature mAP undergoes global pooling to derive a comprehensive global feature representation, capturing the entirety of information within the remote sensing image. To streamline computation, we utilize dimension reduction post-1x1 convolution and pooling. Attention weights for the dimensionally reduced features are computed through a series of convolutions and normalized using softmax. These weights adaptively learn the significance of each feature channel, selecting crucial features extracted from varying kernel sizes. Ultimately, by leveraging the learned attention weights, the multi-scale features obtained in the initial step are multiplied and merged with the attention weights. This process fosters interaction and information transfer across different channels, reinforcing the interdependence between features. The calculation formula for AKSFFM is provided in Equation (3):

\begin{matrix} Y_{k} = C o n v 2 d (x, k e r n e l = k), k = 1, 3, 5, 7 \\ f e a t s = C o n c a t (Y_{1}, Y_{3}, Y_{5}, Y_{7}) \\ C = \sum_{k} Y_{k} \\ G 1 = C o n v 2 d (A v g P o o l (C), k e r n e l = 1) \\ W e i g h t_{k} = C o n v 2 d (G 1, k e r n e l = 1) \\ a t t e n t i o n_w e i g h t_{k} = S o f t m a x (W e i g h t_{k}) \\ O u t p u t = \sum_{k} (a t t e n t i o n_w e i g h t_{k} \times Y_{k}) \end{matrix}

(3)

where “x” symbolizes the input feature map. Here,

k e r n e l

represents the size of the convolution kernel, and “

Y_{k}

” denotes the result obtained from a convolution kernel of size k. Furthermore, “

f e a t s

” denotes the feature map acquired by concatenating features from each group, while “C” signifies the feature map obtained by directly summing features from each group. “G” represents the global feature representation derived after global pooling. “

W e i g h t_{k}

” signifies the attention weights post-dimension reduction, whereas “

a t t e n t i o n_{w} e i g h t_{k}

” represents the normalized attention weights. Finally, “

O u t p u t

” represents the output feature.

Figure 5. Structural diagrams of AKSFFM, C2f, and Bottleneck.

3.5. Loss Function

In YOLOv8, the loss function is constructed using the CIoU (Complete Intersection over Union) proposed by Zheng et al. (2020) [41], augmented with an aspect ratio penalty term derived from DIoU (Distance Intersection over Union) [42].

However, CIoU fails to account for the sample difficulty balance, and it considers the box’s aspect ratio as one of the penalty terms in the loss function. When the aspect ratios of the bounding box and the predicted box are similar but their width and height values differ significantly, this penalty term becomes largely ineffective. Equation (4) presents the formula for CIoU:

\begin{matrix} L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν \\ R_{C I o U} = \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν \\ α = \frac{ν}{(1 - I o U) + ν} \\ ν = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2} \end{matrix}

(4)

where

L_{C I o U}

stands for the loss function for Complete IoU (CIoU), with

R_{C I o U}

representing the penalty term added on top of the IoU loss function. The variables w and h denote the width and height of the prediction box, respectively. The parameter

ρ

signifies the Euclidean distance between the computed center points. Similarly,

w^{g t}

and

h^{g t}

represent the width and height of the ground truth box, while b and

b^{g t}

indicate the center points of the ground truth box and the prediction box. c represents the diagonal length of the smallest enclosing box covering the two boxes. The Intersection over Union

I o U

indicates the overlap between the predicted box and the ground truth box. Additionally,

α

acts as a balancing term that adjusts the position of the bounding box’s center point relative to its size, while

ν

represents the cosine of the bounding box’s aspect ratio.

To address the challenges associated with CIoU, WIoU was integrated into the network. WIoU assesses the anchor frame’s quality through the implementation of a dynamic non-monotonic mechanism and the adoption of a more rational dynamic gradient gain allocation technique. This approach diminishes the competitive advantage of high-quality anchor frames for small targets, thereby mitigating the adverse effects of low-quality anchor frames on gradient calculations. Consequently, the loss function can allocate more focus to the standard quality of small target anchor frames, thereby enhancing the overall performance of the detector. The formula for WIoU v1 is presented in Equation (5):

\begin{matrix} L_{W I o U v 1} = R_{W I o U} \times L_{I o U} \\ R_{W I o U} = e x p (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{2}}) \\ L_{I o U} = 1 - I o U \end{matrix}

(5)

In Formula (5), x, y,

x_{g t}

, and

y_{g t}

denote the center point coordinates of the prediction box and the ground truth box.

W_{g}^{2}

and

H_{g}^{2}

denote the squares of the width and height of the minimum frame that encloses both the predicted box and the ground truth box. Moreover, the Intersection over Union (

I o U \in [0, 1]

) is utilized to quantify the extent of overlap between the anchor box and the target box in object detection tasks.

L_{W I o U v 1}

represents the WIoU loss function, while

R_{W I o U}

represents the penalty term added on top of the

L_{I o U}

loss function. The term

R_{W I o U}

significantly amplifies

L_{I o U}

for anchor boxes of standard quality.

In WIoU v3, a hyperparameter

β

, acting as an outlier to characterize the anchor frame mass, is introduced based on the v1 iteration. This

β

parameter plays a pivotal role in formulating a non-monotonic focusing coefficient

γ

, which effectively lessens the influence of low-quality anchor frames while amplifying the significance of high-quality anchor frames. Both

σ

and

β

are adjustable parameters that can be fine-tuned to cater to various model requirements.

\begin{matrix} γ = \frac{β}{σ α^{β - δ}} \\ L_{W I o U v 3} = γ L_{W I o U v 1} \end{matrix}

(6)

4. Experiment

4.1. Related Indexes

In this study, evaluation metrics from the COCO dataset were utilized to examine the detection outcomes of remote sensing small targets, including Precision, Recall, mAP, and GFLOPs. These metrics are recognized as classic and widely employed indicators in the field of remote sensing.

Precision and Recall values are derived from Equation (7), where TP denotes the count of accurately identified small targets, FP signifies the count of incorrectly identified small targets, and FN represents the count of undetected small targets. Precision reflects the ratio of accurately detected small targets to all targets, while Recall indicates the ratio of identified small targets (inclusive of false positives) to all small targets within the training dataset.

\begin{matrix} P = \frac{T P}{T P + F P} \\ R = \frac{T P}{T P + F N} \end{matrix}

(7)

mAP is a comprehensive evaluation indicator that takes into account the accuracy of different Recall rates. emphAP can be seen as the area of the P-R curve. The curve takes the Recall rate as the horizontal axis and the accuracy rate as the vertical axis. mAP is the average of the area of all categories. The formula is Equation (8); the larger the area, the higher the Precision.

\begin{matrix} A_{A P} = \int_{0}^{1} P (R) d R \\ m A P = \frac{\sum_{i = 1}^{m} A P_{i}}{m} \end{matrix}

(8)

4.2. Datasets

In this research, we selected VEDAI and AI-TOD, two well-known small target datasets, and partitioned them into training, testing, and validation sets in a ratio of 7:2:1. The subsequent summaries offer a concise overview of these datasets. The image dimensions were standardized to 1024 × 1024. Furthermore, we augmented the input images through a mosaic data technique. This involved randomly selecting four images from the dataset, applying transformations such as flipping, scaling, and adjusting the color space to each of these images. Subsequently, the four modified images were combined and randomly cropped to create a new composite image. This approach significantly diversifies the object detection background and enhances the model’s resilience.

VEDAI: VEDAI (Vehicle Detection in Aerial Imagery: A Small Target Detection Benchmark) is an aerial image dataset sourced from the Utah Automated Geographic Reference Center (AGRC). The dataset consists of cropped images with a resolution of approximately 12.5 cm by 12.5 cm per pixel. Vehicles in this dataset are characterized by their small size and diverse types. Containing over 3700 annotated objects across more than 1200 images, VEDAI exhibits variations in orientation, lighting conditions, shadows, occlusions, and other factors. The images encompass diverse backgrounds such as forests, roads, fields, and construction sites. There are nine vehicle categories present, including “plane”, “truck”, “boat”, “pick-up”, and others, as illustrated in Figure 6. Each image is available in both 1024 × 1024 and 512 × 512 dimensions, with the former size being utilized. On average, each image contains 5.5 vehicles, representing about 0.7% of the total pixel count in the image. This percentage is lower compared to many other remote sensing datasets, allowing for a comprehensive assessment of the impact of the proposed algorithm enhancements on remote sensing imagery.

Figure 6. Images illustrating the different categories of the dataset [43].

AI-TOD: The AI-TOD dataset, an expansive dataset enhanced in 2024 for detecting small airborne objects, is referenced [44]. This dataset is constructed upon extensive aerial image datasets, including DOTA [45], xView [46], DIOR [47], and VisDrone [48], among others. Comprising 70,621 object instances across eight categories within 28,036 aerial images, AI-TOD boasts an average target size of 12.8 pixels. Notably, as depicted in Figure 7, targets larger than 16 pixels constitute 67.8% and 79.0% in the remote sensing datasets DOTA and DIOR, respectively. Most of these datasets predominantly feature large objects, rendering them less suitable for small target detection. In AI-TOD, 85% of targets are smaller than 16 pixels, with the largest targets measuring less than 64 pixels. This dataset encompasses eight classes of object instances, including some uncommon classes like swimming pools and windmills, which are notably less prevalent compared to classes such as vehicles and boats, aligning with real-world scenarios and enhancing practical applicability.

4.3. Ablation Experiments

To validate the efficacy of ARSOD-YOLO, we conducted experiments on the VEDAI dataset. Building upon the results from YOLOv8n as a baseline, we systematically performed ablation experiments on various modules.

It is important to highlight that we employed the pre-training weights of YOLOv8n, as illustrated in Table 1, acquired after its training on the COCO dataset. The initial results showcase the performance of the baseline YOLOv8n, yielding a mAP50 value of 0.712 and a mAP50–95 value of 0.459. Subsequently, we iteratively added different modules. The second group showcases results after the addition of jump connections. The third group presents results post-addition of AFEM. The fourth group demonstrates results after integrating AKSFFM, showcasing a 1.4% accuracy enhancement compared to the pre-addition scenario. The fifth group showcases results from experiments combining jump connections and AKSFFM. The final set of experimental outcomes encapsulates the ARSOD-YOLO experiment.

Table 1 displays the outcomes of ablation experiments conducted on the VEDAI dataset employing various modules. As the network undergoes gradual enhancements, the accuracy of target detection steadily improves. Notably, the combination of BiFPN and AKSFFM resulted in the most significant accuracy boost, showcasing a 2.3% enhancement in target detection accuracy. Upon integrating all modules, the model’s target accuracy sees a notable improvement of 3.1%. Specifically, the effects of individual module additions are as follows: The inclusion of jump connections elevates the mAP50 value by 0.6% and the mAP50–95 value by 0.2%. The integration of AFEM leads to a 0.7% improvement. Subsequent to incorporating AKSFFM, there is a 1.4% increase in the mAP50 value. These results collectively indicate a positive influence of our enhancements on the experimental outcomes. Additionally, we visualize the accuracy variations following incremental module additions using a line chart in Figure 8. In the chart, A, B, C, and D signify the magnitude of change in mAP50 after the gradual addition of BiFPN, AFEM, AKSFFM, and WIoU, illustrating a consistent enhancement in accuracy with the inclusion of each module.

4.4. Comparative Experiments on Loss Function

In our study, we conducted experiments on several loss functions using the VEDAI dataset in Table 2 . It is noteworthy that these experiments were carried out on network models enhanced with improvements such as BiFPN, AFEM, and AKSFFM. The table showcases the results of replacing the CIoU los4.3.2s function in YOLOv8 with SIoU, Focal_CIoU, GIoU, and WIoU, respectively. From the experimental results, it is evident that when the loss function in the network model is WIoU, the mAP50 and mAP50–95 values are maximized, with an increase of 0.8% for mAP50 and 0.6% for mAP50–95 compared to the original CIoU. Conversely, the experimental results after introducing other loss functions showed varying degrees of decline. We posit that the WIoU loss function is better suited for the characteristics and distribution of small targets in remote sensing image datasets, as it can more accurately measure the alignment between predicted and ground truth bounding boxes. Additionally, the WIoU loss function’s adaptability to different shapes through IoU calculations, unrestricted by specific shapes or sizes, enhances the generalization ability of small target detection.

Figure 8. Visualization of mAP effects of different modules.

4.5. Comparative Experiments

The table below displays the experimental results of the network model before and after enhancements on the VEDAI dataset. In Table 3, the detection outcomes for each category in the VEDAI dataset are presented. YOLOv8n represents original experimental results, while ARSOD-YOLO illustrates the detection outcomes of the improved network model. For most categories, except for car, pick-up, and boat, there was a slight decrease in the mAP50 value, whereas the mAP50 values for other categories exhibited significant improvements. Notably, the tractor category saw a substantial increase, with a remarkable 10.6% rise in mAP50. On average, there was a 3.1% increase in mAP50 across all categories. Precision also witnessed a significant boost, escalating by 5.7%. These experimental outcomes unequivocally demonstrate the substantial enhancement brought about by the proposed algorithm on the VEDAI dataset, effectively elevating the detection accuracy of small targets in remote sensing applications.

Figure 9 displays the PR curves for each category in the VEDAI dataset, where P represents Precision and R represents Recall. It is evident that most categories exhibit high detection accuracies, with categories such as “pick-up”, “large”, and “camping_car” achieving over 80%.

The experimental results and PR curves for the AI-TOD dataset are depicted in Table 4 and Figure 10. In the AI-TOD dataset, the enhanced model demonstrated a 6.1% increase compared to the pre-improved mAP50. With the exception of the bridge category, vehicles saw a slight decline, while other categories showed more notable improvements. Notably, the airplane category experienced a 5.9% enhancement in mAP50, and the swimming pool category saw a substantial improvement of nearly 30%. The remarkable enhancement in the swimming pool category could be attributed to its relatively small sample size; there were only 42 swimming pool targets within the test dataset. Conversely, the test set contained a considerable number of vehicle targets, reaching 61,107, with a corresponding mAP50 increase of 3.6%.

The PR curve of our proposed ARSOD-YOLO in the AI-TOD dataset is illustrated in Figure 10. The detection accuracy within the AI-TOD dataset is typically low, with an average mAP50 across all categories amounting to 0.476. The windmill category exhibited the lowest mAP50 at just 0.125, while the storage-tank category displayed the highest at 0.753.

To validate the performance of ARSOD-YOLO, we conducted a comparison with various mainstream detection models using the VEDAI dataset. As shown in Table 5, the mAP50 and mAP50–95 metrics for YOLOv8 are 71.2% and 46.2%, respectively, with a GFLOPs of 8.1. Following enhancements, the improved model exhibited increased GFLOPs at 25.1, with mAP50 and mAP50–95 values reaching 74.3% and 46.9%, respectively, marking a 3.4% mAP50 improvement. In the first group of experiments, the YOLOv3-t algorithm achieved detection accuracies of 55.3% for mAP50 and 30% for mAP50–95. Compared to the YOLOv5n algorithm in the second experimental group, although the Precision value slightly increased, other accuracy metrics decreased, alongside reduced corresponding GFLOPs. The upgraded YOLOv9t model demonstrated inferior detection accuracy compared to YOLOv8n, with both mAP values showing lower and higher GFLOPs than YOLOv8n and ARSOD-YOLO. Furthermore, YOLOv10 achieved mAP50 and mAP50–95 scores of 72.6% and 46.5%, respectively, showcasing improved accuracy compared to YOLOv8n but still falling short by 1.7% and 0.4% in comparison to our approach. Algorithms based on Transformers, such as RT-DETR and TPH-YOLO, displayed lower accuracy than our proposed method, with significantly higher GFLOPs. Furthermore, single-stage SSD and two-stage Fast-RCNN algorithms also exhibited lower accuracy levels compared to our proposed algorithm.

To assess the detection accuracy of ARSOD-YOLO on the AI-TOD dataset, a comparative experiment was conducted. Table 6 presents the detection accuracy of ARSOD-YOLO alongside several other algorithm models, along with the associated GFLOPs requirements. The results for the YOLO series are detailed in the table. In terms of detection accuracy, ARSOD-YOLO consistently outperforms other algorithms, with YOLOv10n following closely, albeit excelling in being lightweight. The mAP value of the Transformer-based algorithm is significantly lower than that of our designed algorithm, and the parameter count is notably higher compared to our proposed approach. When compared to the upgraded versions of YOLOv8n, namely, YOLOv9t and YOLOv10n, our algorithm also demonstrates superior detection accuracy.

Figure 10. PR curves for categories in the AI-TOD dataset.

4.6. Visual Experiment on VEDAI Dataset

In our study, we conducted a comparative analysis between ARSOD-YOLO and YOLOv3, YOLOv5, and YOLOv10. Four images depicting diverse scenes and targets from the VEDAI dataset were selected for evaluation. The detection outcomes were visually represented using distinct color-coded bounding boxes, with category labels and confidence scores annotated on each bounding box.

As illustrated in Figure 11, our findings revealed notable disparities among the models.

For Group 1, among the models evaluated, only ARSOD-YOLO successfully identified a distinct class of objects highlighted in green, a task which eluded the other models. This underscores ARSOD-YOLO’s proficiency in detecting seemingly inconspicuous targets, thereby exhibiting superior detection accuracy. For group 2, YOLOv3, YOLOv5, and YOLOv10 failed to detect the camping_car across different positions, while ARSOD-YOLO accurately detected it. This underscores ARSOD-YOLO’s efficacy in recognizing challenging targets where other models faltered. For group 3, YOLOv10 missed detecting a van, while YOLOv5 misclassified it as a car. In contrast, both YOLOv3 and ARSOD-YOLO correctly identified the van, showcasing their robust detection capabilities. For group 4, ARSOD-YOLO demonstrated comprehensive detection by successfully identifying all targets in the images, whereas the other algorithms exhibited varying degrees of missed detections and misclassifications. These results underscore the superior detection performance of ARSOD-YOLO across diverse scenarios, outperforming YOLOv3, YOLOv5, and YOLOv10 in challenging detection tasks.

Figure 11. Visual comparison of object detection models: ARSOD-YOLO vs. YOLOv3, YOLOv5, and YOLOv10 on VEDAI dataset images.

5. Discussion

The experimental findings on the VEDAI and AI-TOD datasets underscore the superior performance of ARSOD-YOLO in contrast to existing object detection algorithms. Nevertheless, our experiments have brought to light certain limitations, revealing that not all classes demonstrate improved accuracy when specific modules are combined as opposed to when utilized individually. Future research endeavors will concentrate on fine-tuning hyperparameters, refining network structures, and addressing these identified issues. Furthermore, while our study leveraged publicly available datasets, upcoming efforts will strive to develop proprietary datasets or enhance existing ones to facilitate data augmentation. This may involve simulating images under diverse weather conditions and lighting scenarios to enrich the training data. Moreover, plans are in progress to further refine the model’s lightweight attributes, aiming to enhance its efficiency and applicability in real-world remote sensing applications.

6. Conclusions

The detection of small targets in remote sensing images is often hampered by insufficient features, resulting in feature loss and heightened sensitivity to positional deviations during continuous convolution. In response to these challenges, we propose the ARSOD-YOLO network model, an advanced iteration building upon YOLOv8n. This model introduces significant enhancements in network architecture, modules, and loss functions. Drawing inspiration from the BiFPN concept, we integrate skip connections to retain crucial information. Introducing the innovative feature enhancement module, AFEM, allows for adaptive adjustment of feature weights and management of multiple branches, effectively addressing feature scarcity and enhancing discrimination between remote sensing backgrounds and targets. Furthermore, we incorporate the AKSFFM module to leverage attention weights at various scales, facilitating the learning of feature dependencies and enabling improved fusion of small target features. This approach helps mitigate information loss challenges encountered during convolution in remote sensing scenarios. To enhance the precision of small target localization, we utilize WiSe-IoU as a loss function. Our proposed ARSOD-YOLO enhances the YOLOv8 network to tackle the challenge of low accuracy in detecting small targets in remote sensing. Experimental results on the VEDAI dataset and AI-TOD dataset showed that ARSOD-YOLO achieves a favorable balance between precision and speed. The experimental results on the VEDAI and AI-TOD datasets, as well as comparisons with several popular object detection algorithms, indicate that ARSOD-YOLO achieves a good balance between accuracy and speed.

Author Contributions

Conceptualization, Y.Q.; data curation, Y.Q. and X.Z.; investigation, T.L. and P.J.; methodology, Y.Q.; software, Y.Q. and T.L.; writing—original draft, Y.Q.; writing—review and editing, X.H. and G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep learning-based object detection techniques for remote sensing images: A survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Zhang, M.; Li, W.; Zhang, Y.; Tao, R.; Du, Q. Hyperspectral and LiDAR data classification based on structural optimization transmission. IEEE Trans. Cybern. 2022, 53, 3153–3164. [Google Scholar] [CrossRef] [PubMed]
Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-enhanced CenterNet for small object detection in remote sensing images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
Ran, Q.; Wang, Q.; Zhao, B.; Wu, Y.; Pu, S.; Li, Z. Lightweight oriented object detection using multiscale context and enhanced channel attention in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 5786–5795. [Google Scholar] [CrossRef]
Ruan, H.; Qian, W.; Zheng, Z.; Peng, Y. A Decoupled Semantic–Detail Learning Network for Remote Sensing Object Detection in Complex Backgrounds. Electronics 2023, 12, 3201. [Google Scholar] [CrossRef]
Zhang, Y.; Li, W.; Sun, W.; Tao, R.; Du, Q. Single-source domain expansion network for cross-scene hyperspectral image classification. IEEE Trans. Image Process. 2023, 32, 1498–1512. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4096–4105. [Google Scholar] [CrossRef]
Pang, J.; Li, C.; Shi, J.; Xu, Z.; Feng, H. R2-CNN: Fast tiny object detection in large-scale remote sensing images. arXiv 2019, arXiv:1902.06042. [Google Scholar]
Zhang, W.; Wang, S.; Thachan, S.; Chen, J.; Qian, Y. Deconv R-CNN for small object detection on remote sensing images. In Proceedings of the IGARSS 2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2483–2486. [Google Scholar]
Aggarwal, A.; Mittal, M.; Battineni, G. Generative adversarial network: An overview of theory and applications. Int. J. Inf. Manag. Data Insights 2021, 1, 100004. [Google Scholar] [CrossRef]
Courtrai, L.; Pham, M.T.; Lefèvre, S. Small object detection in remote sensing images based on super-resolution with auxiliary generative adversarial networks. Remote Sens. 2020, 12, 3152. [Google Scholar] [CrossRef]
Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. Sod-mtgan: Small object detection via multi-task generative adversarial network. In Proceedings of the European conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 206–221. [Google Scholar]
Wu, J.; Xu, S. From point to region: Accurate and efficient hierarchical small object detection in low-resolution remote sensing images. Remote Sens. 2021, 13, 2620. [Google Scholar] [CrossRef]
Ma, T.; Yang, Z.; Wang, J.; Sun, S.; Ren, X.; Ahmad, U. Infrared small target detection network with generate label and feature mapping. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, J.; Zhao, C.; Xiong, W.; Li, Q.; Yang, J. Robust real-time object detection based on deep learning for very high resolution remote sensing images. In Proceedings of the IGARSS 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 14 November 2019; pp. 1314–1317. [Google Scholar]
Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2874–2883. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Tang, X.; Du, D.K.; He, Z.; Liu, J. Pyramidbox: A context-assisted single shot face detector. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 797–813. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Lim, J.S.; Astrid, M.; Yoon, H.J.; Lee, S.I. Small object detection using context and attention. In Proceedings of the IEEE 2021 international Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea, 13–16 April 2021; pp. 181–186. [Google Scholar]
Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution Detection Network for Small Objects. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021. [Google Scholar]
Nie, H.; Pang, H.; Ma, M.; Zheng, R. A Lightweight Remote Sensing Small Target Image Detection Algorithm Based on Improved YOLOv8. Sensors 2024, 24, 2952. [Google Scholar] [CrossRef]
Zhao, D.; Shao, F.; Liu, Q.; Yang, L.; Zhang, H.; Zhang, Z. A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sens. 2024, 16, 1002. [Google Scholar] [CrossRef]
Wang, P.; Sun, X.; Diao, W.; Fu, K. FMSSD: Feature-merged single-shot detection for multiscale objects in large-scale remote sensing imagery. IEEE Trans. Geosci. Remote. Sens. 2019, 58, 3377–3390. [Google Scholar] [CrossRef]
Liang, X.; Zhang, J.; Zhuo, L.; Li, Y.; Tian, Q. Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1758–1770. [Google Scholar] [CrossRef]
Bai, Y.; Li, R.; Gou, S.; Zhang, C.; Chen, Y.; Zheng, Z. Cross-connected bidirectional pyramid network for infrared small-dim target detection. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Li, Y.; Huang, Q.; Pei, X.; Chen, Y.; Jiao, L.; Shang, R. Cross-layer attention network for small object detection in remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 14, 2148–2161. [Google Scholar] [CrossRef]
Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective fusion factor in FPN for tiny object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 1160–1168. [Google Scholar]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-based lightweight yolo network for UAV small object detection. Remote Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
Liu, K.; Huang, J.; Li, X. Eagle-eye-inspired attention for object detection in remote sensing. Remote Sens. 2022, 14, 1743. [Google Scholar] [CrossRef]
Li, P.; Che, C. SeMo-YOLO: A multiscale object detection network in satellite remote sensing images. In Proceedings of the IEEE 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
Wan, D.; Lu, R.; Wang, S.; Shen, S.; Xu, T.; Lang, X. Yolo-hr: Improved yolov5 for object detection in high-resolution optical remote sensing images. Remote Sens. 2023, 15, 614. [Google Scholar] [CrossRef]
Qu, Z.; Zhu, F.; Qi, C. Remote sensing image target detection: Improvement of the YOLOv3 model with auxiliary networks. Remote Sens. 2021, 13, 3908. [Google Scholar] [CrossRef]
Xu, D.; Wu, Y. Improved YOLO-V3 with DenseNet for multi-scale remote sensing target detection. Sensors 2020, 20, 4276. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation IEEE Trans. Cybern. 2020, 52, 8574–8586. [Google Scholar]
Liu, Z.; Gao, G.; Sun, L.; Fang, Z. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression Proc. AAAI Conf. Artif. Intell. 2019, 34, 12993–13000. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery : A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Liu, C.; Gao, G.; Huang, Z.; Hu, Z.; Liu, Q.; Wang, Y. YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3863–13875. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Lam, D.; Kuzma, R.; Mcgee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; Mccord, B. xView: Objects in Context in Overhead Imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote. Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision Meets Drones: A Challenge; Springer: Cham, Switzerland, 2018. [Google Scholar]

Figure 7. Comparison of AI-TOD with other benchmark datasets [44].

Figure 9. PR curves for categories in the VEDAI dataset.

Table 1. Ablation experiments on VEDAI dataset.

Dataset	BiFPN	AFEM	AKSFFM	WIoU	mAP50 (%)	mAP50–95 (%)
VEDAI	×	×	×	×	0.712	0.459
	✓	×	×	×	0.718	0.461
	✓	✓	×	×	0.725	0.472
	✓	✓	✓	×	0.739	0.466
	×	×	×	✓	0.729	0.469
	✓	×	✓	×	0.735	0.463
	✓	✓	✓	✓	0.743	0.469

Table 2. Comparative experiments on loss function.

Method	mAp50	mAp50–95
CIoU	0.735	0.463
SIoU	0.704	0.449
Focal_CIoU	0.734	0.455
GIoU	0.729	0.46
WIoU	0.743	0.469

Table 3. Comparison of target detection results before and after improvement on the VEDAI dataset.

Model	Categories	P (%)	R (%)	mAP50 (%)
YOLOv8n	car	0.787	0.87	0.922
	truck	0.537	0.571	0.616
	pick-up	0.692	0.836	0.853
	tractor	0.715	0.6	0.609
	camping_car	0.726	0.784	0.786
	boat	0.499	0.667	0.706
	van	0.636	0.699	0.759
	other	0.448	0.383	0.399
	large	1	0.617	0.756
	all	0.671	0.67	0.712
ARSOD-YOLO	car	0.82	0.84	0.895
	truck	0.638	0.543	0.636
	pick-up	0.686	0.834	0.8
	tractor	0.74	0.568	0.715
	camping_car	0.694	0.789	0.816
	boat	0.623	0.63	0.675
	van	0.819	0.72	0.797
	other	0.611	0.468	0.523
	large	0.923	0.667	0.831
	all	0.728	0.673	0.743

Table 4. Comparison of target detection results before and after improvement on the AI-TOD dataset.

Model	Categories	P (%)	R (%)	mAP50 (%)
YOLOv8n	vedaiself	0.434	0.616	0.589
	bridge	0.562	0.334	0.331
	storage-tank	0.843	0.652	0.739
	ship	0.603	0.64	0.629
	swimming-pool	1	0.0335	0.0604
	vehicle	0.672	0.681	0.671
	person	0.671	0.282	0.26
	windmill	0.397	0.0152	0.0591
	all	0.616	0.407	0.417
ARSOD-YOLO	airplane	0.714	0.6	0.648
	bridge	0.518	0.289	0.307
	storage-tank	0.849	0.671	0.748
	ship	0.731	0.596	0.668
	swimming-pool	0.439	0.31	0.307
	vehicle	0.746	0.682	0.707
	person	0.535	0.267	0.307
	windmill	0.331	0.182	0.136
	all	0.608	0.45	0.478

Table 5. Comparative experiment on VEDAI dataset.

Method	P (%)	R (%)	mAP50 (%)	mAP50–95 (%)	FLOPs (G)
YOLOv3t	0.51	0.564	0.553	0.3	13
YOLOv5n	0.602	0.511	0.517	0.302	5.9
YOLOv9t	0.817	0.591	0.654	0.429	11.1
YOLOv8n	0.671	0.67	0.712	0.462	8.1
YOLOv10n	0.729	0.658	0.726	0.465	8.3
RT-DETR	0.455	0.436	0.455	0.436	103.56
Fast-RCNN	-	-	0.459	-	196.2
SSD	-	-	0.451	-	-
TPH-YOLO	-	-	0.584	0.338	270.9
Ours	0.728	0.673	0.743	0.469	25.1

Table 6. Comparative experiment on AI-TOD dataset.

Method	P (%)	R (%)	mAP50 (%)	mAP50–95 (%)	FLOPs (G)
YOLOv3t	0.475	0.298	0.309	0.131	13.1
YOLOv5n	0.698	0.311	0.311	0.342	6.1
YOLOv8n	0.616	0.407	0.417	0.18	8.2
YOLOv9t	0.592	0.277	0.384	0.299	12
YOLOv10n	0.538	0.203	0.455	0.203	8.2
RT-DETR	0.614	0.229	0.191	0.121	108.1
HANet	-	-	0.529	0.210	-
Ours	0.608	0.45	0.478	0.209	25.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, Y.; Zheng, X.; Hao, X.; Zhang, G.; Lei, T.; Jiang, P. ARSOD-YOLO: Enhancing Small Target Detection for Remote Sensing Images. Sensors 2024, 24, 7472. https://doi.org/10.3390/s24237472

AMA Style

Qiu Y, Zheng X, Hao X, Zhang G, Lei T, Jiang P. ARSOD-YOLO: Enhancing Small Target Detection for Remote Sensing Images. Sensors. 2024; 24(23):7472. https://doi.org/10.3390/s24237472

Chicago/Turabian Style

Qiu, Yijuan, Xiangyue Zheng, Xuying Hao, Gang Zhang, Tao Lei, and Ping Jiang. 2024. "ARSOD-YOLO: Enhancing Small Target Detection for Remote Sensing Images" Sensors 24, no. 23: 7472. https://doi.org/10.3390/s24237472

APA Style

Qiu, Y., Zheng, X., Hao, X., Zhang, G., Lei, T., & Jiang, P. (2024). ARSOD-YOLO: Enhancing Small Target Detection for Remote Sensing Images. Sensors, 24(23), 7472. https://doi.org/10.3390/s24237472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ARSOD-YOLO: Enhancing Small Target Detection for Remote Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. Small Target Detection Algorithm Based on Feature Enhancement

2.2. Small Target Detection Algorithm Based on Feature Fusion

2.3. Remote Sensing Small Target Detection Method Based on YOLO

3. Materials and Methods

3.1. Baseline Model

3.2. ARSOD-YOLO

3.3. Adaptive Feature Enhancement Module

3.4. Adaptive Multi-Convolutional Kernel Feature Fusion Module

3.5. Loss Function

4. Experiment

4.1. Related Indexes

4.2. Datasets

4.3. Ablation Experiments

4.4. Comparative Experiments on Loss Function

4.5. Comparative Experiments

4.6. Visual Experiment on VEDAI Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI