Open AccessArticle

LARS: Remote Sensing Small Object Detection Network Based on Adaptive Channel Attention and Large Kernel Adaptation

Yuanyuan Li

¹,

Yajun Yang

²,

Yiyao An

^1,*

Yudong Sun

¹ and

Zhiqin Zhu

School of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400050, China

School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400050, China

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2906; https://doi.org/10.3390/rs16162906

Submission received: 31 May 2024 / Revised: 26 July 2024 / Accepted: 7 August 2024 / Published: 8 August 2024

(This article belongs to the Special Issue Recent Advances in Remote Sensing Image Processing Technology)

Download

Browse Figures

Figure 1
The overall architecture of LARS. "> Figure 2
Structure of the ACA block. "> Figure 3
Structure of the LKA block. "> Figure 4
Average detection accuracy per category on the DOTA-v2.0 dataset. Each point represents the accuracy of a comparison model in a given category, the horizontal axis represents different models, and the vertical axis represents the AP50 value for each category. "> Figure 5
Average detection accuracy for all categories on the DOTA-v2.0 dataset. LARS is represented by the red bar, and the highest detection accuracy of 63.01 was achieved on this dataset. "> Figure 6
Comparison of mAP50 and mAP95 metrics on the VisDrone dataset. "> Figure 7
Comparison of results from different strategies for decomposition of large kernels using (kernel, dilation) format. "> Figure 8
Comparison of detection performance with different blocks added. "> Figure 9
Visualization of detection performance after adding different blocks, with the parts of the detection results where it is difficult to find differences circled in ellipses. "> Figure 10
Evaluation metric analysis for proposed model. "> Figure 11
Comparison of normalized confusion matrices without (Left) and with (Right) the LBN block on the DOTA-v2.0 dataset, where BG represents the background. "> Figure 12
Precision–recall curve and F1–confidence curve for proposed model. "> Figure 13
Examples of test results on the DOTA-v2.0 and VisDrone datasets. ">

Review Reports Versions Notes

Abstract

In the field of object detection, small object detection in remote sensing images is an important and challenging task. Due to limitations in size and resolution, most existing methods often suffer from localization blurring. To address the above problem, this paper proposes a remote sensing small object detection network based on adaptive channel attention and large kernel adaptation. This approach aims to enhance multi-channel information mining and multi-scale feature extraction to alleviate the problem of localization blurring. To enhance the model’s focus on the features of small objects in remote sensing at varying scales, this paper introduces an adaptive channel attention block. This block applies adaptive attention weighting based on the input feature dimensions, guiding the model to better focus on local information. To mitigate the loss of local information by large kernel convolutions, a large kernel adaptive block is designed. The block dynamically adjusts the surrounding spatial receptive field based on the context around the detection area, improving the model’s ability to extract information around remote sensing small objects. To address the recognition confusion during the sample classification process, a layer batch normalization method is proposed. This method enhances the consistency analysis capabilities of adaptive learning, thereby reducing the decline in the model’s classification accuracy caused by sample misclassification. Experiments on the DOTA-v2.0, SODA-A and VisDrone datasets show that the proposed method achieves state-of-the-art performance.

Keywords:

remote sensing images; small object detection; feature extraction; adaptive channel attention; large kernel adaptation

1. Introduction

Object detection in remote sensing images is an important research direction in the field of computer vision, and its development and utilization is an important way to promote military and civilian remote sensing applications, with broad market prospects. One of the main difficulties faced in this task is that the means by which remote sensing images are acquired and the distance at which they are taken result in objects in the images that are small in size and have less distinctive features [1,2]. As shown in Table 1, the detected objects can be classified via the size of the specific area according to Ref. [3]. This characteristic of small objects leads to a relatively limited amount of information about the object and also increases the difficulty of feature extraction, and it has therefore received extensive academic attention [4,5].

Remote sensing image detection predominantly employs networks based on anchor-based and anchor-free structures. Anchor-based networks use predefined anchor boxes during the detection process, predicting the objects’ relative position and size concerning the anchor boxes to accomplish classification and localization [6,7,8]. On the other hand, anchor-free models directly regress the objects’ positions and avoid reliance on predefined anchors, thus enhancing the network’s adaptability to various object shapes and sizes [9,10,11]. Although both types have achieved notable results in general object detection, they each have their strengths and weaknesses when processing small objects in remote sensing images. Anchor-based methods require the careful design and adjustment of the anchor boxes’ size and quantity, making the design and tuning of the network more complex. In contrast, anchor-free structures can avoid potential errors in the anchor matching process by directly regressing the object’s position and size. Consequently, anchor-free networks are more favored for the detection of small objects. However, due to the variations in the size and proportion of objects in remote sensing images and the anchor-free networks’ reliance on local feature points, existing networks often fall short of expectations when processing remote sensing images. Therefore, a new type of detection network suitable for small objects in remote sensing imagery is needed. It should guide the model to focus more on the features of small objects and reduce the loss of local information around the images. This approach will alleviate localization blurring issues.

To enhance the model’s focus on the features of small objects in remote sensing images, an attention mechanism is introduced. This approach has seen considerable application in the field of remote sensing image detection [12]. The attention mechanism helps the model to focus on target regions, addressing challenges related to the varying scales, shapes, and orientations in remote sensing images, thereby improving the detection and recognition accuracy and robustness. While traditional attention mechanisms can enhance the model performance, they often overlook the inter-channel positional correlations [13]. Additionally, the use of fixed-size convolution operations in standard attention mechanisms to capture feature correlations can lead to the loss of local information for small objects, resulting in localization blurring. While it is essential to guide the model to focus on small objects, it is also important to consider a broader receptive field to extract more comprehensive features.

To extract a broader range of input features and capture more extensive contextual information, several researchers have employed larger convolution kernels [14,15,16]. By increasing the receptive field, these approaches consider more local features of the object; thus, they have achieved significant application and success in remote sensing image processing [17,18]. However, these methods often overlook the ranging context issue, which involves the local and global correlation relationships among objects of different sizes in remote sensing images. This oversight can lead to the loss of detailed information, resulting in the erroneous detection of small objects in remote sensing images.

Additionally, since batch normalization (BN) [19] relies on batch data during both training and testing, it can lead to changes in the batch statistical information during testing, affecting the model’s detection performance. In contrast, layer normalization (LN) [20] focuses more on the independence of individual samples. LN covers all feature channels for each sample independently, emphasizing the features of each sample. However, because LN does not consider the correlations between the samples within a batch, it may overlook some common features and statistical information shared among the batch samples. This can result in unstable normalization effects and cause classification confusion.

To address the above limitations, this paper proposes a remote sensing small object detection network based on adaptive channel attention and large kernel adaptation (LARS). The classic and representative anchor-free network YOLO is selected as the network architecture, which not only helps to validate the effectiveness of the proposed methods but also demonstrates their broad applicability in limited experiments. To address the issue of insufficient attention to small object features at different scales of remote sensing images, as mentioned above, an adaptive channel attention (ACA) block is proposed. This block adjusts the convolutional kernel size based on the input feature channels and applies adaptive attention weighting, guiding the model to focus on the local information. To overcome the information loss problem associated with large kernel convolutions when processing small objects, the large kernel adaptive (LKA) block is proposed. The LKA block decomposes a large kernel into several smaller convolutions, retaining a broad receptive field while preserving more detailed feature information. The ACA block enables the network to dynamically adjust the attention to different channels to improve the model’s sensitivity to small objects; after this, the extracted attention information is used to assign weights to the LKA block to better focus on small object information. The combination of the two ensures that the attention mechanism enhances each small object feature and is processed by the extended sensory field. This allows the model to better extract the contextual information surrounding small objects, which enables the network to accurately detect small objects in a wider range while reducing localization ambiguities and enhancing the detection of small objects. Considering the issue of sample correlation in traditional normalization methods, layer batch normalization (LBN) is proposed for normalization computation and it is integrated into the ACA and LKA blocks. Finally, extensive experiments are conducted on the DOTA-v2.0 [21] SODA-A [3] and VisDrone [22] datasets, demonstrating the effectiveness of the LARS model and the design of each block.

This paper has three contributions as follows.

To address the issue of insufficient attention to remote sensing small object features when dealing with features of different scales, the ACA block is proposed. This block applies adaptive attention weighting based on the input feature dimensions, guiding the model to better focus on the local information.
The LKA block is designed to address the problem of the incorrect detection of remote sensing small objects caused by the loss of local information in remote sensing images due to large kernel convolutions. This block dynamically adjusts the surrounding spatial receptive field according to the ranging background around the detection area, and it is guided by the weight information extracted by the ACA block, enhancing the model’s ability to extract the contextual information around small objects.
The LBN method is designed to resolve the issue of classification confusion caused by the correlation between samples. This method improves the consistency analysis capabilities during adaptive learning, alleviating the decline in the model’s classification accuracy caused by sample misclassification.

2. Related Works

Small object detection in remote sensing images involves the task of detecting and localizing small-sized objects, which is often hindered by factors such as low object resolutions and noise. The current models for small object detection in remote sensing typically improve the performance by enhancing the search strategies, using region proposal methods, and performing loss function regression. Additionally, attention mechanisms and large-sized convolution kernels have been introduced to further improve the detection accuracy and robustness.

Search-based methods generate anchor boxes during the detection process through a sliding window approach, moving the window to the right or down by a certain step size until the entire image is covered. Jiang et al. [23] addressed the automatic detection issue of transmission towers in drone images by proposing a model that detects transmission towers, enhancing the detection robustness. However, this approach neglects the global contextual information surrounding the object, making it difficult to accurately distinguish small objects, and is prone to missed or false detection.

Region proposal-based methods generate candidate regions by segmenting and merging similar regions at the base level, followed by object detection and localization using deep learning models. Ren et al. [24] improved R-CNN to adapt it to the task of small object detection in optical remote sensing images. Lim et al. [25] proposed a context-aware object detection method to address the challenge of limited small object information. Although these methods have improved the detection performance to some extent, they tend to produce a large number of inaccurate candidate regions when processing remote sensing images. This leads to localization blurring and recognition errors for small objects.

Methods based on loss function regression extract image features using a feature extractor and directly predict the object’s bounding box position and size by optimizing the loss function. This approach eliminates the need for additional candidate region generation and classification calculations, providing a more direct and efficient localization method. Yan et al. [26] proposed a multi-level feature fusion network to address the issues of insufficient information and background noise in the detection of dim small objects in remote sensing images. Fan et al. [27] introduced an anchor-free, efficient single-stage object detection method for optical remote sensing images to tackle the challenges of multi-scale objects and complex backgrounds in remote sensing imagery. However, the bounding boxes generated by this method often lack precise localization, leading to instability in the regression model’s prediction of the bounding box positions and sizes.

Subsequently, attention mechanisms were introduced to enhance the model’s focus on important features, thereby improving the recognition of small objects in remote sensing images. Du et al. [28] addressed the issue of small object sizes and dense distributions in remote sensing imagery by designing an enhanced multi-scale feature fusion network based on spatial attention mechanisms. Paoletti et al. [29] proposed a new multi-attention guided network that uses detailed feature extractors and attention mechanisms to identify the most representative visual part of an image to improve the feature processing for remote sensing hyperspectral image classification. Yan et al. [30] explored the potential of low-cost sparse annotations and introduced an end-to-end RSI-SOD method that relies entirely on scribble annotations. Liu et al. [31] addressed the challenges posed by traditional feature pyramid networks in handling various scale variations in remote sensing images by proposing an attention-based multi-scale feature enhancement and fusion module algorithm. Although these methods can pay attention to some important feature information, they usually use fixed-size convolutions to calculate the correlations between the channels. Moreover, for high-resolution and multi-band remote sensing images, standard channel attention methods fail to adequately focus on small object features when processing features of different sizes, which can lead to the loss of local information, thus leading to the problem of blurred localization.

Several studies have found that large convolution kernels can cover a broader range of input features, which helps to capture a broader context. Wang et al. [16] proposed a large kernel convolutional object detection network based on feature capture enhancement and wide receptive field attention to address the issue whereby critical information in a small receptive field is not prominently highlighted. Dong et al. [32] introduced a novel Transformer with a large convolutional kernel decoding network to tackle the problems of blurred semantic information and inaccurate detail and boundary predictions in remote sensing images. Sharshar et al. [33] investigated an object detection model integrating the LSKNet backbone with the DiffusionDet header to solve the problems of small object detection, dense element management, and different orientation considerations in aerial images. Li et al. [17] proposed a lightweight large selective kernel network to address the issue of extracting prior knowledge in remote sensing scenes. Although these methods have achieved good results by considering a larger receptive field, the use of large kernel convolutions fails to effectively leverage the local and global ranging contexts of objects of different sizes in remote sensing images. This can lead to the loss of detailed information, resulting in the incorrect detection of small objects in remote sensing images.

In summary, although the current remote sensing small object detection methods can obtain good results, there are still challenges. Due to the large differences in the size and scale of small objects in remote sensing images, it is necessary to address the localization blurring problem caused by low attention to the features of small objects and the lack of localization accuracy. Therefore, it is necessary to explore a new processing method applicable to the above problems.

3. Method

3.1. Network Overview

The overall model structure is divided into three parts, the backbone, neck, and head, as shown in Figure 1. In the backbone section, the ACA block is used to capture specific semantic features, such as color and texture, contained in different channels of the image. The adaptive weighting set within this block allows the model to focus more on the local information, guiding the model to focus on the small object area. Then, the LKA block analyzes the local and global correlation between this area and the surrounding receptive field, accurately extracting high-level feature representations of the input image for subsequent object detection tasks. The neck section uses a feature pyramid network (FPN) architecture for feature fusion and upsampling to further process the features extracted by the backbone, enhancing the model’s sensitivity to objects of different scales. The head section primarily extracts the category and location information of objects of different sizes, using three anchor-free detection heads for information fusion.

3.2. ACA Block

Remote sensing images are typically composed of spectral data from multiple bands, each corresponding to a different spectral range. The channel information provides rich spectral features, which are crucial in distinguishing various land cover types, materials, and vegetation conditions. In contrast, an overemphasis on spatial features can lead to redundant information, causing the model to learn incorrect features. The proposed ACA block in this paper addresses the issue of focusing on regional features, and its structure is shown in Figure 2.

During training, we have an input set

x \in (B, C, H, W)

, where B is the batch size, C is the number of channels, and H and W are the height and width, respectively. Convolution operations are used for the initial feature extraction from the input image. To enhance the connections between the channels, all channels share the same learning parameters, i.e.,

χ_{i} = σ (\sum_{j = 1}^{k} w^{j} x^{j}, x_{i}^{j} \in R_{i}^{k})

(1)

Here, W is a

C \times C

parameter matrix. For each pixel

x_{i}

, this paper only considers the receptive field within a range of k units, where

R_{i}^{k}

denotes the set of k adjacent channels of

y_{i}

. To capture appropriate channel interaction information, the kernel size k for different convolution operations can be manually adjusted to account for different receptive fields. However, this is overly cumbersome. Therefore, an adaptive method is designed to automatically adjust the convolution kernel size k based on the input channels, enabling the adaptive convolution of features of different dimensions. A mapping is set between k and C:

k = τ (C) = {|(l o g_{2}^{C} + b) / η|}_{o d d}

(2)

Here,

{|\cdot|}_{o d d}

denotes the nearest odd number to ·, and

η

and b are the parameters of the linear mapping. Through the mapping

τ

, high-dimensional channels have longer-range interactions, while low-dimensional channels have shorter-range interactions due to using a nonlinear mapping.

3.3. LKA Block

Using larger convolution kernels can increase the receptive field, allowing for the capture of more image information and thereby obtaining richer feature representations.

However, in the detection of small objects in remote sensing images, such a large receptive field can lead to overly mixed information. This makes it difficult to accurately capture the details of small objects, resulting in blurred and lost information. Therefore, this paper proposes the LKA block, which decomposes the original large kernel convolution using matrix multiplication in the KA block, thereby increasing the receptive field and computing a series of multiple remote receptive fields. The structure of the LKA block is shown in Figure 3, and the internal KA block structure is demonstrated in Algorithm 1.

The determination of the number of decomposed kernels is a key aspect of the LKA block. An increase in the kernel size and dilation rate ensures that the receptive field expands sufficiently quickly. Therefore, this paper defines the kernel size k, dilation rate d, and receptive field R of the i-th convolution as follows:

k_{i - 1} \leq k_{i}, d_{1} = 1, d_{i - 1} < d_{i} \leq R_{i - 1}

(3)

R_{1} = k_{1}, R_{i} = d_{i} (k_{i} - 1) + R_{i - 1}

(4)

Based on the above rules, this paper splits a convolution with a kernel size of 23 into smaller convolutions with kernel sizes of 5 and 7. This approach allows for more detailed feature extraction. This paper uses a series of convolutions with different receptive fields to realize the above operations:

y_{0} = X, y_{i + 1} = F_{i} (y_{i})

(5)

Here, F represents depth convolution using a kernel

k_{i}

. Assuming that there are N decomposed convolution kernels, after performing depth convolution, a 1 × 1 convolution layer is used for further processing. Each decomposed convolution kernel channel is reduced to

\frac{1}{N}

of the original size and then concatenated. This enhances the model’s ability to capture features with different receptive field sizes, allowing the model to consider information at multiple scales simultaneously. Subsequently, max pooling and average pooling operations are performed separately on each channel of the features, enhancing the model’s perception of the feature information within each channel and making the model’s processing of each channel’s information more flexible.

S_{a v g} = P_{a v g} (Y), S_{m a x} = P_{m a x} (Y)

(6)

Here,

S_{a v g}

represents the spatial features obtained through average pooling, and

S_{m a x}

represents the spatial features obtained through max pooling. To reflect the information interaction between different descriptors, these two spatial pooling features are concatenated. Then, a sigmoid activation function is applied to obtain an individual spatial selection mask for each decomposed kernel. This allows the model to adaptively select the required features of different sizes:

\hat{S} = σ (C o n c a t (S_{a v g}, S_{m a x}))

(7)

Here,

σ (\cdot)

represents the sigmoid function. The decomposed features are then weighted by their corresponding spatial selection masks and fused through a convolution layer

F (\cdot)

. Finally, the learned weight information is introduced to enhance the attention to the input features. The element-wise product between the input features X and Z is outputted:

\hat{X} = X \cdot F (\sum_{i = 1}^{N} {\hat{S}}_{i} \cdot Y_{i})

(8)

Algorithm 1: KA Block: Core of LKA
	Input: Image tensor x
	Output: Tensor after processing X
1	Initialize convolutional layers;
2	Generate features F $_{1}$ = Conv $_{0}$ (x), F $_{2}$ = Conv(F $_{1}$ );
3	Obtain concatenated features F $_{c o n}$ = Concat(F $_{1}$ , F $_{2}$ );
4	Calculate statistics S $_{a v g}$ = P $_{a v g}$ (F $_{c o n}$ ), S $_{m a x}$ = P $_{m a x}$ (F $_{c o n}$ );
5	Obtain aggregate weights S = $σ$ (Concat(S $_{a v g}$ , S $_{m a x}$ ));
6	Obtain the weight matrix W = S[0]F $_{1}$ + S[1]F $_{2}$ ;
7	Calculate the weighted sum X = W*x;
8	Return X;

3.4. LBN Method

In the detection of small object images in remote sensing, the features are often subtle and sparse in number, and the traditional normalization methods are mainly batch normalization (BN) and layer normalization (LN). BN normalizes the input data by using the mean and variance of each feature channel of all batch samples, and it mainly relies on the statistical information of the batch samples. However, BN may not perform well with small batch samples and single samples because it relies on the statistical information of the batch samples. Thus, it may introduce noise and lead to inter-sample interactions and a dependence on the overall statistical information. In contrast, LN performs normalization for all channels of each sample and can maintain the inter-sample independence. However, because of this, LN ignores global information, resulting in less stable statistical information.

In this paper, we combine the two types of normalization so that the network can take into account the statistical information of the batch samples while dealing with small objects, and we propose the LBN normalization method. LBN first normalizes each sample’s channel to ensure independence between the samples; then, it calculates the mean and variance of the output of the current layer and normalizes the whole batch using these statistics to take advantage of the correlation between the batches. This dual normalization strategy can maintain the stability when dealing with small batches of samples and single samples, and, at the same time, it can use the batch statistical information to improve the overall model performance. It ensures feature independence and utilizes global statistical data, and it is particularly suitable for small object detection tasks. Assuming that the dimension of feature X is

(N, C, H, W)

, BN normalizes each channel across the entire batch, while LN normalizes each sample across all channels. The following formula represents normalization along the

(N, H, W)

dimensions:

x_{1} = \frac{x_{i} - \frac{1}{B} \sum_{i = 1}^{B} x_{i}}{\sqrt{{(\frac{1}{B} \sum_{i = 1}^{N} {(x_{i} - \frac{1}{B} \sum_{i = 1}^{B} x_{i})}^{2})}^{2} + ϵ}}

(9)

The variable B represents the number of samples, and

ϵ

is an added smoothing term, taking the value of a small positive floating-point number, to prevent division by zero, ensuring the stability of the BN calculations. If we replace the variable B in the formula with the number of channels C, we can obtain the calculation result of LN. Then, we introduce a learnable parameter

λ

to balance the normalization outputs along both directions.

y = λ x_{1} + (1 - λ) x_{2}

(10)

y is the output of the normalization method. In this paper, LBN is embedded into both the ACA and LKA blocks, accelerating the model training process.

4. Experiments

This section describes extensive experiments to evaluate the model’s effectiveness and performance in remote sensing small object detection. Firstly, the datasets used in the experiments are introduced, followed by explanations of the experimental settings and evaluation metrics. Finally, the results of the ablation studies and comparative experiments are presented, and the observed phenomena and trends are discussed.

4.1. Datasets

DOTA-v2.0. DOTA-v2.0 is a benchmark dataset released by Wuhan University that is widely used for object detection in remote sensing images. The dataset contains 11,268 high-resolution aerial and satellite images and 1,793,658 annotated instances covering 18 object classes, such as aircraft, harbors, etc. The high-resolution images and different object classes of the DOTA-v2.0 dataset provide rich test samples for the evaluation of the performance of different detection algorithms. As a publicly available benchmark dataset, DOTA-v2.0 provides a unified evaluation tool that facilitates direct comparisons with existing methods and ensures reproducible and comparable research results. The details of each category in DOTA-v2.0 are shown in Table 2.

SODA-A. The SODA-A dataset is designed for small object detection and was released by the Northwestern Polytechnical University. The dataset contains 2513 high-resolution aerial images, in which 872,069 objects are labeled with orientation frames, covering nine categories, such as airplanes, helicopters, and ships, etc. The high-density and multi-directional small object annotations in the SODA-A dataset provide ideal test samples for the evaluation of small object detection algorithms for remote sensing. The details of each category in SODA-A are shown in Table 3.

VisDrone. The VisDrone dataset is a benchmark for UAV vision tasks and is published by the University of Science and Technology of China. The dataset consists of 10,209 high-resolution images and video frames covering 79,658 labeled instances distributed across 10 object classes, including pedestrians, vehicles, traffic lights, etc. The scene diversity and rich object classes of the VisDrone dataset allow for the analysis of model performance in complex urban environments and dynamic scenes. In addition, the multi-angle and multi-scale features of the objects in the VisDrone dataset can also be exploited to verify the robustness and generalization abilities of the model in practical applications. Table 4 presents the detailed information of each category in the VisDrone dataset.

4.2. Implementation Details

This paper reports experimental results obtained on the DOTA-v2.0 and VisDrone datasets to evaluate the model’s performance. To ensure fairness, a unified data processing method was employed: the original images were cropped into 1024 × 1024 patches with a pixel overlap of 150 between adjacent patches. All experiments were conducted using a single NVIDIA 4090 GPU with a batch size of 6 for model training and testing.

A stochastic gradient descent (SGD) optimizer was used for training, with a learning rate of 0.01, momentum of 0.9, and weight decay of 0.0005. The classification loss was computed using BCE, and the bounding box regression loss was computed using the CIoU and DFL.

Pre-training was conducted on the ImageNet dataset for 400 epochs. For the ablation studies, the model was trained for 20 epochs to ensure that the proposed methods could achieve good results within a limited number of iterations. For the comparative experiments on the DOTA-v2.0 and VisDrone datasets, the model was trained for 50 epochs and the performance on the two datasets was evaluated using the mAP50 of each class and the mAP50 of the total class, as well as the mAP50 and the mAP95 as the evaluation metrics, respectively.

4.3. Comparative Experiments

Results on the DOTA dataset. The proposed method achieved state-of-the-art performance, with a 63.01% mAP50 on the DOTA-v2.0 OBB benchmark.

From Table 5, it can be observed that, compared to previous methods, LARS achieved significant improvements in detection, achieving higher average precision and more accurate localization. The detection results for each category, as well as the overall detection accuracy, are visually presented in Figure 4 and Figure 5 using line and bar charts.

Results on the SODA-A dataset. This paper further highlights the performance of LARS on the SODA-A dataset. The experimental results demonstrate the performance of our method compared with other multi-stage and single-stage detection methods on the SODA-A dataset.

As shown in Table 6, our method achieves significant performance improvements in all metrics, especially in small object detection. AP_eS, AP_rS, AP_gS, and AP_N each represent the detection accuracy for extremely small, relatively small, generally small, and normal objects [3]. Our model outperforms all other comparative methods in all four metrics, which indicates that our method has higher accuracy in detecting small objects. In addition, our method also performs well in terms of the overall average precision (AP) and high-confidence detection (AP₇₅), reaching 49.4 and 59.3, respectively, proving our method’s accuracy in detecting small objects in complex aerial photography scenes.

Results on the VisDrone dataset. This paper further examines the performance of LARS on the VisDrone dataset. The VisDrone dataset has richer scenarios and more challenges, which enables us to evaluate the performance and generalization ability of the model more comprehensively. Next, this paper will analyze the experimental results on the VisDrone dataset to further validate the validity and generalization of the proposed approach, and the experimental results are shown in Table 7.

Figure 6 lists the results of the comparison tests on the VisDrone dataset, showing that LARS performs well in dealing with various challenging scenarios. Compared with other methods, LARS achieves higher values in the mAP evaluation metric, which indicates that the model not only covers real objects more effectively but also identifies the object boundaries more accurately.

Overall, the experimental results show that the method proposed in this paper not only achieves significant performance improvements on the DOTA-v2.0 dataset but also achieves excellent detection performance on the VisDrone dataset, which proves the versatility and effectiveness of the method. In addition, the results prove that the proposed method has good generalization.

4.4. Ablation Experiments

This section reports the results of the ablation experiments on the DOTA-v2.0 dataset to investigate the method’s effectiveness.

Different decomposition strategies. Setting the theoretical receptive field as 23, the results of the ablation study on the number of large kernel decompositions are shown in Table 8, and the visualization results are illustrated in Figure 7. From the experimental findings, decomposing the large kernel into a convolution with a kernel size of 5 and a dilation rate of 1, along with another convolution with a kernel size of 7 and a dilation rate of 3, achieves the optimal performance.

Different insertion blocks. In this experiment, the ACA block, the LKA block, and the LBN were gradually added to the model, after which the three blocks were used in combination. The same dataset and training configurations were used, and the performance was evaluated on the validation set. As shown in Table 9, the experimental results reveal that the accuracy is further improved after adding ACA, LKA, and LBN at the same time. The visual comparison of the detection results is depicted in Figure 8. Incorporating all blocks enables the more accurate localization of the objects, reducing both missed detection instances and false positives. Additionally, in areas with densely distributed objects, the use of all blocks, compared to using partial blocks, can reduce the overlap between the detection boxes, thereby more accurately distinguishing individual objects (Figure 9). This indicates that the two blocks complement each other and can jointly enhance the model’s performance.

4.5. Results Analysis

The experimental results on the DOTA-v2.0 dataset are analyzed in this section.

In Figure 10, various evaluation metrics are illustrated, including the loss function, mAP, recall, and precision. The overall pattern of the loss metric in the experiments shows a continuous decrease, indicating the gradual optimization of the model’s bounding box prediction accuracy during training, enabling the accurate localization of the objects. The sustained increase in the mAP50 and mAP95 indicates the model’s good performance in focusing on key features and expanding the receptive field, leading to significant performance improvements at different IoU thresholds and demonstrating strong generalization abilities. During the early stages of training, the precision metric exhibits significant fluctuations due to the model’s lack of learned parameters and features. However, in the later stages of training, through adjustments made by the ACA block, critical information can be extracted, and the LKA block can assign corresponding receptive fields to objects of different sizes, leading to the gradual stabilization of the precision, converging to an optimal state and consistently achieving good performance across different samples. The continuous increase in the recall metric reflects the enhanced ability of the model to recognize positive samples, resulting in a decrease in missed detection instances. Through the proposed ACA and LKA blocks, the model can more accurately focus on critical features and better understand and capture the contextual information of objects, thereby further improving the recognition accuracy and completeness.

Figure 11 illustrates the normalized confusion matrix without the LBN block (left panel) and with the LBN block (right panel), where the rows represent the true categories and the columns represent the categories predicted by the model. The diagonal elements from the top left to the bottom right represent the probability that the model correctly categorizes each category. It can be seen that after adding the LBN block, the model’s classification performance in each category is improved, especially in the categories BC, GTF, BR, and AP, where the number of misclassification instances is significantly reduced, indicating that the LBN block effectively reduces the classification confusion and improves the overall detection accuracy. The ablation experiments illustrated in Table 9 also demonstrate the improvement effect after adding the LBN block. Specifically, the experimental results in the table show that the mAP50 and mAP95 were improved by 1.3% and 1.66%, respectively, after adding the LBN block. This indicates that the LBN block reduces misclassification and improves the overall detection performance.

In addition, the model shows a strong discrimination ability in the categories of PL and TC, with accuracy of more than 90%. This indicates that the model can accurately recognize these categories and distinguish objects from the background. However, some of the PL samples were misclassified as HCs, which may have been due to the similarity in the features of PL and HC, making it challenging for the model to differentiate between them. Similarly, the low accuracy for categories such as CC and AP could be due to insufficient training samples, which prevented the model from learning enough features for accurate classification.

The PR curve and F1–confidence curve are important metrics in evaluating the performance of object detection models.

The PR curve illustrates the relationship between the precision and recall at different thresholds. Typically, the area under the curve (AUC) is used to quantify the model’s performance, where a larger area indicates better performance. In the PR curve (Figure 12, left), most class curves protrude towards the upper right corner, indicating that the model maintains high precision while also improving the recall. This is attributed to the discriminative feature representations provided by the LKA block and the enhanced focus on objects by the ACA block, resulting in more accurate localization and recognition by the combined model.

In the F1–confidence curve (Figure 12, right), the horizontal axis represents the confidence threshold, while the vertical axis represents the F1 score, which is the harmonic mean of the precision and recall. The calculation formula is as follows:

F 1 = 2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(11)

At low confidence levels, the F1 score of the model is relatively low. However, with an increasing confidence threshold, the features extracted by the LKA block are fully utilized, and the ACA block effectively adjusts the importance of the feature channels. As a result, the F1 score gradually increases and reaches its highest value of 0.62 at a confidence level of 0.405. This improvement enhances the precision and recall, reducing instances of false positives and false negatives, thereby achieving more accurate object localization.

Figure 13 illustrates the performance of the proposed model on the two datasets. It can be observed that the model achieves high detection accuracy for small objects in remote sensing images and performs well in precise multi-scale object localization. On the DOTA-v2.0 dataset, the model accurately identifies objects of different scales, such as PL and HB, indicating that LARS can not only recognize normal-sized objects but also accurately identify small-sized objects. The test results on the VisDrone dataset also demonstrate the accurate identification of objects of different scales, such as cars, bicycles, and pedestrians. These experimental results fully demonstrate the effectiveness and feasibility of the proposed method in the task of detecting small objects in remote sensing images.

5. Conclusions

To address the localization blurring issue in small object detection in remote sensing images, a remote sensing small object detection network based on adaptive channel attention and large kernel adaptation was proposed. An adaptive channel attention block was proposed to enhance the attention mechanism and channel features for small objects in remote sensing images. This block could guide the model to focus better on local information. To alleviate the problem of local information loss when processing small objects in remote sensing images with large kernel convolutions, a large kernel adaptive block was designed to dynamically adjust the spatial receptive field of the objects so as to improve the model’s ability to extract the associated information around the small objects. We also designed a layer batch normalization method to alleviate the decrease in the model classification accuracy caused by sample misclassification and address the issue of inter-sample correlation. Extensive experiments and analyses demonstrated the convincing improvements brought by the proposed model.

Although the network proposed in this paper has obtained satisfactory results in mitigating the localization blurring problem during small object detection in remote sensing images, there are still several directions that can be further explored, as well as some limitations.

(1): To address the lack of interpretability of the model, we are also exploring the combination of some mathematical formulas to explain the workings of the model in order to more clearly understand the inner workings and decision-making process of the model.
(2): The model has a large number of parameters, leaving room for improvement in terms of a lightweight design. Future research can focus on reducing the model parameters and computational complexity through compression techniques to meet the requirements of applications in resource-constrained environments.
(3): There is still room for improvement in the model’s accuracy, especially for complex backgrounds in remote sensing small object detection tasks. Future works can further enhance the accuracy and robustness through optimization algorithms or modifications to the model structure.
(4): We also find that the imbalance in the number of samples in the dataset can significantly affect the detection results. Future research can consider using learning methods to improve the model’s detection abilities for classes with fewer samples.

Author Contributions

Conceptualization, Y.L., Y.Y. and Y.A.; methodology, Y.Y. and Y.S.; software, Y.S.; validation, Y.A.; writing—original draft preparation, Y.L., Y.Y. and Z.Z.; writing—review and editing, Y.L., Y.A. and Z.Z.; visualization, Y.S.; supervision, Y.L. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is jointly sponsored by the National Natural Science Foundation of China (62276037), the Special Key Project of Chongqing Technology, Innovation and Application Development (CSTB2023TIAD-KPX0088), the Science and Technology Innovation Key R&D Program of Chongqing (CSTB2023TIAD-STX0016), and the Special Key Project of Chongqing Technology, Innovation and Application Development (CSTB2022TIAD-KPX0039).

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors are thankful to the providers for all the datasets used in this study. We also thankful the anonymous reviewers and editors for their comments to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, D.; Zhang, J.; Qi, Y.; Wu, Y.; Zhang, Y. Tiny Object Detection in Remote Sensing Images Based on Object Reconstruction and Multiple Receptive Field Adaptive Feature Enhancement. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Shihabudeen, H.; Rajeesh, J. A detail review and analysis on deep learning based fusion of IR and visible images. In AIP Conference Proceedings; AIP Publishing: Melville, NY, USA, 2024; Volume 2965. [Google Scholar]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote sensing image super-resolution and object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar] [CrossRef]
Xie, Q.; Zhou, D.; Tang, R.; Feng, H. A Deep CNN-Based Detection Method for Multi-Scale Fine-Grained Objects in Remote Sensing Images. IEEE Access 2024, 12, 15622–15630. [Google Scholar] [CrossRef]
Chadwick, A.J.; Coops, N.C.; Bater, C.W.; Martens, L.A.; White, B. Transferability of a Mask R–CNN Model for the Delineation and Classification of Two Species of Regenerating Tree Crowns to Untrained Sites. Sci. Remote Sens. 2024, 9, 100109. [Google Scholar] [CrossRef]
Zhu, Z.; He, X.; Qi, G.; Li, Y.; Cong, B.; Liu, Y. Brain Tumor Segmentation Based on the Fusion of Deep Semantics and Edge Information in Multimodal MRI. Inf. Fusion 2023, 91, 376–387. [Google Scholar] [CrossRef]
Sagar, A.S.; Chen, Y.; Xie, Y.; Kim, H.S. MSA R-CNN: A comprehensive approach to remote sensing object detection and scene understanding. Expert Syst. Appl. 2024, 241, 122788. [Google Scholar] [CrossRef]
Zhu, Z.; Sun, M.; Qi, G.; Li, Y.; Gao, X.; Liu, Y. Sparse Dynamic Volume TransUNet with Multi-Level Edge Fusion for Brain Tumor Segmentation. Comput. Biol. Med. 2024, 172, 108284. [Google Scholar] [CrossRef] [PubMed]
Zhu, Z.; Wang, Z.; Qi, G.; Mazur, N.; Yang, P.; Liu, Y. Brain Tumor Segmentation in MRI with Multi-Modality Spatial Information Enhancement and Boundary Shape Correction. Pattern Recognit. 2024, 153, 110553. [Google Scholar] [CrossRef]
Ghadi, Y.Y.; Rafique, A.A.; Al Shloul, T.; Alsuhibany, S.A.; Jalal, A.; Park, J. Robust object categorization and Scene classification over remote sensing images via features fusion and fully convolutional network. Remote Sens. 2022, 14, 1550. [Google Scholar] [CrossRef]
Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion. Remote Sens. 2023, 15, 2728. [Google Scholar] [CrossRef]
Ghaffarian, S.; Valente, J.; Van Der Voort, M.; Tekinerdogan, B. Effect of attention mechanism in deep learning-based remote sensing image processing: A systematic literature review. IEEE Trans. Geosci. Remote Sens. 2021, 13, 2965. [Google Scholar] [CrossRef]
Wang, J.; Li, W.; Zhang, M.; Chanussot, J. Large Kernel Sparse ConvNet Weighted by Multi-Frequency Attention for Remote Sensing Scene Understanding. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Xiang, S.; Liang, Q. Remote Sensing Image Compression with Long-Range Convolution and Improved Non-Local Attention Model. Signal Process. 2023, 209, 109005. [Google Scholar] [CrossRef]
Wang, W.; Li, S.; Shao, J.; Jumahong, H. LKC-Net: Large Kernel Convolution Object Detection Network. Sci. Rep. 2023, 13, 9535. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1 October 2023; pp. 16794–16805. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proc. Int. Conf. Mach. Learn. 2015, 37, 448–456. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelilli, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Jiang, J.; Zhong, X.; Chang, Z.; Gao, X. Object Detection of Transmission Tower Based on DPM. In Proceedings of the 4th International Conference on Information Technologies and Electrical Engineering, Lviv, Ukraine, 19–21 May 2021; pp. 1–5. [Google Scholar]
Ren, Y.; Zhu, C.; Xiao, S. Small Object Detection in Optical Remote Sensing Images via Modified Faster R-CNN. Appl. Sci. 2018, 8, 813. [Google Scholar] [CrossRef]
Lim, J.S.; Astrid, M.; Yoon, H.J.; Lee, S.I. Small Object Detection Using Context and Attention. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju, Republic of Korea, 13–16 April 2021; pp. 181–186. [Google Scholar]
Yan, J.; Hu, X.; Zhang, K.; Shi, T.; Zhu, G.; Zhang, Y. Detection of Dim Small Ground Targets in SAR Remote Sensing Image Based on Multi-Level Feature Fusion. J. Imaging Sci. Technol. 2023, 67, 1. [Google Scholar] [CrossRef]
Fan, F.; Zhang, M.; Yu, D.; Li, J.; Zhou, S.; Liu, Y. Lightweight Context Awareness and Feature Enhancement for Anchor-Free Remote Sensing Target Detection. IEEE Sens. J. 2024, 24, 10714–10726. [Google Scholar] [CrossRef]
Du, Z.; Liang, Y. Object Detection of Remote Sensing Image Based on Multi-Scale Feature Fusion and Attention Mechanism. IEEE Access 2024, 12, 8619–8632. [Google Scholar] [CrossRef]
Paoletti, M.E.; Moreno-Alvarez, S.; Haut, J.M. Multiple attention-guided capsule networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–20. [Google Scholar] [CrossRef]
Yan, R.; Yan, L.; Cao, Y.; Geng, G.; Zhou, P. One-Stop Multiscale Reconciliation Attention Network with Scribble Supervision for Salient Object Detection in Optical Remote Sensing Images. Appl. Intell. 2024, 54, 1–19. [Google Scholar] [CrossRef]
Liu, C.; Zhang, S.; Hu, M.; Song, Q. Object Detection in Remote Sensing Images Based on Adaptive Multi-Scale Feature Fusion Method. Remote Sens. 2024, 16, 907. [Google Scholar] [CrossRef]
Dong, P.; Wang, B.; Cong, R.; Sun, H.H.; Li, C. Transformer with Large Convolution Kernel Decoder Network for Salient Object Detection in Optical Remote Sensing Images. Comput. Vis. Image Underst. 2024, 240, 103917. [Google Scholar] [CrossRef]
Sharshar, A.; Matsun, A. Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for Advanced Object Detection. arXiv 2023, arXiv:2311.12956. [Google Scholar]
Cha, K.; Seo, J.; Lee, T. A Billion-Scale Foundation Model for Remote Sensing Images. arXiv 2023, arXiv:2304.05215. [Google Scholar] [CrossRef]
Lee, H.; Song, M.; Koo, J. Hausdorff distance matching with adaptive query denoising for rotated detection transformer. arXiv 2023, arXiv:2305.07598. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J. Oriented R-CNN and Beyond. Int. J. Comput. Vis. 2024, 132, 2420–2442. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented RepPoints for Aerial Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Biswas, D.; Tešić, J. Progressive Domain Adaptation with Contrastive Learning for Object Detection in the Satellite Imagery. arXiv 2022, arXiv:2209.02564. [Google Scholar]
Zhao, Z.; Li, S. OASL: Orientation-Aware Adaptive Sampling Learning for Arbitrary Oriented Object Detection. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103740. [Google Scholar] [CrossRef]
Zhao, J.; Ding, Z.; Zhou, Y.; Zhu, H.; Du, W.; Yao, R.; Saddik, A.E. Efficient Decoder for End-to-End Oriented Object Detection in Remote Sensing Images. arXiv 2023, arXiv:2311.17629. [Google Scholar]
Xie, X.; Cheng, G.; Rao, C.; Lang, C.; Han, J. Oriented Object Detection via Contextual Dependence Mining and Penalty-Incentive Allocation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–10. [Google Scholar]
Zhang, M.; Yue, K.; Li, B.; Guo, J.; Li, Y.; Gao, X. Single-Frame Infrared Small Target Detection via Gaussian Curvature Inspired Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Xu, C.; Ding, J.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7318–7328. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning High-Precision Bounding Box for Rotated Object Detection Via Kullback-Leibler Divergence. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2021; Volume 34, pp. 18381–18394. [Google Scholar]
Jocher, G. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 March 2024).
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 923–932. [Google Scholar]
Nin, G.; Huang, H. Multi-oriented object detection in aerial images with double horizontal rectangles. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4932–4944. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
Cheng, C.; Yao, Y.; Li, S.; Li, K.; Xie, X. Dual aligned oriented detector. IEEE Trans. Geosci. Remote Sens. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Yuan, X.; Cheng, G.; Yan, K.; Zeng, Q.; Han, J. Small object detection via coarse-to-fine proposal generation and imitation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6317–6327. [Google Scholar]
Shang, J.; Wang, J.; Liu, S.; Wang, C.; Zheng, B. Small Target Detection Algorithm for UAV Aerial Photography Based on Improved YOLOv5s. Electronics 2023, 12, 2434. [Google Scholar] [CrossRef]
Liu, H.; Duan, X.; Lou, H.; Gu, J.; Chen, H. Improved GBS-YOLOv5 Algorithm Based on YOLOv5 Applied to UAV Intelligent Traffic. Sci. Rep. 2023, 13, 9577. [Google Scholar] [CrossRef]
Ding, K.; Li, X.; Guo, W.; Wu, L. Improved object detection algorithm for drone-captured dataset based on yolov5. In Proceedings of the 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 14–16 January 2022; pp. 895–899. [Google Scholar]
Tang, S.; Fang, Y.; Zhang, S. HIC-YOLOv5: Improved YOLOv5 for Small Object Detection. arXiv 2023, arXiv:2309.16393. [Google Scholar]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13435–13444. [Google Scholar]
Yu, W.; Yang, T.; Chen, C. Towards Resolving the Challenge of Long-Tail Distribution in UAV Images for Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 3258–3267. [Google Scholar]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An edge-real-time object detector. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 7507–7512. [Google Scholar]

Figure 1. The overall architecture of LARS.

Figure 2. Structure of the ACA block.

Figure 3. Structure of the LKA block.

Figure 4. Average detection accuracy per category on the DOTA-v2.0 dataset. Each point represents the accuracy of a comparison model in a given category, the horizontal axis represents different models, and the vertical axis represents the AP50 value for each category.

Figure 5. Average detection accuracy for all categories on the DOTA-v2.0 dataset. LARS is represented by the red bar, and the highest detection accuracy of 63.01 was achieved on this dataset.

Figure 6. Comparison of mAP50 and mAP95 metrics on the VisDrone dataset.

Figure 7. Comparison of results from different strategies for decomposition of large kernels using (kernel, dilation) format.

Figure 8. Comparison of detection performance with different blocks added.

Figure 9. Visualization of detection performance after adding different blocks, with the parts of the detection results where it is difficult to find differences circled in ellipses.

Figure 10. Evaluation metric analysis for proposed model.

Figure 11. Comparison of normalized confusion matrices without (Left) and with (Right) the LBN block on the DOTA-v2.0 dataset, where BG represents the background.

Figure 12. Precision–recall curve and F1–confidence curve for proposed model.

Figure 13. Examples of test results on the DOTA-v2.0 and VisDrone datasets.

Table 1. Classification and corresponding area ranges of objects.

Area Subset	Small			Normal
Area Subset	Extremely Small	Relatively Small	Generally Small	Normal
Area Range	(0, 144]	(144, 400]	(400, 1024]	(1024, 2000]

Table 2. Instance counts of each category in the DOTA-v2.0 dataset.

Category	Abbr.	Inst. Count	Category	Abbr.	Inst. Count
plane	PL	23,930	large vehicle	LV	89,353
ship	SH	251,883	small vehicle	SV	1,235,658
storage tank	ST	79,497	helicopter	HC	893
baseball diamond	BD	3834	roundabout	RA	6809
tennis count	TC	9396	soccer ball field	SBF	2404
basketball court	BC	3556	swimming pool	SP	20,095
ground track field	GTF	4933	container crane	CC	3887
harbor	HB	29,581	airport	AP	5905
bridge	BR	21,433	helipad	HP	611
Training	/	268,627	Test/Test-dev	/	353,346
Validation	/	81,048	Test-challenge	/	1,690,637

Table 3. Instance counts of each category in the SODA-A dataset.

Category	Instance Count
airplane	31,529
helicopter	1395
small-vehicle	463,072
large-vehicle	15,333
ship	61,916
container	138,223
storage-tank	35,027
swimming-pool	26,953
windmill	26,755
Train	344,228
Validation	159,573
Test	296,402
Total	800,203

Table 4. Number of instances per category in the VisDrone dataset.

Category	Instance Count
pedestrian	79,337
people	27,059
bicycle	10,477
car	144,865
van	24,950
truck	12,871
tricycle	4803
awning-tricycle	3243
bus	5926
motor	29,642

Table 5. Comparative experimental results on the DOTA-v2.0 dataset. Each column represents a category. Results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	CC	AP	HL	mAP50
multi-stage
BFM [34]	80.12	54.12	50.07	65.68	43.98	60.07	67.85	79.11	64.38	60.56	45.98	58.26	58.31	64.82	69.84	32.78	89.37	11.07	58.69
RHINO [35]	79.74	58.79	48.13	67.12	57.21	59.11	69.48	83.54	65.14	74.05	47.93	60.49	58.43	63.25	55.59	48.49	82.06	14.40	60.72
ORCB [36]	78.65	51.80	47.15	65.78	43.35	58.29	60.89	82.83	63.51	59.50	43.40	55.79	52.90	56.18	54.13	27.55	66.24	5.22	54.06
S²A-Net [37]	77.84	51.31	43.72	62.59	47.51	50.58	57.86	80.73	59.11	65.32	36.43	52.60	45.36	52.46	40.12	0	62.81	11.11	49.86
O-Rep [38]	73.02	46.68	42.37	63.05	47.06	50.28	58.64	78.84	57.12	66.77	35.21	50.76	48.77	51.62	34.23	6.17	64.66	5.87	48.95
PDACL [39]	83.08	68.53	44.31	58.33	63.04	79.12	88.18	93.87	58.51	72.95	54.01	54.84	73.21	57.80	40.70	3.05	61.41	49.67	61.21
OASL [40]	76.65	55.46	46.33	62.49	53.18	56.62	66.16	80.75	63.07	67.03	44.89	55.68	54.24	59.04	59.15	35.00	77.81	14.65	57.18
one-stage
RRF [41]	77.58	49.96	38.60	53.82	54.97	57.12	68.93	77.88	59.59	71.92	40.17	51.22	53.00	57.12	49.66	25.42	66.92	5.17	53.28
DFDet [42]	75.44	52.17	42.28	60.17	48.80	53.36	62.67	78.15	56.85	66.52	40.78	53.05	48.42	59.23	51.40	25.47	66.29	16.38	53.19
GCI-Net [43]	79.18	51.57	47.50	66.61	43.30	58.07	60.73	82.85	64.47	59.62	44.31	56.66	52.71	56.73	53.04	26.10	66.41	14.42	54.68
O-RCNN [44]	77.95	50.29	46.73	65.24	42.61	54.56	60.02	79.08	61.69	59.42	42.26	56.89	51.11	56.16	59.33	25.81	60.67	9.17	53.28
DCFL [45]	79.49	55.97	50.15	61.59	49.01	55.33	59.31	81.81	66.52	60.06	52.87	56.71	57.83	58.13	60.35	35.66	78.65	13.03	57.66
R³Det [46]	75.44	50.95	41.16	61.61	41.11	45.76	49.65	78.52	54.97	60.79	42.07	53.20	43.08	49.55	34.09	36.26	68.65	0.06	47.26
YOLO8 [47]	91.89	72.87	43.42	60.95	62.13	79.11	85.79	94.51	60.78	75.22	42.36	56.53	75.99	68.35	48.51	2.24	33.82	0.15	58.51
SASM [48]	70.30	40.62	37.01	59.03	40.21	45.46	44.60	78.58	49.34	60.73	29.89	46.57	42.95	48.31	28.13	1.82	76.37	0.74	44.53
Ours	94.53	73.45	51.43	65.02	65.49	81.32	87.66	94.82	70.20	79.41	54.17	62.53	80.42	72.50	42.67	8.01	54.81	3.90	63.01

Table 6. Comparative experimental results on the SODA-A dataset. Results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Method	Publication	AP	AP $_{50}$	AP $_{75}$	AP $_{eS}$	AP $_{rS}$	AP $_{gS}$	AP $_{N}$
multi-stage
S²A-Net [37]	TGRS’22	28.3	69.6	13.1	10.2	22.8	35.8	29.5
O-Rep [38]	CVPR’22	26.3	58.8	19.0	9.4	22.6	32.4	28.5
DHRec [49]	TPAMI’22	30.1	68.8	19.8	10.6	24.6	40.3	34.6
one-stage
GV [50]	TPAMI’21	31.7	70.8	22.6	11.7	27.0	41.1	33.8
O-RCNN [44]	ICCV’21	34.4	70.7	28.6	12.5	28.6	44.5	36.7
DODet [51]	TGRS’22	31.6	68.1	23.4	11.3	26.3	41.0	33.5
CFINet [52]	ICCV’23	34.4	73.1	26.1	13.5	29.3	44.0	35.9
Ours	-	49.4	72.1	59.3	15.2	30.5	45.4	37.7

Table 7. Main results of the comparison test on the VisDrone dataset. Results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Method	mAP50 (%)	mAP95 (%)
UTY5S [53]	36.41	20.18
IGUIT [54]	35.32	20.04
DCFL [45]	32.14	-
IOD [55]	42.93	24.62
HIC-YOLOv5 [56]	44.32	25.99
QueryDet [57]	48.15	28.71
CEASC [58]	50.74	28.46
DSH-Net [59]	51.81	30.94
SAHI [60]	43.59	-
EdgeYOLO [61]	44.85	-
Ours	52.87	33.92

Table 8. The impact of decomposing different numbers of large kernels on the evaluation metrics, assuming a theoretical receptive field of 23, with the best metrics indicated in bold.

(k,d) Sequence	Precision (%)	Recall (%)	mAP50 (%)	mAP95 (%)
(23, 1)	68.53	50.47	52.94	32.68
(3, 1) + (5, 1) + (7, 1) + (9, 1)	70.90	51.54	54.45	34.27
(5, 1) + (7, 3)	73.97	54.72	57.34	41.12

Table 9. Effectiveness experiment on individual blocks proposed, √ indicates the inclusion of the block and bold indicates the best indicator.

Block			Precision (%)	Recall (%)	mAP50 (%)	mAP95 (%)
LKA	ACA	LBN	Precision (%)	Recall (%)	mAP50 (%)	mAP95 (%)
			70.62	50.08	52.33	36.06
		√	67.98	51.38	53.63	37.72
	√	√	72.34	51.69	54.28	37.95
√	√		66.54	53.22	55.95	38.85
√	√	√	73.97	54.72	57.40	41.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Yang, Y.; An, Y.; Sun, Y.; Zhu, Z. LARS: Remote Sensing Small Object Detection Network Based on Adaptive Channel Attention and Large Kernel Adaptation. Remote Sens. 2024, 16, 2906. https://doi.org/10.3390/rs16162906

AMA Style

Li Y, Yang Y, An Y, Sun Y, Zhu Z. LARS: Remote Sensing Small Object Detection Network Based on Adaptive Channel Attention and Large Kernel Adaptation. Remote Sensing. 2024; 16(16):2906. https://doi.org/10.3390/rs16162906

Chicago/Turabian Style

Li, Yuanyuan, Yajun Yang, Yiyao An, Yudong Sun, and Zhiqin Zhu. 2024. "LARS: Remote Sensing Small Object Detection Network Based on Adaptive Channel Attention and Large Kernel Adaptation" Remote Sensing 16, no. 16: 2906. https://doi.org/10.3390/rs16162906

APA Style

Li, Y., Yang, Y., An, Y., Sun, Y., & Zhu, Z. (2024). LARS: Remote Sensing Small Object Detection Network Based on Adaptive Channel Attention and Large Kernel Adaptation. Remote Sensing, 16(16), 2906. https://doi.org/10.3390/rs16162906

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LARS: Remote Sensing Small Object Detection Network Based on Adaptive Channel Attention and Large Kernel Adaptation

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Network Overview

3.2. ACA Block

3.3. LKA Block

3.4. LBN Method

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Comparative Experiments

4.4. Ablation Experiments

4.5. Results Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI