1. Introduction
Infrared images play a crucial role in remote sensing technology due to their unique imaging capabilities. Unlike visible light imaging, infrared imaging can capture images in real time under both day and night conditions. By detecting the infrared radiation emitted by objects, infrared imaging effectively penetrates common environmental obstacles, such as smoke and mist, enabling all-weather imaging [
1,
2]. With ongoing advancements in infrared imaging technology, there have been significant improvements in image resolution and sensitivity. These enhancements make infrared images highly valuable for applications in environmental monitoring, agriculture, night surveillance, and target detection [
3,
4,
5]. Moreover, the all-weather and multi-angle imaging capabilities of infrared technology are essential for maritime traffic monitoring and ship identification, thereby enhancing maritime safety management.
Although infrared imaging technology has demonstrated strong performance in marine image acquisition, it continues to encounter significant challenges in target detection [
6]. First, the marine environment and ship distribution are complex, and images often contain multiple ships against complex backgrounds accompanied by numerous interference factors, which escalate the difficulty of identification. Second, when detecting at long distances, small ships may take up a limited number of pixels in infrared images, occasionally manifesting as patches or dots [
7]. In addition, the infrared reflection characteristics vary considerably across different angles, complicating the model’s ability to accurately capture and learn directional information. Finally, environmental noise such as thermal radiation from the sea surface can also interfere with infrared image detection, increasing the probability of false detection and missed detection [
8]. To address these problems, it is essential to develop more advanced image processing and target detection algorithms to enhance the precision and dependability of infrared ship detection.
With the advancement of deep learning technology, various advanced ship detection methods have emerged, providing substantial support for monitoring offshore vessels. These techniques can be divided into one-stage methods, which are faster, and two-stage methods, which offer higher accuracy, depending on their underlying algorithmic processes. The two-stage methods first identify potential target locations through a Region Proposal Network, then classify the objects within these regions, and finally refine the target positions and boundaries. Although this method can achieve high accuracy, it incurs significant computational complexity and time costs, as seen in methods such as Faster r-cnn [
9], Mask r-cnn [
10], and Cascade r-cnn [
11]. Conversely, one-stage methods treat the detection process as a regression task, eliminating the candidate region extraction step, which simplifies the process and results in faster detection speed, but potentially lower accuracy, as seen in SSD [
12], RetinaNet [
13], and the YOLO series [
14,
15,
16]. Wang et al. [
17] designed an improved detection model based on YOLOv5s, which extracts and expresses information about different targets through a feature fusion module and improves sensitivity to small models through SPD-Conv. Zhang et al. [
18] developed a dual-branch supervised strategy for segmentation of surface ship images, leveraging a dual-branch network to enhance segmentation performance.
In recent years, the transformer architecture has performed significantly performance in optical image recognition [
19]. In 2020, Carion et al. [
20] presented a Detection transformer to achieve end-to-end target detection. In 2023, Lu Wenyu [
21] proposed RT-DETR, surpassing similar detectors in both efficiency and precision. For example, Li et al. [
22] introduced a real-time detection transformer for near-shore ships that utilizes multi-head coordinate attention and contrastive learning to enhance the detection efficiency of near-shore ships at night. Similarly, Zhou et al. [
23] proposed a dual backbone integrating a convolutional backbone and a converter architecture to enhance the model’s ability to detect infrared ships in complex environments. Furthermore, Feng et al. [
24] introduced an enhanced multi-scale transformer detection framework that improves the performance of image target detection and feature discrimination capabilities through the direction enhancement module and contrast loss.
Simultaneously, Xu et al. [
25] introduced HCF-Net, a deep learning framework for enhancing infrared small object detection, addressing challenges like object size and complex backgrounds. The key components include the multi-scale feature extraction module, the DASI module for automatically selecting the most informative channels, and the MDCR module for spatial feature refinement. Similarly, Chen et al. [
26] proposed a new method named MFDS-DETR, which significantly improved target detection performance through multi-layer feature fusion and a deformable self-attention mechanism. Furthermore, Gong et al. [
27] optimized the YOLOv7 [
28] algorithm and improved the operating efficiency and detection accuracy on mobile devices through technologies such as ShuffleNetV2 [
29] and Vision transformer.
This paper puts forth MAFF-DETR, an infrared ship detection multi-attention and multi-scale feature fusion network that integrates the improved RT-DETR algorithm. The main objective of this research is to develop a multi-scale infrared ship detection model to address the significant size disparity of ships on the sea surface, in order to cope with the increasingly complex marine environments in the future. Its key contributions include the following aspects:
To address the issue of information loss resulting from multiple downsampling in infrared images, we propose a novel C2f-PPA backbone that integrates CSP with parallelized patch-aware attention (PPA). The cross-stage partial feature extraction mechanism in C2f-PPA reduces redundancy while preserving essential features. Its flexible multi-branch approach further improves multi-scale feature fusion and representation, leading to significantly enhanced detection performance in complex infrared scenarios.
To tackle the challenge of scale disparity in infrared images, we introduce the High-level Screening-feature Fusion Pyramid, which optimizes the model’s neck. Additionally, we employ Channel Attention to selectively refine low-level feature information, thereby improving the model’s ability to express and utilize feature representations effectively. Furthermore, we augment the model’s robustness by incorporating a collaborative attention mechanism that integrates multiple attention types.
We present an innovative multi-layer dynamic shuffle transformer (MDST) module that simultaneously reduces computational costs and enhances feature diversity by combining channel shuffling with group convolution. The substitution of linear transformations with stacked standard convolutions and DSconv enhances the network’s nonlinear expression ability and reduces model parameters and memory usage.
Experiments conducted on an infrared marine ship dataset show that the proposed approach achieves better performance than the baseline and exceeds the results of most other deep-learning detection models.
2. Related Work
2.1. CNN-Based Detection of Infrared Targets
Currently, AI-driven techniques for detecting ship images are mainly divided into two categories: traditional neural network algorithms and deep learning convolutional neural network algorithms. Traditional algorithms require the manual extraction of ship texture, shape and other features in infrared images, relying on expert experience, which are then input into traditional neural networks such as MLP [
30] for detection. Since the features are manually designed, the quality of their selection significantly impacts detection performance. Deep learning algorithms, on the other hand, directly input infrared images into deep convolutional neural networks, automatically learning features through convolutional layers and pooling layers, such as Swin Transformerv2 [
31], extracting high-level semantic features and outputting detection results. This approach realizes an end-to-end process from inputting infrared images to outputting detection results, thereby improving detection efficiency and accuracy [
32].
2.2. Detection Transformer
DETR, launched in 2020, innovatively uses the transformer architecture for end-to-end object detection. It eliminates post-processing steps like ROI pooling and non-maximum suppression, featuring a backbone for feature extraction, transformer encoders and decoders, and a prediction head for bounding box calculation and classification. DETR employs the Hungarian algorithm for set prediction matching and harnesses the global feature learning capabilities of the transformer, thereby achieving superior prediction accuracy and inference speed compared to conventional methods.
Despite its advantages, compared with other detectors, DETR requires extended training time and exhibits reduced performance in detecting small targets. However, subsequent studies [
33,
34,
35] have alleviated these problems. Conditional DETR introduces a conditional decoder attention mechanism, utilizing the query position as a condition input to expedite target focus and model convergence. Deformable DETR integrates deformable attention, which only calculates attention weights exclusively at key points, thus diminishing computational complexity while preserving global information. This significantly enhances detection performance and efficiency Sparse DETR uses sparse attention to only calculate attention at high-confidence locations, reducing computational demands and accelerating convergence without maintaining detection accuracy. RT-DETR performs global context modeling through the transformer’s self-attention mechanism, which is able to better handle complex backgrounds and overlapping targets, reducing false and missed detections. Unlike YOLO, which relies on anchor frames, RT-DETR does not use anchor frames, simplifying the target localization process and avoiding the complexity of anchor frame adjustment. Due to the transformer’s long-range dependent capture capability, RT-DETR has an advantage over YOLO when dealing with target-intensive and overlapping scenes, and can identify targets more accurately.
However, in infrared ship detection, while RT-DETR leverages the global context provided by its self-attention mechanism to achieve a strong performance in detecting large objects, it shows a relatively weaker performance in detecting small infrared targets, particularly those appearing as patch-like structures. Furthermore, the transformer-based architecture introduces significant computational costs. This paper seeks to enhance RT-DETR’s sensitivity to small infrared targets while reducing its computational resource demands, thereby making it more efficient and applicable to a broader range of real-world scenarios.
2.3. ShuffleNet v2
ShuffleNet v2 is a lightweight convolutional neural network architecture designed for efficient mobile and embedded applications. Its primary purpose is to achieve high accuracy while maintaining a low computational cost, making it suitable for devices with limited resources. Compared to its predecessor, ShuffleNet v1, ShuffleNet v2 introduces several key improvements: it simplifies the network design by eliminating unnecessary operations, uses a more efficient channel shuffle operation, and incorporates a split-transform-merge strategy that enhances feature reuse. These modifications lead to a better balance between accuracy and speed, enabling ShuffleNet v2 to outperform ShuffleNet v1 in various tasks, including image classification. The architecture demonstrates superior performance in terms of computational efficiency and model size, making it an excellent choice for real-time applications in constrained environments.
3. MAFF-DETR
The network structure of MAFF-DETR is shown in
Figure 1. We innovatively improved the backbone part by combining the ideas of SCP and PPA, which ensures the sensitivity of the model to the key information of the infrared small targets after multiple downsampling. Due to the significant differences in the shooting distances of infrared images, we utilize the HS-FPN network structure in the neck section to achieve multi-scale feature fusion. The MDST module we proposed skillfully integrates channel shuffle technology and group convolution to achieve a favorable balance between model intricacy and efficiency. This approach not only boosts computational performance but also preserves the model’s strong ability to extract essential information. In addition, the MDST module further enhances the stability and accuracy of the model by introducing skip connections and depthwise separable convolutions.
3.1. C2f-PPA
In the task of detecting small infrared objects, these objects are prone to losing key information during multiple downsampling operations. Inspired by the lightweight design philosophy of the CSP Bottleneck, our goal is to enhance the feature flow of small infrared targets, such as fishing boats and canoes, while reducing computational overhead. The standalone PPA module optimizes feature representations by focusing on important regions within the feature map, thereby improving the detection of small targets. C2f ensures the efficient propagation of refined information throughout the network, enabling more efficient processing of the feature map. C2f accelerates the handling and aggregation of low-level features, while PPA further refines these features by focusing on spatially relevant areas within the image. The structure is shown in
Figure 2.
The C2f-PPA architecture improves feature aggregation by providing the feature fusion and channel flow optimization that PPA lacks, ensuring better representation of small, low-contrast objects such as fishing boats and canoes. C2f is capable of merging features across different resolutions, while PPA ensures that attention is focused on the most critical regions of the image, thus driving this improvement. Although the attention mechanism in PPA may incur higher computational costs, its combination with C2f strikes a balance between high-resolution attention focusing and minimizing unnecessary computations, significantly enhancing computational efficiency.
3.1.1. Multi-Branch Feature Extraction
The core principle of multi-branch feature extraction proves to be highly effective for infrared ship detection, where capturing diverse image features from various angles and scales is essential. By utilizing multiple parallel feature extraction paths, as illustrated in
Figure 3, this method enhances the model’s sensitivity to the distinct characteristics of infrared ship images. These paths, which may incorporate different convolutional kernel sizes, layers, or architectural designs, enable the detection of small or obscured ships in complex infrared scenarios. The parallel processing of features from these varied paths results in a more enriched and comprehensive image representation, significantly improving detection accuracy in challenging environments.
Specifically, this strategy represents an advanced feature extraction mechanism designed to effectively identify ships by combining information from different spatial scales and contexts. The input feature tensor is the raw feature map extracted from the infrared image, which is processed through point-wise 1 × 1 convolutions to adjust the number of channels while maintaining spatial resolution, resulting in . This refined tensor is then processed by three parallel branches. The local branch focuses on fine-grained, high-frequency details from smaller regions, capturing essential features like ship edges, corners, and textures for detecting smaller ships. The global branch gathers information from larger regions, helping the model understand the ship’s position relative to its environment and distinguishing it from background objects. The convolution branch processes the image hierarchically, extracting features that integrate both spatial and contextual information, balancing the insights from local and global contexts. Finally, the outputs of these three branches are summed to form the unified feature tensor , combining detailed local features, broader global context, and structured convolutional information to create a comprehensive feature map that captures both the ship’s characteristics and its surrounding environment.
The primary distinction between the local and global branches resides in the size of the image patches in each process. The local branch is dedicated to scrutinizing detail-oriented features, while the global branch is designed to capture overarching global features. This distinction is facilitated through the parameter p, which stipulates the size of the patch. Specifically, p is set by aggregating and shifting non-overlapping patches within the image space dimension, effectively partitioning the image into multiple discrete, non-overlapping regions, each defined by p. In order to proficiently extract and interact with local and global features, an attention matrix is computed among these non-overlapping patches to ascertain the relationship and importance between different patches. This computational strategy enables the model to adeptly capture both the intricate details and the expansive global information of the image at different scales, thereby improving the precision and efficiency of feature extraction.
In the feature map after preliminary processing, the features need to be weighted to select those relevant to the specific task. The dimension, following segmentation of the feature map into non-overlapping patches is represented as , and the weighted result is represented as , where each represents the i-th output token. The feature selection process begins by reweighting each token to determine its relevance to the specific task. The formula defines this operation. In this expression, is the task embedding, which represents the task’s characteristics and identifies which tokens are most pertinent to the task. The parameter matrix contains task-specific weights that adjust the importance of different token channels. The similarity function measures the cosine similarity between the token and the task embedding , producing a scalar value between 0 and 1. This value indicates how closely aligned each token is with the task, guiding the reweighting process. By multiplying by , the model ensures that tokens more relevant to the task receive higher weights, while less relevant tokens are diminished. This reweighted token is the final output, simulating a token selection process that focuses on features most aligned with the task. Subsequently, the parameter matrix p undergoes a linear transformation to selectively identify and extract channels relevant to each tag. Then, reshape and interpolate operations are performed to finally generate and features. Finally, a sequential convolution consisting of three 3 × 3 convolutional layers supersedes the traditional 7 × 7, 5 × 5, and 3 × 3 convolutional layers. This modification generates , and . These three convolution outputs are then aggregated to produce the final output of the sequential convolution .
3.1.2. Feature Fusion and Attention
Upon obtaining the feature map through multi-branch feature extraction, the model’s effectiveness is boosted by adaptively amplifying features using an attention mechanism. The attention module consists of both channel and spatial attention mechanisms. First, the feature map
is processed through the channel attention module. Here, the importance of each channel is reweighted using a one-dimensional attention map
, which prioritizes the most relevant channels for infrared ship detection. This ensures that the model focuses on the channels that are most informative for identifying ships, especially under challenging conditions such as low contrast or small object size. Following this, spatial attention is applied via a two-dimensional attention map
, which highlights critical regions within the feature map. By amplifying the spatial areas that are most likely to contain ships, this mechanism further enhances detection performance. Together, these mechanisms amplify key channels and spatial areas, producing more effective feature representations for detection.
The symbol ⊗ represents element-wise multiplication, which is essential for merging the attention maps with the feature maps. After channel attention, the features
are multiplied by the one-dimensional attention map, enhancing the most relevant channels. Similarly, spatial attention generates
, emphasizing critical spatial regions. Equation (
2) describes a series of operations applied to the spatially attended feature map
to prepare it for the final detection stages. Dropout is first used to prevent overfitting by randomly deactivating some neurons during training. Then, Batch Normalization adjusts and scales activations, improving training efficiency and generalization by reducing internal covariate shifts. Finally, the ReLU activation function introduces non-linearity, allowing the model to capture more complex patterns by zeroing out negative values. Together, these steps produce a refined and expressive feature map
, optimized for detection tasks.
3.2. High-Level Screening-Feature Pyramid Networks
To address the challenges of varying scales in infrared ship data, we employed the HS-FPN. This Network is designed to capture and fuse infrared ship features across different scales, addressing the complexities associated with multi-scale target detection. At the outset, feature maps across various scales are filtered and processed within the feature selection module to distill and amplify pertinent information. Subsequently, a selective feature fusion mechanism is used to collaboratively integrate high-level and low-level information across different scales. This integration produces a feature map replete with rich semantic information, better equipped to identify subtle features within the infrared ship dataset, thereby markedly advancing the model’s detection capabilities.
Feature Selection Module: This module includes both channel attention (CA) and dimension matching (DM). The CA component utilizes global average pooling and max pooling to determine channel weights, which enhances the emphasis on key features, as illustrated in
Figure 4. The integration of CA for feature selection and PPA for feature extraction forms a highly efficient and multi-dimensional complementary feature processing framework. This framework demonstrates significant applicability in complex near-shore environments, particularly for infrared images of sailboats, where key features may not be immediately apparent. PPA utilizes spatial attention and local–global attention mechanisms to fine-tune the feature map, effectively integrating low-level and high-level features, thereby providing multi-layered input for subsequent channel attention. CA then recalibrates the channels, emphasizing the most informative ones, which enhances the representation of crucial features. This hierarchical attention approach optimizes both spatial location and channel representations. The former improves the model’s adaptability to various input transformations, while the latter boosts robustness against noise or irrelevant backgrounds, ultimately enhancing the overall feature learning capability. In complex target detection tasks, this synergistic mechanism plays a crucial role in improving the accuracy and stability of infrared sailboat recognition.
The processed feature map uses a Sigmoid activation function to obtain the weight of each channel, which is subsequently used to produce a weighted feature map through element-wise multiplication with the corresponding features. Maximum pooling is utilized to extract salient features from each channel to ensure that important features are preserved; average pooling obtains the average information across each channel to minimize information loss. By integrating these two pooling strategies, the CA module comprehensively extracts and retains representative information while reducing information loss. Prior to feature fusion, the disparate channel counts across feature maps of different scales could result in size mismatches during integration. To address this, the DM module implements 1 × 1 convolution to standardize the channel count of all feature maps to 256, thereby facilitating a uniform fusion process.
Feature Fusion Module: This is designed to amalgamate multi-scale feature maps generated from the backbone network to achieve accurate object detection. Multi-scale feature maps consist of high-level features that capture semantic information and object categories and low-level features that include edges and textures. Feature Pyramid Networks and Path Aggregation Networks enhance a model’s ability to perceive objects at different scales through multi-scale feature fusion. However, these methods typically rely on simple top-down or bottom-up fusion strategies, which, while effective at merging low-level and high-level features, may fall short of capturing fine details and key features in complex scenarios. In contrast, the feature fusion module used in this algorithm leverages the CA mechanism to amplify the most discriminative features while suppressing redundant or irrelevant ones. This is especially crucial for infrared sailboat detection, where key features, such as contours and sails, are often subtle. The channel attention mechanism effectively highlights these critical feature channels, enhancing the model’s ability to recognize weak and detailed features, thereby improving detection accuracy.
As illustrated in
Figure 4, to integrate these features effectively, a resizing step is required. First, the high-level feature (
) is enlarged, adjusting its size to better integrate with the low-level feature. An enlarged high-level feature (
) is obtained by resizing the transposed convolution by setting the step size to 2, which improves resolution while maintaining semantic integrity. Following this, bilinear interpolation is used to align the high-level feature with the low-level feature, standardizing their dimensions to produce an intermediate feature map (
). The CA module then uses the semantic richness of the high-level features to generate attention weights, which are applied to filter and refine the low-level features, leading to efficient feature fusion and ultimately producing the final output feature map (
). The process of fusing feature selection is illustrated in Equations (3) and (4):
3.3. Multi-Layer Dynamic Shuffle Transformer
This section focuses on the MDST module we proposed, which synergistically incorporates channel shuffle, group convolution, depthwise separable convolutions and Vision transformer technologies to optimize the neck network. The structure is shown in
Figure 5. This integration not only significantly improves computational efficiency but also maintains exemplary performance, thereby enhancing both computational efficiency and adaptability.
As illustrated in
Figure 5, group convolution aims to reduce the number of parameters and computational requirements of parameters and computational demands of the model while augmenting its robustness and generalization capabilities. It partitions the input channels into multiple groups, with each group conducting an independent convolution operation. This configuration offers several advantages: by grouping input channels, group convolution significantly reduces the number of convolution kernels. The reduction in the number of parameters helps prevent model overfitting and improves the generalization ability of the model. Moreover, by processing input features partially independently, group convolution fortifies the model’s resilience against noise and interference. In addition, we draw on the channel shuffle techniques from ShuffleNetV2 to promote the exchange of feature information between groups, enabling the integration and complementation of features across different groups, thus boosting the overall feature expressiveness. Although there is a reduction in parameter count, channel shuffling ensures the preservation of feature diversity and richness, circumventing potential feature redundancy and information loss that grouped convolutions might induce. The combined application of group convolution and ShuffleNetV2’s channel shuffle technology not only reduces the number of parameters and computing requirements of the model but also maintains the diversity and richness of features.
The MDST module innovatively integrates the Vision transformer to substantially enhance both the model’s computational efficiency and overall performance. Within MDST, the input features are split in a 3:1 ratio, one part is allocated to undergo group convolution and channel shuffling operations, which reduces computational requirements while maintaining feature diversity. Considering that convolution operations are generally less computationally intensive than fully connected layers and are more suitable for processing large-scale image data and local feature processing, we have replaced the traditional fully connected linear layers with convolution operations. This strategic modification not only reduces computational requirements but also aligns more closely with the characteristics of convolutional neural networks, thereby bolstering model performance. To emulate the functions of the fully connected layers, three specific convolutions are employed: DSConv extracts deep convolution to extract spatial features and utilizes point-by-point convolution to distill channel features. This arrangement significantly cuts down on computational load while simultaneously boosting computational efficiency.
4. Experiment and Discussion
4.1. Experimental Setup and Dataset Overview
We use the InfiRay platform dataset [
36] as experimental data. A collection of over 8000 infrared images was gathered from coastal ports and docks, with resolutions of 384 × 288, 640 × 512, and 1280 × 1024. We used 5881 images for training, 840 for validation, and 1681 for testing in a 7:1:2 ratio. The seven ship classifications and the number of labels in the dataset are shown in
Figure 6, where fishing boats, sailboats, and canoes are the majority, reflecting the real ocean scenarios. The bounding box overlay shows the size distribution of the ship labels, which covers a wide range of ship shapes and sizes, testing the model’s capability to identify ships across various scales.
The hardware devices for this study are the E5-2680 CPU (Intel, Santa Clara, CA, USA) and RTX 3090 (24G) GPU (NVIDIA, Santa Clara, CA, USA); the Ubuntu version is 20.04 (Canonical, London, UK), and the framework is PyTorch (Facebook, Menlo Park, CA, USA). The experimental parameters were set according to the default configuration of RT-DETR, with the batch size set to 16.
4.2. Performance Measures
In object detection, mAP is a key metric that evaluates the model performance by averaging precision over all classes. mAP50 refers to mAP with a 50% IoU threshold, while mAP50:95 averages mAP across thresholds from 0.5 to 0.95. GFLOPs are calculated by summing the floating-point operations required during the model’s forward and backward passes, providing a measure of its computational complexity and processing efficiency. Parameters represent the total number of trainable weights, calculated by adding the weights across all layers. The number of parameters reflects the model’s size and potential overfitting risk, as larger models typically require more data to generalize well. Together, these metrics help balance model accuracy, speed, and resource consumption in object detection tasks.
4.3. Impact of Different Modules
Unlike other attention mechanisms, a model incorporating Channel Attention can more effectively capture global context, leading to a better understanding of the entire infrared scene and thus improving the accuracy of ship detection. Additionally, Channel Attention integrates bottleneck transformation to streamline information and reduce redundancy in global features. Although this approach slightly increases computational complexity, the trade-off is a significant improvement in the model’s detection performance, making it a worthwhile enhancement.
To assess the impact of the feature fusion module on detection accuracy, the study employed different attention mechanisms as part of the feature fusion strategy. Feature fusion occurs on transformed data, where various attention mechanisms operate on feature representations from previous layers of the network. This allows for more efficient extraction and refinement of relevant features that improve detection accuracy. Under identical parameter settings and in the absence of pre-trained weights, considering model parameters, GFLOPs and mAP, Channel Attention, Efficient Local Attention, Coordinate Attention and Context Anchor Attention are employed collaboratively with C2f-PPA for comparative experiments. The attention collaboration method proposed in this paper exhibits robust performance in detection average accuracy, and the parameters and computational load are comparable to other algorithms. Experiments indicate that Channel Attention more comprehensively extracts and retains the most representative information in each channel. The results are presented in
Table 1, where all values corresponding to mAP50 and mAP50:95 are dimensionless.
4.4. Ablation Results
To assess the performance of the MAFF-DETR model, ablation experiments were performed, with comparisons made against the original RT-DETR algorithm (S0). The findings from the ablation study are presented in
Table 2, where all values corresponding to layers, mAP50 and mAP50:95 are dimensionless, featuring six schemes (S0–S5), each representing a combination of different improvement strategies, S0 employs no strategy, while S5 implements all five strategies.
This study utilized sequential stacking modules for comparison. First, testing was conducted using only the original RT-DETR model; then C2f-PPA, HS-FPN, and MDST were added sequentially. After the inclusion of C2f-PPA, mAP50:95 was increased by 0.9%, the number of parameters was decreased by 1.9 M, and the computational load was augmented by 3.2 GFLOPs. Following the addition of C2f-PPA and HS-FPN, the mAP saw an improvement with 1.7 M fewer parameters and 3.6 GFLOPs less computation than S1. Upon further addition of the MDST module, the mAP was improved an additional 0.5% with 0.7 M fewer parameters and 5.4 GFLOPs less computation than S4. Compared with the original RT-DETR-r18 model, the final model achieved a reduction in the number of parameters by 4.3 M, a decrease in the computational demand by 5.8 GFLOPs, and an increase in mAP by 1.7%. The results show that the model reduces the computational cost while improving accuracy, and is suitable for infrared ship detection.
The module proposed in this paper has higher accuracy and robust performance in multi-scale and small-target detection. Bulk carriers, container ships and warships exhibit higher detection accuracies due to their generally small number and large sizes. In contrast, fishing boats, sailboats and canoes, which are numerous and often constitute smaller targets, are prone to blending with complex backgrounds, thereby complicating detection efforts. Therefore, the average accuracy for these three types of ships is lower, as indicated in
Table 3, where all values corresponding to mAP50 and mAP50:95 are dimensionless.
4.5. Comparison with Alternative Algorithms
We compared the improved model with current mainstream models to validate its effectiveness, including the YOLO series models [
15,
16,
37,
38,
39], RT-DETR [
21], and other advanced models.
This paper compares the proposed MAFF-DETR with the YOLO series algorithms, both of which were implemented without the use of pre-trained weights. The MAFF-DETR model significantly outperforms the YOLO series models in terms of mAP50, achieving superior detection results with fewer parameters and computational requirements. While its mAP50:95 is comparable to that of the latest YOLO series models, our approach demonstrates significant improvements in model efficiency.
Specifically, as shown in
Table 4, MAFF-DETR achieves superior performance with significantly fewer parameters and lower computational overhead. All values in
Table 4 corresponding to mAP50 and mAP50:95 are dimensionless. This makes our model particularly suitable for real-time applications on resource-constrained platforms, such as unmanned ships, where minimizing computational cost is critical.
In order to verify the effectiveness of the proposed algorithm, comparisons were made with several other models including ATSS [
40], Cascade-RCNN [
11], DDQ-DETR [
41], DINO [
42], Faster-RCNN [
9], GFL [
43], RetinaNet [
13] and TOOD [
44], all of which utilized ResNet50 as the backbone network. To ensure a fair comparison, these algorithms were trained under identical experimental conditions using the same dataset. The comparative results are documented in
Table 5, where all values corresponding to mAP50 and mAP50:95 are dimensionless. The mAP of our model surpasses that of the previously most accurate model, RT-DETR-R18, by 1.7%. MAFF-DETR offers significant improvements over DINO, GFL, and TOOD in infrared ship detection, particularly in addressing challenges such as scale disparity and information loss. The C2f-PPA backbone of MAFF-DETR integrates PPA to effectively mitigate the information loss caused by multiple downsampling steps in infrared images, preserving critical features for more accurate detection, which gives it a key advantage over DINO, as DINO lacks a dedicated feature preservation mechanism. Moreover, MAFF-DETR introduces the High-level Screening-feature Fusion Pyramid, optimizing the model’s neck to handle scale variations in infrared ship detection better, enabling it to detect ships of various sizes more effectively than GFL, which tends to struggle with larger or clustered ships as it focuses more on harder-to-detect objects. The Channel Attention mechanism in MAFF-DETR further enhances feature refinement, allowing it to distinguish finer details, an area where TOOD may falter due to its reliance on task alignment without a robust scale-sensitive mechanism. Finally, the MDST in MAFF-DETR reduces computational costs while enhancing feature diversity, making it more efficient than both GFL and TOOD, which tend to be more computationally expensive. The use of stacked standard convolutions and DSconv increases the model’s nonlinear expressive capability, making MAFF-DETR more adaptable and accurate for large-scale infrared ship detection in complex dynamic marine environments.
The MDST module excels in processing multi-scale information, enhancing small object features through dynamic group convolution and feature shuffling. It surpasses advanced algorithms such as the YOLO series and DINO in infrared ship detection. It demonstrates advantages in accuracy and efficiency, especially when processing scenes characterized by extensive details and dynamic range. Moreover, MDST optimizes computing resources effectively to maintain high performance, even under conditions of limited resources.
4.6. Results Visualization
To intuitively demonstrate the performance of MAFF-DETR, four common scenes were selected for comparison between the detection results of the original RT-DETR model and our model. As shown in
Figure 7, the leftmost infrared image shows very few texture features and the thermal features of the ship are very similar to those of the surrounding environment. In the middle is the detection result of the original RT-DETR model, where the red box represents the missed and misdetected cases of the original model compared with MAFF-DETR.
Figure 8a shows that the original RT-DETR model fails to effectively detect multiple adjacent ships, whereas our model successfully identifies subtle texture features and distinguishes them. In
Figure 8b, where the target area is small and exhibits night reflections, all methods show a certain degree of incomplete recognition. However, MAFF-DETR is able to identify more ships and perform more effectively in detecting small and densely packed targets at the sea-sky boundary.
Figure 8c reveals that RT-DETR incorrectly detects the shore as a ship, while our model accurately distinguishes shore buildings and ships in the water.
Figure 8d presents more complex scenes where ships blend into the environment and suffer from severe occlusion problems. Despite these challenges, our model maintains high accuracy, especially in identifying occluded sailboats and canoes at the edge of the picture. This difference may stem from the interference of larger targets with the detection of smaller targets when adjacent targets are in close proximity. Our model introduces C2f-PPA to extract features across different scales and levels, enhancing the accuracy of small object detection by capturing multi-scale features, thus enabling the accurate identification of objects of varying sizes.
To verify the effectiveness of the enhancement algorithm, a heat map was generated to illustrate the feature distribution and model focus areas. As depicted in
Figure 8, the RT-DETR-r18 model struggles with detecting small targets and blurred scenes, frequently losing target information and incorrectly detecting shore buildings. In contrast, our model effectively suppresses background interference and enhances focus on ship features. In densely occluded scenes, RT-DETR is prone to false detection and missed detections, while MAFF-DETR demonstrates higher positioning and recognition accuracy. Our model integrates HS-FPN and CA, utilizing CA to extract representative information and minimize information loss. The proposed MDST model specifically enhances infrared features, captures complex and diverse feature information, reduces feature redundancy, and provides excellent performance for ship detection.
5. Conclusions
This paper presents MAFF-DETR, an innovative end-to-end network for multi-class object detection in infrared images. We integrated C2f-PPA into the backbone to enhance the representation of small objects during downsampling. Additionally, we refined the neck structure with insights from HS-FPN, and incorporated the MDST module, using group convolution, ShuffleNetV2, and Vision transformer. These improvements reduce the model’s parameters and computational cost while maintaining high detection accuracy.
Experimental results on the public infrared ship dataset show that the improved algorithm boosts mAP by 1.7%, reduces parameters by 4.3 M, and decreases computational demand by 5.8 GFLOPs, effectively balancing detection accuracy and simplicity. MAFF-DETR excels in real-world applications such as maritime surveillance, ship navigation, and port security, especially in challenging environments with poor lighting or varying ship sizes. Its lightweight design allows efficient operation on resource-limited devices like drones, edge computing platforms, and mobile monitoring systems.
However, MAFF-DETR’s performance may be sensitive to some environmental factors in the real world, such as sea-surface reflection, sudden weather changes, or variations in infrared sensor quality, which could lower detection accuracy. Additionally, although the model has been improved for small-object detection, it may not perform well in cases involving heavily occluded ships or scenes where ships and background elements share similar thermal signatures, which could reduce its ability to accurately detect targets. Future work will focus on improving the model’s robustness in diverse environments and further optimizing hardware compatibility for broader real-world deployment.