CN117994573A

CN117994573A - Infrared dim target detection method based on superpixel and deformable convolution

Info

Publication number: CN117994573A
Application number: CN202410073036.6A
Authority: CN
Inventors: 赵岩; 王生杰; 郑裕隆; 黄艳金
Original assignee: China Forestry Star Beijing Technology Information Co ltd
Current assignee: China Forestry Star Beijing Technology Information Co ltd
Priority date: 2024-01-18
Filing date: 2024-01-18
Publication date: 2024-05-07

Abstract

An infrared weak and small target detection method based on super pixels and deformable convolution belongs to the field of target detection and comprises the steps of original image acquisition; clustering superpixel segmentation is carried out on the original image to obtain a plurality of superpixel block sequences; inputting the super pixel block sequence into a deformable convolution feature extraction backbone network to obtain a plurality of feature graphs with different scales; inputting a plurality of feature images with different scales into a multi-scale feature fusion network to obtain a plurality of fused feature images with different scales; and detecting a plurality of fused feature maps with different scales by using anchor frames, optimizing all the anchor frames by using a K-means clustering method, and extracting all areas where the infrared weak and small targets possibly exist to obtain a final detection result. The method can divide the possible region of the infrared dim target from the original image under the conditions of small infrared dim target size and low contrast relative to the background, thereby improving the detection effect of the infrared dim target and reducing the false alarm rate of the infrared dim target detection.

Description

Infrared dim target detection method based on superpixel and deformable convolution

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to an infrared dim target detection method based on super pixels and deformable convolution.

Background

Object Detection (Object Detection) technology belongs to the field of computer vision, and aims to find all interested objects in images or videos, so that the Object Detection technology has wide application in the fields of traffic, video monitoring, military and the like. Conventional target detection methods mainly comprise a morphological processing method, a sliding window method and an HOG detector method, which are simple, and have relatively limited detection performance and result.

In recent years, deep learning is increasingly applied to a target detection task, RCNN networks are firstly applied to the target detection task through convolutional neural networks, feature vectors are extracted through generating a plurality of candidate areas, and then the feature vectors are input into an SVM classifier to predict probability values of objects contained in each candidate area. In the restoration experiment, the existing target detection task has the following problems: typical infrared weaknesses have several prominent features: firstly, the size of the target is smaller, the brightness is lower, and the target is easy to submerge in the background or is interfered by noise; secondly, color, texture and shape information of the target are difficult to extract, some typical methods based on the color, the shape and the texture, such as methods of moment features, outline features, local invariant feature points and the like are difficult to apply, and the distinguishability between objects is low; thirdly, due to the infrared imaging mechanism, when areas with similar infrared radiation characteristics to the targets appear in the background, the areas can be treated as false targets, and the false targets are often difficult to identify. Infrared target detection is more susceptible to interference than visible light target detection, and particularly, interference caused by heating sources with similar shapes is more serious; and fourthly, the resolution ratio of the infrared imaging sensor is low, the obtained infrared image has certain blurring, the noise interference is serious, the signal to noise ratio of the image is low, and the low signal to noise ratio can cause more error information to be doped in the characteristic extraction process.

Disclosure of Invention

Aiming at the problems of low infrared dim target detection rate and high false alarm rate of the existing target detection method, the invention provides the infrared dim target detection method based on super pixels and deformable convolution, which can divide the possible region of the infrared dim target from the original image under the conditions of small infrared dim target size and low relative background contrast, carry out target detection on a specific region, introduce a deformable convolution feature extraction backbone network to realize multi-scale feature extraction, realize infrared dim target detection more adaptively, and further can be widely applied to an actual infrared image processing system.

The technical scheme adopted by the invention for solving the technical problems is as follows:

The invention discloses an infrared dim target detection method based on super pixels and deformable convolution, which mainly comprises the following steps of:

step S1: collecting an original image;

step S2: performing clustering super-pixel segmentation on the original image to obtain a plurality of super-pixel block sequences;

step S3: inputting the super pixel block sequence into a deformable convolution feature extraction backbone network to obtain a plurality of feature graphs with different scales;

step S4: inputting a plurality of feature images with different scales into a multi-scale feature fusion network to obtain a plurality of fused feature images with different scales;

Step S5: and detecting a plurality of fusion feature images with different scales by using anchor frames, optimizing all the obtained anchor frames by using a K-means clustering method, and extracting all areas where the infrared weak and small targets possibly exist to obtain a final detection result.

Furthermore, the original image is generated through the existing infrared state scene simulation system, is manually marked, and is divided into a training set, a verification set and a test set.

Further, the specific operation flow of the step S2 is as follows:

step S2.1: sampling is carried out on a regular grid with S pixels at intervals, so that a super pixel block is obtained;

step S2.2: moving the sampling center to a position corresponding to the lowest gradient position in the 3×3 neighborhood;

Step S2.3: introducing a Euclidean distance D to calculate the nearest cluster center of each pixel, and searching similar pixels in a region 2S multiplied by 2S around the super pixel center;

Step S2.4: re-calculating the center of each cluster according to the new label of each pixel; calculating residual errors of the new clustering center and the previous clustering center by adopting an L2 norm, and comparing the residual errors with a residual error threshold value: if the residual error is larger than the residual error threshold value, repeating calculation of the Euclidean distance D and the clustering center; and if the residual error is smaller than the residual error threshold value, stopping iteration, and finally obtaining the super-pixel block sequence after super-pixel pretreatment.

Further, the deformable convolution feature extraction backbone network consists of a deformable convolution layer, a batch normalization layer, an activation function SiLu, three convolution layers and a correction linear unit ReLu; inputting the super pixel block sequence into a deformable convolution feature extraction backbone network, and sequentially processing the deformable convolution layer, the batch normalization layer, the activation function SiLu, the three convolution layers and the correction linear unit ReLu to obtain a plurality of feature maps with different scales.

Further, in the step S3, in the multi-scale feature extraction process, an offset is introduced into a convolution kernel, where the introduced offset is obtained by applying a convolution layer on the same input feature map, and the convolution kernel has the same spatial resolution and expansion as the current convolution layer; generating the offset with the same spatial resolution as the input feature map; in the training process, the convolution kernel of the output feature map and the convolution kernel for generating the offset are simultaneously learned, and the calculation formula is as follows:

Y＝wpn*(xp0+xpn+Δxpn)

where Y represents the output signature, w represents the sum of the sample values, x represents the input signature, p0 is each point in the convolution lattice, pn is the offset of point p0 in the convolution kernel range, n=1, …, N.

Further, the multi-scale feature fusion network is composed of 31×1 convolution layers, 33×3 convolution layers, 3 upsampling layers, 3 fusion modules and 3 feature extraction modules; the 1X 1 convolution layer is used for changing the number of characteristic channels, the 3X 3 convolution layer is used for downsampling the characteristic images, the upsampling layer is used for converting the low-resolution characteristic images into high-resolution characteristic images, the fusion module is used for splicing and fusing the deep and shallow characteristic images with the same resolution, and the characteristic extraction module is used for extracting characteristic information from the fused characteristic images.

Further, the specific operation flow of the step S4 is as follows:

Step S4.1: the feature maps of different scales are fip, fip2, fip and fip4 respectively, the feature map fip4 is up-sampled and then multi-scale fusion is carried out on the feature map 3562 and the feature map fip, so that a feature map fcp1 is obtained;

Step S4.2: performing multiscale fusion on the feature map fcp1 and the feature map fip2 after upsampling to obtain an output feature map fop1;

step S4.3: the output feature map fop1 is up-sampled and then is subjected to multi-scale fusion with the feature map fip1, so that an output feature map fop2 is obtained;

step S4.4: and carrying out convolution processing on the output characteristic map fop2, and carrying out multi-scale fusion on the output characteristic map fop1 subjected to up-sampling to obtain an output characteristic map fop3.

The beneficial effects of the invention are as follows:

Compared with visible light images, infrared image imaging is more complex, and is more difficult to acquire under the influence of equipment and environment, so that the traditional large-target detection method cannot be used. In addition, the existing target detection based on CNN (convolutional neural network) is limited to the problems of overlarge model volume and unstable transformation mode, and the problems are caused by that a convolutional unit of a CNN module can only sample an input characteristic diagram at a fixed position, the sizes of receptive fields of all activating units in the same CNN layer are the same, and an internal mechanism for processing various geometric shapes is lacked. Conventional CNNs are not suitable for finer target detection tasks, since different locations may correspond to objects of different dimensions or shapes. Therefore, the invention provides the infrared weak and small target detection method based on the super pixel and the deformable convolution, which uses the super pixel method to divide the possible area of the infrared weak and small target, extracts the backbone network through the deformable convolution characteristic to adaptively extract the target characteristic, and uses the anchor frame-based method to detect the target, thereby greatly improving the detection capability and having better detection effect.

Drawings

FIG. 1 is a flow chart of a method for detecting infrared dim targets based on superpixels and deformable convolution according to the present invention.

Fig. 2 is a schematic diagram of a deformable convolution feature extraction backbone network.

Fig. 3 is a schematic diagram of a multi-scale feature fusion network.

Detailed Description

The invention provides an infrared dim target detection method based on super pixels and deformable convolution, which mainly comprises two parts, namely a first part: using a super-pixel algorithm to divide a region where the infrared weak and small target possibly exists; a second part: and extracting the characteristics of the original image subjected to the super-pixel pretreatment by using a deformable convolution characteristic extraction backbone network, generating a characteristic map with multiple scales, processing the characteristic map with the multi-scale characteristic fusion network, outputting the characteristic map to a target detection network, and obtaining a final detection result by using the target detection network.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

The invention provides an infrared dim target detection method based on super pixels and deformable convolution, which specifically comprises the following steps as shown in fig. 1:

step S1: collecting an original image;

The infrared video data selected by the invention is generated by an infrared state scene simulation system, the training set comprises 20 infrared videos, each infrared video comprises 100 original images, the images are manually marked, and the training set, the verification set and the test set are divided.

Step S2: pre-processing super pixels;

And respectively carrying out simple clustering super-pixel segmentation on the original images to obtain a plurality of super-pixel block sequences. Through super-pixel preprocessing of the input original image, the learning capacity of the background can be enhanced, and meanwhile, the effect of data augmentation is achieved. The specific operation steps are as follows:

Step S2.1: sampling is carried out on a regular grid with S pixels at intervals, so that approximately equal super-pixel blocks are obtained;

Step S2.2: moving the sampling center to a position corresponding to the lowest gradient position in the 3 x 3 neighborhood, avoiding positioning the superpixel on the edge, and reducing the chance of infrared weak and small objects and noise affecting the superpixel result;

Step S3: further, inputting the super pixel block sequence into a deformable convolution feature extraction backbone network to obtain a plurality of feature graphs with different scales. By using the deformable convolution characteristic to extract the backbone network, the target detection can be realized more adaptively, and the detection probability of the infrared weak and small target image can be further improved.

The specific operation steps are as follows:

Step S3.1: inputting the super-pixel block sequence into a deformable convolution feature extraction backbone network, and extracting the multi-scale features of the super-pixel block sequence.

The adopted deformable convolution feature extraction backbone network is shown in fig. 2, and consists of four stages, wherein each stage consists of a deformable convolution layer DConv, a batch normalization layer BN (Batch Normalization), an activation function SiLu (Sigmoid Linear Unit), three convolution layers Conv and a correction linear unit ReLu (RECTIFIED LINEAR unit). The input original image is subjected to super-pixel preprocessing to obtain a super-pixel block sequence, and the super-pixel block sequence is input into a deformable convolution feature extraction backbone network, and is sequentially subjected to processing of a deformable convolution layer DConv, a batch normalization layer BN (Batch Normalization), an activation function SiLu (Sigmoid Linear Unit), three convolution layers Conv and a correction linear unit ReLu (Rectified linearunit) to obtain a feature map; finally, the backbone network can be extracted through the deformable convolution feature extraction, so that feature graphs { fip1, fip2, fip3, fip4, i=i-N, & i, & i+n }, with four different scales, can be extracted. The deformable convolution feature extraction backbone network adopted by the invention uses the deformable convolution to replace a common convolution kernel, so that the target feature is extracted more adaptively.

Step S3.2: in the process of multi-scale feature extraction, the invention introduces an offset into a convolution kernel, the introduced offset is obtained by applying a convolution layer on the same input feature map, and the convolution kernel has the same spatial resolution and expansion as the current convolution layer. The generated offset has the same spatial resolution as the input feature map. In the training process, the convolution kernel of the output feature map and the convolution kernel for generating the offset are simultaneously learned, and the calculation formula is as follows:

Y＝wpn*(xp0+xpn+Δxpn)

Step S4: further, the obtained four feature images with different scales are input into a multi-scale feature fusion network to obtain three fused feature images with different scales.

The adopted multi-scale feature fusion network is shown in fig. 3, and mainly comprises 31×1 convolution layers Conv, 33×3 convolution layers Conv, 3 upsampling layers upsample, 3 fusion modules and 3 feature extraction modules. The 1×1 convolution layer Conv is used for changing the number of feature channels, the 3×3 convolution layer Conv is used for downsampling the feature images, the upsampling layer upsample is used for converting the low-resolution feature images into high-resolution feature images, the fusion module is used for splicing and fusing the deep and shallow feature images with the same resolution, and the feature extraction module is used for extracting feature information from the fused feature images. Because the high-resolution low-level feature map contains more detailed structure and texture information, and the low-resolution high-level feature map contains rich target semantic information, the depth feature fusion strategy provided by the invention combines the structure and texture information with the semantic information for infrared dim target detection, and can solve the problem of detection missing under a complex scene.

The specific operation steps are as follows:

Step S4.1: up-sampling the feature map fip, and then performing multi-scale fusion with the feature map fip3 to obtain a feature map fcp1;

Thus, three output feature maps fop1, fop2, and fop3 can be obtained by step S4.

Step S5: and detecting the three fused feature images with different scales by using the anchor frames, optimizing all the obtained anchor frames by using a K-means clustering method, and extracting all areas where the infrared weak and small targets possibly exist to obtain a final detection result.

The specific operation steps are as follows:

Step S5.1: because the larger the feature map size, the smaller the receptive field, and standard convolutional neural networks have difficulty detecting infrared weak targets because these targets are typically much smaller in size than the general targets. To solve this problem, output feature maps fop1, fop2 and fop3 with sizes of 40×40, 80×80 and 160×160, respectively, are input into a target detection network (specifically, a detection head network of YOLO may be selected, but not limited thereto), and feature maps with these dimensions are suitable for detection of most infrared weak targets. The lowest-layer feature map l ₃ is fused with other upper-layer feature maps l ₁、l₂ to obtain a feature map with the size of 160×160, and the feature map is more sensitive to smaller infrared targets, so that the method provided by the invention can detect infrared weak small targets and extreme targets from an input original image.

Step S5.2: and optimizing all the obtained anchor frames by using a K-means clustering method to obtain a final detection result and outputting a target mark image. The training stage of the infrared dim target detection method based on the superpixel and the deformable convolution is completed.

In summary, according to the infrared dim target detection method based on superpixel and deformable convolution provided by the invention, compared with a visible light image, infrared image imaging is more complex, and is affected by equipment and environment, so that the acquisition difficulty is higher, and therefore, the infrared dim target cannot be detected by using the traditional large target detection method. In order to solve the detection problem of the infrared weak and small targets, the invention provides the infrared weak and small target detection method based on super pixels and deformable convolution, which uses a super pixel pretreatment method to segment the possible areas of the infrared weak and small targets, extracts backbone network through deformable convolution characteristics to adaptively extract the target characteristics, and uses an anchor frame-based method to detect the targets, thereby greatly improving the detection capability of the network to the infrared weak and small targets, obtaining better detection effect than the existing target detection method, improving the infrared weak and small target detection rate and reducing the detection false alarm rate.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The infrared dim target detection method based on super pixels and deformable convolution is characterized by comprising the following steps of:

step S1: collecting an original image;

2. The method for detecting the infrared small target based on the super pixels and the deformable convolution according to claim 1, wherein the original image is generated through an existing infrared state scene simulation system, and the original image is manually marked and divided into a training set, a verification set and a test set.

3. The method for detecting infrared small targets based on super pixels and deformable convolution according to claim 1, wherein the specific operation flow of the step S2 is as follows:

4. The method for detecting the infrared small target based on the super-pixel and the deformable convolution according to claim 1, wherein the deformable convolution feature extraction backbone network consists of a deformable convolution layer, a batch normalization layer, an activation function SiLu, three convolution layers and a correction linear unit ReLu; inputting the super pixel block sequence into a deformable convolution feature extraction backbone network, and sequentially processing the deformable convolution layer, the batch normalization layer, the activation function SiLu, the three convolution layers and the correction linear unit ReLu to obtain a plurality of feature maps with different scales.

5. The method for detecting infrared small targets based on super-pixels and deformable convolution according to claim 1, wherein in the step S3, an offset is introduced into a convolution kernel in the process of multi-scale feature extraction, the introduced offset is obtained by applying a convolution layer on the same input feature map, and the convolution kernel has the same spatial resolution and expansion as the current convolution layer; generating the offset with the same spatial resolution as the input feature map; in the training process, the convolution kernel of the output feature map and the convolution kernel for generating the offset are simultaneously learned, and the calculation formula is as follows:

Y＝wpn*(xp0+xpn+Δxpn)

6. The method for detecting infrared small targets based on super pixels and deformable convolution according to claim 1, wherein the multi-scale feature fusion network consists of 31×1 convolution layers, 33×3 convolution layers, 3 upsampling layers, 3 fusion modules and 3 feature extraction modules; the 1X 1 convolution layer is used for changing the number of characteristic channels, the 3X 3 convolution layer is used for downsampling the characteristic images, the upsampling layer is used for converting the low-resolution characteristic images into high-resolution characteristic images, the fusion module is used for splicing and fusing the deep and shallow characteristic images with the same resolution, and the characteristic extraction module is used for extracting characteristic information from the fused characteristic images.

7. The method for detecting infrared small targets based on super pixels and deformable convolution according to claim 1, wherein the specific operation flow of the step S4 is as follows: