CN117456330A

CN117456330A - MSFAF-Net-based low-illumination target detection method

Info

Publication number: CN117456330A
Application number: CN202311333750.6A
Authority: CN
Inventors: 江泽涛; 杨静; 熊邦书
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-01-26

Abstract

The invention discloses a low-illumination target detection method based on MSFAF-Net, which comprises the following steps: 1) Integrating and constructing a data set; 2) Training of a feature extraction network module; 3) Training of a feature enhanced network module FEN; 4) Training of a feature fusion network module FFM-ECA; 5) Training a target detection network; 6) Training and testing the low-illumination target model. The method of the invention ensures that the feature expression capability of the feature map of the low-illumination image target detection is better, and the precision of the low-illumination image target detection can be improved.

Description

MSFAF-Net-based low-illumination target detection method

Technical Field

The invention relates to deep learning, image enhancement, target detection and other technologies, in particular to a low-illumination target detection method based on multi-scale feature self-adaptive fusion MSFAF-Net (Multi scale feature adaptive fusion, MSFAF for short).

Background

Object detection is the basis of many visual tasks, which have been applied to various aspects of our lives, such as: autopilot, medical detection, face recognition, etc., and thus target detection algorithms are particularly important. Currently, target detection can be classified into conventional detection techniques and detection algorithms based on deep learning, which can be classified into single-stage and double-stage detection algorithms. The single-stage algorithm represented by the YOLO series and SSD directly converts the target detection problem into the regression problem, so that the calculation speed of the model can be greatly improved; the double-stage algorithm represented by RCNN and Faster RCNN adopts a method of generating candidate frames first and then classifying and finishing the positions of candidate areas, and the double-stage algorithm has higher accuracy, but the speed of training and detecting pictures is not fast enough. In the field of vision-based target detection research, on the premise that a data set is excellent, some common neural network models can obtain relatively high detection precision, but when the neural network models are applied to a low-illumination data set, the detection effect of the neural network models is often unsatisfactory.

The low-illumination image enhancement aims at improving the visual perception quality of captured data under the scene of insufficient illumination so as to acquire more information, gradually becomes a research hot spot in the field of image processing, and has very wide application prospects in the industries related to artificial intelligence such as automatic driving, security protection and the like. Traditional low-illumination image enhancement technology often needs high mathematical skills and strict mathematical derivation, and the derived iterative process is generally complex in flow and is not beneficial to practical application. With the sequential advent of large-scale data sets, deep learning-based low-light image enhancement has become the current dominant technology. The invention adopts a deep learning method, firstly extracts shallow features of the low-illumination image, secondly extracts deep features, fuses the extracted features again, and finally fuses the enhanced features with the original low-illumination image features.

Disclosure of Invention

The invention aims to solve the problem that the existing target detector has poor detection effect on low-illumination images, and provides a low-illumination target detection method based on MSFAF-Net. The method can further enhance the expression capability of the low-illumination image characteristics and improve the detection performance to a certain extent.

The technical scheme for realizing the aim of the invention is as follows:

a MSFAF-Net-based low-illumination target detection method comprises the following steps:

1) Integrating the build data set: comprising the following steps:

1-1) the Exdark dataset contains 12 types of objects and 7363 images collected in a real low-light environment, wherein 4800 images are used for training and 2563 images are used for testing;

1-2) selecting 4800 images in the PASCAL VOC dataset, carrying out gamma transformation and Gaussian noise superposition on the images, generating a low-illumination image for training, and synthesizing the low-illumination image as shown in a formula (1):

wherein I is _low For a composite low-light image, I _in For inputting images, N _o Representing superimposed gaussian noise, β and γ being coefficients of gamma transformation;

1-3) training phase, wherein the model is trained on a data set formed by mixing 4800 real low-illumination images and 4800 synthesized low-illumination images, and the model is tested on 2563 real low-illumination images;

2) Training of a feature extraction network module: the shallow layer features can better describe the detail features due to the fact that the number of layers of the passing network is smaller, but have smaller receptive fields and more noise, the contained semantic information is also less, the deep layer features are less in noise, have larger receptive fields and stronger semantic information, and can better describe the whole image, but the detail is difficult to describe specifically, so that the technical scheme fuses the shallow layer features and the deep layer features, and enables the network to have richer global feature information, and the method comprises the following steps:

2-1) preprocessing the data set, i.e. uniformly scaling the width and height of all images to 640 x 640 pixels;

2-2) sending the low-illumination image into a shallow feature extraction module SFEM, and extracting shallow features of the low-illumination image;

2-3) sending the features extracted in the step 2-2) into a deep feature extraction module DFEM to extract deep features of the low-illumination image;

3) Training of feature enhanced network module FEN: because the extracted feature information is different, global enhancement often causes an overexposure phenomenon, and for this purpose, the extracted feature is sent to a feature enhancement network FEN by combining with a feature enhancement network of an attention mechanism, where the feature enhancement network is an improved U-Net network combined with the attention mechanism, and includes:

3-1) sending the extracted features from the three-time feature extraction module to a feature enhancement network module FEN;

3-2) introducing a loss function comprising image multi-scale structural similarity loss, image perception loss and region loss, as shown in formula (2):

L _total ＝λ _ms L _ms +λ _pl L _pl +λ _rl L _rl (2)，

wherein lambda is _ms 、λ _pl 、λ _rl To lose weight balance coefficient, L _ms 、L _pl 、L _rl The multi-scale structure similarity loss is used for measuring the structure similarity of images under different resolutions, the multi-scale structure similarity loss is used for helping to restore the brightness and contrast of low-light images, improving the quality and sense of reality of the images and keeping the consistency of the image structures, and the definition is shown in a formula (3):

wherein mu _x Sum mu _y Is the average value of the image and,and->Is the variance, sigma of the image _xy Is the covariance of the image, C ₁ And C ₂ The method is used for preventing the constant with the denominator of 0, N is the number of local areas of the image, the structural loss only pays attention to the low-layer information of the image, and the enhanced image can be excessively smoothed due to the lack of deep information, so that the image perception loss is introduced, the image perception loss aims at evaluating the similarity between the enhanced image and the real image, and the definition is as shown in a formula (4):

where E and G are the enhanced image and the real low-light image respectively,is the characteristic diagram of the jth convolution layer of the ith block in the VGG16 network, W _i,j 、H _i,j 、G _i,j The sizes of the feature maps in the VGG16 network are respectively, and in the training process, the brightness of different areas of the image makes the enhanced image different, and the enhanced image is easy to overexposure, so that the enhanced image cannot be regarded as a whole, and in order to prevent overexposure, an area loss function is introduced, as shown in formula (5):

wherein E is _l And G _l Is a low light region of the enhanced image and the real image, E _h And G _h Is the rest of the image, because more low light in the image needs to be paid attention to during trainingThe area, therefore, the low light area needs to be set with a larger weight value;

4) Training of the feature fusion module FFM-ECA: the feature fusion module FFM-ECA training comprises:

4-1) Low-light image features and enhanced image features are used as dual-channel input of the FFM-ECA module, and features X are input ₀ ∈R ^C×H×W The input features are spliced to obtain a feature map X through convolution with a convolution kernel of 3*3 and activated by ReLu and then the number of channels is adjusted through convolution with a convolution kernel of 1*1 ₁ ∈R ^C×H×W ；

4-2) then global averaging pooling to obtain a polymeric profile M of size 1 x c _C Then to M _C A fast one-dimensional convolution of convolution kernel size K is performed, where K is adaptively determined by a mapping of channel dimensions C, as shown in equation (6):

k represents the convolution kernel size, C represents the number of channels, gamma and b are used to change the ratio between the number of channels C and the convolution kernel size, | _odd K can be represented as an odd number;

4-3) obtaining the weight omega of each channel through the Sigmoid activation function, as shown in a formula (7):

ω＝σ(C1D _k (X ₁ )) (7).

where ω represents the weight of each channel, σ represents the Sigmoid activation function, and C1D represents the one-dimensional convolution;

4-4) multiplying the weights with the characteristics of the channels respectively to obtain outputs, as shown in formula (8):

wherein δ represents a ReLU activation function;

4-5) finally fusing the features obtained in the step 4-4) with the features of the enhanced image after one 3*3 convolution and one 1*1 convolution;

5) Training a target detection network: comprising the following steps:

5-1) uniformly scaling the width and height of the dataset image to 640 x 640 pixels;

5-2) inputting the pictures into a backbone network for feature extraction to respectively obtain feature graphs with the sizes of 80-512, 40-1024 and 20-1024, wherein the feature graphs are marked as feature1, feature2 and feature3;

5-3) sending the feature map obtained in the step 5-2) to a detection neck for feature fusion to obtain feature maps with the sizes of 20-512, 40-256 and 80-128, wherein the feature maps are respectively marked as P5_out, P4_out, P3_out and feature3, and feature layers are firstly subjected to feature extraction by an SPPCSPC-CA module and marked as P5; p5 is first convolved once by 1*1, then upsampled, combined with features of feature2 after being convolved once by 1*1, and then subjected to feature extraction by an ELAN-CA module, and marked as P4; performing 1*1 convolution on P4, then performing up-sampling, combining the up-sampling with features of feature1 after 1*1 convolution, performing feature extraction through an ELAN-CA module to obtain P3-out, performing down-sampling on P3-out once, splicing the P4 with the down-sampling P3-out, and performing feature extraction through the ELAN-CA module to obtain P4-out; the P4-out is subjected to downsampling once and then is spliced with P5, and features are extracted through an ELAN-CA module to obtain P5-out;

5-4) carrying out feature fusion on the three feature images obtained in the step 5-3) and the features obtained by the enhanced images through a backbone network and a detection neck, and respectively sending the three feature images into three detection heads, wherein the detection heads respectively predict the confidence level, the category and the boundary frame of the object;

5-5) screening the prediction frames, filtering the prediction frames with low target confidence, then performing non-maximum suppression, and selecting the boundary frame with highest confidence as a final detection result, wherein the total loss of the target detection network is shown as a formula (9):

L _total ＝λ ₁ L _box +λ ₂ L _obj +λ ₃ L _cls (9)，

wherein lambda is ₁ 、λ ₂ 、λ ₃ To lose weight balance systemNumber, L _box 、L _obj 、L _cls The method comprises the steps of respectively obtaining boundary regression box loss, confidence coefficient loss and classification loss, wherein the confidence coefficient loss and the classification loss adopt BCEloss, the boundary regression box loss adopts EIoU, an SPPCSPC-CA module is composed of two branches, the first branch is subjected to convolution with a 1*1 to change the channel number, the second branch is subjected to convolution with a 1*1 to change the channel number, passes through a CA attention module, then carries out maximum pooling operation with four convolution kernel sizes of 1 x 1,5 x 5,9 x 9 and 13 x 13, then splices the four features together, adjusts the channel number through convolution with a convolution kernel size of 1*1, then carries out convolution extraction with a convolution kernel size of 3*3 and a step length of 1, splices the features obtained by the two branches together, and then carries out convolution adjustment on the channel number once 1*1;

the ELAN-CA module consists of two branches, wherein the first branch is used for adjusting the number of channels through convolution with the size of 1*1, the second branch is used for firstly carrying out convolution with the size of 1*1, then carrying out convolution with the size of 3*3 and the step length of 1 to carry out feature extraction, then carrying out CA attention module, finally carrying out Concat operation to splice the four features together, introducing a CA attention mechanism into a backbone network and a detection neck in order to improve the capability of a network detection target, effectively combining the channel attention and the space attention by the CA attention mechanism, and simultaneously integrating position information into the channel attention, thereby improving the capability of the network for extracting low-illumination image features, and for any one input feature X, firstly encoding each channel along the horizontal direction and the vertical direction by using a pooling core with the size of (H, 1) or (1, W), and enabling the generated feature layer to have coordinate information through conversion of a formula (10) and a formula (11):

then, the two pieces of aggregate characteristic information are subjected to characteristic series connection, and after the 1 multiplied by 1 convolution transformation function of the formula (12), the intermediate mapping of the spatial information coded in the vertical direction and the horizontal direction is obtained:

f＝δ(F ₁ ([z ^h ,z ^w ])) (12)，

wherein [ · ]]Representing a stitching operation along the spatial dimension, delta being a nonlinear activation function, f ε R ^c/r× ( ^H+W) An intermediate feature map encoding spatial information for both the horizontal and vertical directions,

then, f is divided into two independent tensors f along the spatial dimension ^h ∈R ^c/r×H And f ^w ∈R ^c/r×W The convolution transforms of two 1×1's through equation (13) and equation (14) will f, respectively ^h And f ^w Conversion to g with the same number of channels ^h And g ^w ：

g ^h ＝σ(F _h (f ^h )) (13)，

g ^w ＝σ(F _w (f ^w )) (14)，

Where σ is a Sigmoid function,

finally g is to ^h And g ^w As the attention weight, the output is finally obtained as shown in formula (15):

6) Training and testing the low-illumination target model: comprising the following steps:

6-1) sending the low-illumination image into the MSFAF-Net network trained in the steps 1) -4) for enhancement;

6-2) taking the original low-illumination image and the enhanced image obtained in the step 6-1) as input of a target detection network at the same time to obtain a detection result;

6-3) visualizing the detection result.

The shallow feature extraction module SFEM described in step 2-1) includes:

2-1-1) the SFEM module consists of two branches, wherein the first branch consists of 3 groups of convolution layers, each group of convolution layers is activated by a ReLU and added with a batch normalization layer, the dimension is increased by convolution with a convolution kernel size of 1*1, the characteristics are extracted by convolution with a convolution kernel size of 3*3 and a step size of 1, and finally the number of channels is adjusted by convolution with a convolution kernel size of 1*1;

2-1-2) in another branch, the number of channels is adjusted by adopting a convolution with a convolution kernel size of 1*1, and finally, shallow features are fused by adopting a Concat operation.

The deep feature extraction module DFEM in step 2-2) includes:

2-2-1) DFEM is stacked from three residual dense blocks RDB;

2-2-2) RDB comprises three parts: dense connection, local feature fusion and feature fusion, each RDB consists of 3 convolution layers and is activated by ReLU, each convolution layer uses a convolution kernel of size 3*3 and adds BN and performs local feature fusion by convolutions of convolution kernel size 1*1.

The feature enhanced network module FEN in step 3-1) includes:

3-1-1) firstly, downsampling to respectively obtain feature images with the sizes of 320 x 64, 160 x 128 and 80 x 256, wherein the downsampling process comprises three groups of stacked residual blocks, each residual block comprises 2 convolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution layers are a batch normalization layer and a ReLU activation function, the feature images after each downsampling are reduced to half of the original size, channel weights are calculated through a CA module, different channels are adaptively strengthened, the number of the input channels is 3 for the first downsampling, and the number of the output channels is 64; the number of the input channels is 64, and the number of the output channels is 128; the third time downsampling the input channel number to 128, the output channel number to 256;

3-1-2) obtaining context information of multiple scales by adopting cavity convolution with cavity rate of 1, 2 and 5 in a parallel connection mode, wherein in a first branch, the convolution kernel size is 3*3, the cavity rate is 1, the filling is 1, and the number of channels of an input characteristic diagram and the number of channels of an output characteristic diagram are 256; in the second branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 2, and the filling is 2; in the third branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 5, and the filling is 5, and in addition, each convolution is followed by a batch normalization layer and a ReLU activation layer;

3-1-3) splicing the feature images obtained in the step 3-1-2) according to channel dimensions, then performing a convolution operation to reduce the channel dimensions, wherein the convolution kernel of the convolution is 1*1, the number of input channels is 768, the number of output channels is 256, and finally transmitting the obtained feature images to the next module;

3-1-4) up-sampling the feature map obtained in the step 3-1-3) to obtain feature maps with the sizes of 160×160×128, 320×320×64 and 640×640×3 respectively, wherein the up-sampling process comprises three groups of stacked residual blocks, each residual block comprises 2 deconvolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution is followed by a batch normalization layer and a ReLU activation function, the size of the feature map is enlarged to 2 times of the original size after each up-sampling to form a symmetrical structure, the down-sampled feature is introduced into a corresponding up-sampling module by using jump connection, the number of input channels is 256 for the first up-sampling, and the number of output channels is 128; the number of the input channels is 128 for the second downsampling, and the number of the output channels is 64; the third time downsampling input channel number is 64, and the output channel number is 3;

3-1-5) are finally subjected to 1*1 convolution operations to obtain enhanced features.

According to the technical scheme, in target detection, aiming at the problem that characteristic details extracted by a target detection model are easy to lose due to factors such as unobvious image characteristic details in low-illumination data set, firstly, the original low-illumination image is enhanced by adopting operations such as characteristic extraction, characteristic enhancement, characteristic fusion and the like, shallow and deep characteristics are extracted to enable a network to have richer global characteristic information, a characteristic enhancement module of an attention mechanism is combined to adaptively strengthen different channels, a characteristic fusion module fuses the low-illumination image characteristics and the enhanced image characteristics, image quality is improved, a target detection network comprises a backbone network, a detection neck and a detection head, and the original low-illumination image and the enhanced image are simultaneously input into a target detector for detection.

The method ensures that the feature expression capability of the feature map of the low-illumination image target detection is better, thereby improving the precision of the low-illumination image target detection.

Drawings

FIG. 1 is a diagram of a MSFAF-Net network architecture in an embodiment;

FIG. 2 is a block diagram of an SFEM module in an embodiment;

FIG. 3 is a block diagram of a DFEM module in an embodiment;

FIG. 4 is a diagram of the RDB module architecture in an embodiment;

FIG. 5 is a FEN block diagram in an embodiment;

FIG. 6 is a block diagram of FFM-ECA in an embodiment;

FIG. 7 is a block diagram of an SPPCSPC-CA module in an embodiment;

FIG. 8 is a block diagram of an ELAN-CA module in an embodiment;

fig. 9 is a schematic diagram of a CA attention mechanism in an embodiment.

Detailed Description

The present invention will now be further illustrated with reference to the drawings and examples, but is not limited thereto.

Examples:

referring to fig. 1, a low-illuminance target detection method based on MSFAF-Net includes the steps of:

1) Integrating the build data set: comprising the following steps:

2) Training of a feature extraction network module: the shallow layer features can better describe the detail features due to the fact that the number of layers of the passing network is smaller, but have smaller receptive fields and more noise, the contained semantic information is less, the deep layer features are less in noise, have larger receptive fields and stronger semantic information, and can better describe the whole image, but the detail is difficult to describe specifically, so that the shallow layer features and the deep layer features are fused, and the network has richer global feature information, and comprises:

L _total ＝λ _ms L _ms +λ _pl L _pl +λ _rl L _rl (2)，

where E and G are the enhanced image and the real low-light image respectively,is VFeature map of jth convolutional layer of ith block in GG16 network, W _i,j 、H _i,j 、G _i,j The sizes of the feature maps in the VGG16 network are respectively, and in the training process, the brightness of different areas of the image makes the enhanced image different, and the enhanced image is easy to overexposure, so that the enhanced image cannot be regarded as a whole, and in order to prevent overexposure, an area loss function is introduced, as shown in formula (5):

wherein E is _l And G _l Is a low light region of the enhanced image and the real image, E _h And G _h Is the rest of the image, and because more low light regions in the image need to be focused on during training, a larger weight value is required to be set for the low light regions, in this example omega _l ＝5，ω _h ＝1；

In this example, a hole convolution is introduced between the encoder and the decoder to replace a bottom convolution layer in a traditional U-Net network, and a parallel combination mode is adopted to obtain context information of multiple scales by adopting hole convolutions with hole rates of 1, 2 and 5. In the first branch, the convolution kernel size is 3*3, the void ratio is 1, the filling is 1, and the number of input characteristic diagram channels and the number of output characteristic diagram channels are 256; in the second branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 2, and the filling is 2; in the third branch, the number of input and output feature map channels is 256, the convolution kernel size is 3*3, the void fraction is 5, and the padding is 5. In addition, after each convolution, a batch normalization layer and a ReLU activation layer are connected, the obtained feature images are spliced according to the channel dimension, then a convolution operation is executed to carry out channel dimension reduction, the convolution kernel of the convolution is 1*1, the number of input channels is 768, the number of output channels is 256, and finally the obtained feature images are transmitted to the next module;

4) As shown in fig. 6, training of the feature fusion module FFM-ECA: the feature fusion module FFM-ECA training comprises:

4-1) willThe low-illumination image feature and the enhanced image feature are used as dual-channel input of the FFM-ECA module, and the feature X is input ₀ ∈R ^C×H×W The input features are spliced to obtain a feature map X through convolution with a convolution kernel of 3*3 and activated by ReLu and then the number of channels is adjusted through convolution with a convolution kernel of 1*1 ₁ ∈R ^C×H×W ；

ω＝σ(C1D _k (X ₁ )) (7).

wherein δ represents a ReLU activation function;

5) Training a target detection network: comprising the following steps:

5-3) sending the feature map obtained in the step 5-2) to a detection neck for feature fusion to obtain feature maps with the sizes of 20-512, 40-256 and 80-128, wherein the feature maps are respectively marked as P5_out, P4_out, P3_out and feature3, and feature layers are firstly subjected to feature extraction by an SPPCSPC-CA module and marked as P5; p5 is first convolved once by 1*1, then upsampled, combined with features of feature2 after being convolved once by 1*1, and then subjected to feature extraction by an ELAN-CA module, and marked as P4; performing 1*1 convolution on P4, then performing up-sampling, combining the up-sampling with features of feature1 after 1*1 convolution, performing feature extraction through an ELAN-CA module to obtain P3-out, performing down-sampling on P3-out once, splicing the P4 with the down-sampling P3-out, and performing feature extraction through the ELAN-CA module to obtain P4-out; the P4-out is subjected to downsampling once and then is spliced with the P5, and then the characteristics are extracted through an ELAN-CA module to obtain the P5-out, so that in order to improve the capability of a network detection target, a CA attention mechanism is introduced into a backbone network and a detection neck, the CA attention mechanism not only can effectively combine channel attention and space attention, but also can integrate position information into the channel attention, and further the capability of the network for extracting low-illumination image characteristics is improved;

L _total ＝λ ₁ L _box +λ ₂ L _obj +λ ₃ L _cls (9)，

wherein lambda is ₁ 、λ ₂ 、λ ₃ To lose weight balance coefficient, L _box 、L _obj 、L _cls The boundary regression box loss, the confidence coefficient loss and the classification loss are respectively BCEloss, the boundary regression box loss is EIoU,

in this example, as shown in fig. 7, the SPPCSPC-CA module in the detection neck is composed of two branches, the first branch is a change of the channel number through convolution of one 1*1, the second branch is a change of the channel number through convolution of one 1*1, the second branch is a maximum pooling operation of four convolution kernel sizes of 1×1,5×5,9×9 and 13×13 after the CA attention module, and the four features are spliced together, the channel number is adjusted through convolution of one convolution kernel size 1*1, then the convolution extraction feature of one convolution kernel size 3*3 and the step length of 1 is adopted, and the features obtained by the two branches are spliced together and the channel number is adjusted through convolution of one 1*1;

the ELAN-CA module is shown in fig. 8, and is composed of two branches, the first branch adjusts the channel number through a convolution with a convolution kernel size of 1*1, the second branch adjusts the channel number through a convolution with a convolution kernel size of 1*1, then the first branch performs feature extraction through a convolution with a step size of 3*3, then the second branch performs a Concat operation to splice the four features together through the CA attention module, in order to improve the capability of network detection targets, the CA attention mechanism is introduced into a backbone network and a detection neck, the CA attention mechanism not only can effectively combine the channel attention and the space attention, but also can integrate position information into the channel attention, and further improve the capability of the network to extract low-illumination image features, the CA attention mechanism is shown in fig. 9, for any one input feature X, the first feature layer is encoded through a pooling kernel with a size of (H, 1) or (1, w) along the horizontal direction and the vertical direction, and the formula (10) and the formula (11) has the feature layer generated by the formula transformation information.

f＝δ(F ₁ ([z ^h ,z ^w ])) (12)，

g ^h ＝σ(F _h (f ^h )) (13)，

g ^w ＝σ(F _w (f ^w )) (14)，

Where σ is a Sigmoid function,

6-3) visualizing the detection result.

The shallow feature extraction module SFEM described in step 2-1) is shown in fig. 2, and includes:

The deep feature extraction module DFEM in step 2-2) is shown in fig. 3, and includes:

2-2-1) DFEM is stacked from three residual dense blocks RDB;

2-2-2) RDB as shown in FIG. 4, includes three parts: dense connection, local feature fusion and feature fusion, each RDB consists of 3 convolution layers and is activated by ReLU, each convolution layer uses a convolution kernel of size 3*3 and adds BN and performs local feature fusion by convolutions of convolution kernel size 1*1.

The feature enhanced network module FEN in step 3-1) is shown in fig. 5, and includes:

Claims

1. The MSFAF-Net-based low-illumination target detection method is characterized by comprising the following steps of:

1) Integrating the build data set: comprising the following steps:

I _low ＝βI _in ^γ +N _o (1)，

2) Training of a feature extraction network module: comprising the following steps:

3) Training of feature enhanced network module FEN: comprising the following steps:

3-1) sending the features extracted by the three-time feature extraction module into a feature enhancement network module FEN;

L _total ＝λ _ms L _ms +λ _pl L _pl +λ _rl L _rl (2)，

wherein lambda is _ms 、λ _pl 、λ _rl To lose weight balance coefficient, L _ms 、L _pl 、L _rl Respectively, imagesMulti-scale structural similarity loss, image perception loss, and region loss;

multiscale structural similarity loss the structural similarity of images at different resolutions is measured by comparing them, and the multiscale structural similarity loss definition is shown in formula (3):

wherein mu _x Sum mu _y Is the average value of the image and,and->Is the variance, sigma of the image _xy Is the covariance of the image, C ₁ And C ₂ Is a constant for preventing denominator from being 0, N is the number of image local areas, and image perception loss is introduced, and is defined as formula (4):

where E and G are the enhanced image and the real low-light image respectively,is the characteristic diagram of the jth convolution layer of the ith block in the VGG16 network, W _i,j 、H _i,j 、G _i,j The sizes of the feature maps in the VGG16 network are respectively, in the training process, the brightness of different areas of the images enable the enhanced images to be different, and an area loss function is introduced, as shown in a formula (5):

wherein E is _l And G _l Is a low light region of the enhanced image and the real image, E _h And G _h Is the rest of the image, and a larger weight value is required to be set for the low light area;

ω＝σ(C1D _k (X ₁ )) (7).

wherein δ represents a ReLU activation function;

5) Training a target detection network: comprising the following steps:

L _total ＝λ ₁ L _box +λ ₂ L _obj +λ ₃ L _cls (9)，

wherein lambda is ₁ 、λ ₂ 、λ ₃ To lose weight balance coefficient, L _box 、L _obj 、L _cls The method comprises the steps of respectively obtaining boundary regression box loss, confidence coefficient loss and classification loss, wherein the confidence coefficient loss and the classification loss adopt BCEloss, the boundary regression box loss adopts EIoU, an SPPCSPC-CA module is composed of two branches, the first branch is subjected to convolution with a 1*1 to change the channel number, the second branch is subjected to convolution with a 1*1 to change the channel number, passes through a CA attention module, then carries out maximum pooling operation with four convolution kernel sizes of 1 x 1,5 x 5,9 x 9 and 13 x 13, then splices the four features together, adjusts the channel number through convolution with a convolution kernel size of 1*1, then carries out convolution extraction with a convolution kernel size of 3*3 and a step length of 1, splices the features obtained by the two branches together, and then carries out convolution adjustment on the channel number once 1*1;

the ELAN-CA module consists of two branches, wherein the first branch adjusts the channel number through convolution with the convolution kernel size of 1*1, the second branch firstly adjusts the channel number through convolution with the convolution kernel size of 1*1, then carries out feature extraction through convolution with the step size of 1 through four convolution kernels of 3*3, then carries out Concat operation through CA attention module, finally splices the four features together, introduces a CA attention mechanism into a backbone network and a detection neck, and for any input feature X, firstly encodes each channel in the horizontal direction and the vertical direction through pooling kernels with the size of (H, 1) or (1, W), and enables the generated feature layer to have coordinate information through transformation of the formula (10) and the formula (11):

f＝δ(F ₁ ([z ^h ,z ^w ])) (12)，

wherein [ · ]]Representing a stitching operation along the spatial dimension, delta being a nonlinear activation function, f ε R ^c/r×(H+W) An intermediate feature map encoding spatial information for both the horizontal and vertical directions,

g ^h ＝σ(F _h (f ^h )) (13)，

g ^w ＝σ(F _w (f ^w )) (14)，

Where σ is a Sigmoid function,

6-3) visualizing the detection result.

2. The MSFAF-Net based low-illuminance target detection method of claim 1 wherein the shallow feature extraction module SFEM of step 2-1) includes:

3. The MSFAF-Net based low-illuminance target detection method of claim 1 wherein the deep feature extraction module DFEM of step 2-2) includes:

2-2-1) DFEM is stacked from three residual dense blocks RDB;

4. The MSFAF-Net based low-light level target detection method of claim 1, wherein the feature enhanced network module FEN in step 3-1) comprises:

3-1-4) up-sampling the feature map obtained in the step 3-1-3) to obtain feature maps with the sizes of 160×160×128, 320×320×64 and 640×640×3 respectively, wherein the up-sampling process comprises three groups of stacked residual blocks, each residual block comprises 2 deconvolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution is followed by a batch normalization layer and a ReLU activation function, the size of the feature map is enlarged to 2 times of the original size after each up-sampling to form a symmetrical structure, the down-sampled feature is introduced into a corresponding up-sampling module by using jump connection, the number of input channels is 256 for the first up-sampling, and the number of output channels is 128; the number of the input channels is 128 for the second downsampling, and the number of the output channels is 64;

the third time downsampling input channel number is 64, and the output channel number is 3;