CN117456330A - MSFAF-Net-based low-illumination target detection method - Google Patents
MSFAF-Net-based low-illumination target detection method Download PDFInfo
- Publication number
- CN117456330A CN117456330A CN202311333750.6A CN202311333750A CN117456330A CN 117456330 A CN117456330 A CN 117456330A CN 202311333750 A CN202311333750 A CN 202311333750A CN 117456330 A CN117456330 A CN 117456330A
- Authority
- CN
- China
- Prior art keywords
- convolution
- feature
- image
- low
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 74
- 238000005286 illumination Methods 0.000 title claims abstract description 61
- 238000000605 extraction Methods 0.000 claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 37
- 230000004927 fusion Effects 0.000 claims abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 19
- 238000012360 testing method Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 30
- 238000005070 sampling Methods 0.000 claims description 27
- 230000004913 activation Effects 0.000 claims description 22
- 238000010586 diagram Methods 0.000 claims description 19
- 230000007246 mechanism Effects 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 13
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 12
- 230000009466 transformation Effects 0.000 claims description 11
- 230000008859 change Effects 0.000 claims description 9
- 230000008447 perception Effects 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 239000011800 void material Substances 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 3
- 239000002131 composite material Substances 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000002156 mixing Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000001629 suppression Effects 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a low-illumination target detection method based on MSFAF-Net, which comprises the following steps: 1) Integrating and constructing a data set; 2) Training of a feature extraction network module; 3) Training of a feature enhanced network module FEN; 4) Training of a feature fusion network module FFM-ECA; 5) Training a target detection network; 6) Training and testing the low-illumination target model. The method of the invention ensures that the feature expression capability of the feature map of the low-illumination image target detection is better, and the precision of the low-illumination image target detection can be improved.
Description
Technical Field
The invention relates to deep learning, image enhancement, target detection and other technologies, in particular to a low-illumination target detection method based on multi-scale feature self-adaptive fusion MSFAF-Net (Multi scale feature adaptive fusion, MSFAF for short).
Background
Object detection is the basis of many visual tasks, which have been applied to various aspects of our lives, such as: autopilot, medical detection, face recognition, etc., and thus target detection algorithms are particularly important. Currently, target detection can be classified into conventional detection techniques and detection algorithms based on deep learning, which can be classified into single-stage and double-stage detection algorithms. The single-stage algorithm represented by the YOLO series and SSD directly converts the target detection problem into the regression problem, so that the calculation speed of the model can be greatly improved; the double-stage algorithm represented by RCNN and Faster RCNN adopts a method of generating candidate frames first and then classifying and finishing the positions of candidate areas, and the double-stage algorithm has higher accuracy, but the speed of training and detecting pictures is not fast enough. In the field of vision-based target detection research, on the premise that a data set is excellent, some common neural network models can obtain relatively high detection precision, but when the neural network models are applied to a low-illumination data set, the detection effect of the neural network models is often unsatisfactory.
The low-illumination image enhancement aims at improving the visual perception quality of captured data under the scene of insufficient illumination so as to acquire more information, gradually becomes a research hot spot in the field of image processing, and has very wide application prospects in the industries related to artificial intelligence such as automatic driving, security protection and the like. Traditional low-illumination image enhancement technology often needs high mathematical skills and strict mathematical derivation, and the derived iterative process is generally complex in flow and is not beneficial to practical application. With the sequential advent of large-scale data sets, deep learning-based low-light image enhancement has become the current dominant technology. The invention adopts a deep learning method, firstly extracts shallow features of the low-illumination image, secondly extracts deep features, fuses the extracted features again, and finally fuses the enhanced features with the original low-illumination image features.
Disclosure of Invention
The invention aims to solve the problem that the existing target detector has poor detection effect on low-illumination images, and provides a low-illumination target detection method based on MSFAF-Net. The method can further enhance the expression capability of the low-illumination image characteristics and improve the detection performance to a certain extent.
The technical scheme for realizing the aim of the invention is as follows:
a MSFAF-Net-based low-illumination target detection method comprises the following steps:
1) Integrating the build data set: comprising the following steps:
1-1) the Exdark dataset contains 12 types of objects and 7363 images collected in a real low-light environment, wherein 4800 images are used for training and 2563 images are used for testing;
1-2) selecting 4800 images in the PASCAL VOC dataset, carrying out gamma transformation and Gaussian noise superposition on the images, generating a low-illumination image for training, and synthesizing the low-illumination image as shown in a formula (1):
wherein I is low For a composite low-light image, I in For inputting images, N o Representing superimposed gaussian noise, β and γ being coefficients of gamma transformation;
1-3) training phase, wherein the model is trained on a data set formed by mixing 4800 real low-illumination images and 4800 synthesized low-illumination images, and the model is tested on 2563 real low-illumination images;
2) Training of a feature extraction network module: the shallow layer features can better describe the detail features due to the fact that the number of layers of the passing network is smaller, but have smaller receptive fields and more noise, the contained semantic information is also less, the deep layer features are less in noise, have larger receptive fields and stronger semantic information, and can better describe the whole image, but the detail is difficult to describe specifically, so that the technical scheme fuses the shallow layer features and the deep layer features, and enables the network to have richer global feature information, and the method comprises the following steps:
2-1) preprocessing the data set, i.e. uniformly scaling the width and height of all images to 640 x 640 pixels;
2-2) sending the low-illumination image into a shallow feature extraction module SFEM, and extracting shallow features of the low-illumination image;
2-3) sending the features extracted in the step 2-2) into a deep feature extraction module DFEM to extract deep features of the low-illumination image;
3) Training of feature enhanced network module FEN: because the extracted feature information is different, global enhancement often causes an overexposure phenomenon, and for this purpose, the extracted feature is sent to a feature enhancement network FEN by combining with a feature enhancement network of an attention mechanism, where the feature enhancement network is an improved U-Net network combined with the attention mechanism, and includes:
3-1) sending the extracted features from the three-time feature extraction module to a feature enhancement network module FEN;
3-2) introducing a loss function comprising image multi-scale structural similarity loss, image perception loss and region loss, as shown in formula (2):
L total =λ ms L ms +λ pl L pl +λ rl L rl (2),
wherein lambda is ms 、λ pl 、λ rl To lose weight balance coefficient, L ms 、L pl 、L rl The multi-scale structure similarity loss is used for measuring the structure similarity of images under different resolutions, the multi-scale structure similarity loss is used for helping to restore the brightness and contrast of low-light images, improving the quality and sense of reality of the images and keeping the consistency of the image structures, and the definition is shown in a formula (3):
wherein mu x Sum mu y Is the average value of the image and,and->Is the variance, sigma of the image xy Is the covariance of the image, C 1 And C 2 The method is used for preventing the constant with the denominator of 0, N is the number of local areas of the image, the structural loss only pays attention to the low-layer information of the image, and the enhanced image can be excessively smoothed due to the lack of deep information, so that the image perception loss is introduced, the image perception loss aims at evaluating the similarity between the enhanced image and the real image, and the definition is as shown in a formula (4):
where E and G are the enhanced image and the real low-light image respectively,is the characteristic diagram of the jth convolution layer of the ith block in the VGG16 network, W i,j 、H i,j 、G i,j The sizes of the feature maps in the VGG16 network are respectively, and in the training process, the brightness of different areas of the image makes the enhanced image different, and the enhanced image is easy to overexposure, so that the enhanced image cannot be regarded as a whole, and in order to prevent overexposure, an area loss function is introduced, as shown in formula (5):
wherein E is l And G l Is a low light region of the enhanced image and the real image, E h And G h Is the rest of the image, because more low light in the image needs to be paid attention to during trainingThe area, therefore, the low light area needs to be set with a larger weight value;
4) Training of the feature fusion module FFM-ECA: the feature fusion module FFM-ECA training comprises:
4-1) Low-light image features and enhanced image features are used as dual-channel input of the FFM-ECA module, and features X are input 0 ∈R C×H×W The input features are spliced to obtain a feature map X through convolution with a convolution kernel of 3*3 and activated by ReLu and then the number of channels is adjusted through convolution with a convolution kernel of 1*1 1 ∈R C×H×W ;
4-2) then global averaging pooling to obtain a polymeric profile M of size 1 x c C Then to M C A fast one-dimensional convolution of convolution kernel size K is performed, where K is adaptively determined by a mapping of channel dimensions C, as shown in equation (6):
k represents the convolution kernel size, C represents the number of channels, gamma and b are used to change the ratio between the number of channels C and the convolution kernel size, | odd K can be represented as an odd number;
4-3) obtaining the weight omega of each channel through the Sigmoid activation function, as shown in a formula (7):
ω=σ(C1D k (X 1 )) (7).
where ω represents the weight of each channel, σ represents the Sigmoid activation function, and C1D represents the one-dimensional convolution;
4-4) multiplying the weights with the characteristics of the channels respectively to obtain outputs, as shown in formula (8):
wherein δ represents a ReLU activation function;
4-5) finally fusing the features obtained in the step 4-4) with the features of the enhanced image after one 3*3 convolution and one 1*1 convolution;
5) Training a target detection network: comprising the following steps:
5-1) uniformly scaling the width and height of the dataset image to 640 x 640 pixels;
5-2) inputting the pictures into a backbone network for feature extraction to respectively obtain feature graphs with the sizes of 80-512, 40-1024 and 20-1024, wherein the feature graphs are marked as feature1, feature2 and feature3;
5-3) sending the feature map obtained in the step 5-2) to a detection neck for feature fusion to obtain feature maps with the sizes of 20-512, 40-256 and 80-128, wherein the feature maps are respectively marked as P5_out, P4_out, P3_out and feature3, and feature layers are firstly subjected to feature extraction by an SPPCSPC-CA module and marked as P5; p5 is first convolved once by 1*1, then upsampled, combined with features of feature2 after being convolved once by 1*1, and then subjected to feature extraction by an ELAN-CA module, and marked as P4; performing 1*1 convolution on P4, then performing up-sampling, combining the up-sampling with features of feature1 after 1*1 convolution, performing feature extraction through an ELAN-CA module to obtain P3-out, performing down-sampling on P3-out once, splicing the P4 with the down-sampling P3-out, and performing feature extraction through the ELAN-CA module to obtain P4-out; the P4-out is subjected to downsampling once and then is spliced with P5, and features are extracted through an ELAN-CA module to obtain P5-out;
5-4) carrying out feature fusion on the three feature images obtained in the step 5-3) and the features obtained by the enhanced images through a backbone network and a detection neck, and respectively sending the three feature images into three detection heads, wherein the detection heads respectively predict the confidence level, the category and the boundary frame of the object;
5-5) screening the prediction frames, filtering the prediction frames with low target confidence, then performing non-maximum suppression, and selecting the boundary frame with highest confidence as a final detection result, wherein the total loss of the target detection network is shown as a formula (9):
L total =λ 1 L box +λ 2 L obj +λ 3 L cls (9),
wherein lambda is 1 、λ 2 、λ 3 To lose weight balance systemNumber, L box 、L obj 、L cls The method comprises the steps of respectively obtaining boundary regression box loss, confidence coefficient loss and classification loss, wherein the confidence coefficient loss and the classification loss adopt BCEloss, the boundary regression box loss adopts EIoU, an SPPCSPC-CA module is composed of two branches, the first branch is subjected to convolution with a 1*1 to change the channel number, the second branch is subjected to convolution with a 1*1 to change the channel number, passes through a CA attention module, then carries out maximum pooling operation with four convolution kernel sizes of 1 x 1,5 x 5,9 x 9 and 13 x 13, then splices the four features together, adjusts the channel number through convolution with a convolution kernel size of 1*1, then carries out convolution extraction with a convolution kernel size of 3*3 and a step length of 1, splices the features obtained by the two branches together, and then carries out convolution adjustment on the channel number once 1*1;
the ELAN-CA module consists of two branches, wherein the first branch is used for adjusting the number of channels through convolution with the size of 1*1, the second branch is used for firstly carrying out convolution with the size of 1*1, then carrying out convolution with the size of 3*3 and the step length of 1 to carry out feature extraction, then carrying out CA attention module, finally carrying out Concat operation to splice the four features together, introducing a CA attention mechanism into a backbone network and a detection neck in order to improve the capability of a network detection target, effectively combining the channel attention and the space attention by the CA attention mechanism, and simultaneously integrating position information into the channel attention, thereby improving the capability of the network for extracting low-illumination image features, and for any one input feature X, firstly encoding each channel along the horizontal direction and the vertical direction by using a pooling core with the size of (H, 1) or (1, W), and enabling the generated feature layer to have coordinate information through conversion of a formula (10) and a formula (11):
then, the two pieces of aggregate characteristic information are subjected to characteristic series connection, and after the 1 multiplied by 1 convolution transformation function of the formula (12), the intermediate mapping of the spatial information coded in the vertical direction and the horizontal direction is obtained:
f=δ(F 1 ([z h ,z w ])) (12),
wherein [ · ]]Representing a stitching operation along the spatial dimension, delta being a nonlinear activation function, f ε R c/r× ( H+W) An intermediate feature map encoding spatial information for both the horizontal and vertical directions,
then, f is divided into two independent tensors f along the spatial dimension h ∈R c/r×H And f w ∈R c/r×W The convolution transforms of two 1×1's through equation (13) and equation (14) will f, respectively h And f w Conversion to g with the same number of channels h And g w :
g h =σ(F h (f h )) (13),
g w =σ(F w (f w )) (14),
Where σ is a Sigmoid function,
finally g is to h And g w As the attention weight, the output is finally obtained as shown in formula (15):
6) Training and testing the low-illumination target model: comprising the following steps:
6-1) sending the low-illumination image into the MSFAF-Net network trained in the steps 1) -4) for enhancement;
6-2) taking the original low-illumination image and the enhanced image obtained in the step 6-1) as input of a target detection network at the same time to obtain a detection result;
6-3) visualizing the detection result.
The shallow feature extraction module SFEM described in step 2-1) includes:
2-1-1) the SFEM module consists of two branches, wherein the first branch consists of 3 groups of convolution layers, each group of convolution layers is activated by a ReLU and added with a batch normalization layer, the dimension is increased by convolution with a convolution kernel size of 1*1, the characteristics are extracted by convolution with a convolution kernel size of 3*3 and a step size of 1, and finally the number of channels is adjusted by convolution with a convolution kernel size of 1*1;
2-1-2) in another branch, the number of channels is adjusted by adopting a convolution with a convolution kernel size of 1*1, and finally, shallow features are fused by adopting a Concat operation.
The deep feature extraction module DFEM in step 2-2) includes:
2-2-1) DFEM is stacked from three residual dense blocks RDB;
2-2-2) RDB comprises three parts: dense connection, local feature fusion and feature fusion, each RDB consists of 3 convolution layers and is activated by ReLU, each convolution layer uses a convolution kernel of size 3*3 and adds BN and performs local feature fusion by convolutions of convolution kernel size 1*1.
The feature enhanced network module FEN in step 3-1) includes:
3-1-1) firstly, downsampling to respectively obtain feature images with the sizes of 320 x 64, 160 x 128 and 80 x 256, wherein the downsampling process comprises three groups of stacked residual blocks, each residual block comprises 2 convolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution layers are a batch normalization layer and a ReLU activation function, the feature images after each downsampling are reduced to half of the original size, channel weights are calculated through a CA module, different channels are adaptively strengthened, the number of the input channels is 3 for the first downsampling, and the number of the output channels is 64; the number of the input channels is 64, and the number of the output channels is 128; the third time downsampling the input channel number to 128, the output channel number to 256;
3-1-2) obtaining context information of multiple scales by adopting cavity convolution with cavity rate of 1, 2 and 5 in a parallel connection mode, wherein in a first branch, the convolution kernel size is 3*3, the cavity rate is 1, the filling is 1, and the number of channels of an input characteristic diagram and the number of channels of an output characteristic diagram are 256; in the second branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 2, and the filling is 2; in the third branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 5, and the filling is 5, and in addition, each convolution is followed by a batch normalization layer and a ReLU activation layer;
3-1-3) splicing the feature images obtained in the step 3-1-2) according to channel dimensions, then performing a convolution operation to reduce the channel dimensions, wherein the convolution kernel of the convolution is 1*1, the number of input channels is 768, the number of output channels is 256, and finally transmitting the obtained feature images to the next module;
3-1-4) up-sampling the feature map obtained in the step 3-1-3) to obtain feature maps with the sizes of 160×160×128, 320×320×64 and 640×640×3 respectively, wherein the up-sampling process comprises three groups of stacked residual blocks, each residual block comprises 2 deconvolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution is followed by a batch normalization layer and a ReLU activation function, the size of the feature map is enlarged to 2 times of the original size after each up-sampling to form a symmetrical structure, the down-sampled feature is introduced into a corresponding up-sampling module by using jump connection, the number of input channels is 256 for the first up-sampling, and the number of output channels is 128; the number of the input channels is 128 for the second downsampling, and the number of the output channels is 64; the third time downsampling input channel number is 64, and the output channel number is 3;
3-1-5) are finally subjected to 1*1 convolution operations to obtain enhanced features.
According to the technical scheme, in target detection, aiming at the problem that characteristic details extracted by a target detection model are easy to lose due to factors such as unobvious image characteristic details in low-illumination data set, firstly, the original low-illumination image is enhanced by adopting operations such as characteristic extraction, characteristic enhancement, characteristic fusion and the like, shallow and deep characteristics are extracted to enable a network to have richer global characteristic information, a characteristic enhancement module of an attention mechanism is combined to adaptively strengthen different channels, a characteristic fusion module fuses the low-illumination image characteristics and the enhanced image characteristics, image quality is improved, a target detection network comprises a backbone network, a detection neck and a detection head, and the original low-illumination image and the enhanced image are simultaneously input into a target detector for detection.
The method ensures that the feature expression capability of the feature map of the low-illumination image target detection is better, thereby improving the precision of the low-illumination image target detection.
Drawings
FIG. 1 is a diagram of a MSFAF-Net network architecture in an embodiment;
FIG. 2 is a block diagram of an SFEM module in an embodiment;
FIG. 3 is a block diagram of a DFEM module in an embodiment;
FIG. 4 is a diagram of the RDB module architecture in an embodiment;
FIG. 5 is a FEN block diagram in an embodiment;
FIG. 6 is a block diagram of FFM-ECA in an embodiment;
FIG. 7 is a block diagram of an SPPCSPC-CA module in an embodiment;
FIG. 8 is a block diagram of an ELAN-CA module in an embodiment;
fig. 9 is a schematic diagram of a CA attention mechanism in an embodiment.
Detailed Description
The present invention will now be further illustrated with reference to the drawings and examples, but is not limited thereto.
Examples:
referring to fig. 1, a low-illuminance target detection method based on MSFAF-Net includes the steps of:
1) Integrating the build data set: comprising the following steps:
1-1) the Exdark dataset contains 12 types of objects and 7363 images collected in a real low-light environment, wherein 4800 images are used for training and 2563 images are used for testing;
1-2) selecting 4800 images in the PASCAL VOC dataset, carrying out gamma transformation and Gaussian noise superposition on the images, generating a low-illumination image for training, and synthesizing the low-illumination image as shown in a formula (1):
wherein I is low For a composite low-light image, I in For inputting images, N o Representing superimposed gaussian noise, β and γ being coefficients of gamma transformation;
1-3) training phase, wherein the model is trained on a data set formed by mixing 4800 real low-illumination images and 4800 synthesized low-illumination images, and the model is tested on 2563 real low-illumination images;
2) Training of a feature extraction network module: the shallow layer features can better describe the detail features due to the fact that the number of layers of the passing network is smaller, but have smaller receptive fields and more noise, the contained semantic information is less, the deep layer features are less in noise, have larger receptive fields and stronger semantic information, and can better describe the whole image, but the detail is difficult to describe specifically, so that the shallow layer features and the deep layer features are fused, and the network has richer global feature information, and comprises:
2-1) preprocessing the data set, i.e. uniformly scaling the width and height of all images to 640 x 640 pixels;
2-2) sending the low-illumination image into a shallow feature extraction module SFEM, and extracting shallow features of the low-illumination image;
2-3) sending the features extracted in the step 2-2) into a deep feature extraction module DFEM to extract deep features of the low-illumination image;
3) Training of feature enhanced network module FEN: because the extracted feature information is different, global enhancement often causes an overexposure phenomenon, and for this purpose, the extracted feature is sent to a feature enhancement network FEN by combining with a feature enhancement network of an attention mechanism, where the feature enhancement network is an improved U-Net network combined with the attention mechanism, and includes:
3-1) sending the extracted features from the three-time feature extraction module to a feature enhancement network module FEN;
3-2) introducing a loss function comprising image multi-scale structural similarity loss, image perception loss and region loss, as shown in formula (2):
L total =λ ms L ms +λ pl L pl +λ rl L rl (2),
wherein lambda is ms 、λ pl 、λ rl To lose weight balance coefficient, L ms 、L pl 、L rl The multi-scale structure similarity loss is used for measuring the structure similarity of images under different resolutions, the multi-scale structure similarity loss is used for helping to restore the brightness and contrast of low-light images, improving the quality and sense of reality of the images and keeping the consistency of the image structures, and the definition is shown in a formula (3):
wherein mu x Sum mu y Is the average value of the image and,and->Is the variance, sigma of the image xy Is the covariance of the image, C 1 And C 2 The method is used for preventing the constant with the denominator of 0, N is the number of local areas of the image, the structural loss only pays attention to the low-layer information of the image, and the enhanced image can be excessively smoothed due to the lack of deep information, so that the image perception loss is introduced, the image perception loss aims at evaluating the similarity between the enhanced image and the real image, and the definition is as shown in a formula (4):
where E and G are the enhanced image and the real low-light image respectively,is VFeature map of jth convolutional layer of ith block in GG16 network, W i,j 、H i,j 、G i,j The sizes of the feature maps in the VGG16 network are respectively, and in the training process, the brightness of different areas of the image makes the enhanced image different, and the enhanced image is easy to overexposure, so that the enhanced image cannot be regarded as a whole, and in order to prevent overexposure, an area loss function is introduced, as shown in formula (5):
wherein E is l And G l Is a low light region of the enhanced image and the real image, E h And G h Is the rest of the image, and because more low light regions in the image need to be focused on during training, a larger weight value is required to be set for the low light regions, in this example omega l =5,ω h =1;
In this example, a hole convolution is introduced between the encoder and the decoder to replace a bottom convolution layer in a traditional U-Net network, and a parallel combination mode is adopted to obtain context information of multiple scales by adopting hole convolutions with hole rates of 1, 2 and 5. In the first branch, the convolution kernel size is 3*3, the void ratio is 1, the filling is 1, and the number of input characteristic diagram channels and the number of output characteristic diagram channels are 256; in the second branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 2, and the filling is 2; in the third branch, the number of input and output feature map channels is 256, the convolution kernel size is 3*3, the void fraction is 5, and the padding is 5. In addition, after each convolution, a batch normalization layer and a ReLU activation layer are connected, the obtained feature images are spliced according to the channel dimension, then a convolution operation is executed to carry out channel dimension reduction, the convolution kernel of the convolution is 1*1, the number of input channels is 768, the number of output channels is 256, and finally the obtained feature images are transmitted to the next module;
4) As shown in fig. 6, training of the feature fusion module FFM-ECA: the feature fusion module FFM-ECA training comprises:
4-1) willThe low-illumination image feature and the enhanced image feature are used as dual-channel input of the FFM-ECA module, and the feature X is input 0 ∈R C×H×W The input features are spliced to obtain a feature map X through convolution with a convolution kernel of 3*3 and activated by ReLu and then the number of channels is adjusted through convolution with a convolution kernel of 1*1 1 ∈R C×H×W ;
4-2) then global averaging pooling to obtain a polymeric profile M of size 1 x c C Then to M C A fast one-dimensional convolution of convolution kernel size K is performed, where K is adaptively determined by a mapping of channel dimensions C, as shown in equation (6):
k represents the convolution kernel size, C represents the number of channels, gamma and b are used to change the ratio between the number of channels C and the convolution kernel size, | odd K can be represented as an odd number;
4-3) obtaining the weight omega of each channel through the Sigmoid activation function, as shown in a formula (7):
ω=σ(C1D k (X 1 )) (7).
where ω represents the weight of each channel, σ represents the Sigmoid activation function, and C1D represents the one-dimensional convolution;
4-4) multiplying the weights with the characteristics of the channels respectively to obtain outputs, as shown in formula (8):
wherein δ represents a ReLU activation function;
4-5) finally fusing the features obtained in the step 4-4) with the features of the enhanced image after one 3*3 convolution and one 1*1 convolution;
5) Training a target detection network: comprising the following steps:
5-1) uniformly scaling the width and height of the dataset image to 640 x 640 pixels;
5-2) inputting the pictures into a backbone network for feature extraction to respectively obtain feature graphs with the sizes of 80-512, 40-1024 and 20-1024, wherein the feature graphs are marked as feature1, feature2 and feature3;
5-3) sending the feature map obtained in the step 5-2) to a detection neck for feature fusion to obtain feature maps with the sizes of 20-512, 40-256 and 80-128, wherein the feature maps are respectively marked as P5_out, P4_out, P3_out and feature3, and feature layers are firstly subjected to feature extraction by an SPPCSPC-CA module and marked as P5; p5 is first convolved once by 1*1, then upsampled, combined with features of feature2 after being convolved once by 1*1, and then subjected to feature extraction by an ELAN-CA module, and marked as P4; performing 1*1 convolution on P4, then performing up-sampling, combining the up-sampling with features of feature1 after 1*1 convolution, performing feature extraction through an ELAN-CA module to obtain P3-out, performing down-sampling on P3-out once, splicing the P4 with the down-sampling P3-out, and performing feature extraction through the ELAN-CA module to obtain P4-out; the P4-out is subjected to downsampling once and then is spliced with the P5, and then the characteristics are extracted through an ELAN-CA module to obtain the P5-out, so that in order to improve the capability of a network detection target, a CA attention mechanism is introduced into a backbone network and a detection neck, the CA attention mechanism not only can effectively combine channel attention and space attention, but also can integrate position information into the channel attention, and further the capability of the network for extracting low-illumination image characteristics is improved;
5-4) carrying out feature fusion on the three feature images obtained in the step 5-3) and the features obtained by the enhanced images through a backbone network and a detection neck, and respectively sending the three feature images into three detection heads, wherein the detection heads respectively predict the confidence level, the category and the boundary frame of the object;
5-5) screening the prediction frames, filtering the prediction frames with low target confidence, then performing non-maximum suppression, and selecting the boundary frame with highest confidence as a final detection result, wherein the total loss of the target detection network is shown as a formula (9):
L total =λ 1 L box +λ 2 L obj +λ 3 L cls (9),
wherein lambda is 1 、λ 2 、λ 3 To lose weight balance coefficient, L box 、L obj 、L cls The boundary regression box loss, the confidence coefficient loss and the classification loss are respectively BCEloss, the boundary regression box loss is EIoU,
in this example, as shown in fig. 7, the SPPCSPC-CA module in the detection neck is composed of two branches, the first branch is a change of the channel number through convolution of one 1*1, the second branch is a change of the channel number through convolution of one 1*1, the second branch is a maximum pooling operation of four convolution kernel sizes of 1×1,5×5,9×9 and 13×13 after the CA attention module, and the four features are spliced together, the channel number is adjusted through convolution of one convolution kernel size 1*1, then the convolution extraction feature of one convolution kernel size 3*3 and the step length of 1 is adopted, and the features obtained by the two branches are spliced together and the channel number is adjusted through convolution of one 1*1;
the ELAN-CA module is shown in fig. 8, and is composed of two branches, the first branch adjusts the channel number through a convolution with a convolution kernel size of 1*1, the second branch adjusts the channel number through a convolution with a convolution kernel size of 1*1, then the first branch performs feature extraction through a convolution with a step size of 3*3, then the second branch performs a Concat operation to splice the four features together through the CA attention module, in order to improve the capability of network detection targets, the CA attention mechanism is introduced into a backbone network and a detection neck, the CA attention mechanism not only can effectively combine the channel attention and the space attention, but also can integrate position information into the channel attention, and further improve the capability of the network to extract low-illumination image features, the CA attention mechanism is shown in fig. 9, for any one input feature X, the first feature layer is encoded through a pooling kernel with a size of (H, 1) or (1, w) along the horizontal direction and the vertical direction, and the formula (10) and the formula (11) has the feature layer generated by the formula transformation information.
Then, the two pieces of aggregate characteristic information are subjected to characteristic series connection, and after the 1 multiplied by 1 convolution transformation function of the formula (12), the intermediate mapping of the spatial information coded in the vertical direction and the horizontal direction is obtained:
f=δ(F 1 ([z h ,z w ])) (12),
wherein [ · ]]Representing a stitching operation along the spatial dimension, delta being a nonlinear activation function, f ε R c/r× ( H+W) An intermediate feature map encoding spatial information for both the horizontal and vertical directions,
then, f is divided into two independent tensors f along the spatial dimension h ∈R c/r×H And f w ∈R c/r×W The convolution transforms of two 1×1's through equation (13) and equation (14) will f, respectively h And f w Conversion to g with the same number of channels h And g w :
g h =σ(F h (f h )) (13),
g w =σ(F w (f w )) (14),
Where σ is a Sigmoid function,
finally g is to h And g w As the attention weight, the output is finally obtained as shown in formula (15):
6) Training and testing the low-illumination target model: comprising the following steps:
6-1) sending the low-illumination image into the MSFAF-Net network trained in the steps 1) -4) for enhancement;
6-2) taking the original low-illumination image and the enhanced image obtained in the step 6-1) as input of a target detection network at the same time to obtain a detection result;
6-3) visualizing the detection result.
The shallow feature extraction module SFEM described in step 2-1) is shown in fig. 2, and includes:
2-1-1) the SFEM module consists of two branches, wherein the first branch consists of 3 groups of convolution layers, each group of convolution layers is activated by a ReLU and added with a batch normalization layer, the dimension is increased by convolution with a convolution kernel size of 1*1, the characteristics are extracted by convolution with a convolution kernel size of 3*3 and a step size of 1, and finally the number of channels is adjusted by convolution with a convolution kernel size of 1*1;
2-1-2) in another branch, the number of channels is adjusted by adopting a convolution with a convolution kernel size of 1*1, and finally, shallow features are fused by adopting a Concat operation.
The deep feature extraction module DFEM in step 2-2) is shown in fig. 3, and includes:
2-2-1) DFEM is stacked from three residual dense blocks RDB;
2-2-2) RDB as shown in FIG. 4, includes three parts: dense connection, local feature fusion and feature fusion, each RDB consists of 3 convolution layers and is activated by ReLU, each convolution layer uses a convolution kernel of size 3*3 and adds BN and performs local feature fusion by convolutions of convolution kernel size 1*1.
The feature enhanced network module FEN in step 3-1) is shown in fig. 5, and includes:
3-1-1) firstly, downsampling to respectively obtain feature images with the sizes of 320 x 64, 160 x 128 and 80 x 256, wherein the downsampling process comprises three groups of stacked residual blocks, each residual block comprises 2 convolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution layers are a batch normalization layer and a ReLU activation function, the feature images after each downsampling are reduced to half of the original size, channel weights are calculated through a CA module, different channels are adaptively strengthened, the number of the input channels is 3 for the first downsampling, and the number of the output channels is 64; the number of the input channels is 64, and the number of the output channels is 128; the third time downsampling the input channel number to 128, the output channel number to 256;
3-1-2) obtaining context information of multiple scales by adopting cavity convolution with cavity rate of 1, 2 and 5 in a parallel connection mode, wherein in a first branch, the convolution kernel size is 3*3, the cavity rate is 1, the filling is 1, and the number of channels of an input characteristic diagram and the number of channels of an output characteristic diagram are 256; in the second branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 2, and the filling is 2; in the third branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 5, and the filling is 5, and in addition, each convolution is followed by a batch normalization layer and a ReLU activation layer;
3-1-3) splicing the feature images obtained in the step 3-1-2) according to channel dimensions, then performing a convolution operation to reduce the channel dimensions, wherein the convolution kernel of the convolution is 1*1, the number of input channels is 768, the number of output channels is 256, and finally transmitting the obtained feature images to the next module;
3-1-4) up-sampling the feature map obtained in the step 3-1-3) to obtain feature maps with the sizes of 160×160×128, 320×320×64 and 640×640×3 respectively, wherein the up-sampling process comprises three groups of stacked residual blocks, each residual block comprises 2 deconvolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution is followed by a batch normalization layer and a ReLU activation function, the size of the feature map is enlarged to 2 times of the original size after each up-sampling to form a symmetrical structure, the down-sampled feature is introduced into a corresponding up-sampling module by using jump connection, the number of input channels is 256 for the first up-sampling, and the number of output channels is 128; the number of the input channels is 128 for the second downsampling, and the number of the output channels is 64; the third time downsampling input channel number is 64, and the output channel number is 3;
3-1-5) are finally subjected to 1*1 convolution operations to obtain enhanced features.
Claims (4)
1. The MSFAF-Net-based low-illumination target detection method is characterized by comprising the following steps of:
1) Integrating the build data set: comprising the following steps:
1-1) the Exdark dataset contains 12 types of objects and 7363 images collected in a real low-light environment, wherein 4800 images are used for training and 2563 images are used for testing;
1-2) selecting 4800 images in the PASCAL VOC dataset, carrying out gamma transformation and Gaussian noise superposition on the images, generating a low-illumination image for training, and synthesizing the low-illumination image as shown in a formula (1):
I low =βI in γ +N o (1),
wherein I is low For a composite low-light image, I in For inputting images, N o Representing superimposed gaussian noise, β and γ being coefficients of gamma transformation;
1-3) training phase, wherein the model is trained on a data set formed by mixing 4800 real low-illumination images and 4800 synthesized low-illumination images, and the model is tested on 2563 real low-illumination images;
2) Training of a feature extraction network module: comprising the following steps:
2-1) preprocessing the data set, i.e. uniformly scaling the width and height of all images to 640 x 640 pixels;
2-2) sending the low-illumination image into a shallow feature extraction module SFEM, and extracting shallow features of the low-illumination image;
2-3) sending the features extracted in the step 2-2) into a deep feature extraction module DFEM to extract deep features of the low-illumination image;
3) Training of feature enhanced network module FEN: comprising the following steps:
3-1) sending the features extracted by the three-time feature extraction module into a feature enhancement network module FEN;
3-2) introducing a loss function comprising image multi-scale structural similarity loss, image perception loss and region loss, as shown in formula (2):
L total =λ ms L ms +λ pl L pl +λ rl L rl (2),
wherein lambda is ms 、λ pl 、λ rl To lose weight balance coefficient, L ms 、L pl 、L rl Respectively, imagesMulti-scale structural similarity loss, image perception loss, and region loss;
multiscale structural similarity loss the structural similarity of images at different resolutions is measured by comparing them, and the multiscale structural similarity loss definition is shown in formula (3):
wherein mu x Sum mu y Is the average value of the image and,and->Is the variance, sigma of the image xy Is the covariance of the image, C 1 And C 2 Is a constant for preventing denominator from being 0, N is the number of image local areas, and image perception loss is introduced, and is defined as formula (4):
where E and G are the enhanced image and the real low-light image respectively,is the characteristic diagram of the jth convolution layer of the ith block in the VGG16 network, W i,j 、H i,j 、G i,j The sizes of the feature maps in the VGG16 network are respectively, in the training process, the brightness of different areas of the images enable the enhanced images to be different, and an area loss function is introduced, as shown in a formula (5):
wherein E is l And G l Is a low light region of the enhanced image and the real image, E h And G h Is the rest of the image, and a larger weight value is required to be set for the low light area;
4) Training of the feature fusion module FFM-ECA: the feature fusion module FFM-ECA training comprises:
4-1) Low-light image features and enhanced image features are used as dual-channel input of the FFM-ECA module, and features X are input 0 ∈R C×H×W The input features are spliced to obtain a feature map X through convolution with a convolution kernel of 3*3 and activated by ReLu and then the number of channels is adjusted through convolution with a convolution kernel of 1*1 1 ∈R C×H×W ;
4-2) then global averaging pooling to obtain a polymeric profile M of size 1 x c C Then to M C A fast one-dimensional convolution of convolution kernel size K is performed, where K is adaptively determined by a mapping of channel dimensions C, as shown in equation (6):
k represents the convolution kernel size, C represents the number of channels, gamma and b are used to change the ratio between the number of channels C and the convolution kernel size, | odd K can be represented as an odd number;
4-3) obtaining the weight omega of each channel through the Sigmoid activation function, as shown in a formula (7):
ω=σ(C1D k (X 1 )) (7).
where ω represents the weight of each channel, σ represents the Sigmoid activation function, and C1D represents the one-dimensional convolution;
4-4) multiplying the weights with the characteristics of the channels respectively to obtain outputs, as shown in formula (8):
wherein δ represents a ReLU activation function;
4-5) finally fusing the features obtained in the step 4-4) with the features of the enhanced image after one 3*3 convolution and one 1*1 convolution;
5) Training a target detection network: comprising the following steps:
5-1) uniformly scaling the width and height of the dataset image to 640 x 640 pixels;
5-2) inputting the pictures into a backbone network for feature extraction to respectively obtain feature graphs with the sizes of 80-512, 40-1024 and 20-1024, wherein the feature graphs are marked as feature1, feature2 and feature3;
5-3) sending the feature map obtained in the step 5-2) to a detection neck for feature fusion to obtain feature maps with the sizes of 20-512, 40-256 and 80-128, wherein the feature maps are respectively marked as P5_out, P4_out, P3_out and feature3, and feature layers are firstly subjected to feature extraction by an SPPCSPC-CA module and marked as P5; p5 is first convolved once by 1*1, then upsampled, combined with features of feature2 after being convolved once by 1*1, and then subjected to feature extraction by an ELAN-CA module, and marked as P4; performing 1*1 convolution on P4, then performing up-sampling, combining the up-sampling with features of feature1 after 1*1 convolution, performing feature extraction through an ELAN-CA module to obtain P3-out, performing down-sampling on P3-out once, splicing the P4 with the down-sampling P3-out, and performing feature extraction through the ELAN-CA module to obtain P4-out; the P4-out is subjected to downsampling once and then is spliced with P5, and features are extracted through an ELAN-CA module to obtain P5-out;
5-4) carrying out feature fusion on the three feature images obtained in the step 5-3) and the features obtained by the enhanced images through a backbone network and a detection neck, and respectively sending the three feature images into three detection heads, wherein the detection heads respectively predict the confidence level, the category and the boundary frame of the object;
5-5) screening the prediction frames, filtering the prediction frames with low target confidence, then performing non-maximum suppression, and selecting the boundary frame with highest confidence as a final detection result, wherein the total loss of the target detection network is shown as a formula (9):
L total =λ 1 L box +λ 2 L obj +λ 3 L cls (9),
wherein lambda is 1 、λ 2 、λ 3 To lose weight balance coefficient, L box 、L obj 、L cls The method comprises the steps of respectively obtaining boundary regression box loss, confidence coefficient loss and classification loss, wherein the confidence coefficient loss and the classification loss adopt BCEloss, the boundary regression box loss adopts EIoU, an SPPCSPC-CA module is composed of two branches, the first branch is subjected to convolution with a 1*1 to change the channel number, the second branch is subjected to convolution with a 1*1 to change the channel number, passes through a CA attention module, then carries out maximum pooling operation with four convolution kernel sizes of 1 x 1,5 x 5,9 x 9 and 13 x 13, then splices the four features together, adjusts the channel number through convolution with a convolution kernel size of 1*1, then carries out convolution extraction with a convolution kernel size of 3*3 and a step length of 1, splices the features obtained by the two branches together, and then carries out convolution adjustment on the channel number once 1*1;
the ELAN-CA module consists of two branches, wherein the first branch adjusts the channel number through convolution with the convolution kernel size of 1*1, the second branch firstly adjusts the channel number through convolution with the convolution kernel size of 1*1, then carries out feature extraction through convolution with the step size of 1 through four convolution kernels of 3*3, then carries out Concat operation through CA attention module, finally splices the four features together, introduces a CA attention mechanism into a backbone network and a detection neck, and for any input feature X, firstly encodes each channel in the horizontal direction and the vertical direction through pooling kernels with the size of (H, 1) or (1, W), and enables the generated feature layer to have coordinate information through transformation of the formula (10) and the formula (11):
then, the two pieces of aggregate characteristic information are subjected to characteristic series connection, and after the 1 multiplied by 1 convolution transformation function of the formula (12), the intermediate mapping of the spatial information coded in the vertical direction and the horizontal direction is obtained:
f=δ(F 1 ([z h ,z w ])) (12),
wherein [ · ]]Representing a stitching operation along the spatial dimension, delta being a nonlinear activation function, f ε R c/r×(H+W) An intermediate feature map encoding spatial information for both the horizontal and vertical directions,
then, f is divided into two independent tensors f along the spatial dimension h ∈R c/r×H And f w ∈R c/r×W The convolution transforms of two 1×1's through equation (13) and equation (14) will f, respectively h And f w Conversion to g with the same number of channels h And g w :
g h =σ(F h (f h )) (13),
g w =σ(F w (f w )) (14),
Where σ is a Sigmoid function,
finally g is to h And g w As the attention weight, the output is finally obtained as shown in formula (15):
6) Training and testing the low-illumination target model: comprising the following steps:
6-1) sending the low-illumination image into the MSFAF-Net network trained in the steps 1) -4) for enhancement;
6-2) taking the original low-illumination image and the enhanced image obtained in the step 6-1) as input of a target detection network at the same time to obtain a detection result;
6-3) visualizing the detection result.
2. The MSFAF-Net based low-illuminance target detection method of claim 1 wherein the shallow feature extraction module SFEM of step 2-1) includes:
2-1-1) the SFEM module consists of two branches, wherein the first branch consists of 3 groups of convolution layers, each group of convolution layers is activated by a ReLU and added with a batch normalization layer, the dimension is increased by convolution with a convolution kernel size of 1*1, the characteristics are extracted by convolution with a convolution kernel size of 3*3 and a step size of 1, and finally the number of channels is adjusted by convolution with a convolution kernel size of 1*1;
2-1-2) in another branch, the number of channels is adjusted by adopting a convolution with a convolution kernel size of 1*1, and finally, shallow features are fused by adopting a Concat operation.
3. The MSFAF-Net based low-illuminance target detection method of claim 1 wherein the deep feature extraction module DFEM of step 2-2) includes:
2-2-1) DFEM is stacked from three residual dense blocks RDB;
2-2-2) RDB comprises three parts: dense connection, local feature fusion and feature fusion, each RDB consists of 3 convolution layers and is activated by ReLU, each convolution layer uses a convolution kernel of size 3*3 and adds BN and performs local feature fusion by convolutions of convolution kernel size 1*1.
4. The MSFAF-Net based low-light level target detection method of claim 1, wherein the feature enhanced network module FEN in step 3-1) comprises:
3-1-1) firstly, downsampling to respectively obtain feature images with the sizes of 320 x 64, 160 x 128 and 80 x 256, wherein the downsampling process comprises three groups of stacked residual blocks, each residual block comprises 2 convolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution layers are a batch normalization layer and a ReLU activation function, the feature images after each downsampling are reduced to half of the original size, channel weights are calculated through a CA module, different channels are adaptively strengthened, the number of the input channels is 3 for the first downsampling, and the number of the output channels is 64; the number of the input channels is 64, and the number of the output channels is 128; the third time downsampling the input channel number to 128, the output channel number to 256;
3-1-2) obtaining context information of multiple scales by adopting cavity convolution with cavity rate of 1, 2 and 5 in a parallel connection mode, wherein in a first branch, the convolution kernel size is 3*3, the cavity rate is 1, the filling is 1, and the number of channels of an input characteristic diagram and the number of channels of an output characteristic diagram are 256; in the second branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 2, and the filling is 2; in the third branch, the number of input feature map channels and the number of output feature map channels are 256, the convolution kernel size is 3*3, the void ratio is 5, and the filling is 5, and in addition, each convolution is followed by a batch normalization layer and a ReLU activation layer;
3-1-3) splicing the feature images obtained in the step 3-1-2) according to channel dimensions, then performing a convolution operation to reduce the channel dimensions, wherein the convolution kernel of the convolution is 1*1, the number of input channels is 768, the number of output channels is 256, and finally transmitting the obtained feature images to the next module;
3-1-4) up-sampling the feature map obtained in the step 3-1-3) to obtain feature maps with the sizes of 160×160×128, 320×320×64 and 640×640×3 respectively, wherein the up-sampling process comprises three groups of stacked residual blocks, each residual block comprises 2 deconvolution layers with the convolution kernel size of 3*3 and the step length of 2, the convolution is followed by a batch normalization layer and a ReLU activation function, the size of the feature map is enlarged to 2 times of the original size after each up-sampling to form a symmetrical structure, the down-sampled feature is introduced into a corresponding up-sampling module by using jump connection, the number of input channels is 256 for the first up-sampling, and the number of output channels is 128; the number of the input channels is 128 for the second downsampling, and the number of the output channels is 64;
the third time downsampling input channel number is 64, and the output channel number is 3;
3-1-5) are finally subjected to 1*1 convolution operations to obtain enhanced features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311333750.6A CN117456330A (en) | 2023-10-16 | 2023-10-16 | MSFAF-Net-based low-illumination target detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311333750.6A CN117456330A (en) | 2023-10-16 | 2023-10-16 | MSFAF-Net-based low-illumination target detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117456330A true CN117456330A (en) | 2024-01-26 |
Family
ID=89588331
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311333750.6A Pending CN117456330A (en) | 2023-10-16 | 2023-10-16 | MSFAF-Net-based low-illumination target detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117456330A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117952985A (en) * | 2024-03-27 | 2024-04-30 | 江西师范大学 | Image data processing method based on lifting information multiplexing under defect detection scene |
CN118212240A (en) * | 2024-05-22 | 2024-06-18 | 山东华德重工机械有限公司 | Automobile gear production defect detection method |
-
2023
- 2023-10-16 CN CN202311333750.6A patent/CN117456330A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117952985A (en) * | 2024-03-27 | 2024-04-30 | 江西师范大学 | Image data processing method based on lifting information multiplexing under defect detection scene |
CN118212240A (en) * | 2024-05-22 | 2024-06-18 | 山东华德重工机械有限公司 | Automobile gear production defect detection method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110111366B (en) | End-to-end optical flow estimation method based on multistage loss | |
CN109299274B (en) | Natural scene text detection method based on full convolution neural network | |
CN110738697B (en) | Monocular depth estimation method based on deep learning | |
Zhou et al. | GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation | |
CN113065558B (en) | Lightweight small target detection method combined with attention mechanism | |
CN112001960B (en) | Monocular image depth estimation method based on multi-scale residual error pyramid attention network model | |
CN110648334A (en) | Multi-feature cyclic convolution saliency target detection method based on attention mechanism | |
CN111160249A (en) | Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion | |
CN117456330A (en) | MSFAF-Net-based low-illumination target detection method | |
CN111931857B (en) | MSCFF-based low-illumination target detection method | |
CN112365514A (en) | Semantic segmentation method based on improved PSPNet | |
CN113657388A (en) | Image semantic segmentation method fusing image super-resolution reconstruction | |
CN114170286B (en) | Monocular depth estimation method based on unsupervised deep learning | |
CN114048822A (en) | Attention mechanism feature fusion segmentation method for image | |
CN116229452B (en) | Point cloud three-dimensional target detection method based on improved multi-scale feature fusion | |
CN113850324B (en) | Multispectral target detection method based on Yolov4 | |
CN112991371B (en) | Automatic image coloring method and system based on coloring overflow constraint | |
CN117351363A (en) | Remote sensing image building extraction method based on transducer | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN116757986A (en) | Infrared and visible light image fusion method and device | |
CN115035171A (en) | Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion | |
CN116524307A (en) | Self-supervision pre-training method based on diffusion model | |
CN113807356A (en) | End-to-end low visibility image semantic segmentation method | |
CN113066025A (en) | Image defogging method based on incremental learning and feature and attention transfer | |
CN113344110A (en) | Fuzzy image classification method based on super-resolution reconstruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |