CN116310359A

CN116310359A - Intelligent detection method for photoelectric imaging weak and small target in complex environment

Info

Publication number: CN116310359A
Application number: CN202310196144.8A
Authority: CN
Inventors: 邢建川; 莫国坤; 曾凤; 陈洋; 周春文; 付鱼; 王菲
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-06-23

Abstract

The invention discloses an intelligent detection method for a small target in a complex environment photoelectric imaging, and belongs to the technical field of target detection. The invention comprises the following steps: performing size normalization processing on an image to be detected, performing grouping convolution processing, performing downsampling, and performing multi-scale feature extraction on the second feature map through four stacked residual module layers; carrying out multi-scale fusion processing on the output feature graphs of the residual error module layers of the second layer, the third layer and the fourth layer by adopting a ladder-type structural feature fusion mode to obtain fusion feature graphs; finally, carrying out weak and small target detection processing on the fusion feature map through a classification detection judging device to obtain a detection processing result of the weak and small target; the invention improves the detection network to make the network model algorithm of the invention have the characteristic of richer characteristic information. The invention effectively improves the precision performance, reduces the false detection rate and reduces the complexity of the model.

Description

Intelligent detection method for photoelectric imaging weak and small target in complex environment

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to an intelligent detection method for a small target in a complex environment photoelectric imaging.

Background

In various application scenarios, there is a need to use algorithm software to simulate human eyes to intelligently detect, identify and classify targets in image files and process more complex tasks on the basis of the targets. In the long biological derivatization process, the vision system which takes human eyes as information input and human brain as information processing has very strong functions, and has high processing speed and strong anti-interference capability. Early computer vision development is slow and there is great difficulty in reaching the level of the human visual system. With the continuous development of the field of machine learning research, after a deep learning method introduces computer vision, the computer achieves better performance than human eyes in a plurality of vision tasks.

Classification detection and segmentation tasks constitute the basic task of computer vision, and thus target detection and recognition is a long-standing task research history as an advanced visual task. The detection and recognition task not only needs to determine whether the object contains detailed coordinate information of the object in the picture in a rectangular prediction box mode. After the weak and small target position detection is completed, the target category is classified and judged, and a result is given. With the continuous progress of computer vision related theory and technology in recent years, weak and small target detection in a complex environment has gradually become a new research hotspot in the field. However, the existing neural network algorithm still has the problems of false alarm, omission and the like when detecting weak and small targets in a complex environment.

The traditional target detection algorithm mainly adopts the manually designed characteristic of manual setting to judge by using a classifier under a sliding window. Traditional algorithms fall into two main categories: one based on spatial filtering algorithms and one based on the human visual system. However, the traditional method needs to consume a great deal of expert knowledge and labor cost to design templates or detection rules, and meanwhile, the problems of large calculated amount, poor generalization performance, high algorithm air-alarm rate and high omission rate in a complex environment exist. With the advent of deep learning technology, it has been increasingly migrated to deep neural networks in weak target detection due to its strong feature extraction and information abstraction capabilities. Especially, the convolutional neural network is low in air-alarm rate and omission rate of weak and small target detection under a complex environment, so that more and more students use the deep learning method to conduct task research. The convolutional neural network installation detector stage for target detection can be divided into a two-stage network and a single-stage network. Representative of a two-stage target detector is RCNN (Regions with CNN features), and subsequent optimized Faster RCNN. Representative of single-stage object detectors are SSD (Single Shot MultiBox Detector) and YOLO. However, the existing target detection algorithm still has the defects of lack of generalization capability, high detection omission ratio of a detector and poor recognition effect in a complex environment aiming at the detection of weak and small targets in the complex environment.

Disclosure of Invention

Aiming at the technical problem that the accuracy is low due to the fact that the target size is small, characteristic information is weak, identification is difficult, and the background is complex in a detection task of a weak and small target in the photoelectric imaging under a complex environment, the invention provides an intelligent detection method for the weak and small target in the photoelectric imaging under the complex environment.

The invention adopts the technical scheme that:

the intelligent detection method for the photoelectric imaging weak and small target in the complex environment comprises the following steps:

step S1: performing size normalization processing on the image to be detected to obtain the expected image size;

step S2: performing group convolution on the image processed in the step S1 by adopting a depth convolution with the convolution kernel size of 5 multiplied by 5 to obtain a first feature map;

step S3: performing first downsampling through convolution with a step length of 2 and a convolution kernel size of 3×3 to obtain a second feature map;

step S4: performing multi-scale feature extraction on the second feature map through four stacked residual error module layers;

the residual error module layer specifically comprises: the second feature map generated in the step S3 is used as an input feature map of the first residual error module layer, and the input feature maps of the second, third and fourth residual error module layers are output feature maps of the last residual error module layer; namely, the output characteristic diagram of the upper layer is used as the input characteristic diagram of the lower layer;

the network structure of each residual error module layer is the same, a lightweight convolution structure SheffleNet 2 is adopted as a basic convolution module, and after the input feature images of the residual error module layers pass through the basic convolution module, the input feature images of the residual error module layers are spliced with the input feature images of the current residual error module layers according to channels, so that the output feature images of the current residual error module layers are obtained;

step S5: carrying out multi-scale fusion processing on the output feature graphs of the residual error module layers of the second layer, the third layer and the fourth layer by adopting a ladder-type structural feature fusion mode to obtain fusion feature graphs;

the step-type structural feature fusion mode comprises the following steps:

defining output feature maps of residual error module layers of the second layer, the third layer and the fourth layer as a first scale feature map, a second scale feature map and a third scale feature map respectively, and defining feature map dimensions of the output feature maps of the residual error module layers of the second layer, the third layer and the fourth layer as a first scale, a second scale and a third scale respectively;

converting the dimension of the feature map of the third-scale feature map into a second scale, and then performing channel splicing with the second-scale feature map to obtain a first splicing result; converting the dimension of the feature map of the first splicing result into a second dimension to obtain a first splicing result of the second dimension;

converting the feature map dimension of the second scale feature map into a first scale, and then performing channel splicing with the first scale feature map to obtain a second splicing result; converting the feature map dimension of the second splicing result into a first dimension to obtain a second splicing result of the first dimension;

converting the feature map dimension of the first splicing result of the second scale into the first scale, then carrying out channel splicing with the second splicing result of the first scale to obtain a third splicing result, and converting the feature map dimension of the third splicing result into the first scale to obtain a third splicing result of the first scale;

and carrying out global average pooling operation on the third-scale feature map, converting the feature map dimension of the global average pooling operation result into a first scale, and carrying out feature fusion with a third splicing result of the first scale to obtain a fused feature map.

Step S6: performing weak and small target detection processing on the fusion feature map through a classification detection judging device to obtain a detection processing result of the weak and small target;

the weak target refers to a target with a target size smaller than or equal to a specified size.

Further, step S6 includes:

step S601, extracting candidate frames from the fusion feature map, and obtaining a plurality of final candidate frames after screening and non-maximum suppression processing of the extracted candidate frames;

step S602, extracting the characteristics of each final candidate frame from the fusion characteristic diagram based on each final candidate frame obtained in the step S601, and carrying out pooling operation on the characteristics of the candidate frames to obtain candidate frame characteristics conforming to the expected characteristic size;

step S603, weak and small object detection is performed on the candidate frame features based on the classification detection determiner.

Further, the classification detection determiner is composed of a full connection layer and a softmax function layer.

Further, in step S1, the desired image size is: 512 x 512.

The technical scheme provided by the invention has at least the following beneficial effects:

aiming at the technical problems of low accuracy caused by weak and small target detection task of photoelectric imaging in a complex environment due to small target size, weak characteristic information, difficult identification and complex background, the invention improves a detection network so that the network model algorithm has the characteristic of richer characteristic information. The invention effectively improves the precision performance, reduces the false detection rate and reduces the complexity of the model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram showing the fusion of U-shaped structural features in an embodiment of the present invention;

FIG. 2 is a schematic diagram showing step-like feature fusion in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a residual module structure according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a co-scale residual module according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a process of an intelligent detection method for a small target in a complex environment photoelectric imaging according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

The problem of small and weak target detection has been an important research problem in the fields of computer vision and artificial intelligence. The method is not only a precondition of an advanced visual task, but also can be widely applied to real scenes such as satellite remote sensing, military detection, airport scene detection and the like. With deep learning, especially with deep research of neural network algorithms, research methods based on neural networks are increasingly applied to detection of weak targets in complex environments. Aiming at the characteristics of small and weak target feature information, such as low probability of losing and large image disturbance in a complex environment, a feature value extraction and fusion mode is improved, and a residual error idea is introduced, so that detection efficiency is improved.

The task requirement for identifying weak and small targets of photoelectric imaging in a complex environment is common in aspects of social production and life. For small objects such as hat screws, nails and fuses on the runway of an airport. If the small problems can be intelligently detected through the monitoring images, the working pressure of airport staff can be effectively reduced. For automatic driving, intelligent detection is carried out from the monitoring image, and the road congestion degree and the traffic accident occurrence probability are effectively judged. For factory-generated materials, foreign objects and flaws can be identified by a small and weak target detection algorithm in a complex environment. For military reconnaissance, if the enemy deployment information and the movement can be accurately detected through the satellite remote sensing image, the combat efficiency of the army can be greatly improved, and the loss of personnel is reduced. For public safety management, human behavior analysis can be performed through monitoring visual angle data, early warning is performed on dangerous behaviors such as personnel trampling, violent attack and the like possibly occurring, and public safety is improved. Meanwhile, the intelligent detection and identification technology of the weak target is also an important support for carrying out route tracking and prediction on criminal suspects which may exist.

The complex environment refers to the difficulty that the target overlap and the target shielding exist. Electro-optical imaging refers to an electronic information map acquired by an electro-optical imaging device. A small target refers to a target having a size smaller than a specified size, for example, a target size of no more than 9*9 pixels at 256 x 256 pixels.

As a possible implementation manner, in the embodiment of the present invention, the fast RCNN is taken as a basic frame and is improved by adopting a fusion method of the features with the same scale and multiple scales, that is, the feature graphs with the same size and the same level are fused by using a residual structure and channel stitching on a convolution layer. Thereby improving the detection accuracy. And the lightweight convolution module ShuffeNetV 2 is used for replacing the original convolution module, so that the algorithm efficiency is improved.

In the fast RCNN network structure, features are first extracted from a picture to be detected through a convolutional network to obtain a feature map, where the convolutional network includes a convolutional layer (conv layer), a Relu activating layer (Relu layer), and a pooling layer (pooling layer), each layer of convolutional layer is followed by one layer of Relu activating layer, and a total of 13 layers of convolutional layers and 13 layers of Relu activating layers, and 4 layers of pooling layers, where the first two layers of pooling layers are: one pooling layer is arranged behind each two convolution layers and the Relu activation layer, and the two pooling layers are as follows: a pooling layer is arranged after each three convolution layers and the Relu activation layer. The feature map size is unchanged after passing through each conv layer relu layer according to the convolution and pooling formulas; after each deposition layer, the feature map (feature map) width and height become half of before. For example, a feature map size generated by extracting a m×n picture from the network is (M/16) ×n/16.

The resulting feature map input RPN (Region Proposal Network) is then used to extract candidate boxes, which is the main difference between the two-stage algorithm and the single-stage algorithm. After the FPN network is input, candidate boxes are obtained, and are classified by SVM, so that a plurality of (e.g. 2000) candidate boxes with the highest scores are obtained in a mode of screening and non-maximum suppression.

And then, selecting the features of the corresponding candidate frames from the feature map, and carrying out pooling operation on the features. So that their size is as expected. ROI pooling would have a preset width and height, indicating that each proposal feature is to be unified into such a large feature map. In the processing, the preselected frame coordinates based on the M x N scale are mapped back to the (M/16) x (N/16) scale. And then grid-dividing the corresponding area of each pre-selected frame according to the preset size. And carrying out maximum pooling on each part of the grid, outputting after processing, and unifying the output vectors.

And finally, according to the generated multidimensional feature vector, completing target identification, classification and prediction box position deviation on the candidate frame. That is, the specific categories of all pre-selected frames are classified by the full connection layer and softmax, typically multiple categories, and the regression to the border of the pre-selected frame results in a final classification with higher accuracy.

In the process of multi-scale feature image fusion, a top-down U-shaped structure is adopted in the conventional method as shown in fig. 1, wherein 8s, 16s and 32s represent three feature images with different scales, the scale of 8s is 64×64×128, the scale of 16s is 32×32×256, and the scale of 32s is 16×16×512, and 128, 256 and 512 represent corresponding channel numbers. That is, the feature map of 32s is first transformed into the feature map of 32×256 and then fused with the feature map of 16s, the fusion result is transformed into the feature map of 64×64×128, and finally fused with the feature map of 8s, and finally the fusion feature map of 64×64×128 is obtained. The feature map formed by the deep network lacks sufficient spatial location information for weak target identification, and therefore needs to be supplemented by the feature information of the shallow network. However, the traditional connection structure is simpler, and the shallower the feature information of the layer is less involved in aggregation, which is insufficient for completing the supplement of the space position information of the weak and small target. Meanwhile, when fusion is carried out, the information abstraction degree of shallow layer features is low, and in summary, aiming at the detection problem of weak and small targets, the multi-scale feature fusion mode of the U-shaped structure needs to be improved.

As shown in fig. 2, the embodiment of the invention provides a step-type structural feature fusion mode, which specifically comprises the following steps:

step 1, along the indicated direction (1), the number of channels of the 32s profile is reduced by half using 1*1 convolution, and then up-sampled twice so that its size becomes 32×32×256. And then splicing the channels along the indicated direction (2) and the 16s characteristic map to obtain a characteristic map of 32 x 512, and performing 1*1 convolution to reduce the number of the channels by half so as to change the size into 32 x 256.

And 2, fusing the 8s feature map and the feature map formed by downsampling the 16s feature map along the indication direction (4) of the indication direction (3) to obtain a new 8s feature map. And carrying out feature fusion on the new 16s feature map obtained through the indication direction (2) along the indication direction (5) and the 8s feature map of the indication direction (6) to obtain a new 8s feature map.

And 3, carrying out global average pooling operation on the 32s feature images along the indication direction (7), and carrying out vector expansion on the feature images after pooling operation along the indication direction (8) to generate 8s feature images. And (3) carrying out feature fusion on the 8s feature map generated after vector expansion and the 8s feature map generated in the step (2) to obtain a final feature map, wherein the size of the final feature map is 64×64×128.

The feature fusion mode is a mode from deep to shallow, and finally, a feature map with expected size is output. The fusion mode provided by the invention is characterized in that the fusion of the characteristics between two adjacent layers is added on the basis of a U-shaped structure. The nonlinear design ensures that the characteristic fusion is more sufficient, the generated characteristic diagram has more abundant characteristics, and the characteristic diagram can represent the complete information of the picture. Meanwhile, the invention also carries out global mean value pooling on the initial shallow feature map (16 x 512), then carries out vector expansion and adds the vector expansion into a fusion process. The calculation of the step is simpler, but the receptive field is improved by a global information enhancement mode.

In order to reduce the calculation amount of the model, the invention selects the light-weight convolution module of the SheffleNetV 2. ShellfeNetV 2 is modified from ShellfeNetV 1. The SheffeNetV 1 provides a grouping convolution and channel rearrangement channel shuffle algorithm for optimizing the convolutional neural network module so as to reduce the addition, subtraction, multiplication and division operation quantity required in the convolution process.

The packet convolution principle used by shufflenetv1 approximates the deep convolution principle. The depth convolution algorithm is to design a separate convolution kernel, which convolves for each characteristic channel. The group convolution algorithm divides the channels of the feature map into a plurality of groups according to the set numerical values, and then allows the convolution kernel to process the features of each group. When the number of channels per set is set to 1, the two algorithm effects are equivalent.

The ShuffleNetV2 mainly increases the ratio of arithmetic computation operations to memory access operations, and improves the parallelism of the model, and is specifically expressed in:

(1) And a channel separation channel split is added at the beginning, so that the input image characteristic channels are divided into two groups, and the subsequent grouping convolution operation is canceled.

(2) element-wise addition (adding corresponding feature maps) replaces operation with channel stitching.

(3) The rearranged channel operation moves to the channel splicing and then is separated and combined with the channel.

Compared with other common lightweight convolution modules, the SheffeNetV 2 not only has better running performance, but also obtains better accuracy in the research of the image Net field, and is slightly worse than the MobileNetV2 in calculation amount. Because the processing task of the invention is to improve the detection precision of the photoelectric imaging weak and small target in the complex environment, the invention uses the photoelectric imaging weak and small target to lighten the convolution module and finish the algorithm optimization.

Aiming at the problem of gradient disappearance of weak and small targets, the invention designs the co-scale residual error module. The module aims to fuse the feature images under the same scale mainly by a residual module and a channel splicing method. By the method, the gradient disappearance problem and the model degradation problem of the algorithm model are relieved, and the feature richness extracted is improved.

The specific structure of the residual error module is shown in fig. 3, the input feature diagram firstly passes through the basic convolution module with the step length of 1 to obtain the output feature diagram of the basic convolution module, and then the output feature diagram of the basic convolution module and the input feature diagram are subjected to channel splicing to obtain the output feature diagram of the residual error module. And the residual module characteristic fusion structure under the same scale combines a dense connection network and a residual network. The features of the shallow layer and the deep layer can be combined together in a dense connection mode, so that the richness of the extracted features is improved. Meanwhile, due to the existence of a residual error module, reverse transmission can be effectively improved, and the gradient disappearance problem in the deep convolution process is relieved to a certain extent. The structure of the co-scale residual error module is shown in fig. 4, and the input of the co-scale residual error module is a generated characteristic diagram; rounded rectangle is residual error module; the intersections of the arrows represent the channel splicing feature images, and the number of the channels of the feature images is kept unchanged through a convolution kernel compression mode. The characteristic diagram firstly passes through a residual error module, and then the characteristic diagram and the output of the residual error module are subjected to channel splicing. The design can efficiently complete tasks such as feature learning, feature multiplexing, feature selection and the like through a residual structure and channel splicing, and can increase the number of channels while reducing the size of a feature map so as to keep feature richness and diversity

For small objects in a complex environment, such as small objects on the ground, since the ground objects are relatively small, in order to preserve the original features of the small ground objects as much as possible, in the embodiment of the present invention, a relatively large size of 512×512 pixels may be used as the input size of the picture. The network structure acquires feature graphs by using a residual module layer comprising a residual module and channel splicing idea, and finally fuses different-size feature graphs generated by different dense fusion layers by using a multi-scale fusion layer. Meanwhile, in order to increase receptive field and reduce calculated amount, the image is continuously downsampled in the transmission process of the convolution network, and the size of the feature map is gradually reduced, so that in order to reduce loss of features in the process, the richness and the diversity of the features are maintained, and the size of the feature map is reduced by half and the number of channels is doubled when downsampling is performed each time.

Referring to fig. 5, in the embodiment of the invention, the implementation of the method for intelligently detecting the small target in the complex environment photoelectric imaging comprises the following steps:

step S1: the picture to be predicted is normalized so that the input size can be unified, and in this embodiment, the input size of the picture is 512×512 pixels.

Step S2: for the input pictures, first, the input 3-channel (RGB three-channel) packets are convolved using the depth convolution of 5*5, resulting in a feature map of 512×512×3. The large-size convolution kernel is used in this step to obtain a relatively large receptive field, and the group convolution is used to reduce the calculation amount.

Step S3: the first downsampling was performed by a step size of 2 of 3*3 convolutions, resulting in a 256 x 32 feature map. This step is convolved with conventional 3*3 because it is the most commonly used convolution kernel size, which allows efficient feature extraction with relatively little computational effort, with increased receptive fields. The convolutional layer is added in the shallow layer of the network, so that the quality of the network extracted features can be ensured, and the stability of the network is improved.

Step S4: the characteristic diagrams with different scales are mainly obtained through a residual error module layer (4 layers). The module layer mainly uses a lightweight convolution structure in the ShuffleNet2 as a basic convolution module, and simultaneously adds a residual module and a channel splicing idea.

Step S5: and carrying out feature fusion on the obtained multi-scale feature map by using the ladder-type feature fusion method provided by the invention, and finally obtaining the feature map with the size of 64 x 128.

Step S6: and detecting and classifying the finally obtained feature map by using a classification detection determiner.

Aiming at the problem that small characteristic information of a weak target size is easy to be lost in a deep network, the shallow network characteristic information is supplemented in a characteristic fusion mode, so that the detection precision of the weak target is improved.

To further verify the detection performance of the method of the present invention, detection performance analyses were performed on the visrrone dataset and the VEDAI dataset. Wherein the visrrone data set contains six thousand training data sets, five hundred verification data sets, and one thousand test data sets. The data set is mainly used for collecting pictures containing people and vehicles from the perspective of the unmanned aerial vehicle, and ten target categories exist. The target distribution in the data set is centralized, and the environment is complex. The VEDAI dataset is an aerial image with very small target size and relatively complex background obtained by satellite. The data set has a picture size of 1024 x 1024 and contains one thousand training data sets, eight hundred verification data sets and one hundred test data sets. The detection targets of the images are 11 kinds of different vehicles, the image background is rich, and the detection targets cover the vehicle targets in the scenes of residential areas, city streets, expressways and the like. Because the original size of the picture is larger, the size of the target is extremely small compared with the size of the target under the satellite view angle, and the detection difficulty is high. Most of the target sizes for both the VEDAI data set and the visrrone data set are 20 x 20 or less. There are few cases in which the size of the object to be detected exceeds 60 x 60 in the data set, and the data set as a whole meets the requirements of a weak and small object. Meanwhile, the two data sets have the problems of target shielding and target overlapping, the background is complex, and the task requirements of the invention are met.

The accuracy comparison results of the method of the invention and the Faster RCNN and SSD algorithm model are shown in Table 1. The AP evaluation (Average Precision aims at the average accuracy of a data set, the calculation mode is the area under a Precision-recovery curve, and the area is used for measuring the quality of a trained target detection model on each target class) index is IoU (Intersection over Union) threshold value 0.50:0.05: at 0.95, the average of all IoU threshold MAPs (average of multiple categories AP). From the data in the table, the invention has better improvement on the detection precision, and the result proves the effectiveness and the high efficiency of the invention.

Table 1 algorithm model accuracy AP comparison table

Model	VEDAI	VISDRONE
			Faster RCNN	16.7	8.6
SSD512	22.8	9.1
			The invention is that	23.4	12.2

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

What has been described above is merely some embodiments of the present invention. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the invention.

Claims

1. The intelligent detection method for the photoelectric imaging weak and small target in the complex environment is characterized by comprising the following steps of:

the step-type structural feature fusion mode comprises the following steps:

2. The method of claim 1, wherein step S6 comprises:

3. The method of claim 1, wherein the class detection determiner is comprised of a fully connected layer and a softmax function layer.

4. The method of claim 1, wherein in step S1, the desired image size is: 512 x 512.