CN112101113B

CN112101113B - Lightweight unmanned aerial vehicle image small target detection method

Info

Publication number: CN112101113B
Application number: CN202010819487.1A
Authority: CN
Inventors: 李红光; 王蒙; 丁文锐
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2022-05-27
Anticipated expiration: 2040-08-14
Also published as: CN112101113A

Abstract

The invention provides a light unmanned aerial vehicle image small target detection method, and belongs to the technical field of unmanned aerial vehicle image processing. The invention processes each frame of image according to the sequence of the input image video of the unmanned aerial vehicle to be detected, comprising the following steps: the image is input into a Revised Mobile Net 2 feature extractor after being zoomed, and a feature map is output; inputting the feature map into a synchronous up-sampling and detecting module, and obtaining the position and the corresponding scale of a target central point to obtain all predicted target boundary frames in a frame; and after processing all frames of the video to be detected, performing rapid sequence non-maximum suppression processing on prediction results of all frames, and outputting a target detection result of the video to be detected. The invention uses the light backbone network to detect the small targets in the unmanned aerial vehicle image, reduces the false detection of the small targets, improves the detection efficiency, and can realize the quick and accurate detection of the small targets in the unmanned aerial vehicle image video.

Description

Lightweight unmanned aerial vehicle image small target detection method

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle image processing, and particularly relates to a light unmanned aerial vehicle image small target detection method.

Background

With the maturity of unmanned aerial vehicle technique and the increase of unmanned aerial vehicle supplier quantity, unmanned aerial vehicle cost reduces gradually, and in recent years, unmanned aerial vehicle all receives extensive concern in a plurality of fields such as geology, agriculture and forestry, stream of people/traffic control. Unmanned aerial vehicle self can carry multiple peripheral hardware sensor, including infrared image sensor, visible light image sensor, acceleration sensor, baroceptor etc. wherein visible light image sensor can provide abundant environmental information, consequently, the technique is understood to unmanned aerial vehicle visible light image is one of the popular field of unmanned aerial vehicle application research. The target detection technology can locate interested category targets in the image, and undoubtedly can provide effective support for various unmanned aerial vehicle tasks.

According to the definition of MS COCO (Microsoft Common Objects in Context) data set, an object with a number of pixels ≦ 32 × 32 is considered a small object. Typical targets in unmanned aerial vehicle images have the characteristics of small size, large quantity and dense distribution.

The target detection technology has been developed for a long time, and the detection precision is continuously improved from the traditional method based on manual design characteristics to the deep learning method based on the convolutional neural network. Currently, a convolutional neural network-based method is the mainstream method of the target detection technology. The optimization target of most unmanned aerial vehicle visible light image target detection algorithms is to improve the precision as much as possible, and the efficiency problem is rarely considered. Target detection based on unmanned aerial vehicle machine upper mounting plate has the significance, can not only promote the flexibility that unmanned aerial vehicle used and the intelligent level of unmanned aerial vehicle self, can also overcome abominable communication environment and carry out work. However, storage and computational resources of the platform on the drone are limited, requiring a low computational load and parameter for the target detection algorithm.

Disclosure of Invention

Based on the importance of the existing unmanned aerial vehicle target detection and the requirements of low calculated amount and parameter amount of a detection method, the invention provides a light unmanned aerial vehicle image small target detection method aiming at the scene of target detection of a visible light unmanned aerial vehicle image by combining an unmanned aerial vehicle upper platform. The unmanned aerial vehicle image data source is in a video format, namely, the input unmanned aerial vehicle image is each frame of a video according to a time sequence.

The invention provides a light unmanned aerial vehicle image small target detection method, which comprises the following steps:

the method comprises the following steps: scaling the current frame image to be detected to 512 x 512 pixels;

step two: inputting the zoomed image into a Revised mobilenetV2 feature extractor, and outputting a feature map with the dimension of 16 multiplied by 16;

step three: and inputting the extracted feature map into a synchronous up-sampling and detecting module. The synchronous upsampling and detection module contains four branches based on the subpixel convolution structure. The four branches are respectively a central point branch, a central point offset branch, a central point target branch and a scale branch. The first three branches jointly determine the position of a central point, and the scale branches determine the scale of each central point corresponding to the target;

step four: and obtaining all predicted target frames of the current frame according to the predicted target central point position and the corresponding scale, and storing the result. Judging whether all frames of the current video to be detected are detected, if so, entering the fifth step after the detection is finished, otherwise, returning to the first step for the undetected frames to continue to execute;

step five: and performing rapid sequence non-maximum suppression processing on all frame prediction results of the video to be detected to obtain a final target detection result.

The fast sequence non-maximum suppression processing comprises the steps of firstly suppressing and removing duplication of a prediction target frame, and then sequentially selecting and re-scoring the prediction target frame sequence; the sequence selection refers to selecting the first K predicted target frames of each frame, which are subjected to duplicate removal and then are subjected to score descending arrangement, calculating the IOU value between the predicted target frames of the adjacent front and back frames, associating the predicted target frames with the IOU value larger than a threshold B, obtaining a plurality of overlapped time sequence target frame sequences with different lengths corresponding to the whole video sequence at the moment, and selecting the time sequence target frame sequence with the maximum total score of the target frames; the re-scoring refers to assigning the score average value of the target frames of the sequence to each boundary frame in the sequence for the selected time sequence target frame sequence; after the three steps are executed, excluding the time sequence target frame sequence with the maximum total score, and selecting and re-scoring the prediction target frame sequence again until no time sequence target frame sequence can be selected; wherein K is a positive integer and B is a real number greater than 0.

Compared with the prior art, the invention has the advantages and positive effects that: (1) the invention carries out lightweight design on a typical frame based on a central point prediction method, uses a lightweight backbone network, designs a synchronous up-sampling and detection module, and forms a more efficient detection framework; (2) according to the invention, a target branch of a binary center point is added at the detection head, so that new information is brought to the center point prediction, and the false detection of small targets is reduced to a certain extent; (3) the invention provides a rapid sequence non-maximum suppression method in a post-processing part, and a detection result is optimized by combining target time sequence information. The method can realize the rapid and accurate detection of the target on the unmanned aerial vehicle.

Drawings

FIG. 1 is an exemplary framework for a center point based prediction approach;

FIG. 2 is a schematic diagram of a lightweight network for detecting small targets of unmanned aerial vehicle images according to the present invention;

FIG. 3 is a flow chart of a fast sequence non-maxima suppression method of the present invention;

fig. 4 is a flow chart of the light unmanned aerial vehicle image small target detection framework of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples.

The method of the present invention is designed based on a typical framework of the center point prediction method, as shown in fig. 1. The input Image (Image) is input into a Feature Extractor (Feature Extractor) for Feature extraction after being zoomed (Resize), the extracted features are input into a Detection Head (Detection Head) for predicting a target Center Point (Center Point) and a Scale (Scale) after being Up-sampled (Up-sampling). The feature extractor is typically a classification network based architecture, and the feature extractor and the upsampling module together form a backbone network based on a central point prediction framework. The detection head is responsible for target prediction according to the extracted features and comprises three branches: a center point branch, a center point offset branch, and a scale branch.

As shown in fig. 2, the light unmanned aerial vehicle image target detection framework mainly comprises the following parts:

the first part is the Revised MobileNetV2 feature extractor, which acts as the backbone of the network structure of the method of the present invention. The second part is a synchronous Up-sampling and Detection Module (Simultaneous Up-sampling and Detection Module) which is used as a Detection head of the network structure of the method and has the Up-sampling function. The synchronous upsampling and detection module includes four branches, which are a center point branch cls, a center point offset branch offset, a center point target branch obj, and a scale branch wh, respectively.

(1) Revised MobileNetV2 feature extractor.

The feature extractor of the present invention uses a structure based on a lightweight classification network MobileNetV 2. The MobileNetV2 adopts a depth separable convolution and Inverted Residual (Inverted Residual) structure, and is a mobile terminal network structure which is widely applied to various tasks such as detection and segmentation. Generally, a structure before a classification network pooling layer is selected to be used for feature extraction in other task algorithms, experiments show that the structure before the MobileNetV2 pooling layer is used as a backbone network of the invention, and the last layer of 1 × 1 convolution (Conv) has negative influence on detection accuracy. The MobileNetV2 is a structure designed for classification tasks, the final goal is to obtain feature vectors with good discrimination, and then the feature vectors are classified through a full connection layer, so that MobileNetV2 is not necessarily completely suitable for detection tasks. Just as MobilNetV2 uses a linear Bottleneck structure (bottleeck) instead of the traditional nonlinear Bottleneck structure, the signature of the final layer of 1 × 1 convolution output after nonlinear RELU activation may lose some necessary information for detection relative to the signature of the final linear residual block output. Meanwhile, the dimension of the output of the last layer of 1 × 1 convolution is quite large, namely 1280 dimension, which also brings great calculation burden to a subsequent synchronous up-sampling and detection module, so the invention removes the last layer of 1 × 1 convolution of the MobileNet V2 feature extractor to form a Revised MobileNet V2 feature extractor.

(2) A synchronous upsampling and detection module.

Deconvolution is a common upsampling method, and compared with Up-sampling or Up-posing methods, the method has the advantages that deconvolution is learnable and results can be made finer. However, deconvolution is also limited by the large computational complexity and checkerboard effect.

The present invention uses sub-pixel convolution instead of deconvolution, which is also a learning-based upsampling method and can be defined as:

FM^HR＝PS(W_L×FM^LR+b_L)

wherein PS is a Periodic Shuffling operation (Periodic Shuffling) of low resolution H × W × C · r²High-resolution feature map FM with feature map rearranged to rH × rw × C scale^HR。W_LAnd b_LIs a convolution operator for low resolution feature map FM^LRIs carried out r²The dimension of the multiple is raised, and r is the upsampling multiple. Briefly, the sub-pixel convolution is to perform dimension raising on the input feature map by using the convolution layer, and then obtain the up-sampling output by periodic rearrangement. Because the principle of the sub-pixel convolution and the principle of the deconvolution are different, the sub-pixel convolution has no chessboard effect, and simultaneously because the sub-pixel convolution and the detection head have the common convolution structure, the invention shares the calculation of the two layers to form a synchronous up-sampling and detection module.

The synchronous up-sampling and detecting module can be seen as a detecting head integrated with the up-sampling function, the feature diagram to be up-sampled is directly input into each branch of the detecting head, each branch comprises a layer of 1 × 1 convolution structure, and the output dimension of the branch is 8 of the final output dimension of the corresponding branch²And (4) multiplying the sum, namely r is 8, and periodically rearranging the output of the last layer of convolution to obtain the prediction result of the corresponding branch. The design utilizes the characteristic of sub-pixel convolution, not only solves the problem of overlarge deconvolution parameter quantity, but also shares the calculation of the detection head and the up-sampling, and further reduces the calculation burden of the algorithm.

The input of the synchronous upsampling and detecting module is a 16 × 16-scale feature map output by a Revised MobileNetV2 feature extractor, and the synchronous upsampling and detecting module has an 8-time upsampling function, so that the outputs of a central point branch, a central point offset branch, a central point target branch and a scale branch in the synchronous upsampling and detecting module are all 128 × 128-scale. The structure of each branch is basically the same, and all the branches are composed of a 1 × 1 convolutional layer and a periodic rearrangement operation, and only the output dimension is different. The output of the central point branch is a heat map with category dimension, the output of the central point target branch is a heat map with constant 2 dimension, the output of the central point offset branch is the offset corresponding to each central point with constant 2 dimension, and the output of the scale branch is the scale corresponding to each central point with constant 2 dimension.

Compared with the branch definition of the detection head in the conventional framework based on the central point prediction, the invention adds a central point target branch. Because the invention designs the extremely light feature extractor and the detection head, the resolution capability of the extracted features and the detection head to the features is poor, and the risk of missing detection and error detection is brought in unmanned aerial vehicle images which have small target size, dense distribution and difficult feature learning. Especially in unmanned aerial vehicle images with complex environments, complex backgrounds are extremely easy to be mistakenly detected as targets. Inspired by YOLO (young Only Look once), YOLO additionally predicts an objective score for each anchor block, and the physical meaning of the objective score is the iou (interaction Over union) value of the anchor block and the true value block. YOLO performs detection, the actual score of the target box is the product of the classification score and the targetability score. The invention adds a central point target branch for detecting the head, which is a two-classification branch, and the branch predicts a two-classification heat map, namely whether a certain point in a marked image space is the central point of any target of interest class, and does not go to specific classification. Therefore, the invention adds new information in the prediction process of the target center point, and is beneficial to generating better prediction results. The heat maps output by the two-classification central point target branch and the multi-classification central point branch are independently trained and predicted. Integral loss function L of the inventive algorithm_detIs defined as:

L_det＝L_cls+λ_sizeL_size+λ_offL_off+λ_objL_obj

wherein L is_clsThe specific definition is the same as for the centret network for central point classification Loss and for Focal local Loss. L is_sizeFor the loss of target box dimension, L_offFor center point offset loss, both are L1 losses, λ_sizeAnd λ_offIs the coefficient corresponding to the loss. L is_objTarget Loss for center point, Focal LossIs as defined for L_cls，λ_sizeFor the coefficient corresponding to the target loss, the present invention takes 0.5. When the detection is performed, the final target frame Score is obtained by combining the classification Score and the target Score of each central point:

Score＝heatmap_cls×F(heatmap_obj)

where F (x) is a preprocessing function of the targeting score, where x represents the value of a pixel in the heat map, heatmap_clsHeatmap, class-by-class prediction, output for a central point branch_objA binary targeted heatmap output for the central point targeted branch. The introduction of the targeted branch reduces the fraction of the bounding box with low targeting, so that the risk of false alarm can be effectively reduced.

A specific structural parameter of the target detection network implemented by the present invention is shown in table 1:

table 1 structural parameter table of object detection network of the present invention

Wherein Conv2d refers to a conventional convolutional layer, bottleeck refers to an inverted residual error module in a MobileNet, t, c, n and s are parameters of an inverted residual error structure, t is a channel expansion coefficient, c is an output dimension, n is the number of the inverted residual error modules, and s is a step size. cls is the number of classes detected and ratio is the upsampling multiple.

Therefore, after the target prediction boundary frames of all frames are obtained, the invention uses Fast sequence Non-maximum Suppression (Fast seq-NMS) to carry out de-duplication processing on the target detection results of all frames and corrects the detection results through the time sequence connection of the inter-frame targets.

In a time-series image sequence, the same object exists in a large probability between adjacent frames, and the position and the scale of the object are changed to a small degree. Based on this assumption, the score of the prediction bounding box can be reevaluated by establishing inter-frame association of the same target based on the IOU value between the prediction boxes of adjacent frames based on the result of the target detection network prediction. Sequence non-maximum suppression (Seq-NMS) as shown in figure 3, comprises 3 steps: 1) selection of bounding box sequences, 2) scoring of bounding box sequences, 3) suppression of deduplication. The bounding box sequence selection firstly calculates the IOU value between the predicted target bounding boxes of all adjacent front and back frames, then associates the predicted boxes with the IOU larger than a threshold B, at the moment, the whole video sequence can generate a plurality of overlapped time sequence bounding box sequences with different lengths, and finally selects the bounding box sequence with the maximum total score of the predicted target boxes. Bounding box sequence re-scoring assigns a fractional average of the selected bounding box sequence to each bounding box in the sequence. Suppression deduplication removes selected bounding box sequences, while removing bounding boxes within the frame that are larger than a certain threshold from the selected bounding box sequence elements IOU. After 3 steps are completed, a new selection of bounding box sequences is restarted until no sequences can be selected. The threshold B is typically set to 0.5.

Considering more prediction target frames, the invention provides a more efficient Fast sequence non-maximum suppression (Fast seq-NMS) method in consideration of improving efficiency. The Fast seq-NMS first performs deduplication of the prediction bounding box, and then performs selection and re-scoring of the inter-frame bounding box sequence. The deduplication mode is to perform Non-maximum Suppression (NMS) processing on all the prediction target bounding boxes of each frame. Meanwhile, Fast seq-NMS only selects and re-scores the boundary box sequences of the first K boundary boxes with ascending scores after the duplication removal, and K is set to be 100 in the embodiment of the invention, so that the calculation burden is greatly reduced compared with the mode adopted by seq-NMS. And the step of selecting and re-scoring the bounding box sequence is the same as that of seq-NMS, after the bounding box sequence with the highest total score is re-scored, the bounding box sequence is excluded, the selection and re-scoring of the prediction target bounding box sequence are restarted, and the steps are repeated in sequence until no sequence can be selected. In the invention, K can be set according to a quantity proportion and can also be set as a fixed value.

As shown in fig. 4, an implementation process of the method of the present invention includes: taking out a current frame to be detected according to a time sequence from an input unmanned aerial vehicle video, scaling the frame to be detected to 512 multiplied by 512 pixels, inputting the frame to be detected into the lightweight small-target detection network, and storing all predicted target bounding boxes; carrying out scale scaling and lightweight small-target detection network detection on all frames of a video to be detected; and then processing the detection results of all frames of the video to be detected by using a rapid sequence non-maximum suppression method, and outputting the target detection result of the video to be detected after duplication removal and correction.

Claims

1. A light unmanned aerial vehicle image small target detection method is characterized in that the following steps are executed for an input unmanned aerial vehicle image video to be detected:

the method comprises the following steps: taking a frame of image according to a time sequence, and zooming the current frame of image to a set size;

removing the last layer of 1 × 1 convolution of the MobileNetV2 feature extractor to form a Revised MobileNetV2 feature extractor;

step three: inputting the extracted feature map into a synchronous up-sampling and detecting module; the synchronous up-sampling and detecting module comprises four branches based on a sub-pixel convolution structure, namely a central point branch, a central point offset branch, a central point target branch and a scale branch, wherein the first three branches jointly determine the position of the central point, and the scale branch determines the scale of each central point corresponding to a target;

the central point target branch outputs a two-classification target heat map, and marks whether each point in the corresponding image is the central point of any interested class target;

step four: obtaining all predicted target frames of the current frame according to the predicted target central point position and the corresponding scale, and storing the result; judging whether all frames of the current video to be detected are detected or not, if so, entering the step five for execution, and if not, returning to the step one for execution;

2. The method of claim 1, wherein in step one, the image is scaled to a size of 512 x 512 pixels.

3. The method of claim 1, wherein in step three, in the synchronous upsampling and detecting module, at each branch, the feature map is first upscaled by using sub-pixel convolution, and then the upsampled output is obtained by periodic rearrangement.

4. The method according to claim 1, wherein in step three, when the synchronous upsampling and detecting module detects the feature map, the predicted target frame score is obtained by combining the classification score and the target score of each center point, and is represented as follows:

Score＝heatmap_cls×F(heatmap_obj)

wherein Score is the target box Score, heatmap_clsHeatmap, class-by-class prediction, output for a central point branch_objA binary target heat map output for the central point target branch, f (x) is a preprocessing function of the target score, x represents the value of the pixel points in the heat map.

5. The method according to claim 1, wherein in the fifth step, the fast sequence non-maximum suppression processing is performed, and the fast sequence non-maximum suppression processing comprises: firstly, inhibiting and removing duplication of a predicted target frame, and then selecting and re-scoring a sequence of the predicted target frame; the suppression and the deduplication refer to that the non-maximum suppression processing is carried out on all the prediction target frames of each frame; the sequence selection refers to selecting the first K predicted target frames of each frame after duplicate removal and with descending scores, calculating the IOU value between the predicted target frames of the adjacent front and back frames, linking the predicted target frames with the IOU value larger than a threshold B, obtaining a plurality of overlapped time sequence target frame sequences with different lengths corresponding to the whole video sequence at the moment, and selecting the time sequence target frame sequence with the maximum total score of the target frames; the re-scoring refers to assigning the score average value of the target frames of the sequence to each prediction target frame in the sequence for the selected time sequence target frame sequence; then, excluding the time sequence target frame sequence with the largest total score, and selecting and re-scoring the prediction target frame sequence again until no time sequence target frame sequence can be selected; wherein K is a positive integer and B is a real number greater than 0.