CN117557774A

CN117557774A - Unmanned aerial vehicle image small target detection method based on improved YOLOv8

Info

Publication number: CN117557774A
Application number: CN202311456286.XA
Authority: CN
Inventors: 郭小伟; 李俊武; 陈跃冲; 封征
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2023-11-03
Filing date: 2023-11-03
Publication date: 2024-02-13

Abstract

The invention provides an unmanned aerial vehicle image small target detection method based on improved YOLOv8, which comprises the steps of collecting and labeling various unmanned aerial vehicle shooting images and establishing an unmanned aerial vehicle image data set; based on a Yolov8 original network, a backbone network ITNet is introduced, dynamic convolution ODConv is used for replacing Conv convolution, a neck module SGFPN is used, a feature fusion module CSF is introduced, a CARAFE up-sampling method is used for replacing nearest neighbor sampling, an improved YOLOv8 network structure is used as an unmanned aerial vehicle image recognition network, a deep learning model for unmanned aerial vehicle small target recognition detection is obtained through training, unmanned aerial vehicle images are detected, and high-accuracy detection of the unmanned aerial vehicle images is achieved.

Description

Unmanned aerial vehicle image small target detection method based on improved YOLOv8

Technical Field

The invention belongs to the technical field of deep learning target detection, and particularly relates to an unmanned aerial vehicle image small target detection method based on improved YOLOv8

Background

In recent years, a target detection algorithm based on a convolutional neural network is widely applied and developed in the fields of remote sensing image processing, unmanned aerial vehicle navigation, automatic driving, medical diagnosis, face recognition, defect detection and the like. Conventional target detection algorithms can basically meet the requirements in various scenes, but the algorithms are mainly faced with large and medium targets, and for small targets of an aerial view of an unmanned aerial vehicle, due to the fact that effective features are few, enough feature information is difficult to extract, and the effect is unsatisfactory. In particular, even the most advanced detectors have a great performance gap in detecting small and medium-sized objects.

Currently popular object detectors typically comprise a backbone network and a detection head, the decision of the latter being dependent on the representation output of the former, which has proven to be effective. However, the small target feature information is originally small, and is hardly reserved after a plurality of downsampling, so that a network can hardly learn useful information, and a detection head cannot make a correct decision, which is fatal to small target detection. Therefore, the detection accuracy of the current detector for the small target of the unmanned aerial vehicle is lower.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle small target recognition detection method based on improved YOLOv8, which aims to solve the technical problem of low accuracy in unmanned aerial vehicle picture detection.

In order to solve the technical problems, the specific technical scheme of the unmanned aerial vehicle small target recognition detection method based on the improved YOLOv8 is as follows:

an unmanned aerial vehicle small target recognition detection method based on improved YOLOv8,

step 1, data are obtained from pictures shot by an unmanned aerial vehicle in a real living environment, ten images such as people, vehicles and the like are marked, an unmanned aerial vehicle picture data set is established, and a Mosaic data enhancement method is used for carrying out data enhancement on the data set;

step 2, taking a yolov8 network structure as a reference network, introducing a main network ITNet (Inverted Triangle Net), using a dynamic convolution ODConv to replace Conv convolution, using a neck module SGFPN, introducing a feature fusion module CSF, using a CARAFE up-sampling method to replace nearest neighbor sampling, using the improved yolov5 network structure as an unmanned aerial vehicle small target recognition network, and obtaining a deep learning model of unmanned aerial vehicle small target recognition detection through training;

and step 3, inputting the unmanned aerial vehicle small target image to be detected and identified into a deep learning model for unmanned aerial vehicle small target identification and detection for detection.

Further, the detection network is modified based on the YOLOv8 network structure, and comprises 4C 2f modules, 1 SPPF module, 6 ODConv modules, 7 CSF modules, 7 Concat modules, 3 upsampling modules and 6 conv modules.

Further, the C2f module includes a 3×3 convolution layer, a BN (Batch Normalization) layer, and a SiLU activation function layer, which are sequentially cascaded;

the SPPF module comprises 5 multiplied by 5 global pooling layers which are sequentially cascaded, and the results are spliced through concat;

the Conv module comprises a 1 multiplied by 1 convolution layer, a BN layer and a ReLu activation function layer which are sequentially cascaded;

the ODConv module is represented as

y＝(α _w1 ⊙α _f1 ⊙α _c1 ⊙α _s1 ⊙W ₁ +…+α _wn ⊙α _fn ⊙α _cn ⊙α _sn ⊙W _n )*x

Wherein x ε R (h x ω x c_in) and y ε R (h x ω x c_out) represent the input and output features, respectively (channel number c_in/c_out, width and height of feature h, ω, respectively), W _i Representing an ith convolution kernel consisting of a c_out filter (w_i∈r (kxkxc_in), m=1, …, c_out); x 0_wi×1r represents the attention scalar of the convolution kernel w_i; alpha_si epsilon R (k x k), alpha_ci epsilon R (c_in) and alpha_fi epsilon R (c_out) represent three newly introduced notes, calculated along the spatial dimension, input channel dimension and output channel dimension of the convolution kernel W_i, respectively; x 2 represents multiplication operations along different dimensions of the kernel space.

The CSF module comprises three branches, wherein the first branch is a 3X 3RepConv convolution layer which is sequentially cascaded, the second branch is a PConv module and a Conv module, the third branch is a Conv module, and the outputs of the three branches are spliced through a concat layer;

further, the method for preprocessing the unmanned aerial vehicle image dataset comprises the following steps: the xml file generated using the VOC annotation mode is converted into txt file required for YOLO training.

Further, the data set dividing method comprises the following steps: 60% data was used as training set, 20% data was used as validation set, and 20% data was used as test set.

Further, setting model training parameters, wherein the initial learning rate is 0.01, the momentum is 0.937, the weight attenuation is 0.0005, the training threshold is 0.2, the picture size is normalized to 640×640, the iteration number is 300, and the batch size is 16;

compared with the original YOLOv8 target detection network, the improved YOLOv8 network provided by the invention can realize accurate detection of small target objects under a complex background on the detection task of small targets of an unmanned aerial vehicle, and reduces the parameter quantity and the calculation quantity. Firstly, a trunk which increases the number of the convolution of the shallow extraction features is designed, the extraction of the shallow information by the network is enhanced, the full-dimensional dynamic convolution is utilized for encoding, and the extraction capability of the network to the features of the small target is effectively improved. Secondly, a feature fusion module is provided to further enhance the multi-layer and feature fusion capability of the network. Thirdly, a neck structure is designed, shallow information extraction is increased, and the mining capability of the network on small target position information is enhanced.

According to the method, small targets of the unmanned aerial vehicle with more ground object shielding and complex background in a low-altitude scene are detected, and the manpower and time cost for manually collecting and processing data is reduced through a deep learning method. And the data enhancement mode is utilized to acquire more comprehensive and higher-quality data.

Drawings

FIG. 1 is a schematic and flow chart of the overall architecture of the present invention;

FIG. 2 is a block diagram of a method study of the present invention;

FIG. 3 is a diagram of the improved YOLOv8 network of the present invention;

FIG. 4 is a block diagram of CSF in accordance with the invention;

FIG. 5 is a graph showing the comparison of the changes of the evaluation indexes before and after model improvement;

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

A method for detecting low-speed targets based on edge calculation, as shown in fig. 2, comprises the following steps:

s1, collecting image sets of different targets under different exposure degrees, and processing the image sets to obtain a low-small slow target data set;

specifically, images of different targets under different exposure degrees are acquired through a camera, low and slow targets in the images are marked by a marking tool, so that in order to enhance the generalization, the mosaine and mixup combined data of yolov4 are referenced, the data dimension is enhanced, and the image fuzzy data of different degrees are increased according to different fuzzification of the small targets, so that the detection precision of the fuzzy data is improved.

S2, designing a trunk, enhancing the extraction of shallow information by a network, and encoding by utilizing full-dimensional dynamic convolution.

Specifically, through a large number of experiments, we find that downsampling can improve translational invariance, avoid overfitting, and reduce calculation cost. Small objects occupy very few pixels, while downsampling may remove important features that identify these objects. The only way to save information about small features is to encode these features in the earliest layers using convolution filters and pass this information to the subsequent layers. However, in existing backbones, the number of shallow convolution filters is kept to a minimum to reduce the computational burden, which may result in loss of small target key discrimination features.

The original CSPDarkNet53 reduced the feature map size by a factor of 4 in a 2-layer convolution. The use of such backbones to handle tiny object detection may result in tiny object information disappearing in the feature map before complete extraction. To solve this problem, we propose ITNet. The number of convolution kernels for feature extraction is increased in the shallow layer compared to the original Backbone, while the number of kernels is decreased in the deep layer to improve the computational efficiency. Furthermore, we use the full-dimensional dynamic convolution ODConv at the time of downsampling, so as to preserve the full-dimensional information of the object.

S3, designing a feature fusion module CSF based on upper and lower layers, and designing a neck structure SGFPN, wherein more shallow and high-resolution information is reserved.

Specifically, YOLOv8 uses PANet for feature fusion and top-down and bottom-up feature layers of different dimensions for fusion. The PAN structure in YOLOv8 uses bottom-up paths and lateral connections, the bottom-up paths upsample by a spatially lower resolution but semantically stronger feature map, yielding higher resolution features. These features are enhanced by fusing the lateral connections with features on the same level. Each lateral join incorporates feature maps of the same spatial size from the bottom-up path and the top-down path.

Shallow feature mapping is a lower level of semantics but its activation is more accurately located because it is downsampled less often, so the feature map of this layer is also fused when multi-scale features are acquired and a detection head is added. The performance improvement for small target detection is very significant after adding an additional detection head, although computation and memory costs increase.

In addition, P2 is only down-sampled four times compared to the input picture, which contains much interference information, so we use a feature fusion module to better extract features.

GFPN enhances feature interactions through queen-fusion, but it also brings a large number of additional upsampling and downsampling operations, which are disadvantageous for small targets and which are easily lost during sampling. And the transmission of information is provided from the early node to the later stage through the layer Connection (Skip-layer Connection), but the information almost reaches the subsequent layer through transverse transmission, and the redundant information transmission is generated by continuing to do so, and meanwhile, more parameters and calculation amount are introduced, so that the model efficiency is reduced. In order to further research an effective multi-scale feature fusion method and achieve a better target detection effect, the connection method of the feature fusion layer is improved. The structure adds cross-scale links and uses a modified giraffe feature pyramid network for feature fusion.

The SGFPN of the invention maintains more information of small targets in the upper layer by adding fusion to the features of the upper layer. The system can integrate more features, realize multi-scale feature fusion and obtain a larger receiving domain and an accurate object position. After adding the P2 layer, an up-sampling is added on the P3 layer, the up-sampling is transversely connected with the P2 layer, the F3, F4 and F5 nodes are respectively connected with the P2, P3 and P4 nodes, the N3, N4 and N5 nodes are respectively connected with the F2, F3 and F4 nodes, and the characteristics can be fused better by adding the connections. The final improved structure is shown in figure 3.

The fusion module used in the present invention is CSF (Cross-scale fusion) which is used to fuse incoming multi-layer feature maps. The structure of the CSF module is shown. The original feature fusion module adopts simple channel connection, and only the features are overlapped. To introduce context information and refine the (refine) feature map, we propose a feature fusion module CSF for each scale feature in the k-level

Where Concat () refers to the concatenation of feature maps generated in all previous layers, while Conv1 () represents a 3x3 convolution.

Wherein Conv2 () represents a 1×1 convolution

Basicbolck(P ₁ )＝Conv1(RepConv(P ₁ ))

Where RepConv is typically a convolution block that combines a 3x3 convolution, a 1 x 1 convolution, and an identity mapping in one convolution layer, the structure is shown in fig. 4. The RepConv can learn rich features after one mapping, and is a multi-branch structure, so that the performance can be improved through multiple branches in the training process, and reasoning can be converted into a continuous straight-cylinder structure with 3X3 convolution and ReLU activation functions through structure reparameterization, so that the reasoning speed is accelerated.

Finally, the overcomplete operation truncates the gradient stream to prevent the different layers from learning duplicate gradient information.

Pout＝Concat(P ₁ ,P ₂ ,P ₃ ,Basicbolck(P ₁ ),(Basicbolck(P ₁ )) ² ,(Basicbolck(P ₁ )) ³ )

Wherein, (Basicbolck (P) ₁ )) ⁿ Representing n basic bolck () connections. PConv refers to depth convolution using a 3x3 kernel size for capturing important local spatial regions for each channel

The CSF module retains the advantages of RepConv characteristic reuse and structure re-parametrization, and simultaneously intercepts gradient flow, prevents excessive repeated gradient information, can well fuse various characteristic diagrams, and accelerates reasoning speed.

In summary, these modules together form a backhaul part in the YOLOv8 network structure, which is used to extract and fuse multi-scale feature information, so as to support accuracy and robustness of the target detection task;

and building a virtual environment for a training model on the GPU server, inputting a training set into the improved yolov8 network structure to perform target detection model training, obtaining a deep learning model for unmanned aerial vehicle image recognition detection after training is completed, inputting a verification set into the deep learning model for unmanned aerial vehicle image recognition detection to perform verification, optimizing the deep learning model according to the effect obtained by verification, and finally obtaining the deep learning model for unmanned aerial vehicle image recognition detection with the best effect.

In one embodiment, a 3×3 convolution and a 1×1 convolution are utilized as the final output module of the YOLOv8 network; and respectively inputting the detected feature maps with three different pixel scales into a YOLO Head for decoding, extracting global features through a 3×3 convolution layer, fully connecting the 1×1 convolution layers, and finally calculating to obtain a prediction boundary box, a confidence value and a category. After the YOLO Head, the loss function value of the detection model is minimized through iterative calculation, and when the training time iteration is completed, the model with the highest detection precision is selected as the final detection model.

Sequentially stacking the modified structures and modules according to the original YOLOv8 network structure form, so as to obtain an improved YOLOv8 network structure; model training parameters include:

the initial learning rate was 0.01, the momentum was set to 0.937, the weight decay was set to 0.0005, the training threshold was 0.2, the picture sizes were all normalized to 640 x 640, the number of iterations was 300, and the batch size was 16.

The data set dividing method comprises the following steps: 60% data was used as training set, 20% data was used as validation set, and 20% data was used as test set.

The bounding box loss is calculated using CIoU, the loss of objects and categories is calculated using cross entropy, and back propagation update model parameters are performed.

It should be emphasized that the examples described herein are illustrative rather than limiting, and therefore the invention includes, but is not limited to, the examples described in the detailed description, as other embodiments derived from the technical solutions of the invention by a person skilled in the art are equally within the scope of the invention.

It will be understood that the invention has been described in terms of several embodiments, and that various changes and equivalents may be made to these features and embodiments by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. An unmanned aerial vehicle image small target detection method based on improved YOLOv8 is characterized by comprising the following steps:

step 2, taking a yolov8 network structure as a reference network, introducing a main network ITNet (Inverted Triangle Net), using a dynamic convolution ODConv to replace Conv convolution, using a neck module SGFPN, introducing a feature fusion module CSF, using a CARAFE up-sampling method to replace nearest neighbor sampling, using the improved yolov8 network structure as an unmanned aerial vehicle image recognition network, and obtaining a deep learning model of unmanned aerial vehicle small target recognition detection through training;

2. The unmanned aerial vehicle image small target detection method based on improved YOLOv8 of claim 1, wherein the method comprises the following steps of: the specific implementation method of the step 1 comprises the following steps:

step 1.1, acquiring an image shot by an unmanned aerial vehicle in a real environment through a mobile terminal instrument; labeling information on the acquired image by using a LabelImg tool;

and 1.2, performing data enhancement on the data set by using a Mosaic data enhancement method, and establishing an unmanned aerial vehicle image data set.

3. The unmanned aerial vehicle image small target detection method based on improved YOLOv8 of claim 2, wherein the method comprises the following steps of: the Mosaic data augmentation processing method performs a series of image processing operations on a given image file, including randomly using a plurality of pictures, randomly scaling, randomly distributing, stitching, cropping the image, horizontally turning the image, rotating the image by 90 degrees, reducing the brightness of the image, improving the brightness of the image, performing blurring processing on the image, adding salt and pepper noise into the image, and adding Gaussian noise into the image.

4. The unmanned aerial vehicle image small target detection method based on improved YOLOv8 of claim 1, wherein the method comprises the following steps of: the specific implementation method of the step 2 comprises the following steps: the number of convolution kernels for feature extraction is increased in the shallow layer, while the number of kernels is decreased in the deep layer to improve computational efficiency. Furthermore, we use the full-dimensional dynamic convolution ODConv at the time of downsampling, so as to preserve the full-dimensional information of the object.

5. The unmanned aerial vehicle image small target detection method based on improved YOLOv8 of claim 1, wherein the method comprises the following steps of: the specific implementation method of the step 2 comprises the following steps of:

the fusion module used in the present invention is CSF (Cross-scale fusion) which is used to fuse incoming multi-layer feature maps. For each scale feature in the k-level

Wherein Conv2 () represents a 1×1 convolution

Basicbolck(P ₁ )＝Conv1(RepConv(P ₁ ))

Among these, repConv is typically a convolution block that combines a 3x3 convolution, a 1 x 1 convolution, and an identity mapping in one convolution layer.

Pout＝Concat(P ₁ ,P _2, P ₃ ,Basicbolck(P ₁ ),(Basicbolck(P ₁ )) ² ,(Basicbolck(P ₁ )) ³ )

Wherein, (Basicbolck (P) ₁ )) ⁿ Representing n basic bolck () connections. PConv refers to a depth convolution using a 3x3 kernel size for capturing important local spatial regions for each channel.