CN111723852B

CN111723852B - Robust training method for target detection network

Info

Publication number: CN111723852B
Application number: CN202010480420.XA
Authority: CN
Inventors: 李涵生; 韩鑫; 亢宇鑫; 崔磊; 杨林
Original assignee: Hangzhou Diyingjia Technology Co ltd
Current assignee: Hangzhou Diyingjia Technology Co ltd
Priority date: 2020-05-30
Filing date: 2020-05-30
Publication date: 2022-07-22
Anticipated expiration: 2040-05-30
Also published as: CN111723852A

Abstract

The invention relates to a robust training method aiming at a target detection network, which comprises the following steps: acquiring a training sample, wherein a part of detection targets on the training sample carry artificial labeling frames; performing feature extraction on the training sample by using a target detection network, and generating a suggestion box on the training sample; marking original sampling labels on the suggestion boxes, wherein the original sampling labels comprise positive labels and negative labels; performing pooling operation on the positive label by adopting a pooling branch, and outputting a first region-of-interest characteristic; inputting the first region of interest characteristics into a mining network, wherein the mining network is a fully-connected neural network, and the mining network generates a new suggestion box label, namely a mining label; fusing the mining label and the original sampling label to generate a gold label; and using the gold label for training the target detection network.

Description

Robust training method for target detection network

Technical Field

The invention relates to the technical field of computer vision and target detection, in particular to a robust training method for a target detection network.

Background

In recent years, a Convolutional Neural Network (CNN) based object detection framework has become a powerful method for various computer vision tasks and has been widely applied in object localization and object statistics tasks. At the same time, the Convolutional Neural Network (CNN) based object detection framework has continued to improve and a number of excellent architectures have been proposed. Among them, region-based detection frameworks (e.g., fasternn, FPN), which include a pre-processing step proposed for a region, are widely used due to their more accurate detection performance. At the same time, many approaches continue to improve the performance of feature extractors by optimizing their network architecture. However, how to enhance the training robustness under non-optimal parameters and the trainability of the network under various label qualities has been proposed little.

Disclosure of Invention

The present application is proposed to solve the above technical problem, and provides a robust training method for a target detection network.

According to an aspect of the present application, there is provided a robust training method for a target detection network, including: acquiring a training sample image, wherein a part of detection targets on the training sample image carry artificial labeling frames; performing feature extraction on the training sample image by using a target detection network, and generating a suggestion box on the training sample image; marking original sampling labels on the suggestion boxes, wherein the original sampling labels comprise positive labels and negative labels; performing pooling operation on the positive label by adopting a pooling branch, and outputting a first region-of-interest characteristic; inputting the first region of interest characteristics into a mining network, wherein the mining network is a fully-connected neural network, and the mining network generates a new suggestion box label, namely a mining label; fusing the mining label and the original sampling label to generate a gold label; and using the gold label for training the target detection network.

Compared with the prior art, the robust training method for the target detection network is adopted, the processes of suggestion frame mining and label fusion are added in the training process of the target detection network, the phenomenon that the suggestion frame is wrongly annotated or a sample has too many false positives due to the fact that a manual annotation frame is missing or the set threshold (the first threshold and the second threshold) is too high or too low is effectively overcome, and the anti-interference performance of the network training process is improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is a flow chart of a robust training method for a target detection network of the present invention;

FIG. 2 is a segmentation of the processing stage of FIG. 1;

FIG. 3 is some positive label examples generated when training on a sparse VOC2007 training set;

FIG. 4 is a comparison graph (1) of results of a target detection network obtained by a common training method and a training method proposed in the present application under sparse COCO;

fig. 5 is a comparison graph (2) of the results of the target detection network obtained by the common training method and the training method proposed in the present application under COCO.

Detailed Description

Hereinafter, example embodiments of the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments of the present application, and it should be understood that the present application is not limited to the example embodiments described herein.

Summary of the application

Taking FasterRCNN in a target detection network as an example, FasterRCNN generates a suggestion box during training, then calculates the intersection ratio of the suggestion box and a labeling box, marks a category label (positive sample) on the suggestion box if the intersection ratio is larger than a manually set threshold value, and marks a background label (negative sample) on the suggestion box otherwise, and trains the network by taking the label as the positive and negative sample. However, if the manual labeling frame in the image is missing, the suggestion frame will be labeled with an error label. In addition, if the manually set threshold is not optimal, the sampling performance of the positive and negative samples is affected, and if the threshold is set too high, too many positive samples are lost, so that the capability of the network for identifying the target is reduced; if the threshold is set too low, too many false positives will appear in the sampled sample, interfering with the normal training process of the network and affecting the final performance.

Aiming at the technical problems, the invention aims to improve the training robustness of the pathological image detection network under the training data with different labeling qualities and non-optimal parameters. The core component of the present invention is a neural network named "mining network". The mining network is able to learn the characteristics of the positive samples and mine potential positive samples in the mined images. Since mined positive samples typically contain positive samples that are lost due to non-optimal parameters and missing annotations. In this way, combining the mined positive samples with the originally sampled positive samples will recover the missing positive proposals due to improper manual parameter settings and lack of samples.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary method

As shown in fig. 1, the robust training method for a target detection network includes:

s10, acquiring a training sample image, wherein a part of detection targets on the training sample image carry a manual labeling frame;

s20, performing feature extraction on the training sample image by using a target detection network, and generating a suggestion box on the training sample image;

the task of the target detection network is to locate and identify an object from an image, and the image space is an Euclidean space which is not an effective feature separable space, so that a feature extractor is needed to be used for feature combination of field pixels of the image, and features of a larger range and even the whole image are mapped to a high-dimensional separable space. Because the performance of the network is closely related to the separability of the feature space, the backbone network of the target detection network often utilizes a mainstream classification network that has been widely verified to extract and combine features. The classification networks are usually pre-trained on a large-scale public data set, so that the search range of a parameter space of network parameters is effectively limited by a transfer learning mode, and the training difficulty of the network on a new target detection task is further reduced. Therefore, in the invention, the classification model ResNet101 after pre-training is used as a backbone network to execute a feature extraction task.

S30, marking the original sample label on the suggestion box: judging whether the intersection ratio of the suggestion frame and the manual labeling frame is larger than a set first threshold value or not, if so, marking the suggestion frame as a positive label, judging whether the intersection ratio of the suggestion frame and the manual labeling frame is smaller than a set second threshold value or not, if so, marking the suggestion frame as a negative label, and both the positive label and the negative label are original sampling labels;

the proposed boxes that are neither positive nor negative do not help the network training, and therefore the number of positive labels is crucial to the training of the detector.

S40, adopting two pooling branches, performing pooling operation on the suggestion boxes marked as positive labels respectively, and outputting a first region-of-interest feature and a second region-of-interest feature;

the region of the feature map corresponding to the positive label is also called a region of interest (RoI), and two parallel pooling branches are used to pool the region of the feature map corresponding to the RoI, and the RoI feature (i.e., the first RoI feature) and the RoI feature (i.e., the second RoI feature) for mining are output respectively. The two parallel different branch structures of the RoI pooling ensure that the mining process does not interfere with the training process of the detector. And inputting the RoI characteristics (namely the second region of interest characteristics) into the target detection network, and outputting the result by the target detection network.

S50, inputting the first region of interest features into a mining network, wherein the mining network is a fully-connected neural network and generates a new suggestion box label, namely a mining label;

the mining network is a fully-connected neural network, the input of which is the RoI characteristic used for mining, and the hidden layer of which can be one or more layers. The mining network outputs a probability distribution (mining score) with the suggested box category activated by softmax, then the mining score is subjected to one-hot coding, and a suggested box mining label represented by m is generated. This process can be expressed as:

wherein, m represents the suggestion box mining label,

is to excavate the network(s),

is the input RoI feature used for mining.

S60, fusing the mining label and the original sampling label to generate a gold label, wherein the gold label is used as a final label of the suggestion frame;

and (4) the label obtained by fusing the mining label of the suggestion box and the original sampling label is called a gold label, and the gold label is used as a real label for detection training. By generating the gold tags through the merging operation, it can be ensured that the performance of the probe is not affected even under the worst condition (excavation network is invalid).

Specifically, the gold tag (g) is the union of the original sampling tag (a) and the suggestion box mining tag (m). Some false negative tags (which should be positive but sampled negative) in the original sample tags will be corrected by the suggestion box mining tag by the merge operation. Therefore, lost due to improper manual thresholds and missing annotations will be recoveredA number of positive tags, the tag merge process can be expressed as:

，

wherein,

=0 indicates that the suggestion box indexed by k is assigned a negative label.

And S70, using the suggestion box corresponding to the gold label for training the target detection network. The total loss for network training is represented by the following equation:

，

where p is the final output with the probability distribution of softmax activation,

is a loss of positioning;

is the cross entropy loss, which can be expressed as:

，

wherein,

is the number of the proposed boxes,

the classification probability distribution of the proposed boxes indexed by i, which are output by the FastRCNN branch,

the method comprises the steps that a suggested frame gold label is indexed according to i, and an original sampling label is optimized through the gold label;

the expression mining loss, i.e. the cross entropy loss of the training mining network, can be expressed as:

wherein

indicates a tag indexed by i among the assigned tags,

is the output indexed by i in the mining network. Obviously, the labels used to train the mining network are sampling labels. Typically, each training step has hundreds of recommendations with tags, and thus sampling hundreds of tags, ensures that the mining network can adequately learn the characteristics of positive tags, and thus, can use them

And training the whole target detection network. The loss function comprises classification loss and positioning loss, wherein the classification loss comprises cross entropy loss in the application

And excavation loss

The algorithm for the positioning loss follows the conventional calculation method, which is not described herein again.

As shown in fig. 2, a general R-CNN training process is shown by a dotted line in the figure, and a suggested frame is obtained by further correcting the position of a default recommended region (e.g., "anchor point" in fasterncn), and then a class label (or background label) is assigned to the suggested frame and used as a training sample image for training a detector. The process of suggested frame mining and label fusion is added in the training process of the application, as shown by a chain line in fig. 2, the phenomenon that the suggested frame is wrongly marked or the sample is excessively false positive due to the fact that the manual marking frame is missing or the set threshold (the first threshold and the second threshold) is too high or too low is effectively overcome, and the anti-interference performance of the network training process is improved.

To verify the validity of this patent, experiments were performed on the paschalloc 2007 and MSCOCO2017 datasets. The paschaloc 2007 consisted of 5k training images and 5k test images for approximately 20 classes of subjects. The COCO data set contained about 11.8 ten thousand training images and 5k validation images, and was tested using the validation set. The sparse data set is created manually by deleting annotations randomly until only one annotation per training image per class, as shown in fig. 3 (a) (sparse annotation). Sparse operation is only performed on the training set of the PASCL and the training set of the COCO, and the test set of the PASCL and the verification set of the COCO are kept complete.

1. Experimental parameters and details:

in the experiment, FasterRCNN is adopted in the target detection network, a characteristic extractor is ResNet101, and ResNet101 is pre-trained on ImageNet. The number of training steps is 150k on PASCAL, 1500k on COCO and 1 for blocksize. The learning rate was initially set at 0.0001, divided by 10 at 60k/600k steps for PASCAL and 10 at 80k/800k steps for COCO. Zooming the image in the training process to make the length of the short edge be 600 pixels; the maximum length of the long side is 1000 pixels. In addition, the images are randomly flipped horizontally to enhance the training data. The intersection ratio IoU of the suggestion box with the annotation is higher than 0.5 and is assigned a positive label, otherwise, it is a negative label.

2. Quantitative results:

TABLE 1 fast-RCNN trained on PASCAL training set, and average accuracy (nAP) and Average Recall (AR) results evaluated on PASCAL test set

Table 1 lists the results evaluated on the PASCAL2007 test set, and under the training of sparse PASCAL, the method of the patent improves the mapp (mean average precision) by 3.0% and the AR by 2.1%. Meanwhile, the method of the patent realizes 0.7% AR (average recall) improvement on the original PASCAL.

TABLE 2 mean accuracy Ap results using fast-RCNN trained on the MSCOCO training set and evaluated on the MSCOCO validation set

Table 2 shows the results evaluated on the validation set of the COCO dataset, with the method of the invention increasing the AP trained on sparse COCO and complete COCO by 1.6% and 1.0%, respectively. In addition, the method of the invention improves the AP @0.5 of 3.0% and 2.5% respectively under sparse and complete COCO, and the AP @0.5 means the statistical result under a single threshold value of 0.5. AP-s, AP-m, AP-l are AP indices for small, medium and large targets, respectively.

3. And (3) robustness analysis:

TABLE 3 mean recall AR results using fast-RCNN trained on the MSCOCO training set and evaluated on the MSCOCO validation set

In Table 3, the AR results of the present invention (19.70 and 25.7 AR) are not much improved over the original FasterRCNN (17.4 and 23.5). In this section, the training performance of the object detection network will be explored, as well as the effectiveness of the present invention under different conditions IoU thresholds. The number of positive suggestion boxes at different IoU thresholds at the last iteration cycle on the PASCAL training set is counted and the average number of positive suggestions per image is reported. At the same time, the mAP results of the networks trained on the PASCAL training set are given and evaluated on the test set.

TABLE 4 number of positive advice boxes averaged over last training period (different IoU thresholds), and mAP results evaluated by PASCAL test set

As shown in table 4, the maps results of the method of the invention outperformed fasternn except for IoU when the threshold was 0.3. However, with the increasing IoU threshold, the method of the invention can achieve more significant mAP improvement, for example, when the IoU threshold is 0.6, 0.7 and 0.8, respectively, the mAP improvement of the method of the invention is 1.0%, 2.7% and 6.8%, respectively.

4. Qualitative results

In fig. 4 and 5, some of the test results generated by the method of this patent are illustrated, which are compared to Faster. Fasterncn trained on sparse COCO datasets tends to miss some objects (red dashed box), and the method of this patent largely avoids this error. Meanwhile, the method of the patent obtains more accurate prediction on the COCO data set.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. The robust training method for the target detection network is characterized by comprising the following steps of:

acquiring a training sample image, wherein a part of detection targets on the training sample image carry artificial labeling frames;

performing feature extraction on the training sample image by using a target detection network, and generating a suggestion box on the training sample image;

marking original sampling labels on the suggestion boxes, wherein the original sampling labels comprise positive labels and negative labels;

performing pooling operation on the positive label by adopting a pooling branch, and outputting a first region-of-interest characteristic;

inputting the first region of interest characteristics into a mining network, wherein the mining network is a fully-connected neural network, and the mining network generates a new suggestion box label, namely a mining label;

fusing the mining label and the original sampling label to generate a gold label;

using the gold label for training of the target detection network;

the generation process of the mining label comprises the following steps: inputting the first region of interest feature into a mining network, outputting the probability distribution with the suggested box category activated by softmax by the mining network, performing one-hot coding on the probability distribution, and generating the mining tag, which is specifically represented as:

wherein, m represents a digging label,

it is shown that the network is mined,

representing a first region of interest feature;

the gold label is a union of the original sampling label and the mining label, and through merging operation, a false negative label which is originally a positive label but is marked as a negative label in the original sampling label is corrected by the mining label and is recovered to be a positive label;

the label merging process is represented as:

wherein, in the process,

=0 indicates that the suggestion box indexed by k is marked as a negative label, g indicates a gold label, a indicates an original sample label, and m indicates a mining label.

2. The robust training method for an object detection network as claimed in claim 1, wherein the loss function for object detection network training is:

wherein p is the probability distribution with the suggested box category activated by softmax, g represents gold label;

which represents the cross-entropy loss in the entropy domain,

wherein,

the number of the suggested boxes is represented,

representing the probability distribution indexed by i with the category of the suggestion box after activation by softmax,

gold tags representing indices by i;

indicating a loss of positioning;

which is indicative of the loss of excavation,