CN114972711B

CN114972711B - Improved weak supervision target detection method based on semantic information candidate frame

Info

Publication number: CN114972711B
Application number: CN202210393075.5A
Authority: CN
Inventors: 李国权; 夏瑞阳; 向娇; 林金朝; 庞宇; 郭豆豆; 朱宏钰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2024-09-20
Anticipated expiration: 2042-04-14
Also published as: CN114972711A

Abstract

The present invention claims an improved weakly supervised target detection method based on semantic information candidate boxes, which belongs to the field of image processing. The method comprises the following steps: S1: inputting image data to be processed, and adopting preprocessing steps including random horizontal flipping; S2: designing a combined backbone network for fusing features from masked and non-masked network branches; S3: designing a multi-branch detection head network based on a multi-instance selection algorithm; S4: designing a target semantic candidate box, and generating a more reasonable target candidate box by cyclically masking the target semantic information generated by the network model; S5: performing detection on a natural data set. Based on this method, an evaluation is performed on the challenging PASCAL VOC 2007 and 2012 public data sets. Experimental results show that the method proposed in the present invention can achieve good performance on the PASCAL VOC 2007 and 2012 data sets, and outperforms many advanced weakly supervised target detection methods.

Description

An improved weakly supervised object detection method based on semantic information candidate boxes

技术领域Technical Field

本发明属于图像处理领域，涉及一种基于语义信息候选框下组合主干网络结合基于多示例选择算法的多分支检测头网络的弱监督目标检测方法。The present invention belongs to the field of image processing and relates to a weakly supervised target detection method based on a combined backbone network under a semantic information candidate frame combined with a multi-branch detection head network based on a multi-example selection algorithm.

背景技术Background Art

卷积神经网络模型在图像分类、目标检测和图像分割等多个计算机视觉任务中取得了惊人的性能，因此，近年来提出了具有各种优势且先进的网络来进一步提升相关的视觉任务性能。然而，在目标检测中，大多数是带有大量锚框或锚定点的全监督方法，这迫使研究人员在训练前需要将大量精力集中在对每个物体坐标框的精确标签上。Convolutional neural network models have achieved amazing performance in multiple computer vision tasks such as image classification, object detection, and image segmentation. Therefore, in recent years, advanced networks with various advantages have been proposed to further improve the performance of related visual tasks. However, in object detection, most of them are fully supervised methods with a large number of anchor boxes or anchor points, which forces researchers to focus a lot of effort on the accurate labeling of each object coordinate box before training.

因此，为了避免对大量真实坐标框标签的需求，近年来提出了基于弱监督学习的目标检测方法，因为该方法只需要图像级类别标签。对于弱监督目标检测模型，模型通过所有候选框的信息在图像级类别标签上进行训练，训练后选择合理的候选框进行输出，这意味着需要的标签信息资源更少。目前，许多弱监督目标检测模型都采用多示例学习和端到端模型结构来对图像级类别标签和图像中候选框之间建立连接。例如，作为一种有效的弱监督目标检测模型，在线示例分类器精调方法首先训练图像级类别信息有限的所有候选框，然后引入伪图像级类别的真实目标框信息，进一步关注部分候选框。Therefore, in order to avoid the need for a large number of real coordinate box labels, a target detection method based on weakly supervised learning has been proposed in recent years, because this method only requires image-level category labels. For weakly supervised target detection models, the model is trained on image-level category labels through the information of all candidate boxes, and after training, reasonable candidate boxes are selected for output, which means that fewer label information resources are required. At present, many weakly supervised target detection models use multi-instance learning and end-to-end model structures to establish a connection between image-level category labels and candidate boxes in the image. For example, as an effective weakly supervised target detection model, the online example classifier fine-tuning method first trains all candidate boxes with limited image-level category information, and then introduces the real target box information of pseudo-image-level categories to further focus on some candidate boxes.

虽然弱监督目标检测性能有了很大的提高，但这些方法仍然存在三个问题。首先，对于候选框，与目标相关候选框的质量较差，而且数量都较少，这阻碍了模型在训练过程中的收敛。其次，对于特征，从主干中提取的目标特征仅在局部目标区域表现出较高的响应，这导致位于该区域的预测候选框比代表完整目标的其他候选框产生更高的可信度。第三，对于检测分支而言，每个类别只有一个伪真实目标框是不合理的，因为在一幅图像中可能有更多具有相同类别且位置不同的目标。Although the performance of weakly supervised object detection has been greatly improved, these methods still have three problems. First, for the candidate boxes, the quality of the candidate boxes related to the target is poor and the number is small, which hinders the convergence of the model during training. Second, for the features, the target features extracted from the backbone only show a high response in the local target area, which causes the predicted candidate boxes located in this area to have higher credibility than other candidate boxes representing the complete target. Third, for the detection branch, it is unreasonable to have only one pseudo-true target box for each category, because there may be more objects with the same category and different positions in an image.

发明内容Summary of the invention

本发明旨在解决以上现有技术的问题。提出了一种从模型特征提取的合理性、挖掘真实目标框相关的潜在候选框能力以及训练前高质量目标候选框的生成三个方面进行改进的基于语义信息候选框的弱监督目标检测方法。本发明的技术方案如下：The present invention aims to solve the above problems of the prior art. A weakly supervised target detection method based on semantic information candidate frames is proposed, which is improved from three aspects: the rationality of model feature extraction, the ability to mine potential candidate frames related to the real target frame, and the generation of high-quality target candidate frames before training. The technical solution of the present invention is as follows:

一种基于语义信息候选框的改进弱监督目标检测方法，其包括以下步骤：An improved weakly supervised object detection method based on semantic information candidate boxes comprises the following steps:

S1：输入待处理图像数据，并采用包括随机水平翻转在内的预处理步骤；S1: Input the image data to be processed and use preprocessing steps including random horizontal flipping;

S2：设计组合主干网络，用于融合来自掩膜和非掩膜的网络分支的特征，非掩膜的网络分支任务是粗略地找到局部有显著区别的目标部分并对该目标进行定位，而掩膜分支的任务是屏蔽显著特征，并且保留不明显特征在网络中的响应；S2: Design a combined backbone network to fuse features from the masked and unmasked network branches. The task of the unmasked network branch is to roughly find the locally significantly different target parts and locate the target, while the task of the masked branch is to shield the significant features and retain the responses of the inconspicuous features in the network.

S3：设计基于多示例选择算法的多分支检测头网络，为每个目标类别生成更多具有较高置信度的伪真实目标框，从而令网络模型得到更多与目标相关的候选框，进而得到更合理的训练。S3: Design a multi-branch detection head network based on a multi-instance selection algorithm to generate more pseudo-real target boxes with higher confidence for each target category, so that the network model can obtain more candidate boxes related to the target and obtain more reasonable training.

S4：设计目标语义候选框，通过对多分支检测头网络模型生成的目标语义信息进行循环掩膜来从而生成更合理的目标候选框；S4: Design target semantic candidate boxes, and generate more reasonable target candidate boxes by cyclically masking the target semantic information generated by the multi-branch detection head network model;

S5：在自然数据集上进行检测。S5: Testing on natural datasets.

进一步的，所述步骤S1输入待处理图像数据，并采用包括随机水平翻转在内的预处理步骤，具体包括以下步骤：Furthermore, the step S1 inputs the image data to be processed and adopts a preprocessing step including random horizontal flipping, which specifically includes the following steps:

通过随机水平翻转操作，利用五个图像尺度{480、576、688、864、1200}随机调整训练图像的最短边的大小对网络模型进行训练；在测试阶段，通过综合分析每个图像的所有刻度及其水平翻转来计算预测的输出结果。The network model is trained by randomly adjusting the size of the shortest side of the training images using five image scales {480, 576, 688, 864, 1200} through random horizontal flipping operations; in the test phase, the predicted output results are calculated by comprehensively analyzing all scales of each image and its horizontal flipping.

进一步的，所述通过综合分析每个图像的所有刻度及其水平翻转来计算预测的输出结果，具体包括：Furthermore, the predicted output result is calculated by comprehensively analyzing all scales of each image and its horizontal flipping, specifically including:

每张图像的尺度会按照480、576、688、864和1200的尺度依次进行设置。将5种不同尺度及其水平翻转处理后的图像输入至网络模型，从而获得网络模型中不同尺度及其水平翻转处理后图像的输出结果。最后，将不同尺度图像的预测结果放缩至同一尺度大小下的结果，并使用后处理算法对放缩后的输出结果进行筛选，从而获得最终的预测结果。The scale of each image is set in order of 480, 576, 688, 864 and 1200. The images of 5 different scales and their horizontal flipping are input into the network model to obtain the output results of the images of different scales and their horizontal flipping in the network model. Finally, the prediction results of images of different scales are scaled to the results of the same scale, and the scaled output results are screened using the post-processing algorithm to obtain the final prediction results.

通过计算样本的预测结果与真实目标框坐标之间的IoU大小，首先，按照预设阈值(0.5或0.75)确定样本属性。之后，对所有样本根据其分类结果由高至低进行排序。对排序后的样本进行遍历，并对已经遍历的样本按照公式(1)和公式(2)进行准确率Precision及召回率Recall的计算。By calculating the IoU between the sample prediction result and the real target frame coordinates, first, determine the sample attributes according to the preset threshold (0.5 or 0.75). Then, sort all samples from high to low according to their classification results. Traverse the sorted samples, and calculate the precision and recall of the traversed samples according to formula (1) and formula (2).

其中，TP、FP和FN表示真正例、假正例和假负例。Among them, TP, FP and FN represent true positives, false positives and false negatives.

根据每次遍历得到的Precision及Recall，构建出以Recall为X轴而Precision为Y轴的曲线。最后，通过计算Precision及Recall围成的曲线面积，得到平均精度(AveragePrecision,AP),并通过计算每个类别的AP得到平均精度均值(mean Average Precision,mAP)Based on the Precision and Recall obtained from each traversal, a curve with Recall as the X-axis and Precision as the Y-axis is constructed. Finally, the average precision (AP) is obtained by calculating the area of the curve enclosed by Precision and Recall, and the mean average precision (mAP) is obtained by calculating the AP of each category.

进一步的，所述S2中，还包括以下步骤：Furthermore, the step S2 further includes the following steps:

给定一幅图像，基于像素级和目标语义特征生成候选框，并将其输入空间金字塔池层SPP，从组合主干网络中提取特征和候选框的坐标后，将候选框映射到特征图，以获得感兴趣区域RoI；由于每个RoI都有自己的大小，因此SPP层通过替换位于组合主干网络最后一层的最大池层来获得具有固定大小的RoI；然后，改进检测头网络中的并行全连接层将输出一系列候选框的特征向量。Given an image, candidate boxes are generated based on pixel-level and target semantic features and input into the spatial pyramid pooling layer SPP. After extracting features and coordinates of candidate boxes from the combined backbone network, the candidate boxes are mapped to the feature map to obtain the region of interest RoI. Since each RoI has its own size, the SPP layer obtains a RoI with a fixed size by replacing the maximum pooling layer located at the last layer of the combined backbone network. Then, the parallel fully connected layer in the improved detection head network will output a series of feature vectors of candidate boxes.

进一步的，所述步骤S2中，划分显著及不明显特征的方法是在通过这两个独立的分支之前，从特征图中提取平均值和最大值；如果某个具体位置的特征值高于阈值，显著特征将被屏蔽，从而使得不明显特征经过组合主干网络后其响应得以增强；此外，有两种方法可以对显著特征进行屏蔽；一种是硬性掩膜方法，另一种是软性屏蔽方法，可分别在公式(3)和公式(4)中显示，Furthermore, in step S2, the method for dividing the significant and inconspicuous features is to extract the average and maximum values from the feature map before passing through the two independent branches; if the feature value of a specific position is higher than the threshold, the significant feature will be shielded, so that the response of the inconspicuous feature after passing through the combined backbone network is enhanced; in addition, there are two methods for shielding the significant features; one is a hard masking method, and the other is a soft masking method, which can be shown in formula (3) and formula (4) respectively,

其中，f_i,j表示特定位置的特征值，mean、Max、分别表示当前特征图的均值及最大值，和分别采用硬屏蔽和软屏蔽的方法来表示结果，α是一个超参数，用于控制阈值大小。Among them, fi _,j represents the feature value of a specific position, mean, Max, respectively represent the mean and maximum value of the current feature map, and Hard masking and soft masking methods are used to represent the results respectively, and α is a hyperparameter used to control the threshold size.

进一步的，所述S3中，多分支检测头网络包括以下两个步骤:：Furthermore, in S3, the multi-branch detection head network includes the following two steps:

首先，所有候选框在多分支检测头网络的顶部分支部分进行共同训练，以找到潜在的目标候选框，图像级类别标签作为该分支的标签；然后，网络模型只会关注与目标相关的潜在候选框区域，并对与该区域关联的网络部分进行训练，并且上一个网络分支中生成的伪真实目标检测框标签作为新的监控输入下一个网络分支。First, all candidate boxes are jointly trained in the top branch of the multi-branch detection head network to find potential target candidate boxes, and the image-level category label is used as the label of this branch; then, the network model only focuses on the potential candidate box area related to the target, and trains the network part associated with this area, and the pseudo-real target detection box label generated in the previous network branch is used as the new monitoring input to the next network branch.

进一步的，所述步骤S3中引入多示例选择算法，用于在每一类中生成多个具有高置信度的伪真实坐标框，并将其发送给后一个网络分支，以便训练出与真实目标更相关的候选框；基于每个网络分支中候选框在每个类别上的分数不同，将所有分数从高到低进行排序，并将重点放在前10％分数的候选框P′上，作为伪真实目标检测框的来源；将设置一个K的阈值，以限制每个类别中伪真实目标检测框的数量；然后，在每一类中使用相同的阈值来确定候选框是否可以充当伪真实目标框箱。Furthermore, a multi-example selection algorithm is introduced in step S3 to generate multiple pseudo-real coordinate frames with high confidence in each category and send them to the next network branch so as to train candidate frames that are more relevant to the real target; based on the different scores of the candidate frames in each category in each network branch, all scores are sorted from high to low, and the focus is placed on the candidate frames P′ with the top 10% scores as the source of pseudo-real target detection frames; a threshold of K is set to limit the number of pseudo-real target detection frames in each category; then, the same threshold is used in each category to determine whether the candidate frame can serve as a pseudo-real target frame box.

进一步的，在改进检测头网络中，每个类将生成多个伪真实目标检测框，在这些伪真实目标框中计算出的最大IoU高于和低于0.5的候选框将分别被视为前景和背景，将标准交叉熵损失乘以第k个检测分支中的γ^k权重，以便通过调整γ^k的值来平衡前景和背景样本的损失函数，对于第k个改进选择检测分支，其损失函数如下所示：Furthermore, in the improved detection head network, each class will generate multiple pseudo-real target detection boxes. The candidate boxes with the maximum IoU calculated in these pseudo-real target boxes above and below 0.5 will be regarded as foreground and background, respectively. The standard cross entropy loss is multiplied by the γ ^k weight in the k-th detection branch to balance the loss function of the foreground and background samples by adjusting the value of γ ^k . For the k-th improved selection detection branch, its loss function is as follows:

其中，分别表示与当前候选框相关的伪真实目标框的预测结果，背景类别标签信息和背景类别样本的输出结果，γ^k分别表示第k个精调选择检测头的损失值和正负样本平衡权值。和分别表示第c个类别和第p个候选框的背景成本，以第k类为单位C是类别总数，|P|等于候选框总数，C+1代表背景的类别标签，和表示第k个检测分支中候选框p的第c个类别的标签信息和输出结果；in, They represent the prediction results of the pseudo-real target box related to the current candidate box, the background category label information, and the output results of the background category samples. γ ^k represents the loss value and positive and negative sample balance weight of the k-th fine-tuned detection head. and They represent the background cost of the cth category and the pth candidate box respectively. C is the total number of categories in units of the kth category. |P| is equal to the total number of candidate boxes. C+1 represents the category label of the background. and Represents the label information and output results of the cth category of the candidate box p in the kth detection branch;

然后，将顶部粗糙选择检测头的标准二进制交叉熵函数公式(8)并入整个模型损失函数，以使网络进行联合训练；因此，用于训练整个网络的损失函数如公式(9)所示；Then, the standard binary cross entropy function of the top coarse selection detection head, Formula (8), is incorporated into the entire model loss function so that the network can be jointly trained; therefore, the loss function used to train the entire network is shown in Formula (9);

P_c、y_c、L分别表示在第c类的所有候选框的预测结果之和，第c类的样本标签信息及网络模型的总损失值。P _c , y _c , and L represent the sum of the prediction results of all candidate boxes in the cth class, the sample label information of the cth class, and the total loss value of the network model, respectively.

进一步的，所述步骤S4中采用了一个循环掩膜方法来屏蔽这些显著特征部分，进而生成合理目标候选框，具体包括：Furthermore, in step S4, a circular masking method is used to mask these salient feature parts, thereby generating a reasonable target candidate frame, which specifically includes:

生成目标语义候选框的整个过程可分为三个阶段，分别称为屏蔽阶段、合成阶段和生成阶段；屏蔽阶段的目的是消除物体的局部显著特征部分；The whole process of generating the target semantic candidate box can be divided into three stages, namely the shielding stage, the synthesis stage and the generation stage. The purpose of the shielding stage is to eliminate the local salient features of the object.

合并阶段是将每个循环中平均生成的这些网络响应进行融合；The merging stage is to fuse these network responses generated on average in each cycle;

生成阶段是基于生成的目标区域生成候选框。The generation stage generates candidate boxes based on the generated target area.

本发明的优点及有益效果如下：The advantages and beneficial effects of the present invention are as follows:

本发明的创新主要是权利要求2-9的结合。本发明提出了一种基于语义信息候选框的弱监督目标检测方法，实现对图像物体以弱监督学习的方式进行检测。首先提出了一种利用循环掩膜方法生成的基于目标语义信息的候选框，该方法可以提高与目标相关候选框的数量和质量，帮助网络模型在训练期间更好地收敛；然后提出了一种融合原始特征分支和显著特征抑制分支的双分支主干网络结构，称为组合主干网络，以提升网络对不明显特征的响应，避免检测结果被目标的局部显著特征主导；最后提出了一种基于多示例选择算法的多分支检测头网络，为每个类别生成更多具有高置信度的伪真实目标框，从而使模型能够以更为合理的方式进行训练。因此，本发明分别从目标候选框、主干网络、多分支检测头网络等三个方面对模型进行了改进，分别能够解决目标相关的候选框数量不足及质量较差、局部显著特征主导网络的特征响应和伪真实目标框不足引起候选框标签信息出现偏差问题，从而作为一个整体共同作用提升了模型在弱监督学习中对目标的检测精度。The innovation of the present invention is mainly the combination of claims 2-9. The present invention proposes a weakly supervised target detection method based on semantic information candidate frames, which realizes the detection of image objects in a weakly supervised learning manner. First, a candidate frame based on target semantic information generated by a cyclic masking method is proposed. This method can improve the number and quality of candidate frames related to the target and help the network model converge better during training; then a dual-branch backbone network structure that integrates the original feature branch and the significant feature suppression branch is proposed, called a combined backbone network, to improve the network's response to inconspicuous features and avoid the detection results being dominated by the local significant features of the target; finally, a multi-branch detection head network based on a multi-instance selection algorithm is proposed to generate more pseudo-real target frames with high confidence for each category, so that the model can be trained in a more reasonable way. Therefore, the present invention improves the model from three aspects, namely, the target candidate frame, the backbone network, and the multi-branch detection head network, which can respectively solve the problem of insufficient number and poor quality of candidate frames related to the target, the feature response of the local significant features dominating the network, and the lack of pseudo-real target frames causing deviations in the candidate frame label information, thereby acting as a whole to improve the detection accuracy of the model for the target in weakly supervised learning.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明提供优选实施例三个核心模块及整体流程图。FIG. 1 is a diagram showing three core modules and an overall flow chart of a preferred embodiment of the present invention.

图2为硬性掩膜和软性掩膜方法的区别。Figure 2 shows the difference between hard mask and soft mask methods.

图3为PASCAL VOC 2007测试集上目标区域中心和相关真实目标框中心之间相对位置的统计结果。Figure 3 shows the statistical results of the relative position between the center of the target area and the center of the corresponding true target box on the PASCAL VOC 2007 test set.

图4为目标语义候选框的生成流程图。FIG4 is a flowchart of generating a target semantic candidate box.

图5为无随机约束函数和有随机约束函数的目标语义候选框之间的比较。Figure 5 shows the comparison between the target semantic candidate boxes without and with random constraint functions.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、详细地描述。所描述的实施例仅仅是本发明的一部分实施例。The following will describe the technical solutions in the embodiments of the present invention in detail in conjunction with the accompanying drawings in the embodiments of the present invention. The described embodiments are only part of the embodiments of the present invention.

本发明解决上述技术问题的技术方案是：The technical solution of the present invention to solve the above technical problems is:

一种基于多注意力和语义的图像威胁物品分类及定位方法，该方法包括以下步骤：A method for classifying and locating image threat objects based on multi-attention and semantics, the method comprising the following steps:

S1：数据预处理；S1: Data preprocessing;

S2：设计组合主干网络以增强不明显的目标特征响应；S2: Design a combined backbone network to enhance the response of inconspicuous target features;

S3：设计基于多示例选择算法的多分支检测头网络以促进更多潜在的目标候选框获得训练；S3: Design a multi-branch detection head network based on a multi-instance selection algorithm to facilitate the training of more potential target candidate boxes;

S4：设计目标语义候选框以提高图像中目标候选框的质量和数量；S4: Design object semantic candidate boxes to improve the quality and quantity of object candidate boxes in images;

S5：在自然数据集PASCAL VOC 2007和VOC 2012数据集上进行检测；S5: Detection on the natural datasets PASCAL VOC 2007 and VOC 2012;

可选的，所述S1具体包括以下步骤：Optionally, the S1 specifically includes the following steps:

通过随机水平翻转操作，利用五个图像尺度{480、576、688、864、1200}随机调整训练图像的最短边的大小对网络模型进行训练。在测试阶段，通过综合分析每个图像的所有刻度及其水平翻转来计算预测的输出结果。The network model is trained by randomly resizing the shortest side of the training images using five image scales {480, 576, 688, 864, 1200} through random horizontal flipping operations. During the test phase, the predicted output is calculated by comprehensively analyzing all scales of each image and its horizontal flipping.

所述S2中，为了防止网络对目标局部显著区域产生过度拟合，设计了组合主干网络来融合来自掩膜和非掩膜的网络分支的特征。非掩膜的网络分支任务是粗略地找到局部有显著区别的目标部分并对该目标进行定位，而掩膜分支的任务是屏蔽显著特征，并且保留不明显特征在网络中的响应。将这两个独立的分支组合后，可以有效改善不明显特征的响应，而避免检测结果被局部明显的目标特征所支配。In S2, in order to prevent the network from overfitting the local salient area of the target, a combined backbone network is designed to fuse the features from the masked and non-masked network branches. The task of the non-masked network branch is to roughly find the target part with significant local differences and locate the target, while the task of the masked branch is to shield the significant features and retain the response of the inconspicuous features in the network. After combining these two independent branches, the response of the inconspicuous features can be effectively improved, and the detection results can be prevented from being dominated by the local obvious target features.

优选的，所述S3中，为了解析每个类别中生成的多个伪真实目标框，本发明提出了一种基于多示例选择算法的多分支检测头网络，以找到更多具有高置信度的伪真实目标框，从而使得更多潜在的目标候选框得以训练。Preferably, in S3, in order to parse multiple pseudo-real target boxes generated in each category, the present invention proposes a multi-branch detection head network based on a multi-instance selection algorithm to find more pseudo-real target boxes with high confidence, so that more potential target candidate boxes can be trained.

优选的，所述S4中，为了提高候选框的质量和数量，本发明提出了一种基于目标语义特征而非像素级图像信息(如Selective Search或Edge Box算法)的生成候选框的方法，以进一步帮助网络更好地在训练中收敛。该方法通过对网络模型生成的目标语义信息进行循环掩膜来从而生成更合理的目标候选框，从而提高目标候选框的数量和质量，帮助本发明的检测网络模型在训练过程中更好地收敛。Preferably, in S4, in order to improve the quality and quantity of candidate boxes, the present invention proposes a method for generating candidate boxes based on target semantic features rather than pixel-level image information (such as Selective Search or Edge Box algorithm) to further help the network converge better during training. The method generates more reasonable target candidate boxes by performing cyclic masking on the target semantic information generated by the network model, thereby improving the quantity and quality of the target candidate boxes and helping the detection network model of the present invention to converge better during training.

优选的，所述S5中，在训练过程中，本发明采用Adam作为网络模型的优化器，初始学习率为1×10^-5。本发明设定的最小批量分别在PASCAL VOC 2007和PASCAL VOC 2012中设置为8和16张输入图像。对于这两个数据集，本发明分别对模型进行30k次的迭代训练，并且20k次迭代训练后学习率会降低至原来的10％。Preferably, in said S5, during the training process, the present invention uses Adam as the optimizer of the network model, and the initial learning rate is 1×10 ^-5 . The minimum batch size set by the present invention is set to 8 and 16 input images in PASCAL VOC 2007 and PASCAL VOC 2012, respectively. For these two data sets, the present invention performs 30k iterations of training on the model, and the learning rate is reduced to 10% of the original after 20k iterations of training.

可选的，所述S2中，给定一幅图像，基于像素级和目标语义特征生成候选框，并将其输入空间金字塔池层(Spatial Pyramid Pooling,SPP)。具体而言，从组合主干网络中提取特征和候选框的坐标后，本发明将候选框映射到特征图，以获得感兴趣区域(Region ofInterest,RoI)。由于每个RoI都有自己的大小，因此SPP层通过替换位于组合主干网络最后一层的最大池层来获得具有固定大小的RoI。然后，改进检测头网络中的并行全连接层将输出一系列候选框的特征向量。Optionally, in S2, given an image, a candidate box is generated based on pixel-level and target semantic features and input into a spatial pyramid pooling layer (SPP). Specifically, after extracting features and coordinates of candidate boxes from the combined backbone network, the present invention maps the candidate boxes to feature maps to obtain a region of interest (RoI). Since each RoI has its own size, the SPP layer obtains a RoI with a fixed size by replacing the maximum pooling layer located at the last layer of the combined backbone network. Then, the parallel fully connected layer in the improved detection head network will output a series of feature vectors of candidate boxes.

为进一步减少局部明显特征的过度拟合。设计了一种组合主干网络结构来提高不明显特征区域的响应。具体来说，在提取底层特征后，网络将被划分为两个独立的分支。第一个分支是找到物体的显著特征部分从而大致定位位置，第二个分支是掩盖这些显著特征部分，但保留不明显的特征响应。To further reduce the overfitting of local obvious features, a combined backbone network structure is designed to improve the response of the inconspicuous feature area. Specifically, after extracting the underlying features, the network will be divided into two independent branches. The first branch is to find the salient features of the object to roughly locate the position, and the second branch is to cover these salient features but retain the inconspicuous feature responses.

划分显著及不明显特征的方法是在通过这两个独立的分支之前，从特征图中提取平均值和最大值。如果该值高于阈值，显著特征将被屏蔽，从而使得不明显特征经过组合主干网络后其响应得以增强。此外，有两种方法可以对显著特征进行屏蔽。一种是硬性掩膜方法，另一种是软性屏蔽方法，可分别在公式(1)和公式(2)中显示,The method of dividing the significant and insignificant features is to extract the average and maximum values from the feature map before passing through the two independent branches. If the value is higher than the threshold, the significant feature will be masked, so that the response of the insignificant feature after passing through the combined backbone network is enhanced. In addition, there are two methods to mask the significant features. One is the hard masking method and the other is the soft masking method, which can be shown in formula (1) and formula (2) respectively.

其中，f_i,j表示特定位置的特征值和分别采用硬屏蔽和软屏蔽的方法来表示结果α是一个超参数，用于控制阈值大小。Among them, _fi,j represents the eigenvalue at a specific position and The hard masking and soft masking methods are used to represent the results respectively. α is a hyperparameter used to control the threshold size.

与硬性掩膜方法相比，软性掩膜方法有两个优点。首先，它可以避免某些数值略高于阈值的特征无法获得增强的情况，因为硬性掩膜函数会在屏蔽分支中将其设置为零。其次，如果图像背景信息过多，会导致全局平均特征值降低，进而使得硬性掩膜方法的掩膜分支只能聚焦于较低特征值的特征，因此在小目标位于大背景的情况下，软性掩膜方法仍然可以较好工作。Compared with the hard mask method, the soft mask method has two advantages. First, it can avoid the situation where some features with values slightly above the threshold cannot be enhanced, because the hard mask function will set them to zero in the shielding branch. Second, if there is too much background information in the image, the global average eigenvalue will be reduced, and the mask branch of the hard mask method can only focus on features with lower eigenvalues. Therefore, the soft mask method can still work well when a small target is located in a large background.

可选的，所述S3中，多分支检测头网络的工作流程可以总结为以下两个步骤。首先，所有候选框会在该网络的顶部分支部分进行共同训练，以找到潜在的目标候选框。图像级类别标签作为该分支的标签。然后，网络模型只会关注与目标相关的潜在候选框区域，并对与该区域关联的网络部分进行训练，并且上一个网络分支中生成的伪真实目标检测框标签作为新的监控输入下一个网络分支。Optionally, in S3, the workflow of the multi-branch detection head network can be summarized into the following two steps. First, all candidate boxes are trained together in the top branch part of the network to find potential target candidate boxes. The image-level category label is used as the label of the branch. Then, the network model only focuses on the potential candidate box area related to the target, and trains the network part associated with the area, and the pseudo-real target detection box label generated in the previous network branch is used as the new monitoring input to the next network branch.

该网络会引入多示例选择(Multiple Instance Selection,MIS)算法,其作用是在每一类中生成多个具有高置信度的伪真实坐标框，并将其发送给后一个网络分支，以便训练出与真实目标更相关的候选框。基于每个网络分支中候选框在每个类别上的分数不同，本发明将所有分数从高到低进行排序，并将重点放在前10％分数的候选框P′上，作为伪真实目标检测框的来源。将设置一个K的阈值，以限制每个类别中伪真实目标检测框的数量。然后，本发明采用与非最大抑制(Non-Maximum Suppression,NMS)类似的方法，在每一类中使用相同的阈值来确定候选框是否可以充当伪真实目标框箱。具体而言，对于P′中第i个候选框，如果它和之前已经生成的伪真实目标框之间的IOU低于阈值t，则该候选框将被附加到伪真实目标框列表中。此外，由于这些候选框区域对应的网络部分可能会在其他的类别中被训练，因此，这些已被选定的候选框的分数将被清零。最后，在生成伪真实目标框之前，类向量c的序列是随机的，以减少具有优先顺序的类可以选择更多候选框的影响。The network will introduce a Multiple Instance Selection (MIS) algorithm, which generates multiple pseudo-real coordinate boxes with high confidence in each category and sends them to the next network branch in order to train candidate boxes that are more relevant to the real target. Based on the different scores of the candidate boxes in each category in each network branch, the present invention sorts all scores from high to low and focuses on the candidate boxes P' with the top 10% scores as the source of pseudo-real target detection boxes. A threshold of K will be set to limit the number of pseudo-real target detection boxes in each category. Then, the present invention uses a method similar to non-maximum suppression (NMS) to use the same threshold in each category to determine whether the candidate box can serve as a pseudo-real target box. Specifically, for the i-th candidate box in P', if the IOU between it and the previously generated pseudo-real target box is lower than the threshold t, the candidate box will be appended to the pseudo-real target box list In addition, since the network parts corresponding to these candidate box areas may be trained in other categories, the scores of these selected candidate boxes will be reset to zero. Finally, when generating pseudo-real target boxes Previously, the sequence of class vectors c was randomized to reduce the impact that classes with priority can select more candidate boxes.

在改进检测头网络中，每个类将生成多个伪真实目标检测框。在这些伪真实目标框中计算出的最大IoU高于和低于0.5的候选框将分别被视为前景和背景。为了更有效地训练本发明的检测网络模型，需要平衡正面和负面的候选框，因为背景的数量远远超过前景。因此，本发明将标准交叉熵损失乘以第k个检测分支中的γ^k权重，以便通过调整γ^k的值来平衡前景和背景样本的损失函数。为了避免该参数对模型进行过度拟合，所有精细分支都具有相同的γ调整。本发明的实验表明，这种权重可以显著提高模型性能。此外，令w_p等于候选框p的最接近的伪真实目标框的结果，因为在开始训练的阶段，这些伪真实目标框在相应的类中的结果并不足够可靠，而w_p在训练开始时结果很小。这个变量可以防止本发明提出的模型受到这些不可靠的伪真实目标框的干扰。因此，对于第k个改进选择检测分支，其损失函数如下所示。In the improved detection head network, multiple pseudo-real target detection frames will be generated for each class. Candidate frames with maximum IoU above and below 0.5 calculated in these pseudo-real target frames will be regarded as foreground and background, respectively. In order to train the detection network model of the present invention more effectively, it is necessary to balance the positive and negative candidate frames, because the number of backgrounds far exceeds that of foregrounds. Therefore, the present invention multiplies the standard cross entropy loss by the γ ^k weight in the k-th detection branch so as to balance the loss function of the foreground and background samples by adjusting the value of γ ^k . In order to avoid overfitting the model by this parameter, all fine branches have the same γ adjustment. The experiments of the present invention show that this weight can significantly improve the model performance. In addition, let w _p be equal to the result of the closest pseudo-real target frame of the candidate frame p, because at the beginning of training, the results of these pseudo-real target frames in the corresponding classes are not reliable enough, and w _p results are small at the beginning of training. This variable can prevent the model proposed by the present invention from being disturbed by these unreliable pseudo-real target frames. Therefore, for the k-th improved selection detection branch, its loss function is as follows.

其中，和分别表示第c个类别和第p个候选框的背景成本，以第k类为单位C是类别总数，|P|等于候选框总数，C+1代表背景的类别标签，和表示第k个检测分支中候选框p的第c个类别的标签信息和输出结果。in, and They represent the background cost of the cth category and the pth candidate box respectively. C is the total number of categories in units of the kth category. |P| is equal to the total number of candidate boxes. C+1 represents the category label of the background. and Represents the label information and output results of the cth category of the candidate box p in the kth detection branch.

然后，将顶部粗糙选择检测头的标准二进制交叉熵函数(如公式(6)所示)并入整个模型损失函数，以使网络进行联合训练。因此，用于训练整个网络的损失函数如公式(7)所示。Then, the standard binary cross entropy function of the top coarse selection detection head (as shown in Equation (6)) is incorporated into the overall model loss function so that the network can be trained jointly. Therefore, the loss function used to train the entire network is shown in Equation (7).

可选的，所述S4中，对于改进的候选框模块，虽然Selective Search方法可以生成大量候选框，但背景候选框的数量远远超过目标候选框。此外，许多与目标相关的候选框通常只包含一个目标的局部。为了解决这些问题，本发明根据目标的语义特征提高了与目标相关的候选框的数量和质量。Optionally, in S4, for the improved candidate frame module, although the Selective Search method can generate a large number of candidate frames, the number of background candidate frames far exceeds the number of target candidate frames. In addition, many candidate frames related to the target usually only contain a part of the target. In order to solve these problems, the present invention improves the number and quality of candidate frames related to the target according to the semantic features of the target.

目标语义候选框是通过分析已拥有一定知识表征能力的模型生成的目标语义特征响应而产生的。梯度类激活映射(Gradient Class Activation Mapping,Grad CAM)是一种可视化算法，旨在显示具有特定类别的目标特征在网络中的响应。具体来说，Grad CAM可以通过使用每个类别的反向传播梯度，并采用全局平均池函数来计算每个特征映射的平均权重，从而使模型输出目标响应。然后，将这些平均权重乘以ReLU函数之前特征映射的相应值。整个过程可以总结如下：The target semantic candidate box is generated by analyzing the target semantic feature responses generated by the model that already has a certain knowledge representation capability. Gradient Class Activation Mapping (Grad CAM) is a visualization algorithm designed to display the response of target features with a specific category in the network. Specifically, Grad CAM can use the back-propagation gradient of each category and adopt the global average pooling function to calculate the average weight of each feature map so that the model outputs the target response. These average weights are then multiplied by the corresponding values of the feature map before the ReLU function. The whole process can be summarized as follows:

其中，y_c是类别c的输出，表示第k个特征图特定位置数值，Z表示特征图的区域是第ki个功能映射的平均响应，L_c是具体第c个类别的激活映射。本发明提出的候选框生成方法特点是不需要重新训练，因为可以使用在ImageNet上预训练的VGG16作为生成目标语义特征的主干网络，这意味着本发明的方法更加通用，可以用于类别更多的数据集。此外，与本发明提出的该方法还可以基于本发明提出的循环掩膜方法和目标语义候选框的生成，通过使用较少的手动阈值来获得较为合理的候选框。Among them, y _c is the output of category c, Represents the value of a specific position in the kth feature map, and Z represents the area of the feature map is the average response of the ki-th functional map, and L _c is the activation map of the specific c-th category. The candidate box generation method proposed in the present invention is characterized in that no retraining is required, because VGG16 pre-trained on ImageNet can be used as the backbone network for generating target semantic features, which means that the method of the present invention is more general and can be used for data sets with more categories. In addition, the method proposed in the present invention can also be based on the cyclic mask method proposed in the present invention and the generation of the target semantic candidate box, and a more reasonable candidate box can be obtained by using fewer manual thresholds.

由于高响应总是由显著特征主导，因此在生成可行的候选框之前，掩盖这些区域是很重要的。因此，本发明采用了一个循环掩膜方法来屏蔽这些显著特征部分，进而生成合理目标候选框。Since high responses are always dominated by salient features, it is important to mask these areas before generating feasible candidate boxes. Therefore, the present invention adopts a cyclic masking method to mask these salient feature parts and then generate reasonable target candidate boxes.

生成目标语义候选框的整个过程可分为三个阶段，分别称为屏蔽阶段、合成阶段和生成阶段。首先，屏蔽阶段的目的是消除物体的局部显著特征部分。当原始图像经过神经网络时，输出将具有与相应类别相关的分数结果。在通过反向传播计算梯度之前，本发明选择前5个分数对应的类作为响应源，并在进入动态二进制阈值函数之前生成网络响应。然后，如果在特定位置生成的响应高于动态函数的阈值，则目标的显著特征将在原始图像中被掩盖。在掩盖这些明显的特征区域后，被掩盖的图像将替换前一张图像，并作为当前循环中网络的输入。The whole process of generating the target semantic candidate box can be divided into three stages, namely the masking stage, the synthesis stage and the generation stage. First, the purpose of the masking stage is to eliminate the local salient feature parts of the object. When the original image passes through the neural network, the output will have a score result related to the corresponding category. Before calculating the gradient by back propagation, the present invention selects the class corresponding to the top 5 scores as the response source and generates the network response before entering the dynamic binary threshold function. Then, if the response generated at a specific location is higher than the threshold of the dynamic function, the salient features of the target will be masked in the original image. After masking these obvious feature areas, the masked image will replace the previous image and serve as the input of the network in the current cycle.

本发明设定的循环次数是2，因为第一个循环是找到物体中最具辨别力的部分，第二个循环是找到次要区域。更多的循环可能有效，但本发明发现效果不明显且耗时，因为上述步骤已经检测到与目标相关的大部分区域。对于动态二进制阈值函数，阈值大小遵循公式(10)，其中i＝1，2，…，T，T_i表示第i次循环的阈值。The number of cycles set in the present invention is 2, because the first cycle is to find the most discriminative part of the object, and the second cycle is to find the secondary area. More cycles may be effective, but the present invention finds that the effect is not obvious and time-consuming, because the above steps have already detected most of the areas related to the target. For the dynamic binary threshold function, the threshold size follows formula (10), where i = 1, 2, ..., T, _Ti represents the threshold of the i-th cycle.

T_i＝100+20*(i-1) (10)T _i = 100 + 20 * (i-1) (10)

根据对图像的大量经验观察，本发明发现低阈值会导致包含多个目标的超大遮罩。100是在第一阶段生成合理口罩的合适基准。由于前一个循环删除了区分部分，这意味着在处理网络中的遮罩图像时，其余目标区域较小，网络对噪声更敏感。因此，在下一个循环中使用更高的阈值可以避免网络受到背景等噪声的干扰。Based on a lot of empirical observations of images, the present invention found that low thresholds lead to oversized masks containing multiple targets. 100 is a suitable benchmark for generating reasonable masks in the first stage. Since the previous cycle removes the distinguishing part, it means that when processing the mask image in the network, the remaining target area is smaller and the network is more sensitive to noise. Therefore, using a higher threshold in the next cycle can avoid the network being disturbed by noise such as background.

合并阶段是将每个循环中平均生成的这些网络响应进行融合。综合的特征响应应该通过固定阈值来生成多组二进制目标区域，作为候选框的基础。此外，由于小目标可能存在于大目标区域中，提高阈值可以生成包含在大目标区域中的小目标区域，从而可以生成与这些小目标相关的候选框。然而，一个折衷候选框是，小的间隔会导致每个区域的差异不明显，而大的间隔会导致生成的区域仅代表同一目标的一部分。The merging stage is to fuse these network responses generated on average in each cycle. The comprehensive feature response should generate multiple sets of binary target regions as the basis of candidate boxes through a fixed threshold. In addition, since small targets may exist in large target regions, increasing the threshold can generate small target regions contained in large target regions, so that candidate boxes related to these small targets can be generated. However, a compromise candidate box is that a small interval will result in unclear differences in each region, while a large interval will result in the generated region representing only part of the same target.

生成阶段是基于生成的目标区域生成候选框。然而，本发明不能仅从目标区域的边缘获取候选框。相邻目标区域之间的差异是本发明产生目标语义候选框的主要因素。生成的候选框的坐标可计算如下：The generation stage generates candidate boxes based on the generated target regions. However, the present invention cannot obtain candidate boxes only from the edges of the target regions. The difference between adjacent target regions is the main factor in generating target semantic candidate boxes in the present invention. The coordinates of the generated candidate boxes can be calculated as follows:

其中，P_right,i、P_left,i、P_top,i和P_bottom,i分别表示第i个候选框的右、左、上和下坐标。R_right、R_left、R_top和R_bottom表示之前生成的目标区域的坐标。N是一个常数，表示每个区域的候选框数量，t是确定间隔距离限制因子的阈值。Δ_right、Δ_left、Δ_top和Δ_bottom是在相应位置计算的差异。Where P _right,i , P _left,i , P _top,i and P _bottom,i represent the right, left, top and bottom coordinates of the i-th candidate box, respectively. R _right , R _left , R _top and R _bottom represent the coordinates of the target region generated previously. N is a constant representing the number of candidate boxes per region, and t is a threshold to determine the spacing distance limit factor. Δ _right , Δ _left , Δ _top and Δ _bottom are the differences calculated at the corresponding positions.

根据PASCAL VOC 2007测试集上的统计结果，S_R和S_GT分别表示生成目标区域的面积及其相关真实目标框的面积。可以注意到，尽管目标区域与真实目标框有很高的相关性(即IoU(R,GT)>0.5)，但仍然有62.7％的置信度来确定生成的目标区域的面积小于相关真实目标框的面积。因此，本发明需要扩大生成的目标区域，以获得合理的目标语义候选框。According to the statistical results on the PASCAL VOC 2007 test set, _SR and _SGT represent the area of the generated target region and the area of its associated true target box, respectively. It can be noted that although the target region has a high correlation with the true target box (i.e., IoU(R,GT)>0.5), there is still a 62.7% confidence level to determine that the area of the generated target region is smaller than the area of the associated true target box. Therefore, the present invention needs to expand the generated target region to obtain a reasonable target semantic candidate box.

有效目标区域应与真实坐标框或真实坐标框内部具有高度相关性。如果有效目标区域位于真实目标框的顶部或左侧，则其他区域一定位于底部或右侧。在弱监督目标检测中，并没有任何示例级别的标签信息。然而，根据本发明得到的统计结果，无论有效的目标区域在图像中的何处，概率至少有70％的置信度认为真实目标框的中心要更靠近图像中心，这意味着向图像中心生成目标语义候选框将更合理。因此，如果一个有效的目标区域位于图像中心的顶部或左侧，则更多的注意力集中在扩大目标生成区域的底部或右侧，更有可能缩小目标区域中心与真实坐标框中心之间的距离。The valid target region should have a high correlation with the true coordinate frame or the inside of the true coordinate frame. If the valid target region is at the top or left of the true target frame, other regions must be at the bottom or right. In weakly supervised target detection, there is no example-level label information. However, according to the statistical results obtained by the present invention, no matter where the valid target region is in the image, there is at least a 70% confidence probability that the center of the true target frame is closer to the center of the image, which means that it will be more reasonable to generate a target semantic candidate frame toward the center of the image. Therefore, if a valid target region is at the top or left of the center of the image, more attention is focused on expanding the bottom or right side of the target generation region, and it is more likely to reduce the distance between the center of the target region and the center of the true coordinate frame.

此外，为了减少大于真实目标框的目标区域的影响，并带来不可逆信息，因此将公式(15)中的随机限制函数f并入公式(11)-(14)以缩小坐标的增长，从而减少不相关信息的引入。此外，位于图像边缘的目标在计算目标语义候选框的坐标时会导致间隔不平衡。例如，如果目标区域位于图像的左上角，则Δ_top和Δ_right的结果将比Δ_bottom和Δ_left大得多。因此，会存在包括右下角区域过多的背景信息情况。随机约束函数因此考虑了这种情况，并对右下间隔的增长带来了相对较强的约束。然而，由于大多数目标区域都比其相关的真实目标框小，公式(15)中的更强限制参数将导致生成更多面积更小的候选框。In addition, in order to reduce the impact of the target area larger than the true target box and bring irreversible information, the random restriction function f in formula (15) is incorporated into formulas (11)-(14) to reduce the growth of the coordinates, thereby reducing the introduction of irrelevant information. In addition, the target located at the edge of the image will cause an imbalance in the interval when calculating the coordinates of the target semantic candidate box. For example, if the target area is located in the upper left corner of the image, the results of Δ _top and Δ _right will be much larger than Δ _bottom and Δ _left . Therefore, there will be a situation where there is too much background information including the lower right corner area. The random constraint function therefore takes this situation into account and brings a relatively strong constraint on the growth of the lower right interval. However, since most target areas are smaller than their associated true target boxes, the stronger restriction parameters in formula (15) will result in the generation of more candidate boxes with smaller areas.

如上所述，相邻的目标区域具有决定Δ的结果。前一个目标区域将成为后一个目标区域的边界。但对于第一个目标区域，原始图像的大小作为其最远的边界。由于大区域内部的小目标区域有时只代表目标的一个局部，而不是其他小目标，这意味着在生成目标语义候选框时不应考虑这些区域。为了减少影响，本发明会计算前一个目标区域和后一个目标区域之间的IoU，以决定是否基于这些较小的区域生成候选框。As described above, adjacent target regions have the result of determining Δ. The previous target region will become the boundary of the next target region. But for the first target region, the size of the original image is used as its farthest boundary. Since the small target region inside the large region sometimes only represents a part of the target, rather than other small targets, this means that these regions should not be considered when generating target semantic candidate frames. In order to reduce the impact, the present invention calculates the IoU between the previous target region and the next target region to decide whether to generate candidate frames based on these smaller regions.

此外，当CNN在局部感受野内计算每个局部特征时，还会引入了图像金字塔，这意味着响应会因目标的比例而改变。因此，本发明设置s＝[1,0.75,0.5]，从而进一步生成对小、中和大尺度目标更合理的候选框。初始的目标区域阈值等于100，为了生成更多的目标区域集，本发明设置的函数后面的每个提升阈值都将高于上一个阈值的30，即得到固定二进制阈值列表T_fix＝[100,130,160]。此外，每个目标区域将生成固定数量的候选框。根据公式(11)-(14)，当计算Δ时，间隔距离由N决定。在这里，本发明仅从每个目标区域中选择前10个候选框(即，i∈[0,9])，并分别将第一组和其余两组目标区域的N设置为50和20。最后，本发明在公式(8)中设定的阈值t为8。In addition, when CNN calculates each local feature within the local receptive field, an image pyramid is introduced, which means that the response will change depending on the scale of the target. Therefore, the present invention sets s = [1, 0.75, 0.5] to further generate more reasonable candidate boxes for small, medium and large-scale targets. The initial target area threshold is equal to 100. In order to generate more target area sets, each lifting threshold after the function set by the present invention will be higher than 30 of the previous threshold, that is, a fixed binary threshold list T _fix = [100, 130, 160] is obtained. In addition, a fixed number of candidate boxes will be generated for each target area. According to formulas (11)-(14), when calculating Δ, the interval distance is determined by N. Here, the present invention only selects the first 10 candidate boxes (i.e., i∈[0, 9]) from each target area, and sets N of the first group and the remaining two groups of target areas to 50 and 20, respectively. Finally, the threshold t set by the present invention in formula (8) is 8.

如图1所示，本发明实例提供了一种基于语义信息候选框下组合主干网络结合基于多示例选择算法的多分支检测头网络的弱监督目标检测方法，该方法可以由终端或服务器实现，该方法包括：第一个模块是结合像素级和对象语义候选框，我们将该模块定义为改进候选框。第二个模块是通过提高不明显物体特征的响应来减少网络对于显著特征产生过拟合现象。我们称这个模块为组合主干网络。最后一个为多示例选择算法的多分支检测头网络模块，其目的是是生成更多伪真实目标框，以获取更多潜在的与目标相关的候选框。As shown in Figure 1, an example of the present invention provides a weakly supervised target detection method based on a combined backbone network under a semantic information candidate box combined with a multi-branch detection head network based on a multi-instance selection algorithm. The method can be implemented by a terminal or a server, and the method includes: the first module is to combine pixel-level and object semantic candidate boxes, and we define this module as an improved candidate box. The second module is to reduce the overfitting phenomenon of the network to significant features by improving the response of unobvious object features. We call this module a combined backbone network. The last one is a multi-branch detection head network module of a multi-instance selection algorithm, and its purpose is to generate more pseudo-real target boxes to obtain more potential candidate boxes related to the target.

如图2所示的硬性掩膜和软性掩膜方法的区别，这里我们假设α等于0。经过软性掩膜方法后，对象周围的特征值相同，这意味着对象周围的所有特征具有相同的重要程度。然而，由于平均特征值减少了过大的背景区域，硬性掩膜方法并不能够对非显著特征响应进行提升。The difference between hard masking and soft masking is shown in Figure 2, where we assume that α is equal to 0. After the soft masking method, the eigenvalues around the object are the same, which means that all features around the object have the same importance. However, since the average eigenvalue reduces the background area too much, the hard masking method cannot improve the response of non-salient features.

如图3所示的PASCAL VOC 2007测试集上目标区域中心和相关真实目标框中心之间相对位置的统计结果。观察得到，无论有效的目标区域在图像中的何处，概率至少有70％的置信度认为真实目标框的中心要更靠近图像中心，这意味着向图像中心生成目标语义候选框将更合理。The statistical results of the relative position between the center of the target region and the center of the relevant true target box on the PASCAL VOC 2007 test set are shown in Figure 3. It can be observed that no matter where the valid target region is in the image, there is at least 70% confidence that the center of the true target box is closer to the center of the image, which means that it is more reasonable to generate a target semantic candidate box towards the center of the image.

如图4所示的目标语义候选框的生成流程图，利用Grad CAM可以从网络中产生对物体的响应。动态阈值函数的目的是生成一个掩膜来屏蔽明显的特征。前一张图像将被屏蔽后的图像替换，之后过程将再次开始。当循环过程的数量达到总循环的预设值时，将多个响应合并以显示综合响应结果。然后，固定阈值函数滤除噪声并生成目标区域。提高该阈值将使区域变。，最后，分别根据这些目标区域及公式(11)-(15)生成候选框。As shown in Figure 4, the target semantic candidate box generation flow chart, using Grad CAM, can generate responses to objects from the network. The purpose of the dynamic threshold function is to generate a mask to mask obvious features. The previous image will be replaced by the masked image, and then the process will start again. When the number of loops reaches the preset value of the total loop, multiple responses are merged to display the comprehensive response result. Then, the fixed threshold function filters out noise and generates the target area. Increasing the threshold will make the area change. Finally, candidate boxes are generated based on these target areas and formulas (11)-(15).

如图5所示的无随机约束函数和有随机约束函数的目标语义候选框之间的比较。橙色和蓝色框分别表示真实目标框和生成的目标语义候选框。生成的带有随机约束函数的候选框与真实目标框会相对紧凑，这意味着可以减少生成的候选框中存在的背景信息。Comparison between the target semantic candidate boxes without random constraint function and with random constraint function as shown in Figure 5. The orange and blue boxes represent the true target box and the generated target semantic candidate box, respectively. The generated candidate box with random constraint function will be relatively compact with the true target box, which means that the background information in the generated candidate box can be reduced.

为了验证本发明实例提供的基于多注意力机制和高语义及高分辨率特征结合网络的方法，对于该模型的多个超参数，γ^k等于0.7，表示每个精调检测分支的损失函数中的目标候选框权值；组合主干网络中的α设置0.1。对于改进对选择检测头，K和阈值t默认设置为5和0.05。精调检测分支的数量设置为3。通过随机水平翻转操作，本发明利用五个图像尺度{480、576、688、864、1200}随机调整训练图像最短边的大小。在测试阶段，通过全面分析每个图像的所有刻度及其水平翻转来计算输出。In order to verify the method based on multi-attention mechanism and high semantic and high resolution feature combined network provided by the example of the present invention, for multiple hyperparameters of the model, γ ^k is equal to 0.7, which represents the weight of the target candidate box in the loss function of each fine-tuning detection branch; α in the combined backbone network is set to 0.1. For the improved pair selection detection head, K and threshold t are set to 5 and 0.05 by default. The number of fine-tuning detection branches is set to 3. Through random horizontal flipping operations, the present invention uses five image scales {480, 576, 688, 864, 1200} to randomly adjust the size of the shortest side of the training image. In the test phase, the output is calculated by comprehensively analyzing all scales of each image and its horizontal flipping.

本发明使用Selective Search算法生成像素级候选框，并结合本发明的目标语义候选框作为网络的示例样本。采用在ImageNet数据集上预训练的VGG16(无批量归一化)作为特征提取的主干。此外，倒数第二个最大池化层和后面的卷积层被改为空洞卷积层，以扩大感受野。对于新添加的层，使用高斯分布进行初始化，其平均值为0，标准差为0.01。最后，我们的实验是在PyTorch深度学习框架、Python和C++中实现的。所有的实验都在NVIDIAGTX TitanV GPU和Intel Xeon Silver 4110CPU(2.10GHz)上运行。The present invention uses the Selective Search algorithm to generate pixel-level candidate boxes, and combines the target semantic candidate boxes of the present invention as example samples of the network. VGG16 (without batch normalization) pre-trained on the ImageNet dataset is used as the backbone of feature extraction. In addition, the penultimate maximum pooling layer and the subsequent convolutional layers are changed to hole convolutional layers to expand the receptive field. For the newly added layers, a Gaussian distribution is used for initialization with a mean of 0 and a standard deviation of 0.01. Finally, our experiments are implemented in the PyTorch deep learning framework, Python, and C++. All experiments are run on NVIDIA GTX Titan V GPU and Intel Xeon Silver 4110 CPU (2.10 GHz).

实验结果Experimental Results

在本实例中，使用AP和mAP分别在IoU阈值为0.5下评估模型性能。此外，还使用CorLoc评估指标分别在trainval集和测试集上的评估模型的定位精度。In this example, AP and mAP are used to evaluate the model performance at an IoU threshold of 0.5. In addition, the CorLoc evaluation indicator is used to evaluate the positioning accuracy of the model on the trainval set and the test set.

表1在PASCAL VOC 2007上使用VGG16对每个类别的检测结果Table 1 Detection results of each category using VGG16 on PASCAL VOC 2007

表2在PASCAL VOC 2012上使用VGG16对每个类别的检测结果Table 2 Detection results of each category using VGG16 on PASCAL VOC 2012

表3在PASCAL VOC 2007上使用VGG16对每个类别的正确定位结果Table 3 Correct positioning results for each category using VGG16 on PASCAL VOC 2007

表4在PASCAL VOC 2012上使用VGG16对每个类别的正确定位结果Table 4 Correct positioning results for each category using VGG16 on PASCAL VOC 2012

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, commodity or device. In the absence of more restrictions, the elements defined by the sentence "comprises a ..." do not exclude the existence of other identical elements in the process, method, commodity or device including the elements.

以上这些实施例应理解为仅用于说明本发明而不用于限制本发明的保护范围。在阅读了本发明的记载的内容之后，技术人员可以对本发明作各种改动或修改，这些等效变化和修饰同样落入本发明权利要求所限定的范围。The above embodiments should be understood to be only used to illustrate the present invention and not to limit the protection scope of the present invention. After reading the contents of the present invention, technicians can make various changes or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. An improved weakly supervised object detection method based on semantic information candidate boxes, characterized in that it includes the following steps:

S1: Input the image data to be processed and use preprocessing steps including random horizontal flipping;

S2: Design a combined backbone network to fuse features from the masked and unmasked network branches. The task of the unmasked network branch is to roughly find the locally significantly different target parts and locate the target, while the task of the masked branch is to shield the significant features and retain the responses of the inconspicuous features in the network.

S3: Design a multi-branch detection head network based on the multi-instance selection algorithm. The multi-branch detection head network based on the multi-instance selection algorithm generates more pseudo-real target boxes with higher confidence for each target category, so that the network model can obtain more candidate boxes related to the target, and thus obtain more reasonable training;

S4: Design target semantic candidate boxes, and generate more reasonable target candidate boxes by cyclically masking the target semantic information generated by the multi-branch detection head network model;

S5: Testing on natural datasets;

The step S1 inputs the image data to be processed and adopts a preprocessing step including random horizontal flipping, which specifically includes the following steps:

The network model is trained by randomly adjusting the size of the shortest side of the training images using five image scales {480, 576, 688, 864, 1200} through random horizontal flipping operations; in the test phase, the predicted output results are calculated by comprehensively analyzing all scales of each image and its horizontal flipping;

The S2 further includes the following steps:

Given an image, candidate boxes are generated based on pixel-level and target semantic features and input into the spatial pyramid pooling layer SPP. After extracting features and coordinates of candidate boxes from the combined backbone network, the candidate boxes are mapped to the feature map to obtain the region of interest RoI. Since each RoI has its own size, the SPP layer obtains a RoI with a fixed size by replacing the maximum pooling layer at the last layer of the combined backbone network. Then, the parallel fully connected layer in the improved detection head network will output a series of feature vectors of candidate boxes.

In step S2, the method of dividing the significant and inconspicuous features is to extract the average value and the maximum value from the feature map before passing through the two independent branches; if the feature value of a specific position is higher than the threshold, the significant feature will be shielded, so that the response of the inconspicuous feature after passing through the combined backbone network is enhanced; in addition, there are two methods to shield the significant features; one is a hard masking method, and the other is a soft masking method, which can be shown in formula (3) and formula (4) respectively,

Among them, fi _,j represents the feature value at a specific position, mean and Max represent the mean and maximum value of the current feature map respectively. and The hard masking and soft masking methods are used to represent the results. α is a hyperparameter used to control the threshold value.

In S3, the multi-branch detection head network includes the following two steps:

First, all candidate boxes are jointly trained in the top branch of the multi-branch detection head network to find potential target candidate boxes, and the image-level category label is used as the label of this branch; then, the network model only focuses on the potential candidate box area related to the target, and trains the network part associated with this area, and the pseudo-real target detection box label generated in the previous network branch is used as the new monitoring input to the next network branch;

In the step S3, a multi-example selection MIS algorithm is introduced to generate multiple pseudo-real coordinate frames with high confidence in each category, and send them to the next network branch to train candidate frames that are more relevant to the real target; based on the different scores of the candidate frames in each category in each network branch, all scores are sorted from high to low, and the focus is placed on the candidate frames P′ with the top 10% scores as the source of pseudo-real target detection frames; a threshold of K is set to limit the number of pseudo-real target detection frames in each category; then, the same threshold is used in each category to determine whether the candidate frame can serve as a pseudo-real target frame box;

In the improved detection head network, each class will generate multiple pseudo-real target detection boxes. The candidate boxes with the maximum IoU calculated in these pseudo-real target boxes above and below 0.5 will be regarded as foreground and background respectively. The standard cross entropy loss is multiplied by the γ ^k weight in the k-th detection branch to balance the loss function of the foreground and background samples by adjusting the value of γ ^k . For the k-th improved selection detection branch, its loss function is as follows:

in, They represent the prediction results of the pseudo-real target box related to the current candidate box, the background category label information, and the output results of the background category samples. γ ^k represents the loss value and positive and negative sample balance weight of the kth fine-tuned detection head, and They represent the background cost of the cth category and the pth candidate box respectively. C is the total number of categories in units of the kth category. |P| is equal to the total number of candidate boxes. C+1 represents the category label of the background. and Represents the label information and output results of the cth category of the candidate box p in the kth detection branch; represents the prediction results of the pseudo-real target box related to the current candidate box, the background category label information and the prediction results of the background category samples;

Then, the standard binary cross entropy function of the top coarse selection detection head, Formula (8), is incorporated into the entire model loss function so that the network can be jointly trained; therefore, the loss function used to train the entire network is shown in Formula (9);

P _c , y _c , and L represent the sum of the prediction results of all candidate boxes in the cth class, the sample label information of the cth class, and the total loss value of the network model respectively;

In step S4, a circular masking method is used to mask these salient feature parts, thereby generating a reasonable target candidate frame, which specifically includes:

The whole process of generating the target semantic candidate box can be divided into three stages, namely the shielding stage, the synthesis stage and the generation stage. The purpose of the shielding stage is to eliminate the local salient features of the object.

The merging stage is to fuse these network responses generated on average in each cycle;

The generation stage generates candidate boxes based on the generated target area.

2. According to claim 1, the improved weakly supervised object detection method based on semantic information candidate box is characterized in that the predicted output result is calculated by comprehensively analyzing all scales of each image and its horizontal flipping, specifically comprising:

The scale of each image is set in order according to the scales of 480, 576, 688, 864 and 1200; the images of 5 different scales and their horizontal flipping are input into the network model to obtain the output results of the images of different scales and their horizontal flipping in the network model; finally, the prediction results of the images of different scales are scaled to the results of the same scale, and the scaled output results are screened by the post-processing algorithm to obtain the final prediction results;

By calculating the IoU size between the sample prediction result and the real target frame coordinates, first, determine the sample attribute according to the preset threshold (0.5 or 0.75); then, sort all samples from high to low according to their classification results; traverse the sorted samples, and calculate the precision and recall of the traversed samples according to formula (1) and formula (2);

Among them, TP, FP and FN represent true positives, false positives and false negatives;

Based on the Precision and Recall obtained from each traversal, a curve with Recall as the X-axis and Precision as the Y-axis is constructed; finally, the average precision AP is obtained by calculating the area of the curve enclosed by Precision and Recall, and the mean average precision mAP is obtained by calculating the AP of each category.