CN111967433A

CN111967433A - Action identification method based on self-supervision learning network

Info

Publication number: CN111967433A
Application number: CN202010894661.9A
Authority: CN
Inventors: 周伟; 汪彦; 郑福建; 郭鑫; 庞一然; 易军; 黄麟; 王波; 邓建华; 张秀才
Original assignee: Chongqing University of Science and Technology
Current assignee: Chongqing University of Science and Technology
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-11-20

Abstract

The invention provides an action recognition method based on a self-supervised learning network. The first step is to use OpenPose to extract human skeleton information in a video stream, make the human skeleton information into a positive and negative sample data set, and use the data set to train actions The classification model ResNet‑56, the preliminarily trained action classification model is used to preliminarily judge the input action and further self-supervised training; while training the action classifier, the deep learning model YOLOv4 is used to detect the object information in the video stream, and the detected The object information is labeled with positive and negative labels and sent to the action classification model ResNet‑56 for self-supervised learning training. The action classification model after self-supervision has higher detection accuracy and reliability.

Description

An Action Recognition Method Based on Self-Supervised Learning Network

技术领域technical field

本发明属于计算机智能学习领域，更具体地说是一种结合OpenPose和YOLOv4的自监督人体动作识别办法。The invention belongs to the field of computer intelligent learning, and more specifically relates to a self-supervised human action recognition method combining OpenPose and YOLOv4.

背景技术Background technique

计算机视觉技术是人工智能的分支方向，主要通过视觉图像信息的处理，达到识别和理解图像内容的目的。智能视频监控借助于计算机强大的计算能力。依靠计算机视觉技术，从监控图像或视频中提取关键信息并快速分析，并将分析结果反馈给监控系统进行处理，从而达到智能识别、理解和处理监控视频的目的。Computer vision technology is a branch of artificial intelligence, mainly through the processing of visual image information, to achieve the purpose of identifying and understanding image content. Intelligent video surveillance relies on the powerful computing power of computers. Relying on computer vision technology, key information is extracted from surveillance images or videos and quickly analyzed, and the analysis results are fed back to the surveillance system for processing, so as to achieve the purpose of intelligently identifying, understanding and processing surveillance videos.

人体动作识别里人体关键点的分析是最为重要的之一，也被统称为人体姿态估计。目前，有很多种技术可以实现对人体关键点的信息进行读取，例如深度体感相机，如Kinect等，再人们穿上设备之后可以对人体d额关键点进行实时的信息采集。不足之处是一次可采集的人数太少，并且价格较为昂贵。相比之下采用OpenPose来对人体骨骼关键点进行信息采集则方便的多，一次性也可采集多人，并且不需要太贵的价格。The analysis of human key points in human action recognition is one of the most important, and is also collectively referred to as human pose estimation. At present, there are many technologies that can read the information of the key points of the human body, such as depth somatosensory cameras, such as Kinect, etc. After people put on the device, they can collect real-time information on the key points of the human body. The disadvantage is that the number of people that can be collected at one time is too small, and the price is relatively expensive. In contrast, it is much more convenient to use OpenPose to collect information on key points of human skeleton, and it can collect multiple people at one time, and it does not require too expensive price.

OpenPose是目前主流的人体姿态估计采用的模型算法，大多数的动作都可以用OpenPose采集人体骨骼关键点信息并加以训练后可以识别出视频流中的动作，但是，相对于Kinect的高精准性，OpenPose在检测精度上略有不足；因为大多数的模型都是基于单调的OpenPose来进行动作识别，例如，一种基于OpenPose的吃东西行为识别方法CN201911150648.6，基于OpenPose的击剑动作获取方法及计算机存储介质CN201810338998.4等，上述两篇专利都是基于只基于OpenPose来提取信息就完成了动作识别的模型训练OpenPose is the current mainstream model algorithm for human pose estimation. Most actions can use OpenPose to collect key point information of human skeletons and train them to recognize actions in video streams. However, compared to Kinect's high accuracy, OpenPose is slightly insufficient in detection accuracy; because most models are based on monotonic OpenPose for action recognition, for example, an OpenPose-based eating behavior recognition method CN201911150648.6, OpenPose-based fencing action acquisition method and computer The storage medium CN201810338998.4, etc., the above two patents are based on the model training of action recognition only based on OpenPose to extract information

为了提高OpenPose在实际运用中的检测准确性，本专利提供了一种基于自监督学习网络的动作识别办法，在传统的单调OpenPose动作识别的基础上，加入了YOLOv4目标检测技术，将检测到的物品信息分类为阳性、阴性样本后再自监督模型训练，以此提高动作识别的准确度。In order to improve the detection accuracy of OpenPose in practical applications, this patent provides an action recognition method based on a self-supervised learning network. On the basis of the traditional monotonous OpenPose action recognition, YOLOv4 target detection technology is added to detect the detected objects. Item information is classified into positive and negative samples and then self-supervised model training is performed to improve the accuracy of action recognition.

发明内容SUMMARY OF THE INVENTION

本发明基于目前单调的OpenPose提取人体骨架信息再用于识别的精确度不高，提出了一种基于自监督学习网络的动作识别办法。The present invention proposes an action recognition method based on a self-supervised learning network based on the low accuracy of the current monotonous OpenPose extracting human skeleton information and then using it for recognition.

本发明具体内容如下：The specific content of the present invention is as follows:

步骤S1：收集所人体骨骼关节点特征信息：OpenPose通过前馈网络预测出图片中人体部位置信度S，同时预测出部位的亲和力向量场L(人体骨骼各个关节的关系)，集合S＝(S1,S2,SJ)J表示每个骨骼关节点有J个身体部位置信图。集合L(L1,L2,LC)每个肢体有C个部位亲和力向量场。得到集合J和L后使用Greedy algorithm算法将人体骨骼关节点信息找出。将提取到的骨架特征信息制作成所需识别动作的正负样本数据集。如图1所示。Step S1: Collect the feature information of all human skeleton joint points: OpenPose predicts the position reliability S of the human body in the picture through the feedforward network, and at the same time predicts the affinity vector field L of the part (the relationship between each joint of the human skeleton), the set S = ( S1, S2, SJ) J means that each skeleton joint point has J body part position information maps. The set L(L1, L2, LC) has C site affinity vector fields for each limb. After the sets J and L are obtained, the Greedy algorithm is used to find out the joint point information of the human skeleton. The extracted skeleton feature information is made into a positive and negative sample data set of the desired action to be recognized. As shown in Figure 1.

步骤S2：将步骤S1.1中制作的正负样本数据集作为输入，训练基于ResNet-56动作分类器模型。在ResNet-56模型中，使用Softmax损失进行分类，目标函数如下：

Step S2: Use the positive and negative sample data set made in Step S1.1 as input, and train an action classifier model based on ResNet-56. In the ResNet-56 model, the Softmax loss is used for classification, and the objective function is as follows:

N为样本个数，λ是回归损失权重，i是批次中提议窗口的索引，P是预测概率。N is the number of samples, λ is the regression loss weight, i is the index of the proposed window in the batch, and P is the predicted probability.

神经络的输入是x，期望输出是H(x)，残差网络F(x)＝H(x)-x，训练ResNet-56模型直至F(x)接近0则模型训练结束。将训练数据集传入步骤S2.1所训练好的模型中，得到视频流训练集的动作识别结果。The input of the neural network is x, the expected output is H(x), the residual network F(x)=H(x)-x, and the ResNet-56 model is trained until F(x) is close to 0. The model training ends. The training data set is passed into the model trained in step S2.1, and the action recognition result of the video stream training set is obtained.

步骤S3.1：物体数据集的建立：对需检测动作中所包含除人体以外的物体数据集进行收集分类。如：将需识别的动作设为A，则和动作A相关的物品设为B1、B2……依次排列，可以从多种不同视频或图片中收集物品B1、B2的信息，并将其制作成标准VOC数据集。Step S3.1: Establishment of Object Data Set: Collect and classify the object data set other than human body included in the action to be detected. For example: set the action to be recognized as A, then the items related to action A are set as B1, B2... Arranged in order, the information of items B1 and B2 can be collected from a variety of different videos or pictures, and made into Standard VOC dataset.

步骤S3.2：将上述步骤S3.1中制作好的物品VOC数据集输入到标准的YOLOv4模型中进行训练。YOLOv4模型中运用卷积的级联结构，网络的输入层设计为448*448。Step S3.2: Input the item VOC data set prepared in the above step S3.1 into the standard YOLOv4 model for training. The convolution cascade structure is used in the YOLOv4 model, and the input layer of the network is designed to be 448*448.

逻辑回归的代价函数为：

The cost function of logistic regression is:

其中h_θ是sigmoid函数，作为激活函数在网络中。当物品检测的准确率达到95％以上时，模型训练完毕，随即可以用于检测视频流中的物品和行人信息并记录。where h _θ is the sigmoid function as the activation function in the network. When the accuracy rate of item detection reaches more than 95%, the model is trained and can be used to detect and record the information of items and pedestrians in the video stream.

步骤S4.1：将步骤S3中识别出的物品信息标签划分为正负标签，若物品信息和动作信息有相似则视为阳性样本，若不同则视为阴性样本，对于这些物品信息，对比方法可以根据公式score(f(x),f(x⁺))＞＞score(f(x),f(x^-))来衡量两个功能相似之间的相似指标。Step S4.1: Divide the item information labels identified in step S3 into positive and negative labels. If the item information and action information are similar, they are regarded as positive samples, and if they are different, they are regarded as negative samples. The similarity index between two functional similarities can be measured according to the formula score(f(x), f(x ⁺ )) >> score(f(x), f(x ^- )).

构造softmax分类器来正确的对正负样本进行分类：Construct a softmax classifier to correctly classify positive and negative samples:

步骤S4.2：将物品信息分为正负样本后，正负样本将会被传入步骤S3中的ResNet-56中进行自监督训练：当阳性样本信息较多的时候则最终识别的动作的置信度会增加，反之，若阴性样本数量较多的时候则预测的改动作置信度会降低。经过不断的输入由物品信息制作成的阳性阴性标签训练模型，动作分类器模型的识别率会逐步上升，直至达到理想水平。Step S4.2: After dividing the item information into positive and negative samples, the positive and negative samples will be passed to ResNet-56 in step S3 for self-supervised training: when there is a lot of positive sample information, the final identified action The confidence will increase, on the contrary, if the number of negative samples is large, the confidence of the predicted change will decrease. After continuous input of positive and negative labels made from item information to train the model, the recognition rate of the action classifier model will gradually increase until it reaches an ideal level.

附图说明Description of drawings

图1为OpenPose提取人体骨架信息流程；Figure 1 shows the process of extracting human skeleton information from OpenPose;

图2为人体周围物品检测；Figure 2 shows the detection of objects around the human body;

图3为自监督训练动作分类模型图Figure 3 is a diagram of the self-supervised training action classification model

具体实施方式Detailed ways

为了更清楚地说明本发明实施例的技术方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,应当理解,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例,因此不应被看作是对保护范围的限定。基于本发明中的实施例,本领域普通技术工作人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. It should be understood that the described embodiments are only Some, but not all, embodiments of the present invention should therefore not be construed as limiting the scope of protection. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

实施例1：(以人在攀爬为例)Example 1: (take a person climbing as an example)

步骤S1：收集人在攀爬围墙、攀爬梯子等和攀爬相关的人体骨骼关节点特征信息：OpenPose通过前馈网络预测出攀爬人体的关键节点部位置信度S，同时预测出部位的亲和力向量场L(人体骨骼各个关节的关系)，集合S＝(S1,S2,SJ)J表示每个骨骼关节点有J个身体部位置信图。集合L(L1,L2,LC)每个肢体有C个部位亲和力向量场。得到集合J和L后使用Greedy algorithm算法将人体骨骼关节点信息找出。将提取到的攀爬骨架特征信息制作成所需识别动作的正负样本数据集，正样本为人有攀爬动作，负样本为人没有攀爬动作。骨骼关键点预测如图1所示。Step S1: Collect the feature information of human skeleton joint points related to climbing when people are climbing walls, climbing ladders, etc.: OpenPose predicts the position reliability S of the key nodes of the climbing human body through a feedforward network, and at the same time predicts the position of the body. The affinity vector field L (the relationship of each joint of the human skeleton), the set S=(S1, S2, SJ)J indicates that each skeleton joint point has J body part position information maps. The set L(L1, L2, LC) has C site affinity vector fields for each limb. After the sets J and L are obtained, the Greedy algorithm is used to find out the joint point information of the human skeleton. The extracted climbing skeleton feature information is made into a positive and negative sample data set for the required recognition action. Skeletal keypoint predictions are shown in Figure 1.

步骤S2：将步骤S1中制作的攀爬正负样本数据集作为输入，训练基于ResNet-56动作分类器模型。在ResNet-56模型中如图2所示，使用Softmax损失进行分类，目标函数如下：Step S2: Use the climbing positive and negative sample data set made in Step S1 as input, and train a ResNet-56-based action classifier model. In the ResNet-56 model shown in Figure 2, the Softmax loss is used for classification, and the objective function is as follows:

神经络的输入是x，期望输出是H(x)，残差网络F(x)＝H(x)-x，训练ResNet-56模型直至F(x)接近0则模型训练结束。此时的动作模型可以用于识别视频流中的攀爬行为，但是精度还有待提高。The input of the neural network is x, the expected output is H(x), the residual network F(x)=H(x)-x, and the ResNet-56 model is trained until F(x) is close to 0. The model training ends. The action model at this time can be used to identify the climbing behavior in the video stream, but the accuracy needs to be improved.

步骤S3.1：在本专利中，YOLOv4的物体检测是官方权重和自建数据集权重合并使用。所以第一步是攀爬相关物体数据集的建立：攀爬是需要测测的物体为围墙、门、梯子、电线杆等，将这些物品的图片进行分类收集，并将其制作成标准VOC数据集。Step S3.1: In this patent, the object detection of YOLOv4 is a combination of official weights and self-built dataset weights. Therefore, the first step is to establish a dataset of climbing-related objects: climbing is to measure the objects such as walls, doors, ladders, telephone poles, etc., classify and collect the pictures of these objects, and make them into standard VOC data set.

步骤S3.2：将上述步骤S3.1中制作好的物品VOC数据集输入到标准的YOLOv4模型中进行训练如图3所示。Step S3.2: Input the item VOC data set prepared in the above step S3.1 into the standard YOLOv4 model for training, as shown in Figure 3.

在本专利使用的模型YOLOv4模型中，拥有多个卷积层和一个全连接层，大量运用了卷积的级联结构，网络的输入层设计为448*448。卷积级联结构用于图像的特征提取，全连接层用于预测类概率和边框。模型中逻辑回归的代价函数为：In the model YOLOv4 model used in this patent, there are multiple convolution layers and one fully connected layer, and the convolution cascade structure is widely used. The input layer of the network is designed to be 448*448. The convolution cascade structure is used for image feature extraction, and the fully connected layer is used to predict class probabilities and bounding boxes. The cost function of logistic regression in the model is:

其中h_θ是sigmoid函数，作为激活函数在网络中。当物品检测的准确率达到95％以上时，模型训练完毕，随即可以用于检测视频流中的物品和行人信息并记录下来。where h _θ is the sigmoid function as the activation function in the network. When the accuracy rate of item detection reaches more than 95%, the model is trained and can be used to detect and record the information of items and pedestrians in the video stream.

步骤S4.1：将步骤S3中识别出的物品信息标签划分为正负标签，和攀爬相关的物品如门、梯子、电线杆等视为阳性样本，将手机、耳机、食物等视为阴性样本，其余识别到的物体按照构造softmax分类器来正确的对正负样本进行分类：Step S4.1: Divide the item information tags identified in step S3 into positive and negative tags, and items related to climbing such as doors, ladders, telephone poles, etc. are regarded as positive samples, and mobile phones, earphones, food, etc. are regarded as negative samples Samples, and the rest of the recognized objects are classified according to the construction of a softmax classifier to correctly classify positive and negative samples:

分母项包括一个正数和n-1个阴性样本；The denominator term includes a positive number and n-1 negative samples;

步骤S4.2：将物品信息分为阳性、阴性样本后，这两种物品信息会传入ResNet-56动作分类器模型中，在经过自监督的学习下将动作识别的精度提高。Step S4.2: After the item information is divided into positive and negative samples, the two types of item information will be passed into the ResNet-56 action classifier model, and the accuracy of action recognition will be improved under self-supervised learning.

模型训练好后，就可将训练模型用于检测视频流中是否出现攀爬行为，精度相对于没有经过自监督的模型会有较大的提升。After the model is trained, the trained model can be used to detect whether there is climbing behavior in the video stream, and the accuracy will be greatly improved compared to the model without self-supervision.

Claims

1. An action recognition method based on an automatic supervision learning network is characterized by comprising the following steps:

step S1: extracting human skeleton characteristic information of a target human body image in the video stream through OpenPose;

step S2: constructing a neural network model, and training the neural network model by taking the human skeleton information extracted by OpenPose as input to obtain an action recognition result corresponding to the action training data set;

step S3: detecting object information in the video stream by using a YOLOv4 model;

step S4: labeling the identified object information according to the detection action, and retraining the neural network based on the article information label and the action identification result corresponding to the pre-training data to obtain a retrained neural network model;

step S5: and performing motion recognition on the video stream by using the retrained model, and outputting a motion prediction result in the video stream.

2. The method for constructing a neural network model according to step S2 in claim 1, comprising the following steps:

step S2.1: extracting human body skeleton characteristic information by using OpenPose, predicting human body part confidence coefficient S in a picture through a feedforward network, and simultaneously predicting an affinity vector field L (relation of each joint of a human body) of the part, wherein the set S is (S1, S2, SJ) J and shows that each bone joint point has J body part confidence maps; set L (L1, L2, LC) has C site affinity vector fields per limb; after the sets J and L are obtained, the information of the human skeleton joint points is found out by using a Greedy algorithm; making the extracted skeleton characteristic information into a positive and negative sample data set of the motion to be recognized;

step S2.2: taking the positive and negative sample data sets produced in the step S1.1 as input, and training a ResNet-56-based action classifier model; in the ResNet-56 model, classification is performed using Softmax loss, and the objective function is as follows:

n is the number of samples, λ is the regression loss weight, i is the index of the proposed window in the batch, and P is the prediction probability;

the input of the neural network is x, the expected output is H (x), the residual network F (x) ═ H (x) — x, and the model training is finished until F (x) is close to 0;

step S2.3: and (4) transmitting the training data set into the model trained in the step (S2.1) to obtain the action recognition result of the video stream training set.

3. The method for the YOLOv4 model detection of claim 1, step S3, comprising the following steps:

step S3.1: establishing an object data set: collecting and classifying object data sets except human bodies contained in the motion to be detected; such as: setting the action to be identified as A, setting the articles related to the action A as B1 and B2 … … to be sequentially arranged, and making the articles into a standard VOC data set;

step S3.2: inputting the article VOC data set manufactured in the step S3.1 into a standard YOLOv4 model for training; the input layer of the network is designed to 448 x 448, a convolution cascade structure is used for carrying out feature extraction on the image, and the full-connection layer is used for predicting class probability and frames;

the cost function of logistic regression is:

wherein h is_θIs a sigmoid function, which is used as an activation function in the network; when the accuracy rate of the article detection reaches more than 95%, the model training is finished;

step S3.3: and detecting the incoming video stream by using the Yolov4 model trained in the step S3.2, and recording and storing object information when detecting that the motion comprises objects B1 and B2 … ….

Labeling the identified object information according to the detection action, and retraining the neural network based on the article information label and the action identification result corresponding to the pre-training data to obtain a retrained neural network model;

4. the method for tagging identified object information according to detection actions of claim 1, step S4, comprising the steps of:

dividing the item information labels identified in step S3 into positive and negative labels, regarding the item information labels as positive samples if the item information and the motion information are similar, regarding the item information labels as negative samples if the item information and the motion information are different, and determining the item information labels as positive samples according to a formula score (f (x), f (x))(x⁺))＞＞score(f(x),f(x^-) To measure a similarity index between two functionally similar; constructing a softmax classifier to correctly classify positive and negative samples:

after the article information is divided into positive and negative samples, the positive and negative samples are transmitted to the ResNet-56 for self-supervision training in step S3: in the transmitted article information, when the number of positive samples is large, the action confidence coefficient finally output by the action classification model ResNet-56 is increased, on the contrary, when the number of negative samples is large, the confidence coefficient of the output action is reduced, and the identification rate of the action classification model to the set action reaches the ideal level through multiple input training of the object information.