CN111967433A - Action identification method based on self-supervision learning network - Google Patents
Action identification method based on self-supervision learning network Download PDFInfo
- Publication number
- CN111967433A CN111967433A CN202010894661.9A CN202010894661A CN111967433A CN 111967433 A CN111967433 A CN 111967433A CN 202010894661 A CN202010894661 A CN 202010894661A CN 111967433 A CN111967433 A CN 111967433A
- Authority
- CN
- China
- Prior art keywords
- action
- information
- model
- training
- positive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides an action recognition method based on an automatic supervision learning network, which comprises the steps of firstly, extracting human skeleton information in a video stream by using OpenPose, making the human skeleton information into a positive sample data set and a negative sample data set, training an action classification model ResNet-56 by using the data set, and using the preliminarily trained action classification model for preliminarily judging input action and further carrying out automatic supervision training; the method comprises the steps of detecting object information in a video stream by using a deep learning model YOLOv4 while training an action classifier, marking positive and negative labels on the detected object information, transmitting the positive and negative labels into an action classification model ResNet-56 for self-supervision learning training, and enabling the action classification model after self-supervision to have higher detection precision and reliability.
Description
Technical Field
The invention belongs to the field of computer intelligent learning, and particularly relates to a self-supervision human body action recognition method combining OpenPose and Yolov 4.
Background
The computer vision technology is a branch direction of artificial intelligence, and achieves the purpose of identifying and understanding image contents mainly through processing visual image information. Intelligent video surveillance relies on the powerful computing power of computers. By means of computer vision technology, key information is extracted from the monitored image or video and analyzed quickly, and the analysis result is fed back to the monitoring system for processing, so that the purpose of intelligently identifying, understanding and processing the monitored video is achieved.
Analysis of human key points in human motion recognition is one of the most important, and is also referred to as human pose estimation. At present, there are many technologies that can read information of key points of a human body, for example, a depth somatosensory camera such as Kinect, and then people can acquire information of key points of a human body d in real time after wearing the device. The disadvantages are that the number of people who can be collected at one time is too small, and the price is expensive. Compared with the method, the method has the advantages that the method is more convenient to acquire the information of the key points of the human skeleton by adopting OpenPose, and can acquire a plurality of people at one time without expensive price.
Openpos is a model algorithm adopted by mainstream human body posture estimation at present, most actions can be obtained by acquiring human body skeleton key point information by using openpos and recognizing the actions in a video stream after training, but openpos has little defect in detection precision compared with the high precision of Kinect; because most models are based on monotonous openpos for motion recognition, for example, an openpos-based eating behavior recognition method CN201911150648.6, an openpos-based fencing motion acquisition method, and a computer storage medium CN201810338998.4, both of the above-mentioned patents are based on model training in which motion recognition is completed by extracting information based on openpos alone
In order to improve the detection accuracy of OpenPose in practical application, the patent provides an action recognition method based on an auto-supervised learning network, and on the basis of traditional monotonic OpenPose action recognition, a Yolov4 target detection technology is added, and detected article information is classified into positive and negative samples and then is trained by an auto-supervised model, so that the action recognition accuracy is improved.
Disclosure of Invention
The invention provides an action recognition method based on a self-supervision learning network, which is based on the low accuracy of extracting human skeleton information and then using the extracted human skeleton information for recognition by the current monotonous OpenPose.
The invention specifically comprises the following contents:
step S1: collecting characteristic information of human body bone joint points: openpos predicts the confidence S of human body parts in the picture through a feed-forward network, and also predicts the affinity vector field L (relationship of each joint of human bone), and the set S ═ S (S1, S2, SJ) J indicates that J body part confidence maps exist for each bone joint point. Set L (L1, L2, LC) has C site affinity vector fields per limb. And finding out the information of the human body bone joint points by using a Greedy algorithm after the sets J and L are obtained. And making the extracted skeleton characteristic information into a positive and negative sample data set of the required recognition action. As shown in fig. 1.
Step S2: and (4) taking the positive and negative sample data sets produced in the step (S1.1) as input, and training a ResNet-56-based action classifier model. In the ResNet-56 model, classification is performed using Softmax loss, and the objective function is as follows:
n is the number of samples, λ is the regression loss weight, i is the index of the proposed window in the batch, and P is the prediction probability.
The input of the neural network is x, the expected output is h (x), the residual network f (x) ═ h (x) -x, and the training of the ResNet-56 model is completed until f (x) approaches 0. And (4) transmitting the training data set into the model trained in the step (S2.1) to obtain the action recognition result of the video stream training set.
Step S3.1: establishing an object data set: and collecting and classifying data sets of objects except human bodies included in the motion to be detected. Such as: if the action to be identified is set as A, the articles related to the action A are set as B1 and B2 … … which are sequentially arranged, and the information of the articles B1 and B2 can be collected from a plurality of different videos or pictures and made into a standard VOC data set.
Step S3.2: and (3) inputting the article VOC data set prepared in the step (S3.1) into a standard YOLOv4 model for training. The YOLOv4 model uses a convolutional cascade structure, and the input layer of the network is designed to be 448 x 448.
wherein h isθIs a sigmoid function, which is in the network as an activation function. When the accuracy of the article detection reaches more than 95%, the model training is finished, and the method can be used for detecting and recording the article and pedestrian information in the video stream.
Step S4.1: the item information labels identified in step S3 are divided into positive and negative labels, and if the item information and the motion information are similar, the positive labels are regarded as positive samples, and if the item information and the motion information are different, the negative samples are regarded, and for the item information, the comparison method can be according to the formula score (f (x), x (x)+))>>score(f(x),f(x-) Measure the similarity index between two functionally similar.
Constructing a softmax classifier to correctly classify positive and negative samples:
step S4.2: after the article information is divided into positive and negative samples, the positive and negative samples are transmitted to the ResNet-56 for self-supervision training in step S3: when the number of positive samples is large, the confidence of the finally recognized action is increased, and on the contrary, when the number of negative samples is large, the confidence of the predicted change action is reduced. Through continuously inputting a positive and negative label training model made of article information, the recognition rate of the action classifier model can gradually rise until an ideal level is reached.
Drawings
FIG. 1 is a flow of extracting human skeleton information by OpenPose;
FIG. 2 illustrates the detection of objects around a human body;
FIG. 3 is a diagram of an automated supervised training action classification model
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and therefore should not be considered as a limitation to the scope of protection. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example 1: (for example, climbing a person)
Step S1: collect people in climbing enclosure, climbing ladder etc. and the relevant human skeleton joint point characteristic information of climbing: the confidence degree S of key node positions climbing the human body is predicted by OpenPose through a feedforward network, meanwhile, an affinity vector field L (relation of each joint of human bone) of the positions is predicted, and the set S (S1, S2, SJ) J shows that each bone joint point has J body position maps. Set L (L1, L2, LC) has C site affinity vector fields per limb. And finding out the information of the human body bone joint points by using a Greedy algorithm after the sets J and L are obtained. And making the extracted feature information of the climbing skeleton into a positive and negative sample data set of the required identification action, wherein the positive sample is the climbing action of the person, and the negative sample is the climbing action of the person. Skeletal keypoint prediction is shown in fig. 1.
Step S2: and (5) taking the climbing positive and negative sample data set produced in the step (S1) as input, and training a ResNet-56-based action classifier model. In the ResNet-56 model, shown in FIG. 2, classification was performed using Softmax loss, with the objective function as follows:
n is the number of samples, λ is the regression loss weight, i is the index of the proposed window in the batch, and P is the prediction probability.
The input of the neural network is x, the expected output is h (x), the residual network f (x) ═ h (x) -x, and the training of the ResNet-56 model is completed until f (x) approaches 0. The motion model at this time can be used for identifying the climbing behavior in the video stream, but the precision is still to be improved.
Step S3.1: in this patent, the object detection of YOLOv4 is used in combination with official weights and self-created dataset weights. The first step is therefore the establishment of a data set of climbing-related objects: the climbing is to measure objects such as enclosing walls, doors, ladders, telegraph poles and the like, collect pictures of the objects in a classified manner, and manufacture the pictures into a standard VOC data set.
Step S3.2: the VOC data set of the article prepared in the above step S3.1 is input into a standard YOLOv4 model for training, as shown in fig. 3.
The model YOLOv4 used in this patent has a plurality of convolution layers and a fully connected layer, and a large number of cascade structures of convolution are used, and the input layer of the network is designed to 448 x 448. The convolution cascade structure is used for feature extraction of images, and the full-connection layer is used for predicting class probability and frames. The cost function of logistic regression in the model is:
wherein h isθIs a sigmoid function, which is in the network as an activation function. When the accuracy of the article detection reaches more than 95%, the model training is finished, and the method can be used for detecting and recording the article and pedestrian information in the video stream.
Step S4.1: dividing the item information labels identified in the step S3 into positive and negative labels, regarding climbing related items such as doors, ladders, telegraph poles and the like as positive samples, regarding mobile phones, earphones, food and the like as negative samples, and correctly classifying the positive and negative samples by constructing a softmax classifier for the remaining identified objects:
the denominator term comprises a positive number and n-1 negative samples;
step S4.2: after the article information is divided into positive and negative samples, the article information and the ResNet-56 action classifier model are transmitted, and the precision of action recognition is improved under the self-supervision learning.
After the model is trained, the training model can be used for detecting whether the climbing behavior occurs in the video stream, and the precision is greatly improved compared with a model which is not subjected to self-supervision.
Claims (4)
1. An action recognition method based on an automatic supervision learning network is characterized by comprising the following steps:
step S1: extracting human skeleton characteristic information of a target human body image in the video stream through OpenPose;
step S2: constructing a neural network model, and training the neural network model by taking the human skeleton information extracted by OpenPose as input to obtain an action recognition result corresponding to the action training data set;
step S3: detecting object information in the video stream by using a YOLOv4 model;
step S4: labeling the identified object information according to the detection action, and retraining the neural network based on the article information label and the action identification result corresponding to the pre-training data to obtain a retrained neural network model;
step S5: and performing motion recognition on the video stream by using the retrained model, and outputting a motion prediction result in the video stream.
2. The method for constructing a neural network model according to step S2 in claim 1, comprising the following steps:
step S2.1: extracting human body skeleton characteristic information by using OpenPose, predicting human body part confidence coefficient S in a picture through a feedforward network, and simultaneously predicting an affinity vector field L (relation of each joint of a human body) of the part, wherein the set S is (S1, S2, SJ) J and shows that each bone joint point has J body part confidence maps; set L (L1, L2, LC) has C site affinity vector fields per limb; after the sets J and L are obtained, the information of the human skeleton joint points is found out by using a Greedy algorithm; making the extracted skeleton characteristic information into a positive and negative sample data set of the motion to be recognized;
step S2.2: taking the positive and negative sample data sets produced in the step S1.1 as input, and training a ResNet-56-based action classifier model; in the ResNet-56 model, classification is performed using Softmax loss, and the objective function is as follows:
n is the number of samples, λ is the regression loss weight, i is the index of the proposed window in the batch, and P is the prediction probability;
the input of the neural network is x, the expected output is H (x), the residual network F (x) ═ H (x) — x, and the model training is finished until F (x) is close to 0;
step S2.3: and (4) transmitting the training data set into the model trained in the step (S2.1) to obtain the action recognition result of the video stream training set.
3. The method for the YOLOv4 model detection of claim 1, step S3, comprising the following steps:
step S3.1: establishing an object data set: collecting and classifying object data sets except human bodies contained in the motion to be detected; such as: setting the action to be identified as A, setting the articles related to the action A as B1 and B2 … … to be sequentially arranged, and making the articles into a standard VOC data set;
step S3.2: inputting the article VOC data set manufactured in the step S3.1 into a standard YOLOv4 model for training; the input layer of the network is designed to 448 x 448, a convolution cascade structure is used for carrying out feature extraction on the image, and the full-connection layer is used for predicting class probability and frames;
the cost function of logistic regression is:
wherein h isθIs a sigmoid function, which is used as an activation function in the network; when the accuracy rate of the article detection reaches more than 95%, the model training is finished;
step S3.3: and detecting the incoming video stream by using the Yolov4 model trained in the step S3.2, and recording and storing object information when detecting that the motion comprises objects B1 and B2 … ….
Labeling the identified object information according to the detection action, and retraining the neural network based on the article information label and the action identification result corresponding to the pre-training data to obtain a retrained neural network model;
4. the method for tagging identified object information according to detection actions of claim 1, step S4, comprising the steps of:
dividing the item information labels identified in step S3 into positive and negative labels, regarding the item information labels as positive samples if the item information and the motion information are similar, regarding the item information labels as negative samples if the item information and the motion information are different, and determining the item information labels as positive samples according to a formula score (f (x), f (x))(x+))>>score(f(x),f(x-) To measure a similarity index between two functionally similar; constructing a softmax classifier to correctly classify positive and negative samples:
after the article information is divided into positive and negative samples, the positive and negative samples are transmitted to the ResNet-56 for self-supervision training in step S3: in the transmitted article information, when the number of positive samples is large, the action confidence coefficient finally output by the action classification model ResNet-56 is increased, on the contrary, when the number of negative samples is large, the confidence coefficient of the output action is reduced, and the identification rate of the action classification model to the set action reaches the ideal level through multiple input training of the object information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010894661.9A CN111967433A (en) | 2020-08-31 | 2020-08-31 | Action identification method based on self-supervision learning network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010894661.9A CN111967433A (en) | 2020-08-31 | 2020-08-31 | Action identification method based on self-supervision learning network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111967433A true CN111967433A (en) | 2020-11-20 |
Family
ID=73400782
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010894661.9A Pending CN111967433A (en) | 2020-08-31 | 2020-08-31 | Action identification method based on self-supervision learning network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111967433A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560649A (en) * | 2020-12-09 | 2021-03-26 | 广州云从鼎望科技有限公司 | Behavior action detection method, system, equipment and medium |
CN112560618A (en) * | 2020-12-06 | 2021-03-26 | 复旦大学 | Behavior classification method based on skeleton and video feature fusion |
CN112668492A (en) * | 2020-12-30 | 2021-04-16 | 中山大学 | Behavior identification method for self-supervised learning and skeletal information |
CN113191228A (en) * | 2021-04-20 | 2021-07-30 | 上海东普信息科技有限公司 | Express item casting identification method, device, equipment and storage medium |
CN113935407A (en) * | 2021-09-29 | 2022-01-14 | 光大科技有限公司 | Abnormal behavior recognition model determining method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135319A (en) * | 2019-05-09 | 2019-08-16 | 广州大学 | A kind of anomaly detection method and its system |
CN110852303A (en) * | 2019-11-21 | 2020-02-28 | 中科智云科技有限公司 | Eating behavior identification method based on OpenPose |
CN111178251A (en) * | 2019-12-27 | 2020-05-19 | 汇纳科技股份有限公司 | Pedestrian attribute identification method and system, storage medium and terminal |
-
2020
- 2020-08-31 CN CN202010894661.9A patent/CN111967433A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135319A (en) * | 2019-05-09 | 2019-08-16 | 广州大学 | A kind of anomaly detection method and its system |
CN110852303A (en) * | 2019-11-21 | 2020-02-28 | 中科智云科技有限公司 | Eating behavior identification method based on OpenPose |
CN111178251A (en) * | 2019-12-27 | 2020-05-19 | 汇纳科技股份有限公司 | Pedestrian attribute identification method and system, storage medium and terminal |
Non-Patent Citations (1)
Title |
---|
李宾皑 等: "目标检测、人体姿态估计算法叠加的监控视频分析方法", 《电子技术与软件工程》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560618A (en) * | 2020-12-06 | 2021-03-26 | 复旦大学 | Behavior classification method based on skeleton and video feature fusion |
CN112560649A (en) * | 2020-12-09 | 2021-03-26 | 广州云从鼎望科技有限公司 | Behavior action detection method, system, equipment and medium |
CN112668492A (en) * | 2020-12-30 | 2021-04-16 | 中山大学 | Behavior identification method for self-supervised learning and skeletal information |
CN112668492B (en) * | 2020-12-30 | 2023-06-20 | 中山大学 | Behavior recognition method for self-supervision learning and skeleton information |
CN113191228A (en) * | 2021-04-20 | 2021-07-30 | 上海东普信息科技有限公司 | Express item casting identification method, device, equipment and storage medium |
CN113935407A (en) * | 2021-09-29 | 2022-01-14 | 光大科技有限公司 | Abnormal behavior recognition model determining method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109446923B (en) | Deep supervision convolutional neural network behavior recognition method based on training feature fusion | |
Kulsoom et al. | A review of machine learning-based human activity recognition for diverse applications | |
Misra et al. | Shuffle and learn: unsupervised learning using temporal order verification | |
CN108765394B (en) | Target identification method based on quality evaluation | |
CN106203318B (en) | Camera network pedestrian recognition method based on the fusion of multi-level depth characteristic | |
CN111967433A (en) | Action identification method based on self-supervision learning network | |
CN111709311B (en) | Pedestrian re-identification method based on multi-scale convolution feature fusion | |
CN109993100B (en) | Method for realizing facial expression recognition based on deep feature clustering | |
CN110575663B (en) | Physical education auxiliary training method based on artificial intelligence | |
CN108921107A (en) | Pedestrian's recognition methods again based on sequence loss and Siamese network | |
CN107818307B (en) | Multi-label video event detection method based on LSTM network | |
CN111950372A (en) | Unsupervised pedestrian re-identification method based on graph convolution network | |
CN110728216A (en) | Unsupervised pedestrian re-identification method based on pedestrian attribute adaptive learning | |
CN113361549A (en) | Model updating method and related device | |
CN113642482A (en) | Video character relation analysis method based on video space-time context | |
CN111860117A (en) | Human behavior recognition method based on deep learning | |
CN113688761A (en) | Pedestrian behavior category detection method based on image sequence | |
Kadim et al. | Deep-learning based single object tracker for night surveillance | |
Serpush et al. | Complex human action recognition in live videos using hybrid FR-DL method | |
Sridhar et al. | Anomaly Detection using CNN with SVM | |
Meratwal et al. | Multi-camera and multi-person indoor activity recognition for continuous health monitoring using long short term memory | |
CN115798055B (en) | Violent behavior detection method based on cornersort tracking algorithm | |
CN110826459B (en) | Migratable campus violent behavior video identification method based on attitude estimation | |
Nikpour et al. | Deep reinforcement learning in human activity recognition: A survey | |
Wang | Recognition and Analysis of Behavior Features of School-Age Children Based on Video Image Processing. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201120 |
|
RJ01 | Rejection of invention patent application after publication |