CN114202803A - Multi-stage human body abnormal action detection method based on residual error network - Google Patents
Multi-stage human body abnormal action detection method based on residual error network Download PDFInfo
- Publication number
- CN114202803A CN114202803A CN202111553555.5A CN202111553555A CN114202803A CN 114202803 A CN114202803 A CN 114202803A CN 202111553555 A CN202111553555 A CN 202111553555A CN 114202803 A CN114202803 A CN 114202803A
- Authority
- CN
- China
- Prior art keywords
- network
- video
- human body
- abnormal
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 93
- 238000001514 detection method Methods 0.000 title claims abstract description 36
- 230000009471 action Effects 0.000 claims abstract description 48
- 238000012544 monitoring process Methods 0.000 claims abstract description 33
- 238000000034 method Methods 0.000 claims abstract description 27
- 230000004927 fusion Effects 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 22
- 238000010586 diagram Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 19
- 230000033001 locomotion Effects 0.000 claims description 13
- 238000005070 sampling Methods 0.000 claims description 13
- 210000002569 neuron Anatomy 0.000 claims description 12
- 230000002123 temporal effect Effects 0.000 claims description 12
- 238000005457 optimization Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000003287 optical effect Effects 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 230000002441 reversible effect Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000000844 transformation Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 2
- 238000012360 testing method Methods 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 238000013461 design Methods 0.000 abstract description 2
- 238000010009 beating Methods 0.000 description 5
- 206010000117 Abnormal behaviour Diseases 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011897 real-time detection Methods 0.000 description 2
- 239000002131 composite material Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a multi-stage human body abnormal action detection method based on a residual error network, which comprises the following steps: segmenting a video segment to be detected into video instances with equal length; obtaining a human body target boundary frame and the position and size of each monitoring video example by using a target detection network model; calculating the category and confidence of the human body action in the boundary box in each monitoring video instance by using an action recognition network model according to the human body boundary box information; and (4) giving an abnormal score of each monitoring video instance by using an abnormal score learning model and carrying out weighted fusion. The invention can quickly obtain abnormal action information in the monitoring video, and designs a target detection network model required for detecting the human body boundary frame, an action recognition network model for analyzing the human body action and an abnormal score learning model for predicting abnormal scores. The method realizes the detection of the abnormal actions of the human body in the common monitoring video scene, has the advantages of simplicity, low false alarm rate and certain practical value.
Description
Technical Field
The invention relates to a multi-stage human body abnormal action detection method based on a residual error network, in particular to a human body action recognition and classification method based on the residual error network, which relates to some technologies in the fields of machine learning, deep learning, computer vision and abnormal detection, is mainly applied to abnormal action detection in various real monitoring video scenes, and is particularly applied to the fields of intelligent home, intelligent security and the like.
Background
Under the continuous promotion of the environment of the rapid development of the society, the safety awareness of people is being improved year by year, and frequent public safety problems are paid more and more attention by the social public. The security engineering formed on the basis of the video recording system and the monitoring system is developed rapidly, a large number of monitoring video recording products are applied to traffic major roads, key buildings and public places such as campuses where people are dense and frequently flow in each city, and meanwhile, the application of the security monitoring system in the household aspect is also an important component which cannot be lacked in modern life. In the field of intelligent life assistant in each family or intelligent safety monitoring in social public places, a lot of key information can be extracted from the monitoring video. However, the key information needs to consume a lot of time and energy for relevant workers to find out, and in the actual situation of various occasions nowadays, the workers often face massive monitoring data, and the efficiency and accuracy of manually analyzing abnormal action information in videos cannot reach the ideal level. Therefore, the intelligent video monitoring system is realized, the computer is enabled to complete the task of intelligently detecting abnormal actions, the key information is quickly extracted and transmitted to the working personnel, the detection efficiency and accuracy of the video monitoring system can be improved, the real-time performance of abnormal action detection is achieved, the public safety and normal activities of the people are effectively guaranteed, and the occurrence rate of public safety problems is reduced.
The traditional video monitoring system is limited to the function of video recording, however, complex tasks such as real-time detection and analysis of abnormal events still need manual operation, however, the requirement of the current monitoring system is difficult to meet only by processing the abnormal behavior events by a manual method, and the intelligent video monitoring technology becomes one of the best ways for processing the abnormal behavior events in real time and rapidly.
In the abnormal action recognition task, because the occurrence frequency of the abnormal events is far lower than that of the normal events, enough learning samples with complete labels are lacked, and the abnormal actions are various in types, so that all the types of the abnormal events cannot be predicted in advance.
The abnormal action recognition task can be learned in a weak supervision training mode, a model is trained through coarse-grained labels, namely video-level labels, and the abnormal action recognition task cannot be completed through fine-grained labels. The method adopts a multi-stage human body abnormal action detection method based on a residual error network, uses a plurality of network models, respectively calculates the sizes of a human body target boundary frame and the position and the size, the category and the confidence coefficient of human body actions and the abnormal score of a monitoring video example, and then carries out weighting fusion.
Disclosure of Invention
The invention overcomes part of the defects of the prior art, provides a multi-stage human body abnormal motion detection method based on a residual error network, adopts a target detection network model, a human body motion recognition model and an abnormal score learning model, can quickly obtain abnormal motion information in a monitoring video, designs the target detection network model required for detecting a human body boundary frame, the motion recognition network model for analyzing human body motion and the abnormal score learning model for predicting abnormal scores, classifies and recognizes abnormal human body motions such as tumble, beating and the like in the monitoring video, and has good real-time detection speed and precision.
The technical solution of the invention is as follows: a multi-stage human body abnormal motion detection method based on a residual error network is suitable for classifying and identifying abnormal human body motions (such as tumble, beating and the like) in a monitoring video, and comprises the following steps:
(1) segmenting a video segment to be detected into m video instances with equal length;
(2) according to the video example segmented in the step (1), adopting a target detection network model based on a residual error network to obtain the position and the size of a human body target boundary frame appearing in each monitoring video example;
(3) based on the position and the size of the human body target boundary box in the step (2), calculating the category and the confidence coefficient of the human body action in the human body target boundary box in each monitoring video instance by using a human body action recognition model based on a double-flow space-time residual error network;
(4) and (4) according to the category and the confidence coefficient of the human body action in the step (3), an abnormal score of each monitoring video instance is given by adopting an abnormal score learning model, and weighting fusion is carried out, so that abnormal human body actions such as tumble, beating and the like can be classified and identified.
The step (1) is specifically realized as follows:
(11) if the applicable data set is a training and verification data set, dividing the video segments containing the abnormal events into abnormal video example packets according to the abnormal event labels in the data set, wherein the abnormal video example packets are represented asVideo segments not containing exceptional events are divided into normal video instance packets, denoted as
(12) If the applicable data set is a test data set, dividing the video clip to be detected into m video instances with equal length;
(13) and (3) cutting or scaling the resolution of the video example obtained in the step (11) or (12) to 228 multiplied by 228.
The step (2) is specifically realized as follows:
(21) using fast R-CNN as a target detection network model, using the video instance obtained in step (13) as input, using a residual error network as a skeleton network, where conv2_ x, conv3_ x, conv4_ x and conv5_ x, which are main bodies of the residual error network completing the feature extraction part, are composed of a plurality of sets of residual error module structures, and 3 convolutional layers are connected with the residual errors to form a set of residual error modules, where conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer;
(22) combining a characteristic pyramid network as a candidate region generator, reducing the dimensionality of the characteristic diagram of an upper layer in a mode of adding convolution layers with convolution kernel size of 1 × 1, simultaneously performing up-sampling operation on the characteristic diagram of a side convolution layer to obtain the characteristic diagram with the same size, performing element-by-element superposition operation on the convolution characteristic diagrams of two processed convolution layers, taking the result as the input of the next layer of the side convolution layer, performing element-by-element superposition on the characteristic diagrams of conv2_ x, conv3_ x, conv4_ x and the characteristic diagram of conv5_ x after convolution, dimensionality reduction and sampling operation, and outputting different predictions of each layer according to different characteristic diagrams;
(23) approximate joint training is used for a candidate area network and Fast R-CNN in Fast R-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when Fast R-CNN is trained, and the training loss from the candidate area network and the training loss of Fast R-CNN are combined in a sharing layer through reverse transmission obtained through pre-calculation;
(24) the boundary frame of the human body target adopts a parametric calculation mode and is represented as tx=(x-xa)/wa,ty=(y-ya)/ha,tw=log(w/wa),th=log(h/ha), Wherein x, y, w, h respectively represent the coordinate of abscissa axis, the coordinate of ordinate axis, the width and height of the boundary box of the central point of the predicted boundary box, and xa,ya,wa,haCoordinate of abscissa axis representing center point of anchor box boundary box, coordinate of ordinate axis, width and height of boundary box, x*,y*,w*,h*The coordinate of the abscissa axis and the coordinate of the ordinate axis representing the central point of the real boundary box, and the width and the height of the boundary box;
(25) the optimization goal of the target detection network model is to reduce the error of the boundary box and the category of the predicted human target and minimize the multi-task training objective functionWherein L isc(p,u)=-logpu,Lr(p,u)=R(p-u),R (x) is the smooth L1 penalty function, i denotes the index of the subscript of the anchor bin, piIndicating the probability that the ith anchor bin contains the target object,is a sample label for learning, tiIs a vector representing a set of parameterized coordinates of a human target bounding box,is a set of parameterized coordinates, L, of an anchor box with a positive exemplar labelcIs a log classification loss predicting the presence of a human target, and λ is in a regularization loss functionAnd the two terms account for the super-parameters of the weight, and the output of the target detection network model is the position and the size of a human target boundary box in the video example.
The step (3) is specifically realized as follows:
(31) the method comprises the steps of using a double-flow space-time network as a human body action recognition model, using the position and the size of a human body target boundary box as input, using two residual error networks of time flow and space flow as a skeleton network for the human body action recognition model, and performing analysis on a group of input data samples { x }1,x2,x3,…,xmBy β, γ as translation parameters and scalingParameter, batch normalization calculation of transformed yiThe mean value of each batch is calculated along the channelCalculate variance of each batch along the channel
(32) Adjusting a first layer filter in a temporal stream by three RGB filter channels for running on horizontal and vertical optical flow stacks, each stack containing 10 video frames, using cross-layer residual connection in temporal and spatial stream residual networks, respectivelyAnd cross-stream residual concatenationDeployed before residual module of time flow to space flow, where f is ReLU activation function, XlIs an input matrix of the l-th layer, YlFor the output matrix of layer l, g (·,) is a non-linear residual mapping function, weighted by the convolution filter WlIt is shown that,is the input of the l-th layer spatial stream,is the input of the layer i time stream,is concatenation of layer I residual neurons of spatial streamAnd then the weight matrix is connected with the weight matrix,is the output of the l layer spatial stream;
(33) using a residual network with the resolution ratio of alpha frame rate and C size as a spatial stream network to determine a static area in a video, and using a residual network with the resolution ratio of 8 alpha frame rate and C/8 size as a temporal stream network to determine a dynamic area in the video, wherein the spatial stream network is called a slow channel, the temporal stream network is called a fast channel, the slow channel network uses a time sequence span as a hyper-parameter to control the number of video frames crossed per second, about 2 frame characteristics are extracted in 1 second in a data set with a video frame rate of 30, 15 frame characteristics are extracted in 1 second in the fast channel network, data in the fast channel network is transmitted to the slow channel network through a side connection, a network model performs dimension conversion on data transmitted into the side connection by the fast channel network and then is merged into the slow channel network, when the fast channel network performs dimension conversion on a data sample, performing three-dimensional convolution at intervals of a certain number of frames, performing global average sampling on a human body action recognition network model on the last layer of a fast channel network and a slow channel network, reducing dimensionality of data samples, fusing results, transmitting the results into a full-connection classification layer, and giving a confidence score of each action through Softmax;
(34) the optimization goal of the human body action recognition model is to reduce the error of human body action class prediction and minimize the cross entropy loss function of the Softmax classifierWherein xiMotion confidence score, y, for the ith personal body motion category, which is the output vector of the Softmax classifieriIs the ith element in the corresponding real action tag vector.
The step (4) is specifically realized as follows:
(41) the method uses a multi-layer perceptron as an abnormal score learning model, uses the category and the confidence score of human body action as input, and uses a multi-instance learning methodTraining and predicting the anomaly score obtained for each video instance in the instance package, whereinIs a packet of an instance of an abnormal video,is a packet of an instance of an abnormal video,for the corresponding exceptional video instance and normal video instance,h is a multilayer perceptron model which consists of three fully-connected layers, wherein an input layer comprises 512 neurons, a hidden layer comprises 32 neurons, a final output layer comprises 1 neuron, abnormal scores of a given video example are predicted, and the output value range is between 0 and 1;
(42) the optimization goal of the abnormal score learning model is to maximize the distance between two classes of samples between the abnormal video instance package and the normal video instance package to facilitate classification by the multi-layered perceptron model using hinge-based ordering penaltyConsidering the real situation, the probability and frequency of normal events are far greater than those of abnormal events, so a sparse constraint is added to the loss functionEnabling some abnormal human body actions with lower occurrence frequency than other normal events, such as falling, beating and the like, and violence to have higher abnormal scores, wherein epsilon is a hyper-parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;
(43) network model and human body action detection on targetPerforming joint training and optimization on the recognition model and the abnormal score learning modelAccumulating the loss function and performing weight balancing, λod,λaf,λfilIs a regularization hyper-parameter that controls each loss function.
Compared with the prior art, the invention has the advantages that:
(1) the model extracts abundant human body target boundary box characteristics and human body action characteristics, can effectively learn the abnormal action characteristics of the human body from a training sample, and can also give the abnormal action category and confidence score of the human body on the basis of the prior art that whether the weak supervision multi-instance learning model detects abnormal events;
(2) by the segmented video examples and the characteristics extracted by the model, the abnormal mode can be rapidly learned, the space-time position of abnormal action in the monitoring video is detected, and the model also has the capability of rapidly positioning the space position of the abnormal action on the basis of the prior art that the weak supervision multi-example learning model detects the time sequence information of the abnormal action;
(3) the model can be applied to an unclear image monitoring video with the resolution ratio lower than 228 x 228, in an experimental result of a monitoring video data set, the AUC (area under the ROC curve) of abnormal actions reaches 77.43% and is used as a reference of performance indexes of model classification, and compared with the prior art, the identification capability of the model can obtain better classification performance, such as a time series regularity learning model (50.60%), a high frame rate abnormal event detection model (65.51%), a weak supervision multi-instance learning model (75.41%), and the like.
Drawings
FIG. 1 is a flow chart of a method implementation of the present invention;
FIG. 2 is a schematic diagram of an internal architecture of a dual-stream residual error network;
FIG. 3 is a schematic diagram of the fast R-CNN architecture;
FIG. 4 is a schematic diagram of a composite network model for target detection;
FIG. 5 is a schematic diagram of a dual-flow network model architecture;
FIG. 6 is a schematic diagram of video example partitioning;
FIG. 7 is a diagram of a multi-level perceptron architecture.
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
As shown in fig. 1, the method of the present invention specifically includes the following steps:
(1) segmenting a video segment to be detected into m video instances with equal length;
(2) according to the video example segmented in the step (1), adopting a target detection network model based on a residual error network to obtain the position and the size of a human body target boundary frame appearing in each monitoring video example;
(3) based on the position and the size of the human body target boundary box in the step (2), calculating the category and the confidence coefficient of the human body action in the human body target boundary box in each monitoring video instance by using a human body action recognition model based on a double-flow space-time residual error network;
(4) and (4) according to the category and the confidence coefficient of the human body action in the step (3), an abnormal score of each monitoring video instance is given by adopting an abnormal score learning model, and weighting fusion is carried out, so that abnormal human body actions such as tumble, beating and the like can be classified and identified.
The basic implementation process is as follows:
(1) dividing the video clips containing abnormal labels into abnormal video example packages and dividing the video clips not containing abnormal events into normal video example packages by using a UCF-Crime data set according to the abnormal event labels of the video clips in the input data set, wherein the normal video example packages are shown in figure 6;
(2) cropping or scaling the video instance resolution to 228 x 228 using the FFMPEG audio video library;
(3) using fast R-CNN as a target detection network model, wherein the target detection network model uses 101 layers of residual error networks as a skeleton network, and the main bodies conv2_ x, conv3_ x, conv4_ x and conv5_ x in the residual error networks, which complete the feature extraction part, are composed of several groups of residual error module structures, and 3 convolutional layers are connected with the residual errors to form a group of residual error modules, wherein conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer, and the framework is shown in fig. 3;
(4) combining the feature pyramid network as a candidate region generator, the structure of the fast R-CNN combining the feature pyramid network is as shown in fig. 4, the feature pyramid network is added with sampling operations layer by layer, the fast R-CNN is connected with the network layers of the feature pyramid network in a transverse manner, the fast R-CNN reduces the dimension of the feature map of the upper layer in a manner of adding convolution layers with convolution kernel size of 1 × 1, the feature map of the side convolution layer in the feature pyramid network is up-sampled at the same time to obtain the feature map with the same dimension, the convolution feature maps of the two processed convolution layers are overlapped element by element, the result is used as the input of the next layer of the side convolution layer, the fast R-CNN convolves convolve, the feature maps of conv2_ x, conv3_ x, conv4_ x and the feature map of conv5_ x through convolution, dimension reduction and sampling operations, element-by-element superposition is carried out, and different predictions of each layer are output by the feature pyramid network according to different feature graphs;
(5) approximate joint training is used for a candidate area network and Fast R-CNN in Fast R-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when Fast R-CNN is trained, and the training loss from the candidate area network and the training loss of Fast R-CNN are combined in a sharing layer through reverse transmission obtained through pre-calculation;
(6) the boundary frame of the human body target adopts a parametric calculation mode and is represented as tx=(x-xa)/wa,ty=(y-ya)/ha,tw=log(w/wa),th=log(h/ha), Wherein x, y, w, h respectively represent the coordinate of abscissa axis, the coordinate of ordinate axis, the width and height of the boundary box of the central point of the predicted boundary box, and xa,ya,wa,haCoordinate of abscissa axis representing center point of anchor box boundary box, coordinate of ordinate axis, width and height of boundary box, x*,y*,w*,h*The coordinate of the abscissa axis and the coordinate of the ordinate axis representing the central point of the real boundary box, and the width and the height of the boundary box;
(7) the optimization goal of the target detection network model is to reduce the error of the boundary box and the category of the predicted human target and minimize the multi-task training objective functionWherein L isc(p,u)=-logpu,Lr(p,u)=R(p-u),R (x) is the smooth L1 penalty function, i denotes the index of the subscript of the anchor bin, piIndicating the probability that the ith anchor bin contains the target object,is a sample label for learning, tiIs a vector representing a set of parameterized coordinates of a human target bounding box,is a set of parameterized coordinates, L, of an anchor box with a positive exemplar labelcIs a log classification loss predicting the presence of a human target, and λ is in a regularization loss functionThe two terms account for the super-parameters of the weight, and the output of the target detection network model is the position and the size of a human target boundary frame in the video example;
(8) using double-flow space-time network as human body action recognition model and human body target boundary frameThe human motion recognition model uses two residual networks of time stream and space stream as skeleton network, the internal architecture of the residual network is shown in fig. 2, wherein the input layer step of the time stream network is 2, the channel number of the conv1 convolutional layer is 8, the input layer step of the space stream network is 16, the channel number is 64, two network streams are added with sampling layers after the conv1 convolutional layer, the characteristic diagram size of the conv2, conv3, conv4 and conv5 convolutional layer of the time stream network is 1/8 of the corresponding convolutional layer of the space stream network, and in the dual-stream space-time network, a group of input data samples { x 3, conv4 and conv5 are input into the skeleton network1,x2,x3,…,xmAnd calculating transformed y in batch normalization by taking beta and gamma as translation parameters and scaling parametersiThe mean value of each batch is calculated along the channelCalculate variance of each batch along the channelComputingPerforming scaling and translation transformations
(9) Adjusting a first layer filter in a temporal stream by three RGB filter channels for running on horizontal and vertical optical flow stacks, each stack containing 10 video frames, using cross-layer residual connection in temporal and spatial stream residual networks, respectivelyAnd cross-stream residual concatenationDeployed before residual module of time flow to space flow, where f is ReLU activation function, XlIs an input matrix of the l-th layer, YlIs the first layerG (·,) is a non-linear residual mapping function, weighted by the convolution filter WlIt is shown that,is the input of the l-th layer spatial stream,is the input of the layer i time stream,is the connection weight matrix for the layer i residual neurons of the spatial stream,is the output of the l layer spatial stream;
(10) using a residual network with the resolution ratio of alpha frame rate and C size as a spatial stream network to determine a static area in a video, and using a residual network with the resolution ratio of 8 alpha frame rate and C/8 size as a temporal stream network to determine a dynamic area in the video, wherein the spatial stream network is called a slow channel, the temporal stream network is called a fast channel, the slow channel network uses a time sequence span as a hyper-parameter to control the number of video frames crossed per second, about 2 frame characteristics are extracted in 1 second in a data set with a video frame rate of 30, 15 frame characteristics are extracted in 1 second in the fast channel network, data in the fast channel network is transmitted to the slow channel network through a side connection, a network model performs dimension conversion on data transmitted into the side connection by the fast channel network and then is merged into the slow channel network, when the fast channel network performs dimension conversion on a data sample, performing three-dimensional convolution at intervals of a certain number of frames, performing global average sampling on a human body action recognition network model on the last layer of a fast channel network and a slow channel network, reducing the dimensionality of a data sample, fusing results, transmitting the results into a full-connection classification layer, and giving a confidence score of each action through Sofimax, wherein a double-flow space-time residual error network architecture is shown in FIG. 5;
(11) in the learning process of the human body action recognition model, the aim is to reduce the human body action classError of prediction and minimization of cross entropy loss function of Softmax classifierWherein xiMotion confidence score, y, for the ith personal body motion class, which is the output vector of the Sofimax classifieriIs the ith element in the corresponding real action tag vector;
(12) the method uses a multi-layer perceptron as an abnormal score learning model, uses the category and the confidence score of human body action as input, and uses a multi-instance learning methodTraining and predicting the anomaly score obtained for each video instance in the instance package, whereinIs a packet of an instance of an abnormal video,is a packet of an instance of an abnormal video,for the corresponding exceptional video instance and normal video instance,h is a multilayer perceptron model, the framework of which is shown in fig. 7 and consists of three fully-connected layers, wherein an input layer comprises 512 neurons, a hidden layer comprises 32 neurons, a final output layer comprises 1 neuron, abnormal scores of a given video example are predicted, and the output value ranges from 0 to 1;
(13) the optimization goal of the abnormal score learning model is to maximize the distance between two classes of samples between the abnormal video instance package and the normal video instance package to facilitate classification by the multi-layered perceptron model using hinge-based ordering penaltyConsidering the real situation, the probability and frequency of normal events are far greater than those of abnormal events, so a sparse constraint is added to the loss functionEnabling abnormal human body actions such as falls, blows and the like with lower occurrence frequency than other normal events to have higher abnormal scores, wherein the epsilon is a super parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;
(14) performing joint training and optimization on the target detection network model, the human body action recognition model and the abnormal score learning modelAccumulating the loss function and performing weight balancing, λod,λaf,λfilIs a regularization hyper-parameter that controls each loss function.
The above description is only a part of the embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.
Claims (5)
1. A multi-stage human body abnormal motion detection method based on a residual error network is characterized by comprising the following steps:
(1) segmenting a video segment to be detected into m video instances with equal length to obtain segmented video instances;
(2) according to the video example segmented in the step (1), adopting a target detection network model based on a residual error network to obtain the position and the size of a human body target boundary frame appearing in each monitoring video example;
(3) based on the position and the size of the human body target boundary box in the step (2), calculating the category and the confidence coefficient of the human body action in the human body target boundary box in each monitoring video instance by using a human body action recognition model based on a double-flow space-time residual error network;
(4) and (4) according to the category and the confidence of the human body action in the step (3), giving an abnormal score of each monitoring video instance by adopting an abnormal score learning model, and performing weighting fusion, so that the abnormal human body action can be classified and identified.
2. The method of claim 1, wherein the method comprises: the step (1) is specifically realized as follows:
(11) if the applicable data set is a training and verification data set, dividing the video segments containing the abnormal events into abnormal video example packets according to the abnormal event labels in the data set, wherein the abnormal video example packets are represented asVideo segments not containing exceptional events are divided into normal video instance packets, denoted as
(12) If the applicable data set is a test data set, dividing the video clip to be detected into m video instances with equal length;
(13) and (3) cutting or scaling the resolution of the video example obtained in the step (11) or (12) to 228 multiplied by 228.
3. The method of claim 1, wherein the method comprises: the step (2) is specifically realized as follows:
(21) using fast R-CNN as a target detection network model, taking a video instance as an input, the target detection network model using a residual error network as a skeleton network, the main bodies conv2_ x, conv3_ x, conv4_ x and conv5_ x in the residual error network, which complete the feature extraction part, are composed of several groups of residual error module structures, 3 convolutional layers are connected with the residual errors to form a group of residual error modules, wherein conv2_ x is composed of 3 residual error modules, conv3_ x is composed of 4 residual error modules, conv4_ x is composed of 23 residual error modules, conv5_ x is composed of 3 residual error modules, and an average sampling layer, a full connection layer and a Softmax layer are added behind the conv5_ x network layer;
(22) combining a characteristic pyramid network as a candidate region generator, reducing the dimensionality of the characteristic diagram of an upper layer in a mode of adding convolution layers with convolution kernel size of 1 × 1, simultaneously performing up-sampling operation on the characteristic diagram of a side convolution layer to obtain the characteristic diagram with the same size, performing element-by-element superposition operation on the convolution characteristic diagrams of two processed convolution layers, taking the result as the input of the next layer of the side convolution layer, performing element-by-element superposition on the characteristic diagrams of conv2_ x, conv3_ x, conv4_ x and the characteristic diagram of conv5_ x after convolution, dimensionality reduction and sampling operation, and outputting different predictions of each layer according to different characteristic diagrams;
(23) approximate joint training is used for a candidate area network and FastR-CNN in FasterR-CNN, in each random gradient descent iteration process, a generated candidate area is fixed and unchanged when FastR-CNN is trained, and the training loss from the candidate area network and the training loss of FastR-CNN are combined in a shared layer through reverse transmission through pre-calculation;
(24) the boundary frame of the human body target adopts a parametric calculation mode and is represented as tx=(x-xa)/wa,ty=(y-ya)/ha,tw=log(w/wa),th=log(h/ha), Wherein x, y, w, h respectively represent the coordinate of abscissa axis, the coordinate of ordinate axis, the width and height of the boundary box of the central point of the predicted boundary box, and xa,ya,wa,haCoordinate of abscissa axis representing center point of anchor box boundary box, coordinate of ordinate axis, width and height of boundary box, x*,y*,w*,h*The coordinate of the abscissa axis and the coordinate of the ordinate axis representing the central point of the real boundary box, and the width and the height of the boundary box;
(25) the optimization goal of the target detection network model is to reduce the error of the boundary box and the category of the predicted human target and minimize the multi-task training objective functionWherein L isc(p,u)=-logpu,Lr(p,u)=R(p-u),R (x) is the smooth L1 penalty function, i denotes the index of the subscript of the anchor bin, piIndicating the probability that the ith anchor bin contains the target object,is a sample label for learning, tiIs a vector representing a set of parameterized coordinates of a human target bounding box,is a set of parameterized coordinates, L, of an anchor box with a positive exemplar labelcIs a log classification loss predicting the presence of a human target, and λ is in a regularization loss functionAnd the two terms account for the super-parameters of the weight, and the output of the target detection network model is the position and the size of a human target boundary box in the video example.
4. The method of claim 1, wherein the method comprises: the step (3) is specifically realized as follows:
(31) the method comprises the steps of using a double-flow space-time network as a human body action recognition model, using the position and the size of a human body target boundary box as input, using two residual error networks of time flow and space flow as a skeleton network for the human body action recognition model, and performing analysis on a group of input data samples { x }1,x2,x3,…,xmAnd calculating transformed y in batch normalization by taking beta and gamma as translation parameters and scaling parametersiThe mean value of each batch is calculated along the channelCalculate variance of each batch along the channelComputingPerforming scaling and translation transformations
(32) Adjusting a first layer filter in a temporal stream by three RGB filter channels for running on horizontal and vertical optical flow stacks, each stack containing 10 video frames, using cross-layer residual connection in temporal and spatial stream residual networks, respectivelyAnd cross-stream residual concatenationDeployed before residual module of time flow to space flow, where f is ReLU activation function, XlIs an input matrix of the l-th layer, YlFor the output matrix of layer l, g (·,) is a non-linear residual mapping function, weighted by the convolution filter WlIt is shown that,is the input of the l-th layer spatial stream,is an input of the l layer time stream, Wl SIs the l-th layer of the spatial streamA connection weight matrix of the residual neurons,is the output of the l layer spatial stream;
(33) using a residual network with the resolution ratio of alpha frame rate and C size as a spatial stream network to determine a static area in a video, and using a residual network with the resolution ratio of 8 alpha frame rate and C/8 size as a temporal stream network to determine a dynamic area in the video, wherein the spatial stream network is called a slow channel, the temporal stream network is called a fast channel, the slow channel network uses a time sequence span as a hyper-parameter to control the number of video frames crossed per second, about 2 frame characteristics are extracted in 1 second in a data set with a video frame rate of 30, 15 frame characteristics are extracted in 1 second in the fast channel network, data in the fast channel network is transmitted to the slow channel network through a side connection, a network model performs dimension conversion on data transmitted into the side connection by the fast channel network and then is merged into the slow channel network, when the fast channel network performs dimension conversion on a data sample, performing three-dimensional convolution at intervals of a certain number of frames, performing global average sampling on a human body action recognition network model on the last layer of a fast channel network and a slow channel network, reducing dimensionality of data samples, fusing results, transmitting the results into a full-connection classification layer, and giving a confidence score of each action through Softmax;
(34) the optimization goal of the human body action recognition model is to reduce the error of human body action class prediction and minimize the cross entropy loss function of the Softmax classifierWherein xiMotion confidence score, y, for the ith personal body motion category, which is the output vector of the Softmax classifieriIs the ith element in the corresponding real action tag vector.
5. The method of claim 1, wherein the method comprises: the step (4) is specifically realized as follows:
(41) the method uses a multi-layer perceptron as an abnormal score learning model, uses the category and the confidence score of human body action as input, and uses a multi-instance learning methodTraining and predicting the anomaly score obtained for each video instance in the instance package, whereinIs a packet of an instance of an abnormal video,is a packet of an instance of an abnormal video,for the corresponding exceptional video instance and normal video instance,h is a multilayer perceptron model which consists of three fully-connected layers, wherein an input layer comprises 512 neurons, a hidden layer comprises 32 neurons, a final output layer comprises 1 neuron, abnormal scores of a given video example are predicted, and the output value range is between 0 and 1;
(42) the optimization goal of the abnormal score learning model is to maximize the distance between two classes of samples between the abnormal video instance package and the normal video instance package to facilitate classification by the multi-layered perceptron model using hinge-based ordering penaltyConsidering the real situation, the probability and frequency of normal events are far greater than those of abnormal events, so a sparse constraint is added to the loss functionMake oneThe abnormal human body actions with the occurrence frequency lower than other normal events have higher abnormal scores, wherein the epsilon is a hyper-parameter for adjusting the weight occupied by the hinge sequencing loss and the sparse constraint;
(43) performing joint training and optimization on the target detection network model, the human body action recognition model and the abnormal score learning modelAccumulating the loss function and performing weight balancing, λod,λaf,λfilIs a regularization hyper-parameter that controls each loss function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111553555.5A CN114202803A (en) | 2021-12-17 | 2021-12-17 | Multi-stage human body abnormal action detection method based on residual error network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111553555.5A CN114202803A (en) | 2021-12-17 | 2021-12-17 | Multi-stage human body abnormal action detection method based on residual error network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114202803A true CN114202803A (en) | 2022-03-18 |
Family
ID=80654998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111553555.5A Pending CN114202803A (en) | 2021-12-17 | 2021-12-17 | Multi-stage human body abnormal action detection method based on residual error network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114202803A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114419528A (en) * | 2022-04-01 | 2022-04-29 | 浙江口碑网络技术有限公司 | Anomaly identification method and device, computer equipment and computer readable storage medium |
CN114842558A (en) * | 2022-06-02 | 2022-08-02 | 沈阳航空航天大学 | Multi-relation GCNs framework action identification method based on edge perception |
CN114860535A (en) * | 2022-04-18 | 2022-08-05 | 地平线征程(杭州)人工智能科技有限公司 | Data evaluation model generation method and device and abnormal data monitoring method and device |
CN115147921A (en) * | 2022-06-08 | 2022-10-04 | 南京信息技术研究院 | Key area target abnormal behavior detection and positioning method based on multi-domain information fusion |
CN116564460A (en) * | 2023-07-06 | 2023-08-08 | 四川省医学科学院·四川省人民医院 | Health behavior monitoring method and system for leukemia child patient |
CN118115924A (en) * | 2024-04-26 | 2024-05-31 | 南京航空航天大学 | Elevator car abnormal event detection method based on multi-frame key feature fusion |
-
2021
- 2021-12-17 CN CN202111553555.5A patent/CN114202803A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114419528A (en) * | 2022-04-01 | 2022-04-29 | 浙江口碑网络技术有限公司 | Anomaly identification method and device, computer equipment and computer readable storage medium |
CN114860535A (en) * | 2022-04-18 | 2022-08-05 | 地平线征程(杭州)人工智能科技有限公司 | Data evaluation model generation method and device and abnormal data monitoring method and device |
CN114842558A (en) * | 2022-06-02 | 2022-08-02 | 沈阳航空航天大学 | Multi-relation GCNs framework action identification method based on edge perception |
CN115147921A (en) * | 2022-06-08 | 2022-10-04 | 南京信息技术研究院 | Key area target abnormal behavior detection and positioning method based on multi-domain information fusion |
CN115147921B (en) * | 2022-06-08 | 2024-04-30 | 南京信息技术研究院 | Multi-domain information fusion-based key region target abnormal behavior detection and positioning method |
CN116564460A (en) * | 2023-07-06 | 2023-08-08 | 四川省医学科学院·四川省人民医院 | Health behavior monitoring method and system for leukemia child patient |
CN116564460B (en) * | 2023-07-06 | 2023-09-12 | 四川省医学科学院·四川省人民医院 | Health behavior monitoring method and system for leukemia child patient |
CN118115924A (en) * | 2024-04-26 | 2024-05-31 | 南京航空航天大学 | Elevator car abnormal event detection method based on multi-frame key feature fusion |
CN118115924B (en) * | 2024-04-26 | 2024-07-05 | 南京航空航天大学 | Elevator car abnormal event detection method based on multi-frame key feature fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114202803A (en) | Multi-stage human body abnormal action detection method based on residual error network | |
CN112861635A (en) | Fire and smoke real-time detection method based on deep learning | |
Zheng et al. | A review of remote sensing image object detection algorithms based on deep learning | |
CN113569766B (en) | Pedestrian abnormal behavior detection method for patrol of unmanned aerial vehicle | |
Yandouzi et al. | Investigation of combining deep learning object recognition with drones for forest fire detection and monitoring | |
CN116824335A (en) | YOLOv5 improved algorithm-based fire disaster early warning method and system | |
CN115294519A (en) | Abnormal event detection and early warning method based on lightweight network | |
Lin | Automatic recognition of image of abnormal situation in scenic spots based on Internet of things | |
CN117746264B (en) | Multitasking implementation method for unmanned aerial vehicle detection and road segmentation | |
Bhardwaj et al. | Machine Learning-Based Crowd Behavior Analysis and Forecasting | |
CN117495825A (en) | Method for detecting foreign matters on tower pole of transformer substation | |
Liu et al. | UAV imagery-based railroad station building inspection using hybrid learning architecture | |
Hao et al. | Human behavior analysis based on attention mechanism and LSTM neural network | |
Lestari et al. | Comparison of two deep learning methods for detecting fire | |
CN116740624A (en) | Concentrated crowd detection algorithm based on improved YOLO and fused attention mechanism | |
Wu et al. | Video Abnormal Event Detection Based on CNN and LSTM | |
CN112633162B (en) | Pedestrian rapid detection and tracking method suitable for expressway external field shielding condition | |
Zhang et al. | Multi-object crowd real-time tracking in dynamic environment based on neural network | |
Wang et al. | Abnormal event detection algorithm based on dual attention future frame prediction and gap fusion discrimination | |
Li et al. | Multi-scale feature extraction and fusion net: Research on UAVs image semantic segmentation technology | |
Pan et al. | An Improved Two-stream Inflated 3D ConvNet for Abnormal Behavior Detection. | |
Shine et al. | Comparative analysis of two stage and single stage detectors for anomaly detection | |
Ai et al. | Analysis of deep learning object detection methods | |
Wu et al. | An Improved SIOU-YoloV5 Algorithm Applied to Human Flow Detection | |
Padmaja et al. | Crowd abnormal behaviour detection using convolutional neural network and bidirectional LSTM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |