CN109598229B

CN109598229B - Monitoring system and method based on action recognition

Info

Publication number: CN109598229B
Application number: CN201811453471.2A
Authority: CN
Inventors: 李刚毅
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2024-06-21
Anticipated expiration: 2038-11-30
Also published as: CN109598229A

Abstract

The present disclosure relates to a monitoring system and method that provides motion-based recognition. The method comprises the following steps: identifying the limb position of a person in a monitored video frame by using a gesture estimation method, and carrying out human skeleton 2D modeling; classifying the 2D model of the human skeleton in the monitored video frame by utilizing a pre-trained gesture classification model; storing the gesture classification result in the continuous video frames into gesture vectors, and judging the action type according to a pre-trained action recognition model; and if the judged action type belongs to the monitored type, storing the video frame marked with the specific action and/or the video fragment of the action into a memory and triggering an alarm.

Description

Monitoring system and method based on action recognition

Technical Field

The present disclosure relates to a monitoring system based on motion recognition and a method thereof, and more particularly, to a system and a method thereof for determining whether a person in a monitored video makes a specific motion by using a gesture prediction technique, a gesture recognition technique, and a motion recognition technique, automatically alarming if a specific motion is detected, and saving a related video frame and video file clip for archival and review.

Background

The actions of the object have a decisive role in judging the behavior of the object. Whether the object is a human or animal or a machine, it is to be achieved by a corresponding action in order to achieve the intended aim.

Chinese patent application publication CN107992858a proposes a real-time three-dimensional gesture estimation system based on a single RGB frame and a method thereof, which detects and frames a hand region using a hand detector, recognizes a hand joint 2D position using OpenPose, fits a hand 3D model to the 2D joint position using nonlinear square minimization, and restores the hand pose. The method utilizes OpenPose method to realize modeling of hands and identification of gestures. However, this method is not suitable for detecting joints of other limbs of a person (e.g. wrist, elbow, shoulder, neck, crotch, knee, ankle, finger, etc.), nor is it configured with an efficient classification algorithm suitable for classifying other limbs. In addition, the method disclosed in CN107992858a mainly recognizes gestures within a single frame, and is not suitable for recognizing actions in video of consecutive frames.

The Chinese patent application publication CN108427331A proposes a human-computer cooperation safety protection method and system, which uses RGB-D sensor to identify the coordinates of the robot, uses RGB-D sensor and OpenPose to detect the coordinates of the person, and controls the speed of the robot by calculating the distance between the person and the robot. The method also uses OpenPose method to realize modeling of limb position, and realizes dynamic speed control of the robot by judging position and distance after modeling human body. Therefore, this prior art does not judge the type of motion by motion recognition after modeling the human body.

The Chinese patent application publication CN108416795A proposes a video motion recognition method based on sequencing pooling fusion spatial features, which calculates a visual feature vector set of a video frame, constructs a two-dimensional spatial pyramid model for the video frame, and judges motion types after processing and classifying the visual feature vector set in subspace. The method adopts a mode of carrying out multi-scale segmentation on the two-dimensional space of the video frame to realize the detection of the motion, so that the method is based on the visual characteristics of the original video frame for classification. The technique is not suitable for distinguishing independent human skeleton models and classifying and judging the skeleton models (instead of original video features) so as to be unsuitable for judging actions according to the gesture sequence.

Therefore, there is a need for a system and method that can utilize gesture prediction techniques, gesture recognition techniques, and motion recognition techniques to determine whether a person in a monitored video is doing a specific motion, automatically alarm if a specific motion is detected, and save relevant video frames and video file segments for archival review.

Disclosure of Invention

Therefore, the object of the present disclosure is to determine whether a person in a monitored video makes a specific action by using a gesture prediction technique, a gesture recognition technique, and an action recognition technique, automatically alarm if the specific action is detected, and save related video frames and video file clips for archival and review.

To achieve the above object, according to one aspect of the present disclosure, there is provided a monitoring method based on motion recognition, including the steps of: a) Identifying the limb position of a person in a monitored video frame by using a gesture estimation method, and carrying out human skeleton 2D modeling; b) Classifying the 2D model of the human skeleton in the monitored video frame by utilizing a pre-trained gesture classification model; c) Storing the gesture classification result in the continuous video frames into gesture vectors, and judging the action type according to a pre-trained action recognition model; and d) if the judged action type belongs to the monitored type, storing the video frame and/or the video fragment of the action, which are marked with the mark to make the specific action, into a memory and triggering an alarm.

Preferably, the step a) includes: a1 Judging the main joint position coordinates of one or more persons in the video frame; and a 2) 2D modeling of the human skeleton for each person using the relationship between the primary joint position coordinates and the joints.

Preferably, the step a 2) further comprises 2D modeling of human bones for the five sense organs of the hands and/or faces of each person in the video.

Preferably, the step b) includes: decomposing continuous limb actions to be recognized into discrete key gestures; carrying out key gesture labeling on the 2D modeling result of the human skeleton; and training a gesture classification model by using a convolutional neural network algorithm and the labeled human skeleton 2D modeling result.

Preferably, said step c) comprises: labeling a gesture vector set of a known action; and taking the marked gesture vector set as a training set to train the action recognition model.

Preferably, said step d) comprises at least one of the following steps: marking an object which makes a specific action in the original video, and triggering an alarm; archiving the video frames marked with the specific actions for a license; and archiving the video clips marked with the specific actions for certification.

Preferably, said step c) further comprises the step of: a hot zone (ROI, region of Interest) comparison is used to determine the motion of the tracked object in the multi-person scene.

Preferably, the hot zone refers to a designated area of the monitoring video, and if no designated area exists, the hot zone is the whole monitoring picture area.

Preferably, the step c) further comprises: a tracker is added to each person in the video to monitor its actions and determine if the tracked object needs to continue to be tracked, and if it does not need to continue to be tracked, the tracker is deleted.

Preferably, it is determined whether or not tracking is required to be continued by determining whether or not the detected object is in at least one of the following states: the detected object reaches the designated area; the detected object arrives at the departure area; the detected object is in a static state for more than a certain time; and whether an instruction to stop continuing tracking of the object in the monitored area is received.

According to another aspect of the present disclosure, there is provided a monitoring system based on action recognition, comprising: a posture predicting section that recognizes a limb position of a person in the monitored video frame using a posture estimating method, and performs human skeleton 2D modeling based on the obtained position; a gesture classification section that classifies a 2D model of a human skeleton in a monitored video frame using a pre-trained gesture classification model; a pose management section that stores pose classification results in successive video frames into a pose vector; an action recognition section that judges an action type based on a pre-trained action recognition model; and an output section that stores a video frame marking a specific action and/or a video clip of the action into the memory and triggers an alarm when the judged action type belongs to the monitored type.

Preferably, the monitoring system further comprises: the gesture classification training part is used for marking key gestures of the obtained human skeleton 2D model, inputting the marked human skeleton 2D model as a training set into the convolutional neural network training classification model for training so as to obtain a gesture classification model; and the motion recognition training part takes the gesture vector generated by the gesture management part of the known motion video as a training set, and trains the gesture vector by adopting a multivariate classification algorithm to obtain a motion recognition model for classifying the motion of the gesture vector.

The human body skeleton 2D modeling of the person in the detection video is performed by using the human body posture prediction method, the posture is classified by using the posture classification method, the posture sequence is recorded by using the posture vector, and the human body actions are identified by using the action classification method, so that the human body action identification is performed in real time in the automatic production, and the unattended operation monitoring can be realized.

In addition, the method and the system realize the human body action recognition of the key people in the multi-person scene in the fixed area by using the hot zone comparison method, and realize the multi-person action recognition of the plurality of moving people in the multi-person scene by using the object tracking method, so that the method and the system can be used for different application scenes such as production command and environment monitoring.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic block diagram illustrating a motion recognition based monitoring system in accordance with one embodiment of the present disclosure;

FIG. 2 is a schematic block diagram illustrating a gesture recognition portion according to one embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating the gesture management portion updating the gesture vector;

FIG. 4 is a detailed schematic block diagram of an action recognition portion according to one embodiment of the present disclosure;

FIG. 5 is a flowchart of the operation of a motion recognition based monitoring system according to one embodiment of the present disclosure;

Fig. 6 is a view showing the main joints of the human body;

fig. 7 is a view showing an association relationship between the respective joints;

FIG. 8 shows an example of several poses;

FIGS. 9a and 9b show examples of several poses, respectively;

FIG. 10 illustrates a designated hot zone in a video region; and

Fig. 11 shows a schematic diagram of object tracking in the case where the detected object is moving.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the present disclosure. Unless defined otherwise, all other scientific and technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and vice versa, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at …" or "at …" or "in response to a determination" depending on the context.

In order that those skilled in the art will better understand the present disclosure, the present disclosure will be described in further detail below with reference to the accompanying drawings and detailed description.

FIG. 1 is a schematic block diagram illustrating a motion recognition based monitoring system according to one embodiment of the present disclosure. As shown in fig. 1, the monitoring system includes a video acquisition section 110, a gesture recognition section 120, an action recognition section 130, and an output section 140.

The video capturing section 110 collects video data by a video capturing device such as a cellular phone, a camera, a network, etc., and then converts the collected video data (video stream) into video frames, which are supplied to the gesture recognition section 120 for use.

The gesture recognition part 120 detects whether a person in the video frame is in a predefined gesture, and in the case that the predefined gesture is detected, establishes a human skeleton 2D model of each person according to the detection result. If a resolved gesture of any predefined plurality of actions is found in the gesture detection process, the gesture detection result is added to the gesture vector and the gesture vector is transferred to the action recognition part 130.

The action recognition section 130 determines whether the posture vector is a monitored action. In the case of a monitored action, the action recognition portion 120 outputs a video frame of the key gesture of the action and/or a video clip of the action to the output portion 140.

The output section 140 outputs the key posture frame and/or the video clip corresponding to the action recognized by the action recognition section to a data storage device, a video display device, and/or a sound playing device (not shown).

Fig. 2 is a schematic block diagram illustrating a gesture recognition portion 130 according to one embodiment of the present disclosure. The gesture recognition section 130 includes a gesture prediction section 210, a gesture classification section 220, a gesture classification training section 230, and a gesture management section 240.

The pose prediction section 210 predicts the pose of the human body in the video frame. According to one embodiment of the present disclosure, the pose prediction portion 210 determines 2D coordinate positions of critical joints of limbs of all persons in the video frame using OpenPose techniques, and then builds a human skeleton 2D model for each detected person according to a custom joint association. For predefined human body movements, it is necessary to break up the continuous limb movements into discrete key poses (similar to broadcast gymnastics maps). Preferably, the pose prediction part 210 may further perform human skeleton 2D modeling of the hands of each person in the video and/or human five sense organs 2D modeling of the faces of each person in the video.

Although OpenPose techniques are employed for human body pose prediction according to one embodiment of the present disclosure, it should be understood that any other similar techniques may be employed for human body pose prediction.

The gesture recognition portion 130 has two modes of operation: a gesture training mode and a gesture recognition mode. In the posture training mode, the posture classification training section 230 performs key posture labeling on the human skeleton 2D model obtained by the posture predicting section 210. The labeled human skeleton 2D model is used as a training set to be input into a convolutional neural network for training a classification model. And taking the trained gesture classification model as a gesture classification model in the recognition mode to automatically classify the human skeleton 2D model.

In the gesture recognition mode, the gesture prediction part 210 transmits the human skeleton 2D model to the gesture classification part 220, and classifies the gesture using the gesture classification model trained in the training mode. For any application scenario, an initial gesture should be defined for the gesture exploded view of each action.

The pose management portion 240 maintains a pose vector P (c, s) for the current video, where c is a pose category, such as some common human poses: standard sitting, answering a call, resting on a table, or some specific business gesture, such as arm stretching, arm bending, fisting, etc., s is the number of times each gesture category c is continuously detected.

Fig. 3 is a flowchart illustrating updating of the posture vector by the posture management section 240. As shown in fig. 3, once the gesture classification section 220 detects a predefined gesture in the video frame (310), this gesture class is transmitted to the gesture management section 240. The posture management section 240 first detects whether the posture vector P (c, s) is empty, that is, the posture vector is "empty", that is, no posture record is held in the posture vector (320). If the pose vector is empty (320), the pose management portion 240 determines if the current pose is an initial pose (330). The so-called initial pose, i.e. a certain pose breaks down the first one of the atlases. for example, when the person to be monitored is in a standard sitting posture and performs a fist making action by bending the arm, the first decomposing posture is to stretch the arm, and the position of the arm in the posture is obviously different from the posture that the arm naturally holds the chair handle in the standard sitting posture, so that the person can be considered to be the beginning of an action. If not, the determination is ended and the gesture is discarded (370). For example, if the monitored gesture is an arm lift, then this resolved gesture is not the first gesture of any known motion resolved gesture, so that subsequent motions may be determined to be not the motion that needs to be monitored, and thus may be ignored. Alternatively, the gesture, and subsequent gestures that may develop, are not part of the various gesture sets that need to be monitored, and thus do not continue to be monitored. Of course, if such a gesture needs to be monitored in the future, it may also be considered as an initial gesture and not ignored after it is brought into the monitoring range. If it is the initial pose, the pose management section 240 stores the pose as the current pose into the pose vector (340) and a counter of the current pose +1 (360). The current pose is the latest pose recorded in the pose vector. Since video is typically 30 frames/second, if continuous video is monitored frame by frame, the poses in many frames will be recognized as the same pose because of small differences, so the number of occurrences of the current pose is counted with a counter. If it is not the initial gesture, the gesture management part 240 determines whether the gesture is the current gesture (i.e., the last detected key gesture) (350). In other words, the initial gesture is the determination and recording of the onset of an action. The current gesture refers to the newly recorded key gesture when recording multiple resolved gestures (i.e., key gestures) of an action. When an action is just started to be recorded, the initial posture and the current posture are the same, and when a posture different from the initial posture is detected in the action recording process, the new action posture becomes the current posture, and at this time, the initial posture and the current posture are different. If not, the gesture management section 240 stores the gesture as a current gesture in the gesture vector (340) and starts counting a new current gesture, that is, a new initial gesture, that is, a counter +1 of the current gesture (360). If it is, that is, if the detected gesture is slightly different from the immediately previous key gesture, it is judged to be the same as the previous key gesture, the gesture management section 240 counts the counter +1 of the current gesture (360).

The gesture management section 240 will end vector update for a set of gestures under the following conditions:

The current gesture is an ending gesture, which corresponds to the initial gesture, being the last gesture in the multiple resolved gestures (i.e., key gestures) of an action. And its number of consecutive occurrences exceeds a predefined threshold.

The counter of the current pose is not updated in N frames. This means that the action of the monitored person has been detected within a predetermined period of time.

The state of the current pose is not changed in N frames. This means that the gesture continues to be in the same gesture state despite the gesture being detected.

System command end gesture update

Video end or video stream interruption

When the attitude vector is finished being updated, it will be sent to the subsequent module for processing, while the attitude vector in the attitude management section 240 is initialized, ready for recording of the next set of attitude vectors.

After the attitude vector is updated, the attitude vector is normalized, including filtering the attitude continuously monitored for times below a predefined threshold, so as to prevent erroneous judgment caused by accidental false detection. Because the artificial intelligence gesture recognition technology based on computer vision may be affected by factors such as light, angle, shielding and the like, a certain degree of misjudgment possibility exists. Since such erroneous judgment is generally sporadic, in order to reduce the influence of the erroneous judgment on the overall classification, it is necessary to set a threshold value to ensure that the posture record is kept only in the case where the same posture is detected in consecutive multiframes (above the threshold value), otherwise, the posture is regarded as being sporadically misdetected and is not recorded, thereby improving the accuracy of the overall posture detection. After filtering, it is also necessary to merge gestures that occur multiple times in succession.

Fig. 4 is a detailed schematic block diagram of the action recognition portion 130 shown in fig. 1, according to one embodiment of the present disclosure. As shown in fig. 3, the action recognition section 130 includes an action recognizer 410, an action recognition training section 420.

The action recognition part also has two modes of operation: action training mode and action recognition mode. In the motion training mode, the sample training motion video is input to the motion recognition training section 420 as a training set via the pose vector generated by the pose management section 240. The motion recognition training section 420 trains the posture vector using a multivariate classification algorithm to obtain a sample motion recognition model. The trained sample motion recognition model is used as a comparison sample for motion recognition by the motion recognizer 410 to perform motion classification on the gesture vector generated by the gesture management part 240.

In the recognition mode, the gesture vector generated by the gesture management part 240 is directly input into the action recognizer 410. The action recognizer 410 determines the action type using the sample recognition model trained by the action recognition trainer 420. Typically, the system will pre-assign some action categories as monitored action categories (also referred to as specific action categories), and if the action category is determined to be a monitored action category, the output section 140 will store the video frames and/or video clips marking the specific action into a memory (not shown). Preferably, the output section 140 may trigger the alarm at the same time.

Preferably, when the initial pose and the end pose of an action are the same (i.e., the action has only one key pose), it is possible to determine whether the action is a corresponding action according to whether the value of the key pose counter is greater than a predefined pose determination threshold without requiring a pre-trained action classification model. For example, to determine whether a person is making a call, assuming that the action determination threshold of the gesture counter is seen, if the gesture of making a call occurs in consecutive N (N > =k) frames, it may be determined that the person is making a call.

Fig. 5 is a flowchart of the operation of a motion recognition based monitoring system according to one embodiment of the present disclosure. As shown in fig. 5, in step S51, the motion recognition-based monitoring system recognizes the limb position of the person in the monitored video frame using the gesture estimation method. The estimation method can be any existing method, such as OpenPose method, for performing 2D modeling on human bones. The modeled objects include the five sense organs of the human hand, face, wrists, elbows, shoulders, neck, crotch, knees, ankles, fingers, etc. Here, human bone 2D modeling may be performed for each person using relationships between primary joint position coordinates of one or more persons in the video frame.

Next, if the pose classification model has been trained in advance, in step S52, the 2D model of the human skeleton in the monitored video frame is classified using the pre-trained pose classification model. If the pose classification model is not trained in advance, then in step S52, such a pose classification model is first trained, and then the 2D model of the human skeleton in the monitored video frame is classified by using the pre-trained pose classification model. The process of training the gesture classification model is as follows: decomposing continuous limb actions to be recognized into discrete key poses, then carrying out key pose labeling on a human skeleton 2D modeling result, and finally training a pose classification model by using a convolutional neural network algorithm and the labeled human skeleton 2D modeling result.

Then, in step S53, an attitude vector is established and an attitude vector set of known actions is labeled as a training set training action recognition model to determine whether the action type belongs to the monitored type. Specifically, in this step, once the pose classification section 220 detects a trained sample pose in a video frame, this pose classification is transferred to the pose management section 240. The posture management section 240 first detects whether the posture vector is empty. If the posture vector is empty, the posture management section 240 determines whether the current posture is an initial posture. If not, the determination is ended and the gesture is discarded. If it is the initial gesture, the gesture management section 240 stores the gesture as the current gesture in the gesture vector and a counter of the current gesture is +1. If it is not the initial gesture, the gesture management section 240 determines whether the gesture is the current gesture (i.e., the last detected key gesture). If not, the gesture management section 240 stores the gesture as a current gesture in the gesture vector and a counter+1 of the current gesture. If so, the gesture management portion 240 counts the counter +1 for the current gesture.

the current gesture is an end gesture and its number of successive occurrences exceeds a predefined threshold

The state of the current pose is not updated in N frames

The state of the current pose is not changed in N frames

System command end gesture update

Video end or video stream interruption

Next, if the determined action type belongs to the monitored type, the video frame for marking the specific action and/or the video clip for the action is stored in the memory and an alarm is triggered in step S54.

It should be appreciated that in other embodiments of the present disclosure, if multiple people are present in the video, a hot zone (ROI, region of Interest) comparison method or object tracking method may be employed to distinguish and track objects.

If the monitored object is located in a fixed area, a hot zone comparison method is suitably used. A hot zone is first specified in the video region, then its contour polygons (default rectangles) are drawn for each established human skeletal 2D model, and then the coincidence ratio of the contour region and the hot zone is compared. The highest ratio is regarded as the detected object.

If the detected object is moving, the object is tracked by using an object tracking method, and the gesture vector of each object is recorded respectively. According to one embodiment of the present disclosure, KCF may be employed

(Kernelized Correlation Filters),BOOSTING,MIL(Multiple Instance Learning),TLD(Tracking,learning and detection),GOTURN Or other object tracking algorithm to track objects in the video screen.

In other embodiments of the present disclosure, a tracker may be added to everyone in the video. It may be determined whether the tracked object needs to continue to be tracked. For example, whether or not tracking needs to be continued may be determined by determining whether or not the detected object is in at least one of the following states: the detected object reaches the designated area; the detected object arrives at the departure area; the detected object is in a static state for more than a certain time; and whether an instruction to stop continuing tracking of the object in the monitored area is received. If tracking does not need to be continued, the tracker is deleted.

Example

The purpose of this example is to perform motion detection of a person in a video generated by shooting with a camera. The detection process is as follows.

1) Human bone 2D modeling is performed in the video region.

Fig. 6 is a view showing the main joints of the human body. As shown in fig. 6, the main joints (e.g., wrist, elbow, shoulder, neck, crotch, knee, ankle, finger, etc., as shown by the white spots on the human body) of the human body are detected using the existing deep learning model (e.g., openPose) of human body key node detection.

Fig. 7 is a view showing an association relationship between the respective joints. As shown in fig. 7, according to a predefined association relationship between joints (for example, association between right elbow and right wrist), a human skeleton is drawn and a model is normalized so that its output size is uniform.

2) And then carrying out gesture prediction on the human skeleton 2D model established by each frame.

Fig. 8 shows an example of several poses. As shown in fig. 8, the pre-trained pose classification model is used to predict the pose of the human skeleton 2D model, and the prediction result with the prediction reliability greater than a predefined threshold (e.g., 50%) is written into the pose vector P (c, s), where c is the pose class and s is the number of times each pose class is continuously detected. In other words, the acquired gesture is compared with the gesture classification model, and the similarity with the gesture classification model is obtained. The prediction reliability is calculated by a gesture recognition algorithm and is used for quantifying the approximation degree of the predicted gesture and the sample gesture of the training gesture classification model. The prediction reliability threshold is an empirical value summarized in a test environment according to an actual application scene, and the value of the prediction reliability threshold can be configured according to the actual scene.

In the example shown in fig. 8, where the prediction confidence of one gesture is 12% and less than the predefined threshold, then the gesture is not recorded in the gesture vector. Thus, the pose vectors for this set of poses are:

P1＝[(1，2)，(2，2)]

3) Finally, after the update of the posture vector P (c, s) is finished, the posture vector is classified by using a pre-trained motion recognition model, and the motion type is determined. The vector update for a set of poses will end under the following conditions:

The state of the current pose is not updated in N frames

The state of the current pose is not changed in N frames

System command end gesture update

Video end or video stream interruption

For example, if posture 1 continuously appears for 2 frames in posture vector P1, posture 2 continuously appears for 2 frames and then ends, and at the same time, the motion recognition model may determine that the reliability of motion 1 is 85%, then it may be determined that the motion is motion 1. These detected poses 1 and 2 are both constituent poses of action 1.

P1= [ (1, 2), (2, 2) ]= > action 1 (0.85).

Fig. 9a and 9b show examples of several poses, respectively. It is noted that, as shown in fig. 9, some erroneously detected gestures (e.g., gesture 3) may also occur between gesture 1 and gesture 2 in an actual measurement. In order to prevent the influence of such information on the motion determination, it is necessary to normalize the posture vector before performing motion detection. The normalization process comprises the following steps:

thresholding the number of successive occurrences of the gesture (e.g., 2), and filtering vector values whose number of successive occurrences is below the threshold.

As shown in fig. 9a, p1= [ (1, 2), (3, 1), (2, 2) ] is removed from the pose vector since the pose 3 occurs only 1 time in succession, the corrected vector p1' = [ (1, 2), (2, 2) ].

After removing vector values below the threshold, merging successive vector values with the same pose.

As shown in fig. 9b, p1= [ (1, 2), (3, 1), (1, 2), (2, 2) ], the vector becomes variable after filtering out vector values that occur only 1 time:

P1'＝[(1，2)，(1，2)，(2，2)]，

it can be seen that the presence gesture 1 in P1' occurs twice in succession, thus merging it into:

P1”＝[(1，4)，(2，2)]。

If the action is found to be a predefined action after the action detection, an alarm is triggered and the action video clips are output or the decomposed video frames of the action are output one by one.

Preferably, if multiple people are present in the video, a hot zone (ROI, region of Interest) comparison method or object tracking method may be employed to distinguish and track objects.

If the monitored object is located in a fixed area, a hot zone comparison method is suitably used. Fig. 10 shows a designated hot zone in a video region. As shown in fig. 10, a hot zone is first specified in a video area, then its contour polygon (default rectangle) is drawn for each established human bone 2D model, and then the coincidence ratio of the contour area and the hot zone is compared. The highest ratio is regarded as the detected object.

If the detected object is moving, the object is tracked by using an object tracking method, and the gesture vector of each object is recorded respectively. Fig. 11 shows a schematic diagram of object tracking in the case where the detected object is moving. As shown in FIG. 11, KCF(Kernelized Correlation Filters),BOOSTING,MIL(Multiple Instance Learning),TLD(Tracking,learning and detection),GOTURN or other object-tracking algorithms may be employed to track objects in a video screen, according to one embodiment of the present disclosure.

The present invention is not to be limited in scope by the specific embodiments described herein, which are intended as exemplary embodiments. Functionally equivalent products and methods are clearly within the scope of the invention as described herein.

While the basic principles of the present disclosure have been described above in connection with specific embodiments, it should be noted that all or any steps or components of the methods and apparatus of the present disclosure can be implemented in hardware, firmware, software, or combinations thereof in any computing device (including processors, storage media, etc.) or network of computing devices, as would be apparent to one of ordinary skill in the art upon reading the present disclosure.

Thus, the objects of the present disclosure may also be achieved by running a program or set of programs on any computing device. The computing device may be a well-known general purpose device. Thus, the objects of the present disclosure may also be achieved by simply providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is apparent that the storage medium may be any known storage medium or any storage medium developed in the future.

It should also be noted that in the apparatus and methods of the present disclosure, it is apparent that the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure. The steps of executing the series of processes may naturally be executed in chronological order in the order described, but are not necessarily executed in chronological order. Some steps may be performed in parallel or independently of each other.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A monitoring method based on action recognition, comprising the steps of:

a) Identifying the limb position of a person in a monitored video frame by using a gesture estimation method, and carrying out human skeleton 2D modeling;

b) Classifying the 2D model of the human skeleton in the monitored video frame by utilizing a pre-trained gesture classification model;

c) Storing the gesture classification result in the continuous video frame into gesture vectors P (c, s) of the current video, wherein c is gesture category, s is the number of times each gesture category c is continuously detected, judging the action type according to a pre-trained action recognition model, marking a gesture vector set of a known action, filtering vector values with the continuous occurrence number being smaller than a threshold value, merging continuous vector values with the same gesture after filtering and removing the vector values with the continuous occurrence number being smaller than the threshold value, and taking the marked gesture vector set as a training set training action recognition model; and

D) If the judged action type belongs to the monitored type, storing the video frame marked with the specific action and/or the video fragment of the action into a memory and triggering an alarm.

2. The monitoring method according to claim 1, wherein step a) includes:

a1 Judging the main joint position coordinates of one or more persons in the video frame; and

A2 2D modeling of human bones for each person using the relationship between the primary joint position coordinates and the joints.

3. The monitoring method according to claim 2, wherein step a 2) further comprises

Human skeletal 2D modeling is performed on the five sense organs of each person's hand and/or face in the video.

4. The monitoring method according to claim 1, wherein step b) includes:

decomposing continuous limb actions to be recognized into discrete key gestures;

carrying out key gesture labeling on the 2D modeling result of the human skeleton; and

And training a gesture classification model by using a convolutional neural network algorithm and a labeled human skeleton 2D modeling result.

5. The monitoring method according to claim 1, wherein step d) includes at least one of the following steps:

marking an object which makes a specific action in the original video, and triggering an alarm;

Archiving the video frames marked with the specific actions for a license; and

The video clip marked with the specific action is archived for certification.

6. The monitoring method according to claim 1, wherein the step c) further comprises the step of: and judging the action of the tracked object in the multi-person scene by adopting a hot zone comparison method.

7. The monitoring method of claim 6, wherein the hot zone is a designated area of a monitoring video or an entire monitoring picture area.

8. The monitoring method according to claim 1, wherein step c) further comprises:

Adding a tracker to each person in the video to monitor their actions, and

And judging whether the tracked object needs to be tracked continuously, and if the tracked object does not need to be tracked continuously, deleting the tracker.

9. The monitoring method according to claim 8, wherein whether or not tracking needs to be continued is determined by determining whether or not the detected object is in at least one of the following states:

the detected object reaches the designated area;

The detected object arrives at the departure area;

The detected object is in a static state for more than a certain time; and

Whether an instruction to stop continuing tracking of an object in the monitored area is received.

10. A monitoring system based on motion recognition, comprising:

A posture predicting section that recognizes a limb position of a person in the monitored video frame using a posture estimating method, and performs human skeleton 2D modeling based on the obtained position;

A gesture classification section that classifies a 2D model of a human skeleton in a monitored video frame using a pre-trained gesture classification model;

A gesture management section that stores a gesture classification result in successive video frames into a gesture vector P (c, s) of a current video, where c is a gesture category, s is the number of times each gesture category c is continuously detected, includes labeling a set of gesture vectors of a known motion, filtering vector values of which the number of continuous occurrences is less than a threshold, and merging successive vector values of the same gesture after filtering and removing vector values of which the number of times is less than the threshold, and taking the labeled set of labeled gesture vectors as a training motion recognition model;

an action recognition section that judges an action type based on a pre-trained action recognition model; and

And an output section that stores a video frame marking a specific action and/or a video clip of the action into the memory and triggers an alarm when the judged action type belongs to the monitored type.

11. The monitoring system of claim 10, further comprising:

The gesture classification training part is used for marking key gestures of the obtained human skeleton 2D model, inputting the marked human skeleton 2D model as a training set into the convolutional neural network training classification model for training so as to obtain a gesture classification model; and

And the motion recognition training part takes the gesture vector generated by the gesture management part of the known motion video as a training set, and trains the gesture vector by adopting a multivariate classification algorithm to obtain a motion recognition model for classifying the motion of the gesture vector.