CN114332927A

CN114332927A - Classroom hand-raising behavior detection method, system, computer equipment and storage medium

Info

Publication number: CN114332927A
Application number: CN202111606911.5A
Authority: CN
Inventors: 张格格; 陈曾平; 王亮; 王鲁平; 刘浚乐; 江伟弘
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-12

Abstract

The invention provides a classroom hand-lifting behavior detection method, a system, computer equipment and a storage medium, wherein the classroom hand-lifting behavior detection method comprises the steps of obtaining image data to be detected by performing frame-extraction processing on classroom video to be detected, performing posture estimation on the image data to be detected by adopting a posture estimation model, obtaining a posture estimation result corresponding to an upper body posture image comprising a plurality of human body joint points, determining a hand region to be recognized according to elbow joint point coordinates and wrist joint point coordinates in the posture estimation result, and detecting the hand region to be recognized by adopting a gesture recognition model to obtain a hand-lifting behavior detection result, effectively solving the problem of low classroom hand-lifting detection accuracy rate caused by factors such as human body occlusion, small resolution of an object to be recognized, large brightness difference of picture data and the like in practical application, and further improving classroom accuracy of classroom hand-lifting behavior real-time detection, thereby greatly improving the application value of the corresponding detection result.

Description

Classroom hand-raising behavior detection method, system, computer equipment and storage medium

Technical Field

The invention relates to the technical field of computer vision and gesture estimation, in particular to a classroom hand-lifting behavior detection method and system based on gesture estimation and gesture recognition, computer equipment and a storage medium.

Background

With the development of information technology in the education field, the classroom monitoring automation and intellectualization become a popular research direction, the interactive behaviors of teachers and students in classroom are observed in real time, the study state of each student is supervised and objectively evaluated, the teaching management is facilitated, and the teaching quality of teachers is improved in a targeted manner. The hand-lifting action is the most typical behavior of students in classroom, is the most intuitive expression form of the active participation of the students in classroom content, is the most direct mode of the interaction between teachers and students, and can be used as an effective mode for evaluating the learning state of the students and the teaching quality of the teachers. However, in a real classroom application scene, the detection of the hands lifted by students faces many challenges, such as the shielding of students at the back row by front row students with dense seats, the small hand resolution when the students lift hands near the back half position of the classroom, the large light difference caused by the lighting problem of the classroom, and the like, which all affect the accuracy of the detection of the hands lifted by the classroom. Therefore, the problem of how to improve the accuracy of classroom behavior detection and further improve the application value of the corresponding detection result is paid much attention by researchers.

The methods applied to classroom hand-raising detection provided by the prior scholars mainly comprise two methods: 1) acquiring student posture information from the image information, and analyzing whether the student behavior is a hand-lifting behavior or not through the student posture information; 2) the hand-lifting behavior is used as a target detection problem to be solved, and a target detection algorithm is directly adopted to detect the hand-lifting behavior. Although the two methods can realize detection and identification of classroom hand-lifting behaviors to a certain extent, the two methods have application defects: 1) when the bottom-up posture estimation is applied to classroom scenes with more people, the joint points of different objects to be detected are easy to be confused, so that the posture estimation accuracy is not high; 2) the target detection algorithm is a static method, and when the method is applied to dynamic motion detection, the method can only realize effective detection on the hand-lifting detection of students in the front row, and is not ideal for the hand-lifting target detection with small resolution in the back row.

Disclosure of Invention

The invention aims to provide a classroom hand-lifting behavior detection method, a classroom hand-lifting behavior detection system, computer equipment and a storage medium from the perspective of computer vision, by adopting a deep learning technology, posture estimation is carried out on a human body in a video stream of a classroom complex environment, a hand region to be identified is obtained based on a posture estimation result, and hand-lifting and non-hand-lifting gesture identification is further carried out.

In order to achieve the above object, it is necessary to provide a classroom hand-lifting behavior detection method, system, computer device and storage medium in view of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for detecting a classroom hand-lifting behavior, where the method includes the following steps:

collecting classroom videos to be detected in real time, and obtaining image data to be detected according to the classroom videos to be detected;

inputting the image data to be detected into an attitude estimation model for attitude estimation to obtain a corresponding attitude estimation result; the posture estimation result is an upper half body posture graph comprising a plurality of human body joint points;

determining a hand region to be recognized according to the attitude estimation result;

and inputting the hand area to be recognized into a gesture recognition model to obtain a hand lifting behavior detection result.

Further, the step of obtaining image data to be detected according to the classroom video to be detected includes:

performing frame extraction processing on the classroom video to be detected according to a preset time interval to obtain candidate detection image data;

and screening out repeated data in the candidate detection image data to obtain the image data to be detected.

Further, the posture estimation model comprises a human body region detection model, a single posture estimation model and a posture redundancy suppression model which are connected in sequence;

the body region detection model comprises a YOLOv3 network; the single attitude estimation model comprises a spatial transformation network, a single attitude estimation network and an inverse spatial transformation network which are connected in sequence; the attitude redundancy suppression model includes a parameterized attitude non-maxima suppression network.

Further, the step of inputting the image data to be detected into a posture estimation model for posture estimation to obtain a corresponding posture estimation result includes:

inputting the image data to be detected into the human body region detection model for single person region detection to obtain a single person region;

inputting each single human body region into the single attitude estimation model for attitude estimation to obtain a corresponding candidate attitude estimation result;

and inputting the candidate attitude estimation result into the attitude redundancy suppression model to remove redundant attitude, and obtaining the attitude estimation result.

Further, the human body joint points include joint points corresponding to a top of the head, a neck, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left crotch, and a left crotch, respectively;

the step of determining the hand area to be recognized according to the posture estimation result comprises the following steps:

obtaining corresponding elbow joint point coordinates and wrist joint point coordinates according to the posture estimation result; the elbow joint point coordinates comprise left elbow joint point coordinates and right elbow joint point coordinates; the wrist joint point coordinates comprise left wrist joint point coordinates and right wrist joint point coordinates;

obtaining a corresponding hand candidate area according to the elbow joint point coordinates and the wrist joint point coordinates;

calculating the angle between the corresponding arm and the horizontal plane according to the elbow joint point coordinates and the wrist joint point coordinates of each hand candidate area;

and judging whether the angle between the arm and the horizontal plane is larger than a preset gesture angle, and if so, determining the hand candidate area as the hand area to be recognized.

Further, the step of obtaining a corresponding hand candidate region according to the elbow joint point coordinates and the wrist joint point coordinates includes:

respectively taking the coordinates of the left wrist joint point and the coordinates of the right wrist joint point as centers to obtain corresponding coordinates of a left elbow symmetric point and a right elbow symmetric point;

respectively obtaining a rectangular area taking a straight line between the left elbow joint point coordinate and the left elbow symmetric point and a straight line between the right elbow joint point coordinate and the right elbow symmetric point coordinate as diagonal lines, and taking the rectangular area as a corresponding hand candidate area.

Further, the gesture recognition model comprises a convolutional neural network model; the convolutional neural network model comprises 2 convolutional layers, 1 pooling layer, 1 convolutional layer, 1 pooling layer and 1 full-connection layer which are connected in sequence; the convolution kernel size of the convolution layer is 3 x 3; the kernel size of the pooling layer is 2 x 2.

In a second aspect, an embodiment of the present invention provides a classroom hand-lifting behavior detection system, where the system includes:

the data acquisition module is used for acquiring the classroom video to be detected in real time and obtaining image data to be detected according to the classroom video to be detected;

the attitude estimation module is used for inputting the image data to be detected into an attitude estimation model for attitude estimation to obtain a corresponding attitude estimation result; the posture estimation result is an upper half body posture graph comprising a plurality of human body joint points;

the region screening module is used for determining a hand region to be identified according to the posture estimation result;

and the gesture recognition module is used for inputting the hand area to be recognized into a gesture recognition model to obtain a hand-lifting behavior detection result.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the above method.

The above application provides a classroom hand-lifting behavior detection method, system, computer equipment and storage medium, through the method, the technical scheme that image data to be detected are obtained based on classroom video to be detected, which is collected in real time, posture estimation is performed on the image data to be detected through a posture estimation model, after a posture estimation result corresponding to an upper half body posture image comprising a plurality of human body joint points is obtained, a hand region to be recognized is determined according to elbow joint point coordinates and wrist joint point coordinates in the posture estimation result, and then the hand region to be recognized is detected through a hand posture recognition model to obtain a hand-lifting behavior detection result is achieved. Compared with the prior art, the classroom hand-lifting behavior detection method has the advantages that the deep learning technology is adopted, effective posture estimation is carried out on a human body in a video stream of a classroom complex environment, the hand candidate region is obtained based on the posture estimation result, the hand candidate region is screened according to the preset gesture angle, the hand region to be recognized is obtained, hand-lifting and non-hand-lifting gesture recognition judgment is carried out on the hand region to be recognized, the problems that classroom hand-lifting detection accuracy is low due to factors such as human body shielding, small target resolution to be recognized, large picture data brightness difference and the like in practical application are effectively solved, real-time detection efficiency of classroom hand-lifting behaviors and accuracy of detection results are further improved, and application value of corresponding detection results is further improved to a large extent.

Drawings

Fig. 1 is a schematic view of an application scenario of a classroom hand-lifting behavior detection method in an embodiment of the present invention;

fig. 2 is a schematic network structure diagram of a classroom hand-lifting behavior detection method in an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a method for detecting a classroom hand-lifting behavior in an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an attitude estimation model in an embodiment of the invention;

FIG. 5 is a schematic structural diagram of a single-person attitude estimation model (stacked hourglass network) in an attitude estimation model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a process of determining a hand region to be recognized based on pose estimation according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of hand-lifting and non-hand-lifting gestures in a hand region to be recognized in an embodiment of the invention;

FIG. 8 is a schematic structural diagram of a gesture recognition model for detecting a hand lifting behavior of a hand region to be recognized according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a classroom hand-lifting behavior detection system in an embodiment of the present invention;

fig. 10 is an internal structural view of a computer device in the embodiment of the present invention.

Detailed Description

In order to make the purpose, technical solution and advantages of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments, and it is obvious that the embodiments described below are part of the embodiments of the present invention, and are used for illustrating the present invention only, but not for limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The classroom hand-lifting behavior detection method provided by the invention can be applied to a terminal or a server as shown in figure 1. The terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server can be implemented by an independent server or a server cluster formed by a plurality of servers. The server can use the classroom hand-lifting behavior detection method provided by the invention based on the classroom video to be detected acquired in real time, and the network architecture shown in fig. 2 is used for completing detection and identification of classroom hand-lifting behaviors in each video, and finally obtained hand-lifting behavior detection results are applied to other learning tasks on the server or transmitted to the terminal for receiving and using by a terminal user.

In one embodiment, as shown in fig. 3, a classroom hand-lifting behavior detection method is provided, which includes the following steps:

s11, collecting classroom videos to be detected in real time, and obtaining image data to be detected according to the classroom videos to be detected; the classroom video to be detected is general specification monitoring video data which are collected in real time through a camera arranged in the front of a classroom and used for classroom hand-lifting behavior detection and analysis, and the corresponding cameras used for collecting video image data can be arranged according to the size of the classroom, for example, one camera is respectively arranged on the left side and the right side in the front of the classroom and used for collecting the video data. The video data acquired by the camera cannot be directly used, and corresponding frame images need to be extracted according to a time sequence for subsequent analysis, specifically, the step of obtaining the image data to be detected according to the classroom video to be detected includes:

performing frame extraction processing on the classroom video to be detected according to a preset time interval to obtain candidate detection image data; the preset time interval can be determined according to the actual application requirements, for example, a method of extracting one frame every 1 second from the classroom video data to be detected can be used to obtain candidate detection image data.

Screening out repeated data in the candidate detection image data to obtain the image data to be detected; after the candidate detection image data is obtained according to the method, the candidate detection image data can be directly used for subsequent analysis in principle, but in order to avoid unnecessary calculation overhead and time waste caused by repeated data, the embodiment preferably performs data cleaning and screening on the candidate detection image data obtained by frame extraction in sequence, screens out repeated and fuzzy candidate detection image data, and obtains image data to be detected which can be used for subsequent model analysis. It should be noted that, when constructing the data set used for model training required by the present invention, in addition to performing data cleaning and screening on the image data, the LabelImg software (LabelImg is a tool used for labeling the position and name of an object in a picture in deep learning) needs to label the acquired large-scale picture data in two ways: framing the hand-lifting area as a label of the hand-lifting gesture; marking the joint points of each part of the upper body by using a 5-by-5 rectangular frame to take the center of the rectangular frame as the coordinate of the joint point; and using the obtained image data after labeling as a training set and a test set of a subsequent posture estimation model and a gesture recognition model.

S12, inputting the image data to be detected into a posture estimation model for posture estimation to obtain a corresponding posture estimation result; the posture estimation result is an upper half body posture graph comprising a plurality of human body joint points; the attitude estimation model comprises a human body region detection model, a single attitude estimation model and an attitude redundancy suppression model which are sequentially connected; in principle, any model capable of realizing corresponding functions can be respectively adopted for the human body region detection model, the single-person posture estimation model and the posture redundancy suppression model, but in order to ensure the posture estimation effect, the posture estimation model shown in fig. 4 is preferred in the embodiment, and the corresponding human body region detection model comprises a YOLOv3 network; the single attitude estimation model comprises a spatial transformation network, a single attitude estimation network and an inverse spatial transformation network which are connected in sequence; the attitude redundancy suppression model comprises a parameterized attitude non-maximum suppression network; the method comprises the steps of adopting a YOLOv3 network to identify all single areas in image data to be detected, adopting a single posture estimation model obtained by combining a space transformation network (STN network), a single posture estimation network (SPPE network) and an inverse space transformation network (SDTN network) to estimate postures of the single areas, adopting a parameterized posture non-maximum suppression network (PP-NMS, Parametric Pose NMS network) to carry out redundancy suppression on posture estimation results, and obtaining posture estimation results which can be finally obtained by a hand area to be identified.

Specifically, the step of inputting the image data to be detected into a posture estimation model for posture estimation to obtain a corresponding posture estimation result includes:

inputting the image data to be detected into the human body region detection model for single person region detection to obtain a single person region; the body region detection model preferably adopts a YOLOv3 model as described above, and the characteristics of the model are extracted by a DarkNet53 network (a 52-layer convolutional neural network) with a full connection layer removed, and the corresponding model training process specifically includes: dividing an input image into grids of S multiplied by S, predicting the position and the category of a target to be detected according to the grids where the center of the target to be detected falls, giving a prediction result to each grid prediction bounding box and corresponding probability, predicting 3 bounding boxes by each grid, wherein each bounding box comprises 5 predicted values: (x, y, w, h, confidence), (x, y) represents the center coordinates of bounding boxes, w, h represent the normalized width and height of the image respectively,

representing the confidence that the predicted box contains the target; in order to further improve the accuracy of the algorithm for detecting the small target, after 3 scales (13 × 13, 26 × 26 and 52 × 52) predicted by each grid are obtained, fusion detection is carried out on the 3 scale feature maps predicted by each grid in an up-sampling and fusion mode to obtain all single human body regions in the image data to be detected;

inputting each single human body region into the single attitude estimation model for attitude estimation to obtain a corresponding candidate attitude estimation result; wherein, single attitude estimation model is as above including the space transformation network, single attitude estimation network and the inverse space transformation network that connect gradually, and the model training process that corresponds specifically includes:

(1) inputting each single human body region into a space transformation network (STN network), improving the quality of the single human body region obtained by the human body region detection model by the space transformation network STN, extracting a high-quality human body region frame from an inaccurate bounding box, and enabling the corresponding space transformation network STN model to be as follows:

wherein [ theta ]₁ θ₂ θ₃]Representing a two-dimensional spatial vector;

representing coordinates before image transformation;

representing the coordinates after the spatial transformation;

(2) inputting the extracted high-quality human body region frame into a single posture estimation network for single posture estimation; the single-person posture estimation network (SPPE network) adopts a Stacked Hourglass (Stacked Hourglass) algorithm, and takes the posture estimation as a related task by considering that the best identification precision corresponding to different human body joint points is not all on the same characteristic diagram (for example, arms can be easily identified on the characteristic diagram of the layer 2, and heads can be easily identified on the characteristic diagram of the layer 6), and adopts multi-scale characteristics to identify postures, so that a network structure which is capable of simultaneously using a plurality of characteristic diagrams is designed, and each-step Hourglass network is formed by connecting a pooling layer, three residual modules, an upper sampling layer in series and then a residual module in a recursion mode. As shown in fig. 5, the present embodiment preferably adopts a 4-order hourglass network as the single-person posture estimation network, where C1, C2, C3, C4, C5, C6, and C7 all represent convolutional layers (Convolution), C4 obtains C4a with the same feature map size through one residual (residual) process, C7 obtains the same feature map size as C4a after upsampling, and the two feature maps are superimposed and convolved to obtain C4 b; subsequently, after C4b is up-sampled, C3a with the same size as the feature map obtained by C3 respectively through residual error processing is superposed and convolved to obtain C3b, and so on, until the feature map C1b which not only retains the information of all layers but also has the same size as the input original map is obtained, a thermodynamic diagram representing the joint point probability is generated through 1 × 1 convolution, and finally, the corresponding joint point coordinates are obtained, namely, the predicted joint point positions of the high-quality human body area are extracted through an hourglass stack network;

(3) the joint point coordinates predicted by the single posture estimation network are mapped back to an initial image coordinate system to obtain the human posture estimation of the input picture, the adopted network is an inverse space transformation network (SDTN network) which has a reverse structure with the STN network, and the corresponding model is as follows:

wherein [ gamma ]₁ γ₂]＝[θ₁ θ₂]^-1γ₃＝-1×[γ₁ γ₂]θ₃Mapping the high-quality human body region extracted from the input through the space transformation network STN to the original input to obtain a corresponding attitude estimation result;

the posture estimation result obtained through the above steps can be directly used for obtaining a subsequent hand region to be recognized in principle, but in order to improve the recognition efficiency of a subsequent gesture recognition model, preferably, after the posture estimation result is obtained, redundant posture removal is further performed on the posture estimation result.

And inputting the candidate attitude estimation result into the attitude redundancy suppression model to remove redundant attitude, and obtaining the attitude estimation result. Wherein, the attitude redundancy suppression model preferably adopts a parameterized attitude non-maximum suppression (PP-NMS: Parametric Pose NMS) algorithm as described above: suppose a candidate pose estimation result P of a person_iThere are m joints as:

wherein,

and

respectively representing the coordinate position and the confidence score of the jth part; when a plurality of candidate posture estimation results exist, NMS is needed when redundant candidates are removed, the process is to select the posture with the maximum confidence coefficient as a reference, suppress the region frame close to the reference according to the redundancy elimination standard, and repeat for many times until each posture frame is unique; the corresponding redundancy elimination criterion can be a posture distance measurement used for measuring the similarity between postures, and the model is as follows:

f(P_i,P_j|Λ,η)＝1[d(P_i,P_j|Λ,λ)≤η]

if the pose distance metric d () is less than the threshold η, then the output of f () is 1, representing the pose P_iShould be suppressed, i.e. with respect to the attitude P_jIn other words, the attitude P_iIs redundant; in the formula,

d(P_i,P_j|Λ)＝K_sim(P_i,P_j|σ₁)+λH_sim(P_i,P_j|σ₂)

wherein,

the region position representing the joint point i; k_sim(P_i，P_j|σ₁) Representing two poses P_iAnd P_jThe number of different positions between the two postures P_iAnd P_jAll have higher confidenceThe output of the function is close to 1; h_sim(P_i，P_j|σ₂) Representing two poses P_iAnd P_jThe spatial distance of (a); λ represents a weight coefficient; Λ ═ σ₁，σ₂，λ}。

In the embodiment, after a YOLOv3 model is adopted to identify a single human body region of image data to be detected, a spatial transform network (STN network), a single posture estimation network (SPPE network) and an inverse spatial transform network (SDTN network) are sequentially adopted to estimate the posture of each single human body region, and redundant posture suppression is performed on the obtained posture estimation result through a parameterized posture non-maximum suppression (PP-NMS: parametrica Pose NMS) network, so that the acquisition efficiency and accuracy of the posture estimation result are ensured, and further guarantee is provided for subsequently and efficiently determining a hand region to be identified.

S13, determining a hand area to be recognized according to the posture estimation result;

wherein the posture estimation result is an upper body posture comprising a plurality of human body joint points as described above, and the corresponding human body joint points comprise 10 joint points respectively corresponding to the top of the head, the neck, the left shoulder, the right shoulder, the left elbow, the right elbow, the left wrist, the right wrist, the left crotch and the left crotch; in this embodiment, a hand region to be recognized is determined according to the steps shown in fig. 6 mainly according to the left-hand elbow joint point, the right-hand elbow joint point, the left-hand wrist joint point, and the right-hand wrist joint point in the posture estimation result; specifically, the step of determining the hand region to be recognized according to the posture estimation result includes:

obtaining corresponding elbow joint point coordinates and wrist joint point coordinates according to the posture estimation result; the elbow joint point coordinates comprise left elbow joint point coordinates and right elbow joint point coordinates; the wrist joint point coordinates comprise left wrist joint point coordinates and right wrist joint point coordinates; the left wrist joint point coordinates and the right wrist joint point coordinates are used in a matched mode;

obtaining a corresponding hand candidate area according to the elbow joint point coordinates and the wrist joint point coordinates; wherein, the step of obtaining the corresponding hand candidate area according to the elbow joint point coordinates and the wrist joint point coordinates comprises the following steps:

respectively obtaining rectangular areas with the straight line between the left elbow joint point coordinate and the left elbow symmetric point and the straight line between the right elbow joint point coordinate and the right elbow symmetric point coordinate as diagonal lines, and taking the rectangular areas as corresponding hand candidate areas;

although the hand candidate areas determined by the method simultaneously comprise information of elbows, wrists, palms and the like, all the hand areas are only found out from the original image, the found hand candidate areas simultaneously comprise hand areas of lifting hands and hand areas of non-lifting hands, and the following method is further adopted to screen effective hand areas in combination with the characteristics of normal hand lifting postures, so that the subsequent unnecessary calculation cost and time waste are avoided, and the gesture recognition efficiency is further ensured;

and judging whether the angle between the arm and the horizontal plane is larger than a preset gesture angle, and if so, determining the hand candidate area as the hand area to be recognized. The preset gesture angle can be understood as an angle between an arm and a horizontal plane when a hand is lifted, and according to data statistics knowledge, the angle between the arm corresponding to an actual hand lifting gesture and the horizontal plane is generally larger than or equal to 35 degrees.

And S14, inputting the hand area to be recognized into a gesture recognition model to obtain a hand lifting behavior detection result. In order to accurately recognize the hand lifting posture in the hand area to be recognized, the embodiment adopts a gesture recognition model of a convolutional neural network model obtained by training a hand lifting posture as a positive sample and non-hand lifting postures such as a cheek support, a face touch and a head touch as a negative sample, and as shown in fig. 8, preferably, the convolutional neural network model comprises 2 convolutional layers, 1 pooling layer, 1 convolutional layer, 1 pooling layer and 1 full-connection layer which are connected in sequence; the convolution kernel size of the convolution layer is 3 x 3; the kernel size of the pooling layer is 2 x 2. The gesture recognition model that this embodiment adopted is a succinct two classification networks, and it is applied to the more single classroom of student's gesture and raises hands the action and detect time, can further promote the efficiency of categorised discernment when guaranteeing the accurate nature of categorised discernment.

The embodiment of the application carries out frame extraction processing on a classroom video to be detected collected in real time to obtain image data to be detected, then carries out single human body region identification, posture estimation of the single human body region by a single posture estimation model (STN network, SPPE network and SDTN network) and redundant suppression processing of a posture estimation result by a posture redundant suppression model (PP-NMS, Parametric Pose NMS network) in sequence through a human body region detection model (YOLov3 network), obtains a posture estimation result corresponding to an upper half body posture image comprising a plurality of human body joint points, determines a hand candidate region according to elbow joint point coordinates and wrist joint point coordinates in the posture estimation result, screens a hand region to be identified by combining a preset gesture angle, and then adopts a gesture identification model comprising a convolutional neural network to carry out detection and identification to obtain a corresponding hand lifting behavior detection result, by adopting the deep learning technology, the posture estimation and the gesture recognition are combined, the problem of low classroom hand-lifting detection accuracy caused by factors such as human body shielding, small target resolution to be recognized, large picture data brightness difference and the like in practical application is effectively solved, the accuracy of classroom hand-lifting behavior real-time detection is improved, meanwhile, the acquisition efficiency of detection results is better guaranteed, and the application value of the hand-lifting behavior detection results is further improved to a greater extent.

It should be noted that, although the steps in the above-described flowcharts are shown in sequence as indicated by arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise.

In one embodiment, as shown in fig. 9, there is provided a classroom hand-holding behavior detection system, the system comprising:

the data acquisition module 1 is used for acquiring a classroom video to be detected in real time and obtaining image data to be detected according to the classroom video to be detected;

the attitude estimation module 2 is used for inputting the image data to be detected into an attitude estimation model for attitude estimation to obtain a corresponding attitude estimation result; the posture estimation result is an upper half body posture graph comprising a plurality of human body joint points;

the region screening module 3 is used for determining a hand region to be identified according to the posture estimation result;

and the gesture recognition module 4 is used for inputting the hand area to be recognized into a gesture recognition model to obtain a hand-lifting behavior detection result.

For specific limitations of a classroom hand-lifting behavior detection system, reference may be made to the above limitations of a classroom hand-lifting behavior detection method, which are not described herein again. All or part of the modules in the pig attack behavior identification system can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 10 shows an internal structure diagram of a computer device in one embodiment, and the computer device may be specifically a terminal or a server. As shown in fig. 10, the computer apparatus includes a processor, a memory, a network interface, a display, and an input device, which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a classroom hand-lifting behavior detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those of ordinary skill in the art that the architecture shown in FIG. 10 is merely a block diagram of some of the structures associated with the present solution and is not intended to limit the computing devices to which the present solution may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a similar arrangement of components.

In one embodiment, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the steps of the above method being performed when the computer program is executed by the processor.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method.

To sum up, the classroom hand-lifting behavior detection method provided by the embodiments of the present invention employs a deep learning technique, performs upper half posture estimation on each human body region through a posture estimation model obtained by integrating a human body region detection model (YOLOv3 network), a single person posture estimation model (STN network, SPPE network, and SDTN network), and a posture redundancy suppression model (PP-NMS, parametrical pos NMS network), determines an effective hand region to be recognized based on the upper half posture estimation result, and performs hand-lifting behavior detection on the hand region by using a gesture recognition model, thereby effectively solving the problem of low classroom hand-lifting detection accuracy caused by factors such as human body occlusion, small resolution of an object to be recognized, and large brightness difference of picture data in practical application, and improving the accuracy of classroom hand-lifting behavior real-time detection, the acquisition efficiency of the detection result is better ensured, and the application value of the detection result of the hand-lifting behavior is further improved to a greater extent.

The embodiments in this specification are described in a progressive manner, and all the same or similar parts of the embodiments are directly referred to each other, and each embodiment is described with emphasis on differences from other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. It should be noted that, the technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express some preferred embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these should be construed as the protection scope of the present application. Therefore, the protection scope of the present patent shall be subject to the protection scope of the claims.

Claims

1. A classroom hand-raising behavior detection method is characterized by comprising the following steps:

2. The method for detecting classroom hand-lifting behavior according to claim 1, wherein the step of obtaining image data to be detected based on the classroom video to be detected comprises:

3. The classroom hand-lifting behavior detection method of claim 1, wherein the pose estimation model comprises a body region detection model, a single pose estimation model, and a pose redundancy suppression model connected in sequence;

4. The classroom hand-lifting behavior detection method of claim 3, wherein the step of inputting the image data to be detected into a pose estimation model for pose estimation to obtain a corresponding pose estimation result comprises:

5. The classroom hand-lifting behavior detection method of claim 1, wherein the body joint points comprise joint points corresponding to a top of the head, a neck, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left crotch, and a left crotch, respectively;

6. The classroom hand-lifting behavior detection method as claimed in claim 5, wherein said step of obtaining a corresponding hand candidate region based on the elbow joint point coordinates and wrist joint point coordinates comprises:

7. The classroom hand-lifting behavior detection method of claim 1, wherein the gesture recognition model comprises a convolutional neural network model; the convolutional neural network model comprises 2 convolutional layers, 1 pooling layer, 1 convolutional layer, 1 pooling layer and 1 full-connection layer which are connected in sequence; the convolution kernel size of the convolution layer is 3 x 3; the kernel size of the pooling layer is 2 x 2.

8. A classroom hand-lifting behavior detection system, the system comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.