CN111160288A

CN111160288A - Gesture key point detection method and device, computer equipment and storage medium

Info

Publication number: CN111160288A
Application number: CN201911413461.0A
Authority: CN
Inventors: 赵突
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-15

Abstract

The application relates to a gesture key point detection method and device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image to be detected, wherein the image to be detected contains gesture characteristics; extracting gesture features in the image to be detected through a trained gesture detection model, and outputting a region where the gesture features are located to obtain at least one target region; and extracting the characteristics of the key points in each target region through the trained key point regression model, and outputting the position information of the key points of each target region, wherein the trained gesture detection model and the trained key point regression model are obtained by independent training. Firstly, a gesture detection model is adopted to position the area where the gesture is located, the positioned area is input into a key point regression model, the position information of key points of each target area is directly regressed, the complexity of the model is reduced, and therefore the detection efficiency is improved.

Description

Gesture key point detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a gesture key point detection method and apparatus, a computer device, and a storage medium.

Background

Gesture keypoint recognition has become an important research direction. In traditional gesture detection, only the position and the type of a gesture are detected, joint points of a hand can be identified through key point identification, a gesture area can be identified more finely, and therefore more applications are achieved. Due to the fact that gestures are various and serious shielding exists, the traditional method is difficult to recognize key points of the gestures, and the problems of low precision and low speed exist.

Disclosure of Invention

In order to solve the technical problem, the application provides a gesture key point detection method, a gesture key point detection device, a computer device and a storage medium.

In a first aspect, the present application provides a gesture key point detection method, including:

acquiring an image to be detected, wherein the image to be detected contains gesture characteristics;

extracting gesture features in the image to be detected through a trained gesture detection model, and outputting a region where the gesture features are located to obtain at least one target region;

and extracting the characteristics of the key points in each target region through the trained key point regression model, and outputting the position information of the key points of each target region, wherein the trained gesture detection model and the trained key point regression model are obtained by independent training.

In a second aspect, the present application provides a gesture key point detection device, including:

the data acquisition module is used for acquiring an image to be detected, and the image to be detected comprises gesture characteristics;

the area detection module is used for extracting gesture features in the image to be detected through the trained gesture detection model and outputting the area where the gesture features are located to obtain at least one target area;

and the key point detection module is used for extracting the characteristics of the key points in each target region through the trained key point regression model and outputting the position information of the key points of each target region, and the trained gesture detection model and the trained key point regression model are obtained through independent training.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The gesture key point detection method, the gesture key point detection device, the computer equipment and the storage medium comprise the following steps: acquiring an image to be detected, wherein the image to be detected contains gesture characteristics; extracting gesture features in the image to be detected through a trained gesture detection model, and outputting a region where the gesture features are located to obtain at least one target region; and extracting the characteristics of the key points in each target region through the trained key point regression model, and outputting the position information of the key points of each target region, wherein the trained gesture detection model and the trained key point regression model are obtained by independent training. Firstly, a gesture detection model is adopted to position the area where the gesture is located, the positioned area is input into a key point regression model, the position information of key points of each target area is directly regressed, the complexity of the model is reduced, and therefore the detection efficiency is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a diagram of an exemplary implementation of a gesture keypoint detection method;

FIG. 2 is a flowchart illustrating a method for detecting a gesture key point according to an embodiment;

FIG. 3 is a diagram of a gesture detection model in one embodiment;

FIG. 4 is a diagram illustrating detection results of gesture keypoints in an embodiment;

FIG. 5 is a diagram illustrating detection results of gesture keypoints in an embodiment;

FIG. 6 is a block diagram of an exemplary gesture key detection apparatus;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

FIG. 1 is a diagram of an application environment of a gesture keypoint detection method in an embodiment. Referring to fig. 1, the gesture key point detection method is applied to a gesture key point detection system. The gesture key point detection system includes a terminal 110 and a server 120. The terminal 110 or the server 120 acquires an image to be detected, wherein the image to be detected comprises gesture features; extracting gesture features in the image to be detected through a trained gesture detection model, and outputting a region where the gesture features are located to obtain at least one target region; and extracting the characteristics of the key points in each target region through the trained key point regression model, and outputting the position information of the key points of each target region, wherein the trained gesture detection model and the trained key point regression model are obtained by independent training.

The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

As shown in FIG. 2, in one embodiment, a method of gesture keypoint detection is provided. The embodiment is mainly illustrated by applying the method to the terminal 110 (or the server 120) in fig. 1. Referring to fig. 2, the gesture key point detection method specifically includes the following steps:

step S201, an image to be detected is acquired.

In this embodiment, the image to be detected includes a gesture feature.

In particular, the gesture features refer to features for describing gestures, and different gestures are described by using different gesture features. The image to be detected refers to an image which is acquired by image acquisition equipment and contains gesture features, and the image to be detected can contain one or more gesture features corresponding to gestures.

Step S202, extracting gesture features in the image to be detected through the trained gesture detection model, and outputting the area where the gesture features are located to obtain at least one target area.

Specifically, the trained gesture detection model is a detection model obtained by training a large number of images carrying gesture feature identifiers. The gesture detection model may adopt a common deep learning detection model, such as at least one of mobrienet, SSD (Single Shot multi box Detector, multi-class Single rod Detector), fast RCNN, YOLO, and the like, performs feature extraction on an image to be detected through each network layer of the deep learning detection model, and matches the extracted features with features stored in the trained gesture detection model to obtain a region where the features with the highest matching degree are located, that is, a target region. And if the image to be detected contains a plurality of gestures, outputting target areas corresponding to the gestures to obtain a plurality of target areas.

Step S203, extracting the characteristics of the key points in each target area through the trained key point regression model, and outputting the position information of the key points of each target area.

In this embodiment, the trained gesture detection model and the trained keypoint regression model are trained separately.

Specifically, the trained keypoint regression model refers to a deep learning model obtained by training a large number of images carrying keypoint position information. The trained key point regression model comprises a feature extraction unit and a regression unit, wherein the feature extraction unit is used for extracting features of the target region, and the feature extraction unit can extract features by using one or more different types of networks in a convolutional layer, a pooling layer and a full-link layer. Convolutional layers are network layers that perform convolutional operations using convolutional kernels, with different convolutional kernels used to extract different features. The pooling layer refers to a network layer that performs a pooling operation using a pooling algorithm. The regression unit comprises a regression layer, the regression layer comprises at least one full-connection layer, and after the regression layer is adopted to execute regression operation, a corresponding regression result is obtained, namely position information of each key point is obtained through regression. And performing feature extraction on each target region by using a feature extraction unit to obtain a feature map corresponding to each target region, and performing regression operation on the feature map corresponding to each target region by using a regression layer to obtain the position information of the key points of each target region. Where the trained keypoint regression model may be a mobilene or modified mobilene network.

The gesture key point detection method comprises the following steps: the data acquisition module is used for acquiring an image to be detected, and the image to be detected comprises gesture characteristics; extracting gesture features in the image to be detected through a trained gesture detection model, and outputting a region where the gesture features are located to obtain at least one target region; and extracting the characteristics of the key points in each target region through the trained key point regression model, and outputting the position information of the key points of each target region, wherein the trained gesture detection model and the trained key point regression model are obtained by independent training. Firstly, a gesture detection model is adopted to position a region where a gesture is located, a key point regression model is input into the positioned region, the position information of key points of each target region is directly regressed, two networks obtained through independent training are adopted to detect the key points, the detection efficiency is improved, and the detection accuracy is ensured.

In one embodiment, the trained keypoint regression model includes a convolution layer and a regression layer, and step S203 includes: inputting each target area to the convolutional layer, and performing convolution operation on each target area through a convolution core in the convolutional layer to obtain a convolution characteristic diagram corresponding to each target area; inputting each convolution characteristic graph to a pooling layer, and executing pooling operation through the pooling layer to obtain a corresponding pooling characteristic graph; and inputting each pooling feature map to a regression layer, and regressing the position information of the key points of each target area.

Specifically, the keypoint regression model comprises at least one convolution layer, a pooling layer and a regression layer, and the size and the number of convolution kernels contained in each convolution layer can be customized according to requirements. The pooling operation corresponding to the pooling layer includes a global average pooling algorithm, a maximum pooling algorithm, and the like. Extracting the characteristics of the target area by adopting the convolution kernels of all the convolution layers to obtain a corresponding convolution characteristic graph, inputting the convolution characteristic graph into the pooling layer, extracting the characteristics of the convolution characteristic graph again by adopting a preset pooling algorithm in the pooling layer to obtain a corresponding pooling characteristic graph, inputting the pooling characteristic graph input by the pooling layer into the regression layer, and outputting the position information of the key points of all the target areas.

In one embodiment, generating a trained keypoint regression model comprises: acquiring a plurality of training images, wherein the training images carry a plurality of key point position information corresponding to the gestures; inputting each training image to the key point regression model, and outputting the predicted position information of each key point; calculating a loss value of the regression model of the key points according to the difference between the position information of each key point and the predicted position information; and when the loss value is within the preset loss value interval, obtaining the trained key point regression model.

Specifically, the training image is used for training a key detection model, the training image carries position information of key points, the training image is input to a key point regression model, features of the key points in the training image are extracted through model parameters in the key point regression model, the position information of the key points is regressed according to the extracted features, the key points and corresponding predicted position information are obtained, and loss values of the key point regression model are determined according to the difference between the predicted key points and the corresponding predicted position information and the key points marked in the training image and the corresponding position information. The difference, the ratio, the squared difference, or the exponent or logarithm of the difference, etc. are used to determine the degree of difference, and the loss value may also be expressed by the accuracy of the identification of the key points, etc. The preset loss value interval is a preset data interval, and the data interval may be an empirical interval or an accurate interval obtained through calculation. And when the loss value is within the preset loss value interval, representing that the regression model of the key point is converged to obtain the trained regression model of the key point.

In one embodiment, the keypoint regression model is an improved model of melinet. The model has the advantages of high precision and high speed.

In one embodiment, when the loss value is not in the preset loss value interval, updating model parameters of the key point regression model according to the loss value to obtain an intermediate key point regression model; and inputting each training image to the intermediate key point regression model, and obtaining the trained key point regression model until the loss value of the intermediate key point regression model is within the preset loss value interval.

Specifically, when the loss value is not within the preset loss value interval, it is indicated that the key point regression model is not converged, and the learning needs to be continued, and the model parameters of the key point regression model are updated according to the loss value, so as to obtain an intermediate key point regression model. And otherwise, updating the model parameters of the intermediate key point regression model according to the loss value of the intermediate key point regression model until the loss value of the intermediate key point regression model with updated parameters is located in the preset loss value interval, and obtaining the trained key point regression model.

In a specific embodiment, the gesture key point detection method includes:

training a gesture detection model:

multiple pictures of multiple types of gestures are collected, and the regions and the types of the gestures are marked manually.

The training data is used to train a gesture detection model of MobileNet + SSD, and a network structure diagram of the gesture detection model is shown in fig. 3, where fig. 3 includes MobileNet and SSD. The MobileNet is a lightweight deep neural network provided for embedded devices such as mobile phones, and can effectively reduce the operation complexity of the neural network.

The SSD algorithm is used in the detection. SSD is one of the main detection frameworks at present, and has a distinct speed advantage compared to fast RCNN, and higher accuracy compared to YOLO. In the SSD algorithm, the rectangular box of the gesture area and the gesture category can be obtained only by inputting the picture once. A detection mode based on a characteristic pyramid is added into the SSD, and gestures can be detected in multiple scales.

After an original input picture passes through the MobileNet, characteristics are respectively extracted at different levels, the extracted characteristics are input into an SSD target detection layer, and detection is respectively carried out at each level of the network. And finally outputting the coordinates and the position information of the target detection frame after multistage network cascade.

And after the output target detection frame is subjected to a non-maximum suppression algorithm, removing the overlapped detection frames to obtain the frame coordinate with the maximum confidence coefficient and a classification result.

Training a key point recognition model:

and marking a large number of pictures of the gesture key points, wherein the gesture part in the pictures is divided into 21 key points for marking. The gesture is divided into 21 key points, wherein each finger is 4 key points, and 1 key point is marked at the wrist. The trained model uses a simplified version of the deep learning network, and the model structure is shown in table 1. The model structure is only one of the specific structures, and modifications can be made to the structure, such as adding or subtracting convolutional layers and full link layers.

TABLE 1 Key points regression model

Where Conv refers to a convolutional layer, Conv dw refers to a convolutional layer that performs deep convolution, Avg Pool refers to an average pooling layer, and FC refers to a fully connected layer. The parameter a × b × c × d corresponding to the convolution layer means that the convolution kernel is a three-dimensional convolution kernel a × b × c, and the number of convolution kernels is d. The parameter e × f × h dw corresponding to the depth convolution refers to the depth convolution with the convolution size e × f and the convolution kernel number h, and the average pooling layer Pool 7 × 7 refers to the size of the pooling window. s1 indicates a moving step of 1, and s2 indicates a moving step of 2.

The training target is the x, y coordinates of the 21 key points corresponding to each gesture, for a total of 42 values. And the square sum error function is adopted as a loss function of the model during training.

The use stage is as follows:

inputting the image including the gesture into a gesture detection model for detection, outputting a target region including gesture features, inputting the target region into a trained key point regression model, namely the regression model, directly obtaining 42 coordinate numerical values of 21 key points corresponding to each gesture, and showing detection results, as shown in fig. 4 and 5.

And (4) using a second-level key point detection architecture to perform gesture detection and then perform key point detection. For the key point detection part, a simplified version of deep learning network is used, the result of the gesture key points is directly regressed, and the complex pre-processing and post-processing processes are avoided. The gesture detection uses deep learning to quickly acquire the frame of the gesture area. The gesture key point detection algorithm uses the deep learning network with a simplified version, and is high in speed and precision. The gesture detection module and the key point detection module are combined, so that the key point identification precision is guaranteed, the detection speed can be guaranteed, and the purpose of real-time detection is achieved.

In a specific embodiment, when a video is played, collected gesture images of a user are received, the gesture images of the user are input to a trained gesture detection model, a target area where the gesture is located is output, the target area is input into a trained key point regression model, position information of key points is output, and various special effects are superposed by combining with a graph rendering front end to increase interaction based on the video.

In a specific embodiment, the key point detection method is applied to intelligent editing, and a gesture area and key points thereof in a video are automatically detected, so that the later-stage operation of adding gesture special effects and the like is facilitated for personnel.

FIG. 2 is a flowchart illustrating a method for detecting gesture keypoints in an embodiment. It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided a gesture keypoint detection apparatus 200, comprising:

the data acquisition module 201 is configured to acquire an image to be detected, where the image to be detected includes a gesture feature.

The region detection module 202 is configured to extract gesture features in the image to be detected through the trained gesture detection model, and output a region where the gesture features are located to obtain at least one target region.

The key point detection module 203 is configured to extract features of key points in each target region through the trained key point regression model, and output location information of the key points in each target region, where the trained gesture detection model and the trained key point regression model are obtained through separate training.

In an embodiment, the gesture key point detecting device 200 further includes:

a model generation module for generating a trained keypoint regression model, wherein the model generation module comprises:

and the data acquisition unit is used for acquiring a plurality of training images, and the training images carry a plurality of key point position information corresponding to the gestures.

And the prediction unit is used for inputting each training image to the key point regression model and outputting the predicted position information of each key point.

And the loss value calculation unit is used for calculating the loss value of the key point regression model according to the difference between the position information of each key point and the predicted position information.

And the model generating unit is used for obtaining the trained key point regression model when the loss value is in the preset loss value interval.

In one embodiment, the model generating unit is further configured to update a model parameter of the key point regression model according to the loss value when the loss value is not within the preset loss value interval, so as to obtain an intermediate key point regression model; and inputting each training image to the intermediate key point regression model, and obtaining the trained key point regression model until the loss value of the intermediate key point regression model is within the preset loss value interval.

In one embodiment, the keypoint detection module 203 is specifically configured to input each target region to the convolutional layer, perform convolution operation on each target region through a convolution kernel in the convolutional layer to obtain a convolution feature map corresponding to each target region, input each convolution feature map to the pooling layer, and perform pooling operation through the pooling layer to obtain a corresponding pooling feature map; inputting each pooling feature map to a regression layer, and regressing the position information of the key points of each target region, wherein the trained key point regression model comprises a convolution layer, a pooling layer and a regression layer.

FIG. 7 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 (or the server 120) in fig. 1. As shown in fig. 7, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected via a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the gesture keypoint detection method. The internal memory may also store a computer program, and when the computer program is executed by the processor, the processor may execute the gesture key point detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the gesture key point detection apparatus provided in the present application may be implemented in a form of a computer program, and the computer program may be run on a computer device as shown in fig. 7. The memory of the computer device may store various program modules constituting the gesture key point detection apparatus, such as the data acquisition module 201, the area detection module 202, and the key point detection module 203 shown in fig. 6. The program modules constitute computer programs that cause a processor to execute the steps of the gesture keypoint detection method of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 7 may execute, by using the data acquisition module 201 in the gesture key point detection apparatus shown in fig. 6, acquiring an image to be detected, where the image to be detected includes a gesture feature. The computer device may extract the gesture features in the image to be detected through the trained gesture detection model by the region detection module 202, and output the region where the gesture features are located, so as to obtain at least one target region. The computer device may perform, through the keypoint detection module 203, extracting features of keypoints in each target region through a trained keypoint regression model, and outputting location information of the keypoints of each target region, where the trained gesture detection model and the trained keypoint regression model are trained separately.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring an image to be detected, wherein the image to be detected contains gesture characteristics; extracting gesture features in the image to be detected through a trained gesture detection model, and outputting a region where the gesture features are located to obtain at least one target region; and extracting the characteristics of the key points in each target region through the trained key point regression model, and outputting the position information of the key points of each target region, wherein the trained gesture detection model and the trained key point regression model are obtained by independent training.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the loss value is not in the preset loss value interval, updating the model parameters of the key point regression model according to the loss value to obtain an intermediate key point regression model; and inputting each training image to the intermediate key point regression model, and obtaining the trained key point regression model until the loss value of the intermediate key point regression model is within the preset loss value interval.

In one embodiment, the trained keypoint regression model includes a convolution layer and a regression layer, the trained keypoint regression model extracts features of keypoints in each target region, and outputs position information of the keypoints in each target region, including: inputting each target area to the convolutional layer, and performing convolution operation on each target area through a convolution core in the convolutional layer to obtain a convolution characteristic diagram corresponding to each target area; inputting each convolution characteristic graph to a pooling layer, and executing pooling operation through the pooling layer to obtain a corresponding pooling characteristic graph; and inputting the pooling feature maps to a regression layer, and regressing the position information of the key points of each target area.

In one embodiment, the regression layer includes at least one fully connected layer.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring an image to be detected, wherein the image to be detected contains gesture characteristics; extracting gesture features in the image to be detected through a trained gesture detection model, and outputting a region where the gesture features are located to obtain at least one target region; and extracting the characteristics of the key points in each target region through the trained key point regression model, and outputting the position information of the key points of each target region, wherein the trained gesture detection model and the trained key point regression model are obtained by independent training.

In one embodiment, the computer program when executed by the processor further performs the steps of: and generating a trained key point regression model. Wherein generating a trained keypoint regression model comprises: acquiring a plurality of training images, wherein the training images carry a plurality of key point position information corresponding to the gestures; inputting each training image to the key point regression model, and outputting the predicted position information of each key point; calculating a loss value of the regression model of the key points according to the difference between the position information of each key point and the predicted position information; and when the loss value is within the preset loss value interval, obtaining the trained key point regression model.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the loss value is not in the preset loss value interval, updating the model parameters of the key point regression model according to the loss value to obtain an intermediate key point regression model; and inputting each training image to the intermediate key point regression model, and obtaining the trained key point regression model until the loss value of the intermediate key point regression model is within the preset loss value interval.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A gesture keypoint detection method, the method comprising:

acquiring an image to be detected, wherein the image to be detected comprises gesture characteristics;

extracting the characteristics of the key points in each target region through a trained key point regression model, and outputting the position information of the key points of each target region, wherein the trained gesture detection model and the trained key point regression model are obtained through independent training.

2. The method of claim 1, wherein generating the trained keypoint regression model comprises:

acquiring a plurality of training images, wherein the training images carry a plurality of key point position information corresponding to gestures;

inputting each training image to a key point regression model, and outputting the predicted position information of each key point;

calculating a loss value of the key point regression model according to the difference between the position information of each key point and the predicted position information;

and when the loss value is within a preset loss value interval, obtaining the trained key point regression model.

3. The method of claim 2, further comprising:

when the loss value is not located in the preset loss value interval, updating model parameters of the key point regression model according to the loss value to obtain an intermediate key point regression model;

and inputting each training image to the intermediate key point regression model, and obtaining the trained key point regression model until the loss value of the intermediate key point regression model is within the preset loss value interval.

4. The method according to any one of claims 1 to 3, wherein the trained keypoint regression model comprises a convolutional layer, a pooling layer and a regression layer, and the extracting features of the keypoints in each of the target regions through the trained keypoint regression model and outputting the position information of the keypoints of each of the target regions comprises:

inputting each target area to the convolutional layer, and performing convolution operation on each target area through a convolution core in the convolutional layer to obtain a convolution characteristic diagram corresponding to each target area;

inputting each convolution feature map to the pooling layer, and executing pooling operation through the pooling layer to obtain corresponding pooling feature maps;

and inputting each pooling feature map to the regression layer, and regressing the position information of each key point of the target area.

5. The method of claim 4, wherein the regression layer comprises at least one fully connected layer.

6. A gesture keypoint detection apparatus, the apparatus comprising:

the data acquisition module is used for acquiring an image to be detected, wherein the image to be detected comprises gesture characteristics;

the area detection module is used for extracting gesture features in the image to be detected through a trained gesture detection model and outputting an area where the gesture features are located to obtain at least one target area;

and the key point detection module is used for extracting the characteristics of the key points in each target region through a trained key point regression model and outputting the position information of the key points of each target region, and the trained gesture detection model and the trained key point regression model are obtained through independent training.

7. The apparatus of claim 6, wherein the apparatus comprises:

a model generation module for generating the trained keypoint regression model, wherein the model generation module comprises:

the data acquisition unit is used for acquiring a plurality of training images, and the training images carry a plurality of key point position information corresponding to the gestures;

the prediction unit is used for inputting each training image to a key point regression model and outputting the predicted position information of each key point;

a loss value calculation unit, configured to calculate a loss value of the keypoint regression model according to a difference between the position information of each keypoint and the predicted position information;

and the model generating unit is used for obtaining the trained key point regression model when the loss value is in a preset loss value interval.

8. The apparatus according to claim 6 or 7, wherein the keypoint detection module is specifically configured to input each of the target regions to the convolutional layer, perform a convolution operation on each of the target regions through a convolution kernel in the convolutional layer to obtain a convolution feature map corresponding to each of the target regions, input each of the convolution feature maps to a pooling layer, perform a pooling operation through the pooling layer to obtain a corresponding pooling feature map, input each of the pooling feature maps to a regression layer, and regress position information of keypoints in each of the target regions, where the trained keypoint regression model includes the convolutional layer, the pooling layer, and the regression layer.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 5 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.