WO2018058419A1

WO2018058419A1 - Two-dimensional image based human body joint point positioning model construction method, and positioning method

Info

Publication number: WO2018058419A1
Application number: PCT/CN2016/100763
Authority: WO
Inventors: 黄凯奇; 张俊格; 付连锐
Original assignee: 中国科学院自动化研究所
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2018-04-05

Abstract

A two-dimensional image based human body joint point positioning model construction method, and a positioning method based on the construction method. The construction method comprises: using a color image with human body joint point position coordinates and obstruction states being marked, to construct a human body part local characteristic training sample set and a human body part global formation sample set (S100); constructing a deep convolutional neural network, and using the human body part local characteristic training sample set to train the deep convolutional neural network, to obtain a human body part local appearance model (S110); obtaining an obstruction relationship diagram model by using the human body part local appearance model and the human body part global formation sample set (S120); and determining the human body part local appearance model and the obstruction relationship diagram model as a two-dimensional image human body joint point positioning model (S130). The present method solves the technical problem of how to accurately and robustly position human body joint points in a two-dimensional image.

Description

Method and positioning method for two-dimensional image human joint point positioning model

Technical field

The invention relates to the field of image processing and pattern recognition technology, in particular to a method for constructing a two-dimensional image human joint point positioning model and a positioning method based on the same.

Background technique

In the fields of video surveillance, sign language recognition, smart home, human-computer interaction, augmented reality, image retrieval, robotics, etc., it is often necessary to estimate the position coordinates of each joint point of the human body from the two-dimensional image. Two-dimensional image human joint point positioning plays a key role in the above application fields, and it has great application value. In practical applications, difficult factors in the positioning of human joint points include large-scale deformation, viewing angle changes, occlusion, and complex background.

At present, the two-dimensional image human joint point positioning method is divided into two major methods: joint point regression and component detection.

For the two-dimensional image human joint point regression method, it is necessary to first use the human body detector to determine the position and size of the area where the human body is located, and then extract the image features in the area determined by the human body detector, and use the regression method to predict the joint points of the human body. coordinate. For related content, please refer to Document 1 and Document 2.

The method of node regression is easy to implement, but it has the following two disadvantages: First, because the joint point regression method requires a rectangular frame obtained by the human body detector as an input, when the human body has a large movement, the human body detector is caused. A false check occurred, causing subsequent joint point regression to fail. Second, because the position of the end joints such as wrists and ankles changes greatly, and the position of the joints such as the head and shoulders changes little, the method of global regression on the image area may cause an under-fitting of the end joint points, thus affecting The positioning accuracy of the end joint points. In order to improve the second shortcoming, the third part of the literature divides the human body into upper, middle and lower regions, and returns the joint points of the three regions separately, but ignores the first shortcoming.

For the two-dimensional image human body part detection method, the first part of the image is extracted by means of sliding window scanning, and the parts are classified, and then the structural model is used to constrain the relative positional relationship between the parts, thereby detecting the optimal human body. The component is configured and the coordinates of the area where the components are located and the corresponding joint points are obtained. Two-dimensional image body parts The detection method involves two key technologies, one is the local feature representation of the component, and the other is the structural modeling of the human body.

In terms of local feature representation of components, existing methods primarily employ hand-designed features or features learned through learning. Document 4 uses the gradient direction histogram to express the local features of the component; the fifth is the shape context feature. Manually designed features do not need to be trained, simple and fast, but the disadvantage is that the feature expression is weak and not robust to noise. In the sixth paper, the convolutional neural network is used to extract the features of the local regions of the components, which enhances the expressive power of the local features of the components in different poses and enhances the robustness to noise. However, Document 6 only considers the case where the component is not occluded, and the method has poor positioning accuracy for the occluded joint point.

In the structural modeling of the human body, the model structure used includes a tree structure model and a band graph model. Most of the existing anatomical modeling methods use a tree structure model, see Document 4 and Document 6. Although the tree structure model is simple in structure and convenient for fast reasoning, it is difficult to model complex occlusion relationships, especially self-occlusion relationships. The biggest difference between the band graph model and the tree structure model is the introduction of loops in the model structure. For example, Document 7 and Document 8 are the band graph models used. Although the band graph model enhances the model's expressive ability and robustness to occlusion, its reasoning complexity is high, which limits its application in anatomical modeling.

In view of this, the present invention has been specifically proposed.

The above related documents are listed below:

Document 1: Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1653–1660, 2014;

Document 2: US7778446B2, FAST HUMAN POSE ESTIMATION USING APPEARANCE AND MOTION VIA MULTI-DIMENSIONAL BOOSTING REGRESSION;

Document 3: Belagiannis Vasileios, Rupprecht Christian, Carneiro Gustavo, and Navab Nassir. Robust optimization for deep regression. In International Conference on Computer Vision, pages 2830–2838, 2015;

Document 4: Y.Yang and D.Ramanan.Articulated pose estimation with flexible mixtures-of-parts.In IEEE Conference on Computer Vision and Pattern Recognition,pages 1385–1392,2011;

Document 5: US7925081B2, SYSTEMS AND METHODS FOR HUMAN BODY POSE ESTIMATION;

Document 6: Xianjie Chen and Alan L. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems, pages 1736–1744, 2014;

Document 7: Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Discriminative appearance models for pictorial structures. International Journal of Computer Vision, 99(3): 259–280, 2012;

Document 8: Leonid Sigal and Michael J. Black. Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2041–2048, 2006.

Summary of the invention

In order to solve the above problems in the prior art, a method for constructing a two-dimensional image human joint point positioning model is provided in order to solve the technical problem of how to accurately and robustly locate a human joint point in a two-dimensional image. In addition, a positioning method based on the construction method is also provided.

In order to achieve the above object, on the one hand, the following technical solutions are provided:

A method for constructing a two-dimensional image human joint point positioning model, the construction method comprising:

Constructing a human component local feature training sample set and a human body component global configuration sample set by using a color image marking the position coordinates and the occlusion state of the human joint point;

Constructing a deep convolutional neural network, training the deep convolutional neural network with the human component local feature training sample set, and obtaining a local apparent model of the human body component;

Obtaining an occlusion relationship graph model by using the partial appearance model of the human body component and the global configuration sample set of the human body component;

The body part local apparent model and the occlusion relationship model are determined as a two-dimensional image human joint point positioning model.

Preferably, the constructing the body part local feature training sample set may specifically include:

Calculating the relative position of any of the body members relative to their parent node;

Clustering the relative positions of all of the color images;

The body part local feature training sample set is constructed by using an image region in which the body part is located and a category obtained by clustering the body part.

Preferably, the constructing the human body component global configuration sample set may specifically include:

Determining a sample label of the body part;

Determining an image area corresponding to all of the body parts;

The sample set and the image area are utilized to form the body part global configuration sample set.

Preferably, the constructing the deep convolutional neural network may specifically include:

Determining the basic unit of the deep convolutional neural network as 5 convolutional layers and 3 fully connected layers;

The image area in which the component is located is used as an input to the deep convolutional neural network.

Preferably, the obtaining the occlusion relationship graph model by using the partial appearance model of the human body component and the global configuration sample set of the human body component may specifically include:

Establishing a connection relationship between the components of the human body with a loop;

Based on the connection relationship between the components of the human body with a loop, using the local apparent model of the human body component, using a structured support vector machine, using a dual coordinate descent method on the global component sample set of the human body component, The training obtains the weight corresponding to the relative position between the two human body components having the constraint relationship and the apparent feature weight coefficient of any human body component, thereby obtaining an occlusion relationship graph model.

In order to achieve the above object, on the other hand, a two-dimensional image human joint point positioning method based on the above construction method is further provided, and the positioning method includes:

Obtaining an image to be detected;

Extracting, by the local appearance model of the human body component, a partial appearance feature of the image to be detected;

Based on the local appearance feature of the image to be detected, the occlusion graph model is utilized, and an optimal body part configuration is obtained according to the following formula:

(xi*, yi*, oi*, ti*)=argmax(∑γij*Δij+∑ωi*pi);

Wherein xi represents the abscissa of the component i; the yi represents the ordinate of the component i; the oi represents the occlusion state of the component i; and the ti represents the component i a category j; a component j is a parent node component of the component i; the Δij represents a relative position between the components i and j; the γij represents a weight corresponding to the relative position Δij; An apparent characteristic weight coefficient of the component i; the pi represents a local apparent feature of the component i; the i and the j take a positive integer;

The center position of the body part area in the optimal body part configuration is determined as the joint point position at the body part.

Preferably, the utilizing the local appearance model of the human body component to extract the local appearance feature of the image to be detected may specifically include:

Dividing the image to be detected into a plurality of partial image regions;

Each of the partial image regions is used as an input of the partial appearance model of the human body component to obtain a partial appearance feature of the image to be detected.

Embodiments of the present invention provide a method for constructing a two-dimensional image human joint point positioning model and a two-dimensional image human joint point positioning method based on the construction method. The construction method may include: constructing a human component local feature training sample set and a human body component global configuration sample set by using a color image marked with a human joint point position coordinate and an occlusion state; constructing a deep convolutional neural network, and utilizing the human body component The local feature training sample set is used to train the deep convolutional neural network to obtain the local apparent model of the human body component; the occlusion relationship diagram model is obtained by using the local component appearance model of the human body component and the global configuration sample set of the human body component; the partial appearance model of the human body component is obtained. The occlusion map model is determined as a two-dimensional image human joint point location model. It can be seen that the present invention can simultaneously model the self-occlusion and its occlusion, and learn the occlusion relationship between the human body components and between the components and the background; by combining the deep convolutional neural network feature extraction with the graph model structure, the present invention Robust positioning of human joint points in large motion poses and local occlusion situations. The model structure adopted by the present invention can not only model the relationship between physically connected components, but also model the spatial context relationship between left and right limb components that are not directly connected, thereby enhancing the occlusion of the Lu Great. The invention combines the local apparent model of the human body component and the structural model of the figure, can effectively overcome the adverse effects brought by the large-scale action and the partial occlusion, and improves the robustness of the joint position of the human body in the two-dimensional image.

DRAWINGS

1 is a schematic flow chart of a method for constructing a two-dimensional image human joint point positioning model according to an embodiment of the present invention;

2 is a schematic flow chart of constructing a set of local feature training samples of a human body component according to an embodiment of the present invention;

3 is a flow chart showing the construction of a global component sample set of a human body component according to an embodiment of the present invention;

4 is a schematic diagram of a constructed deep convolutional neural network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an occlusion diagram model according to an embodiment of the present invention; FIG.

6 is a flow chart showing a method for positioning a two-dimensional image human joint point according to an embodiment of the invention.

detailed description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. Those skilled in the art should understand that these embodiments are only used to explain the technical principles of the present invention, and are not intended to limit the scope of the present invention.

The basic idea of the embodiment of the present invention is to model the occlusion relationship of the human body component in both the local feature representation of the component and the structural modeling of the human body.

In a practical application, such as the prior art entitled "A Human Body Attitude Estimation Method", Application No. 201510792096.4 discloses a similar human joint point positioning algorithm whose input is a color image and a depth image; and uses local features It is a gradient direction histogram feature; and the structural model is a tree structure. However, this method cannot handle the situation in which the human body parts are occluded from each other.

To this end, an embodiment of the present invention provides a method for constructing a two-dimensional image human joint point positioning model. As shown in FIG. 1, the construction method can be implemented by step S100 to step S130. among them:

S100: Constructing a human component local feature training sample set and a human body component global configuration sample set by using a color image marking the position coordinates and the occlusion state of the human joint point.

In some embodiments, as shown in FIG. 2, the above process of constructing a body part local feature training sample set can be implemented by the following preferred methods:

S101: Calculate the relative position of any human body component relative to its parent node.

S102: Clustering the relative positions of all the color images.

S103: Construct a human component local feature training sample set by using an image region in which the human body component is located and a category obtained by clustering the body component.

The process of constructing a component local feature training sample set is described in detail below in a preferred embodiment.

Step a: Calculate the relative position Δij of the i-th component relative to its parent node component j. Where i and j take a positive integer.

Step b: Clustering the relative positions Δij of all images using k-means.

In the implementation process, the number of categories can be taken as 13.

Step c: constructing a body part local feature training sample set by using the image area Ii in which the i-th component is located and the category ti (ti is the category of the component i) obtained by clustering the i-th component.

In some embodiments, as shown in FIG. 3, the above process of constructing a human component global configuration sample set can be implemented by the following preferred methods:

S105: Determine a sample label of a human body part.

S106: Determine an image area corresponding to all body parts.

S107: Using the sample label and the image area to form a human body component global configuration sample set.

The process of constructing a global configuration sample set of a human body component is described in detail below with a preferred embodiment.

Step d: determining that the sample label of the i-th component is (xi, yi, oi, ti), where xi represents the abscissa of the component i; yi represents the ordinate of the component i; oi represents the occlusion state of the component i, and its

value

0, 1 and 2, where 0 means visible, 1 means occluded by other parts of the human body, 2 means occluded by the background; ti means the category of the part i.

Step e: Determine the image area corresponding to all the components.

Step f: Using the sample tag and the image area to form a human component global configuration sample set.

S110: Constructing a deep convolutional neural network, training a deep convolutional neural network by using a local feature training sample set of a human body component, and obtaining a partial apparent model of the human body component.

Prior art (eg: Yoshua Bengio, Yann LeCun, Craig R. Nohl, Christopher JC Burges: LeRec: a NN/HMM hybrid for on-line handwriting recognition. Neural Computation 7(6): 1289-1303 (1995)) The LeNet network structure was used to implement the training. Among them, the input of the LeNet network structure is a grayscale image; the basic unit is 3 convolution layers and 2 fully connected layers.

Embodiments of the present invention improve upon the above prior art. In some embodiments, constructing a deep convolutional neural network in this step can be accomplished by the following preferred means: determining the basic unit of the deep convolutional neural network as 5 convolutional layers and 3 fully connected layers. The image area in which the component is located (i.e., the color local area image) is used as an input to the deep convolutional neural network. Through the construction of the above method, the depth convolutional neural network can output the probability of being a component class. Among them, the probability of the component category indicates the probability that the image region belongs to component i. FIG. 4 exemplarily shows a schematic diagram of a deep convolutional neural network constructed by an embodiment of the present invention.

In some embodiments, the training process in this step can include forward propagation and back propagation processes. The forward propagation process performs a convolution operation and a matrix multiplication operation on the image region where the color component is located layer by layer; the back propagation process transmits the error between the prediction error and the sample label by the gradient descent method layer by layer. And modify the parameters of the fully connected layer and the convolution layer.

In a specific implementation, for ease of processing, the color component local area image may be scaled to 36 x 36 pixels as an input to the deep convolutional neural network.

The local apparent model parameters of the human body component in this step may be parameters of the convolutional layer and the fully connected layer neurons in the deep convolutional neural network.

Since the deep convolutional neural network is a supervised learning algorithm, the local apparent model of the human body component is obtained through supervised learning of the training samples, so no manual intervention is required.

Because the local apparent model of the human body component is realized by the deep convolutional neural network, it can make full use of a large number of training samples to fit the varied apparent features, and can also make the extracted component features more robust.

The feature extraction of deep convolutional neural network and the structure of the graph model can achieve robust positioning of human joint points in the case of large motion pose and local occlusion.

S120: Obtaining an occlusion relationship graph model by using a local appearance model of the human body component and a global configuration sample set of the human body component.

In some embodiments, the step may specifically include:

S121: Establish a connection relationship between the components of the human body with a loop.

By setting the connection relationship between the various components of the human body to have a loop-connected relationship, it is possible to model the occlusion relationship between the human body components and the occlusion relationship between the human body components and the background.

S122: based on the connection relationship between the various components of the human body, using the local apparent model of the human body component, using the structured support vector machine, using the dual coordinate descent method on the global configuration sample set of the human body component, the training is obtained. The weight corresponding to the relative position between any two human body parts of the constraint relationship and the apparent characteristic weight coefficient of any human body component, thereby obtaining an occlusion relationship diagram model.

FIG. 5 exemplarily shows a schematic diagram of an occlusion diagram model. Among them, the circle represents the 14 joint point parts of the human body, and each side represents the connection relationship between the various parts of the human body. Compared with the tree structure model described in the prior art (for example, Document 4), the connection relationship of the occlusion diagram model constructed by the embodiment of the present invention has a loop, that is, a loop diagram model.

The process of obtaining the occlusion diagram model and its parameters is described below in a preferred embodiment.

The parameters of the graph structure model include the weight γij corresponding to the relative position Δij between the components i and j in which the constraint relationship exists, and the corresponding apparent weight coefficient ωi of the component i. A structured support vector machine using the document IX (for example: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann and Yasemin Altun (2005), Large Margin Methods for Structured and Interdependent Output Variables, JMLR, Vol. 6, pages 1453-1484) On the global component sample set of the human body component, the parameters γij and ωi of the structural model are obtained by the dual coordinate descent method described in the literature. Wherein, if the occlusion state oi of the component i is 2, ωi is set to 0, and the component i is blocked by the background.

The occlusion graph model constructed by the embodiment of the invention can not only express the occlusion relationship, but also has the reasoning complexity similar to the tree structure model. Moreover, since the deep convolutional neural network is a supervised learning algorithm, the occlusion map model is obtained by supervised learning through training samples, so no manual intervention is required.

S130: Determine a partial appearance model and an occlusion relationship model of the human body component as a two-dimensional image human joint point positioning model.

On the basis of the above embodiments, the embodiment of the present invention further provides a two-dimensional image human joint point positioning method. As shown in FIG. 6, the positioning method can be implemented by step S140 to step S170. among them:

S140: Acquire an image to be detected.

S150: Extracting a local appearance feature of the image to be detected by using a local appearance model of the human body component.

Specifically, this step may include:

S151: Divide the image to be detected into a partial image area.

S152: Taking each partial image area as an input of a partial appearance model of the human body component, obtaining a partial appearance feature of the image to be detected.

The process of extracting the local apparent features of the image to be detected is described below with reference to specific examples:

Dividing the image to be detected into partial image regions, scaling each partial image region to a size of 36×36 pixels, and then feeding the scaled image into a partial appearance model of the human body component (ie, a trained deep convolutional neural network), The probability pi of the partial image area image component i is obtained after 5 convolutional layers and 3 fully connected layers. Wherein, a larger pi indicates that the partial image area is more like the component i. The probability pi obtained in this embodiment can be used as a partial appearance feature of the image to be detected for subsequent processing.

S160: Based on the local appearance feature of the image to be detected, the occlusion relationship model is used, and the optimal human body component configuration is obtained according to the following formula:

(xi*, yi*, oi*, ti*)=argmax(∑γij*Δij+∑ωi*pi)(1)

Where xi represents the abscissa of component i; yi represents the ordinate of component i; oi represents the occlusion state of component i; ti represents the class of component i; component j is the parent node component of component i; Δij represents component i and j The relative position between; γij represents the weight corresponding to the relative position Δij; ωi represents the apparent feature weight coefficient of the component i; pi represents the local apparent feature of the component i, for example, the probability of the partial image region image component i; i and j takes a positive integer.

The predicted value of the joint point position at the component i can be obtained by the above formula (1) (xi*, yi*). (xi*, yi*) is the joint point of the component i positioned for this embodiment.

S170: Determine a center position of the body part area in the optimal body part configuration as a joint point position at the body part.

Although the operation of the method of the present invention is described in a particular order in the figures, this is not a requirement or implied that the operations must be performed in that particular order, or that all of the illustrated operations must be performed to achieve the desired results. Additionally or alternatively, certain steps may be omitted, or multiple steps may be combined into one step execution, and/or one step may be decomposed into multiple steps.

It is to be understood that any number of elements in the figures are used for the purposes of illustration and not limitation, and any no

Heretofore, the technical solutions of the present invention have been described in conjunction with the preferred embodiments shown in the drawings, but it is obvious to those skilled in the art that the scope of the present invention is obviously not limited to the specific embodiments. Those skilled in the art can make equivalent changes or substitutions to the related technical features without departing from the principles of the present invention, and the technical solutions after the modifications or replacements fall within the scope of the present invention.

Claims

A method for constructing a two-dimensional image human joint point positioning model, characterized in that the construction method comprises:

Constructing a human component local feature training sample set and a human body component global configuration sample set by using a color image marking the position coordinates and the occlusion state of the human joint point;

Constructing a deep convolutional neural network, training the deep convolutional neural network with the human component local feature training sample set, and obtaining a local apparent model of the human body component;

Obtaining an occlusion relationship graph model by using the partial appearance model of the human body component and the global configuration sample set of the human body component;

The body part local apparent model and the occlusion relationship model are determined as a two-dimensional image human joint point positioning model.
The construction method according to claim 1, wherein the constructing the body part local feature training sample set specifically comprises:

Calculating the relative position of any of the body members relative to their parent node;

Clustering the relative positions of all of the color images;

The body part local feature training sample set is constructed by using an image region in which the body part is located and a category obtained by clustering the body part.
The constructing method according to claim 1, wherein the constructing the human component global configuration sample set specifically comprises:

Determining a sample label of the body part;

Determining an image area corresponding to all of the body parts;

The sample set and the image area are utilized to form the body part global configuration sample set.
The construction method according to claim 2 or 3, wherein the constructing the deep convolutional neural network specifically comprises:

Determining the basic unit of the deep convolutional neural network as 5 convolutional layers and 3 fully connected layers;

The image area in which the component is located is used as an input to the deep convolutional neural network.
The constructing method according to claim 1, wherein the occlusion relationship graph model is obtained by using the local component apparent model of the human body component and the global component sample set of the human body component, and specifically includes:

Establishing a connection relationship between the components of the human body with a loop;

Based on the connection relationship between the components of the human body with a loop, using the local apparent model of the human body component, using a structured support vector machine, using a dual coordinate descent method on the global component sample set of the human body component, The training obtains the weight corresponding to the relative position between the two human body components having the constraint relationship and the apparent feature weight coefficient of any human body component, thereby obtaining an occlusion relationship graph model.
A two-dimensional image human joint point positioning method according to any one of the preceding claims 1, 2, 3, and 5, wherein the positioning method comprises:

Obtaining an image to be detected;

Extracting, by the local appearance model of the human body component, a partial appearance feature of the image to be detected;

Based on the local appearance feature of the image to be detected, the occlusion graph model is utilized, and an optimal body part configuration is obtained according to the following formula:

(xi*, yi*, oi*, ti*)=argmax(∑γij*Δij+∑ωi*pi);

Wherein xi represents the abscissa of the component i; the yi represents the ordinate of the component i; the oi represents the occlusion state of the component i; the ti represents the category of the component i; the component j is a parent node component of the component i; the Δij represents a relative position between the components i and j; the γij represents a weight corresponding to the relative position Δij; and the ωi represents the appearance of the component i Feature weight coefficient; said pi represents a local apparent feature of said component i; said i and said j taking a positive integer;

The center position of the body part area in the optimal body part configuration is determined as the joint point position at the body part.
The locating method according to claim 6, wherein the extracting the local appearance feature of the image to be detected by using the local appearance model of the human body component comprises:

Dividing the image to be detected into a plurality of partial image regions;

Each of the partial image regions is used as an input of the partial appearance model of the human body component to obtain a partial appearance feature of the image to be detected.