CN113379930B

CN113379930B - Immersive interaction method and device through human body graph and storage medium

Info

Publication number: CN113379930B
Application number: CN202110571228.6A
Authority: CN
Inventors: 顾友良; 杨子龙; 李观喜; 张哲为; 丁博文; 程煜钧; 张磊
Original assignee: Guangzhou Ziweiyun Technology Co ltd
Current assignee: Guangzhou Ziweiyun Technology Co ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2023-03-24
Anticipated expiration: 2041-05-25
Also published as: CN113379930A

Abstract

The invention discloses an immersive interaction method through a human body curve graph, which aims at the technical defects in the existing immersive interaction field, provides a mode of interaction through a human body contour curve, and compared with a human body segmentation technology, the method is different from the human body contour curve in the following steps: human body segmentation is to identify all regions (including the trunk and the like) of a person, but the technology only describes the contour curve of the person, and compared with the former, the human body segmentation has the advantages of better continuity characteristic, higher accuracy rate of action identification based on the curve and better interactive experience effect. The human body contour curve has the characteristics of simplicity and continuity, even if the body curve is partially shielded, dynamic restoration can be effectively carried out by means of the unshielded part and the curve of the previous frame, additional hardware equipment such as a sensor is not needed, the number of people who carry out immersive interaction is completely supported by the capability of an algorithm, and therefore the limit of the applicable environment is broken while the cost is reduced.

Description

Immersive interaction method and device through human body graph and storage medium

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to an immersive interaction method and device through a human body curve graph and a storage medium.

Background

Immersive interaction can be used in any scene or field where there is a need for interaction, for example: 3D fitting, virtual fitting, interactive vending machine, motion sensing games, and the like. The existing immersive interactive system mainly achieves the purpose of human-computer interaction through two ways: firstly, a camera acquires a picture in real time, and then an algorithm module analyzes the picture to obtain key point information of people in the picture, and the limb actions of the people are judged according to the key point information, the method has a good effect in a sparse environment, and people who carry out interaction cannot walk or be shielded by other people in front, otherwise, even if a target person positioning and tracking mode is adopted, the key point information of the target person is lost with a great probability, so that the human action is not identified accurately, and effective interaction cannot be achieved; secondly, compared with the first method, the method using sensors as a medium for human-computer interaction is not limited by the factors of multiple persons, and the accuracy of motion recognition on the basis of the method is very high, but the additional sensors also increase the use cost, and the applicable scenes are very limited, especially for large open immersive scenes, which is obviously not very applicable.

Disclosure of Invention

According to the invention, people in the visual field range can be captured only by one RGB camera, the specific action made by the people is used for identification, and the computer correspondingly completes some instructions corresponding to the action, so that people can obtain an immersive scene experience.

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention discloses an immersive interaction method through a human body graph, which obtains video information through an RGB camera and comprises the following steps:

step 1, video frame extraction preprocessing is carried out, a real-time video stream of the RGB camera is obtained, whether the video stream is an effective frame or not is judged, when the video stream is judged to be the effective frame, histogram equalization is carried out on the effective frame to enhance local contrast, then a cutout of each person is obtained through a pedestrian detection technology, finally edge detection is carried out on each person in the effective frame of the video stream through a sober operator and a canny operator respectively, and therefore two edge detection result graphs of each person are obtained;

step 2, depicting a human body curve, constructing a human body curve generation model, inputting the two edge detection result graphs of each person in the step 1 as data of the human body curve generation model, and obtaining a profile curve graph of each person according to the data, wherein the specific formula is as follows:

P＝G(φ(F _sober (x)),φ(F _canny (x)))

wherein, x represents an input picture, G represents a picture generation algorithm, phi is an empty set symbol, a mapping function is represented in the formula, a binary pixel value of 0-255 is mapped between 0-1 through the mapping function, and G represents a curve generation algorithm;

step 3, converting picture styles and caching, converting each frame of the video into a heat map only containing a human body contour curve, sequentially putting the heat map of each frame of each person into a respective cache queue in a tracking mode, setting the queue length, and only storing the video human body heat map of a frame adjacent to a preset length;

and 4, performing action recognition by using the heat map of each continuous 60-100 frames obtained in the step 3 as an input through a graph convolution neural network, wherein the specific formula is as follows:

wherein ti represents time, v represents a human body heat map of continuous frames, Z represents that different pictures of the continuous frames are subjected to unified regularization treatment, p represents that point mapping is carried out on every two input pictures of the continuous frames, w represents a picture convolution network parameter, and B represents the number of the continuous frames.

Further, the acquiring the real-time video stream of the RGB camera, and determining whether the real-time video stream is an active frame further includes: and performing pixel-level error calculation on each frame after one frame is extracted, judging that the current frame is an invalid frame if the error difference is smaller than a first preset value, and judging that the current frame is an valid frame if the error difference is larger than the first preset value.

Furthermore, the human body curve generation model is trained by using a picture generation technology and using a picture result processed by a sober operator, a picture result processed by a canny operator and a self-made human body curve graph of the same person as a group of input data.

Further, the converting each frame of the video into the heat map only containing the human body contour curve further comprises: and (3) performing overall background blacking on the original scene large image in the effective frame, and then putting the contour curve graph in the step (2) to an original coordinate position.

Still further, the step 3 further comprises: detecting whether occlusion exists in continuous frames or not, if so, predicting the human coordinate frame when the occlusion exists by using the human coordinate frame when the previous frame is not occluded and the previous motion state information as Kalman filtering initial information, and then using a prediction result as an initial value of next iterative Kalman filtering to ensure that the human coordinate frame can keep stable in the motion state when the human figure is not occluded even if the human figure is occluded until the human figure is separated from the occlusion object and the area of the coincidence of the prediction frame and the human detection frame is more than 0.5; for the completion mechanism of the human body curve when being shielded, the human body curve characteristic of the adjacent frame which is not shielded is used for replacing, and the human body curve characteristic is put into a cache queue, and the original shielded curve graph is discarded; if the sheltering object is also the human, a new dimension is established on the heat map, and the curve of the sheltered human is transferred to the heat map of the second dimension, so that the curve characteristics of the sheltered human and the sheltered human can be effectively reserved, and the spatial information can also exist.

Further, the operation recognition is performed by using the heat map of each person of 60 to 100 consecutive frames obtained in the step 3 as an input, the number of frames is determined according to the duration of the operation, and the number of frames is input more as the operation duration is determined to be longer.

Furthermore, the multi-person simultaneous action recognition is carried out through a multi-thread mechanism, so that the immersive interactive task of a specific action is completed.

The invention also discloses an immersion type interaction device through the human body curve graph, which acquires video information through the RGB camera and comprises the following components: a video frame-extracting preprocessing unit for acquiring the real-time video stream of the RGB camera, judging whether the video stream is an effective frame, and if so, comparing the effective frame with the video streamThe effective frame is processed by histogram equalization to enhance local contrast, then the keying of each person is obtained by a pedestrian detection technology, and finally the edge detection is carried out on each person in the effective frame of the video stream by a sober operator and a canny operator respectively so as to obtain two edge detection result graphs of each person; and a human body curve generation model is constructed, two edge detection result graphs of each person are used as data input of the human body curve generation model, and a profile curve graph of each person is obtained according to the data input, wherein the specific formula is as follows: p = G (phi (F) _sober (x)),φ(F _canny (x) X) represents an input picture, G represents a picture generation algorithm, phi is an empty set symbol, a mapping function is represented in the formula, and the binary pixel values of 0-255 are mapped between 0-1 through the mapping function, and G represents a curve generation algorithm; the picture style conversion and cache processing unit converts each frame of the video into a heat map only containing a human body contour curve, sequentially puts the heat map of each frame of each person into a respective cache queue in a tracking mode, sets the queue length, and only stores the video human body heat map of the frame adjacent to the preset length; the action recognition unit performs action recognition by taking the acquired heat map of 60-100 continuous frames of each person as input through a graph convolution neural network, and the specific formula is as follows:

The invention further discloses a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method when executing the computer program.

The invention further discloses a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

Compared with the prior art, aiming at the technical defects in the existing immersive interaction field, the invention has the beneficial effects that: the method for interaction through the human body contour curve is provided, and compared with the human body segmentation technology, the method is different from the human body segmentation technology in the following steps: human body segmentation is to identify all regions (including trunk, head and limbs) of a human, but the technology only describes a contour curve of the human, and compared with the former, the human body segmentation has better continuity characteristics, higher accuracy of curve-based action identification and better interactive experience effect. The human body contour curve has the characteristics of simplicity and continuity, even if the body curve is partially shielded, dynamic restoration can be effectively carried out by means of the unshielded part and the curve of the previous frame, additional hardware equipment such as a sensor is not needed, the number of people who carry out immersive interaction is completely supported by the capability of an algorithm, and therefore the limit of the applicable environment is broken while the cost is reduced.

Drawings

The invention will be further understood from the following description in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. In the drawings, like reference numerals designate corresponding parts throughout the different views.

FIG. 1 is a logic flow diagram of the present invention.

Detailed Description

Example one

The core of the invention lies in human body contour curve generation and curve-based action identification, and the main technical scheme comprises the following steps: video frame extraction preprocessing, human body curve drawing, picture style conversion and cache processing, and action recognition.

Firstly, video frame extraction preprocessing: acquiring a real-time video stream of the RGB camera, performing pixel-level error calculation on each frame after one frame is extracted, and skipping if the difference is not large, wherein the frame is an invalid frame; if the frame is an effective frame, firstly, histogram equalization is used for processing to enhance local contrast, secondly, a pedestrian detection technology is used for obtaining the sectional drawing of each person, and finally, the sober operator and the canny operator are used for carrying out edge detection on each person respectively, so that each person can take two edge detection result drawings.

Secondly, human body curve drawing: in the step, firstly, a human body curve generation model is obtained, and the model is obtained by training three pictures (a sober operator processing result, a canny operator processing result and a self-made human body curve graph) of the same person as a group of data through a picture generation technology. After the model is available, the two result graphs of each person in the previous step are used as data input of a curve generation model, so that a profile graph of each person is obtained, and the specific formula is as follows:

P＝G(φ(F _sober (x)),φ(F _canny (x)))

wherein, x represents an input picture, G represents a picture generation algorithm, phi is an empty set symbol, a mapping function (mapping the binary pixel values of 0-255 to 0-1) is represented in the formula, and G represents a curve generation algorithm.

Thirdly, picture style conversion and caching: the whole background blacking processing is firstly carried out on the original scene large image, then the contour curve graph of each person generated in the previous step is placed at the original coordinate position, each frame of the video is converted into a heat map only containing a human body contour curve in the mode (the picture is one-dimensional, and can be expanded later and is called heatmap), and the influence of environmental noise on the algorithm is greatly reduced. The heat map of each frame of each person is put into a respective buffer queue in a tracking mode, the length of the queue is default to 100, namely, only the video human body heat map of the adjacent 100 frames is stored. There will be an occlusion mechanism in this step: because the shielding is gradual, when shielding begins to occur, only a small part of a shielded person is shielded, the human coordinate frame when the person is shielded and the previous motion state information are used as Kalman filtering initial information to predict the human coordinate frame when the person is shielded, and then the prediction result is used as the initial value of next iterative Kalman filtering, so that the human coordinate frame can be kept stable in the motion state when the person is not shielded even if the person is shielded until the person is separated from the shielding object, and the overlapping area of the prediction frame and the human detection frame is more than 0.5; for the completion mechanism of the human body curve when being shielded, the human body curve characteristic of the adjacent frame which is not shielded is used for replacing, and the human body curve characteristic is put into a cache queue, and the original shielded curve graph is discarded; if the shelters are also the people, a new dimension is established on the heat map, the curve of the sheltered people is transferred to the heat map of the second dimension, and the curve characteristics of the sheltered people and the sheltered people can be effectively reserved and the spatial information can also exist by processing in the mode.

Fourthly, action recognition: the algorithm used in this step is a graph convolution neural network, and the heatmap obtained in the previous step and having each person continuous 60-100 frames (the number of input frames is determined according to the duration of the action) is used as an input to perform action recognition, and the specific formula is as follows:

Through the steps and the multithreading mechanism, the action recognition can be carried out on multiple persons at the same time, and therefore the immersive interactive task of the specific action is completed.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications may be made without departing from the scope of the invention. It is therefore intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this invention. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. An immersive interaction method through a human body graph, video information is acquired through an RGB camera, and the method is characterized by comprising the following steps:

P＝G(φ(F _sober (x)),φ(F _canny (x)))

step 3, converting picture style and caching, converting each frame of the video into a heat map only containing a human body contour curve, sequentially putting the heat map of each frame of each person into a respective cache queue in a tracking mode, setting queue length, and only storing the video human body heat map of a frame adjacent to a preset length, wherein the step 3 further comprises: detecting whether shielding exists in continuous frames or not, if so, predicting the human coordinate frame when the frame is shielded by using the human coordinate frame when the previous frame is not shielded and previous motion state information as Kalman filtering initial information, and then using a prediction result as an initial value of next iterative Kalman filtering to ensure that the human coordinate frame can be kept stable in the motion state when the frame is not shielded even if a person is shielded until the person is separated from a shielding object and the overlapped area of the prediction frame and the human detection frame is more than 0.5; for the completion mechanism of the human body curve when being shielded, the human body curve characteristic of the adjacent frame which is not shielded is used for replacing, and the human body curve characteristic is put into a cache queue, and the original shielded curve graph is discarded; if the shelters are also the people, a new dimension is established on the heat map, and the curve of the sheltered people is transferred to the heat map with the second dimension, so that the curve characteristics of the sheltered people and the sheltered people can be effectively reserved, and the spatial information also exists;

2. The method as claimed in claim 1, wherein said obtaining a real-time video stream of said RGB camera, and determining whether it is a valid frame further comprises: and performing pixel-level error calculation on each frame after one frame is extracted, judging that the current frame is an invalid frame if the error difference is smaller than a first preset value, and judging that the current frame is an valid frame if the error difference is larger than the first preset value.

3. The method of claim 1, wherein the human body curve generation model is trained by using a picture generation technique, and using a picture result processed by sober operator of the same person, a picture result processed by canny operator, and a self-made human body curve as a set of input data.

4. The method of immersive interaction with a human body graph as recited in claim 1, wherein said converting each frame of the video into a heat map containing only human body contour curves further comprises: and (3) performing overall background blacking on the original scene large image in the effective frame, and then putting the contour curve graph in the step (2) to an original coordinate position.

5. The method as claimed in claim 1, wherein the step 3 of performing the immersive interaction by using the human body graph comprises performing motion recognition by using a heat map of each person obtained in the step 60 to 100 consecutive frames as an input, wherein the number of frames is determined according to the duration of the motion, and the number of frames is increased as the duration of the motion is longer.

6. The method of claim 1, wherein the multi-thread mechanism is used to recognize the actions of multiple persons at the same time, so as to complete the immersive interactive task of the specific action.

7. An immersive interaction device through a human body graph, video information being acquired through an RGB camera, comprising: the video frame extraction preprocessing unit is used for acquiring a real-time video stream of the RGB camera, judging whether the video stream is an effective frame or not, when the video stream is judged to be the effective frame, processing the effective frame by adopting histogram equalization to enhance local contrast, then acquiring cutout pictures of each person by using a pedestrian detection technology, and finally performing edge detection on each person in the effective frame of the video stream by using sober operators and canny operators respectively so as to obtain two edge detection result pictures of each person; and a human body curve generation model is constructed, two edge detection result graphs of each person are used as data input of the human body curve generation model, and a profile curve graph of each person is obtained according to the data input, wherein the specific formula is as follows: p = G (phi (F) _sober (x)),φ(F _canny (x) X) represents an input picture, G represents a picture generation algorithm, phi is an empty set symbol, a mapping function is represented in the formula, and the binary pixel values of 0-255 are mapped between 0-1 through the mapping function, and G represents a curve generation algorithm; the picture style conversion and cache processing unit converts each frame of the video into a heat map only containing a human body contour curve, sequentially puts the heat map of each frame of each person into a respective cache queue in a tracking mode, sets the queue length, and only stores the video human body heat map of the frame adjacent to the preset length, wherein the picture style conversion and cache processing unit further comprises: detecting whether occlusion exists in the continuous frames, and if so, utilizing the human body coordinate frame and the previous motion state when the previous frame is not occludedThe state information is used as Kalman filtering initial information to predict a human body coordinate frame when the human body coordinate frame is shielded, and then the prediction result is used as an initial value of next iteration Kalman filtering to ensure that the human body coordinate frame can be kept stable in a motion state when the human body coordinate frame is not shielded even if the person is shielded until the human body coordinate frame is separated from a shielding object and the overlapped area of the prediction frame and the human body detection frame is more than 0.5; for the completion mechanism of the human body curve when being shielded, the human body curve characteristic of the adjacent frame which is not shielded is used for replacing, and the human body curve characteristic is put into a cache queue, and the original shielded curve graph is discarded; if the sheltering object is also the human, a new dimension is established on the heat map, and the curve of the sheltered human is transferred to the heat map of the second dimension, so that the curve characteristics of the sheltered human and the sheltered human can be effectively reserved, and the spatial information also exists; the action recognition unit performs action recognition by taking the acquired heat map of 60-100 continuous frames of each person as input through a graph convolution neural network, and the specific formula is as follows:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.