CN116894894A

CN116894894A - Method, apparatus, device and storage medium for determining motion of avatar

Info

Publication number: CN116894894A
Application number: CN202310728732.1A
Authority: CN
Inventors: 李丰果; 刘豪杰; 冯志强; 陈睿智
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-10-17
Anticipated expiration: 2043-06-19
Also published as: CN116894894B

Abstract

The disclosure provides an action determining method, device, equipment and storage medium for an avatar, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and can be applied to scenes such as metauniverse, digital people and the like. The action determining method of the avatar includes: aiming at a target image in a video of an actual image, processing the target image to acquire initial key point information of the actual image corresponding to the target image; constructing a total objective function based on the initial key point information and the objective key point information of the virtual image to be determined, and determining the objective key point information based on the total objective function and the objective constraint condition corresponding to the real image; and determining an action of the avatar based on the target keypoint information. The present disclosure can improve the motion accuracy of the avatar.

Description

Method, apparatus, device and storage medium for determining motion of avatar

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and can be applied to scenes such as metauniverse, digital people and the like, in particular to an action determining method, an action determining device, action determining equipment and a storage medium of an virtual image.

Background

Dynamic capture refers to motion capture, typically the transcription of the motion of a real actor (also referred to as a man in life) into an avatar (e.g., a virtual person) in a three-dimensional (3D) game or animation.

Disclosure of Invention

The present disclosure provides an action determining method, apparatus, device and storage medium for an avatar.

According to an aspect of the present disclosure, there is provided an action determining method of an avatar, including: aiming at a target image in a video of an actual image, processing the target image to acquire initial key point information of the actual image corresponding to the target image; constructing a total objective function based on the initial key point information and the objective key point information of the virtual image to be determined; wherein the total objective function is used for representing error information between the initial key point information and the target key point information; determining the target key point information based on the total target function and the target constraint condition corresponding to the real image; and determining an action of the avatar based on the target keypoint information.

According to another aspect of the present disclosure, there is provided an action determining apparatus of an avatar, including: the acquisition module is used for processing the target image aiming at the target image in the video of the real image so as to acquire initial key point information of the real image corresponding to the target image; the construction module is used for constructing a total objective function according to the initial key point information and the objective key point information of the virtual image to be determined; wherein the total objective function is used for representing error information between the initial key point information and the target key point information; the solving module is used for determining the target key point information according to the total target function and the target constraint condition corresponding to the real image; and the determining module is used for determining the action of the virtual image according to the target key point information.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme, the action precision of the virtual image can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

fig. 2 is a schematic diagram of an application scenario provided according to an embodiment of the present disclosure;

fig. 3 is a schematic overall architecture of an action determining method of an avatar provided according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure;

fig. 7 is a schematic view of an electronic device for implementing an action determining method of an avatar according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Taking the virtual image as an example, in the related art, a video of a person in the middle is generally processed, 3D position coordinates of key points of the person in the middle are obtained, the 3D position coordinates are input into a driving model, and the output is an action of the virtual person.

In order to improve the motion accuracy of a virtual person, an effort is usually made to improve the accuracy of a driving model, that is, a large number of training samples are usually used to train the driving model to improve the model accuracy.

However, this way of increasing the data volume is costly and the model performance is limited after the data volume has increased to a certain extent.

In order to improve the motion accuracy of the avatar, the present disclosure provides the following embodiments.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The present embodiment provides an action determining method of an avatar, the method including:

101. and processing the target image aiming at the target image in the video of the real image to acquire initial key point information of the real image corresponding to the target image.

102. Constructing a total objective function based on the initial key point information and the objective key point information of the virtual image to be determined; wherein the overall objective function is used to characterize error information between the initial keypoint information and the target keypoint information.

103. And determining the target key point information based on the total target function and the target constraint condition corresponding to the real image.

104. And determining an action of the avatar based on the target keypoint information.

The real figure is a real figure for driving the action of the virtual figure, is generally a real actor and can be called as a middle person.

The target image refers to an image to be processed in the video, and each image in the video can be used as the target image.

For distinction, the keypoint information of the real character may be referred to as initial keypoint information and the keypoint information of the avatar may be referred to as target keypoint information.

The key point may also be referred to as an articulation point. Taking a human body as an example, one or more part points on the human body can be preset as key points.

The key point information may include location information of the key point, and/or angle information of the key point. Taking a three-dimensional avatar as an example, the position information of the key points refers to 3D position coordinates of the key points. The angle information of the key point may be an included angle between a connecting line between the key point and a set reference point and between the key point and the set reference line, and the included angle may specifically be an euler angle, that is, three-dimensional angle information.

In the embodiment, the target key point information of the avatar may be solved, and then the motion of the avatar may be determined based on the target key point information, unlike the motion of the avatar determined based on the driving model in the related art.

The total objective function is constructed based on the initial key point information and the target key point information, wherein the initial key point information is a known quantity, the target key point information is an unknown quantity (variable), and the target key point information is solved by minimizing the total objective function under a certain constraint condition.

The overall objective function is used to characterize error information between the initial keypoint information and the target keypoint information, and may be constructed based on the absolute value of the difference between the initial keypoint information and the target keypoint information, for example.

Taking the target key point information as the target key point angle information as an example, after the target key point angle information is determined, the action of the virtual image can be determined by adopting a related technology. For example, a correspondence between the key point angle information and the action may be preconfigured, and the action of the avatar may be determined based on the correspondence and the target key point angle information.

In this embodiment, the target key point information is determined based on the constraint condition corresponding to the total objective function and the real image, and the motion of the virtual image is determined based on the target key point information, so that the cost can be reduced and the motion precision can be improved compared with the mode of determining the motion based on the driving model because the driving model is not required.

In order to better understand the embodiments of the present disclosure, application scenarios to which the embodiments of the present disclosure may be applied are described.

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure. The scene comprises: user terminal 201 and server 202, user terminal 201 may include: personal computers (Personal Computer, PCs), cell phones, tablet computers, notebook computers, smart wearable devices, and the like. The server 202 may be a cloud server or a local server, and the user terminal 201 and the server 202 may communicate using a communication network, for example, a wired network and/or a wireless network.

The user terminal 201 may transmit a video of the avatar to the server 202, and the server 202 determines an action of the avatar based on the video. This process may be referred to as off-line dynamic capture. Information that the process needs to display, such as the finalized actions, may be displayed by the user terminal 201. It will be appreciated that the action of the avatar may also be determined locally at the user terminal based on the video if the user terminal has corresponding capabilities.

Specifically, the target key point information of the virtual person can be determined based on the video of the real image, and the process can be called global optimization; and determining the action of the avatar based on the target key point information.

As shown in fig. 3, the overall architecture of the act of determining an avatar based on a video may include: the video of the real figure is processed by the touchdown detection network 301 to determine the foot touchdown state of the real figure, and the video is processed by the body perception network 302 to determine the initial key point information of the real figure. Performing global optimization 303 according to the ground contact state of the foot and the initial key point information to obtain target key point information; further, the action of the avatar may be determined based on the identified target key point information based on a correspondence between the key point information of the predetermined avatar and the action.

In combination with the above application scenario, the present disclosure further provides the following embodiments.

Fig. 4 is a schematic view of a second embodiment of the present disclosure, which provides an action determining method of an avatar, the method including:

401. and acquiring a video of the real image.

402. And aiming at a target image in the video, processing the target image by adopting a pre-trained body perception network so as to acquire initial key point information corresponding to the target image.

The body perception network is a deep neural network and is trained in advance, the input of the deep neural network is an image, and the output of the deep neural network is key point information of a real image contained in the image.

The keypoint information of the real character may be referred to as initial keypoint information.

The number of the target images is usually multiple, and the initial key point information corresponding to each target image can be obtained after the body perception network is adopted to process each target image.

Taking a three-dimensional avatar as an example, the key point information may include: three-dimensional position coordinates and/or three-dimensional angle information of the key points.

In this embodiment, the initial key point information is obtained through the body sensing network trained in advance, so that the accuracy of the initial key point information can be improved by utilizing the excellent performance of the body sensing network, and the accuracy of the determined virtual image action can be further improved.

403. And processing the target image by adopting a pre-trained foot touch detection network to determine the foot touch state of the real image in the target image.

The foot touchdown state is used for representing whether the foot of the real figure touches the ground or not, and can be represented by 1 or 0, wherein 1 represents touchdown and 0 represents non-touchdown.

The foot touchdown detection network is also a deep neural network, and is pre-trained, wherein the input is an image, and the output is the foot touchdown state of the real image contained in the image.

The body sensing network and the foot contact detection network can be specifically trained by adopting related technologies.

In this embodiment, the foot touchdown state is obtained through the pre-trained foot touchdown detection network, so that the excellent performance of the foot touchdown detection network can be utilized, the accuracy of the foot touchdown state is improved, and the accuracy of the determined avatar action is further improved.

404. And constructing a total objective function based on the initial key point information and the objective key point information of the virtual image to be determined.

The target key point information refers to key point information of the virtual image.

The target key point information is an unknown quantity, which is a quantity to be solved.

The initial keypoint information may be obtained through a body aware network, a known quantity.

The target images are usually multiple, corresponding initial key point information can be obtained for each target image, sub-target functions corresponding to each target image are built according to the initial key point information corresponding to each target image, and then a total target function is built based on each sub-target function.

That is, the target image is plural; aiming at each target image in a plurality of target images, constructing a sub-target function corresponding to each target image based on initial key point information corresponding to each target image and the target key point information; and constructing the total objective function based on the sub-objective functions corresponding to the target images.

When the total objective function is constructed based on the sub objective functions, specifically, the total objective function may be obtained by adding the sub objective functions.

In this embodiment, by constructing the sub-objective function corresponding to each objective image and then constructing the total objective function based on the sub-objective function, the total objective function can refer to the information of each objective image, that is, the global information of the video can be referred to, and then global optimization can be performed to obtain the objective key point information, so that the accuracy of the objective key point information is improved, and further, the accuracy of the determined virtual image action is improved.

For each sub-objective function, the sub-objective function may be constructed based on two parts, one based on the unknown target keypoint information and the known initial keypoint information and the other based on the known initial keypoint information.

That is, for the respective target images, performing: constructing a first function based on initial key point information corresponding to each target image and the target key point information; constructing a second function based on the initial key point information corresponding to each target image; and constructing the sub-objective function based on the first function and the second function.

Assuming that the first function is denoted by F1, the second function is denoted by F2, and the sub-objective function is denoted by F3, the calculation formula is f3=f1+f2.

In this embodiment, the first function is constructed by the initial key point information and the target key point information, the second function is constructed based on the initial key point information, and the sub-objective function is constructed based on the first function and the second function, so that the optimization target can be constructed from different angles, the accuracy of the sub-objective function is improved, and the accuracy of the action is further improved.

Further, the second function F2 may be constructed based on a variety of error functions, for example, the error functions may include a first error function G1, a second error function G2, and a third error function G3, and f2=g1+g2+g3.

For the three error functions described above, it may include: the initial key point information includes: two-dimensional position coordinates of key points of the real image, three-dimensional position coordinates of the key points of the real image and three-dimensional angle information; the constructing a second function based on the initial key point information corresponding to each target image includes: carrying out projection processing on the two-dimensional position coordinates to obtain three-dimensional projection coordinates, and constructing a first error function according to the three-dimensional projection coordinates and the three-dimensional position coordinates; determining a current speed corresponding to the target image based on the three-dimensional angle information, and constructing a second error function according to the current speed and a previous speed corresponding to a previous image of the target image; determining the current acceleration corresponding to the target image based on the three-dimensional angle information, and constructing a third error function according to the current acceleration and the previous acceleration corresponding to the previous image of the target image; constructing the second function based on the first error function, the second error function, and the third error function.

In this embodiment, error functions of different dimensions can be constructed based on the initial key point information, and a second function containing multiple dimension information can be constructed based on the error functions of multiple dimensions, so that a total objective function containing multiple dimension information can be constructed, global optimization effect is improved, accuracy of target key point information is improved, and action accuracy of a virtual object is improved.

Specifically, the calculation formula of the total objective function may be:

wherein F is the overall objective function, F _k Is the sub-objective function corresponding to the kth target image, and N is the total number of target images.

For F _k The calculation formula can be as follows:

F _k ＝F1+G1+G2+G3

wherein,,

wherein q _k The angle information (initial key point angle information) of the key point of the real image in the kth target image can be obtained through a body perception network;the target key point angle information corresponding to the kth target image is the unknown quantity to be solved; />The three-dimensional projection coordinates of the key points of the real image in the kth target image can be obtained by three-dimensional projection of the two-dimensional position coordinates of the key points of the real image in the kth target image on the target image and camera parameters; />The three-dimensional position coordinates (initial key point coordinates) of the key points of the real image in the kth target image can be obtained through a body perception network; / >Is the velocity of the key point of the real character in the kth target image (initial key point velocity),>the speed of the key point of the real image in the k-1 target image; />Acceleration (initial key point acceleration) which is the key point of the real figure in the kth target image, is->Is the acceleration of the key point of the real figure in the k-1 target image. The initial key point speed is obtained after deriving the initial key point angle information, and the initial key point acceleration is obtained after deriving the initial key point speed (or performing secondary derivation on the initial key point angle information); the i is an absolute value operation.

405. And constructing a target constraint condition corresponding to the real image based on the foot touchdown state and the initial key point information.

The ground contact constraint condition can be constructed based on the foot ground contact state and the foot key point speed of the real image; constructing dynamic constraint conditions based on the initial key point information of the real image; and taking the triggering constraint condition and the dynamic constraint condition as the target constraint condition.

That is, the target constraint includes: touchdown constraints C1, and dynamics constraints C2.

The expressions of C1 and C2 may be:

Ground contact constraint C1:

when contact=1

Wherein,,is the foot key point speed of the real image and can be used for the foot key point position coordinate p _foot Deriving to obtain the key points of the feetThe location coordinates may be obtained through a body awareness network; contact=1 indicates that the foot strike condition is a touchdown, which may be obtained by a touchdown detection network.

Kinetic constraint C2:

wherein M is a mass matrix; b is a centrifugal force and Golgi force matrix, G is a gravity matrix, tau is a joint moment, J is a Jacobian matrix, fc is an external force applied, and tau and Fc are obtained from a kinetic network; q is initial key point angle information.

In this embodiment, the target constraint conditions include a touchdown constraint condition and a dynamics constraint condition, and by constructing multiple constraint conditions, accuracy of target key point information can be improved, so that accuracy of the determined avatar actions is improved.

406. And determining the target key point information based on the target function and the target constraint condition.

After the total objective function and the objective constraint condition are constructed, the objective key point information can be solved by minimizing the total objective function based on the objective constraint condition.

For example, a nonlinear solver may be used to solve for the target keypoint information.

407. And determining an action of the avatar based on the target keypoint information.

For example, the target key point information is key point angle information of the avatar, a correspondence between the key point angle information and the action may be preconfigured, and the action of the avatar may be determined based on the correspondence and the solved key point angle information of the avatar.

Fig. 5 is a schematic diagram according to a third embodiment of the present disclosure. The present embodiment provides an action determining apparatus of an avatar, as shown in fig. 5, the apparatus 500 including: an acquisition module 501, a construction module 502, a solution module 503 and a determination module 504.

The acquisition module 501 is configured to process a target image in a video of an actual image, so as to acquire initial key point information of the actual image corresponding to the target image; the construction module 502 is configured to construct a total objective function according to the initial key point information and the objective key point information of the avatar to be determined; wherein the total objective function is used for representing error information between the initial key point information and the target key point information; the solving module 503 is configured to determine the target key point information according to the total objective function and a target constraint condition corresponding to the real image; the determining module 504 is configured to determine an action of the avatar according to the target keypoint information.

In some embodiments, the target image is a plurality of; the building block 502 is further configured to: aiming at each target image in a plurality of target images, constructing a sub-target function corresponding to each target image based on initial key point information corresponding to each target image and the target key point information; and constructing the total objective function based on the sub-objective functions corresponding to the target images.

In some embodiments, the building block 502 is further configured to: constructing a first function based on initial key point information corresponding to each target image and the target key point information aiming at each target image; constructing a second function based on initial key point information corresponding to each target image aiming at each target image; for each target image, constructing the sub-target function based on the first function and the second function.

In some embodiments, the initial keypoint information comprises: two-dimensional position coordinates of key points of the real image, three-dimensional position coordinates of the key points of the real image and three-dimensional angle information;

the building block 502 is further configured to: carrying out projection processing on the two-dimensional position coordinates to obtain three-dimensional projection coordinates, and constructing a first error function according to the three-dimensional projection coordinates and the three-dimensional position coordinates; determining a current speed corresponding to the target image based on the three-dimensional angle information, and constructing a second error function according to the current speed and a previous speed corresponding to a previous image of the target image; determining the current acceleration corresponding to the target image based on the three-dimensional angle information, and constructing a third error function according to the current acceleration and the previous acceleration corresponding to the previous image of the target image; constructing the second function based on the first error function, the second error function, and the third error function.

In some embodiments, the obtaining module 501 is further configured to: and aiming at the target image, processing the target image by adopting a pre-trained body perception network to acquire the initial key point information corresponding to the target image.

Fig. 6 is a schematic diagram according to a fourth embodiment of the present disclosure. The present embodiment provides an action determining apparatus of an avatar, as shown in fig. 6, the apparatus 600 including: an acquisition module 601, a construction module 602, a solution module 603 and a determination module 604.

For details of the acquisition module 601, the construction module 602, the solving module 603 and the determining module 604, reference may be made to the previous embodiment.

In some embodiments, the apparatus 600 may further include: the constraint module 605 is configured to construct a touchdown constraint condition based on the foot touchdown state and the foot key point speed of the real figure; constructing dynamic constraint conditions based on the initial key point information of the real image; and taking the touchdown constraint condition and the dynamic constraint condition as the target constraint condition.

In some embodiments, the apparatus 600 may further include: the detection module 606 is configured to process the target image by using a pre-trained touchdown detection network to determine the foot touchdown state.

It is to be understood that in the embodiments of the disclosure, the same or similar content in different embodiments may be referred to each other.

It can be understood that "first", "second", etc. in the embodiments of the present disclosure are only used for distinguishing, and do not indicate the importance level, the time sequence, etc.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. The electronic device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 707 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 performs the respective methods and processes described above, for example, an action determination method of an avatar. For example, in some embodiments, the avatar's action determination method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 707. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the above-described avatar's action determination method may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the avatar's action determination method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of determining an action of an avatar, comprising:

aiming at a target image in a video of an actual image, processing the target image to acquire initial key point information of the actual image corresponding to the target image;

constructing a total objective function based on the initial key point information and the objective key point information of the virtual image to be determined; wherein the total objective function is used for representing error information between the initial key point information and the target key point information;

Determining the target key point information based on the total target function and the target constraint condition corresponding to the real image;

and determining an action of the avatar based on the target keypoint information.

2. The method of claim 1, wherein,

the target image is a plurality of images;

the constructing a total objective function based on the initial key point information and the objective key point information of the avatar to be determined, including:

aiming at each target image in a plurality of target images, constructing a sub-target function corresponding to each target image based on initial key point information corresponding to each target image and the target key point information;

and constructing the total objective function based on the sub-objective functions corresponding to the target images.

3. The method of claim 2, wherein the constructing, for each target image of the plurality of target images, a sub-objective function corresponding to the each target image based on the initial keypoint information and the target keypoint information corresponding to the each target image, comprises:

constructing a first function based on initial key point information corresponding to each target image and the target key point information aiming at each target image;

Constructing a second function based on initial key point information corresponding to each target image aiming at each target image;

for each target image, constructing the sub-target function based on the first function and the second function.

4. The method of claim 3, wherein,

the initial key point information includes: two-dimensional position coordinates of key points of the real image, three-dimensional position coordinates of the key points of the real image and three-dimensional angle information;

the constructing a second function based on the initial key point information corresponding to each target image includes:

carrying out projection processing on the two-dimensional position coordinates to obtain three-dimensional projection coordinates, and constructing a first error function according to the three-dimensional projection coordinates and the three-dimensional position coordinates;

determining a current speed corresponding to the target image based on the three-dimensional angle information, and constructing a second error function according to the current speed and a previous speed corresponding to a previous image of the target image;

determining the current acceleration corresponding to the target image based on the three-dimensional angle information, and constructing a third error function according to the current acceleration and the previous acceleration corresponding to the previous image of the target image;

Constructing the second function based on the first error function, the second error function, and the third error function.

5. The method of claim 1, further comprising:

constructing a touchdown constraint condition based on the foot touchdown state and the foot key point speed of the real image;

constructing dynamic constraint conditions based on the initial key point information of the real image;

and taking the touchdown constraint condition and the dynamic constraint condition as the target constraint condition.

6. The method of claim 5, further comprising:

and processing the target image by adopting a pre-trained touchdown detection network to determine the foot touchdown state.

7. The method according to any one of claims 1-6, wherein the processing the target image for the target image in the video of the real avatar to obtain initial keypoint information of the real avatar corresponding to the target image includes:

and aiming at the target image, processing the target image by adopting a pre-trained body perception network to acquire the initial key point information corresponding to the target image.

8. An action determining apparatus of an avatar, comprising:

The acquisition module is used for processing the target image aiming at the target image in the video of the real image so as to acquire initial key point information of the real image corresponding to the target image;

the construction module is used for constructing a total objective function according to the initial key point information and the objective key point information of the virtual image to be determined; wherein the total objective function is used for representing error information between the initial key point information and the target key point information;

the solving module is used for determining the target key point information according to the total target function and the target constraint condition corresponding to the real image;

and the determining module is used for determining the action of the virtual image according to the target key point information.

9. The apparatus of claim 8, wherein,

the target image is a plurality of images;

the build module is further to:

10. The apparatus of claim 9, wherein the build module is further to:

11. The apparatus of claim 9, wherein,

the build module is further to:

12. The apparatus of claim 8, further comprising:

the constraint module is used for constructing a touchdown constraint condition based on the foot touchdown state and the foot key point speed of the real image; constructing dynamic constraint conditions based on the initial key point information of the real image; and taking the touchdown constraint condition and the dynamic constraint condition as the target constraint condition.

13. The apparatus of claim 12, further comprising:

and the detection module is used for processing the target image by adopting a pre-trained touchdown detection network so as to determine the foot touchdown state.

14. The apparatus of any of claims 8-13, wherein the acquisition module is further to:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.