Nothing Special   »   [go: up one dir, main page]

WO2023174182A1 - 渲染模型训练、视频的渲染方法、装置、设备和存储介质 - Google Patents

渲染模型训练、视频的渲染方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2023174182A1
WO2023174182A1 PCT/CN2023/080880 CN2023080880W WO2023174182A1 WO 2023174182 A1 WO2023174182 A1 WO 2023174182A1 CN 2023080880 W CN2023080880 W CN 2023080880W WO 2023174182 A1 WO2023174182 A1 WO 2023174182A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
model
facial
rendering model
rendering
Prior art date
Application number
PCT/CN2023/080880
Other languages
English (en)
French (fr)
Inventor
张子阳
王耀园
李明磊
何炜华
张瑀涵
程捷
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP23769686.9A priority Critical patent/EP4394711A1/en
Publication of WO2023174182A1 publication Critical patent/WO2023174182A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2215/00Indexing scheme for image rendering
    • G06T2215/16Using real world measurements to influence rendering

Definitions

  • This application relates to the field of image processing technology, and in particular to rendering model training, video rendering methods, devices, equipment and storage media.
  • low-definition videos are videos with a definition lower than the threshold
  • high-definition videos are videos with a definition higher than the threshold.
  • the rendering model is trained by taking the low-definition video as a sample video input to the rendering model and using the high-definition video corresponding to the low-definition video as a supervised video, so that the trained rendering model can render the same output as the supervised video.
  • the resolution corresponds to the video.
  • the training method needs to record a large number of high-definition videos and low-definition videos according to high recording requirements when training the rendering model, it is not only difficult to obtain samples, but also requires a lot of computing resources when training the rendering model. , the training time is longer.
  • This application provides a rendering model training, video rendering method, device, equipment and storage medium to solve the problems provided by related technologies.
  • the technical solution is as follows:
  • a rendering model training method comprising: acquiring a first video including a face of a target object; mapping facial movements of the target object in the first video based on a three-dimensional facial model , obtain a second video including a three-dimensional face; use the second video as the input of the initial rendering model, use the first video as the output supervision of the initial rendering model, train the initial rendering model, and obtain the target Render the model.
  • the technical solution provided by this application avoids the need for recording under higher recording requirements before training the rendering model by introducing a three-dimensional facial model and using the second video including the three-dimensional face generated based on the three-dimensional facial model as a sample for training the rendering model.
  • a large number of low-definition videos, thereby reducing the computing resources and time used in training the rendering model, and the introduction of the three-dimensional facial model can enable the trained target rendering model to have higher generalization capabilities.
  • mapping the facial movements of the target object in the first video based on a three-dimensional facial model to obtain a second video including a three-dimensional face includes: extracting the first video The facial key points of the target object in each frame of the picture are obtained to obtain multiple groups of facial key points. The number of groups of facial key points is the same as the number of frames of the first video. One frame of picture corresponds to one group of facial key points. points; combine the three-dimensional facial model with each set of facial key points Fitting is performed to obtain multiple three-dimensional facial images; according to the corresponding relationship between the three-dimensional facial images and each frame of the first video, the multiple three-dimensional facial images are combined to obtain a three-dimensional facial image including the three-dimensional face. Second video.
  • the facial key points have good stability. Therefore, even when the definition of the first video is low, the facial key points of the target object can be better extracted from the first video, and then the facial key points of the target object can be extracted based on the extracted face key points of the target object.
  • the key points are fitted to the three-dimensional facial model to obtain the second video, which improves the reliability of obtaining the second video.
  • fitting the three-dimensional facial model to each group of facial key points to obtain multiple three-dimensional facial images includes: using a neural network to fit the three-dimensional facial model to each group of facial key points. The points are fitted to obtain the plurality of three-dimensional facial images.
  • the second video is used as the input of the initial rendering model
  • the first video is used as the output supervision of the initial rendering model
  • the initial rendering model is trained to obtain
  • the target rendering model includes: inputting the second video into an initial rendering model, rendering the second video by the initial rendering model to obtain a third video; calculating each of the first video and the third video The similarity between one frame of picture; adjusting the parameters of the initial rendering model according to the similarity, and using the initial rendering model after adjusting the parameters as the target rendering model.
  • adjusting parameters of the initial rendering model according to the similarity, and using the initial rendering model after adjusting parameters as the target rendering model includes: adjusting the parameters according to the similarity.
  • using the initial rendering model after adjusting parameters as the target rendering model includes: responding to each frame in the video generated according to the initial rendering model after adjusting parameters, and the The similarity between each frame in the first video is not less than the similarity threshold, and the initial rendering model after adjusting parameters is used as the target rendering model.
  • the obtaining the first video including the face of the target object includes: obtaining the fourth video including the target object; cropping each frame of the fourth video, retaining The first video is obtained from the facial area of the target object in each frame of the fourth video.
  • a video rendering method includes: obtaining a video to be rendered including a target object; mapping the facial movements of the target object in the video to be rendered based on a three-dimensional facial model to obtain a video including a three-dimensional face. the intermediate video; obtain a target rendering model corresponding to the target object; render the intermediate video based on the target rendering model to obtain a target video.
  • the target video can still be rendered better through the target rendering model.
  • the obtaining a to-be-rendered video including a target object includes: obtaining a virtual object generation model established based on the target object; and generating the to-be-rendered video based on the virtual object generation model.
  • generating the video to be rendered based on the virtual object generation model includes: Obtain the text used to generate the video to be rendered; convert the text into the speech of the target object, the content of the speech corresponds to the content of the text; obtain at least one set of lip synchronization parameters based on the speech ; Input the at least one set of lip synchronization parameters into the virtual object generation model, and the virtual object generation model drives the facial expression of the virtual object corresponding to the target object based on the at least one set of lip synchronization parameters. Corresponding actions are performed to obtain a virtual video corresponding to the at least one set of lip synchronization parameters; the virtual video is rendered to obtain the video to be rendered.
  • rendering the intermediate video based on the target rendering model to obtain the target video includes: rendering each frame in the intermediate video based on the target rendering model, Obtain the same number of rendered pictures as the number of frames in the intermediate video; combine the rendered pictures according to the corresponding relationship between the rendered pictures and each frame in the intermediate video to obtain the target video.
  • mapping the facial movements of the target object in the video to be rendered based on a three-dimensional facial model to obtain an intermediate video including a three-dimensional face includes: mapping each frame of the video to be rendered. The picture is cropped, and the facial area of the target object in each frame of the video to be rendered is retained to obtain a facial video; the facial movements of the target object in the facial video are mapped based on the three-dimensional facial model, and the facial movements of the target object in the facial video are obtained. Describe the intermediate video. By cropping the video to be rendered to obtain the facial video, and then mapping the facial movements of the target object in the facial video based on the three-dimensional facial model, the mapping of the facial movements of the target object can be completed while using less computing resources.
  • a rendering model training device includes:
  • an acquisition module configured to acquire a first video including the face of the target object
  • a mapping module configured to map facial movements of the target object in the first video based on a three-dimensional facial model to obtain a second video including a three-dimensional face;
  • a training module configured to use the second video as the input of the initial rendering model, use the first video as the output supervision of the initial rendering model, train the initial rendering model, and obtain a target rendering model.
  • the mapping module is used to extract facial key points of the target object in each frame of the first video to obtain multiple sets of facial key points.
  • the number of groups is the same as the number of frames of the first video, and one frame corresponds to a group of facial key points; the three-dimensional facial model is fitted to each group of facial key points to obtain multiple three-dimensional facial pictures; according to The corresponding relationship between the three-dimensional facial images and each frame of the first video, and the plurality of three-dimensional facial images are combined to obtain a second video including the three-dimensional face.
  • the mapping module is configured to use a neural network to fit the three-dimensional facial model to each group of facial key points to obtain the plurality of three-dimensional facial images.
  • the training module is used to input the second video into an initial rendering model, and use the initial rendering model to render the second video to obtain a third video; calculate the third video The similarity between each frame of a video and the third video; adjusting the parameters of the initial rendering model according to the similarity, and using the initial rendering model after adjusting the parameters as the target rendering model.
  • the training module is configured to adjust the weight of the pre-training layer in the initial rendering model according to the similarity, and use the adjusted initial rendering model as the target rendering model
  • the pre-training layer is at least one layer of network in the initial rendering model, and the number of network layers included in the pre-training layer is less than the total number of network layers in the initial rendering model.
  • the training module is configured to respond to the difference between each frame of the video generated according to the initial rendering model after adjusting parameters and each frame of the first video.
  • the similarity is not less than the similarity threshold, and the initial rendering model after adjusting the parameters is used as the target rendering model.
  • the acquisition module is configured to acquire a fourth video including the target object; crop each frame of the fourth video, retaining each frame of the fourth video. The facial area of the target object in the frame is obtained to obtain the first video.
  • a video rendering device in a fourth aspect, includes:
  • the acquisition module is used to obtain the video to be rendered including the target object
  • a mapping module configured to map the facial movements of the target object in the video to be rendered based on a three-dimensional facial model to obtain an intermediate video including a three-dimensional face
  • the acquisition module is also used to acquire the target rendering model corresponding to the target object
  • a rendering module configured to render the intermediate video based on the target rendering model to obtain a target video.
  • the acquisition module is configured to acquire a virtual object generation model established based on the target object; and generate the video to be rendered based on the virtual object generation model.
  • the acquisition module is used to acquire the text used to generate the video to be rendered; convert the text into the speech of the target object, and the content of the speech is consistent with the text. corresponding to the content; obtain at least one set of lip synchronization parameters based on the speech; input the at least one set of lip synchronization parameters into the virtual object generation model, and the virtual object generation model is based on the at least one set of lip synchronization parameters.
  • the synchronization parameters drive the face of the virtual object corresponding to the target object to make corresponding actions to obtain a virtual video corresponding to the at least one set of lip synchronization parameters; the virtual video is rendered to obtain the video to be rendered.
  • the rendering module is configured to render each frame of the intermediate video based on the target rendering model, and obtain the same number of rendered images as the number of frames of the intermediate video. Picture; according to the corresponding relationship between the rendered picture and each frame of the intermediate video, the rendered pictures are combined to obtain the target video.
  • the mapping module is configured to crop each frame of the video to be rendered and retain the facial area of the target object in each frame of the video to be rendered, A facial video is obtained; and the facial movements of the target object in the facial video are mapped based on the three-dimensional facial model to obtain the intermediate video.
  • a communication device which includes: a transceiver, a memory, and a processor.
  • the transceiver, the memory and the processor communicate with each other through an internal connection path, the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory to control the transceiver to receive signals and control the transceiver to send signals.
  • the processor executes the instructions stored in the memory, the processor is caused to execute the method in the first aspect or any possible implementation of the first aspect, or the second aspect or any one of the second aspects Methods in Possible Implementations.
  • another communication device which device includes: a transceiver, a memory, and a processor.
  • the transceiver, the memory and the processor communicate with each other through an internal connection path, the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory to control the transceiver to receive signals and control the transceiver to send signals.
  • the processor executes the instructions stored in the memory, the processor is caused to execute the method in the first aspect or any possible implementation of the first aspect, or the second aspect or any one of the second aspects Methods in Possible Implementations.
  • processors there are one or more processors and one or more memories.
  • the memory may be integrated with the processor, or the memory may be provided separately from the processor.
  • the memory can be a non-transitory memory, such as a read only memory (ROM), which can be integrated on the same chip as the processor, or can be set in different On the chip, this application does not limit the type of memory and the arrangement of the memory and the processor.
  • ROM read only memory
  • a communication system which system includes the device in the above third aspect or any possible implementation manner of the third aspect, and the fourth aspect or any possible implementation manner in the fourth aspect. device.
  • a computer program includes: computer program code.
  • the computer program code When the computer program code is run by a computer, it causes the computer to perform the methods in the above aspects.
  • a computer-readable storage medium stores programs or instructions. When the programs or instructions are run on a computer, the methods in the above aspects are executed.
  • a chip including a processor for calling and running instructions stored in the memory, so that the communication device installed with the chip executes the methods in the above aspects.
  • another chip including: an input interface, an output interface, a processor, and a memory, and the input interface, the output interface, the processor, and the memory are connected through an internal connection path, and the The processor is used to execute the code in the memory.
  • the processor is used to execute the methods in the above aspects.
  • Figure 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • Figure 2 is a flow chart of a rendering model training method provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram before and after cropping a frame provided by an embodiment of the present application.
  • Figure 4 is a flow chart of a video rendering method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a rendering model training device provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of a video rendering and training device provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • Voice-driven virtual objects are a newly emerging technology that have attracted widespread attention. Compared with real objects, virtual objects have more advantages in terms of controllability, risk and cost.
  • Speech-driven virtual object technology can be used to generate a corresponding virtual object video based on a piece of audio.
  • the process includes: generating a model of lip synchronization parameters from voice input, and the model generating lip synchronization parameters generates lip synchronization parameters based on the speech;
  • the lip synchronization parameters are input into the virtual object generation model, and the virtual object generation model generates a virtual object video based on the lip synchronization parameters.
  • the virtual object generation model when the virtual object generation model generates the virtual object video based on the lip synchronization parameters, the virtual object generation model drives the virtual object to make corresponding facial movements according to the lip synchronization parameters, and then the rendering model included in the virtual object generation model is Render the scene after the virtual object makes facial movements to obtain the virtual object video.
  • the rendering capability of the rendering model directly affects the clarity of the virtual object video. Therefore, in order to make the virtual object video clearer, in related technologies, a large number of high-definition videos recorded under higher recording requirements are combined with the corresponding Low-definition videos are used as samples to train the rendering model so that the rendering model can render virtual object videos with higher definition.
  • it is difficult to obtain training samples in related technologies and training a rendering model requires a lot of computing resources and takes a long time to train.
  • the rendering models trained in related technologies also require large computing resources when applied, which is not conducive to Terminal side applications.
  • embodiments of the present application provide a rendering model training method.
  • This method can avoid using a large number of low-definition videos recorded under high recording requirements as samples to train the rendering model, and reduces the calculation required when training the rendering model. resources and time. And after the rendering model training is completed, it requires less computing resources when applied, and can be applied to the terminal side.
  • the embodiment of the present application also provides a video rendering method.
  • the rendering model used to render the video is a rendering model trained based on the rendering model training method provided by the embodiment of the present application. The method can input Complete the rendering of the video when the video definition is low.
  • the embodiment of the present application provides an implementation environment.
  • the implementation environment may include: a terminal 11 and a server 12 .
  • the terminal 11 can obtain required content, such as video, from the server 12 .
  • the terminal 11 is configured with a camera device, based on which the terminal 11 can obtain video.
  • the rendering model training method provided by the embodiment of the present application can be executed by the terminal 11, the server 12, or both the terminal 11 and the server 12. This is not limited by the embodiment of the present application.
  • the video rendering method provided by the embodiment of the present application can be executed by the terminal 11, or by the server 12, or by both the terminal 11 and the server 12. This is not limited by the embodiment of the present application.
  • the rendering model training method and the video rendering method provided by the embodiments of the present application can be executed by the same device or by different devices, which are not limited by the embodiments of the present application.
  • the terminal 11 can be any electronic product that can perform human-computer interaction with the user through one or more methods such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting device, for example PC (Personal Computer, personal computer), mobile phone, smartphone, PDA (Personal Digital Assistant, personal digital assistant), wearable devices, PPC (Pocket PC, handheld computer), tablet computer, smart car, smart TV, smart speaker , intelligent voice interaction equipment, smart home appliances, vehicle terminals, etc.
  • the server 12 may be one server, a server cluster composed of multiple servers, or a cloud computing service center. The terminal 11 and the server 12 establish a communication connection through a wired or wireless network.
  • terminal 11 and server 12 are only examples. If other existing or possible terminals or servers that may appear in the future are applicable to this application, they should also be included in the protection scope of this application, and are hereby referred to as References are included here.
  • embodiments of the present application provide a rendering model training method, taking this method as an example when it is applied to the terminal 11.
  • the rendering model training method provided by the embodiment of the present application may include the following steps 201 to 203.
  • Step 201 Obtain a first video including the face of the target object.
  • acquiring the first video including the face of the target object includes: capturing the face of the target object based on a camera configured at the terminal, thereby acquiring the first video including the face of the target object.
  • obtaining the first video including the face of the target object includes: obtaining a fourth video including the target object; cropping each frame corresponding to the fourth video, retaining the corresponding frame of the fourth video. The first video is obtained from the facial area of the target object in each frame.
  • the method of obtaining the first video including the face of the target object may also be to capture the first video by a device other than the terminal and then transmit the first video to the terminal, thereby allowing the terminal to complete processing of the first video. Obtain.
  • the fourth video may also be captured by a device other than the terminal and then transmitted to the terminal, thereby allowing the terminal to complete acquisition of the fourth video.
  • the fourth video is stored in the server, and the method of obtaining the fourth video may also be to directly obtain the fourth video from the server.
  • cropping each frame of the fourth video, retaining the facial area of the target object in each frame of the fourth video, and obtaining the first video includes: determining the fourth video based on image recognition technology The area corresponding to the face of the target object in each frame of the picture; based on the image segmentation technology, the area corresponding to the face of the target object is cut out from each frame of the fourth video to obtain the same number of faces as the number of frames of the fourth video Image; according to the corresponding relationship between each facial image and each frame in the fourth video, all facial images are combined to obtain the first video.
  • FIG. 3 Exemplarily, a frame before and after cropping is shown in Figure 3.
  • (1) in Figure 3 is a frame in the fourth video
  • (2) in Figure 3 is a cropped image including the face of the target object. Facial images. It should be noted that Figure 3 is only used to help understand the embodiment of the present application.
  • the size of the peripheral frame lines corresponding to (1) and (2) in Figure 3 has nothing to do with the size of the corresponding screen.
  • the samples used to train the rendering model are related to the purpose of the rendering model. For example, if the rendering model is used to render a video containing a tiger, the video containing the tiger should be used as a training sample when training the rendering model, that is, the training of the rendering model is targeted.
  • the embodiment of the present application takes the rendering model for rendering a video including a target object as an example. That is, the video including the target object is used as a sample for training the rendering model.
  • the rendering model training method provided by the embodiment of the present application can also be applied to train the rendering model.
  • Step 202 Map the facial movements of the target object in the first video based on the three-dimensional facial model to obtain a second video including a three-dimensional face.
  • the three-dimensional facial model is a three-dimensional model including a facial pattern.
  • the method before mapping the facial movements of the target object in the first video based on the three-dimensional facial model to obtain the second video including the three-dimensional face, the method further includes: obtaining the three-dimensional facial model.
  • the three-dimensional facial model is pre-stored in the memory of the terminal. At this time, obtaining the three-dimensional facial model includes: obtaining the three-dimensional facial model from the memory.
  • the three-dimensional facial model is pre-stored in the server, and obtaining the three-dimensional facial model includes: obtaining the three-dimensional facial model from the server based on the communication network.
  • the embodiments of this application do not limit which specific three-dimensional facial model the three-dimensional facial model is.
  • the three-dimensional facial model used in the embodiment of the present application may be a three-dimensional deformable face model (3 dimensions morphable face model, 3DMM).
  • mapping the facial movements of the target object in the first video based on the three-dimensional facial model to obtain the second video including the three-dimensional face includes: extracting the facial key of the target object in each frame of the first video points to obtain multiple groups of facial key points.
  • the number of groups of facial key points is the same as the number of frames in the first video.
  • One frame corresponds to a group of facial key points.
  • the three-dimensional facial model is fitted to each group of facial key points to obtain multiple groups of facial key points.
  • extracting facial key points of the target object in each frame of the first video to obtain multiple sets of facial key points includes: determining the location of the facial key points in each frame of the first video; Extract the coordinates of the position of the facial key points in each frame of the first video to obtain multiple groups of facial key points.
  • the number of groups of facial key points is the same as the number of frames in the first video.
  • One frame corresponds to a group of facial key points. point.
  • determining the position of the facial key point in each frame of the first video includes: determining the position of the facial key point in each frame of the first video based on a neural network for determining the key position. Location. After determining the position of each facial key point, you can directly determine the coordinates corresponding to the position based on the position of each facial key point, obtain the coordinates of each facial key point, and then extract the facial key points in each frame of the first video. The coordinates at the position of the point are used to obtain multiple sets of facial key points.
  • the facial key points of the target object include, but are not limited to, head key points, lip key points, eye key points, and eyebrow key points.
  • the embodiment of the present application does not impose a limit and can be set based on experience.
  • facial key points are distributed in set areas. For example, there are 20 facial key points distributed in the lip area, 6 key points in each of the left eye area and right eye area, and 6 key points in the head area. There are 17 key points distributed in the cheek (cheek) area, 5 key points each distributed in the left eyebrow area and right eyebrow area, and 9 key points distributed in the nose area.
  • the facial key points have better stability, even when the definition of the first video is low, the facial key points of the target object can be better extracted from the first video, and then the facial key points of the target object can be extracted based on the extracted
  • the facial key points are fitted to the three-dimensional facial model to obtain the second video. That is to say, by extracting the facial key points of the target object in the first video and fitting the three-dimensional facial model, the method of generating the second video is highly stable and less affected by the clarity of the first video. The second video can also be generated better when the definition is lower.
  • the embodiments of this application do not limit how to fit the three-dimensional facial model to each set of facial key points.
  • the three-dimensional facial model can be fitted to each set of facial key points based on a neural network to obtain multiple three-dimensional facial images.
  • the neural network used to realize the fitting of the three-dimensional facial model and facial key points is a residual network (residual neural network, ResNet).
  • ResNet residual neural network
  • a three-dimensional facial model can also be fitted to each set of facial key points based on a Bayesian model.
  • the three-dimensional facial picture is a picture obtained by fitting the facial key points extracted from each frame in the first video to the three-dimensional facial model, therefore, one three-dimensional facial picture corresponds to one frame in the first video.
  • multiple three-dimensional facial images may be combined into a second video including a three-dimensional face according to the corresponding relationship between the three-dimensional facial images and each frame in the first video.
  • Step 203 Use the second video as the input of the initial rendering model, use the first video as the output supervision of the initial rendering model, train the initial rendering model, and obtain the target rendering model.
  • training the initial rendering model to obtain the target rendering model includes: inputting the second video into the initial rendering model
  • the rendering model uses the initial rendering model to render the second video to obtain the third video; calculates the similarity between each frame of the first video and the third video; adjusts the parameters of the initial rendering model based on the similarity, and adjusts the parameters
  • the final initial rendering model is used as the target rendering model.
  • Using the second video generated by extracting facial key points to train the rendering model can make the trained rendering model have better generalization capabilities. Among them, the similarity represents the degree of similarity between the third video and the first video. The higher the similarity, the better the rendering capability of the initial rendering model.
  • the goal of training the initial rendering model is to make the third video outputted by the initial rendering model to render the second video as similar as possible to the first video.
  • the initial rendering model can be any rendering model, and the embodiment of the present application does not limit this.
  • the dye model can be a generative adversarial network (GAN) model, a convolutional neural network (CNN) model, or a U-net model.
  • the initial rendering model includes a pre-training layer
  • the pre-training layer is at least one layer of network in the initial rendering model
  • the pre-training layer includes a number of network layers that is less than the total number of network layers in the initial rendering model.
  • adjusting the parameters of the initial rendering model according to the similarity, and using the initial rendering model after adjusting the parameters as the target rendering model includes: adjusting the weight of the pre-training layer in the initial rendering model according to the similarity value, and using The initial rendering model after adjusting the weights is used as the target rendering model.
  • the pre-training layer includes a reference number of network layers from back to front among the network layers corresponding to the initial rendering model.
  • the initial rendering model includes A, B, C, D, and E from front to back. There are 5 layers of network in total. If the reference number corresponding to the pre-training layer is 2, the network layers included in the pre-training layer are D and E; if the reference number corresponding to the pre-training layer is 3, the network layer included in the pre-training layer is C. ,D,E. There is no limit to the number of network layers included in the pre-training layer in this embodiment. It can be set based on experience or based on application scenarios.
  • the initial rendering model is a CNN model
  • the CNN model includes an input layer, multiple convolutional layers, a pooling layer, and a fully connected layer, where the pre-training layer includes some of the convolutional layers and the pooling layer among the multiple convolutional layers. layer and fully connected layer. Since the pre-training layer is part of the network layer corresponding to the rendering model, adjusting the weight of the pre-training layer involves fewer network layers, and thus adjusting the weight of the pre-training layer requires less computing resources. Accordingly, the efficiency of training rendering models is improved.
  • using the initial rendering model after adjusting parameters as the target rendering model includes: responding to the initial rendering model after adjusting parameters.
  • the similarity between each frame in the video generated by the rendering model and each frame in the first video is not less than the similarity threshold, and the initial rendering model after adjusting the parameters is used as the target rendering model.
  • Introducing a three-dimensional facial model when training the rendering model, and using the second video generated based on the three-dimensional facial model to train the rendering model can reduce the amount of samples required when training the rendering model, usually only a few minutes of the second video and the corresponding
  • the training of the rendering model can be completed in the first video.
  • the definition of the video rendered by the rendering model is strongly related to the definition of the first video. Therefore, the first video is usually a video with higher definition, so that the rendering model trained using the first video and the second video can render Produce videos with higher definition.
  • an embodiment of the present application provides a video rendering method, taking this method as an example when it is applied to the terminal 11.
  • the video rendering method provided by the embodiment of the present application may include the following steps 401 to 403.
  • Step 401 Obtain the video to be rendered including the target object.
  • the video to be rendered is a virtual object video
  • the virtual object video is generated by a virtual object generation model established based on the target object. That is to say, the video to be rendered is generated by a virtual object generation model established based on the target object.
  • the video to be rendered can be generated at the terminal, or after it is generated by a device other than the terminal, the video to be rendered can be sent to the terminal by a device other than the terminal.
  • Embodiments of the present application take the terminal to generate a video to be rendered based on a virtual object generation model as an example.
  • obtaining the video to be rendered including the target object includes: obtaining a virtual object generation model based on the target object; The generative model generates the video to be rendered.
  • the embodiment of this application takes the video to be rendered as a virtual object video as an example for explanation, but it does not mean that the video to be rendered is not
  • the video can only be a virtual object video, and the video to be rendered can also be a video obtained by directly shooting the target object, or a video including the target object generated by other methods.
  • the virtual object generation model used in the embodiment of the present application can be a lip synchronization wave2lip model, or any other virtual object generation model, which is not limited in the embodiment of the present application.
  • the virtual object generation model established based on the target object is stored in the memory of the terminal, and obtaining the virtual object generation model established based on the target object includes: directly obtaining the virtual object generation model from the memory of the terminal.
  • the virtual object generation model is stored in the memory of the server, and obtaining the virtual object generation model established based on the target object includes: obtaining the virtual object generation model from the server.
  • the virtual object generation model established based on the target object is a virtual object generation model that has completed training. It can use lip synchronization parameter adjustment to drive the virtual object to make corresponding facial movements, and then generate a video including the target object.
  • generating a video to be rendered based on the virtual object generation model includes: obtaining text used to generate the video to be rendered; converting the text into the speech of the target object, and the content of the speech corresponds to the content of the text; based on The voice acquires at least one set of lip synchronization parameters; the at least one set of lip synchronization parameters is input into the virtual object generation model, and the virtual object generation model drives the face of the virtual object corresponding to the target object based on the at least one set of lip synchronization parameters Make a corresponding action to obtain a virtual video corresponding to the at least one set of lip synchronization parameters; render the virtual video to obtain a video to be rendered.
  • the virtual object generation model includes virtual objects corresponding to the target object.
  • each frame in the video to be rendered is a static picture, and in the static picture, the lips have a certain shape.
  • a set of lip synchronization parameters is used to drive the virtual object to generate a corresponding frame in the video to be rendered.
  • the text used to generate the video to be rendered is manually input into the terminal.
  • the terminal detects the input of the text
  • the acquisition of the text used to generate the video to be rendered is completed accordingly.
  • the text used to generate the video to be rendered is manually input into a device other than the terminal, and the other device sends the text to the terminal.
  • obtaining the text used to generate the video to be rendered includes: Receive text sent from other devices to generate video to be rendered.
  • converting the text into the speech of the target object includes: converting the text into the speech of the target object based on any model capable of converting the text into speech, wherein the any model capable of converting the text into The speech model is trained based on the audio of the target object, so that any model that can convert text into speech has the ability to convert text into the speech of the target object.
  • any model capable of converting text into speech may be a Char2Wav model (a speech synthesis model).
  • obtaining at least one set of lip synchronization parameters based on the speech includes: using the speech input to generate a neural network model of lip synchronization parameters based on the audio, and the neural network model generates at least one set of lip synchronization parameters based on the speech.
  • Group lip sync parameters For example, the neural network model used to generate lip synchronization parameters based on audio may be a long short-term memory network (long short-term memory, LSTM) model.
  • Step 402 Map the facial movements of the target object in the video to be rendered based on the three-dimensional facial model to obtain an intermediate video including the three-dimensional face.
  • the method further includes: obtaining the three-dimensional facial model. How to obtain the three-dimensional facial model has been described in step 202 of the embodiment of the rendering model training method corresponding to Figure 2, and will not be described again here.
  • mapping the facial movements of the target object in the video to be rendered based on the three-dimensional facial model to obtain an intermediate video including the three-dimensional face includes: cropping each frame of the video to be rendered, and retaining each frame of the video to be rendered. The facial area of the target object in a frame is obtained to obtain a facial video; based on the three-dimensional facial model, the target object in the facial video is The facial movements of the target object are mapped to obtain the intermediate video.
  • each frame of the video to be rendered is cropped, and the facial area of the target object in each frame of the video to be rendered is retained to obtain the facial video.
  • the fourth video is cropped to obtain the first video, which will not be described again here.
  • the implementation of mapping the face of the target object in the facial video based on the three-dimensional facial model to obtain the intermediate video is shown in step 202 of the rendering model training method corresponding to Figure 2, which will not be described again here.
  • Step 403 Obtain the target rendering model corresponding to the target object, render the intermediate video based on the target rendering model, and obtain the target video.
  • the target rendering model is sent to the terminal after completing training in other devices other than the terminal.
  • obtaining the target rendering model corresponding to the target object includes: receiving the target sent by the other device. Render the model.
  • the target rendering model is trained in the terminal and stored in the memory of the terminal.
  • obtaining the target rendering model corresponding to the target object includes: obtaining the target rendering model from the memory.
  • the target rendering model is a target rendering model trained based on the rendering model training method corresponding to Figure 2.
  • the clarity of the video rendered by the target rendering model is determined by the clarity of the supervision samples that trained the target rendering model
  • the intermediate image is processed based on the target rendering model.
  • the definition of the target video output after the video is rendered is higher than the definition of the video to be rendered.
  • rendering the intermediate video based on the target rendering model to obtain the target video includes: rendering each frame in the intermediate video based on the target rendering model to obtain the same number of renderings as the number of frames of the intermediate video. According to the corresponding relationship between the rendered picture and each frame in the intermediate video, the rendered pictures are combined to obtain the target video.
  • the definition of the target video corresponds to the definition of the supervision video used to train the target rendering model.
  • the target video with higher definition can be obtained by rendering the to-be-rendered video through the target rendering model. That is to say, by combining a low-definition virtual object generation model with a target rendering model, the terminal can generate a high-definition target video while spending less computing resources.
  • the embodiments of the present application also provide a rendering model training device and a video rendering device.
  • this embodiment of the present application also provides a rendering model training device, which includes:
  • Acquisition module 501 configured to acquire a first video including the face of the target object
  • Mapping module 502 configured to map the facial movements of the target object in the first video based on the three-dimensional facial model to obtain a second video including a three-dimensional face;
  • the training module 503 is used to use the second video as the input of the initial rendering model, use the first video as the output supervision of the initial rendering model, train the initial rendering model, and obtain the target rendering model.
  • the mapping module 502 is used to extract the facial key points of the target object in each frame of the first video, and obtain multiple groups of facial key points.
  • the number of groups of facial key points is the same as that of the first video.
  • the number of frames is the same, and one frame corresponds to a set of facial key points; the three-dimensional facial model is fitted to each group of facial key points to obtain multiple three-dimensional facial images; according to the difference between the three-dimensional facial image and each frame of the first video According to the corresponding relationship, multiple three-dimensional facial images are combined to obtain a second video including a three-dimensional face.
  • the mapping module 502 is used to use a neural network to map the three-dimensional facial model to each group of faces. Key points are fitted to obtain multiple three-dimensional facial images.
  • the training module 503 is used to input the second video into the initial rendering model, and render the second video by the initial rendering model to obtain the third video; calculate each of the first video and the third video.
  • the similarity between frames adjust the parameters of the initial rendering model based on the similarity, and use the initial rendering model after adjusting the parameters as the target rendering model.
  • the training module 503 is used to adjust the weight of the pre-training layer in the initial rendering model according to the similarity, use the adjusted initial rendering model as the target rendering model, and the pre-training layer as the initial rendering model. At least one layer of the network in the pre-trained layer includes less than the total number of network layers in the initial rendering model.
  • the training module 503 is configured to respond to the similarity between each frame of the video generated according to the initial rendering model after adjusting parameters and each frame of the first video. are not less than the similarity threshold, the initial rendering model after adjusting the parameters is used as the target rendering model.
  • the acquisition module 501 is used to acquire a fourth video including a target object; crop each frame of the fourth video to retain the face of the target object in each frame of the fourth video. area, get the first video.
  • this embodiment of the present application also provides a video rendering device, which includes:
  • Obtaining module 601 is used to obtain the video to be rendered including the target object
  • the mapping module 602 is used to map the facial movements of the target object in the video to be rendered based on the three-dimensional facial model to obtain an intermediate video including the three-dimensional face;
  • the acquisition module 601 is also used to acquire the target rendering model corresponding to the target object;
  • the rendering module 603 is used to render the intermediate video based on the target rendering model to obtain the target video.
  • the acquisition module 601 is used to acquire a virtual object generation model established based on the target object; and generate a video to be rendered based on the virtual object generation model.
  • the acquisition module 601 is used to acquire the text used to generate the video to be rendered; convert the text into the speech of the target object, and the content of the speech corresponds to the content of the text; and obtain at least one set of speech based on the speech.
  • Lip synchronization parameters input at least one set of lip synchronization parameters into the virtual object generation model, and the virtual object generation model drives the face of the virtual object corresponding to the target object to make corresponding actions based on at least one set of lip synchronization parameters, and obtains at least one
  • a virtual video corresponding to the lip synchronization parameters is assembled; the virtual video is rendered to obtain a video to be rendered.
  • the rendering module 603 is used to render each frame of the intermediate video based on the target rendering model, and obtain the same number of rendered images as the number of frames of the intermediate video; according to the rendered images Corresponding to each frame in the intermediate video, the rendered images are combined to obtain the target video.
  • the mapping module 602 is used to crop each frame of the video to be rendered, retain the facial area of the target object in each frame of the video to be rendered, and obtain a facial video; based on the three-dimensional facial model The facial movements of the target object in the facial video are mapped to obtain the intermediate video.
  • Figure 7 shows a schematic structural diagram of an electronic device provided by an exemplary embodiment of the present application.
  • the electronic device shown is used to perform operations involved in the rendering model training method shown in FIG. 2 or the video rendering method shown in FIG. 4 .
  • the electronic device is, for example, a terminal, etc., and the electronic device can be implemented by a general bus architecture.
  • the electronic device includes at least one processor 701 , a memory 703 and at least one communication interface 704 .
  • the processor 701 is, for example, a general central processing unit (CPU), a digital signal processor (DSP), a network processor (NP), a graphics processor (Graphics Processing Unit, GPU), Neural network processors (neural-network processing units, NPU), data processing units (Data Processing Unit, DPU), microprocessors or one or more integrated circuits used to implement the solution of this application.
  • the processor 701 includes an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof.
  • PLD is, for example, a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
  • the processor can also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on.
  • the electronic device also includes a bus.
  • Buses are used to transfer information between components of electronic equipment.
  • the bus can be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 7, but it does not mean that there is only one bus or one type of bus.
  • the memory 703 is, for example, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, or a random access memory (random access memory, RAM) or a device that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • Other types of dynamic storage devices such as electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical discs Storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can Any other media accessed by a computer, without limitation.
  • the memory 703 exists independently, for example, and is connected to the processor 701 through a bus. Memory 703 may also be integrated with processor 701.
  • the communication interface 704 uses any device such as a transceiver to communicate with other devices or a communication network.
  • the communication network can be Ethernet, a radio access network (RAN) or a wireless local area network (WLAN), etc.
  • the communication interface 704 may include a wired communication interface and may also include a wireless communication interface.
  • the communication interface 704 may be an Ethernet (Ethernet) interface, a Fast Ethernet (FE) interface, a Gigabit Ethernet (GE) interface, an asynchronous transfer mode (Asynchronous Transfer Mode, ATM) interface, or a wireless LAN (wireless local area networks, WLAN) interface, cellular network communication interface or a combination thereof.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the communication interface 704 can be used for electronic devices to communicate with other devices.
  • the processor 701 may include one or more CPUs, such as CPU0 and CPU1 as shown in FIG. 7 .
  • Each of these processors may be a single-CPU processor or a multi-CPU processor.
  • a processor here may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • the electronic device may include multiple processors, such as processor 701 and processor 701 shown in Figure 7 705.
  • processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • a processor here may refer to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the electronic device may also include an output device and an input device.
  • Output devices communicate with processor 701 and can display information in a variety of ways.
  • the output device may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector (projector), etc.
  • Input devices communicate with processor 701 and can receive user input in a variety of ways.
  • the input device may be a mouse, a keyboard, a touch screen device or a sensing device, etc.
  • the memory 703 is used to store the program code 710 for executing the solution of the present application, and the processor 701 can execute the program code 710 stored in the memory 703. That is, the electronic device can implement the rendering model training method or the video rendering method provided by the method embodiment through the processor 701 and the program code 710 in the memory 703 .
  • Program code 710 may include one or more software modules.
  • the processor 701 itself can also store program codes or instructions for executing the solution of the present application.
  • Each step of the rendering model training method shown in Figure 2 or the video rendering method shown in Figure 4 is completed by instructions in the form of hardware integrated logic circuits or software in the processor of the electronic device.
  • the steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware. To avoid repetition, the details will not be described here.
  • FIG. 8 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server may vary greatly due to different configurations or performance, and may include one or more processors (central processing units, CPU) 801 and one or Multiple memories 802 , wherein at least one computer program is stored in the one or more memories 802 , and the at least one computer program is loaded and executed by the one or more processors 801 to enable the server to implement each of the above method embodiments.
  • the server can also have components such as wired or wireless network interfaces, keyboards, and input and output interfaces to facilitate input and output.
  • the server can also include other components for implementing device functions, which will not be described again here.
  • An embodiment of the present application also provides a communication device, which includes: a transceiver, a memory, and a processor.
  • the transceiver, the memory and the processor communicate with each other through an internal connection path, the memory is used to store instructions, and the processor is used to execute the instructions stored in the memory to control the transceiver to receive signals and control the transceiver to send signals.
  • the processor executes the instructions stored in the memory, the processor is caused to execute the rendering model training method or the video rendering method.
  • processor can be a CPU, or other general-purpose processor, digital signal processing (DSP), application specific integrated circuit (ASIC), field programmable gate array ( field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor can be a microprocessor or any conventional processor, etc. It is worth noting that the processor may be a processor that supports advanced RISC machines (ARM) architecture.
  • ARM advanced RISC machines
  • the above-mentioned memory may include a read-only memory and a random access memory, and provide instructions and data to the processor.
  • Memory may also include non-volatile random access memory.
  • memory also Can store device type information.
  • the memory may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available.
  • static random access memory static random access memory
  • dynamic random access memory dynamic random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate synchronous dynamic random access Memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous link dynamic random access memory direct memory bus random access memory
  • direct rambus RAM direct rambus RAM
  • Embodiments of the present application also provide a computer-readable storage medium. At least one instruction is stored in the storage medium. The instruction is loaded and executed by the processor, so that the computer implements any of the above-mentioned rendering model training methods or video rendering. method.
  • Embodiments of the present application also provide a computer program (product).
  • the computer program When the computer program is executed by a computer, it can cause the processor or computer to execute corresponding steps and/or processes in the above method embodiments.
  • Embodiments of the present application also provide a chip, including a processor, configured to call and run instructions stored in the memory, so that the communication device installed with the chip performs the rendering model training as described above. method or the rendering method of the video.
  • An embodiment of the present application also provides another chip, including: an input interface, an output interface, a processor, and a memory.
  • the input interface, the output interface, the processor, and the memory are connected through an internal connection path, and the The processor is configured to execute the code in the memory.
  • the processor is configured to execute any of the above-mentioned rendering model training methods or video rendering methods.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk), etc.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium can be read-only memory, magnetic disk or optical disk, etc.
  • the computer program product includes one or more computer program instructions.
  • methods of embodiments of the present application may be described in the context of machine-executable instructions, such as included in a program module executing in a device on a target's real or virtual processor.
  • program modules include routines, programs, libraries, objects, classes, components, data structures, etc., which perform specific tasks or implement specific abstract data structures.
  • the functionality of program modules may be combined or split between the described program modules.
  • Machine-executable instructions for program modules can execute locally or on a distributed device. In a distributed device, program modules can be located in both local and remote storage media.
  • Computer program codes for implementing the methods of embodiments of the present application may be written in one or more programming languages. These computer program codes may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that when executed by the computer or other programmable data processing device, the program code causes the flowcharts and/or block diagrams to be displayed. The functions/operations specified in are implemented.
  • the program code may execute entirely on the computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server.
  • the computer program code or related data may be carried by any appropriate carrier, so that the device, device or processor can perform the various processes and operations described above.
  • Examples of carriers include signals, computer-readable media, and the like.
  • Examples of signals may include electrical, optical, radio, acoustic, or other forms of propagated signals, such as carrier waves, infrared signals, and the like.
  • a machine-readable medium may be any tangible medium that contains or stores a program for or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared or semiconductor systems, devices or devices, or any suitable combination thereof. More detailed examples of machine-readable storage media include an electrical connection with one or more wires, laptop computer disk, hard drive, random memory accessor (RAM), read-only memory (ROM), erasable programmable read-only memory Memory (EPROM or flash memory), optical storage device, magnetic storage device, or any suitable combination thereof.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be indirect coupling or communication connection through some interfaces, devices or modules, or may be electrical, mechanical or other forms of connection.
  • the modules described as separate components may or may not be physically separated.
  • the components shown as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the embodiments of the present application.
  • each functional module in each embodiment of the present application can be integrated into one processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software function modules.
  • the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .
  • first, second and other words are used to distinguish the same or similar items with basically the same functions and functions. It should be understood that the terms “first”, “second” and “nth” There is no logical or sequential dependency, and there is no limit on the number or execution order. It should also be understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first image may be referred to as a second image, and similarly, a second image may be referred to as a first image, without departing from the scope of various described examples. Both the first image and the second image may be images, and in some cases, may be separate and different images.
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not be determined by the execution order of the embodiments of the present application.
  • the implementation process constitutes no limitation.
  • determining B based on A does not mean determining B only based on A. It can also be determined based on A and/or other information. Set B.
  • references throughout this specification to "one embodiment,” “an embodiment,” and “a possible implementation” mean that specific features, structures, or characteristics related to the embodiment or implementation are included herein. In at least one embodiment of the application. Therefore, “in one embodiment” or “in an embodiment” or “a possible implementation” appearing in various places throughout this specification do not necessarily refer to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.
  • the first video including the face of the target subject involved in this application was obtained with full authorization.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computer Graphics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请公开了渲染模型训练、视频的渲染方法、装置、设备和存储介质,属于图像处理技术领域。渲染模型训练方法包括:获取包括目标对象的面部的第一视频(201);基于三维面部模型对第一视频中的目标对象的面部动作进行映射,得到包括三维面部的第二视频(202);将第二视频作为初始渲染模型的输入,以第一视频作为初始渲染模型的输出监督,对初始渲染模型进行训练,得到目标渲染模型(203)。通过利用基于三维面部模型生成的第二视频作为训练渲染模型的样本,避免了在训练渲染模型前需要在较高的录制要求下录制大量的低清晰度视频,进而减少了训练渲染模型时所需的计算资源和时间。

Description

渲染模型训练、视频的渲染方法、装置、设备和存储介质
本申请要求于2022年03月18日提交的申请号为202210273192.8、发明名称为“渲染模型训练、视频的渲染方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,特别涉及渲染模型训练、视频的渲染方法、装置、设备和存储介质。
背景技术
随着图像处理技术的发展,基于渲染模型渲染出的视频的清晰度越来越高。如何训练得到渲染模型,成为渲染出清晰度较高的视频的关键。
相关技术中,为了使渲染模型可以渲染出清晰度高的视频,需要根据较高的录制要求录制大量的低清晰度视频和高清晰度视频。其中,低清晰度视频为清晰度低于阈值的视频,高清晰度视频为清晰度高于阈值的视频。之后,通过将低清晰度视频作为输入渲染模型的样本视频,将与低清晰度视频对应的高清晰度视频作为监督视频对渲染模型进行训练,以使训练后的渲染模型可以渲染出与监督视频的清晰度相应的视频。
由于上述训练方式在训练渲染模型时,需要根据较高的录制要求录制大量的高清晰度视频与低清晰度视频,不仅获取样本的难度较大,且训练渲染模型时所需的计算资源较多,训练时间较长。
发明内容
本申请提供了一种渲染模型训练、视频的渲染方法、装置、设备和存储介质,以解决相关技术提供的问题,技术方案如下:
第一方面,提供了一种渲染模型训练方法,所述方法包括:获取包括目标对象的面部的第一视频;基于三维面部模型对所述第一视频中的所述目标对象的面部动作进行映射,得到包括三维面部的第二视频;将所述第二视频作为初始渲染模型的输入,以所述第一视频作为所述初始渲染模型的输出监督,对所述初始渲染模型进行训练,得到目标渲染模型。
本申请提供的技术方案,通过引入三维面部模型,并利用基于三维面部模型生成的包括三维面部的第二视频作为训练渲染模型的样本,避免了在训练渲染模型前需要较高的录制要求下录制大量的低清晰度视频,进而减少了训练渲染模型时所使用的计算资源和时间,并且通过三维面部模型的引入能够使训练出的目标渲染模型具备较高的泛化能力。
在一种可能的实现方式中,所述基于三维面部模型对所述第一视频中的所述目标对象的面部动作进行映射,得到包括三维面部的第二视频,包括:提取所述第一视频的每一帧画面中的所述目标对象的面部关键点,得到多组面部关键点,所述面部关键点的组数与所述第一视频的帧数相同,一帧画面对应一组面部关键点;将所述三维面部模型与每组面部关键点进 行拟合,得到多个三维面部画面;根据所述三维面部画面与所述第一视频的每一帧画面的对应关系,将所述多个三维面部画面进行组合,得到包括所述三维面部的第二视频。
面部关键点具有较好的稳定性,因而,在第一视频的清晰度较低时也能较好地从第一视频中提取到目标对象的面部关键点,进而根据提取到的目标对象的面部关键点与三维面部模型进行拟合,得到第二视频,提高了获取第二视频的可靠性。
在一种可能的实现方式中,所述将所述三维面部模型与每组面部关键点进行拟合,得到多个三维面部画面,包括:利用神经网络将所述三维面部模型与每组面部关键点进行拟合,得到所述多个三维面部画面。
在一种可能的实现方式中,所述将所述第二视频作为初始渲染模型的输入,以所述第一视频作为所述初始渲染模型的输出监督,对所述初始渲染模型进行训练,得到目标渲染模型,包括:将所述第二视频输入初始渲染模型,由所述初始渲染模型对所述第二视频进行渲染得到第三视频;计算所述第一视频和所述第三视频的每一帧画面之间的相似度;根据所述相似度调整所述初始渲染模型的参数,将调整参数后的初始渲染模型作为所述目标渲染模型。
利用通过提取面部关键点的方式生成的第二视频来训练渲染模型,可以使得训练完成的渲染模型具有更好泛化能力。
在一种可能的实现方式中,所述根据所述相似度调整所述初始渲染模型的参数,将调整参数后的初始渲染模型作为所述目标渲染模型,包括:根据所述相似度调整所述初始渲染模型中的预训练层的权重,将调整权重后的初始渲染模型作为所述目标渲染模型,所述预训练层为所述初始渲染模型中的至少一层网络,所述预训练层包括的网络层数少于所述初始渲染模型中的网络总层数。通过调整预训练层的参数,可以降低调整渲染模型时所需计算资源,使得渲染模型可以更快地完成训练,提高训练效率。
在一种可能的实现方式中,所述将调整参数后的初始渲染模型作为所述目标渲染模型,包括:响应于根据调整参数后的初始渲染模型生成的视频中的每一帧画面,与所述第一视频中的每一帧画面之间的相似度均不小于相似度阈值,将调整参数后的初始渲染模型作为所述目标渲染模型。
在一种可能的实现方式中,所述获取包括目标对象的面部的第一视频,包括:获取包括所述目标对象的第四视频;对所述第四视频的每一帧画面进行裁剪,保留所述第四视频的每一帧画面中所述目标对象的面部区域,得到所述第一视频。通过从包括目标对象的第四视频中裁剪出包括目标对象的面部的第一视频,并通过第一视频来体现目标对象的面部动作,可以在使用较少计算资源的情况下,完成对目标对象的面部关键点的提取。
第二方面,提供了一种视频的渲染方法,所述方法包括:获取包括目标对象的待渲染视频;基于三维面部模型对所述待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频;获取与所述目标对象对应的目标渲染模型;基于所述目标渲染模型对所述中间视频进行渲染,得到目标视频。
通过引入三维面部模型来帮助对待渲染视频进行渲染,使得待渲染视频的清晰度在较低的情况下,依然可以通过目标渲染模型较好地渲染出目标视频。
在一种可能的实现方式中,所述获取包括目标对象的待渲染视频,包括:获取基于所述目标对象建立的虚拟对象生成模型;基于所述虚拟对象生成模型生成所述待渲染视频。
在一种可能的实现方式中,所述基于所述虚拟对象生成模型生成所述待渲染视频,包括: 获取用于生成所述待渲染视频的文本;将所述文本转化为所述目标对象的语音,所述语音的内容与所述文本的内容对应;基于所述语音获取至少一组音唇同步参数;将所述至少一组音唇同步参数输入所述虚拟对象生成模型,由所述虚拟对象生成模型基于所述至少一组音唇同步参数,驱动所述目标对象对应的虚拟对象的面部做出相应的动作,得到所述至少一组音唇同步参数对应的虚拟视频;对所述虚拟视频进行渲染,得到所述待渲染视频。
在一种可能的实现方式中,所述基于所述目标渲染模型对所述中间视频进行渲染,得到目标视频,包括:基于所述目标渲染模型对所述中间视频中每一帧画面进行渲染,得到与所述中间视频的帧数相同数量的渲染后的画面;根据渲染后的画面与所述中间视频中每一帧画面的对应关系,将渲染后的画面进行组合,得到所述目标视频。
在一种可能的实现方式中,所述基于三维面部模型对所述待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频,包括:对所述待渲染视频的每一帧画面进行裁剪,保留所述待渲染视频的每一帧画面中所述目标对象的面部区域,得到面部视频;基于所述三维面部模型对所述面部视频中目标对象的面部动作进行映射,得到所述中间视频。通过对待渲染视频进行裁剪来得到面部视频,再基于三维面部模型对面部视频中目标对象的面部动作进行映射,能够在使用较少计算资源的情况下,完成对目标对象的面部动作的映射。
第三方面,提供了一种渲染模型训练装置,所述装置包括:
获取模块,用于获取包括目标对象的面部的第一视频;
映射模块,用于基于三维面部模型对所述第一视频中的所述目标对象的面部动作进行映射,得到包括三维面部的第二视频;
训练模块,用于将所述第二视频作为初始渲染模型的输入,以所述第一视频作为所述初始渲染模型的输出监督,对所述初始渲染模型进行训练,得到目标渲染模型。
在一种可能的实现方式中,所述映射模块,用于提取所述第一视频的每一帧画面中的所述目标对象的面部关键点,得到多组面部关键点,所述面部关键点的组数与所述第一视频的帧数相同,一帧画面对应一组面部关键点;将所述三维面部模型与每组面部关键点进行拟合,得到多个三维面部画面;根据所述三维面部画面与所述第一视频的每一帧画面的对应关系,将所述多个三维面部画面进行组合,得到包括所述三维面部的第二视频。
在一种可能的实现方式中,所述映射模块,用于利用神经网络将所述三维面部模型与每组面部关键点进行拟合,得到所述多个三维面部画面。
在一种可能的实现方式中,所述训练模块,用于将所述第二视频输入初始渲染模型,由所述初始渲染模型对所述第二视频进行渲染得到第三视频;计算所述第一视频和所述第三视频的每一帧画面之间的相似度;根据所述相似度调整所述初始渲染模型的参数,将调整参数后的初始渲染模型作为所述目标渲染模型。
在一种可能的实现方式中,所述训练模块,用于根据所述相似度调整所述初始渲染模型中的预训练层的权重,将调整权重后的初始渲染模型作为所述目标渲染模型,所述预训练层为所述初始渲染模型中的至少一层网络,所述预训练层包括的网络层数少于所述初始渲染模型中的网络总层数。
在一种可能的实现方式中,所述训练模块,用于响应于根据调整参数后的初始渲染模型生成的视频中的每一帧画面,与所述第一视频中的每一帧画面之间的相似度均不小于相似度阈值,将调整参数后的初始渲染模型作为所述目标渲染模型。
在一种可能的实现方式中,所述获取模块,用于获取包括所述目标对象的第四视频;对所述第四视频的每一帧画面进行裁剪,保留所述第四视频的每一帧画面中所述目标对象的面部区域,得到所述第一视频。
第四方面,提供了一种视频的渲染装置,所述装置包括:
获取模块,用于获取包括目标对象的待渲染视频;
映射模块,用于基于三维面部模型对所述待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频;
所述获取模块,还用于获取与所述目标对象对应的目标渲染模型;
渲染模块,用于基于所述目标渲染模型对所述中间视频进行渲染,得到目标视频。
在一种可能的实现方式中,所述获取模块,用于获取基于所述目标对象建立的虚拟对象生成模型;基于所述虚拟对象生成模型生成所述待渲染视频。
在一种可能的实现方式中,所述获取模块,用于获取用于生成所述待渲染视频的文本;将所述文本转化为所述目标对象的语音,所述语音的内容与所述文本的内容对应;基于所述语音获取至少一组音唇同步参数;将所述至少一组音唇同步参数输入所述虚拟对象生成模型,由所述虚拟对象生成模型基于所述至少一组音唇同步参数,驱动所述目标对象对应的虚拟对象的面部做出相应的动作,得到所述至少一组音唇同步参数对应的虚拟视频;对所述虚拟视频进行渲染,得到所述待渲染视频。
在一种可能的实现方式中,所述渲染模块,用于基于所述目标渲染模型对所述中间视频中每一帧画面进行渲染,得到与所述中间视频的帧数相同数量的渲染后的画面;根据渲染后的画面与所述中间视频中每一帧画面的对应关系,将渲染后的画面进行组合,得到所述目标视频。
在一种可能的实现方式中,所述映射模块,用于对所述待渲染视频的每一帧画面进行裁剪,保留所述待渲染视频的每一帧画面中所述目标对象的面部区域,得到面部视频;基于所述三维面部模型对所述面部视频中目标对象的面部动作进行映射,得到所述中间视频。
第五方面,提供了一种通信装置,该装置包括:收发器、存储器和处理器。其中,该收发器、该存储器和该处理器通过内部连接通路互相通信,该存储器用于存储指令,该处理器用于执行该存储器存储的指令,以控制收发器接收信号,并控制收发器发送信号,并且当该处理器执行该存储器存储的指令时,使得该处理器执行第一方面或第一方面的任一种可能的实施方式中的方法,或者第二方面或第二方面的任一种可能的实施方式中的方法。
第六方面,提供了另一种通信装置,该装置包括:收发器、存储器和处理器。其中,该收发器、该存储器和该处理器通过内部连接通路互相通信,该存储器用于存储指令,该处理器用于执行该存储器存储的指令,以控制收发器接收信号,并控制收发器发送信号,并且当该处理器执行该存储器存储的指令时,使得该处理器执行第一方面或第一方面的任一种可能的实施方式中的方法,或者第二方面或第二方面的任一种可能的实施方式中的方法。
可选地,所述处理器为一个或多个,所述存储器为一个或多个。
可选地,所述存储器可以与所述处理器集成在一起,或者所述存储器与处理器分离设置。
在具体实现过程中,存储器可以为非瞬时性(non-transitory)存储器,例如只读存储器(read only memory,ROM),其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请对存储器的类型以及存储器与处理器的设置方式不做限定。
第七方面,提供了一种通信系统,该系统包括上述第三方面或第三方面的任一种可能实施方式中的装置以及第四方面或第四方面中的任一种可能实施方式中的装置。
第八方面,提供了一种计算机程序(产品),所述计算机程序(产品)包括:计算机程序代码,当所述计算机程序代码被计算机运行时,使得所述计算机执行上述各方面中的方法。
第九方面,提供了一种计算机可读存储介质,计算机可读存储介质存储程序或指令,当所述程序或指令在计算机上运行时,上述各方面中的方法被执行。
第十方面,提供了一种芯片,包括处理器,用于从存储器中调用并运行所述存储器中存储的指令,使得安装有所述芯片的通信设备执行上述各方面中的方法。
第十一方面,提供另一种芯片,包括:输入接口、输出接口、处理器和存储器,所述输入接口、输出接口、所述处理器以及所述存储器之间通过内部连接通路相连,所述处理器用于执行所述存储器中的代码,当所述代码被执行时,所述处理器用于执行上述各方面中的方法。
附图说明
图1为本申请实施例提供的一种实施环境的示意图;
图2为本申请实施例提供的一种渲染模型训练方法的流程图;
图3为本申请实施例提供的一种一帧画面裁剪前后的示意图;
图4为本申请实施例提供的一种视频的渲染方法的流程图;
图5为本申请实施例提供的一种渲染模型训练装置的示意图;
图6为本申请实施例提供的一种视频的渲染练装置的示意图;
图7为本申请实施例提供的一种电子设备的结构示意图;
图8为本申请实施例提供的一种服务器的结构示意图。
具体实施方式
本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。
语音驱动虚拟对象是新兴起的一项技术,引起了人们的广泛关注。相比于真实对象而言,虚拟对象在可控性、风险性和成本方面更具有优势。利用语音驱动虚拟对象技术能够根据一段音频生成对应的一段虚拟对象视频,该过程包括:将语音输入生成音唇同步参数的模型,由生成音唇同步参数的模型根据该语音生成音唇同步参数;将音唇同步参数输入虚拟对象生成模型,由虚拟对象生成模型基于音唇同步参数生成虚拟对象视频。
其中,在由虚拟对象生成模型基于音唇同步参数生成虚拟对象视频时,由虚拟对象生成模型根据音唇同步参数驱动虚拟对象做出相应的面部动作,然后由虚拟对象生成模型中包括的渲染模型将该虚拟对象做出面部动作后的画面渲染出来,得到虚拟对象视频。
渲染模型渲染能力的优劣直接影响虚拟对象视频的清晰度,因而,为了使虚拟对象视频的清晰度更高,相关技术中,通过使用较高录制要求下录制的大量高清晰度视频与对应的低清晰度视频作为样本来训练渲染模型,以使渲染模型可以渲染出清晰度更高的虚拟对象视频。但是,相关技术中获取训练样本的难度较大,且训练渲染模型时所需的计算资源较多,训练时间较长。而且,相关技术中训练出的渲染模型在应用时也需要较大的计算资源,不利于在 终端侧的应用。
对此,本申请实施例提供了一种渲染模型训练方法,该方法可以避免使用较高录制要求下录制大量的低清晰度视频作为样本来训练渲染模型,减少了训练渲染模型时所需的计算资源和时间。且在渲染模型训练完成后应用时所需的计算资源较少,能够适用于终端侧。本申请实施例还提供了一种视频的渲染方法,该方法中用于对视频进行渲染的渲染模型为,基于本申请实施例提供的渲染模型训练方法训练出的渲染模型,该方法能够在输入的视频清晰度较低的情况下完成对视频的渲染。
如图1所示,本申请实施例提供了一种实施环境。该实施环境可以包括:终端11和服务器12。
其中,终端11可以从服务器12中获取所需的内容,例如,视频。可选地,终端11配置有摄像装置,基于该摄像装置终端11可以获取到视频。
本申请实施例提供的渲染模型训练方法可以由终端11执行,也可以由服务器12执行,还可以由终端11和服务器12共同执行,本申请实施例对此不加以限定。本申请实施例提供的视频的渲染方法可以由终端11执行,也可以由服务器12执行,还可以由终端11和服务器12共同执行,本申请实施例对此不加以限定。此外,本申请实施例提供的渲染模型训练方法和视频的渲染方法可以由相同的设备执行,也可以由不同的设备执行,本申请实施例对此不加以限定。
在一种可能实现方式中,终端11可以是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品,例如PC(Personal Computer,个人计算机)、手机、智能手机、PDA(Personal Digital Assistant,个人数字助手)、可穿戴设备、PPC(Pocket PC,掌上电脑)、平板电脑、智能车机、智能电视、智能音箱、智能语音交互设备、智能家电、车载终端等。服务器12可以是一台服务器,也可以是由多台服务器组成的服务器集群,或者是一个云计算服务中心。终端11与服务器12通过有线或无线网络建立通信连接。
本领域技术人员应能理解上述终端11和服务器12仅为举例,其他现有的或今后可能出现的终端或服务器如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
基于上述图1所示的实施环境,本申请实施例提供了一种渲染模型训练方法,以该方法应用于终端11为例。如图2所示,本申请实施例提供的渲染模型训练方法可以包括如下步骤201至步骤203。
步骤201,获取包括目标对象的面部的第一视频。
关于如何获取包括目标对象的面部的第一视频,本申请实施例不做限制。目标对象在进行说话、运动等行为时,面部会有相应的动作。而面部的动作可以被摄像装置记录下来,形成包含目标对象的面部的视频。因而,在示例性实施例中,获取包括目标对象的面部的第一视频,包括:基于配置在终端的摄像装置对目标对象的面部进行拍摄,从而获取到包括目标对象的面部的第一视频。在另一个示例性实施例中,获取包括目标对象的面部的第一视频,包括:获取包括目标对象的第四视频;对第四视频对应的每一帧画面进行裁剪,保留第四视频对应的每一帧画面中目标对象的面部区域,得到第一视频。
可选地,获取包括目标对象的面部的第一视频的方式,还可以是由除了终端外的其它设备拍摄到第一视频后将第一视频传输给终端,进而使得终端完成对第一视频的获取。
可选地,第四视频也可以是由除了终端之外的其他设备拍摄到第四视频后将第四视频传输给终端,进而使得终端完成对第四视频的获取。可选地,服务器中存储有第四视频,获取第四视频的方式还可以是直接从服务器中获取第四视频。
在示例性实施例中,对第四视频的每一帧画面进行裁剪,保留第四视频的每一帧画面中目标对象的面部区域,得到第一视频,包括:基于图像识别技术确定第四视频的每一帧画面中目标对象的面部对应的区域;基于图像分割技术从第四视频的每一帧画面中裁剪出目标对象的面部对应的区域,得到与第四视频的帧数相同数量的面部图像;依据每一个面部图像与第四视频中每一帧画面的对应关系,将全部的面部图像进行组合,得到第一视频。
示例性地,一帧画面裁剪前后如图3所示,图3中的(1)为第四视频中的一帧画面,图3中的(2)为裁剪得到的包括目标对象的面部的一个面部图像。需要说明的是,图3仅用于帮助理解本申请实施例,图3中的(1)和(2)对应的外围框线的大小与对应的画面的大小无关。
需要说明的是,训练渲染模型所使用的样本与渲染模型的用途有关。例如,若渲染模型用于对包括老虎的视频进行渲染,则训练该渲染模型时应使用包括该老虎的视频作为训练样本,即渲染模型的训练是有针对性的。本申请实施例以渲染模型用于渲染包括目标对象的视频为例进行说明,即以包括目标对象的视频作为训练渲染模型的样本来进行说明。当渲染模型用于对其他类型的视频进行渲染时,同样可以应用本申请实施例提供的渲染模型训练方法对渲染模型进行训练。
步骤202,基于三维面部模型对第一视频中目标对象的面部动作进行映射,得到包括三维面部的第二视频。
示例性地,三维面部模型是包括面部样式的三维模型。在示例性实施例中,基于三维面部模型对第一视频中目标对象的面部动作进行映射,得到包括三维面部的第二视频之前,还包括:获取三维面部模型。在示例性实施例中,三维面部模型预先存储在终端的存储器中,此时,获取三维面部模型,包括:从存储器中获取三维面部模型。在另一个示例性实施例中,三维面部模型预先存储在服务器中,获取三维面部模型,包括:基于通信网络从服务器中获取三维面部模型。关于三维面部模型是哪一种具体的三维面部模型,本申请实施例不做限制。可选地,应用在本申请实施例中的三维面部模型可以是三维可变形面部模型(3 dimensions morphable face model,3DMM)。
在示例性实施例中,基于三维面部模型对第一视频中目标对象的面部动作进行映射,得到包括三维面部的第二视频,包括:提取第一视频的每一帧画面中目标对象的面部关键点,得到多组面部关键点,面部关键点的组数与第一视频的帧数相同,一帧画面对应一组面部关键点;将三维面部模型与每组面部关键点进行拟合,得到多个三维面部画面;根据三维面部画面与第一视频中每一帧画面的对应关系,将多个三维面部画面进行组合,得到包括三维面部的第二视频。
通过引入三维面部模型,并利用基于三维面部模型生成的第二视频作为训练渲染模型的样本,避免了在训练渲染模型前需要较高的录制要求下录制大量的低清晰度视频,进而减少了训练渲染模型时所使用的计算资源和时间。
在示例性实施例中,提取第一视频的每一帧画面中目标对象的面部关键点,得到多组面部关键点,包括:确定在第一视频的每一帧画面中面部关键点的位置;提取在第一视频的每一帧画面中面部关键点的位置处的坐标,得到多组面部关键点,面部关键点的组数与第一视频的帧数相同,一帧画面对应一组面部关键点。
在示例性实施例中,确定在第一视频的每一帧画面中面部关键点的位置,包括:基于用于确定关键位置的神经网络确定在第一视频的每一帧画面中面部关键点的位置。确定每一个面部关键点的位置后,可以直接根据每一个面部关键点的位置确定该位置对应的坐标,得到每一个面部关键点的坐标,进而提取在第一视频的每一帧画面中面部关键点的位置处的坐标,得到多组面部关键点。
在示例性实施例中,目标对象的面部关键点包括但不限于头部关键点、唇部关键点、眼部关键点和眉部关键点。关于每一组面部关键点中关键点的数量,本申请实施例不做限制,可根据经验设定。在示例性实施例中,面部关键点分布在设定的区域,例如,在唇部区域分布有20个面部关键点,在左眼区域和右眼区域各分部有6个关键点,在头部(脸颊)区域分布有17个关键点,在左眉区域和右眉区域各分布有5个关键点,在鼻部区域分布有9个关键点。由于面部关键点具有较好的稳定性,因而,在第一视频的清晰度较低时也能较好地从第一视频中提取到目标对象的面部关键点,进而根据提取到的目标对象的面部关键点与三维面部模型进行拟合,得到第二视频。也就是说,通过提取第一视频中目标对象的面部关键点与三维面部模型拟合的方式,来生成第二视频的方式稳定性强、受第一视频清晰度的影响小,在第一视频清晰度较低时同样可以较好地生成第二视频。
关于如何将三维面部模型与每一组面部关键点进行拟合,本申请实施例不做限制。可选地,可以基于神经网络将三维面部模型与每一组面部关键点进行拟合,得到多个三维面部画面。示例性地,用于实现三维面部模型与面部关键点拟合的神经网络为残差网络(residual neural network,ResNet)。可选地,还可以基于贝叶斯模型将三维面部模型与每一组面部关键点进行拟合。
由于三维面部画面是根据第一视频中每一帧画面提取到的面部关键点与三维面部模型进行拟合得到的画面,因而,一个三维面部画面对应第一视频中的一帧画面。在示例性实施例中,可以根据三维面部画面与第一视频中每一帧画面的对应关系,将多个三维面部画面组合成包括三维面部的第二视频。
步骤203,将第二视频作为初始渲染模型的输入,以第一视频作为初始渲染模型的输出监督,对初始渲染模型进行训练,得到目标渲染模型。
在示例性实施例中,将第二视频作为初始渲染模型的输入,以第一视频作为初始渲染模型的输出监督,对初始渲染模型进行训练,得到目标渲染模型,包括:将第二视频输入初始渲染模,由初始渲染模型对第二视频进行渲染得到第三视频;计算第一视频和第三视频每一帧画面之间的相似度;根据该相似度调整初始渲染模型的参数,将调整参数后的初始渲染模型作为目标渲染模型。利用以提取面部关键点的方式生成的第二视频来训练渲染模型,可以使得训练完成的渲染模型具有更好泛化能力。其中,相似度代表第三视频与第一视频的相似程度,相似度越高则代表初始渲染模型的渲染能力越好。而训练初始渲染模型的目标是,使得初始渲染模型对第二视频进行渲染输出的第三视频与第一视频尽可能的相似。
初始渲染模型可以是任意一种渲染模型,本申请实施例对此不做限制,可选地,初始渲 染模型可以是生成对抗网络(generative adversarial network,GAN)模型,或者是卷积神经网络(convolutional neural networks,CNN)模型,或者是U-net模型。
在一些实施例中,初始渲染模型包括预训练层,预训练层为初始渲染模型中的至少一层网络,预训练层包括的网络层数少于初始渲染模型中的网络总层数。此种情况下,调整初始渲染模型的参数时,不必调整初始渲染模型的全部参数,对预训练层的权重进行调整即可。在示例性实施例中,根据该相似度调整初始渲染模型的参数,将调整参数后的初始渲染模型作为目标渲染模型,包括:根据相似度值调整初始渲染模型中的预训练层的权重,将调整权重后的初始渲染模型作为目标渲染模型。
在示例性实施例中,预训练层包括初始渲染模型对应的网络层中从后往前参考数量个网络层,示例性地,初始渲染模型包括从前到后的A、B、C、D、E共5层网络,若预训练层对应的参考数量为2,则预训练层包括的网络层为D、E;若预训练层对应的参考数量为3,则预训练层包括的网络层为C、D、E。关于预训练层包括的网络层数本申请实施例不做限制,可以根据经验设置,也可以根据应用场景进行设置。例如,初始渲染模型是CNN模型,且该CNN模型包括输入层、多个卷积层、池化层和全连接层,其中预训练层包括该多个卷积层中的部分卷积层以及池化层和全连接层。由于预训练层是渲染模型对应的网络层中的部分网络层,所以对预训练层的权重进行调整涉及的网络层数较少,进而调整预训练层的权重所需的计算资源也较少,相应地提高了训练渲染模型的效率。
在调整初始渲染模型的参数时,需要确定调整参数后的初始渲染模型达到何种效果时才能作为目标渲染模型。关于初始渲染模型达到何种效果时才能作为目标渲染模型,本申请实施例不做限制,可选地,将调整参数后的初始渲染模型作为目标渲染模型,包括:响应于根据调整参数后的初始渲染模型生成的视频中的每一帧画面,与第一视频中的每一帧画面之间的相似度均不小于相似度阈值,将调整参数后的初始渲染模型作为目标渲染模型。
在训练渲染模型时引入三维面部模型,并利用基于三维面部模型生成的第二视频对渲染模型进行训练,可以降低训练渲染模型时所需的样本量,通常只需要几分钟的第二视频与对应的第一视频即可完成对渲染模型的训练。渲染模型渲染出的视频的清晰度与第一视频的清晰度强相关,因此,通常第一视频为清晰度较高的视频,以使利用第一视频和第二视频训练得到的渲染模型能够渲染出清晰度较高的视频。
基于上述图1所示的实施环境,本申请实施例提供了一种视频的渲染方法,以该方法应用于终端11为例。如图4所示,本申请实施例提供的视频的渲染方法可以包括如下步骤401至步骤403。
步骤401,获取包括目标对象的待渲染视频。
在示例性实施例中,待渲染视频为虚拟对象视频,虚拟对象视频由基于目标对象建立的虚拟对象生成模型生成,也就是说,待渲染视频由基于目标对象建立的虚拟对象生成模型生成。可选地,待渲染视频可以在终端生成,也可以在除了终端之外的其他设备生成后,由除了终端之外的其他设备将待渲染视频发往终端。本申请实施例以终端基于虚拟对象生成模型生成待渲染视频为例进行说明,此种情况下,获取包括目标对象的待渲染视频,包括:获取基于目标对象建立的虚拟对象生成模型;基于虚拟对象生成模型生成待渲染视频。
当然,本申请实施例以待渲染视频为虚拟对象视频为例进行说明,但并不表示待渲染视 频仅可以是虚拟对象视频,待渲染视频同样可以为直接对目标对象进行拍摄得到的视频,或者是其他方式生成的包括目标对象的视频。
可选地,本申请实施例中所使用的虚拟对象生成模型可以是嘴型同步wave2lip模型,还可以是其他任意一种虚拟对象生成模型,本申请实施例对此不做限制。
在示例性实施例中,基于目标对象建立的虚拟对象生成模型存储在终端的存储器中,获取基于目标对象建立的虚拟对象生成模型,包括:直接从终端的存储器中获取该虚拟对象生成模型。在另一个示例性实施例中,该虚拟对象生成模型存在服务器的存储器中,获取基于目标对象建立的虚拟对象生成模型,包括:从服务器中获取该虚拟对象生成模型。其中,基于目标对象建立的虚拟对象生成模型为完成训练的虚拟对象生成模型,能够利用音唇同步参数调驱动虚拟对象做出对应的面部动作,进而生成包括目标对象的视频。
在示例性实施例中,基于虚拟对象生成模型生成待渲染视频,包括:获取用于生成待渲染视频的文本;将该文本转化为目标对象的语音,该语音的内容与文本的内容对应;基于该语音获取至少一组音唇同步参数;将该至少一组音唇同步参数输入虚拟对象生成模型,由虚拟对象生成模型基于该至少一组音唇同步参数,驱动目标对象对应的虚拟对象的面部做出相应的动作,得到该至少一组音唇同步参数对应的虚拟视频;对虚拟视频进行渲染,得到待渲染视频。其中,虚拟对象生成模型包括目标对象对应的虚拟对象。通常,待渲染视频中的每一帧画面均为静态的画面,该静态的画面中,唇部对应有一定的形状。而一组音唇同步参数用于驱动虚拟对象生成待渲染视频中相应的一帧画面。
可选地,用于生成待渲染视频的文本由人工输入至终端中,当终端检测到该文本的输入后相应地完成了获取用于生成待渲染视频的文本。可选地,用于生成待渲染视频的文本由人工输入至除了终端之外的其他设备中,由其他设备将该文本发往终端,此时,获取用于生成待渲染视频的文本,包括:接收其他设备发送来的用于生成待渲染视频的文本。
在示例性实施例中,将该文本转化为目标对象的语音,包括:基于任一能够将文本转换为语音的模型将该文本转换为目标对象的语音,其中,该任一能够将文本转换为语音的模型基于目标对象的音频训练得到,以使该任一能够将文本转换为语音的模型具备将文本转换为目标对象的语音的能力。示例性地,该任一能够将文本转换为语音的模型可以是Char2Wav模型(一种语音合成模型)。
在示例性实施例中,基于该语音获取至少一组音唇同步参数,包括:将该语音输入用于根据音频生成音唇同步参数的神经网络模型,由该神经网络模型基于该语音生成至少一组音唇同步参数。示例性地,用于根据音频生成音唇同步参数的神经网络模型可以是长短期记忆网络(long short-term memory,LSTM)模型。
步骤402,基于三维面部模型对待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频。
在基于三维面部模型对待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频之前,还包括:获取三维面部模型。关于如何获取三维面部模型在图2对应的渲染模型训练方法的实施例步骤202中已经说明,在此不再赘述。
在示例性实施例中,基于三维面部模型对待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频,包括:对待渲染视频的每一帧画面进行裁剪,保留待渲染视频的每一帧画面中目标对象的面部区域,得到面部视频;基于三维面部模型对面部视频中目 标对象的面部动作进行映射,得到中间视频。
其中,对待渲染视频的每一帧画面进行裁剪,保留待渲染视频的每一帧画面中目标对象的面部区域,得到面部视频的实现方式参见图2对应的渲染模型训练方法的步骤201中,对第四视频进行裁剪得到第一视频,在此不再赘述。基于三维面部模型对面部视频中目标对象的面部进行映射,得到中间视频的实现方式参见图2对应的渲染模型训练方法的步骤202,在此不再赘述。
步骤403,获取与目标对象对应的目标渲染模型,基于目标渲染模型对中间视频进行渲染,得到目标视频。
在示例性实施例中,目标渲染模型在除了终端之外的其他设备中完成训练后发往终端,此种情况下,获取与目标对象对应的目标渲染模型,包括:接收其他设备发送来的目标渲染模型。在另一个示例性实施例中,目标渲染模型在终端中训练完成且存储在终端的存储器中,此种情况下,获取与目标对象对应的目标渲染模型,包括:从存储器中获取目标渲染模型。示例性地,目标渲染模型为基于图2对应的渲染模型训练方法训练得到的目标渲染模型。
由于目标渲染模型渲染出的视频的清晰度,由训练出目标渲染模型的监督样本的清晰度决定,因而,在监督样本的清晰度高于待渲染视频的清晰度时,基于目标渲染模型对中间视频进行渲染后输出的目标视频的清晰度高于待渲染视频的清晰度。
在示例性实施例中,基于目标渲染模型对中间视频进行渲染,得到目标视频,包括:基于目标渲染模型对中间视频中的每一帧画面进行渲染,得到与中间视频的帧数相同数量的渲染后的画面;根据渲染后的画面与中间视频中每一帧画面的对应关系,将渲染后的画面进行组合,得到目标视频。其中,目标视频的清晰度与用于训练出目标渲染模型的监督视频的清晰度对应。
在虚拟对象生成模型生成的待渲染视频的清晰度较低,且目标渲染模型对应的监督样本的清晰度较高时,通过目标渲染模型对待渲染视频进行渲染,可以得到清晰度较高的目标视频。也就是说,通过将低清晰度的虚拟对象生成模型与目标渲染模型搭配使用,可以使终端在花费较少计算资源的情况下生成高清晰度的目标视频。
以上介绍了本申请实施例的渲染模型训练方法和视频的渲染方法,与上述方法对应,本申请实施例还提供了渲染模型训练装置和视频的渲染装置。如图5所示,本申请实施例还提供了一种渲染模型训练装置,该装置包括:
获取模块501,用于获取包括目标对象的面部的第一视频;
映射模块502,用于基于三维面部模型对第一视频中的目标对象的面部动作进行映射,得到包括三维面部的第二视频;
训练模块503,用于将第二视频作为初始渲染模型的输入,以第一视频作为初始渲染模型的输出监督,对初始渲染模型进行训练,得到目标渲染模型。
在一种可能的实现方式中,映射模块502,用于提取第一视频的每一帧画面中的目标对象的面部关键点,得到多组面部关键点,面部关键点的组数与第一视频的帧数相同,一帧画面对应一组面部关键点;将三维面部模型与每组面部关键点进行拟合,得到多个三维面部画面;根据三维面部画面与第一视频的每一帧画面的对应关系,将多个三维面部画面进行组合,得到包括三维面部的第二视频。
在一种可能的实现方式中,映射模块502,用于利用神经网络将三维面部模型与每组面部 关键点进行拟合,得到多个三维面部画面。
在一种可能的实现方式中,训练模块503,用于将第二视频输入初始渲染模型,由初始渲染模型对第二视频进行渲染得到第三视频;计算第一视频和第三视频的每一帧画面之间的相似度;根据相似度调整初始渲染模型的参数,将调整参数后的初始渲染模型作为目标渲染模型。
在一种可能的实现方式中,训练模块503,用于根据相似度调整初始渲染模型中的预训练层的权重,将调整权重后的初始渲染模型作为目标渲染模型,预训练层为初始渲染模型中的至少一层网络,预训练层包括的网络层数少于初始渲染模型中的网络总层数。
在一种可能的实现方式中,训练模块503,用于响应于根据调整参数后的初始渲染模型生成的视频中的每一帧画面,与第一视频中的每一帧画面之间的相似度均不小于相似度阈值,将调整参数后的初始渲染模型作为目标渲染模型。
在一种可能的实现方式中,获取模块501,用于获取包括目标对象的第四视频;对第四视频的每一帧画面进行裁剪,保留第四视频的每一帧画面中目标对象的面部区域,得到第一视频。
如图6所示,本申请实施例还提供了一种视频的渲染装置,装置包括:
获取模块601,用于获取包括目标对象的待渲染视频;
映射模块602,用于基于三维面部模型对待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频;
获取模块601,还用于获取与目标对象对应的目标渲染模型;
渲染模块603,用于基于目标渲染模型对中间视频进行渲染,得到目标视频。
在一种可能的实现方式中,获取模块601,用于获取基于目标对象建立的虚拟对象生成模型;基于虚拟对象生成模型生成待渲染视频。
在一种可能的实现方式中,获取模块601,用于获取用于生成待渲染视频的文本;将文本转化为目标对象的语音,语音的内容与文本的内容对应;基于语音获取至少一组音唇同步参数;将至少一组音唇同步参数输入虚拟对象生成模型,由虚拟对象生成模型基于至少一组音唇同步参数,驱动目标对象对应的虚拟对象的面部做出相应的动作,得到至少一组音唇同步参数对应的虚拟视频;对虚拟视频进行渲染,得到待渲染视频。
在一种可能的实现方式中,渲染模块603,用于基于目标渲染模型对中间视频中每一帧画面进行渲染,得到与中间视频的帧数相同数量的渲染后的画面;根据渲染后的画面与中间视频中每一帧画面的对应关系,将渲染后的画面进行组合,得到目标视频。
在一种可能的实现方式中,映射模块602,用于对待渲染视频的每一帧画面进行裁剪,保留待渲染视频的每一帧画面中目标对象的面部区域,得到面部视频;基于三维面部模型对面部视频中目标对象的面部动作进行映射,得到中间视频。
应理解的是,上述图5和图6提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
参见图7,图7示出了本申请一个示例性实施例提供的电子设备的结构示意图。图7所 示的电子设备用于执行上述图2所示的渲染模型训练方法或图4所示的视频的渲染方法所涉及的操作。该电子设备例如是终端等,该电子设备可以由一般性的总线体系结构来实现。
如图7所示,电子设备包括至少一个处理器701、存储器703以及至少一个通信接口704。
处理器701例如是通用中央处理器(central processing unit,CPU)、数字信号处理器(digital signal processor,DSP)、网络处理器(network processer,NP)、图形处理器(Graphics Processing Unit,GPU)、神经网络处理器(neural-network processing units,NPU)、数据处理单元(Data Processing Unit,DPU)、微处理器或者一个或多个用于实现本申请方案的集成电路。例如,处理器701包括专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。PLD例如是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。其可以实现或执行结合本发明实施例公开内容所描述的各种逻辑方框、模块和电路。处理器也可以是实现计算功能的组合,例如包括一个或多个微处理器组合,DSP和微处理器的组合等等。
可选的,电子设备还包括总线。总线用于在电子设备的各组件之间传送信息。总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。
存储器703例如是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,又如是随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,又如是电可擦可编程只读存储器(electrically erasable programmable read-only Memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器703例如是独立存在,并通过总线与处理器701相连接。存储器703也可以和处理器701集成在一起。
通信接口704使用任何收发器一类的装置,用于与其它设备或通信网络通信,通信网络可以为以太网、无线接入网(RAN)或无线局域网(wireless local area networks,WLAN)等。通信接口704可以包括有线通信接口,还可以包括无线通信接口。示例性地,通信接口704可以为以太(Ethernet)接口、快速以太(Fast Ethernet,FE)接口、千兆以太(Gigabit Ethernet,GE)接口,异步传输模式(Asynchronous Transfer Mode,ATM)接口,无线局域网(wireless local area networks,WLAN)接口,蜂窝网络通信接口或其组合。以太网接口可以是光接口,电接口或其组合。在本申请实施例中,通信接口704可以用于电子设备与其他设备进行通信。
作为一种实施例,处理器701可以包括一个或多个CPU,如图7中所示的CPU0和CPU1。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
作为一种实施例,电子设备可以包括多个处理器,如图7中所示的处理器701和处理器 705。这些处理器中的每一个可以是一个单核处理器(single-CPU),也可以是一个多核处理器(multi-CPU)。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
作为一种实施例,电子设备还可以包括输出设备和输入设备。输出设备和处理器701通信,可以以多种方式来显示信息。例如,输出设备可以是液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备和处理器701通信,可以以多种方式接收用户的输入。例如,输入设备可以是鼠标、键盘、触摸屏设备或传感设备等。
在一些实施例中,存储器703用于存储执行本申请方案的程序代码710,处理器701可以执行存储器703中存储的程序代码710。也即是,电子设备可以通过处理器701以及存储器703中的程序代码710,来实现方法实施例提供的渲染模型训练方法或视频的渲染方法。程序代码710中可以包括一个或多个软件模块。可选地,处理器701自身也可以存储执行本申请方案的程序代码或指令。
图2所示的渲染模型训练方法或图4所示的视频的渲染方法的各步骤通过电子设备的处理器中的硬件的集成逻辑电路或者软件形式的指令完成。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤,为避免重复,这里不再详细描述。
图8是本申请实施例提供的一种服务器的结构示意图,该服务器可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器(central processing units,CPU)801和一个或多个存储器802,其中,该一个或多个存储器802中存储有至少一条计算机程序,该至少一条计算机程序由该一个或多个处理器801加载并执行,以使该服务器实现上述各个方法实施例提供的渲染模型训练方法或视频的渲染方法。当然,该服务器还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件,在此不做赘述。
本申请实施例还提供了一种通信装置,该装置包括:收发器、存储器和处理器。其中,该收发器、该存储器和该处理器通过内部连接通路互相通信,该存储器用于存储指令,该处理器用于执行该存储器存储的指令,以控制收发器接收信号,并控制收发器发送信号,并且当该处理器执行该存储器存储的指令时,使得该处理器执行渲染模型训练方法或视频的渲染方法。
应理解的是,上述处理器可以是CPU,还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。值得说明的是,处理器可以是支持进阶精简指令集机器(advanced RISC machines,ARM)架构的处理器。
进一步地,在一种可选的实施例中,上述存储器可以包括只读存储器和随机存取存储器,并向处理器提供指令和数据。存储器还可以包括非易失性随机存取存储器。例如,存储器还 可以存储设备类型的信息。
该存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用。例如,静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic random access memory,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
本申请实施例还提供了一种计算机可读存储介质,存储介质中存储有至少一条指令,指令由处理器加载并执行,以使计算机实现如上任一所述的渲染模型训练方法或视频的渲染方法。
本申请实施例还提供了一种计算机程序(产品),当计算机程序被计算机执行时,可以使得处理器或计算机执行上述方法实施例中对应的各个步骤和/或流程。
本申请实施例还提供了一种芯片,包括处理器,用于从存储器中调用并运行所述存储器中存储的指令,使得安装有所述芯片的通信设备执行如上任一所述的渲染模型训练方法或视频的渲染方法。
本申请实施例还提供另一种芯片,包括:输入接口、输出接口、处理器和存储器,所述输入接口、输出接口、所述处理器以及所述存储器之间通过内部连接通路相连,所述处理器用于执行所述存储器中的代码,当所述代码被执行时,所述处理器用于执行如上任一所述的渲染模型训练方法或视频的渲染方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如,固态硬盘(solid state disk))等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例中描述的各方法步骤和模块,能够以软件、硬件、固件或者其任意组合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬 件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机程序指令。作为示例,本申请实施例的方法可以在机器可执行指令的上下文中被描述,机器可执行指令诸如包括在目标的真实或者虚拟处理器上的器件中执行的程序模块中。一般而言,程序模块包括例程、程序、库、对象、类、组件、数据结构等,其执行特定的任务或者实现特定的抽象数据结构。在各实施例中,程序模块的功能可以在所描述的程序模块之间合并或者分割。用于程序模块的机器可执行指令可以在本地或者分布式设备内执行。在分布式设备中,程序模块可以位于本地和远程存储介质二者中。
用于实现本申请实施例的方法的计算机程序代码可以用一种或多种编程语言编写。这些计算机程序代码可以提供给通用计算机、专用计算机或其他可编程的数据处理装置的处理器,使得程序代码在被计算机或其他可编程的数据处理装置执行的时候,引起在流程图和/或框图中规定的功能/操作被实施。程序代码可以完全在计算机上、部分在计算机上、作为独立的软件包、部分在计算机上且部分在远程计算机上或完全在远程计算机或服务器上执行。
在本申请实施例的上下文中,计算机程序代码或者相关数据可以由任意适当载体承载,以使得设备、装置或者处理器能够执行上文描述的各种处理和操作。载体的示例包括信号、计算机可读介质等等。
信号的示例可以包括电、光、无线电、声音或其它形式的传播信号,诸如载波、红外信号等。
机器可读介质可以是包含或存储用于或有关于指令执行系统、装置或设备的程序的任何有形介质。机器可读介质可以是机器可读信号介质或机器可读存储介质。机器可读介质可以包括但不限于电子的、磁的、光学的、电磁的、红外的或半导体系统、装置或设备,或其任意合适的组合。机器可读存储介质的更详细示例包括带有一根或多根导线的电气连接、便携式计算机磁盘、硬盘、随机存储存取器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或闪存)、光存储设备、磁存储设备,或其任意合适的组合。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、设备和模块的具体工作过程,可以参见前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,该模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、设备或模块的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。 可以根据实际的需要选择其中的部分或者全部模块来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以是两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
该集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例中方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种所述示例的范围的情况下,第一图像可以被称为第二图像,并且类似地,第二图像可以被称为第一图像。第一图像和第二图像都可以是图像,并且在某些情况下,可以是单独且不同的图像。
还应理解,在本申请的各个实施例中,各个过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个第二报文是指两个或两个以上的第二报文。本文中术语“系统”和“网络”经常可互换使用。
应理解,在本文中对各种所述示例的描述中所使用的术语只是为了描述特定示例,而并非旨在进行限制。如在对各种所述示例的描述和所附权利要求书中所使用的那样,单数形式“一个(“a”,“an”)”和“该”旨在也包括复数形式,除非上下文另外明确地指示。
还应理解,本文中所使用的术语“和/或”是指并且涵盖相关联的所列出的项目中的一个或多个项目的任何和全部可能的组合。术语“和/或”,是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中的字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,术语“包括”(也称“includes”、“including”、“comprises”和/或“comprising”)当在本说明书中使用时指定存在所陈述的特征、整数、步骤、操作、元素、和/或部件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元素、部件、和/或其分组。
还应理解,术语“若”和“如果”可被解释为意指“当...时”(“when”或“upon”)或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“若确定...”或“若检测到[所陈述的条件或事件]”可被解释为意指“在确定...时”或“响应于确定...”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确 定B。
还应理解,说明书通篇中提到的“一个实施例”、“一实施例”、“一种可能的实现方式”意味着与实施例或实现方式有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”、“一种可能的实现方式”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中涉及到的包括目标对象的面部的第一视频是在充分授权的情况下获取的。

Claims (27)

  1. 一种渲染模型训练方法,其特征在于,所述方法包括:
    获取包括目标对象的面部的第一视频;
    基于三维面部模型对所述第一视频中的所述目标对象的面部动作进行映射,得到包括三维面部的第二视频;
    将所述第二视频作为初始渲染模型的输入,以所述第一视频作为所述初始渲染模型的输出监督,对所述初始渲染模型进行训练,得到目标渲染模型。
  2. 根据权利要求1所述的方法,其特征在于,所述基于三维面部模型对所述第一视频中的所述目标对象的面部动作进行映射,得到包括三维面部的第二视频,包括:
    提取所述第一视频的每一帧画面中的所述目标对象的面部关键点,得到多组面部关键点,所述面部关键点的组数与所述第一视频的帧数相同,一帧画面对应一组面部关键点;
    将所述三维面部模型与每组面部关键点进行拟合,得到多个三维面部画面;
    根据所述三维面部画面与所述第一视频的每一帧画面的对应关系,将所述多个三维面部画面进行组合,得到包括所述三维面部的第二视频。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述三维面部模型与每组面部关键点进行拟合,得到多个三维面部画面,包括:
    利用神经网络将所述三维面部模型与每组面部关键点进行拟合,得到所述多个三维面部画面。
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述将所述第二视频作为初始渲染模型的输入,以所述第一视频作为所述初始渲染模型的输出监督,对所述初始渲染模型进行训练,得到目标渲染模型,包括:
    将所述第二视频输入初始渲染模型,由所述初始渲染模型对所述第二视频进行渲染得到第三视频;
    计算所述第一视频和所述第三视频的每一帧画面之间的相似度;
    根据所述相似度调整所述初始渲染模型的参数,将调整参数后的初始渲染模型作为所述目标渲染模型。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述相似度调整所述初始渲染模型的参数,将调整参数后的初始渲染模型作为所述目标渲染模型,包括:
    根据所述相似度调整所述初始渲染模型中的预训练层的权重,将调整权重后的初始渲染模型作为所述目标渲染模型,所述预训练层为所述初始渲染模型中的至少一层网络,所述预训练层包括的网络层数少于所述初始渲染模型中的网络总层数。
  6. 根据权利要求4或5所述的方法,其特征在于,所述将调整参数后的初始渲染模型作为所 述目标渲染模型,包括:
    响应于根据调整参数后的初始渲染模型生成的视频中的每一帧画面,与所述第一视频中的每一帧画面之间的相似度均不小于相似度阈值,将调整参数后的初始渲染模型作为所述目标渲染模型。
  7. 根据权利要求1-6任一所述的方法,其特征在于,所述获取包括目标对象的面部的第一视频,包括:
    获取包括所述目标对象的第四视频;
    对所述第四视频的每一帧画面进行裁剪,保留所述第四视频的每一帧画面中所述目标对象的面部区域,得到所述第一视频。
  8. 一种视频的渲染方法,其特征在于,所述方法包括:
    获取包括目标对象的待渲染视频;
    基于三维面部模型对所述待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频;
    获取与所述目标对象对应的目标渲染模型;
    基于所述目标渲染模型对所述中间视频进行渲染,得到目标视频。
  9. 根据权利要求8所述的方法,其特征在于,所述获取包括目标对象的待渲染视频,包括:
    获取基于所述目标对象建立的虚拟对象生成模型;
    基于所述虚拟对象生成模型生成所述待渲染视频。
  10. 根据权利要求9所述的方法,其特征在于,所述基于所述虚拟对象生成模型生成所述待渲染视频,包括:
    获取用于生成所述待渲染视频的文本;
    将所述文本转化为所述目标对象的语音,所述语音的内容与所述文本的内容对应;
    基于所述语音获取至少一组音唇同步参数;
    将所述至少一组音唇同步参数输入所述虚拟对象生成模型,由所述虚拟对象生成模型基于所述至少一组音唇同步参数,驱动所述目标对象对应的虚拟对象的面部做出相应的动作,得到所述至少一组音唇同步参数对应的虚拟视频;
    对所述虚拟视频进行渲染,得到所述待渲染视频。
  11. 根据权利要求8-10任一所述的方法,其特征在于,所述基于所述目标渲染模型对所述中间视频进行渲染,得到目标视频,包括:
    基于所述目标渲染模型对所述中间视频中每一帧画面进行渲染,得到与所述中间视频的帧数相同数量的渲染后的画面;
    根据渲染后的画面与所述中间视频中每一帧画面的对应关系,将渲染后的画面进行组合,得到所述目标视频。
  12. 根据权利要求8-11任一所述的方法,其特征在于,所述基于三维面部模型对所述待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频,包括:
    对所述待渲染视频的每一帧画面进行裁剪,保留所述待渲染视频的每一帧画面中所述目标对象的面部区域,得到面部视频;
    基于所述三维面部模型对所述面部视频中目标对象的面部动作进行映射,得到所述中间视频。
  13. 一种渲染模型训练装置,其特征在于,所述装置包括:
    获取模块,用于获取包括目标对象的面部的第一视频;
    映射模块,用于基于三维面部模型对所述第一视频中的所述目标对象的面部动作进行映射,得到包括三维面部的第二视频;
    训练模块,用于将所述第二视频作为初始渲染模型的输入,以所述第一视频作为所述初始渲染模型的输出监督,对所述初始渲染模型进行训练,得到目标渲染模型。
  14. 根据权利要求13所述的装置,其特征在于,所述映射模块,用于提取所述第一视频的每一帧画面中的所述目标对象的面部关键点,得到多组面部关键点,所述面部关键点的组数与所述第一视频的帧数相同,一帧画面对应一组面部关键点;将所述三维面部模型与每组面部关键点进行拟合,得到多个三维面部画面;根据所述三维面部画面与所述第一视频的每一帧画面的对应关系,将所述多个三维面部画面进行组合,得到包括所述三维面部的第二视频。
  15. 根据权利要求14所述的装置,其特征在于,所述映射模块,用于利用神经网络将所述三维面部模型与每组面部关键点进行拟合,得到所述多个三维面部画面。
  16. 根据权利要求13-15任一所述的装置,其特征在于,所述训练模块,用于将所述第二视频输入初始渲染模型,由所述初始渲染模型对所述第二视频进行渲染得到第三视频;计算所述第一视频和所述第三视频的每一帧画面之间的相似度;根据所述相似度调整所述初始渲染模型的参数,将调整参数后的初始渲染模型作为所述目标渲染模型。
  17. 根据权利要求16所述的装置,其特征在于,所述训练模块,用于根据所述相似度调整所述初始渲染模型中的预训练层的权重,将调整权重后的初始渲染模型作为所述目标渲染模型,所述预训练层为所述初始渲染模型中的至少一层网络,所述预训练层包括的网络层数少于所述初始渲染模型中的网络总层数。
  18. 根据权利要求16或17所述的装置,其特征在于,所述训练模块,用于响应于根据调整参数后的初始渲染模型生成的视频中的每一帧画面,与所述第一视频中的每一帧画面之间的相似度均不小于相似度阈值,将调整参数后的初始渲染模型作为所述目标渲染模型。
  19. 根据权利要求13-18任一所述的装置,其特征在于,所述获取模块,用于获取包括所述目标对象的第四视频;对所述第四视频的每一帧画面进行裁剪,保留所述第四视频的每一帧 画面中所述目标对象的面部区域,得到所述第一视频。
  20. 一种视频的渲染装置,其特征在于,所述装置包括:
    获取模块,用于获取包括目标对象的待渲染视频;
    映射模块,用于基于三维面部模型对所述待渲染视频中目标对象的面部动作进行映射,得到包括三维面部的中间视频;
    所述获取模块,还用于获取与所述目标对象对应的目标渲染模型;
    渲染模块,用于基于所述目标渲染模型对所述中间视频进行渲染,得到目标视频。
  21. 根据权利要求20所述的装置,其特征在于,所述获取模块,用于获取基于所述目标对象建立的虚拟对象生成模型;基于所述虚拟对象生成模型生成所述待渲染视频。
  22. 根据权利要求21所述的装置,其特征在于,所述获取模块,用于获取用于生成所述待渲染视频的文本;将所述文本转化为所述目标对象的语音,所述语音的内容与所述文本的内容对应;基于所述语音获取至少一组音唇同步参数;将所述至少一组音唇同步参数输入所述虚拟对象生成模型,由所述虚拟对象生成模型基于所述至少一组音唇同步参数,驱动所述目标对象对应的虚拟对象的面部做出相应的动作,得到所述至少一组音唇同步参数对应的虚拟视频;对所述虚拟视频进行渲染,得到所述待渲染视频。
  23. 根据权利要求20-22任一所述的装置,其特征在于,所述渲染模块,用于基于所述目标渲染模型对所述中间视频中每一帧画面进行渲染,得到与所述中间视频的帧数相同数量的渲染后的画面;根据渲染后的画面与所述中间视频中每一帧画面的对应关系,将渲染后的画面进行组合,得到所述目标视频。
  24. 根据权利要求20-23任一所述的装置,其特征在于,所述映射模块,用于对所述待渲染视频的每一帧画面进行裁剪,保留所述待渲染视频的每一帧画面中所述目标对象的面部区域,得到面部视频;基于所述三维面部模型对所述面部视频中目标对象的面部动作进行映射,得到所述中间视频。
  25. 一种计算机设备,其特征在于,所述计算机设备包括存储器及处理器;所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行,以使所述计算机设备实现如权利要求1-7中任一所述的渲染模型训练方法,或者如权利要求8-12中任一所述的视频的渲染方法。
  26. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现如权利要求1-7中任一所述的渲染模型训练方法,或者如权利要求8-12中任一所述的视频的渲染方法。
  27. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序/指令,所述 计算机程序/指令被处理器执行,以使计算机实现如权利要求1-7中任一所述的渲染模型训练方法,或者如权利要求8-12中任一所述的视频的渲染方法。
PCT/CN2023/080880 2022-03-18 2023-03-10 渲染模型训练、视频的渲染方法、装置、设备和存储介质 WO2023174182A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP23769686.9A EP4394711A1 (en) 2022-03-18 2023-03-10 Rendering model training method and apparatus, video rendering method and apparatus, and device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210273192.8A CN116824016A (zh) 2022-03-18 2022-03-18 渲染模型训练、视频的渲染方法、装置、设备和存储介质
CN202210273192.8 2022-03-18

Publications (1)

Publication Number Publication Date
WO2023174182A1 true WO2023174182A1 (zh) 2023-09-21

Family

ID=88022223

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/080880 WO2023174182A1 (zh) 2022-03-18 2023-03-10 渲染模型训练、视频的渲染方法、装置、设备和存储介质

Country Status (3)

Country Link
EP (1) EP4394711A1 (zh)
CN (1) CN116824016A (zh)
WO (1) WO2023174182A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117745902A (zh) * 2024-02-20 2024-03-22 卓世科技(海南)有限公司 一种用于康复演示的数字人生成方法及装置
CN117834949A (zh) * 2024-03-04 2024-04-05 清华大学 基于边缘智能的实时交互预渲染方法及其装置
CN117827192A (zh) * 2023-12-26 2024-04-05 合肥锦上汇赢数字科技有限公司 一种三维模型生成系统
CN117876550A (zh) * 2024-03-11 2024-04-12 国网电商科技有限公司 一种基于大数据的虚拟数字人渲染方法、系统及终端设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117615115A (zh) * 2023-12-04 2024-02-27 广州开得联智能科技有限公司 视频图像的渲染方法、渲染装置、电子设备及介质
CN118351274B (zh) * 2024-04-29 2024-10-29 广州蓝昊广告有限公司 一种物体模型渲染方法、装置、设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200013212A1 (en) * 2017-04-04 2020-01-09 Intel Corporation Facial image replacement using 3-dimensional modelling techniques
CN111783986A (zh) * 2020-07-02 2020-10-16 清华大学 网络训练方法及装置、姿态预测方法及装置
CN113822977A (zh) * 2021-06-28 2021-12-21 腾讯科技(深圳)有限公司 图像渲染方法、装置、设备以及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200013212A1 (en) * 2017-04-04 2020-01-09 Intel Corporation Facial image replacement using 3-dimensional modelling techniques
CN111783986A (zh) * 2020-07-02 2020-10-16 清华大学 网络训练方法及装置、姿态预测方法及装置
CN113822977A (zh) * 2021-06-28 2021-12-21 腾讯科技(深圳)有限公司 图像渲染方法、装置、设备以及存储介质

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827192A (zh) * 2023-12-26 2024-04-05 合肥锦上汇赢数字科技有限公司 一种三维模型生成系统
CN117745902A (zh) * 2024-02-20 2024-03-22 卓世科技(海南)有限公司 一种用于康复演示的数字人生成方法及装置
CN117745902B (zh) * 2024-02-20 2024-04-26 卓世科技(海南)有限公司 一种用于康复演示的数字人生成方法及装置
CN117834949A (zh) * 2024-03-04 2024-04-05 清华大学 基于边缘智能的实时交互预渲染方法及其装置
CN117834949B (zh) * 2024-03-04 2024-05-14 清华大学 基于边缘智能的实时交互预渲染方法及其装置
CN117876550A (zh) * 2024-03-11 2024-04-12 国网电商科技有限公司 一种基于大数据的虚拟数字人渲染方法、系统及终端设备
CN117876550B (zh) * 2024-03-11 2024-05-14 国网电商科技有限公司 一种基于大数据的虚拟数字人渲染方法、系统及终端设备

Also Published As

Publication number Publication date
CN116824016A (zh) 2023-09-29
EP4394711A1 (en) 2024-07-03

Similar Documents

Publication Publication Date Title
WO2023174182A1 (zh) 渲染模型训练、视频的渲染方法、装置、设备和存储介质
KR102469295B1 (ko) 깊이를 사용한 비디오 배경 제거
US20190279410A1 (en) Electronic Messaging Utilizing Animatable 3D Models
US11887235B2 (en) Puppeteering remote avatar by facial expressions
US10846560B2 (en) GPU optimized and online single gaussian based skin likelihood estimation
CN111815666B (zh) 图像处理方法及装置、计算机可读存储介质和电子设备
US10929982B2 (en) Face pose correction based on depth information
KR20220150410A (ko) 동적 깊이 이미지를 캡처 및 편집하는 기술
WO2020248950A1 (zh) 一种面部特征的有效性判定方法及电子设备
US11375244B2 (en) Dynamic video encoding and view adaptation in wireless computing environments
WO2021196648A1 (zh) 交互对象的驱动方法、装置、设备以及存储介质
WO2023207379A1 (zh) 图像处理方法、装置、设备及存储介质
US11474776B2 (en) Display-based audio splitting in media environments
US11831973B2 (en) Camera setting adjustment based on event mapping
US20220309725A1 (en) Edge data network for providing three-dimensional character image to user equipment and method for operating the same
US20210328954A1 (en) Advanced Electronic Messaging Utilizing Animatable 3D Models
CN118159341A (zh) 一种图像帧的渲染方法及相关装置
US20240114170A1 (en) Feature reconstruction using neural networks for video streaming systems and applications
US20240087091A1 (en) Server device and network control method
WO2024077792A1 (zh) 视频生成方法、装置、设备与计算机可读存储介质
US20230400975A1 (en) Thermal management of an electronic device
EP4401039A1 (en) Image processing method and apparatus, and related device
US20240203060A1 (en) Effect display based on text recognition and room scaling within an augmented reality environment
CN116309017A (zh) 表情渲染处理方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23769686

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023769686

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023769686

Country of ref document: EP

Effective date: 20240328