CN114529785B

CN114529785B - Model training method, video generating method and device, equipment and medium

Info

Publication number: CN114529785B
Application number: CN202210166388.7A
Authority: CN
Inventors: 魏舒; 周超勇; 刘玉宇; 曾平安; 赵记坤
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2024-06-28
Anticipated expiration: 2042-02-22
Also published as: CN114529785A

Abstract

The embodiment provides a training method, a video generation method, a device, equipment and a medium for a model, and belongs to the technical field of artificial intelligence. Comprising the following steps: acquiring a face image, and extracting features of the face image to obtain a first feature image; the first characteristic image and the preset virtual face characteristic data are subjected to characteristic stitching to obtain a combined characteristic image, so that training of different types of virtual face images can be ensured; performing self-attention processing on the first characteristic image through a preset self-attention model to obtain a second characteristic image; performing feature extraction processing on the combined feature image and the second feature image to obtain a third feature image; and training the preset neural network model according to the third characteristic image to obtain a virtual face image generation model. According to the embodiment, the self-attention model is added, so that the neural network model can be more focused on learning of a key area in the training process, the training time of the model is shortened, and the training efficiency of the model is improved.

Description

Model training method, video generating method and device, equipment and medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a model training method, a video generating method, a device, equipment, and a medium.

Background

With the development of artificial intelligence technology, a method for driving virtual objects based on human faces is widely used in many fields. For example, a user may be interacted with in a video through a human facial image of a virtual object. At present, a clear and real virtual face image is generated mainly by extracting face features of a real face image and rendering the face image through a neural network model. However, most of the current neural network models are based on GAN, that is, the whole face is rendered, so that the rendering range is too large, and the training efficiency of the models is affected.

Disclosure of Invention

The main purpose of the disclosed embodiments is to provide a training method, a video generating method, a device, equipment and a medium for a model, which can improve the training efficiency of the model.

To achieve the above object, a first aspect of an embodiment of the present disclosure provides a training method of a model, where the training method is used for training a virtual face image to generate a model, including:

Acquiring a face image, and performing feature extraction processing on the face image to obtain a first feature image;

performing feature stitching processing on the first feature image and preset virtual face feature data to obtain a combined feature image;

performing self-attention processing on the first characteristic image through a preset self-attention model to obtain a second characteristic image;

Performing feature extraction processing on the combined feature image and the second feature image to obtain a third feature image;

Training a preset neural network model according to the third characteristic image to obtain a virtual face image generation model.

In some embodiments, the acquiring the face image includes:

acquiring a real face video;

Acquiring a video frame image corresponding to each frame in the real face video;

extracting 3DMM features and a lower half face region of the video frame image;

And carrying out lamination treatment on the 3DMM features and the lower half face area to obtain the face image.

In some embodiments, the self-attention model comprises: a first neural network and a second neural network; the self-attention processing is performed on the first feature image through a preset self-attention model to obtain a second feature image, which comprises the following steps:

performing feature extraction processing on the first feature image through the first neural network to obtain a first feature matrix;

performing reinforcement treatment on the first characteristic image through the second neural network to obtain a second characteristic matrix;

multiplying the first feature matrix and the second feature matrix to obtain a third feature matrix;

performing convolution and spectrum normalization processing on the third feature matrix to obtain a fourth feature image;

and adding the pixels of the first characteristic image and the pixels of the fourth characteristic image to obtain the second characteristic image.

In some embodiments, the second neural network comprises a convolutional layer, a normalization layer, a pooling layer, and a classifier; the reinforcement processing is performed on the first feature image through the second neural network to obtain a second feature matrix, including:

carrying out convolution processing on the first characteristic image through the convolution layer to obtain a convolution matrix;

carrying out spectrum normalization processing on the convolution matrix through the normalization layer to obtain a normalization matrix;

carrying out maximum pooling treatment on the normalized matrix through the pooling layer to obtain a maximum pooling matrix;

multiplying the normalized matrix by the maximum pooling matrix to obtain a fourth feature matrix;

And classifying the fourth feature matrix through the classifier to obtain the second feature matrix.

In some embodiments, the neural network model includes a discriminant; training the preset neural network model according to the third characteristic image to obtain a virtual face image generation model, wherein the training comprises the following steps:

calculating an image true value of the third characteristic image through the discriminator;

calculating a loss function of the neural network model according to the image reality value to obtain a loss value;

and taking the loss value as a counter propagation quantity, and adjusting model parameters of the neural network model to train the neural network model so as to obtain the virtual face image generation model.

A second aspect of an embodiment of the present disclosure proposes a video generating method, configured to generate a target face video, including:

acquiring text data and virtual face characteristic data of a target virtual face;

inputting the text data and the virtual face characteristic data into a virtual face image generation model to perform image generation processing to obtain a plurality of continuous frame speaking images; wherein the virtual face image generation model is obtained by training according to the training method according to any one of the embodiments of the first aspect of the present application;

Performing image stitching processing on the plurality of continuous frame speaking images to obtain an initial face video;

performing voice conversion processing on the text data to obtain target voice;

and performing voice synthesis processing on the initial video according to the target voice to obtain a target face video.

A third aspect of an embodiment of the present disclosure proposes a training device for training a virtual face image to generate a model, including:

a first feature extraction module: the method comprises the steps of acquiring a face image, and carrying out feature extraction processing on the face image to obtain a first feature image;

The first splicing module: the method comprises the steps of performing feature stitching processing on a first feature image and preset virtual face feature data to obtain a combined feature image;

Self-attention processing module: the self-attention processing method comprises the steps of carrying out self-attention processing on a first characteristic image through a preset self-attention model to obtain a second characteristic image;

And a second feature extraction module: the method comprises the steps of carrying out feature extraction processing on the combined feature image and the second feature image to obtain a third feature image;

model training module: and training a preset neural network model according to the third characteristic image to obtain a virtual face image generation model.

A fourth aspect of an embodiment of the present disclosure proposes a video generating apparatus for generating a target face video, including:

And a data acquisition module: the virtual face feature data are used for acquiring text data and virtual face feature data of a target virtual face;

an image generation module: the text data and the virtual face feature data are input into a virtual face image generation model to be subjected to image generation processing, so that a plurality of continuous frame speaking images are obtained; wherein the virtual face image generation model is trained according to the training method as set forth in any one of claims 1 to 5;

And a second splicing module: the method comprises the steps of performing image stitching processing on a plurality of continuous frame speaking images to obtain an initial face video;

And the voice conversion module is used for: the method comprises the steps of performing voice conversion processing on text data to obtain target voice;

and a voice synthesis module: and the method is used for carrying out voice synthesis processing on the initial video according to the target voice to obtain a target face video.

A fifth aspect of the disclosed embodiments proposes a computer device comprising a memory and a processor, wherein the memory has stored therein a program which, when executed by the processor, is adapted to carry out a method according to any of the embodiments of the first aspect of the application or a method according to any of the embodiments of the second aspect of the application.

A sixth aspect of the disclosed embodiments proposes a storage medium, which is a computer-readable storage medium, storing computer-executable instructions for causing a computer to perform a method according to any one of the embodiments of the first aspect of the present application or a method according to any one of the embodiments of the second aspect of the present application.

The training method, the video generating device, the equipment and the medium of the model provided by the embodiment of the disclosure are used for extracting the characteristics of the face image by acquiring the face image to obtain a first characteristic image; the first characteristic image and the preset virtual face characteristic data are subjected to characteristic stitching processing to obtain a combined characteristic image, so that training of different types of virtual face images can be ensured, and personalized requirements of users can be met; performing self-attention processing on the first characteristic image through a preset self-attention model to obtain a second characteristic image; the combined characteristic image and the second characteristic image are subjected to characteristic extraction processing to obtain a third characteristic image, and the rendering effect of the virtual face can be improved and the authenticity of the virtual face can be improved through carrying out characteristic extraction processing on the images for a plurality of times; training a preset neural network model according to the third characteristic image to obtain a virtual face image generation model; the virtual face image generation model is used for generating a virtual face image. According to the embodiment of the disclosure, the self-attention model is added, so that the neural network model can be more focused on the study of the key areas in the training process, the training time of the model is shortened, and the training efficiency of the model is improved.

Drawings

FIG. 1 is a flow chart of a training method for a model provided by an embodiment of the present disclosure;

fig. 2 is a flowchart of step S110 in fig. 1;

fig. 3 is a flowchart of step S130 in fig. 1;

Fig. 4 is a flowchart of step S132 in fig. 3;

fig. 5 is a flowchart of step S150 in fig. 1;

FIG. 6 is a schematic diagram of a neural network model and a self-attention model provided by an embodiment of the present disclosure;

FIG. 7 is a flow chart of a practical application of the self-attention model for self-attention processing according to an embodiment of the present disclosure;

Fig. 8 is a flowchart of a video generation method provided by an embodiment of the present disclosure;

FIG. 9 is a block diagram of a modular construction of a training device for a model provided by an embodiment of the present disclosure;

fig. 10 is a block diagram of a module structure of a video generating apparatus provided in an embodiment of the present disclosure;

fig. 11 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the disclosure.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

First, several nouns involved in the present application are parsed:

artificial intelligence (ARTIFICIAL INTELLIGENCE, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Unet network: the image semantic segmentation network enables a computer to segment according to the semantics of an image so as to output a picture which is appointed to be segmented.

Virtual Reality (VR), i.e., the combination of Virtual and Reality, is a computer simulation system that creates and experiences a Virtual world by using a computer to create a simulated environment into which a user is immersed. The virtual reality technology is to use real life data, combine electronic signals generated by computer technology with various output devices to convert them into phenomena that can be felt by people, and display them by three-dimensional models.

Virtual anchor (Virtual YouTuber): the virtual anchor is an anchor or customer service which uses an avatar to interact with a user in a video based on leading technologies such as voice, NLP, vision and the like.

Self-attention mechanism (Attention Mechanism): the attention mechanism may provide the neural network with the ability to concentrate on a subset of its inputs (or features), select a particular input, and apply to any type of input, regardless of its shape. In situations where computing power is limited, the attention mechanism is a resource allocation scheme that is the primary means of solving the information overload problem, allocating computing resources to more important tasks.

Region of interest (region of interest, ROI): in machine vision and image processing, a region to be processed is outlined from a processed image in a box, circle, ellipse, irregular polygon and the like, and is called a region of interest.

Variable face model (3D Morphable Face Model,3DMM): the 3DMM is a statistical model of face shape and appearance, firstly, a high-precision instrument is used for scanning a plurality of groups of face 3D data, alignment is carried out, then PCA is used for obtaining lower-dimensional subspaces from the three-dimensional shape and color data, and the variability is embodied in that the PCA subspaces can be combined and deformed to transfer the characteristics of one face to another face or generate a new face.

Baseline (baseline): the improvement of the training model is compared with the basic model as to whether the improvement is effective or not, and the effect of the new training model is evaluated.

Encoder-Decoder (Encoder-Decoder): is a common model framework in deep learning, many common applications are designed by using an encoding-decoding framework, encoder and a Decoder part can be any text, voice, image, video data and the like, and various models can be designed based on Encoder-Decoder.

Coding (Encoder) is to convert the input sequence into a fixed length vector; decoding (decoder), namely converting the fixed vector generated before into an output sequence; wherein the input sequence can be words, voice, images and video; the output sequence may be text, images.

SA (Shuffle Attention) module: the module effectively combines two types of attention mechanisms using a Shuffle unit. Specifically, the SA first groups the channel dimensions into multiple sub-features, and then processes them in parallel. Then, for each sub-feature, the SA depicts the feature dependencies in the spatial and channel dimensions using a shuffle unit. All sub-features are then summarized together and the "channel shuffle" operator is used to enable information communication between the different sub-features.

Softmax classifier: for the generalized generalization of the logistic regression classifier for multiple classifications, probability values belonging to different classes are output.

Activation function (Activation functions): the method has important roles in learning and understanding very complex and nonlinear functions of the artificial neural network model. They introduce non-linear properties into our network, where the inputs are weighted and summed to act as a function, the activation function, which is introduced to increase the non-linearity of the neural network model.

Embedding (embedding): embedding is a vector representation, which refers to representing an object, which may be a word, or a commodity, or a movie, etc., with a low-dimensional vector; the nature of this embedding vector is such that objects corresponding to similar vectors have similar meaning, such as embedding (revenge alliance) and embedding (iron man) will be very close, but embedding (revenge alliance) and embedding (delicatessen) will be far from each other. embedding is essentially a mapping from semantic space to vector space, while maintaining as close as possible the relationship of the original samples in semantic space in vector space, e.g., the two words that are semantically close are also located closer together in vector space. embedding can encode objects with low-dimensional vectors and preserve their meaning, often applied to machine learning, to improve efficiency by encoding objects into a low-dimensional dense vector and then transmitting the dense vector to DNN during machine learning model construction.

Max pooling (max-pooling): i.e. taking the point of maximum value in the local acceptance field.

A generator: a vector is Input and a high-dimensional vector (which may be a picture, text, etc.) is output, typically with each dimension of the Input vector representing some feature.

Discriminator (discriminator): also called discriminator, whose Input is what you want to produce (i.e. output produced by the generator), such as a picture, or a piece of speech, etc., whose output is a scalar that represents how good the Input is, the larger the number is, the more authentic the Input is. In addition, the relationship between the generator and the arbiter is: the generator generates an object, inputs the object into the discriminator, then judges whether the input is real data or machine generated by the discriminator, if the discriminator is not deceived, the generator continues to evolve, outputs second generation Output, inputs the discriminator, and meanwhile, the discriminator also evolves, so that the Output of the generator has stricter requirements.

Text To Speech (TTS): the TTS is supported by the built-in chip, and the text is intelligently converted into a natural voice stream through the design of a neural network. TTS technology converts text files in real time. TTS is a type of speech synthesis application that converts documents stored in a computer, such as help documents or web pages, into natural speech output.

Lipschitz continuous conditions: also referred to as lipschitz condition, is a condition of greater smoothness than usual. Intuitively, the lipschz continuous function limits the speed of change of the function, and the slope of the function that meets the lipschz condition must be less than a real number called the lipschz constant (which depends on the function).

With the development of artificial intelligence technology, a method for driving virtual objects based on human faces is widely used in many fields. For example, a user may be interacted with in a video through a human facial image of a virtual object. At present, a clear and real virtual face image is generated mainly by extracting face features of a real face image and rendering the face image through a neural network model. However, most of the existing neural network models are based on GAN, that is, the whole face is rendered, but only the lip effect is changed by the actually generated virtual image, but the upper half face and the background information are unchanged in the final effect, so that the whole face is rendered by adopting the traditional neural network model, which results in an excessively large rendering range, thereby affecting the training efficiency of the model.

Based on the above, the embodiment of the disclosure provides a training method, a video generating method, a device, equipment and a medium for a model, which are used for extracting features of a face image to obtain a first feature image by acquiring the face image; the first characteristic image and the preset virtual face characteristic data are subjected to characteristic stitching processing to obtain a combined characteristic image, so that training of different types of virtual face images can be ensured, and personalized requirements of users can be met; performing self-attention processing on the first characteristic image through a preset self-attention model to obtain a second characteristic image; the combined characteristic image and the second characteristic image are subjected to characteristic extraction processing to obtain a third characteristic image, and the rendering effect of the virtual face can be improved and the authenticity of the virtual face can be improved through carrying out characteristic extraction processing on the images for a plurality of times; training a preset neural network model according to the third characteristic image to obtain a virtual face image generation model; the virtual face image generation model is used for generating a virtual face image. According to the embodiment of the disclosure, the self-attention model is added, so that the neural network model can be more focused on the study of the key areas in the training process, the training time of the model is shortened, and the training efficiency of the model is improved.

The embodiment of the present disclosure provides a training method, a video generating method and device, a computer device, and a storage medium for a model, and specifically, the training method for a model in the embodiment of the present disclosure is described first by describing the following embodiment.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the disclosure provides a training method of a model, which relates to the field of artificial intelligence and also relates to the field of virtual reality. The training method of the model provided by the embodiment of the disclosure can be applied to a terminal, a server and software running in the terminal or the server. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like of a training method for realizing the model, but is not limited to the above form.

Embodiments of the present disclosure are operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Referring to fig. 1, a training method of a model according to an embodiment of the first aspect of the present disclosure includes, but is not limited to, steps S110 to S180.

Step S110, a face image is obtained, and feature extraction processing is carried out on the face image to obtain a first feature image;

Step S120, performing feature stitching processing on the first feature image and preset virtual face feature data to obtain a combined feature image;

Step S130, performing self-attention processing on the first characteristic image through a preset self-attention model to obtain a second characteristic image;

step S140, carrying out feature extraction processing on the combined feature image and the second feature image to obtain a third feature image;

and step S150, training the preset neural network model according to the third characteristic image to obtain a virtual face image generation model.

In step S110 of some embodiments, real face data is acquired, where the real face data refers to some data rich in facial features or information, and the form of the face data is various, including photos, videos, and the like of the real face. It should be noted that the face image of the real face may be multiple, and the shooting angle, illumination, color, speaking mouth shape, expression, etc. of each face image may be different. And carrying out preliminary processing on the real face data to obtain a face image, wherein the obtained real face data is various, so that the real face data is required to be subjected to preliminary processing before the virtual face image is trained to generate a model, and the face image conforming to the training condition of the model is obtained. Inputting the face image into a preset neural network model; the neural network model comprises an encoder network for feature extraction, the encoder network comprises a plurality of convolution layers, the neural network model further comprises a decoder network for feature extraction, and the decoder network also comprises a plurality of deconvolution layers. After the face image is obtained, feature extraction processing is carried out on the face image through an encoder network to obtain a first feature image, specifically, the Unet network is taken as a datum line, the face image passes through a plurality of convolution layers of the encoder network to extract features of different sizes of the face image, and the first feature image and corresponding first feature data under different sizes are obtained, wherein the first feature data comprise color features, texture features, shape features, spatial relationship features, face key point features and the like of the first feature image.

In step S120 of some embodiments, feature stitching is performed on the first feature image and preset virtual face feature data to obtain a joint feature image, where the preset virtual face feature data may refer to that the ID of the virtual face is obtained after feature extraction by a natural language processing technology, for example, through Embedding, so that the ID of each virtual face is unique, and fewer features can be used to represent more IDs, which are used to describe different avatar images. In practical application, the first feature image and the virtual face feature data can be spliced through a full-connection layer of the neural network model to obtain a combined feature image.

In step S130 of some embodiments, the self-attention processing is performed on the first feature image through a preset self-attention model, so as to obtain a second feature image, where the self-attention model is disposed between the encoder network and the decoder network, and is used for learning a part of the image to be focused, such as a rendering part of a lower half face and a generating part of a tooth, or reducing a part of the learning image that is unchanged, such as a background, hair, a face contour and a part of an upper half face in the image, so that the iteration number is reduced, and the training time is reduced.

It should be noted that, the self-attention model may be an SA module, each SA module may include one or more self-attention network layers, the SA module may select more layers to add according to the performance of the training data, or may set a plurality of SA modules before the encoder network and the decoder network according to actual requirements, which is not limited in this disclosure.

In step S140 of some embodiments, feature extraction processing is performed on the joint feature image and the second feature image through the decoder network, so as to obtain a third feature image. Specifically, the joint feature image and the second feature image are decoded through a decoder network, and are adjusted to a third feature image with the same size as the face image.

In step S150 of some embodiments, training is performed on the preset neural network model according to the third feature image to obtain a virtual face image generating model for generating a virtual face image, so that the virtual face image generated by the virtual face image generating model achieves a real and natural effect. The preset neural network comprises the encoder network and the decoder network mentioned in the above embodiment.

In some embodiments, as shown in fig. 2, step S110 specifically includes, but is not limited to, steps S111 to S114.

Step S111, obtaining a real face video;

Step S112, obtaining a video frame image corresponding to each frame in the real face video;

step S113, extracting 3DMM features and a lower half face area of a video frame image;

and step S114, attaching the 3DMM features to the lower half face area to obtain a face image.

In steps S111 to S112 of some embodiments, the real face video is obtained, and in practical application, the real face video may be obtained by recording a live host, or directly obtaining the video of the live host. After the real face video is acquired, extracting a video frame image corresponding to each frame in the real face video.

In step S113 of some embodiments, 3DMM features of the video frame image, that is, face key points, are extracted, and mainly include face contours, inner and outer boundary points of eyes, nose and mouth, etc., where the face key points can reflect facial features of each part of the face. It should be noted that, a person skilled in the art can set different face key points according to actual training requirements, which is not described herein.

In step S114 of some embodiments, the 3DMM feature is attached to the lower half face area, so as to obtain a face image, in other words, after the 3DMM feature is extracted from each frame of the real face video, the extracted 3DMM feature is attached to the lower half face of the corresponding frame of the real face video, so that the process of performing the preliminary processing on the real face data is completed. The 3DMM feature is attached to the lower half face, so that the lower half face and the mouth shape of the virtual face can be rendered better, the neural network model is made to pay more attention to the rendering effect of the lower half face, training time is shortened, and training efficiency of the model is improved.

In some embodiments, the self-attention model includes: the first neural network and the second neural network, as shown in fig. 3, step S130 specifically includes, but is not limited to, step S131 to step S135.

Step S131, performing feature extraction processing on the first feature image through a first neural network to obtain a first feature matrix;

step S132, performing reinforcement processing on the first characteristic image through a second neural network to obtain a second characteristic matrix;

Step S133, multiplying the first feature matrix and the second feature matrix to obtain a third feature matrix;

step S134, rolling and spectrum normalization processing is carried out on the third feature matrix, and a fourth feature image is obtained;

step S135, adding the pixels of the first feature image to the pixels of the fourth feature image to obtain the second feature image.

In step S131 of some embodiments, a first feature image is input to the self-attention model, and feature extraction processing is performed on the first feature image through a first neural network of the self-attention model, so as to obtain a first feature matrix. Specifically, the first neural network comprises a convolution layer, a spectrum normalization layer and a maximum pooling layer, the first characteristic image sequentially carries out convolution processing, spectrum normalization processing and maximum pooling operation through the convolution layer, the spectrum normalization layer and the pooling layer of the first neural network to obtain a new characteristic image, and a plurality of pixel points of the characteristic image can form a first characteristic matrix.

It should be noted that, the convolution processing mentioned in the embodiments of the present disclosure refers to a convolution operation of an image, or referred to as a kernel operation, which is a common means for performing image processing. The image convolution operation aims at realizing functions of blurring, sharpening, edge detection and the like through weighted summation operation by utilizing the spatial relation between the pixel point and the neighborhood pixels. The calculation process of image convolution is the process of weighting and summing the local pixel blocks of the image according to step sizes by a convolution kernel. The convolution kernel is essentially a set of weights of a fixed size, with the anchor point in the set typically centered.

The spectrum normalization according to the embodiments of the present disclosure is implemented on the weight matrix of each layer of the arbiter, that is, the spectrum norm of the weight matrix, and the spectrum normalization operation is to make the network parameter of each layer of the network divided by the spectrum norm of the parameter matrix of that layer satisfy the preset condition constraint, that is, the constraint of lipschitz =1.

The max pooling operation mentioned in the embodiments of the present disclosure, similar to the convolution process, essentially performs sampling, divides a plurality of pools or a plurality of blocks for a pixel feature map, and combines the maximum values of each pool or each block.

In step S132 of some embodiments, the first feature map is input to the self-attention model, and the second neural network of the self-attention model is used to perform reinforcement processing on the first feature image to obtain a third feature matrix, so that the neural network model can concentrate on the region needing to be focused, such as the lower half face and the lips of the face, by performing reinforcement processing, and the generation of the upper half face and the background information is reduced. It should be noted that, since the upper half face and the background information are unchanged in the final effect for the generated virtual face image, only the effect of the lower half face, especially the lip part, needs to be rendered in a focused manner, so that a real and natural virtual face image can be obtained, and the best effect can be generated on the basis of improving the model training efficiency.

In step S133 of some embodiments, the first feature matrix and the second feature matrix are multiplied to obtain a third feature matrix.

In step S134 of some embodiments, the third feature matrix performs a convolution process and a spectrum normalization process to obtain a third feature matrix including a plurality of pixel points, from which a fourth feature image may be formed.

In step S135 of some embodiments, the pixels of the first feature image and the pixels of the fourth feature image are added to obtain an output of the self-attention model, i.e. the second feature image. In practical application, the fourth feature image output by the self-attention model is spliced with the feature image of the deconvolution layer in the decoder network and then sent to the next convolution layer of the decoder network.

In some embodiments, as shown in fig. 4, step S132 specifically includes, but is not limited to, steps S1321 to S1325.

Step S1321, performing convolution processing on the first characteristic image through a convolution layer to obtain a convolution matrix;

step S1322, carrying out spectrum normalization processing on the convolution matrix through a normalization layer to obtain a normalization matrix;

step S1323, carrying out maximum pooling treatment on the normalized matrix through a pooling layer to obtain a maximum pooling matrix;

step S1324, multiplying the normalized matrix and the maximum pooling matrix to obtain a fourth feature matrix;

Step S1325, classifying the fourth feature matrix through a classifier to obtain a second feature matrix.

In step S1321 of some embodiments, the first feature image is input to a second neural network, wherein the second neural network also includes a convolution layer, a normalization layer, and a pooling layer, and further includes a classifier. Specifically, the convolution layer of the second neural network is used for carrying out convolution processing on the first characteristic image to obtain a convolution matrix.

In step S1322 of some embodiments, the convolution matrix is subjected to spectral normalization by a normalization layer of the second neural network, to obtain a normalized matrix.

In step S1323 of some embodiments, the normalized matrix is subjected to a maximum pooling process by a pooling layer of the second neural network, resulting in a maximum pooled matrix.

In step S1324 of some embodiments, the normalized matrix and the maximum pooling matrix are multiplied to obtain a fourth feature matrix.

In step S1325 of some embodiments, the fourth feature matrix is classified by a classifier of the second neural network to obtain a second feature matrix of self-attention, and in particular, the classifier of the second neural network may employ a Softmax classifier.

In some embodiments, as shown in fig. 5, step S150 specifically includes, but is not limited to, steps S151 to S153.

Step S151, calculating an image reality value of the third characteristic image through a discriminator;

Step S152, calculating a loss function of the neural network model according to the image true value to obtain a loss value;

and step 153, taking the loss value as the counter-propagation quantity, and adjusting model parameters of the neural network model to train the neural network model so as to obtain a virtual face image generation model.

In step S151 of some embodiments, the third feature image is input into a discriminator of the neural network model, and the discriminator judges whether the input image is a real image or an image generated by the neural network model, to obtain an image reality value.

In step S152 of some embodiments, a loss function of the neural network model is calculated according to the image true value to obtain a loss value, and specifically, a loss function corresponding to the arbiter and a loss function corresponding to the generator may be calculated to obtain a loss value. In practical application, the LI loss function corresponding to the lower half face mask can be added into the loss function of the neural network, the weight is improved, and the learning speed of the neural network model for rendering the lip effect can be accelerated.

In step S153 of some embodiments, the loss value is used as a counter-propagation quantity, and model parameters of the neural network model are adjusted to train the neural network model, so as to obtain a virtual face image generation model, so that the virtual face image generated by the neural network model is more real and natural.

In some embodiments, as shown in fig. 6, the neural network model includes an encoder network and a decoder network, where the encoder network includes a plurality of convolution layers, and the decoder network includes a plurality of deconvolution layers, and the input image, that is, the face image after preliminary processing, is subjected to multiple feature extraction processing through the plurality of convolution layers and deconvolution layers in the encoder network and the decoder network, and with Unet network as a baseline, an SA module, that is, a self-attention model, may be added in a middle layer, for letting the decoder network pay attention to important parts in the encoder network, and learn lip effects with emphasis. It should be noted that, the SA module may select more layers to add according to the data representation, and in the embodiment of the present disclosure, only one layer is added to illustrate the entire network for convenience of description. In addition, emdedding is added between the encoder and the decoder to describe different anchor images.

Specifically, as shown in fig. 7, in the embodiment of the present disclosure, a feature map x obtained after one of the convolution layers in the encoder portion is input to the SA module, and the SA module divides the feature map x into three branches. The feature map x of the first branch is subjected to convolution, spectrum normalization and maximum pooling treatment sequentially through a first neural network to obtain a feature matrix of the first branch; the feature map x of the second branch sequentially carries out convolution and spectrum normalization processing through a second neural network to obtain a feature matrix of the second branch; and the characteristic diagram x of the third branch is subjected to convolution, spectrum normalization and maximum pooling treatment sequentially through a second neural network to obtain a characteristic matrix of the third branch. Then, the feature matrix of the second branch is multiplied by the feature matrix of the third branch and then sent to a Softmax classifier to obtain a self-attention matrix. Then, multiplying the characteristic matrix of the first branch by the self-attention moment matrix, and then carrying out convolution sum spectrum normalization processing to obtain a new characteristic diagram; and adding the new feature map and the feature map x to obtain the output of SA, namely the target feature map. And the target feature map after SA output is spliced with a deconvolution layer in a decoder network and then is sent to a deconvolution layer of the next layer for feature extraction processing.

Referring to fig. 8, the embodiment of the present disclosure further provides a video generating method for generating a virtual face video, including but not limited to steps S210 to S250.

Step S210, obtaining text data and virtual face feature data of a target virtual face;

Step S220, inputting text data and virtual face characteristic data into a virtual face image generation model to perform image generation processing, so as to obtain a plurality of continuous frame speaking images;

Step S230, performing image stitching processing on a plurality of continuous frame speaking images to obtain an initial face video;

Step S240, performing voice conversion processing on the text data to obtain target voice;

step S250, performing voice synthesis processing on the initial video according to the target voice to obtain a target face video.

In step S210 of some embodiments, text data and virtual face feature data of the target virtual face are acquired, where the text data refers to text content that needs to be spoken by the target virtual face, and the virtual face feature data is data that identifies different virtual faces, such as a face ID, a serial number, and the like.

In step S220 of some embodiments, the text data and the virtual face feature data are input into the virtual face image generation model to perform image generation processing, so as to obtain a plurality of continuous frame speaking images corresponding to the image of the target virtual face, where the continuous frame speaking images represent states of mouth shapes, expressions, and the like of the target virtual face under different situations, such as speaking. Wherein the virtual face image generation model is trained according to the training method according to any one of the embodiments of the first aspect of the disclosure;

In step S230 of some embodiments, image stitching is performed on a plurality of consecutive frame speech images to obtain an initial face video, where the initial face video is a video in which a virtual face speaks according to the content of text data but has no sound. And processing the initial face virtual speaking video according to the text data to obtain a target face video, wherein the target video is a video which is used for speaking by the target virtual face according to the content of the text data and contains sound.

In steps S240 and S250 of some embodiments, the text data is subjected to a speech conversion process, for example, using TTS technology to convert the text data into speech to obtain a target speech, and the target speech and the initial face video are synthesized together to obtain a target video.

The embodiment of the disclosure also provides a training device for training a virtual face image to generate a model, as shown in fig. 9, which can implement the training method of the model, and the device includes: a first feature extraction module 310, a first stitching module 320, a self-attention processing module 330, a second feature extraction module 340, and a model training module 350. The first feature extraction module 310 is configured to obtain a face image, and perform feature extraction processing on the face image to obtain a first feature image; the first stitching module 320 is configured to perform feature stitching processing on the first feature image and preset virtual face feature data, so as to obtain a combined feature image; the self-attention processing module 330 is configured to perform self-attention processing on the first feature image through a preset self-attention model to obtain a second feature image; the second feature extraction module 340 is configured to perform feature extraction processing on the combined feature image and the second feature image, so as to obtain a third feature image; the model training module 350 is configured to train a preset neural network model according to the third feature image, so as to obtain a virtual face image generating model.

The specific processing procedure of the training device for the model in the embodiment of the present disclosure is the same as that of the model in the embodiment, and is not described here again.

The embodiment of the present disclosure further provides a video generating device, configured to generate a target face video, as shown in fig. 10, where the video generating method may be implemented, where the device includes: the device comprises a data acquisition module 410, an image generation module 420, a second stitching module 430, a voice conversion module 440 and a voice synthesis module 450, wherein the data acquisition module 410 is used for acquiring text data and virtual face feature data of a target virtual face; the image generation module 420 is configured to input text data and virtual face feature data into a virtual face image generation model for performing image generation processing, so as to obtain a plurality of continuous frame speaking images; the virtual face image generation model is obtained by training according to the training method of any one of the embodiments of the first aspect of the disclosure; the second stitching module 430 is configured to perform image stitching on a plurality of consecutive frame speech images to obtain an initial face video; the voice conversion module 440 is configured to perform voice conversion processing on the text data to obtain a target voice; the voice synthesis module 450 is configured to perform voice synthesis processing on the initial video according to the target voice, so as to obtain a target face video.

The video generating apparatus of the embodiment of the present disclosure is configured to execute the video generating method of the above embodiment, and specific processing procedures thereof are the same as those of the video generating method of the above embodiment, and are not described herein in detail.

The disclosed embodiments also provide a computer device comprising:

at least one processor, and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions that are executed by the at least one processor to cause the at least one processor to perform a method as in the first aspect of the application or as in any of the embodiments of the second aspect of the application when the instructions are executed.

The hardware configuration of the computer device is described in detail below with reference to fig. 11. The computer device includes: processor 510, memory 520, input/output interface 530, communication interface 540, and bus 550.

The processor 510 may be implemented by a general-purpose CPU (Central Processin Unit, central processing unit), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present disclosure;

the Memory 520 may be implemented in the form of a ROM (Read Only Memory), a static storage device, a dynamic storage device, or a RAM (Random Access Memory ). Memory 520 may store an operating system and other application programs, and when implementing the technical solutions provided by the embodiments of the present disclosure through software or firmware, relevant program codes are stored in memory 520, and the training method for executing the model of the embodiments of the present disclosure or the video generating method for executing the embodiments of the present disclosure is called by processor 510;

An input/output interface 530 for implementing information input and output;

The communication interface 540 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.); and

Bus 550, which carries information among the various components of the device (e.g., processor 510, memory 520, input/output interface 530, and communication interface 540);

Wherein processor 510, memory 520, input/output interface 530, and communication interface 540 enable a communication connection within the device between each other via bus 550.

The present disclosure also provides a storage medium that is a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a training method of a model of an embodiment of the present disclosure or a video generation method of an embodiment of the present disclosure.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present disclosure are for more clearly describing the technical solutions of the embodiments of the present disclosure, and do not constitute a limitation on the technical solutions provided by the embodiments of the present disclosure, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present disclosure are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1, 2, 3, 4, 5 and 8 are not limiting to the embodiments of the present disclosure, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the disclosed embodiments are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the disclosed embodiments. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present disclosure shall fall within the scope of the claims of the embodiments of the present disclosure.

Claims

1. A method for training a model, the method for training a virtual face image to generate a model comprising:

training a preset neural network model according to the third characteristic image to obtain a virtual face image generation model;

the step of obtaining the face image comprises the following steps:

acquiring a real face video;

extracting 3DMM features and a lower half face region of the video frame image;

2. The training method of claim 1, wherein the self-attention model comprises: a first neural network and a second neural network; the self-attention processing is performed on the first feature image through a preset self-attention model to obtain a second feature image, which comprises the following steps:

3. The training method of claim 2, wherein the second neural network comprises a convolutional layer, a normalization layer, a pooling layer, and a classifier; the reinforcement processing is performed on the first feature image through the second neural network to obtain a second feature matrix, including:

4. A training method as claimed in any one of claims 1 to 3 wherein the neural network model comprises a discriminant; training the preset neural network model according to the third characteristic image to obtain a virtual face image generation model, wherein the training comprises the following steps:

5. A video generation method, for generating a target face video, comprising:

inputting the text data and the virtual face characteristic data into a virtual face image generation model to perform image generation processing to obtain a plurality of continuous frame speaking images; wherein the virtual face image generation model is trained according to the training method as set forth in any one of claims 1 to 4;

performing voice conversion processing on the text data to obtain target voice;

6. A training device for training a virtual face image to generate a model, comprising:

Model training module: training a preset neural network model according to the third characteristic image to obtain a virtual face image generation model;

wherein, the first feature extraction module: the method for acquiring the face image comprises the following steps:

acquiring a real face video;

extracting 3DMM features and a lower half face region of the video frame image;

7. A video generating apparatus for generating a target face video, comprising:

An image generation module: the text data and the virtual face feature data are input into a virtual face image generation model to be subjected to image generation processing, so that a plurality of continuous frame speaking images are obtained; wherein the virtual face image generation model is trained according to the training method as set forth in any one of claims 1 to 4;

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, which when executed by the processor, is operable to perform:

The training method of any one of claims 1 to 4; or (b)

The video generation method of claim 5.

9. A storage medium that is a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a computer, is operable to perform:

The training method of any one of claims 1 to 4; or (b)

The video generation method of claim 5.