CN118800274A

CN118800274A - Automatic AI digital human expression generating system based on voice drive

Info

Publication number: CN118800274A
Application number: CN202410973676.2A
Authority: CN
Inventors: 沈小英
Original assignee: Nanjing Intek Digital Technology Co ltd
Current assignee: Nanjing Intek Digital Technology Co ltd
Priority date: 2024-07-19
Filing date: 2024-07-19
Publication date: 2024-10-18

Abstract

The invention discloses an AI digital human automatic expression generating system based on voice drive, which comprises a voice expression generating module, a human face expression database, an expression feature extracting module and a human face three-dimensional reconstructing module; the voice synthesis parameters are controlled by manually making rules, the voice with the specific expression is generated, a large number of voice data training models with emotion labels are utilized to learn the relation between the voice and emotion, and the voice with rich expression is generated on a new text, so that a digital person can drive more smoothly under the cooperation of a facial expression database, an expression feature extraction module and a facial three-dimensional reconstruction module according to the voice with the specific expression, the calculated amount is reduced, and the digital person can be more finely expressed when the digital person expression is driven, and the digital person is favorable for use.

Description

Automatic AI digital human expression generating system based on voice drive

Technical Field

The invention relates to the technical field of digital human expression generation, in particular to an AI digital human automatic expression generation system based on voice driving.

Background

With the development of society, the AI industry has also developed, including digital people. The digital person is a digital person which is created by real people or subjectively, generates and converts three-dimensional or two-dimensional image data through a computer technology, stores and applies the image data in a form of a computer code, and partially or fully completes human behaviors such as information transmission, emotion expression, interaction with other people, problem solving and the like according to the accessed AI algorithm, knowledge graph, driving system and other capabilities or different systems.

However, the current digital person cannot smoothly generate the automatic expression under the driving of the voice with the specific expression, so that the calculated amount is large, and the digital person expression cannot be driven to be displayed more finely, therefore, an AI digital person automatic expression generation system based on voice driving is provided to solve the problems.

Disclosure of Invention

The invention aims to provide an AI digital human automatic expression generating system based on voice driving so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

An AI digital person automatic expression generating system based on voice drive comprises a voice expression generating module, a face expression database, an expression feature extracting module and a face three-dimensional reconstructing module; the voice expression in the voice expression generation module expresses emotion or attitude by utilizing a plurality of acoustic characteristics of tone, speech speed and rhythm through a voice synthesis technology; the facial expression database classifies the collected facial expressions according to the transmission of the expression semantics by acquiring a large number of facial expressions, so that a set of each expression is acquired; the expression feature extraction module comprises an active shape model method and an active appearance model method for extracting facial features; the face three-dimensional reconstruction module reconstructs three-dimensional information of the model according to images of a single view angle or images of a plurality of view angles.

As a further scheme of the invention: the specific method for generating the voice expression in the voice expression generating module is to manually make rules to control voice synthesis parameters and generate voice with specific expression, wherein the rules are designed by considering emotion and semantic factors; and training a model by utilizing a large amount of voice data with emotion labels, learning the relation between the voice and emotion, and generating the voice with rich expression on a new text.

As a further scheme of the invention: the voice expression generating module comprises an emotion feature extracting unit and an emotion perception model unit; the emotion feature extraction unit extracts rhythm, intonation and acoustic parameters and emotion related features in the voice by utilizing a voice signal analysis technology, models and classifies the extracted features by utilizing a statistical model and a machine learning algorithm, identifies different emotion types, combines multiple text information analyses of vocabulary and semantics, and enhances emotion identification accuracy; the emotion perception model unit explores a multi-mode emotion perception model, integrates various information of audio, text and vision, comprehensively perceives the emotion states of speakers, adopts a deep neural network and a variety automatic encoder to construct an emotion perception model by adopting various advanced machine learning methods, improves recognition accuracy and generalization capability, and improves adaptability of the model to different speakers, emotion contexts and noise backgrounds by training and optimizing model parameters.

As a further scheme of the invention: the realization process of the active shape model method comprises training and searching, wherein n face image samples are manually marked in the training process, 68 feature points are adopted for each face image to fit a face shape model, and the positions of the feature points are used for forming shape vectors of the image; normalizing or aligning the training sets of n shape vectors by using a method for solving a change matrix, and eliminating the influence caused by a plurality of external factors such as gesture transformation, different angles and distance in a face image; and then PCA principal component analysis is carried out on the aligned shape vectors in the training set, any shape vector used for training is determined by adopting an average shape vector and parameters obtained by principal component analysis, local characteristics of each characteristic point are established, and a new position is searched for each characteristic point in each iteration process.

As a further scheme of the invention: the searching comprises a local texture model and a global statistical model, wherein the local searching and the global constraint are respectively realized, and when certain characteristic points fall into local extremum or have larger deviation in the local searching, the global statistical model can adjust the situation.

As a further scheme of the invention: the active appearance model method comprehensively analyzes shape information and texture information of a human face, establishes a hybrid model and is divided into modeling and feature matching; the modeling means to build a hybrid model with shape information and texture information; the feature matching means that an energy function is represented through a mixed model and a mean square error of an input image, model parameters are updated through algorithm calculation, new feature point positions are generated, and the processes are iterated repeatedly to obtain final feature point positions.

As a further scheme of the invention: the three-dimensional face reconstruction module comprises a three-dimensional face reconstruction unit based on multi-view information, a three-dimensional face reconstruction unit based on a deformation statistical model and a three-dimensional face reconstruction unit based on a light and shade restoration shape.

As a further scheme of the invention: the three-dimensional face reconstruction unit based on multi-view information comprises the steps of firstly, recovering and utilizing a computer vision technology to estimate camera parameters of one shot face image at a camera view angle, and recovering three-dimensional coordinates of facial feature points of an input face object; then calculating the three-dimensional coordinates of the residual points by the three-dimensional coordinates of the above estimated characteristic points and using an interpolation algorithm in the scattered point interpolation stage; finally, in the shape repositioning stage, the accuracy of shape fitting is improved by defining the additional corresponding relation between the facial feature points and the image coordinates under the condition of keeping the camera view angle fixed.

As a further scheme of the invention: the three-dimensional face reconstruction unit based on the deformation model comprises the steps of matching and combining the face image with the model after the deformation model gives a new face image, modifying corresponding parameters of the model, deforming the model until the difference between the model and the face image is minimized, and optimizing and adjusting textures at the same time, so that face modeling can be completed.

As still further aspects of the invention: in the three-dimensional face reconstruction unit based on the light and shade restoration shape, the light and shade restoration shape restores various parameter values of the relative height, the surface normal direction, the surface gradient and the gradient of each point of the surface of the object by utilizing the light and shade change of the surface of the object in a single image or a plurality of images, so that an object model is reconstructed.

Compared with the prior art, the invention has the beneficial effects that:

Through setting up the pronunciation expression generation module, control the pronunciation synthesis parameter through manual rule formulation, generate the pronunciation with specific expression, utilize a large amount of pronunciation data training models with emotion label, learn the relation between pronunciation and the emotion, and generate the pronunciation that the expression is abundant on new text, make the digit people can drive more smooth under the cooperation of face expression database, expression feature extraction module, face three-dimensional reconstruction module according to the pronunciation with specific expression, reduced calculated amount, can show more minutely when driving the digital human expression, be favorable to using.

Drawings

Fig. 1 is a schematic structural diagram of an AI digital human automatic expression generating system based on voice driving in the present invention.

Fig. 2 is a schematic structural diagram of a speech expression generating module in the present invention.

Fig. 3 is a schematic structural diagram of a three-dimensional face reconstruction module according to the present invention.

Detailed Description

In one embodiment, as shown in fig. 1-3, an AI digital human automatic expression generating system based on voice driving comprises a voice expression generating module, a human facial expression database, an expression feature extracting module and a human face three-dimensional reconstructing module; the voice expression in the voice expression generation module expresses emotion or attitude by utilizing a plurality of acoustic characteristics of tone, speech speed and rhythm through a voice synthesis technology; the facial expression database classifies the collected facial expressions according to the transmission of the expression semantics by acquiring a large number of facial expressions, so that a set of each expression is acquired; the expression feature extraction module comprises an active shape model method and an active appearance model method for extracting facial features; the face three-dimensional reconstruction module reconstructs three-dimensional information of a model according to images of a single view or images of a plurality of views;

The specific method for generating the voice expression in the voice expression generating module is to manually make rules to control voice synthesis parameters and generate voice with specific expression, wherein the rules are designed by considering various factors of emotion and semantics; training a model by utilizing a large amount of voice data with emotion labels, learning the relation between voice and emotion, and generating voice with rich expression on a new text;

The voice expression generating module comprises an emotion feature extracting unit and an emotion perception model unit; the emotion feature extraction unit extracts rhythm, intonation and acoustic parameters and emotion related features in the voice by utilizing a voice signal analysis technology, models and classifies the extracted features by utilizing a statistical model and a machine learning algorithm, identifies different emotion types, combines multiple text information analyses of vocabulary and semantics, and enhances emotion identification accuracy; the emotion perception model unit explores a multi-mode emotion perception model, integrates various information of audio, text and vision, comprehensively perceives the emotion state of a speaker, adopts a deep neural network and a variety automatic encoder to construct an emotion perception model by adopting various advanced machine learning methods, improves the recognition accuracy and generalization capability, and improves the adaptability of the model to different speakers, emotion contexts and noise backgrounds by training and optimizing model parameters;

4824 pictures of 67 subjects are contained in the facial expression database, each subject is staring ahead, left and right respectively and has 8 facial expressions, and eight expressions are respectively: anger, contempt, aversion, fear, happiness, sadness, surprise and neutrality;

Each picture is provided with 40 attribute labels, including smiling expressions and non-smiling expressions related to facial expressions; the smile expression and non-smile expression classification method comprises the following steps: face key point detection, face correction and face interception;

The key point detection of the face is to detect key points of the face expression image by a detection tool to obtain 68 face key points;

Face correction is to correct a face by using an affine transformation method, and the coordinates of the 37 th and 46 th points (two points of the corner of the eye) are connected through line segments according to the coordinates of the face key points obtained by face key point detection, and then affine transformation is carried out on the line segments to correct the inclined face;

the face is intercepted according to the coordinate positions of the leftmost, rightmost, uppermost and bottommost points in 68 key points, a square is framed according to a certain proportion to intercept the face, and finally the size of the intercepted face is 256 multiplied by 256.

The realization process of the active shape model method comprises training and searching, wherein n face image samples are manually marked in the training process, 68 feature points are adopted for each face image to fit a face shape model, and the positions of the feature points are used for forming shape vectors of the image; normalizing or aligning the training sets of n shape vectors by using a method for solving a change matrix, and eliminating the influence caused by a plurality of external factors such as gesture transformation, different angles and distance in a face image; then PCA principal component analysis is carried out on the aligned shape vectors in the training set, any shape vector used for training is determined by adopting an average shape vector and parameters obtained by principal component analysis, local characteristics of each characteristic point are established, and a new position is searched for each characteristic point in each iteration process;

The searching comprises a local texture model and a global statistical model, local searching and global constraint are respectively realized, and when certain characteristic points of the local searching fall into local extremum or have larger deviation, the global statistical model can adjust the situation;

the active appearance model method comprehensively analyzes the shape information and the texture information of the human face, establishes a hybrid model and is divided into modeling and feature matching; the modeling means to build a hybrid model with shape information and texture information; the feature matching means that an energy function is represented through a mixed model and a mean square error of an input image, model parameters are updated through algorithm calculation, new feature point positions are generated, and the processes are iterated repeatedly to obtain final feature point positions;

The face three-dimensional reconstruction module comprises a three-dimensional face reconstruction unit based on multi-view information, a three-dimensional face reconstruction unit based on a deformation statistical model and a three-dimensional face reconstruction unit based on a light and shade restoration shape;

The three-dimensional face reconstruction unit based on multi-view information comprises the steps of firstly, estimating camera parameters (position, direction and focal length) of a face image which is not shot by utilizing a computer vision technology at a camera view, and simultaneously recovering three-dimensional coordinates of facial feature points of an input face object; then calculating the three-dimensional coordinates of the residual points by the three-dimensional coordinates of the above estimated characteristic points and using an interpolation algorithm in the scattered point interpolation stage; finally, in the shape repositioning stage, the accuracy of shape fitting is improved by defining the additional corresponding relation between the facial feature points and the image coordinates under the condition of keeping the camera view angle fixed;

The three-dimensional face reconstruction unit based on the deformation model comprises the steps of after the deformation model gives out a new face image, matching and combining the face image with the model, modifying corresponding parameters of the model, deforming the model until the difference between the model and the face image is minimized, and optimizing and adjusting textures to finish face modeling;

the deformation statistical model is a parameterized face deformation model constructed by utilizing data in a face database, and parameters in the model are controlled to generate any expected face shape;

In the three-dimensional face reconstruction unit based on the light and shade restoration shape, the light and shade restoration shape restores various parameter values of the relative height, the surface normal direction, the surface gradient and the gradient of each point of the surface of the object by utilizing the light and shade change of the surface of the object in a single image or a plurality of images, thereby reconstructing the object model.

According to the invention, the voice synthesis parameters are controlled by manually making rules, the voice with the specific expression is generated, the relation between the voice and the emotion is learned by utilizing a large number of voice data training models with emotion labels, and the voice with rich expression is generated on a new text, so that a digital person can drive more smoothly under the cooperation of a facial expression database, an expression feature extraction module and a facial three-dimensional reconstruction module according to the voice with the specific expression, the calculated amount is reduced, and the digital person can be more finely shown when the digital person expression is driven, thereby being beneficial to use.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The AI digital human automatic expression generating system based on voice driving is characterized by comprising a voice expression generating module, a human face expression database, an expression feature extracting module and a human face three-dimensional reconstructing module; the voice expression in the voice expression generation module expresses emotion or attitude by utilizing a plurality of acoustic characteristics of tone, speech speed and rhythm through a voice synthesis technology; the facial expression database classifies the collected facial expressions according to the transmission of the expression semantics by acquiring a large number of facial expressions, so that a set of each expression is acquired; the expression feature extraction module comprises an active shape model method and an active appearance model method for extracting facial features; the face three-dimensional reconstruction module reconstructs three-dimensional information of the model according to images of a single view angle or images of a plurality of view angles.

2. The system for automatically generating the AI digital human expression based on the voice driving of claim 1, wherein the specific method for generating the voice expression in the voice expression generating module is to control voice synthesis parameters by manually making rules, and the rules are designed to consider emotion and semantic factors; and training a model by utilizing a large amount of voice data with emotion labels, learning the relation between the voice and emotion, and generating the voice with rich expression on a new text.

3. The AI digital human automatic expression generating system based on voice driving according to claim 2, wherein the voice expression generating module comprises an emotion feature extraction unit and an emotion perception model unit; the emotion feature extraction unit extracts rhythm, intonation and acoustic parameters and emotion related features in the voice by utilizing a voice signal analysis technology, models and classifies the extracted features by utilizing a statistical model and a machine learning algorithm, identifies different emotion types, combines multiple text information analyses of vocabulary and semantics, and enhances emotion identification accuracy; the emotion perception model unit explores a multi-mode emotion perception model, integrates various information of audio, text and vision, comprehensively perceives the emotion state of a speaker, adopts a deep neural network and a variety automatic encoder to construct the emotion perception model, improves identification accuracy and generalization capability, and improves adaptability of the model to different speakers, emotion contexts and noise backgrounds by training and optimizing model parameters.

4. The voice-driven AI digital human automatic expression generating system according to claim 1, wherein the implementation process of the active shape model method comprises training and searching, wherein n human face image samples are manually marked in the training process, 68 feature points are adopted for each human face image to fit a face shape model, and the positions of the feature points are used for forming a shape vector of the image; normalizing or aligning the training sets of n shape vectors by using a method for solving a change matrix, and eliminating the influence caused by a plurality of external factors such as gesture transformation, different angles and distance in a face image; and then PCA principal component analysis is carried out on the aligned shape vectors in the training set, any shape vector used for training is determined by adopting an average shape vector and parameters obtained by principal component analysis, local characteristics of each characteristic point are established, and a new position is searched for each characteristic point in each iteration process.

5. The voice-driven AI digital human automatic expression generating system of claim 4 wherein the searching includes local texture models and global statistical models to implement local searching and global constraints, respectively, the local searching being adapted by the global statistical model when certain feature points fall into local extrema or large deviations occur.

6. The voice-driven AI digital human expression generating system according to claim 1, wherein the active appearance model method comprehensively analyzes shape information and texture information of a human face, establishes a hybrid model, and is divided into modeling and feature matching; the modeling means to build a hybrid model with shape information and texture information; the feature matching means that an energy function is represented through a mixed model and a mean square error of an input image, model parameters are updated through algorithm calculation, new feature point positions are generated, and the processes are iterated repeatedly to obtain final feature point positions.

7. The voice-driven AI digital human automatic expression generating system according to claim 1, wherein the human face three-dimensional reconstruction module comprises a three-dimensional human face reconstruction unit based on multi-view information, a three-dimensional human face reconstruction unit based on a deformation statistical model and a three-dimensional human face reconstruction unit based on a bright-dark restoration shape.

8. The AI digital human automatic expression generating system based on voice driving of claim 7, wherein the step of the three-dimensional human face reconstructing unit based on multi-view information includes estimating camera parameters of one shot human face image by computer vision technique at camera view angle, the camera parameters including position, direction and focal length, and recovering three-dimensional coordinates of facial feature points of an input human face object; then calculating the three-dimensional coordinates of the residual points by the three-dimensional coordinates of the above estimated characteristic points and using an interpolation algorithm in the scattered point interpolation stage; finally, in the shape repositioning stage, the accuracy of shape fitting is improved by defining the additional corresponding relation between the facial feature points and the image coordinates under the condition of keeping the camera view angle fixed.

9. The system of claim 7, wherein the three-dimensional face reconstruction unit based on the deformation model comprises matching and combining the face image with the model after the deformation model gives a new face image, modifying parameters corresponding to the model, deforming the model until the difference between the model and the face image is minimized, and optimizing and adjusting textures to complete face modeling.

10. The system according to claim 7, wherein the three-dimensional face reconstruction unit based on a shading-recovering shape is configured to recover a plurality of parameter values of relative height, surface normal direction, surface gradient and inclination of each point on the surface of the object by using shading changes of the surface of the object in the single image or the plurality of images, thereby reconstructing the object model.