CN117808934A

CN117808934A - Data processing method and related equipment

Info

Publication number: CN117808934A
Application number: CN202211202267.XA
Authority: CN
Inventors: 周世奇; 许斌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2024-04-02
Also published as: WO2024066549A1

Abstract

The data processing method can be applied to scenes such as style migration of animation. The method comprises the following steps: acquiring first style information; extracting action information of the first image sequence; and generating a second image sequence based on the first style information and the action information, wherein the second image sequence is the same as the action type of the first image sequence, and the second image sequence has the first style information. And obtaining through separation of the style information and the action information, and generating a second image sequence based on the first style information and the action information. The stylized animation editing is realized under the condition that other characteristics of the original image sequence are not changed, and the style migration effect of the animation is improved.

Description

Data processing method and related equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and related devices.

Background

With the development of the metauniverse concept, a "virtual digital person" is considered as a medium for future human entry into the metauniverse, followed by standing on public opinion. Along with the maturity of the driving technology, the virtual digital person must be widely applied in more actual scenes which can be rendered, such as virtual customer service, virtual shopping guide, virtual explanation personnel and the like.

Currently, there are several main approaches in how to drive a virtual digital person to mimic human behavior: purely artificial modeling and dynamic capture modeling. The purely artificial modeling mode is applied to super-realistic virtual persons or star virtual persons, but has a longer artificial manufacturing period and high cost. The dynamic capturing modeling mode is driven by collecting model data by means of external scanning equipment, and compared with a purely manual modeling mode, the dynamic capturing modeling mode is quite low in time and cost, is commonly used in the entertainment industry such as film and television, live broadcast and the like, but needs participation of real actors and actors, and cannot improve production efficiency.

Therefore, how to realize migration of different styles between animation actions is a technical problem to be solved.

Disclosure of Invention

The embodiment of the application provides a data processing method and related equipment. The method is used for realizing stylized animation editing under the condition that other characteristics of the original image sequence are not changed, and improving the style migration effect of the animation.

The first aspect of the embodiment of the present application provides a data processing method, which can be applied to scenes such as style migration of animation. The method may be performed by a data processing apparatus or by a component of a data processing apparatus (e.g., a processor, a chip, or a system-on-a-chip, etc.). The method comprises the following steps: acquiring first style information; acquiring action information of a first image sequence; and generating a second image sequence based on the first style information and the action information, wherein the second image sequence is the same as the action type of the first image sequence, and the second image sequence has the first style information. The style information may be understood as a style description of the image sequence, where the style includes one or more of the following: limb/face contour, limb/face ratio, limb motion amplitude, emotion, personality, etc. Action types, used to describe actions of the image sequence, e.g., run, jump, walk, etc. Motion information may be understood as a vector used by lower layers to represent the type of motion. It will be appreciated that the motion vectors corresponding to image sequences of the same motion type may be different.

In the embodiment of the application, the second image sequence is generated based on the first style information and the action information through separate acquisition of the style information and the action information. The stylized animation editing is realized under the condition that other characteristics of the original image sequence are not changed, and the style migration effect of the animation is improved.

Optionally, in a possible implementation manner of the first aspect, before the step of obtaining the first style information, the method further includes: acquiring a third image sequence; acquiring first style information, including: the first style information is acquired based on the third image sequence.

In this possible implementation manner, the first style information is obtained through other third image sequences, so that the defect that a user is difficult to describe a certain type of style information can be overcome.

Optionally, in a possible implementation manner of the first aspect, the steps are as follows: acquiring first style information based on a third image sequence, including: extracting second style information of a third image sequence; the first style information is determined based on the second style information.

In the possible implementation manner, the style information of the third image sequence is directly used as the style information to be migrated to the first image sequence, so that the style of the generated second image sequence is similar to or the same as the style of the third image sequence, and the accurate migration of the style is met.

Optionally, in a possible implementation manner of the first aspect, the steps are as follows: determining the first style information based on the second style information includes: the second style information is used as the first style information.

In the possible implementation manner, the style information of the third image sequence is directly used as the style information to be migrated to the first image sequence, so that the style of the generated second image sequence is similar or identical to the style of the third image sequence, the defect that a user is difficult to describe certain style information is overcome, and the accurate migration of the style is met.

Optionally, in a possible implementation manner of the first aspect, the steps are as follows: determining the first style information based on the second style information includes: displaying a second semantic tag to the user, the second semantic tag being used to describe second style information; modifying the second semantic tags to first semantic tags based on a first operation of a user, the first semantic tags being used to describe first style information; first style information is determined based on the first semantic tags.

In the possible implementation manner, the user modifies the semantic tags through operation on the basis of the third image sequence so as to describe the style information and ensure the user requirements, and the subsequently generated second image sequence can meet the style requirements of the user on the image sequence. Or it is understood that the style information is explicitly shown by using the label, so that a user can quantitatively and qualitatively analyze the style information, and further clearly know how to quantitatively describe own requirements. In addition, through the analysis of the user demands, the video can cover any style advantage, so that any customized stylized digital human animation can be generated according to the embodiment of the application.

Optionally, in a possible implementation manner of the first aspect, the third image sequence is a two-dimensional animated image sequence, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are three-dimensional animated image sequences.

In this possible implementation manner, the stock of the 2D video is large enough, and any style information of the 2D video can be migrated to the 3D original video to obtain the 3D target video.

Optionally, in a possible implementation manner of the first aspect, the steps further include: displaying a first interface to a user, wherein the first interface comprises a plurality of semantic tags, the semantic tags are used for describing different style information of different image sequences, and the semantic tags are in one-to-one correspondence with the style information; acquiring first style information, including: determining a first semantic tag from the plurality of semantic tags based on a second operation of the user; first style information is determined based on the first semantic tags.

In this possible implementation, it may be understood that extracting any style from the video and generating the feature library is done offline. The user only needs to upload the semantic tags of the required personalized styles, and further automatically identifies style information corresponding to the tags from the feature library.

Optionally, in a possible implementation manner of the first aspect, the steps are as follows: generating a second image sequence based on the first style information and the motion information, comprising: fusing the first style information and the action information to obtain a first motion characteristic; a second image sequence is acquired based on the first motion feature.

In this possible implementation manner, the first style information represented by the first semantic tag is fused with the motion information of the original image sequence to obtain the first motion feature. Therefore, based on the second image sequence acquired by the first motion feature, style migration is realized without changing other features of the original image sequence.

Optionally, in a possible implementation manner of the first aspect, the action information includes one or more of the following: facial expression sequence and limb image sequence.

In the possible implementation manner, the method can be applied to style migration of limb actions, style migration of facial expressions and the like, and is widely applicable to scenes.

Optionally, in a possible implementation manner of the first aspect, the steps further include: and rendering the second image sequence to the virtual object to obtain the animation.

In this possible implementation manner, the method may be applicable to style migration scenes from 2D animation to 2D animation, from 2D animation to 3D animation, or from 3D animation to 3D animation.

Optionally, in a possible implementation manner of the first aspect, the style information of the image sequence includes explicit style information and implicit style information, and the second semantic tag is specifically used to associate the explicit style information in the second style information.

In the possible implementation manner, the explicit style information can be edited by a user by performing explicit and implicit decomposition on the style information. And generating modified style information from the edited explicit style information and implicit style information.

Optionally, in a possible implementation manner of the first aspect, the steps are as follows: extracting motion information of a first image sequence, comprising: inputting the first image sequence into a content encoder to obtain motion information; extracting second style information for a third image sequence, comprising: the third image sequence is input to a style encoder to obtain second style information.

Optionally, in a possible implementation manner of the first aspect, the steps further include: acquiring a first training image sequence and a second training image sequence, wherein the first training image sequence and the second training image sequence are different in motion characteristics, and the motion characteristics comprise motion information and/or style information; respectively inputting the first training image sequence into a style encoder and a content encoder to obtain first training style information and first training action information; respectively inputting the second training image sequence into a style encoder and a content encoder to obtain second training style information and second training action information; fusing the first training style information and the second training action information to obtain first training exercise characteristics; fusing the second training style information and the first training action information to obtain second training movement characteristics; inputting the first training motion feature into a decoder to obtain a first reconstructed image sequence; inputting the second training motion feature into a decoder to obtain a second reconstructed image sequence; training with the value of the first loss function being smaller than a first threshold value to obtain a trained style encoder, content encoder and decoder, wherein the first loss function comprises a style loss function and a content loss function, the style loss function is used for representing style differences between the first reconstructed image sequence and the first training image sequence and style differences between the second reconstructed image sequence and the second training image sequence, and the content loss function is used for representing content differences between the first reconstructed image sequence and the second training image sequence and content differences between the second reconstructed image sequence and the first training image sequence.

In this possible implementation manner, accuracy of style migration can be achieved through the training process described above.

A second aspect of an embodiment of the present application provides a data processing apparatus. The data processing apparatus includes: the acquisition unit is used for acquiring the first style information; the acquisition unit is also used for acquiring action information of the first image sequence; and the generating unit is used for generating a second image sequence based on the first style information and the action information, wherein the second image sequence is the same as the first image sequence in action type, and the second image sequence has the first style information.

Optionally, in a possible implementation manner of the second aspect, the acquiring unit is further configured to acquire a third image sequence; and the acquisition unit is specifically used for acquiring the first style information based on the third image sequence.

Optionally, in a possible implementation manner of the second aspect, the acquiring unit is specifically configured to extract second style information of the third image sequence; and the acquisition unit is specifically used for determining the first style information based on the second style information.

Optionally, in a possible implementation manner of the second aspect, the acquiring unit is specifically configured to use the second style information as the first style information.

Optionally, in a possible implementation manner of the second aspect, the acquiring unit is specifically configured to display a second semantic tag to a user, where the second semantic tag is used to describe the second style information; the acquisition unit is specifically used for modifying the second semantic tag into a first semantic tag based on a first operation of a user, and the first semantic tag is used for describing the first style information; the acquisition unit is specifically used for determining the first style information based on the first semantic tag.

Optionally, in a possible implementation manner of the second aspect, the third image sequence is a two-dimensional animated image sequence, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are three-dimensional animated image sequences.

Optionally, in a possible implementation manner of the second aspect, the data processing apparatus further includes: the display unit is used for displaying a first interface to a user, wherein the first interface comprises a plurality of semantic tags, the semantic tags are used for describing different style information of different image sequences, and the semantic tags are in one-to-one correspondence with the style information; the acquisition unit is specifically used for determining a first semantic tag from a plurality of semantic tags based on a second operation of a user; the acquisition unit is specifically used for determining the first style information based on the first semantic tag.

Optionally, in a possible implementation manner of the second aspect, the generating unit is specifically configured to fuse the first style information and the motion information to obtain a first motion feature; the generation unit is specifically configured to acquire the second image sequence based on the first motion feature.

Optionally, in a possible implementation manner of the second aspect, the action information includes one or more of the following: facial expression sequence and limb image sequence.

Optionally, in a possible implementation manner of the second aspect, the data processing apparatus further includes: and the rendering unit is used for rendering the second image sequence to the virtual object to obtain the animation.

A third aspect of the present application provides a data processing apparatus comprising: a processor coupled to a memory for storing a program or instructions which, when executed by the processor, cause the data processing apparatus to implement the method of the first aspect or any possible implementation of the first aspect.

A fourth aspect of the present application provides a computer readable medium having stored thereon a computer program or instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any possible implementation of the first aspect.

A fifth aspect of the present application provides a computer program product which, when executed on a computer, causes the computer to perform the method of the first aspect or any of the possible implementations of the first aspect.

A sixth aspect of the embodiments of the present application provides a chip system comprising at least one processor for supporting a data processing apparatus to implement the functions as referred to in the first aspect or any one of the possible implementations of the first aspect.

In one possible design, the system on a chip may further include a memory to hold program instructions and data necessary for the data processing device. The chip system can be composed of chips, and can also comprise chips and other discrete devices. Optionally, the chip system further comprises an interface circuit providing program instructions and/or data to the at least one processor.

The technical effects of the second, third, fourth, fifth, and sixth aspects or any one of the possible implementation manners of the second, third, fourth, fifth, and sixth aspects may be referred to the technical effects of the first aspect or the different possible implementation manners of the first aspect, which are not described herein.

From the above technical scheme, the application has the following advantages: and obtaining through separation of the style information and the action information, and generating a second image sequence based on the first style information and the action information. The stylized animation editing is realized under the condition that other characteristics of the original image sequence are not changed, and the style migration effect of the animation is improved.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence main body framework according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a system architecture according to an embodiment of the present application;

fig. 3A is a schematic view of a deployment scenario provided in an embodiment of the present application;

FIG. 3B is a schematic diagram of another deployment scenario provided in an embodiment of the present application;

FIG. 4 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure;

FIG. 5A is a schematic diagram of decomposing style information into explicit features according to an embodiment of the present application;

fig. 5B is a schematic diagram of a training flow of a conversion module according to an embodiment of the present application;

FIG. 6A is a schematic diagram of another flow chart of a data processing method according to an embodiment of the present disclosure;

fig. 6B is a schematic flowchart of a user modifying a tag according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a user interface displayed to a user by a data processing device according to an embodiment of the present application;

FIG. 8 is another schematic diagram of a data processing apparatus displaying a user interface to a user according to an embodiment of the present application;

FIG. 9 is another schematic diagram of a data processing apparatus displaying a user interface to a user according to an embodiment of the present application;

FIG. 10 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure;

FIG. 11 is an exemplary diagram of a first image sequence provided in an embodiment of the present application;

FIG. 12 is an exemplary diagram of a third image sequence provided in an embodiment of the present application;

FIG. 13 is an exemplary diagram of a second image sequence provided by an embodiment of the present application;

FIG. 14 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure;

FIG. 15 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure;

FIG. 16 is a schematic diagram of a training process of an encoder and a decoder according to an embodiment of the present disclosure;

FIG. 17 is a flowchart illustrating an application of the method provided in the embodiment of the present application to a gesture style migration scenario;

fig. 18 is a flowchart illustrating an application of the method provided in the embodiment of the present application to an expression style migration scene;

FIG. 19 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 20 is another schematic structural diagram of a data processing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the technical solutions according to the embodiments of the present invention will be given with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art from the embodiments according to the invention without creative efforts, fall within the protection scope of the invention.

For ease of understanding, related terms and concepts primarily related to embodiments of the present application are described below.

1. Neural network

The neural network may be composed of neural units, which may be referred to as X _s And an arithmetic unit whose intercept b is an input, the output of the arithmetic unit may be:

wherein s=1, 2, … … n, n is a natural number greater than 1, W _s Is X _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

2. Loss function

In training the deep neural network, because the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the parameters are preconfigured for each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to make the predicted value lower, and the adjustment is continued until the neural network can predict the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

3. Generating an countermeasure network

Generating the antagonism network (generative adversarial network, GAN) is a deep learning model. The generating countermeasure network at least comprises a generating network (Generator) and a discriminating network (Discriminator), and the generating countermeasure network generates better output by allowing the two neural networks to learn in a manner of playing mutually. The two neural networks can be deep neural networks or convolutional neural networks. The basic principle of GAN is as follows: taking GAN for generating pictures as an example, it is assumed that there are two networks, G (Generator) and D (Discriminator), where G is a network for generating pictures, randomly sampling from potential Space (Space) as input, generating pictures, and recording as G (z); d is a discrimination network for discriminating whether a picture is "true". Its input parameter is x, x represents a picture, x is a real picture or the output of the generation network. The output D (x) represents the probability that x is a true picture, and if 1, it represents 100% of the pictures that are true, and if 0, it represents the pictures that are unlikely to be true. In the process of training the generated countermeasure network, the goal of generating the network G is to generate a Real picture as far as possible to deceptively judge the network D, and output results need to imitate Real Samples (Real Samples) in a training set as far as possible. The goal of the discrimination network D is to distinguish the picture generated by G from the real picture as much as possible. The two networks are mutually opposed and continuously adjust parameters. Thus, G and D constitute a dynamic "game" process, i.e. "antagonism" in "generated antagonism network", and the final objective is to make the discriminating network unable to determine whether the output result of the generated network is true. As a result of the last game, in an ideal state, G can generate a picture G (z) sufficient to be "spurious", and D has difficulty in determining whether the picture generated by G is true or not, i.e., D (G (z))=0.5. This gives an excellent generation model G which can be used to generate pictures.

4. Animation of moving picture

Virtual authored video content, including animated video displayed on a 2D plane, and 3D animated content displayed on 3D display devices such as augmented reality (augmented reality, AR), virtual Reality (VR), holographic display, etc.; the style is not only cartoon style, but also writing style, such as digital human animation, special effect film and the like.

5. Virtual digital person

A virtual digital person refers to a virtual character having a digitized appearance. Unlike robots with physical entities, virtual digital persons exist depending on display devices, such as mobile phones, computers or smart large screens. A complete virtual digital person often needs to have the following three capabilities:

firstly, the appearance of the owner has the characteristics of specific looks, sexes, characters and the like.

Secondly, the behavior of the owner has the ability to express with language, facial expression and limb actions.

Thirdly, the idea of the owner has the capability of identifying the external environment and exchanging interaction with the person.

6. Image sequence

An image sequence is understood to be a plurality of images in a time-sequential relationship, but may also be a sequence of images acquired in a video. The image sequence may include a limb image sequence, and/or a facial expression sequence, etc. The image sequence may refer to an image sequence of a whole body limb, an image sequence of a partial limb (or referred to as a partial limb) of the whole body limb, a facial expression sequence of a character corresponding to the image sequence, and the like, and is not particularly limited herein.

7. Style information

The style information related to the embodiment of the application may be a style feature vector obtained by the image sequence through a style encoder. Or may be an explicit vector in the style feature vector. But also some of the explicit vectors in the style feature vector, and the like, and is not limited in this particular context. The tag corresponding to the style information may be understood as a style description of the image sequence. For example, the style includes one or more of the following: limb/face contour, limb/face ratio, limb motion amplitude, emotion, personality, etc. The emotions in the above may include: happy, depressed, excited, etc. The personality may include: lively, benign, gentle, thin, etc.

8. Action information

The motion information according to the embodiments of the present application may be a feature vector obtained by the image sequence through the content encoder.

9. Action type

The action class type is used to describe the actions of the image sequence. This content refers to the actions described by the image sequence (e.g., run, jump, squat, walk, lift, lower head, close eye, etc.). It will be appreciated that the motion vectors corresponding to image sequences of the same motion type may be different.

10. Semantic tags

The semantic tags are used to describe style information for the image sequence. It is understood as being used to stylize an image sequence.

Style information corresponds to semantic tags one by one. Semantic tags may vary from one instance of style information to another. Semantic tags may be understood as describing style information to facilitate a user's understanding or editing of the style of the image sequence.

Illustratively, the style information is a style feature vector of the sequence of images. The semantic tags are used for explicitly expressing the style feature vectors, and a user can determine the styles of the image sequences/videos (such as the emotion of a character, the character and the like expressed by the limb actions of the character in the videos) through the semantic tags so as to facilitate the operations such as style editing/migration and the like.

Currently, there are three main approaches in how to drive a virtual digital person to mimic human behavior: purely artificial modeling, dynamic capture modeling, and artificial intelligence modeling. The purely artificial modeling mode is applied to super-realistic virtual persons or star virtual persons, but has a longer artificial manufacturing period and high cost. The dynamic capturing modeling mode is driven by collecting model data by means of external scanning equipment, and compared with a purely manual modeling mode, the dynamic capturing modeling mode is quite low in time and cost, is commonly used in the entertainment industry such as film and television, live broadcast and the like, but needs participation of real actors and actors, and cannot improve production efficiency. The artificial intelligence driving mode is based on algorithm and machine learning. The machine can automatically generate the virtual digital person on the premise that enough data are acquired, a large number of photos/videos are analyzed, various data and information of the person are extracted, and the virtual digital person is driven to simulate the behavior of the person. In the artificial intelligence modeling mode, different styles are often used to migrate between animation actions, so as to reduce the dynamic capturing and driving cost of virtual digital human actions.

The generation and editing of stylized human body animation is an important subject in the field of computer animation, and through migration of different styles among the same animation, arbitrary stylization of the animation is realized, and the cost of dynamic capturing and driving is reduced, but several key problems exist to be solved: firstly, stylized animation editing requirements are that on the basis of not changing other characteristics of the original animation as much as possible, the animation editing requirements have a designated style, and how to better decouple style information and animation action information is an important problem; secondly, how to obtain style data at low cost, wherein a video is a large data source, but how to explicitly mark semantic tag features of styles in massive video data, so that a user can complete editing and style migration by only semantically describing the styles, is also an important problem.

Therefore, the embodiment of the application provides a limb action driving scheme based on style extraction of video and explicit marking and editing of style information aiming at the defect that the conventional virtual digital human animation driving method cannot be stylized at will, and aims to fill the blank of the personalized animation driving of an AI user in a general entertainment scene; in addition, the style is extracted from the video, so that the defect that a user is difficult to describe a certain style can be overcome.

Before describing the data processing method and the related device according to the embodiments of the present application with reference to the accompanying drawings, a system architecture provided by the embodiments of the present application is described.

Referring to fig. 1, an embodiment of the present invention provides a system architecture 100. As shown in the system architecture 100, the data acquisition device 160 is configured to acquire training data, where the training data in this embodiment of the present application includes: the first training image sequence and the second training image sequence. And stores the training data in database 130, training device 120 trains to obtain target model/rule 101 based on the training data maintained in database 130. How the training device 120 obtains the target model/rule 101 based on the training data, which target model/rule 101 can be used to implement the data processing method provided by the embodiments of the present application, will be described in more detail below. The object model/rule 101 in the embodiment of the present application may specifically include a style encoder, a content encoder, and a decoder. In practical applications, the training data maintained in the database 130 is not necessarily collected by the data collecting device 160, but may be received from other devices. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130 to perform training of the target model/rule 101, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, etc., and may also be a server or cloud terminal, etc. In fig. 1, the execution device 110 is configured with an I/O interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include in an embodiment of the present application: a first image sequence and a first semantic tag; optionally, the input data may further include a first image sequence, a second image sequence, and the like. Of course, the input data may be a two-dimensional animation (for example, a two-dimensional animation is an animation to which the second image sequence belongs) and a three-dimensional animation (for example, a three-dimensional animation is an animation to which the first image sequence belongs). In addition, the input data may be input by a user, or may be uploaded by the user through the photographing device, or may be from a database, which is not limited herein.

The preprocessing module 113 is configured to perform preprocessing (e.g., conversion of two-dimensional features to three-dimensional features, etc.) according to input data (e.g., a first image sequence and a first semantic tag, or a first image sequence and a second image sequence, or a two-dimensional animation and a three-dimensional animation) received by the I/O interface 112.

In the preprocessing of the input data by the execution device 110, or in the processing performed by the calculation module 111 of the execution device 110 to extract the action information of the first image sequence and generate the second image sequence based on the action information and the first semantic tag, the execution device 110 may call the data, the code, etc. in the data storage system 150 for the corresponding processing, or may store the second image sequence, the instruction, etc. obtained by the corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the second image sequence obtained as described above, or the three-dimensional animation corresponding to the second image sequence to the client device 140, so as to provide the processing result to the user.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule 101 for different targets or different tasks, where the corresponding target model/rule 101 may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 1, the user may manually give input data that may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 1, the target model/rule 101 is trained according to the training device 120, where the target model/rule 101 may include a style encoder, a content encoder, a decoder, etc. in embodiments of the present application.

The following describes a chip hardware structure provided in the embodiments of the present application.

Fig. 2 is a chip hardware structure according to an embodiment of the present invention, where the chip includes a neural network processor 20. The chip may be provided in an execution device 110 as shown in fig. 1 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 1 to complete the training work of the training device 120 and output the target model/rule 101.

The neural network processor 20 may be a neural Network Processor (NPU), tensor processor (tensor processing unit, TPU), or graphics processor (graphics processing unit, GPU) among all suitable processors for large-scale exclusive-or operation processing. Taking NPU as an example: the neural network processor 20 is mounted as a coprocessor on a main central processing unit (central processing unit, CPU) (host CPU) to which tasks are assigned. The NPU has a core part of an arithmetic circuit 203, and a controller 204 controls the arithmetic circuit 203 to extract data in a memory (a weight memory or an input memory) and perform an operation.

In some implementations, the arithmetic circuitry 203 includes a plurality of processing units (PEs) internally. In some implementations, the operational circuitry 203 is a two-dimensional systolic array. The arithmetic circuitry 203 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry 203 is a general purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 202 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 201 and performs matrix operation with the matrix B, and the obtained partial result or final result of the matrix is stored in the accumulator 208.

The vector calculation unit 207 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, vector computation unit 207 may be used for network computation of non-convolutional/non-FC layers in a neural network, such as Pooling (Pooling), batch normalization (Batch Normalization), local response normalization (Local Response Normalization), and the like.

In some implementations, the vector computation unit 207 stores the vector of processed outputs to the unified buffer 206. For example, the vector calculation unit 207 may apply a nonlinear function to an output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 207 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 203, for example for use in subsequent layers in a neural network.

The unified memory 206 is used for storing input data and output data.

The weight data is directly transferred to the input memory 201 and/or the unified memory 206 by the memory cell access controller 205 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 202, and the data in the unified memory 206 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 210 for interfacing between the main CPU, DMAC and the instruction fetch memory 209 via a bus.

An instruction fetch memory (instruction fetch buffer) 209 coupled to the controller 204 is configured to store instructions for use by the controller 204.

And the controller 204 is used for calling the instruction cached in the instruction memory 209 to realize the control of the working process of the operation accelerator.

Typically, the unified memory 206, the input memory 201, the weight memory 202, and the finger memory 209 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, abbreviated as DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memories.

Several deployment scenarios provided by embodiments of the present application are described below. The random style editable 3D animation generation scheme provided by the embodiment of the application can be applied to 2B-end digital moderator scenes, 2C-end digital mate scenes, assistant software scenes and the like. There are a variety of specific deployment scenarios, as exemplified below.

A deployment scenario provided by the embodiment of the application is shown in FIG. 3A, and a user uploads an animation video representing a target style on a client. And the server side is used for completing the extraction of the target style from the video and returning the semantic tags of the style to the user. The user may then describe, edit or select the style based on its semantic tag, e.g., a style that is exciting to the semantic tag, with a slightly weaker degree of hope. After the client finishes label editing and uploading, the server receives the request, reduces the weight of the semantic labels of the style according to the target so as to reduce the excitation degree, thereby realizing editing of the style information, further generating the target animation conforming to the user label, and returning to the client for rendering and displaying.

In another deployment scenario provided in the embodiment of the present application, as shown in fig. 3B, compared with the deployment scenario of fig. 3A, in this deployment scenario, the server side finishes extracting any style from the video offline, and generates the feature library. The user only needs to upload the semantic tags of the required personalized style, such as adding a little negative softness style to the excited style. After receiving the request, the server automatically identifies the style information corresponding to the excitement and the negative flexible label from the style information base, edits the characteristics, generates style information matched with the semantic label of the target style, and finishes rendering and display.

It will be appreciated that the above two deployment scenarios are just examples, and that other types of deployment scenarios are also possible in practical applications, and are not limited herein.

The style related to the deployment scenario may be a two-dimensional style or a three-dimensional style. In other words, the method provided by the embodiment of the application can be applied to a scene of migrating to a two-dimensional image sequence in a two-dimensional style. The method can also be applied to scenes of migration of three-dimensional styles to three-dimensional image sequences. The method can also be applied to scenes of which the two-dimensional style is migrated to the three-dimensional image sequence, or scenes of which the three-dimensional style is migrated to the two-dimensional image sequence, and the like, and is not limited in particular.

The following describes in detail a data processing method provided in an embodiment of the present application with reference to the accompanying drawings.

Referring to fig. 4, in one embodiment of a data processing method provided in the embodiments of the present application, the method may be performed by a data processing device (terminal device/cloud server), or may be performed by a component (such as a processor, a chip, or a chip system) of the data processing device, where the method includes steps 401 to 403. The method can be applied to scenes of style migration among the animations of children education animations, short video animations, propaganda animations, variety animations, film and television previewing animations and the like.

Step 401, obtaining first style information.

In one possible implementation, the style information refers to a style feature vector of the image sequence, and then a third image sequence is input to the style encoder to obtain the second style information. The training process for the style encoder will be described later, and will not be expanded here.

In another possible implementation, the style information refers to an explicit vector or a part of the explicit vector in the style feature vector of the image sequence, and then the third image sequence is input to the style encoder to obtain the style feature vector. And splitting the style feature vector into an explicit vector and an implicit vector. In this case, the style information may be understood as an explicit expression of the style feature vector.

Or it is understood that the style information in the embodiment of the present application may be a style feature vector corresponding to the image sequence, or may be an explicit vector in the style feature vector corresponding to the image sequence. But also part of the features of the explicit vector in the corresponding style feature vector of the image sequence, etc. In other words, the latter case can be understood as the fact that style information can be decomposed into explicit vectors and implicit features. Of course, this decomposition is by way of example only, and style information may also be decomposed into explicit vectors, implicit features, and personalized features. The individuation feature is used for expressing individuation differences brought by the same style when different roles are deducted. The personalized features may also be associated with characters in the image sequence.

Alternatively, in the case where the style information is an explicit vector. It is also necessary to first decompose the style feature vector into an explicit vector and an implicit feature. And takes the explicit vector as style information.

In this embodiment, the data processing device obtains the first style information in a plurality of ways, and the description is described below.

First, first style information is obtained based on a third image sequence.

In this case, the data processing apparatus acquires the third image sequence first, and acquires the first style information based on the third image sequence. The data processing device may acquire the third image sequence in various manners, which may be a manner of receiving a transmission from another device, a manner of selecting from a database, a manner of acquiring by each sensor in the data processing device, a manner of uploading by a user, and the like, and is not limited in this embodiment.

The image sequence (for example, the first image sequence, the third image sequence, etc.) in the embodiment of the present application may be a two-dimensional image sequence, or may be a three-dimensional image sequence, etc., which is not limited herein specifically.

Alternatively, the third image sequence may be an image sequence extracted from a two-dimensional animation in order to obtain style information for more style categories. For example, the third image sequence is extracted from the two-dimensional animation by a human body posture recognition method (e.g., openpost). The method of acquiring the two-dimensional animation is not limited here, and may be a method of capturing and uploading by the user, a method of receiving transmission from another device, a method of selecting from a database, or the like, and is not limited in detail here.

The step of acquiring the first style information based on the third image sequence is divided into two cases according to whether there is a user operation, and is described below.

1. No user operation is required.

After the data processing device acquires the third image sequence, the second style information of the third image sequence may be directly extracted, and the second style information is used as the first style information. Or converting the second style information into preset style information, etc.

The basis of the above decomposition may be through a trained neural network, or may be through searching a database for a plurality of image sequences for expressing the same style, and determining an explicit vector according to the plurality of image sequences for expressing the same style, which is not limited herein. The determination of an explicit vector from a plurality of image sequences for expressing the same style may specifically comprise: a plurality of image sequences of the same style are input to a style encoder to obtain a plurality of style feature vectors. And the common feature in the plurality of style feature vectors is used as style information. The non-common portion is an implicit feature, and the like, and is not limited herein.

Illustratively, a plurality of image sequences with expression styles of "happy" are searched from a database, and the plurality of image sequences are respectively input into a style encoder to obtain a plurality of style feature vectors. If a common vector of a plurality of style feature vectors is determined, the style information of "happy" is the common vector. Thereby determining the correspondence between the explicit style information and the common vector.

Optionally, in the case that the style information is part of a feature of an explicit vector in the corresponding style feature vector of the image sequence, the explicit vector needs to be split. For example, explicit vector = W ₁ * Style information 1+W ₂ * Style information 2+ & gt _n * Style information n).

Illustratively, the style information is an explicit vector of style feature vectors. As shown in fig. 5A, the style information may include: calm- > excited "," single- > diverse "," gentle-yin- > rigid-yang ". Here, the front and rear of "- >" may refer to both side boundaries of one range. For example, "calm" to "excited" is a progressive progression of emotion, or it is understood that style information may further distinguish between different weights/levels. For another example, the intensity range of the happiness may include: several levels of satisfaction, euphoria, pleasure, happiness, mania, etc. In this example, style information may also be "satisfaction- > fan".

Alternatively, if the second style information is converted into the first style information. The second style information is two-dimensional style information. After the data processing device acquires the second style information, the second style information can be converted into first style information through the conversion module, and the first style information is three-dimensional style information. This case is mainly applied to migrating style information of a two-dimensional animation to a three-dimensional animation to change a scene of the style information of the three-dimensional animation.

The conversion module can be understood as a 2D-3D style conversion module. The module utilizes a large number of 2D-3D pairs with consistent styles to train to obtain nonlinear transformation, and is used for embedding 2D stylized features into a 3D stylized feature space. The 2D style information (i.e., the second style information) extracted from the video can be converted into 3D stylized features (i.e., the first style information) after being projected into the 3D space by using nonlinear transformation.

The training process of the conversion module described above may be as shown in fig. 5B. First, a 3D animation sequence is acquired, and 3D stylized features of the 3D animation sequence are extracted. Then, by orthographically projecting the 3D animation sequence, a 2D animation sequence conforming to the 3D animation sequence style and motion is generated, and 2D style information is extracted. Finally, the two pieces of style information are aligned to the same feature space through supervision, so that projection of the 2D style information to the 3D style information space is completed.

2. First style information is determined based on a first operation by a user and a third image sequence.

In this way, after the data processing device extracts the second style information of the third image sequence, a second semantic tag may be displayed to the user, the second semantic tag being used to explicitly describe the second style information. And modifying the second semantic tag to the first semantic tag based on the first operation of the user. First style information is determined based on the first semantic tags. The explanation of the semantic tags may refer to the description of the related terms described above, and will not be repeated here.

The second semantic tag may be understood as a style description of the third image sequence, the style comprising one or more of: limb/face contour, limb/face ratio, limb motion amplitude, emotion, personality, etc. Reference may be made specifically to the descriptions in the foregoing related terms, and no further description is provided herein.

The method can be understood that the data processing device converts the second style information vector of the image sequence into a second semantic tag which can be understood by a user, and the user processes the second semantic tag according to actual needs to obtain the first semantic tag. The data processing device converts the first semantic tags into first style information based on the first semantic tags, and then generates an image sequence meeting the user requirements. The above-described process includes at least one of: addition, deletion, modification, degree control (or understood as amplitude, hierarchical adjustment), etc.

Optionally, the first operation includes adding, deleting, modifying, controlling the degree (or understood as amplitude, hierarchical adjustment), modifying the semantic tag weight, etc. as described above. Specifically, the data processing device may determine the first operation through an input manner of voice, text, and the like of the user, which is not limited herein.

This case can be applied to the scenario shown in fig. 3A described above. Taking a way that the data processing device is cloud device and the third image sequence is sent by the terminal device as an example. The flow in this case may be as shown in fig. 6. The flow includes steps 601 to 606.

In step 601, the terminal device sends a third image sequence to the cloud device.

And the user can send the third image sequence to the cloud end device through the terminal device. Correspondingly, the cloud device receives a third image sequence sent by the terminal device.

In step 602, the cloud device generates a second semantic tag for the third image sequence.

And after the cloud device acquires the third image sequence, acquiring second style information of the third image sequence. And converting the second style information into a second semantic tag.

Illustratively, taking the example that the style information is an explicit vector in the style feature vector, similar to the foregoing, a plurality of image sequences expressing "happy" may be found from the database, and the plurality of image sequences may be input to the style encoder to obtain a plurality of style feature vectors, respectively. If a common vector of a plurality of style feature vectors is determined, the style semantic tag of "happy" corresponds to the common vector (i.e., explicit vector) described above. And determining the corresponding relation between the semantic tag and the style information.

In step 603, the cloud device sends the second semantic tag to the terminal device.

After the cloud device acquires the second semantic tag, the cloud device sends the second semantic tag to the terminal device. Correspondingly, the terminal equipment receives a second semantic tag sent by the cloud equipment.

In step 604, the terminal device determines a first semantic tag based on the second semantic tag.

Similar to the foregoing description if no user operation is required, the determination of the first style information based on the first operation by the user and the third image sequence is merely taken as an example.

And after the terminal equipment acquires the second semantic tags, displaying the second semantic tags to the user. And modifying the second semantic tag to the first semantic tag based on the first operation of the user.

Step 605, the terminal device sends a first semantic tag to the cloud device.

After the terminal equipment acquires the first semantic tag, the first semantic tag is sent to the cloud equipment. Correspondingly, the cloud device receives a first semantic tag sent by the terminal device.

In step 606, the cloud device determines first style information based on the first semantic tag.

After the cloud device obtains the first semantic tag, the first style information may be determined based on the first semantic tag.

Illustratively, FIG. 6B is an example of a user modifying a tag. The second semantic tag of the third image sequence is "emotional excitement, style singleness". The user performs the following processing on the basis of the second semantic tag: deleting excitation and keeping neutral; the action richness is adjusted, and the single action is changed into multiple actions; the soft and gentle style of yin is added. The natural language processing (Natural Language Processing, NLP) module in the data processing device can automatically identify and match semantic tags of the style appointed by the user, select style information matched with the semantic tags, quantify the degree of a certain style appointed by the user, and generate edited style information after the semantic tags are mutually fused. The NLP module is capable of inputting a text and outputting an analysis of the text (e.g., nouns, verbs, keywords intended by the user). The NLP module outputs keywords of the expression style in the text, for example, input "i want the target style to be half of gentle breeze, half of Yang Gangfeng", then the NLP module can output several keywords as follows: gentle yin, rigid yang, half of each. The words related to the style in the descriptive text are analyzed. For another example, the user may compare the style of the mother by entering text or voice transmission information by which the NLP module determines that the user wants to "add a soft style" on the basis of the second semantic tags.

Illustratively, a modification weight tag is taken as an example. The data processing device displays a user interface as shown in fig. 7 to the user. The user interface includes an animation preview interface and an editing interface. The style semantic tags (which may also be referred to as style tags) in the editing interface may be understood as the aforementioned second semantic tags. For example, the second semantic tag is excited and singular. The user may modify the second semantic tag through the editing interface. As shown in fig. 8, the user can drag "calm- > excited" from 1.0 to 0.5 by dragging the cursor 801. I.e. the excitation is removed and changed to neutral. The user can drag "single- > multiple" from 0.0 to 1.0 by dragging the cursor 802. I.e. single modification into multiple ones. In addition, the user can add a label of the negative softness style as shown in fig. 9 by clicking on the add label 803. Based on the above fig. 7 to 9, the user modifies the second semantic tags (excited, single) to the first semantic tags (neutral, diverse, soft).

In this way, by displaying the semantic tags of style information, the user can edit according to the display tags. In practical use, for the style in any video, it is often difficult to define the style presented by the video accurately, and more difficult to edit accurately. According to the embodiment, the style information is decomposed, explicit features in the style information are semantically processed, the tagging of the style information is further realized, and semantic tags of any style specified by a user are identified, matched and quantized to generate specific style information, so that whether the returned feature tags in the deployment scheme shown in the figure 3A are used for user editing or the semantic tags matched with the personalized styles of the user in the deployment scheme shown in the figure 3B are possible, and the user can more clearly edit behaviors of the user.

It should be understood that the foregoing two ways of obtaining the first semantic tag based on the third image sequence are merely examples, and other ways are also possible in practical applications, and are not limited herein.

In this first case, style information may be extracted from the third image sequence/video to compensate for the user's difficulty in describing certain styles.

Second, the first style information is determined based on a second operation of the user with respect to the first interface.

In this manner, the data processing apparatus displays a first interface to the user, the first interface including a plurality of semantic tags. Each semantic tag of the plurality of semantic tags is used to explicit style information for the image sequence. The data processing apparatus then determines a first semantic tag from the plurality of semantic tags based on a second operation by the user. And further determining first style information according to the first semantic tag.

This case can be applied to the scenario shown in fig. 3B described above. Taking a cloud device as an example, the data processing device. The flow in this case may be as shown in fig. 10. The flow includes steps 1001 to 1005.

In step 1001, the cloud device generates a style information base and a plurality of semantic tags based on a plurality of image sequences.

The cloud device acquires common vectors of style feature vectors corresponding to the image sequences by acquiring the image sequences, and extracts different semantic tags based on different common vectors. And further obtaining style information bases of a plurality of public vectors and a plurality of semantic tags.

In step 1002, the cloud device sends a plurality of semantic tags to the terminal device.

After the cloud device acquires the semantic tags, the semantic tags are sent to the terminal device. Correspondingly, the terminal equipment receives a plurality of semantic tags sent by the cloud equipment.

In step 1003, the terminal device determines the first semantic tag based on a second operation of the user with respect to the first interface.

After receiving the plurality of semantic tags sent by the cloud end device, the terminal device displays a first interface to a user, wherein the first interface comprises the plurality of semantic tags. The first semantic tag is determined based on a second operation of the first interface by the user. The second operation may specifically be a selection operation or the like.

In step 1004, the terminal device sends the first semantic tag to the cloud device.

After the terminal equipment determines the first semantic tag, the terminal equipment sends the first semantic tag to the cloud equipment. Correspondingly, the cloud device receives a first semantic tag sent by the terminal device.

In step 1005, the cloud device determines, based on the first semantic tag, first style information from the style information base

After the cloud device receives the first semantic tag sent by the terminal device, the first semantic tag is found out from the style information base to serve as first style information, wherein the public vector corresponds to the first semantic tag.

In this way, it is also possible to understand that the data processing device displays a plurality of semantic tags to the user, who may select from the plurality of semantic tags the semantic tag that is needed directly. Or by the user entering weights in the plurality of semantic tags in the first interface.

Third, the first style information is determined based on a third operation of the user.

In this manner, the data processing apparatus may directly receive a third operation of the user and determine the first semantic tag in response to the third operation.

The third operation may be voice, text, etc., and is not limited in this particular context. For example, the user edits "add negative softness style" by voice. The data processing device may determine that the first semantic tag is "yin soft" based on the voice of "add yin soft style".

Illustratively, taking the example that the data processing device is a server, that is, the data processing device finishes extracting any style from the video offline, and generates a feature library. The user only needs to upload the semantic tags of the required personalized style, such as adding a little negative softness style to the excited style. After receiving the request, the data processing equipment automatically identifies the style information corresponding to the excitement and the negative flexible label from the style information base, edits the characteristics, generates the style information matched with the semantic label of the target style, and finishes rendering and display.

It will be appreciated that the above cases are only examples of obtaining the first style information, and other manners are possible in practical applications, and are not limited herein.

Step 402, motion information of a first image sequence is acquired.

The data processing device acquires a first image sequence. The first image sequence may be understood as an image sequence requiring replacement of style information.

Optionally, in migrating the 2D/3D animation style information to a 3D animated scene, the first image sequence is a three-dimensional image sequence. In migrating 2D/3D animation style information to a 2D animated scene, the first image sequence is a two-dimensional image sequence.

Alternatively, the first image sequence may be an image sequence extracted from a three-dimensional animation. For example, the first image sequence is extracted from the three-dimensional animation by a human gesture recognition method (e.g., openpost). The method of acquiring the three-dimensional animation is not limited here, and may be a method of capturing and uploading by the user, a method of receiving transmission from another device, a method of selecting from a database, or the like, and is not limited in detail here.

Example 1 an example of a first image sequence is shown in fig. 11. The action content of the first image sequence is "walking".

After the data processing device acquires the first image sequence, action information of the first image sequence is extracted. The explanation of the action information may refer to the description of the related terms, and will not be repeated herein.

Optionally, the first image sequence is input to a content encoder to obtain the motion information. The training process for the content encoder will be described later, and will not be expanded here.

Step 403 generates a second image sequence based on the first style information and the motion information.

After the data processing device obtains the first semantic tags, first style information may be determined based on the first semantic tags. And generating a second image sequence based on the first style information and the motion information.

In one possible implementation, the first semantic tags are used to explicitly express the entire first style information. In this case, the first style information is determined directly based on the first semantic tag.

In another possible implementation, the first semantic tag is used to explicitly display an explicit vector in the first style information. In this case, the first semantic tag is first converted into an explicit vector, and then fused with implicit features of the first image sequence to obtain the first style information.

Optionally, the data processing device fuses the first style information with the motion information to obtain the first motion feature. And acquiring a second image sequence based on the first motion feature.

The fusion algorithm used by the data processing device to fuse the first style information and the motion information to obtain the first motion feature may include: an adaptive instance normalization layer (Adaptive Instance Normalization, adaIN), a deep learning model, statistical methods, etc.

Optionally, the data processing device inputs the first motion feature into a decoder to obtain the second image sequence. The training process for the decoder will be described later, and will not be expanded here.

Illustratively, taking the first semantic tag as an example based on a third image sequence acquisition. The third image sequence is shown in fig. 12. The first style information is "frustration". Continuing with example 1 above, a second image sequence obtained in this step is shown in fig. 13. The second image sequence is a "frustrated" step.

In this example, the flow of steps 401 to 403 may be as shown in fig. 14. The input comprises a third image sequence (e.g., a 2D animated image sequence), a first image sequence (e.g., a 3D original animated image sequence), and a semantic tag of a user personalization style (i.e., a first semantic tag). Firstly, the 2D style information extraction module extracts the 2D stylized characteristics of the third image sequence, converts the 2D stylized characteristics into 3D style information, simultaneously displays semantic tags of the style, and returns the semantic tags to a user for editing. Secondly, the user analyzes the personalized requirements of the user according to the semantic tags and the personalized requirements to generate, and then the NLP module inputs the personalized requirements of the user and the personalized requirements of the user together with the 3D style information into the style editing module to generate an edited style information vector (namely first style information). Finally, after the first image sequence is coded, the characteristic expression representing the content of the first image sequence is obtained, the edited first style information is fused, and the image sequence (namely, the second image sequence) of the 3D target animation which accords with the editing information of the user is generated through decoding.

Optionally, after the data processing device acquires the second image sequence, the second image sequence is rendered to the virtual object to obtain the animation/video.

Alternatively, in the case where the second image sequence is a three-dimensional image sequence, the animation generated as described above is a 3D animation. In the case where the second image sequence is a two-dimensional image sequence, the animation generated as described above is a 2D animation.

In one possible implementation manner, the data processing method provided by the embodiment of the application is mainly applied to style migration scenes of an image sequence.

In another possible implementation manner, the data processing method provided by the embodiment of the application is mainly applied to an animation style migration scene.

In this embodiment of the present application, on the one hand, the style information and the action information are acquired separately, and the second image sequence is generated based on the first style information and the action information. The stylized animation editing is realized under the condition that other characteristics of the original image sequence are not changed, and the style migration effect of the animation is improved. On the other hand, the style information is described through the semantic tags, the style information is explicit through the semantic tags, and a user realizes style migration by editing the semantic tags, so that a driving scheme of the limb actions is realized. The user can quantitatively and qualitatively analyze the style information, so that the user can clearly know how to quantitatively describe the own requirements. In addition, through analyzing the user demands, the advantages of any style can be covered by matching with massive videos, so that any customized stylized digital human animation can be generated by the embodiment of the application. On the other hand, the style information is extracted from the video to which the second image sequence belongs, so that the defect that a user is difficult to describe a certain type of style information can be overcome. On the other hand, style information is explicitly displayed using a tag,

Another flowchart of the method provided by the embodiment of the present application may be shown in fig. 15. And obtaining a second image sequence from the style reference animation, and extracting the stylized feature of the second image sequence to obtain a second stylized feature. And further to explicit the second stylized feature to obtain a display label. And editing the display label by the user to obtain a first stylized feature. And migrating the first stylized feature to the original animation to obtain the stylized animation. The content of the stylized animation is consistent with the original animation, and the style of the stylized animation is consistent with the style reference animation, so that stylized migration is realized.

The data processing method provided in the embodiment of the present application is described above, and the training process of the style encoder, the content encoder, and the decoder set forth in the embodiment shown in fig. 4 is described in detail below. On the training side, mass limb animation videos are utilized to construct a limb animation stylized feature vector space which is approximate to the complete limb animation stylized feature vector space, and the discretionary of the reasoning side stylized features can be met.

Training process as shown in fig. 16, first, an image sequence 1 and an image sequence 2 are acquired. Wherein, image sequence 1 has style 1 and action 1. Image sequence 2 has style 2 and action 2. And secondly, respectively encoding the style and motion contents of the two input sequences by utilizing a style encoder and a motion content encoder so as to decouple the style information and the motion information. And then, the style information 1 and the action information 2 are fused through a fusion algorithm (for example AdaIN), and the style 1 action 2 is generated after decoding. And generates action 1 of style 2 by integrating the style information 2 and the action information 1. And finally, respectively supervising the reconstruction loss of the generated stylized animation on the style and the content by a discriminator, so that the finally generated stylized animation can have the maximum similarity with the target style on the premise of not losing the original motion content.

The above process can be understood as: and acquiring a first training image sequence and a second training image sequence, wherein the motion characteristics of the first training image sequence and the second training image sequence are different, and the motion characteristics comprise motion information and/or style information. Respectively inputting the first training image sequence into a style encoder and a content encoder to obtain first training style information and first training action information; and respectively inputting the second training image sequence into a style encoder and a content encoder to obtain second training style information and second training action information. Fusing the first training style information and the second training action information to obtain first training exercise characteristics; and fusing the second training style information and the first training action information to obtain second training exercise characteristics. Inputting the first training motion feature into a decoder to obtain a first reconstructed image sequence; the second training motion characteristic is input to a decoder to obtain a second reconstructed image sequence. Training with the value of the first loss function being smaller than a first threshold value to obtain a trained style encoder, content encoder and decoder, wherein the first loss function comprises a style loss function and a content loss function, the style loss function is used for representing style differences between the first reconstructed image sequence and the first training image sequence and style differences between the second reconstructed image sequence and the second training image sequence, and the content loss function is used for representing content differences between the first reconstructed image sequence and the second training image sequence and content differences between the second reconstructed image sequence and the first training image sequence.

In this embodiment, the style encoder, the content encoder and the decoder obtained through training may extract 2D stylized features from the video sequence, map the 2D stylized features to a 3D feature space, generate a 3D style consistent with the semantics of the 3D stylized features, perform semantic explicit expression on the 3D style information, edit the 3D style information according to the semantic expression of the style, generate a target style according to the expectations of the user, generate corresponding style information by using an algorithm with a semantic tag of the style of the user, and finally migrate the generated 3D target features to the original animation sequence by using a style migration module, thereby generating a virtual digital human animation sequence with the target stylized.

In addition, the third image sequence in the embodiment shown in fig. 4 includes one or more of the following: facial expression sequence and limb image sequence. For example, limb movements include global limbs, local limbs (e.g., gestures, etc.), and the like. In other words, the method provided by the embodiment of the application can also be applied to the migration of the wind grid such as gestures and expressions. A voice-driven gesture is exemplified below. A scenario in which the method is applied to gesture style migration is shown in fig. 17.

By inputting a piece of text or voice data, the virtual digital person is driven to make gesture actions with known semantics and consistent rhythm with the voice data. For the same piece of voice or text data, gesture styles of different lecturers can be different from person to person and can be different from emotion of the same person, so that personalized customization and migration of the styles have important significance for enriching the diversity of gestures.

In an offline or training stage, generating gesture style information which can almost cover any style through collecting massive 2D lecture videos and generating a style information database offline through the stylized feature extraction module; in the online use stage, a user designates any personalized stylized label, and generates edited style information by merging an offline generated style database through analysis and quantitative representation of the user label, and stylizes a motion sequence generated by a voice driving gesture module into a target style.

A scene in which the method is applied to expression style migration is shown in fig. 18. The scene can also be understood as a digital human expression base style editing and migrating scene. The method comprises the steps of obtaining almost any expression style from a large number of expression videos, and then transferring the expression style to the expression muscles of the digital person to drive the same digital person to make any expression style. The expression base is defined as a coordinate set of a plurality of key points of a face which is determined in advance and used for representing a certain neutral expression, and the original coefficient represents a parameter expression of a certain specific expression relative to the neutral expression, such as a smile, a degree of opening of a mouth relative to the neutral expression, and the like. The whole process of fig. 18 is thus that firstly, according to the expression of a person and a preset expression group, the original coefficient corresponding to the expression is calculated through an expression network; and acquiring the coefficients corresponding to various expressions in the video through the same group of expression groups, and controlling the expression to be generated by the user through editing the coefficients.

In the embodiment, on one hand, the stylized characteristics of gestures/expressions can be extracted from the video sequence and converted into the stylized characteristics, so that style diversity is greatly enriched; on the other hand, the style of the gesture/expression extracted from the video is subjected to explicit tagging, so that semantic description is conveniently carried out on the style of the gesture/expression with a user, and further matching and fusion of the subsequent tags and style information are realized.

Having described the data processing method in the embodiment of the present application, the following describes the data processing apparatus in the embodiment of the present application, referring to fig. 19, one embodiment of the data processing apparatus in the embodiment of the present application includes:

an acquisition unit 1901 for acquiring first style information;

an acquisition unit 1901, configured to acquire motion information of the first image sequence;

the generating unit 1902 is configured to generate a second image sequence based on the first style information and the motion information, where the second image sequence is the same as the first image sequence in motion type, and the second image sequence has the first style information.

Optionally, the data processing apparatus may further include: a display unit 1903, configured to display a first interface to a user, where the first interface includes a plurality of semantic tags, the plurality of semantic tags are used to describe different style information of different image sequences, and the plurality of semantic tags are in one-to-one correspondence with the style information; an obtaining unit 1901, specifically configured to determine a first semantic tag from a plurality of semantic tags based on a second operation of a user; and the first semantic tag is used for converting the first semantic tag into first style information.

Optionally, the data processing apparatus may further include: and a rendering unit 1904 for rendering the second image sequence to the virtual object to obtain an animation.

In this embodiment, the operations performed by the units in the data processing apparatus are similar to those described in the embodiments shown in fig. 1 to 18, and are not described here again.

In this embodiment, the acquisition unit 1901 acquires style information and motion information separately, and the generation unit 1902 generates a second image sequence based on the first style information and the motion information. The stylized animation editing is realized under the condition that other characteristics of the original image sequence are not changed, and the style migration effect of the animation is improved.

Referring to fig. 20, another data processing apparatus provided in the present application is a schematic structural diagram. The data processing device may include a processor 2001, a memory 2002, and a communication port 2003. The processor 2001, memory 2002 and communication port 2003 are interconnected by wires. Wherein program instructions and data are stored in memory 2002.

The memory 2002 stores therein program instructions and data corresponding to the steps executed by the data processing apparatus in the respective embodiments shown in fig. 1 to 18.

A processor 2001 for executing steps performed by the data processing device as described in any of the embodiments shown in the previous figures 1 to 18.

The communication port 2003 may be used for receiving and transmitting data for performing the steps associated with acquisition, transmission, reception in any of the embodiments shown in fig. 1-18 described above.

In one implementation, the data processing apparatus may include more or fewer components than those of FIG. 20, which is for exemplary purposes only and not limiting.

Embodiments of the present application also provide a computer-readable storage medium storing one or more computer-executable instructions that, when executed by a processor, perform a method as described in the foregoing embodiments for a possible implementation of a data processing apparatus.

Embodiments of the present application also provide a computer program product (or computer program) storing one or more computers, which when executed by the processor performs a method as described above as a possible implementation of a data processing apparatus.

The embodiment of the application also provides a chip system, which comprises at least one processor and is used for supporting the terminal equipment to realize the functions involved in the possible realization mode of the data processing equipment. Optionally, the chip system further comprises an interface circuit providing program instructions and/or data to the at least one processor. In one possible design, the system on a chip may further include a memory to hold the necessary program instructions and data for the terminal device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A method of data processing, the method comprising:

acquiring first style information;

acquiring action information of a first image sequence;

generating a second image sequence based on the first style information and the action information, wherein the second image sequence is the same as the first image sequence in action type, and the second image sequence has the first style information.

2. The method of claim 1, wherein prior to the obtaining the first style information, the method further comprises:

acquiring a third image sequence;

the obtaining the first style information includes:

and acquiring the first style information based on the third image sequence.

3. The method of claim 2, wherein the obtaining the first style information based on the third image sequence comprises:

extracting second style information of the third image sequence;

the first style information is determined based on the second style information.

4. The method of claim 3, wherein the determining the first style information based on the second style information comprises:

and taking the second style information as the first style information.

5. The method of claim 3, wherein the determining the first style information based on the second style information comprises:

displaying a second semantic tag to the user, wherein the second semantic tag is used for describing the second style information;

modifying the second semantic tags to first semantic tags based on a first operation of the user, the first semantic tags being used to describe the first style information;

the first style information is determined based on the first semantic tag.

6. The method according to any one of claims 2 to 5, wherein the third image sequence is a two-dimensional animated image sequence, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are three-dimensional animated image sequences.

7. The method according to claim 1, wherein the method further comprises:

displaying a first interface to a user, wherein the first interface comprises a plurality of semantic tags, the semantic tags are used for describing different style information of different image sequences, and the semantic tags are in one-to-one correspondence with the style information;

The obtaining the first style information includes:

determining a first semantic tag from the plurality of semantic tags based on a second operation of the user;

the first style information is determined based on the first semantic tag.

8. The method of any of claims 1 to 7, wherein the generating a second image sequence based on the first style information and the action information comprises:

fusing the first style information and the action information to obtain a first motion characteristic;

the second image sequence is acquired based on the first motion feature.

9. The method of any one of claims 1 to 8, wherein the action information comprises one or more of: facial expression sequence and limb image sequence.

10. The method according to any one of claims 1 to 9, further comprising:

and rendering the second image sequence to a virtual object to obtain animation.

11. A data processing apparatus, characterized in that the data processing apparatus comprises:

the acquisition unit is used for acquiring the first style information;

the acquisition unit is further used for acquiring action information of the first image sequence;

And the generating unit is used for generating a second image sequence based on the first style information and the action information, wherein the second image sequence is the same as the first image sequence in action type, and the second image sequence has the first style information.

12. The apparatus according to claim 11, wherein the acquisition unit is further configured to acquire a third image sequence;

the acquiring unit is specifically configured to acquire the first style information based on the third image sequence.

13. The device according to claim 12, wherein the acquisition unit is in particular configured to extract second style information of the third image sequence;

the acquiring unit is specifically configured to determine the first style information based on the second style information.

14. The device according to claim 13, wherein the obtaining unit is specifically configured to use the second style information as the first style information.

15. The device according to claim 13, wherein the obtaining unit is configured to display a second semantic tag to a user, the second semantic tag being configured to describe the second style information;

The acquiring unit is specifically configured to modify the second semantic tag into a first semantic tag based on a first operation of the user, where the first semantic tag is used to describe the first style information;

the acquiring unit is specifically configured to determine the first style information based on the first semantic tag.

16. The apparatus of any one of claims 12 to 15, wherein the third image sequence is a two-dimensional animated image sequence, the second style information is two-dimensional style information, the first style information is three-dimensional style information, and the first image sequence and the second image sequence are three-dimensional animated image sequences.

17. The apparatus of claim 11, wherein the data processing apparatus further comprises:

the display unit is used for displaying a first interface to a user, wherein the first interface comprises a plurality of semantic tags, the semantic tags are used for describing different style information of different image sequences, and the semantic tags are in one-to-one correspondence with the style information;

the acquiring unit is specifically configured to determine a first semantic tag from the plurality of semantic tags based on a second operation of the user;

18. The device according to any one of claims 11 to 17, wherein the generating unit is configured to fuse the first style information with the motion information to obtain a first motion feature;

the generating unit is specifically configured to acquire the second image sequence based on the first motion feature.

19. The apparatus according to any one of claims 11 to 18, wherein the action information comprises one or more of: facial expression sequence and limb image sequence.

20. The apparatus according to any one of claims 11 to 19, wherein the data processing apparatus further comprises:

and the rendering unit is used for rendering the second image sequence to the virtual object to obtain the animation.

21. A data processing apparatus, comprising: a processor coupled to a memory for storing a program or instructions that, when executed by the processor, cause the data processing apparatus to perform the method of any of claims 1 to 10.

22. A computer storage medium comprising computer instructions which, when run on a data processing apparatus, cause the data processing apparatus to perform the method of any of claims 1 to 10.

23. A computer program product, characterized in that the computer program product, when run on a computer, causes the computer to perform the method of any of claims 1 to 10.