CN117409459A

CN117409459A - Image generation method and related device

Info

Publication number: CN117409459A
Application number: CN202311340128.8A
Authority: CN
Inventors: 张克越; 胡澄洋; 姚太平; 丁守鸿; 马利庄
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-01-16

Abstract

The embodiment of the application discloses an image generation method and a related device, which can be applied to various scenes such as artificial intelligence, intelligent traffic, auxiliary driving and the like. In the method, image reference information is acquired; and generating a target non-living human face image according to the image reference information through an image generation model. The target non-living human face image can be generated according to the attack information indicating the non-living attack characteristics and the image basic information indicating the basic characteristics only by an image generation model, so that the participation of a large number of true persons is not required to be collected, and the acquisition speed of a large number of target non-living human face images can be improved; and based on different image basic information in different scenes, target non-living face images corresponding to different scenes respectively can be directly generated, and the diversity of the acquired target non-living face images can be improved. Therefore, the present application can rapidly generate a large number and variety of non-living face images as samples of a living body detection model.

Description

Image generation method and related device

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to an image generating method and related apparatus.

Background

Nowadays, the face detection system is widely applied to daily production and life of people, and can realize the services of identity authentication, face payment and the like. The face living body detection technology is used as an important link in the face detection system, and the safety of the face detection system is ensured by detecting whether the identified face is a living body face or not, so that the attack of non-living body faces is prevented.

The face living body detection technology is generally implemented based on a living body detection model, that is, whether a face in a face image is a living body face or not is detected by the living body detection model; the living body detection model is trained based on a large number of training samples (typically images including non-living faces). In the related art, a large number of true persons are usually collected to participate in actual shooting so as to obtain a training sample of a living body detection model; however, this training sample acquisition method is time-consuming and laborious, it is difficult to acquire a large number of training samples quickly, and due to the limited actual shooting scene, the acquired training samples often have limitations and are not sufficiently diversified, which will affect the performance of the trained living body detection model.

Disclosure of Invention

The embodiment of the application provides an image generation method and a related device, which can quickly generate a large number of diversified non-living face images.

A first aspect of the present application provides an image generation method, the method comprising:

acquiring image reference information; the image reference information comprises attack information and image basic information, wherein the attack information is used for indicating non-living attack characteristics in a non-living face image to be generated, and the image basic information is used for indicating basic characteristics in the non-living face image to be generated;

generating a target non-living face image according to the image reference information through an image generation model;

the image generation model is trained based on training reference information, wherein the training reference information comprises training image basic information and training attack information; the training image basic information is obtained by identifying a training non-living human face image and is used for indicating basic characteristics in the training non-living human face image; the training attack information is used for indicating training non-living attack characteristics in the training non-living face image, and the training non-living attack characteristics are key characteristics according to the fact that the training non-living face image comprises the non-living face.

A second aspect of the present application provides an image generating apparatus, the apparatus comprising:

The information acquisition module is used for acquiring image reference information; the image reference information comprises attack information and image basic information, wherein the attack information is used for indicating non-living attack characteristics in a non-living face image to be generated, and the image basic information is used for indicating basic characteristics in the non-living face image to be generated;

the image generation module is used for generating a target non-living body face image according to the image reference information through an image generation model;

A third aspect of the present application provides a computer device comprising a processor and a memory:

The memory is used for storing a computer program;

the processor is configured to execute the steps of the image generation method according to the first aspect described above according to the computer program.

A fourth aspect of the present application provides a computer readable storage medium for storing a computer program for executing the steps of the image generation method of the first aspect described above.

A fifth aspect of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps of the image generation method described in the first aspect.

From the above technical solutions, the embodiments of the present application have the following advantages:

in the embodiment of the application, firstly, image reference information comprising attack information and image basic information is acquired; the attack information is used for indicating non-living attack characteristics in the non-living face image to be generated, and the image base information is used for indicating base characteristics in the non-living face image to be generated. Then, generating a target non-living body face image according to the acquired image reference information through an image generation model; the image generation model is trained based on training reference information, wherein the training reference information comprises training image basic information and training attack information; the basic information of the training image is obtained by identifying the training non-living face image and is used for indicating basic characteristics in the training non-living face image; the training attack information is used for indicating training non-living attack characteristics in the training non-living face image, wherein the training non-living attack characteristics are key characteristics according to the training non-living face when the training non-living face image is identified.

The target non-living body face image generated by the method can be generated according to the attack information indicating the non-living body attack characteristics and the image basic information indicating the basic characteristics only by relying on the image generation model, so that a large number of real people are not required to participate in the collection, and the acquisition speed of a large number of target non-living body face images can be improved; meanwhile, based on different image basic information and attack information in different scenes, target non-living face images corresponding to different scenes respectively can be generated, and the diversity of the acquired target non-living face images can be improved. Therefore, the method and the device can quickly generate a large number and variety of non-living face images. The target non-living body face images are used as training samples to train a living body detection model, so that the performance of the obtained living body detection model can be greatly improved, and the detection accuracy of living bodies such as faces is improved.

Drawings

Fig. 1 is a scene structure diagram of an image generating method according to an embodiment of the present application;

fig. 2 is a flowchart of an image generating method according to an embodiment of the present application;

FIG. 3 is a flowchart of a training step of an image generation model according to an embodiment of the present application;

Fig. 4 is a schematic diagram of acquiring training reference information according to an embodiment of the present application;

fig. 5 is a schematic diagram of acquiring training noise data according to an embodiment of the present application;

fig. 6 is a schematic diagram of acquiring a reconstructed training non-living face image according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the related art, a face in-vivo detection technique is often implemented based on a in-vivo detection model, which requires a large number of training samples (e.g., images including non-living faces) to train. Assume that a method of gathering a large number of true persons to participate in actual shooting is adopted, and an image including a non-living face is obtained and used as a training sample of a living body detection model. On the one hand, the process of gathering the true human is time-consuming and labor-consuming, the speed of shooting to obtain the image comprising the non-living human face is too slow, a large number of training samples are difficult to obtain quickly, the training time of the living body detection model is prolonged, and the living body detection model cannot be obtained through training in a short time. On the one hand, when a collection real person participates in actual shooting, the actual shooting scene is often limited, so that the acquired training sample is easy to have limitation, and the training effect of the living body detection model is easy to influence.

In view of the above, an image generating method and related apparatus are provided in the present application. In the method, image reference information is acquired; the image reference information comprises attack information and image basic information, wherein the attack information is used for indicating non-living attack characteristics in a non-living face image to be generated, and the image basic information is used for indicating basic characteristics in the non-living face image to be generated; generating a target non-living face image according to the image reference information through an image generation model; the image generation model is trained based on training reference information, wherein the training reference information comprises training image basic information and training attack information; the basic information of the training image is obtained by identifying the training non-living face image and is used for indicating basic characteristics in the training non-living face image; the training attack information is used for indicating training non-living attack characteristics in the training non-living face image, wherein the training non-living attack characteristics are key characteristics according to the training non-living face when the training non-living face image is identified. According to the method and the device, through the image generation model, a large number of diversified target non-living body face images can be generated rapidly by utilizing the image reference information, so that the living body detection model is trained based on the target non-living body face images, and the living body detection model with greatly improved detection effect is obtained.

The embodiment of the application provides an image generation method, which relates to the field of artificial intelligence. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. Operation/interaction system, electromechanical integration, etc. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The image generation method provided by the embodiment of the application mainly relates to the large direction of computer vision technology and machine learning in the artificial intelligence technology.

Computer Vision (CV) is a science of how to "look" a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for the human eye to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and a pretrained model in the vision fields of swin-transducer, viT, V-MOE, MAE and the like can be quickly and widely applied to downstream specific tasks through fine adjustment (finetune). The large model technology brings important innovation for the development of computer vision technology, and a pre-trained model in the vision fields of swin-transformer, viT, V-MOE, MAE and the like can be rapidly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

Next, an execution subject of the image generation method provided in the embodiment of the present application will be specifically described.

The execution subject of the image generation method provided in the embodiment of the present application may be a computer device with an image processing capability, and in particular may be a terminal device or a server. As examples, the terminal device may specifically include, but is not limited to, a mobile phone, a desktop computer, a tablet computer, a notebook electric energy, a palm computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, and the like. The server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers. In addition, the server may also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. Referring specifically to fig. 1, fig. 1 illustrates an exemplary scene architecture diagram of an image generation method. The figure includes the above-described various forms of terminal devices and servers.

In addition, the image generating method provided by the embodiment of the application may also be cooperatively executed by the terminal device and the server. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein. Therefore, in the embodiment of the present application, the implementation main body for executing the technical solution of the present application is not limited.

Next, the image generation method provided in the embodiment of the present application will be specifically described with the server as an execution subject.

Referring to fig. 2, a flowchart of an image generating method according to an embodiment of the present application is shown. The image generation method shown in fig. 2 includes the steps of:

s201: image reference information is acquired.

In the embodiment of the application, the image reference information refers to related information to be referred to for generating the non-living face image, and the image reference information may be used to indicate image features in the non-living face image to be generated.

The method for acquiring the image reference information is not limited. For example, the image reference information may be manually input, so that the server directly obtains the image reference information; the server may extract at least one image feature information from a pre-constructed image feature information base, and the extracted image feature information may include attack information and basic information as the image reference information.

In one possible embodiment of the present application, the image reference information may include attack information and image base information. The attack information is used to indicate non-living attack features in the non-living face image to be generated. The image basis information is used to indicate the basis features in the non-living face image to be generated. The image reference information is information in the form of text, for example, the attack information is text containing attack information, and the image base information is text containing image base information.

The non-living body attack feature refers to a living body camouflage feature that is possessed when a camouflaged living body attacks the living body detection system. For example, the mask may be a hat, an eye, a mask, or the like, which is not limited in this application. The basic features refer to face features in the non-living face image, scene features indicated by the image background, and the like. For example, the gender, the angle and the number of faces in the non-living face image may be the same as those of the faces in the non-living face image, or the non-living face image may be a white wall indicating the background of the images, which is not limited in this application.

S202: and generating a target non-living human face image according to the image reference information through an image generation model.

The image reference information is input into the image generation model, so that the image generation model can extract the non-living body attack characteristic and the basic characteristic indicated by the image reference information, and a target non-living body face image with the non-living body attack characteristic and the basic characteristic is generated, and the requirements are met.

In an embodiment of the present application, the image generation model is trained based on training reference information. The training reference information may include training image base information and training attack information. The training image basic information is obtained by identifying the training non-living face image and is used for indicating basic characteristics in the training non-living face image. The training attack information is used to indicate training non-living attack features in the training non-living face image. The training non-living body attack feature is a key feature according to which the training non-living body face image is recognized when the non-living body face is included, that is, the training non-living body face image including the non-living body face can be recognized based on the training non-living body attack feature in the training non-living body face image.

In one possible embodiment of the present application, S202 may be specifically subdivided into the following steps:

A1: and carrying out coding processing on the image reference information to obtain the image reference text characteristics.

As an example, an encoder may be used to encode the image reference information to obtain the image reference text feature corresponding to the image reference information. The present application does not limit the encoding method.

For example, the encoder may be a contrast language Image Pre-training (CLIP) visual encoder, which is not limited in this application.

A2: and generating a target non-living human face image according to the image reference text characteristics and the noise data through an image generation model.

In the embodiment of the application, the image generation model may be a noise estimation model, and the trained image generation model has learned that noise data brings about interference to the non-living face image to be generated, so that the image generation model can denoise the noise data to generate the target non-living face image by introducing the noise data.

As an example, when the image generation model is a latent variable diffusion model, the noise data may be noise features randomly sampled from any noise feature subject to normal distribution. When the image generation model is a common diffusion model, the noise data may be a noise image obtained by randomly acquiring noise points or a noise image obtained by randomly adding noise points to any one image. When the image generation model is a generator in the generation countermeasure network, the noise data may be noise characteristics obtained by randomly sampling from any noise characteristics subject to normal distribution, or may be noise images obtained by directly obtaining noise added images, or may be noise images obtained by randomly adding noise points to any image, which is not limited in this application.

Therefore, the image generation model can directly generate the target non-living human face image according to the image reference text characteristics and the noise data corresponding to the image reference information, and a plurality of sample numbers corresponding to the living detection model can be rapidly acquired in a short time.

In order to facilitate understanding, a process of generating a target non-living face image using the image generation model as a diffusion model will be specifically described. It should be noted that, the type of diffusion model is not limited in the present application, and may be a common diffusion model or a latent variable diffusion model.

In one possible embodiment of the present application, A2 may be subdivided into the following steps:

a1: and performing N times of denoising processing based on the noise data through the image generation model to obtain target image data.

In the embodiment of the present application, N is an integer greater than or equal to 1. That is, at least one denoising process is required for the noise data to obtain the target image data. The target image data refers to the final denoising result after denoising the noise data for N times, and the target image data does not contain noise any more.

a2: and determining the target non-living human face image according to the target image data.

It is to be understood that in the process of performing the denoising process for the noise data N times, the process of performing the denoising process for each time may be the same, and for ease of understanding, the process of performing the ith denoising process for the noise data will be described in detail below.

The ith denoising process may specifically include the following steps 1 to 2:

step 1: and determining the ith noise estimated data according to the denoising result of the ith-1 denoising process and the image reference text characteristics through an image generation model.

Based on the above description, the image reference text feature is obtained by encoding the image reference information, and the image reference text feature may be used to characterize the text feature of the acquired image reference information. According to the image reference text feature extraction method, the image reference text feature is introduced into the image generation model, so that a denoising result obtained when denoising noise data by the image generation model gradually approaches to the image reference text feature. That is, the image reference text feature is used to guide the image generation model to finally generate the target non-living face image, and the target non-living face image includes the non-living attack feature indicated by the image reference information and the basic feature, so as to meet the image generation requirement.

In this embodiment of the present application, the ith noise estimation data refers to noise data estimated by the image generation model and to be removed when the ith denoising process is performed on the noise data.

And when the ith denoising treatment is carried out on the noise data through the image generation model, determining the ith noise estimated data according to the ith-1 th denoising result, namely the denoising result obtained by carrying out denoising treatment on the noise data last time, and the image reference text characteristics.

Step 2: removing the ith noise estimated data in the denoising result of the ith-1 th denoising process to obtain the denoising result of the ith denoising process.

And removing the ith noise estimated data in the denoising result obtained by denoising the noise data for the ith time, namely removing the noise data for the last time, through an image generation model to obtain the denoising result obtained by denoising the noise data for the ith time, namely.

And i in the ith denoising process is an integer of 1 or more and N or less. When i is equal to 1, namely, when noise data is denoised for the first time, the denoising result of the i-1 th denoising process is the noise data of the original input; when i is equal to N, namely, noise data is only subjected to one denoising or the noise data is subjected to the last denoising, the denoising result of the ith denoising process is the target image data.

It can be seen that the noise data after denoising is the target image data by performing the denoising process N times corresponding to the noise data by the image generating model, and the target non-living face image meeting the requirements can be obtained according to the target image data by the image generating model.

In one possible embodiment of the present application, the image generation model may include a base diffusion structure and a bypass adaptive structure, and then step 1 may be further subdivided into the following steps:

step 11: and determining first sub-noise estimated data according to the denoising result of the ith-1 st denoising process, the image reference text characteristics and the number of turns corresponding to the current denoising process through a basic diffusion structure in the image generation model.

In this embodiment of the present application, the basic diffusion structure is a basic model structure of the diffusion model, and the first sub-noise estimation data to be removed in the i-th execution of the denoising process is determined based on the denoising result of the i-1-th denoising process, the image reference text feature, and the number of rounds corresponding to the current denoising process.

The diffusion model with the basic diffusion structure is a pre-trained model, comprises a large number of model parameters, and has good model performance. The model with rich feature space and strong generating capacity is trained on a large scale, and can be applied to general image generating tasks.

Step 12: and determining second sub-noise estimated data according to the denoising result of the i-1 th denoising process and the image reference text characteristics through a bypass self-adaptive structure in the image generation model.

In the embodiment of the application, the bypass adaptive structure is a structure added for a diffusion model and used for performing model fine adjustment, and the model fine adjustment can be performed through low-rank adaptation. Based on the denoising result of the ith-1 th denoising process and the image reference text characteristics, the second sub-noise prediction data which needs to be removed can be determined by the ith execution denoising process.

Step 13: and determining the ith noise estimated data according to the first sub-noise estimated data and the second sub-noise estimated data.

Based on the first sub-noise estimated data determined by the basic diffusion structure and the second sub-noise estimated data determined by the bypass adaptive structure, the total ith noise estimated data to be removed can be estimated by combining the determination of the ith execution denoising process.

Therefore, the image generation model comprises a basic diffusion structure and a bypass self-adaptive structure, the model parameters corresponding to the bypass self-adaptive structure are low-dimensional model parameters, the calculated amount of the image generation model can be reduced during model training, and the training speed of the image generation model is improved.

The image generation model mentioned above may be a latent variable diffusion model or a common diffusion model, and the image generation method provided by the embodiment of the present application may be implemented through these models, and the type of the image generation model is not limited in this application.

It will be appreciated that different image generation models have respective principles of operation, i.e. the process of generating the target non-living face image may be different. Thus, in one possible embodiment of the present application, there are a plurality of possible implementations of step a1 and step A2 included in the above-mentioned step A2, and the following description will be given separately.

When the image generation model is a latent variable diffusion model, the first alternative implementation of step a1 is: performing N times of denoising processing based on noise characteristics through an image generation model to obtain target image characteristics; accordingly, a first alternative implementation of step a2 is: and decoding the target image features to obtain the target non-living face image.

When the image generation model is a latent variable diffusion model, the noise data is the noise characteristic. The latent variable diffusion model needs to perform denoising processing on noise characteristics for N times to obtain target image characteristics, and then decodes the target image characteristics to obtain a target non-living face image.

When the image generation model is a general diffusion model, the second alternative implementation manner of the step a1 is: performing N times of denoising processing based on the noise image through an image generation model to obtain a target image; accordingly, a second alternative implementation of step a2 is: the target image is taken as a target non-living human face image.

When the image generation model is a common diffusion model, the noise data is a noise image. The common diffusion model can perform N times of denoising processing on the noise image, and the target image obtained after denoising is the target non-living face image.

It should be noted that the implementation manner given in the foregoing description is merely an exemplary illustration and does not represent all implementation manners of the embodiments of the present application, and when the type of the image generation model changes, the process of generating the target non-living face image may change accordingly. And for the two alternative implementation manners, the terminal device can select one to implement, which is not limited in the application.

Therefore, the image generation model in the application can be a latent variable diffusion model or a common diffusion model, and the image generation method can be applicable to different models and has higher universality.

In the embodiment of the present application, according to the above description, the image generation model to be trained may be a common diffusion model or a latent variable diffusion model. For easy understanding, in the following embodiments, an image generation model to be trained is taken as a latent variable diffusion model as an example, and a training process of the image generation model in the embodiments of the present application is specifically described with reference to the accompanying drawings.

Next, the training step of the image generation model provided in the embodiment of the present application is specifically described by continuing to use the server as an execution subject.

Referring to fig. 3, a flowchart of a training manner of an image generation model according to an embodiment of the present application is shown. As shown in fig. 3, the image generation model may be trained by:

s301: and executing T times of noise adding processing based on the training non-living face image to obtain training noise data, and recording noise adding data corresponding to each time of noise adding processing.

Wherein T is an integer of 1 or more. The trained image generation model is used for estimating noise data so as to remove noise according to the noise estimation result, so that the training non-living face image needs to be subjected to noise adding processing during training so as to enable the image generation model to be trained to learn how to remove noise.

Based on the description of the above embodiments, it is known that the training non-living face image is identified to obtain training image base information, where the training image base information is used to indicate the base features (such as the number of faces, the gender of the faces, etc. in the above example) in the training non-living face image. Training attack information is added on the basis of the basic information of the training image to form training reference information. The training attack information is used to indicate training non-living attack features (such as features for camouflage, such as caps, eyes, or masks in the examples above) in the training non-living face image. It should be noted that the added training attack information is a key feature according to which the non-living face image is identified when the non-living face image includes the non-living face.

The training non-living face image can be identified through an automatic tool to obtain training image basic information. For example, the automation tool may be a uniformly understood and generated bootstrap multimodal model (Bootstrapping LanguageImage Pre-training, BLIP) or a reverse-push prompt plug-in deepbooru, etc., which is not limited in this application.

As an example, referring to fig. 4, a schematic diagram of acquiring training reference information is provided in an embodiment of the present application. Referring to fig. 4, a training non-living face image is detected by a face detection technology, an area a where a non-living face is located is determined, and the area a is enlarged to obtain an area b including an image background of more non-living face images. Then, based on the region b, cutting and scaling are carried out on the training non-living face image, and the training non-living face image with the preset image size is obtained. Next, the training non-living face image is identified to obtain training image basic information, as shown in fig. 4, where the training image basic information may include: "1boy (1 boy), front_face, solo (alone)". Training attack information is then added on the basis of the training image base information, as shown in fig. 4, and the training attack information may include: "paper_glasses". Finally, training reference information "paper_glasses,1boy, front_face, solo" required for training the image generation model is obtained.

It should be noted that the area of the region a may be enlarged to a preset multiple to obtain the region b, for example, the preset multiple may be 1.7 times. The preset image size may be 512×512, or may not be scaled, which is not limited in this application.

In the embodiment of the application, when the image generation model to be trained is a latent variable diffusion model, the training non-living face image can be compressed into a latent space by using the encoder to obtain the training non-living face image feature, and then the T times of noise adding processing is performed on the training non-living face image feature, so that the obtained training noise feature is used as training noise data.

As an example, reference may be made to fig. 5, which is a schematic diagram of acquiring training noise data according to an embodiment of the present application. Referring to fig. 5, the training non-living face image x may be compressed into the potential representation space by the encoder epsilonTraining the non-living face image characteristic z, and then performing T times of noise adding processing on the z to sequentially obtain z ₁ 、z ₂ 、……、z _T-1 Z _T . It should be noted that the training noise data is subjected to normal distribution.

Specifically, the process of compressing the training non-living face image x into the potential representation space by the encoder epsilon can be expressed by the following formula 1:

z＝ε(x)∈R ^h×w×c (equation 1)

Wherein the dimension of the training non-living face image x can be given as x epsilon R ^H×W×3 ，R ^h×w×c I.e., the dimension compressed into the potential representation space. In the compression process, the downsampling factor may be calculated as follows

Formula 2:

where f is the downsampling factor, and the downsampling factor is a power of 2.

In addition, when the image generation model to be trained is a common diffusion model, T times of noise adding processing can be directly performed on the training non-living face image, and the obtained training noise data is a training noise image.

S302: and performing T times of denoising processing on the training noise data according to the training noise data and the training reference information by using an image generation model to be trained, and recording training noise estimated data corresponding to each denoising processing.

In the embodiment of the application, by introducing the training reference information, in the process that the latent variable diffusion model executes the T times of denoising processing on the training noise data, the features of the denoising results sequentially obtained by the latent variable diffusion model are led to gradually approach to the features of the training reference information, so that the features of the final denoising results obtained by the latent variable diffusion model conform to the features indicated by the training reference information.

It should be noted that the manner of introducing the training reference information is not limited in this application.

As an example, training reference information may be introduced through a cross-attention mechanism, and the training reference information may be first encoded by an encoder such as CLIP to obtain training image reference text features, which may be represented by the following formula 3:

y=clip (text) (formula 3)

Wherein y represents the reference text characteristics of the training image, and text represents the training reference information.

By introducing training image reference text features across the attention mechanism, it can be expressed specifically by the following formula 4.1, Q, K and V in formula 4.1 can be expressed by formula 4.2, formula 4.3 and formula 4.4 respectively:

wherein, attention (Q, K, V) represents the Attention mechanism, Q, K and V represent Query term Query, key Value Key and Value term Value, respectively, which are three Key vectors in the Attention mechanism. Softmax refers to normalized operation, d refers to dimension, z _t ' refer to the pair z _T And executing the T-T times of denoising treatment to obtain a result. Introducing training image reference text features through a cross-attention mechanism requires z _t Data interaction between' and y, thus requiring mapping both to the same dimension, then And τ _θ (y) means that z _t ' and y map to the same dimension. />And->The corresponding original vectors of Q, K and V are respectively indicated.

S303: and determining a loss function according to the noise adding data and the training noise estimated data which are respectively corresponding to the noise adding processing and the noise removing processing with the corresponding relation.

As an example, it may be based on the pair z _T-1 The T-th noise adding treatment is carried out to obtain z _T Time-corresponding noise-added data (which can be understood as actually added noise data), and generating a model pair z by the image to be trained _T Performing denoising treatment for the first time to obtain z _T-1 The loss function is determined by training noise estimation data corresponding to 'time' (which can be understood as the noise data added at this time estimated by the latent variable diffusion model).

In the embodiment of the application, the latent variable diffusion model is a time sequence denoising self-encoder and can be expressed as epsilon _θ (z _t ' y, t); t=1, …, T, the latent variable diffusion model can obtain training noise estimated data corresponding to denoising processing, and the loss function can be specifically expressed by the following formula 5:

wherein L is _LDM Representing the loss function, e and e _θ (z _t ' y, t) respectively represent noise addition data and training noise estimation data corresponding to the noise addition processing and the noise removal processing which have corresponding relations. E (E) _{ε(x)∈～N(0,1),t} The noise addition data and the training noise estimation data are shown to conform to a normal distribution.

S304: based on the loss function, an image generation model is trained.

Based on the loss function, model parameters of the image generation model are adjusted. And iteratively executing the training process until the preset training ending condition is met, and obtaining a trained image generation model. For example, the preset training ending condition may be that the training frequency of the image generation model reaches a preset frequency threshold; or the model performance of the image generation model reaches a preset requirement, for example, the difference value between the noise addition data and the training noise estimated data, which are respectively corresponding to the noise addition processing and the noise removal processing with corresponding relations, is smaller than a difference value threshold. The present application is not limited in this regard.

Therefore, training non-living body attack characteristics in the training non-living body face image can be learned by training a common diffusion model or a potential variable diffusion model, and the model performance of the image generation model is improved, so that the target non-living body face image is generated.

It will be appreciated that in practical applications, it is very time consuming to fine tune the latent variable diffusion model directly, and therefore, the latent variable diffusion model may be fine tuned in a low-rank adaptive manner.

Thus, in one possible embodiment of the present application, a base diffusion structure and a bypass adaptation structure may be included in the image generation model; the basic diffusion structure is used for determining first training sub-noise estimated data in denoising processing, the bypass self-adaptive structure is used for determining second training sub-noise estimated data in denoising processing, and the training noise estimated data corresponding to denoising processing is determined according to the first training sub-noise estimated data and the second training sub-noise estimated data. Accordingly, S304 may specifically be: based on the loss function, model parameters of a bypass self-adaptive structure in the image generation model are adjusted, and model parameters of a basic diffusion structure in the image generation model are kept unchanged. It is emphasized that the model parameters of the bypass adaptive structure are less than the model parameters of the underlying diffusion structure.

As an example, the model parameters of the basic diffusion structure may be represented as a vector W, and the dimension of W is d×k, and the model parameters of the bypass adaptive structure may be represented as Δw, and the model parameters of the image generation model may be represented by the following equation 6:

W ₀ +ΔW＝W ₀ +BA，B∈R ^d×r ,A∈R ^r×k (equation 6)

Where r is a dimension much smaller than d and k. A represents an ascending rank and B represents a descending rank.

Accordingly, training noise estimated data corresponding to the denoising process may be represented by the following equation 7:

h＝W ₀ x+ΔWx＝W ₀ x+ BAx (equation 7)

Wherein W is ₀ x represents the first training sub-noise estimation data, Δwx represents the second training sub-noise estimation data, and h represents the training noise estimation data.

Therefore, the model parameters of the basic diffusion structure can be converted into the model parameters of the small-scale bypass self-adaptive structure, namely, the high-dimensional model parameters are converted into the corresponding low-dimensional model parameter representation, and the original model parameters are represented by the low-rank matrix. Therefore, in the process of training the image generation model comprising the basic diffusion structure and the bypass self-adaptive structure, model parameters of the large-scale basic diffusion structure can be kept unchanged, and only model parameters of the small-scale bypass self-adaptive structure are adjusted, so that the calculation amount is greatly reduced, the model training speed can be improved, and the trained image generation model can be ensured to be suitable for the generation task of the non-living face image in the application.

Referring to fig. 6, a schematic diagram of acquiring a reconstructed training non-living face image according to an embodiment of the present application is shown. Referring to fig. 6, the text may be encoded by CLIP to obtain y, then introduced by a cross-attention mechanism, and z is determined by using a basic diffusion structure and a bypass adaptive structure of the image generation model _T Performing denoising treatment for the first time to obtain z _T-1 ^’ Other denoising processes are the same as those of the first denoising process, and will not be described here again, when z _T And performing denoising treatment for T times to obtain z'. The decoder may then be utilized to reconstruct images from the potential representation space to obtain reconstructed training non-living face images.

As an example, the process of reconstructing an image using a decoder can be represented by the following equation 8:

wherein,representing the reconstructed training non-living face image, D representing the decoder.

Furthermore, in addition to the common diffusion model and the latent variable diffusion model mentioned above, the image generation model may also generate a generator in the reactance network (Generative Adversarial Nets, GAN).

Thus, in one possible embodiment of the present application, when the image generation model is a generator in a generation countermeasure network, the generation countermeasure network may be trained by:

b1: by generating a generator in the countermeasure network, a training predictive image is generated from the training reference information. The method comprises the steps of carrying out a first treatment on the surface of the

In one possible embodiment of the present application, the input of the generator may further include noise data, i.e. B1 may specifically be: a training predictive image is generated from the training reference information and the noise data by generating a generator in the countermeasure network. For example, the noise data may be noise features randomly sampled from any one of the training noise features subject to normal distribution; the noise data can also be a directly acquired noise image after noise addition; the noise data may be a noise image obtained by randomly adding noise points to any one image, which is not limited in this application.

B2: by generating a discriminant in the countermeasure network, a first probability that the training predictive image belongs to the real image is determined, and a second probability that the training non-living face image belongs to the real image is determined.

The task of the generator is to generate a training predictive image that is similar to the training of the non-living face image. The task of the arbiter is to determine whether a given image is a training predicted image (i.e., to determine a first probability that the training predicted image belongs to a real image) or a training non-living face image (i.e., to determine a second probability that the training non-living face image belongs to a real image).

B3: based on the first probability and the second probability, training generates a generator and a discriminant in the antagonism network.

If the first probability is larger than the second probability, the judgment error of the discriminator is indicated, and parameters of the discriminator need to be adjusted, so that the next judgment error is avoided; if the second probability is greater than the first probability, the judgment of the judging device is correct, and parameters of the generator need to be adjusted, so that the training predicted image generated by the generator is more in line with the training non-living face image. When the continuous training is carried out until the generator and the discriminator enter an equilibrium and harmony state, the training is completed, and a well-trained generated countermeasure network is obtained.

Therefore, the image generation model for generating the target non-living human face image can be obtained by training different types of image generation models to be trained, and has good universality.

Based on the image generation method provided by the previous embodiment, the application also correspondingly provides an image generation device. The text generating device provided in the embodiment of the present application will be specifically described from the viewpoint of functional modularization.

Referring to fig. 7, a schematic structural diagram of an image generating apparatus according to an embodiment of the present application is shown. As shown in fig. 7, the image generating apparatus 700 may specifically include:

an information acquisition module 710 for acquiring image reference information; the image reference information comprises attack information and image basic information, wherein the attack information is used for indicating non-living attack characteristics in a non-living face image to be generated, and the image basic information is used for indicating basic characteristics in the non-living face image to be generated;

an image generation module 720, configured to generate a target non-living face image according to the image reference information through an image generation model;

the image generation model is trained based on training reference information, wherein the training reference information comprises training image basic information and training attack information; the basic information of the training image is obtained by identifying the training non-living face image and is used for indicating basic characteristics in the training non-living face image; the training attack information is used for indicating training non-living attack characteristics in the training non-living face image, wherein the training non-living attack characteristics are key characteristics according to the training non-living face when the training non-living face image is identified.

As an embodiment, the image generating module 720 may specifically include:

the information coding unit is used for coding the image reference information to obtain image reference text characteristics;

and the image generation unit is used for generating a target non-living human face image according to the image reference text characteristics and the noise data through the image generation model.

As an embodiment, the image generating unit may be specifically configured to perform, by an image generating model, denoising processing N times based on noise data, to obtain target image data; determining a target non-living face image according to the target image data; n is an integer greater than or equal to 1;

wherein, the ith denoising process includes: determining the ith noise estimated data according to the denoising result of the ith-1 denoising process and the image reference text characteristics through an image generation model; removing the ith noise estimated data in the denoising result of the ith-1 th denoising process to obtain a denoising result of the ith denoising process; i is an integer of 1 or more and N or less; when i is equal to 1, the denoising result of the ith-1 th denoising process is noise data, and when i is equal to N, the denoising result of the ith denoising process is target image data.

As an embodiment, the ith noise estimate data in the image generation unit may be obtained by the following sub-units:

the first data determining subunit is used for determining first sub-noise estimated data according to the denoising result of the i-1 th denoising process, the image reference text characteristic and the number of turns corresponding to the current denoising process through the basic diffusion structure in the image generating model;

the second data determining subunit is used for determining second sub-noise estimated data according to the denoising result of the i-1 th denoising process and the image reference text characteristics through a bypass self-adaptive structure in the image generation model;

and the third data determination subunit is used for determining the ith noise estimated data according to the first sub-noise estimated data and the second sub-noise estimated data.

As an embodiment, the image generating unit may be specifically configured to perform, by using an image generating model, denoising processing for N times based on noise characteristics, to obtain target image characteristics; decoding the target image features to obtain a target non-living face image;

or,

the image generation unit is specifically configured to perform N times of denoising processing based on the noise image through the image generation model to obtain a target image; the target image is taken as a target non-living human face image.

As an embodiment, the image generation model in the image generation module 720 may be specifically obtained by training the following units:

the noise adding unit is used for executing T times of noise adding processing based on the training non-living face image to obtain training noise data, and recording noise adding data corresponding to each time of noise adding processing; t is an integer greater than or equal to 1;

the denoising unit is used for generating a model through an image to be trained, performing T times of denoising processing on the training noise data according to the training noise data and training reference information, and recording training noise estimated data corresponding to each denoising processing;

the loss function determining unit is used for determining a loss function according to noise adding data and training noise estimated data which are respectively corresponding to the noise adding processing and the noise removing processing with the corresponding relation;

and the model training unit is used for training the image to generate a model based on the loss function.

As an embodiment, the image generation model in the model training unit may include a basic diffusion structure and a bypass adaptive structure therein; the basic diffusion structure is used for determining first training sub-noise estimated data in denoising processing, the bypass self-adaptive structure is used for determining second training sub-noise estimated data in denoising processing, and the training noise estimated data corresponding to denoising processing is determined according to the first training sub-noise estimated data and the second training sub-noise estimated data; the model parameters of the bypass adaptive structure are less than the model parameters of the base diffusion structure;

Correspondingly, the model training unit can be specifically used for adjusting model parameters of the bypass self-adaptive structure in the image generation model based on the loss function, and keeping model parameters of the basic diffusion structure in the image generation model unchanged.

As an embodiment, the image generation model in the image generation module 720 may specifically generate a generator in the antagonism network, and then the generation antagonism network may specifically be obtained through training of the following units:

a predicted image generation unit for generating a training predicted image from training reference information by generating a generator in the countermeasure network;

a probability determination unit configured to determine a first probability that the training predicted image belongs to the real image and a second probability that the training non-living face image belongs to the real image by generating a discriminator in the countermeasure network;

and the training unit is used for training and generating a generator and a discriminator in the countermeasure network based on the first probability and the second probability.

The embodiment of the application also provides a computer device, which can be a terminal device or a server, and the terminal device and the server provided by the embodiment of the application are described below from the perspective of hardware materialization.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 8, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (pda), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal as an example of a computer:

fig. 8 is a block diagram showing a part of the structure of a computer related to a terminal provided in an embodiment of the present application. Referring to fig. 8, a computer includes: radio Frequency (RF) circuitry 1210, memory 1220, input unit 1230 (including touch panel 1231 and other input devices 1232), display unit 1240 (including display panel 1241), sensors 1250, audio circuitry 1260 (which may connect speaker 1261 and microphone 1262), wireless fidelity (wireless fidelity, wiFi) module 1270, processor 1280, and power supply 1290. Those skilled in the art will appreciate that the computer architecture shown in fig. 8 is not limiting and that more or fewer components than shown may be included, or that certain components may be combined, or that different arrangements of components may be utilized.

Memory 1220 may be used to store software programs and modules, and processor 1280 may execute the various functional applications and data processing of the computer by executing the software programs and modules stored in memory 1220. The memory 1220 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data created according to the use of the computer (such as audio data, phonebooks, etc.), and the like. In addition, memory 1220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

Processor 1280 is a control center of the computer and connects various parts of the entire computer using various interfaces and lines, performing various functions of the computer and processing data by running or executing software programs and/or modules stored in memory 1220, and invoking data stored in memory 1220. In the alternative, processor 1280 may include one or more processing units; preferably, the processor 1280 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, application programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1280.

In the embodiment of the present application, the processor 1280 included in the terminal further has the following functions:

Optionally, the processor 1280 is further configured to perform steps of any implementation of the image generating method provided in the embodiments of the present application.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a server 1300 according to an embodiment of the present application. The server 1300 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPU) 1322 (e.g., one or more processors) and memory 1332, one or more storage media 1330 (e.g., one or more mass storage devices) storing applications 1342 or data 1344. Wherein the memory 1332 and storage medium 1330 may be transitory or persistent. The program stored on the storage medium 1330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, the central processor 1322 may be configured to communicate with the storage medium 1330, and execute a series of instruction operations in the storage medium 1330 on the server 1300.

The Server 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1358, and/or one or more operating systems, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 9.

Wherein CPU1322 is configured to perform the following steps:

Optionally, CPU1322 may also be configured to perform the steps of any implementation of the image generation methods provided by embodiments of the present application.

The present application also provides a computer-readable storage medium storing a computer program for executing any one of the implementations of the image generating method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any one of the image generation methods described in the foregoing respective embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media in which a computer program can be stored.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. An image generation method, the method comprising:

2. The method of claim 1, wherein generating the target non-living face image from the image reference information by the image generation model comprises:

Coding the image reference information to obtain image reference text characteristics;

and generating the target non-living human face image according to the image reference text characteristics and the noise data through the image generation model.

3. The method of claim 2, wherein generating, by the image generation model, the target non-living face image from the image reference text feature and noise data, comprises:

performing N times of denoising processing based on the noise data through the image generation model to obtain target image data; determining the target non-living face image according to the target image data; the N is an integer greater than or equal to 1;

wherein, the ith denoising process includes: determining ith noise estimated data according to a denoising result of the ith-1 denoising process and the image reference text characteristics through the image generation model; removing the ith noise estimated data in the denoising result of the ith-1 th denoising process to obtain a denoising result of the ith denoising process; the i is an integer which is more than or equal to 1 and less than or equal to N; and when the i is equal to 1, the denoising result of the ith-1 th denoising process is the noise data, and when the i is equal to N, the denoising result of the ith denoising process is the target image data.

4. A method according to claim 3, wherein said determining, by said image generation model, the ith noise estimate data based on the denoising result of the ith-1 st denoising process and said image reference text feature, comprises:

determining first sub-noise estimated data according to a denoising result of the i-1 th denoising process, the image reference text characteristic and the current round number corresponding to the denoising process through a basic diffusion structure in the image generation model;

determining second sub-noise estimated data according to the denoising result of the ith-1 denoising process and the image reference text characteristics through a bypass self-adaptive structure in the image generation model;

and determining the ith noise estimated data according to the first sub-noise estimated data and the second sub-noise estimated data.

5. A method according to claim 3, wherein said generating a model by said image is performed N times of denoising processing based on said noise data to obtain target image data; determining the target non-living face image according to the target image data, including:

performing N times of denoising processing based on noise characteristics through the image generation model to obtain target image characteristics; decoding the target image features to obtain the target non-living face image;

Or,

performing N times of denoising processing based on the noise image through the image generation model to obtain a target image; and taking the target image as the target non-living human face image.

6. The method of claim 1, wherein the image generation model is trained by:

performing T times of noise adding processing based on the training non-living face image to obtain training noise data, and recording noise adding data corresponding to each time of noise adding processing; the T is an integer greater than or equal to 1;

performing T times of denoising processing on the training noise data according to the training noise data and the training reference information by using the image generation model to be trained, and recording training noise estimated data corresponding to each time of denoising processing;

determining a loss function according to noise adding data and training noise estimated data which correspond to the noise adding processing and the noise removing processing respectively and have corresponding relations;

the image generation model is trained based on the loss function.

7. The method of claim 6, wherein the image generation model includes a base diffusion structure and a bypass adaptation structure therein; the basic diffusion structure is used for determining first training sub-noise estimated data in the denoising process, the bypass self-adaptive structure is used for determining second training sub-noise estimated data in the denoising process, and the training noise estimated data corresponding to the denoising process is determined according to the first training sub-noise estimated data and the second training sub-noise estimated data; the model parameters of the bypass adaptive structure are less than the model parameters of the base diffusion structure;

The training the image generation model based on the loss function includes:

and based on the loss function, adjusting the model parameters of the bypass self-adaptive structure in the image generation model, and keeping the model parameters of the basic diffusion structure in the image generation model unchanged.

8. The method of claim 1, wherein the image generation model generates a generator in an antagonism network that is trained by:

generating a training prediction image according to the training reference information through a generator in the generating countermeasure network;

determining, by the arbiter in the generation countermeasure network, a first probability that the training predictive image belongs to a real image and a second probability that the training non-living face image belongs to a real image;

training the generator and the arbiter in the generation countermeasure network based on the first probability and the second probability.

9. An image generation apparatus, the apparatus comprising:

10. A computer device, the computer device comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the image generation method according to any one of claims 1 to 8 according to the computer program.

11. A computer-readable storage medium storing a computer program for executing the image generation method according to any one of claims 1 to 8.