CN112270644A

CN112270644A - Face super-resolution method based on spatial feature transformation and cross-scale feature integration

Info

Publication number: CN112270644A
Application number: CN202011124368.0A
Authority: CN
Inventors: 张凯兵; 庄诚; 李敏奇; 景军锋; 卢健; 刘薇; 陈小改
Original assignee: Xian Polytechnic University
Current assignee: Liaoning Houfa Xianzhi Technology Co ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-01-26
Anticipated expiration: 2040-10-20
Also published as: CN112270644B

Abstract

The invention discloses a face super-resolution method based on spatial feature transformation and cross-scale feature integration, which is implemented according to the following steps: preprocessing a face image to obtain a training set and a test set, and processing the preprocessed face image to generate a semantic segmentation probability map; constructing a generation confrontation network model for training; and sequentially inputting the face images in the training set into the constructed generation confrontation network model, setting parameters, training and achieving convergence, and inputting the face images in the test set into the trained generation confrontation network model to obtain the super-resolution reconstructed high-resolution images. The face super-resolution method based on spatial feature transformation and cross-scale feature integration solves the problem that texture details in a reconstructed face image are ignored in the existing method in the prior art.

Description

Face super-resolution method based on spatial feature transformation and cross-scale feature integration

Technical Field

The invention belongs to the technical field of face image recognition, and relates to a face super-resolution method based on spatial feature transformation and cross-scale feature integration.

Background

The existing tasks related to human faces, such as face recognition, face alignment, expression recognition, three-dimensional face reconstruction and the like, are all realized based on a clear high-resolution face data set, and the effect is obviously reduced when a low-resolution face image is faced. In addition, due to the inherent limitation of the conventional digital imaging device, the obtained face image is often subjected to a series of degradation processes such as optical blurring and undersampling, and finally, a clear image in visual sense is difficult to obtain. The image super-resolution technology is used as an effective image recovery means, and can effectively overcome the problem of low image resolution caused by the limitation of physical resolution of imaging equipment, optical blurring and the like.

The face super-resolution method is roughly divided into two types, namely a traditional method based on a classical machine learning algorithm and a deep learning method based on a convolutional neural network. Among them, the super-resolution method based on deep learning is gaining attention due to its superior reconstruction performance. However, most of the existing face image super-resolution algorithms only focus on the super-resolution reconstruction of a 'tiny face' with 16 × 16 pixels, namely, the super-resolution reconstruction is also called as a 'face phantom', and the reconstruction of a 'small face' such as a face image with 64 × 64 pixels, which is common in practical application, is ignored; therefore, the result images obtained by the methods can only meet the face detection task, but cannot keep identity consistency with the real face. In addition, the methods usually pursue high peak signal-to-noise ratio and structural similarity, and whether the texture details in the reconstructed face image meet the requirements of human eyes on visual perception quality is ignored.

Disclosure of Invention

The invention aims to provide a face super-resolution method based on spatial feature transformation and cross-scale feature integration, and solves the problem that the existing method in the prior art ignores the texture details in a reconstructed face image.

The technical scheme adopted by the invention is that the face super-resolution method based on spatial feature transformation and cross-scale feature integration is implemented according to the following steps:

step 1, randomly selecting N human face images from a human face data set, and then preprocessing the human face images to generate a training set and a test set;

step 2, adopting a face analysis pre-training model BisNet as a base network for generating a semantic segmentation probability map, and processing the face image preprocessed in the step 1 to generate the semantic segmentation probability map;

step 3, constructing a generated confrontation network model for training, wherein the generated confrontation network model comprises a semantic segmentation probability map intermediate condition generation module, a spatial feature transformation module, a cross-scale feature integration module and a fusion output module which are sequentially connected, a sub-pixel convolution layer sampled on an image is introduced into the cross-scale feature integration module, and a confrontation loss function and a perception loss function introduced into the generated confrontation network model;

step 4, sequentially inputting the face images in the training set obtained in the step 1 into a constructed generation confrontation network model, setting parameters, training and achieving convergence;

and 5, inputting the face images in the test set in the step 1 into the generated confrontation network model trained in the step 4 to obtain a super-resolution reconstructed high-resolution image.

The face data set in the step 1 is a CelebA-HQ face data set.

The preprocessing of the face images in the training set in the step 1 specifically comprises the following steps: adopting a bicubic interpolation algorithm to carry out down-sampling on the images in the training set, and outputting an interpolation image I with the size of 512 multiplied by 512^HRAs a trainingTraining and testing the set of target images, and then interpolating image I^HRDownsampling 4 times to 64 times 64 by adopting bicubic interpolation as training and testing input image I^LR(ii) a Then inputting the image I^LRAdopting double cubic interpolation up-sampling from 4 times to 256 times 256 as semantic segmentation network input image I^S。

The step 2 specifically comprises the following steps:

the face analysis pre-training model BisNet is used as a base network generated by a semantic segmentation probability map, and the output layer of the face analysis pre-training model BisNet is modified, and the method specifically comprises the following steps: adding a softmax function into an output layer of a face analysis pre-training model BisNet, and inputting the semantic segmentation network input image I obtained in the step 1^SInputting the semantic probability output result into a modified face analysis pre-training model BisNet, outputting the semantic probability output result into a pth file, namely a Pythrch model file, and obtaining a semantic segmentation probability map I^Seg。

The step 4 specifically comprises the following steps:

step 4.1, setting training parameters, inputting training and testing into image I^LRTarget image I of training set and test set^HRAnd semantic segmentation probability map I^SegLoading network input end, namely input end of semantic segmentation probability map intermediate condition generation module, and semantic segmentation probability map intermediate condition generation module inputting semantic segmentation probability map I^SegProcessing to generate a semantic information intermediate condition psi;

step 4.2, the semantic segmentation probability map intermediate condition generation module inputs training and testing to the image I^LRGenerating a feature map as a front layer feature map through a layer of convolution;

step 4.3, taking the intermediate condition psi of the front-layer feature map and the semantic information as the input of a spatial feature transformation module, and outputting a feature map F1 by the spatial feature transformation module;

step 4.4, inputting the output feature map F1 in the step 4.3 into the cross-scale integration module to obtain different scale features, then inputting the different scale features into the fusion output module to obtain a super-resolution image, and recording the super-resolution image as I^SR；

Step 45, converting the super-resolution image I^SRAnd corresponding interpolated image I^HRInput discriminator D_ηThe discrimination information is transmitted back to the generator for generation of the countermeasure network model, i.e. generator G_θ；

And 4.6, continuously iterating the steps 4.4-4.5 to minimize the sum of the confrontation loss and the perception loss, and then taking the corresponding parameters as the trained model parameters to obtain the trained generated confrontation network model.

The semantic segmentation probability map intermediate condition generation module comprises five convolutional layers which are sequentially connected, the number of input channels of the first convolutional layer is 19, the number of output channels is 128, the size of a convolutional kernel is 4 multiplied by 4, the convolutional step length is 4, and the negative nonzero slope of a modified linear unit is 0.1; the number of input channels of the second convolutional layer is 128, the number of output channels is 128, the size of a convolutional kernel is 1 multiplied by 1, the convolution step length is 4, and the negative nonzero slope of the modified linear unit is 0.1; the number of input channels of the third convolutional layer is 128, the number of output channels is 128, the size of a convolutional kernel is 1 multiplied by 1, the convolution step is 1, the negative non-zero slope of the modified linear unit is 0.1, the number of input channels of the fourth convolutional layer is 128, the number of output channels is 128, the size of the convolutional kernel is 1 multiplied by 1, and the convolution step is 1; finally, the number of input channels of one convolutional layer is 128, the number of output channels is 32, the size of a convolutional kernel is 1 multiplied by 1, the convolution step length is 1, and finally, an intermediate condition containing semantic information is output by one convolutional layer and is marked as psi;

the spatial feature transformation module is composed of 8 residual units with spatial feature transformation layers, and each residual unit is composed of a spatial feature transformation layer, a convolution layer and a nonlinear activation layer.

Step 4.4, inputting the output feature map F1 in step 4.3 into the cross-scale integration module, and obtaining different scale features specifically as follows:

in the cross-scale integration module, the dimension of the output feature map F1 is raised by 4 times through a convolution layer pair, and the feature map F2 is obtained by performing up-sampling on the output feature map F1 by 2 times through sub-pixel convolution; meanwhile, the output feature map F1 is amplified by 2 times through double cubic interpolation and then is fused with the feature map F2 on a channel to obtain a feature map F3_1, and the feature map is transmitted backwards; the feature map F2 is reduced by two times through convolution with a step length of 2, and then is fused with the feature map F1 on a channel, so that a feature map F3_2 is obtained and is transmitted backwards; f3_1 and F3_2 are respectively input into two residual error feature extraction modules, the output feature maps are respectively marked as a feature map F4_1 and a feature map F4_2, the feature map F4_1 respectively obtains a feature map F5_2 through direct output, performs down-sampling 2-fold output by using convolution with the step length of 2 to obtain a feature map F5_1, and performs up-sampling 2-fold output by using bicubic interpolation to obtain a feature map F5_ 3;

the feature map F4_1 is up-sampled by 2 times by using a second sub-pixel to output a feature map F5, and then the feature map F5 is directly output to obtain F6_3, down-sampled by 2 times by using convolution with the step length of 2 to obtain F6_2, and down-sampled by 4 times by using convolution with the step length of 4 to obtain F6_ 1;

f4_2 are directly output to obtain F7_1, bicubic interpolation is performed for 2 times to obtain F7_2, and bicubic interpolation is performed for 4 times to obtain F7_ 3; then, after feature fusion is carried out on F5_1, F6_1 and F7_1 which are small-scale, the feature fusion is input into a feature extraction module consisting of 4 residual blocks, and the feature graph is output and amplified by 4 times through an interpolation up-sampling module to output a feature graph F8_ 1; similarly, feature fusion is carried out on feature maps F5_2, F6_2 and F7_2 which are on the same scale, then the feature maps are input into a residual feature extraction module consisting of 4 residual blocks, and the output feature map is amplified by 2 times through an interpolation upsampling module to output F8_ 2; and F5_3, F6_3 and F7_3 with the same large scale are subjected to feature fusion and then input into a residual feature extraction module consisting of 4 residual blocks, and a feature map is output to directly output F8_ 3.

And 4.4, inputting the features with different scales into a fusion output module, and obtaining a reconstructed super-resolution result specifically as follows:

feature fusion is carried out on feature maps F8_1, F8_2 and F8_3 with different scales, and then two convolution layers are used for outputting step-by-step dimensionality reduction to obtain a reconstructed super-resolution image which is marked as I^SR。

The perceptual loss function of step 4.6 is:

the penalty function is:

L_D＝∑_ilog(1-D_η(G_θ(I^LR)))

wherein phi (I)^SR)，φ(I^HR) Representing the characteristic diagram G extracted after the result diagram and the target diagram are respectively subjected to the pre-trained Vgg network_θRepresentative of a generating network, D_ηRepresenting a discriminative network.

The invention has the beneficial effects that:

(1) the used spatial feature transformation layer can realize the reconstruction of a high-resolution image with a rich semantic region only by one-time forward transmission through converting the intermediate features of a single network.

(2) The reconstruction network uses semantic mapping to guide texture recovery for different regions in the high resolution domain, while using probability maps to capture fine texture details.

(3) The cross-scale feature integration module enables the texture features in transmission to be exchanged on each scale, and more effective feature representation is realized, so that the performance of the super-resolution reconstruction algorithm is further improved.

Drawings

FIG. 1 is a comparison graph of the results of example 1-1 in the face super-resolution method of the present invention with spatial feature transformation and cross-scale feature integration;

FIG. 2 is a comparison graph of the results of the face super-resolution method of the present invention in which spatial feature transformation and cross-scale feature integration are performed according to the embodiment 1-2.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The face super-resolution method based on spatial feature transformation and cross-scale feature integration is implemented according to the following steps:

step 1, randomly selecting N human face images from a human face data set, and then preprocessing the human face images to generate a training set and a test set; the method specifically comprises the following steps: randomly selecting 1000 face pictures from CelebA-HQ face data setTaking the image as a training set and 100 images as a test set, adopting a bicubic interpolation algorithm to carry out down-sampling on high-resolution images in the training set, and outputting an interpolation image I with the size of 512 multiplied by 512^HRAs target images of a training set and a test set; downsampling 4 times to 64 x 64 as training and test input image I using also bicubic interpolation^LR(ii) a Then adding I^LRRe-interpolation up-sampling 4 times to 256 times 256 as semantic segmentation network input image I^S。

Step 2, adopting a face analysis pre-training model BisNet as a base network for generating a semantic segmentation probability map, and processing the face image preprocessed in the step 1 to generate the semantic segmentation probability map; the method specifically comprises the following steps:

the face analysis pre-training model BisNet is used as a base network generated by a semantic segmentation probability map, and the output layer of the face analysis pre-training model BisNet is modified, and the method specifically comprises the following steps: adding a softmax function into an output layer of a face analysis pre-training model BisNet, and inputting the semantic segmentation network input image I obtained in the step 1^SInputting the semantic probability output result into a modified face analysis pre-training model BisNet, outputting the semantic probability output result into a pth file, namely a Pythrch model file, and obtaining a semantic segmentation probability map I^Seg；

The step 4 specifically comprises the following steps:

step 4.1, setting training parameters, inputting training and testing into image I^LRTarget image I of training set and test set^HRAnd semantic segmentation probability map I^SegLoading network input end, namely input end of semantic segmentation probability map intermediate condition generation module, and semantic segmentation probability map intermediate condition generation module inputting semantic segmentation probability map I^SegProcessing is carried out to generate a semantic information intermediate condition psi, wherein a semantic segmentation probability map intermediate condition generation module comprises five convolutional layers which are sequentially connected, the number of input channels of the first convolutional layer is 19, the number of output channels is 128, the size of a convolutional kernel is 4 multiplied by 4, the convolution step length is 4, and the negative nonzero slope of a correction linear unit is 0.1; the number of input channels of the second convolutional layer is 128, the number of output channels is 128, the size of a convolutional kernel is 1 multiplied by 1, the convolution step length is 4, and the negative nonzero slope of the modified linear unit is 0.1; the number of input channels of the third convolutional layer is 128, the number of output channels is 128, the size of a convolutional kernel is 1 multiplied by 1, the convolution step is 1, the negative non-zero slope of the modified linear unit is 0.1, the number of input channels of the fourth convolutional layer is 128, the number of output channels is 128, the size of the convolutional kernel is 1 multiplied by 1, and the convolution step is 1; finally, the number of input channels of one convolutional layer is 128, the number of output channels is 32, the size of a convolutional kernel is 1 multiplied by 1, the convolution step length is 1, finally, an intermediate condition containing semantic information is output by one convolutional layer and is marked as psi, and the structural parameters of the module are shown in table 1;

step 4.3, using the intermediate condition psi of the front layer feature map and the semantic information as the input of a spatial feature transformation module, outputting a feature map F1 by the spatial feature transformation module, wherein the spatial feature transformation module consists of 8 residual error units with spatial feature transformation layers, and each residual error unit consists of a spatial feature transformation layer, a convolution layer and a nonlinear activation layer; each residual unit consists of a spatial characteristic transformation layer, a convolution layer and a nonlinear activation layer, and the structure is shown in table 2; the method comprises the following steps that a feature map of a layer above a spatial feature transformation layer and a semantic information intermediate condition psi are used as input, a pair of modulation parameters (gamma, beta) are generated through two groups of convolution inside, and affine transformation of the feature map on the space is achieved through multiplication and addition;

the mathematical description is as follows:

SFT(F|γ，β)＝γ⊙F+β

wherein F represents a feature map whose dimensions are consistent with those of γ and β, which is a dot product operation of elements at corresponding positions of the matrix.

Step 4.4, inputting the output feature map F1 in the step 4.3 into the cross-scale integration module to obtain different scale features, then inputting the different scale features into the fusion output module to obtain a super-resolution image, and recording the super-resolution image as I^SR(ii) a In the cross-scale integration module, the dimension of the output feature map F1 is raised by 4 times through a convolution layer pair, and the feature map F2 is obtained by performing up-sampling on the output feature map F1 by 2 times through sub-pixel convolution; meanwhile, the output feature map F1 is amplified by 2 times through double cubic interpolation and then is fused with the feature map F2 on a channel to obtain a feature map F3_1, and the feature map is transmitted backwards; the feature map F2 is reduced by two times through convolution with a step length of 2, and then is fused with the feature map F1 on a channel, so that a feature map F3_2 is obtained and is transmitted backwards; f3_1 and F3_2 are respectively input into two residual error feature extraction modules, the structure of each residual error block is shown in a table 3, output feature maps are respectively marked as a feature map F4_1 and a feature map F4_2, the feature map F4_1 is respectively output directly to obtain a feature map F5_2, the feature map F5_1 is obtained by performing down-sampling 2 times output by using convolution with the step length of 2, and the feature map F5_3 is obtained by performing up-sampling 2 times output by using bicubic interpolation;

f4_2 are directly output to obtain F7_1, bicubic interpolation is performed for 2 times to obtain F7_2, and bicubic interpolation is performed for 4 times to obtain F7_ 3; then, after feature fusion is carried out on F5_1, F6_1 and F7_1 which are small-scale, the feature fusion is input into a feature extraction module consisting of 4 residual blocks, and the feature graph is output and amplified by 4 times through an interpolation up-sampling module to output a feature graph F8_ 1; similarly, feature fusion is carried out on feature maps F5_2, F6_2 and F7_2 which are on the same scale, then the feature maps are input into a residual feature extraction module consisting of 4 residual blocks, and the output feature map is amplified by 2 times through an interpolation upsampling module to output F8_ 2; performing feature fusion on the large-scale F5_3, F6_3 and F7_3, inputting the fused features into a feature extraction module consisting of 4 residual blocks, and directly outputting a feature map F8_3, wherein the structure of the residual blocks is shown in Table 3;

feature fusion is carried out on feature maps F8_1, F8_2 and F8_3 with different scales, and then two convolution layers are used for outputting step-by-step dimensionality reduction to obtain a reconstructed super-resolution image which is marked as I^SR；

Step 4.5, the super-resolution image I^SRAnd corresponding interpolated image I^HRInput discriminator D_ηThe discrimination information is transmitted back to the generator for generation of the countermeasure network model, i.e. generator G_θ；

Step 4.6, continuously iterating steps 4.4-4.5 to minimize the sum of the confrontation loss and the perception loss, and then taking the corresponding parameters as the trained model parameters to obtain a trained generated confrontation network model, wherein the perception loss function is as follows:

the penalty function is:

L_D＝∑_ilog(1-D_η(G_θ(I^LR)))

The invention sets the training data volume of each step, namely bachsize, to 16, sets iteration 3000 rounds, sets the perception loss weight to 1, and sets the countermeasure loss weight to 10^-4(ii) a Starting training, and obtaining the training of the last round after the training is finishedAnd the obtained parameters are stored into a model file, and in the invention, after all training samples are traversed for 3000 rounds, the lumped loss is verified to be basically unchanged, which indicates that the training can be finished.

TABLE 1

Conv_1\|LeakyRelu	(19,128,4,4)\|LeakyRelu
		Conv_2\|LeakyRelu	(128，128，1，1)\|LeakyRelu
Conv_3\|LeakyRelu	(128，128，1，1)\|LeakyRelu
		Conv_4\|LeakyRelu	(128，128，1，1)\|LeakyRelu
Conv_out	(128,32,1,1)

TABLE 2

As shown in table 2, SFT is a spatial feature transform layer, Scale _ Conv0 and Scale _ Conv1 are two convolution layers, which can be learned to obtain a scaling parameter γ; shift _ Conv0 and Shift _ Conv1 are two convolutional layers that can be learned to obtain the Shift parameter β. The corresponding parameters in brackets respectively represent the number of input feature maps, the number of output feature maps, the size of convolution kernels and the size of step sizes of the layer from left to right.

TABLE 3

Conv	(64，64，3，1，1)
		Relu	\
Conv	(64，64，3，1，1)

As shown in table 3, the module is composed of a convolutional layer, an active layer, and a convolutional layer, and the corresponding parameters in the parentheses represent the number of input feature maps, the number of output feature maps, the size of convolutional kernels, and the step size of the layer, respectively, from left to right.

Examples

In order to generate a human face semantic segmentation probability map more conveniently and compare image details more easily, the human face semantic segmentation probability map generation method adopts a human face high-definition data set CelebA-HQ experimentally, and randomly selects a part of human face images from the human face semantic segmentation probability map to compare results under 4 times of amplification; in addition, in order to better quantify the image quality score and make the image quality score more fit with the sense of human eyes, the invention compares PSNR (peak signal-to-noise ratio) and SSIM (structural similarity), and also calculates the local block similarity and a perception index parameter based on the recommendation of Ma super et al. PSNR values, SSIM values, LPIPS values and PI values obtained by using the existing more advanced technologies such as MSRN (multi-scale residual error network), EDSR (improved deep residual error super resolution network) method, SRFBN (super resolution feedback network) method, SFTGAN (spatial feature transform network) method, ESRGAN (improved super resolution generation countermeasure network) method and the method of the present invention are respectively as follows:

by comparison, the method of the invention is superior to other comparison methods in subjective visual quality as shown in fig. 1 and 2 and objective evaluation indexes, and especially compared with more advanced ESRGAN (improved super-resolution generation countermeasure network), almost the same performance as the method is obtained, but the parameter number of the invention is only 4,604,262, while the parameter number of the ESRGAN (improved super-resolution generation countermeasure network) is 16,697,987.

Claims

1. The face super-resolution method based on spatial feature transformation and cross-scale feature integration is characterized by comprising the following steps:

step 3, constructing a generated confrontation network model for training, wherein the generated confrontation network model comprises a semantic segmentation probability map intermediate condition generation module, a spatial feature transformation module, a cross-scale feature integration module and a fusion output module which are sequentially connected, a sub-pixel convolution layer sampled on an image is introduced into the cross-scale feature integration module, and a confrontation loss function and a perception loss function are introduced into the generated confrontation network model;

2. The face super-resolution method based on spatial feature transformation and cross-scale feature integration according to claim 1, wherein the face data set in step 1 is a CelebA-HQ face data set.

3. The face super-resolution method based on spatial feature transformation and cross-scale feature integration according to claim 1, wherein the preprocessing of the face images in the training set in step 1 specifically comprises: adopting a bicubic interpolation algorithm to carry out down-sampling on the images in the training set, and outputting an interpolation image I with the size of 512 multiplied by 512^HRAs target images of the training set and the test set, and then interpolating the image I^HRDownsampling 4 times to 64 times 64 by adopting bicubic interpolation as training and testing input image I^LR(ii) a Then inputting the image I^LRAdopting double cubic interpolation up-sampling from 4 times to 256 times 256 as semantic segmentation network input image I^S。

4. The face super-resolution method based on spatial feature transformation and cross-scale feature integration according to claim 3, wherein the step 2 specifically comprises:

5. The face super-resolution method based on spatial feature transformation and cross-scale feature integration according to claim 4, wherein the step 4 specifically comprises:

step 4.1, setting training parameters, inputting training and testing into image I^LRTarget image I of training set and test set^HRAnd semantic segmentation probability map I^SegLoading network input end, namely input end of semantic segmentation probability map intermediate condition generation module, wherein the semantic segmentation probability map intermediate condition generation module inputs semantic segmentation probability map I^SegProcessing to generate a semantic information intermediate condition psi;

6. The face super-resolution method based on spatial feature transformation and cross-scale feature integration according to claim 5, wherein the semantic segmentation probability map intermediate condition generation module includes five convolutional layers connected in sequence, the number of input channels of the first convolutional layer is 19, the number of output channels is 128, the size of a convolutional kernel is 4 x 4, the convolutional step is 4, and the negative non-zero slope of the modified linear unit is 0.1; the number of input channels of the second convolutional layer is 128, the number of output channels is 128, the size of a convolutional kernel is 1 multiplied by 1, the convolution step length is 4, and the negative nonzero slope of the modified linear unit is 0.1; the number of input channels of the third convolutional layer is 128, the number of output channels is 128, the size of a convolutional kernel is 1 multiplied by 1, the convolution step is 1, the negative non-zero slope of the modified linear unit is 0.1, the number of input channels of the fourth convolutional layer is 128, the number of output channels is 128, the size of the convolutional kernel is 1 multiplied by 1, and the convolution step is 1; finally, the number of input channels of one convolutional layer is 128, the number of output channels is 32, the size of a convolutional kernel is 1 multiplied by 1, the convolution step length is 1, and finally, an intermediate condition containing semantic information is output by one convolutional layer and is marked as psi;

the spatial feature transformation module is composed of 8 residual error units with spatial feature transformation layers, and each residual error unit is composed of a spatial feature transformation layer, a convolution layer and a nonlinear activation layer.

7. The face super-resolution method based on spatial feature transformation and cross-scale feature integration according to claim 6, wherein the step 4.4 inputs the feature map F1 output in step 4.3 into the cross-scale integration module, and the obtaining of different scale features specifically comprises:

8. The face super-resolution method based on spatial feature transformation and cross-scale feature integration according to claim 7, wherein in step 4.4, features of different scales are input to the fusion output module, and the obtained super-resolution result after reconstruction specifically comprises:

9. The face super-resolution method based on spatial feature transformation and cross-scale feature integration according to claim 8, wherein the perceptual loss function in step 4.6 is:

the penalty function is:

L_D＝∑_ilog(1-D_η(G_θ(I^LR)))