Abstract
Despite recent advances in semantic manipulation using StyleGAN, semantic editing of real faces remains challenging. The gap between the W space and the W+ space demands an undesirable trade-off between reconstruction quality and editing quality. To solve this problem, we propose to expand the latent space by replacing fully connected layers in StyleGAN’s mapping network with attention-based transformers. This simple and effective technique integrates the two spaces mentioned above and transforms them into one new latent space called W++. Our modified StyleGAN maintains the state-of-the-art generation quality of the original StyleGAN with moderately better diversity. But more importantly, the proposed W++ space achieves superior performance in both reconstruction quality and editing quality. Besides these significant advantages, our W++ space supports existing inversion algorithms and editing methods with only negligible modifications thanks to its structural similarity with the W/W+ space. Extensive experiments on the FFHQ dataset prove that our proposed W++ space is evidently preferable to the previous W/W+ space for real face editing. The code is publicly available for research purposes at https://github.com/AnonSubm2021/TransStyleGAN.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated and analyzed during the current study are available in the repository, https://github.com/AnonSubm2021/TransStyleGAN.
Notes
Our work builds upon the PyTorch implementation of StyleGANv2 by rosinality, which is publicly available at https://github.com/rosinality/stylegan2-pytorch.
The code is publicly available at https://github.com/clovaai/generative-evaluation-prdc.
The best FID score announced using rosinality’s Pytorch implementation at \(256 \times 256\) resolution is 4.5. While the best FID score we have achieved after multiple runs is 4.69 at the same resolution. However, this performance gap does not affect our findings because the training codes are identical.
The code is publicly available at https://github.com/eladrich/pixel2style2pixel.
The code is publicly available at https://github.com/genforce/interfacegan.
The code is publicly available at https://github.com/siriusdemon/pytorch-DEX.
The code is publicly available at https://github.com/sicxu/Deep3DFaceRecon_pytorch.
The code is publicly available at https://github.com/Juyong/3DFace.
References
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)
Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of gans for semantic face editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9243–9252 (2020)
Shen, Y., Yang, C., Tang, X., Zhou, B.: Interfacegan: interpreting the disentangled face representation learned by gans. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: discovering interpretable gan controls. Adv. Neural Inf. Process. Syst. 33, 9841–9850 (2020)
Collins, E., Bala, R., Price, B., Susstrunk, S.: Editing in style: uncovering the local semantics of gans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5771–5780 (2020)
Shoshan, A., Bhonker, N., Kviatkovsky, I., Medioni, G.: Gan-control: explicitly controllable gans. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14083–14093 (2021)
Su, W., Ye, H., Chen, S.-Y., Gao, L., Fu, H.: Drawinginstyles: portrait image generation and editing with spatially conditioned stylegan. IEEE Trans. Vis. Comput. Graph. (2022)
Shi, Y., Yang, X., Wan, Y., Shen, X.: Semanticstylegan: learning compositional generative priors for controllable image synthesis and editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11254–11264 (2022)
Abdal, R., Qin, Y., Wonka, P.: Image2stylegan++: how to edit the embedded images? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8296–8305 (2020)
Zhu, J., Shen, Y., Zhao, D., Zhou, B.: In-domain gan inversion for real image editing. In: European Conference on Computer Vision, pp. 592–608. Springer (2020)
Abdal, R., Zhu, P., Mitra, N.J., Wonka, P.: Styleflow: attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Trans. Graph. (ToG) 40(3), 1–21 (2021)
Tewari, A., Elgharib, M., Bernard, F., Seidel, H.-P., Pérez, P., Zollhöfer, M., Theobalt, C.: Pie: portrait image embedding for semantic control. ACM Trans. Graph. (TOG) 39(6), 1–14 (2020)
Hou, X., Zhang, X., Liang, H., Shen, L., Lai, Z., Wan, J.: Guidedstyle: attribute knowledge guided style manipulation for semantic face editing. Neural Netw. 145, 209–220 (2022)
Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: how to embed images into the stylegan latent space? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432–4441 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Richardson, E., Alaluf, Y., Patashnik, O., Nitzan, Y., Azar, Y., Shapiro, S., Cohen-Or, D.: Encoding in style: a stylegan encoder for image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2287–2296 (2021)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Xia, W., Zhang, Y., Yang, Y., Xue, J.-H., Zhou, B., Yang, M.-H.: Gan inversion: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
robertluxemburg: Git repository: stylegan2encoder (2020). https://github.com/robertluxemburg/stylegan2encoder
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for stylegan image manipulation. ACM Trans. Graph. (TOG) 40(4), 1–14 (2021)
Alaluf, Y., Patashnik, O., Cohen-Or, D.: Restyle: a residual-based stylegan encoder via iterative refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Roich, D., Mokady, R., Bermano, A.H., Cohen-Or, D.: Pivotal tuning for latent-based editing of real images. ACM Trans. Graph. (2021)
Alaluf, Y., Tov, O., Mokady, R., Gal, R., Bermano, A.: Hyperstyle: stylegan inversion with hypernetworks for real image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18511–18521 (2022)
Tewari, A., Elgharib, M., Bharaj, G., Bernard, F., Seidel, H.-P., Pérez, P., Zollhofer, M., Theobalt, C.: Stylerig: rigging stylegan for 3d control over portrait images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6142–6151 (2020)
Ju, Y., Zhang, J., Mao, X., Xu, J.: Adaptive semantic attribute decoupling for precise face image editing. Vis. Comput. 37(9), 2907–2918 (2021)
Lin, C., Xiong, S., Lu, X.: Disentangled face editing via individual walk in personalized facial semantic field. Vis. Comput. 1–10 (2022)
Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in gans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1532–1540 (2021)
Zhu, P., Abdal, R., Qin, Y., Wonka, P.: Improved stylegan embedding: where are the good latents? arXiv preprint arXiv:2012.09036 (2020)
Liu, Y., Li, Q., Sun, Z., Tan, T.: Style intervention: How to achieve spatial disentanglement with style-based generators? arXiv preprint arXiv:2011.09699 (2020)
Wu, Z., Lischinski, D., Shechtman, E.: Stylespace analysis: disentangled controls for stylegan image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12863–12872 (2021)
Zhang, B., Gu, S., Zhang, B., Bao, J., Chen, D., Wen, F., Wang, Y., Guo, B.: Styleswin: transformer-based gan for high-resolution image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11304–11314 (2022)
Xu, Y., Yin, Y., Jiang, L., Wu, Q., Zheng, C., Loy, C.C., Dai, B., Wu, W.: Transeditor: transformer-based dual-space gan for highly controllable facial editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7683–7692 (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Adv. Neural Inf. Process. Syst. 32 (2019)
Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: International Conference on Machine Learning, pp. 7176–7185. PMLR (2020)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
Rothe, R., Timofte, R., Gool, L.V.: Dex: deep expectation of apparent age from a single image. In: IEEE International Conference on Computer Vision Workshops (ICCVW) (2015)
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3d face reconstruction with weakly-supervised learning: from single image to image set. In: IEEE Computer Vision and Pattern Recognition Workshops (2019)
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D Face Model for Pose and Illumination Invariant Face Recognition. IEEE, Genova, Italy (2009)
Guo, Y., Zhang, J., Cai, J., Jiang, B., Zheng, J.: Cnn-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1294–1307 (2019)
Acknowledgements
This research was partially supported by NSF grants IIS 1527200 and 1941613. We would like to show our gratitude to Jun Fu, Jiayi Liu, Shen Wang, Zhihang Li, and Jie Yang for their early feedback and discussions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A additional results for age and gender manipulation using InterfaeGAN
Two more examples of age transformation are provided in Figs. 10 and 11. Figures 12 and 13 exhibit two extra examples of gender transitioning. With the editing distance between the latent code and the classifying boundary increasing, edited images in the W and W+ space deteriorate significantly. For gender manipulation in the W space, in particular, the attribute race intertwines with the attribute gender resulting in the transition from white to Asian. Our proposed W++ space, on the contrary, consistently preserves untargeted attributes even for long-distance manipulation.
Appendix B Additional results for real image editing using cGAN-base pipeline
Figures 14 and 15 show more comparison results for manipulating real images regarding the attribute smile using our proposed cGAN-based editing pipeline in different latent spaces. Edited images in the W or the W+ space exhibit either limited or imperceptible effects. In contrast, our proposed W++ space successfully accomplishes the most natural smile expression and evidently outperforms both the W and the W+ space.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, H., Liu, J., Zhang, X. et al. Transforming the latent space of StyleGAN for real face editing. Vis Comput 40, 3553–3568 (2024). https://doi.org/10.1007/s00371-023-03051-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-03051-1