Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-030-01261-8_41guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes

Published: 08 September 2018 Publication History

Abstract

The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing.
We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another face in a driving frame to produce a generated frame with the identity of the source frame but the pose and expression of the face in the driving frame. Second, we propose a method for training the network fully self-supervised using a large collection of video data. Third, we show that the generation process can be driven by other modalities, such as audio or pose codes, without any further training of the network.
The generation results for driving a face with another face are compared to state-of-the-art self-supervised/supervised methods. We show that our approach is more robust than other methods, as it makes fewer assumptions about the input data. We also show examples of using our framework for video face editing.

References

[1]
Averbuch-Elor H, Cohen-Or D, Kopf J, and Cohen MF Bringing portraits to life ACM Trans. Graph. (Proceeding of SIGGRAPH Asia 2017) 2017 36 6 196
[2]
Bas, A., Smith, W.A.P., Awais, M., Kittler, J.: 3D morphable models as spatial transformer networks. In: Proceedings of ICCV Workshop on Geometry Meets Deep Learning (2017)
[3]
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of ACM SIGGRAPH (1999)
[4]
Booth J, Roussos A, Ponniah A, Dunaway D, and Zafeiriou S Large scale 3D morphable models IJCV 2018 126 2–4 233-254
[5]
Cao, J., Hu, Y., Yu, B., He, R., Sun, Z.: Load balanced GANs for multi-view face image synthesis. arXiv preprint arXiv:1802.07447 (2018)
[6]
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of ICCV (2017)
[7]
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)
[8]
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of CVPR (2017)
[9]
Chung JS and Zisserman A Chen C-S, Lu J, and Ma K-K Out of time: automated lip sync in the wild Computer Vision – ACCV 2016 Workshops 2017 Cham Springer 251-263
[10]
Dale K, Sunkavalli K, Johnson MK, Vlasic D, Matusik W, and Pfister H Video face replacement ACM Trans. Graph. (TOG) 2011 30 6 130
[11]
Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: NIPS (2017)
[12]
Ding, H., Sricharan, K., Chellappa, R.: ExprGAN: facial expression editing with controllable expression intensity. In: Proceedings of AAAI (2018)
[13]
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of CVPR (2016)
[14]
Hassner, T., Harel, S., Paz, E., Enbar, R.: Effective face frontalization in unconstrained images. In: Proceedings of CVPR (2015)
[15]
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of CVPR (2017)
[16]
Karras T, Aila T, Laine S, Herva A, and Lehtinen J Audio-driven facial animation by joint end-to-end learning of pose and emotion ACM Trans. Graph. (TOG) 2017 36 4 94
[17]
Kim, H., et al.: Deep video portraits. In: Proceedings of ACM SIGGRAPH (2018)
[18]
King DE Dlib-ml: a machine learning toolkit J. Mach. Learn. Res. 2009 10 1755-1758
[19]
Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: Proceedings of First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies (2011)
[20]
Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast face-swap using convolutional neural networks. In: Proceedings of ICCV (2017)
[21]
Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS (2015)
[22]
Kumar, A., Alavi, A., Chellappa, R.: KEPLER: keypoint and pose estimation of unconstrained faces by learning efficient H-CNN regressors. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition (2017)
[23]
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
[24]
Nirkin, Y., Masi, I., Tran, A.T., Hassner, T., Medioni, G.: On face segmentation, face swapping, and face perception. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2018)
[25]
Olszewski, K., et al.: Realistic dynamic facial textures from a single image using GANs. In: Proceedings of ICCV (2017)
[26]
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC (2015)
[27]
Paszke, A., et al.: Automatic differentiation in PyTorch (2017)
[28]
Pérez P, Gangnet M, and Blake A Poisson image editing ACM Trans. Graph. (TOG) 2003 22 3 313-318
[29]
Pătrăucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. In: NIPS (2016)
[30]
Qiao, F., Yao, N., Jiao, Z., Li, Z., Chen, H., Wang, H.: Geometry-contrastive generative adversarial network for facial expression synthesis. arXiv preprint arXiv:1802.01822 (2018)
[31]
Rav-Acha A, Kohli P, Rother C, and Fitzgibbon A Unwrap mosaics: a new representation for video editing ACM Trans. Graph. (TOG) 2008 27 3 17
[32]
Ronneberger O, Fischer P, and Brox T Navab N, Hornegger J, Wells WM, and Frangi AF U-Net: convolutional networks for biomedical image segmentation Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 2015 Cham Springer 234-241
[33]
Roth, J., Tong, Y., Liu, X.: Adaptive 3D face reconstruction from unconstrained photo collections. In: Proceedings of CVPR (2016)
[34]
Saito, S., Wei, L., Hu, L., Nagano, K., Li, H.: Photorealistic facial texture inference using deep neural networks. In: Proceedings of CVPR (2017)
[35]
Saragih, J.M., Lucey, S., Cohn, J.F.: Real-time avatar animation from a single image. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2011)
[36]
Shlizerman, E., Dery, L., Schoen, H., Kemelmacher-Shlizerman, I.: Audio to body dynamics. In: Proceedings of CVPR (2018)
[37]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
[38]
Suwajanakorn S, Seitz SM, and Kemelmacher-Shlizerman I Synthesizing Obama: learning lip sync from audio ACM Trans. Graph. (TOG) 2017 36 4 95
[39]
Tewari, A., et al.: Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of ICCV (2017)
[40]
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: Proceedings of CVPR (2016)
[41]
Tran, A.T., Hassner, T., Masi, I., Paz, E., Nirkin, Y., Medioni, G.: Extreme 3D face reconstruction: Seeing through occlusions. In: Proceedings of CVPR (2018)
[42]
Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: Proceedings of CVPR (2017)
[43]
Vlasic D, Brand M, Pfister H, and Popović J Face transfer with multilinear models ACM Trans. Graph. (TOG) 2005 24 3 426-433
[44]
Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Interpretable transformations with encoder-decoder networks. In: Proceedings of ICCV (2017)
[45]
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of ICCV (2017)
[46]
Zollhöfer, M., Thies, J., Garrido, P., Bradley, D., Beeler, T., Pérez, P., Stamminger, M., Nießner, M., Theobalt, C.: State of the art on monocular 3D face reconstruction, tracking, and applications. In: Proceedings of Eurographics (2018)

Cited By

View all
  • (2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
  • (2024)GaussianTalker: Real-Time Talking Head Synthesis with 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681627(10985-10994)Online publication date: 28-Oct-2024
  • (2024)AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion EncodingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681198(6696-6705)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII
          Sep 2018
          843 pages
          ISBN:978-3-030-01260-1
          DOI:10.1007/978-3-030-01261-8

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          Published: 08 September 2018

          Author Tags

          1. Pose Code
          2. Source Face
          3. Source Frame
          4. Head Pose Angles
          5. Leg Cycling

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 25 Nov 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
          • (2024)GaussianTalker: Real-Time Talking Head Synthesis with 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681627(10985-10994)Online publication date: 28-Oct-2024
          • (2024)AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion EncodingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681198(6696-6705)Online publication date: 28-Oct-2024
          • (2024)SketchAnim: Real-Time Sketch Animation Transfer from VideosProceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation10.1111/cgf.15176(1-11)Online publication date: 21-Aug-2024
          • (2024)TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian SplattingComputer Vision – ECCV 202410.1007/978-3-031-72684-2_8(127-145)Online publication date: 29-Sep-2024
          • (2023)Learning motion refinement for unsupervised face animationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669210(70483-70496)Online publication date: 10-Dec-2023
          • (2023)Dynamically masked discriminator for GANsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667124(23094-23114)Online publication date: 10-Dec-2023
          • (2023)FaceComposerProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666714(13467-13479)Online publication date: 10-Dec-2023
          • (2023)Black-box Attack against Self-supervised Video Object Segmentation Models with Contrastive LossACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361750220:2(1-21)Online publication date: 18-Oct-2023
          • (2023)AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image CollectionsSIGGRAPH Asia 2023 Conference Papers10.1145/3610548.3618164(1-9)Online publication date: 10-Dec-2023
          • Show More Cited By

          View Options

          View options

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media