Article

X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes

Authors:

A. Sophia Koepke,

Andrew ZissermanAuthors Info & Claims

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII

Pages 690 - 706

https://doi.org/10.1007/978-3-030-01261-8_41

Published: 08 September 2018 Publication History

Abstract

The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing.

We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another face in a driving frame to produce a generated frame with the identity of the source frame but the pose and expression of the face in the driving frame. Second, we propose a method for training the network fully self-supervised using a large collection of video data. Third, we show that the generation process can be driven by other modalities, such as audio or pose codes, without any further training of the network.

The generation results for driving a face with another face are compared to state-of-the-art self-supervised/supervised methods. We show that our approach is more robust than other methods, as it makes fewer assumptions about the input data. We also show examples of using our framework for video face editing.

References

[1]

Averbuch-Elor H, Cohen-Or D, Kopf J, and Cohen MF Bringing portraits to life ACM Trans. Graph. (Proceeding of SIGGRAPH Asia 2017) 2017 36 6 196

[2]

Bas, A., Smith, W.A.P., Awais, M., Kittler, J.: 3D morphable models as spatial transformer networks. In: Proceedings of ICCV Workshop on Geometry Meets Deep Learning (2017)

[3]

Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of ACM SIGGRAPH (1999)

[4]

Booth J, Roussos A, Ponniah A, Dunaway D, and Zafeiriou S Large scale 3D morphable models IJCV 2018 126 2–4 233-254

[5]

Cao, J., Hu, Y., Yu, B., He, R., Sun, Z.: Load balanced GANs for multi-view face image synthesis. arXiv preprint arXiv:1802.07447 (2018)

[6]

Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of ICCV (2017)

[7]

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)

[8]

Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of CVPR (2017)

[9]

Chung JS and Zisserman A Chen C-S, Lu J, and Ma K-K Out of time: automated lip sync in the wild Computer Vision – ACCV 2016 Workshops 2017 Cham Springer 251-263

[10]

Dale K, Sunkavalli K, Johnson MK, Vlasic D, Matusik W, and Pfister H Video face replacement ACM Trans. Graph. (TOG) 2011 30 6 130

[11]

Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: NIPS (2017)

[12]

Ding, H., Sricharan, K., Chellappa, R.: ExprGAN: facial expression editing with controllable expression intensity. In: Proceedings of AAAI (2018)

[13]

Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of CVPR (2016)

[14]

Hassner, T., Harel, S., Paz, E., Enbar, R.: Effective face frontalization in unconstrained images. In: Proceedings of CVPR (2015)

[15]

Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of CVPR (2017)

[16]

Karras T, Aila T, Laine S, Herva A, and Lehtinen J Audio-driven facial animation by joint end-to-end learning of pose and emotion ACM Trans. Graph. (TOG) 2017 36 4 94

[17]

Kim, H., et al.: Deep video portraits. In: Proceedings of ACM SIGGRAPH (2018)

[18]

King DE Dlib-ml: a machine learning toolkit J. Mach. Learn. Res. 2009 10 1755-1758

[19]

Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: Proceedings of First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies (2011)

[20]

Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast face-swap using convolutional neural networks. In: Proceedings of ICCV (2017)

[21]

Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS (2015)

[22]

Kumar, A., Alavi, A., Chellappa, R.: KEPLER: keypoint and pose estimation of unconstrained faces by learning efficient H-CNN regressors. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition (2017)

[23]

Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)

[24]

Nirkin, Y., Masi, I., Tran, A.T., Hassner, T., Medioni, G.: On face segmentation, face swapping, and face perception. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2018)

[25]

Olszewski, K., et al.: Realistic dynamic facial textures from a single image using GANs. In: Proceedings of ICCV (2017)

[26]

Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC (2015)

[27]

Paszke, A., et al.: Automatic differentiation in PyTorch (2017)

[28]

Pérez P, Gangnet M, and Blake A Poisson image editing ACM Trans. Graph. (TOG) 2003 22 3 313-318

[29]

Pătrăucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. In: NIPS (2016)

[30]

Qiao, F., Yao, N., Jiao, Z., Li, Z., Chen, H., Wang, H.: Geometry-contrastive generative adversarial network for facial expression synthesis. arXiv preprint arXiv:1802.01822 (2018)

[31]

Rav-Acha A, Kohli P, Rother C, and Fitzgibbon A Unwrap mosaics: a new representation for video editing ACM Trans. Graph. (TOG) 2008 27 3 17

[32]

Ronneberger O, Fischer P, and Brox T Navab N, Hornegger J, Wells WM, and Frangi AF U-Net: convolutional networks for biomedical image segmentation Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 2015 Cham Springer 234-241

[33]

Roth, J., Tong, Y., Liu, X.: Adaptive 3D face reconstruction from unconstrained photo collections. In: Proceedings of CVPR (2016)

[34]

Saito, S., Wei, L., Hu, L., Nagano, K., Li, H.: Photorealistic facial texture inference using deep neural networks. In: Proceedings of CVPR (2017)

[35]

Saragih, J.M., Lucey, S., Cohn, J.F.: Real-time avatar animation from a single image. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2011)

[36]

Shlizerman, E., Dery, L., Schoen, H., Kemelmacher-Shlizerman, I.: Audio to body dynamics. In: Proceedings of CVPR (2018)

[37]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

[38]

Suwajanakorn S, Seitz SM, and Kemelmacher-Shlizerman I Synthesizing Obama: learning lip sync from audio ACM Trans. Graph. (TOG) 2017 36 4 95

[39]

Tewari, A., et al.: Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of ICCV (2017)

[40]

Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: Proceedings of CVPR (2016)

[41]

Tran, A.T., Hassner, T., Masi, I., Paz, E., Nirkin, Y., Medioni, G.: Extreme 3D face reconstruction: Seeing through occlusions. In: Proceedings of CVPR (2018)

[42]

Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: Proceedings of CVPR (2017)

[43]

Vlasic D, Brand M, Pfister H, and Popović J Face transfer with multilinear models ACM Trans. Graph. (TOG) 2005 24 3 426-433

[44]

Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Interpretable transformations with encoder-decoder networks. In: Proceedings of ICCV (2017)

[45]

Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of ICCV (2017)

[46]

Zollhöfer, M., Thies, J., Garrido, P., Bradley, D., Beeler, T., Pérez, P., Stamminger, M., Nießner, M., Theobalt, C.: State of the art on monocular 3D face reconstruction, tracking, and applications. In: Proceedings of Eurographics (2018)

Cited By

Yu HQu ZYu QChen JJiang ZChen ZZhang SXu JWu FLv CYu GCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681675
Cho KLee JYoon HHong YKo JAhn SKim SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)GaussianTalker: Real-Time Talking Head Synthesis with 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681627(10985-10994)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681627
Liu TChen FFan SDu CChen QChen XYu KCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion EncodingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681198(6696-6705)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681198
Show More Cited By

Index Terms

X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes
1. Computing methodologies
2. Human-centered computing
  1. Human computer interaction (HCI)

Index terms have been assigned to the content through auto-classification.

Recommendations

Frontal face synthesis based on multiple pose-variant images for face recognition
ICB'07: Proceedings of the 2007 international conference on Advances in Biometrics

Pose variance remains a challenging problem for face recognition. In this paper, a stereoscopic synthesis method for generating a frontal face image is proposed to improve the performance of automatic face recognition system. Through this method, a ...
Pose-invariant face recognition using deformation analysis
BVAI'05: Proceedings of the First international conference on Brain, Vision, and Artificial Intelligence

Over the last decade or so, face recognition has become a popular area of research in computer vision and one of the most successful applications of image analysis and understanding. In addition, recognition of faces under varied poses has been a ...
2D Pose-Invariant Face Recognition Using Single Frontal-View Face Database
Abstract
Personal identification systems that use face recognition work well for test images with frontal view face, but often fail when the input face is a pose view. Most face databases come from picture ID sources such as passports or driver’s licenses. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII

Sep 2018

843 pages

ISBN:978-3-030-01260-1

DOI:10.1007/978-3-030-01261-8

Editors:
Vittorio Ferrari
Google Research, Zurich, Switzerland
,
Martial Hebert
Carnegie Mellon University, Pittsburgh, PA, USA
,
Cristian Sminchisescu
Google Research, Zurich, Switzerland
,
Yair Weiss
Hebrew University of Jerusalem, Jerusalem, Israel

© Springer Nature Switzerland AG 2018.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 September 2018

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu HQu ZYu QChen JJiang ZChen ZZhang SXu JWu FLv CYu GCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681675(3548-3557)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681675
Cho KLee JYoon HHong YKo JAhn SKim SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)GaussianTalker: Real-Time Talking Head Synthesis with 3D Gaussian SplattingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681627(10985-10994)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681627
Liu TChen FFan SDu CChen QChen XYu KCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion EncodingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681198(6696-6705)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681198
Rai GGupta SSharma OKry PCani MSkouras MWang H(2024)SketchAnim: Real-Time Sketch Animation Transfer from VideosProceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation10.1111/cgf.15176(1-11)Online publication date: 21-Aug-2024
https://dl.acm.org/doi/10.1111/cgf.15176
Li JZhang JBai XZheng JNing XZhou JGu L(2024)TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian SplattingComputer Vision – ECCV 202410.1007/978-3-031-72684-2_8(127-145)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72684-2_8
Tao JGu SLi WDuan LOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Learning motion refinement for unsupervised face animationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669210(70483-70496)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669210
Zhang WLiu HLi BXie JHuang YLi YZheng YGhanem BOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Dynamically masked discriminator for GANsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667124(23094-23114)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667124
Wang JZhao KMa YZhang SZhang YShen YZhao DZhou JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)FaceComposerProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666714(13467-13479)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666714
Chen YYao RZhou YZhao JLiu BSaddik A(2023)Black-box Attack against Self-supervised Video Object Segmentation Models with Contrastive LossACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361750220:2(1-21)Online publication date: 18-Oct-2023
https://dl.acm.org/doi/10.1145/3617502
Wu YXu SXiang JWei FChen QYang JTong X(2023)AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image CollectionsSIGGRAPH Asia 2023 Conference Papers10.1145/3610548.3618164(1-9)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.1145/3610548.3618164
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents