Fine-grained talking face generation with video reinterpretation

636 Accesses
Explore all metrics

Abstract

Generating a talking face video from a given audio clip and an arbitrary face image has many applications in areas such as special visual effects and human–computer interactions. This is a challenging task, as it requires disentangling semantic information from both input audio clips and face image, then synthesizing novel animated facial image sequences from the combined semantic features. The desired output video should maintain both video realism and audio–lip motion consistency. To achieve these two objectives, we propose a coarse-to-fine tree-like architecture for synthesizing realistic talking face frames directly from audio clips. This is followed by a video-to-word regeneration module to translate the synthesized talking videos back to the words space, which is enforced to align with the input audios. With multi-level facial landmark attentions, the proposed audio-to-video-to-words framework can generate fine-grained talking face videos that are not only synchronous with the input audios but also maintain visual details from the input face images. Multi-purpose discriminators are also adopted for adversarial learning to further improve both image fidelity and semantic consistency. Extensive experiments on GRID and LRW datasets demonstrate the advantages of our framework over previous methods in terms of video quality and audio–video synchronization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer

Synthesizing Talking Face Videos with a Spatial Attention Mechanism

Talking Faces: Audio-to-Video Face Generation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. Siggraph 97, 353–360 (1997)
Google Scholar
Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1021–1030 (2017)
Chen, L., Li, Z., Maddox, R.K., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 520–535 (2018)
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? (2017). arXiv preprint arXiv:1705.02966
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Asian Conference on Computer Vision, pp. 87–103. Springer, Berlin (2016)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Deng, Z., Neumann, U.: Expressive speech animation synthesis with phoneme-level controls. In: Computer Graphics Forum, vol. 27, pp. 2096–2113. Wiley Online Library, Hoboken (2008)
Fan, B., Xie, L., Yang, S., Wang, L., Soong, F.K.: A deep bidirectional LSTM approach for video-realistic talking head. Multimed. Tools Appl 75(9), 5287–5309 (2016)
Article Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Huang, X., Wang, M., Gong, M.: Hierarchically-fused generative adversarial network for text to realistic image synthesis. In: 2019 16th Conference on Computer and Robot Vision (CRV), pp. 73–80. IEEE (2019)
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. TOG 36(4), 94 (2017)
Article Google Scholar
Kim, Y., Lee, S.H.: Keyframe-based multi-contact motion synthesis. Vis. Comput. 1–15 (2020)
Li, Y., Min, M.R., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation (2015). arXiv preprint arXiv:1508.04025
Ma, S., Fu, J., Wen Chen, C., Mei, T.: Da-gan: instance-level image translation by deep attention generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5657–5666 (2018)
Ma, X., Deng, Z.: A statistical quality model for data-driven speech animation. IEEE Trans. Visual Comput. Graph. 18(11), 1915–1927 (2012)
Article Google Scholar
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error (2015). arXiv preprint arXiv:1511.05440
Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: Advances in Neural Information Processing Systems, pp. 2863–2871 (2015)
Parkhi, O.M., Vedaldi, A., Zisserman, A., et al.: Deep face recognition. In: BMVC, vol. 1, p. 6 (2015)
Petridis, S., Li, Z., Pantic, M.: End-to-end visual speech recognition with LSTMS. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2592–2596. IEEE (2017)
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., Moreno-Noguer, F.: Ganimation: anatomically-aware facial animation from a single image. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 818–833 (2018)
Qiao, T., Zhang, J., Xu, D., Tao, D.: Mirrorgan: learning text-to-image generation by redescription. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2830–2839 (2017)
Song, Y., Zhu, J., Wang, X., Qi, H.: Talking face generation by conditional recurrent adversarial network (2018). arXiv preprint arXiv:1804.04786
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMS for lipreading (2017). arXiv preprint arXiv:1703.04105
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (TOG) 36(4), 95 (2017)
Article Google Scholar
Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A.G., Hodgins, J., Matthews, I.: A deep learning approach for generalized speech animation. ACM Trans. Graph. (TOG) 36(4), 93 (2017)
Article Google Scholar
Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: real-time face capture and reenactment of RGB videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395 (2016)
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances In Neural Information Processing Systems, pp. 613–621 (2016)
Vougioukas, K., Petridis, S., Pantic, M.: End-to-end speech-driven realistic facial animation with temporal GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 37–40 (2019)
Wan, V., Anderson, R., Blokland, A., Braunschweiler, N., Chen, L., Kolluru, B., Latorre, J., Maia, R., Stenger, B., Yanagisawa, K., et al.: Photo-realistic expressive text to talking head synthesis. In: INTERSPEECH, pp. 2667–2669 (2013)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wiles, O., Sophia Koepke, A., Zisserman, A.: X2face: a network for controlling face generation using images, audio, and pose codes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–686 (2018)
Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)
Article Google Scholar
Xie, L., Liu, Z.Q.: Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans. Multimed. 9(3), 500–510 (2007)
Article Google Scholar
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Yehia, H., Rubin, P., Vatikiotis-Bateson, E.: Quantitative association of vocal-tract and facial behavior. Speech Commun. 26(1–2), 23–43 (1998)
Article Google Scholar
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks (2018). arXiv preprint arXiv:1805.08318
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2018)
Article Google Scholar
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. Proc. AAAI Conf. Artif. Intell. 33, 9299–9306 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Memorial University of Newfoundland, St. John’s, Canada
Xin Huang & Mingjie Wang
University of Guelph, Guelph, Canada
Minglun Gong

Authors

Xin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Mingjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Minglun Gong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Huang.

Ethics declarations

Conflict of interest

To the best of our knowledge, the named authors have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mov 2337 KB)

Supplementary material 2 (mov 120 KB)

Supplementary material 3 (mov 121 KB)

Supplementary material 4 (mov 120 KB)

Supplementary material 5 (mov 120 KB)

Supplementary material 6 (mov 119 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, X., Wang, M. & Gong, M. Fine-grained talking face generation with video reinterpretation. Vis Comput 37, 95–105 (2021). https://doi.org/10.1007/s00371-020-01982-7

Download citation

Accepted: 12 September 2020
Published: 22 September 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s00371-020-01982-7

Fine-grained talking face generation with video reinterpretation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer

Synthesizing Talking Face Videos with a Spatial Attention Mechanism

Talking Faces: Audio-to-Video Face Generation

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Fine-grained talking face generation with video reinterpretation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer

Synthesizing Talking Face Videos with a Spatial Attention Mechanism

Talking Faces: Audio-to-Video Face Generation

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now