Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-030-58545-7_3guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Talking-Head Generation with Rhythmic Head Motion

Published: 23 August 2020 Publication History

Abstract

When people deliver a speech, they naturally move heads, and this rhythmic head motion conveys prosodic information. However, generating a lip-synced video while moving head naturally is challenging. While remarkably successful, existing works either generate still talking-face videos or rely on landmark/video frames as sparse/dense mapping guidance to generate head movements, which leads to unrealistic or uncontrollable video synthesis. To overcome the limitations, we propose a 3D-aware generative network along with a hybrid embedding module and a non-linear composition module. Through modeling the head motion and facial expressions (In our setting, facial expression means facial movement (e.g., blinks, and lip & chin movements).) explicitly, manipulating 3D animation carefully, and embedding reference images dynamically, our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements. Thoughtful experiments on several standard benchmarks demonstrate that our method achieves significantly better results than the state-of-the-art methods in both quantitative and qualitative comparisons. The code is available on https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion.

References

[1]
Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. In: arXiv preprint arXiv:1809.00496 (2018)
[2]
Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 353–360 (1997)
[3]
Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, and Verma R CREMA-D: crowd-sourced emotional multimodal actors dataset IEEE Trans. Affect. Comput. 2014 5 4 377-390
[4]
Cassell J, McNeill D, and McCullough KE Speech-gesture mismatches: evidence for one underlying representation of linguistic and nonlinguistic information Pragmat. Cogn. 1999 7 1 1-34
[5]
Chang, Y.J., Ezzat, T.: Transferable videorealistic speech animation. In: Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 143–151. ACM (2005)
[6]
Chen, L., Li, Z., K Maddox, R., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 520–535 (2018)
[7]
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)
[8]
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British Machine Vision Conference (2017)
[9]
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: INTERSPEECH (2018)
[10]
Chung JS and Zisserman A Lai S-H, Lepetit V, Nishino K, and Sato Y Lip reading in the wild Computer Vision – ACCV 2016 2017 Cham Springer 87-103
[11]
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
[12]
Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3D face reconstruction and dense alignment with position map regression network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 534–551 (2018)
[13]
Fried O et al. Text-based editing of talking-head video ACM Trans. Graph. (TOG) 2019 38 4 1-14
[14]
Garrido, P., et al.: VDub: modifying face video of actors for plausible visual alignment to a dubbed audio track. In: Computer Graphics Forum, vol. 34, pp. 193–204. Wiley Online Library (2015)
[15]
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
[16]
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
[17]
Kim H et al. Deep video portraits ACM Trans. Graph. (TOG) 2018 37 4 1-14
[18]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[19]
Liu, K., Ostermann, J.: Realistic facial expression synthesis for an image-based talking head. In: 2011 IEEE International Conference on Multimedia and Expo, pp. 1–6. IEEE (2011)
[20]
Liu, M.Y., et al.: Few-shot unsupervised image-to-image translation. In: IEEE International Conference on Computer Vision (ICCV) (2019)
[21]
Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3d reasoning. The IEEE International Conference on Computer Vision (ICCV), October 2019
[22]
Munhall KG, Jones JA, Callan DE, Kuratate T, and Vatikiotis-Bateson E Visual prosody and speech intelligibility: head movement improves auditory speech perception Psychol. Sci. 2004 15 2 133-137
[23]
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
[24]
Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., Moreno-Noguer, F.: GANimation: one-shot anatomically consistent facial animation. Int. J. Comput. Vis. 1–16 (2019)
[25]
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
[26]
Song, Y., Zhu, J., Li, D., Wang, A., Qi, H.: Talking face generation by conditional recurrent adversarial network. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 919–925. International Joint Conferences on Artificial Intelligence Organization, July 2019.
[27]
Suwajanakorn S, Seitz SM, and Kemelmacher-Shlizerman I Synthesizing obama: learning lip sync from audio ACM Trans. Graph. (TOG) 2017 36 4 95
[28]
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263 (2018)
[29]
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems, pp. 613–621 (2016)
[30]
Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 1–16 (2019)
[31]
Wang, T.C., Liu, M.Y., Tao, A., Liu, G., Kautz, J., Catanzaro, B.: Few-shot video-to-video synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
[32]
Wang, T.C., et al.: Video-to-video synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
[33]
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
[34]
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP, et al. Image quality assessment: from error visibility to structural similarity IEEE Trans. Image Process. 2004 13 4 600-612
[35]
Wiles, O., Sophia Koepke, A., Zisserman, A.: X2Face: a network for controlling face generation using images, audio, and pose codes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–686 (2018)
[36]
Yoo, S., Bahng, H., Chung, S., Lee, J., Chang, J., Choo, J.: Coloring with limited data: few-shot colorization via memory augmented networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11283–11292 (2019)
[37]
Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
[38]
Zhou, H., Liu, J., Liu, Z., Liu, Y., Wang, X.: Rotate-and-render: unsupervised photorealistic face rotation from single-view images. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
[39]
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)
[40]
Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: a 3D solution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 146–155 (2016)

Cited By

View all
  • (2024)Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action UnitsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367256520:9(1-24)Online publication date: 17-Jun-2024
  • (2024)FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681238(3411-3420)Online publication date: 28-Oct-2024
  • (2024)Dynamic Mixed-Prototype Model for Incremental Deepfake DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680895(8129-8138)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX
Aug 2020
860 pages
ISBN:978-3-030-58544-0
DOI:10.1007/978-3-030-58545-7

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action UnitsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367256520:9(1-24)Online publication date: 17-Jun-2024
  • (2024)FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681238(3411-3420)Online publication date: 28-Oct-2024
  • (2024)Dynamic Mixed-Prototype Model for Incremental Deepfake DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680895(8129-8138)Online publication date: 28-Oct-2024
  • (2024)KMTalk: Speech-Driven 3D Facial Animation with Key Motion EmbeddingComputer Vision – ECCV 202410.1007/978-3-031-72992-8_14(236-253)Online publication date: 29-Sep-2024
  • (2024)EDTalk: Efficient Disentanglement for Emotional Talking Head SynthesisComputer Vision – ECCV 202410.1007/978-3-031-72658-3_23(398-416)Online publication date: 29-Sep-2024
  • (2024)Make Audio Solely Drive Lip in Talking Face Video SynthesisArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72338-4_24(349-360)Online publication date: 17-Sep-2024
  • (2023)Autoregressive GAN for Semantic Unconditional Head Motion GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3635154Online publication date: 6-Dec-2023
  • (2023)Learning and Evaluating Human Preferences for Conversational Head GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612831(9615-9619)Online publication date: 26-Oct-2023
  • (2023)MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612123(6734-6743)Online publication date: 26-Oct-2023
  • (2023)Talking Face Generation via Facial AnatomyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357174619:3(1-19)Online publication date: 25-Feb-2023
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media