Blendshape-Based Migratable Speech-Driven 3D Facial Animation with Overlapping Chunking-Transformer

Jixi Chen¹⁵,
Xiaoliang Ma¹⁵,
Lei Wang¹⁶ &
…
Jun Cheng¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14426))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

796 Accesses

Abstract

Speech-driven 3D facial animation has attracted an amount of research and has been widely used in games and virtual reality. Most of the latest state-of-the-art methods employ Transformer-based architecture with good sequence modeling capability. However, most of the animations produced by these methods are limited to specific facial meshes and cannot handle lengthy audio inputs. To tackle these limitations, we leverage the advantage of blendshapes to migrate the generated animations to multiple facial meshes and propose an overlapping chunking strategy that enables the model to support long audio inputs. Also, we design a data calibration approach that can significantly enhance the quality of blendshapes data and make lip movements more natural. Experiments show that our method performs better than the methods predicting vertices, and the animation can be migrated to various meshes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Modular Joint Training for Speech-Driven 3D Facial Animation

Speech-driven facial animation with spectral gathering and temporal attention

Article 27 September 2021

References

Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: ICML, pp. 173–182. PMLR (2016)
Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020)
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.J.: Capture, learning, and synthesis of 3d speaking styles. In: CVPR, pp. 10101–10111 (2019)
Google Scholar
Edwards, P., Landreth, C., Fiume, E., Singh, K.: Jali: an animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. 35(4), 1–11 (2016)
Article Google Scholar
Egger, B., et al.: 3d morphable face models-past, present, and future. ACM Trans. Graph. 39(5), 1–38 (2020)
Google Scholar
Ekman, P., Friesen, W.V.: Facial action coding system. Environ. Psychol. Nonverb. Behav. (1978)
Google Scholar
Ezzat, T., Poggio, T.: Miketalk: a talking facial display based on morphing visemes. In: Proceedings Computer Animation 1998, pp. 96–102. IEEE (1998)
Google Scholar
Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: speech-driven 3d facial animation with transformers. In: CVPR, pp. 18770–18780 (2022)
Google Scholar
Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: A 3-d audio-visual corpus of affective communication. IEEE Trans. Multim. 12(6), 591–598 (2010)
Article Google Scholar
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36(4), 1–12 (2017)
Article Google Scholar
Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451 (2020)
Lahiri, A., Kwatra, V., Frueh, C., Lewis, J., Bregler, C.: Lipsync3d: data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization. In: CVPR, pp. 2755–2764 (2021)
Google Scholar
Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Animat. 2(4), 118–122 (1991)
Article Google Scholar
Li, R., et al.: Learning formation of physically-based face attributes. In: CVPR, pp. 3410–3419 (2020)
Google Scholar
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph. 36(6), 194–1 (2017)
Article Google Scholar
Liu, H., et al.: BEAT: a large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13667, pp. 612–630. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_36
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)
Article Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)
Google Scholar
Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., Sheikh, Y.: Meshtalk: 3d face animation from speech using cross-modality disentanglement. In: ICCV, pp. 1173–1182 (2021)
Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 1–13 (2017)
Article Google Scholar
Taylor, S.L., Mahler, M., Theobald, B.J., Matthews, I.: Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics Conference on Computer Animation, pp. 275–284 (2012)
Google Scholar
Thambiraja, B., Habibie, I., Aliakbarian, S., Cosker, D., Theobalt, C., Thies, J.: Imitator: personalized speech-driven 3d facial animation. arXiv preprint arXiv:2301.00023 (2022)
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 716–731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42
Tian, G., Yuan, Y., Liu, Y.: Audio2face: generating speech/face animation from single audio with attention-based bidirectional LSTM networks. In: ICME, pp. 366–371. IEEE (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, S., Li, L., Ding, Y., Fan, C., Yu, X.: Audio2head: audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293 (2021)
Xing, J., Xia, M., Zhang, Y., Cun, X., Wang, J., Wong, T.T.: Codetalker: speech-driven 3d facial animation with discrete motion prior. In: CVPR, pp. 12780–12790 (2023)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the Shenzhen Technology Project (JCYJ20220531095810023), National Natural Science Foundation of China (61976143, U21A20487), Guangdong-Hong Kong-Macao JointLaboratory of Human-Machine Intelligence-Synergy Systems (2019B121205007).

Author information

Authors and Affiliations

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Jixi Chen & Xiaoliang Ma
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Lei Wang & Jun Cheng

Authors

Jixi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoliang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Wang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 780 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, J., Ma, X., Wang, L., Cheng, J. (2024). Blendshape-Based Migratable Speech-Driven 3D Facial Animation with Overlapping Chunking-Transformer. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_4

Download citation

DOI: https://doi.org/10.1007/978-981-99-8432-9_4
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8431-2
Online ISBN: 978-981-99-8432-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Blendshape-Based Migratable Speech-Driven 3D Facial Animation with Overlapping Chunking-Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Modular Joint Training for Speech-Driven 3D Facial Animation

Speech-driven facial animation with spectral gathering and temporal attention

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 780 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Blendshape-Based Migratable Speech-Driven 3D Facial Animation with Overlapping Chunking-Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Modular Joint Training for Speech-Driven 3D Facial Animation

Speech-driven facial animation with spectral gathering and temporal attention

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 780 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation