Abstract
Speech-driven 3D facial animation has attracted an amount of research and has been widely used in games and virtual reality. Most of the latest state-of-the-art methods employ Transformer-based architecture with good sequence modeling capability. However, most of the animations produced by these methods are limited to specific facial meshes and cannot handle lengthy audio inputs. To tackle these limitations, we leverage the advantage of blendshapes to migrate the generated animations to multiple facial meshes and propose an overlapping chunking strategy that enables the model to support long audio inputs. Also, we design a data calibration approach that can significantly enhance the quality of blendshapes data and make lip movements more natural. Experiments show that our method performs better than the methods predicting vertices, and the animation can be migrated to various meshes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: ICML, pp. 173–182. PMLR (2016)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020)
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.J.: Capture, learning, and synthesis of 3d speaking styles. In: CVPR, pp. 10101–10111 (2019)
Edwards, P., Landreth, C., Fiume, E., Singh, K.: Jali: an animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. 35(4), 1–11 (2016)
Egger, B., et al.: 3d morphable face models-past, present, and future. ACM Trans. Graph. 39(5), 1–38 (2020)
Ekman, P., Friesen, W.V.: Facial action coding system. Environ. Psychol. Nonverb. Behav. (1978)
Ezzat, T., Poggio, T.: Miketalk: a talking facial display based on morphing visemes. In: Proceedings Computer Animation 1998, pp. 96–102. IEEE (1998)
Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: Faceformer: speech-driven 3d facial animation with transformers. In: CVPR, pp. 18770–18780 (2022)
Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., Van Gool, L.: A 3-d audio-visual corpus of affective communication. IEEE Trans. Multim. 12(6), 591–598 (2010)
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. 36(4), 1–12 (2017)
Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451 (2020)
Lahiri, A., Kwatra, V., Frueh, C., Lewis, J., Bregler, C.: Lipsync3d: data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization. In: CVPR, pp. 2755–2764 (2021)
Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Animat. 2(4), 118–122 (1991)
Li, R., et al.: Learning formation of physically-based face attributes. In: CVPR, pp. 3410–3419 (2020)
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph. 36(6), 194–1 (2017)
Liu, H., et al.: BEAT: a large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13667, pp. 612–630. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_36
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)
Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., Sheikh, Y.: Meshtalk: 3d face animation from speech using cross-modality disentanglement. In: ICCV, pp. 1173–1182 (2021)
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 1–13 (2017)
Taylor, S.L., Mahler, M., Theobald, B.J., Matthews, I.: Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/Eurographics Conference on Computer Animation, pp. 275–284 (2012)
Thambiraja, B., Habibie, I., Aliakbarian, S., Cosker, D., Theobalt, C., Thies, J.: Imitator: personalized speech-driven 3d facial animation. arXiv preprint arXiv:2301.00023 (2022)
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 716–731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42
Tian, G., Yuan, Y., Liu, Y.: Audio2face: generating speech/face animation from single audio with attention-based bidirectional LSTM networks. In: ICME, pp. 366–371. IEEE (2019)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Wang, S., Li, L., Ding, Y., Fan, C., Yu, X.: Audio2head: audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293 (2021)
Xing, J., Xia, M., Zhang, Y., Cun, X., Wang, J., Wong, T.T.: Codetalker: speech-driven 3d facial animation with discrete motion prior. In: CVPR, pp. 12780–12790 (2023)
Acknowledgement
This work was supported in part by the Shenzhen Technology Project (JCYJ20220531095810023), National Natural Science Foundation of China (61976143, U21A20487), Guangdong-Hong Kong-Macao JointLaboratory of Human-Machine Intelligence-Synergy Systems (2019B121205007).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, J., Ma, X., Wang, L., Cheng, J. (2024). Blendshape-Based Migratable Speech-Driven 3D Facial Animation with Overlapping Chunking-Transformer. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_4
Download citation
DOI: https://doi.org/10.1007/978-981-99-8432-9_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8431-2
Online ISBN: 978-981-99-8432-9
eBook Packages: Computer ScienceComputer Science (R0)