Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3410700.3425435acmconferencesArticle/Chapter ViewAbstractPublication Pagessiggraph-asiaConference Proceedingsconference-collections
research-article
Public Access

Song2Face: Synthesizing Singing Facial Animation from Audio

Published: 17 November 2020 Publication History

Abstract

We present Song2Face, a deep neural network capable of producing singing facial animation from an input of singing voice and singer label. The network architecture is built upon our insight that, although facial expression when singing varies between different individuals, singing voices store valuable information such as pitch, breathe, and vibrato that expressions may be attributed to. Therefore, our network consists of an encoder that extracts relevant vocal features from audio, and a regression network conditioned on a singer label that predicts control parameters for facial animation. In contrast to prior audio-driven speech animation methods which initially map audio to text-level features, we show that vocal features can be directly learned from singing voice without any explicit constraints. Our network is capable of producing movements for all parts of the face and also rotational movement of the head itself. Furthermore, stylistic differences in expression between different singers are captured via the singer label, and thus the resulting animations singing style can be manipulated at test time.

Supplementary Material

MP4 File (3410700.3425435.mp4)
Presentation video

References

[1]
Michael Cohen and Dom Massaro. 1999. Modeling Coarticulation in Synthetic Visual Speech. (03 1999). https://doi.org/10.1007/978-4-431-66911-1_13
[2]
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. Computer Vision and Pattern Recognition (CVPR) (2019), 10101–10111. http://voca.is.tue.mpg.de/
[3]
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animator-centric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graph. 35, 4, Article 127 (July 2016), 11 pages. https://doi.org/10.1145/2897824.2925984
[4]
Cletus Fisher. 1969. Confusions Among Visually Perceived Consonants. Journal of speech and hearing research 11 (01 1969), 796–804. https://doi.org/10.1044/jshr.1104.796
[5]
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven Facial Animation by Joint End-to-end Learning of Pose and Emotion. ACM Trans. Graph. 36, 4, Article 94 (July 2017), 12 pages. https://doi.org/10.1145/3072959.3073658
[6]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In ICML.
[7]
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graph. 36, 4, Article 93 (July 2017), 11 pages. https://doi.org/10.1145/3072959.3073699
[8]
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv 2019 (2019).
[9]
Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven Animator-centric Speech Animation. ACM Trans. Graph. 37, 4, Article 161 (July 2018), 10 pages. https://doi.org/10.1145/3197517.3201292

Cited By

View all
  • (2024)Speed-Aware Audio-Driven Speech Animation using Adaptive WindowsACM Transactions on Graphics10.1145/3691341Online publication date: 31-Aug-2024
  • (2024)SingAvatar: High-fidelity Audio-driven Singing Avatar Synthesis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687925(1-6)Online publication date: 15-Jul-2024
  • (2024)3D facial modeling, animation, and rendering for digital humans: A surveyNeurocomputing10.1016/j.neucom.2024.128168598(128168)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SA '20: SIGGRAPH Asia 2020 Technical Communications
December 2020
56 pages
ISBN:9781450380805
DOI:10.1145/3410700
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Facial Animation
  2. Machine Learning
  3. Singing Audio

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • JST ACCEL
  • JST-Mirai Program
  • JSPS KAKENHI

Conference

SA '20
Sponsor:
SA '20: SIGGRAPH Asia 2020
December 1 - 9, 2020
Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)158
  • Downloads (Last 6 weeks)16
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Speed-Aware Audio-Driven Speech Animation using Adaptive WindowsACM Transactions on Graphics10.1145/3691341Online publication date: 31-Aug-2024
  • (2024)SingAvatar: High-fidelity Audio-driven Singing Avatar Synthesis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687925(1-6)Online publication date: 15-Jul-2024
  • (2024)3D facial modeling, animation, and rendering for digital humans: A surveyNeurocomputing10.1016/j.neucom.2024.128168598(128168)Online publication date: Sep-2024
  • (2024)UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified ModelComputer Vision – ECCV 202410.1007/978-3-031-72940-9_12(204-221)Online publication date: 17-Nov-2024
  • (2023)MusicFace: Music-driven expressive singing face synthesisComputational Visual Media10.1007/s41095-023-0343-710:1(119-136)Online publication date: 30-Nov-2023
  • (2022)VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing AnimationSIGGRAPH Asia 2022 Conference Papers10.1145/3550469.3555408(1-9)Online publication date: 29-Nov-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media