research-article

Public Access

Song2Face: Synthesizing Singing Facial Animation from Audio

Authors:

Shigeo MorishimaAuthors Info & Claims

SA '20: SIGGRAPH Asia 2020 Technical Communications

Article No.: 12, Pages 1 - 4

https://doi.org/10.1145/3410700.3425435

Published: 17 November 2020 Publication History

PDF eReader

Abstract

We present Song2Face, a deep neural network capable of producing singing facial animation from an input of singing voice and singer label. The network architecture is built upon our insight that, although facial expression when singing varies between different individuals, singing voices store valuable information such as pitch, breathe, and vibrato that expressions may be attributed to. Therefore, our network consists of an encoder that extracts relevant vocal features from audio, and a regression network conditioned on a singer label that predicts control parameters for facial animation. In contrast to prior audio-driven speech animation methods which initially map audio to text-level features, we show that vocal features can be directly learned from singing voice without any explicit constraints. Our network is capable of producing movements for all parts of the face and also rotational movement of the head itself. Furthermore, stylistic differences in expression between different singers are captured via the singer label, and thus the resulting animations singing style can be manipulated at test time.

Supplementary Material

MP4 File (3410700.3425435.mp4)

Presentation video

Download
229.70 MB

References

[1]

Michael Cohen and Dom Massaro. 1999. Modeling Coarticulation in Synthetic Visual Speech. (03 1999). https://doi.org/10.1007/978-4-431-66911-1_13

Crossref

Google Scholar

[2]

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. Computer Vision and Pattern Recognition (CVPR) (2019), 10101–10111. http://voca.is.tue.mpg.de/

Google Scholar

[3]

Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animator-centric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graph. 35, 4, Article 127 (July 2016), 11 pages. https://doi.org/10.1145/2897824.2925984

Digital Library

Google Scholar

[4]

Cletus Fisher. 1969. Confusions Among Visually Perceived Consonants. Journal of speech and hearing research 11 (01 1969), 796–804. https://doi.org/10.1044/jshr.1104.796

Crossref

Google Scholar

[5]

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven Facial Animation by Joint End-to-end Learning of Pose and Emotion. ACM Trans. Graph. 36, 4, Article 94 (July 2017), 12 pages. https://doi.org/10.1145/3072959.3073658

Digital Library

Google Scholar

[6]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In ICML.

Google Scholar

[7]

Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graph. 36, 4, Article 93 (July 2017), 11 pages. https://doi.org/10.1145/3072959.3073699

Digital Library

Google Scholar

[8]

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv 2019 (2019).

Google Scholar

[9]

Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven Animator-centric Speech Animation. ACM Trans. Graph. 37, 4, Article 161 (July 2018), 10 pages. https://doi.org/10.1145/3197517.3201292

Digital Library

Google Scholar

Cited By

View all

Jung SSeol YSeo KNa HKim STan VNoh J(2024)Speed-Aware Audio-Driven Speech Animation using Adaptive WindowsACM Transactions on Graphics10.1145/3691341Online publication date: 31-Aug-2024
https://doi.org/10.1145/3691341
Ma WTang ALing JXue HLiao HZhu YSong L(2024)SingAvatar: High-fidelity Audio-driven Singing Avatar Synthesis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687925(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687925
Zhang YSu RYu JLi R(2024)3D facial modeling, animation, and rendering for digital humans: A surveyNeurocomputing10.1016/j.neucom.2024.128168598(128168)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.128168
Show More Cited By

Recommendations

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation
SA '22: SIGGRAPH Asia 2022 Conference Papers

Singing and speaking are two fundamental forms of human communication. From a modeling perspective however, speaking can be seen as a subset of singing. We present VOCAL, a system that automatically generates expressive, animator-centric lower face ...
Creating an A Cappella Singing Audio Dataset for Automatic Jingju Singing Evaluation Research
DLfM '17: Proceedings of the 4th International Workshop on Digital Libraries for Musicology

The data-driven computational research on automatic jingju (also known as Beijing or Peking opera) singing evaluation lacks a suitable and comprehensive a cappella singing audio dataset. In this work, we present an a cappella singing audio dataset which ...
From raw audio to a seamless mix: creating an automated DJ system for Drum and Bass

We present the open-source implementation of the first fully automatic and comprehensive DJ system, able to generate seamless music mixes using songs from a given library much like a human DJ does.
The proposed system is built on top of several enhanced ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SA '20: SIGGRAPH Asia 2020 Technical Communications

December 2020

56 pages

ISBN:9781450380805

DOI:10.1145/3410700

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

JST ACCEL
JST-Mirai Program
JSPS KAKENHI

Conference

SA '20

Sponsor:

SIGGRAPH

SA '20: SIGGRAPH Asia 2020

December 1 - 9, 2020

Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
575
Total Downloads

Downloads (Last 12 months)158
Downloads (Last 6 weeks)16

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jung SSeol YSeo KNa HKim STan VNoh J(2024)Speed-Aware Audio-Driven Speech Animation using Adaptive WindowsACM Transactions on Graphics10.1145/3691341Online publication date: 31-Aug-2024
https://doi.org/10.1145/3691341
Ma WTang ALing JXue HLiao HZhu YSong L(2024)SingAvatar: High-fidelity Audio-driven Singing Avatar Synthesis2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687925(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687925
Zhang YSu RYu JLi R(2024)3D facial modeling, animation, and rendering for digital humans: A surveyNeurocomputing10.1016/j.neucom.2024.128168598(128168)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.128168
Fan XLi JLin ZXiao WYang L(2024)UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified ModelComputer Vision – ECCV 202410.1007/978-3-031-72940-9_12(204-221)Online publication date: 17-Nov-2024
https://doi.org/10.1007/978-3-031-72940-9_12
Liu PDeng WLi HWang JZheng YDing YGuo XZeng M(2023)MusicFace: Music-driven expressive singing face synthesisComputational Visual Media10.1007/s41095-023-0343-710:1(119-136)Online publication date: 30-Nov-2023
https://doi.org/10.1007/s41095-023-0343-7
Pan YLandreth CFiume ESingh K(2022)VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing AnimationSIGGRAPH Asia 2022 Conference Papers10.1145/3550469.3555408(1-9)Online publication date: 29-Nov-2022
https://dl.acm.org/doi/10.1145/3550469.3555408

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation

Creating an A Cappella Singing Audio Dataset for Automatic Jingju Singing Evaluation Research

From raw audio to a seamless mix: creating an automated DJ system for Drum and Bass