Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3462244.3479883acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Audiovisual Speech Synthesis using Tacotron2

Published: 18 October 2021 Publication History

Abstract

Audiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are passed through a WaveRNN model to reconstruct the speech waveform. The speech waveform and the output facial controllers are used to generate the corresponding video of the talking face. As a baseline, we use a modular system, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech is then used to drive the controls of the face model using an independently trained audio-to-facial-animation neural network. We further condition both the end-to-end and modular approaches on emotion embeddings that encode the required prosody to generate emotional audiovisual speech. A comprehensive analysis shows that the end-to-end system is able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1, which is the same MOS obtained on the ground truth generated from professionally recorded videos.

Supplementary Material

M4V File (p503-ICMI_1.m4v)
Supplemental video

References

[1]
Ahmed Hussen Abdelaziz, Barry-John Theobald, Paul Dixon, Reinhard Knothe, Nicholas Apostoloff, and Sachin Kajareker. 2020. Modality Dropout for Improved Performance-driven Talking Faces. arxiv:2005.13616 [eess.AS]
[2]
Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, and Ahmed Hussen Abdelaziz. 2020. Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement. arXiv preprint arXiv:2004.12031(2020).
[3]
Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013. An expressive text-driven 3D talking head. In ACM SIGGRAPH 2013 Posters. 1–1.
[4]
Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W Sumner, and Markus Gross. 2011. High-quality passive facial performance capture using anchor frames. In ACM SIGGRAPH 2011 papers. 1–10.
[5]
M. Brand. 1999. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 21–28.
[6]
Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelity facial performance capture. ACM Transactions on Graphics (ToG) 34, 4 (2015), 1–9.
[7]
Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics 35, 4 (2016).
[8]
Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in neural information processing systems. 577–585.
[9]
Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence 23, 6(2001), 681–685.
[10]
Jack Chilton Cotton. 1935. Normal visual hearing. Science (1935).
[11]
B. Fan, L. Wang, F. Soong, and L. Xie. 2015. Photo-real talking head with deep bidirectional LSTM. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884–4888.
[12]
S. Fu, R. Gutierrez-Osuna, A. Esposito, P. Kakumanu, and O. Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243–252.
[13]
Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, Paul Graham, Koki Nagano, Jay Busch, and Paul Debevec. 2013. Driving high-resolution facial blendshapes with video performance capture. In ACM SIGGRAPH 2013 Talks. 1–1.
[14]
Udit Kumar Goyal, Ashish Kapoor, and Prem Kalra. 2000. Text-to-audiovisual speech synthesizer. In International Conference on Virtual Worlds. Springer, 256–269.
[15]
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435(2018).
[16]
T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 94.
[17]
T. Kim, Y. Yue, S. Taylor, and I. Matthews. 2015. A decision tree framework for spatiotemporal sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577–586.
[18]
Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brébisson, and Yoshua Bengio. 2017. Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442(2017).
[19]
Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748.
[20]
Jonathan Parker, Ranniery Maia, Yannis Stylianou, and Roberto Cipolla. 2017. Expressive visual text to speech and expression adaptation using deep neural networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4920–4924.
[21]
Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. 2000. HMM-based text-to-audio-visual speech synthesis. In Sixth International Conference on Spoken Language Processing.
[22]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 2018. Natural TTS synthesis by conditioning WAVENet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779–4783.
[23]
T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII). IEEE, 100–105.
[24]
RJ SkerryRyan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047(2018).
[25]
Linsen Song, Wayne Wu, Chen Qian, Chen Qian, and Chen Change Loy. 2020. Everybody’s Talkin’: Let Me Talk as You Want. arXiv preprint arXiv:(2020).
[26]
Lorenzo Tarantino, Philip N Garner, and Alexandros Lazaridis. 2019. Self-attention for Speech Emotion Recognition. Proc. Interspeech 2019(2019), 2578–2582.
[27]
S. Taylor, A. Kato, B. Milner, and I. Matthews. 2016. Audio-to-visual speech conversion using deep neural networks. In Interspeech. International Speech Communication Association.
[28]
S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. Rodriguez, J. Hodgins, and I. Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 93.
[29]
S. Taylor, M. Mahler, B. Theobald, and I. Matthews. 2012. Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 275–284.
[30]
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv 2019 (2019).
[31]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. journal of Machine Learning Research 9. Nov (2008) (2008).
[32]
Lijuan Wang, Xiaojun Qian, Lei Ma, Yao Qian, Yining Chen, and Frank K Soong. 2008. A real-time text to audio-visual speech synthesis system. In Ninth Annual Conference of the International Speech Communication Association.
[33]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135(2017).
[34]
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017(2018).
[35]
T. Weise, S. Bouaziz, H. Li, and M. Pauly. 2011. Realtime performance-based facial animation. In ACM transactions on graphics (TOG), Vol. 30. ACM, 77.
[36]
Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. 2014. Real-time facial animation on mobile devices. Graphical Models 76, 3 (2014), 172–179.
[37]
L. Xie and Z. Liu. 2007. Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia 9, 3 (2007), 500–510.
[38]
Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. 2008. Spacetime faces: High-resolution capture formodeling and animation. In Data-Driven 3D Facial Animation. Springer, 248–276.
[39]
Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–10.

Cited By

View all
  • (2024)Text‐to‐speech and virtual reality agents in primary school classroom environmentsJournal of Computer Assisted Learning10.1111/jcal.1304640:6(2964-2984)Online publication date: 7-Aug-2024
  • (2024)Enhancing Realism in 3D Facial Animation Using Conformer-Based Generation and Automated Post-ProcessingICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447526(8341-8345)Online publication date: 14-Apr-2024
  • (2024)Inclusive Deaf Education Enabled by Artificial Intelligence: The Path to a SolutionInternational Journal of Artificial Intelligence in Education10.1007/s40593-024-00419-9Online publication date: 24-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction
October 2021
876 pages
ISBN:9781450384810
DOI:10.1145/3462244
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Audiovisual speech
  2. Tacotron2
  3. blendshape coefficients
  4. emotional speech synthesis
  5. speech synthesis

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMI '21
Sponsor:
ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 18 - 22, 2021
QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)7
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Text‐to‐speech and virtual reality agents in primary school classroom environmentsJournal of Computer Assisted Learning10.1111/jcal.1304640:6(2964-2984)Online publication date: 7-Aug-2024
  • (2024)Enhancing Realism in 3D Facial Animation Using Conformer-Based Generation and Automated Post-ProcessingICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447526(8341-8345)Online publication date: 14-Apr-2024
  • (2024)Inclusive Deaf Education Enabled by Artificial Intelligence: The Path to a SolutionInternational Journal of Artificial Intelligence in Education10.1007/s40593-024-00419-9Online publication date: 24-Jul-2024
  • (2024)Adapting Audiovisual Speech Synthesis to EstonianText, Speech, and Dialogue10.1007/978-3-031-70566-3_2(13-23)Online publication date: 9-Sep-2024
  • (2023)Generation of speech and facial animation with controllable articulatory effort for amusing conversational charactersProceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607289(1-9)Online publication date: 19-Sep-2023
  • (2023)On the Role of LIP Articulation in Visual Speech PerceptionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096012(1-5)Online publication date: 4-Jun-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media