research-article

Audiovisual Speech Synthesis using Tacotron2

Authors:

Ahmed Hussen Abdelaziz,

Anushree Prasanna Kumar,

Chloe Seivwright,

Gabriele Fanelli,

Yannis Stylianou,

Sachin KajarekerAuthors Info & Claims

ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Pages 503 - 511

https://doi.org/10.1145/3462244.3479883

Published: 18 October 2021 Publication History

Abstract

Audiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts a sequence of phonemes into a sequence of acoustic features and the corresponding controllers of a face model. The output acoustic features are passed through a WaveRNN model to reconstruct the speech waveform. The speech waveform and the output facial controllers are used to generate the corresponding video of the talking face. As a baseline, we use a modular system, where acoustic speech is synthesized from text using the traditional Tacotron2. The reconstructed acoustic speech is then used to drive the controls of the face model using an independently trained audio-to-facial-animation neural network. We further condition both the end-to-end and modular approaches on emotion embeddings that encode the required prosody to generate emotional audiovisual speech. A comprehensive analysis shows that the end-to-end system is able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1, which is the same MOS obtained on the ground truth generated from professionally recorded videos.

Supplementary Material

M4V File (p503-ICMI_1.m4v)

Supplemental video

Download
11.74 MB

References

[1]

Ahmed Hussen Abdelaziz, Barry-John Theobald, Paul Dixon, Reinhard Knothe, Nicholas Apostoloff, and Sachin Kajareker. 2020. Modality Dropout for Improved Performance-driven Talking Faces. arxiv:2005.13616 [eess.AS]

[2]

Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, and Ahmed Hussen Abdelaziz. 2020. Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement. arXiv preprint arXiv:2004.12031(2020).

[3]

Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. 2013. An expressive text-driven 3D talking head. In ACM SIGGRAPH 2013 Posters. 1–1.

Digital Library

[4]

Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman, Robert W Sumner, and Markus Gross. 2011. High-quality passive facial performance capture using anchor frames. In ACM SIGGRAPH 2011 papers. 1–10.

Digital Library

[5]

M. Brand. 1999. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 21–28.

Digital Library

[6]

Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelity facial performance capture. ACM Transactions on Graphics (ToG) 34, 4 (2015), 1–9.

Digital Library

[7]

Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-time facial animation with image-based dynamic avatars. ACM Transactions on Graphics 35, 4 (2016).

Digital Library

[8]

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances in neural information processing systems. 577–585.

[9]

Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence 23, 6(2001), 681–685.

Digital Library

[10]

Jack Chilton Cotton. 1935. Normal visual hearing. Science (1935).

[11]

B. Fan, L. Wang, F. Soong, and L. Xie. 2015. Photo-real talking head with deep bidirectional LSTM. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4884–4888.

[12]

S. Fu, R. Gutierrez-Osuna, A. Esposito, P. Kakumanu, and O. Garcia. 2005. Audio/visual mapping with cross-modal hidden Markov models. IEEE Transactions on Multimedia 7, 2 (2005), 243–252.

Digital Library

[13]

Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, Paul Graham, Koki Nagano, Jay Busch, and Paul Debevec. 2013. Driving high-resolution facial blendshapes with video performance capture. In ACM SIGGRAPH 2013 Talks. 1–1.

Digital Library

[14]

Udit Kumar Goyal, Ashish Kapoor, and Prem Kalra. 2000. Text-to-audiovisual speech synthesizer. In International Conference on Virtual Worlds. Springer, 256–269.

[15]

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435(2018).

[16]

T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 94.

Digital Library

[17]

T. Kim, Y. Yue, S. Taylor, and I. Matthews. 2015. A decision tree framework for spatiotemporal sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 577–586.

[18]

Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brébisson, and Yoshua Bengio. 2017. Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442(2017).

[19]

Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748.

[20]

Jonathan Parker, Ranniery Maia, Yannis Stylianou, and Roberto Cipolla. 2017. Expressive visual text to speech and expression adaptation using deep neural networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4920–4924.

Digital Library

[21]

Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. 2000. HMM-based text-to-audio-visual speech synthesis. In Sixth International Conference on Spoken Language Processing.

[22]

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 2018. Natural TTS synthesis by conditioning WAVENet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779–4783.

Digital Library

[23]

T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. 2015. Talking heads synthesis from audio with deep neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII). IEEE, 100–105.

[24]

RJ SkerryRyan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047(2018).

[25]

Linsen Song, Wayne Wu, Chen Qian, Chen Qian, and Chen Change Loy. 2020. Everybody’s Talkin’: Let Me Talk as You Want. arXiv preprint arXiv:(2020).

[26]

Lorenzo Tarantino, Philip N Garner, and Alexandros Lazaridis. 2019. Self-attention for Speech Emotion Recognition. Proc. Interspeech 2019(2019), 2578–2582.

[27]

S. Taylor, A. Kato, B. Milner, and I. Matthews. 2016. Audio-to-visual speech conversion using deep neural networks. In Interspeech. International Speech Communication Association.

[28]

S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. Rodriguez, J. Hodgins, and I. Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 93.

Digital Library

[29]

S. Taylor, M. Mahler, B. Theobald, and I. Matthews. 2012. Dynamic units of visual speech. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 275–284.

[30]

Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv 2019 (2019).

[31]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. journal of Machine Learning Research 9. Nov (2008) (2008).

[32]

Lijuan Wang, Xiaojun Qian, Lei Ma, Yao Qian, Yining Chen, and Frank K Soong. 2008. A real-time text to audio-visual speech synthesis system. In Ninth Annual Conference of the International Speech Communication Association.

[33]

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135(2017).

[34]

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017(2018).

[35]

T. Weise, S. Bouaziz, H. Li, and M. Pauly. 2011. Realtime performance-based facial animation. In ACM transactions on graphics (TOG), Vol. 30. ACM, 77.

[36]

Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. 2014. Real-time facial animation on mobile devices. Graphical Models 76, 3 (2014), 172–179.

[37]

L. Xie and Z. Liu. 2007. Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia 9, 3 (2007), 500–510.

Digital Library

[38]

Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. 2008. Spacetime faces: High-resolution capture formodeling and animation. In Data-Driven 3D Facial Animation. Springer, 248–276.

[39]

Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–10.

Digital Library

Cited By

Dai LKritskaia Vvan der Velden EVervoort RBlankendaal MJung MPostma MLouwerse M(2024)Text‐to‐speech and virtual reality agents in primary school classroom environmentsJournal of Computer Assisted Learning10.1111/jcal.1304640:6(2964-2984)Online publication date: 7-Aug-2024
https://doi.org/10.1111/jcal.13046
Zhao YQiang CLi HHu YZhou WLi S(2024)Enhancing Realism in 3D Facial Animation Using Conformer-Based Generation and Automated Post-ProcessingICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447526(8341-8345)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447526
Coy AMohammed PSkerrit P(2024)Inclusive Deaf Education Enabled by Artificial Intelligence: The Path to a SolutionInternational Journal of Artificial Intelligence in Education10.1007/s40593-024-00419-9Online publication date: 24-Jul-2024
https://doi.org/10.1007/s40593-024-00419-9
Show More Cited By

Recommendations

Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis
ICDSP '22: Proceedings of the 6th International Conference on Digital Signal Processing

In this paper, we propose a method of speaker adaption with intuitive prosodic features for statistical parametric speech synthesis. The intuitive prosodic features employed in this method include pitch, pitch range, speech rate and energy considering ...
Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis

This paper presents an approach to hierarchical prosody conversion for emotional speech synthesis. The pitch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosodic word, and subsyllable levels. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

October 2021

876 pages

ISBN:9781450384810

DOI:10.1145/3462244

Editors:
Zakia Hammal
Carnegie Mellon University
,
Carlos Busso
University of Texas at Dallas
,
Catherine Pelachaud
CNRS - ISIR, Sorbonne University
,
Sharon Oviatt
Monash University
,
Albert Ali Salah
Utrecht University and Boğaziçi University
,
Guoying Zhao
University of Oulu

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMI '21

Sponsor:

SIGCHI

ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 18 - 22, 2021

QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
170
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)7

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dai LKritskaia Vvan der Velden EVervoort RBlankendaal MJung MPostma MLouwerse M(2024)Text‐to‐speech and virtual reality agents in primary school classroom environmentsJournal of Computer Assisted Learning10.1111/jcal.1304640:6(2964-2984)Online publication date: 7-Aug-2024
https://doi.org/10.1111/jcal.13046
Zhao YQiang CLi HHu YZhou WLi S(2024)Enhancing Realism in 3D Facial Animation Using Conformer-Based Generation and Automated Post-ProcessingICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447526(8341-8345)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447526
Coy AMohammed PSkerrit P(2024)Inclusive Deaf Education Enabled by Artificial Intelligence: The Path to a SolutionInternational Journal of Artificial Intelligence in Education10.1007/s40593-024-00419-9Online publication date: 24-Jul-2024
https://doi.org/10.1007/s40593-024-00419-9
Aller SFishel M(2024)Adapting Audiovisual Speech Synthesis to EstonianText, Speech, and Dialogue10.1007/978-3-031-70566-3_2(13-23)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70566-3_2
Gustafson JSzékely ÉBeskow JLugrin BLatoschik Mvon Mammen SKopp SPécune FPelachaud C(2023)Generation of speech and facial animation with controllable articulatory effort for amusing conversational charactersProceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607289(1-9)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1145/3570945.3607289
Aldeneh ZFedzechkina MSeto SMetcalf KSarabia MApostoloff NTheobald B(2023)On the Role of LIP Articulation in Visual Speech PerceptionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096012(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096012

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents