Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements

Anton Nesterenko^11,12,
Ruslan Akhmerov¹²,
Yulia Matveeva¹³,
Anna Goremykina¹²,
Dmitry Astankov¹²,
Evgeniy Shuranov¹⁴ &
…
Alexandra Shirshova¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

International Conference on Speech and Computer

1094 Accesses

Abstract

Recently, a number of solutions were proposed that improved on ways of adding an emotional aspect to speech synthesis. Combined with core neural text-to-speech architectures that reach high naturalness scores, these models are capable of producing natural human-like speech with well discernible emotions and even model their intensities. To successfully synthesize emotions the models are trained on hours of emotional data. In practice however, it is often difficult and rather expensive to collect a lot of emotional speech data per speaker. In this article, we inquire upon the minimal data requirements of expressive text-to-speech solutions to be applied in practical scenarios and also find an optimal architecture for low-resource training. In particular, a different number of training speakers and a different amount of data per emotion are considered. Frequently occurring situations are considered when there is a large multi-speaker dataset with neutral records and a large single-speaker emotional dataset, but there is little emotional data for the remaining speakers. On top of that we study the effect of several architecture modifications and training procedures (namely adversarial training and transfer learning from speaker verification) on the quality of the models as well as their data avidity. Our results show that transfer learning may lower data requirements from 15 min per speaker per emotion to just 2.5–7 min maintaining non-significant changes in voice naturalness and giving high emotion recognition rates. We also show how the data requirements change from one emotion to another. A demo page illustrating the main findings of this work is available at: https://diparty.github.io/projects/tts/emo/nat.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Article Open access 12 February 2024

Emotional Speech Datasets for English Speech Synthesis Purpose: A Review

Exploring Transfer Learning for Low Resource Emotional TTS

References

Adigwe, A., Tits, N., El Haddad, K., Ostadabbas, S., Dutoit, T.: The emotional voices database: towards controlling the emotion dimension in voice generation systems 06 (2018)
Google Scholar
Cai, X., Dai, D., Wu, Z., Li, X., Li, J., Meng, H.: Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition. In: ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020). arXiv:abs/2011.08679
Cai, Z., Zhang, C., Li, M.: From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint 08, 1032 (2020). https://doi.org/10.21437/Interspeech
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(59), 1–35 (2016). https://jmlr.org/papers/v17/15-239.html
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Ito, K., Johnson, L.: The LJ speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/
Jang, W., Lim, D., Yoon, J.: Universal MelGAN: a robust neural vocoder for high-fidelity waveform generation in multiple domains (2020). https://doi.org/10.48550/ARXIV.2011.09631, arXiv:abs/2011.09631
Jemine, C., et al.: Real time voice cloning (2021). https://github.com/CorentinJ/Real-Time-Voice-Cloning
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4485–4495. NIPS’18, Curran Associates Inc., Red Hook, NY, USA (2018)
Google Scholar
Kong, J., Casanova, E.: Hifi-gan (2013). https://github.com/jik876/hifi-gan
Kong, J., Kim, J., Bae, J.: Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis (2020). arXiv:abs/2010.05646
Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32 (NeurIPS 2019), vol. 32. Curran Associates, Inc. (2019)
Google Scholar
Li, T., Yang, S., Xue, L., Xie, L.: Controllable emotion transfer for end-to-end speech synthesis (2020). arXiv:abs/2011.08679
Liu, R., Sisman, B., Li, H.: Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability (2021). arXiv:abs/2104.01408
Lu, C., Wen, X., Liu, R., Chen, X.: Multi-speaker emotional speech synthesis with fine-grained prosody modeling. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5729–5733 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413398
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: INTERSPEECH (2017)
Google Scholar
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Google Scholar
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech: Fast, robust and controllable text to speech. In: Wallach, H., Larochelle, H., Beygelzimer, A., dÁlché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)
Google Scholar
Shang, Z., Huang, Z., Zhang, H., Zhang, P., Yan, Y.: Incorporating cross-speaker style transfer for multi-language text-to-speech. In: Proceedings of the Interspeech 2021, pp. 1619–1623 (2021). https://doi.org/10.21437/Interspeech.2021-1265
Shen, J., Jia, Y., Chrzanowski, M., Zhang, Y., Elias, I., Zen, H., Wu, Y.: Non-attentive tacotron: robust and controllable neural TTS synthesis including unsupervised duration modeling (2020). arXiv:abs/2010.04301
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R.J., Saurous, R.A., Agiomyrgiannakis, Y., Wu, Y.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions (2017). arXiv:abs/1712.05884
Student: the probable error of a mean. Biometrika 6(1), 1–25 (1908). http://www.jstor.org/stable/2331554
Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention (2017). arXiv:abs/1710.08969
Tits, N., Haddad, K.E., Dutoit, T.: Exploring transfer learning for low resource emotional TTS (2019). arXiv:abs/1901.04276
Um, S.Y., Oh, S., Byun, K., Jang, I., Ahn, C., Kang, H.G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053732
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018). https://doi.org/10.1109/ICASSP.2018.8462665
Wang, J., Li, J., Zhao, X., Wu, Z., Kang, S., Meng, H.: Adversarially learning disentangled speech representations for robust multi-factor voice conversion. In: Proceedings of the Interspeech 2021, pp. 846–850 (2021). https://doi.org/10.21437/Interspeech
Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R.J., Battenberg, E., Shor, J., Xiao, Y., Ren, F., Jia, Y., Saurous, R.A.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis (2018). arXiv:abs/1803.09017
Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)
Google Scholar
Yamamoto, R., Song, E., Kim, J.M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053795
Zhang, Y., Weiss, R.J., Zen, H., Wu, Y., Chen, Z., Skerry-Ryan, R., Jia, Y., Rosenberg, A., Ramabhadran, B.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: Proceeding of the Interspeech 2019, pp. 2080–2084 (2019). https://doi.org/10.21437/Interspeech
Zhou, K., Sisman, B., Li, H.: Limited data emotional voice conversion leveraging text-to-speech: two-stage sequence-to-sequence training. In: Proceeding of the Interspeech 2021, pp. 811–815 (2021). https://doi.org/10.21437/Interspeech
Zhou, K., Sisman, B., Liu, R., Li, H.: Emotional voice conversion: theory, databases and ESD (2021). arXiv:abs/2105.14762

Download references

Author information

Authors and Affiliations

Ivanovo State University of Chemistry and Technology, Ivanovo, Russia
Anton Nesterenko
Big Data Academy MADE by VK, Saint Petersburg, Russia
Anton Nesterenko, Ruslan Akhmerov, Anna Goremykina & Dmitry Astankov
Huawei Saint-Petersburg Research Center, Saint Petersburg, Russia
Yulia Matveeva & Alexandra Shirshova
ITMO University, Saint Petersburg, Russia
Evgeniy Shuranov

Authors

Anton Nesterenko
View author publications
You can also search for this author in PubMed Google Scholar
Ruslan Akhmerov
View author publications
You can also search for this author in PubMed Google Scholar
Yulia Matveeva
View author publications
You can also search for this author in PubMed Google Scholar
Anna Goremykina
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Astankov
View author publications
You can also search for this author in PubMed Google Scholar
Evgeniy Shuranov
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Shirshova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yulia Matveeva .

Editor information

Editors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nesterenko, A. et al. (2022). Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_43

Download citation

DOI: https://doi.org/10.1007/978-3-031-20980-2_43
Published: 10 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Emotional Speech Datasets for English Speech Synthesis Purpose: A Review

Exploring Transfer Learning for Low Resource Emotional TTS

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Emotional Speech Datasets for English Speech Synthesis Purpose: A Review

Exploring Transfer Learning for Low Resource Emotional TTS

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation