Nothing Special   »   [go: up one dir, main page]

Skip to main content

Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

  • 1094 Accesses

Abstract

Recently, a number of solutions were proposed that improved on ways of adding an emotional aspect to speech synthesis. Combined with core neural text-to-speech architectures that reach high naturalness scores, these models are capable of producing natural human-like speech with well discernible emotions and even model their intensities. To successfully synthesize emotions the models are trained on hours of emotional data. In practice however, it is often difficult and rather expensive to collect a lot of emotional speech data per speaker. In this article, we inquire upon the minimal data requirements of expressive text-to-speech solutions to be applied in practical scenarios and also find an optimal architecture for low-resource training. In particular, a different number of training speakers and a different amount of data per emotion are considered. Frequently occurring situations are considered when there is a large multi-speaker dataset with neutral records and a large single-speaker emotional dataset, but there is little emotional data for the remaining speakers. On top of that we study the effect of several architecture modifications and training procedures (namely adversarial training and transfer learning from speaker verification) on the quality of the models as well as their data avidity. Our results show that transfer learning may lower data requirements from 15 min per speaker per emotion to just 2.5–7 min maintaining non-significant changes in voice naturalness and giving high emotion recognition rates. We also show how the data requirements change from one emotion to another. A demo page illustrating the main findings of this work is available at: https://diparty.github.io/projects/tts/emo/nat.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adigwe, A., Tits, N., El Haddad, K., Ostadabbas, S., Dutoit, T.: The emotional voices database: towards controlling the emotion dimension in voice generation systems 06 (2018)

    Google Scholar 

  2. Cai, X., Dai, D., Wu, Z., Li, X., Li, J., Meng, H.: Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition. In: ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020). arXiv:abs/2011.08679

  3. Cai, Z., Zhang, C., Li, M.: From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint 08, 1032 (2020). https://doi.org/10.21437/Interspeech

  4. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(59), 1–35 (2016). https://jmlr.org/papers/v17/15-239.html

  5. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

  6. Ito, K., Johnson, L.: The LJ speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/

  7. Jang, W., Lim, D., Yoon, J.: Universal MelGAN: a robust neural vocoder for high-fidelity waveform generation in multiple domains (2020). https://doi.org/10.48550/ARXIV.2011.09631, arXiv:abs/2011.09631

  8. Jemine, C., et al.: Real time voice cloning (2021). https://github.com/CorentinJ/Real-Time-Voice-Cloning

  9. Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4485–4495. NIPS’18, Curran Associates Inc., Red Hook, NY, USA (2018)

    Google Scholar 

  10. Kong, J., Casanova, E.: Hifi-gan (2013). https://github.com/jik876/hifi-gan

  11. Kong, J., Kim, J., Bae, J.: Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis (2020). arXiv:abs/2010.05646

  12. Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32 (NeurIPS 2019), vol. 32. Curran Associates, Inc. (2019)

    Google Scholar 

  13. Li, T., Yang, S., Xue, L., Xie, L.: Controllable emotion transfer for end-to-end speech synthesis (2020). arXiv:abs/2011.08679

  14. Liu, R., Sisman, B., Li, H.: Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability (2021). arXiv:abs/2104.01408

  15. Lu, C., Wen, X., Liu, R., Chen, X.: Multi-speaker emotional speech synthesis with fine-grained prosody modeling. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5729–5733 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413398

  16. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: INTERSPEECH (2017)

    Google Scholar 

  17. Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)

    Google Scholar 

  18. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech: Fast, robust and controllable text to speech. In: Wallach, H., Larochelle, H., Beygelzimer, A., dÁlché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)

    Google Scholar 

  19. Shang, Z., Huang, Z., Zhang, H., Zhang, P., Yan, Y.: Incorporating cross-speaker style transfer for multi-language text-to-speech. In: Proceedings of the Interspeech 2021, pp. 1619–1623 (2021). https://doi.org/10.21437/Interspeech.2021-1265

  20. Shen, J., Jia, Y., Chrzanowski, M., Zhang, Y., Elias, I., Zen, H., Wu, Y.: Non-attentive tacotron: robust and controllable neural TTS synthesis including unsupervised duration modeling (2020). arXiv:abs/2010.04301

  21. Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R.J., Saurous, R.A., Agiomyrgiannakis, Y., Wu, Y.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions (2017). arXiv:abs/1712.05884

  22. Student: the probable error of a mean. Biometrika 6(1), 1–25 (1908). http://www.jstor.org/stable/2331554

  23. Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention (2017). arXiv:abs/1710.08969

  24. Tits, N., Haddad, K.E., Dutoit, T.: Exploring transfer learning for low resource emotional TTS (2019). arXiv:abs/1901.04276

  25. Um, S.Y., Oh, S., Byun, K., Jang, I., Ahn, C., Kang, H.G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053732

  26. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018). https://doi.org/10.1109/ICASSP.2018.8462665

  27. Wang, J., Li, J., Zhao, X., Wu, Z., Kang, S., Meng, H.: Adversarially learning disentangled speech representations for robust multi-factor voice conversion. In: Proceedings of the Interspeech 2021, pp. 846–850 (2021). https://doi.org/10.21437/Interspeech

  28. Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R.J., Battenberg, E., Shor, J., Xiao, Y., Ren, F., Jia, Y., Saurous, R.A.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis (2018). arXiv:abs/1803.09017

  29. Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)

    Google Scholar 

  30. Yamamoto, R., Song, E., Kim, J.M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053795

  31. Zhang, Y., Weiss, R.J., Zen, H., Wu, Y., Chen, Z., Skerry-Ryan, R., Jia, Y., Rosenberg, A., Ramabhadran, B.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: Proceeding of the Interspeech 2019, pp. 2080–2084 (2019). https://doi.org/10.21437/Interspeech

  32. Zhou, K., Sisman, B., Li, H.: Limited data emotional voice conversion leveraging text-to-speech: two-stage sequence-to-sequence training. In: Proceeding of the Interspeech 2021, pp. 811–815 (2021). https://doi.org/10.21437/Interspeech

  33. Zhou, K., Sisman, B., Liu, R., Li, H.: Emotional voice conversion: theory, databases and ESD (2021). arXiv:abs/2105.14762

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yulia Matveeva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nesterenko, A. et al. (2022). Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20980-2_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20979-6

  • Online ISBN: 978-3-031-20980-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics