Abstract
Recently, a number of solutions were proposed that improved on ways of adding an emotional aspect to speech synthesis. Combined with core neural text-to-speech architectures that reach high naturalness scores, these models are capable of producing natural human-like speech with well discernible emotions and even model their intensities. To successfully synthesize emotions the models are trained on hours of emotional data. In practice however, it is often difficult and rather expensive to collect a lot of emotional speech data per speaker. In this article, we inquire upon the minimal data requirements of expressive text-to-speech solutions to be applied in practical scenarios and also find an optimal architecture for low-resource training. In particular, a different number of training speakers and a different amount of data per emotion are considered. Frequently occurring situations are considered when there is a large multi-speaker dataset with neutral records and a large single-speaker emotional dataset, but there is little emotional data for the remaining speakers. On top of that we study the effect of several architecture modifications and training procedures (namely adversarial training and transfer learning from speaker verification) on the quality of the models as well as their data avidity. Our results show that transfer learning may lower data requirements from 15 min per speaker per emotion to just 2.5–7 min maintaining non-significant changes in voice naturalness and giving high emotion recognition rates. We also show how the data requirements change from one emotion to another. A demo page illustrating the main findings of this work is available at: https://diparty.github.io/projects/tts/emo/nat.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adigwe, A., Tits, N., El Haddad, K., Ostadabbas, S., Dutoit, T.: The emotional voices database: towards controlling the emotion dimension in voice generation systems 06 (2018)
Cai, X., Dai, D., Wu, Z., Li, X., Li, J., Meng, H.: Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition. In: ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020). arXiv:abs/2011.08679
Cai, Z., Zhang, C., Li, M.: From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint 08, 1032 (2020). https://doi.org/10.21437/Interspeech
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(59), 1–35 (2016). https://jmlr.org/papers/v17/15-239.html
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014). https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Ito, K., Johnson, L.: The LJ speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/
Jang, W., Lim, D., Yoon, J.: Universal MelGAN: a robust neural vocoder for high-fidelity waveform generation in multiple domains (2020). https://doi.org/10.48550/ARXIV.2011.09631, arXiv:abs/2011.09631
Jemine, C., et al.: Real time voice cloning (2021). https://github.com/CorentinJ/Real-Time-Voice-Cloning
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4485–4495. NIPS’18, Curran Associates Inc., Red Hook, NY, USA (2018)
Kong, J., Casanova, E.: Hifi-gan (2013). https://github.com/jik876/hifi-gan
Kong, J., Kim, J., Bae, J.: Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis (2020). arXiv:abs/2010.05646
Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32 (NeurIPS 2019), vol. 32. Curran Associates, Inc. (2019)
Li, T., Yang, S., Xue, L., Xie, L.: Controllable emotion transfer for end-to-end speech synthesis (2020). arXiv:abs/2011.08679
Liu, R., Sisman, B., Li, H.: Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability (2021). arXiv:abs/2104.01408
Lu, C., Wen, X., Liu, R., Chen, X.: Multi-speaker emotional speech synthesis with fine-grained prosody modeling. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5729–5733 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413398
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using kaldi. In: INTERSPEECH (2017)
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech 2: fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2021)
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech: Fast, robust and controllable text to speech. In: Wallach, H., Larochelle, H., Beygelzimer, A., dÁlché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)
Shang, Z., Huang, Z., Zhang, H., Zhang, P., Yan, Y.: Incorporating cross-speaker style transfer for multi-language text-to-speech. In: Proceedings of the Interspeech 2021, pp. 1619–1623 (2021). https://doi.org/10.21437/Interspeech.2021-1265
Shen, J., Jia, Y., Chrzanowski, M., Zhang, Y., Elias, I., Zen, H., Wu, Y.: Non-attentive tacotron: robust and controllable neural TTS synthesis including unsupervised duration modeling (2020). arXiv:abs/2010.04301
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R.J., Saurous, R.A., Agiomyrgiannakis, Y., Wu, Y.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions (2017). arXiv:abs/1712.05884
Student: the probable error of a mean. Biometrika 6(1), 1–25 (1908). http://www.jstor.org/stable/2331554
Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention (2017). arXiv:abs/1710.08969
Tits, N., Haddad, K.E., Dutoit, T.: Exploring transfer learning for low resource emotional TTS (2019). arXiv:abs/1901.04276
Um, S.Y., Oh, S., Byun, K., Jang, I., Ahn, C., Kang, H.G.: Emotional speech synthesis with rich and granularized control. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053732
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018). https://doi.org/10.1109/ICASSP.2018.8462665
Wang, J., Li, J., Zhao, X., Wu, Z., Kang, S., Meng, H.: Adversarially learning disentangled speech representations for robust multi-factor voice conversion. In: Proceedings of the Interspeech 2021, pp. 846–850 (2021). https://doi.org/10.21437/Interspeech
Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R.J., Battenberg, E., Shor, J., Xiao, Y., Ren, F., Jia, Y., Saurous, R.A.: Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis (2018). arXiv:abs/1803.09017
Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019)
Yamamoto, R., Song, E., Kim, J.M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053795
Zhang, Y., Weiss, R.J., Zen, H., Wu, Y., Chen, Z., Skerry-Ryan, R., Jia, Y., Rosenberg, A., Ramabhadran, B.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: Proceeding of the Interspeech 2019, pp. 2080–2084 (2019). https://doi.org/10.21437/Interspeech
Zhou, K., Sisman, B., Li, H.: Limited data emotional voice conversion leveraging text-to-speech: two-stage sequence-to-sequence training. In: Proceeding of the Interspeech 2021, pp. 811–815 (2021). https://doi.org/10.21437/Interspeech
Zhou, K., Sisman, B., Liu, R., Li, H.: Emotional voice conversion: theory, databases and ESD (2021). arXiv:abs/2105.14762
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Nesterenko, A. et al. (2022). Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_43
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)