Exploring Transfer Learning for Low Resource Emotional TTS

Noé Tits¹⁷,
Kevin El Haddad¹⁷ &
Thierry Dutoit¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1037))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

2062 Accesses
18 Citations

Abstract

During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remains challenging. In this paper, we investigate how to leverage fine-tuning on a pre-trained Deep Learning-based TTS model to synthesize speech with a small dataset of another speaker. Then we investigate the possibility to adapt this model to have emotional TTS by fine-tuning the neutral TTS model with a small emotional dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements

Evaluation of Deep Learning Approaches to Text-to-Speech Systems for European Portuguese

Transfer Learning for Audio-Based Speech Emotion Recognition in Chinese: Leveraging Pretrained Models for Improved Performance

Notes

References

Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings SSW, Sunnyvale, USA (2016)
Google Scholar
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.W., Kavukcuoglu, K.: Wavenet: a generative model for raw audio. In: SSW (2016)
Google Scholar
Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q.V., Agiomyrgiannakis, Y., Clark, R., Saurous, R.A.: Tacotron: towards end-to-end speech synthesis. In: Interspeech (2017)
Google Scholar
Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., Kavukcuoglu, K.: Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435 (2018)
Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., Bengio, Y.: Char2wav: end-to-end speech synthesis. In: ICLR2017 Workshop Submission (2017)
Google Scholar
Arik, S.O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al.: Deep voice: real-time neural text-to-speech. arXiv preprint arXiv:1702.07825 (2017)
Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. arXiv preprint arXiv:1710.08969 (2017)
Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertainty, Fuzziness Knowl.-Based Syst. 6(02), 107–116 (1998)
Article Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Article Google Scholar
Laraba, S., Brahimi, M., Tilmanne, J., Dutoit, T.: 3D skeleton-based action recognition by representing motion capture sequences as 2D-RGB images. Comput. Animat. Virt. W. 28(3–4), e1782 (2017)
Article Google Scholar
Tits, N., El Haddad, K., Dutoit, T.: Asr-based features for emotion recognition: a transfer learning approach. arXiv preprint arXiv:1805.09197 (2018)
Jia, Y., Zhang, Y., Weiss, R.J., Wang, Q., Shen, J., Ren, F., Chen, Z., Nguyen, P., Pang, R., Moreno, I.L., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis arXiv preprint arXiv:1806.04558 (2018)
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R.J., Saurous, R.A., Agiomyrgiannakis, Y., Wu, Y.: Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. CoRR, vol. abs/1712.05884 (2017)
Google Scholar
Lee, Y., Rabiee, A., Lee, S.Y.: Emotional end-to-end neural speech synthesizer. arXiv preprint arXiv:1711.05447 (2017)
Kyubyong, P.: A tensorflow implementation of dc-tts: yet another text-to-speech model (2018). https://github.com/Kyubyong/dc_tts
Adigwe, A., Tits, N., El Haddad, K., Ostadabbas, S., Dutoit, T.: The emotional voices database: towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514 (2018)
Kominek, J., Black, A.W.: The CMU arctic speech databases. In: Fifth ISCA Workshop on Speech Synthesis (2004)
Google Scholar
Honnet, P.-E., Lazaridis, A., Garner, P.N., Yamagishi, J.: The siwis french speech synthesis database? Design and recording of a high quality french database for speech synthesis. Online Database (2017)
Google Scholar
El Haddad, K., Tits, N., Dutoit, T.: Annotating nonverbal conversation expressions in interaction datasets. In: Proceedings of Laughter Workshop 2018, September 2018
Google Scholar
Orozco-Arroyave, J.R., Vdsquez-Correa, J.C., Hönig, F., Arias-Londoño, J.D., Vargas-Bonilla, J.F., Skodda, S., Rusz, J., Noth, E.: Towards an automatic monitoring of the neurological state of parkinson’s patients from speech. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6490–6494. IEEE (2016)
Google Scholar
Rothauser, E.H.: IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Electroacoust. 17, 225–246 (1969)
Article Google Scholar
Alamsaputra, D.M., Kohnert, K.J., Munson, B., Reichle, J.: Synthesized speech intelligibility among native speakers and non-native speakers of english. Augmentative Altern. Commun. 22(4), 258–268 (2006)
Article Google Scholar

Download references

Acknowledgments

Noé Tits is funded through a PhD grant from the Fonds pour la Formation à la Recherche dans l’Industrie et l’Agriculture (FRIA), Belgium.

Author information

Authors and Affiliations

Numediart Institute, University of Mons, 7000, Mons, Belgium
Noé Tits, Kevin El Haddad & Thierry Dutoit

Authors

Noé Tits
View author publications
You can also search for this author in PubMed Google Scholar
Kevin El Haddad
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Dutoit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Noé Tits .

Editor information

Editors and Affiliations

School of Computing, Computer Science Research Institute, Ulster University, Newtownabbey, UK
Yaxin Bi
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Rahul Bhatia
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tits, N., El Haddad, K., Dutoit, T. (2020). Exploring Transfer Learning for Low Resource Emotional TTS. In: Bi, Y., Bhatia, R., Kapoor, S. (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1037. Springer, Cham. https://doi.org/10.1007/978-3-030-29516-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-29516-5_5
Published: 24 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29515-8
Online ISBN: 978-3-030-29516-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Exploring Transfer Learning for Low Resource Emotional TTS

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements

Evaluation of Deep Learning Approaches to Text-to-Speech Systems for European Portuguese

Transfer Learning for Audio-Based Speech Emotion Recognition in Chinese: Leveraging Pretrained Models for Improved Performance

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Exploring Transfer Learning for Low Resource Emotional TTS

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements

Evaluation of Deep Learning Approaches to Text-to-Speech Systems for European Portuguese

Transfer Learning for Audio-Based Speech Emotion Recognition in Chinese: Leveraging Pretrained Models for Improved Performance

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation