Nothing Special   »   [go: up one dir, main page]

Skip to main content

Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2020 (ICANN 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12396))

Included in the following conference series:

  • 3323 Accesses

Abstract

Unsupervised learning is based on the idea of self-organization to find hidden patterns and features in the data without the need for labels. Variational autoencoders (VAEs) are generative unsupervised learning models that create low-dimensional representations of the input data and learn by regenerating the same input from that representation. Recently, VAEs were used to extract representations from audio data, which possess not only content-dependent information but also speaker-dependent information such as gender, health status, and speaker ID. VAEs with two timescale variables were then introduced to disentangle these two kinds of information from each other. Our approach introduces a third, i.e. medium timescale into a VAE. So instead of having only a global and a local timescale variable, this model holds a global, a medium, and a local variable. We tested the model on three downstream applications: speaker identification, gender classification, and emotion recognition, where each hidden representation performed better on some specific tasks than the other hidden representations. Speaker ID and gender were best reported by the global variable, while emotion was best extracted when using the medium. Our model achieves excellent results exceeding state-of-the-art models on speaker identification and emotion regression from audio.

Supported by Novatec Consulting GmbH, Dieselstrasse 18/1, D-70771 Leinfelden Echterdingen and by the German Research Foundation (DFG) under the transregio Crossmodal Learning TRR-169.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www2.informatik.uni-hamburg.de/wtm/omgchallenges/omg_emotion2018_results2018.html.

  2. 2.

    https://github.com/Hussam-Almotlak/voice_analysis.

References

  1. Barros, P., Churamani, N., Lakomkin, E., Sequeira, H., Sutherland, A., Wermter, S.: The OMG-emotion behavior dataset. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1408–1414. IEEE (2018). https://doi.org/10.1109/IJCNN.2018.8489099

  2. Blaauw, M., Bonada, J.: Modeling and transforming speech using variational autoencoders. In: INTERSPEECH, pp. 1770–1774 (2016)

    Google Scholar 

  3. Deng, D., Zhou, Y., Pi, J., Shi, B.E.: Multimodal utterance-level affect analysis using visual, audio and text features. arXiv preprint arXiv:1805.00625 (2018)

  4. Hsu, W.N., Zhang, Y., Glass, J.: Learning latent representations for speech generation and transformation. arXiv preprint arXiv:1704.04222 (2017)

  5. Kuppens, P., Tuerlinckx, F., Russell, J.A., Barrett, L.F.: The relation between valence and arousal in subjective experience. Psychol. Bull. 139(4), 917 (2013)

    Article  Google Scholar 

  6. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  7. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)

    Google Scholar 

  8. Peng, S., Zhang, L., Ban, Y., Fang, M., Winkler, S.: A deep network for arousal-valence emotion prediction with acoustic-visual cues. arXiv preprint arXiv:1805.00638 (2018)

  9. Pereira, I., Santos, D.: OMG emotion challenge - ExCouple team. arXiv preprint arXiv:1805.01576 (2018)

  10. Pereira, I., Santos, D., Maciel, A., Barros, P.: Semi-supervised model for emotion recognition in speech. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11139, pp. 791–800. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01418-6_77

    Chapter  Google Scholar 

  11. Radford, A., Jozefowicz, R., Sutskever, I.: Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444 (2017)

  12. Springenberg, S., Lakomkin, E., Weber, C., Wermter, S.: Predictive auxiliary variational autoencoder for representation learning of global speech characteristics. In: Proceedings of INTERSPEECH 2019, pp. 934–938 (2019)

    Google Scholar 

  13. Triantafyllopoulos, A., Sagha, H., Eyben, F., Schuller, B.: audEERING’s approach to the one-minute-gradual emotion challenge. arXiv preprint arXiv:1805.01222 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hussam Almotlak , Cornelius Weber , Leyuan Qu or Stefan Wermter .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Almotlak, H., Weber, C., Qu, L., Wermter, S. (2020). Variational Autoencoder with Global- and Medium Timescale Auxiliaries for Emotion Recognition from Speech. In: Farkaš, I., Masulli, P., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2020. ICANN 2020. Lecture Notes in Computer Science(), vol 12396. Springer, Cham. https://doi.org/10.1007/978-3-030-61609-0_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61609-0_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61608-3

  • Online ISBN: 978-3-030-61609-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics