Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition

Panikos Heracleous¹⁴,
Satoru Fukayama¹⁴,
Jun Ogata¹⁴ &
…
Yasser Mohammad¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13519))

Included in the following conference series:

International Conference on Human-Computer Interaction

997 Accesses
1 Citations

Abstract

Automatic recognition of human emotions is of high importance in human-computer interaction (HCI) due to its applications in real-world tasks. Previously, several studies have been introduced to address the problem of emotion recognition using several kinds of sensors, feature extraction methods, and classification techniques. Specifically, emotion recognition has been reported using audio, vision, text, and biosensors. Although, using acted emotion signals, significant improvements have been achieved, emotion recognition still faces low performance due to the lack of real data and limited data size. To address this problem, in this study data augmentation is investigated based on Generative Adversarial Networks (GANs). For classification the Vision Transformer (ViT) is being used. ViT has originally been applied for image classification, but in the current study is being adopted for emotion recognition. The proposed methods have been evaluated using the English IEMOCAP and the Japanese JTES speech corpora and showed significant improvements when data augmentation has been applied.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

Article 09 February 2022

A GAN-Based Data Augmentation Method for Multimodal Emotion Recognition

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

Article 16 December 2023

References

Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction, pp. 110–127. Oxford University Press, New York, November 2013
Google Scholar
Feng, H., Uno, S., Kawahara, T.: End-to-end speech emotion recognition combined with acoustic-to-word ASR model. In: INTERSPEECH, pp. 501–505 (2020)
Google Scholar
Huang, J., Tao, J., Liu, B., Lian, Z.: Learning utterance-level representations with label smoothing for speech emotion recognition. In: Proceedings of Interspeech, pp. 4079–4083 (2020)
Google Scholar
Jalal, M.A., Milner, R., Hain, T., Moore, R.K.: Removing bias with residual mixture of multi-view attention for speech emotion recognition. In: Proceedings of Interspeech, pp. 4084–4088 (2020)
Google Scholar
Jalal, M.A., Milner, R., Hain, T.: Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition. In: Proceedings of Interspeech, pp. 4113–4117 (2020)
Google Scholar
Stuhlsatz, A., Meyer, C., Eyben, F., Zielke1, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, pp. 5688–5691 (2011)
Google Scholar
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 2023–2027 (2014)
Google Scholar
Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: Proceedings of Signal and Information Processing Association Annual Summit and Conference (APSIPA) (2016)
Google Scholar
Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Commun. 29, 2352–2449 (2017)
Article MathSciNet Google Scholar
Huynh, X.-P., Tran, T.-D., Kim, Y.-G.: Convolutional neural network models for facial expression recognition using BU-3DFE database. In: Information Science and Applications (ICISA) 2016. LNEE, vol. 376, pp. 441–450. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0557-2_44
Chapter Google Scholar
Jalal, M., Milner, R., Hain, T.: Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition. In: INTERSPEECH, pp. 4113–4117 (2020)
Google Scholar
Padi, S., Sadjadi, S.O., Sriram, R.D., Manocha, D.: Improved speech emotion recognition using transfer learning and spectrogram augmentation. In: ICMI, pp. 645–652 (2021)
Google Scholar
Xu, Y., Xu, H., Zou, J.: HGEM: a hierarchical grained and feature model for acoustic emotion recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6499–6503. IEEE (2020)
Google Scholar
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477 (2020)
Wang, Y., Boumadane, A., Heba, A.: A fine-tuned Wav2vec 2.0/Hubert Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. arXiv preprint arXiv:2111.02735 (2021)
Schuller, B., et al.: Paralinguistics in speech and languagestate-of-the-art and the challenge. Comput. Speech Lang. 27(1), 4–39 (2013)
Article Google Scholar
Ian, G., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
Google Scholar
Dosovitskiy, A, et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929v2 (2020)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-toimage translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar
Kaneko, T., Kameoka, H.: Parallel-data-free voice conversion using cycle-consistent adversarial networks. In: 26th European Signal Processing Conference arXiv:1711.11293, November 2017 (2018)
Bao, F., Neumann, M., Vu, N.T.: Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition. In: Proceedings of Interspeech 2019, pp. 2828–2832 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017)
Livingstone, S.R., Peck, K., Russo, F.A.: RAVDESS: the ryerson audio-visual database of emotional speech and song. In: 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS) (Kingston, ON) (2012)
Google Scholar
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain 382 audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210 (2015)
Google Scholar

Download references

Acknowledgments

This work was supported by Council for Science, Technology and Innovation, “Cross-ministerial Strategic Innovation Promotion Program (SIP), Big-data and AI-enabled Cyberspace Technologies” (funding agency: NEDO).

Author information

Authors and Affiliations

National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo, 135-0064, Japan
Panikos Heracleous, Satoru Fukayama, Jun Ogata & Yasser Mohammad

Authors

Panikos Heracleous
View author publications
You can also search for this author in PubMed Google Scholar
Satoru Fukayama
View author publications
You can also search for this author in PubMed Google Scholar
Jun Ogata
View author publications
You can also search for this author in PubMed Google Scholar
Yasser Mohammad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panikos Heracleous .

Editor information

Editors and Affiliations

The Open University of Japan, Chiba, Japan
Masaaki Kurosu
Tokyo University of Science, Tokyo, Saitama, Japan
Sakae Yamamoto
Tokyo City University, Tokyo, Japan
Hirohiko Mori
Soar Technology Inc., Orlando, FL, USA
Dylan D. Schmorrow
Katmai Government Services, Orlando, FL, USA
Cali M. Fidopiastis
Smart Future Initiative, Frankfurt am Main, Germany
Norbert A. Streitz
Kyushu University, Fukuoka, Japan
Shin'ichi Konomi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heracleous, P., Fukayama, S., Ogata, J., Mohammad, Y. (2022). Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition. In: Kurosu, M., et al. HCI International 2022 - Late Breaking Papers. Multimodality in Advanced Interaction Environments. HCII 2022. Lecture Notes in Computer Science, vol 13519. Springer, Cham. https://doi.org/10.1007/978-3-031-17618-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-17618-0_6
Published: 02 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17617-3
Online ISBN: 978-3-031-17618-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

A GAN-Based Data Augmentation Method for Multimodal Emotion Recognition

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

A GAN-Based Data Augmentation Method for Multimodal Emotion Recognition

Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation