Abstract
Automatic recognition of human emotions is of high importance in human-computer interaction (HCI) due to its applications in real-world tasks. Previously, several studies have been introduced to address the problem of emotion recognition using several kinds of sensors, feature extraction methods, and classification techniques. Specifically, emotion recognition has been reported using audio, vision, text, and biosensors. Although, using acted emotion signals, significant improvements have been achieved, emotion recognition still faces low performance due to the lack of real data and limited data size. To address this problem, in this study data augmentation is investigated based on Generative Adversarial Networks (GANs). For classification the Vision Transformer (ViT) is being used. ViT has originally been applied for image classification, but in the current study is being adopted for emotion recognition. The proposed methods have been evaluated using the English IEMOCAP and the Japanese JTES speech corpora and showed significant improvements when data augmentation has been applied.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction, pp. 110–127. Oxford University Press, New York, November 2013
Feng, H., Uno, S., Kawahara, T.: End-to-end speech emotion recognition combined with acoustic-to-word ASR model. In: INTERSPEECH, pp. 501–505 (2020)
Huang, J., Tao, J., Liu, B., Lian, Z.: Learning utterance-level representations with label smoothing for speech emotion recognition. In: Proceedings of Interspeech, pp. 4079–4083 (2020)
Jalal, M.A., Milner, R., Hain, T., Moore, R.K.: Removing bias with residual mixture of multi-view attention for speech emotion recognition. In: Proceedings of Interspeech, pp. 4084–4088 (2020)
Jalal, M.A., Milner, R., Hain, T.: Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition. In: Proceedings of Interspeech, pp. 4113–4117 (2020)
Stuhlsatz, A., Meyer, C., Eyben, F., Zielke1, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, pp. 5688–5691 (2011)
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 2023–2027 (2014)
Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: Proceedings of Signal and Information Processing Association Annual Summit and Conference (APSIPA) (2016)
Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Commun. 29, 2352–2449 (2017)
Huynh, X.-P., Tran, T.-D., Kim, Y.-G.: Convolutional neural network models for facial expression recognition using BU-3DFE database. In: Information Science and Applications (ICISA) 2016. LNEE, vol. 376, pp. 441–450. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0557-2_44
Jalal, M., Milner, R., Hain, T.: Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition. In: INTERSPEECH, pp. 4113–4117 (2020)
Padi, S., Sadjadi, S.O., Sriram, R.D., Manocha, D.: Improved speech emotion recognition using transfer learning and spectrogram augmentation. In: ICMI, pp. 645–652 (2021)
Xu, Y., Xu, H., Zou, J.: HGEM: a hierarchical grained and feature model for acoustic emotion recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6499–6503. IEEE (2020)
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477 (2020)
Wang, Y., Boumadane, A., Heba, A.: A fine-tuned Wav2vec 2.0/Hubert Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. arXiv preprint arXiv:2111.02735 (2021)
Schuller, B., et al.: Paralinguistics in speech and languagestate-of-the-art and the challenge. Comput. Speech Lang. 27(1), 4–39 (2013)
Ian, G., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
Dosovitskiy, A, et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929v2 (2020)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-toimage translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Kaneko, T., Kameoka, H.: Parallel-data-free voice conversion using cycle-consistent adversarial networks. In: 26th European Signal Processing Conference arXiv:1711.11293, November 2017 (2018)
Bao, F., Neumann, M., Vu, N.T.: Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition. In: Proceedings of Interspeech 2019, pp. 2828–2832 (2019)
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017)
Livingstone, S.R., Peck, K., Russo, F.A.: RAVDESS: the ryerson audio-visual database of emotional speech and song. In: 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS) (Kingston, ON) (2012)
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain 382 audio books. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5206–5210 (2015)
Acknowledgments
This work was supported by Council for Science, Technology and Innovation, “Cross-ministerial Strategic Innovation Promotion Program (SIP), Big-data and AI-enabled Cyberspace Technologies” (funding agency: NEDO).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Heracleous, P., Fukayama, S., Ogata, J., Mohammad, Y. (2022). Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition. In: Kurosu, M., et al. HCI International 2022 - Late Breaking Papers. Multimodality in Advanced Interaction Environments. HCII 2022. Lecture Notes in Computer Science, vol 13519. Springer, Cham. https://doi.org/10.1007/978-3-031-17618-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-17618-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17617-3
Online ISBN: 978-3-031-17618-0
eBook Packages: Computer ScienceComputer Science (R0)