Abstract
Recognizing and classifying emotions is important. It is key in medicine, transportation, and artificial intelligence. This paper aims to address the problem. The problem is poor emotion classification due to a bad fusion of many modes. It proposes an audio–visual emotion recognition method. It is based on bilayer LSTM and multi-head attention on the RAVDESS dataset. It considers the fusion of voice and facial expression features. The network learns MFCC and facial features. It uses a convolutional layer for MFCC features and a double-layer LSTM for facial features. Then, it fuses the features of both modalities using a multi-head attention module. Finally, it convolves, pools, and splices the learned features for emotion recognition. In our experiments, the accuracy on the public dataset RAVDESS is 82.42%.We got this result using 5-fold cross-validation. Comparison with other methods shows that the method improves audio–visual emotion recognition.
Similar content being viewed by others
Data availability
No datasets were generated or analyzed during the current study.
References
Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117
Jahangir R, Teh YW, Hanif F, Mujtaba G (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimedia Tools Appl 80(16):23745–23812
Pan B, Hirota K, Jia Z, Dai Y (2023) A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing 561:126866
Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75
Wani TM, Gunawan TS, Qadri SAA, Kartiwi M, Ambikairajah E (2021) A comprehensive review of speech emotion recognition systems. IEEE Access 9:47795–47814
Chumachenko K, Iosifidis A, Gabbouj M (2022) Self-attention fusion for audiovisual emotion recognition with incomplete data. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 2822–2828. IEEE
Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75
Ton-That AH, Cao NT (2019) Speech emotion recognition using a fuzzy approach. J Intell Fuzzy Syst 36(2):1587–1597
Foo LS, Yap W-S, Hum YC, Kadim Z, Hon HW, Tee YK (2020) Real-time baby crying detection in the noisy everyday environment. In: 2020 11th IEEE Control and System Graduate Research Colloquium (ICSGRC), pp. 26–31. IEEE
Er MB (2020) A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8:221640–221653
Bandela SR, Kumar TK (2018) Emotion recognition of stressed speech using teager energy and linear prediction features. In: 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT), pp. 422–425. IEEE
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326
Tingting H, Yaqin F, Lingjie S, Wei W (2019) Selection of main features of lstm speech emotion based on attention mechanism. Acoust Technol 38(04):414–421
Lakshmi KL, Muthulakshmi P, Nithya AA, Jeyavathana RB, Usharani R, Das NS, Devi GNR (2023) Recognition of emotions in speech using deep cnn and resnet. Soft Comput. https://doi.org/10.1007/s00500-023-07969-5
Liang K, Zhang N, Liu Y, et al. (2023) Hybrid multiscale convolution combined with two-layer lstm for speech emotion recognition. Computer and Modernization (01)
Ekman P, Friesen WV, Ellsworth P (2013) Emotion in the Human Face: Guidelines for Research and an Integration of Findings vol. 11. Elsevier, ???
Tang Y (2013) Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239
Liu K, Zhang M, Pan Z (2016) Facial expression recognition with cnn ensemble. In: 2016 International Conference on Cyberworlds (CW), pp. 163–166. IEEE
Shi J, Zhu S, Liang Z (2021) Learning to amend facial expression representation via de-albino and affinity. arXiv preprint arXiv:2103.10189
Li M, Xu H, Huang X, Song Z, Liu X, Li X (2018) Facial expression recognition with identity and emotion joint learning. IEEE Trans Affect Comput 12(2):544–550
Sun W, Song Y, Jin Z, Zhao H, Chen C (2019) Unsupervised orthogonal facial representation extraction via image reconstruction with correlation minimization. Neurocomputing 337:203–217
Meshach WT, Hemajothi S, Anita EM (2021) Retracted article: real-time facial expression recognition for affect identification using multi-dimensional svm. J Ambient Intell Humaniz Comput 12(6):6355–6365
Minaee S, Minaei M, Abdolrashidi A (2021) Deep-emotion: facial expression recognition using attentional convolutional network. Sensors 21(9):3046
Pei E, Hu Z, He L, Ning H, Berenguer AD (2024) An ensemble learning-enhanced multitask learning method for continuous affect recognition from facial images. Expert Syst Appl 236:121290
De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multi-modal information. In: Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat., vol. 1, pp. 397–401. IEEE
Chen LS, Huang TS, Miyasato T, Nakatsu R (1998) Multimodal human emotion/expression recognition. In: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. IEEE
Sun W, Song Y, Zhao H, Jin Z (2020) A face spoofing detection method based on domain adaptation and lossless size adaptation. IEEE access 8:66553–66563
Eyben F, Wöllmer M, Graves A, Schuller B, Douglas-Cowie E, Cowie R (2010) On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interface 3:7–19
Poria S, Cambria E, Hazarika D, Mazumder N, Zadeh A, Morency L-P (2017) Multi-level multiple attentions for contextual multimodal sentiment analysis. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 1033–1038. IEEE
Krishna D, Patil A (2020) Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, pp. 4243–4247
Fu Z, Liu F, Wang H, Qi J, Fu X, Zhou A, Li Z (2021) A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv preprint arXiv:2111.02172
Luna-Jiménez C, Griol D, Callejas Z, Kleinlein R, Montero JM, Fernández-Martínez F (2021) Multimodal emotion recognition on ravdess dataset using transfer learning. Sensors 21(22):7665
Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech emotion recognition on the iemocap and ravdess dataset. IEEE Access 9:74539–74549
Tang G, Xie Y, Li K, Liang R, Zhao L (2023) Multimodal emotion recognition from facial expression and speech based on feature fusion. Multimedia Tools Appl 82(11):16359–16373
Kumar P, Malik S, Raman B (2023) Interpretable multimodal emotion recognition using hybrid fusion of speech and image data. Multimedia Tools and Applications, 1–22
Mocanu B, Tapu R, Zaharia T (2023) Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image Vision Comput 133:104676
Zhao Z, Liu Q, Zhou F (2021) Robust lightweight facial expression recognition network with label distribution training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3510–3519
Author information
Authors and Affiliations
Contributions
The main tasks completed by author Jin Zeyu include model design, model training, and paper structure writing. The main job of author Zai Wenjiao is to revise the paper and guide the details of the paper. All authors have reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jin, Z., Zai, W. Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset. J Supercomput 81, 31 (2025). https://doi.org/10.1007/s11227-024-06582-z
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06582-z