Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset

Zeyu Jin¹^na1 &
Wenjiao Zai¹^na1

78 Accesses
Explore all metrics

Abstract

Recognizing and classifying emotions is important. It is key in medicine, transportation, and artificial intelligence. This paper aims to address the problem. The problem is poor emotion classification due to a bad fusion of many modes. It proposes an audio–visual emotion recognition method. It is based on bilayer LSTM and multi-head attention on the RAVDESS dataset. It considers the fusion of voice and facial expression features. The network learns MFCC and facial features. It uses a convolutional layer for MFCC features and a double-layer LSTM for facial features. Then, it fuses the features of both modalities using a multi-head attention module. Finally, it convolves, pools, and splices the learned features for emotion recognition. In our experiments, the accuracy on the public dataset RAVDESS is 82.42%.We got this result using 5-fold cross-validation. Comparison with other methods shows that the method improves audio–visual emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DBT: multimodal emotion recognition based on dual-branch transformer

Article 21 December 2022

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism

Article 06 August 2021

Data availability

No datasets were generated or analyzed during the current study.

References

Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117
Article MATH Google Scholar
Jahangir R, Teh YW, Hanif F, Mujtaba G (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimedia Tools Appl 80(16):23745–23812
Article Google Scholar
Pan B, Hirota K, Jia Z, Dai Y (2023) A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing 561:126866
Article MATH Google Scholar
Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75
Article Google Scholar
Wani TM, Gunawan TS, Qadri SAA, Kartiwi M, Ambikairajah E (2021) A comprehensive review of speech emotion recognition systems. IEEE Access 9:47795–47814
Article Google Scholar
Chumachenko K, Iosifidis A, Gabbouj M (2022) Self-attention fusion for audiovisual emotion recognition with incomplete data. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 2822–2828. IEEE
Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75
Article Google Scholar
Ton-That AH, Cao NT (2019) Speech emotion recognition using a fuzzy approach. J Intell Fuzzy Syst 36(2):1587–1597
Article MATH Google Scholar
Foo LS, Yap W-S, Hum YC, Kadim Z, Hon HW, Tee YK (2020) Real-time baby crying detection in the noisy everyday environment. In: 2020 11th IEEE Control and System Graduate Research Colloquium (ICSGRC), pp. 26–31. IEEE
Er MB (2020) A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8:221640–221653
Article MATH Google Scholar
Bandela SR, Kumar TK (2018) Emotion recognition of stressed speech using teager energy and linear prediction features. In: 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT), pp. 422–425. IEEE
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326
Article MATH Google Scholar
Tingting H, Yaqin F, Lingjie S, Wei W (2019) Selection of main features of lstm speech emotion based on attention mechanism. Acoust Technol 38(04):414–421
MATH Google Scholar
Lakshmi KL, Muthulakshmi P, Nithya AA, Jeyavathana RB, Usharani R, Das NS, Devi GNR (2023) Recognition of emotions in speech using deep cnn and resnet. Soft Comput. https://doi.org/10.1007/s00500-023-07969-5
Article Google Scholar
Liang K, Zhang N, Liu Y, et al. (2023) Hybrid multiscale convolution combined with two-layer lstm for speech emotion recognition. Computer and Modernization (01)
Ekman P, Friesen WV, Ellsworth P (2013) Emotion in the Human Face: Guidelines for Research and an Integration of Findings vol. 11. Elsevier, ???
Tang Y (2013) Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239
Liu K, Zhang M, Pan Z (2016) Facial expression recognition with cnn ensemble. In: 2016 International Conference on Cyberworlds (CW), pp. 163–166. IEEE
Shi J, Zhu S, Liang Z (2021) Learning to amend facial expression representation via de-albino and affinity. arXiv preprint arXiv:2103.10189
Li M, Xu H, Huang X, Song Z, Liu X, Li X (2018) Facial expression recognition with identity and emotion joint learning. IEEE Trans Affect Comput 12(2):544–550
Article MATH Google Scholar
Sun W, Song Y, Jin Z, Zhao H, Chen C (2019) Unsupervised orthogonal facial representation extraction via image reconstruction with correlation minimization. Neurocomputing 337:203–217
Article MATH Google Scholar
Meshach WT, Hemajothi S, Anita EM (2021) Retracted article: real-time facial expression recognition for affect identification using multi-dimensional svm. J Ambient Intell Humaniz Comput 12(6):6355–6365
Article MATH Google Scholar
Minaee S, Minaei M, Abdolrashidi A (2021) Deep-emotion: facial expression recognition using attentional convolutional network. Sensors 21(9):3046
Article Google Scholar
Pei E, Hu Z, He L, Ning H, Berenguer AD (2024) An ensemble learning-enhanced multitask learning method for continuous affect recognition from facial images. Expert Syst Appl 236:121290
Article Google Scholar
De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multi-modal information. In: Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat., vol. 1, pp. 397–401. IEEE
Chen LS, Huang TS, Miyasato T, Nakatsu R (1998) Multimodal human emotion/expression recognition. In: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. IEEE
Sun W, Song Y, Zhao H, Jin Z (2020) A face spoofing detection method based on domain adaptation and lossless size adaptation. IEEE access 8:66553–66563
Article Google Scholar
Eyben F, Wöllmer M, Graves A, Schuller B, Douglas-Cowie E, Cowie R (2010) On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interface 3:7–19
Article MATH Google Scholar
Poria S, Cambria E, Hazarika D, Mazumder N, Zadeh A, Morency L-P (2017) Multi-level multiple attentions for contextual multimodal sentiment analysis. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 1033–1038. IEEE
Krishna D, Patil A (2020) Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, pp. 4243–4247
Fu Z, Liu F, Wang H, Qi J, Fu X, Zhou A, Li Z (2021) A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv preprint arXiv:2111.02172
Luna-Jiménez C, Griol D, Callejas Z, Kleinlein R, Montero JM, Fernández-Martínez F (2021) Multimodal emotion recognition on ravdess dataset using transfer learning. Sensors 21(22):7665
Article Google Scholar
Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech emotion recognition on the iemocap and ravdess dataset. IEEE Access 9:74539–74549
Article MATH Google Scholar
Tang G, Xie Y, Li K, Liang R, Zhao L (2023) Multimodal emotion recognition from facial expression and speech based on feature fusion. Multimedia Tools Appl 82(11):16359–16373
Article Google Scholar
Kumar P, Malik S, Raman B (2023) Interpretable multimodal emotion recognition using hybrid fusion of speech and image data. Multimedia Tools and Applications, 1–22
Mocanu B, Tapu R, Zaharia T (2023) Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image Vision Comput 133:104676
Article Google Scholar
Zhao Z, Liu Q, Zhou F (2021) Robust lightweight facial expression recognition network with label distribution training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3510–3519

Download references

Author information

Wenjiao Zai and Zeyu Jin contributed equally to this work.

Authors and Affiliations

Insitute of Technology, Sichuan Normal University, No.1819, Section 2, Chenglong Avenue, Longquan District, Cheng Du, 610101, Sichuan Province, China
Zeyu Jin & Wenjiao Zai

Authors

Zeyu Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wenjiao Zai
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The main tasks completed by author Jin Zeyu include model design, model training, and paper structure writing. The main job of author Zai Wenjiao is to revise the paper and guide the details of the paper. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Wenjiao Zai.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jin, Z., Zai, W. Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset. J Supercomput 81, 31 (2025). https://doi.org/10.1007/s11227-024-06582-z

Download citation

Accepted: 21 August 2024
Published: 18 October 2024
DOI: https://doi.org/10.1007/s11227-024-06582-z

Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DBT: multimodal emotion recognition based on dual-branch transformer

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DBT: multimodal emotion recognition based on dual-branch transformer

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

A multi-modal emotion fusion classification method combined expression and speech based on attention mechanism

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation