Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Recognizing and classifying emotions is important. It is key in medicine, transportation, and artificial intelligence. This paper aims to address the problem. The problem is poor emotion classification due to a bad fusion of many modes. It proposes an audio–visual emotion recognition method. It is based on bilayer LSTM and multi-head attention on the RAVDESS dataset. It considers the fusion of voice and facial expression features. The network learns MFCC and facial features. It uses a convolutional layer for MFCC features and a double-layer LSTM for facial features. Then, it fuses the features of both modalities using a multi-head attention module. Finally, it convolves, pools, and splices the learned features for emotion recognition. In our experiments, the accuracy on the public dataset RAVDESS is 82.42%.We got this result using 5-fold cross-validation. Comparison with other methods shows that the method improves audio–visual emotion recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

No datasets were generated or analyzed during the current study.

References

  1. Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117

    Article  MATH  Google Scholar 

  2. Jahangir R, Teh YW, Hanif F, Mujtaba G (2021) Deep learning approaches for speech emotion recognition: state of the art and research challenges. Multimedia Tools Appl 80(16):23745–23812

    Article  Google Scholar 

  3. Pan B, Hirota K, Jia Z, Dai Y (2023) A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing 561:126866

    Article  MATH  Google Scholar 

  4. Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75

    Article  Google Scholar 

  5. Wani TM, Gunawan TS, Qadri SAA, Kartiwi M, Ambikairajah E (2021) A comprehensive review of speech emotion recognition systems. IEEE Access 9:47795–47814

    Article  Google Scholar 

  6. Chumachenko K, Iosifidis A, Gabbouj M (2022) Self-attention fusion for audiovisual emotion recognition with incomplete data. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 2822–2828. IEEE

  7. Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75

    Article  Google Scholar 

  8. Ton-That AH, Cao NT (2019) Speech emotion recognition using a fuzzy approach. J Intell Fuzzy Syst 36(2):1587–1597

    Article  MATH  Google Scholar 

  9. Foo LS, Yap W-S, Hum YC, Kadim Z, Hon HW, Tee YK (2020) Real-time baby crying detection in the noisy everyday environment. In: 2020 11th IEEE Control and System Graduate Research Colloquium (ICSGRC), pp. 26–31. IEEE

  10. Er MB (2020) A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8:221640–221653

    Article  MATH  Google Scholar 

  11. Bandela SR, Kumar TK (2018) Emotion recognition of stressed speech using teager energy and linear prediction features. In: 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT), pp. 422–425. IEEE

  12. Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoust 146:320–326

    Article  MATH  Google Scholar 

  13. Tingting H, Yaqin F, Lingjie S, Wei W (2019) Selection of main features of lstm speech emotion based on attention mechanism. Acoust Technol 38(04):414–421

    MATH  Google Scholar 

  14. Lakshmi KL, Muthulakshmi P, Nithya AA, Jeyavathana RB, Usharani R, Das NS, Devi GNR (2023) Recognition of emotions in speech using deep cnn and resnet. Soft Comput. https://doi.org/10.1007/s00500-023-07969-5

    Article  Google Scholar 

  15. Liang K, Zhang N, Liu Y, et al. (2023) Hybrid multiscale convolution combined with two-layer lstm for speech emotion recognition. Computer and Modernization (01)

  16. Ekman P, Friesen WV, Ellsworth P (2013) Emotion in the Human Face: Guidelines for Research and an Integration of Findings vol. 11. Elsevier, ???

  17. Tang Y (2013) Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239

  18. Liu K, Zhang M, Pan Z (2016) Facial expression recognition with cnn ensemble. In: 2016 International Conference on Cyberworlds (CW), pp. 163–166. IEEE

  19. Shi J, Zhu S, Liang Z (2021) Learning to amend facial expression representation via de-albino and affinity. arXiv preprint arXiv:2103.10189

  20. Li M, Xu H, Huang X, Song Z, Liu X, Li X (2018) Facial expression recognition with identity and emotion joint learning. IEEE Trans Affect Comput 12(2):544–550

    Article  MATH  Google Scholar 

  21. Sun W, Song Y, Jin Z, Zhao H, Chen C (2019) Unsupervised orthogonal facial representation extraction via image reconstruction with correlation minimization. Neurocomputing 337:203–217

    Article  MATH  Google Scholar 

  22. Meshach WT, Hemajothi S, Anita EM (2021) Retracted article: real-time facial expression recognition for affect identification using multi-dimensional svm. J Ambient Intell Humaniz Comput 12(6):6355–6365

    Article  MATH  Google Scholar 

  23. Minaee S, Minaei M, Abdolrashidi A (2021) Deep-emotion: facial expression recognition using attentional convolutional network. Sensors 21(9):3046

    Article  Google Scholar 

  24. Pei E, Hu Z, He L, Ning H, Berenguer AD (2024) An ensemble learning-enhanced multitask learning method for continuous affect recognition from facial images. Expert Syst Appl 236:121290

    Article  Google Scholar 

  25. De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multi-modal information. In: Proceedings of ICICS, 1997 International Conference on Information, Communications and Signal Processing. Theme: Trends in Information Systems Engineering and Wireless Multimedia Communications (Cat., vol. 1, pp. 397–401. IEEE

  26. Chen LS, Huang TS, Miyasato T, Nakatsu R (1998) Multimodal human emotion/expression recognition. In: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. IEEE

  27. Sun W, Song Y, Zhao H, Jin Z (2020) A face spoofing detection method based on domain adaptation and lossless size adaptation. IEEE access 8:66553–66563

    Article  Google Scholar 

  28. Eyben F, Wöllmer M, Graves A, Schuller B, Douglas-Cowie E, Cowie R (2010) On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interface 3:7–19

    Article  MATH  Google Scholar 

  29. Poria S, Cambria E, Hazarika D, Mazumder N, Zadeh A, Morency L-P (2017) Multi-level multiple attentions for contextual multimodal sentiment analysis. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 1033–1038. IEEE

  30. Krishna D, Patil A (2020) Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, pp. 4243–4247

  31. Fu Z, Liu F, Wang H, Qi J, Fu X, Zhou A, Li Z (2021) A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv preprint arXiv:2111.02172

  32. Luna-Jiménez C, Griol D, Callejas Z, Kleinlein R, Montero JM, Fernández-Martínez F (2021) Multimodal emotion recognition on ravdess dataset using transfer learning. Sensors 21(22):7665

    Article  Google Scholar 

  33. Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech emotion recognition on the iemocap and ravdess dataset. IEEE Access 9:74539–74549

    Article  MATH  Google Scholar 

  34. Tang G, Xie Y, Li K, Liang R, Zhao L (2023) Multimodal emotion recognition from facial expression and speech based on feature fusion. Multimedia Tools Appl 82(11):16359–16373

    Article  Google Scholar 

  35. Kumar P, Malik S, Raman B (2023) Interpretable multimodal emotion recognition using hybrid fusion of speech and image data. Multimedia Tools and Applications, 1–22

  36. Mocanu B, Tapu R, Zaharia T (2023) Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image Vision Comput 133:104676

    Article  Google Scholar 

  37. Zhao Z, Liu Q, Zhou F (2021) Robust lightweight facial expression recognition network with label distribution training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3510–3519

Download references

Author information

Authors and Affiliations

Authors

Contributions

The main tasks completed by author Jin Zeyu include model design, model training, and paper structure writing. The main job of author Zai Wenjiao is to revise the paper and guide the details of the paper. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Wenjiao Zai.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, Z., Zai, W. Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset. J Supercomput 81, 31 (2025). https://doi.org/10.1007/s11227-024-06582-z

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06582-z

Keywords

Navigation