Speech emotion recognition using the novel SwinEmoNet (Shifted Window Transformer Emotion Network)

R. Ramesh¹,
V. B. Prahaladhan¹,
P. Nithish¹ &
…
K. Mohanaprasad ORCID: orcid.org/0000-0003-3938-7495¹

291 Accesses
Explore all metrics

Abstract

Understanding human emotions is necessary for various tasks, including interpersonal interaction, knowledge acquisition, and determining courses of action. Recognizing emotions, particularly in speech, poses significant challenges due to linguistic differences, local differences, diversity in gender identities, generational differences, and diversity of cultures. Deep learning methods are promising for automating this task; previous approaches frequently rely on a single type of feature representation, limiting the efficacy of Speech Emotion Recognition (SER). To address these limitations, a comprehensive approach that uses Shifted Window Transformers is proposed, which considers the many different aspects of emotional expression in speech and the use of diverse feature representations to improve SER performance. This paper outlines a novel, Shifted Windowed Transformer Emotional Network (SwinEmoNet), incorporating shifted window attention mechanisms for efficient emotion classification. SwinEmoNet uses local window attention rather than traditional transformer architectures’ global attention mechanisms. This capability allows the model to concentrate essential data in small, finer sections of the input speech signal. The proposed SwinEmoNet architecture has been evaluated against three distinct speech spectrograms. This paper deals with the effectiveness of the proposed SER method by analyzing its performance on the Berlin Emotional Database (EMODB) and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), with an emphasis on metrics such as accuracy, precision, recall and F1-score. With the EMODB and RAVDESS datasets, the accuracy of the SwinEmoNet is 94.93 and 96.51%, respectively, significantly outperforming existing transformer models and current state-of-the-art standards.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

WESER: Wav2Vec 2.0 Enhanced Speech Emotion Recognizer

Speech Emotion Recognition: A Brief Review of Multi-modal Multi-task Learning Approaches

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Abdel-Hamid, L. (2020). Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication, 122, 19–30. https://doi.org/10.1016/j.specom.2020.04.005
Article Google Scholar
Aggarwal, A., Srivastava, A., Agarwal, A., Chahal, N., Singh, D., Alnuaim, A. A., Alhadlaq, A., & Lee, H. N. (2022). Two-way feature extraction for speech emotion recognition using deep learning. Sensors, 22(6), 2378. https://doi.org/10.3390/s22062378
Article Google Scholar
Akçay, M. B., & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56–76. https://doi.org/10.1016/j.specom.2019.12.001
Article Google Scholar
Al-Dujaili, M. J., & Ebrahimi-Moghadam, A. (2023). Speech emotion recognition: A comprehensive survey. Wireless Personal Communications, 129(4), 2525–2561. https://doi.org/10.1007/s11277-023-10244-3
Article Google Scholar
Alluhaidan, A. S., Saidani, O., Jahangir, R., Nauman, M. A., & Neffati, O. S. (2023). Speech emotion recognition through hybrid features and convolutional neural network. Applied Sciences, 13(8), 4750. https://doi.org/10.3390/app13084750
Article Google Scholar
Al-onazi, B. B., Nauman, M. A., Jahangir, R., Malik, M. M., Alkhammash, E. H., & Elshewey, A. M. (2022). Transformer-based multilingual speech emotion recognition using data augmentation and feature fusion. Applied Sciences, 12(18), 9188. https://doi.org/10.3390/app12189188
Article Google Scholar
Andayani, F., Theng, L. B., Tsun, M. T., & Chua, C. (2022). Hybrid LSTM-transformer model for emotion recognition from speech audio files. IEEE Access, 10, 36018–36027. https://doi.org/10.1109/access.2022.3163856
Article Google Scholar
Bhangale, K., & Mohanaprasad, K. (2021). Speech emotion recognition using Mel frequency log spectrogram and deep convolutional neural network. In Lecture notes in electrical engineering (pp. 241–250). https://doi.org/10.1007/978-981-16-4625-6_24
Bhangale, K., & Kothandaraman, M. (2023b). Speech emotion recognition based on multiple acoustic features and deep convolutional neural network. Electronics, 12(4), 839. https://doi.org/10.3390/electronics12040839
Article Google Scholar
Bhangale, K. B., & Kothandaraman, M. (2023a). Speech emotion recognition using the novel PEmoNet (Parallel Emotion Network). Applied Acoustics, 212, 109613. https://doi.org/10.1016/j.apacoust.2023.109613
Article Google Scholar
Bhavya, S., Nayak, D. S., Dmello, R. C., Nayak, A., & Bangera, S. S. (2023, January). Machine learning applied to speech emotion analysis for depression recognition. In 2023 international conference for advancement in technology (ICONAT) (pp. 1–5). IEEE. https://doi.org/10.1109/ICONAT57137.2023.10080060
Charoendee, M., Suchato, A., & Punyabukkana, P. (2017, July). Speech emotion recognition using derived features from speech segment and kernel principal component analysis. In 2017 14th international joint conference on computer science and software engineering (JCSSE) (pp. 1–6). IEEE. https://doi.org/10.1109/JCSSE.2017.8025936.
Chen, W., Xing, X., Xu, X., Yang, J., & Pang, J. (2022, May). Key-sparse Transformer for multimodal speech emotion recognition. In 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP 2022) (pp. 6897–6901). IEEE.
Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440–1444. https://doi.org/10.1109/lsp.2018.2860246
Article Google Scholar
Chernyavskiy, A., Ilvovsky, D., & Nakov, P. (2021). Transformers: “The end of history” for natural language processing? In Machine learning and knowledge discovery in databases: Research track: European conference, ECML PKDD 2021, proceedings, Part III 21 (pp. 677–693), Bilbao, Spain, September 13–17, 2021. Springer. https://doi.org/10.48550/arXiv.2105.00813
Chu, H. C., Zhang, Y. L., & Chiang, H. C. (2023). A CNN sound classification mechanism using data augmentation. Sensors, 23(15), 6972. https://doi.org/10.3390/s23156972
Article Google Scholar
Er, M. B. (2020). A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access, 8, 221640–221653. https://doi.org/10.1109/access.2020.3043201
Article Google Scholar
Falahzadeh, M. R., Farokhi, F., Harimi, A., & Sabbaghi-Nadooshan, R. (2022). Deep convolutional neural network and gray wolf optimization algorithm for speech emotion recognition. Circuits, Systems, and Signal Processing, 42(1), 449–492. https://doi.org/10.1007/s00034-022-02130-3
Article Google Scholar
Falahzadeh, M. R., Farokhi, F., Harimi, A., & Sabbaghi-Nadooshan, R. (2023). A 3D tensor representation of speech and 3D convolutional neural network for emotion recognition. Circuits, Systems, and Signal Processing, 42(7), 4271–4291. https://doi.org/10.1007/s00034-023-02315-4
Article Google Scholar
Han, S., Leng, F., & Jin, Z. (2021). Speech emotion recognition with a ResNet-CNN-Transformer parallel neural network. In 2021 international conference on communications, information system and computer engineering (CISCE) (pp. 803–807). IEEE. https://doi.org/10.1109/cisce52179.2021.9445906
Hema, C., & Garcia Marquez, F. P. (2023). Emotional speech recognition using CNN and Deep learning techniques. Applied Acoustics, 211, 109492. https://doi.org/10.1016/j.apacoust.2023.109492
Article Google Scholar
Ira, N. T., & Rahman, M. O. (2020, December). An efficient speech emotion recognition using ensemble method of supervised classifiers. In 2020 emerging technology in computing, communication and electronics (ETCCE) (pp. 1–5). IEEE. https://doi.org/10.1109/ETCCE51779.2020.9350913
Issa, D., Fatih Demirci, M., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894. https://doi.org/10.1016/j.bspc.2020.101894
Article Google Scholar
Jacob, A. (2017). Modelling speech emotion recognition using logistic regression and decision trees. International Journal of Speech Technology, 20(4), 897–905. https://doi.org/10.1007/s10772-017-9457-6
Article Google Scholar
Justin. (2013). A hybrid speech recognition system with hidden Markov model and radial basis function neural network. American Journal of Applied Sciences, 10(10), 1148–1153. https://doi.org/10.3844/ajassp.2013.1148.1153
Article Google Scholar
Kumar, S., Haq, M., Jain, A., Andy Jason, C., Rao Moparthi, N., Mittal, N., & Alzamil, Z. S. (2023). Multilayer neural network based speech emotion recognition for smart assistance. Computers, Materials & Continua, 74(1), 1523–1540. https://doi.org/10.32604/cmc.2023.028631
Article Google Scholar
Lalitha, S., Geyasruti, D., Narayanan, R., & Shravani, M. (2015). Emotion detection using MFCC and cepstrum features. Procedia Computer Science, 70, 29–35. https://doi.org/10.1016/j.procs.2015.10.020
Article Google Scholar
Lalitha, S., Tripathi, S., & Gupta, D. (2018). Enhanced speech emotion detection using deep neural networks. International Journal of Speech Technology, 22(3), 497–510. https://doi.org/10.1007/s10772-018-09572-8
Article Google Scholar
Lian, Z., Liu, B., & Tao, J. (2021). CTNet: conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 985–1000. https://doi.org/10.1109/taslp.2021.3049898
Article Google Scholar
Liao, Z., & Shen, S. (2023). Speech emotion recognition based on Swin-transformer. Journal of Physics: Conference Series, 2508(1), 012056. https://doi.org/10.1088/1742-6596/2508/1/012056
Article Google Scholar
Likitha, M. S., Gupta, S. R. R., Hasitha, K., & Raju, A. U. (2017, March). Speech based human emotion recognition using MFCC. In 2017 international conference on wireless communications, signal processing and networking (WiSPNET) (pp. 2257–2260). IEEE. https://doi.org/10.1109/wispnet.2017.8300161
Liu, Y., Wu, Y. H., Sun, G., Zhang, L., Chhatkuli, A., & Van Gool, L. (2021). Vision transformers with hierarchical attention. arXiv preprint arXiv:2106.03180. https://doi.org/10.48550/arXiv.2106.03180
Liu, Z. T., Han, M. T., Wu, B. H., & Rehman, A. (2023). Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning. Applied Acoustics, 202, 109178. https://doi.org/10.1016/j.apacoust.2022.109178
Article Google Scholar
Luna-Jiménez, C., Kleinlein, R., Griol, D., Callejas, Z., Montero, J. M., & Fernández-Martínez, F. (2021). A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 12(1), 327. https://doi.org/10.3390/app12010327
Article Google Scholar
Maganti, H. K., & Matassoni, M. (2014). Auditory processing-based features for improving speech recognition in adverse acoustic conditions. EURASIP Journal on Audio, Speech, and Music Processing, 2014, 1–9. https://doi.org/10.1186/1687-4722-2014-21
Article Google Scholar
Meng, H., Yan, T., Yuan, F., & Wei, H. (2019). Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access, 7, 125868–125881. https://doi.org/10.1109/access.2019.2938007
Article Google Scholar
Mustaqeem, M. S., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access, 8, 79861–79875. https://doi.org/10.1109/access.2020.2990405
Article Google Scholar
Mustaqeem, M. S., & Kwon, S. (2021). 1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. Computers, Materials & Continua, 67(3), 4039–4059. https://doi.org/10.32604/cmc.2021.015070
Article Google Scholar
Omman, B., & Eldho, S. M. T. (2022, June). Speech emotion recognition using bagged support vector machines. In 2022 international conference on computing, communication, security and intelligent systems (IC3SIS) (pp. 1–4). IEEE. https://doi.org/10.1109/IC3SIS54991.2022.9885578
Papakostas, M., Spyrou, E., Giannakopoulos, T., Siantikos, G., Sgouropoulos, D., Mylonas, P., & Makedon, F. (2017). Deep visual attributes vs handcrafted audio features on multidomain speech emotion recognition. Computation, 5(4), 26. https://doi.org/10.3390/computation5020026
Article Google Scholar
Patel, P., Chaudhari, A. A., Pund, M. A., & Deshmukh, D. H. (2017). Speech emotion recognition system using Gaussian mixture model and improvement proposed via boosted GMM. IRA International Journal of Technology & Engineering, 7(2), 56. https://doi.org/10.21013/jte.icsesd201706
Article Google Scholar
Pour, A. F., Asgari, M., & Hasanabadi, M. R. (2014, October). Gammatonegram based speaker identification. In 2014 4th international conference on computer and knowledge engineering (ICCKE) (pp. 52–55). IEEE. https://doi.org/10.1109/iccke.2014.6993383
Saadati, M., Toroghi, R. M., & Zareian, H. (2024, February). Multi-level speaker- independent emotion recognition using complex-MFCC and Swin transformer. In 2024 20th CSI international symposium on artificial intelligence and signal processing (AISP) (pp. 1–4). IEEE. https://doi.org/10.1109/aisp61396.2024.10475274
Shahin, I., Alomari, O. A., Nassif, A. B., Afyouni, I., Hashem, I. A., & Elnagar, A. (2023). An efficient feature selection method for Arabic and English speech emotion recognition using Grey Wolf Optimizer. Applied Acoustics, 205, 109279. https://doi.org/10.1016/j.apacoust.2023.109279
Article Google Scholar
Singh, P., Sahidullah, M., & Saha, G. (2023). Modulation spectral features for speech emotion recognition using deep neural networks. Speech Communication, 146, 53–69. https://doi.org/10.1016/j.specom.2022.11.005
Article Google Scholar
Singh, P., Waldekar, S., Sahidullah, M., & Saha, G. (2022). Analysis of constant-Q filterbank based representations for speech emotion recognition. Digital Signal Processing, 130, 103712. https://doi.org/10.1016/j.dsp.2022.103712
Article Google Scholar
Singh, V., & Prasad, S. (2023). Speech emotion recognition system using gender dependent convolution neural network. Procedia Computer Science, 218, 2533–2540. https://doi.org/10.1016/j.procs.2023.01.227
Article Google Scholar
Tanko, D., Dogan, S., Burak Demir, F., Baygin, M., Engin Sahin, S., & Tuncer, T. (2022). Shoelace pattern-based speech emotion recognition of the lecturers in distance education: ShoePat23. Applied Acoustics, 190, 108637. https://doi.org/10.1016/j.apacoust.2022.108637
Article Google Scholar
Vimal, B., Surya, M., Sridhar, V. S., & Ashok, A. (2021). MFCC based audio classification using machine learning. In 2021 12th international conference on computing communication and networking technologies (ICCCNT) (pp. 1–4). IEEE. https://doi.org/10.1109/ICCCNT51525.2021.9579881
Wang, X., Wang, M., Qi, W., Su, W., Wang, X., & Zhou, H. (2021, June). A novel end-to-end speech emotion recognition network with stacked transformer layers. In 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP 2021) (pp. 6289–6293). IEEE. https://doi.org/10.1109/icassp39728.2021.9414314
Wang, Y., Lu, C., Lian, H., Zhao, Y., Schuller, B. W., Zong, Y., & Zheng, W. (2024, April). Speech Swin-Transformer: Exploring a hierarchical transformer with shifted windows for speech emotion recognition. In 2024 IEEE international conference on acoustics, speech and signal processing (ICASSP 2024) (pp. 11646–11650). IEEE. https://doi.org/10.1109/icassp48485.2024.10447726
Zaman, S. R., Sadekeen, D., Alfaz, M. A., & Shahriyar, R. (2021, July). One source to detect them all: gender, age, and emotion detection from voice. In 2021 IEEE 45th annual computers, software, and applications conference (COMPSAC) (pp. 338–343). IEEE. https://doi.org/10.21203/rs.3.rs-3502219/v1
Zhang, S., Liu, R., Yang, Y., Zhao, X., & Yu, J. (2022). Unsupervised domain adaptation integrating transformer and mutual information for cross-corpus speech emotion recognition. In Proceedings of the 30th ACM international conference on multimedia (pp. 120–129).

Download references

Author information

Authors and Affiliations

School of Electronics Engineering, Vellore Institute of Technology, Vandalur–Kelambakkam Road, Chennai, 600127, India
R. Ramesh, V. B. Prahaladhan, P. Nithish & K. Mohanaprasad

Authors

R. Ramesh
View author publications
You can also search for this author in PubMed Google Scholar
V. B. Prahaladhan
View author publications
You can also search for this author in PubMed Google Scholar
P. Nithish
View author publications
You can also search for this author in PubMed Google Scholar
K. Mohanaprasad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Mohanaprasad.

Ethics declarations

Conflict of interest

All the authors do not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ramesh, R., Prahaladhan, V.B., Nithish, P. et al. Speech emotion recognition using the novel SwinEmoNet (Shifted Window Transformer Emotion Network). Int J Speech Technol 27, 551–568 (2024). https://doi.org/10.1007/s10772-024-10123-7

Download citation

Received: 23 April 2024
Accepted: 26 June 2024
Published: 10 July 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10772-024-10123-7

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

WESER: Wav2Vec 2.0 Enhanced Speech Emotion Recognizer

Speech Emotion Recognition: A Brief Review of Multi-modal Multi-task Learning Approaches

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Speech emotion recognition using the novel SwinEmoNet (Shifted Window Transformer Emotion Network)

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

WESER: Wav2Vec 2.0 Enhanced Speech Emotion Recognizer

Speech Emotion Recognition: A Brief Review of Multi-modal Multi-task Learning Approaches

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now