Abstract
Understanding human emotions is necessary for various tasks, including interpersonal interaction, knowledge acquisition, and determining courses of action. Recognizing emotions, particularly in speech, poses significant challenges due to linguistic differences, local differences, diversity in gender identities, generational differences, and diversity of cultures. Deep learning methods are promising for automating this task; previous approaches frequently rely on a single type of feature representation, limiting the efficacy of Speech Emotion Recognition (SER). To address these limitations, a comprehensive approach that uses Shifted Window Transformers is proposed, which considers the many different aspects of emotional expression in speech and the use of diverse feature representations to improve SER performance. This paper outlines a novel, Shifted Windowed Transformer Emotional Network (SwinEmoNet), incorporating shifted window attention mechanisms for efficient emotion classification. SwinEmoNet uses local window attention rather than traditional transformer architectures’ global attention mechanisms. This capability allows the model to concentrate essential data in small, finer sections of the input speech signal. The proposed SwinEmoNet architecture has been evaluated against three distinct speech spectrograms. This paper deals with the effectiveness of the proposed SER method by analyzing its performance on the Berlin Emotional Database (EMODB) and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), with an emphasis on metrics such as accuracy, precision, recall and F1-score. With the EMODB and RAVDESS datasets, the accuracy of the SwinEmoNet is 94.93 and 96.51%, respectively, significantly outperforming existing transformer models and current state-of-the-art standards.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Abdel-Hamid, L. (2020). Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication, 122, 19–30. https://doi.org/10.1016/j.specom.2020.04.005
Aggarwal, A., Srivastava, A., Agarwal, A., Chahal, N., Singh, D., Alnuaim, A. A., Alhadlaq, A., & Lee, H. N. (2022). Two-way feature extraction for speech emotion recognition using deep learning. Sensors, 22(6), 2378. https://doi.org/10.3390/s22062378
Akçay, M. B., & Oğuz, K. (2020). Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 116, 56–76. https://doi.org/10.1016/j.specom.2019.12.001
Al-Dujaili, M. J., & Ebrahimi-Moghadam, A. (2023). Speech emotion recognition: A comprehensive survey. Wireless Personal Communications, 129(4), 2525–2561. https://doi.org/10.1007/s11277-023-10244-3
Alluhaidan, A. S., Saidani, O., Jahangir, R., Nauman, M. A., & Neffati, O. S. (2023). Speech emotion recognition through hybrid features and convolutional neural network. Applied Sciences, 13(8), 4750. https://doi.org/10.3390/app13084750
Al-onazi, B. B., Nauman, M. A., Jahangir, R., Malik, M. M., Alkhammash, E. H., & Elshewey, A. M. (2022). Transformer-based multilingual speech emotion recognition using data augmentation and feature fusion. Applied Sciences, 12(18), 9188. https://doi.org/10.3390/app12189188
Andayani, F., Theng, L. B., Tsun, M. T., & Chua, C. (2022). Hybrid LSTM-transformer model for emotion recognition from speech audio files. IEEE Access, 10, 36018–36027. https://doi.org/10.1109/access.2022.3163856
Bhangale, K., & Mohanaprasad, K. (2021). Speech emotion recognition using Mel frequency log spectrogram and deep convolutional neural network. In Lecture notes in electrical engineering (pp. 241–250). https://doi.org/10.1007/978-981-16-4625-6_24
Bhangale, K., & Kothandaraman, M. (2023b). Speech emotion recognition based on multiple acoustic features and deep convolutional neural network. Electronics, 12(4), 839. https://doi.org/10.3390/electronics12040839
Bhangale, K. B., & Kothandaraman, M. (2023a). Speech emotion recognition using the novel PEmoNet (Parallel Emotion Network). Applied Acoustics, 212, 109613. https://doi.org/10.1016/j.apacoust.2023.109613
Bhavya, S., Nayak, D. S., Dmello, R. C., Nayak, A., & Bangera, S. S. (2023, January). Machine learning applied to speech emotion analysis for depression recognition. In 2023 international conference for advancement in technology (ICONAT) (pp. 1–5). IEEE. https://doi.org/10.1109/ICONAT57137.2023.10080060
Charoendee, M., Suchato, A., & Punyabukkana, P. (2017, July). Speech emotion recognition using derived features from speech segment and kernel principal component analysis. In 2017 14th international joint conference on computer science and software engineering (JCSSE) (pp. 1–6). IEEE. https://doi.org/10.1109/JCSSE.2017.8025936.
Chen, W., Xing, X., Xu, X., Yang, J., & Pang, J. (2022, May). Key-sparse Transformer for multimodal speech emotion recognition. In 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP 2022) (pp. 6897–6901). IEEE.
Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440–1444. https://doi.org/10.1109/lsp.2018.2860246
Chernyavskiy, A., Ilvovsky, D., & Nakov, P. (2021). Transformers: “The end of history” for natural language processing? In Machine learning and knowledge discovery in databases: Research track: European conference, ECML PKDD 2021, proceedings, Part III 21 (pp. 677–693), Bilbao, Spain, September 13–17, 2021. Springer. https://doi.org/10.48550/arXiv.2105.00813
Chu, H. C., Zhang, Y. L., & Chiang, H. C. (2023). A CNN sound classification mechanism using data augmentation. Sensors, 23(15), 6972. https://doi.org/10.3390/s23156972
Er, M. B. (2020). A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access, 8, 221640–221653. https://doi.org/10.1109/access.2020.3043201
Falahzadeh, M. R., Farokhi, F., Harimi, A., & Sabbaghi-Nadooshan, R. (2022). Deep convolutional neural network and gray wolf optimization algorithm for speech emotion recognition. Circuits, Systems, and Signal Processing, 42(1), 449–492. https://doi.org/10.1007/s00034-022-02130-3
Falahzadeh, M. R., Farokhi, F., Harimi, A., & Sabbaghi-Nadooshan, R. (2023). A 3D tensor representation of speech and 3D convolutional neural network for emotion recognition. Circuits, Systems, and Signal Processing, 42(7), 4271–4291. https://doi.org/10.1007/s00034-023-02315-4
Han, S., Leng, F., & Jin, Z. (2021). Speech emotion recognition with a ResNet-CNN-Transformer parallel neural network. In 2021 international conference on communications, information system and computer engineering (CISCE) (pp. 803–807). IEEE. https://doi.org/10.1109/cisce52179.2021.9445906
Hema, C., & Garcia Marquez, F. P. (2023). Emotional speech recognition using CNN and Deep learning techniques. Applied Acoustics, 211, 109492. https://doi.org/10.1016/j.apacoust.2023.109492
Ira, N. T., & Rahman, M. O. (2020, December). An efficient speech emotion recognition using ensemble method of supervised classifiers. In 2020 emerging technology in computing, communication and electronics (ETCCE) (pp. 1–5). IEEE. https://doi.org/10.1109/ETCCE51779.2020.9350913
Issa, D., Fatih Demirci, M., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894. https://doi.org/10.1016/j.bspc.2020.101894
Jacob, A. (2017). Modelling speech emotion recognition using logistic regression and decision trees. International Journal of Speech Technology, 20(4), 897–905. https://doi.org/10.1007/s10772-017-9457-6
Justin. (2013). A hybrid speech recognition system with hidden Markov model and radial basis function neural network. American Journal of Applied Sciences, 10(10), 1148–1153. https://doi.org/10.3844/ajassp.2013.1148.1153
Kumar, S., Haq, M., Jain, A., Andy Jason, C., Rao Moparthi, N., Mittal, N., & Alzamil, Z. S. (2023). Multilayer neural network based speech emotion recognition for smart assistance. Computers, Materials & Continua, 74(1), 1523–1540. https://doi.org/10.32604/cmc.2023.028631
Lalitha, S., Geyasruti, D., Narayanan, R., & Shravani, M. (2015). Emotion detection using MFCC and cepstrum features. Procedia Computer Science, 70, 29–35. https://doi.org/10.1016/j.procs.2015.10.020
Lalitha, S., Tripathi, S., & Gupta, D. (2018). Enhanced speech emotion detection using deep neural networks. International Journal of Speech Technology, 22(3), 497–510. https://doi.org/10.1007/s10772-018-09572-8
Lian, Z., Liu, B., & Tao, J. (2021). CTNet: conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 985–1000. https://doi.org/10.1109/taslp.2021.3049898
Liao, Z., & Shen, S. (2023). Speech emotion recognition based on Swin-transformer. Journal of Physics: Conference Series, 2508(1), 012056. https://doi.org/10.1088/1742-6596/2508/1/012056
Likitha, M. S., Gupta, S. R. R., Hasitha, K., & Raju, A. U. (2017, March). Speech based human emotion recognition using MFCC. In 2017 international conference on wireless communications, signal processing and networking (WiSPNET) (pp. 2257–2260). IEEE. https://doi.org/10.1109/wispnet.2017.8300161
Liu, Y., Wu, Y. H., Sun, G., Zhang, L., Chhatkuli, A., & Van Gool, L. (2021). Vision transformers with hierarchical attention. arXiv preprint arXiv:2106.03180. https://doi.org/10.48550/arXiv.2106.03180
Liu, Z. T., Han, M. T., Wu, B. H., & Rehman, A. (2023). Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning. Applied Acoustics, 202, 109178. https://doi.org/10.1016/j.apacoust.2022.109178
Luna-Jiménez, C., Kleinlein, R., Griol, D., Callejas, Z., Montero, J. M., & Fernández-Martínez, F. (2021). A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 12(1), 327. https://doi.org/10.3390/app12010327
Maganti, H. K., & Matassoni, M. (2014). Auditory processing-based features for improving speech recognition in adverse acoustic conditions. EURASIP Journal on Audio, Speech, and Music Processing, 2014, 1–9. https://doi.org/10.1186/1687-4722-2014-21
Meng, H., Yan, T., Yuan, F., & Wei, H. (2019). Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access, 7, 125868–125881. https://doi.org/10.1109/access.2019.2938007
Mustaqeem, M. S., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access, 8, 79861–79875. https://doi.org/10.1109/access.2020.2990405
Mustaqeem, M. S., & Kwon, S. (2021). 1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. Computers, Materials & Continua, 67(3), 4039–4059. https://doi.org/10.32604/cmc.2021.015070
Omman, B., & Eldho, S. M. T. (2022, June). Speech emotion recognition using bagged support vector machines. In 2022 international conference on computing, communication, security and intelligent systems (IC3SIS) (pp. 1–4). IEEE. https://doi.org/10.1109/IC3SIS54991.2022.9885578
Papakostas, M., Spyrou, E., Giannakopoulos, T., Siantikos, G., Sgouropoulos, D., Mylonas, P., & Makedon, F. (2017). Deep visual attributes vs handcrafted audio features on multidomain speech emotion recognition. Computation, 5(4), 26. https://doi.org/10.3390/computation5020026
Patel, P., Chaudhari, A. A., Pund, M. A., & Deshmukh, D. H. (2017). Speech emotion recognition system using Gaussian mixture model and improvement proposed via boosted GMM. IRA International Journal of Technology & Engineering, 7(2), 56. https://doi.org/10.21013/jte.icsesd201706
Pour, A. F., Asgari, M., & Hasanabadi, M. R. (2014, October). Gammatonegram based speaker identification. In 2014 4th international conference on computer and knowledge engineering (ICCKE) (pp. 52–55). IEEE. https://doi.org/10.1109/iccke.2014.6993383
Saadati, M., Toroghi, R. M., & Zareian, H. (2024, February). Multi-level speaker- independent emotion recognition using complex-MFCC and Swin transformer. In 2024 20th CSI international symposium on artificial intelligence and signal processing (AISP) (pp. 1–4). IEEE. https://doi.org/10.1109/aisp61396.2024.10475274
Shahin, I., Alomari, O. A., Nassif, A. B., Afyouni, I., Hashem, I. A., & Elnagar, A. (2023). An efficient feature selection method for Arabic and English speech emotion recognition using Grey Wolf Optimizer. Applied Acoustics, 205, 109279. https://doi.org/10.1016/j.apacoust.2023.109279
Singh, P., Sahidullah, M., & Saha, G. (2023). Modulation spectral features for speech emotion recognition using deep neural networks. Speech Communication, 146, 53–69. https://doi.org/10.1016/j.specom.2022.11.005
Singh, P., Waldekar, S., Sahidullah, M., & Saha, G. (2022). Analysis of constant-Q filterbank based representations for speech emotion recognition. Digital Signal Processing, 130, 103712. https://doi.org/10.1016/j.dsp.2022.103712
Singh, V., & Prasad, S. (2023). Speech emotion recognition system using gender dependent convolution neural network. Procedia Computer Science, 218, 2533–2540. https://doi.org/10.1016/j.procs.2023.01.227
Tanko, D., Dogan, S., Burak Demir, F., Baygin, M., Engin Sahin, S., & Tuncer, T. (2022). Shoelace pattern-based speech emotion recognition of the lecturers in distance education: ShoePat23. Applied Acoustics, 190, 108637. https://doi.org/10.1016/j.apacoust.2022.108637
Vimal, B., Surya, M., Sridhar, V. S., & Ashok, A. (2021). MFCC based audio classification using machine learning. In 2021 12th international conference on computing communication and networking technologies (ICCCNT) (pp. 1–4). IEEE. https://doi.org/10.1109/ICCCNT51525.2021.9579881
Wang, X., Wang, M., Qi, W., Su, W., Wang, X., & Zhou, H. (2021, June). A novel end-to-end speech emotion recognition network with stacked transformer layers. In 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP 2021) (pp. 6289–6293). IEEE. https://doi.org/10.1109/icassp39728.2021.9414314
Wang, Y., Lu, C., Lian, H., Zhao, Y., Schuller, B. W., Zong, Y., & Zheng, W. (2024, April). Speech Swin-Transformer: Exploring a hierarchical transformer with shifted windows for speech emotion recognition. In 2024 IEEE international conference on acoustics, speech and signal processing (ICASSP 2024) (pp. 11646–11650). IEEE. https://doi.org/10.1109/icassp48485.2024.10447726
Zaman, S. R., Sadekeen, D., Alfaz, M. A., & Shahriyar, R. (2021, July). One source to detect them all: gender, age, and emotion detection from voice. In 2021 IEEE 45th annual computers, software, and applications conference (COMPSAC) (pp. 338–343). IEEE. https://doi.org/10.21203/rs.3.rs-3502219/v1
Zhang, S., Liu, R., Yang, Y., Zhao, X., & Yu, J. (2022). Unsupervised domain adaptation integrating transformer and mutual information for cross-corpus speech emotion recognition. In Proceedings of the 30th ACM international conference on multimedia (pp. 120–129).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All the authors do not have any conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ramesh, R., Prahaladhan, V.B., Nithish, P. et al. Speech emotion recognition using the novel SwinEmoNet (Shifted Window Transformer Emotion Network). Int J Speech Technol 27, 551–568 (2024). https://doi.org/10.1007/s10772-024-10123-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-024-10123-7