Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer
<p>Proposed SER framework: Two parallel CNNs with Transformer encoder for feature extraction. The extracted features are fed to the dense layer with a log Softmax classifier for emotional state prediction.</p> "> Figure 2
<p>Feature extraction process from input speech signals to MFCC features.</p> "> Figure 3
<p>The architecture of a single CNN with skip connections. The proposed model is composed of two parallel CNNs as illustrated in the given architecture.</p> "> Figure 4
<p>Scaled dot structure with multiple attention heads.</p> "> Figure 5
<p>Transformer encoder architecture with input and output feature dimensions.</p> "> Figure 6
<p>Spectrograms after adding white noise: 10 dB, 15 dB, and 25 dB.</p> "> Figure 7
<p>Spectrograms of various speech emotions.</p> "> Figure 8
<p>Confusion metrics. (<b>a</b>) IEMOCAP dataset, (<b>b</b>) RAVDESS dataset.</p> "> Figure 9
<p>CTENet percentage performance: accuracy (Acc), precision (Prc), and F1 score using RAVDESS and IEMOCAP datasets.</p> "> Figure 10
<p>CTENet performance over SOTA for RAVDESS and IEMOCAP datasets.</p> ">
Abstract
:1. Introduction
2. Problems and Motivations
- Stacked parallel CNNs with multi-head self-attention layers are implemented. The channel dimensions of filters and feature maps are reduced, allowing an expressive representation of features at a lower computational cost. With multi-head self-attention, the network learns to predict the frequency distributions of speech emotions in accordance with the overall MFCC structure.
- With the classification and spatial feature representation of CNNs, the MFCCs are used as grayscale images, where the widths and heights of the MFCC are the time and frequency scales, respectively. The pixel values in the MFCC indicate the speech signal intensities at the mel-frequency range and time steps.
- The dataset is augmented with AWGN. Creating new, real samples is a very difficult task. Thus, white noise is added to the speech signals to mask the random effect of noise existing in the training dataset. Moreover, this generates pseudo-new training samples and counterbalances the noise impact inherent in the dataset.
3. Related SER Literature
4. CTENet SER System
4.1. Parallel CNN Framework
4.2. Transformer Encoder
5. Experimentation
5.1. Datasets
5.2. Model Training, Architecture, and Features
5.3. Baseline Models
6. Results and Discussion
Comparison with Existing Models
7. Conclusions and Recommendations
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Liu, Z.T.; Xie, Q.; Wu, M.; Cao, W.H.; Mei, Y.; Mao, J.W. Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 2018, 309, 145–156. [Google Scholar] [CrossRef]
- Nwe, T.L.; Foo, S.W.; De Silva, L.C. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623. [Google Scholar] [CrossRef]
- Patel, P.; Chaudhari, A.; Kale, R.; Pund, M. Emotion recognition from speech with gaussian mixture models via boosted gmm. Int. J. Res. Sci. Eng. 2017, 3, 294–297. [Google Scholar]
- Chen, L.; Mao, X.; Xue, Y.; Cheng, L.L. Speech emotion recognition: Features and classification models. Digit. Signal Process. 2012, 22, 1154–1160. [Google Scholar] [CrossRef]
- Koolagudi, S.G.; Rao, K.S. Emotion recognition from speech: A review. Int. J. Speech Technol. 2012, 15, 99–117. [Google Scholar] [CrossRef]
- Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Qadir, J.; Schuller, B.W. Survey of deep representation learning for speech emotion recognition. IEEE Trans. Affect. Comput. 2021, 14, 1634–1654. [Google Scholar] [CrossRef]
- Fayek, H.M.; Lech, M.; Cavedon, L. Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw. 2017, 92, 60–68. [Google Scholar] [CrossRef]
- Tuncer, T.; Dogan, S.; Acharya, U.R. Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl.-Based Syst. 2021, 211, 106547. [Google Scholar] [CrossRef]
- Singh, P.; Srivastava, R.; Rana, K.P.S.; Kumar, V. A multimodal hierarchical approach to speech emotion recognition from audio and text. Knowl.-Based Syst. 2021, 229, 107316. [Google Scholar] [CrossRef]
- Magdin, M.; Sulka, T.; Tomanová, J.; Vozár, M. Voice analysis using PRAAT software and classification of user emotional state. Int. J. Interact. Multimed. Artif. Intell. 2019, 5, 33–42. [Google Scholar] [CrossRef]
- Huddar, M.G.; Sannakki, S.S.; Rajpurohit, V.S. Attention-based Multi-modal Sentiment Analysis and Emotion Detection in Conversation using RNN. Int. J. Interact. Multimed. Artif. Intell. 2021, 6, 112–121. [Google Scholar] [CrossRef]
- Wang, K.; An, N.; Li, B.N.; Zhang, Y.; Li, L. Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 2015, 6, 69–75. [Google Scholar] [CrossRef]
- Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
- Ho, N.H.; Yang, H.J.; Kim, S.H.; Lee, G. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 2020, 8, 61672–61686. [Google Scholar] [CrossRef]
- Saleem, N.; Gao, J.; Khattak, M.I.; Rauf, H.T.; Kadry, S.; Shafi, M. Deepresgru: Residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition. Knowl.-Based Syst. 2022, 238, 107914. [Google Scholar] [CrossRef]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
- Xie, Y.; Liang, R.; Liang, Z.; Huang, C.; Zou, C.; Schuller, B. Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1675–1685. [Google Scholar] [CrossRef]
- Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech emotion recognition with dual-sequence LSTM architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6474–6478. [Google Scholar]
- Zhao, H.; Xiao, Y.; Zhang, Z. Robust semisupervised generative adversarial networks for speech emotion recognition via distribution smoothness. IEEE Access 2020, 8, 106889–106900. [Google Scholar] [CrossRef]
- Shilandari, A.; Marvi, H.; Khosravi, H.; Wang, W. Speech emotion recognition using data augmentation method by cycle-generative adversarial networks. Signal Image Video Process. 2022, 16, 1955–1962. [Google Scholar] [CrossRef]
- Yi, L.; Mak, M.W. Improving speech emotion recognition with adversarial data augmentation network. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 172–184. [Google Scholar] [CrossRef] [PubMed]
- Huang, C.; Gong, W.; Fu, W.; Feng, D. A research of speech emotion recognition based on deep belief network and SVM. Math. Probl. Eng. 2014, 2014, 749604. [Google Scholar] [CrossRef] [Green Version]
- Huang, Y.; Tian, K.; Wu, A.; Zhang, G. Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J. Ambient. Intell. Humaniz. Comput. 2019, 14, 1787–1798. [Google Scholar] [CrossRef]
- Schuller, B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
- Guo, L.; Wang, L.; Dang, J.; Liu, Z.; Guan, H. Exploration of complementary features for speech emotion recognition based on kernel extreme learning machine. IEEE Access 2019, 7, 75798–75809. [Google Scholar] [CrossRef]
- Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the Interspeech, Singapore, 14–18 September 2014. [Google Scholar]
- Tiwari, U.; Soni, M.; Chakraborty, R.; Panda, A.; Kopparapu, S.K. Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2014; pp. 7194–7198. [Google Scholar]
- Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech emotion recognition from spectrograms with deep convolutional neural network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 13–15 February 2017; pp. 1–5. [Google Scholar]
- Dong, Y.; Yang, X. Affect-salient event sequence modelling for continuous speech emotion recognition. Neurocomputing 2021, 458, 246–258. [Google Scholar] [CrossRef]
- Chen, Q.; Huang, G. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Eng. Appl. Artif. Intell. 2021, 102, 104277. [Google Scholar] [CrossRef]
- Atila, O.; Şengür, A. Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl. Acoust. 2021, 182, 108260. [Google Scholar] [CrossRef]
- Lambrecht, L.; Kreifelts, B.; Wildgruber, D. Gender differences in emotion recognition: Impact of sensory modality and emotional category. Cogn. Emot. 2014, 28, 452–469. [Google Scholar] [CrossRef]
- Fu, C.; Liu, C.; Ishi, C.T.; Ishiguro, H. Multi-modality emotion recognition model with GAT-based multi-head inter-modality attention. Sensors 2020, 20, 4894. [Google Scholar] [CrossRef]
- Liu, D.; Chen, L.; Wang, Z.; Diao, G. Speech expression multimodal emotion recognition based on deep belief network. J. Grid Comput. 2021, 19, 22. [Google Scholar] [CrossRef]
- Zhao, Z.; Li, Q.; Zhang, Z.; Cummins, N.; Wang, H.; Tao, J.; Schuller, B.W. Combining a parallel 2d cnn with a self-attention dilated residual network for ctc-based discrete speech emotion recognition. Neural Netw. 2021, 141, 52–60. [Google Scholar] [CrossRef]
- Gangamohan, P.; Kadiri, S.R.; Yegnanarayana, B. Analysis of emotional speech—A review. Towar. Robot. Soc. Believable Behaving Syst. 2016, 1, 205–238. [Google Scholar]
- Gobl, C.; Chasaide, A.N. The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 2003, 40, 189–212. [Google Scholar] [CrossRef]
- Vlasenko, B.; Philippou-Hübner, D.; Prylipko, D.; Böck, R.; Siegert, I.; Wendemuth, A. Vowels formants analysis allows straightforward detection of high arousal emotions. In Proceedings of the 2011 IEEE International Conference on Multimedia and Expo, Barcelona, Spain, 11–15 July 2011; pp. 1–6. [Google Scholar]
- Lee, C.M.; Narayanan, S.S. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 2005, 13, 293–303. [Google Scholar]
- Schuller, B.; Rigoll, G. Timing levels in segment-based speech emotion recognition. In Proceedings of the INTERSPEECH 2006, Proceedings International Conference on Spoken Language Processing ICSLP, Pittsburgh, PA, USA, 17–21 September 2006. [Google Scholar]
- Lugger, M.; Yang, B. The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Honolulu, HI, USA, 15–20 April 2007; Volume 4, p. IV-17. [Google Scholar]
- Mutlag, W.K.; Ali, S.K.; Aydam, Z.M.; Taher, B.H. Feature extraction methods: A review. J. Phys. Conf. Ser. 2005, 1591, 012028. [Google Scholar] [CrossRef]
- Cavalcante, R.C.; Minku, L.L.; Oliveira, A.L. Fedd: Feature extraction for explicit concept drift detection in time series. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 740–747. [Google Scholar]
- Phinyomark, A.; Quaine, F.; Charbonnier, S.; Serviere, C.; Tarpin-Bernard, F.; Laurillau, Y. Feature extraction of the first difference of EMG time series for EMG pattern recognition. Comput. Methods Programs Biomed. 2014, 177, 247–256. [Google Scholar] [CrossRef]
- Schneider, T.; Helwig, N.; Schütze, A. Automatic feature extraction and selection for classification of cyclical time series data. Tech. Mess. 2017, 84, 198–206. [Google Scholar] [CrossRef]
- Salau, A.O.; Jain, S. Feature extraction: A survey of the types, techniques, applications. In Proceedings of the 2019 International Conference on Signal Processing and Communication (ICSC), Noida, India, 7–9 March 2019; pp. 158–164. [Google Scholar]
- Salau, A.O.; Olowoyo, T.D.; Akinola, S.O. Accent classification of the three major nigerian indigenous languages using 1d cnn lstm network model. In Advances in Computational Intelligence Techniques; Springer: Singapore, 2020; pp. 1–16. [Google Scholar]
- Zamil, A.A.A.; Hasan, S.; Baki, S.M.J.; Adam, J.M.; Zaman, I. Emotion detection from speech signals using voting mechanism on classified frames. In Proceedings of the 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh, 10–12 January 2019; pp. 281–285. [Google Scholar]
- Bhavan, A.; Chauhan, P.; Shah, R.R. Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 2019, 184, 104886. [Google Scholar] [CrossRef]
- Huang, Z.; Dong, M.; Mao, Q.; Zhan, Y. Speech emotion recognition using CNN. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 801–804. [Google Scholar]
- Latif, S.; Rana, R.; Younis, S.; Qadir, J.; Epps, J. Transfer learning for improving speech emotion classification accuracy. arXiv 2018, arXiv:1801.06353. [Google Scholar]
- Xie, B.; Sidulova, M.; Park, C.H. Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors 2021, 21, 4913. [Google Scholar] [CrossRef]
- Ahmed, M.; Islam, S.; Islam, A.K.M.; Shatabda, S. An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech Emotion Recognition. arXiv 2021, arXiv:2112.05666. [Google Scholar]
- Yu, Y.; Kim, Y.J. Attention-LSTM-attention model for speech emotion recognition and analysis of IEMOCAP database. Electronics 2020, 9, 713. [Google Scholar] [CrossRef]
- Ohi, A.Q.; Mridha, M.F.; Safir, F.B.; Hamid, M.A.; Monowar, M.M. Autoembedder: A semi-supervised DNN embedding system for clustering. Knowl.-Based Syst. 2020, 204, 106190. [Google Scholar] [CrossRef]
- Sajjad, M.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar]
- Bertero, D.; Fung, P. A first look into a convolutional neural network for speech emotion detection. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5115–5119. [Google Scholar]
- Mekruksavanich, S.; Jitpattanakul, A.; Hnoohom, N. Negative emotion recognition using deep learning for Thai language. In Proceedings of the 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), Pattaya, Thailand, 11–14 March 2020; pp. 71–74. [Google Scholar]
- Anvarjon, T.; Kwon, S. Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 2020, 20, 5212. [Google Scholar] [CrossRef]
- Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimed. 2017, 20, 1576–1590. [Google Scholar] [CrossRef]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
- Kwon, S. CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics 2020, 8, 2133. [Google Scholar]
- Li, D.; Sun, L.; Xu, X.; Wang, Z.; Zhang, J.; Du, W. BLSTM and CNN Stacking Architecture for Speech Emotion Recognition. Neural Process. Lett. 2021, 53, 4097–4115. [Google Scholar] [CrossRef]
- Zhu, L.; Chen, L.; Zhao, D.; Zhou, J.; Zhang, W. Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors 2017, 17, 1694. [Google Scholar] [CrossRef] [Green Version]
- Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2019, 20, 183. [Google Scholar]
- Lieskovská, E.; Jakubec, M.; Jarina, R.; Chmulík, M. A review on speech emotion recognition using deep learning and attention mechanism. Electronics 2021, 10, 1163. [Google Scholar] [CrossRef]
- Kwon, S. Att-Net: Enhanced emotion recognition system using lightweight self-attention module. Appl. Soft Comput. 2021, 102, 107101. [Google Scholar]
- Chen, S.; Zhang, M.; Yang, X.; Zhao, Z.; Zou, T.; Sun, X. The impact of attention mechanisms on speech emotion recognition. Sensors 2021, 21, 7530. [Google Scholar] [CrossRef]
- Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2803–2807. [Google Scholar]
- Yenigalla, P.; Kumar, A.; Tripathi, S.; Singh, C.; Kar, S.; Vepa, J. Speech Emotion Recognition Using Spectrogram Phoneme Embedding. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3688–3692. [Google Scholar]
- Sarma, M.; Ghahremani, P.; Povey, D.; Goel, N.K.; Sarma, K.K.; Dehak, N. Emotion Identification from Raw Speech Signals Using DNNs. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3097–3101. [Google Scholar]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
- Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894. [Google Scholar] [CrossRef]
- Carta, S.; Corriga, A.; Ferreira, A.; Podda, A.S.; Recupero, D.R. A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning. Appl. Intell. 2021, 51, 889–905. [Google Scholar] [CrossRef]
- Zhang, J.; Xing, L.; Tan, Z.; Wang, H.; Wang, K. Multi-head attention fusion networks for multi-modal speech emotion recognition. Comput. Ind. Eng. 2022, 168, 108078. [Google Scholar] [CrossRef]
- Demilie, W.B.; Salau, A.O. Detection of fake news and hate speech for Ethiopian languages: A systematic review of the approaches. J. Big Data 2022, 9, 66. [Google Scholar] [CrossRef]
- Bautista, J.L.; Lee, Y.K.; Shin, H.S. Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation. Electronics 2022, 11, 3935. [Google Scholar] [CrossRef]
- Abeje, B.T.; Salau, A.O.; Ebabu, H.A.; Ayalew, A.M. Comparative Analysis of Deep Learning Models for Aspect Level Amharic News Sentiment Analysis. In Proceedings of the 2022 International Conference on Decision Aid Sciences and Applications (DASA), Chiangrai, Thailand, 23–25 March 2022; pp. 1628–1633. [Google Scholar]
- Kakuba, S.; Poulose, A.; Han, D.S. Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features. IEEE Access 2022, 10, 125538–125551. [Google Scholar] [CrossRef]
- Tao, H.; Geng, L.; Shan, S.; Mai, J.; Fu, H. Multi-Stream Convolution-Recurrent Neural Networks Based on Attention Mechanism Fusion for Speech Emotion Recognition. Entropy 2022, 24, 1025. [Google Scholar] [CrossRef] [PubMed]
- Kwon, S. MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 2021, 167, 114177. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 17, 1–11. [Google Scholar]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [Green Version]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Zeng, Y.; Mao, H.; Peng, D.; Yi, Z. Spectrogram based multi-task audio classification. Multimed. Tools Appl. 2019, 78, 3705–3722. [Google Scholar] [CrossRef]
- Almadhor, A.; Irfan, R.; Gao, J.; Saleem, N.; Rauf, H.T.; Kadry, S. E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition. Expert Syst. Appl. 2023, 222, 119797. [Google Scholar] [CrossRef]
Ref # | DNN Model | Model Input | Input Features | Accuracy |
---|---|---|---|---|
[74] | CNNAtt-Net | Spectrograms | Spatial Features | 80.00% |
[71] | SVM-DBN | MFCC | Prosodic + Spectral | 80.11% |
[63] | CNN-BLSTM | Spectrograms | Spatial + Temporal | 77.02% |
[56] | Bagged SVM | Spectrograms | Spectral Features | 75.79% |
[66] | Lightweight CNN | Spectrograms | Spectral Features | 75.01% |
[69] | ConvLSTM | Spectrograms | Spectral Features | 75.00% |
[60] | 1D-CNN-FCN | MFCC | Prosodic + Spectral | 72.19% |
[82] | 1D-CNN | MFCC + Chromagram + Spectrogram | Spatial Features | 71.61% |
[78] | TDNN-LSTM-Attn | Raw Spectra | Spectral Features | 70.11% |
[72] | DS-CNN | Raw Spectra | Spatial Features | 70.00% |
[55] | Voting LMT | Spectrograms | Spectral Features | 67.14% |
[73] | RNN-Attn | Raw Spectra | Acoustic + Temporal | 63.50% |
[70] | BLSTM-CNN | MFCC | Prosodic + Spectral | 57.84% |
Emotion | Audio Files | Contribution |
---|---|---|
Happiness | 192 | 13.33% |
Sadness | 192 | 13.33% |
Anger | 192 | 13.33% |
Calm | 192 | 13.33% |
Fear | 192 | 13.33% |
Neutral | 96 | 6.667% |
Disgust | 192 | 13.33% |
Surprise | 192 | 13.33% |
Emotion | Audio Files | Contribution |
---|---|---|
Happiness | 1636 | 24.33% |
Sadness | 1084 | 16.12% |
Anger | 1103 | 16.40% |
Calm | 1700 | 25.28% |
Fear | 1200 | 17.84% |
Layer | Input Dim | Padding | Output Dim | Filter Size | Output Dim | Maxpool, Stride | Output Dim |
---|---|---|---|---|---|---|---|
1 | (1 × 282 × 40) | 1 | (1 × 284 × 42) | (1 × 3 × 3) | (16 × 40 × 282) | (2 × 2), 2 | (16 × 20 × 141) |
2 | (16 × 20 × 141) | 1 | (16 × 22 × 143) | (16 × 3 × 3) | (32 × 20 × 141) | (4 × 4), 4 | (32 × 5 × 35) |
3 | (32 × 5 × 35) | 1 | (32 × 7 × 37) | (32 × 3 × 3) | (64 × 5 × 35) | (4 × 4), 4 | (64 × 1 × 8) |
Flatten (64 × 1 × 8); final convolutional embedding length (1 × 512). |
Datasets | Speech Emotions | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Happy | Sad | Angry | Calm | Fearful | Neutral | Disgust | Surprised | W.Acc | UW.Acc | |
RAVDESS | 90.1 | 73.2 | 82.2 | 98.1 | 72.3 | 40.2 | 82.1 | 87.1 | 79.3 | 78.2 |
IEMOCAP | 76.6 | 73.8 | 86.6 | 76.1 | 84.2 | - | - | - | 80.3 | 79.5 |
Happy | Sad | Angry | Calm | Fearful | Neutral | Disgust | Surprised | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Prc | F1 | Prc | F1 | Prc | F1 | Prc | F1 | Prc | F1 | Prc | F1 | Prc | F1 | Prc | F1 |
92 | 91 | 49 | 74 | 89 | 87 | 93 | 94 | 82 | 79 | 45 | 75 | 90 | 87 | 90 | 88 |
Model Accuracy: 82.31% |
Happy | Sad | Angry | Calm | Fearful | |||||
---|---|---|---|---|---|---|---|---|---|
Prc | F1 | Prc | F1 | Prc | F1 | Prc | F1 | Prc | F1 |
78 | 81 | 77 | 80 | 71 | 86 | 77 | 82 | 71 | 82 |
Model Accuracy: 79.42% |
Model Input | Database | Neural Architecture | Accuracy | Precision | F1 Score |
---|---|---|---|---|---|
MFCC Spectrum | RAVDESS | CTENet without MHAT | 72.10 | 73.34 | 80.40 |
IEMOCAP | 70.32 | 69.95 | 79.65 | ||
MFCC Spectrum | RAVDESS | CTENet with MHAT | 78.00 | 78.75 | 84.37 |
IEMOCAP | 79.00 | 74.80 | 82.20 |
Models | RAVDESS | IEMOCAP | Model Size |
---|---|---|---|
DSCNN [51] | 2400 s | 2640 s | 34.5 MB |
CB-SER [57] | 6250 s | 10,452 s | 125 MB |
AttNet [68] | 1900 s | 2100 s | 14.4 MB |
CTENet | 1600 s | 1900 s | 4.54 MB |
RAVDESS Dataset | IEMOCAP Dataset | |||||||
---|---|---|---|---|---|---|---|---|
Ref# | Benchmarks | Input Features | Accuracy | Precision | F1 Score | Accuracy | Precision | F1 Score |
[56] | BE-SVM | Spectral Features | 75.69 | 74.00 | 73.34 | - | - | - |
[85] | GResNets | Spectral Features | 64.48 | 65.32 | 63.11 | - | - | - |
[86] | MLT-DNet | Spatial Features | - | - | - | 73.01 | 74.00 | 73.00 |
[57] | Deep-BLSTM | Spatial + Temporal | 77.02 | 76.00 | 77.00 | 72.50 | 73.00 | 72.00 |
[74] | 1D-CNN | Spectral Features | 71.61 | - | - | 64.30 | - | - |
[66] | DS-CNN | Spatial Features | 79.50 | 81.00 | 84.00 | 78.75 | 86.00 | 82.00 |
[60] | DeepNet | Spatial + Temporal | - | - | - | 77.00 | 76.00 | 76.00 |
[68] | Att-Net | Spatial Features | 80.00 | 81.00 | 80.00 | 78.00 | 78.00 | 78.00 |
Our | CTENet | Spatial + Temporal | 82.31 | 81.75 | 84.37 | 79.42 | 74.80 | 82.20 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ullah, R.; Asif, M.; Shah, W.A.; Anjam, F.; Ullah, I.; Khurshaid, T.; Wuttisittikulkij, L.; Shah, S.; Ali, S.M.; Alibakhshikenari, M. Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer. Sensors 2023, 23, 6212. https://doi.org/10.3390/s23136212
Ullah R, Asif M, Shah WA, Anjam F, Ullah I, Khurshaid T, Wuttisittikulkij L, Shah S, Ali SM, Alibakhshikenari M. Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer. Sensors. 2023; 23(13):6212. https://doi.org/10.3390/s23136212
Chicago/Turabian StyleUllah, Rizwan, Muhammad Asif, Wahab Ali Shah, Fakhar Anjam, Ibrar Ullah, Tahir Khurshaid, Lunchakorn Wuttisittikulkij, Shashi Shah, Syed Mansoor Ali, and Mohammad Alibakhshikenari. 2023. "Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer" Sensors 23, no. 13: 6212. https://doi.org/10.3390/s23136212
APA StyleUllah, R., Asif, M., Shah, W. A., Anjam, F., Ullah, I., Khurshaid, T., Wuttisittikulkij, L., Shah, S., Ali, S. M., & Alibakhshikenari, M. (2023). Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer. Sensors, 23(13), 6212. https://doi.org/10.3390/s23136212