Abstract
Speaker-independent speech emotion recognition (SER) is a complex task because of the variations among different speakers, such as gender, age and other emotional irrelevant factors, which may lead to a tremendous difference among emotional features’ distribution. To alleviate the adverse effect generated by emotional irrelevant factors, we propose a SER model that consists of convolutional neutral networks (CNN), attention-based bidirectional long short-term memory network (BLSTM), and multiple linear support vector machines. The log Mel-spectrogram with its velocity (delta) and acceleration (double delta) coefficients are adopted as the inputs of our model since they can apply sufficient information for feature learning by our model. Several groups of speaker-independent SER experiments are performed on the Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) database to improve the credibility of the results. Experimental results show that our method obtains unweighted average recall of 61.50% and weighted average recall of 62.31% for speaker-independent SER on IEMOCAP database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011)
Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)
Calvo, M.G., Nummenmaa, L.: Perceptual and affective mechanisms in facial expression recognition: an integrative review. Cogn. Emot. 30(6), 1081–1106 (2016)
Mohammadi, Z., Frounchi, J., Amiri, M.: Wavelet-based emotion recognition system using EEG signal. Neural Comput. Appl. 28(8), 1985–1990 (2017)
Liu, Z.T., Wu, M., Cao, W.H., et al.: Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273, 271–280 (2018)
Shi, P.: Speech emotion recognition based on deep belief network. In: 15th International Conference on Networking, Sensing and Control. IEEE, Zhuhai (2018)
Zhu, L., Chen, L., Zhao, D., et al.: Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors 17(7), 1694 (2017)
Mao, Q., Dong, M., Huang, Z., et al.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)
Hossain, M.S., Muhammad, G.: Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 49, 69–78 (2019)
Zhang, S., Zhang, S., Huang, T., et al.: Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 28(10), 3030–3043 (2018)
Chen, M., He, X., Yang, J., et al.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)
Graves, A., Jaitly, N., Mohamed, A.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 Proceedings of Workshop on Automatic Speech Recognition and Understanding, pp. 273–278. IEEE, Olomouc (2013)
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
New, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov models. Speech Commun. 41(4), 603–623 (2003)
Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)
Lee, C.C., Mower, E., Busso, C., et al.: Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53(9–10), 1162–1171 (2011)
Liu, Z.T., Xie, Q., Wu, M., et al.: Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309, 145–156 (2018)
Li, P., Song, Y., Wang, P., et al.: A multi-feature multi-classifier system for speech emotion recognition. In: 2018 First Asian Conference on Affective Computing and Intelligent Interaction, Beijing, China (2018)
Huang, C.W., Narayanan, S.S.: Attention assisted discovery of sub-utterance structure in speech emotion recognition. In: 17th Proceedings of Annual Conference of the International Speech Communication Association, pp. 1387–1391. International Speech Communication Association, San Francisco (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105. Neural Information Processing Systems, Nevada (2012)
Zhou, P., Shi, W., Tian, J., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: 54th Proceedings of the Annual Meeting of the Association for Computational Linguistics, Short Papers, vol. 2, pp. 207–212. Association for Computational Linguistics, Berlin (2016)
Tang, Y.: Deep learning using linear support vector machines. In: 30th International Conference on Machine Learning, Atlanta, Georgia, USA (2013)
Busso, C., Bulut, M., Lee, C.C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: a system for large-scale machine learning. In: 12th Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283. USENIX Association, Savannah (2016)
Ng, A.Y.: Feature selection, L1 vs. L2 regularization, and rotational invariance. In: 21st Proceedings of International Conference on Machine Learning. ACM, Banff (2004)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants 61403422, 61703375 and 61273102, the Hubei Provincial Natural Science Foundation of China under Grants 2018CFB447 and 2015CFA010, the Wuhan Science and Technology Project under Grant 2017010201010133, the 111 project under Grant B17040, and the Fundamental Research Funds for National University, China University of Geosciences (Wuhan) under Grant 1810491T07.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, ZT., Xiao, P., Li, DY., Hao, M. (2019). Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs. In: Yu, H., Liu, J., Liu, L., Ju, Z., Liu, Y., Zhou, D. (eds) Intelligent Robotics and Applications. ICIRA 2019. Lecture Notes in Computer Science(), vol 11742. Springer, Cham. https://doi.org/10.1007/978-3-030-27535-8_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-27535-8_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27534-1
Online ISBN: 978-3-030-27535-8
eBook Packages: Computer ScienceComputer Science (R0)