Abstract
Emotion recognition (ER) has drawn the interest of many researchers in the field of human-computer interaction, being central in such applications as assisted living and personalized content suggestion. When considering the implementation of ER capable systems, if they are to be widely adopted in daily life, one must take into account that methods for emotion recognition should work on data collected in an unobtrusive way. Out of the possible data modalities for affective state analysis, which include video and biometrics, speech is considered the least intrusive and for this reason has drawn the focus of many research efforts. In this chapter, we discuss methods for analyzing the non-linguistic component of vocalized speech for the purposes of ER. In particular, we propose a method for producing lower dimensional representations of sound spectrograms which respect their temporal structure. Moreover, we explore possible methods for analyzing such representations, including shallow methods, recurrent neural networks and attention mechanisms. Our models are evaluated on data taken from popular, public datasets for emotion analysis with promising results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cowie, R., et al.: Emotion recognition in human-computer interaction. IEEE Sign. Process. Mag. 18(1), 32–80 (2001)
Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448. IEEE (2016)
Zeng, E., Mare, S., Roesner, F.: End user security and privacy concerns with smart homes. In Thirteenth Symposium on Usable Privacy and Security (SOUPS 2017), pp. 65–80 (2017)
Sauter, D.A., Eisner, F., Calder, A.J., Scott, S.K.: Perceptual cues in nonverbal vocal expressions of emotion. Quart. J. Exp. Psychol. 63(11), 2251–2272 (2010)
Anagnostopoulos, C.N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015). https://doi.org/10.1007/s10462-012-9368-5
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Giannakopoulos, T., Pikrakis, A.: Introduction to Audio Analysis: A MATLAB® Approach. Academic Press, Cambridge (2014)
Drakopoulos, G., Pikramenos, G., Spyrou, E.D., Perantonis, S.J.: Emotion recognition from speech: a survey. In: WEBIST, pp. 432–439 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5. IEEE (2017)
Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms (2017)
He, L., Lech, M., Maddage, N., Allen, N.: Stress and emotion recognition using log-Gabor filter analysis of speech spectrograms. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–6. IEEE (2009)
Pikramenos, G., Smyrnis, G., Vernikos, I., Konidaris, T., Spyrou, E., Perantonis, S.J.: Sentiment analysis from sound spectrograms via soft BoVW and temporal structure modelling. In: ICPRAM, pp. 361–369 (2020)
Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE (2016)
Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimedia 10(5), 936–946 (2008)
Nogueiras, A., Moreno, A., Bonafonte, A., & Mariño, J. B.: Speech emotion recognition using hidden Markov models. In Seventh European Conference on Speech Communication and Technology (2001)
Spyrou, E., Nikopoulou, R., Vernikos, I., Mylonas, P.: Emotion recognition from speech using the bag-of-visual words on audio segment spectrograms. Technologies 7(1), 20 (2019)
Hanjalic, A.: Extracting moods from pictures and sounds: Towards truly personalized TV. IEEE Sign. Process. Mag. 23(2), 90–100 (2006)
Rozgić, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Vembu, A.N., Prasad, R.: Emotion recognition using acoustic and lexical features. In Thirteenth Annual Conference of the International Speech Communication Association (2012)
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: 2011 International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)
Hu, D.: An introductory survey on attention mechanisms in NLP problems. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) IntelliSys 2019. AISC, vol. 1038, pp. 432–448. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-29513-4_31
Kristensen, L.B., Wang, L., Petersson, K.M., Hagoort, P.: The interface between language and attention: prosodic focus marking recruits a general attention network in spoken language comprehension. Cereb. Cortex 23(8), 1836–1848 (2013)
Galassi, A., Lippi, M., Torroni, P.: Attention, please! a critical review of neural attention models in natural language processing. arXiv preprint. arXiv:1902.02181 (2019)
Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., Weiss, B.: A database of German emotional speech. In Ninth European Conference on Speech Communication and Technology (2005)
Costantini, G., Iaderola, I., Paoloni, A., Todisco, M.: EMOVO corpus: an Italian emotional speech database. In: International Conference on Language Resources and Evaluation (LREC 2014), pp. 3501–3504. European Language Resources Association (ELRA) (2014)
Jackson, P., Haq, S.: Surrey audio-visual expressed emotion (SAVEE) database. University of Surrey, Guildford, UK (2014)
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2017)
Salazar, J., Kirchhoff, K., Huang, Z.: Self-attention networks for connectionist temporal classification in speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7115–7119. IEEE (2019)
Lin, Z., Feng, M., Santos, C.N.D., Yu, M., Xiang, B., Zhou, B., Bengio, Y.: A structured self-attentive sentence embedding. arXiv preprint. arXiv:1703.03130 (2017)
Yan, Z., Liu, W., Wen, S., Yang, Y.: Multi-label image classification by feature attention network. IEEE Access 7, 98005–98013 (2019)
Mehrabian, A.: Framework for a comprehensive description and measurement of emotional states. Genet. Soc. Gen. Psychol. Monogr. 121, 339–361 (1995)
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Binali, H., Wu, C., Potdar, V.: Computational approaches for emotion detection in text. In: 4th IEEE International Conference on Digital Ecosystems and Technologies, pp. 172–177. IEEE (2010)
Jin, Q., Li, C., Chen, S., Wu, H.: Speech emotion recognition with acoustic and lexical features. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4749–4753. IEEE (2015)
Lu, L., Liu, D., Zhang, H.J.: Automatic mood detection and tracking of music audio signals. IEEE Trans. Audio Speech Lang. Process. 14(1), 5–18 (2005)
Yang, Y.H., Lin, Y.C., Su, Y.F., Chen, H.H.: A regression approach to music emotion recognition. IEEE Trans. Audio Speech Lang. Process. 16(2), 448–457 (2008)
Panda, R., Malheiro, R.M., Paiva, R.P.: Novel audio features for music emotion recognition. IEEE Trans. Affect. Comput. 11, 614–626 (2018)
Grimm, M., Kroschel, K., Mower, E., Narayanan, S.: Primitives-based evaluation and estimation of emotions in speech. Speech Commun. 49(10–11), 787–800 (2007)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Wöllmer, M., et al.: Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings of the 9th Interspeech 2008 Incorp. 12th Australasian International Conference on Speech Science and Technology SST 2008, Brisbane, Australia, pp. 597–600 (2008)
Giannakopoulos, T., Pikrakis, A., Theodoridis, S.: A dimensional approach to emotion recognition of speech from movies. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 65–68. IEEE (2009)
Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1096–1104 (2009)
Zhang, T., Kuo, C.C.J.: Audio content analysis for online audiovisual data segmentation and classification. IEEE Trans. Speech Audio Process. 9(4), 441–457 (2001)
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Plutchik, R.: A general psychoevolutionary theory of emotion. In: Theories of Emotion, pp. 3–33. Academic press, Cambridge (1980)
Papakostas, M., et al.: Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition. Computation 5(2), 26 (2017)
Martiínez, J.G.: Recognition and emotions. A critical approach on education. Procedia Soc. Behav. Sci. 46, 3925–3930 (2012)
Tickle, A., Raghu, S., Elshaw, M.: Emotional recognition from the speech signal for a virtual education agent. J. Phys. Conf. Ser. 450(1), 012053 (2013)
Bahreini, K., Nadolski, R., Westera, W.: Towards real-time speech emotion recognition for affective e-learning. Educ. Inf. Technol. 21(5), 1367–1386 (2016). https://doi.org/10.1007/s10639-015-9388-2
Busso, C., et al.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on Multimodal Interfaces, pp. 205–211 (2004)
Wöllmer, M., Metallinou, A., Eyben, F., Schuller, B., Narayanan, S.: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of the INTERSPEECH 2010, Makuhari, Japan, pp. 2362–2365 (2010)
Trentin, E., Scherer, S., Schwenker, F.: Emotion recognition from speech signals via a probabilistic echo-state network. Pattern Recogn. Lett. 66, 4–12 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling (2014)
Theodoridis, S., Koutroumbas, K.: Pattern recognition and neural networks. In: Paliouras, G., Karkaletsis, V., Spyropoulos, C.D. (eds.) ACAI 1999. LNCS (LNAI), vol. 2049, pp. 169–195. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44673-7_8
Aminbeidokhti, M., Pedersoli, M., Cardinal, P., Granger, E.: Emotion recognition with spatial attention and temporal softmax pooling. In: Karray, F., Campilho, A., Yu, A. (eds.) ICIAR 2019. LNCS, vol. 11662, pp. 323–331. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27202-9_29
Gupta, A., Agrawal, D., Chauhan, H., Dolz, J., Pedersoli, M.: An attention model for group-level emotion recognition. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 611–615 (2018)
Tarantino, L., Garner, P.N., Lazaridis, A.: Self-Attention for speech emotion recognition. In: INTERSPEECH, pp. 2578–2582 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Pikramenos, G., Kechagias, K., Psallidas, T., Smyrnis, G., Spyrou, E., Perantonis, S. (2020). Dimensionality Reduction and Attention Mechanisms for Extracting Affective State from Sound Spectrograms. In: De Marsico, M., Sanniti di Baja, G., Fred, A. (eds) Pattern Recognition Applications and Methods. ICPRAM 2020. Lecture Notes in Computer Science(), vol 12594. Springer, Cham. https://doi.org/10.1007/978-3-030-66125-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-66125-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66124-3
Online ISBN: 978-3-030-66125-0
eBook Packages: Computer ScienceComputer Science (R0)