Abstract
Speaker emotion recognition is achieved through processing methods that include isolation of the speech signal and extraction of selected features for the final classification. In terms of acoustics, speech processing techniques offer extremely valuable paralinguistic information derived mainly from prosodic and spectral features. In some cases, the process is assisted by speech recognition systems, which contribute to the classification using linguistic information. Both frameworks deal with a very challenging problem, as emotional states do not have clear-cut boundaries and often differ from person to person. In this article, research papers that investigate emotion recognition from audio channels are surveyed and classified, based mostly on extracted and selected features and their classification methodology. Important topics from different classification techniques, such as databases available for experimentation, appropriate feature extraction and selection methods, classifiers and performance issues are discussed, with emphasis on research published in the last decade. This survey also provides a discussion on open trends, along with directions for future research on this topic.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aigner M, Sachs G, Bruckmüller E, Winklbaur B, Zitterl W, Kryspin-Exner I, Gur R, Katschnig H (2007) Cognitive and emotion recognition deficits in obsessive–compulsive disorder. Psychiatr Res 149: 121–128
Anagnostopoulos CN, Iliou T (2010) Towards emotion recognition from speech: definition, problems and the materials of research. Stud Comput Intell 279: 127–143
Anagnostopoulos CN, Vovoli E (2010) Sound processing features for speaker-dependent and phrase-independent emotion recognition in Berlin Database. In: Papadopoulos GA, Wojtkowski W, Wojtkowski G, Wrycza S, Zupancic J (eds) Information systems development, pp 413–421
Ang J, Dhillon R, Shriberg E, Stolcke A (2002) Prosody-based automatic detection of annoyance and frustration in human–computer dialog. In: Proceedings of interspeech, pp 2037–2040
Atassi H, Esposito A (2008) A speaker independent approach to the classification of emotional vocal expressions. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, pp 147–152
Athanaselis T, Bakamidis S, Dologlou I, Cowie R, Douglas-Cowie E, Cox C (2005) ASR for emotional speech: clarifying the issues and enhancing performance. Neural Netw 18: 437–444
Batliner A, Fischer K, Huber R, Spilker J, Nolth E (2003) How to find trouble in communication. Speech Commun 40: 117–143
Batliner A, Steidl S, Schuller B, Seppi D, Laskowski K, Vogt T, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2006) Combining efforts for improving automatic classification of emotional user states. In: Proceedings of 1st international language technologies conference, pp 240–245
Bogert B, Healy M, Tukey J (1963) The quefrency analysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. In: Rosenblatt M (ed) Symposium on time series analysis. Wiley, New York, pp 209–243
Calder J, Lawrence AD, Young AW (2001) Neuropsychology of fear and loathing. Nat Rev Neurosci 2: 352–363
Cheng XM, Cheng PY, Zhao L (2009) A study on emotional feature analysis and recognition in speech signal. In: Proceedings of international conference on measuring technology and mechatronics automation, pp 418–420
Cen L, Ser W, Yu ZL (2008) Speech emotion recognition using canonical correlation analysis and probabilistic neural network. In: Proceedings of 7th international conference on machine learning and applications, pp 859–862
Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schroder M (2000) FEELTRACE: an instrument for recording perceived emotion in real time. In: Proceedings of ISCA speech and emotion workshop, pp 19–24
Cowie R, Douglas-Cowie E, Cox C (2005) Beyond emotion archetypes: databases for emotion modelling using neural networks. Neural Netw 18: 371–388
Devillers L, Vasilescu I, Lamel L (2003) Emotion detection in task oriented spoken dialogs. In: Proceedings of IEEE multimedia human–machine interface and interaction conference, pp 549–552
Douglas-Cowie E, Campbell N, Cowie R, Roach P (2003) Emotional speech: towards a new generation of databases. Speech Commun 40: 33–60
Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, Martin JC, Devillers L, Abrilan S, Batliner A, Amir N, Karpouzis K (2007) The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. In: Proceedings of international conference affective computing and intelligent interaction, pp 488–500
Dumouche P, Dehak N, Attabi Y, Dehak R, Boufaden N (2009) Cepstral and long-term features for emotion recognition. In: Proceedings of INTERSPEECH, pp 344–347
Fernandez R, Picard RW (2003) Modeling drivers’ speech under stress. Speech Communications, vol 40. Elsevier, pp 145–159
Firoz Shah A, Vimal Krishnan VR, Raji Sukumar A, Jayakumar A, Babu Anto P (2009) Speaker independent automatic emotion recognition from speech: a comparison of MFCCs and discrete wavelet transforms. In: Proceedings of international conference on advances in recent technologies in communication and computing, pp 528–531
Fontaine JRJ, Scherer KR, Roesch EB, Ellsworth PC (2010) The world of emotions is not two dimensional. Psychol Sci 18: 1050–1057
Forbes-Riley K, Litman DJ (2004) Predicting emotion in spoken dialogue from multiple knowledge sources. In: Proceedings of human language technology conference, North American chapter of the association computational linguistics (HLT/NAACL), pp 201–208
France DJ, Shivavi RG, Silverman S, Silverman M, Wilkes M (2000) Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans Biomed Eng, 7:829–837
Fu L, Mao X, Chen L (2008a) Relative speech emotion recognition based artificial neural network. In: Proceedings of IEEE Pacific-Asia workshop on computational intelligence and industrial application, pp 140–144
Fu L, Mao X, Chen L (2008b) Speaker independent emotion recognition using HMMs fusion system with relative features. In: Proceedings of 1st international conference on intelligent networks and intelligent systems, pp 608–611
Giannakopoulos T, Pikrakis A, Theodoridis S (2009) A dimensional approach to emotion recognition of speech from movies. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 65–68
Graciarena M, Shriberg E, Stolcke A, Enos F, Hirschberg J, Kajarekar S (2006) Combining prosodic lexical and cepstral systems for deceptive speech detection. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 1033–1036
Hanjalic A (2006) Extracting moods from pictures and sounds: towards truly personalized TV. IEEE Signal Process Mag 23: 90–100
Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimed 7: 143–154
Hoch S, Althoff F, McGlaun G, Rigoll G (2005) Bimodal fusion of emotional data in an automotive environment. In: Proceedings of international conference audio. Speech and Signal Processing, vol 2, pp 1085–1088
Hozjan V, Kacic Z (2006) Context-independent multilingual emotion recognition from speech signals. Int J Speech Technol 6: 311–320
Ijima Y, Tachibana M, Nose T, Kobayashi T (2009) Emotional speech recognition based on style estimation and adaptation with multiple-regression HMM. In: Proceedings of 2009 IEEE international conference on acoustics, speech and signal processing, pp 4157–4160
Iliou T, Anagnostopoulos C-N (2009) Comparison of different classifiers for emotion recognition. In: Proceedings of panhellenic conference in informatics, pp 102–106
Jin Y, Zhao Y, Huang C, Zhao L (2009) Study on the emotion recognition of whispered speech. In: Proceedings of global congress on intelligent systems, pp 242–246
Kockmann M, Burget L, Cernocky J (2009) Brno university of technology system for interspeech 2009 emotion challenge. In: Proceedings of INTERSPEECH, pp 348–351
Kostoulas TP, Fakotakis N (2006) A speaker dependent emotion recognition framework, CSNDSP. In: Proceedings of 5th international symposium computers, systems, networks and digital signal processing, pp 305–309
Kostoulas T, Ganchev T, Mporas I, Fakotakis N (2007) Detection of negative emotional states in real-world scenario. In: Proceedings of 19th IEEE international conference on tools with artificial intelligence, pp 502–509
Kostoulas T, Ganchev T, Lazaridis A, Fakotakis N (2010) Enhancing Emotion recognition from speech through feature selection. In: Sojka P, Horák A, Kopecek I, Pala K (eds) Text, speech and dialogue, lecture notes in artificial intelligence, vol 6231, pp 338–344
Kwon OW, Chan K, Hao J, Lee TW (2003) Emotion recognition by speech signals. In: Proceedings of Eurospeech conference, pp 125–128
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13: 293–303
Lee CM, Narayanan SS, Pieraccini R (2002) Combining acoustic and language information for emotion recognition. In: Proceedings of interspeech, pp 873–376
Lee CM, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee SS, Narayanan S (2004) Emotion recognition based on phoneme classes. In: Proceedings of international conference spoken language processing, pp 205–211
Lee C, Mower E, Busso C, Lee S, Narayanan S (2009) Emotion recognition using a hierarchical binary decision tree approach. In: Proceedings of INTERSPEECH, pp 320–323
Litman DJ, Forbes-Riley K (2004) Predicting student emotions in computer-human tutoring dialogues In: Proceedings of 42nd annual meeting on association for computational linguistics
Luengo I, Navas E, Hernaez I (2010) Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans Multimed 12: 490–501
Lugger M, Yang B (2007a) An incremental analysis of different feature groups in speaker independent emotion recognition. In: Proceedings of international congress phonetic sciences, pp 2149–2152
Lugger M, Yang B (2007b) The relevance of voice quality features in speaker independent emotion recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 17–20
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
Mao X, Chen L, Fu L (2009) Multi-level speech emotion recognition based on HMM and ANN. In: Proceedings of world congress on computer science and information engineering, pp 225–229
Matos S, Birring SS, Pavord ID, Evans DH (2006) Detection of cough signals in continuous audio recordings Using HMM. IEEE Trans Biomed Eng 53: 1078–1083
Mishra HK, Sekhar CC (2009) Variational gaussian mixture models for speech emotion recognition. In: Proceedings of 7th international conference on advances in pattern recognition, pp 183–186
Morrison D, Wang R, Silva LCD (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49: 98–112
Navas E, Hernáez I, Luengo I (2006) An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS. IEEE Trans Audio Speech Lang Process 14: 1117–1127
Neiberg D, Elenius K, Laskowski K (2006) Emotion recognition in spontaneous speech using GMMs. In: Proceedings of INTERSPEECH conference, pp 809–812
Nogueiras A, Moreno A, Bonafonte A, Mariño JB (2001) Speech emotion recognition using Hidden Markov models. In: Proceedings of EUROSPEECH, pp 2679–2682
Nwe TL, Foo SW, De Silva LC (2003) Classification of stress in speech using linear and nonlinear features. In: Proceedings of IEEE international conference acoustics, speech, and signal processing, pp 9–12
Ortony A, Clore G, Collins A (1988) The cognitive structure of emotions. Cambridge University Press, Cambridge
Pal P, Iyer AN, Yantorno RE (2006) Emotion detection from infant facial expressions and cries. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, pp 721–724
Pao TL, Liao WY, Chen YT, Yeh JH, Cheng YM, Chien CS (2007a) Comparison of several classifiers for emotion recognition from noisy mandarin speech. In: Proceedings of 3rd international conference on international information hiding and multimedia signal processing, pp 23–26
Pao TL, Chien CS, Chen YT, Yeh JH, Cheng YM, Liao WY (2007b) Combination of multiple classifiers for improving emotion recognition in Mandarin speech. In: Proceedings of 3rd international conference on international information hiding and multimedia signal processing, pp 35–38
Petridis S, Pantic M (2008) Audiovisual discrimination between laughter and speech. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, pp 5117–5120
Rong J, Chen YPP, Chowdhury M, Li G (2007) Acoustic features extraction for emotion recognition. In: Proceedings 6th IEEE/ACIS international conference on computer and information science, pp 419–424
Russell JA, Weiss A, Mendelsohn GA (1989) Affect Grid: a single-item scale of pleasure and arousal. J Pers Soc Psychol 57: 493–502
Russell JA, Bachorowski J, Fernandez-Dols J (2003) Facial and vocal expressions of emotion. Annu Revis Psychol 54: 329–349
Schroder M (2003) Experimental study of affect bursts. Speech Commun 40:99–116
Schuller B, Rigoll G (2009) Recognising interest in conversational speech–comparing bag of frames and supra-segmental features. In: Proceedings of INTERSPEECH, pp 1999–2002
Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: Proceedings of international conference on multimedia and expo, pp 401–404
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proceedings of IEEE international conference acoustics, speech, and signal processing, pp. 577–580
Schuller B, Muller R, Lang M, Rigoll G (2005a) Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. In: Proceedings of 9th Eurospeech–Interspeech, pp 805–809
Schuller B, Villar RJ, Rigoll G, Lang M (2005b) Meta-classifiers in acoustic and linguistic feature fusion-based affect recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 325–328
Schuller B, Reiter S, Mueller R, Al-Hames M, Lang M, Rigoll G (2005c) Speaker-independent speech emotion recognition by ensemble classification. In: Proceedings international conference on multimedia and expo, pp 864–867
Schuller B, Reiter S, Rigoll G (2006) Evolutionary feature generation in speech emotion recognition. In: Proceedings 2006 IEEE international conference on multimedia and expo, pp 5–8
Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2007) The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of INTERSPEECH, pp 2253–2256
Schuller B, Müller R, Eyben F, Gast J, Hörnler B, Wöllmer M, Rigoll G, Höthker A, Konosu H (2009) Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis Comput 27: 1760–1774
Schuller B, Wollmer M, Eyben F, Rigoll G (2009) The role of prosody in affective speech. Peter Lan Publishing Group, Bern
Schuller B, Batliner A, Steidl S, Seppi D (2009c) Emotion recognition from speech: putting ASR in the loop. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 4585–4588
Schuller B, Schenk J, Rigoll G, Knaup T (2009d) “The Godfather” vs. “Chaos”: comparing linguistic analysis based on on-line knowledge sources and bags-of-n-grams for movie review valence estimation. In: Proceedings of 10th international conference on document analysis and recognition, pp 858-862
Schuller B, Steidl S, Batliner A (2009e) The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, pp 312–315
Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1: 119–131
Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53: 1062–1087
Shami MT, Kamel MS (2005) Segment-based approach to the recognition of emotions in speech. In: Proceedings of IEEE international conference on multimedia and expo, pp 4–7
Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings international conference on acoustics speech and signal processing, pp 5688–5691
Sidorova J (2007) Speech emotion recognition. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona
Vlasenko B, Schuller B, Wendemut A, Rigoll G, Frame vs (2007) Turn-level: emotion recognition from speech considering static and dynamic processing. In: Proceedings 2nd international conference on affective computing and intelligent interaction, pp 139–147
Vogt T, André E (2005) Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In: Proceedings IEEE international conference on multimedia and expo, pp 474–477
Vogt T, André E (2006) Improving automatic emotion recognition from speech via gender differentiation. In: Proceedings of language resources and evaluation conference, pp 1123–1126
Vogt T, André E (2009) Exploring the benefits of discretization of acoustic features for speech emotion recognition. In: Proceedings 10th INTERSPEECH conference, pp 328–331
Wagner J, Kim NJ, Andre E (2005) From physiological signals to emotions: implementing and comparing selected methods for feature extraction and classification. In: Proceedings of IEEE international conference multimedia and expo, pp 940–943
Wang Y, Du S, Zhan Y (2008) Adaptive and optimal classification of speech emotion recognition. In: Proceedings of 4th international conference on natural computation, pp 407–411
Wang S, Ling X, Zhang F, Tong J (2010) Speech emotion recognition based on principal component analysis and back propagation neural network. In: Proceedings of international conference on measuring technology and mechatronics automation, pp 437–440
Wenjing H, Haifeng L, Chunyu G (2009) A hybrid speech emotion perception method of VQ-based feature processing and ANN recognition. In: Proceedings of global congress on intelligent systems, pp 145–149
Wierzbicka A (1999) Emotions across languages and cultures: diversity and universals. Cambridge University Press, Cambridge
Wu CH, Liang WB (2011) Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans Affect Comput 2: 10–21
Wu CH, Chuang ZJ, Lin YC (2006) Emotion recognition from text using semantic label and separable mixture model. ACM Trans Asian Lang Inf Process 5: 165–182
Wu S, Falk TH, Chan WY (2009) Automatic recognition of speech emotion using long-term spectro-temporal features. In: Proceedings of 16th international conference on digital signal processing
Yang C, Ji L, Liu G (2009a) Study to speech emotion recognition based on TWINsSVM. In: Proceedings of 5th international conference on natural computation, pp 312–316
Yang T, Yang J, Bi F (2009b) Emotion statuses recognition of speech signal using intuitionistic fuzzy set. In: Proceedings of world congress on software engineering, pp 204–207
You M, Chen C, Bu J, Liu J, Tao J (2006) Emotional speech analysis on nonlinear manifold. In: Proceedings of 18th international conference on pattern recognition, pp 91–94
Yu W (2008) Research and implementation of emotional feature classification and recognition in speech signal. In: Proceedings of international symposium on intelligent information technology application, pp 471–474
Yun S, Yoo CD, (2009) Speech emotion recognition via a max-margin framework incorporating a loss function based on the Watson and Tellegen’s emotion model. In: Proceedings IEEE international conference on acoustics, speech and signal processing, pp 4169–4172
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31: 39–58
Zhou Y, Zhang J, Wang L, Yan Y (2009a) Emotion recognition and conversion for mandarin speech. In: Proceedings of 6th international conference on fuzzy systems and knowledge discovery, pp 179–183
Zhou Y, Sun Y, Yang L, Yan Y (2009b) Applying articulatory features to speech emotion recognition. In: Proceedings of international conference on research challenges in computer science, pp 73–76
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Anagnostopoulos, CN., Iliou, T. & Giannoukos, I. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43, 155–177 (2015). https://doi.org/10.1007/s10462-012-9368-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-012-9368-5