Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011

Published: 01 February 2015 Publication History

Abstract

Speaker emotion recognition is achieved through processing methods that include isolation of the speech signal and extraction of selected features for the final classification. In terms of acoustics, speech processing techniques offer extremely valuable paralinguistic information derived mainly from prosodic and spectral features. In some cases, the process is assisted by speech recognition systems, which contribute to the classification using linguistic information. Both frameworks deal with a very challenging problem, as emotional states do not have clear-cut boundaries and often differ from person to person. In this article, research papers that investigate emotion recognition from audio channels are surveyed and classified, based mostly on extracted and selected features and their classification methodology. Important topics from different classification techniques, such as databases available for experimentation, appropriate feature extraction and selection methods, classifiers and performance issues are discussed, with emphasis on research published in the last decade. This survey also provides a discussion on open trends, along with directions for future research on this topic.

References

[1]
Aigner M, Sachs G, Bruckmüller E, Winklbaur B, Zitterl W, Kryspin-Exner I, Gur R, Katschnig H (2007) Cognitive and emotion recognition deficits in obsessive-compulsive disorder. Psychiatr Res 149:121-128.
[2]
Anagnostopoulos CN, Iliou T (2010) Towards emotion recognition from speech: definition, problems and the materials of research. Stud Comput Intell 279:127-143.
[3]
Anagnostopoulos CN, Vovoli E (2010) Sound processing features for speaker-dependent and phrase-independent emotion recognition in Berlin Database. In: Papadopoulos GA, Wojtkowski W, Wojtkowski G, Wrycza S, Zupancic J (eds) Information systems development, pp 413-421.
[4]
Ang J, Dhillon R, Shriberg E, Stolcke A (2002) Prosody-based automatic detection of annoyance and frustration in human-computer dialog. In: Proceedings of interspeech, pp 2037-2040.
[5]
Atassi H, Esposito A (2008) A speaker independent approach to the classification of emotional vocal expressions. In: Proceedings of 20th IEEE international conference on tools with artificial intelligence, pp 147-152.
[6]
Athanaselis T, Bakamidis S, Dologlou I, Cowie R, Douglas-Cowie E, Cox C (2005) ASR for emotional speech: clarifying the issues and enhancing performance. Neural Netw 18:437-444.
[7]
Batliner A, Fischer K, Huber R, Spilker J, Nolth E (2003) How to find trouble in communication. Speech Commun 40:117-143.
[8]
Batliner A, Steidl S, Schuller B, Seppi D, Laskowski K, Vogt T, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2006) Combining efforts for improving automatic classification of emotional user states. In: Proceedings of 1st international language technologies conference, pp 240-245.
[9]
Bogert B, Healy M, Tukey J (1963) The quefrency analysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. In: Rosenblatt M (ed) Symposium on time series analysis. Wiley, New York, pp 209-243.
[10]
Calder J, Lawrence AD, Young AW (2001) Neuropsychology of fear and loathing. Nat Rev Neurosci 2:352- 363.
[11]
Cheng XM, Cheng PY, Zhao L (2009) A study on emotional feature analysis and recognition in speech signal. In: Proceedings of international conference on measuring technology and mechatronics automation, pp 418-420.
[12]
Cen L, Ser W, Yu ZL (2008) Speech emotion recognition using canonical correlation analysis and probabilistic neural network. In: Proceedings of 7th international conference on machine learning and applications, pp 859-862.
[13]
Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schroder M (2000) FEELTRACE: an instrument for recording perceived emotion in real time. In: Proceedings of ISCA speech and emotion workshop, pp 19-24.
[14]
Cowie R, Douglas-Cowie E, Cox C (2005) Beyond emotion archetypes: databases for emotion modelling using neural networks. Neural Netw 18:371-388.
[15]
Devillers L, Vasilescu I, Lamel L (2003) Emotion detection in task oriented spoken dialogs. In: Proceedings of IEEE multimedia human-machine interface and interaction conference, pp 549-552.
[16]
Douglas-Cowie E, Campbell N, Cowie R, Roach P (2003) Emotional speech: towards a new generation of databases. Speech Commun 40:33-60.
[17]
Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, Martin JC, Devillers L, Abrilan S, Batliner A, Amir N, Karpouzis K (2007) The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. In: Proceedings of international conference affective computing and intelligent interaction, pp 488-500.
[18]
Dumouche P, Dehak N, Attabi Y, Dehak R, Boufaden N (2009) Cepstral and long-term features for emotion recognition. In: Proceedings of INTERSPEECH, pp 344-347.
[19]
Fernandez R, Picard RW (2003) Modeling drivers' speech under stress. Speech Communications, vol 40. Elsevier, pp 145-159.
[20]
Firoz Shah A, Vimal Krishnan VR, Raji Sukumar A, Jayakumar A, Babu Anto P (2009) Speaker independent automatic emotion recognition from speech: a comparison of MFCCs and discrete wavelet transforms. In: Proceedings of international conference on advances in recent technologies in communication and computing, pp 528-531.
[21]
Fontaine JRJ, Scherer KR, Roesch EB, Ellsworth PC (2010) The world of emotions is not two dimensional. Psychol Sci 18:1050-1057.
[22]
Forbes-Riley K, Litman DJ (2004) Predicting emotion in spoken dialogue from multiple knowledge sources. In: Proceedings of human language technology conference, North American chapter of the association computational linguistics (HLT/NAACL), pp 201-208.
[23]
France DJ, Shivavi RG, Silverman S, Silverman M, Wilkes M (2000) Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans Biomed Eng, 7:829-837.
[24]
Fu L, Mao X, Chen L (2008a) Relative speech emotion recognition based artificial neural network. In: Proceedings of IEEE Pacific-Asia workshop on computational intelligence and industrial application, pp 140-144.
[25]
Fu L, Mao X, Chen L (2008b) Speaker independent emotion recognition using HMMs fusion system with relative features. In: Proceedings of 1st international conference on intelligent networks and intelligent systems, pp 608-611.
[26]
Giannakopoulos T, Pikrakis A, Theodoridis S (2009) A dimensional approach to emotion recognition of speech from movies. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 65-68.
[27]
Graciarena M, Shriberg E, Stolcke A, Enos F, Hirschberg J, Kajarekar S (2006) Combining prosodic lexical and cepstral systems for deceptive speech detection. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 1033-1036.
[28]
Hanjalic A (2006) Extracting moods from pictures and sounds: towards truly personalized TV. IEEE Signal Process Mag 23:90-100.
[29]
Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimed 7: 143-154.
[30]
Hoch S, Althoff F, McGlaun G, Rigoll G (2005) Bimodal fusion of emotional data in an automotive environment. In: Proceedings of international conference audio. Speech and Signal Processing, vol 2, pp 1085-1088.
[31]
Hozjan V, Kacic Z (2006) Context-independent multilingual emotion recognition from speech signals. Int J Speech Technol 6:311-320.
[32]
Ijima Y, Tachibana M, Nose T, Kobayashi T (2009) Emotional speech recognition based on style estimation and adaptation with multiple-regression HMM. In: Proceedings of 2009 IEEE international conference on acoustics, speech and signal processing, pp 4157-4160.
[33]
Iliou T, Anagnostopoulos C-N (2009) Comparison of different classifiers for emotion recognition. In: Proceedings of panhellenic conference in informatics, pp 102-106.
[34]
Jin Y, Zhao Y, Huang C, Zhao L (2009) Study on the emotion recognition of whispered speech. In: Proceedings of global congress on intelligent systems, pp 242-246.
[35]
Kockmann M, Burget L, Cernocky J (2009) Brno university of technology system for interspeech 2009 emotion challenge. In: Proceedings of INTERSPEECH, pp 348-351.
[36]
Kostoulas TP, Fakotakis N (2006) A speaker dependent emotion recognition framework, CSNDSP. In: Proceedings of 5th international symposium computers, systems, networks and digital signal processing, pp 305-309.
[37]
Kostoulas T, Ganchev T, Mporas I, Fakotakis N (2007) Detection of negative emotional states in real-world scenario. In: Proceedings of 19th IEEE international conference on tools with artificial intelligence, pp 502-509.
[38]
Kostoulas T, Ganchev T, Lazaridis A, Fakotakis N (2010) Enhancing Emotion recognition from speech through feature selection. In: Sojka P, Horák A, Kopecek I, Pala K (eds) Text, speech and dialogue, lecture notes in artificial intelligence, vol 6231, pp 338-344.
[39]
Kwon OW, Chan K, Hao J, Lee TW (2003) Emotion recognition by speech signals. In: Proceedings of Eurospeech conference, pp 125-128.
[40]
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13:293-303.
[41]
Lee CM, Narayanan SS, Pieraccini R (2002) Combining acoustic and language information for emotion recognition. In: Proceedings of interspeech, pp 873-376.
[42]
Lee CM, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee SS, Narayanan S (2004) Emotion recognition based on phoneme classes. In: Proceedings of international conference spoken language processing, pp 205-211.
[43]
Lee C, Mower E, Busso C, Lee S, Narayanan S (2009) Emotion recognition using a hierarchical binary decision tree approach. In: Proceedings of INTERSPEECH, pp 320-323.
[44]
Litman DJ, Forbes-Riley K (2004) Predicting student emotions in computer-human tutoring dialogues In: Proceedings of 42nd annual meeting on association for computational linguistics
[45]
Luengo I, Navas E, Hernaez I (2010) Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans Multimed 12:490-501.
[46]
Lugger M, Yang B (2007a) An incremental analysis of different feature groups in speaker independent emotion recognition. In: Proceedings of international congress phonetic sciences, pp 2149-2152.
[47]
Lugger M, Yang B (2007b) The relevance of voice quality features in speaker independent emotion recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 17-20.
[48]
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge.
[49]
Mao X, Chen L, Fu L (2009) Multi-level speech emotion recognition based onHMMand ANN. In: Proceedings of world congress on computer science and information engineering, pp 225-229.
[50]
Matos S, Birring SS, Pavord ID, Evans DH (2006) Detection of cough signals in continuous audio recordings Using HMM. IEEE Trans Biomed Eng 53:1078-1083.
[51]
Mishra HK, Sekhar CC (2009) Variational gaussian mixture models for speech emotion recognition. In: Proceedings of 7th international conference on advances in pattern recognition, pp 183-186.
[52]
Morrison D, Wang R, Silva LCD (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun 49:98-112.
[53]
Navas E, Hernáez I, Luengo I (2006) An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS. IEEE Trans Audio Speech Lang Process 14:1117-1127.
[54]
Neiberg D, Elenius K, Laskowski K (2006) Emotion recognition in spontaneous speech using GMMs. In: Proceedings of INTERSPEECH conference, pp 809-812.
[55]
Nogueiras A, Moreno A, Bonafonte A, Mariño JB (2001) Speech emotion recognition using Hidden Markov models. In: Proceedings of EUROSPEECH, pp 2679-2682.
[56]
Nwe TL, Foo SW, De Silva LC (2003) Classification of stress in speech using linear and nonlinear features. In: Proceedings of IEEE international conference acoustics, speech, and signal processing, pp 9-12.
[57]
Ortony A, Clore G, Collins A (1988) The cognitive structure of emotions. Cambridge University Press, Cambridge.
[58]
Pal P, Iyer AN, Yantorno RE (2006) Emotion detection from infant facial expressions and cries. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, pp 721-724.
[59]
Pao TL, Liao WY, Chen YT, Yeh JH, Cheng YM, Chien CS (2007a) Comparison of several classifiers for emotion recognition from noisy mandarin speech. In: Proceedings of 3rd international conference on international information hiding and multimedia signal processing, pp 23-26.
[60]
Pao TL, Chien CS, Chen YT, Yeh JH, Cheng YM, Liao WY (2007b) Combination of multiple classifiers for improving emotion recognition in Mandarin speech. In: Proceedings of 3rd international conference on international information hiding and multimedia signal processing, pp 35-38.
[61]
Petridis S, Pantic M (2008) Audiovisual discrimination between laughter and speech. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, pp 5117-5120.
[62]
Rong J, Chen YPP, Chowdhury M, Li G (2007) Acoustic features extraction for emotion recognition. In: Proceedings 6th IEEE/ACIS international conference on computer and information science, pp 419-424.
[63]
Russell JA, Weiss A, Mendelsohn GA (1989) Affect Grid: a single-item scale of pleasure and arousal. J Pers Soc Psychol 57:493-502.
[64]
Russell JA, Bachorowski J, Fernandez-Dols J (2003) Facial and vocal expressions of emotion. Annu Revis Psychol 54:329-349.
[65]
Schroder M (2003) Experimental study of affect bursts. Speech Commun 40:99-116.
[66]
Schuller B, Rigoll G (2009) Recognising interest in conversational speech-comparing bag of frames and supra-segmental features. In: Proceedings of INTERSPEECH, pp 1999-2002.
[67]
Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: Proceedings of international conference on multimedia and expo, pp 401-404.
[68]
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proceedings of IEEE international conference acoustics, speech, and signal processing, pp. 577-580.
[69]
Schuller B, Muller R, Lang M, Rigoll G (2005a) Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. In: Proceedings of 9th Eurospeech-Interspeech, pp 805-809.
[70]
Schuller B, Villar RJ, Rigoll G, Lang M (2005b) Meta-classifiers in acoustic and linguistic feature fusion-based affect recognition. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 325-328.
[71]
Schuller B, Reiter S, Mueller R, Al-Hames M, Lang M, Rigoll G (2005c) Speaker-independent speech emotion recognition by ensemble classification. In: Proceedings international conference on multimedia and expo, pp 864-867.
[72]
Schuller B, Reiter S, Rigoll G (2006) Evolutionary feature generation in speech emotion recognition. In: Proceedings 2006 IEEE international conference on multimedia and expo, pp 5-8.
[73]
Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2007) The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of INTERSPEECH, pp 2253-2256.
[74]
Schuller B, Müller R, Eyben F, Gast J, Hörnler B, Wöllmer M, Rigoll G, Höthker A, Konosu H (2009) Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis Comput 27:1760-1774.
[75]
Schuller B, Wollmer M, Eyben F, Rigoll G (2009) The role of prosody in affective speech. Peter Lan Publishing Group, Bern.
[76]
Schuller B, Batliner A, Steidl S, Seppi D (2009c) Emotion recognition from speech: putting ASR in the loop. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 4585-4588.
[77]
Schuller B, Schenk J, Rigoll G, Knaup T (2009d) "The Godfather" vs. "Chaos": comparing linguistic analysis based on on-line knowledge sources and bags-of-n-grams for movie review valence estimation. In: Proceedings of 10th international conference on document analysis and recognition, pp 858-862.
[78]
Schuller B, Steidl S, Batliner A (2009e) The INTERSPEECH 2009 emotion challenge. In: Proceedings of INTERSPEECH, pp 312-315.
[79]
Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1:119-131.
[80]
Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53:1062-1087.
[81]
Shami MT, Kamel MS (2005) Segment-based approach to the recognition of emotions in speech. In: Proceedings of IEEE international conference on multimedia and expo, pp 4-7.
[82]
Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings international conference on acoustics speech and signal processing, pp 5688-5691.
[83]
Sidorova J (2007) Speech emotion recognition. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona.
[84]
Vlasenko B, Schuller B, Wendemut A, Rigoll G, Frame vs (2007) Turn-level: emotion recognition from speech considering static and dynamic processing. In: Proceedings 2nd international conference on affective computing and intelligent interaction, pp 139-147.
[85]
Vogt T, André E (2005) Comparing feature sets for acted and spontaneous speech in viewof automatic emotion recognition. In: Proceedings IEEE international conference on multimedia and expo, pp 474-477.
[86]
Vogt T, André E (2006) Improving automatic emotion recognition from speech via gender differentiation. In: Proceedings of language resources and evaluation conference, pp 1123-1126.
[87]
Vogt T, André E (2009) Exploring the benefits of discretization of acoustic features for speech emotion recognition. In: Proceedings 10th INTERSPEECH conference, pp 328-331.
[88]
Wagner J, Kim NJ, Andre E (2005) From physiological signals to emotions: implementing and comparing selected methods for feature extraction and classification. In: Proceedings of IEEE international conference multimedia and expo, pp 940-943.
[89]
Wang Y, Du S, Zhan Y (2008) Adaptive and optimal classification of speech emotion recognition. In: Proceedings of 4th international conference on natural computation, pp 407-411.
[90]
Wang S, Ling X, Zhang F, Tong J (2010) Speech emotion recognition based on principal component analysis and back propagation neural network. In: Proceedings of international conference on measuring technology and mechatronics automation, pp 437-440.
[91]
Wenjing H, Haifeng L, Chunyu G (2009) A hybrid speech emotion perception method of VQ-based feature processing and ANN recognition. In: Proceedings of global congress on intelligent systems, pp 145-149.
[92]
Wierzbicka A (1999) Emotions across languages and cultures: diversity and universals. Cambridge University Press, Cambridge.
[93]
Wu CH, Liang WB (2011) Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans Affect Comput 2:10-21.
[94]
Wu CH, Chuang ZJ, Lin YC (2006) Emotion recognition from text using semantic label and separable mixture model. ACM Trans Asian Lang Inf Process 5:165-182.
[95]
Wu S, Falk TH, Chan WY (2009) Automatic recognition of speech emotion using long-term spectro-temporal features. In: Proceedings of 16th international conference on digital signal processing.
[96]
Yang C, Ji L, Liu G (2009a) Study to speech emotion recognition based on TWINsSVM. In: Proceedings of 5th international conference on natural computation, pp 312-316.
[97]
Yang T, Yang J, Bi F (2009b) Emotion statuses recognition of speech signal using intuitionistic fuzzy set. In: Proceedings of world congress on software engineering, pp 204-207.
[98]
You M, Chen C, Bu J, Liu J, Tao J (2006) Emotional speech analysis on nonlinear manifold. In: Proceedings of 18th international conference on pattern recognition, pp 91-94.
[99]
Yu W (2008) Research and implementation of emotional feature classification and recognition in speech signal. In: Proceedings of international symposium on intelligent information technology application, pp 471-474.
[100]
Yun S, Yoo CD, (2009) Speech emotion recognition via amax-margin framework incorporating a loss function based on the Watson and Tellegen's emotion model. In: Proceedings IEEE international conference on acoustics, speech and signal processing, pp 4169-4172.
[101]
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31:39-58.
[102]
Zhou Y, Zhang J, Wang L, Yan Y (2009a) Emotion recognition and conversion for mandarin speech. In: Proceedings of 6th international conference on fuzzy systems and knowledge discovery, pp 179-183.
[103]
Zhou Y, Sun Y, Yang L, Yan Y (2009b) Applying articulatory features to speech emotion recognition. In: Proceedings of international conference on research challenges in computer science, pp 73-76.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Artificial Intelligence Review
Artificial Intelligence Review  Volume 43, Issue 2
February 2015
155 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2015

Author Tags

  1. Classifiers
  2. Emotion recognition
  3. Speech features

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A review on speech emotion recognitionNeurocomputing10.1016/j.neucom.2023.127015568:COnline publication date: 1-Feb-2024
  • (2024)MBCFNetKnowledge-Based Systems10.1016/j.knosys.2024.111826296:COnline publication date: 19-Jul-2024
  • (2024)Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural NetworkCircuits, Systems, and Signal Processing10.1007/s00034-023-02562-543:4(2341-2384)Online publication date: 1-Apr-2024
  • (2023)WiFE: WiFi and Vision Based Unobtrusive Emotion Recognition via Gesture and Facial ExpressionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.328577714:4(2567-2581)Online publication date: 1-Oct-2023
  • (2023)EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2021.313515214:2(1472-1487)Online publication date: 1-Apr-2023
  • (2023)Speech emotion recognition approachesSpeech Communication10.1016/j.specom.2023.102974154:COnline publication date: 1-Oct-2023
  • (2023)The amalgamation of wavelet packet information gain entropy tuned source and system parameters for improved speech emotion recognitionSpeech Communication10.1016/j.specom.2023.03.007149:C(11-28)Online publication date: 1-Apr-2023
  • (2023)Trends in speech emotion recognition: a comprehensive surveyMultimedia Tools and Applications10.1007/s11042-023-14656-y82:19(29307-29351)Online publication date: 22-Feb-2023
  • (2023)Affective social anthropomorphic intelligent systemMultimedia Tools and Applications10.1007/s11042-023-14597-682:23(35059-35090)Online publication date: 7-Mar-2023
  • (2023)CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognitionInternational Journal of Speech Technology10.1007/s10772-023-10080-726:4(1099-1116)Online publication date: 1-Dec-2023
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media