Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

Loic Kessous¹,
Ginevra Castellano² &
George Caridakis³

2389 Accesses
159 Citations
Explore all metrics

Abstract

In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a command while making 8 different emotional expressions. Gender was equally represented, with speakers of several different native languages including French, German, Greek and Italian. Facial expression, gesture and acoustic analysis of speech were used to extract features relevant to emotion. For the automatic classification of unimodal data, bimodal data and multimodal data, a system based on a Bayesian classifier was used. After performing an automatic classification of each modality, the different modalities were combined using a multimodal approach. Fusion of the modalities at the feature level (before running the classifier) and at the results level (combining results from classifier from each modality) were compared. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison to the unimodal systems: the multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system. Bimodal emotion recognition based on all combinations of the modalities (i.e., ‘face-gesture’, ‘face-speech’ and ‘gesture-speech’) was also investigated. The results show that the best pairing is ‘gesture-speech’. Using all three modalities resulted in a 3.3% classification improvement over the best bimodal results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Database of Emotional Speech, Video and Gestures

Fusing facial and speech cues for enhanced multimodal emotion recognition

Article 24 January 2024

Interpretable multimodal emotion recognition using hybrid fusion of speech and image data

Article 05 September 2023

References

Valstar MF, Gunes H, Pantic M (2007) How to distinguish posed from spontaneous smiles using geometric features. In: Proceedings of ACM international conference on multimodal interfaces (ICMI’07), Nagoya, Japan, November 2007. ACM, New York, pp 38–45
Google Scholar
Picard R (1997) Affective computing. MIT Press, Boston
Google Scholar
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 20:569–571
Google Scholar
Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Boston. ISBN:981-256-105-6
Google Scholar
Pantic M, Sebe N, Cohn J, Huang TS (2005) Affective multimodal human-computer interaction. ACM Multimedia 20:669–676
Google Scholar
Ambady N, Rosenthal R (1992) Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256–274
Article Google Scholar
Busso C, Deng Z, Yildirim S, Bulut M, Lee CM, Kazemzaeh A, Lee S, Neumann U, Narayanan S (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proc. of ACM 6th int’l conf. on multimodal interfaces (ICMI 2004), State College, PA, October 2004. ACM, New York, pp 205–211
Chapter Google Scholar
Gunes H, Piccardi M (2007) Bi-modal emotion recognition from expressive face and body gestures. J Netw Comput Appl 30:1334–1345
Article Google Scholar
Karpouzis K, Caridakis G, Kessous L, Amir N, Raouzaiou A, Malatesta L, Kollias S (2007) Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition. In: Artificial intelligence for human computing
Bernhardt D, Robinson P (2007) Detecting affect from non-stylised body motions. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 59–70
Google Scholar
Castellano G, Villalba SD, Camurri A (2007) Recognising human emotions from body movement and gesture dynamics. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 71–82
Google Scholar
Kleinsmith A, Bianchi-Berthouze N (2007) Recognizing affective dimensions from body posture. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 48–58
Google Scholar
Vidrascu L, Devillers L (2005) Real-life emotions representation and detection in call centers. In: Proc. of 2nd international conference on affective computing and intelligent interaction, Lisbon, Portugal
Batliner A, Steidl S, Hacker C, Noth E, Niemann H (2005) Tales of tuning—prototyping for automatic classification of emotional user states. In: Proceedings of the interspeech conference
Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614–636
Article Google Scholar
Vogt T, Andre E (2005) Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In: Proc. IEEE international conference on multimedia and Expo ICME05
Gunes H, Piccardi M (2006) A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In: Proc. of ICPR 2006 the 18th international conference on pattern recognition, Hong Kong, China, November 2006
Banziger T, Pirker H, Scherer K (2006) Gemep—Geneva multimodal emotion portrayals: a corpus for the study of multimodal emotional expressions. In: L. Deviller et al. (ed), Proceedings of LREC’06 workshop on corpora for research on emotion and affect, Genoa, Italy, pp 15–19
Douglas-Cowie E, Campbell N, Cowie R, Roach P (2003) Emotional speech: towards a new generation of databases. Speech Commun 40:33–60
Article MATH Google Scholar
Rosenblum M, Yacoob Y, Davis L (1996) Human expression recognition from motion using a radial basis function network architecture. IEEE Trans Neural Netw 7(5):1121–1138
Article Google Scholar
Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Mach Intel 22(12):1424–1445
Article Google Scholar
Pantic M, Bartlett MS (2007) Machine analysis of facial expressions. In: Delac K, Grgic M (eds) Face recognition. I-Tech Education and Publishing, Vienna, pp 377–416
Google Scholar
Cowie R, Douglas-Cowie E (1996) Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In: Proceedings international conference on spoken language processing, Genoa, Italy, 1996
Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Human-Comput Stud 59(1–2):157–183
Google Scholar
Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007) Towards more reality in the recognition of emotional speech. In: Proc. int. conf. on acoustics, speech, and signal processing, Honolulu, Hawaii, USA, 2007, pp 941—944
Camurri A, Lagerlöf I, Volpe G (2003) Recognizing emotion from dance movement: comparison of spectator recognition and automated techniques. Int J Human-Comput Stud 59(1–2):213–225
Article Google Scholar
Bianchi-Berthouze N, Kleinsmith A (2003) A categorical approach to affective gesture recognition. Connect Sci 15(4):259–269
Article Google Scholar
Castellano G, Mortillaro M, Camurri A, Volpe G, Scherer K (2008) Automated analysis of body movement in emotionally expressive piano performances. Music Percept 26(2):103–119
Article Google Scholar
Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191
Article Google Scholar
Kim J, Andre E, Rehm M, Vogt T, Wagner J (2005) Integrating information from speech and physiological signals to achieve emotional sensitivity. In: Proc. of the 9th European conference on speech communication and technology
Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Boston
Google Scholar
Pantic M, Sebe N, Cohn JF, Huang T (2005) Affective multimodal human-computer interaction. In: Proceedings of the 13th annual ACM international conference on multimedia. ACM, New York, pp 669–676
Chapter Google Scholar
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Article Google Scholar
Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimedia 9:424–428
Article Google Scholar
Busso C, Narayanan S (2007) Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans Audio Speech Lang Process 20:2331–2347
Article Google Scholar
el Kaliouby R, Robinson P (2005) Generalization of a vision-based computational model of mind-reading. In: Proceedings of first international conference on affective computing and intelligent interfaces, pp 582–589
Scherer KR, Ellgring H (2007) Multimodal expression of emotion: Affect programs or componential appraisal patterns? Emotion 7(1)
Engelbrecht AP, Fletcher L, Cloete I (1999) Variance analysis of sensitivity information for pruning multilayer feedforward neural networks. In: Neural networks, IJCNN ’99, vol 3, pp 1829–1833
Densley DJ, Willis PJ (1997) Emotional posturing: a method towards achieving emotional figure animation. In: Computer animation
Yang M-H, Kriegman DJ, Ahuja N (2002) Detecting faces in images: a survey. IEEE Trans Pattern Anal Mach Intell 24(1):34–58
Article Google Scholar
Young JW Head and face anthropometry of adult U.S. civilians. Technical Report final report, FAA Civil Aeromedical Institute, 1963-93
Ioannou S, Caridakis G, Karpouzis K, Kollias S (2007) Robust feature detection for facial expression recognition. EURASIP J Image Video Process
Raouzaiou A, Tsapatsoulis N, Karpouzis K, Kollias S (2002) Parameterized facial expression synthesis based on mpeg-4. EURASIP J Appl Signal Process 10:1021–1038
Google Scholar
Camurri A, Coletta P, Massari A, Mazzarino B, Peri M, Ricchetti M, Ricci A, Volpe G (2004) Toward real-time multimodal processing: Eyesweb 4.0. In: Proc. AISB 2004 convention: motion, emotion and cognition, Leeds, UK, March 2004
Camurri A, Mazzarino B, Volpe G (2004) Analysis of expressive gesture: the eyesweb expressive gesture processing library. In: Camurri A, Volpe G (eds) Gesture-based communication in human-computer interaction. LNAI, vol 2915. Springer, Berlin
Google Scholar
Kessous L, Amir N, Cohen R (2007) Evaluation of perceptual time/frequency representations for automatic classification of expressive speech. In: International workshop on paralinguistic speech—between models and data, ParaLing’07
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
MATH Google Scholar
Cooper G, Herskovits E (1992) A bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347
MATH Google Scholar
Kononenko I (1995) On biases in estimating multi-valued attributes. In: 14th international joint conference on artificial intelligence, Newcastle upon Tyne, UK, pp 1034–1040
Kohavi R (1995) A study on cross-validation and bootstrap for accurate estimation and model selection. In: Proceedings of the international joint conference on artificial intelligence, vol 2. Morgan Kaufmann, San Francisco, pp 1137–1143
Google Scholar
Chen LS, Huang TS, Miyasato T, Nakatsu R (1998) Multimodal human emotion/expression recognition. In: Conf. on automatic face and gesture recognition
Ioannou S, Kessous L, Caridakis G (2006) Adaptive on-line neural network retraining for real life multimodal emotion recognition. In: Proceedings of international conference on artificial neural networks (ICANN), Athens, Greece, September 2006, pp 81–92
De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multimodal information. In: Conf. on information, communications and signal processing (ICICS’97)
Littlewort G, Stewart Bartlett M, Fasel IR, Susskind J, Movellan JR (2006) Dynamics of facial expression extracted automatically from video. Image Vis Comput 24(6):615–625
Article Google Scholar
Stein B, Meredith MA (1993) The merging of senses. MIT Press, Cambridge
Google Scholar
Coulson M (2004) Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence. J Nonverbal Behav 28(2):117–139
Article MathSciNet Google Scholar
Balomenos T, Raouzaiou A, Ioannou S, Drosopoulos A, Karpouzis K, Kollias S (2005) Emotion analysis in man-machine interaction systems, 3D modeling and animation: synthesis and analysis techniques. Idea Group Publ., pp 175–200
Karpouzis K, Raouzaiou A, Drosopoulos A, Ioannou S, Balomenos T, Tsapatsoulis N, Kollias S (2004) Facial expression and gesture analysis for emotionally-rich man-machine interaction. In: 3D modeling and animation: synthesis and analysis techniques

Download references

Author information

Authors and Affiliations

30 chemin du Lancier, Marseille, 13008, France
Loic Kessous
Department of Computer Science, School of Electronic Engineering and Computer Science, Queen Mary University of London, Mile End Road, London, E1 4NS, UK
Ginevra Castellano
Image, Video and Multimedia Systems Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
George Caridakis

Authors

Loic Kessous
View author publications
You can also search for this author in PubMed Google Scholar
Ginevra Castellano
View author publications
You can also search for this author in PubMed Google Scholar
George Caridakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Loic Kessous.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kessous, L., Castellano, G. & Caridakis, G. Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis. J Multimodal User Interfaces 3, 33–48 (2010). https://doi.org/10.1007/s12193-009-0025-5

Download citation

Received: 16 April 2009
Accepted: 11 November 2009
Published: 12 December 2009
Issue Date: March 2010
DOI: https://doi.org/10.1007/s12193-009-0025-5

Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Database of Emotional Speech, Video and Gestures

Fusing facial and speech cues for enhanced multimodal emotion recognition

Interpretable multimodal emotion recognition using hybrid fusion of speech and image data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Database of Emotional Speech, Video and Gestures

Fusing facial and speech cues for enhanced multimodal emotion recognition

Interpretable multimodal emotion recognition using hybrid fusion of speech and image data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation