Abstract
In this paper a study on multimodal automatic emotion recognition during a speech-based interaction is presented. A database was constructed consisting of people pronouncing a sentence in a scenario where they interacted with an agent using speech. Ten people pronounced a sentence corresponding to a command while making 8 different emotional expressions. Gender was equally represented, with speakers of several different native languages including French, German, Greek and Italian. Facial expression, gesture and acoustic analysis of speech were used to extract features relevant to emotion. For the automatic classification of unimodal data, bimodal data and multimodal data, a system based on a Bayesian classifier was used. After performing an automatic classification of each modality, the different modalities were combined using a multimodal approach. Fusion of the modalities at the feature level (before running the classifier) and at the results level (combining results from classifier from each modality) were compared. Fusing the multimodal data resulted in a large increase in the recognition rates in comparison to the unimodal systems: the multimodal approach increased the recognition rate by more than 10% when compared to the most successful unimodal system. Bimodal emotion recognition based on all combinations of the modalities (i.e., ‘face-gesture’, ‘face-speech’ and ‘gesture-speech’) was also investigated. The results show that the best pairing is ‘gesture-speech’. Using all three modalities resulted in a 3.3% classification improvement over the best bimodal results.
Similar content being viewed by others
References
Valstar MF, Gunes H, Pantic M (2007) How to distinguish posed from spontaneous smiles using geometric features. In: Proceedings of ACM international conference on multimodal interfaces (ICMI’07), Nagoya, Japan, November 2007. ACM, New York, pp 38–45
Picard R (1997) Affective computing. MIT Press, Boston
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 20:569–571
Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Boston. ISBN:981-256-105-6
Pantic M, Sebe N, Cohn J, Huang TS (2005) Affective multimodal human-computer interaction. ACM Multimedia 20:669–676
Ambady N, Rosenthal R (1992) Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256–274
Busso C, Deng Z, Yildirim S, Bulut M, Lee CM, Kazemzaeh A, Lee S, Neumann U, Narayanan S (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proc. of ACM 6th int’l conf. on multimodal interfaces (ICMI 2004), State College, PA, October 2004. ACM, New York, pp 205–211
Gunes H, Piccardi M (2007) Bi-modal emotion recognition from expressive face and body gestures. J Netw Comput Appl 30:1334–1345
Karpouzis K, Caridakis G, Kessous L, Amir N, Raouzaiou A, Malatesta L, Kollias S (2007) Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition. In: Artificial intelligence for human computing
Bernhardt D, Robinson P (2007) Detecting affect from non-stylised body motions. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 59–70
Castellano G, Villalba SD, Camurri A (2007) Recognising human emotions from body movement and gesture dynamics. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 71–82
Kleinsmith A, Bianchi-Berthouze N (2007) Recognizing affective dimensions from body posture. In: Paiva A, Prada R, Picard RW (eds) Proceedings of affective computing and intelligent interaction, second international conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007. LNCS, vol 4738. Springer, Berlin, pp 48–58
Vidrascu L, Devillers L (2005) Real-life emotions representation and detection in call centers. In: Proc. of 2nd international conference on affective computing and intelligent interaction, Lisbon, Portugal
Batliner A, Steidl S, Hacker C, Noth E, Niemann H (2005) Tales of tuning—prototyping for automatic classification of emotional user states. In: Proceedings of the interspeech conference
Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614–636
Vogt T, Andre E (2005) Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In: Proc. IEEE international conference on multimedia and Expo ICME05
Gunes H, Piccardi M (2006) A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In: Proc. of ICPR 2006 the 18th international conference on pattern recognition, Hong Kong, China, November 2006
Banziger T, Pirker H, Scherer K (2006) Gemep—Geneva multimodal emotion portrayals: a corpus for the study of multimodal emotional expressions. In: L. Deviller et al. (ed), Proceedings of LREC’06 workshop on corpora for research on emotion and affect, Genoa, Italy, pp 15–19
Douglas-Cowie E, Campbell N, Cowie R, Roach P (2003) Emotional speech: towards a new generation of databases. Speech Commun 40:33–60
Rosenblum M, Yacoob Y, Davis L (1996) Human expression recognition from motion using a radial basis function network architecture. IEEE Trans Neural Netw 7(5):1121–1138
Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Pattern Anal Mach Intel 22(12):1424–1445
Pantic M, Bartlett MS (2007) Machine analysis of facial expressions. In: Delac K, Grgic M (eds) Face recognition. I-Tech Education and Publishing, Vienna, pp 377–416
Cowie R, Douglas-Cowie E (1996) Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In: Proceedings international conference on spoken language processing, Genoa, Italy, 1996
Oudeyer PY (2003) The production and recognition of emotions in speech: features and algorithms. Int J Human-Comput Stud 59(1–2):157–183
Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007) Towards more reality in the recognition of emotional speech. In: Proc. int. conf. on acoustics, speech, and signal processing, Honolulu, Hawaii, USA, 2007, pp 941—944
Camurri A, Lagerlöf I, Volpe G (2003) Recognizing emotion from dance movement: comparison of spectator recognition and automated techniques. Int J Human-Comput Stud 59(1–2):213–225
Bianchi-Berthouze N, Kleinsmith A (2003) A categorical approach to affective gesture recognition. Connect Sci 15(4):259–269
Castellano G, Mortillaro M, Camurri A, Volpe G, Scherer K (2008) Automated analysis of body movement in emotionally expressive piano performances. Music Percept 26(2):103–119
Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191
Kim J, Andre E, Rehm M, Vogt T, Wagner J (2005) Integrating information from speech and physiological signals to achieve emotional sensitivity. In: Proc. of the 9th European conference on speech communication and technology
Sebe N, Cohen I, Huang TS (2005) Multimodal emotion recognition. Handbook of pattern recognition and computer vision. World Scientific, Boston
Pantic M, Sebe N, Cohn JF, Huang T (2005) Affective multimodal human-computer interaction. In: Proceedings of the 13th annual ACM international conference on multimedia. ACM, New York, pp 669–676
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimedia 9:424–428
Busso C, Narayanan S (2007) Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans Audio Speech Lang Process 20:2331–2347
el Kaliouby R, Robinson P (2005) Generalization of a vision-based computational model of mind-reading. In: Proceedings of first international conference on affective computing and intelligent interfaces, pp 582–589
Scherer KR, Ellgring H (2007) Multimodal expression of emotion: Affect programs or componential appraisal patterns? Emotion 7(1)
Engelbrecht AP, Fletcher L, Cloete I (1999) Variance analysis of sensitivity information for pruning multilayer feedforward neural networks. In: Neural networks, IJCNN ’99, vol 3, pp 1829–1833
Densley DJ, Willis PJ (1997) Emotional posturing: a method towards achieving emotional figure animation. In: Computer animation
Yang M-H, Kriegman DJ, Ahuja N (2002) Detecting faces in images: a survey. IEEE Trans Pattern Anal Mach Intell 24(1):34–58
Young JW Head and face anthropometry of adult U.S. civilians. Technical Report final report, FAA Civil Aeromedical Institute, 1963-93
Ioannou S, Caridakis G, Karpouzis K, Kollias S (2007) Robust feature detection for facial expression recognition. EURASIP J Image Video Process
Raouzaiou A, Tsapatsoulis N, Karpouzis K, Kollias S (2002) Parameterized facial expression synthesis based on mpeg-4. EURASIP J Appl Signal Process 10:1021–1038
Camurri A, Coletta P, Massari A, Mazzarino B, Peri M, Ricchetti M, Ricci A, Volpe G (2004) Toward real-time multimodal processing: Eyesweb 4.0. In: Proc. AISB 2004 convention: motion, emotion and cognition, Leeds, UK, March 2004
Camurri A, Mazzarino B, Volpe G (2004) Analysis of expressive gesture: the eyesweb expressive gesture processing library. In: Camurri A, Volpe G (eds) Gesture-based communication in human-computer interaction. LNAI, vol 2915. Springer, Berlin
Kessous L, Amir N, Cohen R (2007) Evaluation of perceptual time/frequency representations for automatic classification of expressive speech. In: International workshop on paralinguistic speech—between models and data, ParaLing’07
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Cooper G, Herskovits E (1992) A bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347
Kononenko I (1995) On biases in estimating multi-valued attributes. In: 14th international joint conference on artificial intelligence, Newcastle upon Tyne, UK, pp 1034–1040
Kohavi R (1995) A study on cross-validation and bootstrap for accurate estimation and model selection. In: Proceedings of the international joint conference on artificial intelligence, vol 2. Morgan Kaufmann, San Francisco, pp 1137–1143
Chen LS, Huang TS, Miyasato T, Nakatsu R (1998) Multimodal human emotion/expression recognition. In: Conf. on automatic face and gesture recognition
Ioannou S, Kessous L, Caridakis G (2006) Adaptive on-line neural network retraining for real life multimodal emotion recognition. In: Proceedings of international conference on artificial neural networks (ICANN), Athens, Greece, September 2006, pp 81–92
De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multimodal information. In: Conf. on information, communications and signal processing (ICICS’97)
Littlewort G, Stewart Bartlett M, Fasel IR, Susskind J, Movellan JR (2006) Dynamics of facial expression extracted automatically from video. Image Vis Comput 24(6):615–625
Stein B, Meredith MA (1993) The merging of senses. MIT Press, Cambridge
Coulson M (2004) Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence. J Nonverbal Behav 28(2):117–139
Balomenos T, Raouzaiou A, Ioannou S, Drosopoulos A, Karpouzis K, Kollias S (2005) Emotion analysis in man-machine interaction systems, 3D modeling and animation: synthesis and analysis techniques. Idea Group Publ., pp 175–200
Karpouzis K, Raouzaiou A, Drosopoulos A, Ioannou S, Balomenos T, Tsapatsoulis N, Kollias S (2004) Facial expression and gesture analysis for emotionally-rich man-machine interaction. In: 3D modeling and animation: synthesis and analysis techniques
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kessous, L., Castellano, G. & Caridakis, G. Multimodal emotion recognition in speech-based interaction using facial expression, body gesture and acoustic analysis. J Multimodal User Interfaces 3, 33–48 (2010). https://doi.org/10.1007/s12193-009-0025-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-009-0025-5