Abstract
For successful human–machine-interaction (HCI) the pure textual information and the individual skills, preferences, and affective states of the user must be known. Therefore, as a starting point, the user’s actual affective state has to be recognized. In this work we investigated how additional knowledge, for example age and gender of the user, can be used to improve recognition of affective state. Two methods from automatic speech recognition are used to incorporate age and gender differences in recognition of affective state: speaker group-dependent (SGD) modelling and vocal tract length normalisation (VTLN). The investigations were performed on four corpora with acted and natural affected speech. Different features and two methods of classification (Gaussian mixture models (GMMs) and multi-layer perceptrons (MLPs)) were used. In addition, the effects of channel compensation and contextual characteristics were analysed. The results are compared with our own baseline results and with results reported in the literature. Two hypotheses were tested. First, incorporation of age information further improves speaker group-dependent modelling. Second, acoustic normalization does not achieve the same improvement as achieved by speaker group-dependent modelling, because the age and gender of a speaker affects the way emotions are expressed.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Albornoz EM, Milone DH, Rufiner HL. Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang. 2011;25(3):556–70.
Atal B. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J Acoust Soc Am. 1974;55(6):1304–12.
Bahari M, Van Hamme H. Speaker age estimation using hidden markov model weight supervectors. In: Proceedings of the 11th ISSPA; 2012. p. 517–521.
Batliner A, Fischer K, Huber R, Spiker J, North E. Desperately seeking emotions: actors, wizards and human beings. In: Proceedings of the ISCA workshop on speech and emotion; 2000. p. 195–200.
Batliner A, Fischer K, Huber R, Spilker J, Nöth E. How to find trouble in communication. Speech Commun. 2003;40(1–2):117–43.
Batliner A, Hacker C, Steidl S, Nöth E, Russell M, Wong M. “You stupid tin box”- children interacting with the AIBO robot: a cross-linguistic emotional speech corpus. In: Proceedings of LREC; 2004. p. 865–868.
Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Aharonson V, Kessous L, Amir N. Whodunnit—searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang. 2011;25(1):4–28.
Becker-Asano C. WASABI : Affect simulation for agents with believable interactivity. Ph.D. thesis, Universität Bielefeld; 2008.
Böck R, Hübner D, Wendemuth A. Determining optimal signal features and parameters for hmm-based emotion classification. In: Proceedings of the 15th IEEE mediterranean electrotechnical conference; 2010. p. 1586–1590.
Böck R, Limbrecht K, Walter S, Hrabal D, Traue HC, Glüge S, Wendemuth A. Intraindividual and interindividual multimodal emotion analyses in human–machine interaction. In: IEEE interantional multi-disciplinary conference on cognitive methods in situation awareness and decision support; 2012. p. 59–64.
Bocklet T, Maier A, Bauer J, Burkhardt F, Noth E. Age and gender recognition for telephone applications based on GMM supervectors and support vector machines. In: Proceedings of IEEE ICASSP’08; 2008. p. 1605–1608.
Burkhardt F, Eckert M, Johannsen W, Stegmann J. A database of age and gender annotated telephone speech. In: Proceedings of the 7th LREC. ELRA; 2010.
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B. A database of german emotional speech. In: Proceedings of interspeech; 2005. p. 1516–1520.
Busso C, Deng Z, Yildirim S, Bulut M, Lee C, Kazemzadeh A, Lee S, Neumann U, Narayanan S. Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th ICMI. New York, USA: ACM; 2004. p. 205–211.
Butler LD, Nolen-Hoeksema S. Gender differences in responses to depressed mood in a college sample. Sex Roles. 1994;30(5–6):331–46.
Cohen J, Kamm T, Andreou AG. Vocal tract normalization in speech recognition: compensating for systematic speaker variability. J Acoust Soc Am. 1995;97(5):3246–7.
Cowie R, Cornelius RR. Describing the emotional states that are expressed in speech. Speech Commun. 2003;40(1–2):5–32.
Cullen A, Harte N. Feature sets for automatic classification of dimensional affect. In: IET Irish signals and systems conference (ISSC 2012); 2012. p. 1–6.
Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980;28(4):357–66.
Dellwo V, Leemann A, Kolly MJ. Speaker idiosyncratic rhythmic features in the speech signal. In: Proceedings of Interspeech; 2012. Portland, Oregon.
de Veth J, Boves L. On the efficiency of classical rasta filtering for continuous speech recognition: keeping the balance between acoustic pre-processing and acoustic modelling. Speech Commun. 2003;39(3–4):269–86.
Dmitrieva E, Gelman V. The relationship between the perception of emotional intonation of speech in conditions of interference and the acoustic parameters of speech signals in adults of different gender and age. Neurosci Behav Physiol. 2012;42:920–8.
Dobrišek S, Gajšek R, Mihelič F, Pavešić N, Štruc V. Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst. 2013;10:53. doi:10.5772/54002.
Douglas-Cowie E, Cowie R, Sneddon I, Cox C, Lowry O, McRorie M, Martin JC, Devillers L, Abrilian S, Batliner A, Amir N, Karpouzis K. The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. In: Proceedings of ACII’07; 2007. p. 488–500.
Douglas-Cowie E, Devillers L, Martin JC, Cowie R, Savvidou S, Abrilian S, Cox C. Multimodal databases of everyday emotion: facing up to complexity. In: European conference on speech communication and technology; 2005. p. 813–816.
Dumouchel P, Dehak N, Attabi Y, Dehak R, Boufaden N. Cepstral and long-term features for emotion recognition. In: Proceedings of Interspeech;2009. p. 344–347.
Ekman P. Handbook of cognition and emotion, chap. basic emotions. Sussex, UK: Wiley; 2005. p. 45–60.
Emori T, Shinoda K. Rapid vocal tract length normalization using maximum likelihood estimation. In: Proceedings of EUROSPEECH 2001, 7th European conference on speech communication and technology. Denmark: Aalborg; 2001. p. 1649–1652.
Engberg IS, Hansen AV. Documentation of the danish emotional speech database (DES). Technical report. Denmark: Center for Person, Kommunikation, Aalborg University; 1996. Internal aau report.
Frommer J, Michaelis B, Rösner D, Wendemuth A, Friesen R, Haase, M, Kunze M, Andrich R, Lange J, Panning A, Siegert I. Towards emotion and affect detection in the multimodal last minute corpus. In: Proceedings of the 8th LREC; 2012.
Frommer J, Rösner D, Haase M, Lange J, Friesen R, Otto M. Detection and avoidance of failures in dialogues—wizard of oz experiment operator’s manual. Pabst Science Publishers; 2012.
Gajsek R, Zibert J, Justin T, Struc V, Vesnicer B, Mihelic F. Gender and affect recognition based on gmm and gmm-ubm modeling with relevance map estimation. In: Proceedings of interspeech; 2010. p. 2810–2813.
Giuliani D, Gerosa M. Investigating recognition of children’s speech. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing (ICASSP ’03), vol. 2; 2003. p. II-137-40.
Glüge S, Böck R, Wendemuth A. Segmented-memory recurrent neural networks versus hidden markov models in emotion recognition from speech. In: Proceedings of the 3rd international joint conference on computational intelligence. Paris, France; 2011. p. 308–315.
Gnjatović M, Rösner D. On the role of the NIMITEK corpus in developing an emotion adaptive spoken dialogue system. In: Proceedings of the 7th LREC. Marrakech, Morocco; 2008.
Grimm M, Kroschel K, Narayanan S. The vera am mittag german audio–visual emotional speech database. In: Proceedings of ICME; 2008. p. 865–868.
Gross J, Carstensen L, Pasupathi M, Tsai J, Skorpen C, Hsu A. Emotion and aging: experience, expression, and control. Psychol Aging. 1997;12(4):590–9.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The weka data mining software: an update. SIGKDD Explor Newsl. 2009;11(1):10–8.
Hartmann K, Siegert I, Philippou-Hübner D, Wendemuth A. Emotion detection in HCI: from speech features to emotion space. In: Proceedings of the 12th IFAC, IFIP, IFORS, IEA symposium on analysis, design, and evaluation of human–machine systems. Las Vegas, USA; 2013.
Hassan A, Damper RI, Niranjan M. On acoustic emotion recognition: compensating for covariate shift. IEEE Trans Audio Speech Lang Process. 2013;21(7):1458–1468.
Hermansky H, Morgan N. Rasta processing of speech. IEEE Trans Speech Audio Process. 1994;2(4):578–89.
Ho CH. Speaker modelling for voice conversion. Ph.D. thesis, Department of Electronic and Computer Engineering, Brunel University, London; 2001.
Hubeika V. Estimation of gender and age from recorded speech. In: Proceedings of the ACM student research competition. Czech Technical University; 2006. p. 25–32.
Kelly F, Harte N. Effects of long-term ageing on speaker verification. In: Proceedings of the COST 2101 European conference on Biometrics and ID management. Berlin: Springer; 2011. p. 113–124.
Kinnunen T, Li H. An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 2010;52(1):12–40.
Kockmann M, Burget L, Černocký JH. Application of speaker- and language identification state-of-the-art techniques for emotion recognition. Speech Commun. 2011;53(9–10):1172–85.
Lee L, Rose R. Speaker normalization using efficient frequency warping procedures. IEEE Int Conf Acoust Speech Signal Process. 1996;1:353–6.
Lee L, Rose R. A frequency warping approach to speaker normalization. IEEE Trans Speech Audio Process. 1998;6(1):49–60.
Lee MW, Kwak KC. Performance comparison of gender and age group recognition for human-robot interaction. IJACSA. 2012;3:207–11.
Lee S, Potamianos A, Narayanan S. Analysis of children’s speech: duration, pitch and formants. In: Proceedings of interspeech, vol 1; 1997. p. 473–476.
Li M, Han K, Narayanan S. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput Speech Lang. 2012;27:151–67.
Li M, Jung CS, Han KJ. Combining five acoustic level modeling methods for automatic speaker age and gender recognition. In: Proceedings of interspeech; 2010. p. 2826–2829.
Lipovčan LK, Prizmić Z, Franc R. Age and gender differences in affect regulation strategies. Drustvena istrazivanja: J Gen Social Issues. 2009;18(6):1075–88.
Martin O, Kotsia I, Macq B, Pitas I. The enterface’05 audio–visual emotion database. In: Proceedings of the 22nd international conference on data engineering workshops. Washington, DC, USA: IEEE Computer Society; 2006.
Massaro D, Egan P. Perceiving affect from the voice and the face. Psychon Bull Rev. 1996;3(2):215–21.
McDougall, W.: An introduction to social psychology. Adamant Media Corporation, Chestnut Hill, USA, Facsimile reprint of a 1912 edition by John W. Boston: Luce & Co.; 2001.
McKeown G, Valstar M, Cowie R, Pantic M. The semaine corpus of emotionally coloured character interactions. In: Proceedings of ICME; 2010. p. 1079–1084.
McRae K, Ochsner KN, Mauss IB, Gabrieli JJD, Gross JJ. Gender differences in emotion regulation: an fMRI study of cognitive reappraisal. Group Process Intergroup Relat. 2008;11(2):143–62.
Meinedo H, Trancoso I. Age and gender detection in the i-dash project. ACM Trans Speech Lang Process. 2011;7(4):13:1–16.
Mengistu KT. Robust acoustic and semantic modeling in a telephone-based spoken dialog system. Ph.D. thesis, Otto von Guericke University Magdeburg; 2009.
Morris JD. SAM: the self-assessment manikin an efficient cross-cultural measurement of emotional response. J Advert Res. 1995;35(6):63–8.
Mower E, Metallinou A, Lee C, Kazemzadeh A, Busso C, Lee S, Narayanan S. Interpreting ambiguous emotional expressions. In: 3rd Internatinal conference on affective computing and intelligent interaction and workshops (ACII); 2009.
Neiberg D, Elenius K, Laskowski K. A database of german emotional speech. In: Proceedings of interspeech; 2006. p. 809–812.
Olson DL, Delen D. Advanced data mining techniques. Berlin: Springer; 2008.
Paleari M, Huet B, Chellali R. Towards multimodal emotion recognition: a new approach. In: Proceedings of the ACM international conference on image and video retrieval. New York, NY, USA: ACM; 2010. p. 174–181.
Palm G, Glodek M. Towards emotion recognition in human computer interaction. In: Neural nets and surroundings, smart innovation, systems and technologies, vol. 19. Berlin: Springer; 2013. p. 323–36.
Panning A, Siegert I, Al-Hamadi A, Wendemuth A, Rösner D, Frommer J, Krell G, Michaelis B. Multimodal affect recognition in spontaneous HCI environment. In: IEEE international conference on signal processing, communications and computing; 2012. p. 430–435.
Plutchik R. Emotion, a psychoevolutionary synthesis. New York: Harper & Row; 1980.
Potamianos A, Narayanan S. A review of the acoustic and linguistic properties of children’s speech. In: IEEE 9th workshop on multimedia signal processing (MMSP 2007); 2007. p. 22–25.
Rabiner L, Cheng MJ, Rosenberg AE, McGonegal CA. A comparative performance study of several pitch detection algorithms. IEEE Trans ASSP. 1976;24:399–417.
Rao K, Koolagudi S, Vempada R. Emotion recognition from speech using global and local prosodic features. IJST. 2013;16(2):143–60.
Rosenberg A. Classifying skewed data: importance weighting to optimize average recall. In: Proceedings of Interspeech; 2012.
Rösner D, Friesen R, Otto M, Lange J, Haase M, Frommer J. Intentionality in interacting with companion systems—an empirical approach. In: Human–computer interaction, towards mobile and intelligent interaction environments, LNCS, vol. 6763. Berlin, Heidelberg: Springer; 2011. p. 593–602.
Rösner D, Frommer J, Andrich R, Friesen R, Haase M, Kunze M, Lange J, Otto M. LAST MINUTE: a novel corpus to support emotion, sentiment and social signal processing. In: 4th International workshop on corpora for research on emotion sentiment and social signals—ES3. ELRA; 2012. p. 82–89.
Rösner D, Frommer J, Friesen R, Haase M, Lange J, Otto M. LAST MINUTE: a multimodal corpus of speech-based user-companion interactions. In: Proceedings of the 8th LREC; 2012. p. 96–103.
Ruvolo P, Fasel I, Movellan JR. A learning approach to hierarchical feature selection and aggregation for audio classification. Pattern Recogn Lett. 2010;31(12):1535–42.
Scherer K. Appraisal considered as a process of multilevel sequential checking. In: Scherer KR, Schorr A, Johnstone T, editors. Appraisal processes in emotion: theory, methods, research. Oxford: Oxford University Press; 2001. p. 92–120.
Scherer K, Dan E, Flykt A. What determines a feeling’s position in affective space? A case for appraisal. Cogn Emot. 2006;20:92–113.
Schiel F. Automatic phonetic transcription of non-prompted speech. In: Proceedings of the XIVth international congress of phonetic sciences, ICPhS99. San Francisco; 1999. p. 607–610.
Schuller B, Batliner A, Steidl S, Seppi D. Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 2011;53(9–10):1062–87.
Schuller B, Müller R, Eyben F, Gast J, Hörnler B, Wöllmer M, Rigoll G, Höthker A, Konosu H. Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis Comput. 2009;27(12):1760–74.
Schuller B, Seppi D, Batliner A, Maier A, Steidl S. Towards more reality in the recognition of emotional speech. In: IEEE international conference on acoustics, speech and signal processing, vol. 4; 2007. p. IV-941–IV-944.
Schuller B, Steidl S, Batliner A. The interspeech 2009 emotion challenge. In: Proceedings of INTERSPEECH’2009. Brighton, UK: ISCA; 2009. p. 312–315.
Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A. Acoustic emotion recognition: a benchmark comparison of performances. In: Proceedings of the IEEE automatic speech recognition and understanding workshop, ASRU 2009. Merano, Italy; 2009 . p. 552–557.
Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G. Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput. 2010;I:119–31.
Schuller B, Wöllmer M, Eyben F, Rigoll G. Spectral or voice quality? Feature type relevance for the discrimination of emotion pairs. In: Hancil S, editor. The role of prosody in affective speech, linguistic insights, studies in language and communication, vol. 97. Frankfurt am Main: Peter Lang Publishing Group; 2009. p. 285–307.
Schwenker F, Scherer S, Schmidt M, Schels M, Glodek M. Multiple classifier systems for the recogonition of human emotions. In: Gayar N, Kittler J, Roli F, editors. Multiple classifier systems, LNCS, vol. 5997. Berlin: Springer; 2010. p. 315–24.
Shahin I. Gender-dependent emotion recognition based on HMMs and SPHMMs. Int J Speech Technol. 2013;16(2):133–41.
Siegert I, Böck R, Philippou-Hübner D, Wendemuth A. Investigation of hierarchical classification for simultaneous gender and age recognitions. In: Proceedings of the 23. ESSV; 2012.
Siegert I, Böck R, Wendemuth A. Inter-rater reliability for emotion annotation in human–computer interaction—comparison and methodological improvements. J Multimodal User Interfaces. 2013;8(1):17–28.
Siegert I, Böck R, Philippou-Hübner D, Vlasenko B, Wendemuth A. Appropriate emotional labeling of non-acted speech using basic emotions, Geneva emotion wheel and self assessment manikins. In: Proceedings of ICME; 2011.
Siegert I, Böck R, Wendemuth A. The influence of context knowledge for multimodal annotation on natural material. In: Joint Proceedings of the IVA 2012 workshops. Otto von Guericke University Magdeburg; 2012. p. 25–32.
Siegert I, Glodek M, Panning A, Krell G, Schwenker F, Al-Hamadi A, Wendemuth A. Using speaker group-dependent modelling to improve fusion of fragmentary classifier decisions. In: IEEE international conference on cybernetics (CYBCONF); 2013. p. 132–137.
Siegert I, Hartmann K, Böck R, Wendemuth A. Speaker group-dependent modelling for affect recognition from speech. In: ERM4HCI 2013: The 1st workshop on emotion representation and modelling in human–computer-interaction-systems; 2013.
Steidl S, Batliner A, Nöth E, Hornegger J. Quantification of segmentation and f0 errors and their effect on emotion recognition. In: Sojka P, Horák A, Kopeček I, Pala K, editors. Text, speech and dialogue, vol. 5246., Lecture Notes in Computer ScienceBerlin: Springer; 2008. p. 525–34.
Suzuki M, Tsuchiya S, Ren F. A novel emotion recognizer from speech using both prosodic and linguistic features. In: König A, Dengel A, Hinkelmann K, Kise K, Howlett R, Jain L, editors. Knowledge-based and intelligent information and engineering systems, vol. 6881., Lecture Notes in Computer ScienceBerlin: Springer; 2011. p. 456–65.
Takahashi K. Remarks on emotion recognition from biopotential signals. In: 2nd International conference on autonomous robots and agents; 2004. p. 186–191.
Tan L, Karnjanadecha M. Pitch detection algorithm: autocorrelation method and AMDF. In: Proceedings of the 3rd international symposium on communications and information technology; 2003. pp. 551–556.
Truong KP, van Leeuwen DA, de Jong FM. speech-based recognition of self-reported and observed emotion in a dimensional space. Speech Commun. 2012;54(9):1049–63.
Truong KP, Neerincx MA, van Leeuwen DA. Assessing agreement of observer- and self-annotations in spontaneous multimodal emotion data. In: Proceedings of interspeech; 2008. p. 318–321.
Vaughan B, Kosidis S, Cullen C, Wang Y. Task-based mood induction procedures for the elicitation of natural emotional responses. In: The 4th international conference on cybernetics and information technologies, systems and applications. Orlando, Florida; 2007.
Vergin R, Farhat A, O’Shaughnessy D. Robust gender-dependent acoustic–phonetic modelling in continuous speech recognition based on a new automatic male/female classification. In: 4th International conference on spoken language processing; 1996. p. 1081–1084.
Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory. 1967;13(2):260–9.
Vogt T, André E. Improving automatic emotion recognition from speech via gender differentiation. In: Proceedings of the 5th LREC; 2006.
Walter S, Scherer S, Schels M, Glodek M, Hrabal D, Schmidt M, Böck R, Limbrecht K, Traue H, Schwenker F. Multimodal emotion classification in naturalistic user behavior. In: Human–computer interaction, towards mobile and intelligent interaction environments, LNCS, vol. 6763. Berlin, Heidelberg: Springer; 2011. p. 603–11.
Wegmann S, McAllaster D, Orloff J, Peskin B. Speaker normalization on conversational telephone speech. In: Proceedings of IEEE ICASSP’96, vol. 1; 1996. p. 339–341.
Wendemuth A, Biundo S. A companion technology for cognitive technical systems. In: Cognitive behavioural systems, LNCS, vol. 7403. Berlin, Heidelberg: Springer; 2012. p. 89–103.
Wong E, Sridharan S. Utilise vocal tract length normalisation for robust automatic language identification. In: Proceedings of the 9th Australian international conference on speech science and technology. Melbourne, Victoria, Australia; 2002.
Wundt W. Vorlesungen über die Menschen- und Tierseele. 4th ed. Leipzig: L. Voss; 1906.
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P. The HTK book (for HTK Version 3.4). Cambridge: Cambridge University Press; 2006.
Zeng Z, Pantic M, Roisman GI, Huang TS. A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell. 2009;31:39–58.
Zeng Z, Tu J, Pianfetti BM, Huang TS. Audio–visual affective expression recognition through multistream fused HMM. Trans Multimed. 2008;10(4):570–7.
Zhan P, Waibel A. Vocal tract length normalization for large vocabulary continuous speech recognition. Technical report, CMU-CS-97-148. Carnegie Mellon University; 1997.
Zhang S, Li L, Zhao Z. Audio–visual emotion recognition based on facial expression and affective speech. In: Wang F, Lei J, Lau R, Zhang J, editors. Multimedia and signal processing, communications in computer and information science, vol. 346. Berlin: Springer; 2012. p. 46–52.
Acknowledgments
The work presented in this article was conducted within the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” funded by the German Research Foundation (DFG). We also acknowledge the DFG for financing our computing cluster. Portions of the research in this article use the LAST MINUTE Corpus generated under the supervision of Professor Jörg Frommer and Professor Dietmar Rösner.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Siegert, I., Philippou-Hübner, D., Hartmann, K. et al. Investigation of Speaker Group-Dependent Modelling for Recognition of Affective States from Speech. Cogn Comput 6, 892–913 (2014). https://doi.org/10.1007/s12559-014-9296-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-014-9296-6