Abstract
Most speech emotion recognition studies consider clean speech. In this study, statistics of joint spectro-temporal modulation features are extracted from an auditory perceptual model and are used to detect the emotion status of speech under noisy conditions. Speech samples were extracted from the Berlin Emotional Speech database and corrupted with white and babble noise under various SNR levels. This study investigates a clean train/noisy test scenario to simulate practical conditions with unknown noisy sources. Simulations demonstrate the redundancy of the proposed spectro-temporal modulation features and further consider the dimensionality reduction. The proposed modulation features achieve higher recognition rates of speech emotions under noisy conditions than (1) conventional mel-frequency cepstral coefficients combined with prosodic features; (2) official acoustic features adopted in the INTERSPEECH 2009 Emotion Challenge. Adding modulation features increased the recognition rates of INTERSPEECH proposed features by approximately 7% for all tested SNR conditions (20–0 dB).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Bregman AS (1990) Auditory scene analysis: The perceptual organization of sound. MIT press, Cambridge
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proceedings of Interspeech, pp 489–492
Busso C, Lee S, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17:582–596
Carlyon RP, Moore BCJ, Micheyl C (2000) The effect of modulation rate on the detection of frequency modulation and mistuning of complex tones. J Acoust Soc Am 108:304–315
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357
Chi TS, Hsu CC (2011) Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram. J Acoust Soc Am 129(5):EL190–EL196
Chi T, Ru P, Shamma SA (2005) Multi-resolution spectro-temporal analysis of complex sounds. J Acoust Soc Am 118(2):887–906
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Magazine 18:32–80
Eyben F, Wollmer M, Schuller B (2009) Speech and music interpretation by large-space extraction. http://sourceforge.net/projects/openSMILE
Ezzat T, Bouvrie J, Poggio T (2007) Spectro-temporal analysis of speech using 2-D Gabor filters. In: Proceedings of Interspeech, pp 506–509
Falk TH, Chan WY (2010) Modulation spectral features for robust far-field speaker identification. IEEE Trans Audio Speech Lang Process 18:90–100
Grimault N, Bacon SP, Micheyl C (2002) Auditory stream segregation on the basis of amplitude modulation rate. J Acoust Soc Am 111:1340–1348
Jiang DN, Cai LH (2004) Speech emotion classification with the combination of statistic features and temporal features. In: Proceedings of the ICME, pp 1967–1970
Kawahara H, de Cheveigné A, Banno H, Takahashi T, Irino T (2005) Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. In: Proceedings of Interspeech, pp 537–540
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13:293–303
Loizou PC (2007) Speech enhancement: theory and practice. CRC, New York
Meyer BT, Kollmeier B (2011) Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun 53(5):753–767
Mozziconacci S (2002) Prosody and emotions. In: Proceedings of Speech Prosody, pp 1–9
New T, Foo S, DeSilva L (2003) Speech emotion recognition using hidden markov models. Speech Commun 41:603–623
O’Shaughnessy D (2000) Speech communications-human and machine, 2nd edn. IEEE Press, Piscataway
Pao TL, Chen YT, Yeh JH, Li PJ (2006) Mandarin emotional speech recognition based on SVM and NN. In: Proceedings of the 18th International Conference on Pattern Recognition, vol. 1, pp 1096–1110
Pudil P, Ferri FJ, Novovicova J, Kittler J (1994) Floating search methods for feature selection with nonmonotonic criterion functions. In: Proceedings of the international Conference on Computer Vision & Image Processing, pp 279–283
Ringeval F, Chetouani M (2008) A vowel based approach for acted emotion recognition. In: Proceedings of Interspeech, pp 2763–2766
Scherer K (2003) Vocal communication of emotion: A review of research paradigms. Speech Commun 40:227–256
Schuller B, Rigoll G (2006) Timing levels in segment-based speech emotion recognition. In: Proceedings of Interspeech, pp 1818–1821
Schuller B, Rigoll G, Lang M (2003) Hidden markov model-based speech emotion recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp 1–4
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp 577–580
Schuller B, Arsić D, Wallhoff F, Rigoll G (2006) Emotion recognition in the noise applying large acoustic feature sets. In: Proceedings of Speech Prosody
Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2007a). The relevance of feature type for the automatic classification of emotional user states: Low level descriptors and functionals. In: Proceedings of Interspeech, pp 2253–2256
Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007b) Towards more reality in the recognition of emotional speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp 941–944
Schuller B, Wöllmer M, Eyben F, Rigoll G (2009) Spectral or voice quality? Feature type relevance for the discrimination of emotion pairs. In: Hancil S (ed) The Role of Prosody in Affective Speech, Linguistic Insights, Studies in Language and Communication, Vol. 97. Peter Lang Publishing Group, New York, pp 285–307
Shamma S, Klein DJ (2000) The case of the missing pitch templates: how harmonic templates may form in the early auditory system. J Acoust Soc Am 107:2631–2644
Steidl S (2009) Automatic classification of emotion-related user states in spontaneous children’s speech. Logos Verlag, Berlin
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Ververidis V, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48(9):1162–1181
Wu S, Falk TH, Chan WY (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53(5):768–785
You M, Chen C, Bu J, Liu J, Tao J (2006) Emotion recognition from noisy speech. In: Proceedings of the ICME, pp 1653–1656
Yu F, Chang E, Xu YQ, Shum HY (2001) Emotion detection from speech to enrich multimedia content. In: Proceedings of the IEEE Pacific-Rim Conference on Multimedia, vol 1, pp 550–557
Zeng Z, Pantic M, Rosiman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Acknowledgments
This study was supported in part by the National Science Council, Taiwan under Grant No. NSC 99-2220-E-009-056
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chi, TS., Yeh, LY. & Hsu, CC. Robust emotion recognition by spectro-temporal modulation statistic features. J Ambient Intell Human Comput 3, 47–60 (2012). https://doi.org/10.1007/s12652-011-0088-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-011-0088-5