Robust emotion recognition by spectro-temporal modulation statistic features

Tai-Shih Chi¹,
Lan-Ying Yeh¹ &
Chin-Cheng Hsu¹

478 Accesses
Explore all metrics

Abstract

Most speech emotion recognition studies consider clean speech. In this study, statistics of joint spectro-temporal modulation features are extracted from an auditory perceptual model and are used to detect the emotion status of speech under noisy conditions. Speech samples were extracted from the Berlin Emotional Speech database and corrupted with white and babble noise under various SNR levels. This study investigates a clean train/noisy test scenario to simulate practical conditions with unknown noisy sources. Simulations demonstrate the redundancy of the proposed spectro-temporal modulation features and further consider the dimensionality reduction. The proposed modulation features achieve higher recognition rates of speech emotions under noisy conditions than (1) conventional mel-frequency cepstral coefficients combined with prosodic features; (2) official acoustic features adopted in the INTERSPEECH 2009 Emotion Challenge. Adding modulation features increased the recognition rates of INTERSPEECH proposed features by approximately 7% for all tested SNR conditions (20–0 dB).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-features Integration for Speech Emotion Recognition

Speech emotion recognition using MFCC-based entropy feature

Article 22 August 2023

A Multiresolution-Based Fusion Strategy for Improving Speech Emotion Recognition Efficiency

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Article Google Scholar
Bregman AS (1990) Auditory scene analysis: The perceptual organization of sound. MIT press, Cambridge
Google Scholar
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proceedings of Interspeech, pp 489–492
Busso C, Lee S, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17:582–596
Article Google Scholar
Carlyon RP, Moore BCJ, Micheyl C (2000) The effect of modulation rate on the detection of frequency modulation and mistuning of complex tones. J Acoust Soc Am 108:304–315
Article Google Scholar
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chi TS, Hsu CC (2011) Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram. J Acoust Soc Am 129(5):EL190–EL196
Article Google Scholar
Chi T, Ru P, Shamma SA (2005) Multi-resolution spectro-temporal analysis of complex sounds. J Acoust Soc Am 118(2):887–906
Article Google Scholar
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Magazine 18:32–80
Article Google Scholar
Eyben F, Wollmer M, Schuller B (2009) Speech and music interpretation by large-space extraction. http://sourceforge.net/projects/openSMILE
Ezzat T, Bouvrie J, Poggio T (2007) Spectro-temporal analysis of speech using 2-D Gabor filters. In: Proceedings of Interspeech, pp 506–509
Falk TH, Chan WY (2010) Modulation spectral features for robust far-field speaker identification. IEEE Trans Audio Speech Lang Process 18:90–100
Article Google Scholar
Grimault N, Bacon SP, Micheyl C (2002) Auditory stream segregation on the basis of amplitude modulation rate. J Acoust Soc Am 111:1340–1348
Article Google Scholar
Jiang DN, Cai LH (2004) Speech emotion classification with the combination of statistic features and temporal features. In: Proceedings of the ICME, pp 1967–1970
Kawahara H, de Cheveigné A, Banno H, Takahashi T, Irino T (2005) Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. In: Proceedings of Interspeech, pp 537–540
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13:293–303
Article Google Scholar
Loizou PC (2007) Speech enhancement: theory and practice. CRC, New York
Google Scholar
Meyer BT, Kollmeier B (2011) Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun 53(5):753–767
Article Google Scholar
Mozziconacci S (2002) Prosody and emotions. In: Proceedings of Speech Prosody, pp 1–9
New T, Foo S, DeSilva L (2003) Speech emotion recognition using hidden markov models. Speech Commun 41:603–623
Article Google Scholar
O’Shaughnessy D (2000) Speech communications-human and machine, 2nd edn. IEEE Press, Piscataway
MATH Google Scholar
Pao TL, Chen YT, Yeh JH, Li PJ (2006) Mandarin emotional speech recognition based on SVM and NN. In: Proceedings of the 18th International Conference on Pattern Recognition, vol. 1, pp 1096–1110
Pudil P, Ferri FJ, Novovicova J, Kittler J (1994) Floating search methods for feature selection with nonmonotonic criterion functions. In: Proceedings of the international Conference on Computer Vision & Image Processing, pp 279–283
Ringeval F, Chetouani M (2008) A vowel based approach for acted emotion recognition. In: Proceedings of Interspeech, pp 2763–2766
Scherer K (2003) Vocal communication of emotion: A review of research paradigms. Speech Commun 40:227–256
Article MATH Google Scholar
Schuller B, Rigoll G (2006) Timing levels in segment-based speech emotion recognition. In: Proceedings of Interspeech, pp 1818–1821
Schuller B, Rigoll G, Lang M (2003) Hidden markov model-based speech emotion recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp 1–4
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp 577–580
Schuller B, Arsić D, Wallhoff F, Rigoll G (2006) Emotion recognition in the noise applying large acoustic feature sets. In: Proceedings of Speech Prosody
Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2007a). The relevance of feature type for the automatic classification of emotional user states: Low level descriptors and functionals. In: Proceedings of Interspeech, pp 2253–2256
Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007b) Towards more reality in the recognition of emotional speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp 941–944
Schuller B, Wöllmer M, Eyben F, Rigoll G (2009) Spectral or voice quality? Feature type relevance for the discrimination of emotion pairs. In: Hancil S (ed) The Role of Prosody in Affective Speech, Linguistic Insights, Studies in Language and Communication, Vol. 97. Peter Lang Publishing Group, New York, pp 285–307
Shamma S, Klein DJ (2000) The case of the missing pitch templates: how harmonic templates may form in the early auditory system. J Acoust Soc Am 107:2631–2644
Article Google Scholar
Steidl S (2009) Automatic classification of emotion-related user states in spontaneous children’s speech. Logos Verlag, Berlin
Google Scholar
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Article Google Scholar
Ververidis V, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48(9):1162–1181
Article Google Scholar
Wu S, Falk TH, Chan WY (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53(5):768–785
Article Google Scholar
You M, Chen C, Bu J, Liu J, Tao J (2006) Emotion recognition from noisy speech. In: Proceedings of the ICME, pp 1653–1656
Yu F, Chang E, Xu YQ, Shum HY (2001) Emotion detection from speech to enrich multimedia content. In: Proceedings of the IEEE Pacific-Rim Conference on Multimedia, vol 1, pp 550–557
Zeng Z, Pantic M, Rosiman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Article Google Scholar

Download references

Acknowledgments

This study was supported in part by the National Science Council, Taiwan under Grant No. NSC 99-2220-E-009-056

Author information

Authors and Affiliations

Department of Electrical Engineering, National Chiao Tung University, Hsinchu, 300, Taiwan, ROC
Tai-Shih Chi, Lan-Ying Yeh & Chin-Cheng Hsu

Authors

Tai-Shih Chi
View author publications
You can also search for this author in PubMed Google Scholar
Lan-Ying Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Chin-Cheng Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tai-Shih Chi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chi, TS., Yeh, LY. & Hsu, CC. Robust emotion recognition by spectro-temporal modulation statistic features. J Ambient Intell Human Comput 3, 47–60 (2012). https://doi.org/10.1007/s12652-011-0088-5

Download citation

Received: 03 March 2011
Accepted: 24 September 2011
Published: 19 October 2011
Issue Date: March 2012
DOI: https://doi.org/10.1007/s12652-011-0088-5

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-features Integration for Speech Emotion Recognition

Speech emotion recognition using MFCC-based entropy feature

A Multiresolution-Based Fusion Strategy for Improving Speech Emotion Recognition Efficiency

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Robust emotion recognition by spectro-temporal modulation statistic features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-features Integration for Speech Emotion Recognition

Speech emotion recognition using MFCC-based entropy feature

A Multiresolution-Based Fusion Strategy for Improving Speech Emotion Recognition Efficiency

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now