Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals

559 Accesses
18 Citations
Explore all metrics

Abstract

The work presented in this paper explores the effectiveness of incorporating the excitation source parameters such as strength of excitation and instantaneous fundamental frequency ($F_0$) for emotion recognition task from speech and electroglottographic (EGG) signals. The strength of excitation (SoE) is an important parameter indicating the pressure with which glottis closes at the glottal closure instants (GCIs). The SoE is computed by the popular zero frequency filtering (ZFF) method which accurately estimates the glottal signal characteristics by attenuating or removing the high frequency vocaltract interactions in speech. The arbitrary impulse sequence, obtained from the estimated GCIs, is used to derive the instantaneous $F_0$. The SoE and the instantaneous $F_0$ parameters are combined with the conventional mel frequency cepstral coefficients (MFCC) to improve the recognition rates of distinct emotions (Anger, Happy and Sad) using Gaussian mixture models as classifier. The performances of the proposed combination of SoE and instantaneous $F_0$ and their dynamic features with MFCC coefficients are compared with the emotion utterances (4 emotions and neutral) from classical German full blown emotion speech database (EmoDb) having simultaneous speech and EGG signals and Surrey Audio Visual Expressed Emotion database (3 emotions and neutral) for both speaker dependent and speaker independent emotion recognition scenarios. To reinforce the effectiveness of the proposed features and for better statistical consistency of the emotion analysis, a fairly large emotion speech database of 220 utterances per emotion in Tamil language with simultaneous EGG recordings, is used in addition to EmoDb. The effectiveness of SoE and instantaneous $F_0$ in characterizing different emotions is also confirmed by the improved emotion recognition performance in Tamil speech-EGG emotion database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring the Significance of Low Frequency Regions in Electroglottographic Signals for Emotion Recognition

Robust Emotion Recognition using Pitch Synchronous and Sub-syllabic Spectral Features

Automatic Emotion Recognition from Cochlear Implant-Like Spectrally Reduced Speech

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

MFCC term used throughout this paper denotes 39 MFCC coefficients having 13 MFCC along with 13 velocity ($\Delta$) and 13 acceleration ($\Delta$ $\Delta$) coefficients.

References

Adiga, N. & Prasanna, S. R. M. (2013). Significance of instants of significant excitation for source modeling. In Proceedings of INTERSPEECH.
Ayadi, M. E., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes and databases. Pattern Recognition, 44, 572–587.
Article MATH Google Scholar
Bulut, M., & Narayanan, S. (2008). On the robustness of overall f0 only modifications to the perception of emotions in speech. The Journal of the Acoustical Society of America, 123, 4547–4558.
Article Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlemeier, W., & Weiss, B. (2005). A database of German emotional speech. In Proceedings of INTERSPEECH (pp. 1517–1520).
Cabral, J. P., & Oliveira, L. C. (2006). Emo voice: A system to generate emotions in speech. in Proceedings of the INTERSPEECH (pp. 1798–1801).
Cahn, J. E. (1989). Generation of affect in synthesized speech. In Proceedings of the American voice I/O society (pp. 1–19).
Cerezo, E. & Baldassarri, S. (2007). Interactive agents for multimodal emotional user interaction. In In Proceedings of the international conference on interfaces and hman computer interaction.
Creed, C., & Beal, R. (2005). Using emotion simulation to influence user attitudes and behaviors. In Proceedings of workshop on role of emotion in HCI.
Erickson, D. (2005). Expressive speech: Production, perception and application to speech synthesis. Acoustical Science and Technology, 26(4), 317–325.
Article Google Scholar
Fairbanks, G., & Hoaglin, L. W. (1939). An experimental study of pitch characteristics of voice during the expression of emotion. Speech Monographs, 6, 87–104.
Article Google Scholar
Fant, G. (1960). Acoustic theory of speech production. s-Gravenhage: Moutan & Co.
Google Scholar
Govind, D., & Joy, T. T. (2016). Improving the flexibility of dynamic prosody modification using instants of significant excitation. International Journal of Circuits Systems and Signal Processing, 35(7), 2518–2543.
Article Google Scholar
Govind D. & Prasanna, S. R. M. (2012). Epoch extraction from emotional speech. In Proceedings of signal procesing & communications (SPCOM) (pp. 1–5).
Govind, D., & Prasanna, S. R. M. (2013). Expressive speech synthesis: A review. International Journal of Speech Technology, 16(2), 237–260.
Article Google Scholar
Govind, D. , Prasanna, S. R. M., & Yegnanarayana B. (2011). Neutral to target emotion conversion using source and suprasegmental information. In Proceedings of INTERSPEECH 2011.
Haq, S., & Jackson, P. J. B. (2009). Speaker-dependent audio-visual emotion recognition. in Proceedings of international conference on audio visual speech processing (pp. 53–58).
Haq, S., & Jackson, P. J. B. (2010). Chapter 17: Multimodal emotion recognition. In W. Wang (Ed.), Machine audition: Principles, algorithms and systems. Hershey: IGI Global Press.
Google Scholar
Kadiri, S. R., Gangamohan, P., & Yegnanarayana, B. (2015). Analysis of excitation source features of speech for emotion recognition. in Proceedings of INTERSPEECH
Kadiri, S. R. & Yegananarayana, B. (2015). Analysis of singing voice for epoch extraction using zero frequency filtering method,” in International conference on acoustics, speech and signal processing (ICASSP).
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio Speech and Language Processing, 16(8), 1602–1614.
Article Google Scholar
Murty, K. S. R., & Yegnanarayana, B. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472.
Article Google Scholar
Pati, D., & Prasanna, S. R. M. (2011). Subsegmental, segmental and suprasegmental processing of linear prediction residual for speaker information. International Journal of Speech Technology, 14(1), 49–64.
Article Google Scholar
Pradhan, G., & Prasanna, S. R. M. (2013). Speaker verification by vowel and nonvowel like segmentation. IEEE Transactions on Audio Speech and Language Processing, 21(4), 854–867.
Article Google Scholar
Prasanna, S. R. M. & Govind, D. (2010). Analysis of excitation source information in emotional speech,” in Proceedings of the INTERSPEECH (pp. 781–784).
Prasanna, S. R. M., & Yegnanarayana, B. (2004). Extraction of pitch in adverse conditions. In Proceedings of ICASSP, Montreal.
Prasanna, S. R. M., Govind, D., Rao, K. S., & Yenanarayana, B. (2010). Fast prosody modification using instants of significant excitation. In Proceedings of speech prosody.
Pravena, D. & Govind D. (2017). Development of simulated emotion speech database for excitation source analysis,” International Journal of Speech Technology. DOI:10.1007/s10772-017-9407-3.
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio Speech and Language Processing, 14, 972–980.
Article Google Scholar
Rao, K. S. & Yegnanarayana, B. Prosodic manipulation using instants of significant excitation. In Proceedings of ICASSP (pp. 528–531).
Reynolds, D., & Rose, C. (1995). Robust text independent speaker recognition using gaussian mixture speaker models. IEEE Transactions on Audio Speech and Language Processing, 3(1), 72–83.
Article Google Scholar
Ringeval, F., Sonderegger A., Sauer J., & Lalanne D. (2013). Introducing the recola multimodal corpus of remote collaborative and affective interactions, In 2nd international workshop on emotion representation, analysis and synthesis in continuous time and space (EmoSPACE), in Proceedings of IEEE Face & Gestures.
Schroder, M. (2009). Expressive speech synthesis: Past, present and possible futures. Affective information processing (pp. 111–126). Berlin: Springer.
Chapter Google Scholar
Whiteside, S. P. (1998). Simulated emotions: An acoustic study of voice and perturbation measures. Proceedings of the ICSLP, Sydney (pp. 699–703).
Yegnanarayana, B., & Murty, K. S. R. (2009). Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio Speech and Language Processing, 17(4), 614–625.
Article Google Scholar

Download references

Acknowledgements

Works carried out for this paper are funded by the completed DST-SERB project titled, “Analysis, Processing and Synthesis of Emotions in Speech (Ref No. SB/FTP/ETA-370/2012)”.

Author information

Authors and Affiliations

Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amrita University, Coimbatore, India
D. Pravena & D. Govind

Authors

D. Pravena
View author publications
You can also search for this author in PubMed Google Scholar
D. Govind
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to D. Govind.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pravena, D., Govind, D. Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals. Int J Speech Technol 20, 787–797 (2017). https://doi.org/10.1007/s10772-017-9445-x

Download citation

Received: 20 April 2017
Accepted: 29 July 2017
Published: 17 August 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10772-017-9445-x

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring the Significance of Low Frequency Regions in Electroglottographic Signals for Emotion Recognition

Robust Emotion Recognition using Pitch Synchronous and Sub-syllabic Spectral Features

Automatic Emotion Recognition from Cochlear Implant-Like Spectrally Reduced Speech

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring the Significance of Low Frequency Regions in Electroglottographic Signals for Emotion Recognition

Robust Emotion Recognition using Pitch Synchronous and Sub-syllabic Spectral Features

Automatic Emotion Recognition from Cochlear Implant-Like Spectrally Reduced Speech

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation