Abstract
Speech conveys information about the emotive content of the speaker's perspective. The extraction of emotions is conceivable by analyzing the speech signal in every segment of its utterance. Prosody is a critical indicator of emotions and stress in a speech signal. Nonetheless, it is pertinent to take note that extraction of emotions has become a difficult undertaking because of the changeability in the speech signal. Tonal language, for example, the Assamese Language relies on prosody such as intonation. A similar expression passes on an alternate significance with the least change in the tone of a voice. People produce prosody to encode and interpret the message. Extracting and establishing the relationships among the prosody, for example, F0, Articulation Rate, Duration, and Intensity will assist us with tracking down the significant differences in the emotions. Variety in the prosody for change in emotions additionally leads to the variety in the glottal closure instants (GCIs). GCI or Epoch alludes to the moments of critical excitation of the vocal tract with the sudden closing of the glottis. Emotions influence the prosody as well as the strength of excitation of a speech signal. This paper is an endeavor to comprehend the effect of emotions on prosody and how the prosodic features are associated and how it means for the strength of excitation (epoch). This is investigated by computing the prosodic features through speech analyses tools PRAAT, Python, and Matlab. Also, a statistical analysis of the determined prosodic features is performed utilizing the ANOVA statistical test to build up the relationship among the prosodic features and to find out the changes that happen in the prosodic features for various emotions. Finally, extraction of the excitation source signal GCIs and its variations for the same utterance for various positions is investigated utilizing Matlab. Proper investigation of the emotive content of the speech through prosodic and excitation source features will improve the human–machine interface frameworks towards the advancement of speech emotion recognition concerning the Assamese Language for a real-time application like treating mental illness.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ananthapadmanabha, T. V., & Yegnanarayana, B. (1975). Epoch extraction of voiced speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(6), 562–570.
Balyan, A., Agrawal, S. S., & Dev, A. (2013) Speech synthesis: Review, IJERT, 2: 57-75
Banziger, T., & Scherer, K. R. (2005). The role of intonation in emotional expressions. Speech Communication, 46, 252–267.
Beaver, B. Z. C., Flemming, E., Jaeger, T. F., & Wolters, M. (2008). When semantics meets phonetics: Acoustical studies of second-occurrence focus. Journal of the Linguistic Society of America, 83(2), 245–276.
Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Springer.
Cahn, J. E. (1990). The generation of effect in synthesized speech. In JAVIOS (pp. 1–19).
Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40, 5–32.
Dellaert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotions in speech. In ICSLP 96.
Drugman, T., Bozkurt, B., & Dutoit, T. (2012). A comparative study of glottal source estimation techniques. Computer Speech & Language, 26(1), 20–34.
Gangamohan, P., Sudarsana R. K., & Yegnanarayana, B (2013) Analysis of emotional speech at subsegmental level. INTERSPEECH. Vol. 2013.
Goswami, G. C. (1982). Structure of Assamese (1st ed.). Department of Publication Gauhati University.
Hansson, P. (2002). Prosodic phrasing and articulation rate variation. Proc. FONETIK, 2002(44), 173–176.
Hashizawa, Y., Takeda, S., Hamzah, M.D., & Ohyama, G. (2004). On the differences in prosodic features of emotional expressions in Japanese speech according to the degree of the emotion. In: Proceedings of 1993 international conference on acoustics, speech prosody, Nara, pp. 655–658.
Jankowski, C. R. Jr, Quatieri, T. F., & Reynolds, D. A. (1995) Measuring fine structure in speech: Application to speaker identification. In: IEEE international conference on acoustics, speech, and signal processing. Detroit pp. 325–328.
Jia, Y., Huang, D., Liu, W., Dong, Y., Yu, S., & Wang, H. (2008). Text normalization in mandarin text-to-speech system. In Proceeding of the IEEE international conference on acoustics, speech and signal processing, pp. 4693–4696.
Joseph, M. A., Guruprasad, S., Yegnanarayana, Y. (2006). Extracting formants from short segments using group delay functions. In: Interspeech. Pittsburgh, pp. 1009–1012.
Kadiri, S. R. et al. (2015). Analysis of excitation source features of speech for emotion recognition In Sixteenth annual conference of the international speech communication association.
Kakati, B. (2007). Assamese its formation and development. 5th edn. LBS Publication
Kumar, K. S., Reddy, M. S. H., Murty, K. S. R., Yegnanarayana, B. (2009). Analysis of laugh signals for detecting in continuous speech. In: Interspeech. pp. 1591–1594.
Kuremastsu, M., et al. (2008). An extraction of emotion in human speech using speech synthesize and classifiers for each emotion. International Journal of Circuits Systems and Signal Processing., 5, 246–251.
Lee, C., Narayanan, S., & Pieraccini, R. (2001). Recognition of negative emotions from speech signal. In IEEE workshop on automatic speech and understanding, pp. 240–243
Ma, Y. K. C., & Willems, L. F. (1994). A Frobenius norm approach to glottal closure detection from the speech signal. IEEE Transactions on Speech and Audio Processing, 2, 258–265.
Mattingly, I. G. (1974). Speech synthesis for phonetic and phonological models. In T. A. Sebeok (Ed.), Current trends in linguistics (Vol. 12, pp. 2451–2487). Mouton.
McLoughlin, I. (2009). Applied speech and audio processing, with MATLAB examples. Cambridge University Press.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.
Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion by rule in synthetic speech. Speech Communication, 16, 369–390.
Murty, K. S. R., Yegnanarayana, B., & Joseph, M. A. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472.
Murty, K. S. R. (2009). Significance of excitation source information for speech analysis. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.
Navarro-Mesa, J. L., Lleida-Solano, E., & Moreno-Bilbao, A. (2001). A new method for epoch detection based on the Cohen’s class of time-frequency representations. IEEE Signal Processing Letters, 8(8), 225–227.
Nwe, T. L., Foo, S. W., & Silva, L. C. D. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41, 603–623.
Panda, S. P., Ajit, K. N., & Satyananda, C. R. (2020). A survey on speech synthesis techniques in Indian languages. Multimedia Systems, 26, 453–478.
Prafianto, H., Nose, T., Chiba, Y., & Ito, A. (2019). Improving human scoring of prosody using parametric speech synthesis. Speech Communication, 111, 14–21.
Quatieri, T. F. (2001). Discrete-time speech signal processing principles and practices. Prentice Hall PTR.
Raitio, T. et al. (2011). Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE.
Rao, K. S., & Shashidhar, G. K. (2013). Robust emotion recognition using spectral and prosodic features. Springer.
Rathina, X. A., Mehata, K. M., & Ponnavaikko, M. (2012b). A study of prosodic features of emotional speech: Advances in computer science engineering & applications (pp. 41–49). Springer.
Rathina, X. A., Mehata, K. M., & Ponnavaikko, M. (2012a). A study of prosodic features of emotional speech. Advances in computer science engineering & applications. Springer.
Schroder, M., & Cowie, R. (2006). Issues in emotion-oriented computing toward a shared understanding. In Workshop on emotion and computing (HUMAINE).
Schroder, M. (2001). Emotional speech synthesis: A review. In Seventh European conference on speech communication and technology, Eurospeech Aalborg.
Seshadri, G., & Yegnanarayana, B. (2009). Perceived loudness of speech based on the characteristics of excitation source. Journal of Acoustical Society of America, 126(4), 2061–2071.
Sigmund, M. (2012). Influence of psychological stress on formant structure of vowels. Elektronika Ir Elektrotechnika, 18(10), 45–48.
Tseng, C & Lee, Y. (2004). Intensity in relation to prosody organization. In International symposium on Chinese spoken language processing, pp. 217–220, Hong-Kong.
Tuan, V. N. & d’Alessandro, C. (1999). Robust glottal closure detection using the wavelet transform. In European conference on speech communication and technology, Budapest, pp. 2805–2808.
Yegnanarayana, B., & Gangashetty, S. V. (2011). Epoch-based analysis of speech signals. Sadhana, 36(5), 651–697.
Yegnanarayana, B., & Murty, K. S. R. (2009). Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 614–624.
Zhou, J., Su, X., Ylianttila, M., & Riekki, J. (2012). Exploring pervasive service computing opportunities for pursuing successful ageing. In The gerontologist, pp. 73–82.
Acknowledgements
The authors wish to thank The Assam Kaziranga University and Dhing Govt. College students for their cooperation for recording and collecting the speech data in different induced emotions. We also thank The Assam Kaziranga University lab Assistants for their utmost cooperation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bharadwaj, S., Acharjee, P.B. Exploring human voice prosodic features and the interaction between the excitation signal and vocal tract for Assamese speech. Int J Speech Technol 26, 77–93 (2023). https://doi.org/10.1007/s10772-021-09946-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09946-5