Abstract
In literature, various time-frequency representation methods were investigated for automatic detection of voice disorders. Stockwell-Transform (S-Transform) provides good time-frequency localization; hence, it may efficiently capture the voice disorder related information from speech signal. With this motivation, we investigated different variants of S-Transform for the classification of voice disorders. This study proposed the S-Transform based cepstral coefficients for voice disorder detection and assessment. The performance of the proposed feature was compared with baseline features on SVD and HUPA databases. Compared to baseline features, proposed features performed best in terms of classification accuracy of 80.2% and 79.8% on HUPA and SVD databases, respectively for voice disorder detection task. Also, the proposed features performed better in case of assessment task. Further, the experimental results reveal that combining cepstral coefficients derived from S-Transform with baseline features improved the performance of proposed systems by 8% and 4% for detection and assessment tasks, respectively which highlights complementary nature of the explored features. We also analysed the effectiveness of S-Transform based spectral representation in capturing the acoustic characteristics for various voice qualities like breathiness, harshness, creakiness, and falsetto phonations. This representation was also compared with other time-frequency based methods such as STFT, ZTW and SFF. It was observed that S-Transform effectively captures the acoustic variations associated with different voice qualities compared to other baseline methods, which may be due to better spectro-temporal resolution offered by the S-Transform.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The current study used the publicly available SVD and HUPA datasets for the analysis. The SVD dataset is freely available at : http://www.stimmdatenbank.coli.uni-saarland.de/.
Notes
It is freely available at http://www.stimmdatenbank.coli.uni-saarland.de/
References
Adiga, N., Vikram, C., Pullela, K., & Prasanna, S. M. (2017). Zero frequency filter based analysis of voice disorders. In Proceeding of INTERSPEECH (pp. 1824–1828).
Airaksinen, M., Raitio, T., Story, B., & Alku, P. (2013). Quasi closed phase glottal inverse filtering analysis with weighted linear prediction. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(3), 596–607.
Al-Nasheri, A., Ali, Z., Muhammad, G., & Alsulaiman, M. (2015). An investigation of MDVP parameters for voice pathology detection on three different databases. In Sixteenth annual conference of the international speech communication association.
Aneeja, G., & Yegnanarayana, B. (2015). Single frequency filtering approach for discriminating speech and nonspeech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(4), 705–717.
Arias-Londoño, J. D., Godino-Llorente, J. I., Markaki, M., & Stylianou, Y. (2011). On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices. Logopedics Phoniatrics Vocology, 36(2), 60–69.
Aronson, A. (1990). Clinical voice disorders (3rd ed.). Thieme.
Assous, S., & Boashash, B. (2012). Evaluation of the modified S-transform for time-frequency synchrony analysis and source localisation. EURASIP Journal on Advances in Signal Processing, 2012(1), 1–18.
Atal, B. S. (1970). Speech analysis and synthesis by linear prediction of the speech wave. The Journal of the Acoustical Society of America, 47(1A), 65–65.
Bainbridge, K. E., Roy, N., Losonczy, K. G., Hoffman, H. J., & Cohen, S. M. (2017). Voice disorders and associated risk markers among young adults in the United States. The Laryngoscope, 127(9), 2093–2099.
Balasubramanium, R. K., Bhat, J. S., Fahim III, S., & Raju III, R. (2011). Cepstral analysis of voice in unilateral adductor vocal fold palsy. Journal of voice, 25(3), 326–329.
Barche, P., Gurugubelli, K., Vuppala, A. K. (2020) Towards automatic assessment of voice disorders: A clinical approach. In Proceedings of INTERSPEECH (pp. 2537–2541)
Barche, P., Gurugubelli, K., & Vuppala, A. K. (2021). Comparative study of filter banks to improve the performance of voice disorder assessment systems using LTAS features. In APSIPA ASC (pp. 737–742). IEEE.
Bayya, Y., & Gowda, D. N. (2013). Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Communication, 55(6), 782–795.
Bentley, P. M., & McDonnell, J. (1994). Wavelet transforms: An introduction. Electronics & Communication Engineering Journal, 6(4), 175–186.
Berisha, V., Krantsevich, C., Stegmann, G., Hahn, S., & Liss, J. (2022). Are reported accuracies in the clinical speech machine learning literature overoptimistic? In Proceedings of INTERSPEECH (pp. 2453–2457).
Beuter, C., & Oleskovicz, M. (2020). S-transform: From main concepts to some power quality applications. IET Signal Processing, 14(3), 115–123.
Bielamowicz, S., Kreiman, J., Gerratt, B. R., Dauer, M. S., & Berke, G. S. (1996). Comparison of voice analysis systems for perturbation measurement. Journal of Speech, Language, and Hearing Research, 39(1), 126–134.
Chen, L., Wang, C., Chen, J., Xiang, Z., & Hu, X. (2021). Voice disorder identification by using Hilbert-Huang transform (HHT) and K nearest neighbor (KNN). Journal of Voice, 35, 932-e1.
Claros, P., Karlikowska, A., Claros-Pujol, A., Claros, A., & Pujol, C. (2019). Psychogenic voice disorders literature review personal experiences with opera singers and case report of psychogenic dysphonia in opera singer. International Journal of Depression and Anxiety, 2, 015.
Crowe, J., Gibson, N., Woolfson, M., & Somekh, M. G. (1992). Wavelet transform as a potential tool for ECG analysis and compression. Journal of Biomedical Engineering, 14(3), 268–272.
Djurovic, I., Sejdic, E., & Jiang, J. (2008). Frequency-based window width optimization for S-transform. AEU-International Journal of Electronics and Communications, 62(4), 245–250.
Drugman, T., Dubuisson, T., & Dutoit, T. (2009). On the mutual information between source and filter contributions for voice pathology detection. In Proceedings of INTERSPEECH 2009 (pp. 1463–1466).
Dubey, A. K., Prasanna, S. M., & Dandapat, S. (2019). Hypernasality severity detection using constant-Q cepstral coefficients. In Proceedings of INTERSPEECH (pp. 4554–4558)
Ezzine, K., & Frikha, M. (2018). Investigation of glottal flow parameters for voice pathology detection on SVD and MEEI databases. In ATSIP (pp. 1–6). IEEE.
Fraile, R., & Godino-Llorente, J. I. (2014). Cepstral peak prominence: A comprehensive analysis. Biomedical Signal Processing and Control, 14, 42–54.
Frohlich, M., Michaelis, D., & Strube, H. W. (1998). Acoustic breathiness measures in the description of pathologic voices. In Proceedings of ICASSP (Vol. 2, pp. 937–940). IEEE.
Geng, M., Zhou, W., Liu, G., Li, C., & Zhang, Y. (2020). Epileptic seizure detection based on stockwell transform and bidirectional long short-term memory. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 28(3), 573–580.
Gidaye, G., Nirmal, J., Ezzine, K., & Frikha, M. (2020). Wavelet sub-band features for voice disorder detection and classification. Multimedia Tools and Applications, 79(39), 28499–28523.
Godino-Llorente, J. I., Aguilera-Navarro, S., & Gomez-Vilda, P. (2000). LPC, LPCC and MFCC parameterisation applied to the detection of voice impairments. In Sixth international conference on spoken language processing (pp. 965–968).
Godino-Llorente, J. I., Osma-Ruiz, V., Sáenz-Lechón, N., Cobeta-Marco, I., González-Herranz, R., & Ramírez-Calvo, C. (2008). Acoustic analysis of voice using WPCVox: A comparative study with multi dimensional voice program. European Archives of Oto-Rhino-Laryngology, 265(4), 465–476.
Gupta, V. (2018). Voice disorder detection using long short term memory (LSTM) model. ArXiv:1812.01779
Gurugubelli, K., Vuppala, A. K. (2019). Perceptually enhanced single frequency filtering for dysarthric speech detection and intelligibility assessment. In Proceedings of ICASSP (pp. 6410–6414). IEEE.
Hamidia, M., & Amrouche, A. (2017). A new robust double-talk detector based on the Stockwell transform for acoustic echo cancellation. Digital Signal Processing, 60, 99–112.
Heman-Ackah, Y. D., Michael, D. D., Baroody, M. M., Ostrowski, R., Hillenbrand, J., Heuer, R. J., Horman, M., & Sataloff, R. T. (2003). Cepstral peak prominence: A more reliable measure of dysphonia. Annals of Otology, Rhinology & Laryngology, 112(4), 324–333.
Heman-Ackah, Y. D., Michael, D. D., & Goding, G. S., Jr. (2002). The relationship between cepstral peak prominence and selected parameters of dysphonia. Journal of Voice, 16, 20–27.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738–1752.
Hillenbrand, J., & Houde, R. A. (1996). Acoustic correlates of breathy vocal quality: Dysphonic voices and continuous speech. Journal of Speech, Language, and Hearing Research, 39(2), 311–321.
Huckvale, M., Buciuleac, C. (2021). Automated detection of voice disorder in the Saarbrücken voice database: Effects of pathology subset and audio materials. In Proceedings of INTERSPEECH (pp. 4850–4854).
Javanmardi, F., Kadiri, S. R., Kodali, M., Alku, P., et al. (2022). Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiers. In Proceedings of INTERSPEECH (pp. 2173–2177).
Javid, M. H., Gurugubelli, K., & Vuppala, A. K. (2020). Single frequency filter bank based long-term average spectra for hypernasality detection and assessment in cleft lip and palate speech. In Proceedings of ICASSP (pp. 6754–6758). IEEE.
Jo, C.-W., & Kim, D.-H. (1998) Analysis of disordered speech signal using wavelet transform. In Fifth international conference on spoken language processing.
Kadiri, S. R., & Yegnanarayana, B. (2018). Breathy to tense voice discrimination using zero-time windowing cepstral coefficients (ZTWCCs). In Proceedings of INTERSPEECH (pp. 232–236).
Kadiri, S. R., & Alku, P. (2020). Analysis and detection of pathological voice using glottal source features. IEEE Journal of Selected Topics in Signal Processing, 14(2), 367–379.
Kadiri, S. R., Yegnanarayana, B. (2018). Analysis and detection of phonation modes in singing voice using excitation source features and single frequency filtering cepstral coefficients (SFFCC). In Proceedings of INTERSPEECH (pp. 441–445).
Kaleem, M., Ghoraani, B., Guergachi, A., & Krishnan, S. (2013). Pathological speech signal analysis and classification using empirical mode decomposition. Medical & Biological Engineering & Computing, 51, 811–821.
Kane, J., & Gobl, C. (2013). Wavelet maxima dispersion for breathy to tense voice discrimination. IEEE Transactions on Audio, Speech, and Language Processing, 21(6), 1170–1179.
Klingholtz, F. (1990). Acoustic recognition of voice disorders: A comparative study of running speech versus sustained vowels. The Journal of the Acoustical Society of America, 87(5), 2218–2224.
Klingholz, F., & Martin, F. (1985). Quantitative spectral evaluation of shimmer and jitter. Journal of Speech, Language, and Hearing Research, 28(2), 169–174.
Kohler, M., Vellasco, M. M., Cataldo, E., et al. (2016). Analysis and classification of voice pathologies using glottal signal parameters. Journal of Voice, 30, 549–556.
Koichi, O. (2011). Diagnosis of voice disorders. JMAJ, 54(4), 248–253.
Krom, G. D. (1993). A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. Journal of Speech, Language, and Hearing Research, 36(2), 254–266.
Kumar, B. R., Bhat, J. S., & Prasad, N. (2010). Cepstral analysis of voice in persons with vocal nodules. Journal of Voice, 24, 651–653.
Laver, J., Hiller, S., & Beck, J. M. (1992). Acoustic waveform perturbations and voice disorders. Journal of Voice, 6(2), 115–126.
Lee, J.-W., Kim, S., Kang, H.-G. (2014). Detecting pathological speech using contour modeling of harmonic-to-noise ratio. In Proceedings of ICASSP (pp. 5969–5973). IEEE
Lin, W., & Xiaofeng, M. (2011). An adaptive generalized S-transform for instantaneous frequency estimation. Signal Processing, 91(8), 1876–1886.
Livanos, G., Ranganathan, N., & Jiang, J. (2000). Heart sound analysis using the S transform. In Computers in cardiology 2000 (Vol. 27, pp. 587–590). IEEE.
Lopes, L. W., da Silva, J. D., Simões, L. B., da Silva Evangelista, D., Silva, P. O. C., Almeida, A. A., & de Lima-Silva, M. F. B. (2017). Relationship between acoustic measurements and self-evaluation in patients with voice disorders. Journal of Voice, 31(1), 119-e1.
Ludlow, C. L. (2011). Spasmodic dysphonia: A laryngeal control disorder specific to speech. Journal of Neuroscience, 31(3), 793–797.
Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 561–580.
Markaki, M., & Stylianou, Y. (2009). Normalized modulation spectral features for cross-database voice pathology detection. In Tenth annual conference of the international speech communication association.
Maryn, Y., Corthals, P., De Bodt, M., Van Cauwenberge, P., & Deliyski, D. (2009). Perturbation measures of voice: A comparative study between multi-dimensional voice program and praat. Folia Phoniatrica et Logopaedica, 61(4), 217–226.
Mohammed, M. A., Abdulkareem, K. H., Mostafa, S. A., Khanapi Abd Ghani, M., Maashi, M. S., Garcia-Zapirain, B., Oleagordia, I., Alhakami, H., & Al-Dhief, F. T. (2020). Voice pathology detection and classification using convolutional neural network model. Applied Sciences, 10(11), 3723.
Moukadem, A., Bouguila, Z., Abdeslam, D. O., & Dieterlen, A. (2015). A new optimized Stockwell transform applied on synthetic and real non-stationary signals. Digital Signal Processing, 46, 226–238.
Moukadem, A., Dieterlen, A., Hueber, N., & Brandt, C. (2013). A robust heart sounds segmentation module based on S-transform. Biomedical Signal Processing and Control, 8(3), 273–281.
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.
Narendra, N., & Alku, P. (2020). Glottal source information for pathological voice detection. IEEE Access, 8, 67745–67755.
Parsa, V., & Jamieson, D. G. (2000). Identification of pathological voices using glottal noise measures. Journal of Speech, Language, and Hearing Research, 43(2), 469–485.
Pinnegar, C. R., Khosravani, H., & Federico, P. (2009). Time-frequency phase analysis of ictal EEG recordings with the S-transform. IEEE Transactions on Biomedical Engineering, 56(11), 2583–2593.
Poh, K.-K., & Marziliano, P. (2007). Analysis of neonatal EEG signals using stockwell transform. In 2007 29th annual international conference of the IEEE engineering in medicine and biology society (pp. 594–597). IEEE.
Qi, Y., & Hillman, R. E. (1997). Temporal and spectral estimations of harmonics-to-noise ratio in human voice signals. The Journal of the Acoustical Society of America, 102(1), 537–543.
Qi, Y., Hillman, R. E., & Milstein, C. (1999). The estimation of signal-to-noise ratio in continuous speech for disordered voices. The Journal of the Acoustical Society of America, 105(4), 2532–2535.
Ramos-Negrón, O., Escobar-Jiménez, R., Arellano-Pérez, J., Uruchurtu-Chavarín, J., Gómez-Aguilar, J., & Lucio-García, M. (2019). Electrochemical noise analysis to identify the corrosion type using the Stockwell Transform and the Shannon energy: Part II. Journal of Electroanalytical Chemistry, 855, 113597.
Reddy, M. K., & Alku, P. (2021). A comparison of cepstral features in the detection of pathological voices by varying the input and filterbank of the cepstrum computation. IEEE Access, 9, 135953–135963.
Revathi, A., & Sasikaladevi, N. (2019). Hearing impaired speech recognition: Stockwell features and models. International Journal of Speech Technology, 22(4), 979–991.
Reynolds, V., Buckland, A., Bailey, J., Lipscombe, J., Nathan, E., Vijayasekaran, S., Kelly, R., Maryn, Y., & French, N. (2012). Objective assessment of pediatric voice disorders with the acoustic voice quality index. Journal of Voice, 26(5), 672–16727.
Rocabruno-Valdés, C., Escobar-Jiménez, R., Díaz-Blanco, Y., Gómez-Aguilar, J., Astorga-Zaragoza, C., & Uruchurtu-Chavarin, J. (2020). Corrosion evaluation of aluminum 6061–t6 exposed to sugarcane bioethanol-gasoline blends using the Stockwell Transform. Journal of Electroanalytical Chemistry, 878, 114667.
Saldanha, J. C., Ananthakrishna, T., & Pinto, R. (2014). Vocal fold pathology assessment using mel-frequency cepstral coefficients and linear predictive cepstral coefficients features. Journal of Medical Imaging and Health Informatics, 4(2), 168–173.
Sanyal, A., Baral, A., & Lahiri, A. (2012). Application of S-transform for removing baseline drift from ECG. In 2012 2nd national conference on computational intelligence and signal processing (CISP) (pp. 153–157). IEEE.
Saoud, S., Bousselmi, S., Naser, M. B., & Cherif, A. (2016). New speech enhancement based on discrete orthonormal Stockwell Transform. International Journal of Advanced Computer Science and Applications, 7(10).
Seifert, E., & Kollbrunner, J. (2006). An update in thinking about nonorganic voice disorders. Archives of Otolaryngology-Head & Neck Surgery, 132(10), 1128–1132.
Sejdic, E., Djurovic, I., & Jiang, J. (2007). A window width optimized S-transform. EURASIP Journal on Advances in Signal Processing, 2008, 1–13.
Sejdic, E., Stankovic, L., Dakovic, M., & Jiang, J. (2008). Instantaneous frequency estimation using the S-transform. IEEE Signal Processing Letters, 15, 309–312.
Silva, D. G., Oliveira, L. C., & Andrea, M. (2009). Jitter estimation algorithms for detection of pathological voices. EURASIP Journal on Advances in Signal Processing, 2009, 1–9.
Stockwell, R. G. (2007). A basis for efficient representation of the S-transform. Digital Signal Processing, 17, 371–393.
Stockwell, R. G., Mansinha, L., & Lowe, R. (1996). Localization of the complex spectrum: The S-transform. IEEE Transactions on Signal Processing, 44(4), 998–1001.
Syed, S. A., Rashid, M., Hussain, S., & Zahid, H. (2021). Comparative analysis of CNN and RNN for voice pathology detection. BioMed Research International, 2021, 1–8.
Teixeira, J. P., Oliveira, C., & Lopes, C. (2013). Vocal acoustic analysis-jitter, shimmer and HNR parameters. Procedia Technology, 9, 1112–1122.
Umapathy, K., Krishnan, S., Parsa, V., & Jamieson, D. G. (2005). Discrimination of pathological voices using a time-frequency approach. IEEE Transactions on Biomedical Engineering, 52(3), 421–430.
Ventosa, S., Simon, C., Schimmel, M., Dañobeitia, J. J., & Mànuel, A. (2008). The S-transform from a wavelet point of view. IEEE Transactions on Signal Processing, 56(7), 2771–2780.
Vydana, H. K., & Vuppala, A. K. (2016). Detection of fricatives using S-transform. The Journal of the Acoustical Society of America, 140(5), 3896–3907.
Waldekar, S., & Saha, G. (2018). Wavelet transform based mel-scaled features for acoustic scene classification. In INTERSPEECH (Vol. 2018, pp. 3323–3327).
Watts, C. R., & Awan, S. N. (2011). Use of spectral/cepstral analyses for differentiating normal from hypofunctional voices in sustained vowel and continuous speech contexts. Journal of Speech, Language, and Hearing Research, 54, 1525–1537.
Woldert-Jokisz, B. (2007). Saarbruecken voice database.
Wu, H., Soraghan, J., Lowit, A., & Di Caterina, G. (2018). A deep learning method for pathological voice detection using convolutional deep belief networks. In Proceedings of INTERSPEECH (pp. 446–450)
Zhu, M., Jiang, Z., Zhang, X., Qi, Y. (2014). A S-transform based spectrum enhancement method for complex noise environment. In 2014 international conference on audio, language and image processing (pp. 382–385). IEEE.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Barche, P., Gurugubelli, K. & Vuppala, A.K. Stockwell-Transform based feature representation for detection and assessment of voice disorders. Int J Speech Technol 27, 101–119 (2024). https://doi.org/10.1007/s10772-024-10085-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-024-10085-w