Abstract
Conventional automatic assessment of pathological speech usually follows two main steps: (1) extraction of pathology-specific features; (2) classification or regression on extracted features. Given the great variety of speech and language disorders, feature design is never a straightforward task, and yet it is most crucial to the performance of assessment. This paper presents an end-to-end approach to automatic speech assessment for Cantonese-speaking People With Aphasia (PWA). The assessment is formulated as a binary classification task to discriminate PWA with high scores of subjective assessment from those with low scores. The 2-layer Gated Recurrent Unit (GRU) and Convolutional Neural Network (CNN) models are applied to realize the end-to-end mapping from basic speech features to the classification outcome. The pathology-specific features used for assessment are learned implicitly by the neural network model. The Class Activation Mapping (CAM) method is utilized to visualize how the learned features contribute to the assessment result. Experimental results show that the end-to-end approach can achieve comparable performance to the conventional two-step approach in the classification task, and the CNN model is able to learn impairment-related features that are similar to the hand-crafted features. The experimental results also indicate that CNN model performs better than 2-layer GRU model in this specific task.
Similar content being viewed by others
References
Benson, D.F., Benson, D.F., Ardila, A. (1996). Aphasia: A Clinical Perspective, (pp. 89–98). Oxford: Oxford University Press.
Adam, H. (2014). Dysprosody in aphasia: an acoustic analysis evidence from palestinian arabic. Journal of Language and Linguistic Studies, 10(1), 153–162.
National Aphasia Association. (2018). Aphasia definitions. https://www.aphasia.org/aphasia-definitions/, accessed 9 August 2018.
Wikipedia contributors. (2018a). Aphasia — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Aphasia, accessed 10 September 2018.
Wikipedia Contributors. (2018). Anomic aphasia — Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Anomic_aphasia, accessed 10 September 2018.
Peintner, B., Jarrold, W., Vergyri, D., Richey, C., Tempini, M.L.G., Ogar, J. (2008). Learning diagnostic models using speech and language measures. In Proceedings of Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS) (pp. 4648–4651). Vancouver: IEEE.
Fraser, K.C., Rudzicz, F., Rochon, E. (2013). Using text and acoustic features to diagnose progressive aphasia and its subtypes. In Proceedings of Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 2177–2181). Lyon: ISCA.
Fraser, K.C., Meltzer, J.A., Graham, N.L., Leonard, C., Hirst, G., Black, S.E., Rochon, E. (2014). Automated classification of primary progressive aphasia subtypes from narrative speech transcripts. Cortex, 55, 43–60.
Fraser, K.C., Rudzicz, F., Graham, N., Rochon, E. (2013). Automatic speech recognition in the diagnosis of primary progressive aphasia. In Proceedings of the 4th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT) (pp. 47–54). Grenoble: ACL/ISCA Special Interest Group.
Le, D., & Provost, E.M. (2016). Improving automatic recognition of aphasic speech with Aphasiabank. In Proceedings of Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 2681–2685). San Francisco: ISCA.
Le, D., Licata, K., Provost, E.M. (2018). Automatic quantitative analysis of spontaneous aphasic speech. Speech Communication, 100, 1–12.
Kohlschein, C., Klischies, D., Meisen, T., Schuller, B.W., Werner, C.J. (2018). Automatic processing of clinical aphasia data collected during diagnosis sessions: challenges and prospects. In Proceedings of resources and processing of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric impairments (RaPID-2), satellite of the 11th Language Resources and Evaluation Conference (LREC) (pp. 11–18). Miyazaki: ELRA.
Lee, T., Kong, A., Chan, V., Wang, H. (2013). Analysis of auto-aligned and auto-segmented oral discourse by speakers with aphasia: a preliminary study on the acoustic parameter of duration. Procedia, Social and Behavioral Sciences, 94, 71–72.
Lee, T., Liu, Y., Huang, P.W., Chien, J.T., Lam, W.K., Yeung, Y.T., Law, T.K., Lee, K.Y., Kong, A.P.H., Law, S.P. (2016). Automatic speech recognition for acoustical analysis and assessment of cantonese pathological voice and speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6475–6479). Shanghai: IEEE.
Qin, Y., Lee, T., Kong, A.P.H. (2018a). Automatic speech assessment for aphasic patients based on syllable-level embedding and supra-segmental duration features. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 5994–5998). Calgary: IEEE.
Qin, Y., Lee, T., Feng, S., Kong, A.P.H. (2018b). Automatic speech assessment for people with aphasia using TDNN-BLSTM with multi-task learning. In Proceedings of Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 3418–3422). Hyderabad: ISCA.
Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of International Conference on Machine Learning (ICML) (pp. 1764–1772). Beijing: IMLS.
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:cs.CL/1609.08144.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y. (1259). On the properties of neural machine translation: Encoder-decoder approaches. arXiv:cs.CL/1409.
Tang, Y., Huang, Y., Wu, Z., Meng, H., Xu, M., Cai, L. (2016). Question detection from acoustic features using recurrent neural network with gated recurrent unit. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 6125–6129). Shanghai: IEEE.
Rana, R. (2016). Gated Recurrent Unit (GRU) for emotion classification from noisy speech. arXiv:cs.HC/1612.07778.
Chung, H., Lee, Y.K., Lee, S.J., Park, J.G. (2017). Spoken english fluency scoring using convolutional neural networks. In Proceedings of Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) (pp. 1–6). Seoul: COCOSDA.
Vásquez-Correa, J., Orozco-Arroyave, J.R., Nöth, E. (2017). Convolutional neural network to model articulation impairments in patients with Parkinson’s disease. In Proceedings of Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 314–318). Stockholm: ISCA.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2921–2929). Las Vegas: IEEE.
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of International Conference on Computer Vision (ICCV) (pp. 618–626). Venice: CVF.
Wu, Y., & Lee, T. (2019). Enhancing sound texture in CNN-based acoustic scene classification. arXiv:cs.SD/1901.01502.
Kong, A.P.H., & Law, S.P. (2018). Cantonese Aphasiabank: An annotated database of spoken discourse and co-verbal gestures by healthy and language-impaired native cantonese speakers. Behavior research methods, pp. 1–14.
MacWhinney, B., Fromm, D., Forbes, M., Holland, A. (2011). Aphasiabank: Methods for studying discourse. Aphasiology, 25(11), 1286–1307.
Kong, A.P.H., Law, S.P., Kwan, C.C.Y., Lai, C., Lam, V. (2015). A coding system with independent annotations of gesture forms and functions during verbal communication: Development of a database of speech and gesture (DoSaGE). Journal of Nonverbal Behavior, 39(1), 93–111.
Yiu, E.M. (1992). Linguistic assessment of Chinese-speaking aphasics: Development of a Cantonese Aphasia Battery. Journal of Neurolinguistics, 7(4), 379–424.
Kong, A.P.H. (2016). Analysis of neurogenic disordered discourse production: From theory to practice. Routledge.
Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. In Proceedings of Annual Conference on Neural Information Processing Systems (NIPS) (pp. 1097–1105). Lake Tahoe: NIPS Foundation.
Simonyan, K., & Zisserman, A. (1556). Very deep convolutional networks for large-scale image recognition. arXiv:cs.CV/1409.
Lin, M., Chen, Q., Yan, S. (2013). Network in network. arXiv:cs.NE/1312.4400.
Kingma, DP, & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:cs.LG/1412.6980.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A. (2017). Automatic differentiation in pytorch In Conference on Neural Information Processing Systems Workshop (NIPS-w), NIPS Foundation, Long Beach.
Vuk, M., & Curk, T. (2006). ROC Curve, lift chart and calibration plot. Metodoloski zvezki, 3(1), 89.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157.
Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923.
Acknowledgements
This research was partially supported by a GRF project grant (Ref: CUHK14227216) from the Hong Kong Research Grants Council, a Direct Grant from the CUHK Research Committee, the CUHK Research Sustainability Fund, and the CUHK Shenzhen Research Institute. The Cantonese AphasiaBank project was supported by a fund from the National Institutes of Health (project number: NIH-R01-DC010398).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Qin, Y., Wu, Y., Lee, T. et al. An End-to-End Approach to Automatic Speech Assessment for Cantonese-speaking People with Aphasia. J Sign Process Syst 92, 819–830 (2020). https://doi.org/10.1007/s11265-019-01511-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-019-01511-3