Abstract
Acquiring speech data is a crucial step in the development of speech recognition systems and related speech-based machine learning models. However, protecting privacy is an increasing concern that must be addressed. This study investigates voice conversion (VC) as a strategy for anonymizing the speech of individuals with dysarthria. We specifically focus on training a variety of VC models using self-supervised speech representations, such as Wav2Vec and its multi-lingual variant, Wav2Vec2.0 (XLSR). The converted voices maintain a word error rate that is within 1% with respect to the original recordings. The Equal Error Rate (EER) showed a significant increase, from 1.52% to 41.18% on the LibriSpeech test set, and from 3.75% to 42.19% on speakers from the VCTK corpus, indicating a substantial decrease in speaker verification performance. A similar trend is observed with dysarthric speech, where the EER varied from 16.45% to 43.46%. Additionally, our study includes classification experiments on dysarthric vs. healthy speech data to demonstrate that anonymized voices can still yield speech features essential for distinguishing between healthy and pathological speech. The impact of voice conversion is investigated by covering aspects such as articulation, prosody, phonation, and phonology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arias-Vergara, T., Vásquez-Correa, J.C., Orozco-Arroyave, J.R.: Parkinson’s disease and aging: analysis of their effect in phonation and articulation of speech. Cogn. Comput. 9(6), 731–748 (2017)
Babu, A., Wang, C., Tjandra, A., et al.: XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv abs/2111.09296 (2021)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Proceedings of the NeurIPS, vol. 33, pp. 12449–12460 (2020)
Cernak, M., Potard, B., Garner, P.N.: Phonological vocoding using artificial neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4844–4848. IEEE (2015)
Champion, P., Jouvet, D., Larcher, A.: Speaker information modification in the VoicePrivacy 2020 Toolchain. Ph.D. thesis, INRIA Nancy, équipe Multispeech; LIUM-Laboratoire d’Informatique de l’Université du Mans (2020)
Chang, H.P., Yoo, I.C., Jeong, C., Yook, D.: Zero-shot unseen speaker anonymization via voice conversion. IEEE Access 10, 130190–130199 (2022)
Chen, L., Lee, K.A., Guo, W., Ling, Z.H.: Modeling pseudo-speaker uncertainty in voice anonymization. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11601–11605. IEEE (2024)
Conneau, A., Baevski, A., Collobert, R., et al.: Unsupervised cross-lingual representation learning for speech recognition. In: Proceedings of the Interspeech, pp. 2426–2430 (2021)
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Proceedings of the Interspeech, pp. 3830–3834 (2020)
Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L.J., Bordel, G.: On the projection of PLLRs for unbounded feature distributions in spoken language recognition. IEEE Signal Process. Lett. 21(9), 1073–1077 (2014)
Huang, W.C., Yang, S.W., Hayashi, T., Lee, H.Y., Watanabe, S., Toda, T.: S3PRL-VC: Open-source voice conversion framework with self-supervised speech representations. arXiv preprint arXiv:2110.06280 (2021)
Kim, H., Hasegawa-Johnson, M., Perlman, A., et al.: Dysarthric speech database for universal access research. In: Proceedings of the Interspeech, pp. 1741–1744 (2008)
Nautsch, A., Jasserand, C., Kindt, E., Todisco, M., Trancoso, I., Evans, N.: The GDPR & speech data: reflections of legal and technology communities, first steps towards a common understanding. In: Proc. Interspeech, pp. 3695–3699 (2019)
Orozco-Arroyave, J.R., Vásquez-Correa, J.C., et al.: NeuroSpeech: an open-source software for Parkinson’s speech analysis. Digit. Signal Proc. 77, 207–221 (2018)
Orozco-Arroyave, J.R., Vásquez-Correa, J.C., Nöth, E.: Current Methods and New Trends in Signal Processing and Pattern Recognition for the Automatic Assessment of Motor Impairments: The Case of Parkinson’s Disease. Neurological Disorders and Imaging Physics 5, 8-1–8-57 (2020)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: Proceedings of the ICASSP, pp. 5206–5210. IEEE (2015)
Patino, J., Tomashenko, N., Todisco, M., Nautsch, A., Evans, N.: Speaker anonymisation using the McAdams coefficient. In: Proceedings of the Interspeech, pp. 1099–1103 (2021)
Perero-Codosero, J.M., Espinoza-Cuadros, F.M., Hernández-Gómez, L.A.: X-vector anonymization using autoencoders and adversarial training for preserving speech privacy. Comput. Speech Lang. 74, 101351 (2022)
Qian, J., Du, H., Hou, J., et al.: Speech sanitizer: speech content desensitization and voice anonymization. IEEE Trans. Dependable Secure Comput. 18(6), 2631–2642 (2019)
Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. In: Proceedings of the Interspeech, pp. 3465–3469 (2019). https://doi.org/10.21437/Interspeech.2019-1873
Tomashenko, N., Wang, X., Vincent, E., et al.: The VoicePrivacy 2020 challenge: results and findings. Comput. Speech Lang. 74, 101362 (2022)
Vásquez-Correa, J., Klumpp, P., Orozco-Arroyave, J.R., Nöth, E.: Phonet: a tool based on gated recurrent neural networks to extract phonological posteriors from speech. In: Proceedings Interspeech, pp. 549–553 (2019)
Wang, X., Takaki, S., Yamagishi, J.: An autoregressive recurrent mixture density network for parametric speech synthesis. In: Proceedings of the ICASSP, pp. 4895–4899. IEEE (2017)
Yamagishi, J., Veaux, C., MacDonald, K., et al.: CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR) (2019)
Yoo, I.C., Lee, K., Leem, S., et al.: Speaker anonymization for personal information protection using voice conversion techniques. IEEE Access 8, 198637–198645 (2020)
Zhao, Y., Huang, W.C., Tian, X., et al.: Voice conversion challenge 2020: intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527 (2020)
Acknowledgement
This work was partially funded by the EVUK programme (“Next-generation Al for Integrated Diagnostics”) of the Free State of Bavaria and by CODI at UdeA grant # PI2023-58010.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hernandez, A. et al. (2024). Anonymizing Dysarthric Speech: Investigating the Effects of Voice Conversion on Pathological Information Preservation. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science(), vol 15049. Springer, Cham. https://doi.org/10.1007/978-3-031-70566-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-70566-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70565-6
Online ISBN: 978-3-031-70566-3
eBook Packages: Computer ScienceComputer Science (R0)