Abstract
Background
Integration of speech into healthcare has intensified privacy concerns due to its potential as a non-invasive biomarker containing individual biometric information. In response, speaker anonymization aims to conceal personally identifiable information while retaining crucial linguistic content. However, the application of anonymization techniques to pathological speech, a critical area where privacy is especially vital, has not been extensively examined.
Methods
This study investigates anonymization’s impact on pathological speech across over 2700 speakers from multiple German institutions, focusing on privacy, pathological utility, and demographic fairness. We explore both deep-learning-based and signal processing-based anonymization methods.
Results
We document substantial privacy improvements across disorders—evidenced by equal error rate increases up to 1933%, with minimal overall impact on utility. Specific disorders such as Dysarthria, Dysphonia, and Cleft Lip and Palate experience minimal utility changes, while Dysglossia shows slight improvements. Our findings underscore that the impact of anonymization varies substantially across different disorders. This necessitates disorder-specific anonymization strategies to optimally balance privacy with diagnostic utility. Additionally, our fairness analysis reveals consistent anonymization effects across most of the demographics.
Conclusions
This study demonstrates the effectiveness of anonymization in pathological speech for enhancing privacy, while also highlighting the importance of customized and disorder-specific approaches to account for inversion attacks.
Plain Language Summary
When someone’s way of speaking is disrupted due to health issues, making it hard for them to communicate clearly, it is described as pathological speech. Our study explores whether this type of speech can be modified to protect patient privacy without losing its ability to help diagnose health conditions. We evaluated automatic anonymization for over 2,700 speakers. The results show that these methods can substantially enhance privacy while still maintaining the usefulness of speech in medical diagnostics. This means we can keep speech data private whilst still being able to use it to identify health issues. However, our results show the effectiveness of these methods can vary depending on the specific condition being diagnosed. Our study provides a method that can help maintain patient privacy, whilst highlighting that further customized approaches will be required to ensure optimal privacy.
Similar content being viewed by others
Introduction
The advent of speech as a pivotal component in digital technology has accentuated privacy concerns due to the inherent biometric information speech carries1,2,3, particularly highlighted in its role as a biomarker extensively utilized in healthcare applications4, such as Parkinson’s5,6 and Alzheimer’s7 diseases detection or speech therapy8, for its cost-effectiveness and non-invasive nature. However, the advent of deep learning necessitates an ever-increasing volume of speech data for algorithm training. Despite the daily influx of patients with speech or voice disorders at various institutions, leveraging this data for research is hampered by stringent privacy regulations. As speech data can reveal a plethora of personal information, there is an urgent need for privacy-preserving technologies in voice data usage. Consequently, the scope of data available for public research use remains narrowly limited. Addressing this issue, speaker anonymization9,10 emerges as a pivotal strategy, aiming to obscure personally identifiable information while preserving essential linguistic and speech characteristics11,12. This approach is particularly pertinent in the healthcare sector, where the accuracy and reproducibility of speech biomarkers are paramount for advancing medical diagnostics and treatments13. Therefore, finding ways to expand the pool of publicly available training data without breaching privacy norms is crucial for the progression of medical speech technology applications.
Privacy-preserving data processing has seen substantial growth, motivated by an increasing need for privacy protection. The VoicePrivacy 2020 and 2022 challenges10,14,15 have been pivotal, setting a framework for defining and exploring speaker anonymization as an essential element of voice biosignal. These initiatives have led to innovations in automatic anonymization methods, including deep-learning-based (DL-based) techniques, such as the extraction and replacement of speaker identity features (x-vectors)16,17, and signal modification methods, like anonymization using the McAdams coefficient18,19. Such endeavors have spurred advancements in anonymization technologies. For example, Mawalim et al.20 illustrated that phase vocoder-based time-scale modification with pitch shifting offers superior anonymization for healthy speech without sacrificing utility. Khamsehashari et al.21 developed a voice conversion approach utilizing Emphasized Channel Attention, Propagation, and Aggregation in a time delay neural network (ECAPA-TDNN) to embed speaker identities more effectively. Moreover, Meyer et al.22 highlighted the successful application of Generative Adversarial Networks in speaker anonymization, while Perero-Codosero et al.23 utilized autoencoders for this purpose. Furthermore, Srivastava et al.24 delved into design choices for pseudo-speaker selection and, in another study25, analyzed the privacy-utility tradeoffs in x-vector-based methods17.
Despite substantial advances, current research demonstrates a notable gap in the study of anonymization methods tailored to pathological speech, where patient privacy concerns are paramount. Tayebi Arasteh et al.13 have recently pointed out that the unique attributes of pathological speech make it more readily identifiable than its healthy counterpart, raising vital questions about the privacy-utility balance in its anonymization and the treatment of its biomarkers. While some research, such as the work by Hernandez et al.26, has delved into articulation, prosody, phonation, and phonology features in anonymized dysarthric speech to differentiate between healthy and pathological speech, and Zhu et al.27 have assessed anonymization’s impact on speech-based diagnostics for COVID-19, these efforts remain limited. They typically concentrate on specific speech or voice disorders, depend on small datasets, or consider pathological speech unrelated to speech or voice disorders, highlighting the need for more comprehensive analyses in this domain.
In response, our study conducts a comprehensive analysis of the impact of anonymization on pathological speech biomarkers, utilizing a large-scale dataset of over 2700 speakers from various institutions. This dataset includes a wide array of disorders such as Dysarthria28 (a motor speech disorder affecting muscle control), Dysglossia29 (a condition affecting speech by changes of the articulatory organs, e.g., due to oral cancer), Dysphonia30 (voice disorders affecting vocal fold vibration), and Cleft Lip and Palate (CLP)31,32,33,34 (a congenital split in the upper lip and roof of the mouth), thus providing a broad basis for generalizable insights into pathological speech anonymization. Additionally, we meticulously explore the balance between privacy enhancement and the utility of pathological speech data, including an examination of demographic fairness implications.
This study aims to investigate whether anonymization modifies the diagnostic markers within pathological speech while balancing privacy-utility and privacy-fairness considerations, suggesting the potential for applying automatic anonymization to pathological speech. Our findings reveal that although anonymization alters the diagnostic markers within pathological speech, it achieves advantageous balances in privacy-utility and privacy-fairness, thus underscoring the feasibility of employing automatic anonymization for pathological speech. Moreover, our analysis identifies substantial variability in the anonymization effects across different disorders, showcasing a complex interplay between anonymization and the specifics of pathological conditions.
Methods
Speech dataset
The dataset used in our research comprised a wide array of speech utterances from across Germany. It featured a median participant age of 17, with a mean age of 30 years (± 25 years standard deviation), covering ages from 3 to 95 years. Table 1 offers an overview of the dataset demographics, including voice and speech disorder distributions, and gender breakdown.
Data collection
Data were collected from 2006 to 2019 during regular outpatient examinations at the University Hospital Erlangen and at over 20 different locations across Germany for the recording of control speakers. Every patient during a specialized consultation was invited to participate in the study. The study and the methods were performed in accordance with relevant guidelines and regulations and approved by the University Hospital Erlangen’s institutional review board with application number 3473 and respected the Declaration of Helsinki. Informed consent was obtained from all adult participants as well as from parents or legal guardians of the children. Patients and control speakers were informed about the study’s procedure and goals before consenting to participate and providing informed consent.
Recordings were made using a standardized procedure which included consistent settings, microphone setups, and speech tasks. Non-native speakers and patients whose speech was substantially disturbed by factors other than the targeted disorders were excluded. The Program for Evaluation and Analysis of all Kinds of Speech disorders (PEAKS)35, an open-source tool widely used in the German-speaking scientific community, was employed to document and manage the database. Recordings were captured at a 16 kHz sampling frequency and a 16-bit resolution, featuring subjects who are native German speakers, including various local dialects.
Speech features
The dataset included different causes with their main or prominent features of pathological speech, e.g., Dysphonia, refers to voice disorder containing phonation features, Dysglossia refers to articulation disorders containing mostly phonetic and sometimes phonation features, Dysarthria refers to speech disorder containing phonation, phonetic and prosody features, and CLP refers to speech and resonance disorder containing phonetic features, hyper- and hyponasality, and sometimes phonation features. Supplementary Table 1 provides an overview of the expected features for different clinical groups.
Exclusion criteria
The cohort employed in our study represents a meticulously curated subset of the dataset described in ref. 13, where it is delineated that our initial collection consisted of 216.88 h of recordings from n = 4121 subjects. To refine this dataset to a clean and unbiased selection, we adhered to all exclusion criteria mentioned in ref. 13, which encompassed data cleaning, ensuring speech quality and noise standards, and the elimination of multi-speaker utterances. Additional steps undertaken in this study include: (i) Acknowledging the distinct speech characteristics between adults and children13, we categorized the dataset into two primary subsets. Adults, defined as individuals over 20 years of age, were tasked with reading Der Nordwind und die Sonne, a phonetically rich German adaptation of Aesop’s fable The North Wind and the Sun35. This text comprises 108 words, 71 of which are unique. Conversely, children participated in the Psycholinguistische Analyse kindlicher Sprechstörungen (PLAKSS)36 test, which involved naming pictograms across slides, covering all German phonemes in various positions. Given the tendency of some children to describe pictograms with multiple words, and the occasional extra words between target words, recordings were automatically segmented at pauses exceeding 1s35. (ii) Adults’ subset focused on utterances characterized by Dysarthria28, Dysglossia29, and Dysphonia30, alongside healthy control samples. Utterances with ambiguous or mixed pathologies or those representing conditions with a scant number of data points were excluded. (iii) For children, the emphasis was placed on utterances from individuals with CLP conditions — the most prevalent cranial malformation characterized by an incomplete closure of the vocal tract31,32,34 — as well as from healthy controls.
Experimental design
Given the interdisciplinary nature of our study (Fig. 1), which incorporates a variety of criteria, establishing clear metrics, criteria, and statistical methods is crucial for foundational clarity. We detail our approaches to evaluating the effectiveness of anonymization, as well as assessing the utility and fairness of pathological speech data. Subsequently, we describe the anonymization techniques we utilized. Then, a breakdown of all different experiments performed in this study is given.
Evaluation criteria
Anonymization measure (privacy)
To evaluate the effectiveness of anonymization, we employ automatic speaker verification (ASV) techniques, aligning with established standards in the field, such as those set by the VoicePrivacy 2020 and 2022 challenges10,14,15. Utilizing a pretrained module, optimized on the LibriSpeech37 dataset, we fine-tune this system to recognize specific speakers. This involves exposing the module to random utterances, which may belong to the target speaker or an imposter from the dataset. The goal is to achieve an equal error rate (EER) where the rate of false acceptance (FA) matches that of false rejection (FR), indicating the system’s difficulty in distinguishing the speaker’s identity post-anonymization. We adopt the EER as our primary metric for assessing anonymization38, a critical metric in ASV39. It inversely correlates with verification ease—a lower EER indicates a more straightforward speaker verification process40. This metric aims to find a threshold where the false acceptance rate (FAR) and false rejection rate (FRR) are balanced, using cosine distance scores for similarity measurement. An increase in EER post-anonymization, relative to the baseline EER, signifies enhanced anonymization efficacy. EER values are reported as percentages throughout this paper.
EER calculation process
The data preprocessing followed established protocols13,41,42,43, starting with the removal of intervals with sound pressures below 30 dB. Voice activity detection44 was then applied to eliminate silent segments from the utterances, utilizing a 30 ms window length, a maximum silence threshold of 6 ms, and an 8 ms moving average window. The final feature set comprised 40-dimensional log-Mel-spectrograms, calculated with a 25 ms window length and 10 ms steps, employing a short time Fourier transform (STFT) of size 512.
A pretrained ASV network on the LibriSpeech37 dataset (360-clean subset), was employed. This network comprised three long short-term memory (LSTM)45 layers with 768 hidden nodes, followed by a linear projection layer, and was trained using the Generalized End-to-End (GE2E)42 loss with the Adam46 optimizer. For an in-depth discussion of the ASV network’s training and evaluation, readers are directed to refs. 13,41,42, as these details extend beyond the scope of our current study.
The EER calculation process involved comparing pairs of positive (same speaker) and negative (different speakers) utterances to verify speaker identity. Initial EER values for various dataset subsets were calculated and then compared to their anonymized counterparts to assess the impact of anonymization on speaker verifiability.
Biomarker analysis measure (utility)
Pathological speech is characterized by a range of subjective and objective metrics, including articulation47, prosody48, and phonation26,49, among others. Recognizing the complexity and diverse nature of these biomarkers3,13, our study opted for a deep learning-based approach to biomarker assessment. This method does not focus on extracting specific features but allows the network to autonomously identify pertinent features distinguishing healthy speech from various voice and speech disorders. This approach facilitates the application of standard classification metrics such as area under the receiver operating characteristic curve (AUROC), accuracy, sensitivity, and specificity in evaluating the utility of pathological speech biomarkers and standard statistical analyses.
Classification process
During data preprocessing, if present, drifting noise was removed by applying a forward-backward filter50. The final feature set comprised 80-dimensional log-Mel-spectrograms, employing an STFT of size 1024.
Due to the two-dimensional nature of log-Mel-spectrograms, we leveraged state-of-the-art pretrained convolutional networks designed for image classification to maximize feature extraction accuracy51,52. We specifically selected a ResNet3453 model pretrained on the large-scale ImageNet54 dataset, which contains over 14 million images across 1000 categories, adhering to the architecture proposed by He et al.53, featuring a (7 × 7) convolution first layer producing an output with 64 channels, and a final linear layer reducing the (512 × 1) output feature vectors to 2. The sigmoid function was applied to transform output predictions into probabilities, totaling around 21 million trainable parameters in the network.
We employed a batch size of 8, selecting 8 utterances per speaker randomly for each batch. Given the varying lengths of log-Mel-filterbank energies, 180 frames (about 3 s) were chosen at random for inclusion in training. The network inputs were sized (8 × 3 × 80 × 180), reflecting the batch size, channel size (adjusted to 3 to match the pretrained network’s expectation, with log-Mel-spectrograms replicated three times), log-Mel-spectrograms size, and frame size. Training spanned 150 epochs, optimizing with the Adam46 optimizer and a learning rate of 5 × 10−5, utilizing binary weighted cross-entropy for loss calculation.
Statistics and reproducibility
Statistical analysis was conducted using Python v3.9, leveraging the SciPy and NumPy libraries. For each disorder subset and specific experiment, speakers were randomly allocated to training (70%) and test (30%) groups. This random allocation was consistent across experiments to ensure that the same training and test subsets were used for comparing anonymized data with original data, facilitating paired analyses to account for random variations. The division aimed to prevent overlap between training and test data. To address potential imbalances in the dataset, particularly where there was a limited number of healthy controls (81 in the adult subset), we adjusted the patient-to-control ratio. In cases of Dysarthria and Dysglossia with ample patient data (n = 355 and n = 542, respectively), we capped patient speakers at twice the number of controls. In the children’s subset, which had more controls, we sampled controls up to 1.5 times the number of patients to maintain balance. The composition of the final training and test sets, ensuring a fair comparison between the two anonymization methods, was as follows: Training sets comprised n = 168 speakers (Dysarthria detection), n = 168 (Dysglossia detection), n = 110 (Dysphonia detection), and n = 887 (CLP detection). Corresponding test sets included n = 73 (Dysarthria detection), n = 73 (Dysglossia detection), n = 49 (Dysphonia detection), and n = 381 (CLP detection). AUROC was selected as the primary evaluation metric, with accuracy, sensitivity, and specificity serving as secondary metrics. Considering each speaker contributed multiple utterances, and to account for the random sampling of utterances in training and testing, each test phase was repeated 50 times to reduce potential random biases, with evaluations strictly paired for a consistent comparison between anonymized and non-anonymized data. Results are expressed as mean ± standard deviation. Statistical significance was determined using a two-tailed unpaired t-test, setting a significance level at p < 0.05. Pearson’s correlation coefficient was utilized to measure the correlation between EER and AUROC (i.e., the privacy-utility tradeoff).
Anonymization method
Anonymization techniques usually fall into two primary categories: (i) DL-based synthesization methods and (ii) signal-level modification methods. To maintain generality, we considered both categories in our study.
DL-based synthesization method
These types of methods initiate by converting the waveform into the frequency domain and segregating the speaker identity features from other acoustic features. Subsequently, specific modifications are applied solely to the speaker identity features, ensuring no other feature is altered. The modified frequency domain features, typically represented as Mel-spectrograms, are then re-synthesized back into the time domain using a synthesizer known as a vocoder in the text-to-speech55 context. This approach is termed DL-based because critical components, such as the vocoder and speaker identity extractor, are trained using DL-based methods.
The most prevalent applications of these methods are integrated with voice conversion56 algorithms, where after isolating speaker identity features, they are substituted with those of another speaker. This process effectively changes one’s voice to mimic another’s. However, for anonymization purposes, especially in scenarios involving over 2700 speakers as in our study, mapping each speaker to a specific target is impractical. Therefore, we opt for a simplified approach, focusing on altering only the pitch frequency of the speakers before re-synthesizing the signal with a vocoder.
The same data preprocessing procedure as for the classification network was utilized.
To prevent reconstruction of the original signals, the pitch shift magnitude was chosen at random for different speakers. However, these Modifications were carefully applied to ensure audibility, gender preservation, minimal age alteration, and naturalness. Supplementary Fig. 1 details the proposed pitch shift magnitude, noise addition, and denoising processes.
To re-synthesize the speech waveforms from the modified Mel-spectrograms, the HiFi-GAN57 vocoder was utilized, a state-of-the-art voice synthesizer pre-trained on the LibriTTS58 corpus, a diverse corpus of healthy English speech encompassing 585 h of audio. This approach ensured that the resulting speech maintained high fidelity and naturalness.
Signal-level modification method
Signal-level anonymization techniques utilize signal processing to modify speech signals without the need for model training. A prominent method involves the McAdams coefficient, which alters speaker characteristics by adjusting the spectral formant positions via linear predictive coding analysis. The technique involves analyzing speech frame by frame to extract features, then adjusting the spectral positions based on the McAdams coefficient19. This adjustment changes the speaker’s perceived identity by modifying the phase of certain frequencies, while leaving others untouched. The method is effective for standard speech samples, targeting key spectral features for anonymity. The speech is then reconstructed with adjusted features, ensuring both the anonymization of the speaker and the preservation of speech clarity. Our approach adopts and refines a variation of the anonymization method by Patino et al.18, initially introduced in the VoicePrivacy 2022 Challenge15. This adaptation is particularly advantageous for our application because it eliminates the necessity for mapping original speakers to target ones. A key aspect of this method is the utilization of the McAdams coefficient as an anonymization metric; in a specific range, a higher coefficient indicates a lower degree of anonymization, allowing for a customizable balance between speaker anonymity and speech intelligibility.
Experiments breakdown
The breakdown of the experiments performed in this study are detailed below.
Performance evaluation of anonymization methods for pathological speech
The DL-based method is implemented as per Supplementary Fig. 1. For the McAdams coefficient method, given our speech dataset’s 16 kHz sampling rate, we opt for dynamic selection of the McAdams coefficient, randomly choosing a value between 0.75 and 0.90. This approach introduces additional randomness and complicates potential data reconstruction. We noted that coefficients above 0.90 minimally affect anonymization, aligning with the VoicePrivacy 2022 Challenge baseline, while those below 0.75 begin to degrade audio quality and naturalness.
Exploring synthesizing effects
Notably, nasality changes are often observed in the deeper frequencies, which could be particularly relevant in disorders like CLP, where pitch shifts might substantially alter pathological features and formant information. To assess the general impact of DL-based anonymization methods, we delve into a simplified scenario: speech signals are converted into log-Mel-spectrograms yet undergo no alterations prior to being synthesized through the HiFi-GAN vocoder. In an ideal setting, where the vocoder performs impeccably, this approach should theoretically leave both utility and privacy unaffected. Although this experiment does not directly contribute to anonymization, it provides valuable insights into how these processes might influence the pathological utility of speech data. This evaluation aids in distinguishing the appropriateness of DL-based versus signal-level based methods for the anonymization of pathological speech.
Privacy-utility tradeoff
We then examined the effects of varying privacy levels. Using the McAdams coefficient method, we systematically adjusted the coefficient from 0.5 to 1.0 (in increments of 0.1) and train separate classification models for each. This allowed for an in-depth analysis of the privacy-utility tradeoff in pathological speech and helps understand their correlation.
Privacy-fairness tradeoff
We evaluated the balance between privacy and fairness by analyzing demographic subgroups within our dataset. A fair classification network, in this context, is defined as one that maintains equal performance in detecting speech or voice disorders across all patient subgroups, both before and after anonymization59. To assess this, we not only compared AUROC performance and EER privacy metrics across different subgroups but also employed statistical parity difference (PtD)60 as a measure of demographic fairness. This metric represents the accuracy disparity between minority and majority classes, with ideal values being zero—indicating no discrimination. Positive values suggest a benefit to the minority class, whereas negative values indicate potential bias against these groups61. The demographic subgroups analyzed included gender (female and male) and age (adult and child), aiming to ensure equitable performance across these variables.
Anonymization in diverse scenarios
Acknowledging Tayebi Arasteh et al.13’s findings that diversity in speakers and disorders substantially enhances the performance of ASV systems, making anonymization more challenging, we consolidated all patient data into a general patient set and all control data into a general control set. We undertook a task of detecting pathological speech across this combined dataset to evaluate the effectiveness of anonymization methods for large-scale pathological speech corpora. This comparison between original and anonymized speech data aimed to determine the feasibility of applying automatic anonymization to pathological speech in extensive datasets.
Broadening utility assessment
To ensure the robustness of our findings and mitigate task-specific biases, we undertook a multiclass classification challenge. Moving beyond the binary distinction between patients and healthy controls for each disorder, we categorized participants into one of five groups in a single analysis: healthy control, Dysarthria, Dysglossia, Dysphonia, and CLP. To maintain fairness, we included an equal number of speakers from each category, based on the smallest subgroup, Dysphonia, which comprised n = 78 speakers. Consequently, we selected 78 speakers from each category, dividing them into a training set of n = 372 speakers (62 from each disorder, 62 from adult controls, and 62 from child controls) and a test set of n = 96 speakers (16 from each). This approach ensured each category was equally represented, with test speakers excluded from the training phase to guarantee they were unseen during the evaluation.
Exploring inversion methods in speech anonymization
In the final step of our investigation, we examined potential inversion risks within ASV systems. While membership inference attacks62 and countermeasures like differential privacy63,64 are well-discussed in the image processing domain59,61, their implications for speech data anonymization are less explored65,66. Specifically, we investigated how well the randomized McAdams coefficient method could stand against inversion attempts that aim to reverse the anonymization and identify speakers.
Considering a scenario where an external party is aware of the anonymization system’s specifics, including the McAdams coefficients, they might attempt to exploit this knowledge. This could involve training a counter ASV system tailored to recognize speakers despite their speech being anonymized. To test this, we utilized the same subset of the LibriSpeech37 dataset previously employed for our primary ASV system training, aiming for a straightforward comparison. This phase included training with both original and anonymized speech samples, using the randomized McAdams coefficient method, where anonymized versions were considered authentic utterances of the speakers. This setup helped us assess the feasibility of linking anonymized voices back to their original speakers, providing initial insights into our anonymization technique’s resistance to inversion efforts.
Generalization of the method beyond German language
The anonymization methods presented are not reliant on language-specific characteristics, demonstrating their adaptability across languages. We validated this generalization using the PC-GITA dataset67, which consists of speech recordings from n = 50 Parkinson’s Disease (PD) patients and n = 50 matched healthy controls (by age and gender), all native Spanish speakers from Colombia. The recordings were collected in accordance with a protocol designed to meet technical specifications and recommendations from experts in linguistics, phoniatry, and neurology. All recordings were made in controlled noise conditions within a soundproof booth, and the audio was sampled at 44.1 kHz with 16-bit resolution. None of the healthy controls exhibited symptoms of PD or any other neurological disorders.
The protocol for the PC-GITA dataset67 was approved by the Ethical Committee of the Research Institute in the Faculty of Medicine at the University of Antioquia in Medellín, Colombia (approval 19-63-673). All experiments were conducted in accordance with applicable national and international guidelines and regulations. Informed consent was obtained from all adult participants, as well as from the parents or legal guardians of the children involved. Our use of the PC-GITA dataset did not require separate ethical approval, as it is a restricted-access resource. Access was granted following our agreement to the dataset’s user terms.
For anonymization, we employed the McAdams Coefficient method, similar to that used with the German dataset. We utilized phonemic place of articulation features to distinguish between PD patients and healthy controls. A linear support vector regression machine68 was applied to predict the maximum phonation duration. The utility of the method was quantitatively assessed using Pearson’s r correlation coefficient, comparing the PD patients and healthy controls.
Hardware
The hardware used in our experiments included Intel CPUs with 18 and 32 cores, 32 GB of RAM, and Nvidia GPUs such as the GeForce GTX 1080 Ti (11 GB), V100 (16 GB), RTX 6000 (24 GB), Quadro 5000 (32 GB), and Quadro 6000 (32 GB).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Results
Impact of anonymization on pathological speech biomarkers
Table 2 presents the effects of anonymization on pathological speech biomarkers, illustrating a notable rise in EER following anonymization, signaling improved privacy measures. Figure 2 displays frequency spectrums and power spectral densities (PSD) of both original and anonymized speech signals, offering a sample utterance from each disorder for illustrative purposes. We note that anonymization leads to a reduction in the signal’s power, indicating that anonymized signals exhibit lower PSD compared to their original counterparts. This observation suggests that anonymization not only obscures speaker identity but may also affect the acoustic properties of speech.
The anonymization performance varied by disorder type, with the randomized McAdams coefficient anonymization method outperforming the DL-based (i.e., the randomized pitch shift algorithm) anonymization method. Initially, EER values for disorders such as Dysarthria, Dysglossia, Dysphonia, and CLP were 1.80 ± 0.42%, 1.78 ± 0.43%, 2.19 ± 0.30%, and 7.01 ± 0.24%, respectively. After employing the randomized pitch shift algorithm, these values escalated to 30.72 ± 0.48%, 31.54 ± 0.87%, 41.02 ± 0.33%, and 38.73 ± 0.39%, showcasing increases of 1606%, 1672%, 1773%, and 452%, respectively. Similarly, the randomized McAdams coefficient method elevated EERs to 36.59 ± 0.64%, 34.26 ± 0.67%, 38.86 ± 0.35%, and 32.19 ± 0.46%, indicating equivalent percentage increases.
Regarding utility, or the networks’ ability to detect disorders, the DL-based method notably compromised utility across all disorders. Conversely, the McAdams coefficient method resulted in a decrease in AUROC for Dysarthria (p = \(5.5\times {10}^{-27}\)), with only a modest decrease of 2.60% in AUROC post-anonymization. Dysphonia and CLP experienced AUROC reductions of 0.75% (p = \(3.4\times {10}^{-13}\)) and 0.07% (p = 0.14), respectively. Notably, for Dysglossia, anonymization via the McAdams coefficient method led to a significant increase (p = \(6.1\times {10}^{-21}\)) of 1.11% in AUROC.
Following these insights, we refined our examination of the DL-based method by omitting the pitch shifting process to scrutinize its synthesization phase’s impact. This phase involved reconstructing an input utterance solely with the vocoder module. The findings, detailed in Table 2, reveal a nuanced effect: while Dysglossia showed a slight improvement in AUROC (97.86 ± 0.33% vs. 97.73 ± 0.41%, p = 0.097), significant reductions were observed for other disorders without enhancing anonymization substantially. This indicates that utility is influenced not just by the identity-altering pitch shifting process but also by the synthesization phase itself.
Privacy-utility tradeoff results
Given the McAdams Coefficient method’s effectiveness over the DL-based approach, we further investigated this method exclusively for the remainder of our experiments. We adjusted the McAdams coefficient method to utilize fixed values, embarking on a series of controlled experiments. As depicted in Fig. 3, we observed that a linear increase in the coefficient value led to a logarithmic decrease in EER across all studied disorders, indicating that while the method supports customizable levels of anonymization, coefficient adjustments impact the effectiveness of anonymization in a logarithmic manner.
In terms of utility, adjusting anonymization levels (reflected by varying EER) had diverse effects on the network’s diagnostic capabilities across different disorders. Supplementary Table 2 elaborates on the AUROC, accuracy, sensitivity, and specificity metrics at different coefficient settings. Consistent with our early observations, Dysphonia showed a marked reduction in AUROC following anonymization (p = 1.0 × 10−31 ). Surprisingly, adjustments in anonymization levels did not produce a uniform trend in AUROC across all disorders, suggesting that the anonymization level’s impact on diagnostic accuracy is disorder-specific. The Pearson correlation coefficients between EER and AUROC values (−0.613 for Dysarthria, −0.112 for Dysglossia, 0.425 for Dysphonia, and −0.397 for CLP; with corresponding p-values of 0.19, 0.83, 0.40, and 0.44) were not statistically significant. Optimal privacy-utility balance was achieved with coefficients ranging from 0.75 to 0.9, corroborated by both EER and classification metrics. Subjective analysis of the anonymized speech waveforms affirmed that coefficients lower than 0.75 compromised audio quality, impacting not only pathological markers but also overall speech clarity.
Privacy-fairness tradeoff results
Table 3 presents our analysis on how anonymization affects the balance between privacy and fairness across different demographics. In terms of privacy, we observed a uniform increase in EER across gender subgroups, except in the case of Dysphonia detection. Here, female speakers initially had a significantly lower EER compared to males (0.22 ± 0.22% vs 2.28 ± 0.32%), but anonymization leveled the playing field, bringing both genders to similar privacy levels (35.09 ± 1.31% for females and 37.99 ± 0.35% for males). Regarding age subgroups, adults and children started with EERs of 1.25 ± 0.29% and 6.17 ± 0.24%, respectively. Post-anonymization, both groups converged to approximately 32% EER (32.26 ± 0.31% for adults and 32.08 ± 0.50% for children), indicating that anonymization effectively equalizes privacy across ages as well.
Regarding fairness in performance (Fig. 4), minor disparities were noted. Dysarthria detection showed a slight AUROC reduction for females (around 1%) compared to males (around 4%). This discrepancy is reflected in the statistical parity difference (PtD), which increased from 0.02 to 0.04 for females, post-anonymization. A similar pattern emerged in CLP detection.
For Dysglossia and Dysphonia, disparities were more pronounced, with PtD changes around 0.04 but in opposite directions. In Dysglossia, the network initially favored females (PtD = 0.03), shifting to slightly favor males (PtD = 0.01) post-anonymization. Conversely, in Dysphonia, the initial advantage for males (PtD = 0.02) switched to favor females (PtD = 0.02) after anonymization.
Age-related analysis showed nearly consistent performance post-anonymization. Initially favoring children in general disorder detection (PtD = 0.03), the advantage slightly decreased to PtD=0.02 but remained in favor of children.
Efficacy of automatic anonymization in large and diverse pathological datasets
Upon aggregating all patient and control data into comprehensive groups (n = 1333 for training and n = 576 for test), we noted a substantial increase in EER post-anonymization: from 2.96 ± 0.10% to 30.24 ± 0.33% for patients, from 5.20 ± 0.11% to 31.61 ± 0.13% for healthy controls, and from 4.02 ± 0.02% to 32.77 ± 0.05% for all data. This indicates comparable anonymization effectiveness across both patient and control groups, with EERs rising approximately 27% despite the initial patient EER being roughly 2 times lower than that of controls.
Regarding utility, the change was statistically significant (p = \(1.2\times {10}^{-61}\)), with AUROC showing a minimal decrease of less than 1% from 97.05 ± 0.16% to 96.07 ± 0.19% post-anonymization, suggesting a negligible impact on the ability to detect disorders. Figure 5 depicts these effects on utility, showcasing metrics such as AUROC, accuracy, sensitivity, and specificity, indicating that anonymization can substantially enhance privacy without substantially compromising diagnostic utility.
Multiclass classification performance
Figure 6 illustrates the outcomes of our multiclass classification experiment, posing a more complex challenge than prior binary classifications. In this setup, the model distinguishes whether an utterance is healthy or indicative of Dysarthria, Dysglossia, Dysphonia, or CLP. Drawing from the binary classification insights (see Fig. 3), we narrowed the McAdams coefficient to the [0.7, 1.0] range, where its effectiveness peaked, while halving the increment steps for greater precision. Overall, results echoed binary classification trends, with a notable exception: Dysphonia, previously showing lower AUROC scores for anonymized data, now demonstrated improved AUROC values.
A critical observation from this experiment is that while a strict monotonic relationship between privacy levels and utility remains elusive, specifying ranges for each disorder reveals potential for monotonicity. This insight underscores that the privacy-utility tradeoff is substantially influenced by the specific disorder in question, with the optimal balance being unique to each condition.
Inversion attack outcomes
In assessing the inversion attack’s outcomes, our experiment revealed EER values for disorders such as Dysarthria, Dysglossia, Dysphonia, and CLP at 1.64 ± 0.24%, 1.58 ± 0.29%, 1.63 ± 0.12%, and 5.86 ± 0.20%, respectively, using the inverse ASV system. After applying the randomized McAdams coefficient method for anonymization, these EERs rose to 7.08 ± 0.40%, 8.66 ± 0.63%, 10.48 ± 0.42%, and 12.00 ± 0.31%, respectively. Despite the substantial increase in EER post-anonymization, indicating a heightened level of privacy, the inverse ASV system substantially compromised the anonymization’s efficacy. For the primary ASV method compared to the inverse ASV system, the percentage increase in EER post-anonymization was noted as 1933% vs. 332% for Dysarthria, 1825% vs. 448% for Dysglossia, 1674% vs. 543% for Dysphonia, and 359% vs. 105% for CLP. This pattern was evident in both pathological and healthy speech, with EERs for healthy adults rising from 2.66 ± 0.39% to 8.26 ± 0.50%, and for children from 7.37 ± 0.16% to 14.90 ± 0.20% after anonymization, using the inverse ASV system. These results indicate that speech anonymization methods, whether applied to healthy or pathological speech, may not fully withstand inversion attacks. This underscores the necessity for ongoing research into strengthening the resilience of these methods, ensuring comprehensive privacy protection in speech data applications.
Generalization to other languages
Supplementary Fig. 2 presents the results of both utility and privacy assessments using the PC-GITA dataset67, which includes Spanish-speaking PD patients. The overall EER of the anonymized PD dataset is 34.00%, comparable to the results from the German dataset, where the EERs for Dysarthria, Dysglossia, Dysphonia, and CLP are 36.59%, 34.26%, 38.86%, and 32.19%, respectively. Similarly, a logarithmic decline in EER values is observed with linear increases in the McAdams coefficient, akin to the German dataset. Notably, the magnitude of EER changes increases with further adjustments to the coefficient, underscoring the necessity for disorder-specific configurations in anonymization methods. Like in the German dataset, the anonymization process uniformly conceals the identities of both patients and healthy controls. Initially, the EER for controls is 2.00%, while for PD it is 3.78% (nearly twice as high). After anonymization using a random coefficient, the EER for controls and PD equalizes at 34.9% and 34%, respectively, demonstrating that the initial disparity in identification difficulty is effectively neutralized.
Regarding utility, the original data exhibited a correlation coefficient of 0.71. Post-anonymization, this value decreases to 0.42 with a randomized coefficient and to 0.57 with a fixed coefficient set at 0.8. These findings indicate a reduction in some pathological biomarkers, yet the results align well with those from the German dataset. This alignment supports the generalizability of the proposed methods, while also highlighting the critical need for continued research into disorder-specific anonymization techniques.
Discussion
Our investigation into the impact of anonymization on pathological speech biomarkers across a dataset of over 2700 speakers revealed substantial privacy enhancements, especially utilizing the McAdams Coefficient method and DL-based approaches, as evidenced by increased EER. Despite this improvement in privacy, the effect on the utility for diagnosing specific speech and voice disorders varied, maintaining overall minimal influence on diagnostic accuracy. Notably, anonymization had a modest impact on Dysarthria, Dysphonia, and CLP, yet interestingly, it benefited Dysglossia. This advantage for Dysglossia could stem from its primary manifestation in articulatory changes affecting vowels and consonants, suggesting a lesser susceptibility to the anonymization process’s alterations.
Our research underscores that automatic anonymization consistently affects the utility of pathological speech data across different methodologies. Despite this, the tradeoff observed between the level of anonymization and the utility of pathological speech data leans towards a positive equilibrium, underscoring the effectiveness of modern anonymization techniques in managing pathological speech. This balance is particularly pronounced in conditions such as Dysarthria, Dysglossia, Dysphonia, and CLP, highlighting the practicality of these anonymization strategies in preserving the integrity of medical diagnostics while enhancing privacy.
We evaluated both DL-based and signal-level modification anonymization methods for their applicability to pathological speech. The signal-level modification methods, particularly the McAdams Coefficient method, emerged as generally superior to DL-based approaches in anonymizing pathological speech. The DL-based method we utilized, which aligns with the broader category of voice conversion methods, adopted a simplified strategy. Instead of converting speaker identity, we applied a randomized pitch shift followed by speech synthesization. This approach was specifically chosen to facilitate the anonymization of a vast array of 2742 unique speakers, aiming to enhance the availability of public datasets for the development of data-driven pathological speech analysis tools. This focus on a real-world application underscores the limited utility of voice conversion methods that merely alter a speaker’s identity to another in our context. Future investigations might delve into the nuances of more advanced voice conversion-based anonymization methods for pathological speech, examining their specific advantages and applications, while mindful that alterations in fundamental frequency (F0) and formants could obscure critical pathological speech characteristics necessary for in-depth analysis.
Our analysis of the privacy-utility tradeoff has yielded substantial insights. Contrary to common assumptions25, we found that the relationship between the level of anonymization and its utility is not strictly monotonic. Adjusting the McAdams coefficient linearly leads to logarithmic variations in EER, affecting the identification capabilities for various disorders in distinct ways. Importantly, an increase in EER does not universally diminish the utility of pathological speech data; rather, the impact varies by disorder. We observed that for each type of speech or voice disorder, there exists a specific anonymization level that offers a more favorable privacy-utility balance. This underscores the importance of pinpointing the optimal tradeoff point for each disorder to achieve a balanced enhancement of privacy while maintaining utility.
Despite limited exploration of anonymization in pathological speech, research on privacy-preserving pathological AI models within imaging has highlighted potential effects on demographic fairness61,69,70. Our study extended this investigation to the speech domain, evaluating the privacy-fairness tradeoff. We observed uniform anonymization efficacy across demographic subgroups, with minor diagnostic fairness discrepancies, especially notable in gender differences within Dysphonia detection. Initially, females had substantially lower EER than males, yet post-anonymization, both genders achieved similar privacy levels. A deeper look revealed an underrepresentation of females (n = 8) compared to males (n = 70) in this subgroup. The overall impact on diagnostic fairness was modest, typically under 4%, and varied by disorder. In Dysarthria and CLP, minimal effects occurred, whereas in Dysglossia and Dysphonia a shift in which gender subgroup was favored post-anonymization can be seen. Age subgroup analysis showed nearly uniform impacts, underscoring the nuanced influence of anonymization on demographic fairness.
Leveraging a diverse dataset, we assessed the anonymization’s effectiveness across a combined set of patient and control data. This approach aimed to challenge the anonymization methods with a broad spectrum of speech characteristics, in line with findings13 that diversity in speaker and disorder profiles can complicate anonymization. Our results affirm the feasibility of applying automatic anonymization to extensive pathological speech datasets, enhancing privacy with minimal impact on the clinical utility of speech data for diagnostics.
Determining a safe level of anonymity involves assessing the risk of identification within a dataset. Moreover, we acknowledge that absolute privacy, defined as zero risk, can only be achieved when no information is disclosed71. For the entire dataset of n = 2742 speakers, the original EER stood at 4.02 ± 0.02%. Assuming all speakers are published, and an individual from this dataset attempts to re-identify their recording, the breakdown is as follows: false acceptance (FA) = 2741 × 4.02% = 110.19 ≈ 110 (rounded), true acceptance (TA) = 1 × 95.98% = 0.960, false rejection (FR) = 1 × 4.02% = 0.040, and true rejection (TR) = 2741 × 95.98% = 2630.810. This scenario yields approximately 110 recordings requiring manual verification, translating to a 1:110 chance of accurate identification. Post-anonymization, with an EER increase to 32.77 ± 0.05%, the likelihood adjusts to 1:898 (2741 × 30.24% = 898.230), marking a substantial improvement in privacy. However, the practicality of manually sifting through such a large number of potential matches is constrained by computational limitations. Thus, this level of anonymization is deemed sufficient.
This effect even becomes more pronounced when we focus solely on publishing the patient subset (n = 1443) instead of the entire dataset. Initially, with an EER of 2.96 ± 0.10%, the chance of identification before anonymization stood at 1:43, a range feasibly manageable for manual checking. Post-anonymization, with the EER escalating to 30.24 ± 0.33%, this chance dramatically improves to 1:436, exemplifying an ideal enhancement in privacy. When focusing on a smaller dataset, such as the Dysphonia subset with n = 78 subjects, the original EER of 2.19 ± 0.30% increased to 38.86 ± 0.35% after anonymization. This change boosts the identification challenge from 1:2 in the original dataset to 1:30 post-anonymization, a nearly 15-fold increase in difficulty. Although this substantially improves anonymity, manually verifying 30 speakers is still manageable. Therefore, for optimal anonymity enhancement, it is advisable to publish data in larger quantities.
Our study acknowledges certain limitations. First, while our dataset spans various German-speaking regions and exhibits considerable demographic diversity, it is limited to the German language. As part of a validatory experiment, we utilized a PD dataset in Spanish. While some pathological biomarkers were slightly reduced, the results concerning the privacy-utility trade-off were generally consistent with those from our German dataset, indicating the potential generalizability of our proposed methods. However, the results also highlight the urgent need for further research into disorder-specific methods that can effectively address the limitations of automatic anonymization of pathological speech. Additionally, the limited availability of adult controls restricted our ability to completely balance age distributions across all adult sub-groups. Future studies should focus on acquiring more data from both healthy and patient adult populations to deepen and clarify comparative results. Second, our analysis is based on specific speech tests, suggesting a future exploration of more diverse speech transcripts could offer deeper insights. Third, our primary aim was to assess the utility of anonymization for identifying pathological speech biomarkers, not for general speech recognition which is often evaluated using word recognition or error rates. Regarding privacy, recent studies suggest that the intelligibility of pathological speech, as measured by word recognition rates, does not directly correlate with the ease of speaker identification post-anonymization13. On the utility front, our focus was on the utility of anonymized speech in aiding the training of data-driven diagnostic models, where the automatic extraction of features by neural networks is crucial. Fourth, our exploration into inversion methods for anonymized speech was preliminary, a subject that has seen limited discussion in existing studies concerning non-pathological speech. This brief examination demonstrated that such methods could potentially reduce the effectiveness of anonymization, suggesting anonymized speech’s vulnerability to re-identification efforts. While some research72 advocates for the potential reversibility of anonymized speech for trusted entities, concerns arise regarding its compatibility with stringent privacy standards like General Data Protection Regulation (GDPR)73 when accessible to untrusted parties66. Our future work will delve into these inverse methodologies and their implications for the anonymization of pathological speech. Fifth, while we used disorder detection as the baseline for evaluating the utility, we recognize that this approach may be general for assessing the complex impacts on the data. As such, future research should focus on the degree to which anonymized samples can be investigated with respect to detailed speech analytic measures, including prosodic, phonetic, phonation, and resonance features. This would offer a more granular understanding of the trade-offs involved in anonymization and its effects on the diagnostic quality of speech data. Additionally, we plan to conduct a thorough perceptual analysis of anonymized speech to evaluate its utility not only for machine-based analyses but also for human assessments, thus broadening the scope of utility evaluation. This investigation represents an initial step into this area, with further research necessary to fully understand the impacts on dialects, age, gender, and other variables.
In sum, our study demonstrates that anonymization can substantially increase patient privacy in pathological speech data without substantially compromising diagnostic utility or fairness, marking a pivotal step forward in the responsible use of speech data in healthcare. Further research is needed to refine these methods and explore their application across a broader range of disorders and languages, ensuring global applicability, fairness, and robustness against inversion attacks.
Data availability
The German speech dataset used in this study is internal data from patients at the University Hospital Erlangen and is not publicly available due to patient privacy regulations. Access requests can be directed to the corresponding author for on-site access at the University Hospital Erlangen in Erlangen, Germany. The PC- GITA67 dataset is a restricted-access resource. To gain access, users must agree to the dataset’s data protection requirements by submitting a request and signing a user agreement through the GITA Lab, University of Antioquia, Medellín, Colombia (rafael.orozco@udea.edu.co). The source data for Figs. 3, 4, 5 and 6 is available as Supplementary Data 1.
Code availability
To ensure transparency and facilitate further research, our entire source code is publicly accessible at https://doi.org/10.5281/zenodo.1280621374. This repository includes comprehensive details on training protocols, evaluation procedures, data preprocessing, and anonymization processes, promoting reproducibility within the research community. The codebase is implemented in Python v3.9 and employs the PyTorch v1.13 framework for all deep learning tasks.
References
Strimbu, K. & Tavel, J. A. What are biomarkers? Curr. Opin. HIV AIDS 5, 463–466 (2010).
Califf, R. M. Biomarker definitions and their applications. Exp. Biol. Med. (Maywood) 243, 213–221 (2018).
Ramanarayanan, V., Lammert, A. C., Rowe, H. P., Quatieri, T. F. & Green, J. R. Speech as a biomarker: opportunities, interpretability, and challenges. Perspect. ASHA SIGs 7, 276–283 (2022).
Rios-Urrego, C. D., Vásquez-Correa, J. C., Orozco-Arroyave, J. R. & Nöth, E. Is there any additional information in a neural network trained for pathological speech classification? in Text, Speech, and Dialogue (eds. Ekštein, K., Pártl, F. & Konopík, M.) vol. 12848 435–447 (Springer International Publishing, 2021).
Moro-Velazquez, L., Villalba, J., & Dehak, N. Using X-vectors to automatically detect Parkinson’s disease from speech. in ICASSP 2020 − 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1155–1159 (IEEE, 2020). https://doi.org/10.1109/ICASSP40776.2020.9053770.
Tayebi Arasteh, S. et al. Federated learning for secure development of AI models for Parkinson’s disease detection using speech from different languages. in INTERSPEECH 2023 5003–5007 (Dublin, 2023). https://doi.org/10.21437/Interspeech.2023-2108.
Pappagari, R., Cho, J., Moro-Velázquez, L. & Dehak, N. Using state-of-the-art speaker recognition and natural language processing technologies to detect Alzheimer’s disease and assess its severity. in INTERSPEECH 2020 2177–2181 (ISCA, 2020). https://doi.org/10.21437/Interspeech.2020-2587.
Jamal, N., Shanta, S., Mahmud, F. & Sha’abani, M. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: a review. in 020028 (Johor, 2017). https://doi.org/10.1063/1.5002046.
Nautsch, A. et al. Preserving privacy in speaker and speech characterisation. Computer Speech Lang. 58, 441–480 (2019).
Tomashenko, N. et al. Introducing the VoicePrivacy Initiative. in INTERSPEECH 2020 1693–1697 (ISCA, 2020). https://doi.org/10.21437/Interspeech.2020-1333.
Qian, J. et al. Towards privacy-preserving speech data publishing. in IEEE INFOCOM 2018 - IEEE Conference on Computer Communications 1079–1087 (IEEE, 2018). https://doi.org/10.1109/INFOCOM.2018.8486250.
Lal Srivastava, B. M. et al. Evaluating voice conversion-based privacy protection against informed attackers. in ICASSP 2020 − 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2802–2806 (IEEE, 2020). https://doi.org/10.1109/ICASSP40776.2020.9053868.
Tayebi Arasteh, S. et al. The effect of speech pathology on automatic speaker verification: a large-scale study. Sci. Rep. 13, 20476 (2023).
Tomashenko, N. et al. The VoicePrivacy 2020 challenge: results and findings. Computer Speech Lang. 74, 101362 (2022).
Tomashenko, N. et al. The VoicePrivacy 2022 challenge evaluation plan. Preprint at http://arxiv.org/abs/2203.12468 (2022).
Fang, F. et al. Speaker anonymization using X-vector and neural waveform models. 10th ISCA Speech Synthesis Workshop (ISCA, 2019).
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S. X-Vectors: Robust DNN embeddings for speaker recognition. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5329–5333 (IEEE, 2018). https://doi.org/10.1109/ICASSP.2018.8461375.
Patino, J., Tomashenko, N., Todisco, M., Nautsch, A. & Evans, N. Speaker anonymisation using the McAdams coefficient. Proc. Interspeech 1099–1103 (2021). https://doi.org/10.21437/Interspeech.2021-1070.
McAdams, S. E. Spectral fusion, spectral parsing and the formation of auditory images. (Ph.D. dissertation, Stanford University, 1984).
Mawalim, C. O., Okada, S. & Unoki, M. Speaker anonymization by pitch shifting based on time-scale modification. in 2nd Symposium on Security and Privacy in Speech Communication 35–42 (ISCA, 2022). https://doi.org/10.21437/SPSC.2022-7.
Khamsehashari, R. et al. Voice Privacy - leveraging multi-scale blocks with ECAPA-TDNN SE-Res2NeXt extension for speaker anonymization. in 2nd Symposium on Security and Privacy in Speech Communication 43–48 (ISCA, 2022). https://doi.org/10.21437/SPSC.2022-8.
Meyer, S. et al. Anonymizing speech with generative adversarial networks to preserve speaker privacy. in 2022 IEEE Spoken Language Technology Workshop (SLT) 912–919 (IEEE, Doha, Qatar, 2023). https://doi.org/10.1109/SLT54892.2023.10022601.
Perero-Codosero, J. M., Espinoza-Cuadros, F. M. & Hernández-Gómez, L. A. X-vector anonymization using autoencoders and adversarial training for preserving speech privacy. Computer Speech Lang. 74, 101351 (2022).
Srivastava, B. M. L. et al. Design choices for X-vector based speaker anonymization. in INTERSPEECH 2020 1713–1717 (ISCA, 2020). https://doi.org/10.21437/Interspeech.2020-2692.
Srivastava, B. M. L. et al. Privacy and utility of X-vector based speaker anonymization. IEEE/ACM Trans. Audio Speech Lang. Process 30, 2383–2395 (2022).
Hernandez, A. et al. Self-supervised speech representations preserve speech characteristics while anonymizing voices. Preprint at http://arxiv.org/abs/2204.01677 (2022).
Zhu, Y., Imoussaïne-Aïkous, M., Côté-Lussier, C. & Falk, T. H. Investigating biases in COVID-19 diagnostic systems processed with automated speech anonymization algorithms. in 3rd Symposium on Security and Privacy in Speech Communication 46–54 (ISCA, 2023). https://doi.org/10.21437/SPSC.2023-8.
Hirose, H. Pathophysiology of motor speech disorders (Dysarthria). Folia Phoniatr. Logop. 38, 61–88 (1986).
Schröter-Morasch, H. & Ziegler, W. Rehabilitation of impaired speech function (dysarthria, dysglossia). GMS Curr. Top. Otorhinolaryngol. Head. Neck Surg. 4, Doc15 (2005).
Sama, A., Carding, P. N., Price, S., Kelly, P. & Wilson, J. A. The clinical features of functional dysphonia. Laryngoscope 111, 458–463 (2001).
Harding, A. & Grunwell, P. Characteristics of cleft palate speech. Int. J. Lang. Comm. Disord. 31, 331–357 (1996).
Millard, T. & Richman, L. C. Different cleft conditions, facial appearance, and speech: relationship to psychological variables. Cleft Palate-Craniofacial J. 38, 68–75 (2001).
Wantia, N. & Rettinger, G. The current understanding of cleft lip malformations. Facial Plast. Surg. 18, 147–154 (2002).
Maier, A., Nöth, E., Batliner, A., Nkenke, E. & Schuster, M. Fully automatic assessment of speech of children with cleft lip and palate. Informatica 30, 477–482 (2006).
Maier, A. et al. PEAKS—a system for the automatic evaluation of voice and speech disorders. Speech Commun. 51, 425–437 (2009).
Fox, A. V. PLAKSS: Psycholinguistische Analyse kindlicher Sprechstörungen. (Swets & Zeitlinger, Frankfurt a.M, Germany, 2002).
Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: an ASR corpus based on public domain audio books. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5206–5210 (IEEE, 2015). https://doi.org/10.1109/ICASSP.2015.7178964.
Hansen, J. H. L. & Hasan, T. Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32, 74–99 (2015).
Kinnunen, T. & Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 52, 12–40 (2010).
Hashimoto, K., Yamagishi, J. & Echizen, I. Privacy-preserving sound to degrade automatic speaker verification performance. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5500–5504 (IEEE, 2016). https://doi.org/10.1109/ICASSP.2016.7472729.
Arasteh, S. T. An empirical study on text-independent speaker verification based on the GE2E method. Preprint at http://arxiv.org/abs/2011.04896 (2022).
Wan, L., Wang, Q., Papir, A. & Moreno, I. L. Generalized end-to-end loss for speaker verification. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4879–4883 (IEEE, 2018). https://doi.org/10.1109/ICASSP.2018.8462665.
Prabhavalkar, R., Alvarez, R., Parada, C., Nakkiran, P. & Sainath, T. N. Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4704–4708 (IEEE, 2015). https://doi.org/10.1109/ICASSP.2015.7178863.
Ramirez, J., Górriz J. M. & Segura J. C. Voice activity detection. fundamentals and speech recognition system robustness. In Robust Speech Recognition and Understanding (eds. Grimm, M. & Kroschel, K.) (I-Tech Education and Publishing, 2007). https://doi.org/10.5772/4740.
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR), (San Diego, CA, USA, 2015).
Fey, M. E. Articulation and phonology: inextricable constructs in speech pathology. LSHSS 23, 225–232 (1992).
Peppé, S. J. E. Why is prosody in speech-language pathology so difficult? Int. J. Speech-Lang. Pathol. 11, 258–271 (2009).
Vásquez-Correa, J. C., Fritsch, J., Orozco-Arroyave, J. R., Nöth, E. & Magimai-Doss, M. On modeling glottal source information for phonation assessment in Parkinson’s disease. in INTERSPEECH 2021 26–30 (ISCA, 2021). https://doi.org/10.21437/Interspeech.2021-1084.
Gustafsson, F. Determining the initial states in forward-backward filtering. IEEE Trans. Signal Process. 44, 988–992 (1996).
Chlasta, K., Wołk, K. & Krejtz, I. Automated speech-based screening of depression using deep convolutional neural networks. Procedia Comput. Sci. 164, 618–628 (2019).
Muzammel, M., Salam, H., Hoffmann, Y., Chetouani, M. & Othmani, A. AudVowelConsNet: a phoneme-level based deep CNN architecture for clinical depression diagnosis. Mach. Learn. Appl. 2, 100005 (2020).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016). https://doi.org/10.1109/CVPR.2016.90.
Deng, J. et al. ImageNet: a large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009). https://doi.org/10.1109/CVPR.2009.5206848.
Taylor, P. Text-to-Speech Synthesis (Cambridge University Press, 2009). https://doi.org/10.1017/CBO9780511816338.
Desai, S., Raghavendra, E. V., Yegnanarayana, B., Black, A. W. & Prahallad, K. Voice conversion using artificial neural networks. in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing 3893–3896 (IEEE, 2009). https://doi.org/10.1109/ICASSP.2009.4960478.
Kong, J., Kim, J. & Bae, J. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. in NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems Vol. 1428 17022–17033 (2020).
Zen, H. et al. LibriTTS: a corpus derived from LibriSpeech for text-to-speech. in INTERSPEECH 2019 1526–1530 (ISCA, 2019). https://doi.org/10.21437/Interspeech.2019-2441.
Tayebi Arasteh, S. et al. Securing collaborative medical AI by using differential privacy: domain transfer for classification of chest radiographs. Radiol. Artif. Intell. 6, e230212 (2024).
Calders, T. & Verwer, S. Three naive Bayes approaches for discrimination-free classification. Data Min. Knowl. Disc 21, 277–292 (2010).
Tayebi Arasteh, S. et al. Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging. Commun. Med 4, 46 (2024).
Shokri, R., Stronati, M., Song, C. & Shmatikov, V. Membership inference attacks against machine learning models. in 2017 IEEE Symposium on Security and Privacy (SP) 3–18 (IEEE, 2017). https://doi.org/10.1109/SP.2017.41.
Dwork, C. Differential privacy. in Automata, Languages and Programming (eds. Bugliesi, M., Preneel, B., Sassone, V. & Wegener, I.) vol. 4052 1–12 (Springer, 2006).
Abadi, M. et al. Deep learning with differential privacy. in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security 308–318 (ACM, 2016). https://doi.org/10.1145/2976749.2978318.
Pathak, M. A. & Raj, B. Privacy-preserving speaker verification and identification using Gaussian mixture models. IEEE Trans. Audio Speech Lang. Process 21, 397–406 (2013).
Champion, P., Thebaud, T., Le Lan, G., Larcher, A. & Jouvet, D. On the invertibility of a voice privacy system using embedding alignment. in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 191–197 (IEEE, 2021). https://doi.org/10.1109/ASRU51503.2021.9688159.
Orozco-Arroyave, J. R., Arias-Londoño, J. D., Vargas-Bonilla, J. F., González-Rátiva, M. C. & Noeth, E. New Spanish speech corpus database for the analysis of people suffering from Parkinson’s disease. in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) 342–347 (2014).
Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. & Vapnik, V. Support vector regression machines. in Advances in Neural Information Processing Systems 9 (NIPS 1996) (1996).
Ricci Lara, M. A., Echeveste, R. & Ferrante, E. Addressing fairness in artificial intelligence for medical imaging. Nat. Commun. 13, 4581 (2022).
Bagdasaryan, E., Poursaeed, O. & Shmatikov, V. Differential privacy has disparate impact on model accuracy. in Proceedings of the 33rd International Conference on Neural Information Processing Systems vol. 1387 15479–15488 (Curran Associates Inc., 2019).
Dwork, C. A firm foundation for private data analysis. Commun. ACM 54, 86–95 (2011).
Magariños, C. et al. Reversible speaker de-identification using pre-trained transformation functions. Computer Speech Lang. 46, 36–52 (2017).
European Parliament and Council. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/Ec. (General Data Protection Regulation, 2016).
Tayebi Arasteh, S. Pathology Anonym. https://doi.org/10.5281/zenodo.12806213 (2024).
Acknowledgements
We acknowledge financial support by Deutsche Forschungsgemeinschaft (DFG) and Friedrich-Alexander-Universität Erlangen-Nürnberg within the funding program Open Access Publication Funding. This study was funded by Friedrich-Alexander-Universität Erlangen-Nürnberg, Medical Valley e.V., and Siemens Healthineers AG within the framework of d.hip campus.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
The formal analysis was conducted by S.T.A., A.M., and S.H.Y. and the original draft was written by S.T.A. and corrected by M.S., E.N., A.M., and S.H.Y. The software was developed by S.T.A.; S.T.A., T.A.V., P.A.P.T., T.W., K.P., E.N., A.M., and S.H.Y. provided technical expertise; M.S. provided clinical expertise. The experiments were performed by S.T.A. Statistical analysis was performed by S.T.A. Datasets were provided by M.S., A.M., and S.H.Y.; S.T.A. and T.W. downloaded the datasets from the database. S.T.A. cleaned, organized, and pre-processed the data. E.N., A.M., and S.H.Y. supported the conception of the study and the experiments. S.T.A., A.M., and S.H.Y. designed the study. All authors read the manuscript, contributed to the editing, and agreed to the submission of this paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Medicine thanks Jose Gonzalez-Lopez and Yvonne Wren for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tayebi Arasteh, S., Arias-Vergara, T., Pérez-Toro, P.A. et al. Addressing challenges in speaker anonymization to maintain utility while ensuring privacy of pathological speech. Commun Med 4, 182 (2024). https://doi.org/10.1038/s43856-024-00609-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43856-024-00609-5
- Springer Nature Limited