Privacy Against Real-Time Speech Emotion Detection Via Acoustic Adversarial Evasion of Machine Learning
Privacy Against Real-Time Speech Emotion Detection Via Acoustic Adversarial Evasion of Machine Learning
Privacy Against Real-Time Speech Emotion Detection Via Acoustic Adversarial Evasion of Machine Learning
• EON:
– Universal spectral perturbations
– combining 𝑁 different tones, each with a different
frequency, amplitude, and temporal variation.
– Mask the spectral attributes of speech that depict
emotional information
– EONs are non-invasive and can be played simultaneously
with users' speech, ensuring real-time protection
• Generate EON using High Level GP approach
– Constraint :
• surrogate classifier should classify correctly
• Transcription service extract text
Methodology
• Fitness :
– ranks each individual in the population
– based upon the ability of an EON to mislead the surrogate SER classifier C*, and its ability to do so without
compromising the underlying audio’s transcription (see Equations 1-3).
• Selection
– selects a subset of individuals from a population to carry forward into the next generation before crossover and
mutation. Selection is performed using a tournament selection method [54] with a tournament size of 𝑛𝑆𝑒𝑙.
– guaranteed that at least the 𝑛𝑆𝑒𝑙 − 1 weakest individuals were eliminated from each generation.
• Crossover
– are created by exchanging the genes of parents among themselves”
– is a method to generate new individuals (i.e., offspring) from previous selected ones (i.e., parents), thus generating
new EONs, by combining the parameters of two existing individuals to create two new individuals..
• Mutation
– Used to prevent population stagnation
– New individuals are generated by randomly modifying select individuals’ EON-generation-parameters. Mutation
introduces the greatest amount of variability in the population; a given individual can undergo any number of
changes, leading to significant improvement or degradation. It is performed by randomly shuffling EON parameters
(scaled to [0,1]) with probability 𝑝𝑀𝑈 𝑋.
• Final EON Selection
– After iterating for 𝐾 generations, the EON with the highest evasion success rate (ESR) when mixed with the validation
dataset.
• Hyperparameters
– 𝑛𝑆𝑒𝑙, 𝑝𝐶𝑋, and 𝑝𝑀𝑈 𝑋 are hyperparameters, identified empirically through grid-search.
Results & Findings
Calculations
𝑓𝑖𝑡𝑛𝑒𝑠𝑠(𝑖𝑛𝑑) = 𝑑𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑) × 𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑)
𝑑𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑) = + (𝑖𝑛𝑑, x)
𝑏𝑜𝑛𝑢𝑠(𝑖𝑛𝑑, 𝑥) =
𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑)=
Datasets
• 1. **RAVDESS Dataset**: - The RAVDESS dataset consists of 1,440 samples from 24 speakers, with an
equal split between male and female actors. - The spoken part of the dataset includes two separate
utterances by each speaker, demonstrating 8 different emotions. - The dataset serves as a valuable
resource for training and evaluating Emotion Obfuscating Noises (EONs) in the DARE-GP methodology.
• 2. **TESS Dataset**: - The TESS dataset comprises 1,800 audio samples generated by two actresses,
covering five emotions: neutral, angry, happy, sad, and fearful. - A subset of the TESS data is used for
training and evaluating EONs, providing diverse emotional content for the development and
assessment of the methodology.
• 3. **EON Training and Evaluation**: - The RAVDESS data is utilized for training "factory default" EONs
without tailoring for specific target users or environments. - A portion of the TESS data is used for
additional training on any black-box classifiers that underperformed on the original TESS data,
contributing to the adaptability and robustness of the EONs.
• 4. **Evaluation Metrics**: - The success of DARE-GP is evaluated using metrics such as Evasion
Success Rate (ESR) and False Label Independence, ensuring the protection of emotional privacy and
the utility of the modified audio samples. - The dataset split for black box evaluation involves training
EONs on specific data subsets and evaluating their performance using dedicated evaluation datasets.
• 5. **Role in Methodology**: - Both the RAVDESS and TESS datasets play a crucial role in training,
tailoring, and evaluating the effectiveness of EONs in obfuscating emotional content and evading
Speech Emotion Recognition (SER) classifiers. - The datasets provide diverse emotional speech
samples, enabling the development and validation of EONs for real-world deployment scenarios.
Discussions
• 1. **Research Questions**: - The experiments aim to address specific research
questions, including the ability of the approach to deceive unseen black-box SER
classifiers without compromising speech-to-text transcription, comparison of
performance with state-of-the-art audio evasion techniques, and the potential for
defense against the technique by knowledgeable SER operators. - Real-world
deployment scenarios involving off-the-shelf smart speakers, variable user locations,
and SWAP (Size, Weight, and Power) constraints are also considered.
• 2. **EON Generation Process**: - The methodology involves digitally fine-tuning
"factory default" generic EONs to users' speech through iterations/generations of the
Genetic Programming (GP) approach. - The final EON selection process involves
evaluating the fitness of EONs with different loudness levels and selecting the most
suitable EON for deployment in the target environment.
• 3. **Acoustic Evaluation Recordings**: - Fine-tuning of EONs involves pre-training
using a "canonical" dataset (RAVDESS) and subsequent fine-tuning with a target
household’s user data (TESS) to limit in-home training time. - The challenge of
collecting user recordings using smart speakers due to the unavailability of specific
APIs is highlighted.
Discussions
• 4. **Evasion Success Metrics**: - The success of DARE-GP is
evaluated using metrics such as Evasion Success Rate (ESR) and
False Label Independence, ensuring the protection of emotional
privacy and the utility of the modified audio samples.
• 5. **Role of Datasets**: - The RAVDESS and TESS datasets play a
crucial role in training, tailoring, and evaluating the effectiveness
of EONs in obfuscating emotional content and evading Speech
Emotion Recognition (SER) classifiers. 6. **Real-World
Deployment Considerations**: - The discussion extends to the
potential deployment of EONs in acoustic, real-world scenarios
involving off-the-shelf smart speakers, variable user locations, and
SWAP constraints, highlighting the practical implications of the
methodology.
Strong Points
Limitations
Conclusion
Question & Answer
Components of Sound WaveForm
• 1. Frequency: This refers to how many times the wave repeats itself
per unit time, measured in Hertz (Hz). Higher frequencies
correspond to higher-pitched sounds, while lower frequencies
correspond to lower-pitched sounds. For instance, middle C on a
piano vibrates at approximately 261.6 Hz.
• 2. Amplitude: This refers to the height of the wave peaks, indicating
the sound's loudness or intensity. Larger amplitudes represent
louder sounds, while smaller amplitudes represent quieter sounds.
• 3. Timbre: This refers to the quality or "color" of the sound, which
distinguishes it from other sounds even at the same pitch and
loudness. Timbre is determined by the presence and relative
strengths of harmonics, which are additional frequencies related to
the fundamental frequency.
About EON
• Universal spectral perturbations
• Generated by combining 𝑁 different tones, each
with a different frequency, amplitude, and
temporal variation.
• Mask the spectral attributes of speech that depict
emotional information
• EONs are non-invasive and can be played
simultaneously with users' speech, ensuring real-
time protection
Kernel SHAP (SHapley Additive exPlanations)