Nothing Special   »   [go: up one dir, main page]

Vocal Tract

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Who Are You (I Really Wanna Know)?

Detecting Audio DeepFakes Through


Vocal Tract Reconstruction
Logan Blue, Kevin Warren, Hadi Abdullah, Cassidy Gibson, Luis Vargas,
Jessica O’Dell, Kevin Butler, and Patrick Traynor, University of Florida
https://www.usenix.org/conference/usenixsecurity22/presentation/blue

This paper is included in the Proceedings of the


31st USENIX Security Symposium.
August 10–12, 2022 • Boston, MA, USA
978-1-939133-31-1

Open access to the Proceedings of the


31st USENIX Security Symposium is
sponsored by USENIX.
Who Are You (I Really Wanna Know)? Detecting Audio DeepFakes Through Vocal
Tract Reconstruction

Logan Blue, Kevin Warren, Hadi Abdullah, Cassidy Gibson, Luis Vargas, Jessica O’Dell, Kevin Butler,
Patrick Traynor

University of Florida, Gainesville, FL


Email: {bluel, kwarren9413, hadi10102, c.gibson, lfvargas14, odelljessica, butler,
traynor}@ufl.edu

Abstract now available to groups including patients suffering from the


loss of speech due to medical conditions and grieving family
Generative machine learning models have made convincing
members of the recently deceased [1, 2].
voice synthesis a reality. While such tools can be extremely
While these speech models are a powerful and important
useful in applications where people consent to their voices
enabler of communication, they also create significant prob-
being cloned (e.g., patients losing the ability to speak, ac-
lems for users who have not given their consent. Specifically,
tors not wanting to have to redo dialog, etc), they also allow
generative machine learning models now make it possible to
for the creation of nonconsensual content known as deep-
create audio deepfakes, which allow an adversary to simu-
fakes. This malicious audio is problematic not only because
late a targeted individual speaking arbitrary phrases. While
it can convincingly be used to impersonate arbitrary users,
public individuals have long been impersonated, such tools
but because detecting deepfakes is challenging and generally
make impersonation scalable, putting the general population
requires knowledge of the specific deepfake generator. In
at risk. Such attacks have reportedly been observed in the
this paper, we develop a new mechanism for detecting audio
wild, including a company that allowed an attacker to instruct
deepfakes using techniques from the field of articulatory pho-
funds to be sent to them using generated audio of the victim
netics. Specifically, we apply fluid dynamics to estimate the ar-
company’s CEO’s voice [3]. In response, researchers have
rangement of the human vocal tract during speech generation
developed detection techniques using bispectral analysis (i.e.,
and show that deepfakes often model impossible or highly-
inconsistencies in the higher-order correlations in audio) [4,5]
unlikely anatomical arrangements. When parameterized to
and training machine learning models as discriminators [6];
achieve 99.9% precision, our detection mechanism achieves
however, both are highly dependent on the specific, previously
a recall of 99.5%, correctly identifying all but one deepfake
observed generation techniques to be effective.
sample in our dataset. We then discuss the limitations of this
approach, and how deepfake models fail to reproduce all as- In this paper, we develop techniques to detect deepfake au-
pects of speech equally. In so doing, we demonstrate that dio samples by solely relying on limitations of human speech
subtle, but biologically constrained aspects of how humans that are the results of biological constraints. Specifically, we
generate speech are not captured by current models, and can leverage research in articulatory phonetics to apply fluid dy-
therefore act as a powerful tool to detect audio deepfakes. namic models that estimate the arrangement of the human
vocal tract during speech. Our analysis shows that deepfake
audio samples are not fundamentally constrained in this fash-
1 Introduction ion, resulting in vocal tract arrangements that are inconsistent
with human anatomy. Our work demonstrates that this incon-
The ability to generate synthetic human voices has long been a sistency is a reliable detector for deepfake audio samples.
dream of scientists and engineers. Over the past 50 years, tech- We make the following contributions:
niques have included comprehensive dictionaries of spoken • Identify inconsistent vocal tract behavior: Using a
words and formant synthesis models that create new sounds combination of fluid dynamics and articulatory phonet-
through the combination of frequencies. While such tech- ics, we identify the inconsistent behavior exhibited by
niques have made important progress, their outputs are gener- deepfaked audio samples (e.g., unnatural vocal tract di-
ally considered robotic-sounding and easily distinguishable ameters). We develop a technique to estimate the vocal
from organic speech. Recent advances in generative machine- tract during speech to prove this phenomenon.
learning models have led to dramatic improvements in syn- • Constructing a deepfake detector: After proving the
thetic speech quality, with convincing voice reconstruction existence of the phenomena, we construct a deepfake

USENIX Association 31st USENIX Security Symposium 2691


detector that is capable of detecting deepfake audio deepfakes. Some of the current work in this area has focused
(Precision: 99.9%, Recall: 99.5%) from a large dataset on identifying subtle spectral differences that are otherwise
we create using the Real-Time Voice Cloning genera- imperceptible to the human ear [4,5]. In some cases, the deep-
tor [7]. Finally, we also demonstrate that entries from the fake audio will be played over a mechanical speaker, which
ASVSpoof2019 dataset are easily detectable in the pre- will itself leak artifacts into the audio sample. These artifacts
filtering portion of our mechanism due to a high word can be detected using a variety of techniques such as machine
error rate in automatic transcription. learning models [13, 14], additional hardware sensors [15],
• Analysis of deepfake detector: We further analyze or spectral decomposition [16]. Researchers have also tried
which vocal tract features and portions of speech cause to detect these artifacts by using mobile devices. They use
the deepfakes to be detectable. From this analysis, we differences in the time-of-arrival in phoneme sequences, ef-
determine that on average our detector only requires a fectively turning the mobile devices into a Doppler Radar that
single sentence to detect a deepfake with a true positive could verify the audio source [17, 18]. These techniques fall
rate (TPR) of 92.4%. within the category of liveness detection and have spawned
• Analysis of Potential Adaptive Adversaries: We con- major competitions such as the ASV Spoof Challenge [19].
ducted two large-scale experiments emulating both a However, these methods have certain limitations including the
naïve and an advanced adaptive adversary. Our experi- distance of the speaker from the recording microphone, accu-
ments consist of training 28 different models and show racy, additional hardware requirements, and large training sets.
that in the best case, an adaptive adversary faces greater Phonetics (the scientific study of speech sounds) is commonly
than a 26x increase in training time, increasing the ap- used by language models for machine learning systems built
proximate time necessary to train the model to over 130 for speech to text [20, 21] or speaker recognition [22]. Speech
days. recognition and audio detection tools also use phonetics to
We note that the lack of anatomical constraints is consis- increase their overall performance and capabilities [23, 24].
tent across all deepfake techniques. Without modeling the While articulatory phonetics is not commonly used in secu-
anatomy or forcing the model to operate within these con- rity, this has been used in past work, such as reconstructing
straints, the likelihood that a model will learn a biologically encrypted VoIP calls by identifying phonemes [25].
appropriate representation of speech is near zero. Our detector, Using concepts of articulatory phonetics, our work attempts
therefore, drastically reduces the number of possible models to extract the physical characteristics of a speaker from a
that can practically evade detection. given audio sample; these characteristics would otherwise not
The paper is organized as follows: Section 2 provides con- be present in deepfake audio. Human or organic speech is
text by discussing related work; Section 3 gives background created using a framework of muscles and ligaments around
on relative topics used throughout this paper; Section 4 dis- the vocal tract. The unique sound of each of our voices is
cusses our underlying hypothesis; Section 5 details our threat directly tied to our anatomy [26]. This has enabled researchers
model; Section 6 explains our methodology and detection to use voice samples of a speaker to extract the dimensions
method; Section 7 describes our data collection and experi- of their anatomical structures such as vocal tract length [27–
mental design; Section 8 discusses the results of our experi- 31], age [32], or height [33, 34] of the speaker. These works
ments; Section 9 details the intricacies and consequences of attempt to derive an acoustical pipe configuration by modeling
our work; and Section 10 provides concluding remarks. the human pharynx. This configuration can then be used
as a proxy for the human anatomy to retrieve the physical
characteristics of the speaker. Since deepfakes are generated
2 Related Work using GANs, the physical dimensions are likely inconsistent.
This inconsistency can be measured and help differentiate
Advances in Generative Adversarial Networks (GANs) have between deepfake and human audio samples.
enabled the generation of synthetic “human-like” audio that is
virtually indistinguishable from audio produced by a human 3 Background
speaker. In some cases, the high quality of GAN-generated
audio has made it difficult to ascertain whether the audio heard
3.1 Phonemes
(e.g., over a phone call) was organic [8]. This has enabled
personalized services such as Google Assistant [9], Amazon Phonemes are the fundamental building blocks of speech.
Alexa [10], and Siri [11], which use GAN-generated audio Each unique phoneme sound is a result of different config-
to communicate with users. GANs can also be trained to urations of the vocal tract components shown in Figure 1.
impersonate a specific person’s audio, this kind of audio is Phonemes that comprise the English language are categorized
known as a deepfake [12]. into vowels, fricatives, stops, affricates, nasals, glides, and
The dangerous applications of deepfake audio have spurred diphthongs (Table 1).
the need to automatically identify human audio samples from Vowels (e.g., “/I/” in ship) are created using different ar-

2692 31st USENIX Security Symposium USENIX Association


Phoneme Type Phoneme Example
Palate Oral Cavity
Alveolar Ridge Vowel /I/ ship
Velum Fricative /s/ sun
Lips
Stop /g/ gate
Tongue
Teeth Affricative /tS/ church
Pharynx Nasal /n/ nice
Glide /l/ lie
Diphthong /eI/ wait

Table 1: English is composed of these seven categories of


Glottis phonemes. Their pronunciation is dependent on the configu-
ration of the various vocal tract components and the airflow
Trachea
that goes through it.
spectrogram

Figure 1: The vocal tract is composed of various components waveform


Encoder
that act together to produce sounds. Distinct phonemes are ar-
speaker
ticulated based on the path the air travels, which is determined embedding
by how the components are positioned. “/kæt/00 Synthesizer Vocoder
(cat) (dog)

rangements of the tongue and jaw, which result in resonance Figure 2: Deepfake generation has several stages to create
chambers within the vocal tract. For a given vowel, these a fake audio sample. The encoder generates an embedding
chambers produce frequencies known as formants whose re- of the speaker, the synthesizer creates a spectrogram for a
lationship determines the actual sound. Vowels are the most targeted phrase using the speaker embedding, and the vocoder
commonly used phoneme type in the English language, mak- converts the spectrogram into the synthetic waveform.
ing up approximately 38% of all phonemes [35]. Fricatives
(e.g.,“/s/” in sun) are generated by turbulent flow caused by a ing air through the vocal cords, which induces an acoustic
constriction in the airway, while stops (e.g.,“/g/” in gate) are resonance that contains the fundamental (lowest) frequency
created by briefly halting and then quickly releasing the air- of a speaker’s voice. The resonating air then moves through
flow in the vocal tract. Affricatives (e.g.,“/tS/” in church) are the vocal cords and into the vocal tract (Figure 1). At this
a concatenation of a fricative with a stop. Nasals (e.g.,“/n/” point, different configurations of the articulators (e.g., where
in nice) are created by forcing air through the nasal cavity the tongue is placed, how large the mouth is) shape the path
and tend to be at a lower amplitude than the other phonemes. for the air to flow, which creates constructive/destructive in-
Glides (e.g.,“/l/” in lie) act as a transition between different terference that produces the unique sounds of each phoneme.
phonemes, and diphthongs (e.g.,“/eI/” in wait) refer to the
vowel sound that comes from the lips and tongue transitioning
3.3 Deepfake Audio
between two different vowel positions.
Phonemes alone do not encapsulate how humans speak. Deepfakes are digitally produced speech samples that are
The transitions between two phonemes are also important for intended to sound like a specific individual. Currently, deep-
speech since it is a continuous process. Breaking speech down fakes are produced via the use of machine learning (ML)
into pairs of phonemes (i.e., bigrams) preserves the individual algorithms. While there are numerous deepfake ML algo-
information of each phoneme as well as transitions between rithms in existence, the overall framework the techniques are
them. These bigrams generate a more accurate depiction of built on are similar. As shown in Figure 2, the framework is
the vocal tract dynamics during the speech process. comprised of three stages: encoder, synthesizer, and vocoder.
Encoder: The encoder learns the unique representation of
3.2 Organic Speech the speaker’s voice, known as the speaker embedding. These
can be learned using a model architecture similar to that of
Human speech production results from the interactions be- speaker verification systems [36]. The embedding is derived
tween different anatomical components, such as the lungs, from a short utterance using the target speaker’s voice. The
larynx (i.e., the vocal cords), and the articulators (e.g., the accuracy of the embedding can be increased by giving the
tongue, cheeks, lips), that work in conjunction to produce encoder more utterances, with diminishing returns. The output
sound. The production of sound1 starts with the lungs forc- embedding from the encoder stage is passed as an input into
1 Thisprocess is similar to how trumpets create a sound as air flows the following synthesizer stage.
through various pipe configurations. Synthesizer: A synthesizer generates a Mel Spectrogram

USENIX Association 31st USENIX Security Symposium 2693


from a given text and the speaker embedding. A Mel Spectro- audio generation algorithm. This is a stronger threat model
gram is a spectrogram that has its frequencies scaled using than existing works in the area, which often use very large
the Mel scale, which is designed to model audio perception of training data sets (order of thousands of audio samples) [6].
the human ear. Some synthesizers can produce spectrograms Lastly, we assume that the defender wants an explanation
solely from a sequence of characters or phonemes [37]. as to why their detection system flagged a sample as either
Vocoder: Lastly, the vocoder converts the Mel Spectrogram deepfake or organic.
to retrieve the corresponding audio waveform. This newly A practical example of this scenario is as follows. An adver-
generated audio waveform will ideally sound like a target sary creates a deepfake of a local politician and releases it to
individual uttering a specific sentence. A commonly used the media to further some goal. The media is the defender in
vocoder model is some variation of WaveNet [38], which uses this scenario and must decide whether the adversary’s audio
a deep convolutional neural network that uses surrounding sample is authentic. Once the authenticity of the audio sam-
contextual information to generate its waveform. ple has been checked the media can choose to either ignore
Although the landscape of audio generation tools is ever- or publish the audio. If the media publishes a synthetically
changing, these three stages are the foundational components generated audio sample, then the adversary has successfully
of the generation pipeline. The uniqueness of each tool is victimized the politician by making them appear to have said
derived mainly from the quality of models (one for each stage) something they did not. Additionally, any media outlet which
and the exact design of their system architecture. publishes a synthetic audio sample could have its reputation
damaged if it is later discovered that the audio sample was
inauthentic. By leveraging our technique, the media outlet can
4 Hypothesis prevent reporting inauthentic audio samples, thus preventing
their loss of reputation and the victimization of the politician.
Human-created speech is fundamentally bound to the anatom-
ical structures that are used to generate it. Only certain ar-
rangements of the vocal tract are physically possible for a
speaker to create. The number of possible acoustic models 6 Methodology
that can accurately reflect both the anatomy and the acous-
tic waveform of a speaker is therefore limited. Alternatively,
synthetic audio is not restricted by any physical structures Our technique requires a training set of organic audio and a
during its generation. Therefore, an infinite set of acoustic small set of deepfake audio samples generated by the deepfake
models could have generated the synthetic audio. The details algorithm.2 The process of determining the source of an audio
of this phenomenon will be discussed shortly in Section 6. It sample (i.e., organic vs deepfake) can then be broken down
is highly improbable that models used to generate synthetic into two logical steps:
audio will mimic an acoustic model that is consistent with • Vocal Tract Feature Estimator: First, we construct a
that of an organic speaker. As such, synthetic audio can be mathematical model of the speaker’s vocal tract based
detected by modeling the acoustic behavior of a speaker’s on the amplitudes of certain frequencies (commonly
vocal tract. referred to as the frequency response) present in their
voice during a specific pair of adjacent phonemes (i.e.,
bigram). This model allows us to estimate the cross-
5 Security Model sectional area of the vocal tract at various points along
the speaker’s airway.
Our security model consists of an adversary, a victim, and a • Deepfake Audio Detector: Next, we aggregate the
defender. The goal of the adversary is to create a deepfake range of values found for each bigram-feature pair in our
audio sample of the victim uttering a specific phrase. We organic dataset. These values determine if the audio sam-
assume a powerful adversary, one who has access to enough ple can be realistically produced by an organic speaker.
of the victim’s audio and enough computing power to generate This enables our system to discriminate between organic
a deepfake sample. and deepfake samples. Additionally, we can isolate the
The adversary offers the defender either the deepfake or bigram-feature pairs that best determine the source of
an organic audio sample. The defender is tasked with ascer- an audio sample to create a set of ideal discriminators.
taining whether the adversary-provided sample is deepfake or These ideal discriminators allow us to optimally deter-
organic audio. If the defender makes the correct assessment, mine whether an unseen audio sample is a deepfake.
the adversary loses.
The defender does not have knowledge of, or audio data 2 The training data is a general set that does not require samples of the
from, the victim the adversary will attempt to impersonate specific victim that is being targeted. Additionally, the training data required
(i.e., no user-specific audio samples of the victim). The de- for this technique is significantly less than the data required with prior ML-
fender also has no knowledge of, or access to, the attacker’s based techniques.

2694 31st USENIX Security Symposium USENIX Association


Cross Sectional Area
A B C

A
Vocal Tract Position
B B
C C
A

A B C

Cross Sectional Area


Vocal Tract Position

Who Has
/hu/ /hæz/

Figure 3: The sound produced by a phoneme is highly dependent on the structure of the vocal tract. Constriction made by tongue
movement or jaw angle filters different frequencies.
6.1 Reader Participation affected by the adjacent phonemes differently than how “/O/”
in “ball” is. In particular “thought” ends with the plosive “/t/”
Before we go further into the details of these two steps, we which requires a break in airflow, thus causing the speaker to
would like to help the reader develops a deeper intuition of abruptly end the “/O/” phoneme. In contrast, the “/O/” in “ball”
phonemes and speech generation. is followed by the lateral approximant “/l/,” which does not
For speech, air must move from the lungs to the mouth require a break-in airflow, leading the speaker to gradually
while passing through various components of the vocal tract. transition between the two phonemes.
To understand the intuition behind our technique, we invite
the reader to speak out loud the words “who” (phonetically
6.2 Vocal Tract Feature Estimator
spelled “/hu/”) and “has” (phonetically spelled “/hæz/”) while
paying close attention to how the mouth is positioned during Based on the intuition built in the previous subsection, our
the pronunciation of each vowel phoneme (i.e., “/u/” in “who” modeling technique needs to be able to extract the shape of
and “/æ/” in “has”). the vocal tract present during the articulation of a specific
Figure 3 shows how the components are arranged during bigram. To do this, we use a fluid dynamic concatenated tube
the pronunciation of the vowel phonemes for each word men- model to estimate the speaker’s vocal tract that is similar to
tioned above. Notice that during the pronunciation of the Rabiner et al.’s technique [27]. Before we go into the details
phoneme “/u/” in “who” the tongue compresses to the back of this model, it is important to discuss the assumption the
of the mouth (i.e., away from the teeth) (A) at the same time, model makes.
the lower jaw is held predominately closed. The closed jaw • Lossless Model: Our model ignores energy losses that
position lifts the tongue so that it is closer to the roof of the result from the fluid viscosity (i.e., the friction losses
mouth (B). Both of these movements create a specific path- between molecules of the air), the elastic nature of the
way through which the air must flow as it leaves the mouth. vocal tract (i.e., the cross-sectional area changing due
Conversely, the vowel phoneme “/æ/” in “has” elongates the to a change in internal pressure), and friction between
tongue into a more forward position (A) while the lower jaw the fluid and the walls of the vocal tract. Ignoring these
distends, causing there to be more space between the tongue energy losses will result in our model having acoustic
and the roof of the mouth. This tongue position results in a dampening, causing the lower formant frequencies to in-
different path for the air to flow through, and thus creates a crease in value3 and an increase in the bandwidth of all
different sound. In addition to tongue and jaw movements, the formant frequency spikes4 . Additionally, we assume the
position of the lips also differs for both phonemes. For “/u/”, walls of the vocal tract have an infinitely high acoustic
the lips round to create a smaller more circular opening (C). impedance (i.e., sound can only exit the speaker from
Alternatively, “/æ/” has the lips unrounded, leaving a larger, their mouth) which will result in our model missing trace
more elliptical opening. Just as the tongue and jaw position, amounts of low bass frequencies. Overall, these assump-
the shape of the lips also impacts the sound created. tions simplify the modeling processing while decreasing
One additional component that affects the sounds of a the accuracy of our technique by a marginal amount and
phoneme is the other phonemes that are adjacent to it. For are consistent with prior work [27].
example, take the words “ball” (phonetically spelled “/bOl/”‘) 3 This
effect is mainly caused by the elastic nature of the vocal tract walls.
and “thought” (phonetically spelled “/TOt/”). Both words con- 4 The
viscosity and friction losses predominately effect frequencies above
tain the phoneme “/O/,” however the “/O/” in “thought” is 4 kHz [27].

USENIX Association 31st USENIX Security Symposium 2695


+ +
uo u1
Positive flow

-
Negative flow u1
-
uo

A2 A3 A4 1
A5 A6
+ +
A1 1
uo (1+rk) u1

L11

L21
L4 L5 L6 -rk rk

L3
uo- (1-rk) u1-

Figure 4: The cross sectional area of each pipe is calculated


to characterize the speaker’s voice tract. Figure 5: In our model we must account for airwaves being
able to move in different directions within the vocal tract and
• Unidirectional Traveling Waves: We assume that, anticipate how they interact with each other.
within the model, we will only have traveling waves
along the centerline of the tube. It stands to reason that two consecutive pipes can be written as follows:
this assumption is accurate enough for our model given −
the small diameter of our tubes (i.e., vocal tract). This u+ +
1 = u0 (1 + rk ) +U0 (rk ) (1)
assumption should not affect our results since any error u− − +
0 = u1 (1 − rk ) +U0 (−rk ) (2)
caused by this assumption will most likely occur in fre-

quencies greater than 20 kHz (far above human speech). where u+ 0 and u0 is the forward and reverse volumetric flow

As we will discuss later in this section, our model is most rate in the left pipe, u+
1 and u1 is the forward and reverse
accurate for lower frequencies and thus we only analyze volumetric flow rate in the right pipe and rk is the reflection
frequencies beneath 5 kHz.5 coefficient between the two consecutive pipes.
• Vowel Limitation: The model used in this paper was The reflection coefficient rk can be expressed as follows:
only designed to accurately model vowel phonemes.
Ak+1 − Ak
Other phonemes are generated via fundamentally dif- rk = (3)
ferent and more difficult model mechanisms such as Ak+1 + Ak
turbulent flow. Despite this, we apply the same model where Ak+1 is the cross-sectional area of the tube that is down-
across all bigrams throughout this work for reasons dis- stream (i.e., further from the pressure source) in the tube series
cussed in Section 9.1. and Ak is the cross-sectional area of the tube that is upstream
Our concatenated tube model consists of a series of open (i.e., closer to the pressure source) in the tube series. It should
pipe resonators that vary in diameters but share the same be noted that rk is mathematically bound between −1 and
length. A simplified representation can be seen in Figure 4. 1. This bounding represents the scenarios where either Ak or
To estimate the acoustics of an individual tube at a specific Ak+1 is infinitely larger than the next pipe adjacent to it.
time during a bigram, we need to understand the behavior Between these three equations, we can fully describe a sin-
of pressure waves within the resonator. The simplest way to gle intersection between two tubes. Our vocal tract model
do this is to model the net volumetric flow rate of the fluid consists of various tubes with multiple intersections being
(i.e., the air in the vocal tract) within the resonator. We can concatenated to form a series. To model this, we need to ex-
model the acoustics of a resonator via the flow rate since the pand these equations to incorporate additional tube segments
volumetric flow rate and the pressure (i.e., sound) within the and intersections. In particular, we need to incorporate N
resonator are directly related [27]. connected tubes with N − 1 intersections between them. The
Modeling the interaction between two consecutive tubes is resulting differential equation is the transfer function of our
accomplished by balancing the volumetric inflows and out- N-segment tube series and when simplified is the following:
flows of the two tubes at their connection. Since the volumet-
ric flow rate between two consecutive tubes must be equal, but 0.5(1 + rG ) ∏Nk=1 (1 + rk )e−LCN jω
V (ω) = (4)
the cross-sectional areas (and thus the volumes) may differ, D(ω)
there may exist a difference in fluid pressure between them.
This pressure difference at the boundary results in a reflection  
1 −r1
coefficient, which affects the fluid flow rates between the two D(ω) = [1, −rG ] ...
−r1 e−2LC jω e−2LC jω
tubes. A schematic of the intersection between two tubes can    (5)
be seen in Figure 5. Mathematically, the interactions between 1 −rN 1
−rN e−2LC jω e−2LC jω rAtm
5 It is worth noting that most information in human speech is found below

5 kHz. It is also the reason why cellular codecs, such as those used in GSM where rG is the reflection coefficient at the glottis, r1 ...rN are
networks, filter out noise in higher frequencies [39]. the reflection coefficients for every consecutive tube pair in

2696 31st USENIX Security Symposium USENIX Association


Sample FFT Z Subtract

Z
( Subtract
)
Transfer Function

( )
!0 , ..., !N
r0 + step, ..., rN
r0 step, ..., rN

Adjust r0
min < error+ , error >

Figure 6: High-level overview of how the vocal tract feature estimator works. A speaker’s audio sample (a single-window from
a bigram) has its frequency data extracted using an FFT, the output of which is used as the target for our transfer function to
reproduce. The transfer function is run twice over a range of frequencies w0 , ..., wN . The first application of the transfer function
uses the current reflection coefficients r0 , ..., rN with a step size offset added to a single coefficient. The second application
instead subtracts the step size offset from the same single coefficient. The estimated frequency response curve calculated for both
series is subtracted from the target curve. Whichever reflection coefficient results in a lower area under the resulting curve will be
selected for the next iteration. This process continues (applying the step size offset to all of the reflection coefficients) until the
area under the subtracted curves approach zero, indicating that we have found a reflection coefficient series that approximately
replicates the original speaker’s vocal tract.
the series, rAtm is the reflection coefficient at the mouth, L Rabiner et al.’s work [27]):
is the length of each tube, C is the speed of sound (34, 300 TC
cm/s), j is the imaginary constant, and ω is the frequency L= (6)
2
of the waveform in rad/s. V (ω) is the volumetric flow rate
at the lips during the pronunciation of a certain frequency, where T is the period between samples in our audio record-
which is directly related to acoustic pressure (i.e., amplitude ings. In our study, all of our audio samples had a sampling rate
of the voice at frequency ω). We separate the denominator of 16 kHz. This sampling rate was selected since it captures
of Equation 4 out separately into Equation 5 for increased the most important frequencies for speech comprehension
readability. and is also the most commonly found sampling rate for voice-
based systems [41]. By sampling at 16 kHz, our vocal tract
These equations together are a simplified representation of model will be made up of 15 distinct pipe resonators.
a system of 2N equations (N Equation 1’s and N Equation 2’s) Next, we can use our understanding of human anatomy to
that represents a series of N connected tube intersections fix the first reflection coefficient in the series (rG in Equation
(Figure 6). Since the volumetric flow rate through every tube 5). This reflection coefficient represents the fluid reflection
within this series must be equal, we can simplify the 2N that occurs at the speaker’s glottis. During large portions of
equations to Equation 4 and 5. speech (e.g., during vowels) the glottis is actively being en-
gaged. This means that the vocal folds are actively vibrating
We refer the reader to Rabiner et al.’s work for a full deriva- and thus preventing fluid flow in the reverse direction. With
tion of these equations [27]. this in mind, we can set rG to 1, symbolizing only fluid flow
It is important to note that this differential equation lacks in the forward direction. Finally, the last reflection coefficient
a closed-form solution and thus, we must specify several rAtm is representing the behavior of the flow at the opening of
boundary conditions before solving the equation. Specifically, the mouth. Here, once again, we can expect to see predomi-
we must fix the number of tubes used in the series (N) and the nately only positive flow. This is because, during speech, the
reflection coefficients at both the beginning (rG ) and end of vocal tract is raised to a higher than atmospheric pressure, pre-
the series (rAtm ). This helps to more closely bind our equation venting flow from moving from the atmosphere back into the
to the physical anatomy from which it is modeled. vocal tract. We can, therefore, set the last reflection coefficient
rAtm equal to 1.
We can determine the number of tubes necessary for our With these boundary conditions, we now have a solvable
model by taking the average human vocal tract length (approx- differential equation that describes the acoustic behavior of
imately 15.5 cm [40]) and dividing by the length of each tube. our concatenated tube model. Using this equation we are
This length, L, can be determined by the following equation now able to accurately estimate the amplitude of a certain
(derivation of this equation can be found in Section 3.4.1 of frequency ω during a bigram for a known speaker (that has

USENIX Association 31st USENIX Security Symposium 2697


a known r0 , ..., rN series). However, in our case, we do not samples of audio as either deepfake or organic. The associ-
know the dimensions of the speaker’s vocal tract and cannot ated code for this paper is available at https://github.com/
simply apply the transfer function. We do, however, have blue-logan/who_are_you/.
access to samples of the speaker’s voice. Thus, we can use
these audio samples and our transfer function to estimate the
6.3.1 Organic Range Extraction
speaker’s vocal tract during various articulations. The process
of estimating a speaker’s vocal tract can be seen in Figure 6. The organic range extraction phase begins with the detector
The estimation is done by running a segment of a speaker’s ingesting known organic audio samples. These audio samples
speech through a Fast Fourier Transform (FFT) to get the rel- also have associated metadata containing timestamps for both
ative amplitudes for the frequencies that make up their voice. the words and individual phonemes that make up the sample.
The found frequency response curve is effectively the output The phoneme metadata is then augmented to create the neces-
we would expect from the transfer function if we knew the sary bigram timing information. For this, we need to define
speaker’s r0 , ..., rN values. We can use the frequency response which phonemes are considered to be adjacent to one another.
curve found with the FFT to check if a certain r0 , ...rN se- We define two phonemes as being adjacent if they are both in
ries correctly matches our speaker. We can, therefore, find the same word and occur one after the other. For example, the
an accurate approximation of a speaker’s vocal tract by find- word cat (phonetically spelled “/kæt/”) contains two bigram
ing a r0 , ..., rN series that accurately reproduces the speaker’s pairs, “/k – æ/” and “/æ– t/”. We consider a bigram to begin
frequency response curve. at the start of the first phoneme and stop at the end of the
To avoid naïvely searching the entire r0 , ..., rN space for a second phoneme. The bigram timing information will later
match, we can instead construct an error function that can be be associated to estimate features from processing the audio.
optimized with gradient descent to find a good solution. Since Each bigram audio sample was divided using a sliding
gradient descent searches for a local minimum, we subtract window of 565 samples with an overlap between windows
the outputs from our transfer function from the frequency of 115 samples. To find these values, we found the minimum
response curve found using the FFT. The transfer function and maximum duration for any bigram that existed in our
is initially run with all reflection coefficients r0 , ..., rN set to feature extraction set (more detail of the feature extraction set
zero. This is analogous to a constant diameter tube, which is a in Section 7). We then selected sliding window parameters
configuration achievable by the human vocal tract.6 We then (565 samples per window with an overlap of 115 samples) that
integrate the resulting curve to find the overall error between ensure that every bigram would have between three and seven
the two curves. As the output of our transfer function ap- windowed samplings taken from them. This ensured that we
proaches the frequency response curve, the area between the capture the temporal behavior of every bigram. Since speech
two curves will approach zero and result in a local minimum. is not discrete, each bigram captures the transitional behavior
The r0 , ..., rN values used in the transfer function should ap- that exists as the speaker moves from the initial phoneme to
proximate the speaker’s vocal tract during that bigram. the final phoneme. Windowing the audio allows us to examine
Once we have found the optimal series of reflection coef- individual stages of these transitions (e.g., beginning, middle,
ficients, we can convert them into cross-sectional area esti- end). This forces deepfake audio samples to be generated
mates using Equation 3. This step requires us to make one correctly throughout the transition between phonemes in order
last assumption about the vocal tract since there is one more to evade detection. Without windowing, it would be possible
cross-sectional area measurement than there are reflection to evade detection by merely generating correct phonemes
coefficients (i.e., N − 1 tube intersections). To mitigate this, without a transition. Each windowed segment of audio is then
we set the cross-sectional area at the glottis to the average passed through our vocal tract estimator and assigned a feature
size of a human glottis of 3.7cm2 [40]. With this assumption, vector of 15 cross-sectional areas. The 15 different cross-
we can then calculate the cross-sectional area series a0 , ..., aN sectional areas are estimates of the vocal tract at different
that closely approximates the human vocal tract. points between the glottis and the oral cavity opening. Each
windowed segment is then labeled with the word, bigram, and
window index which corresponds to it.
6.3 Deepfake Audio Detector We are now able to determine which bigrams and features
Using the vocal tract estimator we can design a generalized best differentiate organic and deepfake audio. This is done
detector for deepfake audio. Our detector has two phases. by finding divergences in the distributions of features in spe-
First, it extracts the acceptable organic ranges for bigram- cific bigrams between deepfake and organic audio samples.
feature pairs that describe organic speech. Next, the detector In other words, we look for differences in the distribution
will use these acceptable organic ranges to classify whole of the cross-sectional area estimations found for organic and
deepfake audio. The detection of a difference in the cross-
6 During the “/@/” phoneme the vocal tract roughly resembles a constant sectional area distributions found for the two types of audio
diameter tube. indicates that the deepfake audio is being created by an inor-

2698 31st USENIX Security Symposium USENIX Association


Cross-sectional Area Estimate (cm^2) Cross-sectional Area Estimate (cm^2)

(a) High Overlap (b) No Overlap


Figure 7: The divergence of the cross-sectional area estimate distributions for each bigram can be used to help identify deepfake
samples from organic ones. The plots show the distributions of the cross-sectional area estimates for the bigrams (a) “aI – k” as
in “like” and (b) “aU – t” as in “out”. Because the distributions in (a) overlap, “aI – k” is a poor indicator of whether a sample is
deepfake or not. In contrast, the distributions in (b) do not overlap at all. This means “aU – t” is a good indicator to differentiate
between deepfake and organic audio samples. Our technique would then select a threshold value, such as 2.1cm2 , that divided
the two distributions.
ganic source. These divergences exist because the biological found in the audio sample are outside the organic distributions,
framework of the vocal tract limits organic speech, whereas the audio sample is labeled as a deepfake.
GAN-generated audio is not limited in the same way. Thus, we
are then able to identify deepfake audio samples from organic
ones by looking for irregular (i.e., inorganic) cross-sectional 6.4 Detector Optimization
area estimates. We, therefore, record the distributions of the Although the previously described detector can differentiate
cross-sectional area estimates extracted from the organic au- between organic and deepfake audio samples, it is inefficient.
dio set to be used for future comparison. One distribution is Not all bigram-feature pairs act as effective discriminators
recorded for each unique set of a bigram, window index, and since deepfake audio models might be accidentally learning
vocal tract position. the correct distribution for some bigram-feature values. This
scenario is likely possible as these models do produce high-
6.3.2 Whole Sample Detection quality “human-like” audio. Thus, our detector is estimating
large numbers of bigram-feature pairs that are unlikely to in-
The second phase of our detector is used to determine whether dicate the origin of the audio sample. To prevent this, we con-
whole audio samples were GAN or organically generated. struct an ideal feature set that contains only bigram-feature
This phase begins similarly to the organic range extraction pairs that act as strong indicates of the audio authenticity.
phase described in the previous section.
This phase begins by creating the necessary bigram timing
6.4.1 Ideal Feature Set Creation
information from the sample’s metadata. Next, it windows
and evaluates the audio samples using the vocal tract estima- To determine the ideal bigram-feature pairs that act as good
tor. Finally, it associates the estimated vocal tract features to discriminators, we initially follow the same procedures laid
specific bigrams and words just as in the ideal feature selec- out in our organic range extraction phase. Namely, extract the
tion phase. At this point, our whole sample detection phase timing information from the sample’s metadata, window the
deviates from the organic range extraction phase. audio, evaluate it using the vocal tract estimator, and construct
Instead of calculating the cross-sectional area distributions an association between the vocal tract features and specific
for all bigram-features pairs in the data, this phase checks bigrams. We then plot the probability density function (PDF)
whether every bigram-feature pair falls within the previously (Figure 7) for each bigram-feature pair. The PDF represents
determined organic ranges. More specifically, we extract ev- the likelihood of the random variable, in this case, the bigram-
ery bigram-feature pair from the sample that exists in both feature pair, having a certain value. If there is a large over-
itself and the organic ranges (our set of organic ranges has lap between the PDF curves for organic and deepfake audio
no guarantee of containing all possible bigram-feature pairs). (Figure 7a), then that feature is a poor discriminator, which
Next, each feature is compared against the maximum and min- indicates that the model has learned the correct distribution
imum values found in the distribution of previously extracted of the bigram-feature pair. In contrast, if there is little-to-no
organic samples. If the majority of the bigram-feature pairs overlap between the PDF curves (Figure 7b), then that bigram-

USENIX Association 31st USENIX Security Symposium 2699


feature pair is an ideal discriminator (i.e., can be used to help ing deepfakes. For our transfer function, we use the TIMIT
differentiate between deepfake and organic audio). Acoustic-Phonetic Continuous Speech Corpus [42] as it is the
Our set of ideal features consists of bigram-feature pairs standard in acoustic-phonetic studies and is manually verified
that can differentiate between deepfake and organic audio by the National Institute of Standard and Technology (NIST).
78
samples with a precision-recall of at least 0.9. We determine
these threshold values by a sweep through a range of potential
values for each bigram-feature pairs. This process continues Organic Audio The TIMIT dataset is a speech corpus that is
until a threshold value (the threshold k, which is chosen on a used in phonetic studies and is commonly used in the training
per-feature basis) achieves the desired precision-recall values. of ML models for speech recognition systems [45]. Despite
This results in a triplet bigram-feature-threshold that we refer its age, the TIMIT dataset is widely used in research today
to as an ideal feature. Next, we weigh each bigram-feature pair with over 1,900 citations since 2015 according to Google
to avoid outlying bigram-features pairs that meet our require- Scholar. TIMIT provides documentation of the time align-
ments but only contain a relatively small number of samples. ments for the phonemes and words in each audio file, which
This weight is the number of samples used in our selection is information that is essential for developing our modeling
criteria calculations. We then filter our bigram-feature pairs process. The TIMIT dataset is comprised of 630 speakers
using these weights so that only the pairs whose weight are of 8 different American English dialects speaking phoneti-
equal to or greater than the average weight of the set are kept. cally balanced sentences [42]. Each speaker has 10 sentences
The collection of all the resulting triplets is hereby referred to recorded with a sampling rate of 16 kHz. For our experiments,
as our ideal feature set. we randomly sampled 300 speakers from the TIMIT dataset
It is worth noting that our thresholds singularly divide the to have a similarly-sized dataset as that of previous work [46].
PDF. That is, thresholds in our ideal set will label all bigram- For consistency, we also performed our time alignments
feature pairs as organic only if it shares a certain relationship using an open-source forced aligner called Gentle [47] which
with their threshold (i.e., less than or greater than). Therefore time-aligns words of transcription with phonemes in an audio
it is possible to create two ideal feature sets, one where values sample. Gentle is built on the Kaldi model [48], which is
below the thresholds are labeled as organic and one where a toolkit frequently used for automatic speech recognition.
values above the thresholds are labeled as organic. Unless We use Gentle since any audio samples outside the TIMIT
otherwise stated we will be referring to the ideal feature set dataset will not have the phoneme time-alignment information
as the set of thresholds where values less than the threshold needed for extraction. By performing our time-alignments
are labeled as deepfakes and the values greater than or equal on the TIMIT dataset, we can keep any error in alignments
to the threshold are labeled as organic. consistent across all samples (i.e., organic and deepfake).

6.4.2 Optimized Detector Deepfake Audio We derived our own set of synthetic
TIMIT audio samples using the open-source Real-Time-
Finally, we construct an optimized detector that only computes
Voice-Cloning (RTVC) tool from Jemine [7, 49], the most
and analyzes bigram-features that have been shown to act as
widely used publicly available deepfake generation tool.
strong indicators (i.e., our ideal feature set). This detector will
RTCV is an implementation of Tacotron 2 by Liu et al.,
follow the same initial operation as our whole sample detector,
which uses Tacotron as the synthesizer and Wavenet as a
decorating the audio data with its corresponding timing and
vocoder [50]. For each of our 300 TIMIT speakers, we trained
phoneme data. However, unlike the whole sample detector,
an RTCV model on a concatenation of all 10 TIMIT audio
our optimized detector will only check the bigram-features
recordings (approximately 30 seconds). Each RTCV model
that best indicate whether the audio sample is organic or
was then used to create a deepfake version of every TIMIT
deepfake. More specifically, we extract every feature from the
sentence spoken by each speaker. In total, this creates 2,986
sample that exists in both itself and the ideal feature set. For
usable synthetic audio samples of our 300 original speakers.
every one of these features, we compare the previously found
The 14 missing audio samples were too noisy for Gentle to
threshold from the ideal feature set with the value found in the
process and were thus unable to be used in our experiments.
current sample. We count the number of times the values from
Additionally, we contacted several commercial companies
the test audio samples cross the threshold. If more bigram-
with Deepfake generation tools in an attempt to test our tech-
feature values cross the threshold than do not, we label the
nique against other systems. Most of these companies never
audio sample as a deepfake (i.e., majority voting).
returned our requests to use their products in our research. The
few companies that responded would only give us extremely
7 Datasets 7 External factors such as face coverings during audio recording do not

affect the features our technique needs to operate [43].


We now discuss the datasets that our technique was tested 8 Additional experiments were conducted using the ASVspoof2019

against as well as the process that was performed in generat- dataset [44]. These can be found in Appendix A.

2700 31st USENIX Security Symposium USENIX Association


Figure 8: The bigrams found in the ideal feature set were not the most common bigrams found in speech. However, the bigrams
in the ideal feature set still made up approximately 30.9% of bigrams in our dataset.

limiting access to their product after purchase. Their restric- detector. In contrast, our technique does not require a large
tions would have limited us to at most 5 different speakers dataset to learn from since we leverage the knowledge of
compared to the 300 speakers present in the TIMIT Deep- human anatomy. As a result, our technique requires a signifi-
fakes we generated. We, therefore, took the largest available cantly smaller dataset to learn from while still being able to
of such datasets, published by Lyrebird [51], and evaluated it. generalize over a much larger evaluation set.
We note that the generation of these samples is black box and
represents a reasonable test against unknown models. 8 Evaluation
In this section, we discuss the performance of our deepfake
Feature Extraction and Evaluation Sets To evaluate and detection technique and explain the results.
test our technique, we subdivided both the organic and deep-
fake TIMIT samples into a feature extraction set (51 speakers) 8.1 Detector Performance
and an evaluation set (249 speakers). The feature extraction
set is used to determine the ideal bigram-feature pairs and We first need to find the ideal feature set using the process
their corresponding thresholds k using the ideal feature extrac- detailed in Section 6.3.1. The feature extraction dataset was
tor outlined in Section 6.3.1. Conversely, the evaluation set is used to find the set of ideal features that consisted of 865
used to evaluate the efficacy of our technique. Both datasets bigram-feature-threshold triples.
contain all of the organic and deepfake audio samples for their To evaluate the performance of our detector, we classi-
respective speakers. Our security model (Section 5) assumes fied all the audio samples in the evaluation dataset. To do
no knowledge of a speaker is known to the defender. As such, this, we concatenated all the sentences for each speaker to-
both sets were selected so that they did not share any speak- gether to form a single audio sample. We then ran each audio
ers. This demonstrates that our technique is extracting useful sample through our whole sample detection phase outlined
features that are inherent to deepfake audio as a class, rather in Section 6.3.2. Overall, we extract and compared 12,525
than features specific to the deepfake of an individual speaker. bigram-features pairs to the values found in our ideal feature
This captures a stronger threat model as we do not have any set. Finally, our detector was able to achieve a 99.9% preci-
information about the speaker who will be impersonated. sion, a 99.5% recall, and a false-positive rate (FPR) of 2.5%
The feature extraction set contains 1,020 audio files from using our ideal feature set.
51 speakers, which contain a total of 702 unique bigrams. Of
these, 510 audio files from 9 speakers are deepfake samples 8.2 Bigram Frequency Analysis
and 510 audio files are organic. The evaluation set consists
of 4,966 audio files from 249 speakers, which contain 835 We now explain why the detector performed so well by an-
unique bigrams. Of these, 2,476 audio files are deepfake sam- alyzing the bigram results. The 865 bigram-feature pairs of
ples and 2,490 audio files are organic. It is important to note the ideal feature came from 253 distinct bigrams that had 3.4
that our evaluation set is five times as large as our feature features on average within the set. These bigrams make up
extraction set. We used a smaller feature extraction set and a approximately 30.9% of the 820 unique bigrams present in
larger evaluation set to showcase the efficiency of our tech- the TIMIT dataset we tested. Since TIMIT is a phonetically
nique. Traditional ML models require large datasets, orders balanced dataset, it accurately represents the distribution of
of magnitude larger than what is used here, to learn from phonemes in spoken English. In Figure 8, we show the 30
and capture data intricacies that improve the model’s gener- most common bigrams in both the TIMIT dataset and our
alization [6]. Generating large datasets of DeepFakes can be ideal feature set. While most of the bigrams in the ideal fea-
difficult, inherently limiting the effectiveness of an ML-based ture set are not in the top 30 bigrams, the total ideal feature set

USENIX Association 31st USENIX Security Symposium 2701


having in a manner that is similar to the organically spoken
data. The final segment of the figure (c) shows that the vocal
tract estimates found for deepfake audio are approximately
the size and shape of a drinking straw.
In addition to the bigram deep dive, we conducted a small-
scale Principle Component Analysis (PCA) experiment, the
results of which are visible in Figure 11. Our PCA experi-
ment was conducted using all bigram pairs from the organic
samples in the feature extraction set (labeled TIMIT Eval-
uation), the organic samples in the evaluation set (labeled
TIMIT Testing), and the deepfake samples in the evaluation
set (labeled Deepfake). We treated each feature vector as a
Figure 9: This distribution plots show the percentage of fea- point in a 15 dimensional space and then used PCA to reduce
tures classified as deepfakes per sentence. We can see that the the data down to a single dimension that accounted for the
majority of the time the decision to classify a sentence as a most variance within the data. This single dimension accounts
deepfake is not near the decision boundary of 0.5. for approximately 48% of all the variance in the dataset. As
shown in Figure 11, deepfake audio samples are much less
still accounts for about 15.3% of the total bigram occurrences variable than their organic counterparts, which demonstrates
extracted from our evaluation set, implying that even though that the “drinking straw” vocal tract observed in the bigram
our ideal features are not the most common bigrams, they still deep dive is not an outlier, but rather more likely the norm.
account for a sizable portion of the speech. This makes select-
ing a phrase that does not contain multiple occurrences of the
bigrams in our ideal feature set difficult for longer phrases, es- 8.4 Transferability Experiments
pecially when considering most words are constructed from As discussed in Section 7, the availability of public deepfake
multiple bigrams. As such, an English sentence will likely datasets is limited, meaning that it was not possible to test
contain bigrams that are included in our ideal feature set. our technique against models at the same scale done using
With this knowledge, we next explore the likelihood that the RTVC tool. However, we were able to collect a limited
our detector will misclassify a sentence. Figure 9 shows the dataset from Lyrebird [51]. These nine synthetic audio sam-
PDF and histogram of the percentage of features labeled deep- ples of former presidents Obama and Trump were generated
fake for every sentence in our dataset. The figure shows that using tweeted messages and publicly released as marketing
most features evaluated in the deepfake samples are individu- material. While our dataset is small, the internal workings of
ally labeled as deepfakes, which informs us that classification the Lyrebird model are not public and thus we can perform
is dependent on multiple features rather than a few prominent a black box test. We collected eight true-negative sentence
ones. This implies that an adversary’s model performance samples from previous State of the Union addresses for both
would need to increase by a considerable margin before the speakers. In total, the synthetic audio samples contained 1,914
model could trivially beat our detection technique. bigram pairs representing 220 unique bigrams and the organic
audio samples contained 4,262 bigram pairs representing 270
8.3 Fundamental Phenomena Confirmation unique bigrams. The organic and synthetic audio samples had
136 bigrams in common.
To observe the fundamental difference between deepfakes and We then evaluated these samples using the ideal features
organic audio that our detector is based around, we conducted derived from our original dataset and found that they were
an in-depth analysis of the incorrect behavior of the vocal tract unable to successfully detect the Lyrebird audio. Following
estimates found for deepfake audio in a single phoneme (“/d – this, we wanted to see if a different ideal feature set capable
oU/”, pronounced doh). Figure 10 shows the estimated cross- of detecting Lyrebird audio existed. We proceeded to use the
sectional area for one of the bigrams from the ideal feature set. organic and deepfaked audio samples of Presidents Trump
For comparison, we use a disjoint set of the TIMIT data to and Obama to extract an ideal feature set using the process de-
create a secondary set of audio samples (labeled TIMIT Test) scribed in Section 6.3.1. After extraction, we found that every
that has not been previously used. The box plots (a) represent one of the 136 bigrams shared between the organic and syn-
the estimated cross-sectional area found by our estimator de- thetic audio sets would qualify as an ideal discriminator (i.e.,
scribed in Section 6.2. The dimensions represent the multiple an ideal feature). This means that while the ideal feature from
tubes our transfer function used to estimate the vocal tract one deepfake model failed to transfer to another, our hypoth-
with, as previously seen in Figure 4. We then converted these esis that deepfake models are failing to correctly mimic the
cross-sectional area estimates to their approximate diameters acoustic behavior of the human vocal tract, remains correct.
(b). It is clear at this point that the deepfake audio is not be- Because of this, we were still able to detect every synthetic

2702 31st USENIX Security Symposium USENIX Association


100

80
Cross-sectional Area (cm^2)

60

40

20

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Dimensions

a) b) c)

Figure 10: a) The cross-sectional area estimates output by the transfer function for bigram “/d – oU/.” pronounced “doh” b) The
approximate vocal tracts used to create each of the datasets. c) An anatomical approximation of a deepfaked model (bottom),
which no longer represents a regular human vocal tract (top) and instead is approximately the dimensions of a drinking straw.
This inconsistency is prevalent across more than 350 observed bigrams.

audio sample generated by Lyrebird by using an ideal feature less accurate when processing non-vowel phonemes. That
set that was sensitive to the Lyrebird model. We believe that being said, vowels make up 38% of all phonemes, meaning
this indicates that the RTVC and Lyrebird deepfake genera- most bigrams should contain at least one vowel phoneme.
tion models are failing to mimic human acoustics in different Therefore, our use of bigrams also helps to minimize the
ways. It, therefore, appears that the ideal features extracted number of processed necessary samples.
from one deepfake generation model will not necessarily ap-
ply to other models. However, the lack of overlapping ideal
features between models can be potentially circumvented by Preprocessing During the preprocessing stage of our
having a defender check all bigrams within an audio sample. pipeline, we use Gentle to automatically timestamp the audio
This would allow a defender to practically check all possible files according to their words and phonemes. Gentle requires
ideal feature sets simultaneously. However, the thoroughness sample transcriptions, which we generate using the Google
provided by checking all bigrams will result in a considerable Speech API. Thus the accuracy of the timestamps (and the fol-
increase in the processing time for the detector and would lowing stages of the pipeline) are directly tied to the accuracy
potentially require a different process for whole sample de- of Gentle and the Google Speech API. While some phonemes
tection than what was presented in Section 6.3.2. We leave are only a few milliseconds long, Gentle’s precision is to the
the further exploration of the concept of an all-bigram pro- nearest hundredth of a second. This forces Gentle to overesti-
cessing method to future work. To conclude, a defender who mate the timestamps for short phonemes, which introduces
is concerned about detecting previously unknown deepfake rounding errors. The use of bigrams helped to mitigate this
generation models will likely not be able to benefit from the problem, since using pairs gave us more appropriate target
performance increases provided by the creation of an ideal lengths for Gentle’s precision levels.
feature set. Furthermore, they will likely need to rely on a The noisiness of synthetically generated audio can also
different metric than majority voting when evaluating the set cause mistranscriptions in the Google Speech API. However,
of all bigrams within the sample. the mistranscriptions are usually phonetically similar to the
correct ones. As a result, Gentle’s timestamps will contain lit-
tle error. This limits any major impact that a mistranscription
9 Discussion could have on our results.

9.1 Limitations
Data Access There does not exist many large publicly avail-
Acoustic Model While our acoustic modeling can process able corpora of deepfake audio samples or generation tools.
all phonemes for a given speech sample, the pipe series are While we would have liked to test our technique against a
only anatomically correct for the vocal tract while the speaker larger variety of samples, this was not possible. Our dataset
is pronouncing a vowel. This means that our technique is is limited to the data and tools that are currently publicly

USENIX Association 31st USENIX Security Symposium 2703


For example, it is a greater threat to democracy if the popu-
lation believes that a deepfake audio of a politician making
incendiary remarks is real.
Furthermore, to achieve such a high precision rate, our de-
tector must also sacrifice its false-positive performance. The
achieved FPR of 2.5% could be seen as higher than ideal
for automated systems that process hundreds or thousands of
audio samples per day. However, we believe this trade-off is
still better overall. If an organic sample is falsely identified as
a deepfake, it is trivial for the original speaker to verify the au-
Figure 11: The first PCA dimension shows that deepfake thenticity of the sample if they choose. However, a malicious
audio samples fundamentally lack the same amount of vari- speaker could use the false-positive as an opportunity to dis-
ability found in organic speech. associate from a comment they had previously made. While
available. Startups in deepfake generation have propriety al- ideally our technique could prevent this kind of disassociation,
gorithms that are either not available for purchase or other- we believe that allowing an individual to disassociate from
wise made inaccessible for use in academic research. Despite previous a comment is less of a security risk than allowing
these issues, our technique appears to be generalizable and a deepfake to go undetected. For this reason, we accept our
not targeting any specific deepfake audio generator, although FPR of 2.5% as an acceptable value.
additional research is needed to verify this.
9.4 Advantages over other techniques
9.2 Deployment Considerations Our technique offers several distinct advantages over some
Our optimized detector leverages a one-time preprocessing techniques in the current literature. Researchers have explored
step to amortize the cost of processing individual audio sam- the use of Deep Neural Nets (DNNs) for detecting deepfake
ples later. We preprocess each bigram during the feature ex- audio [6]. These require large training data sets (thousands
traction phase to determine the acceptable organic ranges of audio samples), which is extremely limiting as generating
and ideal features (e.g., a 60 bigram sample would take ap- large amounts of deepfake audio data is not easy. If the train-
proximately 15 seconds to process). However, during the ing data is not large enough to capture the full distribution,
evaluation phase, which simulates our detection process, the the trained DNN will fail to generalize. As a result, the DNN
preprocessing is performed only on a discriminative set of will perform poorly on the test set. Our method only requires
bigrams (i.e., the ideal features extracted during feature ex- a few dozen audio samples and can generalize to a much
traction). By greedily selecting the ideal features, the prepro- larger evaluation set. Also, since DNNs are black boxes, they
cessing time for similarly-sized samples during evaluation is do not provide explanations for the predicted labels. On the
approximately reduced by an order of magnitude, allowing us other hand, our method leverages a deep understanding of the
to perform detection in real-time. For example, the average human anatomy to explain the predicted labels.
sentence in the TIMIT dataset contains 25 bigrams over 2.7
seconds of audio. If our technique processed every bigram 9.5 Robustness against Adaptive Adversaries
in the sentence, it would take approximately 6.25 seconds
to complete. However, as discussed in Section 8.2, our ideal Attacks and defenses are in a constant arms race. An attacker
feature set makes up about approximately 15.3% of bigram with knowledge of our technique may try to adapt their attack
occurrences. This means that by only processing the ideal to defeat it. We explore two different approaches an adaptive
features our technique could evaluate the 2.7 seconds of audio adversary could use to evade our detector. The first approach
in approximately 1 second. Other techniques that use audio follows the general best practice as described by Tramèr et
analysis (e.g., Pindr0p [52]) require on the order of 15 seconds al. Ideally, a defense should only be attacked end-to-end (i.e.,
to work and are therefore not usually performed in realtime. used as a loss function) if the entire defense is differentiable
and an inspection of the technique reveals that the defense
is not likely to fail due to the exploitation of one or two
9.3 Performance Trade-off
subcomponents. However, our technique is not end-to-end
Our optimized technique has a 99.9% precision rate resulting differentiable due to Eqns. 4 and 5. In line with Tramèr et
in a minor decrease in recall to 99.5%. A high precision rate al.’s work, however, an adversary can still attempt to evade
will ensure that a deepfake audio is not accidentally labeled as our technique by identifying the critical subcomponent of
organic by our system. This is specifically designed to protect the model that needs to be exploited. In the case of our tech-
the victim of a deepfake attack. It is far more dangerous for nique, this subcomponent is the underlying vocal tract transfer
a deepfake audio to be believed as real, than the converse. function represented by Eqns. 4 and 5.

2704 31st USENIX Security Symposium USENIX Association


Single Frame
Model Pass c) Speech to Text
a) WaveRNN

Single Sample d) Forced Phoneme


Model Pass b) Discretized Mix Alignment
Logistic Loss

e) Audio Processing/
Metadata Generation

g) Average Error
h) Pre-Calculated f) Vocal Tract
across VT from
Organic VT Estimation Estimation
Organic

Figure 13: The modifications made to the WaveRNN model


(a) to integrate our vocal tract estimation technique as a loss
function. Every batch during training the model now generate
a full audio sample, transcribe it to text (c), align the neces-
sary metadata (d, e) before estimating the vocal tract (f) and
calculating the effective error this has incurred (g) from a set
of pre-calculated organic vocal tracts (h). This error is then
combined with original loss used by WaveRNN (b).
cross-sectional area estimate, rather than all 15.
Figure 12: While attempting to overfit to a small dataset, As shown in Figure 12, the loss (calculated by mean ab-
the best models were only able to achieve loss values of solute error) of a wide range of models,9 detailed in Table 2,
approximately 10−1 . This is several orders of magnitude remained flat and did not converge to near-zero after training
greater than what would be expected if the models were well for multiple epochs. Our best models had their loss values
suited to learning our technique’s mapping. This indicates that converge to between 100 and 10−1 , putting them four to five
these models struggle to mimic our technique and significant orders of magnitude greater than standard practice [53].
domain-specific knowledge is needed to overcome our de- Thus, we believe that a naïve adversary would be unsuccess-
fense. Additional information on which model is represented ful in evading our defense mechanism, and our results suggest
by each line can be found in Table 2. that generalizing our technique using ML is non-trivial and
requires significant domain-specific knowledge.
To exploit this segment, an adversary needs to learn the The second adaptive adversary approach leverages part of
mapping of our technique between the Fourier Transform of the loss function during the deepfake generator training. As
a bigram’s audio and the 15 cross-sectional area estimates of discussed in Section 3.3, deepfake generators compromise
the vocal tract. Simply put, an adversary would have to train three stages; the encoder, synthesizer, and vocoder. We choose
an ML model to predict the cross-sectional area of a speaker’s to modify the vocoder training process as it is the final step
vocal tract from the frequencies present in the audio sample. and is responsible for creating the life-like audio.
By minimizing the error between the model’s prediction and For this test, we modified the training process of the Wa-
the extracted vocal tract estimates, the model would attempt veRNN [54] vocoder model used by RTVC [7]. These mod-
to learn some mapping similar to that of our technique. This ification can be seen in Figure 13. The WaveRNN model’s
trained model would then be used to generate adversarial original loss function (b) was calculated on mini-batches of
audio, using a technique such as a Generative Adversarial 32 single frames of training audio at a time. However, these
Network, to evade detection. single frames of audio are not full audio samples and there-
To measure the difficulty of this task, we constructed a fore do not contain any full bigrams our detection method
naïve adversary that uses out-of-the-box Deep Learning mod- uses. Thus, during every mini-batch, we used the model to
els to mimic our technique. Before training on a full dataset, generate a single complete audio sample. This is then run
we took an initial step and trained on a small sample set (i.e., through the transcription and phoneme aligner, discussed in
16 random audio samples) to try to overfit the model and Section 7 (c & d). We then create the necessary metadata
get near-zero loss (e.g., around 10−5 [53]). Doing so would (e) which is passed along with the audio to the vocal tract
suggest that the models are indeed learning some mapping estimator (f). The distance between the vocal tract estimation
between the frequencies in the audio and the 15-dimensional for the model’s generated audio is then compared (g) to a
output, indicating that training on a larger dataset would gen- precalculated set of organic vocal tracts (h) to find the rela-
eralize better. However, a non-near-zero loss suggests that, 9 While there are other architectures to consider (e.g., RNNs, LSTMs,
even with a small sample set, the model struggles with find- and transformers), those architectures tend to rely on temporal dependencies.
ing a correct mapping. To further simplify the problem, we Since our input was the Fourier Transform of a bigram, which lacks temporal
also attempted to train the models to output only the first data, testing those models would not be appropriate.

USENIX Association 31st USENIX Security Symposium 2705


tive loss incurred by the vocal tract analysis. This loss value Furthermore, since adversarial samples do not transfer across
is then added to the original WaveRNN loss. Training then RNN based models [41], this adversarial sample could not be
occurs as normal from there. used against any other iteration of our system. Therefore, an
In addition, we made a major optimization to the training attacker will need to spend 130 for every single audio sample,
process. Early in the training phase, the model will not out- making this attack completely impractical.
put transcribable audio samples. Thus, our technique can not
estimate the vocal tract of the audio. In this case, we simply 10 Conclusion
set the component of the loss found from the vocal tract es-
timation to zero and prevent any related calculations from Deepfake audio generators can now enable attackers to im-
occurring (d, e, f, or g). However, the model will still need to personate any person of their choosing. Existing techniques
generate a complete audio sample to determine if it is capable to detect deepfake audio often require knowledge of the spe-
of creating transcribable audio. cific generator. In our work, we present a novel detection
All testing was conducted on a local desktop machine run- mechanism that is independent of any generator. Our method
ning a 6th generation Intel i7-6770HQ with 32 GB of RAM leverages the knowledge of the human anatomy, fluid dy-
and a GTX 1080 with 8 GB of V-RAM. The WaveRNN model namics, and the articulatory system to detect deepfake audio
was constructed using PyTorch 1.9.1 and Cuda 11.4. samples with a precision of 99.9% and a recall of 99.5%. In
We use the WaveRNN model’s default parameters while doing so, our work presents a unique lens to view the problem
training on all 4,622 samples of the TIMIT corpus as a of deepfake detection – one that is explainable, generalizable,
baseline. The model defaulted to a learning rate of 0.0001, and free of the limitations of other ML-based approaches.
1,000,000 steps, a batch size of 32, and would complete 6,994
epochs. This is similar to the pretained model included in
RTVC that was trained for 1,159,000 steps at a batch size
Acknowledgments
of 50. In this configuration, the unedited model was able to The authors thank our anonymous reviewers and our shepherd,
complete a batch every 0.50 seconds and an epoch every 71 Stjepan Picek, for their valuable comments and suggestions.
seconds. The model would require approximately 5.75 days This work was supported in part by the Office of Naval Re-
to train. search under grant number ONR-OTA N00014-21-1-2658.
After the inclusion of our new loss function, we saw a Any opinions, findings, and conclusions or recommendations
considerable slowdown. During the training phases, before expressed in this material are those of the authors and do not
the model is producing transcribable audio, the model was necessarily reflect the views of the Office of Naval Research.
able to process a single batch every 8.86 seconds and an
epoch every 1,267 seconds or 21 minutes, a slow down of
17.8x. This slow down was a result of the model having to
References
generate whole audio samples at every epoch, regardless of [1] T. Mills, H. T. Bunnell, and R. Patel, “Towards Personalized Speech
audio transcription. In this state, the model would take a Synthesis for Augmentative and Alternative Comm.” Augmentative
and Alternative Communication, 2014.
minimum of approximately 100 days to complete ∼90% of
its training. This estimate does not include the additional time [2] J. M. Costello, “Message Banking, Voice Banking and
Legacy Messages,” Boston Children’s Hospital - https:
necessary for running the vocal tract estimator. We are likely //www.childrenshospital.org/~/media/centers-and-services/
to see a further slow down at the end of the model training programs/a_e/augmentative-communication-program/
when the full vocal tract estimator needs to run. At this stage, messagebankdefinitionsandvocab201613.ashx?la=en, 2016.
we assumed that the adversary has already calculated the vocal [3] C. Stupp, “Fraudsters Used AI to Mimic CEO’s Voice in Unusual
tracts for a large number of existing organic audio samples Crime,” Wall Street Journal, 2019.
to use as ground truth to avoid having to do the vocal tract [4] E. A. AlBadawy, S. Lyu, and H. Farid, “Detecting AI-Synthesized
Speech Using Bispectral Analysis,” in CVPR Workshops, 2019.
estimation twice per batch. For TIMIT this process would
take approximately 77 hours but only needs to be completed [5] H. Malik, “Securing Voice-driven Interfaces against Fake (Cloned)
Audio Attacks,” in IEEE Conference on Multimedia Information Pro-
one time. Once the vocal tract estimation loss function was cessing and Retrieval (MIPR), 2019.
fully engaged the model was able to complete one batch every
[6] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep Residual Neural
31.52 seconds and one epoch every 1.25 hours. Assuming Networks for Audio Spoofing Detection,” arXiv:1907.00501, 2019.
the model will only be producing transcribable audio for the [7] C. Jemine, “Real-Time Voice Cloning,” https://github.com/CorentinJ/
final 10% of its total epochs, the model would complete this Real-Time-Voice-Cloning, 2019.
phase of training in approximately 36.4 days. Overall, when [8] J. Saunders, “Detecting Deep Fakes With Mice : Machines vs Biology,”
the vocal tract estimation loss function is fully engaged, the 2019.
model training is slowed by approximately a factor of 63x. [9] “Google Home,” https://madeby.google.com/home/, 2017.
In total, training the WaveRNN model using our vocal tract [10] “Amazon Alexa Line,” https://www.amazon.com/Amazon-Echo-And-
estimator in this way would require approximately 130 days. Alexa-Devices/b?ie=UTF8&node=9818047011, 2017.

2706 31st USENIX Security Symposium USENIX Association


[11] “Apple Siri,” https://www.apple.com/ios/siri/, 2017. [31] S. Flego, “Estimating vocal tract length by minimizing non-uniformity
[12] J. Lorenzo-Trueba, F. Fang, X. Wang, I. Echizen, J. Yamagishi, and of cross-sectional area,” in Proceedings of Meetings on Acoustics.
T. Kinnunen, “Can we steal your vocal identity from the Internet?: ASA, 2018.
Initial investigation of cloning Obama’s voice using GAN, WaveNet [32] S. Skoog Waller and M. Eriksson, “Vocal Age Disguise: The Role of
and low-quality found data,” arXiv:1803.00860, 2018. Fundamental Frequency and Speech Rate and its Perceived Effects,”
[13] Y. Wang, W. Cai, T. Gu, W. Shao, Y. Li, and Y. Yu, “Secure Your Frontiers in psychology, 2016.
Voice: An Oral Airflow-Based Continuous Liveness Detection for Voice [33] H. Cao, Y. Wang, and J. Kong, “Correlations between body heights
Assistants,” Proceedings of the ACM on Interactive, Mobile, Wearable and formant frequencies in young male speakers: a pilot study,” The
and Ubiquitous Technologies, 2019. 9th International Symposium on Chinese Spoken Language Processing,
[14] Q. Wang, X. Lin, M. Zhou, Y. Chen, C. Wang, Q. Li, and X. Luo, “Voice- 2014.
pop: A Pop Noise based Anti-spoofing System for Voice Authentication [34] J. H. Hansen, K. Williams, and H. Bořil, “Speaker height estimation
on Smartphones,” in IEEE Conference on Computer Communications, from speech: Fusing spectral regression and statistical acoustic models,”
2019. The Journal of the Acoustical Society of America, 2015.
[15] C. Wang, S. A. Anand, J. Liu, P. Walker, Y. Chen, and N. Saxena, [35] R. E. Hayden, “The relative frequency of phonemes in General-
“Defeating Hidden Audio Channel Attacks on Voice Assistants via American English,” Word, 1950.
Audio-Induced Surface Vibrations,” in Proceedings of the Annual Com-
puter Security Applications Conference, 2019. [36] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized End-to-End
Loss for Speaker Verification,” in IEEE International Conference on
[16] L. Blue, L. Vargas, and P. Traynor, “Hello, Is It Me You’re Looking Acoustics, Speech and Signal Processing (ICASSP), 2018.
For? Differentiating Between Human and Electronic Speakers for Voice
Interface Security,” in Proceedings of the ACM Conference on Security [37] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,
& Privacy in Wireless and Mobile Networks, 2018. Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-
end speech synthesis,” arXiv:1703.10135, 2017.
[17] L. Zhang, S. Tan, J. Yang, and Y. Chen, “VoiceLive: A Phoneme Local-
ization Based Liveness Detection for Voice Authentication on Smart- [38] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,
phones,” in Proceedings of the ACM SIGSAC Conference on Computer N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A genera-
and Communications Security, 2016. tive model for raw audio,” arXiv:1609.03499, 2016.
[18] L. Zhang, S. Tan, and J. Yang, “Hearing Your Voice is Not Enough: An [39] B. Reaves, L. Blue, and P. Traynor, “Authloop: End-to-end crypto-
Articulatory Gesture Based Liveness Detection for Voice Authentica- graphic authentication for telephony over voice channels,” in USENIX
tion,” 2017. Security Symposium, 2016.
[19] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, [40] K. N. Stevens, Acoustic phonetics. MIT press, 2000.
A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee,
[41] H. Abdullah, K. Warren, V. Bindschaedler, N. Papernot, and P. Traynor,
“Asvspoof 2019: Future horizons in spoofed and fake audio detection,”
“SoK: The Faults in our ASRs: An Overview of Attacks against Au-
arXiv:1904.05441, 2019.
tomatic Speech Recognition and Speaker Identification Systems,” in
[20] S. Naren, “Speech Recognition using DeepSpeech-2,” Last accessed in IEEE Symposium on Security and Privacy (S&P), 2021.
2019, https://github.com/SeanNaren/deepspeech.pytorch.
[42] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,
[21] “Google Cloud Speech-to-Text API,” Last accessed in 2019, https: “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.
//cloud.google.com/speech-to-text/. NIST speech disc 1-1.1,” NASA STI/Recon technical report n, 1993.
[22] “Azure Speaker Verification API,” Last accessed in 2019, available [43] D. D. Nguyen, P. McCabe, D. Thomas, A. Purcell, M. Doble,
at https://azure.microsoft.com/en-us/services/cognitive-servic/speaker- D. Novakovic, A. Chacon, and C. Madill, “Acoustic voice
recognition/. characteristics with and without wearing a facemask,” Scientific
[23] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell: Reports, 2021. [Online]. Available: https://doi.org/10.1038/s41598-
A Neural Network for Large Vocabulary Conversational Speech Recog- 021-85130-8
nition,” in ICASSP, 2016. [44] J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado, X. Wang,
[24] H. Scheidl, S. Fiel, and R. Sablatnig, “Word Beam Search: A Connec- N. Evans, T. Kinnunen, K. A. Lee, V. Vestman, A. Nautsch et al.,
tionist Temporal Classification Decoding Algorithm,” in 2018 Interna- “ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing
tional Conference on Frontiers in Handwriting Recognition (ICFHR), and Countermeasures Challenge database,” 2019.
2018.
[45] J. Michálek and J. Vanek, “A Survey of Recent DNN Architectures on
[25] A. M. White, A. R. Matthews, K. Z. Snow, and F. Monrose, “Phonotac- the TIMIT Phone Recognition Task,” ArXiv, 2018.
tic Reconstruction of Encrypted VoIP Conversations: Hookt on Fon-iks,”
[46] N. Subramani and D. Rao, “Learning Efficient Representations for Fake
in IEEE Symposium on Security and Privacy, 2011.
Speech Detection,” Proceedings of the AAAI Conference on Artificial
[26] Z. Zhang, “Mechanics of human voice production and control,” The Intelligence, 2020.
Journal of the Acoustical Society of America, 2016.
[47] lowerquality, “Gentle Force Aligner,” 2018. [Online]. Available:
[27] L. Rabiner and R. Schafer, Digital Processing of Speech Signals. The https://github.com/lowerquality/gentle
Journal of the Acoustical Society of America, 1978.
[48] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,
[28] R. Kirlin, “A posteriori estimation of vocal tract length,” IEEE Trans- M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem-
actions on Acoustics, Speech, and Signal Processing, 1978. mer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in IEEE
[29] H. Wakita, “Normalization of vowels by vocal-tract length and its 2011 Workshop on Automatic Speech Recognition and Understanding,
application to vowel identification,” IEEE Transactions on Acoustics, 2011.
Speech, and Signal Processing, 1977. [49] Y. Jia, Y. Zhang, R. J. Weiss, Q. shan Wang, J. Shen, F. Ren, Z. Chen,
[30] A. C. Lammert, S. Narayanan, and C. R. Larson, “On Short-Time P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu, “Transfer Learning
Estimation of Vocal Tract Length from Formant Frequencies,” in PloS from Speaker Verification to Multispeaker Text-To-Speech Synthesis,”
one, 2015. ArXiv, 2018.

USENIX Association 31st USENIX Security Symposium 2707


[50] Y. Liu and J. Zheng, “Es-Tacotron2: Multi-Task Tacotron 2 with Pre-
Trained Estimated Network for Reducing the Over-Smoothness Prob-
lem,” Information, 2019.
[51] “LyreBird,” https://github.com/logant/Lyrebird, 2017.
[52] V. Balasubramaniyan, A. Poonawalla, M. Ahamad, M. Hunter, and
P. Traynor, “PinDr0p: Using single-ended audio features to determine
call provenance,” 2010.
[53] F. Tramer, N. Carlini, W. Brendel, and A. Madry, “On Adaptive Attacks
to Adversarial Example Defenses,” arXiv:2002.08347, 2020.
[54] O. McCarthy, “WaveRNN,” https://github.com/fatchord/WaveRNN,
2021.

A ASVSpoof Dataset
We explored the potential use of the ASVSpoof2019 dataset to eval-
uate our deepfake detection technique. The ASVspoof2019 dataset
contains a collection of synthetically modified audio samples, none
of which are actual deepfakes. Instead, these audio samples are used
for speaker verification tasks, such as voice authentication. While
our algorithm can still detect these audio samples, they should not
be used for evaluating deepfake detection algorithms. This dataset
was not designed for this task.
We ran the full dataset against our approach, which required over
1,400 hours of processing time. However, we noticed that such tests
produced very high word error (WER) rates of 0.45. This means that
nearly half of all words were transcribed incorrectly. Upon manual
listening tests, we found these audio samples sounded very robotic,
thus resulting in poor transcriptions. Therefore, the lower quality
of the audio was the source of these failures, and therefore served
as efficient filters. Further investigation revealed that contrary to
popular belief, ASVSpoof2019 is not a deepfake audio dataset -
the maintainers of this challenge note that deepfake detection is
a separate challenge, and have identified it as such in their yet-to-
be-released 2021 dataset. Even though our preprocessing stage can
detect these audio samples as being abnormal due to the high WER,
they were never intended to be used for deepfake detection and
instead target the related problem of automatic speaker verification
(ASV) (e.g., authentication), hence the name of the challenge.

B Phrases
TIMIT phrase that was converted into our deepfake dataset.
1. Cattle which died from them winter storms were referred to as
the winter
2. The odor here was more powerful than that which surrounded
the town aborigines. (si1077)
3. No, they could kill him just as easy right now. (si1691)
4. Yet it exists and has an objective reality which can be experi-
enced and known. (si654)
5. I took her word for it, but is she really going with you? (sx395)

Table 2: Descriptions and high-level information about the


models tested in our adaptive adversary experiments. The
corresponding results for each model can be seen in Figure 12
by the line denoted in the Line Style column.

2708 31st USENIX Security Symposium USENIX Association

You might also like