Thesis Bich Ngoc Do
Thesis Bich Ngoc Do
Thesis Bich Ngoc Do
MASTER THESIS
Bich Ngoc Do
Prague 2015
Acknowledgements
First and foremost, I would like to express my deep gratitude to my main supervi-
sor, Filip Jurčı́ček and my co-supervisor, Marco Wiering. Without their guidance
and patience, I would never finish this thesis.
I would like to thank Stanislav Veselý as his help and kindness has made my life
in Prague much easier.
I greatly appreciate all supports from my coordinators in University of Groningen,
Gosse Bouma, and in Charles University in Prague, Markéta Lopatková and
Vladislav Kuboň.
Big thanks to my dearest friend, Dat, for cooking meals for me when I had to
work, and Minh, for encouraging me when I was depressed. Thank Tam, Chuong
and all my friends as your encouragement kept me working until the end.
Last but not least, I dedicate this work to my parents and my brother.
Prague, 20/11/2015
Ngoc
i
Title: Neural networks for automatic speaker, language, and sex identification
Author: Bich-Ngoc Do
Supervisor: Ing. Mgr. Filip Jurčı́ček, Ph.D., Institute of Formal and Applied
Linguistics, Charles University in Prague and Dr. Marco Wiering, Institute of
Artificial Intelligence and Cognitive Engineering, Faculty of Mathematics and
Natural Sciences, University of Groningen
ii
Contents
1 Introduction 3
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Components of a Speaker Recognition
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
5 Experiments and Results 38
5.1 Corpora for Speaker Identification
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 TIMIT and its derivatives . . . . . . . . . . . . . . . . . . 38
5.1.2 Switchboard . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.3 KING corpus . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Database Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Reference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Experimental Framework Description . . . . . . . . . . . . . . . . 42
5.4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4.2 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4.3 Back-end . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.4 Configuration file . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 50
5.5.1 Experiment 1: Performance on small size
populations . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5.2 Experiment 2: Performance with regard to training duration 52
5.5.3 Experiment 3: Performance on large populations . . . . . . 53
5.5.4 Experiment 4: Sex identification . . . . . . . . . . . . . . . 55
5.5.5 Epilogue: Language identification . . . . . . . . . . . . . . 56
Bibliography 64
List of Figures 66
List of Tables 67
List of Abbreviations 68
2
Chapter 1
Introduction
There is not a unique way to classify subfields in speech processing, but in gen-
eral, it can be divided into some main components: analysis, coding/synthesis
and recognition [14]. Among those, recognition area directly deals with basic
information that speech delivers, for instance, its message of words (speech recog-
nition), language (language identification) and information about the speaker
such as his or her gender, emotion (speaker recognition) (see figure 1.1).
3
In other words, beside transmitting a message as other means of communi-
cation do, speech also reveals the identity of its speaker. Together with other
biometrics such as face recognition, DNA, fingerprint, ..., speaker recognition
plays an important role in many fields, from forensics to security control. The
first attempts at this field were made in the 1960s [20]; since then its approaches
have ranged from simple template matching to advanced statistical modeling like
hidden Markov models or artificial neural networks.
In our work, we would like to use one of the most effective statistical mod-
els today to solve speaker recognition problems, which is deep neural networks.
Hence, the aim of this thesis is to apply deep neural network models to identi-
fy speakers, to show if this approach is promising and to prove its efficiency by
comparing its results to other techniques. Our evaluation is conducted on TIMIT
data released in the year 1990.
Figure 1.2: Structures of (a) speech identification and (b) speech verification
(adapted from [55])
4
then compared against the model of the claimant, i.e. the speaker whose identity
the system knows about. Other speakers except the claimant are called impostors.
A verification system is trained using not only the claimant’s signal but also data
from other speakers, called background speakers. In the evaluation phase, the
system compares the likelihood ratio ∆ (between the score corresponding to the
claimant’s model to that of the background speakers’ model) with a threshold
θ. If ∆ ≥ θ, the speaker is accepted, otherwise he or she is rejected. Since the
system usually does not know about the test speaker identity, this task is an
open-set problem.
Speaker identification, on the other hand, determines who the speaker is
among known voices registered in the system. Given an unknown speaker, the
system must compare his or her voice to a set of available models, thus makes
this task a one-vs-all classification problem. The type of identification can be
closed-set or open-set depending on its assumption. If the test speaker is guar-
anteed to come from the set of registered speakers, its type is closed-set, and the
system returns the most probable model ID. In case its type is open-set, there
is a chance that the test speaker’s identity is unknown, and the system should
make a rejection in this situation.
Speaker detection is another subtask of speaker recognition, which aims at
detecting one or more specific speakers in a stream of audio [4]. It can be viewed
as a combination of segmentation together with speaker verification and/or iden-
tification. Depending on a specific situation, this problem can be formulated as
a speech recognition problem, a verification problem or both of them. For in-
stance, one way to combine both tasks is to perform identification first, and then
use returned the ID for the verification session.
Based on the restriction of texts used in speech, speech recognition can be fur-
ther categorized as text-dependent and text-independent [54]. In text-dependent
speech recognition, all speakers say the same words or phrases during both training
and testing phases. This modality is more likely to be used in speaker verifica-
tion than other branches [4]. In text-independent speech recognition, there is no
constraint placed on training and testing texts; therefore, it is more flexible and
can be used in all branches of speech recognition.
5
[55].
In a speaker recognition system, a vector of features acquired from the previous
step is compared against a set of speaker models. The identity of the test speaker
is associated with the ID of the highest scoring model. A speaker model is a
statistical model that represents speaker-dependent information, and can be used
to predict new data. Generally, any modeling techniques can be used, but the
most popular techniques are: clustering, hidden Markov model, artificial neural
network and Gaussian mixture model.
A speaker verification system has an extra impostor model which stands for
non-speaker probability. An impostor model can apply any technique in speaker
models, but there are two main approaches for impostor modeling [55]. The
first approach is to use a cohort, also known as a likelihood set, a background
set, which is a set of background speaker models. The impostor likelihood is
computed as a function of all match scores of background speakers. The second
approach uses a single model trained on a large amount of speakers to represent
general speech patterns. It is known as general, world or universal background
model.
Chapter 1 The current chapter provides general information about our research
interest, speaker identification, and its related problems.
Chapter 2 This chapter revises the theory of speech signal processing that be-
comes the foundation of extracting speech features. Important topics are
frequency analysis, short-term processing and cepstrum.
Chapter 4 In this chapter, the method that inspires this project, deep neural
networks, is inspected closely.
Chapter 5 This chapter presents the data that are used to evaluate our ap-
proach and details about our experimental systems. Experiment results are
compared with reference systems and analyzed.
Chapter 6 This chapter serves as a summary of our work and presents some
future directions.
6
Chapter 2
7
Figure 2.1: Sampling a sinusoidal signal at different sampling rates; f - signal
frequency, fs - sampling frequency (adapted from [4])
point, analog or continuous-time signals will use parentheses such as x(t), while
digital or discrete-time signals will be represented by square brackets such as x[n].
After sampling, acquired values of the signal must be converted into some
discrete set of values. This process is called quantization. In audio signals, the
quantization level is normally given as the number of bits needed to represent the
range of the signal. For example, values of a 16-bit signal may range from -32768
to 32767. Figure 2.2 illustrates an analog signal which is quantized at different
levels.
The processes of sampling and quantization cause losses in information of a
signal, thus they introduce noise and errors to the output. While the sampling
frequency needs to be fast enough in order to effectively reconstruct the original
signal, in case of quantization, the main problem is a trade-off between the output
signal quality and its size.
8
Figure 2.2: Quantized versions of an analog signal at different levels (adapted
from [10])
xa (t + T ) = xa (t) ∀t (2.2)
Similary, a digital signal x[n] is periodic with period N if and only if:
In contrast, a signal that does not satisfy 2.2 (if it is analog) or 2.3 (if it is digital)
is nonperiodic or aperiodic.
The frequency domain is another point of view to look at a signal besides the
time domain. A very famous example of the frequency domain is the experiment
of directing white light through a prism. Newton showed in his experiment [68]
that a prism could break white light up into a band of colors, or spectrum, and
9
1.0
Amplitude
0.5
0.0
−0.5
0 10 20 30 40 50 60 70 80
Time (ms)
450
400
350
300
Amplitude
250
200
150
100
50
0
0 200 400 600 800 1000 1200 1400
Frequency (Hz)
8000
7000
Frequency (Hz)
6000
5000
4000
3000
2000
1000
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Time (s)
Figure 2.3: An adult male voice saying [a:] sampled at 44100 Hz: (a) waveform
(b) spectrum limited to 1400 Hz (c) spectrogram limited from 0 Hz to 8000 Hz
Figure 2.4: Periodic and aperiodic speech signals (adapted from [43]). The wave-
form of voiceless fricative [h] is aperiodic while the waveforms of three vowels are
periodic.
10
Figure 2.5: Illustration of the Helmholtz’s experiment (adapted from [24])
furthermore, these color rays could be reconstituted into white light using the
second prism. Therefore, white light can be analyzed into color components. We
also know that each primary color corresponds to a range of frequencies. Hence,
decomposing white light into colors is a form of frequency analysis.
In digital processing, the sine wave or sinusoid is a very important type of
signal:
xa (t) = A cos(ωt + φ) − ∞ < t < ∞ (2.4)
where A is the amplitude of the signal, ω is the angular frequency in radians per
second, and φ is the phase in radians. The frequency f of the signal in hertzs is
related to the angular frequency by:
ω = 2πf (2.5)
Clearly, the sinusoid is periodic with period T = 1/f from equation 2.2. Its
digital version has the form:
However, from equation 2.3, x[n] is periodic with period N if and only if ω = 2π/N
or its frequency f = ω/2π is a rational number. Therefore, the digital signal in
equation 2.6 is not periodic for all values of ω.
A sinusoid with a specific frequency in speech processing is known as a pure
tone. In the 19th century, Helmholtz discovered the connection between pitches
and frequencies using a tuning fork and a pen attached to one of its tines [67]
(figure 2.5). While the tuning fork was vibrating as a specific pitch, the pen was
drawing the waveform across a piece of paper. It turned out that each pure tone
is related to a frequency.
Hence, frequency analysis of a speech signal can be seen as decomposing it
as sums of sinusoids. An example of speech signal decomposition is illustrated
in figure 2.6. The process of changing a signal from time domain to frequency
domain is called frequency transformation.
A spectrum is a representation of sound in frequency domain as it plots the
amplitude at each corresponding frequency (see figure 2.3b). On the other hand,
a spectrogram (see figure 2.3c) is a three dimension representation of spectral
information. As usual, the horizontal axis displays time and the vertical axis
displays frequencies. The shade at each time-frequency point represents the am-
plitude level. The higher the amplitude, the darker (or hotter if using colors) the
shade. Spectrograms are effective visual cues to study the acoustics of speech.
11
1
−1
1.0
0.5
0.0
−0.5
−1.0
0.0 0.5 1.0 1.5 2.0
1.0
0.5
0.0
−0.5
−1.0
0.0 0.5 1.0 1.5 2.0
12
The Fourier Transform (FT) of a continuous aperiodic signal x(t) is defined
as:
Z ∞
X(ω) = x(t)e−jωt dt (2.9)
Z−∞
∞
x(t) = X(ω)ejωt dω (2.10)
−∞
The Discrete Fourier Transform (DFT) of a discrete periodic signal x[n] with
period N is defined as:
N −1
1 X
ck = x[n]e−j2πkn/N (2.11)
N n=0
X
N −1
x[n] = ck ej2πkn/N (2.12)
k=0
13
x[m] w[m]ejωm X(m, ω)
e−jωm
2.4.2 Spectrograms
The magnitude of a spectrogram is computed as:
S(ω, t) = X(ω, t)2 (2.20)
There are two kinds of spectrograms: narrow-band and wide-band (figure 2.8).
Wide-band spectrograms use a short window length (< 10 ms) which leads to
filters with wide bandwidth (> 200 Hz). In contrast, narrow-band spectrograms
use longer window (> 20 ms) which corresponds to narrow bandwidth (< 100
Hz). The difference in window duration between two types of spectrograms re-
sults in time and frequency representation: while wide-band spectrograms give a
good view of time resolution such as pitches, they are less useful with harmonics
(i.e. component frequencies). Narrow-band spectrograms have a better resolu-
tion with frequencies but smear periodic changes over time. In general, wide-band
spectrograms are more preferred in phonetic study.
1
The convolution of f and g is defined as:
∞
X
f [n] ∗ g[n] = f [k]g[n − k]
k=−∞
14
0.08456
-0.05179
0 1.896
Time (s)
5000
Frequency (Hz)
0
0 1.896
Time (s)
5000
Frequency (Hz)
0
0 1.896
Time (s)
Figure 2.8: Two types of spectrograms: (a) original sound wave (b) wide-band
spectrogram using 5 ms Hanning windows (c) narrow-band spectrogram using 23
ms Hanning windows
15
Spectral domain Cepstral domain
Frequency Quefrency
Spectrum Cepstrum
Phase Saphe
Amplitude Gamnitude
Filter Lifter
Harmonic Rahmonic
Period Repiod
· ·
v φ w
· + x y + ·
v ln v φ0 ey w
16
logarithms and inverse Fourier transforms as the definition of cepstrum.
The definition of the complex cepstrum of a discrete signal is:
Z
1
x̂[n] = X̂(ω)ejωn dω (2.22)
2π 2π
17
Chapter 3
Approaches in Speaker
Identification
18
3,000
Frequency (mel)
2,000
1,000
0
0 2,000 4,000 6,000 8,000 10,000
Frequency (Hz)
Figure 3.1: Relationship between the frequency scale and mel scale
based on auditory perception. MFCCs are based on mel scale. A mel is a unit
of ”measure of perceived pitch or frequency of a tone” [14]. In 1940, Stevens and
Volkman [63] assigned 1000 mels as 1000 Hz, and asked participants to change the
frequency until they perceived the pitch changed some proportions in comparison
with the referential tone. The threshold frequencies were marked, resulting a
mapping between real frequency scale (in Hz) and perceived frequency scale (in
mel). A popular formula to convert from frequency scale to mel scale is:
fHz
fmel = 1127 ln 1 + (3.5)
700
where fmel is the frequency in mels and fHz is the normal frequency in Hz. This
relationship is plotted in figure 3.1.
MFCCs are often computed using a filter bank of M filters (m = 0, 1, ..., M −
1), each one has a triangular shape and is spaced uniformly on the mel scale
(figure 3.2). Each filter is defined by:
0 k < f [m − 1]
k−f [m−1] f [m − 1] < k ≤ f [m]
Hm [k] = f [m]−f [m−1]
f [m+1]−k (3.6)
f [m] ≤ k < f [m + 1]
f [m+1]−f [m]
0 k ≥ f [m + 1]
Given the DFT of the input signal in equation 3.2 with N as the sampling size
of DFT, let us define fmin and fmax the lowest and highest frequencies of the
filter bank in Hz and Fs the sampling frequency. M + 2 boundary points f [m]
(m = −1, 0, ..., M ) are uniformly spaced between fmin and fmax on mel scale:
N −1 B(fmax ) − B(fmin )
f [m] = B B(fmin ) + m (3.7)
Fs M +1
where B is the conversion from frequency scale to mel scale given in equation 3.5
and B −1 is its inversion:
fmel
fHz = 700 exp −1 (3.8)
1125
19
Figure 3.2: A filter bank of 10 filters used in MFCC
X
M −1
1 πn
x̂[n] = S[m] cos m + n = 0, 1, ..., M − 1 (3.10)
m=0
2 M
Typically, the number of filters M ranges from 20 to 40, and the number of
kept coefficients is 13. Some research reported that the performance of speech
recognition and speaker identification systems reached peak with 32-35 filters
[65, 18]. Many speech recognition systems remove the zeroth coefficient from
MFCCs because it is the average power of the signal [4].
20
Figure 3.3: A filter bank of 10 filters used in LFCC
21
3.2 Speaker Modeling Techniques
Given a set of feature vectors, we wish to build a model for each speaker so that
a vector from the same speaker has higher probability belonging to that model
than any other models. In general, any learning method can be used, but in
this section we focus on the most basic approaches in text-independent speaker
identification.
1 X 1 X
d(U, R) = min |ui − rj |2 + min |ui − rj |2
|U | u ∈U j
r ∈R |R| r ∈R i ∈U
u
i j
1 X 1 X
− min |ui − uj |2 − min |ri − rj |2 (3.16)
|U | u ∈U uj ∈U,j6=i |R| r ∈R rj ∈R,j6=i
i i
22
Figure 3.4: A codebook in 2 dimensions. Input vectors are marked with x sym-
bols, codewords are marked with circles (adapted from [51]).
1 X
N
D= min d(xi , zj ) (3.19)
N i=1 1≤j≤M
is minimized over all input vectors. Figure 3.4 illustrates a codebook in 2 dimen-
sional space. K-means and LBG (Linde-Buzo-Gray) are two popular techniques
to design codebooks in VQ.
The K-means algorithm is described as follows [31]:
Step 4 Iteration. Repeat step 2 and 3 until the difference between the new
distortion and the previous one is below a pre-defined threshold.
23
Step 1 Initialization. Set M = 1. Find the centroid of all data according to
equation 3.20.
zj+ = zj +
zj− = zj −
Set M = 2M .
1 X
N
l
D = min d(xi , zjl ) (3.21)
N i=1 1≤j≤M
N average distortions are then compared, and the speaker’s ID is decided by the
minimum distortion:
l∗ = argmin1≤l≤L Dl (3.22)
aij = P (qt+1 = sj | qt = si )
24
• π = {πi }: the initial state distribution where:
πi = P (q1 = si )
X
M
bj (k) = cjm bjm (k) (3.23)
m=1
bjm (k) ∼ N (ok ; µjm , Σjm ) (3.24)
where M is the number of Gaussian mixtures, µjm , Σjm are the mean and covari-
ance matrix of the m-th mixture, and cjm is the weight coefficient of the m-th
mixture. cjm satisfies:
XM
cjm = 1 1 ≤ j ≤ N (3.25)
m=1
25
Figure 3.5: A left-to-right HMM model used in speaker identification (adapted
from [1]).
identification system builds a HMM for each speaker, and the model that yields
the highest probability for a testing sequence gives the final identification.
If using VQ, first a codebook corresponding with each speaker is generated.
By using codebooks, the domain of observation probabilities becomes discrete,
and the system can use discrete HMMs. However, in some cases, a codebook of a
different speaker may be the nearest codebook to the testing sequence, thus the
recognition is poor [46]. Continuous HMMs are able to solve this problem, and
Matsui and Furui showed that continuous HMMs had much better results than
discrete HMMs.
In speaker identification, the most common types of HMM structure are ergod-
ic HMMs (i.e., HMMs that have full connection between states) and left-to-right
HMMs (i.e., HMMs only allow transitions in the same direction, or transitions to
the same state). A left-to-right HMM is illustrated in figure 3.5).
X
M
p(~x | λ) = pi bi (~x) (3.27)
i=1
X
M
pi = 1
i=1
26
Each mixture component is a D-variate Gaussian density function:
1 1 T −1
bi (~x) = q exp − (~x − µ
~ i ) Σi (~x − µ
~ i) (3.28)
2
(2π)D/2 |Σi |1/2
µi is the mean vector, and Σi is the covariance matrix.
A GMM is characterized by the mean vector, covariance matrix and weight
from all components. Thus, we represent it by a compact notation:
λ = (pi , µ
~ i , Σi ) i = 1, 2, ..., M (3.29)
In speaker identification, each speaker is characterized by a GMM with its
parameters λ. There are many different choices of covariance matrices [56], for
example, the model may use one covariance matrix per component, one covariance
matrix for all components or one covariance matrix for components in a speaker
model. The shape of covariance matrices can be full or diagonal.
Given a set of training samples X, probably, the most popular method to
train a GMM is maximum likelihood (ML) estimation. The likelihood of a GMM
is:
Y
T
p(X | λ) = p(~xt | λ) (3.30)
t=1
Despite of their power, GMMs still face some disadvantages [66]. Firstly,
GMMs have a large number of parameters to train. This fact not only leads to
expensive computation, but also requires a sufficient amount of training data.
Therefore, the performance of a GMM is unreliable if it is trained on a small
dataset. Secondly, as a generative model, GMMs do not work well with unseen
data, which easily yield low likelihood scores. Fortunately, these two problems
can be overcome by speaker adaptation. The main idea of speaker adaptation
is building speaker-dependent systems by adapting (i.e. modifying) a speaker-
independent system constructed using all speaker data. A GMM trained on all
speaker identities is the universal background model (UBM), of which concept
was discussed in section 1.2. The GMM-UBM is then modified into a speaker’s
model using maximum a posteriori (MAP) adaptation [57].
27
m1
m2
MAP
GMM UBM m= .
Adaptation ..
mK
GMM Supervector
Feature
Extraction
Utterance
M = s + Ux (3.35)
where U is a factor loading matrix that defines a channel subspace, x are common
channel factors having standard normal distributions. In summary:
M = m + U x + V y + Dz (3.36)
In [13], based on an experiment showing that JFA channel factors also con-
tained speaker information, a new single subspace was defined to model both
channel and speaker variabilities. The new space was referred as total variability
space, and the new speaker-and-channel dependent supervector was defined as:
M = m + Tw (3.37)
28
The i-vector technique is considered to be an effective way to reduce from high di-
mensional input data to low dimensional feature vectors. Today, i-vector systems
have become the state-of-the-art in speaker recognition [33, 45].
29
Chapter 4
It has been more than 70 years since Warren McCulloch and Walter Pitts mod-
eled the first artificial neural network (ANN) that mimicked the way brains work.
These day, ANNs have become one of the most powerful tools in machine learning,
and their effectiveness have been tested empirically in many real world applica-
tions. In combination with the deep learning paradigm, ANNs have achieved
state-of-the-art results in plenty of areas, especially in natural language process-
ing and speech technology (see [60] for more details).
This chapter serves as reference for ideas and techniques we use directly in
our speaker identification systems. First, an overview of ANNs and deep learning
is presented, then we review some available applications of ANNs in speaker
identification.
y = ϕ(w · x + b) (4.1)
Common activation functions are sigmoid, tanh and rectified linear (ReL).
30
x1
w1
x2
w2
w3 ϕ(·)
x3 y
.. wn
.
b
x3
+1
Sigmoid
1
σ(x) = (4.2)
1 + e−x
Tanh
ex − e−x
tanh(x) = (4.3)
ex + e−x
ReL
f (x) = max(0, x) (4.4)
where h(l) is the output vector of layer l, l = 1...L where L is the number of layers
in the network. h(0) is the input of the network. W (l) , b(l) and ϕ(l) in turn are
the weight matrix, the bias vector and the activation function of layer l.
The role of activation functions in MLPs is very important, because they give
MLPs the ability to compute nonlinear function: if outputs of hidden layers were
linear, the network output would be just a linear combination of inputs, which is
not very useful. In regression, the activation function used in the output layer is
usually linear, while in classification of K classes, it could be a sigmoid (K = 2)
or softmax (K > 2) function. Choosing activation functions for hidden layers
will be discussed further in section 4.5.
31
Given a set of samples {(x(1) , y (1) ), ..., (x(M ) , y (M ) )} and a MLP with initial
parameters θ (characterized by weight matrices and bias vectors), we would like
to train the MLP so that it can learn the mapping given in our set. If we see the
whole network as a function:
ŷ = F (x; θ) (4.6)
and define some loss function E(x, y, θ), then the goal of training our network
becomes minimizing E(x, y, θ). Luckily, the gradient of E tells us the direction
to go in order to increase E:
∂E ∂E
∇E(θ) = , ..., (4.7)
∂θ1 ∂θn
Since the gradient of E specifies the direction to increase E, at each step param-
eters will be updated proportionally to the negative of the gradient:
θi ← θi + ∆θi (4.8)
where:
∂E
∆θi = −η (4.9)
∂θi
The training procedure is gradient descent, and η is a small positive training
parameter called learning rate.
In our systems, we employ two types of loss functions:
(m)
where m is the index of an arbitrary sample, K is the number of classes. yk
represents the k-th column (corresponding to the probability of class k) of vector
y (m) .
In conventional systems, the gradient components of the output layer can be
computed directly, while they are harder to compute in lower layers. Normally,
the current gradient is calculated using the error of the previous step. Since errors
are calculated in the reverse direction, this algorithm is known as backpropagation.
32
Hidden Layer
33
Figure 4.3: A simple recurrent neural network
and bias vectors [Win , Wh , Wout , bin , bout ]. Given input sequence x1 , x2 , ..., xT , the
output of the RNN is computed as:
The simple RNN model is elegant, yet it only captures temporal relations in
one direction. Bidirectional RNNs [61] were proposed to overcome this limitation.
Instead of using two separate networks for the forward and backward directions,
bidirectional RNNs split the old recurrent layer into two distinct layers, one for
the positive time direction (forward layer) and one for the negative time direction
(backward layer). The output of forward states are not connected to backward
states and the other way around (figure 4.4).
34
the convolution operation (see section 2.4.1) between the filters and the input.
The inspiration of CNNs is said to be based on the receptive field of a neuron,
i.e. sub-regions of the visual field that the neuron is sensitive to.
There are several types of layers that make up a CNN:
Convolutional layer A convolutional layer consists of K filters. In general,
its input has one or more feature maps, e.g., a RGB image has 3 channels
red, green and blue. Therefore, the input is a 3-dimensional matrix and its
feature maps is considered the depth dimension. Each filter need to have
3-dimensional shape as well with its depth extend to the entire depth of the
input (see figure 4.5). The output of the layer is K feature maps, each one
is computed as the convolution of the input and a filter k, plus its bias:
hijk = ϕ((Wk ∗ x)ij + bk ) (4.14)
where i and j are the row index and the column index, ϕ is the activation
function of the layer and x is its input. Thus the output of a convolutional
layer is also a 3-dimensional matrix, and its depth is defined by the number
of filters.
Pooling layer A pooling layer is usually inserted between two successive convo-
lutional layers in a CNN. It downsamples the input matrix, thus reducing
the space of representation and the number of parameters. The depth di-
mension remains the same. A pooling layer divides the input into (usually)
non-overlapping rectangle regions, of which size defined by the pool shape.
Then, it outputs the value of each region using the max, sum or average
operator. If a pooling layer uses the max operator, it is called a max pooling
layer. The pool size is normally set as (2, 2) as larger sizes may lost too
much information.
Fully-connected layer One or more fully-connected layers may be placed at
the end of a CNN, to refine features learned from convolutional layers, or
to return class scores in classification.
The most common architecture of CNNs stacks convoltional layers and pool
layers in turn, then ends with fully-connected layers (e.g., LeNet [38]). It is worth
considering that a convolutional layer can be substituted by a fully-connected
layer of which weight matrix is mostly zero except at some blocks, and the weight
of those blocks are equal.
35
Figure 4.5: An illustration of 3-dimensional convolution (adapted from [38])
The vanishing gradient mainly occurs due to the calculation of local gradients.
In the backpropagation algorithm, a local gradient is the aggregate sum of the
previous gradients and weights, multiplied by its derivative. Since parameters
are usually initialized as small values, their gradients are less than 1; therefore
gradients of lower layers are smaller than those of above layers and are easier
to reduce to zero. The exploding gradient, on the other hand, normally hap-
pens in neural networks with long time dependencies, for instance RNNs, since
a large number of components to compute local gradients are prone to explode.
In practice, some factors affect the influence of vanishing and exploding gradient
problem, which includes the choice of activation functions, the cost function and
network initialization [22].
A closer look to the role of ac-
tivation functions can give us an
1
intuitive understanding of these
problems. A sigmoid is a mono-
tonic function that maps its in- 0.5
puts to range [0, 1] (figure 4.6).
It was believed to be popular in 0
the past because of the biological
inspiration that neurons also fol-
lowed a sigmoid activation func- −0.5
tion. A sigmoid function saturates sigmoid
at both tails, at which values re- −1 tanh
main mostly constant. Thus, gra- −6 −4 −2 0 2 4 6
dients at those points is zero, and
this phenomenon will be propagat- Figure 4.6: Sigmoid and tanh function
ed to lower layers, which makes
the network hardly learn anything.
In consequence, we should pay attention at the initialization phase so that weights
are small enough not to fall into saturated regions.
36
The tanh function also has a S-shape like sigmoid, except that it ranges from
−1 to 1 instead of 0 to 1. Its characteristics are also the same, but tanh is
empirically recommended over sigmoid because it is zero-centered. According to
LeCun et al., weights should be normalized around 0 to avoid ineffective zigzag
updates, which leads to slow convergence [39].
In three types of activation functions, ReL has the cheapest computation
and does not suffer from the vanishing gradient along activation units. Many
researches reported that ReL improved DNNs in comparison with other activation
functions [42]. However, ReL could have problems with 0-gradient case, where a
unit never activates during training. This issue may be alleviated by introducing
a leaky version of ReL:
(
x x>0
f (x) = (4.15)
0.01x otherwise
A unit with ReL as activation function is called a rectifier linear unit (ReLu).
37
Chapter 5
38
Speaker model 5 speakers 10 speakers 20 speakers
FVSQ (128) 100% 98% 96%
TSVQ (64) 100% 94% 88%
MNTN (7 levels) 96% 98% 96%
MLP (16) 96% 90% 90%
ID3 86% 88% 79%
CART 80% 76% -
C4 92% 84% 73%
BAYES 92% 92% 83%
5.1.2 Switchboard
The Switchboard corpus is one of the largest public collections of telephone con-
versations. It contains data recorded in multiple sessions using different hand-
sets. Conversations were automatically collected under computer supervision
[23]. There are two Switchboard corpora, Switchboard-I and Switchboard-II.
Switchboard-I has about 2400 two-sided conversations from 534 participants in
the United States.
Due to its hugeness, many researchers wanted to evaluate their systems on
a part of the Switchboard corpus. An important subset of Switchboard-I is
SPIDRE, SPeaker IDentification REsarch, which was specially planned for close
or open-set speaker identification and verification. SPIDRE includes 45 target
speakers, 4 conversations per target and 100 calls from non-targets.
Gish and Schmidt achieved identification accuracy of 92% on the SPIDRE 30
second test using robust scoring algorithms [21]. Besides, some systems were test-
ed on a subset of 24 speakers of Switchboard (which was referred as SWBDTEST
in [21]), with accuracy higher than 90% [54, 21, 26] (table 5.2).
39
Speaker model Accuracy (5 second test) (%)
GMM-nv 94.5 ± 1.8
VQ-100 92.9 ± 2.0
GMM-gv 89.5 ± 2.4
VQ-50 90.7 ± 2.3
RBF 87.2 ± 2.6
TGMM 80.1 ± 3.1
GC 67.1 ± 3.7
Table 5.4: TIMIT distribution of speakers over dialects (reproduced from [71])
40
#Speaker/ #Sentence/
Sentence type #Sentences Total
sentence speaker
Dialect (SA) 2 630 1260 2
Compact (SX) 450 7 3150 5
Diverse (SI) 1890 1 1890 3
Total 2342 6300 10
Table 5.5: The distribution of speech materials in TIMIT (reproduced from [71])
41
from that were used as features. Several techniques were compared in the speaker
identification task, including:
MLP A MLP with one hidden layer is constructed for each speaker. The input
of the MLP is a feature vector, and the output is the label of that vector,
as 1 if it is from the same speaker of the MLP, and 0 otherwise. In the
identification phase, all test vectors of an utterance are passed through
each MLP, and the outputs of each MLP are accumulated. The speaker is
decided as the corresponding MLP with the highest accumulated output.
Decision tree All training data are used to train a binary decision tree for
each speaker with identical input and output manner as in MLP method.
The probability of classification using decision trees is used to determine
the target speaker. Pruning is applied after training to avoid overfitting.
Various decision tree algorithms were considered, including C4, ID3, CART
and a Bayesian decision tree.
Neural tree network A neural tree network has a tree structure as in deci-
sion trees, but each non-leaf node is a single layer of perceptrons. In the
enrollment phase, the single layer perceptron at each node is trained to clas-
sify data into subsets. The architect of neural tree networks is determined
during training rather than pre-defined as in MLPs.
5.4.1 Preprocessing
The original data of TIMIT are in Sphere format, so first they need to be convert-
ed into WAV format before using. Because each file in TIMIT is a clean single
42
Pre- Frame
Speech signal Windowing DFT
emphasis Blocking
Spectrum
Mel-
Differential Differential Liftering DCT Frequency
Warping
Figure 5.1: The process to convert speech signals into MFCC and its derivatives
5.4.2 Front-end
We employed two different types of features in our framework: MFCCs (section
3.1.1) and LFCCs (section 3.1.2). The computation of MFCCs is described in
figure 5.1. LFCCs are acquired by the same process as MFCCs except that they
are warped by a linear frequency band rather than mel-frequency warping. Details
of each step are:
Pre-emphasis Pre-emphasis refers to the process of increasing the magnitude
of higher frequencies with respect to that of lower frequencies. Since speech
sound contains more energy in low frequencies, it helps to flatten the signal
and to remove some glottal effects from the vocal tract parameters. On the
other hand, pre-emphasis may increase noise in the high frequency range.
Perhaps one of the most frequently used form of pre-emphasis is the first-
order differentiator (single-zero filter):
x̃[n] = x[n] − αx[n − 1] (5.1)
where α is usually ranged from 0.95 to 0.97. In our framework, we use
α = 0.97.
Frame Blocking As we use a short-time analysis technique to process speech
(section 2.4), in this step, the speech signal is blocked into frames, each
frame contains N samples and advances M samples from its previous frame
(M < N ). As a result, adjacent frames overlap N − M samples. The signal
is processed until all samples are in one or more frames, and the last frame
is padded with 0 to have the length of exact N samples. Typically, N ranges
from 20 to 30 ms, and M is about half of N .
Windowing As frame blocking breaks the continuity at the beginning and the
end of each frame, they are multiplied by a window function to reduce dif-
ferences, providing smooth transitions between frames. A window function
43
1
0.8
Amplitude
0.6
0.4
0.2 Hamming
Hanning
0
0 10 20 30 40 50 60
Sample
After this step, we compute |Xm [k]|2 for all frames, resulting in a short-time
spectrum of the original signal.
Mel-frequency warping The spectrum of each frame is warped by a band of
B filters (equation 3.6) to obtain the mel-frequency spectrum:
X
N −1
Sm [b] = |Xm [k]|2 Hb [k] b = 0, 1, ..., B − 1 (5.6)
k=0
44
DCT Finally, the mel cepstrum is acquired from the mel spectrum using DCT:
X
B−1
1 πn
x̂m [n] = ln(Sm [b]) cos b + n = 0, 1, ..., B − 1 (5.7)
b=0
2 B
In our framework, we discard the first coefficient and keep the first K coef-
ficients (except the first one) of the cepstrum as MFCCs.
Differential The MFCCs are often referred as static features since they only
contain information of their current frame. In order to capture temporal
relations, cepstral coefficients are modified by calculating their first and sec-
ond order derivatives. The first order derivative is called delta coefficients,
and the second order is called delta-delta coefficients. Delta coefficients are
computed from cepstral coefficients as follow:
PD
τ (ct+τ − ct−τ )
∆ct = τ =1 PD (5.9)
2 τ =1 τ 2
D is the size of delta window, and is normally chosen as 1 or 2. Consequent-
ly, delta-delta coefficients are computed as derivatives of delta-coefficients.
While delta coefficients provide information about rate of speech, delta-
delta coefficients disclose knowledge about acceleration.
In practice, delta and delta-delta coefficients are added to cepstral coefficients
to extend them with dynamic features. Moreover, the energy (sum of spectral
values in one frame) and its derivatives are also incorporated as features. For ex-
ample, 38 dimension MFCC vector was used as features for a speaker recognition
system, which included 12 MFCCs, 12 delta MFCCs, 12 delta-delta MFCCs, the
log energy, the log energy derivative (delta energy) and the second log energy
derivative (delta-delta energy) [44].
Moreover, feature vectors of multiple frames can be concatenated to be used
as the input of the next step. A context of size C is added to a current frame
by appending features of its C frames each side along with it. We summarized
parameters of our front-end in table 5.6.
5.4.3 Back-end
Lying in the heart of our speaker identification framework is a DNN with a
bidirectional recurrent layer inspired by the model in [25]. Hannun et al.’s model
was composed of 5 hidden layers, but in our model, the number of hidden layers is
45
Parameter Meaning
frame length Number of samples in one frame
frame step Number of samples advanced between frames
window Type of window
no ffts Number of bins in DFT
no fbs Number of filters in mel/linear filterbank
min freq The minimum frequency in the filterbank
max freq The maximum frequency in the filterbank
no ceps Number of kept cepstral coefficients
preemphasis coefficient Pre-emphasis coefficent
lifter coefficient Lifter coefficent
type Type of features, e.g., MFCC, MFCC+∆, ...
context size Number of feature vectors at each side
to be added as context
left as a parameter. For all terms regarding ANNs, we refer the reader to section
4.1.
Let L be the number of layers in our model, where layer 0 is the input and
layer L is the output layer. Then, the bidirectional recurrent layer is placed at
position L − 2. Figure 5.3 illustrates the structure of our model. The input of
the DNN is a sequence of speech features of length T , x = x0 , x1 , ..., xT −1 , where
xt denotes the feature vector at frame t, t = 0...T − 1.
For the first L − 3 layers, the output of the l-th layer at time t is computed
as:
(l) (l−1)
ht = ϕ(W (l) · ht + b(l) ) (5.10)
where W (l) and b(l) are the weight matrix and the bias vector of layer l. ϕ is the
activation function, and is selected between sigmoid, tanh and ReL.
The recurrent layer is decomposed into two separate layers for the forward
and the backward processes (figure 5.4):
(f ) (l−1) (f )
ht = ϕ(W (L−2) · ht + W (f ) · ht−1 + b(f ) ) (5.11)
(b) (l−1) (b)
ht = ϕ(W (L−2) · ht + W (b) · ht+1 + b(b) ) (5.12)
(f ) (b)
Note that ht must be computed in order t = 0, ..., T − 1 while ht must be
computed in reverse order t = T − 1, ..., 0. The output of this layer is simply the
linear combination of both the forward and the backward output:
(L−2) (f ) (b)
ht = ht + ht (5.13)
and the output layer is the softmax layer of which each neuron output predicts
the probability of a speaker to be the target in the training set:
(L) (L)
ht = softmax(W (L) · ht + b(L) ) (5.15)
46
Figure 5.3: The structure of our DNN model
where:
exp(zj )
softmax(z)j = P (5.16)
k exp(zk )
Here zj represents the j-th column of vector z.
47
Given a training dataset of S speakers, the DNN simply classifies the target
speaker of frame t as the one that maximizes the conditional probability:
(L)
ŝt = argmax1≤s≤S P (s | xt ) = ŷt,s = ht,s (5.17)
The predicted speaker for the whole speech sequence x is defined by simple voting
as summing the accuracy of all frames and normalizing.
The DNN model is trained using the backpropagation algorithm to minimize
mean squared or cross entropy error. Parameter update is performed in batch,
where each batch contains about 500 frames. We implemented three different
update methods:
Gradient descent
∂E
θt+1 = θt − η (5.18)
∂θt
Momentum [49]
∂E
vt+1 = µvt − η (5.19)
∂θt
θt+1 = θt + vt+1 (5.20)
∂E
vt+1 = µvt − η (5.21)
∂(θt + µvt )
θt+1 = θt + vt+1 (5.22)
k is the decay rate, and its typical values are 0.9, 0.99 or 0.999.
To avoid overfitting, during training we use L2 regularization and/or dropout
[62], which drops some neurons (i.e., set their value to zeros) at some probability
p. Figure 5.5 illustrates the idea of dropout. Dropout is performed in all layers
except at the input and the recurrent layer.
For reference, all hyperparameters of the back-end are summarized in table
5.7.
48
Figure 5.5: The visualization of dropout (adapted from [62])
49
!!python/object:sre.train.Train
name: ’template’
system: !!python/object/new:sre.sresystem.SRESystem
kwds:
frontend: !!python/object:sre.frontend.Frontend
preemphasis_coefficient: 0.97
frame_length: 320
frame_overlap: 160
max_freq: 6000
min_freq: 300
no_fbs: 26
no_ffts: 512
window_func: !!python/name:numpy.lib.function_base.hanning ’’
no_ceps: 13
lifter_coefficient: 22
context_size: 0
type: ’mfcc’
backend: !!python/object/new:sre.backend.Backend
kwds:
no_hiddens: [50, 50, 50]
no_input: 13
no_output: 20
activation: !!python/name:sre.backend.rel ’’
algorithm: !!python/object/new:nn.training_algorithms.TrainingAlgorithm
kwds:
rate: 0.001
cost_function: !!python/name:nn.costs.cross_entropy_error ’’
update_rule: !!python/object/new:nn.updates.RMSprop
kwds:
decay_rate: 0.99
train_path: ’/home/timit/timit-train.list’
trial_path: ’/home/timit/timit-trial.list’
batch_size: 500
max_epochs: 1000
eval_epoch: 20
50
Front-end Description
MFCC23 23 MFCCs
LFCC23 23 LFCCs
MFCC∆38 19 MFCCs and 19 delta MFCCs
MFCC∆∆38 12 MFCCs, 12 delta MFCCs, 12 delta-delta MFCCs,
delta energy and delta-delta energy
require a validation set as the stop condition, we decided to use 2 files as validation
data. The number of files in training data remains the same as 5 files since training
duration should have great influence on identification performance. As a result,
for each speaker, 5 sentences are used as training data, 2 sentences are used as
validation data and the remaining 3 sentences are used as 3 evaluation tests.
The reason that we use a validation dataset is that other stop condition types
(pre-defined number of epochs, minimum loss values) varies between different
population sizes, making it hard to choose a good stopping value.
In the reference paper, it seems that the authors used only one evaluation set
for each population size, but we do not know exactly which speakers and which
files they used in each part. Therefore, in order to compare our results with their
performance, for each population size, both speaker selection and data (training,
validation, test) division are performed randomly three times, resulting in three
different evaluation sets. Each testing configuration is trained and evaluated on
all three sets, and the reporting accuracy is the mean of all test cases.
Each speech utterance is processed using a 20 ms Hanning window every 10
ms. Then DFT spectrum is warped by a Mel/linear filterbank of 26 filters with
limiting frequency range 300-3140 Hz. Delta and delta-delta cepstral coefficients
are computed around a window size of 2 frames.
We select four types of front-ends to extract features from speech data, of
which details are described in table 5.8. In addition to the proposed RNN frame-
work, we also implement a FNN back-end in order to compare their power in
speaker identification task. Their combinations result in 6 testing systems. All
systems have 3 hidden layer structure and use ReL as activation functions. Since
the number of testing speakers affects the size of our neural network models, the
size of systems in each case is chosen according to table 5.9.
Each system is trained using RMSprop with cross entropy error and L2 reg-
ularization parameter is 0.001. During training, a dropout rate of 5% is applied.
RNN-structured systems are trained with learning rate 10−4 , while the MLP sys-
tem is trained with learning rate 10−3 . The best system is chosen based on its
accuracy on the validation set, and the training process stops if the accuracy does
not improve after 20 epochs or the number of training cycles reaches 1000.
Experiment results with a single frame as input (context size is 0) are summa-
rized in table 5.10. In comparison with reference systems (table 5.1), all systems
51
System 5 speakers 10 speakers 20 speakers
RNN+MFCC23 84.4% 82.2% 86.7%
RNN+LFCC23 95.5% 86.7% 85.6%
RNN+MFCC∆38 88.9% 90.0% 83.3%
RNN+MFCC∆∆38 91.1% 88.9% 83.9%
FNN+MFCC23 93.3% 94.4% 88.3%
FNN+MFCC∆∆38 95.5% 97.8% 90.5%
of ours only yield higher accuracy than reference decision tree algorithms ID3, C4
and CART. However, the reason of poor results is due to the difference between
validation accuracy and evaluation accuracy. As using only two files to validate,
in all cases, our systems easily achieve high accuracy on the validation set (above
95%), and in more than half of them, they get 100% correct on validation data
and the training process stops. Using more data on validation and evaluation
may alleviate this problem. This is definitely a disadvantage of ANNs in com-
parison with other methods since a part of data (normally from the training set)
is needed for validation.
Our best system is FNN+MFCC∆∆38. Interestingly, RNN-structured sys-
tems do not show their superiority against FNN-structured systems in this task.
Although FNN systems have more free parameters than RNN ones, it only takes
45 minutes to train a FNN on 1000 epochs in comparison with 9 hours of a RNN
with data of 20 speakers. In some cases, the accuracy of 5 speaker test condition
is lower than that of 10 speaker test condition, since it is too easy to obtain 100%
correct on the validation dataset of 5 speakers. Otherwise, the accuracy drops
when the number of speakers is 20.
We also investigate the influence of context size on the identification result.
Four systems are tested with context size 1, 2 and 3, corresponding to 3, 5 and
7 consecutive frames as input. The results are presented in table 5.11. With
two RNN-structured systems, the best results are achieved when using longer
contexts. The explanation for this phenomenon is that the longer the input vector,
the finer the parameters are tuned in order to identify correctly. Therefore, it is
not as easy as before to get high accuracy on the validation set, so that the gap
between validation accuracy and test accuracy is reduced. However, there is an
opposite trend within two FNN-structured systems, as their performance reduces
while increasing the size of context, especially when the number of speakers is 20.
Since we keep the same configuration while increasing input context, this may be
a sign of overfitting as our models could achieve high accuracy on the validation
set but are not very good at generalization.
52
System Context 5 speakers 10 speakers 20 speakers
RNN+MFCC23 0 84.4% 82.2% 86.7%
RNN+MFCC23 1 82.2% 92.2% 89.4%
RNN+MFCC23 2 86.7% 86.7% 90.6%
RNN+MFCC23 3 91.1% 92.2% 90.0%
RNN+MFCC∆∆38 0 91.1% 88.9% 83.9%
RNN+MFCC∆∆38 1 86.7% 86.7% 88.9%
RNN+MFCC∆∆38 2 88.9% 91.1% 90.0%
RNN+MFCC∆∆38 3 95.5% 96.7% 92.8%
FNN+MFCC23 0 93.3% 94.4% 88.3%
FNN+MFCC23 1 93.3% 94.5% 81.1%
FNN+MFCC23 2 91.1% 97.8% 76.1%
FNN+MFCC23 3 93.3% 93.3% 85.6%
FNN+MFCC∆∆38 0 95.5% 97.8% 90.5%
FNN+MFCC∆∆38 1 86.7% 98.9% 85.6%
FNN+MFCC∆∆38 2 97.8% 94.5% 85.6%
FNN+MFCC∆∆38 3 97.8% 97.8% 79.4%
Table 5.11: Identification accuracy with different input context sizes. The best
context size in each case is marked as bold.
described in section 5.5.1, and each test condition is repeated three times.
Of four systems chosen in this experiment, two RNN-structured systems use a
context of size 3, and two FNN-structured systems are evaluated without context.
The summary of our four systems’ performances are illustrated in figure 5.7a.
Having the same type of front-end, RNN systems tend to identify better than
FNN systems. On the other hand, MFCCs with dynamic features (MFCC∆∆38)
outperform normal MFCCs in both types of back-ends.
Moreover, the performance of our two best systems, RNN+MFCC∆∆38 and
FNN+MFCC∆∆38 is compared with two best systems in [19], full-search VQ
and modified neural tree network in the same test conditions (figure 5.7b). There
is only small difference between full-search VQ, RNN+MFCC∆∆38 and FNN+
MFCC∆∆38 when the number of training files is between 2 and 4. Otherwise,
full-search VQ is still the most efficient system if using 1 file as training data with
about 72% accuracy. The gap between all systems reduces as using more training
files.
53
(a)
95
90
85
80
Accuracy (%)
75
70
65
RNN+MFCC23
60 RNN+MFCC∆∆38
55 FNN+MFCC23
50 FNN+MFCC∆∆38
1 2 3 4 5
Number of training files
(b)
100
95
90
85
Accuracy (%)
80
75
70
65 RNN+MFCC∆∆38
60 FNN+MFCC∆∆38
55 MNTN
FSVQ
50
1 2 3 4 5
Number of training files
Figure 5.7: Identification accuracy of our systems and two best systems from [19]
(MNTN: modified neural tree network, FSVQ: full-search VQ) as a function of
training duration
54
Identification accuracy
RNN+MFCC23
90 RNN+MFCC∆∆38
FNN+MFCC23
FNN+MFCC∆∆38
Accuracy (%)
80
70
20 40 60 80 100
Population size
systems’ accuracy drops sharply when the number of testing speakers is 60, which
might be due to data division. Otherwise, the results in this experiment agree
with those in our second experiment (section 5.5.2) as RNN systems yield high-
er accuracy than FNN systems, and using dynamics features improves systems’
performance. The accuracy of FNN+MFCC23 decreases at the highest rate.
• RNN: The same type of the RNN model we use in previous experiments.
Its hidden layers have 20 perceptrons.
55
– A max pooling layer with shape (2, 2)
– An output layer with softmax activation function
The input of the CNN model has shape (15, 38) by stacking 15 frames of
MFCC+∆∆38.
Both systems are trained by RMSprop algorithm with rate 10−4 . In the end,
RNN system achieves accuracy of 97.9%, and CNN system achieves accuracy of
98.3%. Both systems reach 100% correct on the validation set in less than 100
epochs.
56
Chapter 6
57
• With GPU-support scientific computing library like Theano, ANNs are
implemented effectively without spending too much effort (gradients
are computed directly rather than by propagating errors).
Cons:
• Exploring other types of DNNs: Long short-term memory and deep belief
network are DNN structures that have more effective training schemes than
normal RNNs.
58
Bibliography
[1] Sayed Jaafer Abdallah, Izzeldin Mohamed Osman, and Mohamed Elhafiz
Mustafa. Text-independent speaker identification using hidden Markov mod-
el. World of Computer Science and Information Technology Journal (WC-
SIT), 2(6), 2012.
[7] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Raz-
van Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,
and Yoshua Bengio. Theano: a CPU and GPU math expression compiler.
In Proceedings of the Python for Scientific Computing Conference (SciPy),
2010. Oral Presentation.
[8] B. Bogert, M. Healy, and J. Tukey. The quefrency alanysis of time se-
ries for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe
cracking. In Proceedings of the Symposium on Time Series Analysis, pages
209—-243. John Wiley and Sons, Inc., 1963.
[9] J.P. Campbell, Jr. Speaker recognition: A tutorial. Proceedings of the IEEE,
85(9):1437–1462, 9 1997. ISSN 0018-9219. doi: 10.1109/5.628714.
[10] Paul Cuff. ELE 201: Information signals - course notes, 2015. URL http:
//www.princeton.edu/~cuff/ele201/kulkarni.html.
59
[11] George Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems, 2(4):303–314, 1989.
[13] Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre
Ouellet. Front-end factor analysis for speaker verification. Audio, Speech,
and Language Processing, IEEE Transactions on, 19(4):788–798, 2011.
[14] John R. Deller, Jr., John G. Proakis, and John H. Hansen. Discrete Time
Processing of Speech Signals. Prentice Hall PTR, Upper Saddle River, NJ,
USA, 1st edition, 1993. ISBN 0023283017.
[15] Li Deng and Dong Yu. Deep learning: Methods and applications. Now
Publisher Inc, 2014. ISBN 1601988141.
[18] Zheng Fang, Zhang Guoliang, and Song Zhanjiang. Comparison of different
implementations of MFCC. Journal of Computer Science and Technology,
16(6):582–589, 11 2001. ISSN 1000-9000.
[22] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training
deep feedforward neural networks. In International conference on artificial
intelligence and statistics, pages 249–256, 2010.
60
[25] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos,
Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam
Coates, and Andrew Y. Ng. Deepspeech: Scaling up end-to-end speech
recognition. arXiv preprint arXiv:1412.5567, 2014.
[26] A. L. Higgins, L. G. Bahler, and J. E. Porter. Voice identification using
nearest-neighbor distance measure. In Acoustics, Speech, and Signal Process-
ing, 1993. ICASSP-93., 1993 IEEE International Conference on, volume 2,
pages 375–378. IEEE, 1993.
[27] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimension-
ality of data with neural networks. Science, 313(5786):504–507, 2006.
[28] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning
algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
[29] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma
thesis, Institut f. Informatik, Technische Univ. Munich, 1991.
[30] Kurt Hornik. Approximation capabilities of multilayer feedforward networks.
Neural networks, 4(2):251–257, 1991.
[31] Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Pro-
cessing: A Guide to Theory, Algorithm, and System Development. Prentice
Hall PTR, 1st edition, 2001. ISBN 0130226165.
[32] Charles Jankowski, Ashok Kalyanswamy, Sara Basson, and Judith Spitz.
NTIMIT: A phonetically balanced, continuous speech, telephone band-
width speech database. In Acoustics, Speech, and Signal Processing, 1990.
ICASSP-90., 1990 International Conference on, pages 109–112. IEEE, 1990.
[33] Ye Jiang, Kong-Aik Lee, Zhenmin Tang, Bin Ma, Anthony Larcher, and
Haizhou Li. PLDA modeling in i-vector and supervector space for speaker
verification. In INTERSPEECH, 2012.
[34] J. Kacur, R. Vargic, and P. Mulinka. Speaker identification by k-nearest
neighbors: Application of PCA and LDA prior to KNN. In Systems, Signals
and Image Processing (IWSSIP), 2011 18th International Conference on,
pages 1–4, 6 2011.
[35] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel.
Speaker and session variability in GMM-based speaker verification. Audio,
Speech, and Language Processing, IEEE Transactions on, 15(4):1448–1460,
2007.
[36] Yochai Konig, Larry Heck, Mitch Weintraub, and Kemal Sonmez. Nonlinear
discriminant feature extraction for robust text-independent speaker recog-
nition. In Proc. RLA2C, ESCA workshop on Speaker Recognition and its
Commercial and Forensic Applications, pages 72–75, 1998.
[37] Jacques Koreman, Dalei Wu, and Andrew C. Morris. Enhancing speaker
discrimination at the feature level. In Speaker Classification I, pages 260–
277. Springer, 2007.
61
[38] LISA lab. Convolutional neural networks (LeNet), 2015. URL http://
deeplearning.net/tutorial/lenet.html.
[39] Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller.
Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48.
Springer, 2012.
[40] Howard Lei and Eduardo López Gonzalo. Mel, linear, and antimel frequency
cepstral coefficients in broad phonetic regions for telephone speaker recogni-
tion. In INTERSPEECH, pages 2323–2326, 2009.
[42] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinear-
ities improve neural network acoustic models. In Proc. ICML, volume 30,
2013.
[44] Jorge Martinez, Hector Perez, Enrique Escamilla, and Masahisa Mabo Suzu-
ki. Speaker recognition using Mel frequency cepstral coefficients (MFCC) and
vector quantization (VQ) techniques. In CONIELECOMP 2012, 22nd In-
ternational Conference on Electrical Communications and Computers, 2012.
[50] Alan B. Poritz. Linear predictive hidden Markov models and the speech sig-
nal. In Acoustics, Speech, and Signal Processing, IEEE International Con-
ference on ICASSP’82., volume 7, pages 1291–1294. IEEE, 1982.
62
[52] Thomas F. Quatieri. Discrete-Time Speech Signal Processing: Principles
and Practice. Prentice Hall, 2001.
[53] Lawrence R. Rabiner and Ronald W. Schafer. Introduction to Digital Speech
Processing. Now Publishers Inc, 2007.
[54] Douglas A. Reynolds. Speaker identification and verification using gaus-
sian mixture speaker models. The Lincoln Laboratory Journal, 8(2):173–192,
1995.
[55] Douglas A. Reynolds. Automatic speaker recognition: Current approaches
and future trends. Speaker Verification: From Research to Reality, pages
14–15, 2001.
[56] Douglas A. Reynolds and Richard C. Rose. Robust text-independent speaker
identification using Gaussian mixture speaker models. Speech and Audio
Processing, IEEE Transactions on, 3(1):72–83, 1995.
[57] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. Speaker
verification using adapted gaussian mixture models. Digital signal processing,
10(1):19–41, 2000.
[58] Fred Richardson, Douglas Reynolds, and Najim Dehak. Deep neural network
approaches to speaker and language recognition. 2015.
[59] Laszlo Rudasi and Stephen A Zahorian. Text-independent talker identifi-
cation with neural networks. In Acoustics, Speech, and Signal Processing,
1991. ICASSP-91., 1991 International Conference on, pages 389–392. IEEE,
1991.
[60] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural
Networks, 61:85–117, 2015.
[61] Mike Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural net-
works. Signal Processing, IEEE Transactions on, 45(11):2673–2681, 1997.
[62] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks
from overfitting. The Journal of Machine Learning Research, 15(1):1929–
1958, 2014.
[63] S. S. Stevens and J. Volkmann. The relation of pitch to frequency: A revised
scale. The American Journal of Psychology, 53(3):329–353, 1940.
[64] Naftali Z. Tisby. On the application of mixture AR hidden Markov models to
text independent speaker recognition. Signal Processing, IEEE Transactions
on, 39(3):563–570, 1991.
[65] Vibha Tiwari. MFCC and its applications in speaker recognition. Interna-
tional Journal on Emerging Technologies, 1(1):19–22, 2010.
[66] R. Togneri and D. Pullella. An overview of speaker identification: Accuracy
and robustness issues. Circuits and Systems Magazine, IEEE, 11(2):23–61,
2011. ISSN 1531-636X.
63
[67] James S. Walker and Gary W. Don. Mathematics and Music: Composition,
Perception, and Performance. Chapman and Hall/CRC, 2013.
[71] Victor Zue, Stephanie Seneff, and James Glass. Speech database develop-
ment at MIT: TIMIT and beyond. Speech Communication, 9(4):351–356,
1990.
64
List of Figures
4.1 A perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 A feedforward neural network with one hidden layer . . . . . . . . 33
4.3 A simple recurrent neural network . . . . . . . . . . . . . . . . . . 34
4.4 A bidirectional recurrent neural network unfolded in time . . . . . 34
4.5 An illustration of 3-dimensional convolution (adapted from [38]) . 36
4.6 Sigmoid and tanh function . . . . . . . . . . . . . . . . . . . . . . 36
5.1 The process to convert speech signals into MFCC and its derivatives 43
65
5.2 Hamming and Hanning windows of length 64 . . . . . . . . . . . . 44
5.3 The structure of our DNN model . . . . . . . . . . . . . . . . . . 47
5.4 A closer look at the recurrent layer . . . . . . . . . . . . . . . . . 47
5.5 The visualization of dropout (adapted from [62]) . . . . . . . . . . 49
5.6 An example of initializing and training a speaker identification
system. Here the front-end returns 13 MFCCs and the back-end
has 3 hidden layers with 50 neurons at each level. . . . . . . . . . 50
5.7 Identification accuracy of our systems and two best systems from
[19] (MNTN: modified neural tree network, FSVQ: full-search VQ)
as a function of training duration . . . . . . . . . . . . . . . . . . 54
5.8 Identification accuracy as a function of population size . . . . . . 55
66
List of Tables
67
List of Abbreviations
EM Expectation maximization
FS Fourier Series
FT Fourier Transform
ML Maximum likelihood
68
UBM Universal background model
VQ Vector quantization
69