ALIZE Evaluation
ALIZE Evaluation
ALIZE Evaluation
February of 201
Under supervision of
Univ. Prof. Dr. Hermann Kaindl
and
Dr. Dominik Ertl
Abstract
Text-independent speaker verication is the computing task of verifying a user's
claimed identity using only characteristics extracted from their voices, regardless
of the spoken text. Nowadays, a lot of speaker verication applications are being
implemented in software, and using these systems on embedded systems (PDAs, cell
phones, integrated computers) multiplies their potential in security, automotive, or
entertainment applications, among others. Comprehension of speaker verication
requires a knowledge of voice processing and a high mathematical level. Embedded system performance is not the same as oered by a workstation. So, in-depth
knowledge of the target platform where the system will be implemented and about
cross-compilation tools necessary to adapt the software to the new platform is required, too. Also execution time and memory requirements have to be taken into
account to get a good quality of speaker verication.
In this thesis we evaluate the performance and viability of a speaker verication
software on an embedded system. We present a comprehensive study of the toolkit
and the target embedded system. The verication system used in this thesis is the
ALIZE / LIA_RAL Toolkit. This software is able to recognize the identity of a
client previously trained in a database, and works independently of the text spoken.
We have tested the toolkit on a 32-bit RISC ARM architecture set computer. We
expect the toolkit can be ported to comparable embedded system with a reasonable
eort.
The ndings conrm that the speaker verication results on work station are comparable than in an embedded system. However, time and memory requirements are
not the same in both platforms. Taking into account these results, we propose an
optimization in the speaker verication test to reduce resource requirements.
Resumen
La vericacin de locutor independiente del texto es la accin de validar la identidad
de un usuario usando nicamente caractersticas extraidas de su voz, sin tener en
cuenta el texto pronunciado. Hoy en da, multitud de software de vericacin de
locutor ha sido implementado para funcionar en ordenadores personales, pero usar
estas aplicaciones en sistemas embedidos (Smartphones, telfonos, ordenadores integrados) multiplica su potencial en campos como la seguridad, el sector del automvil
u otras aplicaciones de entretenimiento. La comprensin terica de los sistemas de
vericacin de locutor requiere conocimientos de procesado de voz y un nivel alto de
matemtica algortmica. El rendimiento de estos sistemas embedidos no es el mismo
que los que ofrecen los ordenadores personales, as que hace falta un conocimiento
exhaustivo de la plataforma en la cual se va a integrar la aplicacin, as como un
conocimiento de las herramientas de compilacin cruzadas necesarias para adaptar
el software a la nueva plataforma. Los requerimientos de tiempo y memoria tambin
deben ser tenidos en cuenta para garantizar una buena calidad de vericacin.
En este proyecto, se evaluar el rendimiento y la viabilidad de un sistema de vericacin de locutor integrado en un sistema embedido. Se presenta un estudio exhaustivo de las herramientas del software, as como de la plataforma de destino utilizada.
El sistema de vericacin usado en este proyecto ha sido la herramienta ALIZE /
LIA_RAL. Este software es capaz de reconocer la identidad de un cliente entrenado
con anterioridad y almacenado en una base de datos, y trabaja independientemente del texto pronunciado. El software ha sido testado en una mquina de pruebas
con un procesador de 32-bit RISC ARM, pero el sistema podra ser portado a otros
sistemas sin problemas aadidos.
Los hallazgos durante el proyecto conrman que los resultados de la vericacin
en un sistema embedido son similares a los obtenidos en el PC. Sin embargo, los
requerimientos de tiempo y memoria no son los mismos en las dos plataformas.
Teniendo en cuenta estos resultados, se propone una optimizacin de los parmetros
de conguracin utilizados en el proceso de test para reducir considerablemente los
recursos utilizados.
Contents
Contents
List of Figures
iii
List of Tables
iv
1 Introduction
1.1
1.2
1.3
Working Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
2.2
. . . . . . . . . . . . . . . . . . . . .
2.3
Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.4
13
15
3.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.2
16
3.3
Training Phase
17
3.4
Testing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.5
20
ALIZE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
4.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
4.2
ALIZE Library
23
4.3
LIA_RAL Toolkit
4.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
5.2
5.3
6 Conguration of
25
27
29
. . . . . . . . . . . . . . . . . . . . . . .
29
. . . . . . . . . . . . . . . . . . . . . . . .
30
ALIZE / LIA_RAL
. . . . . . . . . . . .
31
35
6.1
Shell-script Compilation
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
6.2
Conguration Parameters
6.3
. . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
41
7.1
7.2
7.3
42
7.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
41
42
8 Discussion
49
50
9.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
9.2
Outlook
50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography
52
A SD Formatting code
54
56
ii
List of Figures
1.1
2.1
. . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
2.3
2.4
3.1
. . . . . . . . . . . .
. . . . . . . . . . . . .
12
. . . . . . . . . . . . . . . . . . . . . . .
17
3.2
18
3.3
21
4.1
. . . . . . . . . . . . . . . . . .
IZE low-level library, and contains all the methods used in this master thesis.
23
5.1
. . . . . . . . . . .
30
5.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
5.3
7.1
False Rejections (left) and False Acceptances (right) curves for 1024 Gaussians.
43
7.2
44
7.3
Recognition performance for each GMM count on ELSDSR data set using 15
world iterations.
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
33
45
List of Tables
2.1
6.2
38
6.3
39
6.4
. . . . . . . . . . . . . . . . . . . . . . . .
39
6.5
40
7.1
45
7.2
7.3
Feature extraction and signal processing execution times (in seconds). Ratio
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
46
. . . . . . .
47
48
iv
Chapter 1
Introduction
In our everyday lives there are many types of interfaces for human-computer interaction,
for instance: graphical user interfaces, speech interfaces, etc.
types, speech input is regarded as a very powerful form because of its rich character.
Among the context of spoken input (words), rich character also refers to gender, attitude,
emotion, health situation and identity of a speaker. Such information is very important for
an ecient communicative interaction between humans and gains importance for humancomputer interaction (HCI).
From the signal processing point of view, speech input can be characterized in terms of the
signal carrying message information. The waveform could be one of the representations
of speech, and this kind of signal is very useful in practical applications. We can get three
types of information from a speech signal: Speech text (transcription of the message),
language (English, German, etc.) and speaker identity (person who utters the message)
[7]. In Figure 1.1, we can see these three main types illustrated.
Due to its large number of applications, Automatic Speaker Verication (ASV) has become
an attractive domain of study in the area of signal processing and information in the last
decades. In combination with Automatic Speech Recognition (ASR), machines capable
of understanding humans were in the mind of scientists and investigators for years [4].
More recently, signicant steps have been made in this direction, for example using Hidden
Markov Models (HMM) for speech and speaker recognition [14, 17].
Currently, applications that incorporate ASV or ASR are used in many areas (robotics,
telephony, entertainment). Electronic devices capable of speech recognition evolve towards
miniaturization (mobile phones, laptops, micro controllers, etc.), which can lead to a
decrease in available device resources (processing power, memory, etc.). Potentially, this
1 Recognition rate refers to the quality in SR process, i.e. percentage of speakers that a recognition
tool can recognize correctly.
can be more ecient than carrying keys or memorizing PIN codes. SR for surveillance may
help one to lter a high amount of data, for example telephone or radio conversations. SR
in forensic can help one to prove the identity of a recorded voice and can help to convict
a criminal or discharge an innocent person in court.
When using such applications, one might not want to carry a laptop or a desktop system,
which makes an embedded device suited as an underlying platform.
Accordingly, one
To study the historical perspectives and current research in the eld of speaker
verication software on embedded devices.
To choose a proper operative system and cross-compiling tools to port this software
to an embedded device.
To present and improve source code of the ALIZE/ LIA_RAL software to get a
solution with a better performance.
To analyze the results after improving the software, and to compare them with
previous iterations, as well as with the results from a laptop.
Testing of the results on an embedded system and compraising with the results from
the laptop.
Improving the source code, analyzing the new results, and comparing them again
with the laptop results.
1.4 Overview
The material of this thesis is arranged into nine chapters as follows:
Chapter 1 introduces the work and gives a brief description of speaker verication.
Chapter 3 describes the four principle steps of the speaker verication process:
Feature extraction, training stage, testing stage and decision making.
Chapter 4 presents the tools ALIZE and LIA_RAL and describes them in detail.
Chapter 5 explains the embedded device used in this master thesis, its principle
features and the cross-compilation process.
Chapter 7 describes the setup of data used on testing and summarizes the results
obtained before and after the improvements.
Chapter 8 discusses the speaker verication results that we gained from the embedded and laptop systems.
Chapter 2
Background and State-of-the-Art
In this section, we present background information and the state-of-the-art of speaker
verication. First, we introduce biometrics concept. Then, we explain speech processing
and speaker recognition, as well as embedded systems state-of-the-art. Finally, we explain
specic works on speaker verication.
surveillance are the most relevant applications where biometrics are used.
Table 2.1
Cost
Ease of
Special
Low
Eective
Use
Hardware
Maintenance
Requirement
Costs
Fingerprints
Yes
Yes
Yes
No
Hand Geometry
No
Yes
Yes
No
Voice Verication
Yes
Yes
No
Yes
Iris Scanning
No
No
Yes
No
Facial Recognition
Yes
Yes
Yes
No
age, mood, or environmental conditions are factors that can be extracted from a voice
message. This extra data, which is independent of the message, can bring us credibility
of the message, the authenticity of the speaker, or many other message properties.
Thanks to the advancements of our electronic and software technology, two elds of research have received more attention in recent years due to the advances in computational
speed: Speech recognition and speaker recognition. The rst is the art of obtaining the
transcript of the message, i.e., what the speaker is saying.
should ignore age, voice, or the speaker's sex and has to focus on the message content.
The second, speaker recognition, determines who is speaking, ignoring the message content. Now, age, dialect or voice tone have to be taken into account, because they can help
to identify the speaker and they can help us with the recognition. In both cases, environmental factors, such as noise or the quality of the recordings, can adversely inuence the
outcome of the process and have to be taken into account.
There are two types of speaker recognition, depending on the delivered message, namely
text-dependent and text-independent recognition. Speaker identication combines speech
recognition with speaker recognition. A speaker identication system requires a speaker
1 http://nextdoornerd.blogspot.com/
to pronounce a given text, with which it will have carried out the training process. For
security applications, a user can log in using a voice password.
Text-independent sys-
tems are able to identify a trained speaker from a dierent message compared to the
testing message. Nowadays, most of the research focuses on speaker recognition for textindependent systems, which allows one a much broader scope.
Being text-independent
allows one to perform SR on surveillance or access control applications that do not require
a concrete statement of text.
Gaussian Mixture Models (GMM): Use a Gaussian sum distribution to model each
human.
A HMM is a statistical method which assumes that the system model is a Markov process
of unknown parameters.
hidden, hence the name) of the chain from the observable parameters.
The extracted
parameters can be used to carry out subsequent analysis. It is very common to use HMM
in pattern recognition applications [17, 15].
HMMs are commonly used at the level of phonemes , each using them to create an HMM.
Essentially, a state machine is created and it uses audio frames to determine the probability of the next state. In text-dependent systems, a concrete phoneme is expected. Here,
a comparison process between a speech segment and a model is more precise. However,
Rabiner et al. [16] probed that GMM are more ecient than HMM in text-independent
systems. GMM are a parametric probability density function represented as a weighted
sum of Gaussian component densities [15]. GMM will be explained in detail below.
In addition to the above techniques, there are others such as neural networks or "support
vector machines", that are still under investigation in the eld of speaker recognition as
shown in [19].
2 Any
of
the
distinct
units
of
sound
that
(http://www.wordreference.com/denition/phoneme)
distinguish
one
word
from
another.
Figure 2.2:
R and R
Gaussian distribution is one of the most common probability density function and
models many natural, social or psychological events.
probability distribution
H0 :
H1 :
by the speaker S.
P (X|Ho )
P (X|H1 )
where
voice
(
accept Ho
< reject Ho
hypothesis
Hi
accepting or rejecting
H0 .
Theoretically
3 The probability distribution of a random variable is a function that assigns to each event random
variable dened on the probability that the event occurs.
wi must be 1, or p(x|)
0.6 + 0.3 + 0.1 = 1.
control the ratio between the probabilities of errors in the two possible directions of the
decision. So, the main objective of a speaker recognition system is to use a given method
to calculate both probabilities,
P (X|H0 )
and
P (X|H1 ).
P (X|Hi )
as a nite
sum of N distributions
p(x|) =
where
PN
i=1
wi pi (x)
pi (x) are individual distributions, and in the case of GMMs, Gaussian distributions.
wi determine how much inuence each distribution has. Each Gaussian
The weights
pi (x) =
1
D/2 P 1/2
(2)
| i|
n
o
P 1
1
exp 2 (x i ) ( i ) (x i )
where
0 p(x|), pi (x) 1,
p(x|)
will
not be a probability. It can easily be seen that a distribution with a large variance and a
large weight will have a great inuence on the mixture, as shown in Figure 2.3.
The application of GMMs to speaker verication requires at least two algorithms. First,
some way of determining the means, variances and weights is needed, to make the GMM
model a speaker's voice characteristics. Second, a method of evaluating audio segments
using the model must be found, to verify the identity of the claimed speaker. Since the
output of the GMM is a single probability, it has to be transformed into a yes or no,
so to accept or reject the hypothesis.
:
p(X|
k+1
) p(X| )
ALIZE/LIA_RAL
GPL license, used specically for speaker or speech recognition. This software was
created in the Mistral project
is divided into the ALIZE library and the high-level LIA_RAL toolkit [12].
ALIZE
contains all needed functions to use Gaussian mixtures and speech process-
ing.
The ALIZE project was created under supervision of the ELISA consortium .
It
pretended to share and update all SR research projects in France in order to create
a strong and free common speaker recognition software. According to the documentation [12], the main objectives of ALIZE are:
4 N-dimensional vector of numerical features that denes an object. In speech processing, object is
referred to part of a recording, or speech segment.
5 http://mistral.univ-avignon.fr/index_en.html
6 http://elisa.ddl.ish-lyon.cnrs.fr/
10
the toolkit and both standard databases and protocols like NIST SRE
ones.
To facilitate the knowledge transfer between the academic labs and between
academic labs and companies.
LIA_RAL
is a speaker verication toolkit that uses all low-level functions of the ALIZE
library.
HTK
(Hidden Markov Model Toolkit) is a well-known open-source toolkit used for speech
MARF
10
been created using MARF as main library, for example: Math Testing Application,
7 http://www.itl.nist.gov/iad/mig/tests/sre/
8 The Viterbi algorithm is a dynamic programming algorithm for nding the most likely sequence of
hidden states called the Viterbi path that results in a sequence of observed events, especially in the
context of Markov information sources and hidden Markov models.
9 http://htk.eng.cam.ac.uk/
10 http://marf.sourceforge.net/
11 http://marf.sourceforge.net/#releases-dev
11
system like a computer with one or more integrated devices on the same board as seen in
Figure 2.4. For example, an embedded system can be a PDA, a smart phone, or a built-in
controller.
Nowadays, the industry is able to create physically smaller electronic systems that have
the same technical features and functionality than larger systems years ago. To achieve
price competitiveness and low dimensions of hardware devices, personal computers or
laptops can be replaced by embedded systems with the same performance.
dedicated to concrete tasks, electrical and size design can be adjusted to this particular
function. Embedded systems often use a relatively small processor and a small memory
to reduce costs. On the other hand, a failure in one element can imply to repair or change
the full board.
We can nd an embedded system in a lot of products.
In a factory, to control an assembly or production process: A machine that is responsible for a particular task contains numerous electronic and electrical systems
for controlling motors or other mechanical device and all these factors have to be
controlled by a micro processor, which often comes with a human-computer interface.
The
embedded system in this case requires a lot of input and output connectors and
robust features for continued work.
12
In aircraft radars: Processing a signal from an aircraft radar requires high computing
power, small dimensions and small weight. Also, it has to be robust enough for work
on extreme conditions, like high or low temperature, low atmospheric pressure,
vibrations, etc.
In automotive technology: A car can include hundreds of microprocessors and micro controllers that control ignition, transmission, power steering, anti-lock brakes
system (ABS), traction, etc. All this functionality is integrated in embedded devices.
In this master thesis, we will use the ARM Cortex BeagleBoard embedded system. The
BeagleBoard is an open-source low-cost low-power device, based on the ARM Cortex-A8
chip.
It is an in
development project under the Creative Commons license, without ocial support except
community forums or wiki pages
12
A BeagleBoard has a lot of interfaces: USB, HDMI, Ethernet, MicroSD, JTAG or camera
peripheral connections.The Linux system is the prefered operative system, so it can be
used for many applications.
processing, embedded system also allow one to design compact and cheap systems for
many purposes. In this master thesis, we will port the software of a speaker verication
system to an embedded system software. It means, the embedded system can do the same
work as performed with a laptop, like to process audio signal, to train clients and to test
clients.
Speaker Verication technology in embedded systems will face the following issues and
challenges:
12 http://elinux.org/BeagleBoard
13
3. Computational power and memory are typically lower. Embedded systems are often
designed to be low-cost boards, so the micro processor power is not comparable with
the power of a laptop. For example, with overclocking an ARM Cortex v8 ( the CPU
in BeagleBoard) we can get 1GHz clock frequenc. Meanwhile, a common midrange
CPU in laptop or PC can work above 3 GHz easily.
Moreover, BeagleBoard is
A lot of signal processing research projects try to port specic software to embedded
systems because such systems oer a lot of opportunities. In recent years, several projects
are focused on audio processing on embedded systems.
closely related to our thesis:
13
and discuss the individual algorithmic steps, the integer implementation issues and its
error analysis. The performance of the system is evaluated with data collected via iPaq
devices and the work discusses the impact of the model compression as well as integer
approximation on the accuracy.
13 https://gforge.ti.com/gf/project/tiesr/
14
Chapter 3
The Process of Speaker Verication
Before we explain the ALIZE / LIA_RAL software in more detail,we present the underlying theory of the speaker verication process. The steps presented in the following are
commonly used in audio processing.
3.1 Overview
In this chapter, a revision of the text-independent speaker verication process is presented.
All verication or identication processes follow the next steps:
1. Feature extraction is the rst step in the Speaker Verication (SV) process.
It
consists of transforming a segment of speech data (wav, raw, etc.) into a parameter
vector that is used in the next step.
model and the input data are combined using the Maximum A Porteriori adaptation
(MAP).
4. In the testing phase we will gain a testing score of each client for each model. We
will perform the feature extraction of the input speech segment and we will compare
it with the speakers database.
5. The last step in the process decides if the system should accept the speaker as
database member, or reject it.
on system purpose (low false acceptance ratio, or low false rejection ratio), and the
system takes a decision comparing the score obtained in testing phase with this
threshold.
Audio sampling
by converting a sound wave into an analog signal. The term sampling is referred to the
Labeling
Markov Models. Basically, labeling means to assign a label or a name to every audio segment. In text-independent systems, it is interesting to process only the speech segments,
because there is no available information in the silence segments. So it is usual to label
speech segments and silence segments. Once the labeling is done, the next step can
be done only in speech segments, saving memory and CPU processing power.
Feature extraction
Normally, linear audio samples gained from the sampling step are
not enough for the characterization of a speaker. Noise, or the change of voice leads to a
dierent audio spectrum, so a more robust representation of such coecients is needed.
So, this step consists on extracting a real speaker feature vector from previous sampling
values. These features are called Cepstral Coecients (MFCC) and are coecients for
the representation of speech based on human hearing perception [18].
extraction is improved by using the Mel scale
real human hearing perception.
Moreover, the
acquisition.
First, the audio signal is sampled at 8 kHz as explained in the sampling step, labeled
and then windowed . Normally, the signal is divided in segments of 25-30 ms, where the
segments overlap each other 15 ms. The next step is to switch to the frequency domain
with the help of the Fast Fourier Transform (FFT). Then the signal is ltered by using
a lter bank of dierent frequencies to have a better resolution at low frequencies, which
is comparable to the human hearing system. After the Mel ltering, we will obtain one
coecient for each lter, so, for example, using a lter bank of 40, we will obtain a vector
of 40 coecients for each frame. The logarithmic-spaced frequency band allows a better
1 http://en.wikipedia.org/wiki/Sampling_(signal_processing)#Speech_sampling
2 http://en.wikipedia.org/wiki/Mel_scale
3 http://en.wikipedia.org/wiki/Window_function#Spectral_analysis
16
modeling of the human auditive response than a linear scale, so a log function and the
Discrete Cosine Transform (DCT) are applied. This process implies more ecient data
processing, for example, in audio compression.
Feature Normalization
So,
eects of the recording place or channel can be prejudicial for our verication process. To
avoid this problem, or to reduce the eects of this noise, a feature normalization is done
and basically consists on applying a feature update using the mean and the variance of
0
x,
x =
where
and
is dened as follows:
are the mean and the variance of the whole feature vector, respectively.
It is possible to normalize only for the mean, but it is more usual and more robust to
perform it on mean and variance.
is not to create a unique client model using client feature vectors, because normally we
17
have not enough data of each client, and the model can become heavily under-trained .
Instead, a common solution is to train a general speaker model with utterances of a set of
people, and then adapt this model to each user. The adaptation is done using user data.
The name of this general model
Training speaker
Like the EM algorithm, the MAP adaptation is a process of estimation in two steps. In
the rst step, the statistics of the training data to each UBM mixture are estimated.
In the second step, these new statistics are combined with the statistical features of the
UBM. Given the UBM and a feature vector
it is nec-
essary to get a probabilistic alignment between the training vector and UBM mixtures.
In Figure 3.2, a sketch of the alignment is shown.
UBM mixture:
w p (x )
P (i|xj ) = PM i i t
k wk pk (xt )
In the rst step, using this value of
P (i|xj )
and
xt
and the variance, which we will update in the second step. Weight, mean and variance
are dened as follows:
18
Ni =
PT
j
P (i|xj , ),
PT
Ei (x) = N1
j P (i|xj )xj ,
i
PT
2
2
Ei (x ) = N1
j P (i|xj )xj
i
Finally, the MAP algorithm consists of updates of these parameters. The parameter
is
dened as:
w
i Ni
w
+ (1 i )wi
i =
n
So. an updated weight, mean and variance are dened as follows:
wi =
i ,
X
i
N
i = i Ei (x)+(1i )i ,
02
i2 = i Ei (x )+(1i )(i +i )i i
{w , , }are param-
eters that control balance between old and new weight, mean, and variance estimations.
These coecients are calculated using a relevance factor r
i =
For example, using
r = 0 = 1.
Ni p
Ni + r
It means
i = Ei (x)
M =
p(HM |x, )
p(HM |x, )
where
given
ratio
possible speakers in the world, except M. It is impossible to obtain all this information,
but we can approximate this probability, too:
19
Types of Output
We can group the SV output in four types: True acceptance (TA), false acceptance (FA),
true rejection (TR), and false rejection (FR). TA occurs when a client or true speaker
is accepted. FA occurs when an impostor is accepted. This is also called false positive.
TR occurs when an impostor is rejected. FR occurs when a true speaker is rejected and
is called false negative.
These rates are measured as a percentage.
typically only false rejections and false acceptances are used and visualized in a graph.
we show an example.
The Equal Error Rate (EER) is the standard numerical value to dene the performance
of a biometric verication system. It is dened as the point on the DET curve where the
False Acceptance Rate and the False Rejection Rate have the same value.
It is important to note that a system with an EER of
the system allows
8%
8%
of impostors . One can choose his / her own threshold, taking into
Final Threshold
The nal threshold is chosen after training, and can be set to the EER point as a starting
point. Often, the threshold xing is ignored, and instead the EER and DET-plot are used
to show the overall system performance.
6 http://rs2007.limsi.fr/index.php/Constrained_MLLR_for_Speaker_Recognition
7 False user who claims to log in the SV system.
20
21
Chapter 4
ALIZE
Library and
LIA-RAL
Toolkit
As explained in previous chapters, in this master thesis we want to analyze the performance of ALIZE / LIA_RAL speaker verication library in an embedded system.
In this chapter, and according to the documentation extracted from the ocial website
(http://mistral.univ-avignon.fr/), we will describe the ALIZE and LIA_RAL speaker verication software [12] and we will explain its functionality relevant for this thesis in more
detail.
This toolkit has been designed to work with speech features, so the SPro4 has
been used in this master thesis for this purpose. Its functionality and its most relevant
features are explained in this chapter as well.
4.1 Overview
If we refer to the author, ALIZE is dened as an open-source platform (distributed under
LIA_RAL is
dened as a high-level toolkit, and is a set of tools to do all tasks required by a biometric
authentication system . In Figure 4.1 we can see a diagram of the main components of
ALIZE and LIA_RAL distribution. The LIA_RAL package is based on ALIZE's low-level
library and contains two task-specic sub-deliveries:
form via the use of GNU Autotools . In this chapter, we will explain each ALIZE and
LIA_RAL function in relation to each stage described in Chapter 3. Neither ALIZE nor
LIA_RAL do support feature extraction, so we will use SPro4 tool for this purpose. It
will be explained in Section 4.4.
1 http://mistral.univ-avignon.fr/index.html
2 The Autotools consists of Autoconf, Automake, and Libtool toolkit to allow cross-compilation
22
Figure 4.1:
ALIZE low-level library, and contains all the methods used in this master thesis.
Figure extracted from the Mistral project web page:
http://mistral.univ-avignon.fr/index.html.
Statistical functions.
Gaussian distributions.
Gaussian mixtures.
23
les and command line conguration. A user can dene a conguration le to set some
dynamic parameters, or dene them on the command line.
The library contains two classes for handling les. They are
handling feature vectors. The ones mostly used are HTK, and SPro3/4. It is possible to
use also list les, in order to concatenate feature les into a single stream.
Statistical functions.
as well to generate histograms. It also contains functions for the Expectation Maximization algorithm in general.
Gaussian distributions.
Gaussian mixtures.
targeted to Gaussian distributions. The ALIZE library is also designed to support other
kinds of distributions. In this master thesis, this feature will be used to read input data.
All text les in ALIZE are distributed into lines and elds. One
ALIZE con-
tains a group of classes called Servers to handle the storage of named entities. I.e, all
speaker models are stored in mixtures server, features are cached in features server, and
labels are stored in labels server. It allows an automatic organization of les in memory.
This automatic storage is essential for the data management, and lets ALIZE-dependent
applications, such LIA_RAL toolkit, to work in an optimized and organized way.
4 http://www.cs.williams.edu/javastructures/doc/structure/structure/FileStream.html
24
Normalize Features
The NormFeat program reads a feature le and normalizes it. It is possible to normalize
the extracted features using mean / variance normalization or Gaussianization. The rst
mode simply accumulates mean or variance, and then normalizes audio features using
them.
The second one performs feature warping [11] to get the Gaussian distribution
from a histogram of feature vectors. In this master thesis, only the mean normalization
will be used.
Energy Detector
The EnergyDetector tool reads a feature le and segments it based on speech energy.
The program will discard segments with lower energy level than a threshold.
Using
the easiest way, meanStd, we train a GMM on the energy component and we nd the
distribution with the largest weight,
wi .
= i i
where
(i , i )
is an
empirical constant between 0 and 1 that can be dened. Normally (and also in this master
thesis), only the mean is used to get the threshold (
= 0 ).
TrainWorld tool.
The main purpose of this tool is to create a single GMM using a large amount of data.
Firstly, the tool creates an equal mean and variance distribution, and it is adapted iteratively using the Expectation Maximization method. We can congure several parameters
of this program. The most relevant are:
Features selection: We can choose the amount of data for creating the UBM. This
selection is based on probability.
25
Training Target
The purpose of the TrainTarget tool is to adapt the background model to a speaker
model, using feature vectors and the MAP criterion. Like in the UBM training, not all
feature vectors are necessary for doing the adaptation, so we can also x the percentage
of feature segments to use.
An input le for TrainTarget is a list with lines like:
SPEAKER1
SPEAKER2
SPEAKER3
SPEAKER4
s p e a k e r s / spk1_001
s p e a k e r s / spk2_001
s p e a k e r s / spk3_001
s p e a k e r s / spk4_001
s p e a k e r s / spk1_002
s p e a k e r s / spk2_002
s p e a k e r s / spk3_002
s p e a k e r s / spk4_002
In the rst column the name of each speaker is declared. In the next columns, we dene
the utterances that we want to use to create the model.
three utterances. The more utterances we use, the better the model will be, because we
will have more information from the speaker.
As explained in the Section 3.3, we will use a relevance factor
of variables
x
i ,
i .
In this master
as a linear combination of
its value in the world model and its value obtained by an EM algorithm on the data.
F
F
F
F
spkFAML .MAP
spkFDHH .MAP
spkFEAB .MAP
spkFHRO .MAP
1
0
0
0
FAML_Sr3
FAML_Sr3
FAML_Sr3
FAML_Sr3
2.26765
1.7321
0.5291
2.1344
Here, the rst column denotes F (female) or M (male) and the second and the fourth
column are referred to the name of the model and audio le, respectively.
In the last
column the score is shown. We can use the third column to discriminate positive scores
(1) or negative scores (0).
The most important parameter that can be set is the number of top distributions to be
used in the result. We can greatly reduce the computational time growing up this number.
If we evaluate, for example, only ve of 512 or 1024 distributions, the performance can be
improved. It will be a key point of improvement in the embedded software in this thesis.
26
Taking a Decision
The last step is the decision making. This process consists of comparing the likelihood
resulting from the comparison between the claimed speaker model and the incoming speech
signal with a decision threshold. If the likelihood is higher than the threshold, the claimed
speaker will be accepted, else rejected.
As commented in Subsection 3.5.3, a threshold should be dened according to our needs.
This master thesis evaluates the ALIZE / LIA_RAL software, so the threshold was not
dened. We will use a collection of possibe threshold values for getting FR and FA curves,
as well as equal error rate (see Section 3.5).
Buer size: sets the input and output buer size. The smaller the input buer size,
the more disk access is needed and, therefore, the slower the program is.
0.20995
1.968982
1.715665
0.775145
(...)
3.858743
0.090583
1.39294
2.62302
0.08501
2.134615
0.938561
0.06644
2.455754
0.15941
0.489811
4.18761
0.085920
2.456415
0.806973
0.87814
In this master thesis, the Spro4 tool will be used as shown below for all audio les and
all tests:
5 http://www.irisa.fr/metiss/guig/spro/
27
Format: Wave.
Shift: 10 ms.
Window: Hamming.
28
Chapter 5
Open Embedded as Operating System
The main goal of this master thesis is to adapt and congure a SV software, and run it
on an embedded system. In this chapter, we describe the embedded system environment
set-up and the necessary cross-compiling tools.
It is presented in
a single device and it is based on the Texas Instruments OMAP3530 chip, as shown in
Figure 5.1. The processor is an ARM Cortex-A8 core with a digital signal processing DSP
core. The motivation of the project is to design a computer with low-power consumption,
low-cost, compact size and expandable device. The project is in continuous development
under the Creative Commons license, as specied on the rst pages of the BeagleBoard
user manual .
environment. One needs at least one SD card to store Linux and its boot loaders . In the
next section, this procedure will be explained.
1 http://BeagleBoard.org/static/BBSRM_latest.pdf
2 Booting is a process that starts operating systems when the user turns on a computer system.
29
with the board, the decision of the operating system will vary.
In this master thesis, we use the Angstrom distribution . This distribution was created
using OpenEmbedded
data storage. There are several pre-compiled Angstrom distributions, but it allows one to
create a personalized operating system image with an online builder. The builder, called
Narcissus , allows one to choose the desired characteristics for maximum features adjust
as needed. The Narcissus tool allows one to create virtual images for several embedded
systems. One can adapt the OS (command line interface, Desktop environment or based
environment for PDA style devices) and add additional packages depending on the needs.
We choose a minimal conguration, without a graphical interface, and no additional
packages. This is necessary to optimize the processing time of our application. Note that
3 http://www.angstrom-distribution.org
4 http://www.openembedded.org/wiki/Main_Page
5 http://narcissus.angstrom-distribution.org/
30
our nal speaker verication version is an oine version. If we want to verify online, an
input microphone and audio drivers would be required.
cross-compiled
In this master thesis, we use a 2 GB SD card. It is possible that other versions of the
operating system require larger cards for its correct functionality. We are using a minimal
version, so 2 GB are enough for our work.
At rst, the SD card formatting and partitioning is necessary. The SD card is divided
into two parts: The OS and the system les needed for the boot process. Formatting and
partition has been done using a script. This script is presented in Annex A.
31
ALIZE library. The compilation and execution of these applications, however, is primarily intended for desktop computers or laptops. The porting process of the software to an
embedded system requires a specic compilation according to the target system architecture. In this master thesis, we have to compile each program, as well as library, to run on
an ARM architecture machine. However, we use a pre-congured OS, and the Angstrom
version presented in Section 5.2 contains compiled libraries for the ARM architecture.
Another option is to compile directly on the BeagleBoard. This means that one copies
the source code on the SD card, and compiles it using the GCC compiler of the OS of the
BeagleBoard.
user manual .
In the following subsections, we will explain the process of cross-compiling used for each
of the three main applications of our master thesis, using the GNU Build System.
system; certain library functions are missing on some systems; header les may have
dierent names. One way to handle this is to write conditional code, with code blocks
selected by means of preprocessor directives (#ifdef ); but because of the wide variety of
build environments this approach can become unmanageable. The GNU build system is
designed to address this problem.
The GNU Build System, also known as the Autotools, is a toolkit developed by the GNU
project.
These tools are designed to help creating portable source code packages for
various Unix systems. The GNU build system is part of the GNU toolchain and is widely
used to develop open source software. Although the tools contained in the GNU build
system are under the General Public License (GPL) , there are not restrictions to create
private software using this toolchain.
The GNU build system includes GNU Autoconf, Automake and Libtool utilities. Other
tools used are often the program GNU make, GNU get text, pkg-cong and the GNU
Compiler Collection (GCC).
GNU Autoconf
For example, some Unix systems may have features that do not exist or do not work on
other systems. Autoconf can detect the problem and nd the way to x it. The output
of Autoconf is a script called congure. Autoheader is also included in Autoconf, and is
the tool used to manage the C header les.
7 https://sourcery.mentor.com/GNUToolchain/doc7793/getting-started.pdf
8 http://en.wikipedia.org/wiki/GNU_General_Public_License
32
Autoconf processes the congure.in and congure.ac les. When the tool is running the
conguration script, Autoconf can also process the Makele.in le to produce an output
Makele. In Figure 5.3 we can see le dependencies and distribution of Autotools.
GNU Automake
is used to create portable Makele les that are processing later using
make tool. The tool uses Makele.am as input le and transforms it into a Makele.in.
The Makele.am is used by autoconf to generate the nal Makele.
Gnu Libtool
can create static and dynamic libraries for various Unix OS. Libtool
abstracts the process of creating libraries and simplify the dierences between systems.
33
needed
where:
build is referred to host machine. We can use both build and host options
CXXFLAGS are the common ags that we use in a common compilation process.
9 http://www.codesourcery.com/sgpp/lite/arm
34
Chapter 6
Conguration of
ALIZE / LIA_RAL
The ALIZE / LIA_RAL package is used in many applications. In this chapter, we present
a how-to-use in a shell environment, the compilation process and the conguration
parameters used in this master thesis.
. /LIA_RAL/LIA_SpkDet/NormFeat/NormFeat c o n f i g c f g /NormFeat . c f g i n p u t F e a t u r e F i l e n a m e
./ l s t / a l l . l s t
The EnergyDetector tool is used for silence removing and labeling speaker features before
processing them.
. /LIA_RAL/LIA_SpkDet/ E n e r g y D e t e c t o r / E n e r g y D e t e c t o r c o n f i g c f g / E n e r g y D e t e c t o r . c f g
inputFeatureFilename ./ l s t / a l l . l s t
Next, one has to re-normalize the label les. This is done with the NormFeat tool again,
with another conguration le:
. /LIA_RAL/LIA_SpkDet/NormFeat/NormFeat c o n f i g c f g / NormFeat_energy . c f g
inputFeatureFilename ./ l s t / a l l . l s t
1 http://www.irisa.fr/metiss/guig/spro/spro-4.0.1/spro_4.html
35
Client enrollment
Speakers that will be accepted by the system have to be trained now. It is necessary to
use a list le with a specic .ndx le. This le contains the name of the speaker model,
and the feature les used to create it. An exemplary of the speaker10 .ndx le is shown.
The TrainTarget tool creates the speaker model using the background data and the
speaker feature dataset specied in the .ndx le:
. /LIA_RAL/LIA_SpkDet/ T r a i n T a r g e t / T r a i n T a r g e t c o n f i g / c f g / t r a i n T a r g e t . c f g t a r g e t I d L i s t
/ l i s t s / mixture . ndx inputWorldFilename wld
To make it easier, the script trainSpeaker.sh performs the feature extraction, feature
processing, ndx le creating and client enrollment one after another.
The ComputeTest le saves the results in an output le with the result of each comparison.
For a simple comparison with only two enrolled clients and one testing sample, the output
le is shown as follows:
c l i e n t 1 .MAP 1 sample1 5 . 4 2 7 3 9
c l i e n t 2 .MAP 0 sample1 2.67043
36
where the columns represent the enrolled client, the decision taken, the sample name and
the score obtained, respectively. In the example, we are comparing a speaker sample with
his enrolled model, and with another speaker model enrolled in the system. The decision
taken will be 1 (accepted) or 0 (denied) depending on the threshold that we are using.
Anyway, this factor will be irrelevant in this mater thesis, as the score gives us all the
information that we need to get conclusions.
For a better understanding, we present a more complex example. Now, we will compare
6 dierent samples of the same speaker (named FAML) against the entire database, with
4 clients enrolled. The enrolled speakers are named FAML, MREM, FEAB, MKBP. The
next output le is obtained:
spkFAML .MAP
spkFAML .MAP
spkFAML .MAP
spkFAML .MAP
spkFAML .MAP
spkFAML .MAP
spkFAML .MAP
spkMREM.MAP
spkMREM.MAP
spkMREM.MAP
spkMREM.MAP
spkMREM.MAP
spkMREM.MAP
spkMREM.MAP
spkFEAB .MAP
spkFEAB .MAP
spkFEAB .MAP
spkFEAB .MAP
spkFEAB .MAP
spkFEAB .MAP
spkFEAB .MAP
spkMKBP .MAP
spkMKBP .MAP
spkMKBP .MAP
spkMKBP .MAP
spkMKBP .MAP
spkMKBP .MAP
spkMKBP .MAP
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
FAML_Sa
FAML_Sb
FAML_Sc
FAML_Sd
FAML_Se
FAML_Sf
FAML_Sg
FAML_Sa
FAML_Sb
FAML_Sc
FAML_Sd
FAML_Se
FAML_Sf
FAML_Sg
FAML_Sa
FAML_Sb
FAML_Sc
FAML_Sd
FAML_Se
FAML_Sf
FAML_Sg
FAML_Sa
FAML_Sb
FAML_Sc
FAML_Sd
FAML_Se
FAML_Sf
FAML_Sg
5.42739
6.73677
6.21802
5.50474
6.60182
5.81713
5.6254
2.80681
2.83226
1.74641
2.20437
1.96162
1.75959
1.20058
1.17351
0.908691
1.19148
0.703976
0.462561
0.448675
0.789436
1.24249
1.0661
0.393202
0.483468
0.959548
1.19945
1.03321
The speaker detector is obtaining high scores for the right speaker, and low scores for
wrong speakers.
The script testSpeaker.sh launches the entire testing process. It includes feature extraction
of test audio les, features processing of those les, and tests them against client Database.
present the description and possible values of the common parameters used in this master
thesis. Later, the values of the specic parameters for each application are presented, too.
les.
Parameter
Description
frameLength
Value
0.01
GD
34
feature les.
loadFeatureFIleFormat
sampleRate
mixtureDistribCount
SPRO4
100
64
200
minLLk
-200
38
Parameter
Description
saveFeatFileSpro3DataKind
Value
FBANK
speech
energy distribution.
varianceFlooring
0.5
varianceCeiling
10
16
featureServerMask
true
Parameter
MAPAlgo
Description
Specify the adaptation method to use:
Value
MAPOccDep
MAPOccDep or MAP-Const.
MAPRegFactorMean
14
adaptation technique.
meanAdapt
true
adaptation.
alpha
0.75
adaptation technique.
nbTrainIt
nbTrainFinalIt
inputWorldFilename
Number of EM iterations.
20
20
wld
39
Parameter
inputWorldFilename
ndxFilename
topDistribsCount
Description
Name of saved world model.
Name of .ndx le.
Number of Gaussians used to compute
Value
wld
not xed
10
the score.
computeLLKTopDistribs
COMPLETE
rate and performance in the BeagleBoard. Thus, depending on the results obtained in
the next chapter, the maximum optimized conguration les will be shown in Annex B.
40
Chapter 7
Tests and Results
In this chapter, we present the results of the tests with the ALIZE / LIA_RAL on the
BeagleBoard. Apart from describing the setup test and presenting the SV quality performance of the software, the text below is a comparison of a laptop and an embedded
platform, so the performance measures and memory requirements dierences will be presented, too.
1 http://www2.imm.dtu.dk/~lfen/elsdsr/
41
and impostors
data (we have just 23 people samples), all speakers are used as clients as well as impostors,
so:
Number of clients: 23
Number of impostors: 23
Creating the universal background speech model requires a large amount of data, since it
must include all possible speech data. The more data the UBM contains, the more robust
the model will be. Typically, SV tests use very large databases to create these models. In
this test, however, we will use all available training data to create the UBM, having 161
utterances.
This database is quite small, and it is not enough to get successful and realistic speaker
verication results. Nevertheless, the main purpose of this master thesis is to compare the
performance of the speaker verication system in both platforms, not to obtain accurate
results of SV.
but the dierences between running it on a laptop or and an embedded system. We use
the same source code in both platforms, so Equal Error Rate and speaker detection quality
should be exactly the same, as they are not memory and processor speed dependent.
7.4 Results
The two main factors that we need to measure are the execution time and the memory.
The rst one will describe the dierence between the laptop and the BeagleBoard performance. The second one presents the viability of using this software on the embedded
system, since we have a limited RAM memory.
We will present these ratios in both platforms with dierent parameter congurations.
EER optimization
First, we measure the Equal Error Rate optimization taking into account the acoustic
features and the amount of data of our database. Before porting the ALIZE / LIA_RAL
toolkit to the embedded system, one has to determine the number of Gaussian distributions in the mixture that optimizes the EER. As mentioned, the EER value is the
intersection point between False Rejections and False Acceptances curves, shown in Figure 7.1, and Figure 7.2.
&
&
Figure 7.1: False Rejections (left) and False Acceptances (right) curves for 1024 Gaussians.
Data extracted from the development environment performed in this Master Thesis.
43
Figure 7.2: Equal Error Rate point in the 1042 GMM test.
ERR is the intersection point of false acceptances (red) and false rejections (black).
This intersection point can vary depending on a number of factors, so Table 7.1 presents
the results in terms of accuracy varying Gaussian mixtures and iterations number. The
results come from evaluating the database using the setting up described in section 7.2 and
testing dierent combinations of mixtures distributions and world iterations (worldits).
This last parameter describes the number of times the client model is calculated against
the UBM model. Finally, the average model is obtained.
44
Mixtures
iterations
5 worldits
6,02 %
32 GMM
15 worldits
5,30 %
20 worldits
4,92 %
5 worldits
3,55 %
64 GMM
128 GMM
256 GMM
512 GMM
1024 GMM
15 worldits
3,12 %
20 worldits
3, 08 %
5 worldits
2,40 %
15 worldits
2,22 %
20 worldits
2,15 %
5 worldits
2,12 %
15 worldits
2,05 %
20 worldits
1,97 %
5 worldits
1,55 %
15 worldits
1,50 %
20 worldits
1,45 %
5 worldits
1,46 %
15 worldits
1,43 %
20 worldits
1,42 %
Secondly, it is interesting to set the DET curve according to our results. In Figure 7.3,
results are shown for 64, 128 and 256 Gaussians.
D
'
'
'
&W
Figure 7.3: Recognition performance for each GMM count on ELSDSR data set using 15
world iterations.
45
Memory requirements
The BeagleBoard OMAP3530 processor includes 256MB NAND memory in all versions.
In this tests the revision BeagleBoard.5 is used, which also includes an additional 128MB
DDR. The laptop has 1GB RAM. Note that the feature extraction memory requirement
is not contemplated in the table below:
Memory type
Laptop
BeagleBoard
Laptop
BeagleBoard
Virtual Memory
2,10 (0,21%)
2,38 (1,85%)
3,24 (0,32%)
3,45 (2,69%)
1,31 (0,13%)
1,66 (1,29%)
2,32 (0,23%)
2,71 (2,10%)
The measures have been taken on a 1024 testing system, using the entire ELSDSR
database
The measurements have been performed for a system of 1024 and 512 Gaussian, and the
results are quite similar.
RAM consumption and the number of Gaussian Mixtures used in the test. As we can see,
the memory requirements are not going to be a major handicap, so in the following such
sections we will omit the memory measurements.
Time results
The time requirement of this speaker verication system is the most important part of
the master thesis. The identication time depends on the number of feature vectors, the
complexity of the speaker models and the number of speakers. We want to analyze if it
is possible to perform it in a real test, taking into account the computational cost that it
represents. The time of the test has been divided into two parts: The rst one is to get the
time of the processing phases that do not depend on the mixture distribution count used.
It means, features extraction, signal processing and normalization stages. The second one
consists on evaluating the training and testing stages using dierent number of mixtures
counts and world iterations.
It is important to understand that the total amount of time expended by our application
will be the sum of the time of both stages. Using
tF E
as signal processing time, we can dene the total training time as:
tSP
Firstly, we present the results of features extraction and signal processing in Table 7.3.
The features extraction, performed by the SPRO application, and signal processing, does
not depend on mixture count because the speaker model that uses this distributions has
not yet been created. We also take into account the length of the audio le. The longer
the speech input le is, the more time the toolkit requires to process its features.
Features extraction
Signal processing
Laptop
BeagleBoard
Laptop
BeagleBoard
3 sec
0,42
2,85
6,78
0,04
1,21
30
5 sec
0,80
3,51
4,38
0,06
1,74
29
8 sec
1,35
4,20
3.05
0,10
2,72
27
10 sec
1,95
4,90
2,51
0,14
3,64
26
Table 7.3: Feature extraction and signal processing execution times (in seconds). Ratio
(R) is obtained form dividing BeagleBoard time with laptop time
(ttest )
and training
(ttrain )
times.
It is important to mention that this measure does not depend on the number or length of
the input audio le. This measure only uses featured les which have a constant length.
On the other hand, the number of mixtures has to be considered now. It is important to
take into account the world iterations used to create and test the models, because it is
the main relevant factor of the conguration parameters.
The results in Table 7.4 show that the proposed SV system for embedded systems can
be used in a real-life scenario, in terms of accuracy (as seen in Subsection 7.3.1) and
computational performance.
47
Testing
Mixtures
Iterations
Laptop
BeagleBoard
5 worldits
10,74
24,35
32 GMM
15 worldits
11,33
29,12
20 worldits
12,31
33,10
5 worldits
10,89
29,67
64 GMM
128 GMM
256 GMM
512 GMM
1024 GMM
15 worldits
12,70
31,88
20 worldits
14,51
36.10
5 worldits
13,15
32,44
15 worldits
16,20
35,67
20 worldits
18,45
40,08
5 worldits
16,81
39.16
15 worldits
20,75
45,12
20 worldits
25,25
52,13
5 worldits
24,50
68,84
15 worldits
30,41
78,21
20 worldits
32,20
85,90
5 worldits
30,15
82,91
15 worldits
37,12
90.59
20 worldits
44,15
101,66
Training
R
2,26
2.57
2,68
2,72
2,51
2,48
2,46
2,20
2,17
2,33
2,17
2,06
2,81
2,57
2,66
2.75
2,44
2.30
Laptop
BeagleBoard
18
272
19
300
21
321
25
382
27
405
31
469
35
533
37
579
42
657
48
746
51
776
54
813
63
973
67
1075
71
1113
80
1271
84
1319
89
1420
Table 7.4: Testing and training execution time comparison (in seconds).
48
R
15,11
15,78
15,28
15,28
15
15,12
15,22
15,64
15,63
15,56
15,21
15,05
15,44
16,06
15,67
15,89
15,70
15,95
Chapter 8
Discussion
In this chapter, a general discussion of the results obtained in the previous chapters will
be presented.
During the research, cross-compilation was the stage that was more dicult to obtain.
Some operating systems were tested on the BeagleBoard and the minimal Angstrom
system was the one that presents the best performance in terms of processing time.
Another point worth interesting is the acquisition of the optimal parameters in the ALIZE
/ LIA_RAL conguration. When reading the literature about this Speaker Verication
system, certain specic congurations can be disregarded on embedded systems, due
to excessive processing time.
parameter combinations was done. Using the measurements obtained in Table 7.4, the
results presented in the Annex show the best optimization of embedded system, ensuring
the same verication results as in the laptop.
We can determine that due to processing requirements, it is more reliable to do just the
training stage in the embedded system, as has been done in this Master Thesis.
If we
take a look on Table 7.4, we can conclude that the time dierence ratios between the
laptop and the BeagleBoard grows up for the training process. For example, taking 64
GMM, the laptop spended 20 seconds for training stage. On the other hand, the ARM
processor of the BeagleBoard took 400 seconds, i.e the training stage is 15 times slower in
the embedded system. If we work with databases that take 3-4 minutes for the training in
the laptop, the process would take more than 45 minutes in the BeagleBoard, that is not
useful in practice when using embedded systems. However, the testing process could be
executed with large amount of data, considering a physical implementation of the system,
the run time is consistent compared to the execution time on laptop.
49
Chapter 9
Conclusion and Outlook
In this chapter, we conclude our thesis and present an outlook to future work to advance
in this eld of research.
9.1 Conclusion
In this thesis the performance viability of a speaker verication software on an embedded
system has been evaluated. We present a comprehensive study of the ALIZE / LIA_RAL
toolkit as embedded system, and the Beagleboard as a target system. We have also have
performed tests that show the possibilities of this speaker verication toolkit on embedded
systems. The evaluation has been carried out using the ELSDSR public database.
Our tests indicate that the ALIZE / LIA_RAL speaker verication system can be ported
to the Beagleboard and most probably to other ARM embedded systems. As a nding,
the tests conclude that we can get exactly the same Equal Error Ratio in the Beagleboard
and in the laptop.
increased by a factor of ~15, and we can conclude that this system can be used only for
testing purposes, as time for training is too long.
9.2 Outlook
This master thesis validates the correct functionality of the ALIZE / LIA_RAL toolkit
on embedded systems, almost for testing purposes. In order to obtain our own speaker
verication software, specic for embedded systems applications, the main goal will be:
Test the ALIZE / LIA_RAL Speaker Verication toolkit with a larger database.
50
Port this software to other ARM architecture, and compare the results with the
obtained using the Beagleboard.
Improve the ALIZE / LIA_RAL Speaker Verication source code omptimizing the
code for these embedded systems.
51
Bibliography
[1]
[2]
D.B Rubin A.P. Dempster, N.M. Laird. Maximum-likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society, Ser. B, 39, 1977.
[3]
L.P. Heck D.A. Reynolds. Automatic speaker recognition. AAAS Meeting, Humans,
R.B. Dunn D.A. Reynolds, T.F. Quatieri. Speaker verication using adapted gaussian
mixture models.
Massachusetts, 2000.
[9]
R.C. Rose D.A. Reynolds. Robust text-independent speaker identication using gaussian mixture speaker models. Lincoln Laboratory, MIT, Lexington, MA, 1995.
[10] D.Reynolds. Universal background models. MIT Lincoln Laboratory, USA, 2001.
[11] S. Sridharan J. Pelecanos.
ISCA
[14] B.H. Juang L.R. Rabiner. Fundamentals of speech recognition. Prentice Hall, En-
Magazine, 3, 1996.
[16] H. Lloyd-Thomas R. Auckenthaler,
M. Carey.
speech recognition. ATT Bell Lab, Murray Hill, New Jersey, 1989.
[18] V. Tiwari. Mfcc and its applications in speaker recognition. International Journal
53
Appendix A
SD Formatting code
Script denition for formatting SD card:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!
/ bin / sh
export LC_ALL=C
i f [ $# ne 1 ] ; t h e n
echo " Usage : $0 <d r i v e >"
exit 1 ;
fi
DRIVE=$1
dd i f =/dev / z e r o o f=$DRIVE bs =1024 count =1024
SIZE=` f d i s k l $DRIVE | g r e p Disk | g r e p b y t e s | awk ' { p r i n t $5 } ' `
echo , 9 , 0 x0C , *
echo , , ,
k p a r t x a ${DRIVE}
PARTITION1=${DRIVE}1
i f [ ! b ${PARTITION1} ] ; then
PARTITION1=${DRIVE}p1
fi
i f [ ! b ${PARTITION1} ] ; then
fi
PARTITION1=$DEV_DIR/mapper/ ${DRIVE_NAME}p1
PARTITION2=${DRIVE}2
i f [ ! b ${PARTITION2} ] ; then
PARTITION2=${DRIVE}p2
fi
54
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
i f [ ! b ${PARTITION2} ] ; then
fi
#make
PARTITION2=$DEV_DIR/mapper/ ${DRIVE_NAME}p2
partitions .
i f [ b ${PARTITION1} ] ; then
else
umount ${PARTITION1}
mkfs . v f a t F 32 n " boot " ${PARTITION1}
else
fi
umount ${PARTITION2}
mke2fs j L " Angstrom " ${PARTITION2}
55
Appendix B
Optimal ALIZE / LIA_RAL
parameters
Conguration le for TrainTarget
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
*** T r a i n T a r g e t C o n f i g u r a t i o n F i l e
***
distribType
GD
mixtureDistribCount
64
maxLLK
200
minLLK
200
bigEndian
false
saveMixtureFileFormat
XML
loadMixtureFileFormat
XML
loadFeatureFileFormat
SPRO4
featureServerBufferSize
ALL_FEATURES
loadMixtureFileExtension
.gmm
saveMixtureFileExtension
.gmm
loadFeatureFileExtension
. tmp . prm
featureFilesPath
data /ELSDSRdatabase/ f e a t u r e s /
mixtureFilesPath
data /ELSDSRdatabase/ m i x t u r e s /
labelFilesPath
data /ELSDSRdatabase/ l a b e l s /
lstPath
data /ELSDSRdatabase/ l i s t s /
baggedFrameProbability
1
// m i x t u r e S e r v e r
nbTrainFinalIt
1
nbTrainIt
1
labelSelectedFrames
speech
useIdForSelectedFrame
false
normalizeModel
false
// t a r g e t I d L i s t
data /ELSDSRdatabase/ l i s t s /trainFAML . l s t
inputWorldFilename
wld //
alpha
0.75
MAPAlgo
MAPOccDep
MAPRegFactorMean
14
featureServerMask
0 15,17 33
vectSize
33
frameLength
0.01
debug
true
verbose
false
meanAdapt
true
Conguration le for TrainWorld
1
2
3
4
5
*** TrainWorld C o n f i g u r a t i o n F i l e
***
distribType
GD
mixtureDistribCount
64
maxLLK
200
56
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
minLLK
bigEndian
saveMixtureFileFormat
loadMixtureFileFormat
loadFeatureFileFormat
featureServerBufferSize
loadMixtureFileExtension
saveMixtureFileExtension
loadFeatureFileExtension
featureFilesPath
mixtureFilesPath
labelFilesPath
lstPath
baggedFrameProbability
baggedFrameProbabilityInit
labelSelectedFrames
addDefaultLabel
defaultLabel
normalizeModel
featureServerMask
vectSize
frameLength
// i n p u t F e a t u r e F i l e n a m e
fileInit
// inputWorldFilename
initVarianceCeiling
initVarianceFlooring
finalVarianceFlooring
finalVarianceCeiling
nbTrainIt
nbTrainFinalIt
outputWorldFilename
debug
verbose
200
false
XML
XML
SPRO4
ALL_FEATURES
.gmm
.gmm
. tmp . prm
data /ELSDSRdatabase/ f e a t u r e s /
data /ELSDSRdatabase/ m i x t u r e s /
data /ELSDSRdatabase/ l a b e l s /
data /ELSDSRdatabase/ l i s t s /
0.05
0.1
speech
true
speech
false
0 15,17 33
33
0.01
. . / . . / . . / data /ELSDSRdatabase/ l i s t s / UBMlist . l s t
false
wld
10
1
0.5
5
20
20
wld
true
true
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
*** ComputeTest C o n f i g F i l e
***
distribType
GD
loadMixtureFileExtension
.gmm
// s a v e M i x t u r e F i l e E x t e n s i o n
.gmm
loadFeatureFileExtension
. tmp . prm
mixtureDistribCount
64
maxLLK
200
minLLK
200
bigEndian
false
saveMixtureFileFormat
XML
loadMixtureFileFormat
XML
loadFeatureFileFormat
SPRO4
featureServerBufferSize
ALL_FEATURES
featureFilesPath
data /ELSDSRdatabase/ f e a t u r e s /
mixtureFilesPath
data /ELSDSRdatabase/ m i x t u r e s /
labelSelectedFrames
speech
labelFilesPath
data /ELSDSRdatabase/ l a b e l s /
frameLength
0.01
segmentalMode
segmentLLR
topDistribsCount
10
computeLLKWithTopDistribs
COMPLETE
// ndxFilename
data /ELSDSRdatabase/ l i s t s / a i t o r . l s t
// worldModelName
wld
// outputFilename
data /ELSDSRdatabase/ r e s / t e s t i n g . r e s
gender
F
debug
true
verbose
false
featureServerMask
0 15,17 33
vectSize
33
inputWorldFilename
wld
57
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
*** NormFeat c o n f i g F i l e
***
mode
debug
verbose
bigEndian
loadFeatureFileFormat
saveFeatureFileFormat
loadFeatureFileExtension
saveFeatureFileExtension
featureServerBufferSize
sampleRate
saveFeatureFileSPro3DataKind
// i n p u t F e a t u r e F i l e n a m e
labelSelectedFrames
segmentalMode
writeAllFeatures
labelFilesPath
frameLength
defaultLabel
addDefaultLabel
vectSize
featureServerMode
featureFilesPath
featureServerMemAlloc
norm
false
true
false
SPRO4
SPRO4
. tmp . prm
. norm . prm
ALL_FEATURES
100
FBANK
data /ELSDSRdatabase/ l i s t s / t e s t L i s t . l s t
speech
false
false
data /ELSDSRdatabase/ l a b e l s /
0.01
speech
speech
34
FEATURE_WRITABLE
data /ELSDSRdatabase/ f e a t u r e s /
50000000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
*** E n e r g y D e t e c t o r C o n f i g F i l e
***
loadFeatureFileExtension
// s a v e F e a t u r e F i l e E x t e n s i o n
minLLK
maxLLK
bigEndian
loadFeatureFileFormat
saveFeatureFileFormat
saveFeatureFileSPro3DataKind
featureServerBufferSize
featureFilesPath
mixtureFilesPath
lstPath
// i n p u t F e a t u r e F i l e n a m e
labelOutputFrames
labelSelectedFrames
addDefaultLabel
defaultLabel
saveLabelFileExtension
labelFilesPath
frameLength
writeAllFeatures
segmentalMode
nbTrainIt
varianceFlooring
varianceCeiling
alpha
mixtureDistribCount
featureServerMask
vectSize
baggedFrameProbabilityInit
debug
verbose
. e n r . tmp . prm
. norm . prm
200
200
false
SPRO4
SPRO4
FBANK
ALL_FEATURES
data /ELSDSRdatabase/ f e a t u r e s /
data /ELSDSRdatabase/ m i x t u r e s /
data /ELSDSRdatabase/ l i s t s /
testList . lst
speech
male
true
male
. lbl
data /ELSDSRdatabase/ l a b e l s /
0.01
true
file
8
0.5
10
0.0
3
16
1
1.0
true
false
58