Nothing Special   »   [go: up one dir, main page]

Deep4SNet: Deep Learning For Fake Speech Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Expert Systems With Applications 184 (2021) 115465

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Deep4SNet: deep learning for fake speech classification


Dora M. Ballesteros a, *, Yohanna Rodriguez-Ortega a, Diego Renza a, Gonzalo Arce b
a
Universidad Militar Nueva Granada, Cra. 11 101-80, 110111 Bogotá, Colombia
b
University of Delaware, 210 South College Ave., Newark, DE, 19716, USA

A R T I C L E I N F O A B S T R A C T

Keywords: Fake speech consists on voice recordings created even by artificial intelligence or signal processing techniques.
Fake voice Among the methods for generating false voice recordings are Deep Voice and Imitation. In Deep voice, the re­
Convolutional neural network cordings sound slightly synthesized, whereas in Imitation, they sound natural. On the other hand, the task of
Imitation
detecting fake content is not trivial considering the large number of voice recordings that are transmitted over
Deep learning
Deep voice
the Internet. In order to detect fake voice recordings obtained by Deep Voice and Imitation, we propose a solution
Classification based on a Convolutional Neural Network (CNN), using image augmentation and dropout. The proposed ar­
chitecture was trained with 2092 histograms of both original and fake voice recordings and cross-validated with
864 histograms. 476 new histograms were used for external validation, and Precision (P) and Recall (R) were
calculated. Detection of fake audios reached P = 0.997, R = 0.997 for Imitation-based recordings, and P = 0.985,
R = 0.944 for Deep Voice-based recordings. The global accuracy was 0.985. According to the results, the pro­
posed system is successful in detecting fake voice content.

1. Introduction (Gaussian Mixture Model) and Universal Background Model (UBM)


based models (Reynolds, 1995; Reynolds, Quatieri, & Dunn, 2000).
Speech is one of the most widely transmitted data in digital systems However, artificial intelligence-based solutions using GA (Genetic Al­
such as mobile phones and mobile applications. With speech, the gorithm) (Loughran, Agapitos, Kattan, Brabazon, & O’Neill, 2017), ACO
receiver can know not only the content of the message (i.e. plain text), (Ant Colony Optimization) (Rashno, Ahadi, & Kelarestaghi, 2015), SVM
but also other characteristics like genre, rhythm, intonation, among (Support Vector Machine) (Zhao et al., 2008; Chao, Tsai, Wang, &
others. However, these signals are easily manipulated for the purpose of Chang, 2008; Chao, 2014; Yaman & Pelecanos, 2013) or DL (Deep
deceiving the listener of the voice message. For example, one person Learning) (Liu et al., 2015; Feng, Xiong, & Shi, 2017; Bunrit, Inkian,
may be trained to imitate the voice of another, and in this way, the Kerdprasop, & Kerdprasop, 2019; Jati & Georgiou, 2019) have recently
listener may not know who is actually speaking. been proposed.
Although nobody wants to be fooled by someone who imitates, for For instance, the authors in Zhao et al. (2008) proposed an SVM-
example, the voice of a famous person, the real problem with false/fake based speaker verification system using GMM, obtaining EER (Equal
audio recordings lie in the forensic field. This is because speech signals Error Rate) values between 4.92% and 7.78% for the core test 2006 NIST
can be used as evidence in legal proceedings (Zakariah, Khan, & Malik, speaker recognition evaluation test. In Chao et al. (2008), SVM is used to
2018). Then, if the legal authority is not sure about the authenticity of optimally separate the true voice recordings from the false voice re­
the recording, wrong evidence could be used to establish the facts of a cordings of the claimed person. Similarly, in Chao (2014) two core
crime. However, the solution cannot be to discard all audio recordings as discriminant techniques were proposed, specifically KFD (Kernel Fisher
forensic evidence, but to classify these recordings as original and Discriminant) and SVM for the speaker verification task; according to
counterfeit. their results, the false alarm vs. miss probability curve is better in their
In the literature there are several proposals for speaker verification proposal compared to the classic GMM-UBM approach. Authors in
systems, to confirm whether a person is who he or she really claims to Yaman and Pelecanos (2013) decrease the computational cost of the
be. Traditionally, this task has been approached by using GMM polynomial kernel support vector machine (PK-SVM) through replacing

* Corresponding author.
E-mail addresses: dora.ballesteros@unimilitar.edu.co (D.M. Ballesteros), est.yohanna.rodrig@unimilitar.edu.co (Y. Rodriguez-Ortega), diego.renza@unimilitar.
edu.co (D. Renza), arce@udel.edu (G. Arce).

https://doi.org/10.1016/j.eswa.2021.115465
Received 22 November 2019; Received in revised form 22 April 2021; Accepted 21 June 2021
Available online 29 June 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

the dot product between two utterances by two dot products between sounds that may reveal that it is fake. Using public examples of cloned
two i-vectors. With the purpose of reducing the redundancy of the high voices (e.g. https://audiodemos.github.io/ or https://r9y9.github.io/
dimension feature vectors used in SVM-based speaker verification sys­ deepvoice3_pytorch/), it is possible to find a pattern to distinguish be­
tems, ACO is applied as a feature selection approach, obtaining a 64% tween original and fake recordings. Specifically, the spectrogram of
feature dimension reduction with a EER of 1.7% (Rashno et al., 2015). cloned signals has been found to have a lower power/frequency ratio for
On the other hand, the authors of Loughran et al. (2017) addressed normalized frequencies close to 0.5 than in the case of original voice
the problem of unbalanced data (i.e. the number of examples of one class signals (Fig. 1).
is significantly higher than the other) by applying GA in the adjustment Similarly, VoCo is a text-to-speech system capable of producing fake
of the cost function. For DL-based systems, proposals include, for voices (Jin, Mysore, Diverdi, Lu, & Finkelstein, 2017). In this case, as
example, deep features of the same speaker to form an objective vector reported in Adobe’s MAX 2016 Conference, the system needs 20 min of
(Liu et al., 2015), a nonlinear metric learning method to discriminate if original voice recordings to train the ML-based system. Until now, as
two utterances belong to the same person (Feng et al., 2017), spectro­ well as we know, a commercial VoCo product does not yet exist. Another
grams of utterances as inputs for a Convolutional Neural Network, CNN- approach, Lyrebird, a division of IA-based Descript, is working on a
based model (Bunrit et al., 2019), or a deep Siamese network to learn private Beta of the Voice Double algorithm, which creates a fake voice
pairs of equal/different speakers from audio recordings. According to that sounds like you, but with different plain text content (visit http
the reported results, DL-based speaker verification systems can work s://www.descript.com/lyrebird).
with EER values of 0.1% and an accuracy of 95%. Unlike Deep Voice, VoCo and Double Voice, the method proposed by
Moreover, the problem is further complicated by the fact that not Ballesteros and Moreno in 2012 to create fake voice recordings uses
only can erroneous evidence be created by faking another person’s signal processing techniques instead of training a machine learning
voice, but also by using methods or models to create a fake voice from a model. It is a bio-inspired solution in the behavior of the chameleon,
real voice. Note that in this document, we use the term false when the which is able to adapt (i.e. imitate) its “color” to the surrounding
recording is obtained by voice impersonation, and the term fake when environment (Ballesteros L & Moreno A, 2012a, 2012b). A voice
the original voice recording is transformed by machine learning or recording can then mimic the accent, rhythm, tone, language, genre, and
signal processing techniques. For example, Deep Voice is a text-to- plain text of another voice recording through a mapping process. The
speech algorithm based on deep neural networks that can clone any­ number of fake recordings that can be obtained from a voice recording is
one’s voice (Arık et al., 2017); when such a method was introduced in extremely high and depends on its duration; the longer the voice
2017, it required the original voice recordings to last a few minutes to recording, the more fake voices can be obtained. The main characteristic
create the cloned voice, while currently it only needs a few seconds to to highlight of the fake voices created with this method is that they
create the fake recording. Although this technology has improved in the sound and look like an original voice recording, as well as their spec­
different versions of the algorithm (Ping et al., 2018), it still has a trograms (Figs. 2 and 3).
challenge related to the naturalness of the cloned voice. With few Because of the high similarity between original and fake signals
samples (less than 100), the cloned voice has some artifacts or synthetic obtained with the Imitation method, detecting its fake content is not a

Fig. 1. Spectrogram examples for one authentic voice signal and three signals generated using Deep Voice.

2
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

original and fake recordings obtained not only by Imitation-based


method, but also with Deep Voice.

The rest of this paper is organized as follows. Section 2 presents a


brief overview of the Imitation-based method. Section 3 explains the
difference between hand-crafted features extraction and automatic
feature extraction in ML-based and DL-based models. Section 4 shows
the data creation, the hypotheses and their validation. Section 5 shows
the architecture of Deep4SNet. Section 6 reports the results. Finally, in
Section 7 the research is concluded.

2. An Overview of the imitation-based method

This method was proposed in 2012 by Ballesteros and Moreno as a


solution to transmit secret content privately (Ballesteros L & Moreno A,
2012a, 2012b). The secret content imitates a non-confidential recording
called the target signal, and then the output is transmitted along with a
key that allows you to retrieve the secret information. Suppose Alice
wants to transmit a message to Bob, but she does not want her message
to be heard by a third person, e.g. Eva (Fig. 4). Then, Alice uses a public
message as a target, and her voice signal imitates that target. The output
is a fake voice that sounds like the public message, but obtained from
Alice’s voice. Using the key and the fake message, Bob can retrieve the
secret content. Unlike cryptography, the transmitted message sounds
natural and it does not create suspicion about its true nature.
Remembering what was proposed by Imitation, an original voice
signal can be transformed to a fake voice, following a few basic steps:

1. Change the domain of the signal by applying a time–frequency rep­


resentation such as Discrete Wavelet Transform (DWT).
2. Relocate the waveform coefficients of the signal, following a map­
ping process between the original content and the target content. If
there are m wavelet coefficients, there are m! available mapping
options.
3. Save in a key the relationship between the location of the original
wavelet coefficient and the new locations.
4. Return to the time domain, reconstructing the signal under the same
conditions (e.g. decomposition levels, ripple base) as in step (1).

Therefore, if Eva hears the fake signal, its content may differ
completely from the original Alice’s message in terms of accent, rhythm,
tone, language and gender. In order to illustrate the performance of the
Imitation-based method, some pairs of original/fake recordings are
available on https://doi.org/10.17632/ytkv9w92t6.1, along with their
corresponding keys and an algorithm to reverse the Imitation process.
The pair of recordings (speaker5_1.wav and fake4_1.wav) are signifi­
cantly the same, but one of them is original and the other is not. The fake
Fig. 2. Examples of time-domain signals generated using the Imitation-
based approach.
recording comes from a Spanish speaking male, even though the fake
audio sounds like a Spanish speaking female.
trivial task. It is extremely easy to deceive a listener (or legal authority)
3. Hand-crafted vs. Automatic feature extraction
about the originality of its content, and a fake recording could be used as
evidence within a legal process. To address this challenge, this paper
According to the IBM Foundational Methodology for Data Science
makes the following contributions:
(Rollins, 2015), there are four steps related to data in any data analytics
lifecycle, as follows: data requirements, data collection, data under­
• We propose a solution based on Deep Learning, called Deep4SNet, to
standing and data preparation. For simplicity, in this text we will refer to
classify original and fake voice recordings obtained by Imitation.
the four blocks as a big block named “data creation”.
Deep4SNet is a text-independent classifier which allows to be used
In our research we propose two approaches to data creation step:
for a wide range of voice recordings.
hand-crafted feature extraction and data transformation. The former for
• An analysis of a set of features is performed to identify whether the
machine learning (ML)-based models and the latter for deep learning
relationship between the feature and the label is strong or not, as a
(DL)-based models. It should be noted that the data creation step for DL-
previous stage of CNN training.
based models does not include the feature extraction process, as this is
• Our proposed solution also classifies fake voices obtained using other
done automatically within the model. In addition, we carry out the data
methods (e.g. Deep Voice). With the trained model and parameters
transformation process within the data creation block because we intend
provided at https://github.com/yohannarodriguez/Deep4SNet.git,
to use a 2D-CNN architecture that is suitable for image-type inputs,
anyone can use the proposed model as a tool to distinguish between
rather than 1D-array-type inputs. There are works in the literature that

3
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

Fig. 3. Examples of spectrograms of signals generated using the Imitation-based method.

Fig. 4. General diagram for Alice’s communication with Bob using fake voices.

have transformed audios into images, e.g. spectrograms, as inputs of 2D- subsections.
CNNs for audio classification tasks as speech emotion recognition (Satt,
Rozenberg, & Hoory, 2017; Yenigalla et al., 2018) or speaker recogni­
4.1. Data creation
tion (Zeng, Mao, Peng, & Yi, 2019). Fig. 5 shows the difference between
the two approaches.
This section presents three proposals for data creation, where each
In Section 4, the data creation block for the two approaches, for three
proposal is based on a hypothesis.
possible datasets and three different hypotheses will be explained.
4.1.1. Statistics-based features and Hypothesis 1
4. Data creation and hypothesis validation
In the first approach, the statistics of the voice signal were consid­
ered, specifically the mean, standard deviation, the minimum of the
For ML-based models we propose two candidate datasets, derived
normalized voice signal, and the maximum of the normalized voice
from two hypotheses. In addition, for DL-based models we propose one
signal. This implies that the voice signal is scaled by the maximum be­
data transformation, related to the third hypothesis. However, each
tween the magnitude of its peak and valley.
proposal has the characteristic of being text-independent, unlike others
These features are selected because they are text-independent.
such as the spectrogram in which the behavior changes not only for the
Typically, natural speech signals have average around 0.0 and stan­
speaker but also for the plain text content. The description of each
dard deviation of 0.2, when the signal ranges between [− 1 1]. There­
dataset and its hypothesis validation are presented in the following
fore, if fake signals differ in terms of their statistics from the above

4
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

Fig. 5. Hand-crafted vs Automatic Feature Extraction in ML-based and DL-based models.

values, a machine learning model could classify the audio recordings as 2017; Rodriguez-Ortega, Ballesteros, & Renza, 2021).
original and fake. For this hypothesis we created a specific dataset (Ballesteros,
In terms of statistics, the following hypothesis is proposed: Rodriguez, & Renza, 2020) which contains histograms of original speech
signals, and fake signals created by Imitation and Deep Voice. These
Hypothesis 1. Natural speech signals have very similar statistics to each
images may have more extensive information than the hand-crafted
other, as do fake signals. Additionally, the statistic values between natural
features based on entropy and statistics, e.g. width and slope of the
and fake signals are dissimilar.
curve in the lower, middle and upper part of the histogram, which may
Complementing the previous hypothesis, if the correlation coeffi­ be useful for the current classification task.
cient between each characteristic (i.e., each statistical value) and the Then, the third proposed hypothesis is as follows:
classifier label (1 for original, 0 for false) is greater than or equal to 0.5
Hypothesis 3. Natural speech signals have very similar histogram shapes
(in the range of 0 to 1), then Hypothesis 1 is true. Otherwise, it is false.
to each other, as do fake signals. Additionally, the histogram shapes between
natural and fake signals are dissimilar.
4.1.2. Entropy-based features and Hypothesis 2
It is well known that entropy can be used as a measure of data un­ Unlike Hypothesis 1 and Hypothesis 2, the correlation coefficient is
certainty (Robinson, 2008). The higher the level of uncertainty, the not used to determine whether Hypothesis 3 is true, but it is carried out
greater the value of entropy. This value depends on the distribution of by visual inspection of the histograms of each label, since in this case the
the data, but not on the plain text of the message, and then, entropy may histogram does not correspond to the features of the model.
help to identify fake content. For that reason, we propose the following The advantage of using histograms instead of other types of trans­
hypothesis: formation such as spectrograms is that the model is not dependent on the
plaintext of the message, i.e. it can classify original and fake voice re­
Hypothesis 2. Natural speech signals have very similar entropy values to
cordings for any plaintext message.
each other, as do fake signals. Additionally, the entropy values between
natural and fake signals are dissimilar.
In this case, if the correlation coefficient between each feature (i.e. 4.2. Validation of the hypothesis
entropy of a segment of the signal) and the classifier label (1 for original,
0 for fake) is at least greater in magnitude than 0.5 (in the range 0 to 1), The next step is to select the most appropriate dataset to feed the ML
then Hypothesis 2 is true. Otherwise it is false. or DL-based model. Therefore, every hypothesis is validated, as pre­
sented below.
4.1.3. Transforming voice recordings into histogram, and Hypothesis 3
The previous hypotheses have a disadvantage in terms of the limited 4.2.1. Validation of the Hypothesis 1
number of features extracted from the speech signal. In the first hy­ For this hypothesis validation, 100 original and 100 fake voice re­
pothesis there are only 4-features (one for each selected statistic) and in cordings were used. Each recording has a duration of 10 s and a sam­
the second hypothesis there are (n + 1) features (i.e. the number of pling rate of 44100 Hz with 16-bit quantization. The statistics calculated
signal segments used for comparison, plus the whole signal). Therefore, for each signal were: mean, standard deviation (desv), normalized
in the third hypothesis, we propose a new approach in which the fea­ minimum (min) and normalized maximum (max).
tures are not hand-crafted, but extracted using a DL-based model, and The correlation coefficient between two series (e.g. mean vs. label, or
the input of the architecture is an image instead of a 1D array. Therefore, dev vs. label) allows to identify whether they are correlated or not, i.e.
the classification problem is treated as a computer vision problem that whether one of them (the label) depends on the other (the feature). This
can be addressed by using Convolutional Neural Networks (CNNs), coefficient was obtained for each feature/label pair, through a correla­
which have demonstrated in the last decade superior performance in tion matrix, in which the last column is the one of interest (See Fig. 6).
classification tasks over ML-based shallow models (Shin & Balasingham, According to Fig. 6, all correlation coefficients between the input
feature and the label are very close to 0 and far from 1, so Hypothesis 1 is

5
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

inspection of some histograms of original voice recordings and fake


voice recordings.
Fig. 8a and Fig. 8c show examples of histograms of the original voice
recordings, and their corresponding fake recordings (Fig. 8c, Fig. 8d)
obtained with Imitation and Deep Voice, respectively. It is easy to
identify the differences between them. For histograms of fake re­
cordings, the number of occurrences around zero is significantly higher
than in the histograms of the original voice recordings, which means
that unnatural recordings have a greater number of zero crosses.
Therefore, if a human can distinguish them, a machine can learn to do
so. Consequently, Hypothesis 3 is expected to be true, and then its cor­
responding dataset is used to feed a DL-based model.

5. Proposed method

Fig. 6. Validation results of Hypothesis 1. Correlation matrix for statistic vs. Due to the difficulty of identifying fake voice recordings created with
label. The lighter the color, the higher the correlation. the Imitation-based method, because they sound and look very much
like the original voice signals (i.e., without artifacts or synthetic effects),
not true. Therefore, they will not allow us to distinguish between the the problem becomes a computer vision task, in which the input is the
original and the fake voice recording, regardless of which machine histogram-image rather than some hand-crafted features. The advantage
learning method may be used. of using CNNs in classification tasks is that they have demonstrated
excellent performance for this type of image analysis.
4.2.2. Validation of the Hypothesis 2 Fig. 9 shows the proposed solution: it covers the training stage and
For a more complete evaluation of this hypothesis, the addition of the validation stage. In the training stage, the histograms of the training
noise to the signal was considered, since by the stochastic behavior of voices are separated into original and fake. They are transformed by
the noise the entropy of the noisy signal can change. The objective is to image resizing, image scaling and horizontal flipping. Then, the CNN is
evaluate the influence of noise on the correlation values between each trained with the new original/fake images. In the validation stage, the
feature and label. images are transformed by resizing and scaling, but not by horizontal
The test protocol to validate Hypothesis 2 is as follows: flipping. They are then fed to the trained model, which classifies the
histograms into original (class = 1) and fake (class = 0), and their results
• Voice recordings: 360 original voice signals and 360 target signals. are compared with the true values.
Each recording has a duration of 10 s and a sampling rate of 44100 In order to avoid overfitting, we apply the following strategies: first,
Hz with 16-bit quantization. to use horizontal split as image augmentation in the pre-processing
• Features: 11 entropy values were used, one value per every second of module; second, to add dropout into the CNN architecture. With the
signal duration, and one for the entire voice recording. These values horizontal flip, the CNN is trained with a wider variety of histograms,
are calculated in both the original and fake voice recordings. which differ in the value of displacement of the central point of the
• Dataset: four datasets are created from the original and target re­ graph (i.e. histograms more positives or negatives). With the dropout as
cordings. The difference between them lies in whether noise is added regularization technique, some of the neurons are randomly ignored, so
to the recordings or not. Table 1 shows how is composed every the neurons in the next layer learn without overfitting.
dataset. Each voice recording imitates a single target voice, and The structure of the proposed solution is explained as follows.
therefore, there are 360 fake recordings in each dataset.
5.1. Pre-processing
In a similar way of the analysis of statistic-based features, the cor­
relation matrix is calculated for every dataset (Fig. 7). In the current In the training stage, we use the ImageDataGenerator module of
case, the column of interest is the first one. Keras to perform three tasks: image resize, image scaling and horizontal
Considering that not all the correlation coefficients between every flip. The images are adjusted to 150 × 150 × 3 pixels normalized in the
feature and the label are higher than 0.5, Hypothesis 2 is not true. range [0, 1]. The selected image augmentation is a horizontal flip, which
corresponds to a mirror effect across the y-axis. It takes advantage of the
4.2.3. Validation of the Hypothesis 3 imperfect symmetry between the left and right sides of the histogram. In
Unlike the validation of the previous hypotheses, the correlation the validation stage, horizontal flip is not considered in the pre-
matrix between the features and the label is not used to determine processing step.
whether Hypothesis 3 is true or not. The reason is that the dataset
related to Hypothesis 3 does not correspond to the features, but to a data 5.2. Network architecture
transformation from the voice recordings, using histograms. This image
set is intended to be used in a DL-based model, where feature extraction In our custom architecture (Fig. 10), the number of convolutional
is part of the model. and pooling layers is significantly lower than in other computer vision
Then, the validation of Hypothesis 3 is performed by visual networks because, unlike the typical classification task, we do not need
features from deeper layers to identify differently shaped objects, but
rather features from shallow layers. Several similar works aimed at fake
Table 1 recognition have used 2D-CNN with a low number of convolutional
Content of each dataset (voice and target) according to the presence of noise.
layers, with satisfactory results (Zhuo, Tan, Zeng, & Lit, 2018;
Voice Noisy voice Target Noisy target Rodriguez-Ortega et al., 2021; Goel, Kaur, & Bala, 2021).
Dataset 1 × × In Fig. 10, f represents the size of the filter, s the size of the stride and
Dataset 2 × × p the size of the padding task; CONV and POOL are convolutional and
Dataset 3 × × pooling operations. There are 3 layers of CONV + POOL followed by a
Dataset 3
flatten layer, a hidden layer and the output layer. The architecture works
× ×

6
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

Fig. 7. Validation examples of Hypothesis 2. Correlation matrix for entropy vs label. The lighter the color, the higher the correlation.

with dropout in the hidden layer to avoid overfitting. The output size of Finally, the activation function of the last neuron is sigmoid, ob­
tained by the Eq. (3). For binary classification problems, this type of
a convolutional block is ⌊n+2p− f
+ 1⌋ × ⌊m+2p− f
+ 1⌋ × nf where n corre­
s s activation is widely recommended by the scientific community.
sponds to the number of rows, m to the number of columns of the input ( )
image, and nf is the number of filters. For CONV1, n = m = 150, nf = f x =
1
(3)
1 + e− x
32, then, the output is ⌊150+0−
1
3
+ 1⌋ × ⌊150+0−
1
3
+ 1⌋ × 32 = 148 ×
148 × 32. 6. Experimental results and analysis
The number of trainable parameters in each convolutional layer is
equal to the number of weights of the filters including the bias. For the The proposed model was developed using Python 3.0, TensorFlow
first convolutional layer, there are 32 RGB filters, everyone with 3 × 3 and Keras runing on GPU. The experiments are designed to evaluate the
weights, and one bias by filter, with a total of 896 weights (i.e. 32 × (3 × performance of the proposed solution with fake voices obtained from the
3 × 3 + 1). Table 2 shows the summary of trainable parameters. It is Imitation-based method, as well as Deep Voice. This section encom­
emphasized that the pooling operation has no trainable parameters. passes the experimental setup, the evaluation metrics, the strategies to
The selected loss function is binary crossentropy ((L(y, ̂ y )), which is avoid overfitting and the final results.
related to the dissimilarity in terms of entropy between two data se­
( )
quences, in our case, the entropy of the known labels yi , and the en­
tropy of the predicted labels (̂ y i )). This kind of loss function is very 6.1. Experimental setup
useful in binary classification problems. Mathematically, it is calculated
as shown in Eq. (1). Training and validation dataset: the first step in creating the exper­
imental dataset is to obtain original and fake recordings using the
1 ∑N
Imitation-based method and the Deep Voice algorithm. In the first case,
L(y, ̂y ) = − yi ⋅log(̂
y i ) + 〈(1 − yi )⋅log(1 − ̂y i )〉 (1)
N i=0 360 original voice recordings from 44 speakers and four languages were
used. Some of these recordings are available in https://doi.org/10
For the optimizer, the selected method is RMSprop which consists of .17632/ytkv9w92t6.1. A white noise with SNR of 20 dB was added to
scaling (dinamically) the learning rate by dividing it into the root of the these recordings, resulting in 360 noisy recordings. Of the 720 original
square (average) gradient of the mini-batch Taqi, Awad, Al-Azzo, and voice recordings, 720 fake recordings were calculated, one for each
Milanova (2018). The activation functions for the convolutional and original voice recording. With original and fake recordings their 2880
hidden layers are ReLU (Linear Rectifier Unit) which has a good histograms were obtained, 1440 from the original voice recordings, and
compromise between performance and computational cost. The goal of the others from the fake recordings. In the case of Deep Voice, recordings
ReLU is to discard negative values and allow positive values to pass, of the Voice Cloning Experiment I published at https://audiodemos.
according to Eq. (2). github.io/ were selected to train the CNN model. A 16-bit re-
f (x) = max(0, x) (2) quantization was applied for working with the same quantization of
the recordings that in the training step. In total, there are 76 histograms

7
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

Fig. 8. Validation results of Hypothesis 3. Histograms obtained from speech signals (natural and fake). Available at Ballesteros et al. (2020).

TP + TN
Accuracy = (4)
TP + TN + FP + FN

TP
Precision = (5)
TP + FP

TP
Recall = (6)
TP + FN
The terms TP, TN, FP, FN are explained in Table 3. The accuracy
corresponds to the correct classified recordings divided into the total
recordings; the precision is the ratio between the corrected recordings
classified as original divided into all the recordings classified as original;
while recall corresponds to the corrected recordings classified as original
divided into the total original recordings.

6.3. Avoiding overfitting


Fig. 9. Overview of the proposed solution.
The objective of applying image augmentation as well as dropout is
related to Deep Voice, 4 for original voice recordings and 72 for fake to avoid overfitting in the CNN model. In this section, performance
recordings. graphs are shown to illustrate the effect of the selected strategies to
Joining the two sets of histograms gives 2956 images, distributed avoid overfitting.
into training and validation, and each of them in original and fake. The Fig. 12 shows the impact of the dropout in the performance of the
dataset is balanced. Fig. 11 shows distribution of the histograms. The classifier. For loss graphs, the breaking point between the descent and
test set is explained in Section 6.4. ascent of the curve (i.e. elbow effect) is more noticeable without dropout
(left plot) than with the cases with dropout (middle and right plot). For
accuracy and recall, from three epochs the validation performance is
6.2. Evaluation metrics almost constant, regardless of whether the total of epochs increases or
not. For values of dropout of 0.2 and 0.3, the elbow effect decreases, and
The proposed CNN model was trained and validated using the the validation diagram follows the training diagram as the epochs
following metrics: accuracy, loss (Eq. (1)), precision and recall. They are increase.
calculated using the Eqs. (4)–(6). However, dropout is not enough to combat overfitting, then,

8
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

Fig. 10. Architecture of the proposed CNN model.

disappears for dropout equal to 0.2, this value is the one selected for the
Table 2
model, according to Section 5.2.
Summary of trainable parameters.
Layer Trainable parameters
6.4. External test
Conv1 896
Pool1 0 The trained model, the trainable parameters, and the experiment for
Conv2 9.248
Pool2 0
external test are posted at https://github.com/yohannarodriguez/Deep
Conv3 18.496 4SNet.git. The dataset with the histograms are available at https://doi.
Pool3 0 org/10.17632/k47yd3m28w.1.
FC1 0 For the method based on Imitation, we use 400 new recordings (not
FC2 1.183.808
FC3 65
Total 1.212.513 Table 3
Description of terms of the evaluation metrics.
Term Description
horizontal flip (image augmentation) in the pre-processing step is
included. Their results are shown in Fig. 13. TP (True Positive) Original speech classified as original
TN (True Negative) Fake speech classified as fake
Comparing the results in Fig. 12 with those in Fig. 13, it is clear that
FP (False Positive) Fake speech classified as original
model performance is better when both horizontal flip and dropout are FN (False Negative) Original speech classified as fake
included. Bearing in mind that the elbow effect in the loss graphs

Fig. 11. Distribution of the histogram dataset, available at https://data.mendeley.com/datasets/k47yd3m28w/1.

9
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

Fig. 12. Dropout effect. Up to down: accuracy, loss, precision, and recall; left to right: without dropout, dropout = 0.2, dropout = 0.3.

previously used for training or testing), corresponding to 20 original Imitation and Deep Voice. Three hypotheses of datasets to predict
recordings and 380 fake recordings (Fig. 11). The original recordings whether a voice recording is original or fake have been studied in this
were obtained from The LJ Speech Dataset (available at https://keith work. The first hypothesis aims to use statistical values as features to
ito.com/LJ-Speech-Dataset/). In the case of Deep Voice, 76 recordings train machine learning models, and according to the results, the label is
were selected from the Voice Cloning Experiment II (available at not dependent of the features as the correlation coefficient values are
https://audiodemos.github.io/) corresponding to 4 original voice re­ very close to zero. The second approach aims to use entropy-based
cordings and 72 false voice recordings. Table 4 shows the results, the features, but, most of the correlation coefficient values between every
positive class correspond to the label original and the negative class to pair of feature vs. label are less than 0.5, and therefore, this hypothesis is
the label fake. false, too. The third hypothesis uses histograms from voice recordings as
According to the results shown in Table 4, recall is better for fake they have more extensive information than entropy itself. In a pre­
voice recordings obtained by Imitation and DeepVoice than for original liminary review of the histograms, the behavior of the original voice
voice recordings. This means that a fake voice recording is less likely to recordings is different from that of fake voice recordings. This hypoth­
be incorrectly labeled than the original voice recording. Similarly, pre­ esis is selected in the final solution. In this case, the classification task is
cision is better for fake voice recordings than for original voice re­ treated as a computer vision problem and therefore the classifier is based
cordings. So, if a recording is labeled as fake, there is great confidence in on a custom CNN. In order to avoid overfitting, two strategies were
the truth of the label. In general, histograms are correctly classified applied: image augmentation and dropout. According to experimental
98.5% of the time. tests, the horizontal flip and the dropout equal to 0.2 are good hyper­
parameter values for the current problem. The model has precision and
7. Conclusion and future work recall of 0.997 when it is used to classify fake voice recordings with the
Imitation method. Also, when the model is used to identify voice re­
We have proposed a solution based on convolutional neural net­ cordings with Deep Voice the precision is 0.985 and recall is 0.944. The
works to classify original and fake speech recordings obtained by results showed that the proposed solution is successful in identifying

10
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

Acknowledgements

This work is supported by the “Universidad Militar Nueva Granada-


Vicerrectoría de Investigaciones” (Grant IMP-ING-2936 of 2019-2021).

References

Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X.,
Miller, J., Ng, A., Raiman, J., et al. (2017). Deep voice: Real-time neural text-to-
speech. In International Conference on Machine Learning (pp. 195–204). PMLR.
Ballesteros, D. M., Rodriguez, Y., & Renza, D. (2020). A dataset of histograms of original
and fake voice recordings (h-voice). Data in brief, 29, Article 105331.
Ballesteros L, D. M., & Moreno A, J. M. (2012a). Highly transparent steganography model
of speech signals using efficient wavelet masking. Expert Systems with Applications,
39, 9141–9149.
Ballesteros L, D. M., & Moreno A, J. M. (2012b). On the ability of adaptation of speech
signals and data hiding. Expert Systems with Applications, 39, 12574–12579.
Bunrit, S., Inkian, T., Kerdprasop, N., & Kerdprasop, K. (2019). Text-independent speaker
identification using deep learning model of convolution neural network. International
Journal of Machine Learning and Computing, 9, 143–148.
Chao, Y.-H. (2014). Using lr-based discriminant kernel methods with applications to
speaker verification. Speech Communication, 57, 76–86.
Chao, Y.-H., Tsai, W.-H., Wang, H.-M., & Chang, R.-C. (2008). Using kernel discriminant
analysis to improve the characterization of the alternative hypothesis for speaker
verification. IEEE transactions on audio, speech, and language processing, 16,
1675–1684.
Feng, Y., Xiong, Q., & Shi, W. (2017). Deep nonlinear metric learning for speaker
verification in the i-vector space. IEICE Transactions on Information and Systems, 100,
215–219.
Goel, N., Kaur, S., & Bala, R. (2021). Dual branch convolutional neural network for copy
move forgery detection. IET Image Processing, 15, 656–665.
Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural
networks toward unsupervised learning of speaker characteristics. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 27, 1577–1589.
Jin, Z., Mysore, G. J., Diverdi, S., Lu, J., & Finkelstein, A. (2017). Voco: Text-based
insertion and replacement in audio narration. ACM Transactions on Graphics (TOG),
36, 1–13.
Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., & Yu, K. (2015). Deep feature for text-
dependent speaker verification. Speech Communication, 73, 1–13.
Loughran, R., Agapitos, A., Kattan, A., Brabazon, A., & O’Neill, M. (2017). Feature
selection for speaker verification using genetic programming. Evolutionary
Intelligence, 10, 1–21.
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., &
Miller, J. (2018). Deep voice 3: 2000-speaker neural text-to-speech. In Proc. ICLR
(pp. 214–217).
Rashno, A., Ahadi, S. M., & Kelarestaghi, M. (2015). Text-independent speaker
verification with ant colony optimization feature selection and support vector
machine. In 2015 2nd International Conference on Pattern Recognition and Image
Fig. 13. Image augmentation and dropout effect. Up to down: accuracy, loss, Analysis (IPRIA) (pp. 1–5). IEEE.
precision, and recall; left to right: dropout = 0.2, dropout = 0.3. Reynolds, D. A. (1995). Speaker identification and verification using gaussian mixture
speaker models. Speech communication, 17, 91–108.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted
gaussian mixture models. Digital signal processing, 10, 19–41.
Table 4 Robinson, D. W. (2008). Entropy and uncertainty. Entropy, 10, 493–506.
Evaluation metrics. Rodriguez-Ortega, Y., Ballesteros, D. M., & Renza, D. (2021). Copy-move forgery
Precision Recall Global Accuracy detection (cmfd) using deep learning for image and video forensics. Journal of
Imaging, 7, 59.
Fake (Imitation) 0.997 0.997 Rollins, J. (2015). Foundational methodology for data science. Whitepaper: Domino Data
Fake (Deep Voice) 0.985 0.944 0.985 Lab Inc.
Original 0.814 0.916 Satt, A., Rozenberg, S., & Hoory, R. (2017). Efficient emotion recognition from speech
using deep learning on spectrograms. In Interspeech (pp. 1089–1093).
Shin, Y., & Balasingham, I. (2017). Comparison of hand-craft feature based svm and cnn
based deep learning framework for automatic polyp classification. In 2017 39th
fake voice recordings obtained by Imitation and Deep Voice. annual international conference of the IEEE engineering in medicine and biology society
(EMBC) (pp. 3277–3280). IEEE.
Taqi, A. M., Awad, A., Al-Azzo, F., & Milanova, M. (2018). The impact of multi-
CRediT authorship contribution statement
optimizers and data augmentation on tensorflow convolutional neural network
performance. In 2018 IEEE Conference on Multimedia Information Processing and
Dora M. Ballesteros: Conceptualization, Methodology, Investiga­ Retrieval (MIPR) (pp. 140–145). IEEE.
tion, Writing - original draft, Funding acquisition. Yohanna Rodriguez- Yaman, S., & Pelecanos, J. (2013). Using polynomial kernel support vector machines for
speaker verification. IEEE Signal Processing Letters, 20, 901–904.
Ortega: Software, Data curation, Validation, Writing - review & editing. Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., & Vepa, J. (2018). Speech
Diego Renza: Conceptualization, Formal analysis, Writing - review & emotion recognition using spectrogram & phoneme embedding. In Interspeech (pp.
editing, Funding acquisition. Gonzalo Arce: Supervision, Writing - re­ 3688–3692).
Zakariah, M., Khan, M. K., & Malik, H. (2018). Digital multimedia audio forensics: past,
view & editing. present and future. Multimedia tools and applications, 77, 1009–1040.
Zeng, Y., Mao, H., Peng, D., & Yi, Z. (2019). Spectrogram based multi-task audio
classification. Multimedia Tools and Applications, 78, 3705–3722.
Declaration of competing interest Zhao, J., Dong, Y., Zhao, X., Yang, H., Lu, L., & Wang, H. (2008). Advances in svm-based
system using gmm super vectors for text-independent speaker verification. Tsinghua
Science and Technology, 13, 522–527.
The authors declare that they have no known competing financial Zhuo, L., Tan, S., Zeng, J., & Lit, B. (2018). Fake colorized image detection with channel-
interests or personal relationships that could have appeared to influence wise convolution based deep-learning framework. In 2018 Asia-Pacific Signal and
the work reported in this paper.

11
D.M. Ballesteros et al. Expert Systems With Applications 184 (2021) 115465

Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp.
733–736). IEEE.

12

You might also like