Este Es 1 Make 01 00031 PDF
Este Es 1 Make 01 00031 PDF
Este Es 1 Make 01 00031 PDF
knowledge extraction
Article
A Near Real-Time Automatic Speaker Recognition
Architecture for Voice-Based User Interface
Parashar Dhakal 1 , Praveen Damacharla 2 , Ahmad Y. Javaid 1, * and Vijay Devabhaktuni 2
1 Electrical Engineering and Computer Science Department, the University of Toledo, Toledo, OH 43606, USA;
Parashar.Dhakal@Utoledo.edu
2 ECE Department, Purdue University Northwest, Hammond, IN 46323, USA; Ldamacha@pnw.edu (P.D.);
Vjdev@pnw.edu (V.D.)
* Correspondence: Ahmad.Javaid@Utoledo.edu; Tel.: +1-419-530-8260
Received: 26 January 2019; Accepted: 15 March 2019; Published: 19 March 2019
Abstract: In this paper, we present a novel pipelined near real-time speaker recognition architecture
that enhances the performance of speaker recognition by exploiting the advantages of hybrid feature
extraction techniques that contain the features of Gabor Filter (GF), Convolution Neural Networks
(CNN), and statistical parameters as a single matrix set. This architecture has been developed to
enable secure access to a voice-based user interface (UI) by enabling speaker-based authentication
and integration with an existing Natural Language Processing (NLP) system. Gaining secure access to
existing NLP systems also served as motivation. Initially, we identify challenges related to real-time
speaker recognition and highlight the recent research in the field. Further, we analyze the functional
requirements of a speaker recognition system and introduce the mechanisms that can address these
requirements through our novel architecture. Subsequently, the paper discusses the effect of different
techniques such as CNN, GF, and statistical parameters in feature extraction. For the classification,
standard classifiers such as Support Vector Machine (SVM), Random Forest (RF) and Deep Neural
Network (DNN) are investigated. To verify the validity and effectiveness of the proposed architecture,
we compared different parameters including accuracy, sensitivity, and specificity with the standard
AlexNet architecture.
Keywords: classifiers; convolution neural network; architecture; feature extraction; machine learning;
random forest; speaker recognition; voice interface
1. Introduction
Automatic speaker recognition is a challenging task as speakers may have different accents,
pronunciations, styles, word rates, and emotional states. Furthermore, the presence of environmental
noise makes the task even more challenging. Development of a functional speaker recognition
architecture has paved the way for many real-world applications such as speech forensics, robotics,
and home control systems [1]. In several areas, recorded voice has been traditionally used to collect
data through transcription. Healthcare is one such area where the voice is recorded for later reporting.
Moving further towards real-world applications in healthcare, researchers have now been focusing on
point-of-impact technologies. Statistically, medical first responders making a large number of errors
during transfer of care, leading to loss of human lives, is considered as the primary reason for this
shift [2]. It is notable that limited response time, lack of standard technical language/protocol, and
the difference in knowledge of equipment among the caretakers, serve as primary sources of these
errors. It is expected that the use of a speaker recognition system integrated with an existing Natural
Language Processing (NLP) system for voice-activated decision making or synthetic assistance (SA),
might reduce such errors to some extent, and improve the voice-based UI while providing secure access
to these systems [3,4]. Such systems, involving the use of synthetic assistants (SA) or voice-activated
decision-making, would also require authentication through speaker recognition.
We propose the use of an in-house developed synthetic assistant for such applications to minimize
errors by augmenting a medic’s capabilities through assistance in decision-making. A personalized
voice assistant with custom-built skills would be a great example of an SA for such applications.
However, these commercial off the shelf (COTS) devices and the associated applications are available
to the public and may pose security and privacy challenges if access is not limited. One of the ways to
limit access can be a high-accuracy real-time speaker recognition system. Motivated by the applications
of real-time speaker recognition, we propose a generalized speaker recognition architecture that can
be used in near real-time without compromising accuracy. Furthermore, the proposed architecture
overcomes the issue of recognizing a single speaker when multiple non-overlapping speakers are
present in a noisy environment. The proposed architecture would also make the recognition simpler
and allow further integration with other existing NLP systems.
Classification algorithms and feature extraction techniques are integral components of a speaker
recognition system. Several algorithms for speaker recognition have been proposed over the years
with varying degrees of success. Conventional speaker recognition systems used a Gaussian mixture
model (GMM)-based hidden Markov models (HMMs) [5]. However, it was inefficient for modeling
data that lie on or near a non-linear manifold in the data space. To understand this concept, consider
an example of modeling the set of points that lie closer to the surface of a sphere. Those points would
only require a few parameters using an appropriate model class. On the contrary, it would require a
large number of diagonal Gaussians or full-covariance to model them. Later on, with the development
of various machine learning (ML) algorithms, the research community shifted its focus to algorithms
such as support vector machine (SVM), random forest (RF), Linear Regression, K-Nearest Neighbors,
K-means, and most recently, deep neural network (DNN). Among these, we found that SVM, RF, and
DNN exhibit the best performance in recent works [6].
Researchers identified random forest as the best classifier among 179 classifiers considered in
their study [7]. However, another group of researchers claimed that these results lacked a held-out
test set and excluded trials with errors [8]. Furthermore, the statistical tests in [7] showed that RF did
not have higher percentage accuracy than SVM and artificial neural networks (ANNs). Therefore,
in our work, we explore SVM, RF, and DNN classifiers to determine which one would be the best
for speaker recognition. It is also noteworthy that choosing the correct classifier does not mean that
the speaker recognition problem is solved and care should be taken to ensure that an optimal set of
features are being extracted from the speech sample under consideration. Although features such as
pitch, zero-crossing rate (ZCR), short-time energy (STE), spectral centroid (SC), spectral roll-off (SR),
and spectral flux (SF) have been previously used in speaker recognition, they are more suitable for
background noise detection [9–11]. Similarly, Mel frequency cepstral coefficients (MFCC) have also
been used in speaker recognition, but they tend to give unreliable results in noisy environments [12].
This paper contributes to the area of speech processing by proposing a voice authentication
based access to an NLP system. Voice authentication is accomplished by developing a generic and
robust speaker recognition architecture that could be integrated with an existing cloud-based NLP.
The paper also gives insights into the essential features of speech signals that are vital to training
popular ML algorithms for speaker recognition. Insight into features was achieved through a detailed
performance analysis of different ML algorithms using the standard ELSDSR and in-house generated
dataset. The proposed novel architecture uses GF, CNN, and statistical parameters separately for
feature extraction. Also, we briefly discuss ML algorithm-based classifiers such as RF, SVM, and DNN
and present results related to their behavior when used with different datasets. Finally, we compare
our architecture performance to the standard AlexNet architecture [13].
The rest of the paper is organized into six sections. Section 2 discusses the most relevant related
work in speaker recognition. Section 3 presents our proposed high-speed pipelined architecture.
Section 4 details the pre-processing and feature extraction modules of the proposed architecture.
Mach. Learn. Knowl. Extr. 2019, 1 506
Section 5 presents a detailed discussion on ML-based algorithms that have been used in speaker
recognition applications. Section 6 presents our experimental results and related discussion. Finally,
we conclude the paper with comments on possible future work directions along with limitations.
2. Related Work
In this section, we discuss existing architectures including the authentication process, and
techniques that have been used for speaker recognition. After a detailed literature survey, we found
that very few architectures for speaker recognition or related areas have been proposed to date. We
found that it was difficult to ascertain if one of the most recent attempts to address multilingual speech
recognition architecture can be used for speaker recognition [14]. On top of that, the authors also
did not mention if this architecture is compatible with cloud-based NLP applications. On a positive
note, this architecture is well-defined and is expected to be useful for real-time applications. Other
popular architectures in literature are simple in design and use old classification algorithms and feature
extraction techniques [1,15,16]. These architectures also lack real-time applicability.
One of the most recently proposed (2019) fully supervised speaker diarization framework [17]
proposes an “online system with offline quality.” This work proposes the use of unbounded
interleaved-state RNN (UIS-RNN) which is a trainable model and outperforms the spectral offline
clustering algorithm on the NIST SRE 2000 CALLHOME benchmark [18]. Another recent work
proposes a speaker verification framework using CNN with a focus on demonstrating the use of
effective pair selection for verification [19]. This work primarily leverages the protocol proposed by
authors of Voxceleb database [20]. In related authors published part 2 of the work where Voxceleb
data is used in speaker recognition using CNN models. The paper examines the depth of network
with performance and results conclude substantial performance improvement with performance
improves with greater network depth [21]. Another Deep CNN and LSTMs were used for ASR
acoustic modeling [22] with accuracies ranging between 69–83% for various scenarios, using the
TIMIT database [23]. A 2018 work also proposes a text-independent ASR using LSTM which
creates match and non-match pairs for ASR through learning the speaker as well as the background
sound [24]. Similar to [18], this model uses overlaps in signal windows during feature extraction to
capture temporal variations and suggests that LSTM fails to capture long dependencies in speaker
characteristics. In another CNN and Gaussian mixture model (GMM) speaker recognition model
authors used spectrograms to recognize the speaker to recognize short utterances and results presented
are promising [25]. The recent LSTM based works further conclude that LSTM has not been successful
in capturing long dependencies among speaker characteristics while trying to capture temporal speech
variation [18,26].
Few researchers have also focused on authenticating along with recognizing an individual’s voice
by comparing the collected voice template for training and testing [27,28]. However, architecture
performance was not reported for noisy or real-time environments. Similarly, another work focused on
building a secure real-time android-based speaker recognition system to identify the technician’s voice
in the laboratory [29]. In this work, the authors state that the authentication process is complicated
in noisy environments. Moreover, the authors did not report any performance-related results of the
proposed system. Similarly, different classification and feature extraction techniques have been used
in automatic speech recognition (ASR), which can be used for speaker recognition as well with some
improvements. DNN and CNN have both been successfully applied to ASR with cepstral coefficient
features as input to these networks. These researchers found CNN to perform better than DNN and
both of these to perform better than GMM/HMM [30,31]. However, some researchers have just applied
the CNN model for both feature extraction and classification [32]; and reported the error rate reduction
compared to DNN. Researchers have also used CNN as a feature extractor and linear classifiers as
a classifier and reported better performance compared to the CNN alone as a feature extractor and
classifier [33].
Mach. Learn. Knowl. Extr. 2019, 1 507
Similarly, another work employed Gabor filter (GF)-based features for ASR [34]. The generated
features were classified using different classifiers. This work claims that the GF-based feature extraction
method gave better recognition than those of MFCC, perceptual linear predictive (PLP) and LPC.
Mach.
In theLearn.
past,Knowl. Extr. 2019,
features 1 FOR PEER
generated from REVIEW
GF have been used as inputs to DNN to generate Gabor-DNN 4
features and to CNN for improved speech recognition [12,35,36]. One such research incorporated GF
Gabor-DNN
into convolution features
filterand to CNN
kernels where fora improved
variety of speech recognition
Gabor features [12,35,36].
served One such
as the feature maps research
of the
incorporated GF into convolution filter kernels where a variety of Gabor
convolution layer [12]. The authors achieved a better result compared to when Gabor features features served as the feature
were
maps of the convolution layer [12]. The authors achieved a better
used without CNN. It was then concluded that the CNN or DNN features alone are not enough andresult compared to when Gabor
features were used
better speaker without
recognition CNN. It was
is achieved whenthenGabor concluded
features that the incorporated.
are also CNN or DNNSimilarly,features in
alone are
another
not enough and better speaker recognition is achieved when Gabor features
recent work, the authors reported a method where specific weight kernels of a CNN are replaced with are also incorporated.
Similarly,
GF to reduce in another recentcomplexity
the training work, the authors
of CNNreported
[37]. This a method wherethat
work claims specific weight
a better kernels ofin
performance a
CNN
termsare replaced
of time with GF
and energy to reduce
was achieved thecompared
training complexity
to CNN because of CNNthe[37]. This worklayers
convolutional claimsusethatthe
a
better performance in terms of time and energy was achieved compared
GF as fixed weight kernels that extract intrinsic features with regular trainable weight kernels. This
to CNN because the
convolutional layers use the GF as fixed weight kernels that extract intrinsic features with regular
implementation used the MNIST, FaceDet and TICH datasets. These findings motivated our research
trainable weight kernels. This implementation used the MNIST, FaceDet and TICH datasets. These
towards the development of a novel architecture design that uses GF, CNN, and statistical parameters
findings motivated our research towards the development of a novel architecture design that uses
separately for feature extraction.
GF, CNN, and statistical parameters separately for feature extraction.
3. High Speed Pipelined Architecture
3. High Speed Pipelined Architecture
The proposed high-speed pipelined architecture depicted in Figure 1 comprises the following
The proposed high-speed
blocks: pre-processing, featurepipelined
extraction, architecture depictedclassification,
feature selection, in Figure 1 comprises the following
trained model, and the
blocks:
speech signal controller block (SSCB). This architecture was designed with the aim to make itand
pre-processing, feature extraction, feature selection, classification, trained model, the
simple,
speech signal controller
highly accurate, and reliableblockin(SSCB).
a noisyThis architecture
environment withwasthedesigned
capabilitywith the aim tonon-overlapping
to recognize make it simple,
highly accurate, and reliable in a noisy environment with the capability
multiple speakers. It was also considered necessary for the architecture to be compatible with to recognize non-overlapping
multiple
cloud-basedspeakers. It was such
applications also considered
as popular NLP necessary
systems.for The
the architecture to be compatible
proposed architecture consistswith cloud-
of different
based
modules,applications
and each of suchthem as has
popular NLP systems.
been discussed Theinproposed
in detail the paper.architecture consists
In the following of different
paragraphs, we
modules, and each of them has been discussed
present details of the operation of the proposed architecture. in detail in the paper. In the following paragraphs, we
present details of the operation of the proposed architecture.
Figure
Figure 1.
1. Proposed
Proposed speaker
speaker recognition
recognition architecture.
architecture.
Suppose
Supposethatthatthere
thereare
aresix
sixindividuals
individuals- A, - A,B,B,C,C,D,D,
E,E,
and
andF, F,
allall
ofofwhom
whom areare
talking and
talking trying
and to
trying
access the NLP. Let us also assume that the system has been trained to only detect
to access the NLP. Let us also assume that the system has been trained to only detect the user B. the user B. The
proposed
The proposedhigh-speed pipelined
high-speed architecture
pipelined will recognize
architecture will recognizeonlyonly
B as Btheas correct person
the correct and and
person sendsend
B’s
voice to the
B’s voice toNLP system.
the NLP Later,Later,
system. if the ifsystem designer
the system decides
designer to givetoaccess
decides give to C as to
access well, then
C as thethen
well, feature
the
database will be updated with the features of C’s voice. Next, an updated trained model will be
generated based on the new feature database. Now, using the generated trained model as a reference,
SSCB will start accepting and transferring the voice of both users B and C to the NLP system.
As shown in Figure 1, after the acquisition of a real-time speech signal, the pre-processing block
performs noise reduction and file conversion to a suitable format. Next, the feature extraction block
Mach. Learn. Knowl. Extr. 2019, 1 508
feature database will be updated with the features of C’s voice. Next, an updated trained model will be
generated based on the new feature database. Now, using the generated trained model as a reference,
SSCB will start accepting and transferring the voice of both users B and C to the NLP system.
As shown in Figure 1, after the acquisition of a real-time speech signal, the pre-processing block
performs noise reduction and file conversion to a suitable format. Next, the feature extraction block
extracts features from the provided voice sample including statistical features, Gabor features, and
CNN-generated features. After that, the feature selection block uses the univariate feature selection
technique that selects the best features from the extracted features based on univariate statistical test
and sends them to the feature database.
Using the feature database the classification block employs the best ML algorithm and builds
a trained model. The proposed architecture uses several ML algorithms, and for each new user, the
SSCB will train them and then select one of the algorithm mentioned above based on the highest
accuracy. A separate set of experiments were performed (discussed in Section 6) to study the best
classifier where we compared three ML-based classifiers. These ML-based classifiers include SVM,
RF, and DNN. The classifier that gave the result with the highest accuracy within a given time was
selected as the best classifier to build a trained model. The trained model is then stored in the SSCB for
comparison with the real-time test voice signals.
Finally, the SSCB uses the extracted features from the test speech signal, tests it against the
trained model and makes the decision of sending the captured real-time speech signal to the NLP
system. This block will reject the captured real-time speech signal if it fails to pass the test against the
trained model. In essence, the SSCB acts as a final switch in the proposed architecture and recognizes
individuals for whom the classification algorithm has already been trained while samples from new
unauthorized users will not be recognized. This systematic approach in our proposed architecture
operation results in a robust speaker recognition process. In the subsequent sections, we will discuss
the submodules of the proposed architecture in more detail.
NLP systems. Therefore, with diverse nationality, we aimed to check the robustness and accuracy
of our architecture for various accents. There are several large data sets available for researchers of
voice recognition and authentication used by previous publications example Voxceleb database [20].
The rationale for not using large data sets such as Voxceleb, or Harvard-Haskins is our model targeting
applications of sensitive environments. Our tested architecture should tell who specifically in the
system is instead of who is not in the system. In the real world, application the users going to submit
only one are two user voice sample for authentication in their personal assisting devices, so in most
of the cases, our architecture has to work with a very sparse and small data set. The main aim of
proposing this architecture was to provide secure user access to an existing NLP system. Based on our
application domain, we wanted a limited dataset and therefore, generated in-house data and looked
for similar standard datasets. We found ELSDSR data to be the most appropriate for our application.
The following subsections detail the feature selection process and the features that were considered
during the selection process.
The input parameter ‘σ’ represents the standard deviation of the Gaussian function, ‘λ’ represents
the wavelength of harmonic function, ‘θ’ is the orientation, ‘γ’ is the spatial aspect ratio with a constant
value of 0.5, and ‘ψ’ represents the phase shift of harmonic function. The spatial frequency bandwidth
is constant and is equal to 1. The output value is the weight of the filter at the (x, y) location. Here,
we used 2D GFs that are equally spaced in orientation (i.e., equal θ) to capture the maximum number
Mach. Learn. Knowl. Extr. 2019, 1 FOR PEER REVIEW
of characteristic textural features [43]. These 24 different GFs were tuned to four orientation values7
(θ = 0◦ , 45◦ , 90◦ , and 135◦ ) and six frequencies (1/λ) (where, λ = 2.82, 5.65, 11.31, 22.62, 45.25, 90.50) as
four orientation values (θ = 0o, 45o, 90o, and 135o) and six frequencies (1/λ) (where, λ = 2.82, 5.65, 11.31,
shown in Figure 2.
22.62, 45.25, 90.50) as shown in Figure 2.
Figure 2.
Figure Twenty-four Gabor
2. Twenty-four Gabor Filters
Filters (GFs)
(GFs) of
of six
six frequencies
frequencies and
and four
four orientations.
orientations.
This CNN model was trained with both standard data (ELSDSR) [45] and the collected 26 voice
samples that were converted to an image.
• MFCCs are derived from a cepstral representation of the audio clip, which is the amplitude of the
resulting spectrum of the frequency bands that are equally spaced on the mel scale [12].
• ZCR of a speech signal is the sign change along the signal, i.e., the rate at which the signal changes,
and is a good indication of the speech variability [10].
• STE is a good measure of the total energy in a short analysis frame of the voice signal and
calculated by windowing of the speech sequence and summation of energy in that short
window [10].
• SR is the frequency below which 85% of the total spectrum energy is contained and indicates
mean and the variance of the roll-off across timeframes in the texture window [10,11].
• SC of a speech signal is defined as the center of gravity of the frequency components in
the spectrum and is evaluated using Fourier transform [10,11]. Perceptually, it indicates the
“brightness” of a sound.
Mach. Learn. Knowl. Extr. 2019, 1 512
extraction and selection. Further, we used the asynchronous stochastic gradient descent method for
training the DNN. The architecture of the network was 5-layered with 56 input nodes, 3 hidden layers
with 50, 100 and 25 nodes, followed by two output nodes. The DNN classifier was implemented
using the TensorFlow library and Python. Table 3 summarizes the parameters used for DNN in our
proposed architecture.
Parameters Specifications
Network type Feed-Forward backpropagation
Number of layers Five layers with three hidden, one input and one output
Activation function Rectified Linear Unit
Training algorithm Gradient Descend
out of the 26 speakers, 12 were from India, 6 from Nepal, 1 from the Kingdom of Saudi Arabia (KSA),
1 from China and 6 from the USA. The experiments conducted on collected samples from 26 speakers
showed that only some features were capable of giving good results.
Apart from the features used in our architecture, we also extracted features such as MFCC, ZCR,
pitch, SC, SR, STE, and SF that have been popularly used for speaker recognition as discussed in
Section 4.5 [9–12]. All of these extracted features were investigated for speaker recognition using
different classifiers to see how they perform when compared with the features used in the proposed
framework. A comparison of the accuracy of different classifiers with different features is shown in
Table 5. It can be seen from these results that our proposed feature extraction method provided the
best accuracy for all the classifiers considered in the current work.
Table 5. Accuracy comparison of different feature extraction technique with different machine
learning-based algorithms.
Similarly, the feature selection technique was performed on each of the extracted features. This
technique helped us achieve higher accuracy while reducing training time. Table 6 reports the
improvement in accuracy of ML-based algorithms after the use of the proposed feature selection
technique. For this comparison, only a realistic dataset could be used while the dataset 1 (ELSDSR)
may not be a reliable source for this specific evaluation since it is collected in a controlled environment
without any background noise. Dataset 1 was collected in a controlled environment, but dataset 2 was
collected in the real world with all background sounds and uncertainties, which is a requirement for
some of the feature extraction methods. Therefore, we only used dataset 2 for this specific purpose.
Table 8. Classifier accuracy, specificity and sensitivity for collected 26 Samples compared to pre-
trained AlexNet [13].
Table 8. Classifier accuracy, specificity and sensitivity for collected 26 Samples compared to pre-trained
AlexNet [13].
6.3. Real-time
Real-time Performance
Performance of the Architecture
With anaim
With an aimtoto study
study thethe real-time
real-time performance
performance of theofproposed
the proposed architecture
architecture with thewith the best
best classifier,
classifier,
i.e., RF, wei.e., RF, we chose
randomly randomly chose five
five speakers outspeakers out of of
of the samples the26samples
speakersoffrom
26 speakers
both “known”from both
and
“known” and “unknown” classes. All these speakers were asked to speak,
“unknown” classes. All these speakers were asked to speak, and then they were authenticated one by and then they were
authenticated one by one
one using the proposed using theFrom
framework. proposedthese framework.
five speakers, From these five
test samples speakers,
lasting test samples
for approximately
lasting
5–6 s werefor taken
approximately
and tested5–6 s were
against taken and
the trained model.tested
Theagainst
averagethe trained time
processing model. Theproposed
for the average
processing
architecture, time
i.e.,for
thethe
timeproposed architecture,
to authenticate i.e., test
after the the sample
time to is
authenticate
received was after the test
found to besample
0.3456 iss,
received was found
on an average. to be any
To perceive 0.3456 s, on an average.
communication real-timeTo bidirectional
perceive anydelays communication
must be of real-time
less than
bidirectional
300 ms whereas delays must
in our casebeit of less than
is taking 300more
little ms whereas
than that.inThe
ourtime
casetaken
it is taking little more
for individual than that.
samples and
The time taken
the average timeforofindividual
processing samples
has beenandreported
the average time9.of processing has been reported in Table 9.
in Table
Table
Table 9. Processing time for test samples.
concluded that the time taken to authenticate speakers with only five speakers was a good enough
representation of all the speakers in the dataset.
Real-time performance of the architecture enables seamless integration with the cloud-based
NLP system without compromising the performance of the NLP system. To examine the integration
capabilities of the developed speaker recognition architecture to the cloud-based NLPs, we chose
three different NLPs Amazon Alexa [56], Google Now [57], and Microsoft Cortana [58]. All of these
services use APIs that provide access to the respective NLP. The interface can be achieved by a far-field
or hands-free method of integration where a wake word is used to alert the system. In each case,
our developed application will authenticate the speech signal after the wake word. As illustrated in
the testing method of our architecture, feature extraction and selection along with a speech signal
controller would be a deployment package, which is integrated into the respective APIs. Speech signal
controller would be built on a cloud server every time a new speaker’s voice is to be added.
7. Conclusions
In this paper, we presented a novel high-speed pipelined architecture that supports real-time
speaker recognition, allowing NLP systems to accept only selected speech signals in near real-time.
Most commercially available NLP systems, even after training by a single user, are known to exhibit
high false positive rates. The proposed intelligent system attempts to solve this problem by detecting
the authorized user with high accuracy. Moreover, we exploit the advantages of GF, CNN, and
statistical parameters for feature extraction. Based on our experiments, we found the feature extraction
techniques mentioned above and RF to be optimal in speaker recognition for feature extraction and
classification, respectively. Our proposed architecture relies primarily on the feature extraction block
and classification block for achieving high accuracy. Results of rigorous testing with multiple datasets
reveal that RF performed better compared to the other two methods for speaker recognition. While
performance decreased in the order of RF-DNN-SVM, the time taken and complexity showed a
different trend of decrease in the order of DNN-SVM-RF.
Unlike other approaches, the proposed architecture establishes a mechanism to perform speaker
recognition in near real-time. Moreover, we assessed this architecture in terms of both %EER and a
DET curve in a collected database consisting of 26 different speakers, with different native languages,
and speaking the same language—English. Initially, we hypothesized that we need to train more data
points for each accent to be recognized by the model. However, we observed that the model performed
equally well regardless of the accent. We also made a comparative analysis of different algorithms
based on %EER and a DET curve that is more preferred in the case of the speaker verification system.
Finally, the results of the comparison between the proposed and AlexNet architecture show that
the proposed architecture is capable of handling and recognizing the voices of multiple individuals
without significant impact on accuracy with better performance. This work could be extended in
future to identify the voices of multiple individuals talking at the same time, i.e., overlapping voices,
possibly using different languages, in real-time (<300ms) and integrated into an NLP system.
Author Contributions: Conceptualization, P.D. (Praveen Damacharla); Data curation, P.D. (Parashar Dhakal);
Formal analysis, P.D. (Parashar Dhakal); Funding acquisition, V.D. and A.Y.J.; Investigation, P.D. (Praveen
Damacharla); Methodology, P.D. (Parashar Dhakal); Project administration, P.D. (Praveen Damacharla), A.Y.J.
and V.D.; Resources, A.Y.J.; Supervision, A.Y.J. and V.D.; Visualization, P.D. (Praveen Damacharla) and A.Y.J.;
Writing—original draft, P.D. (Parashar Dhakal); Writing—review & editing, P.D. (Praveen Damacharla), A.Y.J.
and V.D.
Funding: This study was funded by Round 1 Project Award “Improving Healthcare Training and Decision
Making Through LVC” from the Ohio Federal Research Jobs Commission (OFMJC) through Ohio Federal Research
Network (OFRN).
Acknowledgments: This work was partially supported by Department of Electrical Engineering and Computer
Science, the University of Toledo and Round 1 Project Award “Improving Healthcare Training and Decision
Making Through LVC” from the Ohio Federal Research Jobs Commission (OFMJC) through Ohio Federal Research
Network (OFRN). The authors are also thankful to Paul A. Hotmer Family CSTAR (Cyber Security and Teaming
Research) lab at the University of Toledo.
Mach. Learn. Knowl. Extr. 2019, 1 518
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the
study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to
publish the results.
References
1. Das, T.; Nahar, K.M. A voice identification system using hidden Markov model. Indian J. Sci. Technol. 2016,
9, 4. [CrossRef]
2. Makary, M.A.; Daniel, M. Medical error—The third leading cause of death in the US. BMJ 2016, 353.
[CrossRef] [PubMed]
3. Damacharla, P.; Dhakal, P.; Stumbo, S.; Javaid, A.Y.; Ganapathy, S.; Malek, D.A.; Hodge, D.C.;
Devabhaktuni, V. Effects of voice-based synthetic assistant on performance of emergency care provider in
training. Int. J. Artif. Intell. Educ. 2018. [CrossRef]
4. Damacharla, P.; Javaid, A.Y.; Gallimore, J.J.; Devabhaktuni, V.K. Common metrics to benchmark
human-machine teams (HMT): A review. IEEE Access 2018, 6, 38637–38655. [CrossRef]
5. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.;
Kingsbury, B.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of
four research groups. IEEE Signal. Process. Mag. 2012, 29, 82–97. [CrossRef]
6. Cutajar, M.; Gatt, E.; Grech, I.; Casha, O.; Micallef, J. Comparative study of automatic speech recognition
techniques. IET Signal. Process. 2013, 7, 25–46. [CrossRef]
7. Fernndez-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D. Do we need hundreds of classifiers to solve
real-world classification problems. J. Mach. Learn. Res. 2014, 15, 3133–3181.
8. Weinberg, M.; Alipanahi, B.; Frey, B.J. Are random forests truly the best classifiers? J. Mach. Learn. Res. 2016,
17, 3837–3841.
9. Liu, Z.; Huang, J.; Wang, Y.; Chen, T. Audio feature extraction and analysis for scene classification. J. VLSI
Signal. Process. Syst. 1997, 20, 61–79. [CrossRef]
10. Zahid, S.; Hussain, F.; Rashid, M.; Yousaf, M.H.; Habib, H.A. Optimized audio classification and segmentation
algorithm by using ensemble methods. Math. Probl. Eng. 2015, 2015, 209814. [CrossRef]
11. Lozano, H.; Hernandez, I.; Navas, E.; Gonzalez, F.; Idigoras, I. Household sound identification system for
people with hearing disabilities. In Proceedings of the Conference and Workshop on Assistive Technologies
for People with Vision and Hearing Impairments, Granada, Spain, 28–31 August 2007.
12. Chang, S.Y.; Morgan, N. Robust CNN-Based Speech Recognition with Gabor Filter Kernels. In Proceedings
of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore,
14–18 September 2014.
13. Krizhevsky, I.S.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural
Inf. Process. Syst. 2012, 60, 84–90. [CrossRef]
14. Gonzalez-Dominguez, J.; Eustis, D.; Lopez-Moreno, I.; Senior, A.; Beaufays, F.; Moreno, P.J. A real-time
end-to-end multilingual speech recognition architecture. IEEE J. Sel. Top. Signal Process. 2015, 9, 749–759.
[CrossRef]
15. Karpagavalli, S.; Chandra, E. A Review on Automatic speech recognition architecture and approaches. Int. J.
Signal. Process. Image Process. Pattern Recognit. 2016, 9, 393–404.
16. Goyal, S.; Batra, N. Issues and challenges of voice recognition in pervasive environment. Indian J. Sci. Technol.
2017, 10, 30. [CrossRef]
17. Zhang, A.; Wang, Q.; Zhu, Z.; Paisley, J.; Wang, C. Fully Supervised Speaker Diarization. arXiv preprint 2018.
Available online: https://arxiv.org/pdf/1810.04719.pdf (accessed on 18 March 2019).
18. Zhang, A.; Wang, Q.; Zhu, Z.; Paisley, J.; Wang, C. Fully supervised speaker diarization. In Proceedings of the
IEEE International Conference on Acoustics, Speech, and Signal. Processing, Brighton, UK, 12–17 May 2019.
19. Salehghaffari, H. Speaker Verification using Convolutional Neural Networks. arXiv 2018, arXiv:1803.05427.
20. Nagrani, A.; Son, C.J.; Andrew, Z. Voxceleb: A Large-Scale Speaker Identification Dataset. arXiv 2017,
arXiv:1706.08612.
21. Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. Presented at the Interspeech
2018, Hyderabad, India, 6 September 2018. Available online: http://dx.doi.org/10.21437/Interspeech.2018-
1929 (accessed on 16 February 2019).
Mach. Learn. Knowl. Extr. 2019, 1 519
22. Xiaoyu, L. Deep Convolutional and LSTM Neural Networks for Acoustic Modelling in Automatic Speech Recognition;
Pearson Education Inc.: Hoboken, NJ, USA, 2017; pp. 1–9.
23. Zue, V.; Seneff, S.; Glass, J. Speech database development at MIT: TIMIT and beyond. Speech Commun. 1990,
9, 351–356. [CrossRef]
24. Mobiny, A. Text-Independent Speaker Verification Using Long Short-Term Memory Networks. arXiv 2018,
arXiv:1805.00604.
25. Liu, Z.; Wu, Z.; Li, T.; Li, J.; Shen, C. GMM and CNN hybrid method for short utterance speaker recognition.
IEEE Trans. Ind. Inf. 2018, 14, 3244–3252. [CrossRef]
26. Selvaraj, S.S.P.; Konam, S. Deep Learning for Speaker Recognition. 2017. Available online: https://arxiv.org/
ftp/arxiv/papers/1708/1708.05682.pdf (accessed on 18 March 2019).
27. Rudrapal, D.; Das, S.; Debbarma, S.; Kar, N.; Debbarma, N. Voice recognition and authentication as a
proficient biometric tool and its application in online exam for PH people. Int. J. Comput. Appl. 2012, 39, 12.
28. Dhakal, P.; Damacharla, P.; Javaid, A.Y.; Devabhaktuni, V. Detection and Identification of Background
Sounds to Improvise Voice Interface in Critical Environments. In Proceedings of the 2018 IEEE International
Symposium on Signal. Processing and Information Technology (ISSPIT), Louisville, KY, USA, 6–8 December
2018; pp. 78–83. [CrossRef]
29. Nandish, M.; Balaji, M.C.; Shantala, C.P. An outdoor navigation with voice recognition security application
for visually impaired people. Int. J. Eng. Trends Technol. 2014, 10, 500–504.
30. Sainath, T.N.; Mohamed, A.R.; Kingsbury, B.; Ramabhadran, B. Deep Convolutional Neural Networks for
LVCSR. In Proceedings of the IEEE International Conference on acoustics, Speech and Signal Processing,
Vancouver, BC, Canada, 26–31 May 2013; Volume 2013, pp. 8614–8618.
31. Vesely, K.; Karafit, M.; Grzl, F. Convolutive Bottleneck Network Features for LVCSR. In Proceedings of
the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Big Island, HI, USA,
11 December 2011; pp. 42–47.
32. Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional neural networks for
speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [CrossRef]
33. Poria, S.; Cambria, E.; Gelbukh, A. Deep convolutional neural network textual features and multiple kernel
learning for utterance-level multimodal sentiment analysis. EMNLP 2015. [CrossRef]
34. Missaoui, I.; Zied, L. Gabor Filterbank Features for robust Speech Recognition. In Proceedings of the
International Conference on Image and Signal. Processing (ICISP), Cherburg, France, 30 June–2 July 2014;
Springer International Publishing: Berlin, Germany, 2014; pp. 665–671.
35. Martinez, M.C.; Mallidi, S.H.; Meyer, B.T. On the relevance of auditory-based Gabor features for deep
learning in robust speech recognition. Comput. Speech Lang. 2017, 45, 21–38. [CrossRef]
36. Chang, S.Y.; Morgan, N. Informative Spectro-Temporal Bottleneck Features for Noise-Robust Speech
Recognition. In Proceedings of the Interspeech 14th Annual Conference of the International Speech
Communication Association, Lyon, France, 25–29 August 2013.
37. Sarwar, S.S.; Panda, P.; Roy, K. Gabor Filter Assisted Energy Efficient Fast Learning Convolutional Neural
Networks. In Proceedings of the 2017 IEEE/ACM International Symposium on Low Power Electronics and
Design (ISLPED), Taipei, Taiwan, 15 August 2017. [CrossRef]
38. Mahmoud, W.H.; Zhang, N. Software/Hardware Implementation of an Adaptive Noise Cancellation System.
In Proceedings of the 120th ASEE Annual Conference and Exposition, Atlanta, GA, USA, 23–26 June 2013;
pp. 23–26.
39. Wyse, L. Audio Spectrogram Representations for Processing with Convolutional Neural Networks.
In Proceedings of the IEEE International Conference on Deep Learning and Music, Anchorage, AK, USA,
18–19 May 2017; pp. 37–41.
40. Feng, L.; Kai, H.L. A New Database for Speaker Recognition; IMM: Copenhagen, Denmark, 2005.
41. Malik, F.; Baharudin, B. Quantized Histogram Color Features Analysis for Image Retrieval Based on Median
and Laplacian Filters in DCT Domain. In Proceedings of the IEEE International Conference on Innovation
Management and Technology Research (ICIMTR), Malacca, Malaysia, 21–22 May 2012; Volume 2012.
42. Haghighat, M.; Zonouz, S.; Abdel-Mottaleb, M. CloudID: Trustworthy cloud-based and cross-enterprise
biometric identification. Exp. Syst. Appl. 2015, 42, 7905–7916. [CrossRef]
43. Jain, K.; Farrokhnia, F. Unsupervised Texture Segmentation Using Gabor Filters. In Proceedings of the IEEE
International Conference on Systems, Man and Cybernetics, Universal City, CA, USA, 4–7 November 1990.
Mach. Learn. Knowl. Extr. 2019, 1 520
44. Burkert, P.; Trier, F.; Afzal, M.Z.; Dengel, A.; Liwicki, M. Dexpression: A Deep Convolutional Neural Network
for Expression Recognition. arXiv 2015, arXiv:1509.05371.
45. Levi, G.; Hassner, T. Age and Gender Classification Using Convolutional Neural Networks. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA,
USA, 7–12 June 2015.
46. Dieleman, S.; Schlüter, J.; Raffel, C.; Olson, E.; Sønderby, S.K.; Nouri, D.; Maturana, D.; Thoma, M.;
Battenberg, E.; Kelly, J.; et al. Lasagne: First release; Zenodo: Geneva, Switzerland, 2015; Volume 3.
47. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks
by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580.
48. Hijazi, S.; Kumar, R.; Rowen, C. Using Convolutional Neural Networks for Image Recognition; Cadence Design
Systems Inc.: San Jose, CA, USA, 2015.
49. El-Naqa, Y.Y.; Wernick, M.N.; Galatsanos, N.P.; Nishikawa, R.M. A support vector machine approach for
detection of microcalcifications. IEEE Trans. Med. Imag. 2002, 21, 1552–1563. [CrossRef] [PubMed]
50. Hsu, W.; Chang, C.C.; Lin, C.J. A Practical Guide to Support Vector Classification; Technical Report; Department
of Computer Science and Information Engineering, National Taiwan University: Taipe, Taiwan, 2003;
pp. 1–16.
51. Liaw, A.; Wiener, M. Classification and Regression by Random Forest; The Newsletter of the R Project; The R
Foundation: Vienna, Austria, December 2002; Volume 2/3, pp. 18–22.
52. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC press: Boca Raton,
FL, USA, 1984.
53. Tang, Y. Deep learning using linear support vector machines. Presented at the Challenges in Representation
Learning Workshop (ICML), Atlanta, GA, USA, 2 June 2013. Available online: https://arxiv.org/pdf/1306.
0239.pdf (accessed on 16 February 2019).
54. Pedregosa, F.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
55. NOVA, WGBH Science Unit Online; PBS: Washington, DC, USA, 1997; Volume 1, p. 2018.
56. Amazon, Alexa. 2018. Available online: Amazon.com (accessed on 18 March 2019).
57. Build Natural and Rich Conversational Experiences. 2018. Available online: DialogFlow.com (accessed on
18 March 2019).
58. Cortana Is Your Truly Personal Digital Assistant. 2018. Available online: Microsoft.com (accessed on
18 March 2019).
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).