Modern Speech Recognition Approa
Modern Speech Recognition Approa
Modern Speech Recognition Approa
SPEECH
RECOGNITION
APPROACHES
WITH
CASE STUDIES
Edited by S. Ramakrishnan
MODERN SPEECH
RECOGNITION
APPROACHES WITH
CASE STUDIES
Edited by S. Ramakrishnan
Modern Speech Recognition Approaches with Case Studies
http://dx.doi.org/10.5772/2569
Edited by S. Ramakrishnan
Contributors
Chung-Hsien Wu, Chao-Hong Liu, R. Thangarajan, Aleem Mushtaq, Ronan Flynn, Edward
Jones, Santiago Omar Caballero Morales, Jozef Juhár, Peter Viszlay, Longbiao Wang, Kyohei
Odani, Atsuhiko Kai, Norihide Kitaoka, Seiichi Nakagawa, Masashi Nakayama, Shunsuke
Ishimitsu, Seiji Nakagawa, Alfredo Victor Mantilla Caeiros, Hector Manuel Pérez Meana, Komal
Arora, Ján Staš, Daniel Hládek, Jozef Juhár, Dia AbuZeina, Husni Al-Muhtaseb, Moustafa
Elshafei, Nelson Neto, Pedro Batista, Aldebaro Klautau
Published by InTech
Janeza Trdine 9, 51000 Rijeka, Croatia
Notice
Statements and opinions expressed in the chapters are these of the individual contributors and
not necessarily those of the editors or publisher. No responsibility is accepted for the accuracy
of information contained in the published chapters. The publisher assumes no responsibility for
any damage or injury to persons or property arising out of the use of any materials,
instructions, methods or ideas contained in the book.
Preface IX
Speech processing has become a vital area of research in engineering due to the
development of human-computer interaction (HCI). Among the various modes
available in HCI, speech is the convenient and natural mode through which users can
interact with machines, including computers. Hence, many scholars have been
carrying out research on automatic speech recognition (ASR) to suit modern HCI
techniques, which turns out to be a challenging task. Natural spoken words can’t be
easily recognized without pre-processing because of the disturbances such as
environmental noise, noise by instruments, inter and intra person speech variations,
etc.
This book focuses on speech recognition and its associated tasks namely speech
enhancement and modelling. This book comprises thirteen chapters and is divided in
to three sections, each one for speech processing, enhancement and modelling. Section
1 on speech recognition consists of seven chapters, and sections 2 and 3 on speech
enhancement and speech modelling have three chapters each respectively to
supplement section 1.
The first chapter, by Chung-Hsien Wu and Chao-Hong Liu provides techniques for
speech recognition in adverse environments such as noisy, diffluent and multilingual
environments using Gaussian Mixture Model (GMM) and Support Vector Machine
(SVM). Authors have made extensive experiments using English across Taiwan (EAT)
project database.
In chapter 6, Jozef Juhár and Peter Viszlay introduced linear feature transformations
for speech recognition in Slovak. Authors have used three popular dimensionality
reduction techniques namely Linear Discriminant Analysis (LDA), Two-dimensional
LDA (2DLDA) and Principal Component Analysis (PCA). This chapter is very much
balanced in terms of mathematical, theoretical and experimental treatment. The
authors concluded by clearly stating the linear transformation which suits best for the
particular type of feature.
Chapter 7 by Longbiao Wang, Kyohei Odani, Atsuhiko Kai, Norihide Kitaoka and
Seiichi Nakagawa elaborates on speech recognition in the distant-talking environment.
This chapter provides the outline of blind dereverberation and the authors have used
the LMS algorithm and its variants to estimate the power spectrum of the impulse
response. The authors conducted experiments for hands-free speech recognition in
both simulated and real reverberant environments and presented the results.
In chapter 9, Alfredo Victor Mantilla Caeiros and Hector Manuel Pérez Meana
presented techniques for esophageal speech enhancement. People who suffer from
throat cancer require rehabilitation in order to reintegrate their voice. To accomplish
this, the authors have suggested esophageal speech enhancement technique using
wavelet transformation and artificial neural networks.
Chapter 11 by Ján Staš, Daniel Hládek and Jozef Juhár focuses on the development of
the language model using grammatical features for Slovak language. The authors have
made a detailed presentation on class based language models with sparse training
corpus, utilizing grammatical features. In addition to that authors have done extensive
experimental studies.
In chapter 12, Dia AbuZeina, Husni Al-Muhtaseb and Moustafa Elshafei have
addressed pronunciation variation modeling using part of speech tagging for Arabic
speech recognition, employing language models. The proposed method was
investigated on a speaker-independent modern standard Arabic speech recognition
system using Carnegie Mellon University Sphinx speech recognition engine. The
authors have concluded that the proposed knowledge-based approach to model cross-
word pronunciation variations problem got improvement.
The final chapter by Nelson Neto, Pedro Batista and Aldebaro Klautau highlights the
use of Internet as a collaborative network to develop speech science and technology
for a language such as Brazilian Portuguese (BP). The authors have presented the
required background on speech recognition and synthesis. The authors named the
proposed framework as VOICECONET which is a comprehensive web-based platform
based on the open source concept such as HTK, Julius, MARY, etc.
I would like to express my sincere thanks to all authors for their contribution and
effort to bring in this wonderful book. My gratitude and appreciation to InTech, in
particular Ms. Ana Nikolic and Mr Dimitri Jelovcan who have drawn together the
authors to publish this book. I would like to express my heartfelt thanks to The
Management, Secretary, Director and Principal of my Institute.
S. Ramakrishnan
Professor and Head
Department of Information Technology
Dr.Mahalingam College of Engineering and Technology
India
Section 1
Speech Recognition
Chapter 1
http://dx.doi.org/10.5772/47843
1. Introduction
As the state-of-the-art speech recognizers can achieve a very high recognition rate for
clean speech, the recognition performance generally degrades drastically under noisy
environments. Noise-robust speech recognition has become an important task for speech
recognition in adverse environments. Recent research on noise-robust speech recognition
mostly focused on two directions: (1) removing the noise from the corrupted noisy signal
in signal space or feature space - such as noise filtering: spectral subtraction (Boll 1979),
Wiener filtering (Macho et al. 2002) and RASTA filtering (Hermansky et al. 1994), and
speech or feature enhancement using model-based approach: SPLICE (Deng et al. 2003)
and stochastic vector mapping (Wu et al. 2002); (2) compensating the noise effect into
acoustic models in model space so that the training environment can match the test
environment - such as PMC (Wu et al. 2004) or multi-condition/multi-style training (Deng
et al. 2000). The noise filtering approaches require some assumption of prior information,
such as the spectral characteristic of the noise. The performance will degrade when the
noisy environment vary drastically or under unknown noise environment. Furthermore,
(Deng et al. 2000; Deng et al. 2003) have shown that the use of denoising or preprocessing
are superior to retraining the recognizers under the matched noise conditions with no
preprocessing.
Stochastic vector mapping (SVM) (Deng et al. 2003; Wu et al. 2002) and sequential noise
estimation (Benveniste et al. 1990; Deng et al. 2003; Gales et al. 1996) for noise normalization
have been proposed and achieved significant improvement in noisy speech recognition.
However, there still exist some drawbacks and limitations. First, the performance of
sequential noise estimation will decrease when the noisy environment vary drastically.
Second, the environment mismatch between training data and test data still exists and
results in performance degradation. Third, the maximum-likelihood-based stochastic vector
4 Modern Speech Recognition Approaches with Case Studies
mapping (SPLICE) requires annotation of environment type and stereo training data.
Nevertheless, the stereo data are not available for most noisy environments.
For recognition of disfluent speech, a number of cues can be observed when edit difluency
occurs in the spontaneous speech. These cues can be detected from linguistic features,
acoustic features (Shriberg et al. 2000) and integrated knowledge sources (Bear et al. 1992).
(Shriberg et al. 2005) outlined phonetic consequences of disfluency to improve models for
disfluency processing in speech applications. Four types of disfluency based on intonation,
segment duration and pause duration were presented in (Savova et al. 2003). Soltau et al.
used a discriminatively trained full covariance Gaussian system for rich transcription
(Soltau et al. 2005). (Furui et al. 2005) presented the approaches to corpus collection, analysis
and annotation for conversational speech processing.
(Charniak et al. 2001) proposed an architecture for parsing the transcribed speech using an
edit word detector to remove edit words or fillers from the sentence string, and then a
standard statistical parser was used to parse the remaining words. The statistical parser and
the parameters estimated by boosting were employed to detect and correct the disfluency.
(Heeman et al. 1999) presented a statistical language model that is able to identify POS tags,
discourse markers, speech repairs and intonational phrases. A noisy channel model was
used to model the disfluency in (Johnson et al. 2004). (Snover et al. 2004) combined the
lexical information and rules generated from 33 rule templates for disfluency detection.
(Hain et al. 2005) presented the techniques in front-end processing, acoustic modeling,
language and pronunciation modeling for transcribing the conversational telephone speech
automatically. (Liu et al. 2005) compared the HMM, maximum entropy, and conditional
random fields for disfluency detection in detail.
In this chapter an approach to the detection and correction of the edit disfluency based on
the word order information is presented (Yeh et al. 2006). The first process attempts to
detect the interruption points (IPs) based on hypothesis testing. Acoustic features including
duration, pitch and energy features were adopted in hypothesis testing. In order to
circumvent the problems resulted from disfluency especially in edit disfluency, a reliable
and robust language model for correcting speech recognition errors was employed. For
handling language-related phenomena in edit disfluency, a cleanup language model
characterizing the structure of the cleanup sentences and an alignment model for aligning
words between deletable region and correction part are proposed for edit disfluency
detection and correction.
Robust Speech Recognition for Adverse Environments 5
Furthermore, multilinguality frequently occurs in speech content, and the ability to process
speech in multiple languages by the speech recognition systems has become increasingly
desirable due to the trend of globalization. In general, there are different approaches to
achieving multilingual speech recognition. One approach employing external language
identification (LID) systems (Wu et al. 2006) to firstly identify the language of the input
utterance and the corresponding monolingual system is then selected to perform the speech
recognition (Waibel et al. 2000). The accuracy of the external LID system is the main factor to
the overall system performance.
The organization of this paper is as follows. Section 2 presents two approaches to cepstral
feature enhancement for noisy speech recognition using noise-normalized stochastic vector
mapping. Section 3 describes an approach to edit disfluency detection and correction for
rich transcription. In Section 4, fusion of acoustic and contextual analysis is described to
generate phonetic units for mixed-language or multilingual speech recognition. Finally the
conclusions are provided in the last section.
Figure 1. Diagram of training and test phases for noise-robust speech recognition
The SVM-based feature enhancement approach estimates the clean speech feature x from
the noisy speech feature y through an environment-dependent mapping function
F y; (e ) , where (e ) denotes the mapping function parameters and e denotes the
corresponding environment of noisy speech feature y .
Assuming that the training data of the noisy speech Y can be partitioned into N e different
noisy environments, the feature vectors of Y under an environment e can be modeled by a
Gaussian mixture model (GMM) with N k mixtures:
Nk Nk
p y | e ; e p k | e p y | k , e ke N y; ke , Rke (1)
k 1 k 1
where e represents the environment model. The clean speech feature x can be estimated
using a stochastic vector mapping function which is defined as follows:
NE Nk
x F y ; ( e ) y p k | y , e rke (2)
e 1 k 1
Robust Speech Recognition for Adverse Environments 7
where the posterior probability p k | y , e can be estimated using the Bayes theory based on
the environment model e as follows:
p k|e p y|k, e
p k| y, e Nk (3)
p j|e p y| j , e
j 1
Nk
and ( e ) rk( e ) denotes the mapping function parameters. Generally, ( e ) are estimated
k 1
from a set of training data using maximum likelihood criterion.
For the estimation of the mapping function parameter ( e ) , if the stereo data, which contain
a clean speech signal and the corrupted noisy speech signal with the identical clean speech
signal, are available, the SPLICE-based approach can be directly adopted. However, the
stereo data are not easily available in real-life applications. In this chapter an MCE-based
approach is proposed to overcome the limitation. Furthermore, the environment type of the
noisy speech data is needed for training the environment model ( e ) . The noisy speech data
are manually classified into NE noisy environments types. This strategy assigns each noisy
speech file to only one environment type and is very time consuming. Actually, each noisy
speech file contains several segments with different types of noisy environment. Since the
noisy speech annotation affects the purity of the training data for the environment model
( e ) , this section introduces a frame-based unsupervised noise clustering approach to
construct a more precise categorization of the noisy speech.
NE Nk
x - n F y - n ; ( e ) y - n p k | y - n , e rke (4)
e 1 k 1
The process for noise normalization makes the environment model e more noise-tolerable.
Obviously, the estimation algorithm of noise feature vector n plays an important role in
noise-normalized stochastic vector mapping.
only noisy speech feature vector of the current frame is observed. Since the noise and clean
speech feature vectors are missing simultaneously, the relation among clean speech, noise
and noisy speech is required first. Then the sequential EM algorithm is introduced for online
noise estimation based on the relation. In the meantime, the prior models are involved to
provide more information for noise estimation.
y h x g n - h - x , g( z) Cln I exp C z
T
(5)
where C denotes the discrete cosine transform matrix. In order to linearize the nonlinear
model, the first order Taylor series expansion was used around two updated operating
points n0 and 0x denoting the initial noise feature and the mean vector of the prior clean
speech model, respectively. By ignoring the channel distortion effect, for which h 0 ,
Eq.Error! Reference source not found.(5) is then derived as:
y 0x g n0 - 0x G n0 - 0x x - 0x I - G n0 - 0x n - n0 (6)
I CT
where G( z) -Cdiag T
.
I exp C z
Nd
p n; n wdn N n; dn , nd , p x; x (7)
d 1 m 1
where the pre-training data for noisy and clean speech are required to train the model
parameters of the two GMMs, n and x .
While the prior noisy speech model is needed in sequential noise estimation, the noisy
speech model parameters are derived according to the prior clean speech and noise models
using the approximated linear model around two operating points 0n and 0x as follows:
Robust Speech Recognition for Adverse Environments 9
N m Nd
p y; y w
m 1 d 1
y
m ,d N y; my ,d , m
y
,d (8)
my ,d 0x g 0n - 0x G 0n - 0x mx - 0x I - G 0n - 0x dn - 0n
I G I G
T
m
y n
- 0x x T n
- 0x (9)
,d 0 m 0
0n E[ dn ], 0x E[ mx ], wmy ,d =wm wd
The noisy speech prior model will be employed to search the most similar clean speech
mixture component and noise mixture component in sequential noise estimation.
where m ,m denotes the Kronecker delta function and ( m, d) denotes the posterior
( m , d) E m t 1 t
, m d ,d | y1 ,n1
p m , d | y ,n 1 (12)
p y | m , d ,n 1 p m , d |n 1
Nm Nd
p y | m, d ,n 1 p m, d|n 1
m 1 d 1
p y | m , d ,n 1 wm wd
Nm Nd
p y | m, d ,n 1 wm wd
m 1 c 1
where the likelihood p y | m , d ,n 1 can be approximated using the approximated linear
model as:
p y | m , d ,n 1 N y ; my ,c n 1 , m
y
,d
my ,d n1 0x g n 0 - 0x G n 0 - 0x mx - 0x I - G n 0 - 0x n 1 - n 0
(13)
n -
T
m I G n - x x I GT
,d
y x
0 0 m 0 0
Also, a forgetting factor is employed to control the effect of the features of the preceding
frames.
t 1 N m Nd
Qt 1 (n) t 1 ( m, d) ln p( y | m, d,n) Const
1 m 1 d 1
t 1 Nm Nd 1
( m, d) y my ,d n y y n
T
(n)
Q t 1 t 1 y
m ,d m ,d
(14)
1 m 1 d 1
(n) R n
Q t t 1
N m Nd 1
Rt 1 n t 1( m, d) y my ,d n y y n
T
y
m ,d m ,d
m 1 d 1
In the M-step, the iterative stochastic approximation is introduced to derive the solution.
Finally, sequential noise estimation is performed as follows:
Robust Speech Recognition for Adverse Environments 11
2Qt 1 Rt 1
nt 1 nt Kt 1 st 1 Kt 1
1
|n n st 1 |
n
2 t
n n nt
2Qt 1
Kt 1 |n n
2n t
t 1
m, d I G n0 0x
N m Nd T 1
t 1 y I G n x (15)
m ,d 0 0
1 m=1 d 1
Rt 1
st 1 |
n n nt
The prior models are used to search the most similar noise or clean speech mixture
component. Given the two mixture components, the estimation of the posterior probability
( m , d) will be more accurate.
T Nd Nd T
d d ,t
d 1 s
w d 1 sd ,t
t 1 d 1 d 1 t 1
T T
n s z
d d d d ,t t d sd ,t (16)
t 1 t 1
1 T T T T
n z
n
n
n
nd d sd ,t zt
d t
d
d d
d d
d
d p sd ,t
t 1 t 1
where the conjugate prior density of the mixture weight is the Dirichlet distribution with
hyper-parameter d and the joint conjugate prior density of mean and variance parameters
12 Modern Speech Recognition Approaches with Case Studies
w
Nd
d 1
g w1 ,..., w N | 1 ,..., N d
d d
d 1
(17)
d p / 2
1
T
g dn , nd | d , d , d ,d nd exp d dn d d dn d exp tr d nd
2 2
where d 0 , k p 1 and k 0 . After adaptation of noise prior model, the noisy speech
prior model y is then adapted using the clean speech prior model x and the newly
adapted noise prior model n based on Eq.Error! Reference source not found.(8).
F y; (e ) need to be adapted in the adaptation phase. First, adaptation of model parameter
Nk
e is similar to that of noise prior model. Second, the adaptation of ( e ) rk( e ) is an
k 1
Nk
iterative procedure. While ( e ) rk( e ) is not a random variable and does not follow any
k 1
conjugate prior density, a maximum likelihood (ML)-based adaptation which is similar to
the correction vector estimation of SPLICE is employed as:
rk( e ) p k | yt - n , e xt - yt
t
p k | yt - n , e
t
(18)
where the temporal estimated clean speech xt are estimated using the un-adapted noise
normalized stochastic mapping function in Eq.(4).
0.04% improvement (different background noise types and channel characteristic to the
training data), the environment model adaptation can slightly reduce the mismatch between
the training data and test data.
Training-
Methods Set A Set B Set C Overall
Mode
Multi-condition 87.82 86.27 83.78 86.39
No Denoising
Clean only 61.34 55.75 66.14 60.06
Multi-condition 92.92 89.15 90.09 90.85
MCE
Clean only 87.82 85.34 83.77 86.02
SPLICE with Multi-condition 91.49 89.16 89.62 90.18
Recursive-EM Clean only 87.82 87.09 85.08 86.98
Proposed approach Multi-condition 91.42 89.18 89.85 90.21
(manual tag,
Clean only 87.84 86.77 85.23 86.89
no adaptation)
Proposed approach Multi-condition 91.06 90.79 90.77 90.89
(auto-clustering,
Clean only 87.56 87.33 86.32 87.22
no adaptation)
Proposed approach Multi-condition 91.07 90.90 90.81 90.95
(auto-clustering,
Clean only 87.55 87.44 86.38 87.27
with adaptation)
Table 1. Experimental results (%) on AURORA2
2.5. Conclusions
In this section two approaches to cepstral feature enhancement for noisy speech recognition
using noise-normalized stochastic vector mapping are presented. The prior model was
introduced for precise noise estimation. Then the environment model adaptation is
constructed to reduce the environment mismatch between the training data and the test
data. Experimental results demonstrate that the proposed approach can slightly outperform
the SPLICE-based approach without stereo data on AURORA2 database.
In spontaneous speech, acoustic features such as short pause (silence and filler), energy and
pitch reset generally appear along with the occurrence of edit dislfuency. Based on these
features, we can detect the possible IPs. Furthermore, since IPs generally appear at the
boundary of two successive words, we can exclude the unlikely IPs whose positions are
within a word. Besides, since the structural patterns between the deletable word sequence
and correction word sequence are very similar, the deletable word sequence in edit
disfluency is replaceable by the correction word sequence.
The speech signal is fed to both acoustic feature extraction module and speech recognition
engine in IP detection module. Information about durations of syllables and silence from
speech recognition is provided for acoustic feature extraction. Combined with side
information from speech recognition, duration-, pitch-, and energy-related features are
extracted and used to model the IPs using a Gaussian mixture model (GMM). Besides, in
order to perform hypothesis testing on IP detection, an anti-IP GMM is also constructed
based on the extracted features from the non-IP regions. The hypothesis testing verifies if
the posterior probability of the acoustic features of a syllable boundary is above a threshold
and therefore determines if the syllable boundary is an IP. Since IP is an event that happens
in interword location, we can remove the detected IPs that do not appear in the word
boundary.
Robust Speech Recognition for Adverse Environments 15
Figure 2. The framework of transcription system for spontaneous speech with edit disfluencies
There are two processing stages in the edit disfluency correction module: cleanup and
alignment. As shown in Figure 4, cleanup process divides the word string into three parts:
deletable region (delreg), editing term, and correction according to the locations of potential
IPs detected by the IP detection module. Cleanup process is performed by shifting the
correction part and replaces the deletable region to form a new cleanup transcription. The
edit disfluency correction module is composed of an n-gram language model and the
alignment model. The n-gram model regards the cleanup transcriptions as fluent utterances
16 Modern Speech Recognition Approaches with Case Studies
and models their word order information. The alignment model finds the optimal
correspondence between deletable region and correction in edit disfluency.
We model the interruption detection problem as choosing between H0 , which is termed the
IP not embedded in the silence hypothesis, and H1 which is the IP embedded in the silence
hypothesis. The likelihood ratio test is employed to detect the potential IPs. The function
L Seqsyllable _ silence is termed the likelihood ratio since it indicates for each value of
Sequencesyllable _ silence the likelihood of H 1 versus the likelihood of H 0 .
Robust Speech Recognition for Adverse Environments 17
P Seqsyllable _ silence ; H 1
L Seqsyllable _ silence (19)
P Seqsyllable _ silence ; H 0
P Seqsyllable _ silence ; H 1 P Seqsyllable _ silence | Eip (20)
P Seqsilence | Eip P Seqsyllable | Eip
and
P Seqsyllable _ silence ; H 0 P Seqsyllable _ silence | Eip (21)
P Seqsilence | Eip P Seqsyllable | Eip
Where Eip denotes that IP is embedded in silencek and Eip means that IP does not appear in
silencek , that is,
2
Seqsilence - ip
P Seqsilence | Eip
2 ip
2
exp
2 ip 2
(22)
2
Seqsilence - nip
P Seqsilence | Eip
2 nip
2
exp
2 nip 2
(23)
18 Modern Speech Recognition Approaches with Case Studies
Where ip , nip , nip 2 and ip 2 denote the means and variances of the silence duration
containing and not containing the IP, respectively.
ni
duration syllablei. j
j 1
nfduration (24)
syllable ni
duration syllablei. j
i
i 1 j 1
Where syllablei,j means the j-th samples of syllable i in the corpus. |syllable| means the
number of the syllable. ni is the number of syllable i in the corpus. Similarly, for energy and
pitch, frame-based statistics are used to calculate the normalized features for each syllable.
Considering the result of speech recognition, the features are normalized to be the first order
features. For modeling the speaking rate and variation in the energy and pitch during the
utterance, the 2nd order feature called delta-duration, delta-energy and delta-pitch are
obtained from the forward difference of the 1st order features. The following equation shows
the estimation for delta-duration, which can also be applied for the estimation of delta-
energy and delta-pitch.
nfduration nfduration if - w i w
nfduration i 1 i
(25)
i
0 others
Where w is half of the observation window size. Totally, there are three kinds of two orders
features after feature extraction. We combine these features to form a vector with 24w-6
features to be the observation vector of the GMM. The acoustic features are denoted as the
syllable-based observation sequence that corresponds to the potential IP, silencek , by
O O
D , OP , OE R
dim
(26)
Robust Speech Recognition for Adverse Environments 19
Where Os R s , S D , P , E represents the single kind feature vectors and dim means the
dim
T
nfduration ,..., nfduration , nfduration , nfduration , nfduration ,..., nfduration
OD w 1 1 0 1 2 w
(27)
nfduration w1 ,..., nfduration1 , nfduration0 , nfduration1 , nfduration2 ,..., nfduration w1
W
P Seqsyllable |C j P Ot | j i N Ot ; i , i (28)
i 1
Where C j Eip , Eip means the hypothesis set for silencek containing and not containing
the IP. j is the GMM for class C j and i is a mixture weight which must satisfy the
W
constraint i 1 , where W is the number of mixture components, and N is the
i 1
Gaussian density function:
1
N Ot ; i , i exp Ot i i 1 Ot i
1 T
(29)
2
dim/ 2 1/ 2
i 2
where i and i are the mean vector and covariance matrix of the i-th component. Ot
denotes the t-th observation in the training corpus. The parameters i , i , i , i 1..M
can be estimated iteratively using the EM algorithm for mixture i
1 N
ˆ i P i |Ot ,
N t 1
(30)
N
P i |Ot , Ot
t 1
ˆ i N (31)
P i |Ot ,
t 1
N
P i |Ot , Ot ˆ i Ot ˆ i
T
ˆ t 1
i N (32)
P i |Ot ,
t 1
20 Modern Speech Recognition Approaches with Case Studies
P Ot | i
Where P i |Ot , W
and N denote the total number of feature observations.
P Ot | j
j 1
W * arg max P W ; IP
W , IP
arg max P w1 , w2 ,...wt , wt 1 ,...wn , wn 1 ,...w2 n t , w2 n t 1 ,....wN ; IP
W , IP
P w , w ,...w , w ,...w
1 2 t n1 2nt , w2nt 1 ,....wN
arg max (33)
1
W , n ,t
P wt 1 ,...wn | wn 1 ,...w2 n t , w2 n t 1 ,....wN
arg max
log P w1 , w2 ,...wt , wn 1 ,...w2 n t , w2 n t 1 ,....w N
W , n ,t
+ 1 log P wt 1 ,...wn | wn 1 ,...w2 n t , w2 n t 1 ,....w N
where and 1 are the combination weight for cleanup language model and alignment
model. IP means the interruption point obtained from the IP detection module and n is the
position of the potential IP.
based n-gram model with modified Kneser-Ney discounting probabilities for further
smoothing.
P w |Class w w
t N (34)
P wi |Class w1i 1 P wn 1 |Class w1t j
t
1
j 1
n 1
i 1 j n 2
Where Class means the conversion function that translates a word sequence into a word
class sequence. In this section, we employ two word classes: semantic class and parts-of-
speech (POS) class. A semantic class, such as the synsets in WordNet
(http://wordnet.princeton.edu/) or concepts in the UMLS (http://www.nlm.nih.gov/
research/umls/), contains the words that share a semantic property based on semantic
relations, such as hyponym and hypernym. POS is called syntactic or grammatical categories
defined as the role that a word plays in a sentence such as noun, verb, adjective… etc.
The other essential issue of n-gram model for correcting edit disfluency is the number of
orders in Markov model. Since IP is the point at which the speaker breaks off the deletable
region and the correction consists of the portion of the utterance that has been repaired by
the speaker and can be considered fluent. By removing part of the word string will lead to a
shorter string and result in the condition that higher probability is obtained for shorter word
string. As a result, short word string will be favored. To deal with this problem, we can
increase the order to constrain the perplexity and normalize the word length by aligning the
deletable region and the correction.
3.4.2. Alignment model between the deletable region and the correction
In conversational speech, the structural pattern of a deletable region is usually similar to
that of the correction. Sometimes, the deletable region appears as a substring of the
correction. Accordingly, we can find the structural pattern in the starting point of the
correction which generally follows the IP. Then, we can take the potential IP as the center
and align the word string before and after it. Since the correction is used for replacing the
deletable region and ending the utterance, there exists a correspondence between the words
in the deletable region and the correction. We may, therefore, model the alignment
assuming the conditional probability of the correction given the possible deletable region.
According to this observation, class-based alignment is proposed to clean up edit disfluency.
The alignment model can be described as
where fertility f k means the number of words in the correction corresponding to the word
w k in the deletable region. k and l are the positions of the words w k and wl in the
22 Modern Speech Recognition Approaches with Case Studies
deletable region and the correction, respectively. m denotes the number of words in the
deletable region. The alignment model for cleanup contains three parts: fertility probability,
translation or corresponding probability and distortion probability. The fertility probability
of word w k is defined as
fi f k
P f k |Class wk
wiClass w
k
(36)
N
fj p
p 0 w jClass w
k
where is an indicator function and N means the maximum value of fertility. The
translation or corresponding probability is measured according to (Wu et al. 1994).
P Class wl |Class wk
2 Depth LCS Class wl , Class wk (37)
Depth Class wl Depth Class w k
where Depth denotes the depth of the word class and LCN denotes the lowest
common subsumer of the words. The distortion probability P l | k , m is the mapping
probability of the word sequence between the deletable region and the correction.
with IP is larger than that of the silences without IP. According to this result, we can
estimate the posterior probability of silence duration using a GMM for IP detection. For
hypothesis testing, an anti-IP GMM is also constructed.
The hypothesis testing, combined with the GMM model with four mixture components
using the syllable features, will determine if the silence contains the IP. The parameter
should be determined to achieve a better result. The overall IP error rate defined in RT’04F
will be simply the average number of missed IP detections and falsely detected IPs per
reference IP:
nM IP nFA IP
ErrorIP (38)
nIP
Where nM IP and nFA IP denote the numbers of missed and false alarm IPs respectively.
nIP means the number of reference IPs. We can adjust the threshold for nM IP and
nFA IP .
Since the goal of the IP detection module is to detect the potential IPs, false alarm for IP
detection is not a serious problem compared to miss error. That is to say, we want to obtain
high recall rate without much increase in false alarm rate. Finally, the threshold was set to
0.25. Since the IP always appears in word boundary, this constraint can be used to remove
unlikely IPs.
For class-based approach, part of speech (POS) and semantic class are employed as the word
class. Herein, semantic class is obtained based on Hownet (http://www.keenage.com/) that
24 Modern Speech Recognition Approaches with Case Studies
defines the relation “IS-A” as the primary feature. There are 26 and 30 classes in POS class
and semantic class respectively. By this, we can categorize the words according to their
hypernyms or concepts, and every word can map to its own semantic class.
The edit word detection (EWD) task is to detect the regions of the input speech containing
the words in the deletable regions. One of the primary metrics for edit disfluency correction
is to use the edit word detection method defined in RT’04F (Chen et al. 2002), which is
similar to the metric for IP detection shown in Eq. (38).
Due to the lack of structural information, unigram does not obtain any improvement.
Bigram provides more significant improvement combined with POS class-based alignment
than semantic class-based alignment. Using 3-gram and semantic class-based alignment
outperforms other combinations. The reason is that 3-gram with more strict constraints can
reduce the false alarm rate for edit word detection. In fact, we also tried using 4-gram to
gain more improvement than 3-gram, but the excess computation makes the light
improvement not conspicuous as we expected. Besides, the statistics of 4-gram is too spare
compared to 3-gram model. The best combination in edit disfluency correction module is 3-
gram and semantic class.
According to the analysis of the results shown in Table 2, we can find the values of the
probabilities of the n-gram model are much smaller than that of the alignment model. Since
the alignment can be taken as the penalty for edit words, we should balance the effects
between the 3-gram and the alignment with semantic class using a log linear combination
weight . For optimizing the performance, we estimate empirically based on the
minimization of the edit word errors.
Table 2. Results (%) of linguistic module with equal weight (1 ) 0.5 for edit word detection
on REF and STT conditions
Robust Speech Recognition for Adverse Environments 25
4.1. Introduction
In multilingual speech recognition, it is very important to determine a global phone
inventory for different languages. When an authentic multilingual phone set is defined, the
acoustic models and pronunciation lexicon can be constructed (Chen et al. 2002). The
simplest approach to phone set definition is to combine the phone inventories of different
languages together without sharing the units across the languages. The second one is to map
language-dependent phones to the global inventory of the multilingual phonetic association
based on phonetic knowledge to construct the multilingual phone inventory. Several global
phone-based phonetic representations such as International Phonetic Alphabet (IPA)
(Mathews 1979), Speech Assessment Methods Phonetic Alphabet (Wells 1989) and Worldbet
(Hieronymus 1993) are generally used. The third one is to merge the language-dependent
phone models using a hierarchical phone clustering algorithm to obtain a compact
multilingual inventory. In this approach, the distance measure between acoustic models,
such as Bhattacharyya distance (Mak et al. 1996) and Kullback-Leibler (KL) divergence
(Goldberger et al. 2005), is employed to perform the bottom-up clustering. Finally, the
multilingual phone models are generated with the use of a phonetic top-down clustering
procedure (Young et al. 1994).
26 Modern Speech Recognition Approaches with Case Studies
I J
P( xil |k ) P( x kj |l )
i 1 j 1
ak , l (39)
IJ
where I and J represent the number of training data for phone models, k and l ,
respectively. The acoustic confusing matrix A ( ak ,l ) N N is obtained from the pairwise
similarities between every two phone models, and N denotes the number of phone models.
phones, which represents that the sense of a phone can be co-articulated through its context
phones. Such notion is derived from the observation of articulation behavior. Based on the
co-articulation behavior, if two phones share more common context, they are more similarly
articulated.
The HAL model represents the multilingual triphones based on a vector representation.
Each dimension of the vector is a weight representing the strength of association between
the target phone and its context phone. The weights are computed by applying an
observation window of length over the corpus. All phones within the window are
considered as the co-articulated pronunciation with each other. For any two phones of
distance d within the window, the weight between them is defined as d 1 . After
moving the window by one phone increment over the sentence, the HAL space
G ( gk ,l )N N is constructed. The resultant HAL space is an N N matrix, where N is the
number of triphones.
Table 3 presents the HAL space for the example of English and Mandarin mixed sentence “
查一下<look up> ( CH A @ I X I A ) Baghdad ( B AE G D AE D ).” For each phone in
Table 3, the corresponding row vector represents its left contextual information, i.e. the
weights of the phones preceding it. The corresponding column vector represents its right
contextual information. w k ,l indicates the k-th weight of the l-th triphone l . Furthermore,
the weights in the vector are re-estimated as described as follows.
N
w k ,l wk ,l log (40)
Nl
where N denotes the total number of phone vectors and Nl represents the number of vectors
of phone l with nonzero dimension. After each dimension is re-weighted, the HAL space
is transformed into a probabilistic framework, and thus each weight can be redefined as
wk ,l
wˆ k ,l N (41)
w k ,l
k 1
wˆ k ,l wˆ l , k
g k ,l , 1 k,l N
2
MDS is to reduce the data dimensionality into a low-dimensional space. The IPA-based
phone alphabet is 55 for English and Mandarin. This makes around 166,375 ( 55 55 55 )
triphone numbers. When the number of target languages is increased, the dimension of the
confusing matrix becomes huge. Another purpose of multidimensional scaling is to project
the elements in the matrix to the orthogonal axes where the ranking distance relation
between elements in the confusion matrix can be estimated. Compared to the hierarchical
clustering method (Mak et al. 1996), this section applies MDS to the global similarity
measure of multilingual triphones.
CH A @ I X B AE G D
CH
A 3 4 1
@ 2 3
I 1 2 4 3
X 1 2 3
B 3 2 1
AE 2 1 3 2 3
G 1 2 3
D 1 5 4
Table 3. Example of multilingual sentence“查一下<look up> ( CH A @ I X I A ) Baghdad (
B AE G D AE D )”in HAL space
In this section, the multidimensional scaling method suitable to represent the high
dimensionality relation is adopted to project the confusing characteristic of multilingual
triphones onto a lower-dimensional space for similarity estimation. Multidimensional
scaling approach is similar to the principal component analysis (PCA) method. The
difference is that MDS focuses on the distance relation between any two variables and PCA
focuses on the discriminative principal component in variables. MDS is applied for
estimating the similarity of pairwise triphones. The similarity matrix V ( vk ,l )N N contains
pairwise similarities between every two multilingual triphones. The element of row k and
column l in the similarity matrix is computed as
where denotes the combination weight. The sum rule of data fusion is indicated to
combine acoustic likelihood (ACL) and contextual analysis (HAL) confusing matrices as
shown in Figure 5.
MDS is then adopted to project the triphones onto the orthogonal axes where the ranking
distance relation between triphones can be estimated based on the similarity matrices of
triphones. The first step of MDS is to obtain the following matrices
B HSH (43)
Robust Speech Recognition for Adverse Environments 29
1
where H I 11' is the centralized matrix. I indicates the diagonal matrix and 1 means
n
the indicator vector. The elements in matrix B is computed as
where
N skl
sk (45)
l 1 N
N skl
sl (46)
k 1 N
denotes the average similarity values over the lth column, and
N N skl
s (47)
k 1 l 1 N2
are the average similarity values over all rows and columns of the matrix B . The
eigenvector analysis is applied to matrix B to obtain the axis of each triphone in a low
dimension. The singular value decomposition (SVD) is applied to solve the eigenvalue and
eigenvector problems. Afterwards, the first z nonzero eigenvalues for each phone in a
descending order, i.e. 1 2 z 0 , is obtained. The corresponding ordered
eigenvectors are denoted as u . Then, each triphone is represented by a projected vector as
Y [ 1 u1 , 2 u 2 , , z u z ] (48)
i 1 y k ,i yl , i
z
y y
C(Yk , Yl ) k l (49)
y k yl
i 1 yk2,i i 1 yl2,i
z z
where yk ,i and yl ,i are the element of the triphone vectors Yk and Yl . The modified k-
means (MKM) algorithm is applied to cluster all the triphones into a compact phonetic set.
The convergence of closeness measure is determined by a pre-set threshold.
30 Modern Speech Recognition Approaches with Case Studies
Figure 5. An illustration of fusion of acoustic likelihood (ACL) and contextual analysis (HAL)
confusing matrices for the MDS process
English-Major Non-English-Major
4.3.2. Evaluation of the phone set generation based on acoustic and contextual analysis
In this section, the phone recognition rate was adopted for the evaluation of acoustic
modeling accuracy. Three classes of speech recognition errors, including insertion errors (
Ins ), deletion errors ( Del ) and substitution errors ( Sub ), were considered. This section
applied the fusion of acoustic and contextual analysis approaches to generating the
multilingual triphone set. Since the optimal clustering number of acoustic models was
unknown, several sets of HMMs were produced by varying the MKM convergence
threshold during multilingual triphone clustering. There are three different approaches
including acoustic likelihood (ACL), contextual analysis (HAL) and fusion of acoustic and
contextual analysis (FUN). It is evident that the proposed fusion method achieves a better
result than individual ACL or HAL methods. The comparison of acoustic analysis and
contextual analysis, HAL achieves a higher recognition rate than ACL. It denotes that
contextual analysis is more significant than acoustic analysis for multilingual confusing
phone clustering. The curves shows that phone accuracy will increase with the increase in
state number, and finally decrease due to the confusing triphone definition and the
requirement of a large size of multilingual training corpus. The proposed multilingual
phone generation approach can get an improved performance than the ordinary
multilingual triphone sets. In this section, the English and Mandarin triphone sets is defined
based on the expansion of the IPA definition. The multilingual speech recognition system
for English and Mandarin contains 924 context-dependent triphone models. The best phone
recognition accuracy was 67.01% for the HAL window size = 3. Therefore, this section
applied this setting in the following experiments.
4.3.3. Comparison of acoustic and language models for multilingual speech recognition
Table 5 shows the comparisons on different acoustic and language models for multilingual
speech recognition. For the comparison of monophone and triphone-based recognition,
different phone inventory definitions including direct combination of language-dependent
phones (MIX), language-dependent IPA phone definition (IPA), tree-based clustering
procedure (TRE) (Mak et al. 1996) and the proposed methods (FUN) were considered. The
phonetic units of Mandarin can be represented as 37 fundamental phones and English can
be represented as 39 fundamental phones. The phone set for the direct combination of
English and Mandarin is 78 phones with two silence models. The phone set for IPA
definition of English and Mandarin contains 55 phones.
Monophone Triphone
Table 5. Comparison of acoustic and language models for multilingual speech recognition
32 Modern Speech Recognition Approaches with Case Studies
In order to evaluate the acoustic modeling performance, the experiments were conducted
without using language model. Without the language model, the MIX approach achieved
32.58%, IPA method achieved 51.98%, TRE method achieved 65.32%, and the proposed
approach achieved 67.01% phone accuracies. In conclusion, multilingual speech recognition
can obtain the best performance using FUN approach for the context-dependent phone
definition with language model.
Monolingual Multilingual
English word English sent. English and Mandarin mixed sent.
Training corpus 2496 3072 5884
Phone recognition
76.25% 67.42% 67.01%
accuracy
Table 6. Comparison of monolingual and multilingual speech recognition
4.4. Conclusions
In this section, the fusion of acoustic and contextual analysis is proposed to generate
phonetic units for mixed-language or multilingual speech recognition. The context-
dependent triphones are defined based on the IPA representation. Furthermore, the
confusing characteristics of multilingual phone sets are analyzed using acoustic and
contextual information. From the acoustic analysis, the acoustic likelihood confusing matrix
Robust Speech Recognition for Adverse Environments 33
is constructed by the posterior probability of triphones. From the contextual analysis, the
hyperspace analog to language (HAL) approach is employed. Using the multidimensional
scaling and data fusion approaches, the combination matrix is built and each phone is
represented as a vector. Furthermore, the modified k-means algorithm is used to cluster the
multilingual triphones into a compact and robust phone set. Experimental results show that
the proposed approach gives encouraging results.
5. Conclusions
In this chapter speech recognition techniques in adverse environments are presented. For
speech recognition in noisy environments, two approaches to cepstral feature enhancement
for noisy speech recognition using noise-normalized stochastic vector mapping are
described. Experimental results show that the proposed approach outperformed the
SPLICE-based approach without stereo data on AURORA2 database. For speech recognition
in disfluent environments, an approach to edit disfluency detection and correction for rich
transcription is presented. The proposed theoretical approach, based on a two stage process,
aims to model the behavior of edit disfluency and cleanup the disfluency. Experimental
results indicate that the IP detection mechanism is able to recall IPs by adjusting the
threshold in hypothesis testing. For speech recognition in multilingual environments, the
fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixed-
language or multilingual speech recognition. The confusing characteristics of multilingual
phone sets are analyzed using acoustic and contextual information. The modified k-means
algorithm is used to cluster the multilingual triphones into a compact and robust phone set.
Experimental results show that the proposed approach improves recognition accuracy in
multilingual environments.
Author details
Chung-Hsien Wu* and Chao-Hong Liu
Department of Computer Science and Information Engineering, National Cheng Kung University,
Tainan, Taiwan, R.O.C.
Acknowledgement
This work was partially supported by NCKU Project of Promoting Academic Excellence &
Developing World Class Research Centers.
6. References
Bear, J., J. Dowding and E. Shriberg (1992). Integrating multiple knowledge sources for detection
and correction of repairs in human-computer dialog. Proc. of ACL. Newark, Deleware, USA,
Association for Computational Linguistics: 56-63.
* Corresponding Author
34 Modern Speech Recognition Approaches with Case Studies
Benveniste, A., M. Métivier and P. Priouret (1990). Adaptive Algorithms and Stochastic
Approximations. Applications of Mathematics. New York, Springer. 22.
Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol. 27. No. 2. pp. 113-120.
Charniak, E. and M. Johnson (2001). Edit detection and parsing for transcribed speech. Proc. of
NAACL, Association for Computational Linguistics: 118-126.
Chen, Y. J., C. H. Wu, Y. H. Chiu and H. C. Liao (2002). Generation of robust phonetic set
and decision tree for Mandarin using chi-square testing. Speech Communication, Vol. 38.
No. 3-4. pp. 349-364.
Deng, L., A. Acero, M. Plumpe and X. Huang (2000). Large-vocabulary speech recognition
under adverse acoustic environments. Proc. ICSLP-2000, Beijing, China.
Deng, L., J. Droppo and A. Acero (2003). Recursive estimation of nonstationary noise using
iterative stochastic approximation for robust speech recognition. Speech and Audio
Processing, IEEE Transactions on, Vol. 11. No. 6. pp. 568-580.
Furui, S., M. Nakamura, T. Ichiba and K. Iwano (2005). Analysis and recognition of
spontaneous speech using Corpus of Spontaneous Japanese. Speech Communication, Vol.
47. No. 1-2. pp. 208-219.
Gales, M. J. F. and S. J. Young (1996). Robust continuous speech recognition using parallel
model combination. IEEE Transactions on Speech and Audio Processing, Vol. 4. No. 5. pp.
352-359.
Goldberger, J. and H. Aronowitz (2005). A distance measure between gmms based on the
unscented transform and its application to speaker recognition. Proc. of EUROSPEECH.
Lisbon, Portugal: 1985-1988.
Hain, T., P. C. Woodland, G. Evermann, M. J. F. Gales, X. Liu, G. L. Moore, D. Povey and L.
Wang (2005). Automatic transcription of conversational telephone speech. IEEE
Transactions on Speech and Audio Processing, Vol. 13. No. 6. pp. 1173-1185.
Heeman, P. A. and J. F. Allen (1999). Speech repairs, intonational phrases, and discourse
markers: modeling speakers' utterances in spoken dialogue. Computational Linguistics,
Vol. 25. No. 4. pp. 527-571.
Hermansky, H. and N. Morgan (1994). RASTA processing of speech. IEEE Transactions on
Speech and Audio Processing, Vol. 2. No. 4. pp. 578-589.
Hieronymus, J. L. (1993). ASCII phonetic symbols for the world’s languages: Worldbet.
Journal of the International Phonetic Association, Vol. 23.
Hsieh, C. H. and C. H. Wu (2008). Stochastic vector mapping-based feature enhancement
using prior-models and model adaptation for noisy speech recognition. Speech
Communication, Vol. 50. No. 6. pp. 467-475.
Huang, C. L. and C. H. Wu (2007). Generation of phonetic units for mixed-language speech
recognition based on acoustic and contextual analysis. IEEE Transactions on Computers,
Vol. 56. No. 9. pp. 1225-1233.
Johnson, M. and E. Charniak (2004). A TAG-based noisy channel model of speech repairs. Proc. of
ACL, Association for Computational Linguistics: 33-39.
Robust Speech Recognition for Adverse Environments 35
Yeh, J. F. and C. H. Wu (2006). Edit disfluency detection and correction using a cleanup
language model and an alignment model. IEEE Transactions on Audio, Speech, and
Language Processing, Vol. 14. No. 5. pp. 1574-1583.
Young, S. J., J. Odell and P. Woodland (1994). Tree-based state tying for high accuracy acoustic
modelling. Proc. ARPA Human Language Technology Conference. Plainsboro, USA,
Association for Computational Linguistics: 307-312.
Chapter 2
R. Thangarajan
http://dx.doi.org/10.5772/50140
1. Introduction
Speech technology is a broader area comprising many applications like speech recognition,
Text to Speech (TTS) Synthesis, speaker identification and verification and language
identification. Different applications of speech technology impose different constraints on the
problem and these are tackled by different algorithms. In this chapter, the focus is on
automatically transcribing speech utterances to text. This process is called Automatic Speech
Recognition (ASR). ASR deals with transcribing speech utterances into text of a given
language. Even after years of extensive research and development, ASR still remains a
challenging field of research. But in the recent years, ASR technology has matured to a level
where success rate is higher in certain domains. A well-known example is human-computer
interaction where speech is used as an interface along with or without other pointing devices.
ASR is fundamentally a statistical problem. Its objective is to find the most likely sequence of
words, called hypothesis, for a given sequence of observations. The sequence of observations
involves acoustic feature vectors representing the speech utterance. The performance of an
ASR system can be measured by aligning the hypothesis with the reference text and by
counting errors like deletion, insertion and substitution of words in the hypothesis.
ASR is a subject involving signal processing and feature extraction, acoustics, information
theory, linguistics and computer science. Speech signal processing helps in extracting
relevant and discriminative information, which is called features, from speech signal in a
robust manner. Robustness involves spectral analysis used to characterize time varying
properties of speech signal and speech enhancement techniques for making features resilient
to noise. Acoustics provides the necessary understanding of the relationship between speech
utterances and the physiological processes in speech production and speech perception.
Information theory provides the necessary procedures for estimating parameters of
statistical models during training phase. Computer science plays a major role in ASR with
its implementation of efficient algorithms in software or hardware for decoding speech in
real-time.
38 Modern Speech Recognition Approaches with Case Studies
a. Context variability
In any language, words with different meanings have the same phonetic realization. Their
usage depends on the context. There is even more context dependency at the phone level.
The acoustic realization of a phone is dependent on its neighboring phones. This is because
of the physiology of articulators involved in speech production.
b. Style variability
In isolated speech recognition with a small vocabulary, a user pauses between every word
while speaking. Thus it is easy to detect the boundary between words and decode them using
the silence context. This is not possible in continuous speech recognition. The speaking rate
affects the word recognition accuracy. That is, the higher the speaking rate, the higher the WER.
c. Speaker variability
Every speaker’s utterance is unique per se. The speech produced by a speaker is dependent
on a number of factors, namely vocal tract physiologies, age, sex, dialect, health, education,
etc. For speaker independent speech recognition, more than 500 speakers from different age
groups, their sex, educational background, and dialect are necessary to build a combined
model. The speaker independent system includes a user enrollment process where a new
user can train his voice with the system for 30 minutes before using it.
d. Environment variability
Many practical speech recognizers lack robustness against changes in the acoustic
environment. This has always been a major limitation of speech based interfaces used in
mobile communication devices. The acoustic environment variability is highly
unpredictable and it cannot be accounted for during training of models. A mismatch will
always occur between the trained speech models and test speech.
Speech Recognition for Agglutinative Languages 39
Generally, a hypothesis sentence is aligned with the correct reference sentence. The number
of insertions, substitutions and deletions are computed using maximum substring matching.
This is implemented using dynamic programming. The WER is computed as shown in
equation (1):
= × 100 (1)
Other performance measures are speed and memory footprints. The speed is an important
factor which quantifies the turn around time of the system once the speech is uttered. It is
calculated as shown in equation (2):
= × (2)
Obviously, the time taken for processing should be shorter than the utterance duration for a
quicker response from the system. Memory footprints show the amount of memory
required to load the model parameters.
There are a number of well-known factors which affect the accuracy of an ASR system.
The prominent factors are those which include variations in context, speakers and noise in
the environment. Research in ASR is classified into different types depending on the
nature of the problems, like a small or a large vocabulary task, isolated or continuous
speech, speaker dependent or independent and robustness to environmental variations.
The state-of-the-art speech recognition systems can recognize spoken input accurately
with some constraints. The constraints can be speaker dependency, language dependency,
speaking style, task or environment. Therefore, building an automatic speech recognizer
which can recognize the speech of different speakers, speaking in different languages,
with a variety of accent, in any domain and in any ambience environmental background is
far from reality.
40 Modern Speech Recognition Approaches with Case Studies
ASR for languages like English, French and Czech is well matured. A lot of research and
development have also been reported for oriental languages like Chinese and Japanese. But
in the Indian scenario, ASR is still in its nascent stage due to the inherent agglutinative
nature of most of its official languages. Agglutination refers to the extensive morphological
inflection in which one can find a one-to-one correspondence between affixes and syntactic
categories. This nature results in a large number of words in the dictionary which hinders
modeling and training of utterances, and also creates Out-Of-Vocabulary (OOV) words
when deployed.
2. Speech units
The objective of this chapter is to discuss a few methods to improve the accuracy of ASR
systems for agglutinative languages. The language presented here as a case study is Tamil
(ISO 639-3 tam). Tamil is a Dravidian language spoken predominantly in the state of
Tamilnadu in India and in Sri Lanka. It is the official language of the Indian state of
Tamilnadu and also has official status in Sri Lanka, Malaysia and Singapore. With more
than 77 million speakers, Tamil is one of the widely spoken languages in the world. Tamil
language has also been conferred the status of classical language by the government of
India.
Currently, there is a growing interest among Indian researchers for building reliable ASR
systems for Indian languages like Hindi, Telegu, Bengali and Tamil. Kumar et al (2004)
reported the implementation of a Large Vocabulary Continuous Speech Recognition
(LVCSR) system for Hindi. Many efforts have been put to build continuous speech
recognition systems for Tamil language with a limited and restricted vocabulary
(Nayeemulla Khan and Yegnanarayana 2001, Kumar and Foo Say Wei 2003, Saraswathi and
Geetha 2004, Plauche et al 2006). Despite repeated efforts, a LVCSR system for the foresaid
languages is yet to be explored to a significant level. Keeping agglutination apart, there are
other issues to be addressed like aspirated and un-aspirated consonants, and retroflex
consonants.
From the acoustics point of view, a phoneme is defined as the smallest segmental unit of
sound employed to tell apart meaningfully between utterances. A phoneme can be
considered as a group of slightly different sounds which are all perceived to have the same
function by the speakers of a language or a dialect. An example of a phoneme is the /k/
sound in the words kit and skill. It is customary to place phonemes between slashes in
Speech Recognition for Agglutinative Languages 41
transcriptions. However, the phoneme /k/ in each of these words is actually pronounced
differently i.e. it has different realizations. It is because the articulators which generate the
phoneme cannot move from one position to another instantaneously. Each of these different
realizations of the phoneme is called a phone or technically an allophone (in transcriptions,
a phone is placed inside a square bracket like [k]). A phone can also be defined as an
instance of a phoneme. In kit [k] is aspirated while in skill [k] is un-aspirated. Aspiration is a
period of voiceless-ness after a stop closure and before the onset of voicing of the following
vowel. Aspiration sounds like a puff of air after the [k] and before the vowel. An aspirated
phone is represented as [kh]. In some languages, aspirated and un-aspirated consonants are
treated as different phonemes. Hindi, for instance, has four realizations for [k] and they are
considered as different phonemes. Tamil does not discriminate them and treats them as
allophones.
b. Words
Next comes the representation in symbolic form. According to linguistics, word is defined as
a sequence of morphemes. The sequence is determined by the morpho-tactics. A morpheme
is an independent unit which makes sense in any language. It could refer to the root word or
any of the valid prefixes or suffixes. Therefore what is called a word is quite arbitrary and
depends on the language in context. In agglutinative languages a word could consists of a
root along with its suffixes – a process known as inflectional morphology. Syntax deals with
sentence formation using lexical units. Agglutinative languages, on one hand, exhibit
inflectional morphology to a higher extent. On the other hand, the syntactic structure is
quite simple which enables free-word ordering in a sentence. The English language, which
is not agglutinative, has simpler lexical morphology but the complexity of the syntactic
structure is significantly higher.
c. Syllables
A vowel forms the nucleus of the syllable while an optional consonant or consonant cluster
forms the onset and coda. In some syllables, the onset or the coda will be absent and the
syllables may start and/or end with a vowel.
is no agreed upon definition of syllable boundaries. Furthermore, there are some words
like meal, hour and tire which can be viewed as containing one syllable or two (Ladefoged
1993).
A syllable is usually a larger unit than a phone, since it may encompass two or more
phonemes. There are a few cases where a syllable may only consist of single phoneme.
Syllables are often considered the phonological building blocks of words. Syllables have a
vital role in a language’s rhythm, prosody, poetic meter and stress.
The syllable, as a unit, inherently accounts for the severe contextual effects among its
phones as in the case of words. Already it has been observed that a syllable accounts for
pronunciation variation more systematically than a phone (Greenberg 1998). Moreover,
syllables are intuitive and more stable units than phones and their integrity is firmly based
on both the production and perception of speech. This is what sets a syllable apart from a
triphone. Several research works using syllable as a speech unit have been successfully
carried out for English and other oriental languages like Chinese and Japanese by
researchers across the world.
In Japanese language, for instance, the number of distinct syllables is 100, which is very
small (Nakagawa et al 1999). However in a language like English, syllables are large in
number. In some studies, it is shown that they are of the order of 30,000 syllables in English.
The number of lexically attested syllables is of the order of 10,000. When there are a large
number of syllables, it becomes difficult to train syllable models for ASR (Ganapathiraju et
al 2001).
result, a large number of inflectional variants for each word exist. The use of suffixes is
governed by morpho-tactic rules. Typically, a STEM in Tamil may have the following
structure.
For each stem, there are at least 27 = 128 inflected word forms, assuming only two affixes of
each type. Actually, there may be more than two options, but there may be gaps. In contrast,
English has maximally 4 word forms for a verb as in swim, swims, swam and swum, and for
nouns as in man, man's, men and men’s. Hence, for a lexical vocabulary of 1,000, the actual
Tamil words list of inflected forms will be of the order of 1,28,000.
Cases Suffixes
Accusative ஏ, ஐ
Instrumental ஆல்
Ablative இருந்து
Vocative ஏ
Selective ஆவது
Interrogative ஆ, ஓ
The verbs in Tamil take various forms like simple, transitive, intransitive, causative,
infinitive, imperative and reportive. Verbs are also formed with a stem and various suffix
patterns. Some of the verbal suffix patterns are shown in Table 2. Rajendran et al (2003) had
done a detailed study on computational morphology of verbal patterns in Tamil.
Adjectives and adverbs are generally obtained by attaching the suffix – ஆன and ஆக to
noun forms respectively. Tamil often uses a verb, an adjective or an adverb as the head of a
noun phrase. This process is called nominalization where a noun is produced from another
POS with morphological inflections.
3.2. Morpho-phonology
Morpho-phonology, also known as sandhi, wherein two consecutive words combine by
deletion, insertion or substitution of phonemes at word boundaries to form a new word is
very common in Tamil. In English, one can find morpho-phonology to a limited extent. For
example, the negative prefix (in) when attached to different words, changes according to the
first letter of the word.
in + proper improper
in + logical illogical
in + rational irrational
in + mature immature
In the first example, there is an insertion of a consonant (ப்) between the two words. The
second example shows a substitution of the last consonant (ம்) of the first word by another
consonant (த்). In the third example, there is a deletion of the last vowel (உ i.e. று ற் +
உ) of the first word and the consonant (ற்) merges with the incoming vowel (ஆ i.e. ற் + ஆ
றா) of the second word. As a result of morpho-phonology, two distinct words combine
and sound as a single word. In fact, Morpho-phonology has evolved as a result of context
dependencies among the phonetic units at the boundary of adjacent words or morphemes.
‘உ’ removal rule: This rule states that when one morpheme ends in the vowel ‘உ’ and
the following morpheme starts with a vowel, then the ‘உ’ would be removed from the
combination.
‘வ்’ and ‘ய்’ addition rule: When one morpheme ends with a vowel from a particular a
set of vowels and the morpheme it joins starts with a vowel then the morpheme ‘வ்’ or
‘ய்’ would be added to the end of the first morpheme.
Doubling rule: According to this rule, when one morpheme ends with either ‘ண்’, ‘ன்’,
‘ம்’ or ‘ய்’ and the next morpheme starts with a vowel then the ‘ண்’, ‘ன்’, ‘ம்’ or ‘ய்’ is
doubled.
Insertion of க், ச், ட் or ப்: This rule states that when one morpheme ends with a vowel
and the next morpheme begins with either க், ச், ட் or ப் followed by a vowel there is a
doubling of the corresponding க், ச், ட் or ப்.
Apart from these rules, there are instances of morpho-phonology based on the context. For
example, பழங் கூைட (old basket) and பழக் கூைட (fruit basket). Therefore, modelling
morpho-phonology in Tamil is still a challenging issue for research.
Owing to resource deficiency in text and annotated corpora in Tamil, building reliable
statistical language models is a difficult task. Even with the available corpus, the
agglutinative nature of the language further deepens the problem. Statistical studies using
large corpora are still in the nascent stage. However, Rajendran (2006) has given a review of
various recent works carried out in morphological analysis, morphological disambiguation,
shallow parsing, POS tagging and syntactic parsing in Tamil.
For instance, let and be two consecutive words in a text. The probability
( | ) is the bi-gram that measures the correlation between the words and .
This bi-gram measure is sufficient for modeling strings of words in a language where
inflectional morphology is low. However in agglutinative languages, like Tamil, a more
minute measure is warranted. This issue has been successfully resolved by Saraswati and
Geetha (2007) with their enhanced morpheme based language model. The size of the
vocabulary was reduced by decomposing the words into stems and endings. These sub-
word units (morphemes) are stored in the vocabulary separately. The enhanced morpheme-
based language model is designed and trained on the decomposed corpus. A Tamil text
corpus is decomposed into stem and its associated suffixes using an existing Tamil
morphological analyzer (Anandan P et al, 2002). The decomposition helps reduce the
number of distinct words by around 40% on two different corpora namely News and
Politics. The stems and its endings are marked with a special character ‘#’ for stems and ‘$’
for suffixes in order to co-join them back after recognition is done.
A general morpheme based language model is one where the stem and suffixes are treated
as independent words. No distinction is made between a stem and a morpheme. Figure 2
depicts the various probability measures involved in a morpheme based language model.
The word W is split into the stem S and suffix E , and the word W is split into the stem
S and suffix E .
In this case, the prediction of suffix E will be based on S which is strongly correlated
since a stem can have a few suffixes among the possible 7 suffixes. This information is
48 Modern Speech Recognition Approaches with Case Studies
The perplexity and WER are obtained using a Tamil speech recognition system. While figure
3.a portrays the perplexity of the language models, figure 3.b compares the WER of the ASR
system employing both language models.
The results has confirmed that the modified morpheme-based trigram language model
with Katz back-off smoothing technique is better perplexity and lower WER on two
Tamil corpora. The results confirm that the proposed enhanced morpheme-based
language model is much better than the word-based language models for agglutinative
languages.
Figure 3.
50 Modern Speech Recognition Approaches with Case Studies
In the paper by Ganapathiraju et al (2001), the first successful robust LVCSR system that
used syllable level acoustic unit in telephone bandwidth spontaneous speech is reported.
The paper begins with a conjecture that syllable based system would perform better than
existing triphone systems and concludes with experimental verification after comparing a
syllable based system performance with that of a word-internal and a cross-word triphone
system on publicly available databases, viz. Switchboard and Alphadigits. A number of
syllable based experiments involving syllables and CI phones, syllables and CD phones,
syllables, mono-syllabic words and CD phones have been reported in that paper. However,
this system is deficient especially in the integration of syllable and phone models as mixed-
word entry. It is because mixing models of different lengths and context might result only in
marginal improvements.
( ) = [ ] [ [ ]] (3)
There are two types of prosodic syllables namely Ner-acai and Nirai-acai. Ner-acai is
monosyllabic. It may consist of either one short vowel or one long vowel, either of which
may be open or closed, i.e. ending in a vowel or consonant(s) respectively. Nirai-acai is
always disyllabic with an obligatorily short vowel at first position, while the second
phoneme is unrestricted. Like Ner-acai, Nirai-acai may also be of open or closed type. The
prosodic syllable representation can take any of the following eight patterns as shown in
Table 3. An uninflected Tamil word may comprise one to four prosodic syllables.
( )=[ ]( | )[ [ ]] (4)
This expression describes all the possible patterns of prosodic syllables. In other words, an
optional short vowel is followed obligatorily by either a short vowel or a long vowel, and
zero or one or two consonants.
Theoretically, the number of prosodic syllables will be quite larger (of the order of
3,674,160), since there are 90 (18 times 5) short vowels, 126 (18 times 7) long vowels and 18
consonants. But, the actual number will be smaller due to constraints like phono-tactics and
morpho-tactics. Hence, it is essential to estimate the number of prosodic syllables with the
help of a corpus.
52 Modern Speech Recognition Approaches with Case Studies
After applying the algorithm to the text corpus, the frequency counts of various prosodic
syllable patterns were gathered. The algorithm segmented 26,153 numbers of unique
prosodic syllables in the corpus. Since the text corpus used here was not clean, it contained
a lot of abbreviations, digits and other foreign characters. Therefore, the prosodic syllable
patterns with frequency less than 10 were eliminated. Then, it was found that there were
only 10,015 numbers of unique prosodic syllables as shown in Table 4.
Details Frequency
Documents 686
Sentences 455,504
Words 2,652,370
No. of unique prosodic syllables segmented by the algorithm 26,153
No. of unique prosodic syllables validated by the DFA 10,015
Table 4. Prosodic Syllables in CIIL Corpus
In order to keep the complexity low, it was preferable to model CI syllable units with single
Gaussian continuous density HMM. The continuous speech was transformed into a
sequence of feature vectors. This sequence was matched with the optimal/best concatenated
HMM sequence found using Viterbi algorithm. The time stamps of segmented syllable
boundaries were obtained as a by-product of Viterbi decoding. The duration of the prosodic
syllables was found to vary from 290 ms to 315 ms. Even though a prosodic syllable is either
monosyllabic or disyllabic, the duration was more or less equal to 300 ms on average. This
may be due to vowel duration reduction which occurs in non-initial syllables as reported by
Asher and Keane (2005).
Based on these considerations, eight states per HMM were decided to be adequate for the
experiment. Figure 4 shows the schematic block diagram of a syllable based recognizer.
It was also observed that in the prosodic syllable models, there were larger number of
substitution errors than that of insertions and deletions whereas in the case of word models,
there was a majority of deletion errors. This comparison is shown in Figure 5. The majority
of deletion errors in word models signify OOV rate due to morphological inflections. The
OOV words in syllable models significantly got reduced. This proves the fact that syllables
are effective as sub-word units according to Thangarajan et al (2008b)
700
618
600
500
410
W o rd Erro rs
400
295
300
200 157
100 30
10
0
Word Models Syllable Models
Figure 5. The Types of Word Errors in Word Models and Syllable Models
5. Summary
In this chapter, the nature of agglutinative languages is discussed with Tamil language
taken as a case study. The inflectional morphology of Tamil language is described in great
detail. The challenges that are faced in ASR systems for such languages are highlighted.
Two different approaches – enhanced morpheme based languages model and syllable based
models - used in ASR for agglutinative languages are elaborated along with their results.
The merits and scope for further research is also discussed.
Author details
R. Thangarajan
Department of Computer Science and Engineering,
Kongu Engineering College, Perundurai, Erode, Tamilnadu, India
6. References
[1] Anandan P., Saravanan K., Parthasarathy R., and Geetha T.V., (2002), ‘Morphological
Analyzer for Tamil’, in the proceedings of ICON 2002, Chennai.
[2] Arden A. H. (1934), ‘A progressive grammar of common Tamil’ 4th edition, Christian
Literature Society, Madras, India, pp. 59.
[3] Arokianathan S. (1981), ‘Tamil clitics’, Dravidian Linguistics Association, Trivandrum,
India, pp. 5
[4] Asher R.E. and Keane E.L. (2005), ‘Diphthongs in colloquial Tamil’, (Hardcastle W.J.
and Mackenzie Beck J. eds.), pp. 141-171.
[5] Balasubramanian T. (1980), ‘Timing in Tamil’, Journal of Phonetics, Vol. 8, pp.449-
467.
[6] Fujimura O. (1975), ‘Syllable as a unit of Speech Recognition’, IEEE Transactions on
Acoustics, Speech and Signal Processing, Vol. ASSP-23, No. 1, pp. 82-87.
[7] Ganapathiraju A., Jonathan Hamaker, Joseph Picone, Mark Ordowski and George R.
Doddington (2001), ‘Syllable Based Large Vocabulary Continuous Speech
Recognition’, IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 4, pp.
358-366.
[8] Greenberg S. (1998), ‘Speaking in Short Hand - A Syllable Centric Perspective for
Understanding Pronunciation Variation’, Proceedings of the ESCA Workshop on
Modeling Pronunciation Variation for Automatic Speech Recognition, Kekrade, pp. 47-
56.
[9] Harold F. Schiffman (2006), ‘A Reference Grammar of Spoken Tamil’, Cambridge
University Press (ISBN-10: 0521027527).
[10] Kumar C.S. and Foo Say Wei (2003), ‘A Bilingual Speech Recognition System for
English and Tamil’, Proceedings of Joint Conference of the Fourth International
Conference on Information, Communications and Signal Processing, 2003 and the
Fourth Pacific Rim Conference on Multimedia, Vol. 3, pp. 1641-1644.
56 Modern Speech Recognition Approaches with Case Studies
[11] Kumar M., Rajput N. and Verma A. (2004), ‘A large-vocabulary continuous speech
recognition system for Hindi’, IBM Journal of Research and Development, Vol. 48,
No.5/6, pp. 703-715.
[12] Ladefoged, Peter. (1993), ‘A course in phonetics.’ 3rd edition, Fort Worth, TX: Harcourt,
Brace, and Jovanovich
[13] Marthandan, C.R. (1983), ‘Phonetics of casual Tamil’, Ph.D. Thesis, University of
London.
[14] Nakagawa S. and Hashimoto Y. (1988), ‘A method for continuous speech segmentation
using HMM’, presented at IEEE International Conference on Pattern Recognition.
[15] Nakagawa S., Hanai K., Yamamoto K. and Minematsu N. (1999), ‘Comparison of
syllable-based HMMs and triphone-based HMMs in Japanese speech recognition’,
Proceedings of International Workshop Automatic Speech Recognition and
Understanding, pp. 393-396.
[16] Nayeemulla Khan A. and Yegnanarayana B. (2001), ‘Development of Speech
Recognition System for Tamil for Small Restricted Task’, Proceedings of National
Conference on Communication, India.
[17] Plauche M., Udhyakummar N., Wooters C., Pal J. and Ramachadran D. (2006), ‘Speech
Recognition for Illiterate Access to Information and Technology’, Proceedings of First
International Conference on ICT and Development.
[18] Rajendran S, Viswanathan S and Ramesh Kumar (2003) ‘Computational Morphology of
Tamil Verbal Complex’, Language in India, Vol. 3:4
[19] Rajendran S (2004) ‘Strategies in the Formation of Compound Nouns in Tamil’,
Languages in India, Volume 4:6
[20] Rajendran S (2006) ‘Parsing In Tamil: Present State of Art’, Language in India, Vol. 6:8.
[21] Saraswathi S. and Geetha T.V. (2004), ‘Implementation of Tamil Speech Recognition
System Using Neural Networks’, Lecture Notes in Computer Science, Vol. 3285.
[22] Saraswathi S. and Geetha T. V. (2007), ‘Comparison of Performance of Enhanced
Morpheme-based language Model with Different Word-based Language Models for
Improving the Performance of Tamil Speech Recognition System’, ACM Transaction on
Asian Language Information Processing Vol. 6 No. 3, Article 9.
[23] Soundaraj F. (2000) ‘Accent in Tamil: Speech Research for Speech Technology’, In:
(Nagamma Reddy K. ed.), Speech Technology: Issues and implications in Indian
languages International School of Dravidian Linguistics, Thiruvananthapuram, pp. 246-
256.
[24] Thangarajan R., Natarajan A. M. and Selvam M. (2008a), ‘Word and Triphone based
Approaches in Continuous Speech Recognition for Tamil Language’, WSEAS
Transactions on Signal Processing, Issue 3, Vol.4, 2008, pp. 76-85.
[25] Thangarajan R. and Natarajan A. M. (2008b), ‘Syllable Based Continuous Speech
Recognition for Tamil Language’, South Asian Language Review (SALR), Vol. XVIII,
No. 1, pp. 71-85.
[26] Xuedong Huang, Alex Acero and Hsiao-Wuen Hon (2001), ‘Spoken Language
Processing - A Guide to Theory, Algorithm and System Development’, Prentice Hall
PTR (ISBN 0-13-022616-5).
Chapter 3
Aleem Mushtaq
http://dx.doi.org/10.5772/51532
1. Introduction
The speech production mechanism goes through various stages. First, a thought is
generated in speakers mind. The thought is put into a sequence of words. These words are
converted into a speech signal using various muscles including face muscles, chest muscles,
tongue etc. This signal is distorted by environmental factors such as background noise,
reverberations, channel distortions when sent through a microphone, telephone channel etc.
The aim of Automatic Speech Recognition Systems (ASR) is to reconstruct the spoken words
from the speech signal. From information theoretic [1] perspective, we can treat what is
between the speaker and machine as a distortion channel as shown in figure 1.
Here, represent the spoken words and is the speech signal. The problem of extracting
from can be viewed as finding the words sequence that most likely resulted in the
observed signal as given in equation (1)
ˆ arg max p( X |W )
W (1)
W
58 Modern Speech Recognition Approaches with Case Studies
Like any other Machine Learning/Pattern Recognition problem, the posterior ( | ) plays
a fundamental role in the decoding process. This distribution is parametric and its
parameters are found from the available training data. Modern ASR systems do well when
environment of speech signal being tested matches well with that of the training data. This
is so because the parameter values correspond well to the speech signal being decoded.
However, if the environments of training and testing data do not match well, the
performance of the ASR systems degrade. Many schemes have been proposed to overcome
this problem but humans still outperform these systems, especially in adverse conditions.
The approaches to overcome this problem falls under two categories. One way is to adapt
the parameters of ( | ) such that they match better with the testing environment and the
other is to choose features such that they are more robust to environment variations. The
features can also be transformed to make them more suited to the parameters of ( | ) ,
obtained from training data.
The features from an available training speech corpus are used to estimate the parameters of
Acoustic Models. An acoustic model for a particular speech unit, say a phoneme or a word
is the likelihood of observing that unit based on the features as given in equation 1.1. Most
commonly used structure for the acoustic models in ASR systems is the Hidden Markov
Models (HMM). These models capture the dynamics and variations of speech signal well.
The test speech signal is then decoded using Viterbi Decoder.
In stage 1, the feature extraction process is improved so that the features are robust to
distortions. In stage 2, the features are modified to match them better with the training
environment. The mismatch in this stage is usually modeled by nuisance parameters. These
are estimated from the environment and test data and their effect is minimized based on
some optimality criteria. In stage 3, the acoustic models are improved to match better with
the testing environment. One way to achieve this is to use Multi-Condition training i.e. use
data from diverse environments to train the models. Another way is find transform the
models where transformation matrix is obtained from the test environment.
60 Modern Speech Recognition Approaches with Case Studies
Particle filters are powerful numerical mechanisms for sequential signal modeling and is not
constrained by the conventional linearity and Gaussianity [7] requirements. It is a
generalization of the Kalman filter [8] and is more flexible than the extended Kalman filter
[9] because the stage-by-stage linearization of the state space model in Kalman filter is no
longer required [7]. One difficulty of using particle filters lies in obtaining a state space
model for speech as consecutive speech features are usually highly correlated. Just like in
the Kalman filter and HMM frameworks, state transition is an integral part of the particle
filter algorithms.
In contrast to the previous particle filter attempts [4-6] we describe a method in this chapter
where we treat the speech signal as the state variable and the noise as the corrupting signal
and attempt to estimate clean speech from noisy speech. We incorporate statistical
information available in the acoustic models of clean speech, e.g., the HMMs trained with
clean speech, as an alternative state transition model[10-11]. The similarity between HMMs
and particles filters can be seen from the fact that an observation probability density
function corresponding to each state of an HMM describes, in statistical terms, the
characteristics of the source generating a signal of interest if the source is in that particular
state, whereas in particle filters we try to estimate the probability distribution of the state the
system is in when it generates the observed signal of interest. Particle filters are suited for
feature compensation because the probability density of the state can be updated
dynamically on a sample-by-sample basis. On the other hand, state densities of the HMMs
are assumed independent of each other. Although they are good for speech inference
problems, HMMs do not adapt well in fast changing environments.
By establishing a close interaction of the particle filters and HMMs, the potentials of both
models can be harnessed in a joint framework to perform feature compensation for robust
speech recognition. We improve the recognition accuracy through compensation of noisy
speech, and we enhance the compensation process by utilizing information in the HMM
state transition and mixture component sequences obtained in the recognition process.
When state sequence information is available we found we can attain a 67% digit error
reduction from multi-condition training in the Aurora-2 connected digit recognition task. If
A Particle Filter Compensation Approach to Robust Speech Recognition 61
the missing parameters are estimated in the operational situations we only observe a 13%
error reduction in the current study. Moreover, by tracking the speech features,
compensation can be done using only partial information about noise and consequently
good recognition performance can be obtained despite potential distortion caused by non-
stationary noise within an utterance.
2. Tracking algorithms
Tracking is the problem of estimating the trajectory of an object in a space as it moves
through that space. The space could be an image plane captured directly from a camera or it
could be synthetically generated from a radar sweep. Generally, tracking schemes can be
applied to any system that can be represented by a time dynamical system which consists of
a state space model and an observation
xt f ( xt 1 , wt )
(2)
yt h( xt , nt )
Where is the observation noise and is called the process noise and represents the
model uncertainties in the state transition function (. ). What is available is an observation
which is function of .We are interested in finding a good estimate of current state given
observations till current time i.e. ( | , , ,… ). The state space model (. )
represents the relation between states adjacent in time. The model in equation (2) assumes
that state sequence is one step Markov process
f ( xt 1 | xt , xt 1 ,...x0 ) f ( xt 1 | xt ) (3)
f ( yt 1 | xt 1 , yt ,...y0 ) f ( yt 1 | xt 1 ) (4)
Tracking is a two step process. The first step is to obtain density at time − 1. This is
called the prior density of . Once it is available, we can construct a posterior density upon
availability of observation . The propagation step is given in equation (5). The update step
is obtained using Bayesian theory (equation (6)).
f ( yt | xt , yt 1 ,..., y0 ) f ( xt | yt 1 ,..., y0 )
f ( xt | yt , yt 1 ,..., y0 ) (6)
f ( yt | yt 1 ,..., y0 )
xt 1 At xt wt
(7)
yt Ct xt nt
where and are known as state transition matrix and observation matrix respectively.
Subscript indicates that both can vary with time. Under the assumption that both process
noise and observation noise are Gaussian with zero mean and covariance and
respectively, ( | ) can be readily obtained.
mean( xt 1 | xt ) E( At xt wt ) At xt
(8)
covariance( xt 1 | xt ) E( wt wtT ) Qt
and therefore
p( xt 1 | xt ) ~ N ( At xt , Qt ) (9)
To get the update step, we note that the distributions of | ,…, and are both
Gaussian. For two random variables say and that are jointly Gaussian, the distribution of
one of them given the other for example | is also Gaussian. Consequently,
| , ,…, is a Gaussian distribution with following mean and variance
1
xˆ t 1 | xt 1 E[ xt 1 | yt 1 , yt ,..., y0 ] xˆ t 1 | xt Rxy Ryy ( yt 1 E[ yt 1 | yt ,..., y0 ]) (13)
where
A Particle Filter Compensation Approach to Robust Speech Recognition 63
Similarly
Back substituting equation (14) and equation (15) in equation (13), we get
Covariance can also be obtained by referring to the fact that covariance of | , the two
jointly Gaussian random variables, is given by
1
cov( X |Y ) Rxx Rxy Ryy Ryx (18)
The block diagram in Figure 4 below shows a general recursive estimation algorithm steps
starting from some initial state estimate . The block labeled Kalman filter summarizes the
steps specific to Klaman filter algorithm.
1 i
w ki |k ~ w p( y | xi )
C k|k 1 k k
Ns (21)
w ki |k 1 ~ w ki 1|k 1 p( xki | xkj 1 )
j 1
Here is the normalizing constant to make total probability equal one. The assumption that
state can be represented by finite number of points gives us the ability to sample the whole
state space. The weight represents the probability of being in state when observation
at time is . In grid based method we construct the discrete density at every time instant
in two steps. First we estimate the weights at without the current observation | and
then update them when observation is available and obtain | . In the propagation step we
take into account probabilities (weights) for all possible state values at − 1 to estimate the
weights at time as shown in figure 5.
If the prior ( | ) and the observation probability ( | ) are available, the grid based
method gives us the optimal solution for tracking the state of the system. If the state of the
system is not discrete, then we can obtain an approximate solution using this method. We
divide the continuous space into say cells and for each cell we compute the prior and
posterior in a way that takes into account the range of the whole cell:
where ̅ is the center of th cell at time − 1. The weight update in equation (21)
subsequently remains unchanged.
X1 ~ ( x1 )
Xt | Xt 1 xt ~ p( xt | xt 1 ) (23)
Yt | Xt xt ~ p( yt | xt )
Ns
p( xt | yt , yt 1 ,..., y1 ) wti ( xt xti ) (24)
i 1
where for = 1, … , are the support points and are the associated weights. We thus
have a discretized and weighted approximation of the posterior density without the need of
an analytical solution. Note the similarities with Grid based method. In that, support points
for discrete distribution were predefined and covered the whole space. In particle filter
algorithm, the support points are determined based on the concept of importance sampling
66 Modern Speech Recognition Approaches with Case Studies
in which instead of drawing from (. ), we draw points from another distribution q(.) and
compute the weights using the following:
( xi )
wi (25)
q( xi )
where (. ) is the distribution of (. )and (. )is an importance density from which we can
draw samples. For the sequential case, the weight update equation can be computed one by
one,
The density (. ) propagates the samples to new positions at given samples at time − 1
and is derived from the state transition model of the system.
The additional side information needed for feature compensation is a set of nuisance
parameters, Φ similar to stochastic matching [3], we can iteratively find Φ followed by
decoding as shown in Figure 6:
The clean HMMs and the background noise information enable us to generate appropriate
samples from (. ) in equation (26). The parameters Φ in equation (30) in our particle filter
compensation (PFC) implementation, correspond to the corresponding correct HMM state
sequence and mixture component sequence. These sequences provide critical information
for density approximation in PFC. As shown in Figure 6 this can be done in two stages. We
first perform a front-end compensation of noisy speech. Then recognition is done in the
second stage to generate the side information Φ so as to improve compensation. This
process can be iterated similar to what’s done in maximum likelihood stochastic matching
[3]. During compensation, the observed speech is mapped to clean speech features . For
this purpose clean speech alone cannot be represented by a finite set of points and therefore
HMMs by themselves cannot be used directly for tracking of . Now if an HMM is
available that adequately represents the speech segment under consideration for
compensation along with an estimated state sequence , , … , that correspond to
feature vectors to be considered in the segment, then we can generate the samples from the
sample according to
K
p( xt | xti 1 ) ~ c k ,s N( k ,s , k ,s ) (30)
t t t
k 1
resampling step is applied at every stage and the weight assigned to the -th support point
of the distribution of the speech signal at time is updated as:
The procedure for obtaining HMMs and the state sequence will be described in detail later.
To obtain p( | ), the distribution of the log spectra of noise for each channel is assumed
Gaussian with mean and variance . Assuming there is additive noise only with no
channel effects
y x log(1 e n x ) (32)
We are interested in evaluating ( | ) where represents clean speech and is the noise
with density ( , ). Then
p[Y y | x] p[ x log(1 e N x ) y | x]
e yx (33)
p( y | x) F '(u) p(u) yx
e 1
Where ( ) is the Gaussian cumulative density function with mean and variance and
= log( − 1) + . In the case of MFCC features, the nonlinear transformation is [14]
1
(n x)
y x D log(1 e D ) (34)
Consequently,
p( y | x) pN ( g 1 ( y)) J g 1 ( y ) (35)
Ns
xt wti xti (36)
i 1
the utterance, we chose the -best models , ,…, from HMMs trained using ‘clean
speech data’. The models are combined together to obtain a single model as follows.
where is the number of Gaussian mixtures in each original HMM and is the number of
different words , ,…, in the -best hypothesis. , and Σ , are mean and
covariance from the -th mixture in the -th state of model . The mixture weights are
normalized by scaling them according to the likelihood of the occurrence of the model, from
which they come from,
The mixture weight is an important parameter because it determines the number of samples
that will be generated from the corresponding mixture. The state transition coefficients for
are computed using the following:
L
aˆij( m) p[st l i , st 1l j |Wm m ]p[Wm m ]
(m ) (m )
l l
l 1 (39)
L
m ]p[Wm m ]
(m )
aˆij( m) [aij l |Wm
l l
l 1
st ~ as , st 1
t
(40)
st arg max( aij )
i
where comes from the state transition matrix for . The mixture indices are subsequently
selected from amongst the mixtures corresponding to the chosen state.
70 Modern Speech Recognition Approaches with Case Studies
3.1.3. Experiments
To investigate the properties of the proposed approach, we first assume that a decent
estimate of the state is available at each frame. Moreover, we assume that speech boundaries
are marked and therefore the silence and speech sections of the utterance are known. To
obtain this information, we use a set of digit HMMs (18 states, 3 Gaussian mixtures) that
have been trained using clean speech represented by 23 channel mel-scale log spectral
feature. The speech boundaries and state information for a particular noisy utterance is then
captured through digit recognition performed on the corresponding clean speech utterance.
The speech boundary information is critical because the noise statistics have to be estimated
from the noisy section of the utterance. To get the HMM needed for particle filter
compensation models , , . . . , are selected based on the -best hypothesis list. For our
experiments, we set = 3. We combine these models to get ′ for the -th word in the
utterance. Best results are obtained if the correct word model is present in the pool of
models that contribute to ′ . Upon availability of this information, the compensation of the
noisy log spectral features is done using the sequential importance sampling. To see the
efficacy of the compensation process, we consider the noisy, clean and compensated filter
banks (channel 8) for the whole utterances shown in Figure 7. The SNR for this particular
case is 5 dB. It is clear that the compensated feature matches well with the clean feature. It
should be noted however that such a good restoration of the clean speech signal from the
noisy signal is achievable only when a good estimate of the side information about the state
and mixture component sequences is available.
Figure 7. Fbank channel 8 corresponding underlying clean and compensated speech (SNR = 5 dB).
A Particle Filter Compensation Approach to Robust Speech Recognition 71
Assuming all such information were given (the ideal oracle case) recognition can be
performed on MFCCs (39 MFCCs with 13 MFCCs and their first and second time
derivatives) extracted from these compensated log spectral features. The HMMs used for
recognition are trained with noisy data that has been compensated in the same way as the
testing data. The performance compared to multi-condition (MC) and clean condition
training (Columns 5 and 6 in Table 1) is given in Column 2 of Table 1 (Adapted Model I). It
is clearly noted that a very significant 67% digit error reduction was attained if the missing
information were made available to us.
In the case of the actual operational scenarios, when no side information is available, models
were chosen from the N-Best list while the states were computed using Viterbi decoding. Of
course, the states would correspond to only one model which might not be correct, and
there might be a significant mismatch between actual and computed states. Moreover the
misalignment of words also exacerbated the problem. The results for this case (Adapted
Model III as shown in Table 1 Column 4) were only marginally better than those obtained
with the multi-condition trained models. To see the effects of the improvements for the case
where the states are better aligned, we made use of whatever information we could get. The
boundaries of words were extracted from the N-Best list using exhaustive search and the
states for the words between these boundaries were assigned by splitting the digits into
equal-sized segments and assigning one state to each segment. This limited the damage
done by state misalignment, and it can be seen that a 13% digit error reduction from MC
training was observed (Adapted Model II in Table 1 Column 3).
gm ( x) g ( x)
d( m , n) gm ( x)log dx gn ( x)log n dx (41)
gn ( x ) g m ( x)
m2 (i ) n2 (i ) ( n (i ) m (i ))2
[
i n2 (i )
(42)
n2 (i ) m2 ( i) ( n (i ) m (i ))2
]
m2 (i )
where ( ) is the -th element of the mean vector and ( ) is the -th diagonal element
of the covariance matrix Σ . The parameters of the single Gaussian representing the cluster,
( ) = ( | , ), is computed as follows:
Mk Mk
1 1
k (i )
Mk
E( x(mk ) (i)) M m( k ) (i) (43)
m 1 k m 1
Mk
1
k2 (i )
Mk
E(( x(mk ) (i) k (i))2
m 1
(44)
Mk Mk
1
Mk
m2( k ) (i ) m( k )2 (i ) M k k2 ( i)
m 1 m 1
Alternatively, we can group the components at the state level using the following distance
measure [16]:
1 S 1 P
d(n, m) log[bms ( nsp )] log[bns ( msp )]
S s 1 P p 1
(45)
where S is the total number of states in the cluster, P is the number of mixtures per state and
b(. ) is the observation probability. This method makes it easy to track the state level
composition of each cluster. In both cases, the clustering algorithm proceeds as follows:
Once clustering is complete, it is important to pick the most suitable cluster for feature
compensation at each frame. The particle samples are then generated from the
representative density of the chosen cluster. Two methods can be explored. The first is
to decide the cluster based on the -best transcripts obtained from recognition using
multi-condition trained models. Denote the states obtained from the -best transcripts
for noisy speech feature vectors at time as , ,…, . If state is a member of
cluster , we increment ( ) by one, where ( ) is a count of how many states from
the -best list belong to cluster . We choose the cluster based on argmax ( ) and
generate samples from it. If more than one cluster satisfies this criterion, we merge their
probability density functions. In the second method, we chose the cluster that
A Particle Filter Compensation Approach to Robust Speech Recognition 73
maximizes the likelihood of the MFCC vector at time , , belonging to that cluster as
follows:
Clean clusters are necessary to track clean speech because we need to generate samples from
clean speech distributions. However, they are not the best choice for estimating equation
(46) because the observation is noisy and has a different distribution. The best candidate for
computing equation (46) is the multi-condition cluster set. It is constructed from multi-
condition HMMs that match more closely with noisy speech. A block diagram of the overall
compensation and recognition process is shown in Figure 9. We make inference about the
cluster to be used for observation vector using both the N-best transcripts and equation
(46) combined together. Samples at frame are then generated using the pdf of chosen
cluster. The weights of the samples are computed using equation (46) and compensated
features are obtained using equation (36). Once the compensated features are available for
the whole utterance, recognition is performed again using retrained HMMs with
compensated features.
74 Modern Speech Recognition Approaches with Case Studies
3.2.1. Experiments
To evaluate the proposed framework we experimented on the Aurora 2 connected digit
task. We extracted features (39 elements with 13 MFCCs and their first and second time
derivatives) from test speech as well as 23 channel filter-bank features thereby forming two
streams. One-best transcript was obtained from the MFCC stream using the multi-condition
trained HMMs. PFC is then applied to the filter-bank stream (stream two). We chose two
clusters, one based on 1-best and the other selected with equation (46). The multi-condition
clusters used in equation (46) were from 23 channel fbank features so that the test features
from stream two can be directly used to evaluate the likelihood of the observations. For
results in these experiments, clusters were formed using method two, i.e., tracking the state-
wise composition of each cluster. The number of clusters and particles were varied to
evaluate the performance of the algorithm under different settings. From the compensated
filter-bank features of stream two, we extracted 39-element MFCC features. Final
recognition on these models was done using the retrained HMMs, i.e., multi-condition
training data compensated in a similar fashion as described above.
The results for a fixed number of particles (100) are shown in Table 1. The number of
clusters was 20, 25 or 30. To set the specific number of clusters, HMM states were combined
and clustering was stopped when the specified number was reached. HMM sets for all
purposes were 18 states, with each state represented by 3 Gaussian mixtures. For the 11-
digit vocabulary, we have a total of approximately 180 states. In case of, for example, 20
clusters, we have a 9 to 1 reduction of information blocks to choose from for plugging in the
PF scheme.
It is interesting to note that best results were obtained for 25 clusters. Increasing the number
of clusters beyond 25 did not improve the accuracy. The larger the number of clusters, the
more specific speech statistics each cluster contains. If the number of clusters is large, then
each cluster encompasses more specific section of the speech statistics. Having more specific
information in each cluster is good for better compensation and recognition because the
particles can be placed more accurately. However, due to the large number of clusters to
choose from, it is difficult to pick the correct cluster for generation of particles. More errors
were made in the cluster selection process resulting in degradation in the overall
performance.
This is further illustrated in Figure 10. If the correct cluster is known, having large
number of clusters and consequently more specific information per cluster will only
improve the performance. The results are for 20, 25 and 30 clusters. In the known
cluster case, one cluster is obtained using equation (46) and the second cluster is the
correct one. Correct cluster means the one that contains the state (obtained by doing
recognition on the clean version of the noisy utterance using clean HMMs) to which the
observation actually belongs to. For the unknown cluster case, the clusters are obtained
using equation (46) and 1 − best. It can readily be observed from the known cluster case
that if the choice of cluster is always correct, the recognition performance improves
drastically. Error rate was reduced by 54%, 59% and 61.4% for 20, 25 and 30 clusters,
respectively. Moreover, improvement faithfully follows the number of clusters used.
This was also corroborated by the fact that if the cluster is specific down to the HMM
state level, i.e., the exact HMM state sequence was assumed known and each state is a
separate cluster (total of approximately 180 clusters), the error rate was reduced by as
much as 67% [10].
For the results in Table 2, we fixed the number of clusters and varied the number of
particles. As we increased the number of particles, the accuracy of the algorithm improves
for set A and B combined i.e. for additive noise. The error reduction is 17% over MC trained
models. Using a large number of particles implies more samples were utilized to construct
the predicted densities of the underlying clean speech features, which is now denser and
thus better approximated. Thus, a gradual improvement in the recognition results was
observed as the particles increased. In case of Set C, however, the performance was worse
when more particles were used. This is so because the underlying distribution is different
due to the distortions other than additive noise.
76 Modern Speech Recognition Approaches with Case Studies
4. Conclusions
In this chapter, we proposed a particle filter compensation approach to robust speech
recognition, and show that a tight coupling and sharing of information between HMMs and
particle filters has a strong potential to improve recognition performance in adverse
environments. It is noted that we need an accurate alignment of the state and mixture
sequences used for compensation with particle filters and the actual HMM state sequences
that describes the underlying clean speech features. Although we have observed an
improved performance in the current particle filter compensation implementation there is
still a considerable performance gap between the oracle setup with correct side information
and what’s achievable in this study with the missing side information estimated from noisy
speech. We further developed a scheme to merge statistically similar information in HMM
states to enable us to find the right section of HMMs to dynamically plug in the particle
filter algorithm. Results show that if we use information from HMMs that match specifically
well with section of speech being compensated, significant error reduction is possible
compared to multi-condition HMMs.
A Particle Filter Compensation Approach to Robust Speech Recognition 77
Author details
Aleem Mushtaq
School of ECE, Georgia Institute of Technology, Atlanta, USA
5. References
[1] C.-H. Lee and Q. Huo, “On adaptive decision rules and decision parameter adaptation
for automatic speech recognition”, Proc. IEEE, vol. 88, pp. 1241-1269, 2000.
[2] S.Davis and P. Mermelstein, “Comparison of parametric representations for
monosyllable word recognition in continuously spoken sentences,” Proc. ICASSP
1980ol. 28, no.4, pp. 357-366, 1980.
[3] A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochastic matching for
robust speech recognition,” IEEE Trans. Speech Audio Processing, vol. 4, pp.190-202,
May.1996.
[4] B. Raj, R. Singh, and R. Stern, “On tracking noise with linear dynamical system
models.” Proc. ICASSP, 2004.
[5] M. Fujimoto and S. Nakamura, “Particle Filter based non-stationary noise tracking for
robust speech recognition,” Proc. ICASSP, 2005.
[6] M. Fujimoto and S. Nakamura, “Sequential non-stationary noise tracking using particle
filtering with switching dynamical system,” Proc. ICASSP, 2006.
[7] M .S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A Tutorial on Particle
Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking,” IEEE Trans. Signal
Proc., 2002.
[8] Robert Grover Brown and Patrick Y. C. Hwang. 1996. Introduction to Random Signals
and Applied Kalman Filtering, 3rd edition, Prentice Hall.
[9] Simon Haykin. 2009. Adaptive Filter Theory, 4th edition, Prentice Hall.
[10] A. Mushtaq, Y. Tsao and C.-H. Lee, “A Particle Filter Compensation Approach to
Robust Speech Recognition.” Proc. Interspeech, 2009.
[11] A. Mushtaq and C.-H. Lee, “An integrated approach to feature compensation
combining particle filters and Hidden Markov Model for robust speech recognition.”
Proc. ICASSP, 2012.
[12] Todd K. Moon and Wynn C. Stirling. 2007. Mathematical Methods and Algorithms for
Signal Processing, Pearson Education.
[13] N Arnaud, Doucet, and Johansen, "A tutorial on particle filtering and smoothing:
Fifteen years later," Tech. Rep., 2008. [Online].
http://www.cs.ubc.ca/~arnaud/doucet_johansen_tutorialPF.pdf
[14] A. Acero, L. Deng, T. Kristjansson, and J. Zhang, “HMM adaptation using vector Taylor
series for noisy speech recognition,” Proc. ICSLP, pp. 869-872, 2002.
[15] T. Watanbe, K. Shinoda, K. Takagi, and E. Yamada, “Speech recognition using tree-
structured probability sensity function,” in Proc. Int. Conf. Speech Language Processing
’94, 1994, pp. 223-226.
78 Modern Speech Recognition Approaches with Case Studies
[16] S. J. Young, J. J. Odell, and P. C. Woodland, “Tree-based state tying for high accuracy
acoustic modeling, “ Proc. ARPA Human Language Technology Workshop, pp. 307–
312, 1994.
Chapter 4
http://dx.doi.org/10.5772/49954
1. Introduction
The use of the Internet for accessing information has expanded dramatically over the past
few years, while the availability and use of mobile hand-held devices for communication
and Internet access has greatly increased in parallel. Industry has reacted to this trend for
information access by developing services and applications that can be accessed by users on
the move. These trends have highlighted a need for alternatives to the traditional methods
of user data input, such as keypad entry, which is difficult on small form-factor mobile
devices. One alternative is to make use of automatic speech recognition (ASR) systems that
act on speech input from the user. An ASR system has two main elements. The first element
is a front-end processor that extracts parameters, or features, that represent the speech
signal. These features are processed by a back-end classifier, which makes the decision as to
what has been spoken.
In a fully embedded ASR system [1], the feature extraction and the speech classification are
carried out on the mobile device. However, due to the computational complexity of high-
performance speech recognition systems, such an embedded architecture can be impractical
on mobile hand-held terminals due to limitations in processing and memory resources. On
the other hand, fully centralised (server-based) ASR systems have fewer computational
constraints, can be used to share the computational burden between mobile users, and can
also allow for the easy upgrade of speech recognition technologies and services that are
provided. However, in a centralised ASR system the recognition accuracy can be
compromised as a result of the speech signal being distorted by low bit-rate encoding at the
codec and a poor quality transmission channel [2, 3].
However, DSR systems generally operate in high levels of background noise (particularly in
mobile environments). For mobile users in noisy environments (airports, cars, restaurants
etc.) the speech recognition accuracy can be reduced dramatically as a consequence of
additive background noise. A second source of error in DSR systems is the presence of
transmission errors in the form of random packet loss and packet burst loss during
transmission of speech features to the classifier. Packet loss can arise in wireless and packet
switched (IP) networks, both networks over which a DSR system would normally be
expected to operate. Packet loss, in particular packet burst loss, can have a serious impact on
recognition performance and needs to be considered in the design of a DSR system.
This chapter addresses the issue of robustness in DSR systems, with particular reference to
the problems of background noise and packet loss, which are significant bottlenecks in the
commercialisation of speech recognition products, particularly in mobile environments. The
layout of the chapter is as follows. Section 2 discusses the DSR architecture and standards in
more detail. This is followed by an overview of the auditory model used in this chapter as
an alternative front-end to those published in the DSR standards. The Aurora 2 database
and a description of the speech recognition system used are also discussed in Section 2.
Section 3 addresses the problem of robustness of speech recognition systems in the presence
of additive noise, in particular, by examining in detail the use of speech enhancement
techniques to reduce the effects of noise on the speech signal. The performance of a DSR
system in the presence of both additive background noise and packet loss is examined in
Section 4. The feature vectors produced by the auditory model are transmitted over a
channel that is subject to packet burst loss and packet loss mitigation to compensate for
missing features is investigated. Conclusions are presented in Section 5.
packet burst loss still need to be taken into consideration. Such transmission errors can have
a significant impact on recognition accuracy.
Furthermore, it is well known that the presence of noise severely degrades the performance
of speech recognition systems, and much research has been devoted to the development of
techniques to alleviate this effect; this is particularly important in the context of DSR where
mobile clients are typically used in high-noise environments (though the same problem also
exists for local embedded, or centralised architectures in noisy conditions). One approach
that can be used to improve the robustness of ASR systems is to enhance the speech signal
itself before feature extraction. Speech enhancement can be particularly useful in cases
where a significant mismatch exists between training and testing conditions, such as where
a recognition system is trained with clean speech and then used in noisy conditions. A
significant amount of research has been carried out on speech enhancement, and a number
of approaches have been well documented in the literature [5]. There has also been much
interest in DSR in recent years, within the research community, and in international
standardisation bodies, in particular, the European Telecommunications Standards Institute
(ETSI) [6-9], which has developed a number of different recommendations for front-end
processors of different levels of complexity.
The ETSI basic front-end [6] was developed for implementation over circuit-switched
channels and this implementation is also considered in the other three standards. The
advanced front-end [7] produces superior performance to the basic front-end, and was
designed to increase robustness in background noise. The implementation of the front-ends
over packet-switched Internet Protocol (IP) networks has been specified in two documents
published by the Internet Engineering Task Force (IETF). The first of these [10] specifies the
real-time transport protocol (RTP) payload format for the basic front-end while the second
[11] specifies the RTP payload format for the advanced front-end.
The ETSI basic and advanced front-ends both implement MFCC-based parameterisation of
the speech signal. The stages involved in feature extraction based on MFCCs are shown in
Figure 1. The speech signal first undergoes pre-emphasis in order to compensate for the
unequal sensitivity of human hearing across frequency. Following pre-emphasis, a short-
term power spectrum is obtained by applying a fast Fourier transform (FFT) to a frame of
Hamming windowed speech. Critical band analysis is carried out using a bank of
overlapping, triangular shaped, bandpass filters, whose centre frequencies are equally
spaced on the mel scale. The FFT magnitude coefficients are grouped into the appropriate
critical bands and then weighted by the triangular filters. The energies in each band are
summed, creating a filter bank vector of spectral energies on the mel scale. The size of this
vector of spectral energies is equal to the number of triangular filters used. A non-linearity
in the form of a logarithm is applied to the energy vector. The final step is the application of
a discrete cosine transform (DCT) to generate the MFCCs.
In the ETSI DSR front-ends, speech, sampled at 8 kHz, is blocked into frames of 200 samples
with an overlap of 60%. A logarithmic frame energy measure is calculated for each frame
before any processing takes place. In the case of the basic front-end, pre-emphasis is carried
82 Modern Speech Recognition Approaches with Case Studies
out using a filter coefficient equal to 0.97 while the advanced front-end uses a value of 0.9. A
Hamming window is used in both the ETSI basic and advanced front-ends prior to taking
an FFT. In the ETSI advanced front-end a power spectrum estimate is calculated before
performing the filter bank integration. This results in higher noise robustness when
compared with using a magnitude spectrum estimate as used in the ETSI basic front-end
[12]. The two front-ends both generate a feature vector consisting of 14 coefficients made up
of the frame log-energy measure (determined prior to pre-emphasis) and cepstral
coefficients C0 to C12. In order to introduce robustness against channel variations, the ETSI
advanced front-end carries out post-processing in the cepstral domain on coefficients C1 to
C12 in the form of blind equalisation [13].
Discrete MFCC
Speech Pre- Spectral Mel-scale
Log Cosine Feature
Input emphasis Analysis Filter Bank
Transform Vector
The steps involved in feature extraction in the Li et al. auditory model are shown in Figure 2.
Speech is sampled at 8 kHz and blocked into frames of 240 samples. Frame overlap is 66.7%
and a Hamming window is used prior to taking a FFT. An outer/middle ear transfer
function that models pressure gain in the outer and middle ears is applied to the spectrum
magnitude. After conversion of the spectrum to the Bark scale, the transfer function output
is processed by an auditory filter that is derived from psychophysical measurements of the
frequency response of the cochlea. A non-linearity in the form of a logarithm followed by a
DCT is applied to the filter outputs to generate the cepstral coefficients. The recognition
experiments use vectors that include energy and 12 cepstral coefficients (C1 to C12) along
with velocity and acceleration coefficients. This results in vectors with an overall dimension
equal to 39.
Robust Distributed Speech Recognition Using Auditory Modelling 83
The Aurora framework includes a set of standard test conditions for evaluation of front-end
processors. For the purpose of training the speech recogniser, two modes are defined. The
first mode is training on clean data and the second mode is multi-condition training on
noisy data. The same 8440 utterances, taken from the training part of the TIDigits, are used
for both modes. For the multi-condition training, the clean speech signals are used, as well
as speech with four different noise types (subway, babble, car and exhibition hall), added at
SNRs of 20 dB, 15 dB, 10 dB and 5 dB. There are three different test sets defined for
recognition testing, with the test utterances taken from the testing part of the TIDigits
database. Test Set A (28028 utterances) employs the same four noises as used for the multi-
84 Modern Speech Recognition Approaches with Case Studies
condition training. Test Set B uses the same utterances as Test Set A but uses four different
noise types (restaurant, street, airport and train station). In both Test Sets A and B, the
frequency characteristic used in the filtering of the speech and noise is the same as that used
in the training sets, namely G.712. The frequency characteristic of the filter used in Test Set C
(14014 utterances) is the MIRS, and is different from that used in the training sets. Subway
and street noises are used in Test Set C.
The method used to measure the performance of a speech recognition system is dependent
on the type of utterance that is to be recognised, i.e. isolated word or continuous speech.
There are three error types associated with the recogniser in a continuous speech
recognition system:
Insertions (I) – A new word is inserted between two words of the original sentence.
The performance measure used throughout the work presented here, and also used in [16],
is the word accuracy as defined by (1):
N (S D I )
Word accuracy 100% (1)
N
where N is the total number of evaluated words. The word accuracies for each of the Aurora
test sets presented throughout this chapter are calculated according to [16], which defines
the performance measure for a test set as the word accuracy averaged over all noises and
over all SNRs between 0 dB and 20dB. The overall word accuracy for the two training
modes, clean training and multi-condition training, is calculated as the average over the
three test sets A, B and C.
Robust Distributed Speech Recognition Using Auditory Modelling 85
3. Speech enhancement
Additive noise from interfering noise sources, and convolutional noise arising from
transmission channel characteristics both contribute to a degradation of performance in
automatic speech recognition systems. This section addresses the problem of robustness of
speech recognition systems in the first of these conditions, namely additive noise. As noted
previously, speech enhancement is one way in which the effects of noise on the speech
signal can be reduced. Enhancement of noisy speech signals is normally used to improve the
perception of the speech by human listeners however, it may also have benefits in
enhancing robustness in ASR systems. Speech enhancement can be particularly useful in
cases where a significant mismatch exists between training and testing conditions, such as
where a recognition system is trained with clean speech and then used in noisy conditions,
as inclusion of speech enhancement can help to reduce the mismatch. This approach to
improving robustness is considered in this section.
In the speech recognition system described here, the input speech is pre-processed using an
algorithm for speech enhancement. A number of different methods for the enhancement of
speech, combined with the auditory front-end of Li et al. [14], are evaluated for the purpose
of robust connected digit recognition. The ETSI basic [6] and advanced [7] front-ends
proposed for distributed speech recognition are used as a baseline for comparison.
Two measures that can be used to perceptually evaluate speech are its quality and its
intelligibility [5]. Speech quality is a subjective measure and is dependent on the individual
preferences of listeners. It is a measure of how comfortable a listener is when listening to the
speech under evaluation. The intelligibility of the speech can be regarded as an objective
measure, and is calculated based on the number or percentage of words that can be correctly
recognised by listeners. The intelligibility and the quality of speech are not correlated [5]
and it is well known that improving one of the measures can have a detrimental effect on
the other one. Speech enhancement algorithms give a trade-off between noise reduction and
signal distortion. A reduction in noise can lead to an improvement in the subjective quality
of the speech but a decrease in the measured speech intelligibility [5].
When using speech enhancement in an ASR system, the speech is enhanced before feature
extraction and recognition processing. The advantage of this is that there is no impact on the
computational complexity of the feature extraction or the recognition processes as the
enhancement is independent of both, and the speech enhancement can be implemented as
an add-on without significantly affecting existing parts of the system. However, every
86 Modern Speech Recognition Approaches with Case Studies
speech enhancement process will introduce some form of signal distortion and it is
important that the impact of this distortion on the recognition process is minimised.
Ephraim and Malah [20] present a minimum mean-square error short-time spectral
amplitude (MMSE STSA) estimator. The estimator is based on modelling speech and noise
spectral components as statistically independent Gaussian random variables. The enhanced
speech is constructed using the MMSE STSA estimator combined with the original phase of
the noisy signal. Analysis is carried out in the frequency domain and the signal spectrum is
estimated using an FFT.
Westerlund et al. [22] present a speech enhancement technique in which the input signal is
first divided into a number of sub-bands. The signal in each sub-band is individually
multiplied by a gain factor in the time domain based on an estimate of the short term SNR in
each sub-band at every time instant. High SNR values indicate the presence of speech and
the sub-band signal is amplified. Low SNR values indicate the presence of noise only and
the sub-band signal remains unchanged.
Martin [23] presented an algorithm for the enhancement of noisy speech signals by means of
spectral subtraction, in particular through a method for estimation of the noise power on a
sub-band basis. Martin’s noise estimation method is based firstly on the independence of
speech and noise, and secondly on the observation that speech energy in an utterance falls
to a value close to or equal to zero for brief periods. Such periods of low speech energy
occur between words or syllables in an utterance and during speech pauses. The energy of
the signal during these periods reflects the noise power level. Martin’s minimum statistics
noise estimation method tracks the short-term power spectral density estimate of the noisy
speech signal in each frequency bin separately. The minimum power within a defined
window is used to estimate the noise floor level. The minimum tracking method requires a
bias compensation since the minimum power spectral density of the noisy signal is smaller
than the average value. In [24], Martin further developed the noise estimation algorithm by
using a time- and frequency-dependent smoothing parameter when calculating the
smoothed power spectral density. A method to calculate an appropriate time and frequency
dependent bias compensation is also described in [24] as part of the algorithm.
Rangachari and Loizou [21] proposed an algorithm for the estimation of noise in highly non-
stationary environments. The noisy speech power spectrum is averaged using time and
frequency dependent smoothing factors. This new averaged value is then used to update the
Robust Distributed Speech Recognition Using Auditory Modelling 87
A technique for the removal of noise from degraded speech using two filtering stages was
proposed by Agarwal and Cheng [25]. The first filtering stage coarsely reduces the noise and
whitens any residual noise while the second stage attempts to remove the residual noise.
Filtering is based on the Wiener filter concept and filter optimisation is carried out in the
mel-frequency domain. The algorithm, described as a two-stage mel-warped Wiener filter
noise reduction scheme, is a major component of the ETSI advanced front-end standard for
DSR [7]. The implementation of noise reduction in the ETSI advanced front-end is
summarised in [12].
In the comparison of Li et al. (I) and the ETSI basic front-end, there was no post-processing
of the feature vectors carried out. The recognition results using the Aurora 2 database for Li
et al. (I), for each speech enhancement algorithm, are given in Table 1 and the corresponding
results for the ETSI basic front-end (the baseline for this test) are given in Table 2.
The performance of Li et al. (II) was compared with the performance of the ETSI advanced
front-end. The ETSI advanced front-end includes a SNR-dependent waveform processing
block that is applied after noise reduction and before feature extraction. The purpose of this
88 Modern Speech Recognition Approaches with Case Studies
block is to improve the noise robustness in the front-end of an ASR system by enhancing the
high SNR period portion and attenuating the low SNR period portion in the waveform time
domain, thus increasing the overall SNR of noisy speech [26]. However, the evaluation here
is looking primarily at the effect of speech enhancement or noise reduction alone on the
connected digit recognition accuracy. Therefore, the waveform processing block in the ETSI
advanced front-end was disabled. In addition, the ETSI advanced front-end carries out post-
Robust Distributed Speech Recognition Using Auditory Modelling 89
processing in the cepstral domain in the form of blind equalisation as described in [13]. To
ensure a closer match with the ETSI advanced front-end, the feature vectors produced by Li
et al. (II) undergo post-processing in the cepstral domain by means of cepstral mean
subtraction (CMS). The recognition results for Li et al. (II), for each speech enhancement
algorithm, are detailed in Table 3 and the recognition results for the ETSI advanced front-
end are detailed in Table 4.
Table 5 provides an overall view of the relative performance of the different speech
enhancement algorithms for each of the four front-end versions considered.
3.4. Discussion
Ignoring speech enhancement, and comparing Tables 1 and 2, the performance of Li et al. (I)
exceeds the baseline ETSI front-end [6] by 2.08% overall. From Table 3 and Table 4, again
without speech enhancement applied, there is a difference in recognition accuracy of 0.73%
in favour of Li et al. (II) when compared with the ETSI advanced front-end [7].
90 Modern Speech Recognition Approaches with Case Studies
The other results in Tables 1 to 4 show that enhancement of the speech prior to feature
extraction significantly improves the overall recognition performance. This improvement in
recognition accuracy is observed for both the ETSI basic [6] and advanced [7] front-ends and
the front-end proposed by Li et al. [14]. A comparison of Table 1 with Table 2 shows that Li
et al. (I) outperforms the ETSI basic front-end for all of the speech enhancement techniques
evaluated. Furthermore, from Tables 3 and 4, it is seen that Li et al. (II) again outperforms
the ETSI advanced front-end for all speech enhancement methods except Westerlund et al.
[22], for which the overall recognition results are quite close.
For Li et al. (I), Li et al. (II), the ETSI basic front-end and the ETSI advanced front-end, the
best overall recognition accuracy is obtained for speech enhancement using the algorithm
proposed by Agarwal and Cheng [25]. The combination of auditory front-end and the two-
stage, mel-warped, Wiener filter noise reduction scheme results in an overall recognition
accuracy that is approximately 6% better overall compared with the next ranked front-end
and speech enhancement combination. After Agarwal and Cheng [25], the next best
performance across the board is obtained using Ephraim and Malah [20], and Westerlund et
al. [22]. This suggests that the choice of speech enhancement algorithm for best speech
recognition performance is somewhat independent of the choice of front-end (though this
would have to be validated by further testing with other front ends).
To simulate packet loss and error bursts, the 2-state Gilbert model is widely used. In [27-29],
a voice over IP (VoIP) channel is simulated using such a model. References [30, 31] simulate
IP channels and use a 2-state Gilbert model to simulate burst type packet loss on the
channel. Statistical models have also been used to simulate the physical properties of the
Robust Distributed Speech Recognition Using Auditory Modelling 91
communication channel. The Gilbert model was found in [3] to be inadequate for simulating
a GSM channel and instead a two-fold stochastic model is used in which there are two
processes, namely shadowing and Rayleigh fading. This same model was used by [32],
again to model a GSM network. Reference [33] compares three models of packet loss and
examines their effectiveness at simulating different packet loss conditions. The models are a
2-state Markov chain, the Gilbert-Elliot model and a 3-state Markov chain. The 2-state
Markov chain in [33] uses State 1 to model a correctly received packet and State 2 to model a
lost packet. While the Gilbert-Elliot model is itself a 2-state Markov model, there is only a
probability of packet loss when in State 2. The three models in [33] are all validated for GSM
and wireless local area network (WLAN) channels. Results indicate that the 3-state Markov
model gives the best results overall and this model is used in the work described here; the
model is described in more detail later in this chapter.
There are a number of techniques documented that are used within DSR systems for the
purpose of reducing transmission error degradation and so increasing the robustness of the
speech recognition. Error-robustness techniques are categorised in [4] under the headings
client-based error recovery, and server-based error concealment. Client-based techniques
include retransmission, interleaving and forward error correction (FEC). While
retransmission and FEC may result in recovering a large amount of transmission errors,
they have the disadvantage of requiring additional bandwidth and introducing additional
delays and computational overhead. Server-based methods include feature reconstruction,
by means of repetition or interpolation, and error correction in the ASR-decoding stage.
Reference [4] provides a survey of robustness issues related to network degradations and
presents a number of analyses and experiments with a focus on transmission error
robustness.
The work described in [34-38] is focused on burst-like packet loss and how to improve
speech recognition in the context of DSR. The importance of reducing the average burst
length of lost feature vectors rather than reducing the overall packet loss rate is central to the
work in these papers. By minimising the average burst length, the estimation of lost feature
vectors is more effective. Reference [34] compared three different interleaving mechanisms
(block, convolutional and decorrelated) and found that increasing the degree of interleaving
increases the speech recognition performance but that this comes with the cost of a higher
delay. It is further suggested in [38] that, for a DSR application, it is more beneficial to trade
delay for accuracy rather than trading bit-rate for accuracy as in forward error correction
schemes. Reference [35] combines block interleaving to reduce burst lengths on the client
side with packet loss compensation at the server side. Two compensation mechanisms are
examined: feature reconstruction by means of nearest neighbour repetition, interpolation
and maximum a-posteriori (MAP) estimation; and a decoder-based strategy using missing
feature theory. The results suggest that for packet loss compensation, the decoder-based
strategy is best. This is especially true in the presence of large bursts of losses as the
accuracy of reconstruction methods falls off rapidly as burst length increases. Interleaving,
feature estimation and decoder based strategies are combined in [36] in order to improve the
recognition performance in the presence of packet loss in DSR.
92 Modern Speech Recognition Approaches with Case Studies
In this section, the 3-state model proposed in [33] is used to simulate packet loss and loss
bursts. To compensate for missing packets, two error-concealment methods are examined,
namely nearest neighbour repetition and interpolation. Error mitigation using interleaving
is also considered.
1
q 1 (2)
Robust Distributed Speech Recognition Using Auditory Modelling 93
1
Q 1 (3)
N1
1
Q 1 (4)
N3
1 Q 1 Q
p q Q 2 (5)
Q Q
The authors in [33] suggest that an alternative to performing speech recognition tests using
simulated channels is to define a set of packet loss characteristics, thus enabling recognition
performance to be analysed across a range of different packet loss conditions. References
[34-37] define four channels with different characteristics in order to simulate packet loss.
These same four channels are used here to determine the effect of packet loss on speech
recognition performance. The parameter values for the four channels are detailed in Table 6.
These parameters result from work in [33] on IP and wireless networks. These are network
environments over which a DSR system would typically operate. In an IP network, packet
loss arises primarily due to congestion at the routers within the network, due to high levels
of IP traffic. The nature of IP traffic is that it can be described as being ‘bursty’ in nature with
the result that packet loss occurs in bursts. Signal fading, where the signal strength at a
receiving device is attenuated significantly, is also a contributing factor to packet loss in a
wireless network. Long periods of fading in a wireless network can result in bursts of packet
loss. The authors in [33] measured the characteristics of an IP network and a WLAN, and the
results showed the packet loss rate (α) and the burst length (β) to be highly variable. At one
point or another, most channel conditions occurred, although not necessarily for long. Based
on the experimental measurements, a set of packet loss characteristics was defined in [33]
and these are used to analyse recognition performance for different network conditions. The
parameters in Table 6 are taken from this defined set of packet loss characteristics.
α β N1 N3
Channel A 10% 4 37 1
Channel B 10% 20 181 1
Channel C 50% 4 5 1
Channel D 50% 20 21 1
The ETSI advanced front-end [7] specifies that where missing feature vectors occur due to
transmission errors, they should be substituted with the nearest correctly received feature
vector in the receiver. If there are 2B consecutive missing feature vectors, the first B speech
vectors are substituted by a copy of the last good speech vector before the error, and the last
B speech vectors are substituted by a copy of the first good speech vector received after the
error. The speech vector includes the 12 static cepstral coefficients C1-C12, the zeroth cepstral
coefficient C0 and the log energy term, and all are replaced together. A disadvantage of this
method is that if B is large then long stationary periods can arise.
4.2.2.2. Interpolation
The disadvantage of stationary periods that arise with nearest neighbour repetition can be
alleviated somewhat by polynomial interpolation between the correctly received feature
vectors either side of a loss burst. Reference [34] found that non-linear interpolation using
cubic Hermite polynomials gives the best estimates for missing feature vectors. Equation (6)
is used to calculate the nth missing feature vector in a loss burst of length β packets, which is
equivalent to a loss burst length of 2β feature vectors if each packet contains two feature
vectors as defined by the ETSI advanced front-end [7]. The parameter n in (6) is the missing
feature vector index.
2 3
n n n
xˆ b n a0 a1 a2 a3 1 n 2 (6)
1 1 1
The coefficients a0 , a1 , a2 and a3 in (6) are determined from the two correctly received
feature vectors either side of the loss burst, xb and xb n 1 , and their first derivatives, xb and
xb n 1 . Equation (6) can be rewritten as
xˆ b n xb 1 3t 2 2t 3 xb 1 3t 2 2t 3 xb t 2t 2 t 3 xb 1 t 3 t 2 1 n 2 (7)
where t n ( 1) . It was found in [34] that performance was better when the derivative
components in (7) are set to zero. These components are also set to zero for the work
presented in this chapter.
4.2.2.3. Interleaving
Research has shown that by minimising the average burst length of lost vectors the
estimation of lost feature vectors is more effective [34]. The aim of interleaving is to break a
long loss burst into smaller loss bursts by distributing them over time and so making it
appear that the errors are more randomly distributed.
Robust Distributed Speech Recognition Using Auditory Modelling 95
In a DSR system, the interleaver on the client side takes a feature vector sequence Xi, where i
is the order index, and changes the order in which the vectors are transmitted over the
channel. The result is to generate a new vector sequence Yi that is related to Xi by
Yi X ( i ) (8)
where π(i) is the permutation function. On the server side, the operation is reversed by de-
interleaving the received vector sequence as follows:
Xi Y (9)
1 ( i )
where π(π-1(i))=i.
In order for the interleaver to carry out the reordering of the feature vectors, it is necessary
to buffer the vectors, which introduces a delay. On the server side, in order to carry out the
de-interleaving, buffering of the incoming feature vectors takes place and a second delay is
introduced. The sum of these two delays is known as the latency of the interleaving/de-
interleaving process.
For the work in this chapter, block interleaving is implemented. A block interleaver of
degree d changes the order of transmission of a dxd block of input vectors. An example of a
block interleaver of degree d = 4 and spread S = 4 is given in Figure 4.
and packet loss. As a baseline for comparison, results are also presented for the ETSI
advanced front-end [7]. In all cases, training was carried out using clean data. The speech
enhancement algorithm of Agarwal and Cheng [25] was used on both the (clean) training
speech as well as the (noisy) test speech. The ETSI advanced front-end includes a SNR-
dependent waveform processing block that is applied after noise reduction and before
feature extraction. The waveform processing block in the ETSI advanced front-end is also
implemented in the front-end of Li et al. in order to ensure a closer match between the two
front-ends. Feature vectors are extracted from the output of this waveform processing block.
A detailed description of the waveform processing block can be found in [26]. The ETSI
advanced front-end carries out post-processing in the cepstral domain in the form of blind
equalization as described by [13]. The feature vectors produced by Li et al. undergo post-
processing in the cepstral domain by means of cepstral mean subtraction (CMS). As defined
by the ETSI advanced front-end [7], each packet transmitted over the communication
channel carries two feature vectors.
The baseline recognition results for the two front-ends, without vector quantisation and
with no packet loss but with noise, are detailed in Table 7. The word accuracies in the
following tables are calculated as described in Section 2.4.
To allow for close comparison between the ETSI advanced front end and the front-end
proposed by Li et al., the VQ codebooks for Li et al. should be determined in the same
Robust Distributed Speech Recognition Using Auditory Modelling 97
manner as the VQ codebooks for the ETSI advanced front-end. However, this was not
possible as the detail of how the ETSI advanced front-end VQ codebooks were calculated is
not publicly available at this time. Therefore, an implementation of the Generalized Lloyd
Algorithm (GLA), described by [39], was used to design the VQ codebooks for both the ETSI
advanced front-end and the front-end of Li et al. The recognition results for the two front-
ends using the VQ codebooks generated by the GLA implementation are detailed in Table 9.
The overall word accuracies in Table 9, with vector quantization, compare well with the
baseline accuracies, without vector quantization, in Table 7. There is also close correlation
between the recognition results in Table 8 and Table 9 for the ETSI advanced front-end,
indicating that the VQ codebooks generated by the GLA implementation used for this work
are a good substitute for the VQ codebooks provided by ETSI with the advanced front-end.
Packet loss (where each packet contains two feature vectors) is introduced on the
communication channel by using the packet loss model described in Section 4.2.1. The four
different packet loss channels investigated are defined in Table 6. Recognition tests, in the
presence of packet loss, were carried out for each of the following conditions:
Tests were first carried out for packet loss with no steps taken to recover the missing features
or to minimise the loss burst length. The test results for both front-ends when no speech
enhancement is used are given in Table 10, while recognition results with speech enhancement
are given in Table 11. A comparison of Table 10 with Table 11 illustrates the benefit of using
speech enhancement in improving recognition performance. Comparing Table 11 with Table 9
(no packet loss) it is seen that packet loss has a significant impact on the recognition results, in
particular for channels C and D where the probability of packet loss is 50%.
98 Modern Speech Recognition Approaches with Case Studies
Two methods, nearest neighbour repetition and Hermite interpolation, are used to
reconstruct the feature vector stream as a result of missing features due to packet loss. Table
12 details the recognition results obtained when using nearest neighbour repetition while
Robust Distributed Speech Recognition Using Auditory Modelling 99
Table 13 details the results obtained when Hermite interpolation is implemented (speech
enhancement is used in both cases). Both reconstruction methods show improvements in
recognition testing over no error mitigation for all four channels. In particular, with feature
reconstruction channel C shows improvements in recognition accuracy greater than 55% for
both front-ends. Channel D also shows good improvement. Nearest neighbour repetition
gives a slightly higher performance compared to Hermite interpolation.
When interleaving is introduced, the receive side perceives that the average loss burst
length is reduced [37]. Table 14 shows the recognition results obtained when interleaving,
with an interleaving depth of 4, is used in conjunction with Hermite interpolation.
Comparing the results in Table 14 with the results in Table 13 it is seen that feature
reconstruction is improved when interleaving is employed.
4.4. Discussion
The results in Table 7 show that the front-end proposed by Li et al. [14], when combined with
the speech enhancement algorithm proposed by [25], reduces the overall word error rate of the
ETSI advanced front-end [7] by 8%. Looking at Table 9, the vector quantisation has a lesser
impact on the overall recognition performance of the ETSI advanced front-end compared with
the impact of vector quantisation on the Li et al. front-end. The Li et al. front-end, combined
with speech enhancement, still outperforms the ETSI advanced front-end in the presence of
vector quantisation although the improvement in overall word error rate is reduced from 8%
(without vector quantisation) to 5.3%. In the presence of packet loss, with no speech
enhancement and with no packet loss compensation, a comparison of Table 10 shows that the
front-end of Li et al. gives better overall recognition results than the ETSI advanced front-end.
The benefit of speech enhancement in the presence of packet loss, without any missing feature
reconstruction, can be seen by comparing Table 10 and Table 11. With speech enhancement
and no packet loss compensation, Table 11 shows that Li et al. outperforms the ESTI advanced
front-end for all four channels, and for all three test sets. Comparing Table 9 with Table 11, a
significant reduction in recognition performance is observed in the presence of packet loss, in
particular for channels C and D where the probability of packet loss is 50%. When nearest
neighbour repetition is used to reconstruct missing features, Table 12 shows that there is a
Robust Distributed Speech Recognition Using Auditory Modelling 101
significant increase in recognition performance across all channels when compared to the
results presented in Table 11. Looking at Table 12, the recognition results for the two front-
ends under evaluation are similar across all channels and test sets. The front-end of Li et al.
performs marginally better overall than the ETSI advanced front-end for channels A, B and C;
however, for channel D, the overall recognition performance of the ETSI advanced front-end is
better than that of Li et al. A comparison of Table 13 with Table 12 shows a slight decrease in
recognition performance when Hermite interpolation is used to reconstruct the feature vector
stream instead of nearest neighbour repetition. With Hermite interpolation, Table 13 shows
that the front-end of Li et al. outperforms the ETSI advanced front-end for the packet loss
conditions of channels A and B, however, for channel C the reverse is the case. The overall
performance of both front-ends is the same for channel D. Interleaving the feature vectors
prior to transmission on the channel gives the perception on the receive side that the loss
bursts are shorter than they actually are. The advantage of interleaving can be seen by a
comparison of Table 14 with Table 13, where overall recognition results are improved for both
front-ends when interleaving is introduced. Looking at Table 14 it is seen that Li et al. gives the
better overall recognition performance for channels A and B while the ETSI advanced front-
end gives the better performance for channels C and D. The results indicate that, in the
presence of packet loss and environmental noise, the overall recognition performance of the
front-end of Li et al. is better than that of the ETSI advanced front-end for all channel
conditions when there are no packet loss mitigation techniques implemented. For each of the
error mitigation techniques used, Li et al. outperforms the ETSI advanced front-end for channel
conditions when the probability of packet loss is 10%. For packet loss probabilities of 50%, Li et
al. gives better results than the ETSI advanced front-end for short average burst lengths (4
packets) when nearest neighbour repetition is used. However, the ETSI advanced front-end
gives better recognition performance than Li et al. for the same channel conditions when
Hermite interpolation is used, with and without interleaving. When the average burst length is
increased to 20 packets and the probability of packet loss is 50%, the overall recognition
performance of the ETSI advanced front-end is better than that of the front-end of Li et al.
5. Conclusions
This chapter has examined the speech recognition performance of both a speech enhancement
algorithm combined with the auditory model front-end proposed by Li et al. [14], and the ETSI
advanced front-end [7], in the presence of both environmental noise and packet loss. A
number of speech enhancement techniques were first examined, including well-established
techniques such as Ephraim and Malah [20] and more recently-proposed techniques such as
Rangachari and Loizou [21]. Experiments using the Aurora connected-digit recognition
framework [16] found that the best performance was obtained using the method of Agarwal
and Chang [25]. The test results also suggest that the choice of speech enhancement algorithm
for best speech recognition performance is independent of the choice of front-end.
Packet loss modelling using statistical modelling was also examined, and packet loss
mitigation was discussed. Following initial testing with no packet loss compensation, a
number of existing packet loss mitigation techniques were investigated, namely nearest
102 Modern Speech Recognition Approaches with Case Studies
neighbour repetition and interpolation. Results show that the best recognition performance
was obtained using nearest neighbour repetition to reconstruct missing features. The
advantage of interleaving at the sender’s side to minimise the average burst length of lost
vectors was also demonstrated.
In summary, the experiments and results outlined in this chapter show the benefit of
combining speech enhancement and packet loss mitigation to combat both noise and packet
loss. Furthermore, the performance of the auditory model of Li et al. was generally shown to
be superior to that of the standard ETSI advanced front-end.
Author details
Ronan Flynn
School of Engineering, Athlone Institute of Technology, Athlone, Ireland
Edward Jones
College of Engineering and Informatics, National University of Ireland, Galway, Ireland
6. References
[1] V. Digalakis, L. Neumeyer and M. Perakakis, “Quantization of cepstral parameters for
speech communication over the world wide web”, IEEE Journal on Selected Areas in
Communications, vol. 17, pp. 82-90, Jan. 1999.
[2] D. Pearce, “Enabling new speech driven services for mobile devices: an overview of the
ETSI standards activities for distributed speech recognition front-ends”, in Proc. AVIOS
2000: The Speech Applications Conference, San Jose, CA, USA, May 2000.
[3] G. Gallardo-Antolín, C. Peláez-Moreno and F. Díaz-de-María, “Recognizing GSM digital
speech”, IEEE Trans. on Speech and Audio Processing, vol. 13, pp. 1186-1205, Nov. 2005.
[4] Z.H. Tan, P. Dalsgaard and B. Lindberg, “Automatic speech recognition over error-
prone wireless networks”, Speech Communication, vol. 47, pp. 220-242, 2005.
[5] Y. Ephraim and I. Cohen, “Recent advancements in speech enhancement”, in The
Electrical Engineering Handbook, 3rd ed., R.C. Dorf, Ed., Boca Raton, FL: CRC Press, 2006,
pp. 15-12 to 15-26.
[6] “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech
recognition; Front-end feature extraction algorithm; Compression algorithms”, in ETSI
ES 201 108, Ver. 1.1.3, Sept. 2003.
[7] “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech
recognition; Advanced front-end feature extraction algorithm; Compression
algorithms”, in ETSI ES 202 050, Ver. 1.1.5, Jan. 2007.
[8] “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech
recognition; Extended front-end feature extraction algorithm; Compression algorithms;
Back-end speech reconstruction algorithm”, in ETSI ES 202 211, Ver. 1.1.1, Nov. 2003.
[9] “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech
recognition; Extended advanced front-end feature extraction algorithm; Compression
Robust Distributed Speech Recognition Using Auditory Modelling 103
algorithms; Back-end speech reconstruction algorithm”, in ETSI ES 202 212, Ver. 1.1.2,
Nov. 2005.
[10] “RTP Payload Format for European Telecommunications Standards Institute (ETSI)
European Standard ES 210 108 Distributed Speech Recognition Encoding”, in RFC 3557,
July 2003.
[11] “RTP Payload Formats for European Telecommunications Standards Institute (ETSI)
European Standard ES 202 050, ES 202 211, and 202 212 Distributed Speech Recognition
Encoding”, in RFC 4060, May 2005.
[12] D. Macho, L. Mauuary, B. Noe, Y. M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Perace,
and F. Saadoun, “Evaluation of a noise-robust DSR front-end on Aurora databases”, in
Proceedings of International Conference on Speech and Language Processing, Denver,
Colorado, USA, Sept. 2002, pp. 17-20.
[13] L. Mauuary, “Blind equalization in the cepstral domain for robust telephone based
speech recognition”, in Proc. EUSIPCO 98, Sept. 1998, pp. 359-363.
[14] Q. Li, F. K. Soong and O. Siohan, “A high-performance auditory feature for robust
speech recognition”, in Proc. of 6th International Conference on Spoken Language Processing
(ICSLP), Beijing, China, Oct. 2000, pp. 51-54.
[15] R. Flynn and E. Jones, “A comparative study of auditory-based front-ends for robust
speech recognition using the Aurora 2 database”, in Proc. IET Irish Signals and Systems
Conference, Dublin, Ireland, 28-30 June 2006, pp. 111-116.
[16] H. G. Hirsch and D. Pearce, “The Aurora experimental framework for the performance
evaluation of speech recognition systems under noisy conditions”, in Proc. ISCA ITRW
ASR-2000, Paris, France, Sept. 2000, pp. 181-188.
[17] H. Hermansky, “Perceptual linear prediction (PLP) analysis of speech”, J. Acoust. Soc.
Amer., vol. 87, pp. 1738-1752, 1990.
[18] J. Tchorz and B. Kollmeier, “A model of auditory perception as front end for automatic
speech recognition”, J. Acoust. Soc. Amer., vol. 106, pp. 2040-2050, 1999.
[19] HTK speech recognition toolkit. Available: http://htk.eng.cam.ac.uk/. Accessed March 2011.
[20] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error
short-time spectral amplitude estimator”, IEEE Trans. on Acoustics, Speech and Signal
Processing, vol. 32, pp. 1109-1121, Dec. 1984.
[21] S. Rangachari and P. Loizou, “A noise-estimation algorithm for highly non-stationary
environments”, Speech Communication, vol. 48, pp. 220-231, 2006.
[22] N. Westerlund, M. Dahl and I. Claesson, “Speech enhancement for personal
communication using an adaptive gain equalizer”, Speech Communication, vol. 85, pp.
1089-1101, 2005.
[23] R. Martin, “Spectral subtraction based on minimum statistics”, in Proc. Eur. Signal
Processing Conference, 1994, pp. 1182-1185.
[24] R. Martin, “Noise power spectral density estimation based on optimal smoothing and
minimum statistics”, IEEE Trans. on Speech and Audio Processing, vol. 9, pp. 504-512, July 2001.
[25] A. Agarwal and Y.M. Cheng, “Two-stage mel-warped wiener filter for robust speech
recognition”, in Proceedings of Automatic Speech Recognition and Understanding Workshop,
Keystone, Colorado, USA, 1999, pp. 67-70.
104 Modern Speech Recognition Approaches with Case Studies
[26] D. Macho and Y. M. Cheng, “SNR-dependent waveform processing for improving the
robustness of ASR front-end”, in Proc. of IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), May 2001, pp. 305-308.
[27] C. Peláez-Moreno, G. Gallardo-Antolín and F. Díaz-de-María, “Recognizing voice over
IP: A robust front-end for speech recognition on the world wide web”, IEEE Trans. on
Multimedia, vol. 3, pp. 209-218, June 2001.
[28] C. Peláez-Moreno, G. Gallardo-Antolín, D.F. Gómez-Cajas and F. Díaz-de-María, “A
comparison of front-ends for bitstream-based ASR over IP”, Signal Processing, vol. 86,
pp. 1502-1508, July 2006.
[29] J. Van Sciver, J. Z. Ma, F. Vanpoucke and H. Van Hamme, “Investigation of speech
recognition over IP channels”, in Proc. of IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), May 2002, pp. 3812-3815.
[30] D. Quercia, L. Docio-Fernandez, C. Garcia-Mateo, L. Farinetti and J. C. DeMartin,
“Performance analysis of distributed speech recognition over IP networks on the
AURORA database”, in Proc. of IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), May 2002, pp. 3820-3823.
[31] P. Mayorga, L. Besacier, R. Lamy and J. F. Serignat, “Audio packet loss over IP and
speech recognition”, in Proc. of IEEE Workshop on Automatic Speech Recognition and
Understanding (ASRU), Nov. 2003, pp. 607-612.
[32] J. Vicente-Peña, G. Gallardo-Antolín, C. Peláez-Moreno and F. Díaz-de-María, “Band-
pass filtering of the time sequences of spectral parameters for robust wireless speech
recognition”, Speech Communication, vol. 48, pp. 1379-1398, Oct. 2006.
[33] B.P. Milner and A.B. James, “An analysis of packet loss models for distributed speech
recognition”, in Proc. of 8th International Conference on Spoken Language Processing (ICSLP),
Jeju Island, Korea, Oct. 2004, pp. 1549-1552.
[34] A. B. James and B. P. Milner, “An analysis of interleavers for robust speech recognition
in burst-like packet loss”, in Proc. of IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), May 2004, pp. 853-856.
[35] A. B. James and B. P. Milner, “Towards improving the robustness of distributed speech
recognition in packet loss”, in Proc. Second COST278 and ISCA Tutorial and Research
Workshop (ITRW) on Robustness Issues in Conversational Interaction, University of East
Anglia, U.K., Aug. 2004, p. paper 42.
[36] A. B. James and B. P. Milner, “Combining packet loss compensation methods for robust
distributed speech recognition”, in Proc. of Interspeech-2005, Lisbon, Portugal, Sept. 2005,
pp. 2857-2860.
[37] A. B. James and B. P. Milner, “Towards improving the robustness of distributed speech
recognition in packet loss”, Speech Communication, vol. 48, pp. 1402-1421, Nov. 2006.
[38] B. P. Milner and A. B. James, “Robust speech recognition over mobile and IP networks
in burst-like packet loss”, in IEEE Trans. on Audio, Speech and Language Processing vol. 14,
ed, Jan. 2006, pp. 223-231.
[39] Y. Linde, A. Buzo and R. Gray, “An algorithm for vector quantizer design”, IEEE Trans.
on Communications, vol. COM-28, pp. 84-95, Jan. 1980.
Chapter
Chapter 5
0
http://dx.doi.org/10.5772/49969
1. Introduction
Research on spoken language technology has led to the development of Automatic Speech
Recognition (ASR), Text-To-Speech (TTS) synthesis, and dialogue systems. These systems
are now used for different applications such as in mobile telephones for voice dialing,
GPS navigation, information retrieval, dictation, translation, and assistance for handicapped
people.
To perform Automatic Speech Recognition (ASR) there are different techniques such as
Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Weighted Finite State
Transducers (WFSTs), and Hidden Markov Models (HMMs). In the case of HMMs there are
algorithms such as Viterbi, Baum - Welch / Forward - Backward, that adjust their parameters
and the decoding/recognition process itself. However these algorithms do not guarantee
optimization of these parameters as the recognition process is stochastic.
Main challenges arise when the structures used to describe the stochastic process (i.e., a
three-state left-to-right HMM) are not enough to model the acoustic features of the speech.
Also, when training data to build robust ASR systems is sparse. In practice, both situations
are met, which leads to decrease in rates of ASR accuracy. Thus, research has focused
on the development of techniques to overcome these situations and thus, to improve ASR
performance. In the fields of heuristic optimization, data analysis, and finite state automata,
diverse techniques have been proposed for this purpose. In this chapter, the theoretical bases
and application details of these techniques are presented and discussed.
Initially, the optimization approach is reviewed in Section 2, where the application of heuristic
methods as Genetic Algorithms and Tabu Search for structure and parameter optimization
of HMMs is presented. This continues in Section 3, where the application of WFSTs and
discrete HMMs for statistical error modelling of an ASR system’s response is presented. This
approach is proposed as a corrective method to improve ASR performance. Also, the use
of a data analysis algorithm as Non-negative Matrix Factorization (NMF) is presented as a
mean to improve the information obtained from sparse data. Then, case studies where these
106 Modern
2 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
R R R R
)HDWXUH
9HFWRUV
• The evaluation problem. Given the parameters of a model (λ), estimate the probability of
a particular observation sequence ( Pr (O |λ)).
• The learning problem. Given a sequence of observations ot from a training set,
estimate/adjust the transition (A) and the emission (B) probabilities of an HMM to
describe the data more accurately.
• The decoding problem. Given the parameters of the model, find the most likely sequence
of hidden states Q∗ = {q1 , q2 , ..., qn } that could have generated a given output sequence
O = {o1 , o2 , ..., ot }.
Standard algorithms such as Viterbi (for decoding) and Baum-Welch (learning) are widely
used for these problems [1]. However, heuristics such as Tabu Search (TS) and Genetic
Algorithms (GA) have been proposed to further improve the performance of HMMs and the
parameters estimated by Viterbi/Baum-Welch.
Improvement Techniques for Automatic Speech Recognition Improvement Techniques for Automatic Speech Recognition3 107
A GA is a search heuristic that mimics the process of natural evolution and generates useful
solutions to optimization problems [2]. In Figure 2 the general diagram of a GA used for
HMM optimization is show, where the solutions for an optimization problem receive values
based on their quality or “fitness”, which determine their opportunities for reproduction. It
is expected that parent solutions of very good quality will produce offsprings (by means of
reproduction operators such as crossover or mutation) with similar or better characteristics,
improving their fitness after some generations. Hence, fitness evaluation is a mechanism used
to determine the confidence level of the optimized solutions to the problem [9]. In Figure
3 a general “chromorome” representation of an individual of the GA population, or parent
solution, is shown. This array contains the elements of an HMM (presented in Figure 1),
which can be coded into binary format to perform reproduction of solutions.
6WDUW
,QLWLDO 3RSXODWLRQ
;LQGLYLGXDOV
,WHUDWLRQ )LWQHVV (YDOXDWLRQ RI
2IIVSULQJV
Figure 3. Chromosome array of the elements of an HMM, where N defines the number of states, and M
the number of mixture components of the observation probabilities’ distributions.
In [3], GA optimization was performed for the observation probabilities and transition states
for word HMMs, while in [8] finding an optimal structure was the objective. In [8], the
fitness of individuals (HMM structures) was measured considering (1) the log-likelihood sum
of the HMM calculated by the Viterbi algorithm over a training set of speech data, and (2)
a penalization based on the error rate obtained with the HMM when performing Viterbi
over a sub-set of test data. [9] used a GA to optimize the number of states in the HMM
and its parameters for web information extraction, obtaining important increases in precision
108 Modern
4 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
rates when compared with Baum-Welch training. Fitness was measured on the likelihood
( Pr (O |λ)) over a training set.
On the other hand, TS is a metaheuristic that can guide a heuristic algorithm from a local
search to a global space to find solutions beyond the local optimality [11]. It can avoid loops
in the search process by restricting certain “moves” that would make the algorithm to revisit
a previously explored space of solutions. These moves are kept hidden or reserved (are being
kept “Tabu”) in a temporal memory (a Tabu List) which can be updated with new moves
or released with different criteria. While in a GA the diversification of the search process is
performed by the reproduction operators, in TS this is performed by ”moves” which consist
of perturbances (changes) in the parameter values of an initial solution (i.e., observation
or transition probabilities). These changes can be defined by a function, or by adding or
substracting small randomly generated quantities to the initial solution’s values.
This approach was explored by [27] for HMM optimization. As in other studies, the log
probability indicating the HMM likelihood over a training set was used to measure the fitness
of a solution. The “moves” consisted in adding randomly generated values to each HMM’s
parameters. In the next section, improvement techniques based on statistical error modelling
of phoneme confusions are presented and discussed.
• Non-negative Matrix Factorization, NMF (Section 3.1.1). Application case in Section 4.2
(with Metamodels).
• Metamodels (Section 3.2). Application case in Section 4.1 (with GA), 4.2 (with NMF), and
4.3 (with WFSTs).
• Weighted Finite State Transducers, WFSTs (Section 3.3). Application case in Section 4.3
(with Metamodels).
confusion matrix is that it is easy to see if the system is confusing two classes (e.g., commonly
mislabelling or classifying one as another).
Response
aa
ae
ah
ao
aw
ax
ay
ea
eh
er
ey
ia
ih
iy
oh
ow
oy
ua
uh
Stimulus
uw
b
ch
d
dh
f
g
hh
jh
k
l
m
n
ng
p
r
s
sh
t
th
v
w
y
z
zh
sil
aa ae ah ao aw ax ay ea eh er ey ia ih iy oh ow oy ua uh uw b ch d dh f g hh jh k l m n ng p r s sh t th v w y z zh sil
As P̃∗ is the system’s output, it might contain several errors. Based on the classification
performed by the aligner, these are identified as substitution (S), insertion (I), and deletion
(D) errors. Thus, the performance of ASR systems is measured based on these errors, and two
metrics are widely used for phoneme and word ASR performance:
N−D−S−I
Word_Accuracy(WAcc) = , Word_Error_Rate(WER) = 1 − WAcc (1)
N
where N is the number of elements (words or phonemes) in the reference string (P). Thus, the
objective of the statistical modelling of the phoneme confusion-matrix is to estimate W from
P̃∗ . This can be accomplished by the following expression [22]:
M
W ∗ = max ∏ Pr ( p j ) Pr ( p̃∗j | p j ) (2)
P j
1
In this case, each class is a phoneme in the British-English BEEP pronunciation dictionary [13] which considers 45
phonemes (vowels and consonants).
110 Modern
6 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
where p j is the j’th phoneme in the postulated phoneme sequence P, and p̃∗j the j’th phoneme
in the decoded sequence P̃∗ (of length M). Equation 2 indicates that the most likely word
sequence is the sequence that is most likely given the observed phoneme sequence from a
speaker. The term Pr ( p̃∗j | p j ) represents the probability that the phoneme p̃∗j is recognized
when p j is uttered, and is obtained from a speaker’s confusion-matrix. This element is
integrated into the recognition process as presented in Figure 5.
7UDLQLQJ6SHHFK
%DVHOLQH &RQIXVLRQ0DWUL[ 0RGHOOLQJ RIWKH
$65 (VWLPDWLRQ &RQIXVLRQ0DWUL[
3RVWSURFHVVLQJ
7HFKQLTXHV
7HVW6SHHFK
:
%DVHOLQH 3RVWSURFHVVLQJ
$65
7HFKQLTXHV :
:RUG/DQJXDJH 0RGHO
This information then can be modelled by post-processing techniques to improve the baseline
ASR’s output. Evaluation is performed when P̃∗ (which now is obtained from test speech)
is decoded by using the “trained” techniques into sequences of words W ∗ . The correction
process is done at the phonetic level, and by incorporating a word-language model a more
accurate estimate of W is obtained. In the sections 3.2 and 3.3, the foundations of two
post-processing techniques are explained.
3.2. Metamodels
In practice, it is too restrictive to use only the confusion-matrix to model Pr ( p̃∗j | p j ) as this
cannot model insertions well. Instead, a Hidden Markov model (HMM) can be constructed
for each of the phonemes in the phoneme inventory. These HMMs, termed as metamodels
[22, 24], can be best understood by comparison with a “standard” acoustic HMM: a standard
acoustic HMM estimates Pr (O | p j ), where O is a subsequence of the complete sequence
of observed acoustic vectors in the utterance, O, and p j is a postulated phoneme in P.
A metamodel estimates Pr ( P̃ | p j ), where P̃ is a subsequence of the complete sequence of
observed (decoded) phonemes in the utterance P̃.
The architecture of the metamodel of a phoneme is shown in Figure 7 [22, 24]. Each state of
a metamodel has a discrete probability distribution over the symbols for the set of phonemes,
plus an additional symbol labelled DELETION. The central state (2) of a metamodel for a
certain phoneme models correct decodings, substitutions and deletions of this phoneme made
by the phoneme recognizer. States 1 and 3 model (possibly multiple) insertions before and
after the phoneme. If the metamodel were used as a generator, the output phone sequence
produced could consist of, for example:
112 Modern
8 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
P
««
««
««
««
Q
««
««
««
(VWLPDWHV
FRQYHUJH
12LQ
QFUHDVHGLPHQVLRQVU NMF
"
<(6 6DYHHSUHYLRXV
10)(VWLPDWHV
64')
««
««
««
««
««
5HQRUP
PDOL]DWLRQ
• a single phone which has the same label as the metamodel (a correct decoding) or a
different label (a substitution);
• a single phone labelled DELETION (a deletion);
• two or more phones (one or more insertions).
D D
D D
be made. When used as a generator, this model can produce only one possible phoneme
sequence: a single phoneme which has the same label as the metamodel.
The discrete probability distributions of each metamodel can be refined by using embedded
re-estimation with the Baum-Welch Algorithm [10] over the { P, P̃∗ } pairs of all the
utterances. When performing speech recognition, the language model is used to compile
a “meta-recognizer” network, which is identical to the network used in a standard word
recognizer except that the nodes of the network are the appropriate metamodels rather than
the acoustic models used by the word recognizer. As shown in Figure 5, the output test
phoneme sequence P̃∗ is passed to the meta-recognizer to produce a set of word hypotheses.
ϭ Ϯ ϯ
Figure 8. Extended metamodel of a phoneme. The states Ak and Bj represent the k-th insertion-after, and
the j-th insertion-before, of a phoneme.
for j = 0 to J do
C ( Bj )−C ( Bj+1 ) ΔBj
a_B J +1,j = C ( B0 )
= C ( B0 )
end for
The above algorithm also gives the transition probability from the initial state to the central
state when j = 0 (a_B J +1,0 ). When a transition ends into an insertion state (Bj ), the next
transition must be to the preceding insertion state (Bj−1 ) because there is a dependency on
their ocurrences, so:
for j = 0 to J-1 do
a_Bj+1,j = 1
end for
The modelling of the insertions “after” a phoneme is performed in a slightly different way
as the transition sequences change. As shown in Figure 9(b), the “insertion-context” is taken
with reference of the central state (C), where state A1 models the first-order insertions, state A2
the second-order insertions, and so until state AK , where K is the length of the context. From
left-to-right, state A0 is the central state, and AK +1 is the final or end state (as state 2 and state
4 in Figure 7). C ( A0 ) = C ( B0 ) = C (C ), and C ( AK +1 ) = 0 as the final state is non-emitting.
When starting on the central state there must be a sequence for the next transitions, thus either
going to the final state AK +1 or to the first-order insertion-after state A1 . If the above transition
ended in A1 , then the next transition could be to the second-order insertion-after state A2 , or
to the final state AK +1 , and so on. ΔAk represents the number of elements or observations that
move from an insertion-after state Ak to the final state AK +1 and is expressed as:
The remaining observations, C ( Ak )-ΔAk , do the transition to the state Ak+1 , as the elements
in Ak+1 are dependent on the same number of elements in Ak . The transition probabilities
a_A for each insertion-after state are computed as:
Improvement Techniques for Automatic Speech Recognition 11 115
Improvement Techniques for Automatic Speech Recognition
DB%-
,QLWLDO DB%- &HQWUDO
6WDWH 6WDWH
DB%--
Δ%
& %
Δ%-
& %-
Δ%-
& %-
Δ%-
%- %- %-
... % %
DB$.
&HQWUDO DB$. )LQDO
6WDWH 6WDWH
DB$.
$ $ $ ….. $. $.
DB$ DB$ DB$..
& $
Δ$
& $
Δ$
& $
Δ$
& $.
Δ$.
$ $ $ ... $. $.
for k = 0 to K+1 do
C ( Ak )−C ( Ak+1 )
a_Ak,K +1 = C ( Ak )
= CΔA k
( Ak )
end for
for k = 0 to K do
C( A )
a_Ak,k+1 = C( Ak+)1
k
end for
1. C, the confusion matrix transducer, which models the probabilities of phoneme insertions,
deletions and substitutions.
2. D, the dictionary transducer, which maps sequences of decoded phonemes from P̃∗ ◦ C
into words in the dictionary.
3. G, the language model transducer, which defines valid sequences of words from D.
Thus, the process of estimating the most probable sequence of words W ∗ given P̃∗ can be
expressed as:
W ∗ = τ ∗ ( P̃∗ ◦ C ◦ D ◦ G ) (6)
where τ ∗ denotes the operation of finding the most likely path through a transducer and
◦ denotes composition of transducers [25]. Details of each transducer are presented in the
following section.
Consider Table 1, where the top row of phonemes represents the transcription of a word
sequence, and the bottom row the output from the speech recognizer. It can be seen that the
phoneme sequence /b aa/ is deleted after /ax/, and this can be represented in the transducer
as a multiple substitution/insertion: /ax/→/ax b aa/. Similarly the insertion of /ng dh/ after /ih/
is modelled as /ih ng dh/→/ih/. The probabilities of these multiple substitutions / insertions /
deletions are estimated by counting. In cases where a multiple insertion or deletion is made
of the form A→/B C/, the appropriate fraction of the unigram probability mass Pr(A→B)
is subtracted and given to the probability Pr(A→/B C/), and the same process is used for
insertions or deletions of higher order.
A fragment of the confusion-matrix transducer that represents the alignment of Table 1 is
presented in Figure 10. For computational convenience, the weight for each confusion in
the transducer is represented as −logPr ( p̃∗j | p j ). In practice, an initial set of transducers
are built directly from the speaker’s “unigram” confusion matrix, which is estimated using
each transcription/output alignment pair available from that speaker, and then to add extra
transducers that represent multiple substitution/insertion/deletions. The complete set of
transducers are then determinized and minimized, as described in [25]. The result of these
operations is a single transducer for the speaker as shown in Figure 10.
D[D[
D[H\
EE
GKGK
GKZ
LKLK
QJQJ LKLK QJε
GKε
QJ]
UWK
EE ε HK ε W
VLOVLO
VLOVLO
Figure 10. Example of the Confusion-Matrix Transducer C for the alignment of Table 1.
Qε ε 6+,1
LK ε
VK ε
QJε ε 6+22,1*
LK ε
XZ ε
ε 6+2(
Figure 11. Dictionary transducer D: individual entries, and minimized network of entries.
Improvement Techniques for Automatic Speech Recognition 15 119
Improvement Techniques for Automatic Speech Recognition
3 Z_V! 3 Z_Z
V! Z Z
3 Z 3 Z
% V! % Z % Z
EDFNRII
4. Case studies
4.1. Structure optimization of metamodels with GA
In this section the use of a Genetic Algorithm (GA) to optimize the structure of a set of
metamodels and further improve ASR performance is presented [29]. The experiments were
performed with speakers from the british-english Wall Street Journal (WSJ) speech database
[14], and the metamodels were built with the libraries of the HTK Toolkit[10] from the
University of Cambridge.
LQGH[ « « « « « « « « « « « «
« « « « « « « « « «
/HQJWKRI,QVHUWLRQ%HIRUH&RQWH[W- 3KRQHPH&RQIXVLRQ0DWUL[
IRUHDFKSKRQHPH /HQJWKRI,QVHUWLRQ$IWHU&RQWH[W. 6FDOH [
IRUHDFKSKRQHPH *UDPPDU
)DFWRU V
Because an insertion context exists for each phoneme, there are 2x47 = 94 insertion contexts
in total. This information is arranged in a single vector of dimension 1x94, which is saved in
genes 1-94 of the chromosome vector (see Figure 13). In gene 95 the value of the scale grammar
factor (s), which controls the influence of the language model over the recognition process, is
placed [10]. From genes 96 to 2304 the elements of the confusion-matrix are placed. Hence,
each chromosome vector represents all the parameters of the metamodels for all phonemes in
the speech set.
120 Modern
16 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
The integer values for each set of genes are: K, J = 0 to 3; s = 0 to 10; phoneme confusion -
matrix = 0 to 100 (number of occurrences of each aligned pair {p∗j ,p j )} before being normalized
as probabilities Pr ( p̃∗j | p j )).
The general structure of the GA is the one presented in Figure 2. For this, the initial population
consists of 10 individuals, where the first element is the initial extended metamodel and
the remaining elements are randomly generated within the range of values specified above.
Fitness of each individual was measured on the word recognition accuracy (WAcc, Eq. 1)
achieved with the resulting metamodels on a training set.
4.1.2. Operators
• Selection: The selection method (e.g., how to choose the eligible parents for reproduction)
was based on the Roulette Wheel and was implemented as follows:
• For each of the 10 best individuals in the population, compute its fitness value.
fi
• Compute the selection probability for each xi individual as: pi = N , where N is
∑ k =1 f k
the size of the population (sub-set of 10 individuals), and f i the fitness value of the
individual xi .
• Compute the accumulated probability qi for each individual as: qi = ∑ij=1 p j .
• Generate a uniform random number r ∈ {0, 1}.
• If r < qi , then select the first individual (x1 ), otherwise, select xi such that qi−1 < r ≤ qi .
• Repeat Steps 4 and 5 N times (until all N individuals are selected).
• Crossover: Uniform crossover was used for reproduction of parents chosen by the Roulette
Wheel method. A template vector of dimension 1x2304 was used for this, where each of
its elements received a random binary value (0, 1). Offspring 1 is produced by copying
the corresponding genes from Parent 1 where the template vector has a value of 0, and
copying the genes from Parent 2 where the template vector has a value of 1. Offspring 2 is
obtained by doing the inverse procedure. 10 offsprings are obtained by crossover from the
10 individuals in the initial population. This increases the size of the population to 20.
• Mutation: The mutation scheme consisted in randomly changing values in the best 5
individuals as follows:
• change 5 genes in the sub-vector that includes the length of the insertion contexts (genes
1 to 94);
• change the scale grammar factor (gene 95);
• change 94 genes in the sub-vector that represents the phoneme confusion-matrix (genes
96 to 2304).
In this way, a population of 25 individuals is obtained for the GA.
• Stop condition: The GA is repeated until convergence through generations is achieved. A
boundary of 15 iterations was set for the GA as it was observed that minimum variations
were observed after that.
(three-state left-to-right HMMs, with eight Gaussian components per state). The front-end
used 12 MFCCs plus energy, delta, and acceleration coefficients.
The experiments were done with speech data from 10 speakers of the development set
si_dt of the same database. From each speaker, 74 sentences were selected, discarding
adaptation sentences that were the same for all speakers. From this set, 30 were selected
for testing-only purposes and 44 for confusion-matrix estimation (training of metamodels)
and fitness evaluation for the GA. The metamodels were trained with three different sets of
sentences: 5, 10, and 14 sentences, and fitness evaluation and metamodels re-estimation were
performed with 15, 30, and 44 sentences respectively. This formed the training schemes 5-15,
10-30, and 14-44 (e.g., if 5 sentences were used for metamodel training, fitness evaluation
of the GA and metamodel re-estimation were performed with 15 sentences). Word bigram
language models were estimated from the transcriptions of the si_dt speakers, and were the
same for all the systems (baseline and metamodels).
Figure 14(a) shows the graph of fitness convergence of the GA when the scheme 14-44
was used. The baseline performance is around 78% while the initial extended metamodel
achieved a performance of 87.3%. The mean fitness of the populations starts at 85% and, as
the individuals evolve, increases to up to 87.4%. The best individual from each generation
achieves a fitness of up to 88.20%. For the metamodels trained with the schemes 5-15 and
10-30 the pattern of converge was very similar.
%HVW,QGLYLGXDO 0HDQ)LWQHVVRIWKH3RSXODWLRQ
%DVHOLQH6\VWHP ([WHQGHG0HWDPRGHOV*$ ([WHQGHG0HWDPRGHOV
([WHQGHG0HWDPRGHO %DVHOLQH
$FFXUDF\
$FFXUDF\
,WHUDWLRQVRIWKH*$ 7UDLQLQJ6FKHPHV
Figure 14(b) shows the mean performance of the optimized metamodels on the testing set
when all training schemes were used. On average, the optimized metamodels showed an
increase of around 0.50% when compared with the initial extended metamodels. These gains
were statistically significant as shown in Table 2. The matched pairs test described by [32] was
used to test for statistical significant difference.
When the extended metamodels were optimized with a GA, a significant gain was achieved
with schemes 5-15 and 14-44. The use of the GA also reduced the size of the metamodels by
10% for all cases, thus eliminating unnecessary states and transitions. Finally, with a used
vocabulary of 3000 words and 740 sentences (300 for testing), there is an important confidence
about the significance of these results.
122 Modern
18 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
Table 2. Significance test for the initial and the optimized extended metamodels.
$GDSWDWLRQ'DWD 8
0HDQ1RRI
0//57UDQVIRUPDWLRQV
Table 3. Mean number of MLLR transformations across all test speakers using different sets of
adaptation data.
A word-bigram language model, estimated from the data of the si_tr speakers, was used
to obtain P̃∗ and estimate CM x with the unadapted baseline system. In order to keep
these sequences independent from those of the test-set speakers, the word-bigram language
model used to decode P̃∗ for CMy and CMU , was estimated from the data of the selected
y
Improvement Techniques for Automatic Speech Recognition 19 123
Improvement Techniques for Automatic Speech Recognition
test-speakers of the si_dt set. In all cases, a grammar scale factor s of 10 was used. The
metamodels were tested using all the utterances available from the speakers Sy .
Figure 15 shows the performance of the metamodels using MLLR adapted acoustic models.
The metamodels trained with partial estimates improved over the adapted baseline. When
the NMF estimates were used, the accuracy of the metamodels improved when the training
data was small. As presented in Table 4, these improvements were statistically significant
when 5, 10, and 15 utterances were used for training/estimation.
%DVHOLQH 0//5 3DUWLDO 10)6PRRWKLQJ
$FFXUDF\
8 6HQWHQFHVIRU$GDSWDWLRQ(VWLPDWLRQ7UDLQLQJ
Figure 15. Mean word recognition accuracy of the metamodels and the adapted baseline across all
test-speakers.
Table 4. Significance test for the metamodels trained with partial and NMF estimates.
4.3. Metamodels and WFSTs to improve ASR performance for disordered speech
Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor
coordination of the muscles responsible for speech. Although ASR systems have been
developed for disordered speech, factors such as low intelligibility and limited phonemic
repertoire decrease speech recognition accuracy, making conventional speaker adaptation
algorithms perform poorly on dysarthric speakers.
In the work presented by [22], rather than adapting the acoustic models, the errors made
by the speaker are modelled to attempt to correct them. For this, the metamodels and
WFSTs techniques are applied to correct the errors made at the phonetic level and make use
124 Modern
20 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
of a language model to find the best estimate of the correct word sequence. Experiments
performed with dysarthric speech of different levels of severity showed that both techniques
outperformed standard adaptation techniques.
• A small set of phonemes (in this case the phonemes /ua/, /uw/, /m/, /n/, /ng/, /r/,
and /sil/) dominates the speaker’s output speech.
• Some vowel sounds and the consonants /g/, /zh/, and /y/, are never recognized
correctly. This suggests that there are some phonemes that the speaker apparently cannot
enunciate at all, and for which he or she substitutes a different phoneme, often one of the
dominant phonemes mentioned above.
Response
aa
ae
ah
ao
aw
ax
ay
ea
eh
er
ey
ia
ih
iy
oh
ow
oy
ua
uh
Stimulus
uw
b
ch
d
dh
f
g
hh
jh
k
l
m
n
ng
p
r
s
sh
t
th
v
w
y
z
zh
sil
aa ae ah ao aw ax ay ea eh er ey ia ih iy oh ow oy ua uh uw b ch d dh f g hh jh k l m n ng p r s sh t th v w y z zh sil
These observations differ from the pattern of confusions seen in a normal speaker as shown
in Figure 4. This confusion-matrix shows a clearer pattern of correct recognitions, and few
confusions of vowels with consonants.
Most speaker adaptation algorithms are based on the principle that it is possible to apply a
set of transformations to the parameters of a set of acoustic models of an “average” voice to
move them closer to the voice of an individual (e.g., MLLR). Whilst this has been shown to be
successful for normal speakers, it may be less successful in cases where the phoneme uttered
Improvement Techniques for Automatic Speech Recognition 21 125
Improvement Techniques for Automatic Speech Recognition
is not the one that was intended but is substituted by a different phoneme or phonemes, as
often happens in dysarthric speech. In this situation, it is suggested that a more effective
approach is to combine a model of the substitutions likely to have been made by the speaker
with a language model to infer what was said. So rather than attempting to adapt the system,
we model the insertion, deletion, and substitution errors made by a speaker and attempt to
correct them.
$FFXUDF\
%% %. %9 )% -) // 0+ 5. 5/ 6&
6SHDNHUV
Figure 17. Comparison of recognition performance: Human assessment (FDA), unadapted (BASE) and
adapted (MLLR_16) SI models.
adapted models. This gain in accuracy increases as the training/adaptation data is increased,
obtaining an improvement of almost 3% when all 34 sentences are used.
The matched pairs test described by [32] was used to test for significant differences between
the recognition accuracy using metamodels and the accuracy obtained with MLLR adaptation
when a certain number of sentences were available for metamodel training. The results with
the associated p-values are presented in Table 5.
In all the cases, metamodels improve MLLR adaptation with p-values less than 0.01 and 0.05.
Note that the metamodels trained with only four sentences (META_04) decrease the number
of word errors from 1174 (MLLR_04) to 1139.
%DVHOLQH 0//5 0HWDPRGHOVRQ0//5 :)67V
:RUG$FFXUDF\
:
6HQWHQFHVIRU0//5$GDSWDWLRQDQG0HWDPRGHOV:)67V7UDLQLQJ
Figure 18. Mean word recognition accuracy of the MLLR adapted baseline, the metamodels, and the
WFSTs across all dysarthric speakers.
The WFSTs were tested using the same conditions and speech data as the metamodels. The
FSM Library [25] from AT&T was used for the experiments with WFSTs. Figure 18 shows
clearly the gain in performance given by the WFSTs over both MLLR and the metamodels.
This gain is statistically significant at the 0.1 level for all cases except when 22 and 34 sentences
were used for training.
Improvement Techniques for Automatic Speech Recognition 23 127
Improvement Techniques for Automatic Speech Recognition
Table 5. Comparison of statistical significance of results over all dysarthric speakers using metamodels.
5. Conclusions
In this chapter the details and applications of techniques to improve ASR performance were
presented. These consisted in heuristic and statistical post-processing techniques which made
use of a speaker’s phoneme confusion - matrix for error modelling, achieving correction at the
phonetic level of an ASR’s output for both, normal and disordered speech.
The first post-processing technique, termed as metamodels, incorporated the information
of the speaker’s confusion-matrix into the recognition process. The metamodels expanded
the confusion-matrix modelling by incorporating information of the pattern of insertions
associated with each phoneme. Deletions were modelled as a phoneme being substituted
(or confused) by the “DELETION” symbol.
A metamodel’s architecture, for which there was an extended version, was suitable for
further optimization, and in Section 4.1 the application of a GA for this purpose was
presented. The improvements in word recognition accuracy obtained on normal speech after
optimization were statistically significant when compared with the previous performance of
the metamodels. This corroborated the findings of other works [8, 9], where GA was applied
to improve recognition performance by evolving the internal parameters of an HMM (state
transitions, observation probabilities).
Also, in Section 4.2 was presented another application case, where the use of Non - negative
Matrix Factorization (NMF) provided more accurate estimates of a speaker’s phoneme
confusion - matrix which, when incorporated into a set of metamodels, improved significantly
the accuracy of an ASR system. This performance was also higher than the performance of
the metamodels trained with original partial data (with no NMF estimates).
However, when tested with speakers with disordered speech, where significant increase in
substitution, deletion, and insertion errors exists, two important issues were identified:
• The metamodels were unable to model specific phoneme sequences that were output in
response to individual phoneme inputs. They were capable of outputting sequences,
but the symbols (phonemes) in these sequences were conditionally independent, and so
specific sequences could not be modelled. This also led to be unable to model deletion
of sequences of phonemes, although the original and extended architecture ensures
modelling of multiple insertions and single substitutions / deletions.
128 Modern
24 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
• Adding the “DELETION” symbol led to an increase in the size of the dictionary, because
it could potentially substitute each phoneme in the network of metamodels during the
recognition process. Not adding this symbol led to the problem of a “Tee” model in the
HTK package [10]. In such case, a deletion is represented as a direct transition from the
initial to the final state, thus allowing “skipping” the phoneme. The decoding algorithm
failed because it was possible to traverse the complete network of metamodels without
absorbing a single input symbol.
Hence, for modelling of deletions, a specific way to define a “missing” or “empty” observation
was needed. An alternative to solve these issues, the second post-processing technique,
a network of Weighted Finite State Transducers (WFSTs) was presented. In Section 4.3, a
case of improving ASR performance for speakers with the speech disorder of dysarthria was
presented, where both techniques, metamodels and WFSTs were used for that purpose. The
main advantage of the WFSTs is the definition and use of the epsilon () symbol, which
in finite-state automata represents an “empty” observation. This allowed the modelling of
multiple deletions, substitutions, and insertions. General improvement on ASR for dysarthric
speech was obtained with this technique when compared to the metamodels and the adapted
baseline system.
As conclusion it can be mentioned that the techniques presented in this chapter offer wide
possibilities of application in the general field of speech recognition. Their robustness has
been corroborated with case studies with normal and disordered speech, where higher
performance was consistently achieved over unadapted and adapted baseline ASR systems.
Also, these techniques are flexible enough to be optimized and integrated with other processes
as in [30] and [5].
Author details
Santiago Omar Caballero Morales
Technological University of the Mixteca, Mexico
6. References
[1] Jurafsky, D. & Martin, J.H. (2009). Speech and Language Processing, Pearson: Prentice Hall,
USA.
[2] Goldberg, D.E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning,
Addison-Wesley Publishing Co., USA.
[3] Chan, C.W.; Kwong, S.; Man, K.F. & Tang, K.S. (2001). Optimization of HMM Topology
and its Model Parameters by Genetic Algorithms. Pattern Recognition, Vol. 34, pp.
509-522
[4] Hong, Q. Y. & Kwong, S. (2005). A Genetic Classification Method for Speaker
Recognition. Engineering Applications of Artificial Intelligence, Vol. 18, pp. 13-19
[5] Matsumasa, H.; Takiguchi, T.; Ariki, Y.; LI, I-C. & Nakabayash, T. (2009). Integration of
Metamodel and Acoustic Model for Dysarthric Speech Recognition. Journal of Multimedia
- JMM, Vol. 4, No. 4, pp. 254-261
[6] Lee, D.D. & Seung, H.S. (1999). Learning the parts of objects by non-negative matrix
factorization. Nature, Vol. 401, pp. 788-791
Improvement Techniques for Automatic Speech Recognition 25 129
Improvement Techniques for Automatic Speech Recognition
[7] Lee, D.D. & Seung, H.S. (2001). Algorithms for non-negative matrix factorization.
Advances in Neural Information Processing Systems, Vol. 13, pp. 556-562
[8] Takara, T.; Iha, Y. & Nagayama, I. (1998). Selection of the Optimal Structure of the
Continuous HMM using the Genetic Algorithm, Proc. of the International Conference on
Spoken Language Processing (ICSLP 98), Sydney, Australia.
[9] Xiao, J.; Zou, L. & Li, C. (2007). Optimization of Hidden Markov Model by a
Genetic Algorithm for Web Information Extraction, Proc. of the International Conference
on Intelligent Systems and Knowledge Engineering (ISKE 2007), Chengdu, China, ISBN:
978-90-78677-04-8
[10] Young, S. & Woodland, P. (2006). The HTK Book (for HTK Version 3.4), Cambridge
University Engineering Department, United Kingdom.
[11] Reeves, C.R. (1993). Modern Heuristic Techniques for Combinational Problems, John Wiley &
Sons Inc., USA.
[12] Bodenstab, N. & Fanty, M. (2007). Multi-pass pronunciation adaptation, Proc. of the
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2007,
Honolulu, Hawaii, Vol. 4, pp. 865–868, ISBN: 1-4244-0728-1
[13] Robinson, T.; Mitton, R.; Wilson, M.; Foote, J.; James, D. & Donovan, R. (1997).
British English Example Pronunciation Dictionary (BEEP) v1.0, University of Cambridge,
Department of Engineering, Machine Intelligence Laboratory, United Kingdom
[14] Robinson, T.; Fransen, J.; Pye, D.; Foote, J. & Renals, S. (1995). WSJCAM0: A British
English speech corpus for large vocabulary continuous speech recognition, Proc. of the
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1995,
Detroit, USA, Vol.1., pp. 81-84, ISBN: 0-7803-2431-5
[15] Hamilton, H. J. (2007). Notes of Computer Science 831: Knowledge Discovery in Databases,
University of Regina, Department of Computing Sciences, Canada.
[16] Rosen, K. & Yampolsky, S. (2000). Automatic speech recognition and a review of its
functioning with dysarthric speech. Augmentative and Alternative Communication, Vol. 16,
pp. 48-60
[17] Green, P.; Carmichael, J.; Hatzis, A.; Enderby, P.; Hawley, M.S. & Parker M.(2003).
Automatic Speech Recognition with Sparse Training Data for Dysarthric Speakers, Proc.
of the 8th European Conference on Speech Communication Technology (Eurospeech), Geneva,
Switzerland, pp. 1189-1192, ISSN: 1018-4074
[18] Torre, D.; Villarrubia, L.; Hernandez,L. & Elvira, J.M. (1997). Automatic Alternative
Transcription Generation and Vocabulary Selection for Flexible Word Recognizers, Proc.
of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
2007, Honolulu, Hawaii, Vol. 2, pp. 1463-1466
[19] Fosler-Lussier, E. (2008). Finite State Machines In Spoken Language Processing (FSM
Tutorial), Speech and Language Technology Laboratory, Ohio State University, USA.
[20] Levit, M.; Alshawi, H.; Gorin, A. & Nöth, E. (2003). Context-Sensitive Evaluation
and Correction of Phone Recognition Output, Proc. of the 8th European Conference on
Speech Communication Technology (Eurospeech), Geneva, Switzerland, pp. 925-928, ISSN:
1018-4074
[21] Bunnel, H.T.; Polikoff, J.B.; Menéndez-Pidal, X.; Peters, S.M. & Leonzio, J.E. (1996). The
Nemours Database of Dysarthric Speech, Proc. of the Fourth International Conference on
Spoken Language Processing (ICSLP 96), Philadelphia, USA.
130 Modern
26 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
[22] Caballero-Morales, S.O. & Cox, S.J. (2009). Modelling Errors in Automatic Speech
Recognition for Dysarthric Speakers. EURASIP Journal on Advances in Signal Processing,
pp. 1-14, ISSN: 1687-6172
[23] Li, Y.X.; Tan, C.L.; Ding, X. & Liu, C. (2004). Contextual post-processing based on the
confusion matrix in offline handwritten Chinese script recognition. Pattern Recognition,
Vol. 37, No. 9, pp. 1901-1912
[24] Cox, S.J. & Dasmahapatra, S. (2002). High level approaches to confidence estimation in
speech recognition. IEEE Transactions on Speech and Audio Processing, Vol. 10, No. 7, pp.
460-471,. ISSN: 1063-6676
[25] Mohri, M.; Pereira, F. & Riley, L. (2002). Weighted finite state transducers in speech
recognition. Computer Speech and Language, Vol. 16, pp. 69-88, ISSN: 0885-2308
[26] Fosler-Lussier, E.; Amdal, I. & Kuo, H.-K.J. (2002). On the road to improved lexical
confusability metrics, ISCA Tutorial and Research Workshop on Pronunciation Modelling and
Lexicon Adaptation (PMLA-2002), Estes Park, Colorado, USA.
[27] Thatphithakkul, N. & Kanokphara, S. (2004). HMM Parameter Optimization using Tabu
Search, International Symposium on Communications and Information Technologies (ISCIT)
2004, Sapporo, Japan, pp. 904-908
[28] Cox, S. J. (2008). On Estimation of A Speaker’s Confusion Matrix from Sparse Data,
Proc. of the 9th Annual Conference of the International Speech Communication Association
(Interspeech 2008), Brisbane, Australia, pp. 2618-2621, ISSN: 1990-9772
[29] Caballero-Morales, S.O. (2011). Structure Optimization of Metamodels to Improve
Speech Recognition Accuracy, Proc. of the International Conference on Electronics
Communications and Computers (CONIELECOMP) 2011, pp. 125-130, ISBN:
978-1-4244-9557-3
[30] Caballero-Morales, S.O. & Cox, S.J. (2009). On the Estimation and the Use of
Confusion-Matrices for Improving ASR Accuracy, Proc. of the International Conference on
Spoken Language Processing (Interspeech 2009), pp. 1599-1602, ISSN: 1990-9772
[31] Caballero-Morales, S.O. & Cox, S.J. (2007). Modelling confusion matrices to improve
speech recognition accuracy, with an application to dysarthric speech, Proc. of the
International Conference on Spoken Language Processing (Interspeech 2007), pp. 1565-1568,
ISSN: 1990-9772
[32] Gillick, L. & Cox, S.J. (1989). Some statistical issues in the comparison of speech
recognition algorithms, Proc. of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP) 1989, Glasgow , United Kingdom, pp. 532-535, ISSN:
1520-6149
[33] Leggetter, C.J. & Woodland, P.C. (1995). Maximum likelihood linear regression for
speaker adaptation of continuous density hidden Markov models. Computer Speech and
Language, Vol. 9, No. 2, pp. 171-185, ISSN: 0885-2308
Chapter
Chapter 6
0
http://dx.doi.org/10.5772/48715
1. Introduction
The most common acoustic front-ends in automatic speech recognition (ASR) systems are
based on the state-of-the-art Mel-Frequency Cepstral Coefficients (MFCCs). The practice
shows that this general technique is good choice to obtain satisfactory speech representation.
In the past few decades, the researchers have made a great effort in order to develop and
apply such techniques, which may improve the recognition performance of the conventional
MFCCs. In general, these methods were taken from mathematics and applied in many
research areas such as face and speech recognition, high-dimensional data and signal
processing, video and image coding and many other. One group of mentioned methods is
represented by linear transformations.
Linear feature transformations (also referred as subspace learning or dimensionality reduction
methods) are used to convert the original data set to an alternative and more compact set with
retaining of information as much as possible. They are also used to increase the robustness
and the performance of the system. In speech recognition, the basic acoustic front-end
based on MFCCs can be supplemented by some kind of linear feature transformation.
The linear transformation is applied in feature extraction step. Then the whole feature
extraction process is achieved in two steps: parameter extraction and feature transformation.
Linear transformation is applied to a sequence of acoustic vectors obtained by some kind
of preprocessing method. Usually, the spectral, log-spectral, Mel-filtered spectral or cepstral
features are projected to a more relevant and more decorrelated subspace, which is directly
used in acoustic modeling. During the transformation often a dimension reduction step is
also done. This is achieved by retaining only the relevant dimensions after the transformation
according to some optimization criterion. The dimension reduction step helps to solve the
problem called the curse of dimensionality.
In practice, supervised and unsupervised subspace learning methods are used. The most
popular data-driven unsupervised transformation used in ASR is Principal Component
Analysis (PCA). It is known that the supervised methods need an information about the
132 Modern
2 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
structure of the data, which are partitioned in the classes. Therefore, it is necessary to use
appropriate class labels. A widely used supervised method is known as Linear Discriminant
Analysis (LDA).
In numerous research works and publications it was proven that the above mentioned
linear transformations were successfully applied in ASR to multiple languages with different
characteristics of speech. The Slovak speech recognition research group tends to follow
this trend. In this work, we present a practical methodology with adequate theoretical
principles related to application of linear feature transformations in Slovak phoneme-based
large vocabulary continuous speech recognition (LVCSR).
The main subject of this chapter is the application of LDA in Slovak ASR, but the core of
most experiments is based on Two-dimensional LDA (2DLDA), which is an extension of LDA.
Several context lengths of basic vectors are used in the discriminat analysis and different final
dimensions of transformation matrix are utilized. The classical procedures by several our
modifications are supplemented. The second part of the chapter is oriented to PCA and to
our proposed method related to PCA training from limited amount of training data. The
third part investigates the interaction of the above mentioned PCA and 2DLDA applied in
one recognition task. The closing part compares and evaluates all experiments and concludes
the chapter by presenting the best achieved results.
This chapter is divided into few basic units. Sections 2 and 3 describe LDA and 2DLDA used in
speech recognition. Section 4 surveys PCA and also presents the proposed partial-data trained
PCA method. Section 5 presents the setup of the system for continuous phoneme-based
speech recognition. Section 6 presents extensive experiments and evaluations of the used
methods in different configurations. Finally Section 7 concludes the chapter. Section 8 gives
the future intentions in our research.
y = W T x, (1)
where y is the output transformed feature set, W is the transformation matrix and x is the
input feature set. The aim of LDA is to find this transformation matrix W with respect to
some optimization criterion (information loss, class discrimination, ...). It can be obtained by
applying an eigendecomposition to the covariance matrices. The p best functions resulted
from the decomposition are used to transform the feature vectors to reduced representation.
yi = W T xi ; p < N. (2)
Consider that the original data is partitioned into k classes as X = {Π1 , . . . , Πk }, where the
class Πi contains ni elements (feature vectors) from the ith class. Notice that n = ∑ik=1 ni . The
classes can be represented by class mean vectors
1
ni x∑
μi = x (3)
∈ Πi
Σi = ∑ (x − μi )(x − μi )T , (4)
x ∈ Πi
which are defined to quantify the quality of the cluster. Since LDA in ASR mostly in
class-independent manner is used, we define the within-class covariance matrix as the sum of
all class covariance matrices
1 k 1 k
ΣW = ∑
n i =1
Σi = ∑ ∑ (x − μi )(x − μi ) T .
n i =1 x ∈ Π
(5)
i
To quantify the covariance between classes, the between-class covariance matrix is used. It is
defined as:
1 k
Σ B = ∑ (μi − μ)(μi − μ) T , (6)
n i =1
where
1 k
n i∑ ∑ x
μ= (7)
=1 x ∈ Πi
is the global mean vector (computed disregarding the class label information). Note that the
variable x in speech recognition represents a supervector created by concatenating of acoustic
vectors computed on successive speech frames. To build a supervector of J acoustic vectors
(J is typically 3, 5, 7, 9 or 11 frames), the vector x j at the current position j is spliced together
with J −2 1 vectors on the left and right as
x = x[ j − J −2 1 ] . . . x[ j] . . . x[ j + J −2 1 ] . (8)
It should be noted that in case, when the length of the supervector was greater than the
number of classes (13 × J > k, where J ≥ 5, k = 45), the between-class covariance matrix
became close to singular or singular. This fact resulted in eigendecomposition with complex
valued transformation matrix, which was undesirable.
134 Modern
4 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
Therefore, we used for these cases a modified computation of Σ B according to [7] as follows:
n
B = 1 ∑ (xi − μ)(xi − μ) T .
Σ (9)
n i =1
This way of computation can be interpreted as a finer estimation of Σ B because each training
supervector contributes to a final estimation of Σ B (more data points are used) in comparison
with the estimation represented by Equation 6.
The given covariance matrices are used to formulate the optimization criterion for LDA,
which tries to maximize the between-class scatter (covariance) over the within-class scatter
(covariance). It can be shown that the covariance matrices resulting from the linear
transformation W (in the p-dimensional space) become Σ p = W T ΣW W.
p = W T Σ B W and Σ
B W
The objective function can be defined as
B|
|Σ |W T Σ B W |
J (W ) = = . (10)
W |
|Σ |W T ΣW W |
where v is a square matrix of eigenvectors and λ represents the eigenvalues. The solution can
be obtained by applying an eigendecomposition to the matrix
−1
ΣW ΣB . (12)
Section 5.3). Thus, the number of classes in LDA-based experiments was identical with
the number of phonemes and also with the number of trained monophone models. The
disadvantage of the phone segmentation obtained from embedded training can be potentially
the inaccuracy of the determined phone boundaries compared to the actual boundaries.
k
R
Sw = ∑ ∑ ( X − Mi ) RR T ( X − Mi ) T , (16)
i =1 X ∈ Πi
k
SbR = ∑ ni ( Mi − M) RRT ( Mi − M)T . (17)
i =1
For fixed R, L can be then computed by solving an optimization problem:
−1 T R
max L trace L T Sw
R
L L Sb L . (18)
k
L
Sw = ∑ ∑ ( X − Mi ) T LL T ( X − Mi ), (21)
i =1 X ∈ Πi
k
SbL = ∑ ni ( Mi − M)T LLT ( Mi − M). (22)
i =1
In this way, with obtained L it can be computed the optimal R by solving an optimization
problem:
Linear Feature Transformations inLinear Feature Transformations
Slovak Phoneme-Based in Slovak
Continuous Speech Recognition Phoneme-Based Continuous Speech Recognition7 137
−1
max R trace R T Sw
L
R R T SbL R . (23)
This problem can be solved as an eigenvalue problem:
L
Sw x = λSbL x. (24)
It should be noted that the sizes of scatter matrices in 2DLDA are much smaller that those in
R and S R is r × r and the size of S L and S L is c × c.
LDA. Specifically, the size of Sw b w b
k
SbR ← ∑ ni ( Mi − M) R j−1 RTj−1 ( Mi − M)T ; (27)
i =1
R ) −1 S R ;
6. Compute the first l1 eigenvectors {φlL }ll1=1 of (Sw b
7. L j ← [φ1L , . . . , φlL1 ];
8.
k
L
Sw ← ∑ ∑ ( X − Mi ) T L j L Tj ( X − Mi ), (28)
i =1 X ∈ Πi
k
SbL ← ∑ ni ( Mi − M)T L j LTj ( Mi − M); (29)
i =1
L ) −1 S L ;
9. Compute the first l2 eigenvectors {φlR }ll2=1 of (Sw b
10. End for
11. L ← L I , R ← R I ;
12. Bl ← L T Al R, for l = 1, . . . , n;
13. return(L, R, B1 , . . . , Bn ).
The most time consuming steps in 2DLDA computing are lines 5, 8 and 13. The algorithm
depends on the initial choice of R0 . In [19] it was showed and recommended to choose an
identity matrix as R0 .
138 Modern
8 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
M M
1 1 1
C= ∑
M − 1 i =1
Φi ΦiT = ∑ (x − x̄)(xi − x̄) T =
M − 1 i =1 i M−1
AA T . (33)
The principal components are determined by K leading eigenvectors resulting from the
decomposition. The dimensionality reduction step is performed by keeping only the
eigenvectors corresponding to the K largest eigenvalues (K < n). These eigenvectors form
the transformation matrix UK with dimension n × K:
UK = [u1 u2 . . . uK ], (35)
while λ1 > λ2 > . . . > λn . Finally, the linear transformation Rn → RK is computed according
to Equation (1) as:
yi = UKT Φi = UKT (xi − x̄), i ∈ 1; M . (36)
Linear Feature Transformations inLinear Feature Transformations
Slovak Phoneme-Based in Slovak
Continuous Speech Recognition Phoneme-Based Continuous Speech Recognition9 139
where yi represents the transformed feature vector. The value of K can be chosen as needed
or according to the following comparative criterion:
K
∑ λi
i =1
n > T, (37)
∑ λi
i =1
K
∑ λi
i =1
> T. (39)
trace(U )
trained from limited (reduced) amount of training data, while the performance is maintained,
or even improved. We called this procedure as Partial-data trained PCA.
Partial-data PCA training can be viewed as a kind of feature selection process. The main idea
is to select the statistically significant data (feature vectors) from the whole amount of training
data. There are two major processing stages. The first stage is the data selection based on
PCA separately applied to all training feature vectors. Suitable vectors are concatenated into
one train matrix, which is treated as the input for the main PCA. The second stage is the main
PCA (see Section 4.1).
Suppose now that apply the same conditions as in Section 4.1. Then the selection process
based on PCA (without projecting phase) can be described as follows. Each 26-dimensional
LMFE (or 13-dimensional MFCC) feature vector xi , i ∈ 1; M (see Section 5.2) is reshaped
to its matrix version Xi , i ∈ 1; M with dimension 2 × 13 (in case of MFCC vectors, the
13-dimensional vector was extended with zero coefficient in order to reshape to matrix with
dimension 2 × 7). After mean subtraction the covariance matrix is computed as:
1
Ci = X X T , i ∈ 1; M ; k = 13 (for MFCC, k = 7). (40)
k−1 i i
In the next step, the eigendecomposition is performed on the covariance matrix Ci , which
results in i sets of eigenvectors wi1 , wi2 and eigenvalues αi1 , αi2 :
where
Wi = [wi1 wi2 ]. (42)
Note that the parameters wi1 , wi2 and αi1 , αi2 at each iteration i are updated with new
parameters resulting from a new eigendecomposition. For PCA-based selection the
eigenvectors wi1 , wi2 are not used. On the other hand, the eigenvalues αi1 , αi2 are the key
elements because the selective criterion is based exactly on them. Using these eigenvalues,
the percentage proportion Pi is computed as:
αi1 αi1 αi1
Pi = = = , (43)
2 αi1 + αi2 trace(Ci )
∑ αij
j =1
which determines the percentage of the variance explained by the first eigenvalue in the
eigenspectrum. Further, it is necessary to choose a threshold T. It can be chosen from
two different intervals. The first one is defined as T1 ∈ (50; ≈ 65 and the second one
as T2 ∈ ≈ 85; 99.9. Then the selective criterion can be based on the following logical
expressions:
Pi ≤ T1 (44)
for the first interval, or
Pi ≥ T2 (45)
for the second interval. If the evaluation of the expression yields a logical true then the current
feature vector is classified as statistically significant for PCA training. This vector is stored and
Linear Feature Transformations inLinear Feature Transformations
Slovak Phoneme-Based in Slovak
Continuous Speech Recognition 11 141
Phoneme-Based Continuous Speech Recognition
the selection continues for the next vector. In this way, the whole training corpus is processed.
From the selected vectors a training matrix is composed, which is treated as the input for the
main PCA described in Section 4.1. As was mentioned in Section 4.1, there are M training
vectors in the corpus. If the selected subset contains M vectors ( M M ) then the Equation
32 can be modified as:
A = [φ1 φ2 . . . φ M ], (46)
where φi is the mean subtracted feature vector in the new train matrix. The next mathematical
computations are identical with Equations 33-36. The partial-data training procedure for
LMFE feature vectors is illustrated in the Figure 4.3. Note that for MFCC-based partial-data
PCA the figure would be analogous with the Figure 4.3.
Vector reshaping
.... Covariance
.... Eigen-
.... matrix
decomposition
1x26 computing
2x13
Drop
feature vector
The new train matrix can be viewed as a radically-reduced, more relevant representation
of the training corpus. It has a nearly homoscedastic variance structure because it contains
only those feature vectors, which have almost the same variance distribution. Feature vectors
selected from the interval represented by threshold T1 can be characterized as data clusters,
which have very small variance distribution explained by the first eigenvalue among the
direction of the corresponding first eigenvector. On the other hand, the feature vectors from
the interval represented by threshold T2 are clusters, which have large variance distribution
among the first eigenvector. In both cases, the largeness of the variance is determined by the
first eigenvalue. The size of the selected partial data set depends on the value of T1 or T2 . The
size of partial set can be expressed in percentage amount as:
M
subset_size = × 100. (47)
M
We found that a practical importance has a ratio, when
M
∈ 0.001; 0.15, (48)
M
so the selected subset contains maximally 15% of data of the whole training data amount.
For example, there are approximately 19 million training vectors in our corpus. According to
Equation 48 it is sufficient to extract ≈ 19000 vectors for partial-data training. But, as it will
be showed in Section 6.3.2 this argument does not apply to all cases. The time cunsumption
and memory costs of the covariance matrix computation of the reduced data set are much
142 Modern
12 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
smaller than the costs of the covariance matrix computation in case of the whole corpus. In
case of partial-data training it is needed to allocate the memory only for one investigated
feature vector and for the other data elements for mathematical computations. These memory
requirements are of order of units of megabytes. In other words, the advantage of the
partial-data training is that it does not require the loading of the whole data matrix in the
main memory.
5.4. Evaluation
In order to evaluate the experiments we chose the accuracy as the evaluation parameter.
Accuracies were computed as the ratio of the number of all word matches (resulting from
the recognizer) to the number of the reference words [20]. In all experiments the accuracy is
given in percentage.
and (d) are purely symmetric. They were computed from supervectors constructed according
to Figure 2 (b).
(a) (b)
Figure 2. Different types of supervector composition; (a) composition with simple concatenating, (b)
composition with retaining the structure of the basic vector
6.1.3. Results
The experiments based on LDA can be divided into three categories related to dimension
of the LDA transformation matrix. The first category is represented by LDA matrix with
dimension 13 × 39. Thus, for transformation were retained only the first 13 eigenvectors
corresponding to 13 leading eigenvalues. The final dimension of the features were expanded
to 39 with Δ and ΔΔ coefficients. The second category is represented by LDA matrix with
dimension 19 × 39 so for transformations were used more LDA coefficients. Note that the final
dimension of features was 38 (19 + Δ). The third category is represented by LDA matrix with
dimension 39 × 39 and in this case were not used the Δ and ΔΔ coefficients. The difference
between these three categories is that for acoustic modeling were used various numbers of
dimensions and data-dependent and data-independent Δ and ΔΔ coefficients. The LDA
coefficients with lower order (14–39) can be viewed as Δ and ΔΔ coefficients estimated in
data-dependent manner. The experimental results for LDA are given in the Table 1. The
results are analyzed separately for the mentioned categories.
1. The highest accuracies were achieved for 13 LDA coefficients expanded with Δ and ΔΔ
coefficients and for J = 3. The maximum improvement compared to MFCC model is
+2.05% for 4 mixtures. Only for 1 mixture any improvement was achieved.
Linear Feature Transformations inLinear Feature Transformations
Slovak Phoneme-Based in Slovak
Continuous Speech Recognition 15 145
Phoneme-Based Continuous Speech Recognition
(a) Within-class scatter matrix computed from (b) Within-class scatter matrix computed from
supervectors obtained by using the concatenating supervectors obtained by preserving the structure
of the basic vectors
(c) Between-class scatter matrix computed from (d) Between-class scatter matrix computed from
supervectors obtained by using the concatenating supervectors obtained by preserving the structure
of the basic vectors
Figure 3. Within-class and between-class scatter matrices computed from supervectors with length 65
composed in different ways
2. In case of LDA matrix with dimension 19 × 39 the improvement is lower than in the
previous case. The performance only for 2, 4, 8 and 256 mixtures was improved. It can be
also seen that for 256 mixtures the improvement for higher context length was achieved.
Note that acoustic models in this experiment have smaller dimension as the reference
model (38 < 39).
3. The results in the last case, when the dimension of LDA matrix was 39 × 39 are not
satisfactory. In all cases, the performance was decreased. But we can conclude that
the longer lengths of context are suitable for higher dimensions of transformation matrix
(without Δ and ΔΔ).
(a) Between-class scatter matrix computed from (b) Between-class scatter matrix computed from
supervectors constructed according to Figure 2 supervectors constructed according to Figure 2
(a) (b)
Figure 4. Close to symmetric between-class scatter matrices computed according to Equation 6 for
context length J = 5
Table 1. Accuracy levels (%) for conventional LDA with different number of retained dimensions (13, 19
and 39) compared to baseline MFCC model
LDA reported in Section 6.1.3. The whole mathematical 2DLDA computing was performed
according to Equations 13–25. The statistical estimations are similar as in conventional LDA.
The main difference is that it is necessary to compute two eigendecompositions and we have
two transformation matrices; L and R. 2DLDA does not deal with supervectors as in LDA
but with supermatrices, which are the basic data elements in 2DLDA (instead of vectors).
These supermatrices were created from the basic cepstral vectors by coupling them together.
Similarly as in LDA, we used 5 different sizes of supermatrices according to the number of
contextual vectors (context size J). Thus, the sizes of supermatrices were 13 × 3, 13 × 5, 13 × 7,
Linear Feature Transformations inLinear Feature Transformations
Slovak Phoneme-Based in Slovak
Continuous Speech Recognition 17 147
Phoneme-Based Continuous Speech Recognition
13 × 9 and 13 × 11. Consequently, the class mean, global mean, within-class scatter matrix
and between-class scatter matrix have corresponding sizes according to the current length of
context. For example, when the context size J was set to 7, in statistical estimation 7 cepstral
vectors were coupled together to form a supermatrix 13 × 7. Then, the statistical estimators
have the following dimensions:
• class means Mi : 13 × 7,
• global mean M : 13 × 7,
L : 7 × 7,
• left within-class scatter matrix Sw
• left between-class scatter matrix SbL : 7 × 7,
R : 13 × 13,
• right within-class scatter matrix Sw
• right between-class scatter matrix SbR : 13 × 13,
• left transformation matrix L : 13 × 13,
• right transformation matrix R : 7 × 7.
1. The first category is represented by vector of dimension 13, which resulted from
transformation. The final dimension was 39 (13 2DLDA +Δ + ΔΔ coefficients). As it can be
seen from the Table 2, this case resulted in the highest accuracies for 2DLDA with context
length J = 3.
2. The second category is represented by vector of dimension 19. The final dimension was
38 (19 2DLDA +Δ coefficients). Note that for example in case of transformed supermatrix
with dimension 10 × 2 to obtain a vector with dimension 19, the last coefficient in the
matrix-to-vector alignment was ignored. From the Table 2 it can be seen that 2DLDA at
this dimension does not perform successfully. The performance of the base MFCC model
was not improved.
3. For the third category applies similar conclusions as in the previous case. In these
experiments the feature vector dimension was 39 (without Δ and ΔΔ coefficients).
The maximum improvement achieved by 2DLDA was +2.01% for context length J = 3 and
for one iteration (I = 1).
148 Modern
18 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
Table 2. Accuracy levels (%) for 2DLDA with different number of retained dimensions compared to
baseline MFCC model and conventional LDA
The full-data trained PCA was performed on a Linux machine with 32GB memory. The
training data were loaded in the memory sequentially by data blocks and then concatenated
to one data matrix (see Equation 32). From this matrix the covariance matrix according to
Equation 33 was computed. Then the integral parts of PCA according to Equations 34-36 were
performed. In the next step, the acoustic modeling based on the PCA transformed features
was done. The evaluation results of the full-data trained PCA for LMFE features are listed in
the Table 5 and for MFCC features in the Table 6.
Table 3. Parameters used for partial-data PCA models trained from LMFE
Table 4. Parameters used for partial-data PCA models trained from MFCC
One of the output parameters of the partial-data PCA is the optimal dimension d determined
by Equation (37). It represents the number of principal components, which could be used to
transform the input data with retaining 95% of global variance. Note that the threshold values
T1 and T2 were determined on experimental basis. The results of the partial-data PCA models
are listed in the Table 5 and Table 6 for LMFE and MFCC features, respectively. Note that the
table contains only the highest accuracies chosen from all models.
From the Table 5 we can conclude that for LMFE features the selected subsets of size 0.1% and
5% are not suitable to partial-data PCA training. In addition, an improvement in comparison
with full-data trained PCA was achieved only for 32–256 mixtures. The maximum absolute
improvement +0.43% for 64 mixtures was achieved.
150 Modern
20 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
Mixtures Acc. of full PCA Acc. of partial-data PCA Difference Threshold Part of DB
1 82.80% 82.06% −0.74% T2 = 89.2 15%
2 84.10% 83.88% −0.22% T2 = 89.2 15%
4 86.01% 85.93% −0.08% T1 = 63.0 10%
8 88.88% 88.21% −0.67% T2 = 89.2 15%
16 89.84% 89.82% −0.02% T2 = 91.0 10%
32 90.31% 90.72% +0.41% T2 = 91.0 10%
64 91.00% 91.43% +0.43% T2 = 91.0 10%
128 91.72% 91.91% +0.19% T2 = 96.1 1%
256 92.30% 92.60% +0.30% T2 = 89.2 15%
Table 5. Accuracy levels for LMFE-based full-data and partial-data trained PCA
In case of MFCC features used as the input for partial-data PCA training, the results are
more satisfactory. From the Table 6 it can be seen that for all mixtures an improvement was
achieved. The maximum absolute improvement is +1.25% for 1 mixture. It could be also
mentioned that for MFCC features the proposed method used the smaller selected subsets
(≈ 1%) in comparison with LMFE features.
Mixtures Acc. of full PCA Acc. of partial-data PCA Difference Threshold Part of DB
1 82.35% 83.60% +1.25% T1 = 51.10 0.1%
2 84.24% 84.79% +0.55% T1 = 53.35 1%
4 85.94% 86.33% +0.39% T1 = 53.35 1%
8 87.83% 88.08% +0.25% T2 = 92.62 5%
16 89.14% 89.36% +0.22% T2 = 92.62 5%
32 90.19% 90.32% +0.13% T2 = 89.70 10%
64 90.90% 91.27% +0.37% T1 = 53.35 1%
128 91.20% 91.78% +0.58% T1 = 57.45 5%
256 91.76% 92.19% +0.43% T2 = 96.40 1%
Table 6. Accuracy levels for MFCC-based full-data and partial-data trained PCA
Table 8. Global comparison of partial experiments for all types of linear transformations
2,1
92
Abs. improvement (%)
90
Accuracy level (%)
1,6
88
1,1
86
0,6
84
82 0,1
1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256
Number of mixtures Number of mixtures
Reference MFCC Transformed model
(a) Comparison of transformed and MFCC models (b) Absolute improvement of transformed models
Figure 5. Graphical global evaluation of all experiments compared to reference MFCC model
152 Modern
22 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
• Principal Component Analysis can improve the performance of the MFCC-based acoustic
model. As the input for PCA can be used LMFE or MFCC features.
• The proposed partial-data trained PCA achieves better results compared to full-data
trained PCA. Higher improvements can be achieved in case of MFCC features used as
input for partial-data PCA.
• The conventional Linear Discriminant Analysis leads to improvements almost for all
mixtures, but there may occur a problem related to singularity of between-class scatter
matrix in case of larger lengths of context J.
• 2DLDA achieves comparable improvements as LDA (a little bit smaller). On the other
hand, it is much more stable than LDA and there is no problem with the singularity,
because 2DLDA overcomes it implicitly (much smaller dimensions of scatter matrices).
• In the last step, we clearly demonstrated that the combination of PCA and 2DLDA
(subspace learning) leads to further refinement and improvement compared to
performance of 2DLDA.
Acknowledgments
The research presented in this paper was supported by the Ministry of Education under the
research projects VEGA 1/0386/12 and MŠ SR 3928/2010–11 and by the Slovak Research and
Development Agency under the research project APVV–0369–07.
Author details
Jozef Juhár and Peter Viszlay
Technical University of Košice, Slovakia
9. References
[1] Abbasian, H., Nasersharif, B., Akbari, A., Rahmani, M. & Moin, M. S. [2008]. Optimized
linear discriminant analysis for extracting robust speech features, Proc. of the 3rd
Intl. Symposium on Communications, Control and Signal Processing, St. Julians, pp. 819–824.
Linear Feature Transformations inLinear Feature Transformations
Slovak Phoneme-Based in Slovak
Continuous Speech Recognition 23 153
Phoneme-Based Continuous Speech Recognition
[17] Viszlay, P., Juhár, J. & Pleva, M. [2012]. Alternative phonetic class definition in linear
discriminant analysis of speech, Proc. of the 19th International Conference on Systems,
Signals and Image Processing, IWSSIP’12, Vienna, Austria. Accepted, to be published.
[18] Yang, J., Zhang, D., Frangi, A. F. & Yang, J.-Y. [2004]. Two–Dimensional PCA: A New
Approach to Appearance–Based Face Representation and Recognition, IEEE Transactions
on Pattern Analysis and Machine Intelligence 26: 131–137.
[19] Ye, J., Janardan, R. & Li, Q. [2005]. Two-dimensional linear discriminant analysis,
L. K. Saul, Y. Weiss and L. Bottou (Eds.): Advances in Neural Information Processing Systems
17: 1569–1576.
[20] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., Moore, G., Odell, J.,
Ollason, D., Povey, D., Valtchev, V. & Woodland, P. [2006]. The HTK Book (for HTK Version
3.4). First Published Dec. 1995.
Chapter
Chapter 7
0
http://dx.doi.org/10.5772/48430
1. Introduction
In a distant-talking environment, channel distortion drastically degrades speech recognition
performance because of a mismatch between the training and testing environments. The
current approach focusing on automatic speech recognition (ASR) robustness to reverberation
and noise can be classified as speech signal processing [1, 4, 5, 14], robust feature extraction
[10, 20], and model adaptation [3, 25].
In this chapter, we focus on speech signal processing in the distant-talking environment.
Because both the speech signal and the reverberation are nonstationary signals,
dereverberation to obtain clean speech from the convolution of nonstationary speech signals
and impulse responses is very hard work. Several studies have focused on mitigating
the above problem [8, 9, 11, 12]. [1] explored a speech dereverberation technique whose
principle was the recovery of the envelope modulations of the original (anechoic) speech.
They applied a technique that they originally developed to treat background noise [11] to
the dereverberation problem. [7] proposed a novel approach for multimicrophone speech
dereverberation. The method was based on the construction of the null subspace of the data
matrix in the presence of colored noise, employing generalized singular-value decomposition
or generalized eigenvalue decomposition of the respective correlation matrices. A
reverberation compensation method for speaker recognition using spectral subtraction in
which the late reverberation is treated as additive noise was proposed by [16, 17]. However,
the drawback of this approach is that the optimum parameters for spectral subtraction are
empirically estimated from a development dataset and the late reverberation cannot be
subtracted well since it is not modeled precisely. [18] proposed a novel dereverberation
method utilizing multi-step forward linear prediction. They estimated the linear prediction
coefficients in a time domain and suppressed the amplitude of late reflections through spectral
subtraction in a spectral domain.
156 2Modern Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
Estimation of spectra
of impulse responses
Late reverberation
Multi-channel DFT Early reverberation
reduction based on IDFT
reverberant speech normalization
spectral subtraction
Figure 1. Schematic diagram of blind dereverberation method.
x [ t ] = h[ t ] ∗ s [ t ] + n [ t ]. (1)
where * denotes the convolution operation. In this chapter, additive noise is ignored for
simplification, so Eq. (1) becomes x [ t] = h[ t] ∗ s[ t].
To analyze the effect of impulse response, the impulse response h[ t] can be separated into two
parts hearly [ t] and hlate [ t] as [16, 17]
h[ t] t<T
hearly [ t] = ,
0 otherwise
h[ t + T ] t≥0
hlate [ t] = , (2)
0 otherwise
where T is the length of the spectral analysis window, and h[ t] = hearly[ t] + δ(t − T ) ∗ hlate [ t].
δ() is a dirac delta function (that is, a unit impulse function). The formula (1) can be rewritten
as
x [ t] = s[ t] ∗ hearly [ t] + s[ t − T ] ∗ hlate [ t], (3)
where the early effect is distortion within a frame (analysis window), and the late effect comes
from previous multiple frames.
When the length of impulse response is much shorter than analysis window size T used for
short-time Fourier transform (STFT), STFT of distorted speech equals STFT of clean speech
multiplied by STFT of impulse response h[ t] (in this case, h[ t] = hearly [ t]). However, when the
length of impulse response is much longer than an analysis window size, STFT of distorted
speech is usually approximated by
X ( f , ω ) ≈ S ( f , ω ) ∗ H (ω )
D −1
= S ( f , ω ) H (0, ω ) + ∑ S ( f − d, ω ) H (d, ω ), (4)
d =1
the right-hand side of Eq. (4) is not straightforward subtracted since the late reverberation is
not modelled precisely.
In this chapter, we propose a dereverberation method based on spectral subtraction to
estimate the STFT of the clean speech Ŝ( f , ω ) based on Eq. (4), and the spectrum of the
impulse response for the spectral subtraction is blindly estimated using the method described
in Section 3. Assuming that phases of different frames is noncorrelated for simplification, the
power spectrum of Eq. (4) can be approximated as
D −1
| X ( f , ω )|2 ≈ | S ( f , ω )|2 | H (0, ω )|2 + ∑ | S ( f − d, ω )|2 | H (d, ω )|2 . (6)
d =1
The estimated power spectrum of clean speech may not be very accurate due to the estimation
error of the impulse response, especially the estimation error of early part of the impulse
response. In addition, the unreliable estimated power spectrum of clean speech in a previous
frame causes a furthermore estimation error in the current frame. In this chapter, the late
reverberation is reduced based on the power SS, while the early reverberation is normalized
by CMN at the feature extraction stage. A diagram of the proposed method is shown in
Fig. 1. SS is used to prevent the estimated power spectrum obtained by reducing the late
reverberation from being a negative value; the estimated power spectrum | X̂ ( f , ω )|2 obtained
by reducing the late reverberation then becomes
where | X̂ ( f , ω )|2 = | Ŝ( f , ω )|2 | Ĥ (0, ω )|2 , | Ŝ( f , ω )|2 is the spectrum of estimated clean speech,
Ĥ ( f , ω ) is the estimated STFT of the impulse response. To estimate the power spectra of the
impulse responses, we extended the Multi-channel LMS algorithm for identifying the impulse
responses in a time domain [14] to a frequency domain in Section 3.2.
processing method is shown in Fig. 2. At first, the spectrum of additive noise is estimated
and noise reduction is performed. Then the reverberation is suppressed using the estimated
spectra of impulse responses. When additive noise is present, the power spectrum of Eq. (6)
becomes
D −1
| X ( f , ω )|2 ≈ | S ( f , ω )|2 | H (0, ω )|2 + ∑ | S ( f − d, ω )|2 | H (d, ω )|2 + | N ( f , ω )|2 , (9)
d =1
where N ( f , ω ) is the spectrum of noise n (t). To suppress the noise and reverberation
...
...
...
...
...
...
...
...
...
Before introducing the MCLMS algorithm for the blind channel identification, we express
what SIMO systems are blind identifiable. A multi-channel FIR (Finite Impulse Response)
system can be blindly primarily because of the channel diversity. As an extreme
counter-example, if all channels of a SIMO system are identical, the system reduces to a
Single-Input Single-Output (SISO) system, becoming unidentifiable. In addition, the source
signal needs to have sufficient modes to make the channels fully excited. The following two
assumptions are made to guarantee an identifiable system:
In the following, these two conditions are assumed to hold so that we will be dealing with a
blindly identifiable FIR (Finite Impulse Response) SIMO system.
In the absence of additive noise, we can take advantage of the fact that
xi ∗ h j = s ∗ hi ∗ h j = x j ∗ hi , i, j = 1, 2, · · · , N, i = j, (12)
where xn (t) is speech signal received from the n-th channel at time t and L is the number of
taps of the impulse response. Multiplying Eq. (13) by xn (t) and taking expectation yields,
R x i x i ( t + 1)h j ( t ) = R x i x j ( t + 1)h i ( t ),
i, j = 1, 2, · · · , N, i = j, (15)
N N
∑ R x i x i ( t + 1)h j ( t ) = ∑ R x i x j ( t + 1)h i ( t ),
i =1,i = j i =1,i = j
j = 1, 2, · · · , N. (16)
Over all channels, we then have a total of N equations. In matrix form, this set of equations is
written as:
Rx + (t + 1)h(t) = 0, (17)
where
2
In mathematics, the integers a and b are said to be co-prime if they have no common factor other than 1, or equivalently,
if their greatest common divisor is 1.
Dereverberation Based on Spectral Subtraction by
Multi-Channel
Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-FreeLMS for Hands-Free Speech Recognition7 161
Algorithm
Speech Recognition
⎡ ⎤
∑ n =1 R x n x n ( t + 1) − R x 2 x 1 ( t + 1) · · · − R x N x 1 ( t + 1)
⎢ − R x 1 x 2 ( t + 1) ∑ n =2 R x n x n ( t + 1) · · · − R x N x 2 ( t + 1) ⎥
R x + ( t + 1) = ⎢
⎣ .. .. .. .. ⎥,
⎦ (18)
. . . .
− R x 1 x N ( t + 1) − R x 2 x N ( t + 1) · · · ∑ n = N R x n x n ( t + 1)
⎡ ⎤
∑ n =1 R̃ x n x n ( t + 1) −R̃x2 x1 ( t + 1) · · · −R̃x N x1 ( t + 1)
⎢ −R̃x1 x2 (t + 1) ∑ n =2 R̃ x n x n ( t + 1) · · · −R̃x N x2 ( t + 1) ⎥
R̃ x + (t + 1) = ⎢
⎣ .. .. .. .. ⎥,
⎦ (22)
. . . .
−R̃x1 x N ( t + 1) −R̃x2 x N ( t + 1) · · · ∑n = N R̃x n x n ( t + 1)
where hn (t, l ) is the l-th tap of the n-th impulse response at time t. If the SIMO system is
blindly identifiable, the matrix Rx + is rank deficient by 1 (in the absence of noise) and the
channel impulse responses can be uniquely determined.
When the estimation of channel impulse responses is deviated from the true value, an error
vector at time t + 1 is produced by:
By minimizing the cost function J of Eq. (23), the impulse response is blindly derived. There
are various methods to minimize the cost function J, for example, constrained Multi-Channel
LMS (MCLMS) algorithm, constrained Multi-Channel Newton (MCN) algorithm and
Variable Step-Size Unconstrained MCLMS (VSS-UMCLMS) algorithm and so forth [12, 14].
Among these methods, the VSS-UMCLMS achieves a nice balance between complexity and
convergence speed [14]. Moreover, the VSS-UMCLMS is more practical and much easier to
use since the step size does not have to be specified in advance. Therefore, in this chapter, we
apply VSS-UMCLMS algorithm to identify the multi-channel impulse responses.
The cost function J (t + 1) at time t + 1 diminishes and its gradient with respect to ĥ(t) can be
approximated as
2R̃x + (t + 1)ĥ(t)
ΔJ (t + 1) ≈ (24)
ĥ(t)2
162 8Modern Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
4. Experiments
4.1. Experimental setup
The proposed dereverberation method based on spectral subtraction is evaluated on
an isolated word recognition task in a simulated reverberant environment, and a large
vocabulary continuous speech recognition task in both a simulated reverberant environment
and a real reverberant environment, respectively.
Dereverberation Based on Spectral Subtraction by
Multi-Channel
Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-FreeLMS for Hands-Free Speech Recognition9 163
Algorithm
Speech Recognition
(a) RWCP
(b) CENSREC-4
Figure 3. Illustration of microphone array.
database were used in the isolated word recognition task). The illustration of microphone
array is shown in Fig. 3. A four-channel circle or linear microphone array was taken from a
circle + linear microphone array (30 channels). The four-channel circle type microphone array
had a diameter of 30 cm, and the four microphones were located at equal 90◦ intervals. The
four microphones of the linear microphone array were located at 11.32 cm intervals. Impulse
responses were measured at several positions 2 m from the microphone array. The sampling
frequency was 48 kHz.
For clean speech, 20 male speakers each with a close microphone uttered 100 isolated words.
The 100 isolated words were phonetically balanced common isolated words selected from
the Tohoku University and Panasonic isolated spoken word database [21]. The average
time of all utterances was about 0.6 s. The sampling frequency was 12 kHz. The impulse
responses sampled at 48 kHz were downsampled to 12 kHz so that they could be convolved
with clean speech. The frame length was 21.3 ms, and the frame shift was 8 ms with a
256-point Hamming window. Then, 116 Japanese speaker-independent syllable-based HMMs
(strictly speaking, mora-unit HMMs [22]) were trained using 27,992 utterances read by 175
male speakers from the Japanese Newspaper Article Sentences (JNAS) corpus [15]). Each
continuous-density HMM had five states, four with probability density functions (pdfs) of
output probability. Each pdf consisted of four Gaussians with full-covariance matrices. The
acoustic model was common for the baseline and proposed methods, and it was trained in
a clean condition. The feature space comprised 10 mel-frequency cepstral coefficients. First-
and second-order derivatives of the cepstra plus first and second derivatives of the power
component were also included (32 feature parameters in total).
The number of reverberant windows D in Eq. (4) was set to eight, which was empirically
determined. In general, the window size D is proportional to RT60. However, the window
size D is also affected by the reverberation property; for example, the ratio of power of the
late reverberation to the power of the early reverberation. In our preliminary experiment
with partial test data, the performance of our proposed method with a window size D = 2
to 16 outperformed the baseline significantly and the window size D = 8 achieved the best
result. Automatic estimation of the optimum window size D is our future work. The length of
the Hamming window for discrete Fourier transformation was 256 (21.3 ms), and the rate
of overlap was 1/2. An illustration of the analysis window is shown in Fig. 4. For the
proposed dereverberation based on spectral subtraction, the previous clean power spectra
estimated with a skip window were used to estimate the current clean power spectrum 3 . The
spectrum of the impulse response Ĥ (d, ω ) is estimated using the corresponding utterance
to be recognized with average duration of about 0.6 second. No special parameters such
as over-subtraction parameters were used in spectral subtraction (α = 1), except that the
3
Eq. (27) is true when using a skip window and the spectrum of the impulse response can be blindly estimated.
Dereverberation Based on Spectral Subtraction by
Multi-Channel
Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-FreeLMS Algorithm
Speech for Hands-Free Speech Recognition
Recognition 11 165
subtracted value was controlled so that it did not become negative (β = 0.15). The speech
recognition performance for clean isolated words was 96.0%.
^ƉĞĂŬĞƌƐͬWŽƐŝƚŝŽŶƐ
DŝĐƌŽƉŚŽŶĞĂƌƌĂLJ
ϵ
ϰϱ
ϵϬ Đŵ ϱ Ϯ͘ϱĐŵ
dĂďůĞ Đŵ
ϵϬĐŵ ϴ ϰ ϭ Ϯ ϲ
ϯ
ϭϴϬĐŵ
ϳ
Table 3 gives the conditions for speech recognition. The acoustic models were trained with
the ASJ speech databases of phonetically balanced sentences (ASJ-PB) and the JNAS. In total,
around 20K sentences (clean speech) uttered by 132 speakers were used for each gender.
Table 4 gives the conditions for SS-based denoising and dereverberation. The parameters
shown in Table 4 were determined empirically. For SS-based dereverberation method without
background noise, the parameter α was equal to α1 and β was equal to β1 . The number of
reverberant windows D was set to 6 (192 ms). An illustration of the analysis window is shown
in Fig. 4. An open-source LVCSR decoder software "Julius" [19] that is based on word trigram
and triphone context-dependent HMMs is used.
45
40
35
30
25
2 4 6 8 10
Number of reverberation windows D
Figure 6. Effect of the number of reverberation windows D on power SS-based dereverberation for
speech recognition.
beamforming [27] is performed for all methods in this chapter. The conventional CMN
combined with delay-and-sum beamforming was used as a baseline.
The power SS-based dereverberation method by Eq. (7) improved speech recognition
significantly compared with CMN for all severe reverberant conditions. The reason was that
the proposed method compensated for both the late and early reverberation. The proposed
method achieved an average relative error reduction rate of 24.5% in relation to conventional
CMN with beamforming.
The results with bold font indicate the best result corresponding to each array
Table 6. Detail results based on different number of reverberation windows D and reverberant
environments (%)
Table 7. Channel number corresponding to Fig. 3(a) using for dereverberation and denoising (RWCP
database)
CMN CMN+BF
Proposed method Proposed method+BF
Word accuracy rate (%)
55
50
45
40
35
30
25
2 4 8
Channel number
Figure 7. Effect of the number of channels on power SS-based dereverberation for speech recognition.
evaluated the power SS-based dereverberation method for LVCSR and analyzed the effect
factor (number of reverberation windows D in Eq. (7), channel number, and length of
utterance) for compensation parameter estimation based on power SS using RWCP database.
The word accuracy rate for LVCSR with clean speech was 92.6%.
The effect of the number of reverberation windows on speech recognition is shown in Fig.
6. The detail results based on different number of reverberation windows D and reverberant
environments (that is, different reverberation times) were shown in Table 6. The results shown
on Fig. 6 and Table 6 were not performed delay-and-sum beamforming. The results show
Dereverberation Based on Spectral Subtraction by
Multi-Channel
Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-FreeLMS Algorithm
Speech for Hands-Free Speech Recognition
Recognition 15 169
that the optimal number of reverberation windows D depends on the reverberation time.
The best average result of all reverberant speech was obtained when D equals 6. The speech
recognition performance with the number of reverberation windows between 4 and 10 did
not vary greatly and was significantly better than the baseline.
We analyzed the influence of the number of channels on parameter estimation and
delay-and-sum beamforming. Besides four channels, two and eight channels were also
used to estimate the compensation parameter and perform beamforming. Channel numbers
corresponding to Fig. 3(a) shown in Table 7 were used. The results are shown in Fig.
7. The speech recognition performance of the SS-based dereverberation method without
beamforming was hardly affected by the number of channels. That is, the compensation
parameter estimation is robust to the number of channels. Combined with beamforming,
the more channels that are used and the better is the speech recognition performance.
Thus far, the whole utterance has been used to estimate the compensation parameter. The
effect of the length of utterance used for parameter estimation was investigated, with the
results shown in Fig. 8. The longer the length of utterance used, the better is the speech
recognition performance. Deterioration in speech recognition was not experienced with the
length of the utterance used for parameter estimation greater than 1 s. The speech recognition
performance of the SS-based dereverberation method is better than the baseline even if only
0.1 s of utterance is used to estimate the compensation parameter.
We also compared the power SS-based dereverberation method on LVCSR in different
simulated reverberant environments. The experimental results shown in Fig. 9. Naturally, the
speech recognition rate deteriorated as the reverberation time increased. Using the SS-based
dereverberation method, the reduction in the speech recognition rate was smaller than in
conventional CMN, especially for impulse responses with a long reverberation time. For
RWCP database, the SS-based dereverberation method achieved a relative word recognition
error reduction rate of 19.2% relative to CMN with delay-and-sum beamforming. We also
conducted an LVCSR experiment with SS-based dereverberation under different reverberant
conditions (CENSREC-4), with the reverberation time between 0.25 and 0.75 s and the distance
between microphone and sound source 0.5 m. A similar trend to the above results was
observed. Therefore, the SS-based dereverberation method is robust to various reverberant
170 16
Modern Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
90
Word accuracy rate (%)
tŽƌĚĂĐĐƵƌĂĐLJƌĂƚĞ;йͿ
80
CMN+BF
70
Proposed method+BF
60
50.3%
50
40 DEн&
38.5% WƌŽƉŽƐĞĚŵĞƚŚŽĚн&
30
20
0.30 0.38 0.47 0.60 0.78 1.30 Ave. (s)
ZĞǀĞƌďĞƌĂƚŝŽŶƚŝŵĞ;ƐͿ
Reverberation time
(b) CENSREC-4 database
(a) RWCP database
Figure 9. Word accuracy for LVCSR in different simulated reverberant environments.
conditions for both isolated word recognition and LVCSR. The reason is that the SS-based
dereverberation method can compensate for late reverberation through SS using an estimated
power spectrum of the impulse response.
(b) Results of GSS-based method in the simulated reverberant environment
In this section, reverberation and noise suppression using only 2 speech channels is described.
In both power SS-based and GSS-based dereverberation methods, speech signals from two
microphones were used to estimate blindly the compensation parameters for the power SS
and GSS (that is, the spectra of the channel impulse responses), and then reverberation was
suppressed by SS and the spectrum of dereverberant speech was inverted into a time domain.
Finally, delay-and-sum beamforming was performed on the two-channel dereverberant
speech.
The results of power SS-based method and the GSS-based method without background noise
were compared in Table 8. “Distorted speech #” in Table 8 corresponds to “array no” in Table 1.
The speech recognition performance was drastically degraded under reverberant conditions
because the conventional CMN did not suppress the late reverberation. Delay-and-sum
beamforming with CMN (41.91%) could not markedly improve the speech recognition
performance because of the small number of microphones and the small distance between
the microphone pair. In contrast, the power SS-based dereverberation using Eq. (7) markedly
improved the speech recognition performance. The GSS-based dereverberation using Eq. (8)
improved speech recognition performance significantly compared with the power SS-based
dereverberation and CMN for all reverberant conditions. The GSS-based method achieved an
average relative word error reduction rate of 31.4% compared to the conventional CMN and
9.8% compared to the power SS-based method.
Table 9 shows the speech recognition results for the power SS and GSS-based denoising and
dereverberation methods for the simulated noisy and reverberant speech. “Distorted speech
#”, “DN” and “DNR” in Table 9 denote the “array #” in Table 1, “denoising”, and “denoising
and dereverberation”, respectively. The speech recognition performance of conventional
CMN was drastically degraded owing to the noisy and reverberant conditions and the fact
that CMN did not suppress the late reverberation. The power SS-based DN improved speech
recognition performance significantly compared to the CMN for all reverberant conditions.
The GSS-based DN using Eq. (11), however, did not improve the speech recognition
performance compared to the power SS-based DN. On the other hand, the power SS-based
DNR achieved a marked improvement in the speech recognition performance compared with
Dereverberation Based on Spectral Subtraction by
Multi-Channel
Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-FreeLMS Algorithm
Speech for Hands-Free Speech Recognition
Recognition 17 171
that of CMN. The GSS-based DNR using Eq. (10) improved speech recognition performance
significantly compared to both the CMN method and the power SS-based DNR for almost all
reverberant conditions.
(c) Results in the real noisy reverberant environment
Table 10 shows the speech recognition results for the real noisy reverberant speech under
the same conditions as the simulated noisy reververant speech. The word accuracy rate for
close-talking speech recorded in a real environment was 88.3%. We investigated the best
channel combination in the real environment and the best speech recognition performance
was obtained when channels 6, 7, 8, and 9 described in Fig. 5 were used. Therefore,
this channel combination was used in this study. Power SS-based DN and GSS-based DN
achieved a smaller improvement in recognition performance compared with the simulated
noisy reverberant environment because the type of background noise in the real environment
was different from that in the simulated environment. On the other hand, the power
SS-based DNR markedly improved the speech recognition performance compared to CMN.
The GSS-based DNR improved speech recognition performance significantly compared to
172 18
Modern Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
both the CMN method and the power SS-based DNR for almost all speakers. The GSS-based
DNR achieved an average relative word error reduction rate of 39.1% and 11.5% compared
to conventional CMN and power SS-based DNR, respectively. These results show that our
proposed method is also effective in a real environment under the same denoising and
dereverberation conditions as the simulated noisy reverberant environment.
5. Conclusion
In this chapter, we proposed a blind spectral subtraction based dereverberation method for
hands-free speech recognition method. We treated the late reverberation as additive noise,
and a noise reduction technique based on spectral subtraction was applied to compensate for
the late reverberation. The early reverberation was normalized by CMN. The time-domain
MCLMS algorithm was extended to blindly estimate the spectrum of the impulse response
for spectral subtraction in a frequency domain. We evaluated our proposed methods on
isolated word recognition task and LVCSR task. The proposed spectral subtraction based on
multi-channel LMS significantly outperformed than the conventional CMN. For isolated word
recognition task, a relative error reduction rate of 24.5% in relation to the conventional CMN
was achieved. For LVCSR task without background noise, the proposed method achieved
an average relative word error reduction rate of 31.5% compared to conventional CMN in
the simulated reverberant environment. We also presented a denoising and dereverberation
method based on spectral subtraction and evaluated it in both the simulated noisy reverberant
environment and the real noisy reverberant environment. The GSS-based method achieved an
average relative word error reduction rate of 39.1% and 11.5% compared to conventional CMN
and power SS-based method, respectively. These results show that our proposed method is
also effective in a real noisy reverberant environment.
In this chapter, we also investigated the effect factors (numbers of reverberation windows
and channels, and length of utterance) for compensation parameter estimation. We reached
the following conclusions: 1) the speech recognition performance with the number of
reverberation windows between 4 and 10 did not vary greatly and was significantly better
than the baseline; 2) the compensation parameter estimation was robust to the number
of channels; and 3) degradation of speech recognition did not occur with the length of
utterance used for parameter estimation longer than 1 s. We also compared the SS-based
dereverberation method on LVCSR in different simulated reverberant environments. A
similar trend was observed.
Dereverberation Based on Spectral Subtraction by
Multi-Channel
Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-FreeLMS Algorithm
Speech for Hands-Free Speech Recognition
Recognition 19 173
Author details
Longbiao Wang, Kyohei Odani and Atsuhiko Kai
Shizuoka University, Japan
Norihide Kitaoka
Nagoya University, Japan
Seiichi Nakagawa
Toyohashi University of Technology, Japan
6. References
[1] Avendano, C. & Hermansky, H. (1996). Study on the dereverberation of speech based on
temporal envelope filtering. Proceedings of ICSLP-1996, pp. 889-892, Philadelphia, USA,
October 1996.
[2] Chen, H., Cao, X., & Zhu, J. (2002). Convergence of stochastic-approximation-based
algorithms for blind channel identification. IEEE Trans. Information Theory, Vol. 48, 2002,
pp. 1214-1225.
[3] Couvreur, L. & Couvreur, C. (2004). Blind model selection for automatic speech
recognition in reverberant environments. Journal of VLSI Signal Processing, Vol. 36, No.
2-3, February/March 2004, pp. 189-203.
[4] Delcroix, M., Hikichi, T. & Miyoshi, M. (1994). On a blind speech dereverberation
algorithm using multi-channel linear prediction. IEEE Transations on Fundamentals of
Electronics, Communications and Computer Sciences, Vol. E89-A, No. 10, October 2006, pp.
2837-2846.
[5] Delcroix, M., Hikichi, T. & Miyoshi, M. (1994). Precise dereverberation using
multi-channel linear prediction. IEEE Transations on Audio, Speech, and Language
Processing, Vol. 15, No. 2, February 2007, pp. 430-440.
[6] Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE
Trans. Acous. Speech Signal Processing, Vol. 29, No. 2, 1981, pp. 254-272.
[7] Gannot, S. & Moonen, M. (2003). Subspace methods for multimicrophone speech
dereverberation. EURASIP Journal on Applied Signal Processing, October 2003, pp.
1074-1090.
[8] Gillespie, B. W., Malvar, H. S. & Florencio, D. A. F. (2001). Speech dereverberation via
maximum-kurtosis subband adaptive filtering, Proceedings of ICASSP-2001, Vol. 6, pp.
3701-3704, Salt Lake City, USA, May 2001.
[9] Habets, E. A. P. (2004). Single-channel speech dereverberation based on spectral
subtraction, Proceedings of the 15th Annual Workshop on Circuits, Systems and Signal
Processing (ProRISC-2004), pp. 250-254, Veldhoven, Netherlands, November 2004.
[10] Hermansky, H. & Morgan, N. (1994). RASTA processing of speech. IEEE Transations on
Speech and Audio Processing, Vol. 2, No. 4, October 1994, pp. 578-589.
[11] Hermansky, H., Wan, E. A. & Avendano, C. (1995). Speech enhancement based on
temporal processing, Proceedings of ICASSP-1995, pp. 405-408, Detroit, USA, May 1995.
[12] Huang, Y. & Benesty, J. (2002). Adaptive multichannel least mean square and Newton
algorithms for blind channel identification. Signal Processing, Vol. 82, No. 8, August 2002,
pp. 1127-1138.
[13] Huang, Y., Benesty, J. & Chen, J. (2005). Optimal step size of the adaptive multi-channel
LMS algorithm for blind SIMO identification. IEEE Signal Processing Letters, Vol. 12, No.
3, March 2005, pp. 173-176.
174 20
Modern Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
[14] Huang, Y., Benesty, J. & Chen, J. (2006). Acoustic MIMO Signal Processing,
Springer-Verlag, ISBN 978-3-540-37630-9, Berlin, Germany.
[15] Itou, K., Yamamoto, M., Takeda, K., Takezawa, T., Matsuoka, T., Kobayashi, T., Shikano,
K. & Itahashi, S. (1999). JNAS: Japanese speech corpus for large vocabulary continuous
speech recognition research. Journal of the Acoustical Society of Japan (E), Vol. 20, No. 3,
May 1999, pp. 199-206.
[16] Jin, Q., Pan, Y. & Schultz, t. (2006). Far-field speaker recognition, Proceedings of
ICASSP-2006, pp. 937-940, Toulouse, France, May 2006.
[17] Jin, Q., Schultz, t. & Waibel, A. (2007). Far-field speaker recognition. IEEE Transations on
Audio, Speech, and Language Processing, Vol. 15, No. 7, September 2007, pp. 2023-2032.
[18] Kinoshita, K., Delocroix, M., Nakatani, T. & Miyoshi, M. (2009). Suppression of late
reverberation effect on speech signal using long-term multiple-step linear prediction.
IEEE Transations on Audio, Speech, and Language Processing, Vol. 17, No. 4, May 2009, pp.
534-545.
[19] Lee, A., Kawahara, T. & Shikano, K. (2001). Julius—an open source real-time
large vocabulary recognition engine. Proceedings of European Conference on Speech
Communication and Technology, September 2001, pp. 1691-1694.
[20] Maganti, H. & Matassoni, M. (2010). An audiotry modulation spectral feature
for reverberant speech recognition, Proceedings of INTERSPEECH-2010, pp. 570-573,
Makuhari, Japan, September 2010.
[21] Makino, S., Niyada, K., Mafune, Y. & Kido, K. (1992). Tohoku University and Panasonic
isolated spoken word database. Journal of the Acoustical Society of Japan, Vol. 48, No. 12,
December 1992, pp. 899-905 (in Japanese).
[22] Nakagawa, S., Hanai, K., Yamamoto, K. & Minematsu, N. (1999). Comparison of
syllable-based HMMs and triphone-based HMMs in Japanese speech recognition.
Proceedings of International Workshop on Automatic Speech Recognition and Understanding,
1999, pp. 393-396.
[23] Nakamura, S., Hiyane, K., Asano, F. & Nishiura, T. (2000). Acoustical sound database in
real environments for sound scene understanding and hands-free speech recognition,
Proceedings of IREC-2000, pp. 965-971.
[24] Nakayama, M., Nishiura, T., Denda, Y., Kitaoka, N., Yamamoto, K., Yamada, T.,
Tsuge, S., Miyajima, C., Fujimoto, M., Takiguchi, T., Tamura, S., Ogawa, T., Matsuda,
S., Kuroiwa, S., Takeda, K. & Nakamura, S. (2008). CENSREC-4: Development
of evaluation framework for distant-talking speech recognition under reverberant
environments, Proceedings of INTERSPEECH-2008, pp. 968-971, Brisbane, Australia,
September 2008.
[25] Raut, C., Nishimoto, T. & Sagayama, S. (2006). Adaptation for long convolutional
distortion by maximum likelihood based state filtering approach, Proceedings of
ICASSP-2006, pp. 1133-1136, Toulouse, France, May 2006.
[26] Sim, B. L., Tong, Y. C. & Chang J. S. (1998). A parametric formulation of the generalized
spectral subtraction method. IEEE Transations on Speech and Audio Processing, Vol. 6, No.
4, July 1998, pp. 328-337.
[27] Van Veen, B. & Buckley, K. (1988). Beamforming: A versatile approach to spatial
filtering. IEEE ASSP Mag., Vol. 5, No. 2, March 2011, pp. 4-24.
Section 2
Speech Enhancement
Chapter 8
http://dx.doi.org/10.5772/47844
1. Introduction
Speech communication can be impaired by the wide range of noise conditions present in air.
Researchers in the field of speech applications have been investigating how to improve the
performances of signal extraction and its recognition in the conditions. However, it is not yet
possible to measure clear speech in environments where there are low Signal-to-Noise
Ratios (SNR) of about 0 dB or less (H. Hirsch and D. Pearce, 2000). Standard rate scales, such
as CENSREC (N. Kitaoka et al., 2006) and AURORA (H. Hirsch and D. Pearce, 2000), are
typically discussed for evaluating performances of speech recognition in noisy
environments and have shown that speech recognition rates are approximately 50–80%
when under the influence of noise, demonstrating the difficulty of achieving high
percentages. With these backgrounds, many signal extraction and retrieval methods have
been proposed in previous research. There is one of approaches in signal extractions, body-
conducted speech (BCS) which is little influence from noise in air however it does not
measure 2 kHz above in frequency characteristics. However, these need normal speech or
parameters measured simultaneously with body-conducted speech. Because these
parameters are not measured in noisy environments, the authors have been investigating
the use of body-conducted speech which is generally called bone-conducted speech, where
the signal is also conducted through the skin and bone in a human body (S. Ishimitsu, 2008)
(M. Nakayama et al., 2011). Conventional retrieval methods for sound quality of body-
conducted speech are the Modulation Transfer Function (MTF), Linear Predictive
Coefficients (LPC), direct filtering and the use of a throat microphone (T. Tamiya, and T.
Shimamura, 2006) (T. T. Vu et al., 2006) (Z. Liu et al., 2004) (S. Dupont, et al., 2004). As a
research in state-of-the art, the research fields is expanded to speech communications
178 Modern Speech Recognition Approaches with Case Studies
between a patient and an operator in a Magnetic Resonance Imaging (MRI) room which has
a noisy sound environment with a strong magnetic field (A. Moelker et al., 2005).
Conventional microphone such as an accelerometer composed of magnetic materials are not
allowed in this environment, which requires a special microphone made of non-magnetic
material.
For this environment the authors proposed a speech communication system that uses a BCS
microphone with an optical fiber bragg grating (OFBG microphone) (M. Nakayama et al.,
2011). It is composed of only non-magnetic materials, is suitable for the environment and
should provide clear signals using our retrieval method. Previous research using an OFBG
microphone demonstrated the effectiveness and performance of signal extraction in an MRI
room. Its performance of speech recognition was evaluated using an acoustic model
constructed with unspecified normal speech (M. Nakayama et al., 2011). It is concluded that
an OFBG microphone can produce a clear signal with an improved performance compared
to an acoustic model made by unspecified speeches. The original signal of an OFBG
microphone enabled conversation however some stress was felt because its signal was low
in sound quality. Therefore one of the research aims is to improve the quality with our
retrieval method which used differential acceleration and noise reduction methods.
In this chapter, it will be shown in experiments and discussions for the body-conducted
speeches with the method which is measured with an accelerometer and an OFBG
microphone, as one of topics is a state-of-the-art in the research field of signal extraction
under noisy environment. Especially, it is mainly investigated in evaluations of the
microphones, signal retrievals with the method and applying the method to a signal in
sentence unit long for estimating and recovering of sound qualities.
Figure 1. Accelerometer
acoustic models estimated by HTK with JNAS to evaluate closeness of signals when highest
recognition performance is achieved (S. Young et al., 2000) (K. Itou et al, 1999).
xdifferential(i) is the differential acceleration signal that is calculated from each frame of a BCS.
Because of low gains in its amplitude, it requires adjusting to a suitable level for hearing or
processing. Figure 7 shows a differential acceleration estimated from Figure 6 using
Formula (1), with the adjusted gain. It seems that the differential acceleration signal is
composed of speech mixed with stationary noise, so we expected to be able to remove it
completely with the noise reduction method because the signal has a high SNR compared to
the original signal. Consequently, it is proposed the signal estimation method using
differential acceleration and a conventional noise reduction method (M. Nakayama et al.,
2011).
HSpeech ( )
H Estimate ( ) (2)
HSpeech ( ) H Noise ( )
An estimated spectrum HEstimate(ω) can be converted to a retrieval signal from the differential
acceleration signal. It can be calculated from the speech spectrum HSpeech(ω) and noise
spectrum HNoise(ω). In particular, HSpeech(ω) is calculated with autocorrelation functions and
linear prediction coefficients using a Levinson-Durbin algorithm (J. Durbin, 1960), and
HNoise(ω) is then estimated using autocorrelation functions.
4.3. Evaluations
Signal retrieval for a signal measured by an OFBG microphone is performed using the
same parameters in the method because a propagation path of body-conducted speech in
a human body is not affected by either quiet or noisy environments. Figure 8 shows a
retrieval signal from Figure 7 using a Wiener-filtering method where the linear prediction
coefficients and autocorrelation functions are 1 and the frame width is 764 samples. These
procedures were repeated five times on a signal to remove a stationary noise. From a
retrieval signal, high frequency components from 2 kHz and above were recovered with
these settings. This proposed method could also be applied to obtain a clear signal from
body-conducted speech measured with OFBG microphone in noisy sound and high
magnetic field environment.
word differs from a speaker in a former section. Noise within the engine room, under the
two conditions of anchorage and cruising, were 93 and 98 dB SPL, respectively, and the SNR
measurements from microphone. There was –20 and –25 dB SNR, respectively. In this
research, the signal is experimented under cruising condition to estimate retrieval signals.
A 22-year-old male uttered A01 sentence from the ATR503 sentence database, and the
sentence is a commonly used sentence in speech recognition and application (M. Abe et al.,
1991). And the sentence is composed of the followings in sub-word of mora.
/a/ /ra/ /yu/ /ru/ /ge/ /N/ /ji/ /tsu/ /wo/ /su/ /be/ /te/ /ji/ /bu/ /N/ /no/ /ho/ /u/ /he/ /ne/ /ji/
/ma/ /ge/ /ta/ /no/ /da/
(a) Main engine of Oshima-maru (b) Signal recording in the engine room
Figures 10 and 11 show a speech and a body-conducted speech in sentence unit measured
by a conventional microphone and accelerometer in a quiet room when a 22 years-old male
uttered the sentence. Although the accelerometer is held with fingers, sounds are measured
clearly because it was firmly held to the upper lip with a suitable pressure. Figure 12 shows
a differential acceleration from Figure 11, becomes clearly signal with little noise because the
BCS is high SNR.
Figures 13 and 14 show a speech and a body-conducted speech in sentence unit in the
noisy environment. Speech is completely swamped by the intense noise from the engine
and generators. On the other hand, body-conducted speech in Figure 14 is affected a little
by the noise but can be measured. Because SNR in Figure 14 has low gain, differential
acceleration in Figure 15 is considered that the performance of signal retrieval is reduced.
Figure 16 shows the signal retrieval from the differential acceleration works well when the
treated four times since the performance is sufficient to recover the frequency
characteristics. As a result, it is concluded that body-conducted speech is as clear as
possible without noise disturbance.
188 Modern Speech Recognition Approaches with Case Studies
And then, the performances of signal retrieval method in sentence with the microphones
that are an accelerometer and an OFBG microphone were evaluated, and the effectiveness is
confirmed with time–frequency analysis and speech recognition. From this background, it is
investigated estimating clear body-conducted speech in sentence unit from an OFBG
microphone with our signal retrieval method that used combined differential acceleration
and noise reduction. Applying the method to the signal measured recovered which in
sound quality that was evaluated using time-frequency analysis. Thus, its retrieval
method can also be applied to a signal measured by an OFBG microphone with the same
settings because its conduction path is not affected by the noise in the air. The signals
were measured in quiet and noisy rooms, specifically an engine room and MRI room. The
signals were clearly obtained employing the signal retrieval method and the same settings
194 Modern Speech Recognition Approaches with Case Studies
used for the word unit as a first step. To obtain a clearer signal with the signal retrieval
method, the pressure at which the microphone is held is important, and the sounds have
high SNR in original BCS.
As future works, it needs to extend the signal retrieval method for practical use and
improvement of algorithm for advance.
Author details
Masashi Nakayama
Kagawa National College of Technology, Japan
National Institute of Advanced Industrial Science and Technology (AIST), Japan
Shunsuke Ishimitsu
Hiroshima City University, Japan
Seiji Nakagawa
National Institute of Advanced Industrial Science and Technology (AIST), Japan
Acknowledgement
The authors thank Mr. K. Oda, Mr. H. Nagoshi and his colleagues in Ishimitsu laboratory of
Hiroshima City University, members of the Living Informatics Research Group, Health
Research Institute, National Institute of Advanced Industrial Science and Technology (AIST)
for their support in the signal recording, and crew members of the training ship, Oshima-
maru, Oshima National College of Maritime Technology.
7. References
A. Lee, T. Kawahara, and K. Shikano (2001). Julius - an open source real-time large
vocabulary recognition engine, in Proceedings of European Conference on Speech
Communication and Technology (EUROSPEECH), pp. 1691-1694
A. Moelker, R. A. J. J. Maas, M. W. Vogel, M. Ouhlous, and P. M. T. Pattynama (2005).
Importance of bone-conducted sound transmission on patient hearing in the MR
scanner, Journal of Magnetic Resonance Imaging, Volume 22, Issue 1, pp.163-169
D. Li, and D. O’Shaughnessy (2003). Speech Processing: A Dynamic and Optimization-
Oriented Approach, Marcel Dekker Inc.
H. Hirsch, and D. Pearce (2000). The AURORA experimental framework for the
performance evaluation of speech recognition systems under noisy conditions, in
proceedings of ISCA ITRW ASR2000, pp.181-188
J. Durbin (1960). The Fitting of Time-Series Models, Review of the International Statistical
Institute, Vol.28 No.3, pp. 233-244
Improvement on Sound Quality of the Body Conducted Speech from Optical Fiber Bragg Grating Microphone 195
Z. Liu, Z. Zhang, A. Acero, J. Droppo, and X. Huang (2004). Direct Filtering for Air- and
Bone-Conductive Microphones, in proceedings of IEEE International Workshop on
Multimedia Signal Processing (MMSP’04), pp.363-366
Chapter 9
http://dx.doi.org/10.5772/49943
1. Introduction
People that suffer from diseases such as throat cancer require that their larynx and vocal cords
be extracted by surgery, requiring then rehabilitation in order to be able to reintegrate to their
individual, social, familiar and work activities. To accomplish this, different methods have
been suggested, such as: The esophageal speech, the use of tracheoesophageal prosthetics and
the Artificial Larynx Transducer (ALT), also known as “electronic larynx” [1, 2].
The ALT, which has the shape of a handheld device, introduces an excitation in the vocal
tract by applying a vibration against the external walls of the neck. The excitation is then
modulated by the movement of the oral cavity to produce the speech sound. This transducer
is attached to the speaker’s neck, and in some cases to the speaker’s cheeks. The ALT is
widely recommended by voice rehabilitation physicians because it is very easy to use even
for new patients, although the voice produced by these transducers is unnatural and with
low quality, besides that it is distorted by the ALT produced background noise. Thus, ALT
results in a considerably degradation of the quality and intelligibility of speech, problem for
which an optimal solution has not yet been found [2].
The esophageal speech, on the other hand, is produced by the compression of the contained
air in the vocal tract, from the stomach to the mouth through the esophagus. This air is
swallowed and it produces a vibration of the esophageal upper muscle as it passes through
the esophageal-larynx segment, producing the speech. The generated sound is similar to a
burp, the tone is commonly very low, and the timbre is generally harsh. As in the ALT
produced speech, the voiced segments of esophageal speech are the most affected parts of
the speech within a word or phrase resulting an unnatural speech. Thus many efforts have
been carried out to improve its quality and intelligibility.
198 Modern Speech Recognition Approaches with Case Studies
Several approaches have been proposed to improve the quality and intelligibility of
alaryngeal speech, esophageal as well as ALT produced speech [2, 3].
This chapter presents an alaryngeal speech enhancement system, which uses several
methods for speech recognition such as voiced and unvoiced segment detection, feature
extraction method and pattern recognition algorithms.
4. Clasiffier: The parameter vector obtained in the feature extraction stage is supplied to a
classifier. The classification stage consists of neural networks, which identifies the
voiced segments present in segment under analysis.
5. Voice synthesis: The voiced segments detected are replaced by voiced segments of a
normal speaker and concatenated with unvoiced and silent segments to produce the
restored speech.
6. Results: Finally, using objective and subjective evaluation methods, it shows that the
proposed system provides a fairly good improvement of the quality and intelligibility
of alaryngeal speech signals.
2. Methods
Figure 1 shows a block diagram of the proposed system. It is based on the replacement of
voiced segments of alaryngeal speech by their equivalent normal speech voiced segments,
while keeping the unvoiced and silence segments without change. The main reason is that
the voiced segments have a more significant impact on the speech quality and intelligibility
than the unvoiced segments.
2.2. Preprocessing
The digital signal is low-pass filtered to reduce the background noise. This stage is
implemented by a 200 order digital FIR filter with a cut-off frequency of 900Hz. A common
practice for speech recognition is the use of pre-emphasis filter in order to amplify the
higher frequency components of the signal with the purpose of emulating the additional
sensibility of the human ear to high frequencies. Generally a high pass filter characterized
by a slope of 20 dB per decade is used [4].
A Hamming window is applied to the segmented signal so that the extreme samples of the
segments had less weight that the central samples. The window’s length is chosen to be
larger than the frame interval, preventing a loss of information which could take place
during the transitions from one frame to the next.
200 Modern Speech Recognition Approaches with Case Studies
Several approaches have been proposed to detect the voiced segments of speech signals.
However the use of a single criterion of decision to determine if a speech segment is voiced
or unvoiced is not enough. Thus most algorithms in the speech processing area use the
combination of more than one criterion. The proposed speech restoration method uses the
combination of energy average, zero crossing and formant analysis of speech signal for
voiced/unvoiced segment classification
The formants are obtained from the polynomial roots generated by the linear prediction
coefficients (LPC) that represent the vocal tract filter. Once the formants, whose frequency is
defined by the angle of the roots closer to the unitary circle, are obtained, they are ordered in
Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 201
an ascending form and the first three formants are chosen as parameters of the speech
segment. These formants are then stored in the system so that they can be employed to take
the voiced/invoiced decision. Using the normalized Fast Fourier Transform (FFT) the
amplitude of the formant frequency can be obtained.
To take the decision whether the segment is voiced or not, the value of the formants
amplitude is normalized each 100 millisecond segment. Then the algorithm finds the
maximum value of each formant among the 10 values stored for each fragment. Then each
value is divided between the estimated maximum values as shown in (1).
AF AF
11 AF1 2
AF1 ..... 110
AF1 Max AF1 Max AF1 Max
AF AF
2 1 AF2 2
AF2 ... 2 10 (1)
AF2 Max AF2 Max AF2 Max
AF AF3 10
3 1 AF3 2
AF3 ....
AF3 Max AF3 Max AF3 Max
The local normalization process is justified for esophageal speakers due to the loss of energy
as they speak. Once the normalized values are obtained, the decision is made using an
experimental threshold value which is equal to 0.25. It can be seen as a logic mask in the
algorithm if the normalized values greater than 0.25 are set to one, otherwise are set to zero,
as shown in (2).
AFx N
0 0.25
AFx N AFx max
(2)
AFx max AFx N
1 0.25
AFx max
Next an ‘and’ logic operation is applied with the three formant array using the values
obtained after the threshold operation. Here only the segments in which the three formants
have values over the 0.25 are considered to be voiced segments.
Finally, using the three criterions mentioned above, a window is applied to the original
signal which is equal to one if the segment is classified as voiced by the three methods; and
it is equal to zero otherwise, such that only the voiced segments of the original signal are
obtained.
Frequencies Analysis, Mel Frequency Cepstral Coefficients (MFCC) among others [5,6]. This
section discusses these methods and proposes one based on Wavelet Transform.
p
sn ak sn k (3)
k 1
where ak (1<k<p) is a set of real constants known as predictor coefficients, that must be
calculated, and p is the predictor order. The problem of linear prediction resides on finding
the predictor coefficients ak that minimize the error between the real value of the function
and the approximated function.
The developed algorithm takes each segment of 10 milliseconds and calculates its linear
prediction coefficients. The number of predictor coefficients is obtained by substituting the
sampling frequency value (fs) in (4).
fs 8000
p4 4 12 (4)
1000 1000
The sequence of the minimal error could be interpreted as the output of the H(z) filter when
it is excited by the Sn signal. H(z) is usually known as an inverted filter. The approximated
transfer function could be obtained if it is assumed that the transfer function S(z) of the
signal is modeled as an only pole filter with the form of (5).
Sˆ z
A A
(5)
H z p
1 ak z k
k 1
The LPC coefficients correspond to Ŝ z poles. Therefore, the LPC analysis aims to calculate
the filter properties of the vocal tract that produces the sonorous signal.
If the spectrum of a speech signal can be approximated only by its poles, then the formants
could be obtained from the Ŝ z poles. The poles of Ŝ z can be calculated making the
Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 203
denominator of (5) to zero and solving it to find its roots. The S plane conversion is done by
substituting z by e skT , where sk is the pole in the s plane. The resultant roots are generally
conjugated complex pairs.
Formants frequencies are obtained from the polynomial roots generated by the linear
prediction coefficients. The formant frequency is defined by the angle of the roots closer to
the unitary circle. A root, with an angle close to zero, indicates the existence of a formant
near the origin. A root whose angle is in close proximity to π indicates that the formant is
located near the maximum frequency, in this case 4000Hz. Since the frequency dominion is
symmetric with respect to the vertical axis, the roots located in the inferior semi plane of z
plane can be ignored.
Let rp as a linear prediction coefficient root with real part p and imaginary part p .
rp p j p (6)
The roots (r) which are located in the superior semi plane near the unitary circle can be
obtained using (7).
By using the arctangent function is possible to obtain the roots angle. Doing this, the roots
are mapped into the frequency dominion by using (8) to get the formants.
fs
forp p (8)
2
Once the formants are obtained, they are organized in ascending order, and the first three
are chosen as parameters of the speech segment.
c t F 1 log S
(9)
c t F 1 log E log H (10)
204 Modern Speech Recognition Approaches with Case Studies
The above equation indicates that Cepstrum of a signal is the sum of Cepstrum excitation
source and the vocal tract filter. The vocal tract information is of slow variation, and it
appears in the first cepstrum coefficients. For speech recognition application the vocal tract
information is more important than excitation source. The cepstral coefficients can be
estimated from the LPC coefficients applying the following recursion:
co ln 2
m 1
k
c m am m c k am k 1 m p (11)
k 1
m 1
k
cm c k am k mp
k 1 m
where Cm is the n-th LPC-Cepstral coefficients, a is the i-th LPC coefficients and m is the
Cepstral index.
Usually the number of cepstral coefficients is equal to the number of LPC ones to avoid
noise. A representation derived from the coefficients cepstrum are the Mel Frequency
Cepstral Coefficients (MFCC) whose fundamental difference with Cepstrum coefficients is
that the frequency bands are positioned according to a logarithmic scale known as MEL
scale, which approximates the frequency response of the human auditory system more
efficiently than Fast Fourier Transform(FFT).
In the inner ear, the basilar membrane carries out a time-frequency decomposition of the
audible signal through a multiresolution analysis similar to that performed by a wavelet
transform. Thus to develop a feature extraction method that emulates the basilar membrane
operation, it should be able to carry out a similar frequency decomposition, as proposed in
the inner ear model developed by Zhang et. al [9]. In this model the dynamics of basilar
membrane, which has a characteristic frequency equal to fc, can be modeled by using a
gamma-tone filter which consists of a gamma distribution multiplied by a pure tone of
frequency fc. The shape of the gamma distribution α, is related to the filter order while the
scale θ, is related to period of occurrence of the events under analysis, when they have a
Poisson distribution. Thus the gamma-tone filter representing the impulse response of the
basilar membrane is given by (12)
Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 205
t
t t 1e cos 2 t /
1
t0 (12)
1 !
Equation (12) defines a family of gamma-tone filters characterized by and . Thus to
emulate the basilar membrane behavior, it is necessary to look for the more suitable filter
bank which, according to the basilar membrane model given by Zhang [9], can be obtained
if we set θ=1 and α=3, since such values (12) result in the best approximation to the inner ear
dynamics. From (12) we have:
ψ t t 2 e t cos 2πt t 0
1
(13)
2
( 2 )2 ( 2 )2
( ) (14)
1 j ω 2π 1 j ω 2π
3 3
It can be shown that ψ (t) presents the expected attributes of a mother wavelet since it
satisfies the admissibility condition given by (15)
2
( )
d (15)
This means that ψ (t) can be used to analyze and then reconstruct a signal without loss of
information [10]. That is the functions given by (13) constitute an unconditional basis in L2(R)
and then we can estimate the expansion coefficients of an audio signal f(t) by using the scalar
product between f(t) and the function ψ(t) with translation τ and scaling factor s as follows:
t
C , s
1
s
f (t) s
dt
(16)
0
A sampled version of (16) must be specified because we require characterizing discrete time
speech signals. To this end, a sampling of the scale parameter, s, involving the
psychoacoustical phenomenon known as critical bandwidths will be used [11].
The critical bands theory models the basilar membrane operation as a filter bank in which
the bandwidth of each filter increases as its central frequency also increases. This
requirement can be satisfied using the Bark frequency scale that is a logarithmic scale in
which the frequency resolution of any section of the basilar membrane is exactly equal to
one Bark, regardless of its characteristic frequency. Because the Bark scale is characterized
by a biological parameter, there is not an exact expression for it given as a result several
different proposals available in the literature. Among them, the statistical fitting provided
by Schroeder et al [11], appears to be a suitable choice. Thus using the approach provided,
the relation between the linear frequency, f in Hz and the Bark frequency Z, is given by
206 Modern Speech Recognition Approaches with Case Studies
2
f
Z 7 ln 1
f
(17)
650 650
Using (17) the j-th scaling factor sj given by the inverse of the j-th central frequency in Hz, fc,
corresponding to the j-th band in the Bark frequency scale becomes
e j /7
sj , j 1,2,3,.... (18)
325 e 2/7 1
The inclusion of bark frequency in the scaling factor estimation, as well as the relation
between (17) and the dynamics of basilar membrane, allows frequency decomposition
similar to that carried out by the human ear. Since the scaling factor given by (18) satisfies
the Littlewood-Paley theorem (19)
s j 1
lim
e( j 1)/7 e 2 j /7 1 e 1/7
1
e 1
lim (19)
j sj j j /7 2( j 1)/7
e
there is not information loss during the sampling process. Finally the number of subbands is
related to the sampling frequency as follows
2
f f
j int 7 ln s s 1 (20)
max 1300 1300
Therefore, for a sampling frequency equal to 8 KHz the number of subbands becomes 17.
Finally, the translation axis is naturally sampled because the input data is a discrete time
signal and then the j-th decomposition signal can be estimated as follows
C j ( m) f n j n m (21)
where
2 nT
1 nT sj 2 nT
j n e cos n0 (22)
2 sj sj
In (22) T denotes the sampling period. The expansion coefficients Cj obtained for each
subband are used to estimate the feature vector to be used during the training and
recognition tasks.
Using (21), the feature vector used for voiced segment identification consists of the
following parameters:
Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 207
a. The energy of the m-th, speech signal frame x 2 ( n) , where 1 n N and N is number of
samples in the m-th frame.
b. The energy contained in each one of the 17 wavelet decomposition levels of m-th speech
frame C 2j ( m) , where 1 j 17
c. The difference between the energy of the previous and actual frames given by (23)
dx ( m) x 2 (n mN ) x 2 (n ( m 1)N ) (23)
d. The difference between the energy contained in each one of the 17 wavelet
decomposition levels of current and previous frames given by (24)
v j c 2j ( m) c 2j ( m 1) (24)
where m is the number frame. Then the feature vector derived using the proposed approach
becomes
The last eighteen members of the feature vector include the spectral dynamics of speech
signal concatenating the variation from the past feature vector to the current one.
Figure 2. Pattern recognition stage. The first ANN indentifies the vowel present in the segment and the
other 5 ANN identify the consonant-vowel combination.
3. Results
Figure 3 shows the plot of mono-aural recordings of Spanish words “abeja” (a), “adicto” (b)
and “cupo” (c), pronounced by an esophageal speaker with a sample frequency of 8 kHz,
respectively, including the detected voiced segments. Figure 3 shows that a correct detection
is achieved using the combination of several features, in this case zero crossing, formants
analysis and energy average.
Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 209
(a) (b)
(c)
Figure 3. Detected voiced/unvoiced segments of esophageal speech signal of Spanish words “abeja” (a),
“adicto” (b) and “cupo” (c).
Figure 4 shows the produced esophageal speech signal corresponding to the Spanish word
“cachucha” (cap) together with the restored signal obtained using the proposed system. The
corresponding spectrograms of both signals are shown in Figure 5.
To evaluate the actual performance of the proposed system, two different criteria were used:
the bark spectral distortion (MBSD) and the mean opinion scoring (MOS). The bark
spectrum L(f) reflects the ear’s nonlinear transformation of frequency and amplitude,
together with the important aspects of its frequency and spectral integration properties in
response to complex sounds. Using the Bark spectrum, an objective measure of the
distortion can be defined using the overall distortion as the mean Euclidian distance
210 Modern Speech Recognition Approaches with Case Studies
between the spectral vectors of the normal speech, Ln(k,i), and the processed ones, Lp(k,i),
taken over successive frames as follows.
Figure 4. Waveforms trace corresponding to the Spanish word, “Cachucha”, (Cap). a) produced
Esophageal speech, b) restored speech.
Figure 5. Spectrograms trace corresponding to the Spanish word, “Cachucha” (Cap). a) Normal speech,
b) Produced Esophageal Speech, c) Restored speech.
N M 2
Ln ( k , i) Lp ( k , i)
k 1 i 1
MBSD N M
(26)
L2n ( k , i )
k 1 i 1
where Ln(k,i) is the Bark spectrum of the kth segment of the original signal, Lp(k,i) is the Bark
spectrum of the processed signal and M is the number of critical bands. Figures 6 and 7
Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 211
show the Bark spectral trace of both, the esophageal speech produced and enhanced signals,
respectively corresponding to the Spanish words “hola” (hello) and “mochila” (bag). Here
the MBSD during voiced segments was equal 0.2954 and 0.4213 for “hola” and “mochila”,
respectively, while during unvoiced segments the MBSD was 0.6815 and 0.7829 for “hola”
and “mochila” respectively. The distortion decreases during the voiced periods as suggested
by (26). Evaluation results using the Bark spectral distortion measures show that a good
enhancement can be achieved using the proposed method.
Figure 6. Bark spectral trace of normal, Ln(k), and enhanced, Lp(k), speech signals of the Spanish word
“hola”.
Figure 7. Bark spectral trace of normal, Ln(k), and enhanced, Lp(k), speech signals of the Spanish word
“mochila”.
212 Modern Speech Recognition Approaches with Case Studies
A subjective evaluation was also performed using the Mean Opinion Scoring (MOS) in which
the proposed system was evaluated by 200 normal speaking persons and 200 alaryngeal ones
(Table 1 and Table 2), from the point of view of intelligibility and speech quality where 5 is the
highest score and 1 is the lowest one. In both cases the speech intelligibility and quality
evaluation without enhancement are shown for comparison. These evaluation results show
that the proposed system improves the performance of [2] which reports a MOS of 2.91 when
the enhancement system is used and 2.3 without enhancement. These results also show that,
although the improvement is perceived by the alaryngeal and normal speakers, the
improvement is larger in the opinion of alaryngeal speakers. Thus the proposed system is
expected to have a quite good acceptance among the alaryngeal speakers, because the
proposed system allows synthesizing several kinds of male and female speech signals.
Finally, about 95% of alaryngeal persons participating in the subjective evaluation preferred
the use of the proposed system during conversation. Subjective evaluation shows that quite
a good performance enhancement can be obtained using the proposed system.
The performance of the voiced segments classification stage was evaluated using 450
different alaryngeal voiced segments. The system failed to classify correctly 22 segments,
which represents a misclassification rate of about 5% using a network as identification
method, while a misclassification of about 7% was obtained using the Hidden Markov
Models (HMM). The comparison results are given in Table 3.
The behavior of proposed feature extraction method was compared with the performance of
several other wavelet functions for evaluation purposes. Comparison results are shown in
Table 4 which show that proposed method has better performance than other wavelet based
feature extraction methods.
4. Conclusions
This chapter proposed an alaryngeal speech restoration system, suitable for esophageal and
ALT produced speech, based on a pattern recognition approach where the voiced segments
are replaced by equivalent segments of normal speech contained in a codebook. Evaluation
results show a correct detection of voiced segment by comparison between their
spectrograms to those spectrograms of normal speech signal. Objective and subjective
evaluation results show that the proposed system provides a good improvement in the
intelligibility and quality of esophageal produced speech signals. These results show that
proposed system is an attractive alternative to enhance the alaryngeal speech signals. This
chapter also presents a flexible structure that allows the use of the proposed system to
enhance esophageal and artificial laryinx produced speech signals without further
modifications. The proposed system could be used to enhance alaryngeal speech in several
practical situations such as telephone and teleconference systems, thus improving the voice
and quality life of alaryngeal people.
Author details
Alfredo Victor Mantilla Caeiros
Tecnológico de Monterrey, Campus Ciudad de Mexico, México
5. References
[1] H. K. Barney, H. L. Hawork, F. E., and Dunn, ( 1959), “An experimental transitorized
artifcial larynx”,. Bell System Technical Journal, 38, 1337-1356..
[2] G. Aguilar, M. Nakano-Miyatake and H. Perez-Meana, (2005), Alaryngeal Speech
Enhancement Using Pattern Recognition Techniques”, IEICE Trans. Inf. & Syst. Vol.
E88-D, No. 7, pp. 1618-1622.
214 Modern Speech Recognition Approaches with Case Studies
[3] D. Cole, S. Sridharan and M. Geva, (1997), “Application of noise reduction techniques
for alaryngeal speech enhancement”, IEEE TECON, Speech and Image Processing for
Computing and Telecommunications, pp. 491-494.
[4] H. David, et.al. (2001) “Acoustics and psychoacoustics”, Ed: Focal Press. Second
Edition.
[5] L. Rabiner, B. Juang, (1993), “Fundamentals of Speech Recognition”, Prentice Hall,
Piscataway, USA.
[6] L. R. Rabiner, B. H. Juang and C. H. Lee, (1996), An Overview of Automatic Speech
Recognition”, in Automatic Speech and Speaker Recognition: Advanced Topics, C. H.
Lee, F. K. Soong and K. K. Paliwal editors, Kluwer Academic Publisher, pp. 1-30,
Norwell MA.
[7] D.G. Childers, (2000), “Speech Processing and syntesis toolboxes”, Wiley & Sons, inc.
[8] A. Mantilla-Caeiros, M. Nakano-Miyatake, H. Perez-Meana, (2007), “A New Wavelet
Function for Audio and Speech Processing”, 50th MWSCAS, pp. 101-104.
[9] X. Zhang, M. Heinz, I. Bruce and L. Carney, (2001), “A phenomenological model for the
responses of auditory-nerve fibers: I. Nonlinear tuning with compression and
suppression”, Acoustical Society of America, vol. 109, No.2, pp 648-670.
[10] R. M. Rao, A. S. Bopardikar (1998) ,“Wavelets Transforms, Introduction to Theory and
Applications”, Addison Wesley, New York.
[11] M. R. Schroeder, (1979) “Objective measure of certain speech signal degradations based
on masking properties of the human auditory perception”, Frontiers of Speech
Communication Research, Academic Press, New York.
Chapter 10
Komal Arora
http://dx.doi.org/10.5772/49992
1. Introduction
1.1. Cochlear implant system
For individuals with severe to profound hearing losses, due to disease or damage to the
inner ear, acoustic stimulation (via hearing aids) may not provide sufficient information for
adequate speech perception. In such cases direct electrical stimulation of the auditory nerve
by surgically implanted electrodes has been beneficial in restoring useful hearing. This
chapter will provide a general overview regarding sound perception through electrical
stimulation using multi channel cochlear implants.
The multiple channel cochlear implants consist of 1) a microphone which picks up sounds
from the environment, 2) a sound processor which converts the analog sounds into a
digitally coded signal, 3) a transmitting coil which transmits this information to the 4)
receiver stimulator which decodes the radio frequency signals transmitted from the sound
processor into the electrical stimuli responsible for auditory nerve stimulation via 5) an
electrode array which provides multiple sites of stimulation within the cochlea (figure 1).
Most of the current cochlear implant systems use intracochlear and extracochlear electrodes.
Three different modes of current stimulation have been used in cochlear implant systems –
Monopolar, bipolar and common ground (figure 2a). In monopolar stimulation, current is
passed between one active intracochlear electrode and the extracochlear electrodes (which
provide the return current path) placed either as a ball electrode under the temporalis
muscle (MP1) or a plate electrode on the receiver casing (MP2) (figure 2b). When both of
these extracochlear electrodes act as return electrodes in parallel, it is called MP1+2
configuration. In bipolar stimulation, current flows between an active and a return electrode
within the cochlea; whereas in common ground stimulation, current flows from one
electrode within the cochlea to all other intracochlear electrodes.
of speech processing strategies are available in current CI systems. This presents a choice of
fitting parameters when setting up a system for individual patients. These parameters
include the rate of stimulation, number of channels to be activated, mode of stimulation,
electrical pulse width etc. These parameters along with the amplitude mapping of all
available electrodes define a “map” for an individual cochlear implantee. Amplitude
mapping involves measuring for each active electrode, the user’s threshold (T) level that is
the level at which he/ she can just hear the stimulus and the maximum comfortable (C) level,
that is, the level which produces a loud but comfortable sensation. These maps are loaded in
the client’s sound processor. An individual cochlear implantee’s speech perception
outcomes may differ based on the type of strategy he/she is using (Pasanisi et al., 2002;
Psarros et al., 2002; Skinner et al., 2002a, b; Plant et al., 2002)
The current Nucleus® cochlear implants employ filter bank strategies which analyse the
incoming signal using a bank or band-pass filters. The earliest filter bank strategy
implemented in Nucleus cochlear implants called Spectral Maxima Sound processor (SMSP)
(McDermott et al., 1992) did not extract features from the speech waveform. Instead, the
incoming speech signal was sent to 16 band pass filters with centre frequencies from 250 Hz
to 5400 Hz. The output from the six channels with the greatest amplitude (maxima) was
compressed to fit the patient’s electrical dynamic range. The resultant output was then sent
to the six selected electrodes at a rate of 250 pulses per channel (pps/ch). It is beyond the
scope of this chapter to provide detailed information on all available speech coding
strategies. A more comprehensive review of speech processing strategies is provided in
Loizou (1998). A brief review of the most commonly currently used speech processing
strategy in Nucleus devices is provided in section 1.2.3. Figure 3 shows the overall signal
flow in the most commonly used filterbank strategies: Continuous Interleaved Sampling
(CIS), Spectral Peak (SPEAK) and Advanced combination Encoders (ACE™).
Figure 3. Audio signal path used in current filterbank strategies (Swanson, 2007).
compression controls only the amplitude without distorting the temporal pattern of the
speech waveform. The last part of the front end processing is automatic sensitivity control
(ASC). Sensitivity refers to the effective gain of the sound processor and affects the
minimum acoustic signal strength required to produce stimulation. At a higher sensitivity
setting less signal strength is needed. On the other hand, at very low sensitivity settings,
higher sound pressure levels are needed to stimulate threshold or comfortable levels. The
sensitivity setting determines when the AGC will start acting and is aligned to C-level
stimulation. This is automated in the current Nucleus cochlear implant devices and is called
as ASC. The front end processing discussed so far is similar in SPEAK, CIS or ACE.
1.2.2. Filterbank
After the front end processing, the signal passes through a series of partially overlapping
band pass filters (each passing a different frequency range) where the signal is analyzed in
terms of frequency and amplitude. The filterbank splits the audio signal into number of
frequency bands simulating the auditory filter mechanism in normal hearing. The filter
bands in the ACE strategy in current Nucleus processors are spaced linearly from 188 to
1312 Hz and thereafter logarithmically up to 7938 Hz. Each filter band is allocated to one
intracochlear electrode in the implant system according to the tonotopic relationship
between frequency and place in the cochlea.
Different speech coding strategies differ in the number of filter bands they use. For example
the Nucleus implementation of CIS strategy has a small number (4 to 12) of wide bands, and
SPEAK and ACE strategies have a large number (20 to 22) of narrow bands. The maximum
number of bands is determined by the number of electrodes that are available in the
particular implant system.
On the other hand, the ESPrit™ processors use analog switched capacitor filters. For smaller
behind the ear units such as ESPrit™ and ESPrit™ 3G processors (used in the experiments
discussed in this chapter, sections 3 and 4) switched capacitor filters were used because they
were most power efficient at that time. These switched capacitor filters sample low
frequencies at a rate of 19 kHz and high frequencies at a rate of 78 kHz. This filtered signal is
further rectified to extract the envelopes. In the ESPrit™ series of processors filter outputs
are measured using peak detectors. These peak detectors in ESPrit™ processors are matched
Cochlear Implant Stimulation Rates and Speech Perception 219
to the time response of the filters as each filter in the filterbank has different time response.
The Med-El implants use Hilbert transform for measuring filter outputs. Hilbert transform
gives amplitude response equal to bandpass filter response.
After the low pass filtering in SPrint™ / Freedom™ and peak detection in ESPrit™
processors, the outputs are further analyzed for the amplitude maxima. Each analysis
window selects N maxima amplitudes (depending on the strategy) from filterbank outputs.
The rate at which a set of N amplitude maxima are selected is referred to as the update rate.
In Freedom™/ SPrint™ processors this rate is fixed at 760 Hz. However, in the ESPrit™
series of processors, it varies. For the high level sounds, the update rate is 4 kHz and for low
level sounds, it is 1 kHz. This is also the rate at which new information is generated by the
sound processor. In Med-El sound processors, using CIS strategy, new data is available from
the sound processor at the stimulation rate.
The following sections describe how sampling and selection is done for each strategy.
The Spectral Peak (SPEAK) strategy is a derivative of the SMSP strategy (McDermott,et
al.,1992) where the number of channels was increased from 16 to 20 with the center
frequencies ranging from 250-10,000Hz. The frequency boundaries of the filters could be
varied. It also provided flexibility to choose the number of maxima from one to ten with an
average of six. The selected electrodes were stimulated at a fixed rate that varied between
180 and 300 pps/ch. The stimulation rate was varied depending upon the number of maxima
selected. In case of limited spectral content, lesser maxima were selected and the stimulation
rate increased which provided more temporal information and hence may have
compensated for reduced spectral cues. Similarly when more maxima were selected, the
stimulation rate was reduced (Loizou, 1998). This strategy was implemented in the Spectra
sound processor (Seligman and Mc Dermott, 1995) and was next incorporated in the
Nucleus 24™ series of sound processors. However, the SPrint™ and Freedom™ systems (in
Nucleus 24 series) used a fixed analysis rate of 250 Hz. Nucleus 24 also allowed for higher
rates which are covered in the description of ACE strategy (section 1.2.3.3). A typical SPEAK
strategy in the Nucleus devices consists of 250 pps/ch stimulation rate with the selection of
six or eight maxima out of 20 channels.
The CIS strategy (Wilson et al., 1991) was developed for the Ineraid implant. In this strategy,
the filterbank has six frequency bands. The envelope of the waveform is estimated at the
output of each filter. These envelope signals are sampled at a fixed rate. The envelope
outputs are finally compressed to fit the dynamic range of electric hearing and then used to
modulate biphasic pulses. The filter output in terms of electrical pulses is delivered to six
fixed intracochlear electrodes. The key feature of the CIS strategy is the use of higher
stimulation rates to provide better representation of temporal information. The variations in
the pulse amplitude can track the rapid changes in the speech signal. This is possible due to
the shorter pulses with minimal delays/inter-pulse interval (Wilson et al., 1993). The
220 Modern Speech Recognition Approaches with Case Studies
possible benefits of using high stimulation rates are discussed further in the Section 2. The
stimulation rate used in the CIS strategy is generally 800 pps/ch or higher. A modified
version (CIS+) of the CIS strategy is used these days, more typically in Med- El implants.
CIS+ uses a Hilbert Transformation (Stark and Tuteur, 1979) to represent the amplitude
envelope of the filter outputs. This transformation tracks the acoustic signal more closely for
a more accurate representation of its temporal dynamics compared to the other techniques
used to represent the amplitude envelope in other implant systems like “wave rectification”,
“low-pass filtering” or “fast Fourier transform” (Med-El, 2008).
The ACE™ strategy (Vandali et al., 2000) is similar to the SPEAK strategy except that it
combines the higher stimulation rate feature of the CIS strategy. Selection of electrodes is
similar to the SPEAK strategy; however it can also be programmed to stimulate fixed
electrode sites as in CIS. Thus this strategy attempts to provide the combined benefits of
good spectral and temporal information for speech. Figure 4 depicts the schematic diagram
of the ACE strategy. The filter bands in the ACE strategy are spaced linearly from 188 to
1312 Hz and thereafter logarithmically up to 7938 Hz.
The output of the filters is low pass filtered with an envelope cut off frequency between 200
and 400 Hz (In SPrint™ and Freedom™ processors). The ESPrit™ series of processors use
peak detectors (discussed in section 1.2.3) rather than low pass filtering. After the envelopes
Cochlear Implant Stimulation Rates and Speech Perception 221
from each of the band pass filters are extracted, a subset of filters with the highest
amplitudes is selected. These are called maxima which can number from 1 to 20. The range
of stimulation rates in ACE can be selected from a range of 250 to 3500 pps/ch.
Figure 5. Loudness growth function in a Nucleus cochlear implant system. The x-axis shows the input
level (dB SPL) and the y-axis shows the current level. The steeper loudness curve has smaller Q
(Swanson, 2007).
loudness growth function. The parameter Q (steepness factor) controls the steepness of the
loudness growth curve (figure 5). Nucleus ESPrit™ processors operate on the input DR of
30 dB. IDR can be increased up to 50 dB in current Nucleus sound processors.
The biphasic electrical current pulses are then delivered non-simultaneously to the
corresponding electrodes at a fixed rate. The non-simultaneous presentation (one electrode
is stimulated at one time) of current pulses is used to avoid channel interaction (Wilson et
al., 1998). The range of stimulation rates available in current Nucleus devices range from 250
pps/ch to 3500 pps/ch. In the current study the Nucleus ESPrit™ 3G processor was used
which has a maximum stimulation rate of 2400 pps/ch. However, it is unlikely that these
higher stimulation rates provide any additional temporal information unless the processor
update rate is at least equivalent (760 Hz in SPrint™ and Freedom™ processors and 1000 -
4000 Hz in ESPrit™ processors). From the signal processing point of view, the ESPrit™
series of processors have the potential to add new temporal information with high rates
because of their higher update rate (1-4 kHz).
2. Electrical hearing
As discussed in the sections above, various speech coding strategies and choice of fitting
parameters are available in current CI systems. Studies have demonstrated that different
strategies and/or parameter choices can provide benefits to individual patients but there is
no clear method for determining these for a particular individual. The current literature in
this area shows a lack of consistency in outcomes, particularly, when the electrical
stimulation rate is varied. There could be some underlying physiological or psychological
correlates behind it. For example the outcomes may be related to the temporal processing
abilities of CI users. This section will review some of the existing literature pertaining to the
stimulation rate effects on the performance of the cochlear implant subjects.
original speech envelope of channel 5 for the syllable /ti/. As seen in the 200 pps/ch
stimulation rate condition, pulses are spaced relatively far apart, so this sort of processing
may not be able to extract all of the important temporal fine structure of the original
waveform. When a higher pulse rate is used, the pulses are placed more closely and they
can carry the temporal fine structure more precisely (Loizou et al., 2000). From a signal
processing point of view this seems reasonable, however in practice, perceptual
performance of CI users is often not improved when using higher rates.
Burst
Figure 6. The pulsatile waveforms for channel 5 of the syllable /ti/ with stimulation rates of 200 pps/ ch
and 2000 pps/ch. The syllable /ti/ was band pass filtered into six channels and the output was rectified
and sampled at the rates indicated in this figure. The bottom panel shows the speech envelope of
channel 5 for syllable /ti/ (modified from Loizou et al., 2000).
When considering the appropriate rate to employ for coding of F0 temporal information,
Nyquist’s theorem states that the rate must be at least twice the highest frequency to be
represented. However, according to McKay et al (1994), the stimulation rate for CI systems
should be at least four times the highest frequency to be represented. This suggests that
rates of >1200 pps per channel are needed to effectively code the voice pitch range up to 300
Hz. On the other hand studies examining neural responses to electrical stimulation in
animals have shown that at rates above >800 pps/ch, there is poorer phase locking and less
effective entrainment of neurons due to refractory effects being more dominant (Parkins,
1989; Dynes &Delgutte, 1992). It is therefore simplistic to assume that a higher stimulation
rate alone will necessarily result in more effective transfer of temporal information in the
auditory system.
224 Modern Speech Recognition Approaches with Case Studies
A number of studies explored the effect of stimulation rate on speech perception in CI users.
Results for some of the previous studies using the continuous interleaved sampling (CIS)
speech coding strategy and the MED-El implant showed benefits for moderate and high
stimulation rates (Loizou et al, 2000; Keifer et al, 2000; Verchuur, 2005; Nie et al, 2006).
However, other studies using the CIS strategy did not show a benefit for high rates (Plant et
al, 2002; Friesen et al, 2005). The comparison of these studies is complicated by the use of
different implant systems. Studies using the Nucleus devices with 22 intracochlear
electrodes and the ACE strategy did not show a conclusive benefit for higher rates (Vandali
et al, 2000; Holden et al, 2002; Weber et al, 2007; Plant et al, 2007). Again, there are some
limitations in these studies due to the specific hardware used. The higher stimulation rates
tested by Vandali et al (2000) and Holden et al (2002) probably did not add any extra
temporal information due to the limited analysis rate of 760 Hz employed in the SPrint™
processor used in those studies. Many of these studies reported large individual variability
among subjects. Although the recent study by Plant et al (2007) found no significant group
mean differences between higher rate and lower rate programs, five of the 15 subjects
obtained significantly better scores with higher rates (2400 pps/ch & 10 maxima, or 3500
pps/ch & 9 maxima) compared to lower rates (1200 pps/ch & 10 maxima, or 1200 pps/ch &
12 maxima) for speech tests conducted in quiet or noise. Only two subjects obtained
significant benefits in both tests using the higher set of rates, and the results were not
conclusive because significant learning effects were observed in the study. Likewise, in the
study by Weber et al (2007), group speech perception scores in quiet and noise did not
demonstrate a significant difference between stimulation rates of 500, 1200, and 3500 pps/ch
using the ACE strategy. However, some variability in individual scores was observed for six
of the 12 subjects for the sentences in noise test.
Reports on subjects’ preferences for particular stimulation rates with Nucleus devices have
shown results in favor of low to moderate stimulation rates. In the study done by Vandali et
al (2000), 250 and 807 pps/ch rates were preferred over 1615 pps/ch. Similarly, Balkany et al
(2007) reported preferences for slower set of rates (500 to 1200 pps/ch, ACE strategy) for 37
of the 55 subjects, compared to faster set of rates (1800 to 3500 pps/ ch, ACE RE strategy).
Authors also reported that the rate preference by individual subjects tended towards the
slower rates within each of the two sets of stimulation rates. Similarly, in a clinical trial
conducted in North America and Europe by Cochlear Ltd (2007) on subject selection of
stimulation rate with the Nucleus Freedom system, there was a preference for stimulation
rates of 1200 pps/ch or lower. Speech perception test results also showed improved
performance with stimulation rates of 1200 pps/ch or lower compared to a higher set of rates
(1800, 2400, and 3500 pps/ch).
bandwidth of vowel formants, is encoded via place along the tonotopic axis of the cochlea.
Fine spectral structure is also encoded, such as the frequency of the fundamental (F0) and
lower-order harmonics of the fundamental for voiced sounds (Plomp, 1967; Houtsma, 1990).
Temporal properties of speech encoded in the auditory system comprise low frequency
envelope cues (<50 Hz) which provide information about phonetic features of speech, higher
frequency envelope information (>50 Hz), such as F0 periodicity information in auditory
filters in which vowel harmonics are not resolved, and most importantly fine temporal
structure (Rosen, 1992). The perceived quality or timbre of complex sounds is mostly
attributable to the spectral shape. For example, each vowel has specific formant frequencies
and patterning of these formant frequencies helps in determining the vowel quality and
vowel identity (Moore, 2003b).
The frequency coding in cochlear implants takes place in two ways: a) spectral information
presented via the distribution of energy on multiple electrodes along the cochlea, b)
temporal information which is mainly presented via the amplitude envelopes of the
electrical stimulation pulses. These two ways of coding and the spectral shape coding are
described in the following sub sections.
temporal information to assist speech perception for cochlear implant users may be quite
important.
noise (Fu et al., 1998; Dorman et al., 1998). If this increased spectral resolution can not be
obtained with current electrode technology, better coding of periodicity cues may provide
another avenue for improving performance for CI users.
Figure 8. The envelope and fine structure components of a filtered speech signal
(http://research.meei.harvard.edu/chimera/motivation.html).
Psychophysical studies of electrical stimulation in the human auditory system indicate that
temporal pitch information up to 300 Hz is probably available to CI users (Eddington et al.,
1978; Tong et al., 1979; Shannon, 1983; Moore & Carlyon, 2005).These studies used steady
pulse trains (with varied rate of stimulation) delivered to single electrode sites. For very low
pulse rates (<50 Hz), the signal is reported to be perceived as buzz-like sound and for rates
above 300 Hz, a little change in perceived pitch is reported. This ability of CI users varies
with a few being able to discriminate rate increases up to 1000 Hz (Fearn and Wolfe, 2000;
Townshend et al., 1987).
level dependent phase changes on the basilar membrane. In implant systems, spectral shape,
like frequency coding is coded by filtering the signal into several frequency bands and the
relative magnitude of the electric signal across electrode channels. The coding of spectral
shape cannot be as precise as that in the normal ear due to the relatively small number of
effective channels in current cochlear implant systems. In addition, the detailed temporal
information relating to formant frequencies is not conveyed due to the inability to
effectively code temporal information above about 300 Hz. One approach to improving the
representation of the temporal envelope and higher frequency periodicity cues is to increase
the low pass filter frequency applied to the amplitude envelope and/or to use higher
stimulation rates. However, results so far do not conclusively show benefit for higher rates
(section 2.1). Furthermore, it is also not clear how effectively CI listeners can resolve these
temporal modulations.
In normal hearing and hearing impaired subjects, temporal resolution can be characterized
by the temporal modulation transfer function (TMTF) which relates the threshold for
detecting changes in the amplitude of a sound to the rapidity of changes/modulation
frequency (Bacon and Viemeister, 1985; Burns and Viemeister, 1976; Moore and Glasberg,
2001). In this task, modulation detection can be measured for a series of modulation
frequencies. The stimuli used in these experiments are either amplitude modulated noise or
complex tones. To study temporal modulation independent from spectral resolution, spectral
cues are removed by using broadband noise as a carrier. This type of stimulus has waveform
envelope variations but no spectral cues. Complex tones are the combination of two or more
pure tones. This type of sinusoidal amplitude modulated (SAM) signal has components at fc-
fm, fc, fc+fm where fc is the carrier frequency and fm is the modulation frequency (figure 9b).
The components above and below the centre frequency are called sidebands. If fc (e.g.1200,
1400, 1600 Hz) is an integer multiple of fm (200 Hz), it forms a harmonic structure otherwise,
it is called an inharmonic waveform. In signal theory and acoustic literature, amplitude
modulated signals are described by the formula:
where t is time and m is the modulation index (m= 1 means 100% modulation). Figure 9(a)
shows an example of an amplitude modulated signal. The pink color waveform shows the
depth of modulation. In acoustic hearing, m = MD (modulation depth) which is defined as:
where peak and trough refer to the peak and trough (Sound Pressure Level, SPL) levels of
the modulation envelope.
m 1 1 / MD (4)
and where modulation depth (in cochlear implant stimulation) is defined by the peak over
trough (peak/trough) level in the envelope. The psychophysical measure tested in study 2
was modulation detection threshold (MDT) which refers to the depth of modulation
necessary to just allow discrimination between a modulated and unmodulated waveform. In
this study stimuli were presented through the cochlear implant sound processor.
Modulation depth (MD) was converted into modulation index (m) using Eq. (3) for all
analysis. This was because most of the studies which measured modulation detection in CI
recipients have used modulation index (m) for analysis purposes.
Figure 9. a) An example of an amplitude modulated signal. The pink color waveforms show the depth
of modulation. b) The sinusoidally amplitude modulated signal with 2000 Hz carrier frequency (fc), 100
Hz modulation frequency (MF).
The limitation with the sinusoidal carriers is that the modulation introduces sidebands
which can be heard as separate signals. Also the results of modulation detection may be
230 Modern Speech Recognition Approaches with Case Studies
influenced by the “off frequency listening”. That is if the carrier and the modulated
frequency are separated quite apart the sounds can be heard from the auditory filters
centered at the carrier frequency or the sideband frequency depending on the intensity of
modulation (Moore and Glasberg, 2001). However, in study 2 a sinusoidal carrier was used
instead of noise as the carrier because subjects with cochlear hearing loss have reduced
frequency selectivity (Glasberg and Moore, 1986; Moore, 2003) leading to poor spectral
resolution of the sidebands. Thus, TMTF in such cases is mainly influenced by temporal
resolution over a wide range of modulation frequencies (Moore and Glasberg, 2001). It is
also difficult for the CI users to spectrally resolve the components of complex tones
(Shannon, 1983). In addition; the noise signal would have its own temporal envelope which
can confound the results of modulation detection (Moore and Glasberg, 2001).
3.1. Rationale
If low to moderate stimulation rates do indeed provide equivalent or better speech
perception, then recipients may also benefit from reductions in system power consumption
and processor/device size and complexity. So far, low to moderate rates have not been
explored well in Nucleus™ 24 implants with the ACE strategy, especially in the range of
250-900 pps/ch in spite of the fact that this range of rates is often used clinically1 with
Nucleus devices, (which worldwide are the most used devices so far among CI recipients).
The authors thus chose to examine rates of 275, 350, 500, and 900 pps/ch in this study.
Whether rates of stimulation (between 275 and 900 pps/ch) have an effect on the speech
perception in quiet and noise for the group of adult CI subjects.
Whether optimal rate varies among various subjects.
Whether there is a relation between the subjective preference measured with
comparative questionnaire and the speech perception scores.
3.2. Method
Ten postlingually deaf adult subjects using the Nucleus™ 24 Contour™ implant and
ESPrit™ 3G sound processor participated in the study. Table 1 shows the demographic data
for the subjects. Low to moderate stimulation rates of 275, 350, 500 and 900 pulses-per-
second/channel (pps/ch) were evaluated.
1 As per the information available from Melbourne Cochlear Implant Clinic (RVEEH, University of Melbourne) and
Test material comprised CNC open set monosyllabic words (Peterson, & Lehiste, 1962)
presented in quiet and Speech Intelligibility test (SIT) open set sentences (Magner, 1972)
presented in four talker babble noise. Four lists of CNC words were presented in each
session at a level of 60 dB SPL RMS. An adaptive procedure (similar to the procedure used
by Henshall and McKay, 2001) was used to measure speech reception threshold (SRT) for
the sentence test in noise. Four such SRT estimates were recorded in each session. All four
stimulation rate programs were balanced for loudness. A repeated ABCD experimental
design was employed.
Take home practice was provided with each stimulation rate. A comparative questionnaire
was provided to the CI subjects at the end of the repeated ABCD protocol. Subjects were
asked to compare all four rate programs for similar lengths of time over a period of two
weeks with a constant sensitivity setting for all stimulation rates.
Subject Age Cause of Duration of implant use Everyday stimulation rate* and
deafness (yr) strategy.
1 58 Hereditary 4 900 pps/ch, ACE
2 67 Otosclerosis 5 720 pps/ch, ACE
3 64 Unknown 5 900 pps/ch, ACE
4 64 Unknown 5 250 pps/ch, SPEAK
5 74 Unknown 4 1200 pps/ch, ACE
6 75 Otosclerosis 8 250 pps/ch, SPEAK
7 62 Unknown 8 250 pps/ch, SPEAK
8 68 Unknown 6 250 pps/ch, ACE
9 69 Unknown 4 900 pps/ch, ACE
10 72 Unknown 5 500 pps/ch, ACE
*Prior to commencement of the study
The ACE strategy was used for all stimulation rates. For the 275 pps/ch case, the stimulation
rate was jittered in time by approximately 10%, which tends to lower the rate to
approximately 250 pps/ch. This was done to minimize the audibility of the constant
stimulation rate. It may have been beneficial if all other stimulation rates tested in this study
were also jittered (i.e., to avoid a possible confound). The number of maxima was eight for
all the conditions. Clinical default settings for pulse width, mode (MP1+2) and frequency to
electrode mapping were employed. The pulse width was increased in cases where current
level needed to exceed 255 CL units to achieve comfortable levels. The sound processor was
set at the client’s preferred sensitivity and held constant throughout the study.
Thresholds (T-levels) and Comfortable listening levels (C-levels) were measured for all
mapped electrodes and for each rate condition. T-levels were measured using a modified
Hughson-Westlake procedure with an ascending step size of 2 current levels (CLs) and a
232 Modern Speech Recognition Approaches with Case Studies
descending step size of 4 CLs. C-levels were measured with an ascending technique that
slowly increases the levels from the baseline T-levels until the client reported that the sound
was loud but still comfortable. Loudness balancing was performed at C-levels as well as at
50% level of the dynamic range, using a sweep across four consecutive electrodes at a time.
Subjects were asked whether stimulation of all four electrodes sounded equally loud and if
not, T- and C-levels were adjusted as necessary. Speech like noise “ICRA” (International
Collegium of Rehabilitative Audiology) (Dreschler et al., 2001) was presented at 60 dB SPL
RMS for all programs to ensure that each were similar in loudness for conversational
speech. The comparison was conducted using a paired-comparison procedure, in which all
possible pairings of conditions were compared twice. Adjustments were made to C-levels if
necessary to achieve similar loudness across all rate programs.
3.3. Results
3.3.1. CNC words
Figure 10 shows percentage correct CNC word scores for the ten subjects for the four
stimulation rate programs. The scores were averaged across the two evaluation sessions.
Repeated measures two-way analysis of variance (ANOVA) for the group revealed no
significant differences across the four rate programs (F [3, 27] = 2.14; p= 0.118). Furthermore,
there was no significant main effect for session (F [3, 27] = 2.05; p= 0.186). The interaction
effect between rate and session was not significant (F [3, 27] = 2.30; p= 0.099).
In the individual analyses, subject 1 showed significantly better scores for the 500 and 900
pps/ch programs compared to the 350 pps/ch program. There was no significant difference
between the 500 and 900 pps/ch programs. Subject 8 showed best CNC scores with the 500
pps/ch program but the 900 pps/ch program showed poorer performance compared to all
other programs. Subject 10 showed significantly better scores with 500 pps/ch compared to
the 350 pps/ch program.
Figure 10. Individual patient’s percentage correct scores and group mean percentage correct scores for
CNC words in quiet. Statistically significant differences (post hoc Tukey test) are shown in the tables
presented below each bar graph (*p ≤ 0.05, **p ≤ 0.01, ***p ≤ 0.001). Each subject’s subjective preference
in quiet along with the degree of preference (1 - very similar, 2 - slightly better, 3 - moderately better, 4 -
much better) are shown below the chart.
Individual data analysis revealed a significant rate effect for the sentence test (p<0.001) in
eight out of ten subjects. All of these eight subjects showed improved performance with the
500 and/or 900 pps/ch rate programs. Subject 1 performed equally well with the 500 and 900
pps/ch stimulation rate programs. The performance was significantly better with both these
programs compared to the 275 and 350 pps/ch rate programs (p<0.05). Subject 2 showed
improved performance with the 900 pps/ch program. Pair wise multiple comparison with
the Tukey test indicated significant differences between the mean SRT obtained with 275
pps/ch program versus all other rate programs (p< 0.05), and also between the mean SRT
obtained with the 350 and 900 pps/ch programs (p= 0.025). No significant differences were
observed between the SRTs for the 350 and 500 pps/ch programs and the SRTs for the 500
and 900 pps/ch programs.
234 Modern Speech Recognition Approaches with Case Studies
Figure 11. Individual patient’s mean speech reception threshold (SRT) and group mean SRT for SIT
sentences in competing noise. Statistically significant differences (post hoc Tukey test) are shown in the
tables presented below each bar graph (*p ≤ 0.05, **p ≤ 0.01, ***p ≤ 0.001). Each subject’s subjective
preference in noise along with the degree of preference (1 - very similar, 2 - slightly better, 3 -
moderately better, 4 - much better) are shown below the chart.
Subjects 5 and 6 also obtained their best SRTs with the 900 pps/ch compared to 350 and 500
pps/ch stimulation rates (p <0.05). Subject 9 showed improved performance with 900 pps/ch
compared to 275 pps/ch (p<0.001) and 350 pps/ch (p= 0.01) programs. This subject also
showed better SRT for 500 pps/ch compared to 275 pps/ch rate program (p= 0.032). Subjects
3, 4 and 8 performed best with the 500 pps/ch stimulation rate. For subject 3, the results for
500 pps/ch condition were significantly better than 275 pps/ch stimulation rate (p=0.001). For
subject 4 and subject 8, mean SRTs with 500 pps/ch stimulation rate were significantly better
than all other stimulation rates. Subjects 7 and 10 did not show any significant difference in
performance when tested for sentences in noise for all four stimulation rates.
Cochlear Implant Stimulation Rates and Speech Perception 235
5
275pps/ch
350pps/ch
500pps/ch
900pps/ch
4
Rating
1
Quiet Noise Media Soft speech Overall
Listening situation
Figure 12. Group mean preference ratings of helpfulness for the four rate programs averaged across
four categories (listening in quiet, listening in noise, listening media devices & listening to soft speech)
and across 18 listening situations (overall). A rating of 1 represented "no help" and a rating of 5
represented "extremely helpful.
After providing helpfulness ratings, subjects were asked to indicate their first preferences in
quiet, noise and overall. Table 2 shows the number of subjects reporting their first
preferences in quiet, noise and overall for the four rate programs. Chi-square analysis
236 Modern Speech Recognition Approaches with Case Studies
Subjects were asked to describe if their preferred rate program sounded “very similar”,
“slightly better”, “moderately better” or ”much better” than the other programs. As shown
in figure 10, five subjects reported their preferred program in quiet to be slightly better than
other programs, two reported them moderately better and the remaining three subjects
reported them as very similar to other programs. For speech in noise (figure 11), four
subjects rated their preferred program in noise as moderately better than other programs;
two subjects rated them much better than other programs; two subjects rated them slightly
better and the remaining two reported them as very similar to other programs.
There does not appear to be a close relationship between each subject’s subjective preference
and the rate program that provided best speech perception. Only two subjects (subject 1 and
8), who scored consistently better on a particular rate program in quiet and noise, chose that
program as the most preferred. However, only subject 1 showed consistency between
speech test outcomes and helpfulness ratings in quiet and noise. One subject showed
consistency between the rate program that provided best speech perception in noise and the
most preferred program in noise. Subject 9 scored best with 900 pps/ch rate in noise and
preferred this rate in noise. This subject rated 350, 500 and 900 pps/ch equally on helpfulness
rating
For two subjects (subjects 2 and 3) there was a partial agreement between the speech
perception scores in noise and the subjective preference. Subject 2 performed best with 900
pps/ch for speech perception in noise, but there was no significant difference in speech
performance in noise for 500 and 900 pps/ch. This subject preferred 500 pps/ch stimulation
rate in quiet and noise and the average rating of helpfulness in noise was also highest for
500 pps/ch rate. Subject 3 performed best with 500 pps/ch for sentence perception in noise
and preferred this program when listening in quiet. This subject preferred 350 pps/ch rate in
noise and overall and the average helpfulness rating in noise was highest with this rate
program.
Five subjects’ (subjects 4, 5, 6, 7 and 10) speech test outcomes did not agree with their
subjective preferences. However, the average helpfulness ratings were more or less similar
to the first preferences for these five subjects.
Cochlear Implant Stimulation Rates and Speech Perception 237
At the conclusion of the study, six of the ten subjects (subjects 2, 3, 4, 6, 7 and 10) continued
to use a different rate program compared to their everyday rate program (used prior to the
commencement of the study). One of these six subjects (subject 6) preferred to continue
with the rate program with the best sentence in noise perception score and the remaining
five subjects continued with the most preferred program (overall) based on the
questionnaire results.
350 = 500
275 pps/ch 350 pps/ch 500 pps/ch 900 pps/ch 500 = 900 pps/ch
pps/ch
Quiet 0 1 5 2 1 1
Noise 0 2 4 2 1 1
Overall 0 2 4 2 1 1
Table 2. The table shows the number of subjects reporting their first preferences in quiet, noise and
overall for the four rate programs.
3.4. Discussion
3.4.1. Speech perception in quiet and noise
The group averaged scores for monosyllables in quiet showed no significant effect of rate.
However, significantly better group results for sentence perception in noise were observed
for 500 and 900 pps/ch rates compared to 275 pps/ch stimulation rate and for 500 pps/ch
compared to 350 pps/ch rate. Individual data analysis showed improvements with
stimulation rates of 500 pps/ch or higher in eight out of ten subjects for sentence perception
in noise. Three out of these eight subjects showed benefit with 500 pps/ch and four subjects
showed benefit with 900 pps/ch rate. One subject showed improvement with both 500 and
900 pps/ch stimulation rates.
Four out of ten subjects were using 250 pps/ch stimulation rate in their clinical fitted
processor before the commencement of the study. Two out of these four subjects showed
improved performance with 500 pps/ch, one improved with 900 pps/ch, and the remaining
subject showed no effect of rate on speech perception. This suggests that subjects had
enough time to become familiar with the higher rate conditions. The remaining six subjects
in the study had been using stimulation rates ranging between 500-1200 pps/ch prior to
commencement of the study. Four out of these six subjects (including the subject who
performed equally with 500 and 900 pps/ch) showed improvement with 900 pps/ch
stimulation rate. Better speech perception with 900 pps/ch rate could have been due to the
prolonged use of higher stimulation rates prior to commencement of the study.
The CNC test results are somewhat consistent with previous studies that used Nucleus
devices with the ACE strategy (Vandali et al., 2000; Holden et al., 2002; Plant et al., 2007 and
Weber et al., 2007). In these studies, monosyllabic word or consonant perception was not
affected by the increasing stimulation rates. Results in this study are also somewhat
consistent with a recent clinical trial by Cochlear Ltd. (2007) (Reference Note 1), which
238 Modern Speech Recognition Approaches with Case Studies
showed no significant difference in the lower (500-1200 pps/ch) or higher set of rates (1800-
3500 pps/ch) for the subjects tested with CNC words.
Most of the previous studies that used Nucleus devices have reported variable rate effects
for CI individuals. Some of the subjects in these studies have shown improvement with
increased stimulation rates for some of the speech material. Therefore, these studies
emphasize the importance of optimizing stimulation rates for individual cochlear
implantees. Their results suggest that increasing stimulation rates could provide clinical
benefit to some of the cochlear implantees (Vandali et al., 2000; Holden et al., 2002; Plant et
al., 2002; Plant et al., 2007). The variable effect of increasing stimulation rates in these studies
is consistent with the results of this study, in that not all subjects preferred or showed
improved performance with the highest rate (900 pps/ch) tested. For instance, subject 4 and
subject 8 performed significantly better with the 500 pps/ch compared to the 900 pps/ch rate.
However, in the studies by Vandali et al. (2000) and Holden et al. (2002) the higher
stimulation rates probably did not add any extra temporal information due to the limited
update rate in SPrint™ processor (760 Hz). In the SPrint™ processor, stimulation rates
below 760 Hz provide new information in every cycle because filter analysis rate is set to
equal the stimulation rate; however, stimulation rates above 760 Hz are obtained by
repeating stimulus frames. Similarly the results of the study by Holden et al. (2002) may
have been compromised by the limited analysis rate in the SPrint™ processor. In contrast to
these studies, the current study did not use SPrint™ processors. The current study used
ESPrit™ 3G processors which have an update rate of 1 kHz for low level sounds and an
update rate of 4 kHz for high level sounds.
Analysis of speech perception results across sessions in this study revealed no significant
effect of session for the group CNC scores in quiet. However, a significant effect of session
was observed for sentence perception in noise scores. Whilst a session effect was observed
(which may well have been due to task/rate program learning), scores for all four rate
programs showed similar effects of session. Thus given that a balanced design for
evaluation of rate was employed in this study, no one rate condition was advantaged by
learning within the study.
The individual subjective preference findings in this study initially appear at odds with the
results of speech perception outcomes in quiet, in which results for seven of the ten subjects
showed no significant effect of rate for monosyllables in quiet. However, six of these seven
subjects indicated that there was little difference between their preferred rate program and
the other rate programs (see figure 10). On the other hand, three of the five subjects (subjects
1, 2, 4, 8 and 9) who showed some consistencies between the speech perception and the
subjective preferences in noise indicated that their preferences were reasonably strong and
Cochlear Implant Stimulation Rates and Speech Perception 239
that their preferred programs were moderately/much better than the other rate programs
(see figure 11). Two of the five subjects, in which inconsistencies between the speech
perception in noise and the subjective preference in noise were observed, indicated a weak
preference for their preferred rates. The other three, however indicated a strong preference
for their preferred rate.
Although optimization of stimulation rate appears to be beneficial, time restraints will often
prevent clinicians from comparing speech perception outcomes with different stimulation
rates. An adaptive procedure called genetic algorithm (Holland, 1975) may offer potential in
optimizing stimulation rate along with other parameters. This procedure, based on the
genetic “survival of the fittest”, guides the recipient through hundreds of processor MAPs
towards preferred programs in quiet and in noise. The MAPs vary in terms of speech coding
parameters such as, stimulation rate, number of channels, and number of maxima. To date,
genetic algorithm (GA) research in experienced CI recipients has not shown better outcomes
compared to standard MAPs programmed using default parameters (Wakefield et al., 2005;
Lineaweaver et al., 2006). It remains to be seen whether or not the GA algorithm provides
significant benefits for newly implanted subjects who are not biased by prolonged use of
default parameters such as a particular stimulation rate.
Preference of the majority of subjects in this study for 500 pps/ch rate in quiet, noise and
overall is somewhat consistent with the results of Balkany et al. (2007), where 67% of the
subjects preferred the slower strategy (ACE) over the faster rate strategy (ACE RE) with the
majority of subjects preferring the slowest rate in each strategy (500 pps/ch in ACE and 1800
pps/ch in ACE RE). However, in contrast to the present study, there was no significant effect
of rate on the speech perception outcomes in their study.
Subjects who had been using 250 pps/ch stimulation rate for their everyday use prior to
commencement of the present study showed improved performance with 500 or 900 pps/ch.
In light of this finding, it is recommended that CI recipients who have been using very low
240 Modern Speech Recognition Approaches with Case Studies
stimulation rates should be mapped with either 500 or 900 pps/ch and given an opportunity
to trial the higher rate MAP in different listening environments over a number of weeks.
The findings of previous research (Weber et al., 2007; Cochlear Ltd. 2007, Reference Note 1;
Balkany et al., 2007) suggest that for majority of the subjects using Nucleus implants,
stimulation rates between 500 pps/ch and 1200 pps/ch should be tried. The present
investigation’s findings are compatible, suggesting that clinicians should program Nucleus
recipients with the rates 500 pps/ch or 900 pps/ch. However, it needs to be remembered that
the present study’s conclusions are based on a limited number of subjects. Clinicians could
consider providing the 500 pps/ch rate as an initial option with the ACE strategy. This rate has
the advantage of offering increased battery life compared to the 900 pps/ch rate. Then, if time
permits, recipients could compare the 500 pps/ch rate to the 900 pps/ch rate. If for example, the
recipient prefers 900 pps/ch and test results in noise show better performance for 900 pps/ch
compared to 500 pps/ch, he/she could then be given the opportunity to try 1200 pps/ch.
4.1. Rationale
Modulation detection thresholds (MDTs) measured electrically have been found to be
closely related to the subjects’ speech perception ability with CI and auditory brainstem
implant (Cazals et al, 1994, Fu, 2002; Colletti and Shannon, 2005). In addition, studies
investigating the effect of relatively high and low stimulation rates on MDTs have shown
that MDTs are poorer at high stimulation rates (Galvin and Fu, 2005; Pfingst et al., 2007).
The study by Galvin and Fu (2005) showed that rate had a significant effect on the MDTs
with lower rates (250 pps/ch) having lower thresholds than the higher rates (2000 pps/ch).
Similarly, lower MDTs for 250 pps/ch compared to 4000 pps/ch stimulation rate were
observed in the study by Pfingst et al. (2007). These studies suggest that the response
properties of auditory neurons to electrical stimulation along with limitation imposed by
their refractory behavior must be considered in CI systems (Wilson et al., 1997; Rubinstein et
al., 1998). Pfingst et al. (2007) also reported that the average MDTs for 250 pps/ch and 4000
pps/ch were lowest at the apical and the basal end of the electrode array respectively.
Across site variations in modulation detection found by Pfingst et al. (2007, 2008) suggest
that testing modulation detection only at one or two sites (as in the studies by Shannon,
1992; Busby et al., 1993; Cazals et al, 1994; Fu, 2002; Galvin and Fu, 2005) may not provide a
complete assessment of a CI recipient’s modulation sensitivity. In addition, in modern CI
sound processors, speech is not coded at one or two specific electrode sites. In current filter
bank strategies, many electrode sites are stimulated sequentially based on the amplitude
spectrum of the input waveform. In a typical ACE strategy, up to 8 to 10 electrodes are
Cochlear Implant Stimulation Rates and Speech Perception 241
selected based on the channels with the greatest amplitude. Thus, when measuring MDTs, it
may be more realistic to measure them with speech-like signals, which will measure
modulation detection across multiple electrodes.
In the current study, modulation sensitivity for different stimulation rates (275, 350, 500, and
900 pps/ch) was measured using acoustic stimuli which stimulated multiple electrodes.
Electrode place and intensity coding of the stimuli was representative of vowel-like signals.
The vowel-like stimulus stimulated multiple electrodes as in the ACE strategy. It can be
argued that MDTs measured across multiple electrodes may be dominated by a few
electrodes due to across electrode variations in stimulus levels. However, in the current
study all subjects used MP1+2 mode of stimulation and thus the across electrode variations
in stimulation levels were small for all subjects. The depth and frequency of sinusoidal
amplitude modulation in the stimulus envelope of each channel was controlled in the
experiment. Given that it has been found that CI subjects are most sensitive to the
modulations between 50-100 Hz (Shannon, 1992; Busby et al., 1993), modulation frequencies
of 50 and 100 Hz were examined. Speech recognition has been found to correlate well with
MDTs averaged across various stimulation levels of the electrical dynamic range (Fu, 2002).
Therefore, this study presented stimuli at an acoustic level that produced electrical levels close
to the subjects’ most comfortable loudness (MCL) level of stimulation and at a softer acoustic
level of 20 dB below this. In previous CI literature, MDTs have been measured using
modulated electrical pulse trains; however, in the present study MDTs were measured using
acoustic vowel-like stimuli (referred to as Acoustic MDTs in this chapter). Thus for
comparison to previous literature, the Acoustic MDTs were transformed to their equivalent
current MDTs (referred to as Electric MDTs in this chapter). Acoustic MDTs were of interest
because when stimuli are presented acoustically, the differences among different stimulation
rate maps are taken into account as in real-life situations for CI users. It is likely that Acoustic
MDTs are affected by the subjects’ electrical dynamic ranges which can vary with rate of
stimulation. This study examined the influence of electrical dynamic range on Acoustic MDTs.
Whether rates of stimulation (between 275 and 900 pps/ch) affect modulation detection
for vowel-like signals stimulating multiple electrodes.
Whether modulation detection at different stimulation rates predicts speech perception
at these rates.
4.2. Method
Modulation detection thresholds were measured for the same 10 subjects who had
previously participated in study 1. A repeated ABCD experimental design for the four rate
conditions was employed. Evaluation order for rate conditions was balanced across subjects.
Four MDT data points were recorded for each rate condition at two modulation frequencies
and two stimulation levels (4 data points X 4 rates X 2 modulation frequencies X 2 levels) in
each phase of the repeated experimental design.
242 Modern Speech Recognition Approaches with Case Studies
A sinusoidally amplitude modulated acoustic signal with a carrier frequency of 2 kHz was
presented to the audio input of a research processor. The research processor maps were
based on the map parameters used in study 1. Most strategy parameters (e.g. number of
maxima, pulse width, mode) were kept the same as those used in study 1. However, the
maps differed from conventional maps in that only one band-pass filter with a bandwidth of
1.5 to 2.5 kHz (center frequency = 2 kHz) was mapped to all the active electrodes in the map.
This was done so that all electrodes received the same temporal information for the test
stimuli. The electrical threshold and comfortable levels of stimulation for each electrode
were taken from the maps used in study 1.
The signal was used to modulate the envelope of electrical pulse trains interleaved across
eight electrode sites. The choice of which 8 electrodes were activated in the maps was based
on which electrodes were on average activated in conventional maps for the vowels /a/ and
/i/ spoken by a male Australian speaker. This was done by analyzing the spectrograms of
each vowel (four different tokens per vowel) and measuring the spectral magnitude at
frequencies which coincided with the center frequencies of the bands used in the
conventional maps. Two separate vowel maps, one for each vowel, with different sets of
fixed electrodes were created. The SAM acoustic stimuli when presented through the
experimental map thus provided vowel-like place coding and a SAM temporal envelope
code on each channel activated. In addition, the modulation frequency and depth could be
controlled systematically via the input SAM signal. For convenience this stimulus will be
referred to as a vowel-like SAM stimulus throughout this study.
Modulation frequencies of 50 and100 Hz were presented at an acoustic level that produced
electrical levels close to the subjects’ most comfortable level (MCL) of stimulation and at an
acoustic level 20 dB below this. Modulation depth was varied in the 3AFC task to obtain a
threshold level where the subject could discriminate between the modulated and
unmodulated waveform for a particular modulation frequency. A jitter of +/- 3 dB was
applied to minimize any loudness effects on measurement of MDTs.
4.3. Results
4.3.1. Stimulation rate effect on electrical dynamic range
As stimulation rate increased from 250 to 900 pps/ch, mean DR (averaged across the eight
most active electrodes that were selected as maxima when coding vowels /a/ and /i/)
increased from 40.5 to 51.7 CL (or from ~6.9 dB to 8.9 dB in current) for the vowel /a/ map
and from 37.6 to 47.8 CL (or from ~6.5 dB to 8.2 dB in current) for the vowel /i/ map. These
levels were obtained after all four rate programs were balanced for loudness.
4.3.2. Effects of stimulation rate, modulation frequency and presentation level on MDTs
4.3.2.1. Acoustic MDTs
Figure 13 shows Acoustic MDTs for the two modulation frequencies and four stimulation
rate conditions measured at MCL in the left most panels (a, c, and e) and at MCL-20 dB in
Cochlear Implant Stimulation Rates and Speech Perception 243
the right most panels (b, d, and f). MDTs are shown separately for each vowel in the top and
middle panels (a and b for vowel /a/, c and d for vowel /i/) and averaged across both vowels
in the bottom panels (e and f).
-8
a. Vowel /a/ (MCL) b. Vowel /a/ (MCL-20 dB)
-10
-12
-14
-16
-18
-20
-22
-8
c. Vowel /i/ (MCL) d. Vowel /i/ (MCL-20 dB)
-10
MDT (20 log m)
-12
-14
-16
-18
-20
-22
-8
e. Vowel /a/+/i/ (MCL) f. Vowel /a/+/i/ (MCL-20)
-10
-12
-14
-16
-18
275 pps/ch
350 pps/ch
-20
500 pps/ch
900 pps/ch
-22
50 Hz 100Hz 50 Hz 100Hz
Modulation frequency
Figure 13. Acoustic MDTs, averaged across the subject group, measured at MCL and MCL-20 dB for
two modulation frequencies and four stimulation rates.
244 Modern Speech Recognition Approaches with Case Studies
Repeated measures analysis of variance for Acoustic MDTs averaged across the two vowels
revealed a significant effect of rate (F [3, 27] = 3.6, p = 0.026). The Mauchley test of Sphericity
showed that sphericity was violated for the rate effect. However, the effect remained
significant after the G-G (Greenhouse and Geisser, 1959) correction was applied to the
degrees of freedom and the p values.
Post-hoc comparisons for the effect of rate revealed significantly lower MDTs for 500 pps/ch
compared to 275 pps/ch rate. There were no significant effects of the other main factors,
“modulation frequency” and “level”, on MDTs. The interaction between rate and modulation
frequency was not significant, but there was a significant interaction between rate and level.
Post-hoc comparisons revealed significantly lower MDTs for 500 and 900 pps/ch rates
compared to 275 pps/ch rate at MCL, but no significant effect of rate at MCL-20 dB. MDTs at
MCL were significantly lower than those at MCL-20 dB for the rate of 500 pps/ch. The
interaction between modulation frequency and level was significant. MDTs for 50 Hz were
significantly lower than those for 100 Hz modulation at MCL-20 dB and MDTs at MCL were
significantly lower compared to those at MCL-20 dB for the modulation frequency of 100 Hz.
A similar pattern of results to those above were observed for separate analyses of each vowel.
Figure 14 shows Electric MDTs, averaged across the subject group, for the two modulation
frequencies and four stimulation rate programs measured at MCL and MCL-20 dB level.
Repeated measures three-way analysis of variance for Electric MDTs averaged across the
two vowels revealed a significant effect of rate (F [3, 27] = 3.54, p = 0.028), modulation
frequency (F [1, 27 ] = 6.66, p = 0.030), and level (F [1,27] = 78.88, p < 0.001). The sphericity
assumption was violated for the effect of rate; however, the effect remained significant after
the G-G correction was applied.
Post-hoc comparisons for the rate effect revealed no significant comparisons between pairs
of stimulation rates. Post-hoc comparisons for the effect of modulation frequency revealed
significantly lower MDTs for 50 Hz compared to those for 100 Hz modulation frequency
and the post-hoc comparisons for the effect of level revealed significantly lower MDTs at
MCL compared to those at MCL-20 dB.
The interaction between rate and modulation frequency was not significant. Interaction
between rate and level was significant. At MCL-20 dB, MDTs for 900 pps/ch rate were
significantly poorer than all other stimulation rates whereas there was no significant effect
of rate at MCL. MDTs at MCL were significantly lower than those at MCL-20 dB for all
stimulation rates. The interaction effect between modulation frequency and level was also
significant. At MCL, there was no significant effect of modulation frequency on MDTs
whereas at MCL-20 dB, MDTs for 50 Hz were significantly lower than those at 100 Hz
modulation frequency. MDTs at MCL were significantly lower compared to those at MCL-
20 dB for both modulation frequencies (50 and 100 Hz) Again, similar patterns of results to
those above at MCL and MCL-20 dB were observed for separate analyses of each vowel.
Cochlear Implant Stimulation Rates and Speech Perception 245
-15
a. Vowel /a/ MCL Vowel /a/ map
b. Vowel (MCL-20 dB)
/a/ (MCL-20)
-20
-25
-30
-15
Vowel
c. /i/ /i/
Vowel map (MCL)
(MCL) Vowel /i/ map
d. Vowel (MCL-20
/i/ (MCL-20 dB)
dB)
MDT (20 log m)
-20
-25
-30
-15
Vowel /a/+/i/
e. Vowel (MCL)
/a/+/i/ (MCL) f. Vowel /a/+/i/ (MCL-20 dB)
-20
-25
275 pps/ch
-30
350 pps/ch
275 pps/ch
500 pps/ch
350 pps/ch
500 pps/ch
900 pps/ch
900 pps/ch
50 Hz 100Hz 50 Hz 100Hz
Modulation frequency
Figure 14. Electric MDTs, averaged across the subject group, measured at MCL and MCL-20 dB for two
modulation frequencies and four stimulation rates.
Analysis of covariance (ANCOVA) to assess the effect of MDTs averaged across both
stimulation levels (MCL and MCL-20 dB) on speech perception results (study 1) showed no
significant relationships between Acoustic or Electric MDTs and speech perception in quiet
and noise. Similar results were obtained on a separate analysis for the Acoustic and Electric
MDTs measured at MCL-20 dB.
246 Modern Speech Recognition Approaches with Case Studies
Results of the ANCOVA showed that MDTs (averaged across 50 and 100 Hz modulation
frequencies) at different simulation rates at MCL predicted sentences in noise outcomes
(SRTs) at these stimulation rates (F [1, 29] = 9.26, p = 0.005). Lower MDTs were associated
with lower SRTs. The ANCOVA results also revealed that the estimate for the average slope
for subjects was 0.35 (se = 0.11). There were no significant effects of Acoustic or Electric
MDTs measured at MCL on speech perception in quiet (CNC scores).
ANCOVA results to assess the effect of electrical dynamic range on SRTs showed a
significant relationship between electrical dynamic range and SRTs (F [1, 29] = 5.52, p =
0.026). The estimate for the average slope for subjects was -0.084 (se = 0.035).
4.4. Discussion
4.4.1. Effects of modulation frequency and presentation level on MDTs
Both Acoustic and Electric MDTs were significantly lower for 50 Hz compared to 100 Hz
modulation frequency at the lower presentation level (MCL-20 dB). These results are
somewhat consistent with the findings of previous studies in which a progressive increase
in modulation thresholds for modulation frequencies above approximately 100 Hz has been
reported (Shannon, 1992; and Busby et al., 1993). The Electric MDTs (averaged across 50 and
100 Hz modulation frequencies) presented at MCL were either equivalent to, or better than,
those at MCL-20 dB for all rates examined. The Acoustic MDTs at MCL were significantly
better than those at MCL-20 dB for 500 pps/ch for MDTs averaged across 50 and 100 Hz.
These findings are compatible with those of previous studies (Shannon, 1992; Fu, 2002;
Galvin and Fu, 2005; Pfingst et al., 2007).
The differences across rates between the Acoustic and Electric MDTs can be partially
attributed to differences in the range of electrical current levels employed across each rate
map. Because the electrical dynamic range employed in maps increased with increasing
stimulation rate, a smaller depth of modulation in the acoustic input signal is required for
higher rate maps in order to produce the same range of modulation in electrical current
levels coded for each rate map. Thus, the effects observed for MDTs expressed in acoustic
Cochlear Implant Stimulation Rates and Speech Perception 247
levels in which modulation sensitivity was better at 900 pps/ch compared to 350 and 275
pps/ch, can be partly accounted for by the increase in electrical dynamic range that was
coded in the higher rate maps, at least for MDTs measured at the higher presentation level.
At the lower presentation level, it is likely that the increased dynamic range at 900 pps/ch
rate could not compensate for the poorer Electric MDTs obtained at that rate.
It can be argued that differences in the absolute current levels at each rate examined (due to
different T and C levels for each rate program) might have affected the MDT results. This
effect could be more pronounced at MCL-20 dB level, because the effects of stimulation rate
on loudness summation are larger at lower stimulation levels (McKay and McDermott, 1998;
McKay et al., 2001) which is consistent with the reduction in T levels with increasing rate
noted in the current study. However, care was taken to loudness balance all rate programs
and thus absolute differences in current levels coded for each rate condition are unlikely to
have translated to substantial differences in loudness across rates.
The above outcomes are somewhat consistent with the findings of a study by Middlebrooks
(2008), in which cortical modulation sensitivity was measured for modulation frequencies
between 20-64 Hz in an animal model to quantify the effect of carrier/stimulation rate.
Results showed no significant effect of rate for MDTs between 254 and 508 pps/ch; however,
further increase in rates led to consistently increasing MDTs. The author suggested that at
254 and 508 pps/ch pulse rates there is maximum phase locking of the auditory nerves. As
the carrier pulse rates are increased further, there is a decrease in auditory nerve phase
locking and a subsequent loss of brain stem phase locking to the carrier. The author, did
however, indicate that the psychophysical data in humans and the cortical unit responses in
the animal model should be compared with caution, due to differences in species, type of
deafness, and the physical fit of the stimulation electrode arrays. Moreover, in animal
studies all stimulated neural units respond, whereas in human psychophysical experiments
perceptual decisions are made based on the activity of the most sensitive neurons.
There is also some psychophysical evidence (Galvin and Fu, 2005 and Pfingst et al., 2007) to
support the findings of the current study, although these studies explored a different range
of stimulation rates and modulation frequencies. They did report better MDTs for lower
stimulation rates (250 pps/ch) compared to the higher stimulation rates (≥ 2000 pps/ch)
which is consistent with the poorer Electric MDTs found at 900 pps/ch compared to the
lower rates in the present study.
MDTs measured at only two stimulation levels in the present study. These studies did not
report relationships between speech perception and MDTs at specific stimulation levels. In
addition, stimulation rate was not examined in these studies.
For modulation detection measured at MCL, significant effects of Acoustic MDTs and
electrical dynamic range (DR) on speech recognition in noise were found in the present
study. Acoustic MDTs were of interest, because for both speech and modulation tests, the
stimuli were presented through sound processor maps and thus the effects of electrical
stimulation level differences between maps with different dynamic ranges were taken into
account. Furthermore, a positive correlation between electrical DR and speech test results in
noise suggests that the increase in electrical DR with rate contributed to the increase in speech
test scores in noise with rate, at least for rates up to 500 pps/ch. These results were somewhat
consistent with the previous findings by Pfingst and Xu (2005) which showed that subjects
with larger mean dynamic range had better speech recognition in quiet and noise.
At the highest rate (900 pps/ch), speech perception results were equal to, or worse, than
those at 500 pps/ch, particularly for the speech in noise test. In addition, Electric MDTs were
poorest at 900 pps/ch compared to 500 pps/ch, which is consistent with findings of other
studies where poorer MDTs were observed for higher rates of stimulation (e.g., Galvin and
Fu, 2005 and Pfingst et al., 2007). Thus, the benefits to speech perception obtained with an
increase in electrical DR may be offset by a reduction in modulation detection sensitivity
with increasing rate. The rate for which the effect of increased electrical DR is counteracted
by a decrease in modulation detection sensitivity is likely to vary between subjects, speech
material, and presentation levels. For the subjects in the current study, although a rate of 500
pps/ch was found to provide the best speech perception results, it is possible that some
other rate between 500 and 900 pps/ch, or perhaps even higher, may have provided better
speech perception results and/or a better correlation with the electrical DR and modulation
detection sensitivity.
At the lower presentation levels (MCL-20 dB), MDTs (Acoustic and Electric) did not
correlate with speech perception outcomes. Similar findings were observed by Chatterjee
and Peng (2008). In their study no significant correlation was obtained between MDTs
measured at soft levels (i.e., at 50% of the dynamic range) and speech intonation recognition
presented at comfortable levels.
500 and 900 pps/ch, or even higher, were not examined. Such rates may have provided
benefits to speech perception above those obtained at a rate of 500 pps/ch.
5. Conclusion
The above studies investigated the effect of slow and moderate stimulation rates on speech
perception and modulation detection in recipients of the Nucleus cochlear implant. Group
results for sentence perception in noise showed improved performance for 500 and/or 900
pps/ch stimulation rates but no significant rate effect was observed for monosyllabic
perception in quiet. Most subjects preferred the 500 pps/ch stimulation rate in noise.
However, a close relationship between each subject’s subjective preference and the rate
program that provided best speech perception was not observed.
The variable outcomes obtained in the reported studies on stimulation rate could be
influenced by factors such as audiological history, length of implant use, and duration of
hearing loss, speech processing strategy employed or the implant system used by the
implantee. Given the variability of these factors across subjects, it may be important to
optimize the stimulation rate for an individual cochlear implantee.
In study 2, cochlear implant subjects’ speech perception was compared with their
psychoacoustic temporal processing abilities. The aim was to find an objective method for
optimizing stimulation rates for cochlear implantees. This study uniquely used multi-
channel stimuli to measure modulation detection. Best Acoustic MDTs at MCL were
obtained at a rate of 500 pps/ch. MDTs at 900 pps/ch rate were slightly worse and poorest
results were observed at the lower rates. No significant effect of rate was observed for
Acoustic MDTs at MCL-20 dB. For Electric MDTs at MCL-20 dB, best MDTs were obtained
for rates of 500 pps/ch or lower, and poorest MDTs were obtained at 900 pps/ch. Acoustic
MDTs are a realistic measure of MDTs since they take into account the map differences in
dynamic range across stimulation rates. Acoustic MDTs at MCL at different stimulation
rates predicted sentence perception in noise at these rates.
Author details
Komal Arora
Department of Otolaryngology, The University of Melbourne, Australia
250 Modern Speech Recognition Approaches with Case Studies
Acknowledgement
The author would like to express appreciation to the research subjects who participated in
studies one and two. The studies were supported by University of Melbourne. The authors
would like to thank Dr. Pam Dawson, Mr. Andrew Vandali and Prof. Richard Dowell for
their constant support and guidance.
Notes
1. Cochlear Ltd. 2007. Selecting stimulation rate with the Nucleus freedom system. White
paper prepared by Cochlear Ltd.
6. References
Arora, K., Dawson, P., Dowell, R. C., Vandali A. E. (2009). Electrical stimulation rate effects on
speech perception in cochlear implants. International Journal of Audiology, Vol 48 (8).
Arora, K., Vandali A. E., Dawson, P., Dowell, R. C. (2010). Effects of electrical stimulation
rate on modulation detection and speech recognition by cochlear implant users.
International Journal of Audiology Vol 50 (2).
Assmann, P. F., & Summerfield, Q. (1990). Modeling the perception of concurrent vowels:
vowels with different fundamental frequencies. J Acoust Soc Am, 88(2), 680-697.
Bacon, S. P., & Viemeister, N. F. (1985). Temporal modulation transfer functions in normal-
hearing and hearing-impaired listeners. Audiology, 24(2), 117-134.
Balkany, T., Hodges, A., Menapace, C., Hazard, L., Driscoll, C., Gantz, B., et al. (2007). Nucleus
Freedom North American clinical trial. Otolaryngol Head Neck Surg, 136(5), 757-762.
Baskent, D., & Shannon, R. V. (2004). Frequency-place compression and expansion in
cochlear implant listeners. J Acoust Soc Am, 116(5), 3130-3140.
Brokx, J. P. L., & Nooteboom, S. G. (1982). Intonation and perceptual seperation of
simultaneous voices. Journal of Phonetics, 10, 23- 36.
Burian, K. (1979). [Clinical observations in electric stimulation of the ear (author's transl)].
Arch Otorhinolaryngol, 223(1), 139-166.
Burns, E. M., & Viemeister, N. F. (1976). Nonspectral pitch. J Acoust Soc Am, 60(4), 863-869.
Busby, P. A., & Clark, G. M. (2000). Pitch estimation by early-deafened subjects using a
multiple-electrode cochlear implant. J Acoust Soc Am, 107(1), 547-558.
Busby, P. A., Tong, Y. C., & Clark, G. M. (1993). The perception of temporal modulations by
cochlear implant patients. J Acoust Soc Am, 94(1), 124-131.
Busby, P. A., Whitford, L. A., Blamey, P. J., Richardson, L. M., & Clark, G. M. (1994). Pitch
perception for different modes of stimulation using the cochlear multiple-electrode
prosthesis. J Acoust Soc Am, 95(5 Pt 1), 2658-2669.
Cazals, Y., Pelizzone, M., Saudan, O., & Boex, C. (1994). Low-pass filtering in amplitude
modulation detection associated with vowel and consonant identification in subjects
with cochlear implants. J Acoust Soc Am, 96(4), 2048-2054.
Chatterjee, M., & Peng, S. C. (2008). Processing F0 with cochlear implants: Modulation
frequency discrimination and speech intonation recognition. Hear Res, 235(1-2), 143-156.
Cochlear Implant Stimulation Rates and Speech Perception 251
Clark, G. M., Black, R., Forster, I. C., Patrick, J. F., & Tong, Y. C. (1978). Design criteria of a
multiple-electrode cochlear implant hearing prosthesis [43.66.Ts, 43.66.Sr]. J Acoust Soc
Am, 63(2), 631-633.
Cohen, L. T., Busby, P. A., Whitford, L. A., & Clark, G. M. (1996). Cochlear implant place
psychophysics 1. Pitch estimation with deeply inserted electrodes. Audiol Neurootol,
1(5), 265-277.
Cohen, N. L., and Waltzman, S. B. (1993). Partial insertion of the Nucleus multichannel
cochlear implant: Technique and results, Am. J. Otol, 14(4), 357–361.
Colletti, V., & Shannon, R. V. (2005). Open set speech perception with auditory brainstem
implant? Laryngoscope, 115(11), 1974-1978.
Cowan, B. (March, 2007). Historical and BioSafety Overview. Paper presented at the Cochlear
Implant Training Workshop, Bionic Ear Institute, Melbourne.
Donaldson, G. S., & Nelson, D. A. (2000). Place-pitch sensitivity and its relation to consonant
recognition by cochlear implant listeners using the MPEAK and SPEAK speech
processing strategies. J Acoust Soc Am, 107(3), 1645-1658.
Dorman, M. F., Loizou, P. C., Fitzke, J., & Tu, Z. (1998). The recognition of sentences in noise
by normal-hearing listeners using simulations of cochlear-implant signal processors
with 6-20 channels. J Acoust Soc Am, 104(6), 3583-3585.
Dynes, S. B., & Delgutte, B. (1992). Phase-locking of auditory-nerve discharges to sinusoidal
electric stimulation of the cochlea. Hear Res, 58(1), 79-90.
Eddington, D. K., Dobelle, W. H., Brackmann, D. E., Mladejovsky, M. G., & Parkin, J. L.
(1978). Auditory prostheses research with multiple channel intracochlear stimulation in
man. Ann Otol Rhinol Laryngol, 87(6 Pt 2), 1-39.
Fearn, R., & Wolfe, J. (2000). Relative importance of rate and place: experiments using pitch
scaling techniques with cochlear implants recipients. Ann Otol Rhinol Laryngol Suppl,
185, 51-53.
Fletcher, H. (1940). Auditory Patterns. Reviews of Modern Physics, 12, 47-65.
Friesen, L. M., Shannon, R. V., Baskent, D., & Wang, X. (2001). Speech recognition in noise as
a function of the number of spectral channels: comparison of acoustic hearing and
cochlear implants. J Acoust Soc Am, 110(2), 1150-1163.
Fu, Q. J. (2002). Temporal processing and speech recognition in cochlear implant users.
Neuroreport, 13(13), 1635-1639.
Fu, Q. J., Chinchilla, S., & Galvin, J. J. (2004). The role of spectral and temporal cues in voice
gender discrimination by normal-hearing listeners and cochlear implant users. J Assoc
Res Otolaryngol, 5(3), 253-260.
Fu, Q. J., & Shannon, R. V. (2000). Effect of stimulation rate on phoneme recognition by
nucleus-22 cochlear implant listeners. J Acoust Soc Am, 107(1), 589-597.
Fu, Q. J., Shannon, R. V., & Wang, X. (1998). Effects of noise and spectral resolution on vowel and
consonant recognition: acoustic and electric hearing. J Acoust Soc Am, 104(6), 3586-3596.
Galvin, J. J., 3rd, & Fu, Q. J. (2005). Effects of stimulation rate, mode and level on modulation
detection by cochlear implant users. J Assoc Res Otolaryngol, 6(3), 269-279.
Glasberg, B. R., & Moore, B. C. (1986). Auditory filter shapes in subjects with unilateral and
bilateral cochlear impairments. J Acoust Soc Am, 79(4), 1020-1033.
Holden, L. K., Skinner, M. W., Holden, T. A., & Demorest, M. E. (2002). Effects of stimulation rate
with the Nucleus 24 ACE speech coding strategy. Ear Hear, 23(5), 463-476.
252 Modern Speech Recognition Approaches with Case Studies
Nelson, D. A., Van Tasell, D. J., Schroder, A. C., Soli, S., & Levine, S. (1995). Electrode
ranking of "place pitch" and speech recognition in electrical hearing. J Acoust Soc Am,
98(4), 1987-1999.
Nie, K., Barco, A., & Zeng, F. G. (2006). Spectral and temporal cues in cochlear implant
speech perception. Ear Hear, 27(2), 208-217.
Parkins, C. W. (1989). Temporal response patterns of auditory nerve fibers to electrical
stimulation in deafened squirrel monkeys. Hear Res, 41(2-3), 137-168.
Pasanisi, E., Bacciu, A., Vincenti, V., Guida, M., Berghenti, M. T., Barbot, A., et al. (2002).
Comparison of speech perception benefits with SPEAK and ACE coding strategies in
pediatric Nucleus CI24M cochlear implant recipients. Int J Pediatr Otorhinolaryngol,
64(2), 159-163.
Pfingst, B. E., Zwolan, T. A., & Holloway, L. A. (1997). Effects of stimulus configuration on
psychophysical operating levels and on speech recognition with cochlear implants. Hear
Res, 112(1-2), 247-260.
Pfingst, B. E., Xu, L., & Thompson, C. S. (2007). Effects of carrier pulse rate and stimulation site on
modulation detection by subjects with cochlear implants. J Acoust Soc Am, 121(4), 2236-2246.
Pialoux, P. (1976). [Cochlear implants]. Acta Otorhinolaryngol Belg, 30(6), 567-568.
Plant, K., Holden, L., Skinner, M., Arcaroli, J., Whitford, L., Law, M. A., et al. (2007). Clinical
evaluation of higher stimulation rates in the nucleus research platform 8 system. Ear
Hear, 28(3), 381-393.
Psarros, C. E., Plant, K. L., Lee, K., Decker, J. A., Whitford, L. A., & Cowan, R. S. (2002).
Conversion from the SPEAK to the ACE strategy in children using the nucleus 24
cochlear implant system: speech perception and speech production outcomes. Ear Hear,
23(1 Suppl), 18S-27S.
Rosen, S. (1992). Temporal information in speech: acoustic, auditory and linguistic aspects.
Philos Trans R Soc Lond B Biol Sci, 336(1278), 367-373.
Rubinstein, J. T., Abbas, P.J., & Miller, C. A. (1998). The neurophysiological effects of simulated
audtitory prostheses stimulation. Paper presented at the Eighth Quarterly Progress Report
N01- DC- 6 2111.
Rubinstein, J. T., Wilson, B. S., Finley, C. C., & Abbas, P. J. (1999). Pseudospontaneous
activity: stochastic independence of auditory nerve fibers with electrical stimulation.
Hear Res, 127(1-2), 108-118.
Seligman, P., & McDermott, H. (1995). Architecture of the Spectra 22 speech processor. Ann
Otol Rhinol Laryngol Suppl, 166, 139-141.
Seligman, P. (March, 2007). Behind-The-Ear Speech Processors. Paper presented at Cochlear
Implant Training Workshop, Bionic Ear Institute, Melbourne.
Shannon, R. V. (1983). Multichannel electrical stimulation of the auditory nerve in man. I.
Basic psychophysics. Hear Res, 11(2), 157-189.
Shannon, R. V. (1992). Temporal modulation transfer functions in patients with cochlear
implants. J Acoust Soc Am, 91(4 Pt 1), 2156-2164.
Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech
recognition with primarily temporal cues. Science, 270(5234), 303-304.
Simmons, F. B. (1966). Electrical stimulation of the auditory nerve in man. Arch Otolaryngol,
84(1), 2-54.
254 Modern Speech Recognition Approaches with Case Studies
Skinner, M. W., Arndt, P. L., & Staller, S. J. (2002a). Nucleus 24 advanced encoder
conversion study: performance versus preference. Ear Hear, 23(1 Suppl), 2S-17S.
Skinner, M. W., Holden, L. K., Whitford, L. A., Plant, K. L., Psarros, C., & Holden, T. A.
(2002b). Speech recognition with the nucleus 24 SPEAK, ACE, and CIS speech coding
strategies in newly implanted adults. Ear Hear, 23(3), 207-223.
Skinner, M. W., Ketten, D. R., Holden, L. K., Harding, G. W., Smith, P. G.,Gates, G. A.,
Neely, J. G., Kletzker, G. R., Brunsden, B., and Blocker, B. (2002c). CT-Derived
estimation of cochlear morphology and electrode array position in relation to word
recognition in Nucleus 22 recipients. J.Assoc. Res. Otolaryngol, 3(3), 332–350.
Stark, H., & Tuteur, F. B. (1979). Modern Electrical Communications: Englewood Cliffs, NJ:
Prentice-Hall.
Tong, Y. C., Black, R. C., Clark, G. M., Forster, I. C., Millar, J. B., O'Loughlin, B. J., et al.
(1979). A preliminary report on a multiple-channel cochlear implant operation. J
Laryngol Otol, 93(7), 679-695.
Tong, Y. C., Clark, G. M., Blamey, P. J., Busby, P. A., & Dowell, R. C. (1982). Psychophysical
studies for two multiple-channel cochlear implant patients. J Acoust Soc Am, 71(1), 153-160.
Townshend, B., Cotter, N., Van Compernolle, D., & White, R. L. (1987). Pitch perception by
cochlear implant subjects. J Acoust Soc Am, 82(1), 106-115.
Vandali, A. E., Sucher, C., Tsang, D. J., McKay, C. M., Chew, J. W. D., & McDermott, H. J.
(2005). Pitch ranking ability of cochlear implant recipients: A comparison of sound-
processing strategies. J Acoust Soc Am, 117(5), 3126.
Vandali, A. E., Whitford, L. A., Plant, K. L., & Clark, G. M. (2000). Speech perception as a
function of electrical stimulation rate: using the Nucleus 24 cochlear implant system.
Ear Hear, 21(6), 608-624.
Verschuur, C. A. (2005). Effect of stimulation rate on speech perception in adult users of the
Med-El CIS speech processing strategy. Int J Audiol, 44(1), 58-63.
Viemeister, N. F. (1979). Temporal modulation transfer functions based upon modulation
thresholds. J Acoust Soc Am, 66(5), 1364-1380.
Weber, B. P., Lai, W. K., Dillier, N., von Wallenberg, E. L., Killian, M. J., Pesch, J., et al.
(2007). Performance and preference for ACE stimulation rates obtained with nucleus RP
8 and freedom system. Ear Hear, 28(2 Suppl), 46S-48S.
Wilson, B. S. (1991). Better speech recognition with cochlear implants. Nature, 352(6332), 236-238.
Wilson, B. S., Finley, C. C., Lawson, D. T., Wolford, R. D., & Zerbi, M. (1993). Design and
evaluation of a continuous interleaved sampling (CIS) processing strategy for
multichannel cochlear implants. J Rehabil Res Dev, 30(1), 110-116.
Wilson, B. S., Finley, C. C., Lawson, D. T., & Zerbi, M. (1997). Temporal representations with
cochlear implants. Am J Otol, 18(6 Suppl), S30-34.
Wilson, B. S., Rebscher, S., Zeng, F. G., Shannon, R. V., Loeb, G. E., Lawson, D. T., et al.
(1998). Design for an inexpensive but effective cochlear implant. Otolaryngol Head Neck
Surg, 118(2), 235-241.
Xu, L., Thompson, C. S., & Pfingst, B. E. (2005). Relative contributions of spectral and
temporal cues for phoneme recognition. J Acoust Soc Am, 117(5), 3255-3267.
Xu, L., Tsai, Y., & Pfingst, B. E. (2002). Features of stimulation affecting tonal-speech
perception: implications for cochlear prostheses. J Acoust Soc Am, 112(1), 247-258.
Section 3
Speech Modelling
Chapter
Chapter11
0
http://dx.doi.org/10.5772/48506
1. Introduction
The task of creation of a language model consists of the creation of the large-enough training
corpus containing typical documents and phrases from the target domain, collecting statistical
data, such as counts of word n-tuples (called n-grams) from the a collection of prepared
text data (training corpus), further processing of the raw counts and deducing conditional
probabilities of words, based on word history in the sentence. Resulting word tuples and
corresponding probabilities form the language model.
The major space for improvement of the precision of the language model is in the language
model smoothing. Basic method of the probability estimation, called maximum likelihood that
utilizes n-gram counts directly obtained from the training corpus is often insufficient, because
it results zero probability to those word n-grams not seen in the training corpus.
One of the possible ways to update n-gram probabilities lies in the incorporation of the
grammatical features, obtained from the training corpus. Basic methods of the language
modeling work just with sequences of words and does not take any language grammar into
account. Current language modeling techniques are based on the statistics of the sequences
of words in the sentences, obtained from a training corpora. If the information about the
language grammar have to be included in the final language model, it had to be done in
a way that is compatible with the statistical character of the basic language model. More
precisely, this means to propose a method of extraction of the grammatical features from the
text, compile a statistical model based on these grammatical features and finally, make use of
these probabilities in refining probabilities of the basic, word-based language model.
The process of extraction of the grammatical information from the text means assigning one
of the list possible features for each word in the sentence of the training corpus, forming up
several word classes, where one word class consists of each word in the vocabulary of the
speech recognition system that can have the same grammatical feature assigned. Statistics
258 Modern
2 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
collected from these word classes then represent a general grammatical features of the training
text, that can be then used to improve original word-based probabilities.
• vocabulary size is very high, one word has many inflections and forms;
• size of necessary training text is very large – it is hard to catch all events;
• number of necessary n-grams is very large.
Solutions presented in [25] are mostly based on utilization of grammatical features and
manipulation of the dictionary:
Each of these methods require a special method of preprocessing of the training corpus
and producing a language model. Every word in the training corpus is replaced by the
corresponding item (lemma, word-form or a sequence of morphemes) and a language model
is constructed using the processed corpus. For a highly inflectional language, where thanks
to the large dictionary the estimation of the probabilities is very difficult, language modeling
using a extraction of the grammatical features of words seems to be a beneficial way how to
improve general accuracy of the speech recognition .
there are 1015 possible combinations. This amount of text that will contain all combinations
that are possible with this dictionary is just impossible to gather and process. In the most cases,
training corpora are much smaller, and as a consequence, number of extracted n-grams is also
smaller. Then it is possible that if a trigram does not exist in a training corpus, it will have
zero probability, even if the trigram combination is perfectly possible in the target language.
To deal with this problem, process of adjusting calculated probabilities called smoothing is
necessary. This operation will move part of the probability mass from the n-grams that is
present in the training corpus to the n-grams that are not present in the training corpus and
has to be calculated from data that is available.
Usually, in the case of missing n-gram in the language model, required probability is
calculated by using available n-grams of lower order using back-off scheme [19]. For example,
if the trigram is not available, bigram probabilities are used to estimate probability of
the trigram. Using the same principle, if the bigram probability is not present, unigram
probabilities are used for calculation of the bigram probability. This principle for bigram
language model is depicted in the Fig. 1.
Often, the back-off scheme is not enough by itself for efficient smoothing of the language
model, and n-gram probabilities have to be adjusted even more. Common additional
techniques are methods based on adjusting of n-gram counts, such as Laplace smoothing,
Good-Touring method, Witten-Bell [5] or modified Knesser-Ney [21] algorithms. The problem
of this approach is that these methods are designed for languages that are not very
morphologically rich. As it is showed in [17, 24], is that this kind of smoothing does not
bring expected positive effect for highly inflectional languages with large vocabulary.
Another common approach for estimating a language model from sparse data is linear
interpolation, also called Jelinek-Mercer smoothing [16]. This method allows a combination of
multiple independent sources of knowledge into one, that is then used for compose the final
language model. In the case of trigram language model, this approach can calculate the
final probability as a linear combination of unigram, bigram and trigram maximum likelihood
estimates. Linear interpolation is not the only method of combining of multiple knowledge
sources, other possible approaches are maximum entropy [1], log-linear interpolation [20] or
generalized linear interpolation [15].
For a bigram model, a linear interpolation scheme of utilizing bigrams and unigrams is
depicted in the Fig. 2. In this case, the final probability is calculated as a linear combination of
260 Modern
4 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
P = λP1 + (1 − λ) P2 . (1)
P ( wi | wi −1 . . . wi − n +1 ) = P ( c i | c i −1 . . . c i − n +1 ) P ( wi | c i ), (2)
where P(ci |ci−1 . . . ci−n+1 ) is probability of a class ci , where word wi belongs, based on the
class history. In this equation, probability of a word w according to its history of n − 1 words
h = {wi−1 . . . wi−n+1 } is calculated as a product of class-history probability P(ci |ci−1 ) and
word-class probability P(wi |ci ).
C ( c i − n +1 . . . c i )
P ( c i | c i −1 . . . c i − N +1 ) = , (3)
C ( c i − n +1 . . . c i −1 )
where C (ci− N +1 . . . ci ) is a count of sequence of classes in the training corpus and C (ci−n+1
. . . ci−1 ) is count of the history of the class ci in the training corpus.
Incorporating Grammatical Features in the Modeling of
the Slovak
Incorporating Grammatical Features in the Modeling of the Slovak Language for Continuous for Continuous Speech Recognition5 261
Language
Speech Recognition
The word-class probability can be estimated as a fraction of a word count C (w) and class total
count C (c):
C (w)
P(w|c) = . (4)
C (c)
Basic feature of the class-based models is lowering number of independent parameters [4] of
the resulting language model. For word-based n-gram language model, there is a probability
value for each n-gram, as well as back-off weight for lower order n-grams. For class-based
model, a whole set of words is reduced to a single class and class-based model describes
statistical properties of that class. Another advantage is that the same classical smoothing
methods that were presented above can be used for a class-based language model as well.
g(w, h) → c, (5)
where class c is from the set of all possible classes G, word w is from vocabulary V and h is
surrounding context of the word w. This function can be defined in multiple ways - utilizing
expert knowledge, or using data-driven approaches for word-classes induction and can have
various features.
If the word clustering function is generalized to include every possible word, class-based
language model equation then can it can be written as:
From this reason a linear combination of a class-based with word-based language model is
proposed. It should be performed in a way that the resulting language model will mostly
take into account word-based n-grams if they are available and in the case that here is no
word-based n-gram it will fall back to the class-based n-gram that uses grammatical features,
as it is showed in the Fig. 3.
The first part of this language model can be created using classical language modeling
methods from the training corpus. To create a class-based model, the training corpus has
to be processed by the word clustering function and every word has to be replaced by its
corresponding class. From this processed training corpus, a class-based model can be built.
During this process, a word-class probability function has to be estimated. This function
expresses probability distribution of words in the class. The last step is to determine the
interpolation parameter λ, should be set to values close (but lower) to 1.
264 Modern
8 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
There are more possible methods of segmentation of words into such features. Basically, they
can be divided into two groups - rule-based and statistics-based methods. In the following text,
a rule-based method for identifying a stem or a suffix of the word will be presented and a
statistic-based method for finding the word lemma or part-of-speech will be introduced.
1. a dictionary of the most common words in the language has been obtained;
2. from each word longer than 6 characters, a suffix of length 2, 3 or 4 characters has been
extracted;
3. number of occurrences of each extracted suffix has been calculated;
4. a threshold has been chosen and suffixes with count higher than the threshold has been
added to the list of all suffixes.
If the list of the most common suffixes is created, it is possible to easily identify the stem and
suffix and stem of the word.
Incorporating Grammatical Features in the Modeling of
the Slovak
Incorporating Grammatical Features in the Modeling of the Slovak Language for Continuous for Continuous Speech Recognition9 265
Language
Speech Recognition
Disadvantage of this method is that it is statistically based and it is not always precise and
some suffixes found might not be grammatically correct.
This suffix or stem assignment function then can be used as a word clustering function that can
assign certainly one class to every word. Words with the same suffix or stem will then belong
to the same class and according to the properties of the language, they will share similar
statistical features.
Lemma assignment task is very similar to the part-of-speech assignment task, and very similar
methods can be used. The part-of-speech or lemma assignment function can be used as a
word clustering function, when forming a class-based language model. The problem with
this approach is that it is possible that one word can belong to more classes at once that can
bring a lower precision of the language model.
where the best sequence of classes gbest (W ) is assigned from all class sequences that are
possible for the word sequence W, according to the probability of occurrence of the class
sequence in the case of the given word sequence W.
There are several problems with this equation. First, the number of possible sequences gi (W )
is very high and it is not computationally feasible to verify them individually. Second, there
has to be a framework for expressing probability of the sequence P( gi (W )|W ) and calculating
its maximum.
The hidden Markov model is defined as a quintuple:
For construction of the hidden Markov model for the task of POS tagging, all of these
components should be calculated, as precisely as it is possible. The most important part of the
whole process is manually prepared training corpus, where each word has a class assigned by
hand. This process is very difficult and requires a lot of work of human annotators.
When annotated corpus is available, estimation of the main components of the hidden Markov
model, matrices A and B is relatively easy. Again, the maximum likelihood method can be
Incorporating Grammatical Features in the Modeling of
the Slovak
Incorporating Grammatical Features in the Modeling of the Slovak Language for Continuous Language
for Continuous Speech Recognition
Speech Recognition 11 267
C ( c i −1 , c i )
A = P ( c i | c i −1 ) = (9)
C ( c i −1 )
and
C ( wi , c i )
B = P ( wi | c i ) = , (10)
C ( ci )
where C (ci−1 , ci ) is count of the pair of succeeding classes ci−1 , ci , C (ci ) is count of the class ci
in the training corpus. After matrices A and B are prepared, the best sequence of classes for
the given sequence of words can be calculated using a method of dynamic programming - the
Viterbi algorithm.
As it was stated above, the Slovak language is characterized by its rich morphology and
large vocabulary and this fact makes the task of the POS tagging more difficult. During
experiments, it has shown up that these basic methods are not sufficient, and additional
modification of the matrices A and B is required.
For this purpose, a suffix-based smoothing method has been designed, similar but not the same
as in [2]. Here, an accuracy improvement can be achieved by calculation of the suffix-based
probability P( gsu f f (wi )|ci ). This probability estimate uses the same word-clustering function
for assigning words into classed as is presented above. Again, the observation probability
matrix is adjusted using a linear combination:
This operation helps to better estimate probability of the word wi in for the class ci , even if
a pair wi , ci does not exist in the training corpus. The second component of the expression
improves the probability estimate with counts of words, that are similar to the word wi .
5. Experimental evaluation
Basically, language models can be evaluated in two possible ways. In the extrinsic evaluation
is language model tested in simulated real-life environment and performance of the whole
automatic speech recognition system is observed. The result of the recognition is compared to
the annotation of the testing set. Standard measure for extrinsic evaluation is word error rate
(WER) that is calculated as:
C I NS + CDEL + CSUB
WER(W ) = , (12)
C (W )
where C I NS is number of false inserted words, CDEL is number of unrecognized words and
CSUB is number of words that were confused (substituted), when the result of the recognition
is compared to the word sequence W.
WER is evaluation of the real output of the whole automatic speech recognition system. It
evaluates user experience and is affected by all components of the speech recognition system.
On the other hand, intrinsic evaluation is the one ”that measures the quality of the model,
independent on any application” [18]. For n-gram language models, the most common
268 Modern
12 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
evaluation metric is perplexity (PPL). ”The perplexity can be also viewed as a weighted
averaged branching factor of the language model. The branching factor for the language is
the number of possible next words that can follow any word” [18]. Similarly to the extrinsic
method of evaluation, a testing corpus is required. The resulting perplexity value is always
connected with the training corpus. According to the previous definition, perplexity can be
expressed by the equation:
N
1
PPL(W ) = N
∏ P(w|h) , (13)
i =1
where P(w| h) is a probability, returned by the tested language model and expresses
probability of all words conditioned by its histories from the testing corpus of the length N.
Compared to the extrinsic methods of evaluation, it offers several advantages. Usually,
evaluation using perplexity is much faster and simpler, because only testing corpus and
language model evaluation tool is necessary. Also, this method eliminates unwanted effects
of other components of the automatic speech recognition system, such as acoustic model of
phonetic transcription system.
For training a class-based model utilizing grammatical features, further processing of the
training corpus is required.
One of the goals of this study is to evaluate usefulness of the grammatical features for
the language modeling. The tests are focused on following grammatical features that were
mentioned in the previous text:
Incorporating Grammatical Features in the Modeling of
the Slovak
Incorporating Grammatical Features in the Modeling of the Slovak Language for Continuous Language
for Continuous Speech Recognition
Speech Recognition 13 269
• part-of-speech;
• word lemma;
• word suffix;
• word stem.
For this purpose, a set of tools, implementing a word clustering function has been prepared.
For the POS and lemma, a statistical classifier based on the hidden Markov model has been
designed. This classifier has been trained on a data from the [29] (presented in [14]). Method
of the statistical classifier is similar to the [2], but uses additional back-off method, based on a
suffix extraction described in the previous section.
For identification of the suffix or the stem of the word, just simple suffix subtraction method
presented above has been used. When compared to the statistical classifier, this method is
much simpler and faster. Also, this kind of word clustering function is more uniform, because
one word can belong to just one class. Disadvantage of this approach is that it does not allow
to identify a suffix or stem to the words that are short. In this case, for those words that cannot
be split, word is considered as a class by itself.
Word clustering by the suffix identification is performed in two versions. Version 1 is using
625 suffixes compiled by hand. Version 2 is using 7 578 statistically identified suffixes (as
it was described above). For each grammatical feature, the whole training corpus has been
processed and every word in the corpus had a class assigned, according to the used word
clustering function.
To summarize, 7 training corpora has been created, one for each grammatical feature
examined. First training corpus was the baseline, and other 6 corpora were created by
processing of the baseline corpus:
• part-of-speech corpus, marked as POS1 has been created using our POS tagger, based on
hiden Markov models;
• lemma-based corpus, marked as LEM1, has been created using our lemma tagger, based
on hidden Markov models;
• suffix-based corpus 1, marked as SUFF1, has been created using suffix extraction method
with 625 hand compiled suffixes;
• stem-based corpus 1, marked as STEM1, has been created using suffix extraction method
with the same suffixes;
• suffix-based corpus 2, marked an SUFF2, has been created using suffix extraction method
using statistically obtained 7 578 suffixes;
• stem-based corpus 2, marked as STEM2, obtained by the same method.
For the class-based models, according to the Eq. 2, besides class-based language model
probability P(c| hc ), also word-class probability P(w|c) is required. Again, using maximum
likelihood, this probability has been calculated as:
C (w, g(w))
P(w|c) = , (14)
C ( g(w))
where C (w, gw) is number of occurrences of word W with class c = g(w) and C ( g(w))
is number of words in class g(w). The processed corpora will be used for creation of the
class-based language model.
Taking prepared training corpus, SRILM Toolkit [31] has been used to build trigram model
with baseline smoothing method.
Result of this step is 7 language models, one classical word-based models and 6 class-based
models, one for each word clustering function.
For quick evaluation, a perplexity measure has been chosen. As an evaluation corpus, 500 000
sentences of held-out data from the court of law adjudgements has been used. Results of the
perplexity evaluation and characterization of the resulting language models are in the Table 2.
The results have shown that despite expectations. Perplexity of the class-based models
constructed from the processed training corpora is always higher than the perplexity of the
word-based models. Higher perplexity means, that the language model does not fit testing
data so good. Word-based language model seems to be always better than the class-based
model, even if there are some advantages of the class-based language model. But, class-based
language models could be useful. Thanks to the word clustering function, they still provide
extra information that is not included in the baseline model. The hypothesis say, that in some
special cases, the class-based language model can give better result than the word-based
model. The way, how this extra information can be utilized is linear interpolation with the
baseline model, so it contains both word-based and class-based n-grams.
Weight of the word based model has been set to λ = 0.98 and also word class probability
calculated in the previous step has been used.
The result of this process is again a class-based model. This new class-based model utilizing
grammatical features contains two types of classes. Word-based class, where class contains
only one member, the word. The second type of class is grammar-based class, where class
contains all words, that are mapped by the word clustering function.
This new set of the interpolated language models have been evaluated for perplexity. Results
are in the Table 2. It is visible that after the interpolation, perplexity of the interpolated
models has decreased very much. This fact confirms the hypothesis about usability of the
grammar-based language models.
First conclusion from these experiments (see Table 2) is that the classic word based language
models generally give better precision than the class-based grammar models. Their main
advantage is the smoothing ability - estimating probability of the less frequent events
using words that are grammatically similar. This advantage can be utilized using linear
interpolation, where final probability is calculated as a weighted sum of the word-based
component and class-based component. That will help in better distribution of the probability
mass in the language model - thanks to the grammar component, more probability will be
assigned to the events (word sequences) not seen in the training corpus. Effect of the grammar
component is visible in the Table 2, where using simple suffix extraction method and linear
interpolation helped to decrease perplexity of baseline language model by 25%.
Effect of the decreased perplexity has been evaluated in extrinsic tests - recognition of dictation
of the legal texts. From this these tests, summarized in the Table 4 can be seen, how decreased
perplexity affects final precision of the recognition process. In the case of the suffix extraction
method, 2% relative WER reduction has been achieved. Interesting fact is that change of the
perplexity not always led to decreasing of the WER. From this fact it is possible to say that
perplexity of the language model is not always expressing quality of the language model in
the task of the automatic speech recognition, where final performance is affected by more
factors, and can be used just as a kind of clue in next necessary steps.
Incorporating Grammatical Features in the Modeling of
the Slovak
Incorporating Grammatical Features in the Modeling of the Slovak Language for Continuous Language
for Continuous Speech Recognition
Speech Recognition 17 273
Next conclusion is that not every grammatical feature can be useful for increasing precision
of the speech recognition. Each test shows notable differences in the perplexities and word
error rates for each created language model. After a closer look at the results, it can be
seen that those features that are based more on the morphology of the word, such as suffix
or part-of-speech perform better than those that are more based on the semantics of the
word, such as stem or lemma-based features (compare to [23]). Also, when comparing suffix
extraction method 1 and 2, we can see that statistically obtained high number of classes yield
better results, than the handcrafted list of suffixes.
6. Conclusion
This presented approach has shown that using suffix-based extraction method, together with
interpolated class-based model can bring much smaller perplexity of the language model and
considerably lower WER in the automatic speech recognition system. Even if a class-based
models do not bring important improvement of the recognition accuracy, they can be used
as a back-off schema in the connection with the classical word-based language models, using
linear interpolation.
Class-based language models with utilization of the grammatical features allow:
• optimize search network of the speech recognition system by putting some words into
classes;
• have ability to incorporate new words into the speech recognition system without the need
of re-training the language model;
• better estimate probabilities of those n-grams that did not occur in the training corpus.
• relatively larger search network (it includes both words and word-classes);
• more difficult process of the training language models.
The future work in this field should be focused on even better usability of this type of language
model. First area that have not been mentioned in this work is the size of the language model.
Size of the language model influences loading times, recognition speed and used disk space
of the real-word speech recognition system. Effectively pruned language model should also
bring better precision, because it removes n-grams that can be calculated from lower-order
n-grams.
The second area that deserves more attention is the problem of language model adaptation.
Thanks to the class-based nature of this type of language model, new words and new phrases
can be inserted into the dictionary by the user and this feature should be inspected precisely.
This word has introduced a methodology for building a language model for highly inflective
language such as Slovak. It can be also usable for similar languages with rich morphology,
Polish or Czech. It brings a better precision and ability to include new words into the language
model by the user without the need of re-training of the language model.
274 Modern
18 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
Acknowledgement
The research presented in this paper was supported by the Ministry of Education under
the research project MŠ SR 3928/2010-11 (50%) and Research and Development Operational
Program funded by the ERDF under the project ITMS-26220220141 (50%).
Author details
Ján Staš, Daniel Hládek and Jozef Juhár
Department of Electronics and Multimedia Communications
Technical University of Košice, Slovakia
7. References
[1] Berger, A., Pietra, V. & Pietra, S. [1996]. A maximum entropy approach to natural
language processing, Computational Linguistics 22(1): 71.
[2] Brants, T. [2000]. TnT: A statistical part-of-speech tagger, Proc. of the 6th Conference on
Applied Natural Language Processing, ANLC’00, Stroudsburg, PA, USA, pp. 224–231.
[3] Brill, E. [1995]. Transformation-based error-driven learning and natural language
processing: A case study in part-of-speech tagging, Computational Linguistics 21: 543–565.
[4] Brown, P., Pietra, V., deSouza, P., Lai, J. & Mercer, R. [1992]. Class-based n-gram models
of natural language, Computational Linguistics 18(4): 467–479.
[5] Chen, S. F. & Goodman, J. [1999]. An empirical study of smoothing techniques for
language modeling, Computer Speech & Language 13(4): 359–393.
[6] Creutz, M. & Lagus, K. [2007]. Unsupervised models for morpheme segmentation and
morphology learning, ACM Transactions on Speech and Language Processing 4(1).
[7] Darjaa, S., Cerňak, M., Beňuš, v., Rusko, M.and Sabo, R. & Trnka, M. [2011].
Rule-based triphone mapping for acoustic modeling in automatic speech recognition,
Springer-Verlag, LNAI 6836 pp. 268–275.
[8] Ghaoui, A., Yvon, F., Mokbel, C. & Chollet, G. [2005]. On the use of morphological
constraints in n-gram statistical language model, Proc. of the 9th European Conference on
Speech Communication and Technology.
[9] Goldsmith, J. [2001]. Unsupervised learning of the morphology of a natural language,
Computational Linguistics 27(2): 153–198.
[10] Graça, J. V., Ganchev, K., Coheur, L., Pereira, F. & Taskar, B. [2011]. Controlling
complexity in part-of-speech induction, Journal of Artificial Intelligence Research
41(1): 527–551.
[11] Halácsy, P., Kornai, A. & Oravecz, C. [2007]. HunPos - An open source trigram tagger,
Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions,
Stroudsburg, PA, USA, pp. 209–212.
[12] Hládek, D. & Staš, J. [2010]. Text mining and processing for corpora creation in Slovak
language, Journal of Computer Science and Control Systems 3(1): 65–68.
[13] Hládek, D., Staš, J. & Juhár, J. [2011]. A morphological tagger based on a learning
classifier system, Journal of Electrical and Electronics Engineering 4(1): 65–70.
[14] Horák, A., Gianitsová, L., Šimková, M., Šmotlák, M. & Garabík, R. [2004]. Slovak national
corpus, P. Sojka et al. (Eds.): Text, Speech and Dialogue, TSD’04, pp. 115–162.
Incorporating Grammatical Features in the Modeling of
the Slovak
Incorporating Grammatical Features in the Modeling of the Slovak Language for Continuous Language
for Continuous Speech Recognition
Speech Recognition 19 275
[15] Hsu, B. J. [2007]. Generalized linear interpolation of language models, IEEE Workshop on
Automatic Speech Recognition Understanding, ASRU’2007, pp. 136–140.
[16] Jelinek, F. & Mercer, M. [1980]. Interpolated estimation of Markov source parameters
from sparse data, Pattern recognition in practice pp. 381–397.
[17] Juhár, J., Staš, J. & Hládek, D. [2012]. Recent progress in development of language
model for Slovak large vocabulary continuous speech recognition, Volosencu, C. (Ed.):
New Technologies - Trends, Innovations and Research . (to be published).
[18] Jurafsky, D. & Martin, J. H. [2009]. Speech and Language Processing: An Introduction
to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd
Edition), Prentice Hall, Pearson Education, New Jersey.
[19] Katz, S. [1987]. Estimation of probabilities from sparse data for the language model
component of a speech recognizer, IEEE Transactions on Acoustics, Speech and Signal
Processing 35(3): 400–401.
[20] Klakow, D. [1998]. Log-linear interpolation of language models, Proc. of the 5th
International Conference on Spoken Language Processing.
[21] Kneser, R. & Ney, H. [1995]. Improved backing-off for m-gram language modeling, Proc.
of ICASSP pp. 181–184.
[22] Maltese, G., Bravetti, P., Crépy, H., Grainger, B. J., Herzog, M. & Palou, F.
[2001]. Combining word-and class-based language models: A comparative study in
several languages using automatic and manual word-clustering techniques, Proc. of
EUROSPEECH, pp. 21–24.
[23] Nouza, J. & Drabkova, J. [2002]. Combining lexical and morphological knowledge in
language model for inflectional (czech) language, pp. 705–708.
[24] Nouza, J. & Nouza, T. [2004]. A voice dictation system for a million-word Czech
vocabulary, Proc. of ICCCT pp. 149–152.
[25] Nouza, J., Zdansky, J., Cerva, P. & Silovsky, J. [2010]. Challenges in speech processing of
Slavic languages (Case studies in speech recognition of Czech and Slovak), in A. E. et al.
(ed.), Development of Multimodal Interfaces: Active Listening and Synchrony, LNCS 5967,
Springer Verlag, Heidelberg, pp. 225–241.
[26] Pleva, M., Juhár, J. & Čižmár, A. [2007]. Slovak broadcast news speech corpus for
automatic speech recognition, Proc. of the 8th Intl. Conf. on Research in Telecomunication
Technology, RTT’07, Liptovský Ján, Slovak Republic, p. 4.
[27] Ratnaparkhi, A. [1996]. A maximum entropy model for part-of-speech tagging, Proc. of
Empirical Methods in Natural Language Processing, Philadelphia, USA, pp. 133–142.
[28] Rusko, M., Juhár, J., Trnka, M., Staš, J., Darjaa, S., Hládek, D., Cerňák, M., Papco, M.,
Sabo, R., Pleva, M., Ritomský, M. & Lojka, M. [2011]. Slovak automatic transcription and
dictation system for the judicial domain, Human Language Technologies as a Challenge for
Computer Science and Linguistics: 5th Language & Technology Conference pp. 365–369.
[29] SNK [2007]. Slovak national corpus.
URL: http://korpus.juls.savba.sk/
[30] Spoustová, D., Hajič, J., Votrubec, J., Krbec, P. & Květoň, P. [2007]. The best of
two worlds: Cooperation of statistical and rule-based taggers for Czech, Proc. of the
Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling
Technologies, pp. 67–74.
276 Modern
20 Speech Recognition Approaches with Case Studies Will-be-set-by-IN-TECH
[31] Stolcke, A. [2002]. SRILM – an extensible language modeling toolkit, Proc. of ICSLP,
Denver, Colorado, pp. 901–904.
[32] Su, Y. [2011]. Bayesian class-based language models, Proc. of ICASSP, pp. 5564–5567.
[33] Vergyri, D., Kirchhoff, K., Duh, K. & Stolcke, A. [2004]. Morphology-based language
modeling for arabic speech recognition, Proc. of ICSLP, pp. 2245–2248.
[34] Votrubec, J. [2006]. Morphological tagging based on averaged perceptron, Proc. of
Contributed Papers, WDS’06, Prague, Czech Republic, pp. 191–195.
Chapter 12
http://dx.doi.org/10.5772/48645
1. Introduction
Speech recognition is often used as the front-end for many natural language processing
(NLP) applications. Some of these applications include machine translation, information
retrieval and extraction, voice dialing, call routing, speech synthesis/recognition, data entry,
dictation, control, etc. Thus, much research work has been done to improve the speech
recognition and the related NLP applications. However, speech recognition has some
obstacles that should be considered. Pronunciation variations and small words
misrecognition are two major problems that lead to performance reduction. Pronunciation
variations problem can be divided into two parts: within-word variations and cross-word
variations. These two types of pronunciation variations have been tackled by many researchers
using different approaches. For example, cross-word problem can be solved using
phonological rules and/or small-word merging. (AbuZeina et al., 2011a) used the phonological
rules to model cross-word variations for Arabic. For English, (Saon & Padmanabhan, 2001)
demonstrated that short words are more frequently misrecognized, they also had achieved a
statistically significant enhancement using small-word merging approach.
An automatic speech recognition (ASR) system uses a decoder to perform the actual
recognition task. The decoder finds the most likely words sequence for the given utterance
using Viterbi algorithm. The ASR decoder task might be seen as an alignment process
between the observed phonemes and the reference phonemes (dictionary phonemic
transcription). Intuitively, to have a better accuracy in any alignment process, long
sequences are highly favorable instead of short ones. As such, we expect enhancement if we
merge words (short or long). Hence fore, a thorough investigation was performed on Arabic
speech to discover a suitable merging cases. We found that Arabic speakers usually
augment two consecutive words; a noun that is followed by an adjective and a preposition
that is followed by a word. Even though we believe that other cases are found in Arabic
speech, we chose two cases to validate our proposed method. Among the ASR components,
278 Modern Speech Recognition Approaches with Case Studies
the pronunciation dictionary and the language model were used to model our above
mentioned objective. This means that the acoustic models for the baseline and the enhanced
method are the same.
This research work is conducted for Modern Standard Arabic (MSA). So, the work will
necessarily contain many examples in Arabic. Therefore, it would be appropriate for the
reader if we start first by providing a Romanization (Ryding, 2005) of the Arabic letters and
diacritical marks. Table 1 shows the Arabic–Roman letters mapping table. The diacritics
Fatha, Damma, and Kasra are represented using a, u, and i, respectively.
To validate the proposed method, we used Carnegie Mellon University (CMU) Sphinx
speech recognition engine. Our baseline system contains a pronunciation dictionary of
14,234 words from a 5.4 hours pronunciation corpus of MSA broadcast news. For tagging,
we used the Arabic module of Stanford tagger. Our results show that part of speech (PoS)
tagging is considered a promising track to enhance Arabic speech recognition systems.
The rest of this chapter is organized as follows. Section 2 presents the problem statement.
Section 3 demonstrates the speech recognition components. In Section 4, we differentiate
between within-word and cross-word pronunciation variations followed by the Arabic
speech recognition in Section 5. The proposed method is presented in Section 6 and the
results in Section 7. The discussion is provided in Section 8. In Section 9, we highlight some
of the future directions. We conclude the work in Section 10.
2. Problem statement
Continuous speech is characterized by augmenting adjacent words, which do not occur in
isolated speech. Therefore, handling this phenomenon is a major requirement in continuous
speech recognition systems. Even though Hidden Markov Models (HMMs) based ASR
decoder uses triphones to alleviate the negative effects of cross-word phenomenon, more
effort is still needed to model some cross-word cases that could not be avoided using
triphones. In continuous ASR systems, the dictionary is usually initiated using corpus
transcription words, i.e. each word is considered as an independent entity. In this case,
Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 279
speech cross-word merging will reduce the performance. Two main methods are usually
used to model the cross-word problem, phonological rules and small-word merging. Even
though the phonological rules and small-word merging methods enhance the performance,
we believe that generating compound words is also possible using PoS tagging.
Initially, there are two reasons why cross-word modeling is an effective method in speech
recognition system: First, the speech recognition problem appears as an alignment process,
hence for, having long sequences is better than short ones as demonstrated by (Saon and
Padmanabhan, 2001). To illustrate the effect of co-articulation phenomenon (merging of
words in continuous speech), let us examine Figure 1 and Figure 2. Figure 1 shows the
words to be considered with no compound words, while Figure 2 shows the words with
compound words. In both figures we represented the hypotheses words using bold black
lines. During decoding, the ASR decoder will investigate many words and hypotheses.
Intuitively, the ASR decoder will choose the long words instead of two short words. The
difference between the two figures is the total number of words that will be considered
during the decoding process. Figure 2 shows that the total number of words for the
hypotheses is less than the total words in Figure 1 (Figure 1 contains 34 words while Figure
2 contains 18 words). Having less number of total words during decoding process means
having less decoding options (i.e. less ambiguity), which is expected to enhance the
performance.
Second, compounding words will lead to more robust language model. the compound
words which are represented in the language model will provide better representations of
words relations. Therefore, enhancement is expected as correct choice of a word will
increase the probability of choosing a correct neighbor words. The effect of compounding
words was investigated by (Saon & Padmanabhan, 2001). They mathematically
demonstrated that compound words enhance the language model performance, therefore,
enhancing the overall recognition output. They showed that the compound words have the
effect of incorporating a trigram dependency in a bigram language model. In general, the
compound words are most likely to be correctly recognized more than two separated words.
Consequently, correct recognition of a word might lead to another correct word through the
enhanced N-grams language model. In contrast, misrecognition of a word may lead to
another misrecognition in the adjacent words and so on.
For more clarification, we present some cases to show the short word misrecognition, and
how is the long word is much likely to be recognized correctly. Table 2 shows three speech
files that were tested in the baseline and the enhanced system. Of course, it is early to show
some results, but we see that it is worthy to support our motivation claim. In Table 2, it is
clear that the misrecognitions were mainly occurred in the short words (the highlighted
short words were misrecognized in the baseline system).
In this chapter, the most noticeable Arabic ASRs performance reduction factor, the cross-
word pronunciation variations, is investigated. To enhance speech recognition accuracy, a
knowledge-based technique was utilized to model the cross-word pronunciation variation
at two ASR components: the pronunciation dictionary and the language model. The
280 Modern Speech Recognition Approaches with Case Studies
Hypotheses
…
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 o17 o18 o19 o20
Observed phonemes
Hypotheses
…
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 o17 o18 o19 o20
Observed phonemes
3. Speech recognition
Modern large vocabulary, speaker-independent, continuous speech recognition systems
have three knowledge sources, also called linguistic databases: acoustic models, language
model, and pronunciation dictionary (also called lexicon). Acoustic models are the HMMs of
the phonemes and triphones (Hwang, 1993). The language model is the module that
provides the statistical representations of the words sequences based on the transcription of
the text corpus. The dictionary is the module that serves as an intermediary between the
Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 281
acoustic model and the language model. The dictionary contains the words available in the
language and the pronunciation of each word in terms of the phonemes available in the
acoustic models.
Figure 3 illustrates the sub-systems that are usually found in a typical ASR system. In
addition to the knowledge sources, an ASR system contains a Front-End module which is
used to convert the input sound into feature vectors to be usable by the rest of the system.
Speech recognition systems usually use feature vectors that are based on Mel Frequency
Cepstral Coefficients (MFCCs), (Rabiner and Juang, 2004).
Spoken words
Front-End ASR Decoder
Features
W1 W2 W3 …
The following is a brief introduction to typical ASR system components. The reader can find
more elaborate discussion in (Jurafsky and Martin, 2009).
3.1. Front-end
The purpose of this sub-system is to extract speech features which play a crucial role in
speech recognition performance. Speech features includes Linear Predictive Cepstral
Coefficients (LPCC), MFCCs and Perceptual Linear Predictive (PLP) coefficients. The Sphinx
engine used in this work is based on MFCCs.
The feature extraction stage aims to produce the spectral properties (features vectors) of
speech signals. The feature vector consists of 39 coefficients. A speech signal is divided into
overlapping short segments that are represented using MFCCs. Figure 4 shows the steps to
extract the MFCCs of a speech signal (Rabiner & Juang, 2004). These steps are summarized
below.
Sampling and Quantization: Sampling and quantization are the two steps for analog-to-digital
conversion. The sampling rate is the number of samples taken per second, the sampling rate
used in this study is 16 k samples per seconds. The quantization is the process of
representing real-valued numbers as integers. The analysis window is about 25.6 msec (410
samples), and consecutive frames overlap by 10 msec.
Preemphasis: This stage is to boost the high frequency part that was suppressed during the
sound production mechanism, so making the information more available to the acoustic
model.
Discrete Fourier Transform: The goal of this step is to obtain the magnitude frequency
response of each frame. The output is a complex number representing the magnitude and
phase of the frequency component in the original signal.
Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 283
Mel Filter Bank: A set of triangular filter banks is used to approximate the frequency
resolution of the human ear. The Mel frequency scale is linear up to 1000 Hz and logarithmic
thereafter. For 16 KHz sampling rate, Sphinx engine uses a set of 40 Mel filters.
Log of the Mel spectrum values: The range of the values generated by the Mel filter bank is
reduced by replacing each value by its natural logarithm. This is done to make the statistical
distribution of spectrum approximately Gaussian.
Inverse Discrete Fourier Transform: This transform is used to compress the spectral
information into a set of low order coefficients which is called the Mel-cepstrum. Thirteen
MFCC coefficients are used as a basic feature vector, xt ( k ) 0 k 12 .
Deltas and Energy: For continuous models, the 13 MFCC parameters along with computed
delta and delta-deltas parameters are used as a single stream 39 parameters feature vector.
For semi-continuous models, x(0) represents the log Mel spectrum energy, and is used
separately to derive other feature parameters, in addition to the delta and double delta
parameters. Figure 5 shows part of the feature vector of a speech file after completing the
feature extraction process. Each column represents the basic 13 features of a 25.6
milliseconds frame.
S1 S2 S3
n
P(w1n ) p( wk w1k 1 )
k 1
Where n is limited to include the words’ history as bigram (two consequent words), trigram
(three consequent words), 4-gram (four consequent words), etc. for example, by assigning
n=2, the probability of a three word sequence using bigram is calculated as follows:
P(w1 w2 w3 ) p( w3 | w2 ) p( w2 w1 ) p( w1 )
The CMU statistical language tool is described in (Clarkson & Rosenfeld, 1997). The CMU
statistical language tool kit has been used to generate our Arabic statistical language model.
Figure 7 shows the steps for creation and testing the language model, the steps are:
Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 285
Text
N-gram N-gram
Word frequency
calculations tables
Perplexity
Test Text Language
calculation
Model
Perplexity
The CMU language modeling tool comes with a tool for evaluating the language model. The
evaluation measures the perplexity as indication of the convenient (goodness) of the
language model. For more information of the perplexity, please refer to Section 7.
There are two types of dictionaries: closed vocabulary dictionary and open vocabulary
dictionary. In closed vocabulary dictionary, all corpus transcription words are listed in the
dictionary. In contrast, it is possible to have non-corpus transcription words in the open
vocabulary dictionary. Typically, the phoneme set, that is used to represent dictionary
words, is manually designed by language experts. However, when human expertise is not
available, the phoneme set is possible to be selected using data-driven approach as
286 Modern Speech Recognition Approaches with Case Studies
The speech recognition problem is to transcribe the most likely spoken words given the
acoustic observations. If O o1 , o 2 ,....on is the acoustic observation, and W w1 , w 2 ,....wn is
a word sequence, then:
= P(W)P(O|W)
Where is the most probable word sequence of the spoken words, which is also called
maximum posteriori probability. P(W) is the prior probability computed in the language
model, and P(O|W) is the probability of observation computed using the acoustic model.
4. Pronunciation variation
The main goal of ASRs is to enable people to communicate more naturally and effectively.
But this ultimate dream faces many obstacles such as different speaking styles which lead to
“pronunciation variation” phenomenon. This phenomenon appears in the form of
insertions, deletions, or substitutions of phoneme(s) relative to the phonemic transcription
in the pronunciation dictionary. (Benzeghiba et al., 2007) presented the speech variability
sources: foreign and regional accents, speaker physiology, spontaneous speech, rate of
speech, children speech, emotional state, noises, new words, and more. Accordingly,
handling these obstacles is a major requirement to have better ASR performance.
There are two types of pronunciation variations: cross-word variations and within-word
variations. A within-word variation causes alternative pronunciation(s) of the same word. In
contrast, a cross-word variation occurs in continuous speech in which a sequence of
words forms a compound word that should be treated as a one entity. The pronunciation
variation can be modeled in two approaches: knowledge-based and data-driven.
Knowledge-based depends on linguistic studies that lead to the phonological rules which
are called to find the possible alternative variants. On the other hand, data-driven
methods depend solely on the pronunciation corpus to find the pronunciation variants
(direct data-driven) or transformation rules (indirect data-driven). In this chapter, we will
use the knowledge-based approach to model the cross-word pronunciation variation
problem.
Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 287
As pros and cons of both approaches, the knowledge-based approach is not exhaustive; not
all of the variations that occur in continuous speech have been described. Whereas obtaining
reliable information using data-driven is difficult. However, (Amdal & Fossler-Lussier 2003)
mentioned that there is a growing interest in data-driven methods over the knowledge-
based methods due to lack of domains expertise. Figure 8 displays these two techniques.
Figure 8 also distinguishes between the types of variations and the modeling techniques by
a dashed line. The pronunciation variation types are above the dashed line whereas the
modeling techniques are under the dashed line.
described an Arabic speech recognition system based on Sphinx 4. Three corpora were
developed, namely, the Holy Qura’an corpus of about 18.5 hours, the command and control
corpus of about 1.5 hours, and the Arabic digits corpus of less than 1 hour of speech. They
also proposed an automatic toolkit for building pronunciation dictionaries for the Holy
Qur’an and standard Arabic language. (Al-Otaibi, 2001)] provided a single-speaker speech
dataset for MSA. He proposed a technique for labeling Arabic speech. using the Hidden
Markov Model Toolkit (HTK), he reported a recognition rate for speaker dependent ASR of
93.78%. (Afify et al. , 2005) compared grapheme-based recognition system with explicitly
modeling diacritics (short vowels). They found that a diacritic modeling improves
recognition performance. (Satori et al. , 2007) used CMU Sphinx tools for Arabic speech
recognition. They demonstrated the use of the tools for recognition of isolated Arabic digits.
They achieved a digits recognition accuracy of 86.66% for data recorded from six speakers.
(Alghamdi et al., 2009) developed an Arabic broadcast news transcription system. They
used a corpus of 7.0 h for training and 0.5 h for testing. The WER they obtained was 14.9%.
(Lamel et al., 2009) described the incremental improvements to a system for the automatic
transcription of broadcast data in Arabic, highlighting techniques developed to deal with
specificities (no diacritics, dialectal variants, and lexical variety) of the Arabic language.
(Billa et al., 2002) described the development of audio indexing system for broadcast news
in Arabic. Key issues addressed in their work revolve around the three major components of
the audio indexing system: automatic speech recognition, speaker identification, and named
entity identification. (Soltau et al., 2007) reported advancements in the IBM system for
Arabic speech recognition as part of the continuous effort for the Global Autonomous
Language Exploitation (GALE) project. The system consisted of multiple stages that
incorporate both diacritized and non-diacritized Arabic speech model. The system also
incorporated a training corpus of 1,800 hours of unsupervised Arabic speech. (Azmi et al.,
2008) investigated using Arabic syllables for speaker-independent speech recognition
system for Arabic spoken digits. The pronunciation corpus used for both training and
testing consisted of 44 Egyptian speakers. In a clean environment, experiments showed that
the recognition rate obtained using syllables outperformed the rate obtained using
monophones, triphones, and words by 2.68%, 1.19%, and 1.79%, respectively. Also in noisy
telephone channel, syllables outperformed the rate obtained using monophones, triphones,
and words by 2.09%, 1.5%, and 0.9%, respectively. (Elmahdy et al., 2009) used acoustic
models trained with large MSA news broadcast speech corpus to work as multilingual or
multi-accent models to decode colloquial Arabic. (Khasawneh et al., 2004) compared the
polynomial classifier that was applied to isolated-word speaker-independent Arabic speech
and dynamic time warping (DTW) recognizer. They concluded that the polynomial classifier
produced better recognition performance and much faster testing response than the DTW
recognizer. (Shoaib et al., 2003) presented an approach to develop a robust Arabic speech
recognition system based on a hybrid set of speech features. The hybrid set consisted of
intensity contours and formant frequencies. (Alotaibi, 2004) reported achieving high-
performance Arabic digits recognition using recurrent networks. (Choi et al., 2008)
290 Modern Speech Recognition Approaches with Case Studies
GALE project. they reported improved discriminative training, the use of subspace
Gaussian mixture models (SGMM), the use of neural network acoustic features, variable
frame rate decoding, training data partitioning experiments, unpruned n-gram language
models, and neural network based language modeling (NNLMs) . The achieved WER was
8.9% on the evaluation test set. (Kuo et al., 2010) studied various syntactic and
morphological context features incorporated in an NNLM for Arabic speech recognition.
A tag is a word property such as noun, pronoun, verb, adjective, adverb, preposition,
conjunction, interjection, etc. Each language has its own tags. Tags may be different from
language to language. In our method, we used the Arabic module of Stanford tagger
(Stanford Log-linear Part-Of-Speech Tagger, 2011). The total number of tags of this tagger is
29 tags, only 13 tags were used in our method as listed in Table 3. As we mentioned, we
focused on three kinds of tags: noun, adjectives, and preposition. In Table 3, DT is a
shorthand for the determiner article (التعريف )الthat corresponds to "the" in English.
In this work, we used the Noun-Adjective as shorthand for a compound word generated by
merging a noun and an adjective. We also used Preposition-Word as shorthand for a
compound word generated by merging a preposition with a subsequent word. The
prepositions used in our method include:
( منذ، حتى، في، على، عن، الى، )من (mundhu, Hata, fy, ‘ala, ‘an, ’ila, min), Other
prepositions were not included as they are rarely used in MSA. Table 4 shows the tagger
output for a simple non-diacritized sentence.
Thus, the tagger output is used to generate compound words by searching for Noun-
Adjective and Preposition-Word sequences. Figure 9 shows two possible compound words:
َ )بَرنَا ِمand ( )فِياألُردُنfor Noun-Adjective case and for Preposition-Word case,
(جضخم
respectively. These two compound words are, then, represented in new sentences as
illustrated in Figure 9. Therefore, the three sentences (the original and the new ones) will be
used, with all other cases, to produce the enhanced language model and the enhanced
pronunciation dictionary.
Figure 10 shows the process of generating a compound word. It demonstrates that a noun
followed by an adjective will be merged to produce a one compound word. similarly , the
preposition followed by a word will be merged to perform a one compound word. It is
noteworthy to mention that our method is independent from handling pronunciation
variations that may occur at words junctures. That is, our method does not consider the
phonological rules that could be implemented between certain words.
The steps for modeling cross-word phenomenon can be described by the algorithm
(pseudocode) shown in Figure 11. In the figure, the Offline stage means that the stage is
implemented once before decoding, while Online stage means that this stage needs to be
repeatedly implemented after each decoding process.
A compound Word
Noun Adjective
Read in this direction
W1 W2 W3 W4 W5 …
W: Word
Figure 10. A Noun-Adjective compound word generation
Offline Stage
Using a PoS tagger, have the transcription corpus tagged
For all tagged sentences in the transcription file
For each two adjacent tags of each tagged sentence
If the adjacent tags are adjective/noun or word/preposition
Generate the compound word
Represent the compound word in the transcription
End if
End for
End for
Based on the new transcription, build the enhanced dictionary
Based on the new transcription, build the enhanced language model
Online Stage
Switching the variants back to its original separated words
7. The results
The proposed method was investigated on a speaker-independent modern standard Arabic
speech recognition system using Carnegie Mellon University Sphinx speech recognition
engine. Three performance metrics were used to measure the performance enhancement: the
word error rate (WER), out of vocabulary (OOV), and perplexity (PP).
WER is a common metric to measure performance of ASRs. WER is computed using the
following formula:
+ +
=
Where:
The word accuracy can also be measured using WER as the following formula:
1
N
P(w1 ,w 2 ,…,w N )
PP(W) =
Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 295
Where PP is the perplexity, P is the probability of the word set to be tested W=w1, w2, … ,
wN, and N is the total number of words in the testing set.
Table 5 shows the enhancements for different experiments. Since the enhanced method (in
Noun-Adjective case) achieved a WER of (9.82%) which is out of the above mentioned
confidence interval [11.53,12.89], it is concluded that the achieved enhancement is
statistically significant. The other cases are similar, i.e. (Preposition-Word, and Hybrid cases
also achieved a significant improvement).
Table 5 shows that the highest accuracy achieved is in Noun-Adjective case. The reduction
in accuracy in the hybrid case is due to the ambiguity introduced in the language model. For
more clarification, our method depends on adding new sentences to the transcription corpus
that is used to build the language model. Therefore, adding many sentences will finally
cause the language model to be biased to some n-grams (1-grams, 2-grams, and 3-grams) on
the account of others.
The common way to evaluate the N-gram language model is using perplexity. The
perplexity for the baseline is 34.08. For the proposed cases, the language models’
perplexities are displayed in Table 6. The measurements were taken based on the testing set,
which contains 9288 words. The enhanced cases are clearly better as their perplexities are
lower. The reason for the low perplexities is the specific domains that we used in our
corpus, i.e. economics and sports.
The OOV was also measured for the performed experiments. Our ASR system is based on a
closed vocabulary, so we assume that there are no unknown words. The OOV was
calculated as the percentage of recognized words that do not belong to the testing set, but to
the training set. Hence,
Table 7 shows some statistical information collected during experiments. The “Total
compound words” is the total number of Noun-Adjective cases found in the corpus
transcription. The “unique compound words” indicates the total number of Noun-Adjective
cases after removing duplicates. The last column, “compound words replaced” is the total
number of compound words that were replaced back to their original two disjoint words
after the decoding process and prior to the evaluation stage.
Despite the claim that the Stanford Arabic tagger accuracy is more than 96%, a
comprehensive manual verification and correction were made on the tagger output. It was
reasonable to review the collected compound words as our transcription corpus is small
(39217 words). For large corpora, the accuracy of the tagger is crucial for the results. Table 8
shows an error that occurred in the tagger output. The word, for example, “ (”وقالwaqala)
should be VBD instead of NN.
ھذا وقال رئيس لجنة الطاقة بمجلس النواب ورئيس الرابطة الروسية للغاز إن
االحتكارات األوروبية
Sentence to be tagged
hadha waqala ra’ysu lajnati ’lTaqa bimajlisi ’lnuwab wa ra’ysu
’lrabiTa ’lrwsiya llghaz ’ina ’l’iHtikarati ’liwrobiya
Table 9 shows an illustrative example of the enhancement that was achieved in the
enhanced system. It shows that the baseline system missed one word “ (”منmin) while it
appears in the enhanced system. Introducing a compound word in this sentence avoided the
misrecognition that occurred in the baseline system.
اإلسبَانِ ﱢي لِ ُك َر ِة القَدَم
ِ وريﱢ ِ رحلَ ِة السﱠابِ َع ِة َوالثﱠالثِين ِمن ال ﱠدَ فِي ال َم
The text of a speech file to be tested fy ’lmarHalati ’lsabi‘ a wa ’lthalathyn mina ’ldawry
’l’sbany likurati ’lqadam
According to the proposed algorithm, each sentence in the enhanced transcription corpus
can have a maximum of one compound word, since sentences are added to the enhanced
corpus once a compound word is formed. Finally, After the decoding process, the results are
scanned in order to decompose the compound words back to their original form (two
separate words). This process is performed using a lookup table such as:
ال ُك َويتال ﱡد َولِ ﱢي ’( ال ُك َويت ال ﱡد َولِ ﱢيlkuwaytldawly ’lkuwayt ’ldawly)
ِ َ فِي َمط ار
ار ِ َ( فِي َمطfymatari fy matari)
8. Discussion
Table 10 shows comparison results of the suggested methods for cross-word modeling. It
shows that PoS tagging approach outperform the other methods ( i.e. the phonological rules
and small word merging) which were investigated on the same pronunciation corpus. The
use of phonological rules was demonstrated in (AbuZeina et al. 2011a) while merging of
small-words method was presented in (AbuZeina et al. 2011b). even though PoS tagging
seems to be better than the other methods, more research should be carried out for more
confidence. So, the comparison demonstrated in Table 10 is subject to change as more cases
need to be investigated for both techniques. That is, cross-word was modeled using only
two Arabic phonological rules, while only two compounding schemes were applied in PoS
tagging approach.
The recognition time is compared with the baseline system. The comparison includes the
testing set which includes 1144 speech files. The specifications of the machine where we
298 Modern Speech Recognition Approaches with Case Studies
conducted the experiments were as follows: a desktop computer which contains a single
processing chip of 3.2GHz and 2.0 GB of RAM. We found that the recognition time for the
enhanced method is almost the same as the recognition time of the baseline system. This
means that the proposed method is almost equal to the baseline system in term of time
complexity.
9. Further research
As future work, we propose investigating more word-combination cases. In particular, we
expect that the construct phrases Idafa ( )اإلضافةmake a good candidate. Examples include:
(جبال سلسلة, silsilt jibal), (مطار بيروت, maTaru bayrwt) , ( مدينة القدس, madynatu ’lquds).
Another suggested candidate is the Arabic "and" connective ()واو العطف, such as: ( مواد أدبية
ولغوية, mawad ’dabiyah wa lughawiyah ), (yata‘allaqu biqaDaya ’l‘ iraqi wa ‘lsudan يتعلق
،)بقضايا العراق والسودان. A hybrid system could also be investigated. It is possible to use the
different cross-word modeling approaches in a one ASR system. It is also worthy to
investigate how to model the compound words in the language model. In our method, we
create a new sentence for each compound word. we suggest to investigate representing the
compound word exclusively with its neighbors. for example, instead of having two
complete sentences to represent the compound words (ضخم
َ بَرنَا ِمج, barnamijDakhm) and
( فِياألُردُن, fy’l’urdun) as what we proposed in our method:
َطوير َم ِدينَ ِة ال َعقَبَة َ أَ ﱠما فِي األُردُن فَقَد تَ ﱠم َوض ُع بَرنَا ِم
ِ جضخم لِت
’mma fy ’l’urdun faqad tamma wad‘u barnamijDakhm litaTwyru madynati ’l’ aqabati
َطوير َم ِدينَ ِة ال َعقَبَة َ أَ ﱠما فِياألُردُن فَقَد تَ ﱠم َوض ُع بَرنَا ِمج
ِ ضخم لِت
’ mma fy’l’urdun faqad tamma wad‘u barnamijDakhm litaTwyru madynati ’l’ aqabati
We propose to add the compound words only with their adjacent words like:
َطوير
ِ جضخم لِت
َ َوض ُع بَرنَا ِم
waD‘u barnamijDakhm litaTwyr
A comprehensive research work should be made to find how to effectively represent the
compound words in the language model. In addition, we highly recommend further
research in PoS tagging for Arabic.
10. Conclusion
The proposed knowledge-based approach to model cross-word pronunciation variations
problem achieved a feasible improvement. Mainly, PoS tagging approach was used to form
compound words. The experimental results clearly showed that forming compound words
using a noun and an adjective achieved a better accuracy than merging of a preposition and
its next word. The significant enhancement we achieved has not only come from the cross-
word pronunciation modeling in the dictionary, but also indirectly from the recalculated n-
grams probabilities in the language model. We also conclude that Viterbi algorithm works
better with long words. Speech recognition research should consider this fact when
designing dictionaries. We found that merging words based on their types (tags) leads to
significant improvement in Arabic ASRs. We also found that the proposed method
outperforms the other cross-word methods such as phonological rules and small-words
merging.
Author details
Dia AbuZeina, Husni Al-Muhtaseb and Moustafa Elshafei
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
Acknowledgement
The authors would like to thank King Fahd University of Petroleum and Minerals for
providing the facilities to write this chapter. We also thank King Abdulaziz City for Science
and Technology (KACST) for partially supporting this research work under Saudi Arabia
Government research grant NSTP # (08-INF100-4).
11. References
Abushariah, M. A.-A. M.; Ainon, R. N.; Zainuddin, R.; Elshafei, M. & Khalifa, O. O.
Arabic speaker-independent continuous automatic speech recognition based on a
phonetically rich and balanced speech corpus. Int. Arab J. Inf. Technol., 2012, 9, 84-93
AbuZeina D., Al-Khatib W., Elshafei M., “Small-Word Pronunciation Modeling for Arabic
Speech Recognition: A Data-Driven Approach”, Seventh Asian Information Retrieval
Societies Conference, Dubai, 2011b.
AbuZeina D., Al-Khatib W., Elshafei M., Al-Muhtaseb H., "Cross-word Arabic
pronunciation variation modeling for speech recognition" , International Journal of
Speech Technology , 2011a.
300 Modern Speech Recognition Approaches with Case Studies
Afify M, Nguyen L, Xiang B, Abdou S, Makhoul J. Recent progress in Arabic broadcast news
transcription at BBN. In: Proceedings of INTERSPEECH. 2005, pp 1637–1640
Alghamdi M, Elshafei M, Almuhtasib H (2009) Arabic broadcast news transcription system.
Int J Speech Tech 10:183–195
Ali, M., Elshafei, M., Alghamdi M. , Almuhtaseb, H. , and Alnajjar, A., "Arabic Phonetic
Dictionaries for Speech Recognition". Journal of Information Technology Research,
Volume 2, Issue 4, 2009, pp. 67-80.
Alotaibi YA (2004) Spoken Arabic digits recognizer using recurrent neural networks. In:
Proceedings of the fourth IEEE international symposium on signal processing and
information technology, pp 195–199
Al-Otaibi F (2001) speaker-dependant continuous Arabic speech recognition. M.Sc. thesis,
King Saud University
Amdal I, Fosler-Lussier E (2003) Pronunciation variation modeling in automatic speech
recognition. Telektronikk, 2.2003, pp 70–82.
Azmi M, Tolba H,Mahdy S, Fashal M(2008) Syllable-based automatic Arabic speech
recognition in noisy-telephone channel. In: WSEAS transactions on signal processing
proceedings, World Scientific and Engineering Academy and Society (WSEAS), vol 4,
issue 4, pp 211–220
Bahi H, Sellami M (2001) Combination of vector quantization and hidden Markov models
for Arabic speech recognition. ACS/IEEE international conference on computer systems
and applications, 2001
Benzeghiba M, De Mori R et al (2007) Automatic speech recognition and speech variability: a
review. Speech Commun 49(10–11):763–786.
Billa J, Noamany M et al (2002) Audio indexing of Arabic broadcast news. 2002 IEEE
international conference on acoustics, speech, and signal processing (ICASSP)
Bourouba H, Djemili R et al (2006) New hybrid system (supervised classifier/HMM) for
isolated Arabic speech recognition. 2nd Information and Communication Technologies,
2006. ICTTA’06
Choi F, Tsakalidis S et al (2008) Recent improvements in BBN’s English/Iraqi speech-to-speech
translation system. IEEE Spoken language technology workshop, 2008. SLT 2008
Clarkson P, Rosenfeld R (1997) Statistical language modeling using the CMU-Cambridge
toolkit. In: Proceedings of the 5th European conference on speech communication and
technology, Rhodes, Greece.
Elmahdy M, Gruhn R et al (2009) Modern standard Arabic based multilingual approach for
dialectal Arabic speech recognition. In: Eighth international symposium on natural
language processing, 2009. SNLP’09
Elmisery FA, Khalil AH et al (2003) A FPGA-based HMM for a discrete Arabic speech
recognition system. In: Proceedings of the 15th international conference on
microelectronics, 2003. ICM 2003
Emami A, Mangu L (2007) Empirical study of neural network language models for Arabic
speech recognition. IEEE workshop on automatic speech recognition and
understanding, 2007. ASRU
Essa EM, Tolba AS et al (2008) A comparison of combined classifier architectures for Arabic
speech recognition. International conference on computer engineering and systems,
2008. ICCES 2008
Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 301
Owen Rambow, David Chiang, et al., Parsing Arabic Dialects, Final Report – Version 1,
January 18, 2006
http://old-site.clsp.jhu.edu/ws05/groups/arabic/documents/finalreport.pdf
Park J, Diehl F et al (2009) Training and adapting MLP features for Arabic speech
recognition.IEEE international conference on acoustics, speech and signal processing,
2009. ICASSP 2009
Plötz T (2005) Advanced stochastic protein sequence analysis, Ph.D. thesis, Bielefeld
University
Rabiner, L. R. and Juang, B. H., Statistical Methods for the Recognition and Understanding
of Speech, Encyclopedia of Language and Linguistics, 2004.
Ryding KC (2005) A reference grammar of modern standard Arabic (reference grammars).
Cambridge University Press, Cambridge.
Sagheer A, Tsuruta N et al (2005) Hyper column model vs. fast DCT for feature extraction in
visual Arabic speech recognition. In: Proceedings of the fifth IEEE international
symposium on signal processing and information technology, 2005
Saon G, Padmanabhan M (2001) Data-driven approach to designing compound words for
continuous speech recognition. IEEE Trans Speech Audio Process 9(4):327–332.
Saon G, Soltau H et al (2010) The IBM 2008 GALE Arabic speech transcription system. 2010
IEEE international conference on acoustics speech and signal processing (ICASSP)
Satori H, Harti M, Chenfour N (2007) Introduction to Arabic speech recognition using CMU
Sphinx system. Information and communication technologies international symposium
proceeding ICTIS07, 2007
Selouani S-A, Alotaibi YA (2011) Adaptation of foreign accented speakers in native Arabic
ASR systems. Appl Comput Informat 9(1):1–10
Shoaib M, Rasheed F, Akhtar J, Awais M, Masud S, Shamail S (2003) A novel approach to
increase the robustness of speaker independent Arabic speech recognition. 7th
international multi topic conference, 2003. INMIC 2003. 8–9 Dec 2003, pp 371–376
Singh, R., B. Raj, et al. (2002). "Automatic generation of subword units for speech recognition
systems." Speech and Audio Processing, IEEE Transactions on 10(2): 89-99.
Soltau H, Saon G et al (2007) The IBM 2006 Gale Arabic ASR system. IEEE international
conference on acoustics, speech and signal processing, 2007. ICASSP 2007
Stanford Log-linear Part-Of-Speech Tagger, 2011.
http://nlp.stanford.edu/software/tagger.shtml
Taha M, Helmy T et al (2007) Multi-agent based Arabic speech recognition. 2007 IEEE/WIC/
ACM international conferences on web intelligence and intelligent agent technology
workshops
The CMU Pronunciation Dictionary (2011), http://www.speech.cs.cmu.edu/cgi-bin/cmudict,
Accessed 1 September 2011.
Vergyri D, Kirchhoff K, Duh K, Stolcke A (2004) Morphology-based language modeling for
Arabic speech recognition. International conference on speech and language processing.
Jeju Island, pp 1252–1255
Xiang B, Nguyen K, Nguyen L, Schwartz R, Makhoul J (2006) Morphological ecomposition
for Arabic broadcast news transcription. In: Proceedings of ICASSP, vol I. Toulouse, pp
1089–1092
Chapter
Chapter13
0
http://dx.doi.org/10.5772/47835
1. Introduction
In recent years, the performance of personal computers has evolved with the production
of ever faster processors, a fact that enables the adoption of speech processing in
computer-assisted education. There are several speech technologies that are effective in
education, among which text-to-speech (TTS) and automatic speech recognition (ASR) are
the most prominent. TTS systems [45] are software modules that convert natural language
text into synthesized speech. ASR [18] can be seen as the TTS inverse process, in which the
digitized speech signal, captured for example via a microphone, is converted into text.
There is a large body of work on using ASR and TTS in educational tasks [14, 37]. All these
speech-enabled applications rely on engines, which are the software modules that execute
ASR or TTS. This work proposes a collaborative framework and associated techniques for
constructing speech engines and adopts accessibility as the major application. The network
has an important social impact in decreasing the recent digital divide among speakers of
commercially attractive and underrepresented languages.
The incorporation of computer technology in learning generated multimedia systems that
provide powerful “training tools”, explored in computer-assisted learning [34]. Also,
Web-based learning has become an important teaching and learning media [52]. However, the
financial cost of both computer and software is one of the main obstacle for computer-based
learning, especially in developing countries like Brazil [16].
The situation is further complicated when it comes to people with special needs, including
visual, auditory, physical, speech, cognitive, and neurological disabilities. They encounter
serious difficulties in having access to this technology and hence to knowledge. For example,
according to the Brazilian Institute of Geography and Statistics (IBGE), 14.5% of the Brazilian
304 2Modern Speech Recognition Approaches with Case Studies Speech Recognition
Table 1. Profile of Brazilian people with disabilities based on data provided by IBG [20] (the total
population is 190,732,694).
For the effective use of speech-enabled applications in education and assistive systems,
reasonably good engines must be available [30, 43]. Besides, the cost of softwares and
equipments cannot be prohibitive.
This work presents some results of an ambitious project, which aims at using the Internet as
a collaborative network and help the academy and software industry in the development of
speech science and technology for any language, including Brazilian Portuguese (BP). The
goal is to collect, develop, and deploy resources and softwares for speech processing using
a collaborative framework called VOICECONET. The public data and scripts (or software
recipes) allow to establish baseline systems and reproduce results across different sites [48].
The final products of this research are a large-vocabulary continuous speech recognition
(LVCSR) system and a TTS system for BP.
The remainder of the chapter is organized as follows. Section 2 presents a description of ASR
and TTS systems. Section 3 describes the proposed collaborative framework, which aims at
easing the task of developing ASR and TTS engines to any language. Section 4 presents the
developed resources for BP such as speech databases and phonetic dictionary. The baseline
results are presented in Section 5. Finally, Section 6 summarizes our conclusions and addresses
future works.
language model is composed by a grammar that restricts the acceptable sequences of words.
The latter typically supports a vocabulary of more than 60 thousand words and demands
more computation.
The conventional front end extracts segments (or frames) from the speech signal and converts,
at a constant frame rate (typically, 100 Hz), each segment to a vector x of dimension L (typically,
L = 39). It is assumed here that T frames are organized into a L × T matrix X, which represents
a complete sentence. There are several alternatives to parameterize the speech waveforms. In
spite of the mel-frequency cepstral coefficients (MFCCs) analysis being relatively old [10], it
has been proven to be effective and is used pervasively as the input to the ASR back end [18].
The language model of a dictation system provides the probability p(T ) of observing a
sentence T = [ w1 , . . . , w P ] of P words. Conceptually, the decoder aims at finding the sentence
T ∗ that maximizes a posterior probability as given by
p(X|T ) p(T )
T ∗ = arg max p(T |X) = arg max , (1)
T T p (X )
where p(X|T ) is given by the acoustic model. Because p(X) does not depend on T , the
previous equation is equivalent to
In practice, an empirical constant is used to weight the language model probability p(T )
before combining it with the acoustic model probability p(X|T ).
Due to the large number of possible sentences, Equation (2) cannot be calculated
independently for each candidate sentence. Therefore, ASR systems use data structures such
as lexical trees that are hierarchical, breaking sentences into words, and words into basic units
as phones or triphones [18].
A phonetic dictionary (also known as lexical model) provides the mapping from words to
basic units and vice-versa. For improved performance, continuous HMMs are adopted, where
the output distribution of each state is modeled by a mixture of Gaussians, as depicted in
Figure 2. The typical HMM topology is “left-right”, in which the only valid transitions are
staying at the same state and moving to the next.
306 4Modern Speech Recognition Approaches with Case Studies Speech Recognition
Figure 2. Pictorial representation of a left-right continuous HMM with three states s i , i = 1, 2, 3 and a
mixture of three Gaussians per state.
To reduce the computational cost of searching for T ∗ (decoding), hypotheses are pruned, i.e.,
some sentences are discarded and Equation (2) is not calculated for them [11]. In summary,
after having all models trained, an ASR at the test stage uses the front end to convert the input
signal to parameters and the decoder to search for the best sentence T .
The acoustic and language models can be fixed during the test stage but adapting one or both
can lead to improved performance. For example, the topic can be estimated and a specific
language model used. This is crucial for applications with a technical vocabulary such as
X-ray reporting by physicians [2]. The adaptation of the acoustic model is also important [25].
The ASR systems that use speaker independent models are convenient but must be able to
recognize with a good accuracy any speaker. At the expense of requesting the user to read
aloud some sentences, speaker adaptation techniques can tune the HMM models to the target
speaker. The adaptation techniques can also be used to perform environmental compensation
by reducing the mismatch due to channel or additive noise effects.
In this work, an ASR engine is considered to be composed by the decoder and all the required
resources for its execution (language model, etc.). Similarly, a TTS engine consists of all
software modules and associated resources, which will be briefly discussed in the sequel.
Figure 3. Functional diagram of a TTS system showing the front and back ends, responsible by the text
analysis and speech synthesis, respectively.
The front end is language dependent and performs text analysis to output information
coded in a way that is convenient to the back end. For example, the front end performs
VOICECONET: A Collaborative Framework for
Speech-Based
VOICECONET: A Collaborative Framework for Speech-Based ComputerComputer
Accessibility withAccessibility for Brazilian Portuguese5 307
with aPortuguese
a Case Study for Brazilian Case Study
text normalization, converting text containing symbols like numbers and abbreviations
into the equivalent written-out words. It also implements grapheme-to-phone conversion,
syllabification and syllable stress determination, which assign phonetic transcriptions to each
word and marks the text with prosodic information. Phonetic transcriptions and prosody
information together compose the (intermediate) symbolic linguistic representation that is
output by the front end.
The back end is typically language independent and includes the synthesizer, which is
the block that effectively generates sound. With respect to the technique adopted for the
back end, the main categories are the formant-based, concatenative and, more recently,
HMM-based [45]. Historically, TTS systems evolved from a knowledge-based paradigm to
a pragmatic data-driven approach.
PP = 2 H p (T) . (5)
Lower cross-entropies and perplexities indicate less uncertainty in predicting the next word
and, for a given task (vocabulary size, etc.), typically indicate a better language model.
With respect to TTS, the quality of speech outputs can be measured via two factors:
intelligibility and pleasantness. Intelligibility can be divided into segmental intelligibility,
which indicates how accurately spoken sentences have been received by the user, and
308 6Modern Speech Recognition Approaches with Case Studies Speech Recognition
comprehension, which measures how well the spoken sentences are understood. Segmental
intelligibility can be measured in a similar way to WER by comparing transcriptions and
reference messages. Comprehension can be measured by using questions or tasks which
require listeners to understand the meaning of the messages.
Pleasantness of speech can be measured by collecting a large number of user opinions and
using, for example, the mean opinion score (MOS) protocol [27]. The MOS is generated by
averaging the results of a set of subjective tests where a number of listeners rate the heard
audio quality of sentences read aloud by TTS softwares. It should be noted that intelligibility
and pleasantness are related but not directly correlated. They also have a significant effect on
user’s acceptance: unpleasant speech output can lead to poor satisfaction with an otherwise
sophisticated system.
There are some free and mature software tools for building speech recognition and synthesis
engines. HTK [55] is the most widespread toolkit used to build and adapt acoustic models, as
well as HTS [54] (modified version of HTK) for building speech synthesis system. Statistical
language models can be created with SRILM [44] or HTK.
Festival [4] offers a general multi-lingual (currently English and Spanish) framework for
building speech synthesis systems. Another open source framework for TTS is the MARY
platform [36]. Currently, MARY supports the German, English and Tibetan languages.
systems. Softwares for training and test statistical models are also made available. In the end,
the whole community benefits with better ASR and TTS engines.
It is well-known that collaborative networks will be critically important to business
and organizations by helping to establish a culture of innovation and by delivering
operational excellence. Collaborative networks have been achieved on the base of academic
partnership, including several universities and some companies specialized in developing
and implementing software in the university domain [31]. The collaborative networks offer
the possibility to reduce the duration and the costs of development and implementation of
hard to automate, vast and expensive systems [1], such as developing a ASR or TTS engine.
There are already successful multi-language audio collaborative networks like the Microsoft’s
YourSpeech project and the Voxforge free speech corpus and acoustic model repository. The
YourSpeech platform [5] is based on crowd sourcing approaches that aims at collecting
speech data and provides means that allow users to donate their speech: a quiz game and a
personalized TTS system. The VoxForge project [50] was set up to collect transcribed speech to
create a free speech corpus for use with open source speech recognition engines. The speech
audio files are compiled into acoustic models for use with open source speech recognition
engines. This work complements these previous initiatives by focusing on documenting and
releasing softwares and procedures for developing engines.
The proposed VOICECONET platform is comprehensive and based on the open source
concept. It aims at easing the task of developing ASR and TTS engines to any language,
including the underrepresented. In summary, the framework has the following features:
• Rely on a Web-based platform that was organized for collecting resources and improving
the engines accuracy. Through this platform anyone can contribute with waveform files
and the system rewards them with a speaker dependent acoustic model. The users can
also create a custom language model depends on the tasks where the model is going to be
used.
• The adopted softwares for building and evaluating engines (HTK, Julius, MARY, etc.) are
freely available. A considerable amount of time is spent to learn how to use them, and the
network shares scripts to facilitate.
• The shared reports and academic papers use the concept of reproducible research [48], and
the results can be replicated in different sites.
In order to collect desktop labeled data, the system architecture is based on the client/server
paradigm (see Figure 4). The client accesses the platform through a website [49] and is
identified by an user ID. Once there, the user is invited to record random sentences, taken
from a phonetically rich set [8], in order to guarantee complete phoneme coverage. The Java
Applet used for recording control was based on the VoxForge related application. This control
enables the system to access the user’s computer and record audio using any of the installed
devices. If the user selects the submit option, the recordings automatically stop and the wave
files are streamed to the server. The user can submit the recorded sentences anytime.
When enough sentences are recorded, at least 3 minutes of audio, the user may choose to
generate his/her speaker dependent acoustic model. At that moment, in the server, a script
VOICECONET: A Collaborative Framework for
Speech-Based
VOICECONET: A Collaborative Framework for Speech-Based ComputerComputer
Accessibility withAccessibility for Brazilian Portuguese9 311
with aPortuguese
a Case Study for Brazilian Case Study
is used to process the audio and build an adapted acoustic model using the HTK tools. Then,
the user may choose to download a file containing his/her speaker dependent acoustic model.
Note that the quality of the model is increased by the number of recorded sentences.
Another feature allows context-dependent language model to be created. To create the
language model it is necessary to input a small caps without punctuation text corpus in a
simplified standard generalized markup language format. In this format, there can be only
one sentence per line and each sentence is delimited by the tags < s > and < /s >. In the
sequel, a script is used to process the text and build a trigram ARPA format language model
using the SRILM tools. To improve the model it is recommended that the user previously
converts the input texts so that every number, date, money amount, time, ordinal number,
websites and some abbreviations should be written in full.
At the time of writing of this chapter, VOICECONET is online for 1 month for the BP language.
We have been receiving extremely positive feedback from the users. It was collected more than
60 minutes of audio from 15 speakers, and 5 language models were made available, adding
up 10 thousand sentences. Figure 5 shows the VOICECONET platform website.
It is pedagogical to use an underrepresented language to illustrate how the global community
can build and keep improving resources to all languages. Using the VOICECONET platform,
the following speech-related resources for BP are currently shared: two multiple speakers
audio corpora corresponding together to approximately 17 hours of audio, a phonetic
dictionary with over 65 thousand words, a speaker independent HTK format acoustic model,
and a trigram ARPA format language model. All these BP resources are publicly available [15]
and the next section describes how they were developed.
a state of art LVCSR systems, since they have together approximately dozen hours. For
example, only the Switchboard telephone conservations corpus for English has 240 hours of
recorded speech [17]. There were initiatives [35, 53] in the academy for developing corpora,
but they have not established a sustainable collaboration among researchers.
Regarding end-users applications, there are important freely available systems. For instance,
in Brazil, Dosvox and Motrix [47] are very popular among blind and physically-impaired
people, respectively. Dosvox includes its own speech synthesizer, besides offering the
possibility of using other engines. Dosvox does not provide ASR support. The current version
of Motrix supports ASR only in English, which makes use of a ASR engine distributed for free
by Microsoft.
There are advantages on adopting a proprietary software development model, but this work
advocates the open source model. The next sections describe the resources developed in order
to design speech engines for BP.
• a simple procedure that inserts the non-terminal symbol # before and after each word.
• the stress phase that mark the stressed vowel of the word.
• the bulk of the system that convert the graphemes (including the stressed vowel brand) to
38 phones represented using the SAMPA phonetic alphabet [32].
Using the described G2P converter, a phonetic dictionary was created. It has 65,532 words
and is called UFPAdic. These words were selected by choosing the most frequent ones in the
CETENFolha corpus [7], which is a corpus based on the texts of the newspaper Folha de S.
Paulo and compiled by NILC/São Carlos, Brazil. The G2P converter (executable file) and the
UFPAdic are publicly available [15].
4.2. Syllabification
The developed G2P converter does not perform syllabification nor stress syllable
identification. So these two tasks were implemented in a Java software publicly available
at Fal [15]. The algorithm used for syllabification is described in Silva et al. [38]. The main
idea of this algorithm is that all syllables have a vowel as a nucleus, and it can be surrounded
by consonants or other (semi-vowels). Hence, one should locate the vowels that composes the
syllable nuclei and isolate consonants and semivowels.
The original syllabification rules [38], as previously mentioned, consider the kind and the
arrangement of graphemes to separate the syllables of a given word. However, there are some
words whose syllabification is very difficult to perform correctly with only these two criteria,
especially when such words have diphthongs, because they will require a number of very
specific and well elaborated rules, in which each one will deal with just a few examples.
To overcome this difficulty, new linguistic rules, shown in Table 2, were proposed, each one
not just considering the graphemes themselves, but also their stress. The first group deals
with the falling diphthongs (the “vowel + glide” combination), while the second one deals
with diphthongs that varies with hiatus (the “glide + vowel” combination). Due to the fact
that diphthongs need this special treatment in their syllabification, it was defined that these
rules must be evaluated before the 20 original ones.
314 12
Modern Speech Recognition Approaches with Case Studies Speech Recognition
The main motivation for analyzing the previously mentioned diphthongs comes from the
perception of existing divergences between the scholars on such subject (like the position of
a glide, inside a syllable, in the falling diphthongs, that was explained in Bisol [3]). Another
point that ratifies the focus adopted in this analysis is the fact that vocalic segments, especially
the ones with rising sonority, have presented many errors in the separations performed by the
syllabification algorithm (in a previous analysis).
In addition to that, the original rule 19 was updated to fix some errors that occurred when it
was previously proposed. If the analyzed vowel is not the first grapheme in the next syllable
to be formed and is followed by another vowel that precedes a consonant, then the analyzed
vowel must be separated from the following graphemes. This new version of the rule 19 fixes
some errors that were occurring in the syllabification of words like “teólogo”, for example (the
correct is “te-ó-logo”, instead of “teó-lo-go”, as shown in Silva et al. [38]).
Identifying the stress syllable proved to be an easier task that benefited from the fact that the
developed G2P converter, in spite of not separating in syllable, was already able to identify
the stressed vowel. After getting the result of the syllabification, it was then trivial to identify
the syllable corresponding to the stress vowel.
4.3. LapsStory
The LapsStory corpus is based on spoken books or audiobooks. Having the audio files and
their respective transcriptions (the books themselves), a considerable reduction in human
resources can be achieved.
The original audio files were manually segmented to create smaller files, that were re-sampled
from 44,100 Hz to 22,050 Hz with 16 bits. Currently, the LapsStory corpus consists of
8 speakers, which corresponds to 16 hours and 17 minutes of audio. Unfortunately, the
LapsStory corpus cannot be completely released in order to protect the copyright of some
audiobooks. Therefore, only part of the LapsStory corpus is publicly available, which
corresponds to 9 hours of audio [15].
It should be noted that the acoustic environment of audiobooks is very controlled, so the
audio files have no audible noise and high signal to noise ratio. Thus, when such files are
used to train a system that will operate in a noisy environment, there is a problem with the
acoustic mismatch. This difficulty was circumvented by the technique proposed in Silva et al.
VOICECONET: A Collaborative Framework for
Speech-Based
VOICECONET: A Collaborative Framework for Speech-Based ComputerComputer
Accessibility withAccessibility with aPortuguese
a Case Study for Brazilian Case Study
for Brazilian Portuguese
13 315
[41], which showed that speaker adaptation techniques can be used to combat such acoustic
mismatch.
4.4. LapsBenchmark
Another developed corpus is the LapsBenchmark, which aims to be a benchmark reference
for testing BP systems. The LapsBenchmark’s recordings were performed on computers using
common (cheap) desktop microphones and the acoustic environment was not controlled.
Currently, the LapsBenchmark corpus has data from 35 speakers with 20 sentences each,
which corresponds to 54 minutes of audio. It was used the phrases described in Cirigliano
et al. [8]. The used sampling rate was 22,050 Hz and each sample was represented with 16
bits. The LapsBenchmark speech database is publicly available [15].
After having a valid BP front end, the HTS toolkit was used to create an HMM-based back end.
In order to facilitate the procedure, MARY provides the VoiceImport tool, illustrated in Figure 6,
which is an automatic HMM training routine. For HMM training, one needs a labeled corpus
with transcribed speech. It was used the speech data available with the BP demo for HTS [26],
which has a total of 221 files, corresponding to approximately 20 minutes of audio. The word
level transcriptions were not found at the HTS site and, for convenience to other users, were
made available in electronic format at Fal [15].
316 14
Modern Speech Recognition Approaches with Case Studies Speech Recognition
After this stage, the TTS system for BP is already supported by the MARY platform. All the
developed resources and adopted procedures are publicly available [15].
5. Experimental results
This section presents the baseline results obtained with all the developed ASR and TTS
resources for BP. The scripts and developed models were made publicly available [15]. All
the experiments were executed on a computer with Core 2 Duo Intel processor (E6420 2.13
GHz) and 1 GB of RAM.
The target for the first tests presented was to understand, in practice, how the main
components of a LVCSR system behave: the phonetic dictionary, acoustic model, language
model and the decoder. The existing correlation between these components and the
performance presented by the system are analyzed in respect of the xRT factor, as well as
the WER. Finally, the quality of speech produced by the developed TTS system was compared
with other synthesizers, including a commercial software.
described in Bisol [3], were considered as errors encountered during the syllabification process
of these sources.
Sources Words Errors Error rates
Portuguese web dictionary 170 1 0.58%
Syllabification algorithm (old) 170 29 11.17%
Syllabification algorithm (new) 170 2 1.17%
The new version of the syllabification algorithm achieved a very low amount of errors,
compared to its previous version. The mentioned errors were detected in two words in the
context of the falling diphthongs: “eufonia” and “ousadia”. The syllable divisions “eu-fo-nia”
and “ou-sa-dia” are incorrect because each one presents two vowels that should be separated
to compose nuclei for two different syllables. The accepted syllabifications are: “eu-fo-ni-a”
and “ou-sa-di-a”.
350
Crawling
CETENFolha and Crawling
CETENFolha
300
250
Perplexity
200
150
100
200 400 600 800 1000 1200 1400 1600
number of sentences for training (x 1000)
Figure 7. Perplexity against the number of sentences used to train the language model.
digitized voice, transcribed at the level of words (orthography) and/or at the level of phones.
Below is a list with details about the configuration used:
The speaker independent acoustic model was initially trained using the LapsStory and West
Point corpora, which corresponds to 21.65 hours of audio, and the UFPAdic. After that, the
HTK software was used to adapt the acoustic model, using the maximum likelihood linear
regression (MLLR) and maximum a posteriori (MAP) techniques with the Spoltech corpus,
which corresponds to 4.3 hours of audio. This adaptation process was used to combat acoustic
mismatches and is described in Silva et al. [41]. Both MAP and MLLR were used in the
supervised training (offline) mode.
For decoding, the experiments performed in this work adopts the Julius rev.4.1.5 [? ] and
HDecode (part of HTK) [55] softwares. It was used the trigram language model trained before
with 1,534,980 sentences and 14 Gaussians modeled the output distributions of the HMMs.
The LapsBenchmark corpus was used to evaluate the models.
Several tests were conducted in order to evaluate the best decoding parameters for both
HDecode and Julius. It was observed that the increasing on the xRT factor and recognition
VOICECONET: A Collaborative Framework for
Speech-Based
VOICECONET: A Collaborative Framework for Speech-Based ComputerComputer
Accessibility withAccessibility with aPortuguese
a Case Study for Brazilian Case Study
for Brazilian Portuguese
17 319
accuracy can be linked to a more effective pruning of the decoder at the acoustic level. The
pruning process is implemented at each time step by keeping a record of the best hypotheses
overall and de-activating all hypotheses whose log probabilities fall more than a beam width
below the best. Setting the beam width is thus a compromise between speed and avoiding
search errors, as showed in Figure 8.
90
80 HDecode
Julius
70
60
WER (%)
50
40
30
20
10
0
0 1 2 3 4 5 6 7
xRT factor
Figure 8. WER against xRT factor using the Julius and HDecode recognizers.
The tests were performed varying the beam width value from 100 to 400 and 15,000 for
HDecode and Julius, respectively. It was because the WER stopped to evolve, while the xRT
factor increased significantly. It was perceived that Julius can implement more aggressive
pruning methods than HDecode, without significantly increasing the xRT factor. On the other
hand, Julius could not achieve the same WER obtained with HDecode.
Thus the best decoding parameters for both HDecode and Julius are described in Table 4 and
Table 5, respectively. For convenience, the xRT factor value was kept around one.
Parameter Value
Pruning beam width 220
Language model scale factor 20
Word insertion penalty 22
Word end beam width 100
Number of tokens per state 8
Acoustic scale factor 1.5
Parameter Value
55
50
45
40
35
WER (%)
30
25
20
15
10 HDecode
5 Julius
0
0 2 4 6 8 10 12 14 16 18 20
number of Gaussians
Figure 9. WER against the number of Gaussians used for training the acoustic model.
It was perceived that the computational costs added for increasing the number of Gaussians is
compensated by an improvement in the decoder performance. The WER with 14-component
Gaussian mixtures is 18.24% and 27.33% for HDecode and Julius, respectively. As the WER
increased above 14 Gaussians, the next experiment will consider this number as a default.
90
80 HDecode
Julius
70
60
WER (%)
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3 3.5 4
xRT factor
Figure 10. WER and xRT factor observed on tests of speaker independent recognition.
80
HDecode
70
Julius
60
50
WER (%)
40
30
20
10
0
0 1 2 3 4 5
xRT factor
Figure 11. WER and xRT factor observed on tests of speaker dependent recognition.
randomly played and the listener had to give a note for pleasantness and repeat what he/she
heard. On total, ten people were interviewed, and none of which has formal education in
speech area. The results are showed in Figure 12 and 13.
In pleasantness evaluation, the developed TTS system and Liane were considered slightly
annoying, while Raquel was evaluated as a good synthesizer. It was observed that even the
most developed systems still are less natural than the human voice. In objective test, the TTS
system (with WER around 10%) outperforms Liane. Raquel achieved again the best result.
6. Conclusions
This chapter advocates the potential of using speech-enabled applications as a tool for
increasing social inclusion and education. This is specially true for users with special needs. In
order to have such applications for underrepresented languages, there is still a lot of work to
be done. As discussed, one of the main problems is the lack of data for training and testing the
systems, which are typically data-driven. So, in order to minimize cost, this work presents the
VOICECONET, a collaborative framework for building applications using speech recognition
and synthesis technologies. Free tools and resources are made publicly available [15], which
include a complete TTS and ASR systems for BP.
VOICECONET: A Collaborative Framework for
Speech-Based
VOICECONET: A Collaborative Framework for Speech-Based ComputerComputer
Accessibility withAccessibility with aPortuguese
a Case Study for Brazilian Case Study
for Brazilian Portuguese
21 323
Future work includes expanding both the audio and text databases, aiming at reaching the
performance obtained by ASR for English and Japanese, for example. The phonetic dictionary
refinement is another important issue to be addressed, considering the existing dialectal
variation in Brazil. In parallel, improving the free TTS for BP and develop prototypes for
groups of people that are not usually the main target of the assisting technology industry and
that require special attention. And finally, one important goal is to help groups that use the
network for developing resources for languages other than BP.
Acknowledgements
This work was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico
(CNPq), Brazil, project no. 560020/2010-4.
Author details
Nelson Neto, Pedro Batista and Aldebaro Klautau
Federal University of Pará (UFPA), Signal Processing Laboratory (LaPS) – http://www.laps.ufpa.br,
Belém – PA – Brazil
7. References
[1] Agranoff, R. [2006]. Inside collaborative networks: Ten lessons for public managers,
Public Administration Review, 66 pp. 56–65.
[2] Antoniol, G., Fiutem, R., Flor, R. & Lazzari, G. [1993]. Radiological reporting based
on voice recognition, Human-computer interaction. Lecture Notes in Computer Science, 753
pp. 242–253.
[3] Bisol, L. [2005]. Introdução a Estudos de Fonologia do Português Brasileiro, Porto Alegre:
EDIPUCRS.
[4] Black, A., Taylor, P. & Caley, R. [1999]. The Festival Speech Synthesis System, The University
of Edinburgh, System Documentation, Edition 1.4.
[5] Calado, A., Freitas, J., Silva, P., Reis, B., Braga, D. & Dias, M. S. [2010]. Yourspeech:
Desktop speech data collection based on crowd sourcing in the internet, International
Conference on Computational Processing of Portuguese - Demos Session .
URL: http://pt.yourspeech.net
[6] Caseiro, D., Trancoso, I., Oliveira, L. & Viana, C. [2002]. Grapheme-to-phone using
finite-state transducers, IEEE Workshop on Speech Synthesis pp. 215–218.
[7] CET [2012]. CETENFolha Text Corpus. Visited in March.
URL: www.linguateca.pt/CETENFolha/
[8] Cirigliano, R., Monteiro, C., Barbosa, F., Resende, F., Couto, L. & Moraes, J. [2005].
Um conjunto de 1000 frases foneticamente balanceadas para o Português Brasileiro
obtido utilizando a abordagem de algoritmos genéticos, XXII Simpósio Brasileiro de
Telecomunicações pp. 544–549.
[9] Damper, R. [2001]. Data-Driven Methods in Speech Synthesis, Kluwer Academic.
[10] Davis, S. & Merlmestein, P. [1980]. Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on
Acoustics, Speech, and Signal Processing, 28 (4): 357–366.
324 22
Modern Speech Recognition Approaches with Case Studies Speech Recognition
[11] Deshmukh, N., Ganapathiraju, A. & Picone, J. [1999]. Hierarchical search for
large-vocabulary conversational speech recognition, IEEE Signal Processing Magazine
pp. 84–107.
[12] Dic [2012]. Portuguese Web Dictionary. Visited in March.
URL: www.dicionarioweb.com.br
[13] Dutoit, T., Pagel, V., Pierret, N., Bataille, F. & Vrecken, O. [1996]. The MBROLA
project: Towards a set of high quality speech synthesizers free of use for non commercial
purposes, Proceedings of the 4th International Conference of Spoken Language Processing
pp. 1393–1396.
[14] Eskenazi, M. [2009]. An overview of spoken language technology for education, Speech
Communication, 51 (10): 832–844.
[15] Fal [2012]. Research Group FalaBrasil. Visited in March.
URL: www.laps.ufpa.br/falabrasil/index_en.php
[16] Fidalgo-Neto, A., Tornaghi, A., Meirelles, R., Berçot, F., Xavier, L., Castro, M. & Alves, L.
[2009]. The use of computers in Brazilian primary and secondary schools, Computers &
Education, 53 (3): 677–685.
[17] Godfrey, J., Holliman, E. & McDaniel, J. [1992]. SWITCHBOARD: Telephone speech
corpus for research and development, IEEE International Conference on Acoustics, Speech
and Signal Processing, 1 pp. 517–520.
[18] Huang, X., Acero, A. & Hon, H. [2001]. Spoken Language Processing, Prentice Hall.
[19] Huggins-Daines, D., Kumar, M., Chan, A., Black, A. W., Ravishankar, M. & Rudnicky,
A. [2006]. Pocketsphinx: A free, real-time continuous speech recogntion system for
hand-held devices, Proceedings of ICASSP pp. 185–188.
[20] IBG [2010]. Census Demographic Profiles.
URL: www.ibge.gov.br/home/estatistica/populacao/censo2010/
[21] JSA [2012]. Java Speech API. Visited in March.
URL: java.sun.com/products/java-media/speech/
[22] Juang, H. & Rabiner, R. [1991]. Hidden Markov models for speech recognition,
Technometrics, 33 (3): 251–272.
[23] Kornai, A. [1999]. Extended Finite State Models of Language, Cambridge University
Press.
[24] Lee, A., Kawahara, T. & Shikano, K. [2001]. Julius - an open source real-time
large vocabulary recognition engine, Proc. European Conf. on Speech Communication and
Technology pp. 1691–1694.
[25] Lee, C. & Gauvain, J. [1993]. Speaker adaptation based on MAP estimation of HMM
parameters, IEEE International Conference on Acoustics, Speech and Signal Processing, 2
pp. 558–561.
[26] Maia, R., Zen, H., Tokuda, K., Kitamura, T. & Resende, F. [2006]. An HMM-based
Brazilian Portuguese speech synthetiser and its characteristics, Journal of Communication
and Information Systems, 21 pp. 58–71.
[27] Mea [1996]. P.800 - ITU: Methods for Subjective Determination of Transmission Quality.
URL: www.itu.int/rec/T-REC-P.800-199608-I/en
[28] MLD [2012]. Microsoft Development Center. Visited in March.
URL: www.microsoft.com/portugal/mldc/default.mspx
VOICECONET: A Collaborative Framework for
Speech-Based
VOICECONET: A Collaborative Framework for Speech-Based ComputerComputer
Accessibility withAccessibility with aPortuguese
a Case Study for Brazilian Case Study
for Brazilian Portuguese
23 325
[46] Teixeira, A., Oliveira, C. & Moutinho, L. [2006]. On the use of machine
learning and syllable information in European Portuguese grapheme-phone conversion,
Computational Processing of the Portuguese Language, Springer, 3960 pp. 212–215.
[47] UFR [2012]. Accessibility Projects of NCE/UFRJ. Visited in March.
URL: http://intervox.nce.ufrj.br/
[48] Vandewalle, P., Kovacevic, J. & Vetterli, M. [2009]. Reproducible research in signal
processing - what, why, and how, IEEE Signal Processing Magazine, 26 pp. 37–47.
[49] Voi [2012]. VOICECONET. Visited in March.
URL: www.laps.ufpa.br/falabrasil/voiceconet/
[50] Vox [2012]. VoxForge.org. Visited in March.
URL: www.voxforge.org
[51] Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P. & Woelfel,
J. [2004]. Sphinx-4: A Flexible Open Source Framework for Speech Recognition, Sun
Microsystems, TR-2004-139.
[52] Wang, T.-H. [2010]. Web-based dynamic assessment: Taking assessment as teaching
and learning strategy for improving students’ e-learning effectiveness, Computers &
Education, 54 (4): 1157–1166.
[53] Ynoguti, C. A. & Violaro, F. [2008]. A Brazilian Portuguese speech database, XXVI
Simpósio Brasileiro de Telecomunicações pp. 1–6.
[54] Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. & Kitamura, T. [1999].
Simultaneous modeling of spectrum, pitch and duration in HMM-based speech
synthesis, Proc. of EUROSPEECH, 5 pp. 2347–2350.
[55] Young, S., Ollason, D., Valtchev, V. & Woodland, P. [2006]. The HTK Book, Cambridge
University Engineering Department, Version 3.4.