Nothing Special   »   [go: up one dir, main page]

Deep Neural Networks For Cochannel Speaker Identification: Xiaojia Zhao, Yuxuan Wang and Deliang Wang

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

DEEP NEURAL NETWORKS FOR COCHANNEL SPEAKER IDENTIFICATION

Xiaojia Zhao1, Yuxuan Wang1 and DeLiang Wang1,2


1
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
2
Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH, USA
{zhaox, wangyuxu, dwang}@cse.ohio-state.edu

ABSTRACT The top speaker is then paired with the rest for expectation-
maximization (EM) based gain estimation. The output is the speaker
Speaker identification (SID) in cochannel speech, where two speak-
pair whose gain adapted model maximizes the likelihood of the test
ers are talking simultaneously over a single recording channel, is a
utterance. Their system achieves the average SID accuracy of better
challenging problem. Previous studies address this problem in the
than 98%. Li et al. take a very similar SID approach [9]. It adds a few
anechoic environment under the Gaussian mixture model (GMM)
framework. On the other hand, cochannel SID in reverberant condi- constraints to the generation of the short list. The top speaker model
tions has not been addressed. This paper studies cochannel SID in is directly combined with each of the rest and the combined models
both anechoic and reverberant conditions. We explore deep neural are used for SID directly without the EM step. The refined system
networks (DNNs) for cochannel SID and propose a DNN-based yields an accuracy greater than 99%. These two may be regarded as
recognition system. Evaluation results demonstrate the proposed the state-of-the-art cochannel SID methods.
DNN-based system outperforms the two state-of-the-art cochannel Due to the excellent performance of deep neural networks (DNNs)
SID systems in both anechoic and reverberant conditions and various in many tasks, researchers begin to study how to incorporate DNN in
target-to-interferer ratios. speaker recognition [2, 4, 17]. However, DNN has not been utilized
in cochannel SID to our knowledge. State-of-the-art cochannel SID
Index Terms— Cochannel speaker identification, reverberation,
performance is reported on the speech separation challenge (SSC)
deep neural network, Gaussian mixture model, target-to-interferer
corpus [7, 9]. This corpus [3], however, was tailored for robust
ratio
speech recognition rather than speaker recognition. The relative small
1. INTRODUCTION vocabulary and common words between training and testing reduce
the difficulty of the SID task [22]. In this study, we employ a speaker
To separate speech signals from multiple talkers, one can place mi-
recognition evaluation (SRE) dataset of the National Institute of
crophones at different locations and take advantage of the time and
Standards and Technology (NIST). We propose the first DNN-based
intensity differences of the recordings. The task, however, becomes
cochannel SID system working in both anechoic and reverberant
considerably more challenging with a single microphone. Cochannel
conditions. It trains a frame level multi-class DNN classifier that
speech is such a case where two speakers are recorded in a single
outputs the posterior probability of a frame being dominated by each
communication channel. Unlike a conversation, the speakers are not
speaker. Frame level decisions are integrated to make the final deci-
aware of each other, creating large amounts of overlapping speech.
sion.
Cochannel speech separation is a challenging problem. Supervised
The rest of the paper is organized as follows. In Sect. 2, we formu-
methods [13, 16] usually assume that the speaker identities are avail-
late the cochannel SID problem and describe the proposed system.
able in order to utilize the speaker models. Other work conducts
Sect. 3 describes the currently dominant GMM-based approach.
cochannel speaker identification (SID) as a front-end for separation,
Model training is discussed in Sect. 4, followed by evaluation and
or jointly with separation. Compared to cochannel speech recogni-
comparison in Sect. 5. We conclude this paper in Sect. 6.
tion, one advantage of cochannel SID is that it only needs a subset of
homogenous speech segments to infer speaker identities. Such seg-
ments are called usable speech [10]. How to group usable speech
across time into two streams is deemed as a sequential grouping
2. DNN-BASED COCHANNEL SID
problem. Shao and Wang jointly search all the grouping hypothesis We formulate cochannel SID as a discriminative learning problem,
and speaker candidates to get the optimal one [18, 19]. Mowlaee et where we directly learn a mapping from cochannel observations to
al. propose to treat cochannel SID and separation as an iterative pro- the corresponding speak identities. Specifically, we treat cochannel
cess [11]. Later they improve the performance by fusing adapted SID as a multi-class classification problem and employ DNN as the
GMM and Kullback-Leibler divergence scores [12]. Hershey et al. learning machine. To our knowledge, this is the first study of DNN-
get the best speech recognition performance thanks in part to excel- based cochannel SID.s
lent performance of cochannel SID and separation [7]. Their SID Figure 1 shows the schematic diagram of the proposed DNN-based
system first creates a short list of most probable speaker candidates. system. It trains a DNN using frame level features. The output layer
has the same number of nodes as speakers. Only the two nodes corre-
sponding to the underlying speakers have non-zero training labels.

This research was supported in part by an AFOSR grant (FA9550- During testing, the frame level output is aggregated across time to
12-1-0130). We would like to thank the Ohio Supercomputer Center generate the final output.
for providing computing resources.

978-1-4673-6997-8/15/$31.00 ©2015 IEEE 4824 ICASSP 2015


Speaker IDs 
 arg max p O a , b .  (1)
a ,b
We apply the Bayes formula to convert the posterior probability to
Frame Score
Aggregation
the likelihood of a joint distribution of two speakers, with the as-
sumption that all speaker pairs are equally probable. p(O) is not de-
pendent on speakers and can thus be dropped from the calculation.
Speaker Labels The question now becomes how to calculate likelihoods of a joint
distribution. Shao and Wang have introduced a variable g, to (1), to
assign each speech segment to one of the two speaker sources [18,
Hidden Layer 19]. The derivation is shown as follows.
ˆ , ˆ  arg max p O  , 
a b  a b 
a ,b

Hidden Layer  arg max


a ,b
 pO, g  ,  
g
a b

 
Frame Level
Features 
 arg max  max p O, g a , b  
a ,b  g 
 
    
Input Speech
 arg max max p X a , p X b  . (2)
a ,b
 
 X S 
Figure 1. Schematic diagram of the proposed DNN based
Here X denotes a speech segment, S the set of all segments, and g an
cochannel SID system
assignment vector of the same length as S. Each element of g is a
binary label that assigns the corresponding segment to a speaker. The
We use frame level log-spectral features as input. To encode tem- integration over all assignments is approximated as a max operation,
poral context, we splice a window of 11 frames of features to train assuming that the optimal assignment dominates the summation. By
the DNN. The training target of the DNN is the true speaker identi- assuming that segments are independent, the problem reduces to
ties. We use soft training labels where the two underlying speakers finding the best assignment for each segment and the likelihood of
each have a probability of generating the current frame. The sum of the utterance is the multiplication of segment likelihoods. The speak-
their probabilities equals one, whereas the other speakers have zero er pair with the highest likelihood is the SID output. The correspond-
probabilities. We compare frame level energy of two speakers and ing optimal assignment also gives a solution to the cochannel separa-
use their ratio for the soft labels. More specifically, we construct the tion problem by organizing segments into two groups. In other
ideal binary mask (IBM) [20], and frame level energy of each speaker words, this approach jointly performs cochannel SID and separation,
is calculated from the mixture cochleagram according to the IBM. so we name it joint SID & separation (JSS).
The DNN employed in our study is a deep multilayer perceptron. Li et al. have proposed a two stage algorithm that produces state-
The DNN uses three hidden layers, each having 1024 sigmoidal hid- of-the-art performance in the SSC corpus [5, 9]. The first stage ranks
den units. The standard backpropagation algorithm coupled with speakers according to their posterior probabilities given the observa-
dropout regularization (dropout rate 0.2) is used to train the network. tion. The posterior probability of each speaker  given X is calculated
No unsupervised pretraining is used, as we have sufficient labeled as follows.
data. We use the adaptive gradient descent along with a momentum

p X   
p X  P   (3)
term as the optimization technique. A momentum rate of 0.5 is used
for the first 5 epochs, after which the rate increases to 0.9. We use a  
p X m Pm  
softmax output layer and cross-entropy as the loss function. The m

training data is discussed in Section 4. where m is the speaker index. P() and P(m) are prior probabilities.
Assuming that all the speakers are equally probable, the priors can be
3. GMM-BASED COCHANNEL SID eliminated. Frame level posterior probabilities are aggregated across
time to obtain utterance level probabilities. Speakers are ranked
In this section, we present the currently dominant GMM-based based on the aggregated scores. The top ten speakers are kept for the
cochannel SID framework. This introduction serves to contrast our second stage where the top speaker is combined with each of the
DNN-based approach, and describe the algorithms used for later remaining nine. The composite GMMs are used for standard speaker
comparisons. recognition to get the best speaker pair. We point out that the compo-
Given an observation O, the goal of cochannel SID is to get the sition operates on a per frame basis.
two underlying speakers ˆa and ˆb that generate the observation. Li et al.’s two stage algorithm is a fine-tuned version of Hershey et
This can be formulated as searching for the speaker pair with the al.’s SID system [7]. Overall, the two systems yield the best perfor-
highest posterior probability. mance in the SSC corpus with Li et al.’s average performance around
a
ˆ , ˆ  arg max P  ,  O
a b b  1% higher.
 a , b
4. MODEL TRAINING
 arg max
 
p O a , b Pa , b 
In this study, we deal with both anechoic and reverberant test condi-
 a , b pO  tions. For the anechoic condition, we use anechoic data to train
GMMs and DNNs. However, such models do not generalize well to

4825
reverberant conditions. Thus, we directly model speakers in the re- we create two anechoic utterances and two reverberant utterances at -
verberant environments. 5, 0 and 5 dB TIRs. There are totally 9900 anechoic test utterances
The degree of reverberation is typically indicated by reverberation and 9900 reverberant test utterances per TIR. Each reverberant
time (T60), the time taken for a direct sound to attenuate by 60 dB [8]. cochannel test utterance is created using a randomly selected RIR
Reverberation is modeled as a convolution between a room impulse pair from the 21 RIR pair library.
response (RIR) and a direct sound signal. An RIR characterizes a
5.2. Performance on the SSC Corpus
specific reverberant environment and is determined by factors such as
the geometry of the room, and locations of sound sources and receiv- The state-of-the-art cochannel SID systems of Hershey et al. and
ers. Li et al. have reported performance on the SSC corpus. This corpus
Assuming no knowledge of test reverberant conditions, we simu- consists of 17000 training utterances from 34 speakers. Each training
late N representative reverberant training conditions covering a plau- utterance is created following a fixed grammar: command, color,
sible range of T60. Our previous study has shown that this technique preposition, letter, number, and adverb. Each of the six positions has
has reasonable generalization [23]. We prepare training data in each a small number of word choices. The cochannel test set of the SSC
of the N conditions. GMMs are trained using single speaker data corpus comprises six TIRs from -9 dB to 6 dB. There are 600 test
while DNNs are trained with cochannel data mixed at different TIRs. utterances for each TIR. Every test utterance is mixed from clean test
Details are given in the next section. utterances of two speakers. Note that the clean utterances follow the
same grammar and share the same vocabulary as the training utter-
ances.
5. EVALUATION AND COMPARISON We evaluate our proposed system on this dataset in order to make
5.1. Experimental Setup a direct comparison. Table I gives the SID results of the proposed
system and competing systems. As can be seen, our implementation
We randomly select 100 speakers from the 2008 NIST SRE da-
of Li et al.’s two stage system achieves the same average perfor-
taset (short2 part of the training set). The telephone conversation
mance as their paper. The proposed DNN-based system yields the
excerpt of each speaker is roughly 5 minutes long. Large chunks of
best results, although the performance gain is probably not significant.
silence in the excerpt are removed. Then we divide the recording into
As the results are nearly perfect, there is not much room to improve
5 s pieces. Two pieces with the highest energy are used for tests in
and we can conclude that the proposed system work comparably well.
order to provide sufficient speech information. The rest is used for
training. Overall each speaker has about 20 training utterances. More 5.3. Performance on NIST SRE Dataset with 50 speakers
details of the evaluation corpus can be found in [23]. First we test on a subset of 50 speakers with 1225 speaker pairs, to
A Matlab implementation of the image method of Allen and Berk- be roughly comparable with the SSC corpus in terms of speaker
ley is used to simulate room reverberation [1, 6]. We focus on the T60 number. We create two cochannel utterances for each pair at each of
range up to 1s that covers realistic reverberant conditions [8]. Three 3 TIRs, −5 dB, 0 dB and 5 dB. In total, there are 2450 test trials per
rooms are simulated to obtain 3 training T60’s: 300, 600 and 900 ms. TIR. The performance is given in Table II. As shown in the table,
For each T60, we generate 5 RIRs by randomly positioning the source there is a substantial drop of performance compared to the SSC cor-
and receiver while keeping their distance fixed at 2 m. Each training pus, confirming that the SSC corpus is rather easy for cochannel SID
utterance is convolved with the 5 RIRs of each room to create rever- evaluation. For this dataset, JSS outperforms Li et al. by an average
berant training data. Seven rooms are simulated to obtain 7 test T60’s of 4.3%. We also evaluate the DNN-based cochannel SID system,
from 300 ms to 900 ms with a step size of 100 ms. We randomly which further outperforms the best competing system by a large mar-
generate 3 pairs of RIRs at each T60 where each pair provides one gin (almost 13%).
RIR for the target and one for the interferer. In total there are 21 pairs Next we test in the reverberant conditions, and the results are
of test RIRs. Note that the RIRs are different between training and shown in Table III. As can be seen, the performances of all the meth-
testing even when they are generated with the same T60. ods degrade in the reverberant conditions. JSS drops by about 30%.
DNNs are trained using cochannel training data. Instead of one Li et al.’s is slightly more robust, but still drops by more than 20%.
DNN per speaker, we train a universal DNN for all the speakers. We In addition, the proposed DNN-based system continues to perform
include training data from every speaker pair for a complete cover- the best, outperforming JSS by more than 19% and Li et al.’s system
age. For anechoic conditions, we create 10 anechoic cochannel utter- by 14%.
ances per speaker pair at 3 TIRs (-5, 0 and 5 dB). In total, there are
4950 speaker pairs and 49500 cochannel training utterances per TIR. 5.4. Performance on NIST SRE Dataset with 100 speakers
For reverberant conditions, we create 10 reverberant cochannel utter- The SID task becomes more challenging as the number of speakers
ances at each of the 3 T60’s and 3 TIRs. In total, there are 49500 (classes) increases. To quantify cochannel SID dependency on the
cochannel training utterances per TIR and per T60. number of speakers, we have performed cochannel SID evaluation by
For JSS, we extract 22-dimensional MFCC as speaker features. increasing the number of speakers from 50 to 100, quadrupling the
Speaker models are adapted from a 1024-component universal back- number of classes to 4950. As in the previous results, the default
ground model (UBM) trained by pooling training data from all the DNN configuration (3 hidden layers with 1024 nodes each) outper-
speakers [15]. For Li et al., we extract 64-dimensional log-spectral forms the best competing system. With the increase of speaker size as
features for GMM training. Specifically, a 64-channel gammatone well as training data size, we have also explored a few different DNN
filterbank is employed as the front-end. The filter output is converted configurations. As we increase the number of units from 1024 to
to cochleagram [21]. We take the log operation on the cochleagram 2048 for each hidden layer, the SID performance improves by around
to get the features. For anechoic conditions, a 256-component GMM 4.5%. There is a slight improvement as we expand the number of
is trained for each speaker [14]. Another 256-component GMM is hidden layers from 3 to 5 without changing the hidden layer size, for
trained using the reverberant training data by convolving the anecho- either 1024 or 2048 hidden units. Further enlargement of the DNN
ic training data with the RIRs at 3 T60’s. size is expected to improve the performance even more, but at the
Cochannel test set covers all possible speaker pairs. For each pair, expense of substantially increased computational complexity.
4826
Table I: SID accuracy (%) on SSC corpus.

Method −9 dB −6 dB −3 dB 0 dB 3 dB 6 dB Avg.

Reported Performance of Hershey et al. 96.5 98.1 98.2 99.0 99.1 98.4 98.2

Reported Performance of Li et al. 97.3 98.8 99.5 99.7 99.7 98.8 99.0

JSS 89.8 95.3 98.0 98.5 98.2 96.8 96.1

Li et al. 96.7 99.0 99.5 99.7 100.0 99.2 99.0

DNN 98.3 99.5 100 99.8 100 99.0 99.4

Table II: SID accuracy (%) on anechoic NIST SRE dataset Table III: SID accuracy (%) on reverberant NIST SRE dataset
with 50 speakers with 50 speakers

Method −5 dB 0 dB 5 dB Avg. Method −5 dB 0 dB 5 dB Avg.

JSS 82.24 83.51 80.12 81.96 JSS 51.43 53.76 49.51 51.57

Li et al. 77.02 79.96 75.84 77.61 Li et al. 55.02 59.35 56.37 56.91

DNN 94.12 96.90 92.69 94.57 DNN 70.86 75.31 66.29 70.82

7. RELATION TO PRIOR WORK


Table IV: SID accuracy (%) on reverberant NIST SRE dataset The work presented here has focused on the cochannel SID prob-
with 100 speakers lem. Previous studies on this topic focus on GMM-based ap-
proaches in the anechoic condition. DNN has not been studied for
Method −5 dB 0 dB 5 dB Avg. this problem, and there is no previous work on cochannel SID in
reverberant conditions. Our study addresses this problem in both
JSS 39.59 41.76 38.70 40.02 anechoic and reverberant conditions by introducing a DNN-based
approach.
Li et al. 43.58 47.12 43.58 44.76

DNN (1024 by 3) 52.67 59.78 52.58 55.01


8. REFERENCES
DNN (1024 by 5) 54.13 61.31 54.33 56.59
[1] J.B. Allen and D.A. Berkley, "Image method for efficiently
DNN (2048 by 3) 56.91 64.76 56.99 59.55 simulating small-room acoustics," Journal of the Acoustical
Society of America, vol. 65, pp. 943-950, 1979.
DNN (2048 by 5) 57.32 64.82 57.52 59.89
[2] K. Chen and A. Salman, “Learning speaker-specific charac-
teristics with a deep neural architecture,” IEEE Transactions
on Neural Networks, vol. 22, no. 11, pp. 1744-1756, 2011..
6. CONCLUDING REMARKS [3] M. Cooke and T. Lee, “Speech separation and recognition
competition,” 2006 [Online]. Available:
This paper has a number of novel contributions. Our first contribu-
http://www.dcs.shef.ac.uk/~martin/SpeechSeparationChallen
tion lies in the introduction of DNN for cochannel SID. Our pro-
ge.htm
posed DNN system substantially outperforms the state-of-the-art
SID methods, which are GMM-based. Secondly, we address co- [4] S. Garimella and H. Hermansky, “Factor analysis of auto-
channel SID in reverberant conditions, a topic that has not been associative neural networks with application in speaker veri-
studied before. fication,” IEEE Transactions on Neural Networks and
Since this is the first study of applying DNN to cochannel SID, Learning Systems, vol. 24, no. 4, pp. 522-528, 2013.
therefore there is likely room for future improvement. For instance, [5] Y. Guan and W. Liu, “A two-stage algorithm for multi-
training features and labels can be systematically examined, and speaker identification system,” in Proc. International Sym-
DNN architecture can be optimized. With the excellent perfor- posium on Chinese Spoken Language Processing, 2008, pp.
mance of cochannel SID, we believe that the use of DNN repre-
161-164.
sents a promising direction to pursue noise robust SID, reverbera-
tion robust SID, and speaker verification tasks. [6] E.A.P. Habets, “Room impulse response generator,” 2010
[Online]. Available:
http://home.tiscali.nl/ehabets/rir_generato r.html

4827
[7] J. Hershey, S. Rennie, P. Olsen, and T. Kristjansson, “Super [22] X. Zhao, Y. Shao, and D.L. Wang, “CASA-based robust
human multi-talker speech recognition: A graphical model speaker identification,” IEEE Transactions on Audio,
approach,” Computer Speech & Language, vol. 24, pp. 45– Speech and Language Processing, vol. 20, no. 5, pp. 1608 –
66, 2010 1616, 2012.
[8] H. Kuttruff, Room Acoustics. New York, NY: Spon, 2000. [23] X. Zhao, Y. Wang and D.L. Wang, “Robust speaker identifi-
[9] P. Li, Y. Guan, S. Wang, B. Xu and W. Liu, “Monaural cation in noisy and reverberant conditions,” IEEE/ACM
speech separation based on MAXVQ and CASA for robust Transactions on Audio, Speech and Language Processing,
speech recognition,” Computer Speech & Language, vol. 24, vol. 22, no. 4, pp. 836-845, 2014.
pp. 30–44, 2010.
[10] J. M. Lovekin, R. E. Yantorno, K. R. Krishnamachari, D. S.
Benincasa and S. J. Wenndt, “Developing usable speech
criteria for speaker identification,” in Proc. ICASSP, 2001,
pp. 421–424.
[11] P. Mowlaee, R. Saeidi, Z. Tan, M. Christensen, P. Fränti,
and S. Jensen, “Joint single-channel speech separation and
speaker identification,” in Proc. ICASSP, 2010, pp. 4430–
4433.
[12] P. Mowlaee, R. Saeidi, M. Christensen, Z. Tan, T. Kinnunen,
P. Fränti and S. Jensen, “A joint approach for single-channel
speaker identification and speech separation,” IEEE Trans-
actions on Audio, Speech and Language Processing, vol. 20,
no. 9, pp. 2586-2601, 2012.
[13] A. Reddy and B. Raj, “Soft mask methods for single-channel
speaker separation,” IEEE Transactions on Audio, Speech
and Language Processing, vol. 15, no. 6, pp. 1766-1776,
2007.
[14] D.A. Reynolds, "Speaker identification and verification
using Gaussian mixture speaker models," Speech
Communication, vol. 17, pp. 91-108, 1995.
[15] D.A. Reynolds, T.F. Quatieri, and and R.B. Dunn, "Speaker
verification using adapted Gaussian mixture models,"
Digital Signal Processing, vol. 10, pp. 19-41, 2000.
[16] S. T. Roweis, "One microphone source separation," in Proc.
NIPS, 2000, pp. 793-799.
[17] M. Senoussaoui, N. Dehak, P. Kenny, R. Dehak and P.
Dumouchel, “First attempt of boltzmann machines for
speaker verification,” in Proc. Odyssey, The Speaker and
Language Recognition Workshop, 2012.
[18] Y. Shao and D.L. Wang, “Co-channel speaker identification
using usable speech extraction based on multi-pitch track-
ing,” in Proc. ICASSP, 2003, pp. 205-208.
[19] Y. Shao and D.L. Wang, “Model-based sequential organiza-
tion in cochannel speech,” IEEE Transactions on Audio,
Speech and Language Processing, vol. 14, no. 1, pp. 289-
298, 2006.
[20] D.L. Wang, "On ideal binary mask as the computational goal
of auditory scene analysis," in Speech separation by humans
and machines, P. Divenyi, Ed. Norwell, MA: Kluwer
Academic, 2005, pp. 181-197.
[21] D.L. Wang and G.J. Brown, Eds., Computational Auditory
Scene Analysis: Principles, Algorithms, and Applications.
Hoboken, NJ: Wiley-IEEE, 2006.

4828

You might also like