article

Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training

Authors:

Arun Narayanan,

DeLiang WangAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 23, Issue 1

Pages 92 - 101

https://doi.org/10.1109/TASLP.2014.2372314

Published: 01 January 2015 Publication History

Abstract

Although deep neural network (DNN) acoustic models are known to be inherently noise robust, especially with matched training and testing data, the use of speech separation as a frontend and for deriving alternative feature representations has been shown to improve performance in challenging environments. We first present a supervised speech separation system that significantly improves automatic speech recognition (ASR) performance in realistic noise conditions. The system performs separation via ratio time-frequency masking; the ideal ratio mask (IRM) is estimated using DNNs. We then propose a framework that unifies separation and acoustic modeling via joint adaptive training. Since the modules for acoustic modeling and speech separation are implemented using DNNs, unification is done by introducing additional hidden layers with fixed weights and appropriate network architecture. On the CHiME-2 medium-large vocabulary ASR task, and with log mel spectral features as input to the acoustic model, an independently trained ratio masking frontend improves word error rates by 10.9% (relative) compared to the noisy baseline. In comparison, the jointly trained system improves performance by 14.4%. We also experiment with alternative feature representations to augment the standard log mel features, like the noise and speech estimates obtained from the separation module, and the standard feature set used for IRM estimation. Our best system obtains a word error rate of 15.4% (absolute), an improvement of 4.6 percentage points over the next best result on this corpus.

References

[1]

O. Abdel-Hamid and H. Jiang, "Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2013, pp. 7942-7946.

[2]

T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, "A compact model for speaker-adaptive training," in Proc. 4th Int. Conf. Spoken Lang., 1996, vol. 2, pp. 1137-1140.

[3]

S. W. C. Weng, D. Yu, and B.-H. Juang, "Recurrent deep neural networks for robust speech recognition," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 5532-5536.

[4]

C.-P. Chen and J. A. Bilmes, "MVA processing of speech features," IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 257-270, Jan. 2007.

Digital Library

[5]

J. Chen, Y. Wang, and D. L. Wang, "A feature study for classification-based speech separation at very low signal-to-noise ratio," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 7089-7093.

[6]

G. E. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pretrained deep neural networks for large-vocabulary speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30-42, Jan. 2012.

Digital Library

[7]

M. Delcroix, Y. Kubo, T. Nakatani, and A. Nakamura, "Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling?," in Proc. Interspeech, 2013, pp. 2992-2996.

[8]

J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," J. Mach. Learn. Res., vol. 12, pp. 2121-2159, 2010.

Digital Library

[9]

J. T. Geiger, F. Weninger, J. Gemmeke, M. Wollmer, B. Schuller, and G. Rigoll, "Memory-enhanced neural networks and NMF for robust ASR," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 6, pp. 1037-1046, Jun. 2014.

Digital Library

[10]

H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 578-589, Oct. 1994.

[11]

G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. W. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition," Signal Process. Mag., vol. 29, no. 6, pp. 82-97, 2012.

[12]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Improving neural networks by preventing coadaptation of feature detectors," arXiv preprint arXiv:1207.0580, 2012.

[13]

P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, "Deep learning for monaural speech separation," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 1581-1585.

[14]

L. B. K. Vesely, A. Ghoshal, and D. Povey, "Sequence-discriminative training of deep neural networks," in Proc. Interspeech, 2013.

[15]

O. Kalinli, M. L. Seltzer, and A. Acero, "Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2009, pp. 3825-3828.

Digital Library

[16]

B. Kingsbury, "Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2009, pp. 3761-3764.

Digital Library

[17]

U. Kjems, J. Boldt, M. Pedersen, T. Lunner, and D. Wang, "Role of mask pattern in intelligibility of ideal binary-masked noisy speech," J. Acoust. Soc. Amer., vol. 126, no. 3, pp. 1415-1426, 2009.

[18]

B. Kollmeier and R. Koch, "Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction," J. Acoust. Soc. Amer., vol. 95, pp. 1593-1602, 1994.

[19]

B. Li and K. C. Sim, "Improving robustness of deep neural networks via spectral masking for automatic speech recognition," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2013, pp. 279-284.

[20]

J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, "An overview of noise-robust automatic speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 4, pp. 745-777, Apr. 2014.

Digital Library

[21]

J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero, "A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions," Comput., Speech, Lang., vol. 23, pp. 389-405, 2009.

Digital Library

[22]

H. Liao, "Speaker adaptation of context dependent deep neural networks," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2013, pp. 7947-7951.

[23]

H. Liao and M. J. F. Gales, "Adaptive training with joint uncertainty decoding for robust recognition of noisy data," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2007, vol. 4, pp. 389-392.

[24]

A. L. Maas, Q. V. Le, T. M. O'Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, "Recurrent neural networks for noise reduction in robust ASR," in Proc. Interspeech, 2012.

[25]

A. Mohamed, G. E. Dahl, and G. Hinton, "Acoustic modeling using deep belief networks," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 14-22, Jan. 2012.

Digital Library

[26]

V. Nair and G. E. Hinton, "Rectified linear units improve restricted Boltzmann machines," in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 807-814.

Digital Library

[27]

A. Narayanan and D. L. Wang, "Ideal ratio mask estimation using deep neural networks for robust speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 7092-7096.

[28]

A. Narayanan and D. L. Wang, "The role of binary mask patterns in automatic speech recogniton in background noise," J. Acoust. Soc. Amer., vol. 133, pp. 3083-3093, 2013.

[29]

A. Narayanan and D. L. Wang, "Investigation of speech separation as a front-end for noise robust speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 4, pp. 826-835, Apr. 2014.

Digital Library

[30]

A. Narayanan and D. L. Wang, "Joint noise adaptive training for robust automatic speech recognition," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 2523-2527.

[31]

N. Parihar and J. Picone, "Analysis of the Aurora large vocabulary evaluations," in Proc. Eur. Conf. Speech Commun. Technol., 2003, pp. 337-340.

[32]

D. Povey, "Discriminative training for large vocabulary speech recognition," Ph.D. dissertation, Univ. of Cambridge, Cambridge, U.K., 2004.

[33]

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, "The Kaldi speech recognition toolkit," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2011.

[34]

T. N. Sainath, B. Kingsbury, H. Soltau, and B. Ramabhadran, "Optimization techniques to improve training speed of deep neural networks for large speech tasks," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 11, pp. 2267-2276, Nov. 2013.

Digital Library

[35]

M. L. Seltzer, D. Yu, and Y.-Q. Wang, "An investigation of deep neural networks for noise robust speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 7398-7402.

[36]

P. Smaragdis, "Convolutive speech bases and their application to supervised speech separation," IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 1-12, Jan. 2007.

Digital Library

[37]

S. Srinivasan, N. Roman, and D. L. Wang, "Binary and ratio time-frequency masks for robust speech recognition," Speech Commun., vol. 48, pp. 1486-1501, 2006.

[38]

Y. Tachioka, S. Watanabe, J. Le Roux, and J. R. Hershey, "Discriminative methods for noise robust speech recognition: A CHiME challenge benchmark," in Proc. 2nd CHiME Workshop Mach, Listen. Multisource Environ., 2013, pp. 19-24.

[39]

R. C. Van Dalen and M. J. F. Gales, "Extended VTS for noise-robust speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 733-743, May 2011.

Digital Library

[40]

E. Vincent, J. Barker, S. Watanabe, J. LeRoux, F. Nesta, and M. Matassoni, in Proc. 2nd CHiME Speech Separat. Recogn. Challenge, 2012 [Online]. Available: http://spandh.dcs.shef.ac.uk/chime_challenge/ chime2_task2.html

[41]

Y. Wang, A. Narayanan, and D. L. Wang, "On training targets for supervised speech separation," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849-1858, Dec. 2014.

Digital Library

[42]

Y. Wang and D. L. Wang, "Towards scaling up classification-based speech separation," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381-1390, Jul. 2013.

Digital Library

[43]

Y. Wang and D. L. Wang, "A structure-preserving training target for supervised speech separation," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 6148-6152.

[44]

F. Weninger, J. Geiger, M. Wöllmer, B. Schuller, and G. Rigoll, "The Munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks," in Proc. 2nd CHiME Workshop Mach. Listening Multisource Environ., 2013, pp. 86-90.

[45]

Y. Xu, J. Du, L. Dai, and C.-H. Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65-68, Jan. 2014.

[46]

S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and B. Woodland, The HTK Book. Cambridge, U.K.: Cambridge Univ. Press, 2002 [Online]. Available: http://htk.eng.cam.ac.uk

[47]

D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, "Feature learning in deep neural networks - studies on speech recognition tasks," in Proc. Int. Conf. Learn. Represent., 2013.

Cited By

Gupta AKumar RKumar Y(2023)An automatic speech recognition system in Indian and foreign languagesIntelligent Decision Technologies10.3233/IDT-22022817:2(505-526)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/IDT-220228
Ochieng P(2023)Deep neural network techniques for monaural speech enhancement and separation: state of the art analysisArtificial Intelligence Review10.1007/s10462-023-10612-256:Suppl 3(3651-3703)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s10462-023-10612-2
Wang ZZhao YWang D(2021)Phoneme-specific speech separation2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2016.7471654(146-150)Online publication date: 11-Mar-2021
https://dl.acm.org/doi/10.1109/ICASSP.2016.7471654
Show More Cited By

Index Terms

Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training

Recommendations

A joint training framework for robust automatic speech recognition

Robustness against noise and reverberation is critical for ASR systems deployed in real-world environments. In robust ASR, corrupted speech is normally enhanced using speech separation or enhancement algorithms before recognition. This paper presents a ...
A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-...
Investigation of speech separation as a front-end for noise robust speech recognition

Recently, supervised classification has been shown to work well for the task of speech separation. We perform an in-depth evaluation of such techniques as a front-end for noise-robust automatic speech recognition (ASR). The proposed separation front-end ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 23, Issue 1

January 2015

217 pages

ISSN:2329-9290

EISSN:2329-9304

Editor:
Haizou Li
Institute for Infocomm Research, Connexis, Singapore

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 January 2015

Accepted: 11 November 2014

Revised: 14 October 2014

Received: 29 June 2014

Published in TASLP Volume 23, Issue 1

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
135
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gupta AKumar RKumar Y(2023)An automatic speech recognition system in Indian and foreign languagesIntelligent Decision Technologies10.3233/IDT-22022817:2(505-526)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/IDT-220228
Ochieng P(2023)Deep neural network techniques for monaural speech enhancement and separation: state of the art analysisArtificial Intelligence Review10.1007/s10462-023-10612-256:Suppl 3(3651-3703)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s10462-023-10612-2
Wang ZZhao YWang D(2021)Phoneme-specific speech separation2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2016.7471654(146-150)Online publication date: 11-Mar-2021
https://dl.acm.org/doi/10.1109/ICASSP.2016.7471654
Koteswararao YRao C(2021)Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networksMultimedia Systems10.1007/s00530-020-00740-y27:2(271-286)Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.1007/s00530-020-00740-y
Yang PLi JLiu JHuang CLi RChen LHuang XZhang L(2021)Enhancing Robustness Verification for Deep Neural Networks via Symbolic PropagationFormal Aspects of Computing10.1007/s00165-021-00548-133:3(407-435)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s00165-021-00548-1
Zhang ZGeiger JPohjalainen JMousa AJin WSchuller B(2018)Deep Learning for Environmentally Robust Speech RecognitionACM Transactions on Intelligent Systems and Technology10.1145/31781159:5(1-28)Online publication date: 24-Apr-2018
https://dl.acm.org/doi/10.1145/3178115
Kim YKim MGoo JKim H(2018)Learning Self-Informed Feature Contribution for Deep Learning-Based Acoustic ModelingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.285892326:11(2204-2214)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1109/TASLP.2018.2858923
Delcroix MKinoshita KOgawa AHuemmer CNakatani T(2018)Context Adaptive Neural Network Based Acoustic Models for Rapid AdaptationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.279882126:5(895-908)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1109/TASLP.2018.2798821
Wang SLin PTsao YHung JSu B(2018)Suppression by Selecting Wavelets for Feature Compression in Distributed Speech RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.277978726:3(564-579)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1109/TASLP.2017.2779787
Delcroix MZmolikova KKinoshita KOgawa ANakatani T(2018)Single Channel Target Speaker Extraction and Recognition with Speaker Beam2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2018.8462661(5554-5558)Online publication date: 15-Apr-2018
https://dl.acm.org/doi/10.1109/ICASSP.2018.8462661
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents