Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training

Published: 01 January 2015 Publication History

Abstract

Although deep neural network (DNN) acoustic models are known to be inherently noise robust, especially with matched training and testing data, the use of speech separation as a frontend and for deriving alternative feature representations has been shown to improve performance in challenging environments. We first present a supervised speech separation system that significantly improves automatic speech recognition (ASR) performance in realistic noise conditions. The system performs separation via ratio time-frequency masking; the ideal ratio mask (IRM) is estimated using DNNs. We then propose a framework that unifies separation and acoustic modeling via joint adaptive training. Since the modules for acoustic modeling and speech separation are implemented using DNNs, unification is done by introducing additional hidden layers with fixed weights and appropriate network architecture. On the CHiME-2 medium-large vocabulary ASR task, and with log mel spectral features as input to the acoustic model, an independently trained ratio masking frontend improves word error rates by 10.9% (relative) compared to the noisy baseline. In comparison, the jointly trained system improves performance by 14.4%. We also experiment with alternative feature representations to augment the standard log mel features, like the noise and speech estimates obtained from the separation module, and the standard feature set used for IRM estimation. Our best system obtains a word error rate of 15.4% (absolute), an improvement of 4.6 percentage points over the next best result on this corpus.

References

[1]
O. Abdel-Hamid and H. Jiang, "Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2013, pp. 7942-7946.
[2]
T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, "A compact model for speaker-adaptive training," in Proc. 4th Int. Conf. Spoken Lang., 1996, vol. 2, pp. 1137-1140.
[3]
S. W. C. Weng, D. Yu, and B.-H. Juang, "Recurrent deep neural networks for robust speech recognition," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 5532-5536.
[4]
C.-P. Chen and J. A. Bilmes, "MVA processing of speech features," IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 257-270, Jan. 2007.
[5]
J. Chen, Y. Wang, and D. L. Wang, "A feature study for classification-based speech separation at very low signal-to-noise ratio," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 7089-7093.
[6]
G. E. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pretrained deep neural networks for large-vocabulary speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30-42, Jan. 2012.
[7]
M. Delcroix, Y. Kubo, T. Nakatani, and A. Nakamura, "Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling?," in Proc. Interspeech, 2013, pp. 2992-2996.
[8]
J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," J. Mach. Learn. Res., vol. 12, pp. 2121-2159, 2010.
[9]
J. T. Geiger, F. Weninger, J. Gemmeke, M. Wollmer, B. Schuller, and G. Rigoll, "Memory-enhanced neural networks and NMF for robust ASR," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 6, pp. 1037-1046, Jun. 2014.
[10]
H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 578-589, Oct. 1994.
[11]
G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. W. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition," Signal Process. Mag., vol. 29, no. 6, pp. 82-97, 2012.
[12]
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Improving neural networks by preventing coadaptation of feature detectors," arXiv preprint arXiv:1207.0580, 2012.
[13]
P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, "Deep learning for monaural speech separation," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 1581-1585.
[14]
L. B. K. Vesely, A. Ghoshal, and D. Povey, "Sequence-discriminative training of deep neural networks," in Proc. Interspeech, 2013.
[15]
O. Kalinli, M. L. Seltzer, and A. Acero, "Noise adaptive training using a vector Taylor series approach for noise robust automatic speech recognition," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2009, pp. 3825-3828.
[16]
B. Kingsbury, "Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2009, pp. 3761-3764.
[17]
U. Kjems, J. Boldt, M. Pedersen, T. Lunner, and D. Wang, "Role of mask pattern in intelligibility of ideal binary-masked noisy speech," J. Acoust. Soc. Amer., vol. 126, no. 3, pp. 1415-1426, 2009.
[18]
B. Kollmeier and R. Koch, "Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction," J. Acoust. Soc. Amer., vol. 95, pp. 1593-1602, 1994.
[19]
B. Li and K. C. Sim, "Improving robustness of deep neural networks via spectral masking for automatic speech recognition," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2013, pp. 279-284.
[20]
J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, "An overview of noise-robust automatic speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 4, pp. 745-777, Apr. 2014.
[21]
J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero, "A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions," Comput., Speech, Lang., vol. 23, pp. 389-405, 2009.
[22]
H. Liao, "Speaker adaptation of context dependent deep neural networks," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2013, pp. 7947-7951.
[23]
H. Liao and M. J. F. Gales, "Adaptive training with joint uncertainty decoding for robust recognition of noisy data," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2007, vol. 4, pp. 389-392.
[24]
A. L. Maas, Q. V. Le, T. M. O'Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, "Recurrent neural networks for noise reduction in robust ASR," in Proc. Interspeech, 2012.
[25]
A. Mohamed, G. E. Dahl, and G. Hinton, "Acoustic modeling using deep belief networks," IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 14-22, Jan. 2012.
[26]
V. Nair and G. E. Hinton, "Rectified linear units improve restricted Boltzmann machines," in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 807-814.
[27]
A. Narayanan and D. L. Wang, "Ideal ratio mask estimation using deep neural networks for robust speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 7092-7096.
[28]
A. Narayanan and D. L. Wang, "The role of binary mask patterns in automatic speech recogniton in background noise," J. Acoust. Soc. Amer., vol. 133, pp. 3083-3093, 2013.
[29]
A. Narayanan and D. L. Wang, "Investigation of speech separation as a front-end for noise robust speech recognition," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 4, pp. 826-835, Apr. 2014.
[30]
A. Narayanan and D. L. Wang, "Joint noise adaptive training for robust automatic speech recognition," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 2523-2527.
[31]
N. Parihar and J. Picone, "Analysis of the Aurora large vocabulary evaluations," in Proc. Eur. Conf. Speech Commun. Technol., 2003, pp. 337-340.
[32]
D. Povey, "Discriminative training for large vocabulary speech recognition," Ph.D. dissertation, Univ. of Cambridge, Cambridge, U.K., 2004.
[33]
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, "The Kaldi speech recognition toolkit," in Proc. IEEE Workshop Autom. Speech Recogn. Understand., 2011.
[34]
T. N. Sainath, B. Kingsbury, H. Soltau, and B. Ramabhadran, "Optimization techniques to improve training speed of deep neural networks for large speech tasks," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 11, pp. 2267-2276, Nov. 2013.
[35]
M. L. Seltzer, D. Yu, and Y.-Q. Wang, "An investigation of deep neural networks for noise robust speech recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 7398-7402.
[36]
P. Smaragdis, "Convolutive speech bases and their application to supervised speech separation," IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 1, pp. 1-12, Jan. 2007.
[37]
S. Srinivasan, N. Roman, and D. L. Wang, "Binary and ratio time-frequency masks for robust speech recognition," Speech Commun., vol. 48, pp. 1486-1501, 2006.
[38]
Y. Tachioka, S. Watanabe, J. Le Roux, and J. R. Hershey, "Discriminative methods for noise robust speech recognition: A CHiME challenge benchmark," in Proc. 2nd CHiME Workshop Mach, Listen. Multisource Environ., 2013, pp. 19-24.
[39]
R. C. Van Dalen and M. J. F. Gales, "Extended VTS for noise-robust speech recognition," IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 733-743, May 2011.
[40]
E. Vincent, J. Barker, S. Watanabe, J. LeRoux, F. Nesta, and M. Matassoni, in Proc. 2nd CHiME Speech Separat. Recogn. Challenge, 2012 [Online]. Available: http://spandh.dcs.shef.ac.uk/chime_challenge/ chime2_task2.html
[41]
Y. Wang, A. Narayanan, and D. L. Wang, "On training targets for supervised speech separation," IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849-1858, Dec. 2014.
[42]
Y. Wang and D. L. Wang, "Towards scaling up classification-based speech separation," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381-1390, Jul. 2013.
[43]
Y. Wang and D. L. Wang, "A structure-preserving training target for supervised speech separation," in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 6148-6152.
[44]
F. Weninger, J. Geiger, M. Wöllmer, B. Schuller, and G. Rigoll, "The Munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks," in Proc. 2nd CHiME Workshop Mach. Listening Multisource Environ., 2013, pp. 86-90.
[45]
Y. Xu, J. Du, L. Dai, and C.-H. Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal Process. Lett., vol. 21, no. 1, pp. 65-68, Jan. 2014.
[46]
S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and B. Woodland, The HTK Book. Cambridge, U.K.: Cambridge Univ. Press, 2002 [Online]. Available: http://htk.eng.cam.ac.uk
[47]
D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, "Feature learning in deep neural networks - studies on speech recognition tasks," in Proc. Int. Conf. Learn. Represent., 2013.

Cited By

View all
  • (2023)An automatic speech recognition system in Indian and foreign languagesIntelligent Decision Technologies10.3233/IDT-22022817:2(505-526)Online publication date: 1-Jan-2023
  • (2023)Deep neural network techniques for monaural speech enhancement and separation: state of the art analysisArtificial Intelligence Review10.1007/s10462-023-10612-256:Suppl 3(3651-3703)Online publication date: 1-Dec-2023
  • (2021)Phoneme-specific speech separation2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2016.7471654(146-150)Online publication date: 11-Mar-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 23, Issue 1
January 2015
217 pages
ISSN:2329-9290
EISSN:2329-9304
  • Editor:
  • Haizou Li
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 January 2015
Accepted: 11 November 2014
Revised: 14 October 2014
Received: 29 June 2014
Published in TASLP Volume 23, Issue 1

Author Tags

  1. CHiME-2
  2. joint training
  3. ratio masking
  4. robust ASR
  5. time-frequency masking

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 28 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)An automatic speech recognition system in Indian and foreign languagesIntelligent Decision Technologies10.3233/IDT-22022817:2(505-526)Online publication date: 1-Jan-2023
  • (2023)Deep neural network techniques for monaural speech enhancement and separation: state of the art analysisArtificial Intelligence Review10.1007/s10462-023-10612-256:Suppl 3(3651-3703)Online publication date: 1-Dec-2023
  • (2021)Phoneme-specific speech separation2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2016.7471654(146-150)Online publication date: 11-Mar-2021
  • (2021)Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networksMultimedia Systems10.1007/s00530-020-00740-y27:2(271-286)Online publication date: 1-Apr-2021
  • (2021)Enhancing Robustness Verification for Deep Neural Networks via Symbolic PropagationFormal Aspects of Computing10.1007/s00165-021-00548-133:3(407-435)Online publication date: 1-Jun-2021
  • (2018)Deep Learning for Environmentally Robust Speech RecognitionACM Transactions on Intelligent Systems and Technology10.1145/31781159:5(1-28)Online publication date: 24-Apr-2018
  • (2018)Learning Self-Informed Feature Contribution for Deep Learning-Based Acoustic ModelingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.285892326:11(2204-2214)Online publication date: 1-Nov-2018
  • (2018)Context Adaptive Neural Network Based Acoustic Models for Rapid AdaptationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.279882126:5(895-908)Online publication date: 1-May-2018
  • (2018)Suppression by Selecting Wavelets for Feature Compression in Distributed Speech RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.277978726:3(564-579)Online publication date: 1-Mar-2018
  • (2018)Single Channel Target Speaker Extraction and Recognition with Speaker Beam2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2018.8462661(5554-5558)Online publication date: 15-Apr-2018
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media