Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Published: 24 April 2018 Publication History

Abstract

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition but still remains an important challenge. Data-driven supervised approaches, especially the ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks. In the meanwhile, we discuss the pros and cons of these approaches and provide their experimental results on benchmark databases. We expect that this overview can facilitate the development of the robustness of speech recognition systems in acoustic noisy environments.

References

[1]
Alex Acero. 2012. Acoustical and Environmental Robustness in Automatic Speech Recognition. Vol. 201. Springer Science 8 Business Media, Berlin.
[2]
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, and others. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning (ICML’16). New York, NY. 173--182.
[3]
Yekutiel Avargel and Israel Cohen. 2007. System identification in the short-time fourier transform domain with crossband filtering. IEEE Trans. Audio Speech Lang. Process. 15, 4 (Mar. 2007), 1305--1319.
[4]
Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe. 2015. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 504--511.
[5]
Jon Barker, Emmanuel Vincent, Ning Ma, Heidi Christensen, and Phil Green. 2013. The PASCAL CHiME speech separation and recognition challenge. Comput. Speech Lang. 27, 3 (May 2013), 621--633.
[6]
Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Sign. Process. 27, 2 (Apr. 1979), 113--120.
[7]
Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, and others. 2005. The AMI meeting corpus: A pre-announcement. In Proceedings of the International Workshop on Machine Learning for Multimodal Interaction. 28--39.
[8]
Zhuo Chen, Shinji Watanabe, Hakan Erdoğan, and John R. Hershey. 2015. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). Dresden, Germany, 1--5.
[9]
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST’14). 103--111.
[10]
Henry Cox, Robert M. Zeskind, and Mark M. Owen. 1987. Robust adaptive beamforming. IEEE Trans. Acoust. Speech Sign. Process. 35, 10 (Oct. 1987), 1365--1376.
[11]
Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A. Bharath. 2017. Generative adversarial networks: An overview (submitted for publication).
[12]
George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 1 (Jan. 2012), 30--42.
[13]
Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 4 (May 2011), 788--798.
[14]
Li Deng. 2011. Front-end, back-end, and hybrid techniques for noise-robust speech recognition. In Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, 67--99.
[15]
Yariv Ephraim and David Malah. 1984. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 32, 6 (Dec. 1984), 1109--1121.
[16]
Yariv Ephraim and David Malah. 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 23, 2 (Apr. 1985), 443--445.
[17]
Hakan Erdogan, Tomoki Hayashi, John R. Hershey, Takaaki Hori, Chiori Hori, Wei-Ning Hsu, Suyoun Kim, Jonathan Le Roux, Zhong Meng, and Shinji Watanabe. 2016. Multi-channel speech recognition: LSTMs all the way through. In Proceedings of the CHiME-4 Workshop.
[18]
Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 708--712.
[19]
Hakan Erdogan, John R. Hershey, Shinji Watanabe, Michael I. Mandel, and Jonathan Le Roux. 2016. Improved MVDR beamforming using single-channel mask prediction networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1981--1985.
[20]
Xue Feng, Yaodong Zhang, and James Glass. 2014. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1759--1763.
[21]
Tian Gao, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. Joint training of front-end and back-end deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4375--4379.
[22]
J.-L. Gauvain and Chin-Hui Lee. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Aud. Process. 2, 2 (Apr. 1994), 291--298.
[23]
Jürgen Geiger, Jort F. Gemmeke, Björn Schuller, and Gerhard Rigoll. 2014a. Investigating NMF speech enhancement for neural network based acoustic models. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2405--2409.
[24]
Jürgen Geiger, Erik Marchi, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014b. The TUM system for the REVERB challenge: Recognition of reverberated speech using multi-channel correlation shaping dereverberation and BLSTM recurrent neural networks. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.
[25]
Jürgen Geiger, Felix Weninger, Jort F. Gemmeke, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014c. Memory-enhanced neural networks and NMF for robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 6 (June 2014), 1037--1046.
[26]
Jürgen Geiger, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014d. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 631--635.
[27]
Ritwik Giri, Michael L. Seltzer, Jasha Droppo, and Dong Yu. 2015. Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 5014--5018.
[28]
Yifan Gong. 1995. Speech recognition in noisy environments: A survey. Speech Commun. 16, 3 (Apr. 1995), 261--291.
[29]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press, Cambridge, MA.
[30]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’14). 2672--2680.
[31]
E. M. G. Grais, Gerard Roma, Andrew J. R. Simpson, and Mark D. Plumbley. 2016. Combining mask estimates for single channel audio source separation using deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 3339--3343.
[32]
Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv:1308.0850 (Aug. 2013).
[33]
Kun Han, Yuxuan Wang, DeLiang Wang, William S. Woods, Ivo Merks, and Tao Zhang. 2015. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 6 (Apr. 2015), 982--992.
[34]
John H. L. Hansen and Mark A. Clements. 1991. Constrained iterative speech enhancement with application to speech recognition. IEEE Trans. Sign. Process. 39, 4 (Apr. 1991), 795--805.
[35]
John H. L. Hansen and Bryan L. Pellom. 1998. An effective quality evaluation protocol for speech enhancement algorithms. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’98). 2819--2822.
[36]
Jahn Heymann, Lukas Drude, Aleksej Chinaev, and Reinhold Haeb-Umbach. 2015. BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 444--451.
[37]
Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016a. Neural network based spectral mask estimation for acoustic beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 196--200.
[38]
Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016b. Wide residual BLSTM network with discriminative speaker adaptation for robust speech recognition. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 12--17.
[39]
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 82--97.
[40]
Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (July 2006), 504--507.
[41]
Hans Günter Hirsch and Harald Finster. 2005. The simulation of realistic acoustic input scenarios for speech recognition systems. In Proceedings of the Conference of the International Speech Communications Association (INTERSPEECH). Lisbon, Portugal, 2697–2700.
[42]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (Nov. 1997), 1735--1780.
[43]
Yedid Hoshen, Ron J. Weiss, and Kevin W. Wilson. 2015. Speech acoustic modeling from raw multichannel waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4624--4628.
[44]
Yi Hu and Philipos C. Loizou. 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16, 1 (Jan. 2008), 229--238.
[45]
Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2014. Deep learning for monaural speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1562--1566.
[46]
Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2015. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 12 (Dec. 2015), 2136--2147.
[47]
Takaaki Ishii, Hiroki Komiyama, Takahiro Shinozaki, Yasuo Horiuchi, and Shingo Kuroiwa. 2013. Reverberant speech recognition based on denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3512--3516.
[48]
Penny Karanasou, Yongqiang Wang, Mark J. F. Gales, and Philip C. Woodland. 2014. Adaptation of deep neural network acoustic models using factorised i-vectors. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2180--2184.
[49]
Arash Khabbazibasmenj, Sergiy A. Vorobyov, and Aboulnasr Hassanien. 2012. Robust adaptive beamforming based on steering vector estimation with as little as possible prior information. IEEE Trans. Sign. Process. 60, 6 (June 2012), 2974--2987.
[50]
Keisuke Kinoshita, Marc Delcroix, Sharon Gannot, Emanuël A. P. Habets, Reinhold Haeb-Umbach, Walter Kellermann, Volker Leutnant, Roland Maas, Tomohiro Nakatani, Bhiksha Raj, and others. 2016. A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Sign. Process. 2016, 1 (Dec. 2016), 1--19.
[51]
Souvik Kundu, Gautam Mantena, Yanmin Qian, Tian Tan, Marc Delcroix, and Khe Chai Sim. 2016. Joint acoustic factor learning for robust deep neural network based automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5025--5029.
[52]
Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neur. Comput. 1, 4 (1989), 541--551.
[53]
Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.
[54]
Kang Hyun Lee, Shin Jae Kang, Woo Hyun Kang, and Nam Soo Kim. 2016. Two-stage noise aware training using asymmetric deep denoising autoencoder. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5765--5769.
[55]
Kang Hyun Lee, Woo Hyun Kang, Tae Gyoon Kang, and Nam Soo Kim. 2017. Integrated DNN-based model adaptation technique for noise-robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 5245--5249.
[56]
Christopher J. Leggetter and Philip C. Woodland. 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9, 2 (Apr. 1995), 171--185.
[57]
Bo Li, Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, and Michiel Bacchiani. 2016. Neural network adaptive beamforming for robust multichannel speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1976--1980.
[58]
Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. 2014. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Proces. 22, 4 (Apr. 2014), 745--777.
[59]
Yan Liu, Yang Liu, Shenghua Zhong, and Songtao Wu. 2017. Implicit visual learning: Image recognition via dissipative learning model. ACM Trans. Intell. Syst. Technol. 8, 2 (Jan. 2017), 31:1--31:24.
[60]
Yulan Liu, Pengyuan Zhang, and Thomas Hain. 2014. Using neural network front-ends on far field multiple microphones based speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 5542--5546.
[61]
Philipos C. Loizou. 2013. Speech Enhancement: Theory and Practice. Taylor Francis, Abingdon, UK.
[62]
Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori. 2013. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 436--440.
[63]
Andrew L. Maas, Quoc V. Le, Tyler M. OŃeil, Oriol Vinyals, Patrick Nguyen, and Andrew Y. Ng. 2012. Recurrent neural networks for noise reduction in robust ASR. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’12). 22--25.
[64]
Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. 2016. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’16). 2802--2810.
[65]
Claude Marro, Yannick Mahieux, and Klaus Uwe Simmer. 1998. Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering. IEEE Trans. Speech Audio Process. 6, 3 (May 1998), 240--259.
[66]
Iain McCowan and Herv’e Bourlard. 2003. Microphone array post-filter based on noise field coherence. IEEE Trans. Speech Audio Process. 11, 6 (Nov. 2003), 709--716.
[67]
Zhong Meng, Shinji Watanabe, John R. Hershey, and Hakan Erdogan. 2017. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 271--275.
[68]
Tobias Menne, Jahn Heymann, Anastasios Alexandridis, Kazuki Irie, Albert Zeyer, Markus Kitza, Pavel Golik, Kulikov Ilia, Lukas Durde, Ralf Schlater, Hermann Ney, Reinhold Haeb-Umbach, and Athanasios Mouchtaris. 2016. The RWTH /UPB/FORTH system combination for the 4th CHiME challenge evaluation. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 49--51.
[69]
Xavier Mestre and Miguel Angel Lagunas. 2003. On diagonal loading for minimum variance beamformers. In Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology. 459--462.
[70]
Daniel Michelsanti and Zheng-Hua Tan. 2017. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2008--2012.
[71]
Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara. 2016. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition. In Proceedings of theConference of the International Speech Communication Association (INTERSPEECH’16). 3803--3807.
[72]
Seyedmahdad Mirsamadi and John H. L. Hansen. 2015. A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2430--2434.
[73]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, and others. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529--533.
[74]
Asunción Moreno, Børge Lindberg, Christoph Draxler, Gaël Richard, Khalid Choukri, Stephan Euler, and Jeffrey Allen. 2000. SPEECHDAT-CAR. A large speech database for automotive environments. In Proceedings of the the 2nd International Conference on Language Resources and Evaluation (LREC’00).
[75]
Arun Narayanan and DeLiang Wang. 2013. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7092--7096.
[76]
Arun Narayanan and DeLiang Wang. 2014. Joint noise adaptive training for robust automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 2504--2508.
[77]
Arun Narayanan and DeLiang Wang. 2015. Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 92--101.
[78]
Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, and John R. Hershey. 2017. Multichannel end-to-end speech recognition. In Proceedings of the the 34th International Conference on Machine Learning (ICML’17). 2632--2641.
[79]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv:1609.03499 (Sep. 2016).
[80]
ITU-T Recommendation P.862. 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.
[81]
Kuldip Paliwal, Kamil Wójcicki, and Benjamin Shannon. 2011. The importance of phase in speech enhancement. Speech Commun. 53, 4 (Apr. 2011), 465--494.
[82]
Se Rim Park and Jinwon Lee. 2016. A fully convolutional neural network for speech enhancement. arXiv:1609.07132 (Sep. 2016).
[83]
Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv:1703.09452 (Mar. 2017).
[84]
David Pearce and Hans-Günter Hirsch. 2000. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’00). 29--32.
[85]
David Pearce and J. Picone. 2002. Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02. Institute for Signal & Information Processing, Mississippi State University, Tech. Rep (2002).
[86]
Pasi Pertilä and Joonas Nikunen. 2014. Microphone array post-filtering using supervised machine learning for speech enhancement. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2675--2679.
[87]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florêncio, and Mark Hasegawa-Johnson. 2017. Speech enhancement using Bayesian WaveNet. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2013--2017.
[88]
Yanmin Qian, Mengxiao Bi, Tian Tan, and Kai Yu. 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Process. 24, 12 (Dec. 2016), 2263--2276.
[89]
Yanmin Qian and Tian Tan. 2016. The SJTU CHiME-4 system: Acoustic noise robustness for real single or multiple microphone scenarios. In Proceedings of the CHiME-4 Workshop.
[90]
Schuyler R. Quackenbush, Thomas Pinkney Barnwell, and Mark A. Clements. 1988. Objective Measures of Speech Quality. Prentice-Hall, Upper Saddle River, NJ.
[91]
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. 2017. A network of deep neural networks for distant speech recognition. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 4880--4884.
[92]
Dario Rethage, Jordi Pons, and Xavier Serra. 2017. A Wavenet for speech denoising. arXiv:1706.07162 (June 2017).
[93]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (Dec. 2015), 211--252.
[94]
Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Haşim Sak. 2015. Convolutional, long short-term memory, fully connected deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4580--4584.
[95]
Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Bo Li, Arun Narayanan, Ehsan Variani, Michiel Bacchiani, Izhak Shafran, Andrew W. Senior, Kean K. Chin, Ananya Misra, and Chanwoo Kim. 2017. Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 5 (May 2017), 965--979.
[96]
George Saon, Tom Sercu, Steven Rennie, and Hong-Kwang J. Kuo. 2016. The IBM 2016 english conversational telephone speech recognition system. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 7--11.
[97]
Johan Schalkwyk, Doug Beeferman, Françoise Beaufays, Bill Byrne, Ciprian Chelba, Mike Cohen, Maryam Kamvar, and Brian Strope. 2010. “Your word is my command”: Google search by voice: A case study. In Advances in Speech Recognition. Springer, 61--90.
[98]
Markus Schedl, Yi-Hsuan Yang, and Perfecto Herrera-Boyer. 2016. Introduction to intelligent music systems and applications. ACM Trans. Intell. Syst. Technol. 8, 2 (Oct. 2016), 17:1--17:8.
[99]
Björn Schuller, Felix Weninger, Martin Wöllmer, Yang Sun, and Gerhard Rigoll. 2010. Non-negative matrix factorization as noise-robust feature extractor for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 4562--4565.
[100]
Michael L. Seltzer, Dong Yu, and Yongqiang Wang. 2013. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7398--7402.
[101]
Sangita Sharma, Dan Ellis, Sachin S. Kajarekar, Pratibha Jain, and Hynek Hermansky. 2000. Feature extraction using non-linear transformation for robust speech recognition on the aurora database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’00). 1117--1120.
[102]
Soundararajan Srinivasan, Nicoleta Roman, and DeLiang Wang. 2006. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48, 11 (Nov. 2006), 1486--1501.
[103]
Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2013. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’13). 285--290.
[104]
Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2014. Convolutional neural networks for distant speech recognition. IEEE Sign. Process. Lett. 21, 9 (Sep. 2014), 1120--1124.
[105]
George Trigeorgis, Fabien Ringeval, Raymond Bruckner, Erik Marchi, Mihalis Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). Shanghai, China, 5200--5204.
[106]
Barry D. Van Veen and Kevin M. Buckley. 1988. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag. 5, 2 (Apr. 1988), 4--24.
[107]
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 126--130.
[108]
Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 4 (July 2006), 1462--1469.
[109]
Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. 2016. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. (submitted for publication).
[110]
Tuomas Virtanen, Rita Singh, and Bhiksha Raj. 2012. Techniques for Noise Robustness in Automatic Speech Recognition. John Wiley & Sons, Hoboken, NJ.
[111]
DeLiang Wang. 2005. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. Springer US, Boston, MA, 181--197.
[112]
Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 12 (Dec. 2014), 1849--1858.
[113]
Yuxuan Wang and DeLiang Wang. 2013. Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21, 7 (July 2013), 1381--1390.
[114]
Yuxuan Wang and DeLiang Wang. 2015. A deep neural network for time-domain signal reconstruction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4390--4394.
[115]
Zhong-Qiu Wang and DeLiang Wang. 2016. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 4 (Apr. 2016), 796--806.
[116]
Ernst Warsitz and Reinhold Haeb-Umbach. 2007. Blind acoustic beamforming based on generalized eigenvalue decomposition. IEEE Trans. Audio Speech Lang. Process. 15, 5 (July 2007), 1529--1539.
[117]
Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn Schuller. 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 91--99.
[118]
Felix Weninger, Florian Eyben, and Björn Schuller. 2014. Single-channel speech separation with memory-enhanced recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 3709--3713.
[119]
Felix Weninger, Jordi Feliu, and Björn Schuller. 2012. Supervised and semi-supervised suppression of background music in monaural speech recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 61--64.
[120]
Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2013. The munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks. In Proceedings of the 2nd CHiME Workshop on Machine Listening in Multisource Environments. 86--90.
[121]
Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014a. Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Comput. Speech Lang. 28, 4 (July 2014), 888--902.
[122]
Felix Weninger, John R. Hershey, Jonathan Le Roux, and Björn W. Schuller. 2014b. Discriminatively trained recurrent neural networks for single-channel speech separation. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’14). 577--581.
[123]
Felix Weninger, Shinji Watanabe, Jonathan Le Roux, J. Hershey, Yuuki Tachioka, Jürgen Geiger, Björn Schuller, and Gerhard Rigoll. 2014c. The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.
[124]
Donald S. Williamson and DeLiang Wang. 2017a. Speech dereverberation and denoising using complex ratio masks. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 5590--5594.
[125]
Donald S. Williamson and DeLiang Wang. 2017b. Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 7 (July 2017), 1492--1501.
[126]
Martin Wöllmer, Florian Eyben, Alex Graves, Björn Schuller, and Gerhard Rigoll. 2010a. Improving keyword spotting with a tandem BLSTM-DBN architecture. In Proceedings of the Advances in Non-Linear Speech Processing: International Conference on Nonlinear Speech Processing (NOLISP’10). 68--75.
[127]
Martin Wöllmer, Florian Eyben, Björn W. Schuller, Yang Sun, Tobias Moosmayr, and Nhu Nguyen-Thien. 2009. Robust in-car spelling recognition—A tandem BLSTM-HMM approach. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’09). 2507--2510.
[128]
Martin Wöllmer, Björn Schuller, Florian Eyben, and Gerhard Rigoll. 2010b. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Select. Top. Sign. Process. 4, 5 (Oct. 2010), 867--881.
[129]
Martin Wöllmer, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2013. Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 6822--6826.
[130]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 (Oct. 2016).
[131]
Bingyin Xia and Changchun Bao. 2013. Speech enhancement with weighted denoising auto-encoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3444--3448.
[132]
Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang Lu, John Hershey, Michael L. Seltzer, Guoguo Chen, Yu Zhang, Michael Mandel, and Dong Yu. 2016a. Deep beamforming networks for multi-channel speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5745--5749.
[133]
Xiong Xiao, Chenglin Xu, Zhaofeng Zhang, Shengkui Zhao, Sining Sun, and Shinji Watanabe. 2016b. A study of learning based beamforming methods for speech recognition. In Proceedings of the CHiME Workshop. 26--31.
[134]
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. Technical Report MSR-TR-2016-71. Microsoft Research.
[135]
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014a. Dynamic noise aware training for speech enhancement based on deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2670--2674.
[136]
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014b. An experimental study on speech enhancement based on deep neural networks. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.
[137]
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014c. NMF-based target source separation using deep neural network. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.
[138]
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 7--19.
[139]
Yi-Hsuan Yang and Homer H. Chen. 2012. Machine recognition of music emotion: A review. ACM Trans. Intell. Syst. Technol. 3, 3 (May 2012), 40:1--40:30.
[140]
Takuya Yoshioka, Armin Sehr, Marc Delcroix, Keisuke Kinoshita, Roland Maas, Tomohiro Nakatani, and Walter Kellermann. 2012. Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 114--126.
[141]
Chengzhu Yu, Atsunori Ogawa, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, and John H. L. Hansen. 2015. Robust i-vector extraction for neural network adaptation in noisy environment. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2854--2857.
[142]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv:1605.07146 (May 2016).
[143]
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.
[144]
Zixing Zhang, Nicholas Cummins, and Björn Schuller. 2017. Advanced data exploitation for speech analysis—An overview. IEEE Sign. Process. Mag. 34 (July 2017). 24 pages.
[145]
Zixing Zhang, Joel Pinto, Christian Plahl, Björn Schuller, and Daniel Willett. 2014. Channel mapping using bidirectional long short-term memory for dereverberation in hand-free voice controlled devices. IEEE Trans. Cons. Electron. 60, 3 (Aug. 2014), 525--533.
[146]
Zixing Zhang, Fabien Ringeval, Jing Han, Jun Deng, Erik Marchi, and Björn Schuller. 2016. Facing realism in spontaneous emotion recognition from speech: Feature enhancement by autoencoder with LSTM neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16).

Cited By

View all
  • (2025)What is artificial intelligence, machine learning, and deep learning: terminologies explainedArtificial Intelligence in Urology10.1016/B978-0-443-22132-3.00002-2(3-17)Online publication date: 2025
  • (2024)Artificial intelligence approaches for tinnitus diagnosis: leveraging high-frequency audiometry data for enhanced clinical predictionsFrontiers in Artificial Intelligence10.3389/frai.2024.13814557Online publication date: 7-May-2024
  • (2024)Deep learning model for predicting genetic diseases using DNA sequence dataJournal of Intelligent & Fuzzy Systems10.3233/JIFS-238159(1-11)Online publication date: 23-Apr-2024
  • Show More Cited By

Index Terms

  1. Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 9, Issue 5
      Research Survey and Regular Papers
      September 2018
      274 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/3210369
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 April 2018
      Accepted: 01 January 2018
      Revised: 01 November 2017
      Received: 01 July 2017
      Published in TIST Volume 9, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Robust speech recognition
      2. deep learning
      3. multi-channel speech recognition
      4. neural networks
      5. non-stationary noise

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • Huawei Technologies Co. Ltd

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)225
      • Downloads (Last 6 weeks)22
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)What is artificial intelligence, machine learning, and deep learning: terminologies explainedArtificial Intelligence in Urology10.1016/B978-0-443-22132-3.00002-2(3-17)Online publication date: 2025
      • (2024)Artificial intelligence approaches for tinnitus diagnosis: leveraging high-frequency audiometry data for enhanced clinical predictionsFrontiers in Artificial Intelligence10.3389/frai.2024.13814557Online publication date: 7-May-2024
      • (2024)Deep learning model for predicting genetic diseases using DNA sequence dataJournal of Intelligent & Fuzzy Systems10.3233/JIFS-238159(1-11)Online publication date: 23-Apr-2024
      • (2024)Exploring the power of pure attention mechanisms in blind room parameter estimationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-024-00344-82024:1Online publication date: 24-Apr-2024
      • (2024)Cover-source mismatch in steganalysis: systematic reviewEURASIP Journal on Information Security10.1186/s13635-024-00171-62024:1Online publication date: 12-Aug-2024
      • (2024)Deep Learning for Table Detection and Structure Recognition: A SurveyACM Computing Surveys10.1145/365728156:12(1-41)Online publication date: 10-Apr-2024
      • (2024)Binary Optical Machine Learning: Million-Scale Physical Neural Networks with Nano NeuronsProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3649384(603-617)Online publication date: 29-May-2024
      • (2024)AdaStreamLiteProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314607:4(1-29)Online publication date: 12-Jan-2024
      • (2024)Research on classification method of hyperspectral remote sensing images based on multiscale multichannel CNNFourth International Conference on Machine Learning and Computer Application (ICMLCA 2023)10.1117/12.3028998(40)Online publication date: 22-May-2024
      • (2024)Software Defect Prediction Approach Based on a Diversity Ensemble Combined With Neural NetworkIEEE Transactions on Reliability10.1109/TR.2024.335651573:3(1487-1501)Online publication date: Sep-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media