research-article

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Authors:

Jürgen Geiger,

Jouni Pohjalainen,

Amr El-Desoky Mousa,

Björn SchullerAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 9, Issue 5

Article No.: 49, Pages 1 - 28

https://doi.org/10.1145/3178115

Published: 24 April 2018 Publication History

Abstract

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition but still remains an important challenge. Data-driven supervised approaches, especially the ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks. In the meanwhile, we discuss the pros and cons of these approaches and provide their experimental results on benchmark databases. We expect that this overview can facilitate the development of the robustness of speech recognition systems in acoustic noisy environments.

References

[1]

Alex Acero. 2012. Acoustical and Environmental Robustness in Automatic Speech Recognition. Vol. 201. Springer Science 8 Business Media, Berlin.

[2]

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, and others. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning (ICML’16). New York, NY. 173--182.

Digital Library

[3]

Yekutiel Avargel and Israel Cohen. 2007. System identification in the short-time fourier transform domain with crossband filtering. IEEE Trans. Audio Speech Lang. Process. 15, 4 (Mar. 2007), 1305--1319.

Digital Library

[4]

Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe. 2015. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 504--511.

[5]

Jon Barker, Emmanuel Vincent, Ning Ma, Heidi Christensen, and Phil Green. 2013. The PASCAL CHiME speech separation and recognition challenge. Comput. Speech Lang. 27, 3 (May 2013), 621--633.

Digital Library

[6]

Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Sign. Process. 27, 2 (Apr. 1979), 113--120.

[7]

Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, and others. 2005. The AMI meeting corpus: A pre-announcement. In Proceedings of the International Workshop on Machine Learning for Multimodal Interaction. 28--39.

Digital Library

[8]

Zhuo Chen, Shinji Watanabe, Hakan Erdoğan, and John R. Hershey. 2015. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). Dresden, Germany, 1--5.

[9]

Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST’14). 103--111.

[10]

Henry Cox, Robert M. Zeskind, and Mark M. Owen. 1987. Robust adaptive beamforming. IEEE Trans. Acoust. Speech Sign. Process. 35, 10 (Oct. 1987), 1365--1376.

[11]

Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A. Bharath. 2017. Generative adversarial networks: An overview (submitted for publication).

[12]

George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 1 (Jan. 2012), 30--42.

Digital Library

[13]

Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19, 4 (May 2011), 788--798.

Digital Library

[14]

Li Deng. 2011. Front-end, back-end, and hybrid techniques for noise-robust speech recognition. In Robust Speech Recognition of Uncertain or Missing Data. Springer, Berlin, 67--99.

[15]

Yariv Ephraim and David Malah. 1984. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 32, 6 (Dec. 1984), 1109--1121.

[16]

Yariv Ephraim and David Malah. 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Sign. Process. 23, 2 (Apr. 1985), 443--445.

[17]

Hakan Erdogan, Tomoki Hayashi, John R. Hershey, Takaaki Hori, Chiori Hori, Wei-Ning Hsu, Suyoun Kim, Jonathan Le Roux, Zhong Meng, and Shinji Watanabe. 2016. Multi-channel speech recognition: LSTMs all the way through. In Proceedings of the CHiME-4 Workshop.

[18]

Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 708--712.

[19]

Hakan Erdogan, John R. Hershey, Shinji Watanabe, Michael I. Mandel, and Jonathan Le Roux. 2016. Improved MVDR beamforming using single-channel mask prediction networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1981--1985.

[20]

Xue Feng, Yaodong Zhang, and James Glass. 2014. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1759--1763.

[21]

Tian Gao, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. Joint training of front-end and back-end deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4375--4379.

[22]

J.-L. Gauvain and Chin-Hui Lee. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Aud. Process. 2, 2 (Apr. 1994), 291--298.

[23]

Jürgen Geiger, Jort F. Gemmeke, Björn Schuller, and Gerhard Rigoll. 2014a. Investigating NMF speech enhancement for neural network based acoustic models. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2405--2409.

[24]

Jürgen Geiger, Erik Marchi, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014b. The TUM system for the REVERB challenge: Recognition of reverberated speech using multi-channel correlation shaping dereverberation and BLSTM recurrent neural networks. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.

[25]

Jürgen Geiger, Felix Weninger, Jort F. Gemmeke, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014c. Memory-enhanced neural networks and NMF for robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 6 (June 2014), 1037--1046.

Digital Library

[26]

Jürgen Geiger, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2014d. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 631--635.

[27]

Ritwik Giri, Michael L. Seltzer, Jasha Droppo, and Dong Yu. 2015. Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 5014--5018.

[28]

Yifan Gong. 1995. Speech recognition in noisy environments: A survey. Speech Commun. 16, 3 (Apr. 1995), 261--291.

Digital Library

[29]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press, Cambridge, MA.

Digital Library

[30]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’14). 2672--2680.

Digital Library

[31]

E. M. G. Grais, Gerard Roma, Andrew J. R. Simpson, and Mark D. Plumbley. 2016. Combining mask estimates for single channel audio source separation using deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 3339--3343.

[32]

Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv:1308.0850 (Aug. 2013).

[33]

Kun Han, Yuxuan Wang, DeLiang Wang, William S. Woods, Ivo Merks, and Tao Zhang. 2015. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 6 (Apr. 2015), 982--992.

[34]

John H. L. Hansen and Mark A. Clements. 1991. Constrained iterative speech enhancement with application to speech recognition. IEEE Trans. Sign. Process. 39, 4 (Apr. 1991), 795--805.

Digital Library

[35]

John H. L. Hansen and Bryan L. Pellom. 1998. An effective quality evaluation protocol for speech enhancement algorithms. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’98). 2819--2822.

[36]

Jahn Heymann, Lukas Drude, Aleksej Chinaev, and Reinhold Haeb-Umbach. 2015. BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’15). 444--451.

[37]

Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016a. Neural network based spectral mask estimation for acoustic beamforming. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 196--200.

Digital Library

[38]

Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach. 2016b. Wide residual BLSTM network with discriminative speaker adaptation for robust speech recognition. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 12--17.

[39]

Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 82--97.

[40]

Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (July 2006), 504--507.

[41]

Hans Günter Hirsch and Harald Finster. 2005. The simulation of realistic acoustic input scenarios for speech recognition systems. In Proceedings of the Conference of the International Speech Communications Association (INTERSPEECH). Lisbon, Portugal, 2697–2700.

[42]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (Nov. 1997), 1735--1780.

Digital Library

[43]

Yedid Hoshen, Ron J. Weiss, and Kevin W. Wilson. 2015. Speech acoustic modeling from raw multichannel waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4624--4628.

[44]

Yi Hu and Philipos C. Loizou. 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16, 1 (Jan. 2008), 229--238.

Digital Library

[45]

Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2014. Deep learning for monaural speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 1562--1566.

[46]

Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis. 2015. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 12 (Dec. 2015), 2136--2147.

Digital Library

[47]

Takaaki Ishii, Hiroki Komiyama, Takahiro Shinozaki, Yasuo Horiuchi, and Shingo Kuroiwa. 2013. Reverberant speech recognition based on denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3512--3516.

[48]

Penny Karanasou, Yongqiang Wang, Mark J. F. Gales, and Philip C. Woodland. 2014. Adaptation of deep neural network acoustic models using factorised i-vectors. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2180--2184.

[49]

Arash Khabbazibasmenj, Sergiy A. Vorobyov, and Aboulnasr Hassanien. 2012. Robust adaptive beamforming based on steering vector estimation with as little as possible prior information. IEEE Trans. Sign. Process. 60, 6 (June 2012), 2974--2987.

Digital Library

[50]

Keisuke Kinoshita, Marc Delcroix, Sharon Gannot, Emanuël A. P. Habets, Reinhold Haeb-Umbach, Walter Kellermann, Volker Leutnant, Roland Maas, Tomohiro Nakatani, Bhiksha Raj, and others. 2016. A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Sign. Process. 2016, 1 (Dec. 2016), 1--19.

[51]

Souvik Kundu, Gautam Mantena, Yanmin Qian, Tian Tan, Marc Delcroix, and Khe Chai Sim. 2016. Joint acoustic factor learning for robust deep neural network based automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5025--5029.

Digital Library

[52]

Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neur. Comput. 1, 4 (1989), 541--551.

Digital Library

[53]

Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.

[54]

Kang Hyun Lee, Shin Jae Kang, Woo Hyun Kang, and Nam Soo Kim. 2016. Two-stage noise aware training using asymmetric deep denoising autoencoder. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5765--5769.

Digital Library

[55]

Kang Hyun Lee, Woo Hyun Kang, Tae Gyoon Kang, and Nam Soo Kim. 2017. Integrated DNN-based model adaptation technique for noise-robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 5245--5249.

[56]

Christopher J. Leggetter and Philip C. Woodland. 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 9, 2 (Apr. 1995), 171--185.

[57]

Bo Li, Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, and Michiel Bacchiani. 2016. Neural network adaptive beamforming for robust multichannel speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 1976--1980.

[58]

Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. 2014. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Proces. 22, 4 (Apr. 2014), 745--777.

Digital Library

[59]

Yan Liu, Yang Liu, Shenghua Zhong, and Songtao Wu. 2017. Implicit visual learning: Image recognition via dissipative learning model. ACM Trans. Intell. Syst. Technol. 8, 2 (Jan. 2017), 31:1--31:24.

Digital Library

[60]

Yulan Liu, Pengyuan Zhang, and Thomas Hain. 2014. Using neural network front-ends on far field multiple microphones based speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 5542--5546.

[61]

Philipos C. Loizou. 2013. Speech Enhancement: Theory and Practice. Taylor Francis, Abingdon, UK.

[62]

Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori. 2013. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 436--440.

[63]

Andrew L. Maas, Quoc V. Le, Tyler M. OŃeil, Oriol Vinyals, Patrick Nguyen, and Andrew Y. Ng. 2012. Recurrent neural networks for noise reduction in robust ASR. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’12). 22--25.

[64]

Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. 2016. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’16). 2802--2810.

Digital Library

[65]

Claude Marro, Yannick Mahieux, and Klaus Uwe Simmer. 1998. Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering. IEEE Trans. Speech Audio Process. 6, 3 (May 1998), 240--259.

[66]

Iain McCowan and Herv’e Bourlard. 2003. Microphone array post-filter based on noise field coherence. IEEE Trans. Speech Audio Process. 11, 6 (Nov. 2003), 709--716.

[67]

Zhong Meng, Shinji Watanabe, John R. Hershey, and Hakan Erdogan. 2017. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 271--275.

[68]

Tobias Menne, Jahn Heymann, Anastasios Alexandridis, Kazuki Irie, Albert Zeyer, Markus Kitza, Pavel Golik, Kulikov Ilia, Lukas Durde, Ralf Schlater, Hermann Ney, Reinhold Haeb-Umbach, and Athanasios Mouchtaris. 2016. The RWTH /UPB/FORTH system combination for the 4th CHiME challenge evaluation. In Proceedings of the 4th International Workshop on Speech Processing in Everyday Environments (CHiME’16). 49--51.

[69]

Xavier Mestre and Miguel Angel Lagunas. 2003. On diagonal loading for minimum variance beamformers. In Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology. 459--462.

[70]

Daniel Michelsanti and Zheng-Hua Tan. 2017. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2008--2012.

[71]

Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara. 2016. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition. In Proceedings of theConference of the International Speech Communication Association (INTERSPEECH’16). 3803--3807.

[72]

Seyedmahdad Mirsamadi and John H. L. Hansen. 2015. A study on deep neural network acoustic model adaptation for robust far-field speech recognition. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2430--2434.

[73]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, and others. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529--533.

[74]

Asunción Moreno, Børge Lindberg, Christoph Draxler, Gaël Richard, Khalid Choukri, Stephan Euler, and Jeffrey Allen. 2000. SPEECHDAT-CAR. A large speech database for automotive environments. In Proceedings of the the 2nd International Conference on Language Resources and Evaluation (LREC’00).

[75]

Arun Narayanan and DeLiang Wang. 2013. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7092--7096.

[76]

Arun Narayanan and DeLiang Wang. 2014. Joint noise adaptive training for robust automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 2504--2508.

[77]

Arun Narayanan and DeLiang Wang. 2015. Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 92--101.

Digital Library

[78]

Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, and John R. Hershey. 2017. Multichannel end-to-end speech recognition. In Proceedings of the the 34th International Conference on Machine Learning (ICML’17). 2632--2641.

[79]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv:1609.03499 (Sep. 2016).

[80]

ITU-T Recommendation P.862. 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.

[81]

Kuldip Paliwal, Kamil Wójcicki, and Benjamin Shannon. 2011. The importance of phase in speech enhancement. Speech Commun. 53, 4 (Apr. 2011), 465--494.

Digital Library

[82]

Se Rim Park and Jinwon Lee. 2016. A fully convolutional neural network for speech enhancement. arXiv:1609.07132 (Sep. 2016).

[83]

Santiago Pascual, Antonio Bonafonte, and Joan Serrà. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv:1703.09452 (Mar. 2017).

[84]

David Pearce and Hans-Günter Hirsch. 2000. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’00). 29--32.

[85]

David Pearce and J. Picone. 2002. Aurora Working Group: DSR Front End LVCSR Evaluation AU/384/02. Institute for Signal & Information Processing, Mississippi State University, Tech. Rep (2002).

[86]

Pasi Pertilä and Joonas Nikunen. 2014. Microphone array post-filtering using supervised machine learning for speech enhancement. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2675--2679.

[87]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florêncio, and Mark Hasegawa-Johnson. 2017. Speech enhancement using Bayesian WaveNet. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’17). 2013--2017.

[88]

Yanmin Qian, Mengxiao Bi, Tian Tan, and Kai Yu. 2016. Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans. Audio Speech. Lang. Process. 24, 12 (Dec. 2016), 2263--2276.

Digital Library

[89]

Yanmin Qian and Tian Tan. 2016. The SJTU CHiME-4 system: Acoustic noise robustness for real single or multiple microphone scenarios. In Proceedings of the CHiME-4 Workshop.

[90]

Schuyler R. Quackenbush, Thomas Pinkney Barnwell, and Mark A. Clements. 1988. Objective Measures of Speech Quality. Prentice-Hall, Upper Saddle River, NJ.

[91]

Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. 2017. A network of deep neural networks for distant speech recognition. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 4880--4884.

[92]

Dario Rethage, Jordi Pons, and Xavier Serra. 2017. A Wavenet for speech denoising. arXiv:1706.07162 (June 2017).

[93]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (Dec. 2015), 211--252.

Digital Library

[94]

Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Haşim Sak. 2015. Convolutional, long short-term memory, fully connected deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4580--4584.

[95]

Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Bo Li, Arun Narayanan, Ehsan Variani, Michiel Bacchiani, Izhak Shafran, Andrew W. Senior, Kean K. Chin, Ananya Misra, and Chanwoo Kim. 2017. Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 5 (May 2017), 965--979.

Digital Library

[96]

George Saon, Tom Sercu, Steven Rennie, and Hong-Kwang J. Kuo. 2016. The IBM 2016 english conversational telephone speech recognition system. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16). 7--11.

[97]

Johan Schalkwyk, Doug Beeferman, Françoise Beaufays, Bill Byrne, Ciprian Chelba, Mike Cohen, Maryam Kamvar, and Brian Strope. 2010. “Your word is my command”: Google search by voice: A case study. In Advances in Speech Recognition. Springer, 61--90.

[98]

Markus Schedl, Yi-Hsuan Yang, and Perfecto Herrera-Boyer. 2016. Introduction to intelligent music systems and applications. ACM Trans. Intell. Syst. Technol. 8, 2 (Oct. 2016), 17:1--17:8.

Digital Library

[99]

Björn Schuller, Felix Weninger, Martin Wöllmer, Yang Sun, and Gerhard Rigoll. 2010. Non-negative matrix factorization as noise-robust feature extractor for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’10). 4562--4565.

[100]

Michael L. Seltzer, Dong Yu, and Yongqiang Wang. 2013. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 7398--7402.

[101]

Sangita Sharma, Dan Ellis, Sachin S. Kajarekar, Pratibha Jain, and Hynek Hermansky. 2000. Feature extraction using non-linear transformation for robust speech recognition on the aurora database. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’00). 1117--1120.

Digital Library

[102]

Soundararajan Srinivasan, Nicoleta Roman, and DeLiang Wang. 2006. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48, 11 (Nov. 2006), 1486--1501.

[103]

Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2013. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’13). 285--290.

[104]

Pawel Swietojanski, Arnab Ghoshal, and Steve Renals. 2014. Convolutional neural networks for distant speech recognition. IEEE Sign. Process. Lett. 21, 9 (Sep. 2014), 1120--1124.

[105]

George Trigeorgis, Fabien Ringeval, Raymond Bruckner, Erik Marchi, Mihalis Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). Shanghai, China, 5200--5204.

Digital Library

[106]

Barry D. Van Veen and Kevin M. Buckley. 1988. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag. 5, 2 (Apr. 1988), 4--24.

[107]

Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 126--130.

[108]

Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 4 (July 2006), 1462--1469.

Digital Library

[109]

Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. 2016. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. (submitted for publication).

[110]

Tuomas Virtanen, Rita Singh, and Bhiksha Raj. 2012. Techniques for Noise Robustness in Automatic Speech Recognition. John Wiley & Sons, Hoboken, NJ.

[111]

DeLiang Wang. 2005. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis. Springer US, Boston, MA, 181--197.

[112]

Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 12 (Dec. 2014), 1849--1858.

Digital Library

[113]

Yuxuan Wang and DeLiang Wang. 2013. Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 21, 7 (July 2013), 1381--1390.

Digital Library

[114]

Yuxuan Wang and DeLiang Wang. 2015. A deep neural network for time-domain signal reconstruction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4390--4394.

[115]

Zhong-Qiu Wang and DeLiang Wang. 2016. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 4 (Apr. 2016), 796--806.

Digital Library

[116]

Ernst Warsitz and Reinhold Haeb-Umbach. 2007. Blind acoustic beamforming based on generalized eigenvalue decomposition. IEEE Trans. Audio Speech Lang. Process. 15, 5 (July 2007), 1529--1539.

Digital Library

[117]

Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn Schuller. 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 91--99.

Digital Library

[118]

Felix Weninger, Florian Eyben, and Björn Schuller. 2014. Single-channel speech separation with memory-enhanced recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 3709--3713.

[119]

Felix Weninger, Jordi Feliu, and Björn Schuller. 2012. Supervised and semi-supervised suppression of background music in monaural speech recordings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 61--64.

[120]

Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2013. The munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks. In Proceedings of the 2nd CHiME Workshop on Machine Listening in Multisource Environments. 86--90.

[121]

Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, and Gerhard Rigoll. 2014a. Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments. Comput. Speech Lang. 28, 4 (July 2014), 888--902.

[122]

Felix Weninger, John R. Hershey, Jonathan Le Roux, and Björn W. Schuller. 2014b. Discriminatively trained recurrent neural networks for single-channel speech separation. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’14). 577--581.

[123]

Felix Weninger, Shinji Watanabe, Jonathan Le Roux, J. Hershey, Yuuki Tachioka, Jürgen Geiger, Björn Schuller, and Gerhard Rigoll. 2014c. The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement. In Proceedings of the REVERB Workshop, Held in Conjunction with ICASSP 2014 and HSCMA 2014. 1--8.

[124]

Donald S. Williamson and DeLiang Wang. 2017a. Speech dereverberation and denoising using complex ratio masks. In Proceedings of the IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’17). 5590--5594.

[125]

Donald S. Williamson and DeLiang Wang. 2017b. Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 7 (July 2017), 1492--1501.

Digital Library

[126]

Martin Wöllmer, Florian Eyben, Alex Graves, Björn Schuller, and Gerhard Rigoll. 2010a. Improving keyword spotting with a tandem BLSTM-DBN architecture. In Proceedings of the Advances in Non-Linear Speech Processing: International Conference on Nonlinear Speech Processing (NOLISP’10). 68--75.

Digital Library

[127]

Martin Wöllmer, Florian Eyben, Björn W. Schuller, Yang Sun, Tobias Moosmayr, and Nhu Nguyen-Thien. 2009. Robust in-car spelling recognition—A tandem BLSTM-HMM approach. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’09). 2507--2510.

[128]

Martin Wöllmer, Björn Schuller, Florian Eyben, and Gerhard Rigoll. 2010b. Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J. Select. Top. Sign. Process. 4, 5 (Oct. 2010), 867--881.

[129]

Martin Wöllmer, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. 2013. Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 6822--6826.

[130]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 (Oct. 2016).

[131]

Bingyin Xia and Changchun Bao. 2013. Speech enhancement with weighted denoising auto-encoder. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’13). 3444--3448.

[132]

Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang Lu, John Hershey, Michael L. Seltzer, Guoguo Chen, Yu Zhang, Michael Mandel, and Dong Yu. 2016a. Deep beamforming networks for multi-channel speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5745--5749.

Digital Library

[133]

Xiong Xiao, Chenglin Xu, Zhaofeng Zhang, Shengkui Zhao, Sining Sun, and Shinji Watanabe. 2016b. A study of learning based beamforming methods for speech recognition. In Proceedings of the CHiME Workshop. 26--31.

[134]

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. 2016. Achieving Human Parity in Conversational Speech Recognition. Technical Report MSR-TR-2016-71. Microsoft Research.

[135]

Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014a. Dynamic noise aware training for speech enhancement based on deep neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’14). 2670--2674.

[136]

Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014b. An experimental study on speech enhancement based on deep neural networks. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.

[137]

Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014c. NMF-based target source separation using deep neural network. IEEE Sign. Process. Lett. 21, 1 (Jan. 2014), 65--68.

[138]

Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1 (Jan. 2015), 7--19.

Digital Library

[139]

Yi-Hsuan Yang and Homer H. Chen. 2012. Machine recognition of music emotion: A review. ACM Trans. Intell. Syst. Technol. 3, 3 (May 2012), 40:1--40:30.

Digital Library

[140]

Takuya Yoshioka, Armin Sehr, Marc Delcroix, Keisuke Kinoshita, Roland Maas, Tomohiro Nakatani, and Walter Kellermann. 2012. Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition. IEEE Sign. Process. Mag. 29, 6 (Nov. 2012), 114--126.

[141]

Chengzhu Yu, Atsunori Ogawa, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, and John H. L. Hansen. 2015. Robust i-vector extraction for neural network adaptation in noisy environment. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’15). 2854--2857.

[142]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv:1605.07146 (May 2016).

[143]

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.

[144]

Zixing Zhang, Nicholas Cummins, and Björn Schuller. 2017. Advanced data exploitation for speech analysis—An overview. IEEE Sign. Process. Mag. 34 (July 2017). 24 pages.

[145]

Zixing Zhang, Joel Pinto, Christian Plahl, Björn Schuller, and Daniel Willett. 2014. Channel mapping using bidirectional long short-term memory for dereverberation in hand-free voice controlled devices. IEEE Trans. Cons. Electron. 60, 3 (Aug. 2014), 525--533.

[146]

Zixing Zhang, Fabien Ringeval, Jing Han, Jun Deng, Erik Marchi, and Björn Schuller. 2016. Facing realism in spontaneous emotion recognition from speech: Feature enhancement by autoencoder with LSTM neural networks. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’16).

Cited By

Cui ZHung A(2025)What is artificial intelligence, machine learning, and deep learning: terminologies explainedArtificial Intelligence in Urology10.1016/B978-0-443-22132-3.00002-2(3-17)Online publication date: 2025
https://doi.org/10.1016/B978-0-443-22132-3.00002-2
Sadegh-Zadeh SSoleimani Mamalo AKavianpour KAtashbar HHeidari EHajizadeh RRoshani AHabibzadeh SSaadat SBehmanesh MSaadat MGargari S(2024)Artificial intelligence approaches for tinnitus diagnosis: leveraging high-frequency audiometry data for enhanced clinical predictionsFrontiers in Artificial Intelligence10.3389/frai.2024.13814557Online publication date: 7-May-2024
https://doi.org/10.3389/frai.2024.1381455
Tariq SAmin A(2024)Deep learning model for predicting genetic diseases using DNA sequence dataJournal of Intelligent & Fuzzy Systems10.3233/JIFS-238159(1-11)Online publication date: 23-Apr-2024
https://doi.org/10.3233/JIFS-238159
Show More Cited By

Index Terms

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Combined speech enhancement and auditory modelling for robust distributed speech recognition

The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...
A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition

Acoustic feature extraction from speech constitutes a fundamental component of automatic speech recognition (ASR) systems. In this paper, we propose a novel feature extraction algorithm, perceptual-MVDR (PMVDR), which computes cepstral coefficients from ...
A modified oesophageal speech enhancement using ephraim-malah filter for robust speech recognition
ICNVS'10: Proceedings of the 12th international conference on Networking, VLSI and signal processing

This paper presents a modified Oesophageal Single Channel Speech Enhancement using Ephraim-Malah Filter for Robust Speech Recognition. An Oesophageal voice is due to the laryngectomy undergone by those persons with larynx cancer and it has extremely low ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 9, Issue 5

Research Survey and Regular Papers

September 2018

274 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3210369

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 April 2018

Accepted: 01 January 2018

Revised: 01 November 2017

Received: 01 July 2017

Published in TIST Volume 9, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Huawei Technologies Co. Ltd

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

220
Total Citations
View Citations
2,574
Total Downloads

Downloads (Last 12 months)233
Downloads (Last 6 weeks)19

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cui ZHung A(2025)What is artificial intelligence, machine learning, and deep learning: terminologies explainedArtificial Intelligence in Urology10.1016/B978-0-443-22132-3.00002-2(3-17)Online publication date: 2025
https://doi.org/10.1016/B978-0-443-22132-3.00002-2
Sadegh-Zadeh SSoleimani Mamalo AKavianpour KAtashbar HHeidari EHajizadeh RRoshani AHabibzadeh SSaadat SBehmanesh MSaadat MGargari S(2024)Artificial intelligence approaches for tinnitus diagnosis: leveraging high-frequency audiometry data for enhanced clinical predictionsFrontiers in Artificial Intelligence10.3389/frai.2024.13814557Online publication date: 7-May-2024
https://doi.org/10.3389/frai.2024.1381455
Tariq SAmin A(2024)Deep learning model for predicting genetic diseases using DNA sequence dataJournal of Intelligent & Fuzzy Systems10.3233/JIFS-238159(1-11)Online publication date: 23-Apr-2024
https://doi.org/10.3233/JIFS-238159
Wang CJia MLi MBao CJin W(2024)Exploring the power of pure attention mechanisms in blind room parameter estimationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-024-00344-82024:1Online publication date: 24-Apr-2024
https://doi.org/10.1186/s13636-024-00344-8
Mallet ABeneš MCogranne R(2024)Cover-source mismatch in steganalysis: systematic reviewEURASIP Journal on Information Security10.1186/s13635-024-00171-62024:1Online publication date: 12-Aug-2024
https://doi.org/10.1186/s13635-024-00171-6
Salaheldin Kasem MAbdallah ABerendeyev AElkady EMahmoud MAbdalla MHamada MVascon SNurseitov DTaj-Eddin I(2024)Deep Learning for Table Detection and Structure Recognition: A SurveyACM Computing Surveys10.1145/365728156:12(1-41)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3657281
Yang XAn ZPan QYang LLei DFan YGanesan DShi W(2024)Binary Optical Machine Learning: Million-Scale Physical Neural Networks with Nano NeuronsProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3649384(603-617)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3636534.3649384
Wei YXiong JLiu HYu YPan JDu J(2024)AdaStreamLiteProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314607:4(1-29)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3631460
Zhao RZhang CXue D(2024)Research on classification method of hyperspectral remote sensing images based on multiscale multichannel CNNFourth International Conference on Machine Learning and Computer Application (ICMLCA 2023)10.1117/12.3028998(40)Online publication date: 22-May-2024
https://doi.org/10.1117/12.3028998
Chen JXu JCai SWang XChen HLi Z(2024)Software Defect Prediction Approach Based on a Diversity Ensemble Combined With Neural NetworkIEEE Transactions on Reliability10.1109/TR.2024.335651573:3(1487-1501)Online publication date: Sep-2024
https://doi.org/10.1109/TR.2024.3356515
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents