Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Multi-level region-of-interest CNNs for end to end speech recognition

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Efficient and robust automatic speech recognition (ASR) systems are in high demand in the present scenario. Mostly ASR systems are generally fed with cepstral features like mel-frequency cepstral coefficients and perceptual linear prediction. However, some attempts are also made in speech recognition to shift on simple features like critical band energies or spectrogram using deep learning models. These approaches always claim that they have the ability to train directly with the raw signal. Such systems highly depend on the excellent discriminative power of ConvNet layers to separate two phonemes having nearly similar accents but they do not offer high recognition rate. The main reason for limited recognition rate is stride based pooling methods that performs sharp reduction in output dimensionality i.e. at least 75%. To improve the performance, region-based convolutional neural networks (R-CNNs) and Fast R-CNN were proposed but their performances did not meet the expected level. Therefore, a new pooling technique, multilevel region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers. The newly proposed architecture is named as multilevel RoI convolutional neural network (MR-CNN). It is designed by simply placing RoI pooling layers after up to four coarsest layers. It improves extracted features using additional information from the multilevel ConvNet layers. Its performance is evaluated on TIMIT and Wall Street Journal (WSJ) datasets for phoneme recognition. Phoneme error-rate offered by this model on raw speech is 16.4% and 17.1% on TIMIT and WSJ datasets respectively which is slightly better than spectral features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Abdel-Hamid O, Mohamed A-r, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Paper presented at the 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2012/03 https://doi.org/10.1109/ICASSP.2012.6288864

  • Bourlard HA, Morgan N (2012) Connectionist speech recognition: a hybrid approach, vol 247. Springer, New York. https://doi.org/10.1007/978-1-4615-3210-1

    Book  Google Scholar 

  • Bridle JS (1990) Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Neurocomputing. Springer, New York, pp 227–236 https://doi.org/10.1007/978-3-642-76153-9_28

    Chapter  Google Scholar 

  • Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Attention-based models for speech recognition. Paper presented at the proceedings of the 30th international conference on neural information processing systems, Montreal, Canada, pp 577–585

  • Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20:30–42. https://doi.org/10.1109/TASL.2011.2134090

    Article  Google Scholar 

  • Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. Paper presented at the proceedings of the 30th international conference on neural information processing systems, Barcelona, Spain, pp 379–387

  • Davis SB, Mermelstein P (1990) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Readings in speech recognition. Elsevier, Amsterdam, pp 65–74 https://doi.org/10.1109/TASSP.1980.1163420

    Chapter  Google Scholar 

  • Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) DeCAF: a deep convolutional activation feature for generic visual recognition. In: Paper presented at the proceedings of the 31st international conference on machine learning, proceedings of machine learning research

  • Dua M, Aggarwal RK, Biswas M (2018) GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-018-0828-x

    Article  Google Scholar 

  • Gales M, Young S (2008) The application of hidden Markov models in speech recognition. Found Trends® Signal Process 1:195–304

    Article  Google Scholar 

  • Ganapathiraju A, Hamaker J, Picone J (1998) Support vector machines for speech recognition. In: Fifth international conference on spoken language processing

  • Girshick R (2015) Fast r-cnn. arXiv preprint arXiv:150408083

  • Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587 https://doi.org/10.1109/CVPR.2014.81

  • Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256

  • Golik P, Tüske Z, Schlüter R, Ney H (2015) Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In: Sixteenth annual conference of the international speech communication association

  • Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks. In: International conference on machine learning, pp 1764–1772

  • Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. ACM, new York, pp 369–376

  • He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision. Springer, New York, pp 346–361 https://doi.org/10.1007/978-3-319-10578-9_23

    Chapter  Google Scholar 

  • Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech the. J Acoust Soc Am 87:1738–1752

    Article  Google Scholar 

  • Huang Y, Tian K, Wu A, Zhang G (2017) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-017-0644-8

    Article  Google Scholar 

  • Jaitly N, Hinton G (2011) Learning a better representation of speech soundwaves using restricted boltzmann machines. In: Acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference on. IEEE, Piscataway, pp 5884–5887

  • Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Paper presented at the proceedings of the 25th international conference on neural information processing systems, vol 1. Lake Tahoe, Nevada, pp 1097–1105

  • LeCun Y, Bengio Y (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, vol 3361, p 1995

  • Lee K-F, Hon H-W (1989) Speaker-independent phone recognition using hidden Markov models. IEEE Trans Acoust Speech Signal Process 37:1641–1648. https://doi.org/10.1109/29.46546

    Article  Google Scholar 

  • Lee S, Moon N (2018) Location recognition system using random forest. J Ambient Intell Humaniz Comput 9:1191–1196. https://doi.org/10.1007/s12652-018-0679-5

    Article  Google Scholar 

  • Lu L, Kong L, Dyer C, Smith NA, Renals S (2016) Segmental recurrent neural networks for end-to-end speech recognition. In: Paper presented at the Interspeech 2016, https://doi.org/10.21437/Interspeech.2016-40

  • Ouyang W et al (2015) Deepid-net: deformable deep convolutional neural networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2403–2412 https://doi.org/10.1109/CVPR.2015.7298854

  • Palaz D, Collobert R, Doss MM (2013a) End-to-end phoneme sequence recognition using convolutional neural networks. arXiv preprint arXiv:13122137

  • Palaz D, Collobert R, Doss MM (2013b) Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. arXiv preprint. arXiv:13041018

  • Palaz D, Doss MM, Collobert R (2015) Convolutional neural networks-based continuous speech recognition using raw speech signal. In: Acoustics, speech and signal processing (ICASSP), 2015 IEEE international conference on. IEEE, Piscataway, pp 4295–4299

  • Pasricha V, Aggarwal R (2016) Hybrid architecture for robust speech recognition system. In: Recent advances and innovations in engineering (ICRAIE), 2016 international conference on. IEEE, Piscataway, pp 1–7 https://doi.org/10.1109/ICRAIE.2016.7939586

  • Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286 https://doi.org/10.1109/5.18626

    Article  Google Scholar 

  • Rathor S, Jadon R (2018) Acoustic domain classification and recognition through ensemble based multilevel classification J Ambient Intell Humaniz Comput:1–11 https://doi.org/10.1007/s12652-018-1087-6

    Article  Google Scholar 

  • Sainath TN et al (2013a) Improvements to deep convolutional neural networks for LVCSR. In: Paper presented at the 2013 IEEE Workshop on automatic speech recognition and understanding, 2013/12 https://doi.org/10.1109/ASRU.2013.6707749

  • Sainath TN, Kingsbury B, Mohamed A-r, Ramabhadran B (2013b) Learning filter banks within a deep neural network framework. In: Automatic speech recognition and understanding (ASRU), 2013 IEEE Workshop on. IEEE, Piscataway, pp 297–302 https://doi.org/10.1109/ASRU.2013.6707746

  • Sainath TN, Mohamed A-r, Kingsbury B, Ramabhadran B (2013c) Deep convolutional neural networks for LVCSR. In: Paper presented at the 2013 IEEE international conference on acoustics, speech and signal processing, 2013/05 https://doi.org/10.1109/ICASSP.2013.6639347

  • Sainath TN, Kingsbury B, Saon G, Soltau H, Mohamed A, Dahl G, Ramabhadran B (2015a) Deep convolutional neural networks for large-scale speech tasks. Neural Netw 64:39–48

    Article  Google Scholar 

  • Sainath TN, Vinyals O, Senior A, Sak H (2015b) Convolutional, long short-term memory, fully connected deep neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Piscataway, pp 4580–4584

  • Singh PK, Sarkar R, Bhateja V, Nasipuri M (2018) A comprehensive handwritten Indic script recognition system: a tree-based approach J Ambient Intell Humaniz Comput:1–18 https://doi.org/10.1007/s12652-018-1052-4

  • Soltau H, Kuo H-K, Mangu L, Saon G, Beran T (2013) Neural network acoustic models for the DARPA RATS program. In: INTERSPEECH, pp 3092–3096

  • Song W, Cai J (2015) End-to-end deep neural network for automatic speech recognition. Standford CS224D Reports

  • Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  • Swietojanski P, Ghoshal A, Renals S (2014) Convolutional neural networks for distant speech recognition. IEEE Signal Process Lett 21:1120–1124. https://doi.org/10.1109/LSP.2014.2325781

    Article  Google Scholar 

  • Toth L (2014) Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition. In: Paper presented at the 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2014/05 https://doi.org/10.1109/ICASSP.2014.6853584

  • Toth L (2015) Phone recognition with hierarchical convolutional deep maxout networks Eurasip. J Audio Speech Music Process. https://doi.org/10.1186/s13636-015-0068-3

    Article  Google Scholar 

  • Tüske Z, Golik P, Schlüter R, Ney H (2014) Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Fifteenth annual conference of the international speech communication association

  • Vaněk J, Zelinka J, Soutner D, Psutka JA (2017) Regularization post layer: an additional way how to make deep neural networks robust. In: International conference on statistical language and speech processing. Springer, New York, pp 204–214 https://doi.org/10.1007/978-3-319-68456-7_17

    Chapter  Google Scholar 

  • Zhang Y, Pezeshki M, Brakel P, Zhang S, Bengio CLY, Courville A (2017) Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:170102720

  • Zhang X, Trmal J, Povey D, Khudanpur S (2014) Improving deep neural network acoustic models using generalized maxout networks. In: Acoustics, speech and signal processing (ICASSP), 2014 IEEE international conference on. IEEE, Piscataway, pp 215–219

  • Zhang S, Zhang C, You Z, Zheng R, Xu B (2013) Asynchronous stochastic gradient descent for DNN training. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE, Piscataway, pp 6660–6663

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajesh Kumar Aggarwal.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singhal, S., Passricha, V., Sharma, P. et al. Multi-level region-of-interest CNNs for end to end speech recognition. J Ambient Intell Human Comput 10, 4615–4624 (2019). https://doi.org/10.1007/s12652-018-1146-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-018-1146-z

Keywords

Navigation