Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Psycho-acoustics inspired automatic speech recognition

Published: 01 July 2021 Publication History

Abstract

Understanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly related to human speech recognition. They commonly process the phonetic structure of speech while neglecting supra-segmental and syllabic tracts integral to human speech recognition. As a result, these ASRs achieve low performance on spontaneous speech and require enormous costs to build up phonetic and pronunciation models and catch the large variability of human speech. This paper presents a novel ASR that addresses these issues and questions conventional ASR approaches. It uses alternative acoustic models and an exhaustive decoding algorithm to process speech at a syllabic temporal scale (100–250 ms) through a multi-temporal approach inspired by psycho-acoustic studies. Performance comparison on the recognition of spoken Italian numbers (from 0 to 1 million) demonstrates that our approach is cost-effective, outperforms standard phonetic models, and reaches state-of-the-art performance.

Highlights

We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.
It embeds multiple-time scale speech processing inspired to human speech processing.
Its optimal architecture uses an LSTM-based acoustic model and 1h training audio.
It has comparable performance to conventional ASRs on 1-million numbers recognition.

References

[1]
Hawkins S., Smith R., Polysp: A polysystemic, phonetically-rich approach to speech understanding, Ital J Linguist 13 (2001) 99–188.
[2]
Pieraccini R., The voice in the machine: Building computers that understand speech, MIT Press, 2012.
[3]
Markowitz J.A., Robots that talk and listen: Technology and social impact, de Gruyter Berlin, 2015.
[4]
Li J., Deng L., Haeb-Umbach R., Gong Y., Robust automatic speech recognition: A bridge to practical applications, Academic Press, 2015.
[5]
Mustafa M.K., Allen T., Appiah K., A comparative review of dynamic neural networks and hidden Markov model methods for mobile on-device speech recognition, Neural Comput Appl 31 (2) (2019) 891–899.
[6]
CBInsights, How big tech is battling to own the $ 49B voice market, 2019, https://www.cbinsights.com/research/facebook-amazon-microsoft-google-apple-voice/.
[7]
Szaszák G., Tündik M.Á., Beke A., Summarization of spontaneous speech using automatic speech recognition and a speech prosody based tokenizer, in: 8th international conference on knowledge discovery and information retrieval (KDIR 2016), Porto, Portugal, 2016, pp. 221–227.
[8]
Sahu P., Dua M., Kumar A., Challenges and issues in adopting speech recognition, in: Speech and language processing for human-machine communications, Springer, 2018, pp. 209–215.
[9]
Naing S.H.M., Pa Pa W., Automatic speech recognition on spontaneous interview speech, in: 16th international conference on computer applications 2018 (ICCA 2018), Yangon, Myanmar, 2018, pp. 1–5.
[10]
Knill K.M., Gales M.J.F., Manakul P.P., Caines A.P., Automatic grammatical error detection of non-native spoken learner english, in: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2019, pp. 8127–8131.
[11]
Padrell-Sendra J., Martín-Iglesias D., Diaz-de Maria F., Support vector machines for continuous speech recognition, in: 2006 14th European signal processing conference, IEEE, 2006, pp. 1–4.
[12]
CMUSphinx, Training an acoustic model for CMUSphinx, 2017, https://cmusphinx.github.io/wiki/tutorialam/.
[13]
[14]
Greenberg S., Understanding speech understanding: Towards a unified theory of speech perception, in: Proceedings of the ESCA tutorial and advanced research workshop on the auditory basis of speech perception, Keele, England, 1996, pp. 1–8.
[15]
Ostendorf M., Digalakis V.V., Kimball O.A., From HMM’s to segment models: A unified view of stochastic modeling for speech recognition, IEEE Trans Speech Audio Process 4 (5) (1996) 360–378.
[16]
Cutugno F., Origlia A., Schettino V., 7 syllable structure, automatic syllabification and reduction phenomena, in: Rethinking reduction: Interdisciplinary perspectives on conditions, mechanisms, and domains for phonetic variation, Vol. 25, Walter de Gruyter GmbH & Co KG, 2018, p. 205.
[17]
Dunning T., Statistical identification of language, Computing Research Laboratory, New Mexico State University Las Cruces, NM, USA, 1994.
[18]
Huang X., Acero A., Hon H.-W., Foreword By-Reddy R., Spoken language processing: A guide to theory, algorithm, and system development, Prentice hall PTR, 2001.
[19]
Google, Cloud speech-to-text features description, 2019, https://cloud.google.com/speech-to-text/.
[20]
Markov AA. An example of statistical investigation in the text of ‘Eugene Onyegin’ illustrating coupling of tests in chains. In: Proc. of the Academy of Sciences of St. Petersburg, Russia. 1913, p. 153–62.
[21]
Rabiner L.R., Juang B., A tutorial on hidden markov models, IEEE ASSP Mag 3 (1) (1986) 4–16.
[22]
Young S.J., Russell N., Thornton J., Token passing: A simple conceptual model for connected speech recognition systems, Cambridge University Engineering Department Cambridge, 1989.
[23]
Ghahramani Z., Jordan M.I., Factorial hidden Markov models, in: Advances in neural information processing systems, 1996, pp. 472–478.
[24]
Logan B., Moreno P., Factorial HMMs for acoustic modeling, in: Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 2, IEEE, 1998, pp. 813–816.
[25]
Cosi P., Auditory modeling and neural networks, in: International summer school: Speech processing, recognition and artificial neural networks, Citeseer, 1998, pp. 235–258. Paper Available from http://www.csrf.pd.cnr.it/Papers/PieroCosi/cp-IIASS98.pdf.
[26]
Cosi P., Hosom J.-P., HMM/Neural network-based system for Italian continuous digit recognition, in: Proceedings of the 14th international congress of phonetic sciences (ICPhS ‘99), Citeseer, 1999, pp. 1669–1672.
[27]
Ahad A., Fayyaz A., Mehmood T., Speech recognition using multilayer perceptron, in: IEEE students conference, ISCON’02. Proceedings. Vol. 1, IEEE, 2002, pp. 103–109.
[28]
Abdel-Hamid O., Mohamed A.-r., Jiang H., Deng L., Penn G., Yu D., Convolutional neural networks for speech recognition, IEEE/ACM Trans Audio Speech Lang Process 22 (10) (2014) 1533–1545.
[29]
Hinton G., Deng L., Yu D., Dahl G., Mohamed A.-r., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Kingsbury B., et al., Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Process Mag 29 (2012).
[30]
Swietojanski P., Ghoshal A., Renals S., Convolutional neural networks for distant speech recognition, IEEE Signal Process Lett 21 (9) (2014) 1120–1124.
[31]
Povey D., Ghoshal A., Boulianne G., Burget L., Glembek O., Goel N., Hannemann M., Motlicek P., Qian Y., Schwarz P., et al., The kaldi speech recognition toolkit, in: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, 2011, pp. 1–4. IEEE Catalog No.: CFP11SRW-USB.
[32]
Pan J., Liu C., Wang Z., Hu Y., Jiang H., Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMs in acoustic modeling, in: 2012 8th international symposium on Chinese spoken language processing, IEEE, 2012, pp. 301–305.
[33]
Cosi P., A KALDI-DNN-based asr system for Italian, in: 2015 international joint conference on neural networks (IJCNN), IEEE, 2015, pp. 1–5.
[34]
Sak H., Senior A., Beaufays F., Long short-term memory recurrent neural network architectures for large scale acoustic modeling, in: Fifteenth annual conference of the international speech communication association, 2014, pp. 338–342.
[35]
Soltau H., Liao H., Sak H., Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition, 2016, arxiv preprint arXiv:1610.09975.
[36]
Senior A., Sak H., Shafran I., Context dependent phone models for LSTM RNN acoustic modelling, in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 4585–4589,.
[37]
Qu Z., Haghani P., Weinstein E., Moreno P., Syllable-based acoustic modeling with CTC-SMBR-LSTM, in: 2017 IEEE automatic speech recognition and understanding workshop (ASRU), 2017, pp. 173–177,.
[38]
Bengio Y., Simard P., Frasconi P., et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans Neural Netw 5 (2) (1994) 157–166.
[39]
Hochreiter S., Untersuchungen zu dynamischen neuronalen Netzen, Vol. 91, Diploma, Technische Universität München, 1991.
[40]
Massoli F.V., Carrara F., Amato G., Falchi F., Detection of face recognition adversarial attacks, 2019, arxiv preprint arXiv:1912.02918.
[41]
Rao K., Sak H., Prabhavalkar R., Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer, in: 2017 IEEE automatic speech recognition and understanding workshop (ASRU), IEEE, 2017, pp. 193–199.
[42]
Zhang Y., Chan W., Jaitly N., Very deep convolutional networks for end-to-end speech recognition, in: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2017, pp. 4845–4849.
[43]
Chiu C.-C., Sainath T.N., Wu Y., Prabhavalkar R., Nguyen P., Chen Z., Kannan A., Weiss R.J., Rao K., Gonina E., et al., State-of-the-art speech recognition with sequence-to-sequence models, in: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2018, pp. 4774–4778.
[44]
Weng C., Cui J., Wang G., Wang J., Yu C., Su D., Yu D., Improving attention based sequence-to-sequence models for end-to-end english conversational speech recognition, in: Interspeech, 2018, pp. 761–765.
[45]
Watanabe S., Hori T., Karita S., Hayashi T., Nishitoba J., Unno Y., Soplin N.E.Y., Heymann J., Wiesner M., Chen N., et al., Espnet: End-to-end speech processing toolkit, 2018, arxiv preprint arXiv:1804.00015.
[46]
Zeghidour N., Usunier N., Synnaeve G., Collobert R., Dupoux E., End-to-end speech recognition from the raw waveform, 2018, arxiv preprint arXiv:1806.07098.
[47]
Zeghidour N., Xu Q., Liptchinsky V., Usunier N., Synnaeve G., Collobert R., Fully convolutional speech recognition, 2018, arxiv preprint arXiv:1812.06864.
[48]
Jaitly N., Zhang Y., Chan W., Very deep convolutional neural networks for end-to-end speech recognition, 2019, US Patent 10,510,004.
[49]
Sainath T.N., Pang R., Rybach D., He Y., Prabhavalkar R., Li W., Visontai M., Liang Q., Strohman T., Wu Y., et al., Two-pass end-to-end speech recognition, 2019, arxiv preprint arXiv:1908.10992.
[50]
Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning. 2006, p. 369–76.
[51]
Graves A., Jaitly N., Towards end-to-end speech recognition with recurrent neural networks, in: International conference on machine learning, PMLR, 2014, pp. 1764–1772.
[52]
Novoa J, Wuth J, Escudero JP, Fredes J, Mahu R, Yoma NB. DNN-HMM based automatic speech recognition for HRI scenarios. In: Proceedings of the 2018 ACM/IEEE international conference on human-robot interaction. 2018, p. 150–9.
[53]
Audhkhasi K, Saon G, Tüske Z, Kingsbury B, Picheny M. Forget a bit to learn better: Soft forgetting for CTC-based automatic speech recognition. In: Proc. Interspeech 2019. 2019, p. 2618–2622.
[54]
Jenkins J.J., Strange W., Perception of dynamic information for vowels in syllable onsets and offsets, Percept Psychophys 61 (6) (1999) 1200–1210.
[55]
Malaia E.A., Wilbur R.B., Syllable as a unit of information transfer in linguistic communication: The entropy syllable parsing model, Wiley Interdiscip Rev: Cogn Sci (2019).
[56]
Marr D., Vision: A computational investigation into the human representation and processing of visual information, henry holt and co, Inc., New York, NY 2, 1982.
[57]
Scharenborg O., Norris D., Ten Bosch L., McQueen J.M., How should a speech recognizer work?, Cogn Sci 29 (6) (2005) 867–918.
[58]
Norris D., Shortlist: A connectionist model of continuous speech recognition, Cognition 52 (3) (1994) 189–234.
[59]
Norris D., McQueen J.M., Shortlist B: a Bayesian model of continuous speech recognition, Psychol Rev 115 (2) (2008) 357.
[60]
Massaro D., Perceptual images processing time and perceptual units in auditory perception, Psychol Rev 2 (1972) 124–145.
[61]
Ostendorf M. Moving beyond the ‘beads-on-a-string’model of speech. In: Proc. IEEE ASRU workshop. 1999, p. 79–84.
[62]
Fujimura O., Syllable as a unit of speech recognition, IEEE Trans Acoust Speech Signal Process 23 (1) (1975) 82–87.
[63]
Yule G., Bernini G., Introduzione alla linguistica, Il mulino, 1997.
[64]
Martin P., Prominence detection without syllabic segmentation, in: Proc. of speech prosody [Online], 2010, pp. 1–4. URL: http://speechprosody2010.illinois.edu/papers/102010.pdf.
[65]
D’Alessandro C., Mertens P., Automatic pitch contour stylization using a model of tonal perception, Comput Speech Lang 9 (3) (1995) 257–288.
[66]
Roach P., English phonetics and phonology. A practical course, Cambridge University Press, 2000.
[67]
MacNeilage P.F., Davis B.L., On the origin of internal structure of word forms, Science 288 (5465) (2000) 527–531.
[68]
Fujimura O., Syllable timing computation in the c/d model, in: Third international conference on spoken language processing (ICLPS 1994), Yokohama, Japan, 1994, pp. 519–522.
[69]
Warren R.M., Healy E.W., Chalikia M.H., The vowel-sequence illusion: Intrasubject stability and intersubject agreement of syllabic forms, J Acoust Soc Am 100 (4) (1996) 2452–2461.
[70]
Arnal L.H., Poeppel D., Giraud A.-L., A neurophysiological perspective on speech processing in “The Neurobiology of Language”, in: Neurobiology of language, Elsevier, 2016, pp. 463–478.
[71]
Greenberg S., Speaking in shorthand–A syllable-centric perspective for understanding pronunciation variation, Speech Commun 29 (2–4) (1999) 159–176.
[72]
Cutugno F., Leone E., Ludusan B., Origlia A., Investigating syllabic prominence with conditional random fields and latent-dynamic conditional random fields, in: Thirteenth annual conference of the international speech communication association, 2012, pp. 2402–2405.
[73]
Wu S.-L., Kingsbury E., Morgan N., Greenberg S., Incorporating information from syllable-length time scales into automatic speech recognition, in: Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 2, IEEE, 1998, pp. 721–724.
[74]
Kahn D., Syllable-based generalizations in english phonology, Routledge, 2015.
[75]
Peeva M.G., Guenther F.H., Tourville J.A., Nieto-Castanon A., Anton J.-L., Nazarian B., Alario F.-X., Distinct representations of phonemes, syllables, and supra-syllabic sequences in the speech production network, Neuroimage 50 (2) (2010) 626–638.
[76]
Rong F., Isenberg A.L., Sun E., Hickok G., The neuroanatomy of speech sequencing at the syllable level, PLoS One 13 (10) (2018).
[77]
Kingsbury B.E., Morgan N., Greenberg S., Robust speech recognition using the modulation spectrogram, Speech Commun 25 (1–3) (1998) 117–132.
[78]
Wu S-L, Kingsbury ED, Morgan N, Greenberg S. Incorporating information from syllable-length time scales into automatic speech recognition. In: Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, ICASSP ’98 (Cat. No.98CH36181), Vol. 2. 1998, p. 721–4.
[79]
Cutugno F., Coro G., Petrillo M., Multigranular scale speech recognizers: Technological and cognitive view, in: Bandini S., Manzoni S. (Eds.), AI*IA 2005: Advances in artificial intelligence, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 327–330.
[80]
Coro G., A step forward in multi-granular automatic speech recognition, (Ph.D. thesis) University of Naples, Federico II, Naples, Italy, 2008.
[81]
Baby D., Hamme H.V., Investigating modulation spectrogram features for deep neural network-based automatic speech recognition, in: Sixteenth annual conference of the international speech communication association, 2015, pp. 2479–2483.
[82]
Batliner A., Möbius B., Prosody in automatic speech processing, 2019.
[83]
Chang S., A syllable, articulatory-feature, and stress-accent model of speech recognition, (Ph.D. thesis) University of California, Berkeley, 2002.
[84]
Pinson M.B., Pinson D.T., Syllable based automatic speech recognition, 2019, US Patent App. 16/031,637.
[85]
Jespersen O., Lehrbuch der phonetik, Indoger Forsch 18 (s1) (1905) 594.
[86]
House D. Differential perception of tonal contours through the syllable. In: Proc. of ICSLP. 1996, p. 2048–51.
[87]
Cutugno F., D’Anna L., Petrillo M., Zovato E., APA: Towards an automatic tool for prosodic analysis, in: Speech prosody 2002, international conference, 2002, pp. 231–234.
[88]
D’Anna L, Cutugno F. Segmenting the speech chain into tone units: human behaviour vs automatic process. In: Proceedings of the XVth international congress of phonetic sciences (icphs). 2003, p. 1233–6.
[89]
D’Anna L., Petrillo M., Sistemi automatici per la segmentazione in unità tonali, in: Atti delle XIII giornate di studio del gruppo di fonetica sperimentale (GFS), 2003, pp. 285–290.
[90]
Origlia A., Abete G., Cutugno F., A dynamic tonal perception model for optimal pitch stylization, Comput Speech Lang 27 (1) (2013) 190–208.
[91]
Origlia A., Cutugno F., Combining energy and cross-entropy analysis for nuclear segments detection, in: INTERSPEECH, 2016, pp. 2958–2962.
[92]
Origlia A., Cutugno F., Galatà V., Continuous emotion recognition with phonetic syllables, Speech Commun 57 (2014) 155–169.
[93]
Siemund R., Höge H., Kunzmann S., Marasek K., SPEECON-speech data for consumer devices, in: LREC, Citeseer, 2000, pp. 329–333. URL: http://www.lrec-conf.org/proceedings/lrec2000/pdf/63.pdf.
[95]
CMU, The carnegie mellon university CLM toolkit, 2019, https://sourceforge.net/projects/cmusphinx/files/cmuclmtk/0.7/.
[96]
Davis S., Mermelstein P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans Acoust Speech Signal Process 28 (4) (1980) 357–366.
[97]
Tyagi V., Wellekens C., On desensitizing the mel-cepstrum to spurious spectral components for robust speech recognition, in: Proceedings.(ICASSP’05). IEEE international conference on acoustics, speech, and signal processing, 2005. Vol. 1, IEEE, 2005, pp. I–529.
[98]
Parcollet T., Zhang Y., Morchid M., Trabelsi C., Linarès G., De Mori R., Bengio Y., Quaternion convolutional neural networks for end-to-end automatic speech recognition, 2018, arxiv preprint arXiv:1806.07789.
[99]
Kim C., Kumar M., Kim K., Gowda D., Power-law nonlinearity with maximally uniform distribution criterion for improved neural network training in automatic speech recognition, in: 2019 IEEE automatic speech recognition and understanding workshop (ASRU), IEEE, 2019, pp. 988–995.
[100]
Paliwal K.K., On the use of filter-bank energies as features for robust speech recognition, in: ISSPA’99. Proceedings of the fifth international symposium on signal processing and its applications (IEEE Cat. No. 99EX359), Vol. 2, IEEE, 1999, pp. 641–644.
[101]
Tyagi V., McCowan I., Misra H., Bourlard H., Mel-cepstrum modulation spectrum (MCMS) features for robust ASR, in: 2003 IEEE workshop on automatic speech recognition and understanding (IEEE Cat. No. 03EX721), IEEE, 2003, pp. 399–404.
[102]
Coro G., Cutugno F., Caropreso F., Speech recognition with factorial-HMM syllabic acoustic models, in: Eighth annual conference of the international speech communication association (Interspeech), 2007, pp. 870–873.
[103]
D’Anna L., Coro G., Cutugno F., EVALITA 2009: Abla srl participant report, in: EVALITA 2009 speech recognition challenge, 2009, pp. 1–6. URL: http://www.evalita.it/2009/proceedings.
[104]
Viterbi A., Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Trans Inform Theory 13 (2) (1967) 260–269.
[105]
Francois J.-M., JAHMM: An implementation of hidden Markov models in Java, 2019, https://github.com/KommuSoft/jahmm.
[106]
Lamere P., Kwok P., Gouvea E., Raj B., Singh R., Walker W., Warmuth M., Wolf P., The CMU SPHINX-4 speech recognition system, in: IEEE intl. conf. on acoustics, speech and signal processing (ICASSP 2003), Hong Kong, Vol. 1, 2003, pp. 2–5.
[107]
Yu D., Deng L., Deep neural network-hidden markov model hybrid systems, in: Automatic speech recognition, Springer, 2015, pp. 99–116.
[108]
Serizel R., Giuliani D., Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children, Nat Lang Eng 23 (3) (2017) 325–350.
[109]
Ravanelli M., Omologo M., Contaminated speech training methods for robust DNN-HMM distant speech recognition, 2017, arxiv preprint arXiv:1710.03538.
[110]
Maas A.L., Qi P., Xie Z., Hannun A.Y., Lengerich C.T., Jurafsky D., Ng A.Y., Building DNN acoustic models for large vocabulary speech recognition, Comput Speech Lang 41 (2017) 195–213.
[111]
Patel T., Krishna D., Fathima N., Shah N., Mahima C., Kumar D., Iyengar A., Development of large vocabulary speech recognition system with keyword search for manipuri, in: Interspeech, 2018, pp. 1031–1035.
[112]
Smit M.P., Virpioja S., Kurimo M., Advances in subword-based HMM-DNN speech recognition across languages, Computer Speech & Language (ISSN ) 66 (2021) 101158,. https://www.sciencedirect.com/science/article/pii/S0885230820300917.
[113]
Chao G.-L., Chan W., Lane I., Speaker-targeted audio-visual models for speech recognition in cocktail-party environments, 2019, arxiv preprint arXiv:1906.05962.
[114]
Mao S., Tao D., Zhang G., Ching P., Lee T., Revisiting hidden Markov models for speech emotion recognition, in: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2019, pp. 6715–6719.
[115]
Gael J.V., Teh Y.W., Ghahramani Z., The infinite factorial hidden Markov model, in: Advances in neural information processing systems, 2009, pp. 1697–1704.
[116]
Florian B., Sepp K., Joshua H., Richard H., Hidden markov models in the neurosciences, in: Hidden markov models, theory and applications, IntechOpen, 2011, p. 169.
[117]
Virtanen T., Speech recognition using factorial hidden Markov models for separation in the feature space, in: Ninth international conference on spoken language processing, 2006, pp. 89–92.
[118]
Tu Y.-H., Du J., Dai L.-R., Lee C.-H., A speaker-dependent deep learning approach to joint speech separation and acoustic modeling for multi-talker automatic speech recognition, in: 2016 10th international symposium on chinese spoken language processing (ISCSLP), IEEE, 2016, pp. 1–5.
[119]
Ghahramani Z., Matlab implementation of factorial hidden Markov models, 2002, http://mlg.eng.cam.ac.uk/zoubin/software.html.
[120]
Krizhevsky A., Sutskever I., Hinton G.E., Imagenet classification with deep convolutional neural networks, in: Pereira F., Burges C.J.C., Bottou L., Weinberger K.Q. (Eds.), Advances in neural information processing systems 25, Curran Associates, Inc., 2012, pp. 1097–1105. URL: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
[121]
Massoli F.V., Amato G., Falchi F., Cross-resolution learning for face recognition, Image Vis Comput (2020).
[122]
Girshick R. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. 2015, p. 1440–8.
[123]
Deng L., Liu Y., Deep learning in natural language processing, Springer, 2018.
[124]
Ortis A., Farinella G.M., Battiato S., An overview on image sentiment analysis: Methods, datasets and current challenges, in: Proceedings of the 16th international joint conference on e-business and telecommunications, ICETE 2019 - Volume 1: DCNET, ICE-B, OPTICS, SIGMAP and WINSYS, Prague, Czech Republic, July 26-28, 2019, 2019, pp. 296–306,.
[125]
Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al., PyTorch: An imperative style, high-performance deep learning library, in: Advances in neural information processing systems, 2019, pp. 8024–8035.
[126]
LeCun Y., Bengio Y., et al., Convolutional networks for images, speech, and time series, in: The handbook of brain theory and neural networks, Vol. 3361, 1995, p. 1995.
[127]
Hochreiter S., Schmidhuber J., Long short-term memory, Neural Comput 9 (8) (1997) 1735–1780.
[128]
Coro G., Automatic speech recognition: A syllabic approach, 2004, https://sites.google.com/site/gianpaolocoro/ricerca/tesi-di-laurea.
[129]
Coro G., Masetti G., Bonhoeffer P., Betcher M., Distinguishing violinists and pianists based on their brain signals, in: Tetko I.V., Kůrková V., Karpov P., Theis F. (Eds.), Artificial neural networks and machine learning – ICANN 2019: Theoretical neural computation, Springer International Publishing, Cham, 2019, pp. 123–137.
[131]
Mishkin D., Sergievskiy N., Matas J., Systematic evaluation of convolution neural network advances on the imagenet, Comput Vis Image Underst 161 (2017) 11–19.
[132]
Novak R., Bahri Y., Abolafia D.A., Pennington J., Sohl-Dickstein J., Sensitivity and generalization in neural networks: an empirical study, 2018, arxiv preprint arXiv:1802.08760.
[133]
Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R., Dropout: a simple way to prevent neural networks from overfitting, J Mach Learn Res 15 (1) (2014) 1929–1958.
[134]
Muller M.F.K.-R., Estimating a-posteriori probabilities using stochastic network models, in: Proceedings of the 1993 connectionist models summer school, Psychology Press, 2014, p. 324.
[135]
VoxForge, VoxForge free speech recognition corpora, 2012, http://www.voxforge.org/.
[136]
Peters J., Matusov E., Meyer C., Klakow D., Topic specific models for text formatting and speech recognition, 2011, US Patent 8,041,566.
[137]
Ballinger B.M., Schalkwyk J., Cohen M.H., Allauzen C.G.L., Riley M.D., Speech to text conversion, 2011, US Patent App. 12/976,972.
[138]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.
[139]
NIST, SCTK, the NIST scoring toolkit, 2018, https://github.com/usnistgov/SCTK.
[140]
Baroni M., Bernardini S., Ferraresi A., Zanchetta E., The WaCky wide web: a collection of very large linguistically processed web-crawled corpora, Lang Resour Eval 43 (3) (2009) 209–226.
[141]
Lyding V., Stemle E., Borghetti C., Brunello M., Castagnoli S., Dell’Orletta F., Dittmann H., Lenci A., Pirrelli V., The paisa’corpus of italian web texts, in: 9th web as corpus workshop (WaC-9)@ EACL 2014, EACL (European chapter of the Association for Computational Linguistics), 2014, pp. 36–43.
[142]
[143]
Magnini B., Pianta E., Girardi C., Negri M., Romano L., Speranza M., Lenzi V.B., Sprugnoli R., I-CAB: the Italian content annotation bank, in: LREC, Citeseer, 2006, pp. 963–968.
[144]
Milde B., Köhn A., Open source automatic speech recognition for german, in: Speech communication; 13th ITG-symposium, VDE, 2018, pp. 1–5.
[145]
Cole R.A., Noel M., Lander T., Durham T., New telephone speech corpora at CSLU, in: Fourth European conference on speech communication and technology, 1995, pp. 1–4.
[146]
Greenberg S., On the origins of speech intelligibility in the real world, in: Robust speech recognition for unknown communication channels, 1997, pp. 1–11. URL: http://http.icsi.berkeley.edu/ftp/global/pub/speech/papers/escarsr97-origins.pdf.
[147]
Dimitrakakis C., Bengio S., Phoneme and sentence-level ensembles for speech recognition, EURASIP J Audio Speech Music Process 2011 (2011) 1–17.
[148]
Kimura T., Nose T., Hirooka S., Chiba Y., Ito A., Comparison of speech recognition performance between kaldi and google cloud speech API, in: Pan J.-S., Ito A., Tsai P.-W., Jain L.C. (Eds.), Recent advances in intelligent information hiding and multimedia signal processing, Springer International Publishing, Cham, 2019, pp. 109–115.
[149]
Wang D., Wang X., Lv S., An overview of end-to-end automatic speech recognition, Symmetry 11 (8) (2019) 1018.
[150]
Ludusan B., Origlia A., Cutugno F., On the use of the rhythmogram for automatic syllabic prominence detection, in: Twelfth annual conference of the international speech communication association, 2011, pp. 2413–2416.

Cited By

View all
  • (2022)A High-resolution Global-scale Model for COVID-19 Infection RateACM Transactions on Spatial Algorithms and Systems10.1145/34945318:3(1-24)Online publication date: 28-Jan-2022
  • (2022)Automatic detection of potentially ineffective verbal communication for training through simulation in neonatologyEducation and Information Technologies10.1007/s10639-022-11000-z27:7(9181-9203)Online publication date: 1-Aug-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computers and Electrical Engineering
Computers and Electrical Engineering  Volume 93, Issue C
Jul 2021
1045 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 July 2021

Author Tags

  1. Automatic speech recognition
  2. Deep learning
  3. Long short term memory
  4. Convolutional neural networks
  5. Factorial hidden Markov models
  6. Hidden Markov models
  7. Speech
  8. Psycho-acoustics
  9. Syllables

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A High-resolution Global-scale Model for COVID-19 Infection RateACM Transactions on Spatial Algorithms and Systems10.1145/34945318:3(1-24)Online publication date: 28-Jan-2022
  • (2022)Automatic detection of potentially ineffective verbal communication for training through simulation in neonatologyEducation and Information Technologies10.1007/s10639-022-11000-z27:7(9181-9203)Online publication date: 1-Aug-2022

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media