article

Automatic speech recognition and speech variability: A review

Authors:

Stephane Dupont,

C. WellekensAuthors Info & Claims

Speech Communication, Volume 49, Issue 10-11

Pages 763 - 786

https://doi.org/10.1016/j.specom.2007.02.006

Published: 01 October 2007 Publication History

Abstract

Major progress is being recorded regularly on both the technology and exploitation of automatic speech recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise), or the weak representation of grammatical and semantic knowledge. Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. Also, some applications, like directory assistance, particularly stress the core recognition technology due to the very high active vocabulary (application perplexity). There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker herself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, non-stationarity, etc.), especially when resources for system training are scarce. This paper outlines current advances related to these topics.

References

[1]

Aalburg, S., Hoege, H., 2004. Foreign-accented speaker-independent speech recognition. In: Proceedings of ICSLP, Jeju Island, Korea, pp. 1465-1468.

[2]

Abdel-Haleem, Y.H., Renals, S., Lawrence, N.D., 2004. Acoustic space dimensionality selection and combination using the maximum entropy principle. In: Proceedings of ICASSP, Montreal, Canada, pp. 637-640.

[3]

Abrash, V., Sankar, A., Franco, H., Cohen, M., 1996. Acoustic adaptation using nonlinear transformations of HMM parameters. In: Proceedings of ICASSP, Atlanta, GA, pp. 729-732.

[4]

Achan, K., Roweis, S., Hertzmann, A., Frey, B., 2004. A segmental HMM for speech waveforms. Technical Report UTML Techical Report 2004-001, University of Toronto, Toronto, Canada.

[5]

Investigating syllabic structures and their variation in spontaneous french. Speech Communication. v46 i2. 119-139.

[6]

Pronunciation variants across system configuration, language and speaking style. Speech Communication. v29 i2. 83-98.

[7]

A new statistical approach for the automatic segmentation of continuous speech signals. IEEE Transactions on Acoustics, Speech and Signal Processing. v36 i1. 29-40.

[8]

Prosody-based automatic detection of annoyance and frustration in human-computer. In: Proceedings of ICSLP, Denver, Colorado. pp. 2037-2040.

[9]

Language accent classification in american english. Speech Communication. v18 i4. 353-367.

[10]

Atal, B., 1983. Efficient coding of LPC parameters by temporal decomposition. In: Proceedings of ICASSP, Boston, USA, pp. 81-84.

[11]

A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing. v24 i3. 201-212.

[12]

Frequency domain linear prediction for temporal features. In: Proceedings of ASRU, St. Thomas, US Virgin Islands, USA. pp. 261-266.

[13]

Taking the hit: leaving some lexical competition to be resolved post-lexically. Language and Cognitive Processes. v15 i5-6. 731-737.

[14]

Barrault, L., de Mori, R., Gemello, R., Mana, F., Matrouf, D., 2005. Variability of automatic speech recognition systems using different features. In: Proceedings of Interspeech, Lisboa, Portugal, pp. 221-224.

[15]

Bartkova, K., 2003. Generating proper name pronunciation variants for automatic speech recognition, In: Proceedings of ICPhS. Barcelona, Spain.

[16]

Bartkova, K., Jouvet, D., 1999. Language based phone model combination for ASR adaptation to foreign accent. In: Proceedings of ICPhS, San Francisco, USA, pp. 1725-1728.

[17]

Bartkova, K., Jouvet, D., 2004. Multiple models for improved speech recognition for non-native speakers. In: Proceedings of SPECOM, Saint Petersburg, Russia.

[18]

Beattie, V., Edmondson, S., Miller, D., Patel, Y., Talvola, G., 1995. An integrated multidialect speech recognition system with optional speaker adaptation. In: Proceedings of Eurospeech, Madrid, Spain, pp. 1123-1126.

[19]

Beauford, J.Q., 1999. Compensating for variation in speaking rate, PhD thesis, Electrical Engineering, University of Pittsburgh.

[20]

Effects of disfluencies, predictability, and utterance position on word form variation in english conversation. The Journal of the Acoustical Society of America. v113 i2. 1001-1024.

[21]

Benitez, C., Burget, L., Chen, B., Dupont, S., Garudadri, H., Hermansky, H., Jain, P., Kajarekar, S., Sivadas, S., 2001. Robust ASR front-end using spectral based and discriminant features: experiments on the aurora task. In: Proceedings of Eurospeech, Aalborg, Denmark, pp. 429-432.

[22]

Adaptation to a speaker's voice in a speech recognition system based on synthetic phoneme references. Speech Communication. v10 i5-6. 453-461.

[23]

Bonaventura, P., Gallochio, F., Mari, J., Micca, G., 1998. Speech recognition methods for non-native pronunciation variants. In: Proceedings ISCA Workshop on modelling pronunciation variations for automatic speech recognition, Rolduc, Netherlands, pp. 17-23.

[24]

Bonaventura, P., Gallochio, F., Micca, G., 1997. Multilingual speech recognition for flexible vocabularies. In: Proceedings of Eurospeech, Rhodes, Greece, pp. 355-358.

[25]

Bou-Ghazale, S.E., Hansen, J.L.H., 1994. Duration and spectral based stress token generation for HMM speech recognition under stress. In: Proceedings of ICASSP, Adelaide, Australia, pp. 413-416.

[26]

Bou-Ghazale, S.E., Hansen, J.L.H., 1995. Improving recognition and synthesis of stressed speech via feature perturbation in a source generator framework. In: ECSA-NATO Proceedings Speech Under Stress Workshop, Lisbon, Portugal, pp. 45-48.

[27]

Bourlard, H., Dupont, D., 1997. Sub-band based speech recognition. In: Proceedings of ICASSP, Munich, Germany, pp. 1251-1254.

[28]

Bozkurt, B., Couvreur, L., 2005. On the use of phase information for speech recognition. In: Proceedings of Eusipco, Antalya, Turkey.

[29]

Zeros of z-transform representation with application to source-filter separation in speech. IEEE Signal Processing Letters. v12 i4. 344-347.

[30]

Brugnara, F., De Mori, R., Giuliani, D., Omologo, M., 1992. A family of parallel Hidden Markov Models. In: Proceedings of ICASSP, vol. 1. pp. 377-380.

[31]

Automatic recognition of spontaneous speech for access to multilingual oral history archives. IEEE Transactions on Speech and Audio Processing. v12 i4. 420-435.

[32]

Carey, M., Parris, E., Lloyd-Thomas, H., Bennett, S., 1996. Robust prosodic features for speaker identification. In: Proceedings of ICSLP, Philadelphia, Pennsylvania, USA, pp. 1800-1803.

[33]

Carlson, B., Clements, M., 1992. Speech recognition in noise using a projection-based likelihood measure for mixture density HMMs. In: Proceedings of ICASSP, San Francisco, CA, pp. 237-240.

[34]

Chase, L., 1997. Error-responsive feedback mechanisms for speech recognizers. PhD thesis, Carnegie Mellon University.

[35]

Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K., 1997. New methods in continuous mandarin speech recognition. In: Proceedings of Eurospeech, pp. 1543-1546.

[36]

Chen, C.J., Li, H., Shen, L., Fu, G., 2001. Recognize tone languages using pitch information on the main vowel of each syllable. In: Proceedings of ICASSP, vol. 1. pp. 61-64.

[37]

Chen, Y., 1987. Cepstral domain stress compensation for robust speech recognition. In: Proceedings of ICASSP, Dallas, TX, pp. 717-720.

[38]

Chesta, C., Laface, P., Ravera, F., 1999. Connected digit recognition using short and long duration models. In: Proceedings of ICASSP, vol. 2. pp. 557-560.

[39]

Chesta, C., Siohan, O., Lee, C.-H., 1999. Maximum a posteriori linear regression for Hidden Markov Model adaptation. In: Proceedings of Eurospeech, Budapest, Hungary, pp. 211-214.

[40]

Chollet, G.F., Astier, A.B.P., Rossi, M., 1981. Evaluating the performance of speech recognizers at the acoustic-phonetic level. In: in Proceedings of ICASSP, Atlanta, USA, pp. 758-761.

[41]

Cincarek, T., Gruhn, R. Nakamura, S., 2004. Speech recognition for multiple non-native accent groups with speaker-group-dependent acoustic models. In: Proceedings of ICSLP, Jeju Island, Korea, pp. 1509-1512.

[42]

Entropy based algorithms for best basis selection. IEEE Transactions on Information Theory. v38 i2. 713-718.

[43]

Colibro, D., Fissore, L., Popovici, C., Vair, C., Laface, P., 2005. Learning pronunciation and formulation variants in continuous speech applications. In: Proceedings of ICASSP, Philadelphia, PA, pp. 1001-1004.

[44]

Describing the emotional states that are expressed in speech. Speech Communication Special Issue on Speech and Emotions. v40 i1-2. 5-32.

[45]

Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms. Speech Communication. v30 i2-3. 109-119.

[46]

Dalsgaard, P., Andersen, O., Barry, W., 1998. Cross-language merged speech units and their descriptive phonetic correlates. In: Proceedings of ICSLP, Sydney, Australia, pp. 482-485.

[47]

D'Arcy, S.M., Wong, L.P., Russell, M.J., 2004. Recognition of read and spontaneous children's speech using two new corpora. In: Proceedings of ICSLP, Jeju Island, Korea.

[48]

Das, S., Lubensky, D., Wu, C., 1999. Towards robust speech recognition in the telephony network environment - cellular and landline conditions. In: Proceedings of Eurospeech, Budapest, Hungary, pp. 1959-1962.

[49]

Das, S., Nix, D., Picheny, M., 1998. Improvements in children speech recognition performance. In: Proceedings of ICASSP, vol. 1. Seattle, USA, pp. 433-436.

[50]

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing. v28. 357-366.

[51]

Evaluation of formant-like features on an automatic vowel classification task. The Journal of the Acoustical Society of America. v116 i3. 1781-1792.

[52]

Recognition of syllables in a tone language. Speech Communication. v33 i3. 241-254.

[53]

Demuynck, K., Garcia, O., Van Compernolle, D., 2004. Synthesizing speech from speech recognition parameters. In: Proceedings of ICSLP'04, Jeju Island, Korea.

[54]

Deng, Y., Mahajan, M., Acero, A., 2003. Estimating speech recognition error rate without acoustic test data. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 929-932.

[55]

Di Benedetto, M.-G., Liénard, J.-S., 1992. Extrinsic normalization of vowel formant values based on cardinal vowels mapping. In: Proceedings of ICSLP, Alberta, USA, pp. 579-582.

[56]

Disfluency in spontaneous speech (diss'05). 2005. Aix-en-Provence, France.

[57]

Doddington, G., 2003. Word alignment issues in ASR scoring. In: Proceedings of ASRU, US Virgin Islands, pp. 630-633.

[58]

Draxler, C., Burger, S., 1997. Identification of regional variants of high german from digit sequences in german telephone speech. In: Proceedings of Eurospeech, pp. 747-750.

[59]

Pattern Classification and Scene Analysis. Wiley, New York.

[60]

Dupont, S., Ris, C., Couvreur, L., Boite, J.-M., 2005. A study of implicit anf explicit modeling of coarticulation and pronunciation variation. In: Proceedings of Interspeech, Lisboa, Portugal, pp. 1353-1356.

[61]

Dupont, S., Ris, C., Deroo, O., Poitoux, S., 2005. Feature extraction and acoustic modeling: an approach for improved generalization across languages and accents. In: Proceedings of ASRU, San Juan, Puerto-Rico, pp. 29-34.

[62]

Eide, E., 2001. Distinctive features for use in automatic speech recognition. In: Proceedings of Eurospeech, Aalborg, Denmark, pp. 1613-1616.

[63]

Eide, E., Gish, H., 1996. A parametric approach to vocal tract length normalization. In: Proceedings of ICASSP, Atlanta, GA, pp. 346-348.

Digital Library

[64]

Eide, E., Gish, H., 1996. A parametric approach to vocal tract length normalization. In: Proceedings of ICASSP, Atlanta, GA, pp. 346-349.

Digital Library

[65]

Eide, E., Gish, H., Jeanrenaud, P., Mielke, A., 1995. Understanding and improving speech recognition performance through the use of diagnostic tools. In: Proceedings of ICASSP, Detroit, Michigan, pp. 221-224.

[66]

Xenophones: an investigation of phone set expansion in swedish and implications for speech recognition and speech synthesis. Speech Communication. v35 i1-2. 81-102.

[67]

Elenius, D., Blomberg, M., 2004. Comparing speech recognition for adults and children. In: Proceedings of FONETIK, Stockholm, Sweden, pp. 156-159.

[68]

Ellis, D., Singh, R., Sivadas, S., 2001. Tandem acoustic modeling in large-vocabulary recognition. In: Proceedings of ICASSP, Salt Lake City, USA, pp. 517-520.

[69]

ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, 1998.

[70]

Eskenazi, M., 1996. Detection of foreign speakers' pronunciation errors for second language training-preliminary results. In: Proceedings of ICSLP, Philadelphia, PA, pp. 1465-1468.

[71]

Kids: a database of children's speech. The Journal of the Acoustical Society of America. 2759

[72]

Eskenazi, M., Pelton, G., 2002. Pinpointing pronunciation errors in children speech: examining the role of the speech recognizer. In: Proceedings of the PMLA Workshop, Colorado, USA.

[73]

Falthauser, R., Pfau, T., Ruske, G., 2000. On-line speaking rate estimation using gaussian mixture models. In: Proceedings of ICASSP, Istanbul, Turkey, pp. 1355-1358.

[74]

Acoustic Theory of Speech Production. Mouton, The Hague.

[75]

Fiscus, J.G., 1997. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). In: Proceedings of ASRU, pp. 347-354.

[76]

Fitt, S., 1995. The pronunciation of unfamiliar native and non-native town names. In: Proceedings of Eurospeech, Madrid, Spain, pp. 2227-2230.

[77]

Speech Analysis and Synthesis and Perception. Springer-Verlag, Berlin-Heidelberg-New York.

[78]

Interaction between the native and second language phonetic subsystems. Speech Communication. v40. 467-491.

[79]

A framework for predicting speech recognition errors. Speech Communication. v46 i2. 153-170.

[80]

Effects of speaking rate and word predictability on conversational pronunciations. Speech Communication. v29 i2-4. 137-158.

[81]

Combination of machine scores for automatic grading of pronunciation quality. Speech Communication. v30 i2-3. 121-130.

[82]

Fujinaga, K., Nakai, M., Shimodaira, H., Sagayama, S., 2001. Multiple-regression Hidden Markov Model. In: Proceedings of ICASSP, vol. 1. Salt Lake City, USA, pp. 513-516.

[83]

Introduction to Statistical Pattern Recognition. Academic Press, New York.

[84]

Fung, P., Liu, W.K., 1999. Fast accent identification and accented speech recognition. In: Proceedings of ICASSP, Phoenix, Arizona, USA, pp. 221-224.

[85]

Introduction to the special issue on spontaneous speech processing. IEEE Transactions on Speech and Audio Processing. v12 i4. 349-350.

[86]

Gales, M.J.F., 1998. Cluster adaptive training for speech recognition. In: Proceedings of ICSLP, Sydney, Australia, pp. 1783-1786.

[87]

Semi-tied covariance matrices for hidden markov models. IEEE Transactions on Speech and Audio Processing. v7. 272-281.

[88]

Gales, M.J.F., 2001. Acoustic factorization. In: Proceedings of ASRU, Madona di Campiglio, Italy.

[89]

Gales, M.J.F., 2001. Multiple-cluster adaptive training schemes. In: Proceedings of ICASSP, Salt Lake City, Utah, USA, pp. 361-364.

[90]

Gao, Y., Ramabhadran, B., Chen, J., Erdogan, H., Picheny, M., 2001. Innovative approaches for large vocabulary name recognition. In: Proceedings of ICASSP, Salt Lake City, Utah, pp. 333-336.

[91]

Garner, P., Holmes, W., 1998. On the robust incorporation of formant features into hidden markov models for automatic speech recognition. In: Proceedings of ICASSP, pp. 1-4.

[92]

Speaker identification and message identification in speech recognition. Phonetica. v9. 193-199.

[93]

Multiple resolution analysis for robust automatic speech recognition. Computer, Speech and Language. v20. 2-21.

[94]

Girardi, A., Shikano, K., Nakamura, S., 1998. Creating speaker independent HMM models for restricted database using straight-tempo morphing. In: Proceedings of ICSLP, Sydney, Australia, pp. 687-690.

[95]

Giuliani, D., Gerosa, M., 2003. Investigating recognition of children speech. In: Proceedings of ICASSP, Hong Kong, pp. 137-140.

[96]

Segmental minimum Bayes-risk decoding for automatic speech recognition. Transactions of IEEE Speech and Audio Processing. v12 i3. 234-249.

[97]

Gopinath, R.A., 1998. Maximum likelihood modeling with gaussian distributions for classification. In: Proceedings of ICASSP, Seattle, WA, pp. 661-664.

[98]

Generating non-native pronunciation variants for lexicon adaptation. Speech Communication. v42 i1. 109-123.

[99]

Graciarena, M., France, H., Zheng, J., Vergyri, D., Stolcke, A., 2004. Voicing feature integration in SRI's DECIPHER LVCSR system. In: Proceedings of ICASSP, Montreal, Canada, pp. 921-924.

[100]

Greenberg, S., Chang, S., 2000. Linguistic dissection of switchboard-corpus automatic speech recognition systems. In: Proceedings of ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millenium, Paris, France.

[101]

Greenberg, S., Fosler-Lussier, E., 2000. The uninvited guest: information's role in guiding the production of spontaneous speech. In: in Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling. Kloster Seeon, Germany.

[102]

Gupta, S.K., Soong, F., Haimi-Cohen, R., 1996. High-accuracy connected digit recognition for mobile applications. In: Proceedings of ICASSP, vol. 1, pp. 57-60.

[103]

Haeb-Umbach, R., Ney, H., 1992. Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: Proceedings of ICASSP, San Francisco, CA, pp. 13-16.

[104]

Children speech recognition with application to interactive books and tutors. In: Proceedings of ASRU, St. Thomas, US Virgin Islands. pp. 186-191.

[105]

Implicit modelling of pronunciation variation in automatic speech recognition. Speech Communication. v46 i2. 171-188.

[106]

Hain, T., Woodland, P.C., 1999. Dynamic HMM selection for continuous speech recognition. In: Proceedings of Eurospeech, Budapest, Hungary, pp. 1327-1330.

[107]

Hansen, J.H.L., 1989. Evaluation of acoustic correlates of speech under stress for robust speech recognition. In: IEEE Proceedings 15th Northeast Bioengineering Conference, Boston, MA. Boston, Mass, pp. 31-32.

[108]

Hansen, J.H.L., 1993. Adaptive source generator compensation and enhancement for speech recognition in noisy stressful environments. In: Proceedings of ICASSP, Minneapolis, Minnesota, pp. 95-98.

[109]

A source generator framework for analysis of acoustic correlates of speech under stress. part i: pitch, duration, and intensity effects. The Journal of the Acoustical Society of America.

[110]

Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communications, Special Issue on Speech Under Stress. v20 i2. 151-170.

[111]

Hanson, B.A., Applebaum, T., 1990. Robust speaker-independent word recognition using instantaneous, dynamic and acceleration features: experiments with Lombard and noisy speech. In: Proceedings of ICASSP, Albuquerque, New Mexico, pp. 857-860.

[112]

Noise robust speech parameterization using multiresolution feature extraction. IEEE Transactions on Speech and Audio Processing. v9 i8. 856-865.

[113]

Adaptive Filter Theory. Prentice-Hall Publishers, NJ, USA.

[114]

Communication Systems. third ed. John Wiley and Sons, New York, USA.

[115]

Pronunciation modeling using a finite-state transducer representation. Speech Communication. v46 i2. 189-203.

[116]

Fast model selection based speaker adaptation for nonnative speech. IEEE Transactions on Speech and Audio Processing. v11 i4. 298-307.

[117]

Hegde, R.M., Murthy, H.A., Gadde, V.R.R., 2004. Continuous speech recognition using joint features derived from the modified group delay function and MFCC. In: Proceedings of ICSLP, Jeju, Korea, pp. 905-908.

[118]

Hegde, R.M., Murthy, H.A., Rao, G.V.R., 2005. Speech processing using joint features derived from the modified group delay function. In: Proceedings of ICASSP, vol. I. Philadelphia, PA, pp. 541-544.

[119]

Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America. v87 i4. 1738-1752.

[120]

RASTA processing of speech. IEEE Transactions on Speech and Audio Processing. v2 i4. 578-589.

[121]

Hermansky, H., Sharma, S., 1998. TRAPS: classifiers of temporal patterns. In: Proceedings of ICSLP, Sydney, Australia, pp. 1003-1006.

[122]

Hetherington, L., 1995. New words: Effect on recognition performance and incorporation issues. In: Proceedings of Eurospeech, Madrid, Spain, pp. 1645-1648.

[123]

Prosodic and other cues to speech recognition failures. Speech Communication. v43 i1-2. 155-175.

[124]

Holmes, J.N., Holmes, W.J., Garner, P.N., 1997. Using formant frequencies in speech recognition. In: Proceedings of Eurospeech, Rhodes, Greece, pp. 2083-2086.

[125]

Combining frame and segment based models for large vocabulary continuous speech recognition. In: Proceedings of ASRU, Keystone, Colorado.

[126]

Huang, C., Chen, T., Li, S., Chang, E., Zhou, J., 2001. Analysis of speaker variability. In: Proceedings of Eurospeech, Aalborg, Denmark, pp. 1377-1380.

[127]

Huang, X., Lee, K., 1991. On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. In: Proceedings of ICASSP, Toronto, Canada, pp. 877-880.

[128]

Huank, H.C.-H., Seide, F., 2000. Pitch tracking and tone features for Mandarin speech recognition. In: Proceedings of ICASSP, vol. 3. pp. 1523-1526.

[129]

Humphries, J.J., Woodland, P.C., Pearce, D., 1996. Using accent-specific pronunciation modelling for robust speech recognition. In: Proceedings of ICSLP, Rhodes, Greece, pp. 2367-2370.

[130]

Spectral signal processing for ASR. In: Proceedings of ASRU, Keystone, Colorado.

[131]

Hunt, M.J., 2004. Speech recognition, syllabification and statistical phonetics. In: Proceedings of ICSLP, Jeju Island, Korea.

[132]

Hunt, M.J., Lefebvre, C., 1989. A comparison of several acoustic representations for speech recognition with degraded and undegraded speech. In: Proceedings of ICASSP, Glasgow, UK, pp. 262-265.

[133]

Iivonen, A., Harinen, K., Keinanen, L., Kirjavainen, J., Meister, E., Tuuri, L., 2003. Development of a multiparametric speaker profile for speaker recognition. In: Proceedings of ICPhS, Barcelona, Spain, pp. 695-698.

[134]

ISCA Tutorial and Research Workshop, 2002. Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology (PMLA-2002).

[135]

Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech. Speech Communication. v42 i2. 155-173.

[136]

Acoustic feature selection using speech recognizers. In: Proceedings of ASRU, Keystone, Colorado.

[137]

On the use of bandpass liftering in speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. v35. 947-953.

[138]

Jurafsky, D., Ward, W., Jianping, Z., Herold, K., Xiuyang, Y., Sen, Z., 2001. What kind of pronunciation variation is hard for triphones to model? In: Proceedings of ICASSP, Salt Lake City, Utah, pp. 577-580.

[139]

Kajarekar, S., Malayath, N., Hermansky, H., 1999. Analysis of sources of variability in speech. In: Proceedings of Eurospeech, Budapest, Hungary, pp. 343-346.

[140]

Analysis of speaker and channel variability in speech. In: Proceedings of ASRU, Keystone, Colorado.

[141]

Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing. v13 i3. 345-354.

[142]

Köhler, J., 1996. Multilingual phonemes recognition exploiting acoustic-phonetic similarities of sounds. In: Proceedings of ICSLP, Philadelphia, PA, pp. 2195-2198.

[143]

Konig, Y., Morgan, N., 1992. GDNN: a gender-dependent neural network for continuous speech recognition. In: Proceedings of Int. Joint Conf. on Neural Networks, vol. 2. Baltimore, Maryland, pp. 332-337.

[144]

Rapid online adaptation using speaker space model evolution. Speech Communication. v42 i3-4. 467-478.

[145]

Robust speech recognition using the modulation spectrogram. Speech Communication. v25 i1-3. 117-132.

[146]

Kingsbury, B., Saon, G., Mangua, L., Padmanabhan, M., Sarikaya, R., 2002. Robust speech recognition in noisy environments: the 2001 IBM SPINE evaluation system. In: Proceedings of ICASSP, vol. I. Orlando, FL, pp. 53-56.

[147]

Kirchhoff, K., 1998. Combining articulatory and acoustic information for speech recognition in noise and reverberant environments. In: Proceedings of ICSLP, Sydney, Australia, pp. 891-894.

[148]

Kitaoka, N., Yamada, D., Nakagawa, S., 2002. Speaker independent speech recognition using features based on glottal sound source. In: Proceedings of ICSLP, Denver, USA, pp. 2125-2128.

[149]

Kleinschmidt, M., Gelbart, D., 2002. Improving word accuracy with gabor feature extraction. In: Proceedings of ICSLP, Denver, Colorado, pp. 25-28.

[150]

Korkmazskiy, F., Juang, B.-H., Soong, F., 1997. Generalized mixture of HMMs for continuous speech recognition. In: Proceedings of ICASSP, vol. 2. pp. 1443-1446.

[151]

Korkmazsky, F., Deviren, M., Fohr, D., Illina, I., 2004. Hidden factor dynamic bayesian networks for speech recognition. In: Proceedings of ICSLP, Jeju Island, Korea.

[152]

Kubala, F., Anastasakos, A., Makhoul, J., Nguyen, L., Schwartz, R., Zavaliagkos, E., 1994. Comparative experiments on large vocabulary speech recognition. In: Proceedings of ICASSP, Adelaide, Australia, pp. 561-564.

[153]

Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication. v26 i4. 283-297.

[154]

An inverse signal approach to computing the envelope of a real valued signal. IEEE Signal Processing Letters. v5 i10. 256-259.

[155]

Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications. The Journal of the Acoustical Society of America. v105 i3. 1912-1924.

[156]

Kumpf, K., King, R.W., 1996. Automatic accent classification of foreign accented australian english speech. In: Proceedings of ICSLP, Philadelphia, PA, pp. 1740-1743.

[157]

Kuwabara, H., 1997. Acoustic and perceptual properties of phonemes in continuous speech as a function of speaking rate. In: Proceedings of Eurospeech, Rhodes, Greece, pp. 1003-1006.

[158]

Information conveyed by vowels. The Journal of the Acoustical Society of America. v29. 98-104.

[159]

Lamel, L., Gauvain, J.-L., 2005. Alternate phone models for conversational speech. In: Proceedings of ICASSP, Philadelphia, Pennsylvania, pp. 1005-1008.

[160]

Principles of Phonetics. Cambridge University Press, Cambridge.

[161]

Lawson, A.D., Harris, D.M., Grieco, J.J., 2003. Effect of foreign accent on speech recognition in the NATO N-4 corpus. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 1505-1508.

[162]

A study on speaker adaptation of the parameters of continuous density Hidden Markov Models. IEEE Transactions Signal Processing. v39 i4. 806-813.

[163]

Lee, C.-H., Gauvain, J.-L., 1993. Speaker adaptation based on MAP estimation of HMM parameters. In: Proceedings of ICASSP, vol. 2. pp. 558-561.

[164]

Lee, L., Rose, R.C., 1996. Speaker normalization using efficient frequency warping procedures. In: Proceedings of ICASSP, vol. 1. Atlanta, Georgia, pp. 353-356.

[165]

Acoustics of children speech: developmental changes of temporal and spectral parameters. The Journal of the Acoustical Society of America. v105. 1455-1468.

[166]

Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov Models. Computer, Speech and Language. v9 i2. 171-185.

[167]

Leonard, R.G., 1984. A database for speaker independent digit recognition. In: Proceedings of ICASSP, San Diego, US, pp. 328-331.

[168]

Lin, X., Simske, S., 2004. Phoneme-less hierarchical accent classification. In: Proceedings of Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, vol. 2. Pacific Grove, CA, pp. 1801-1804.

[169]

Lincoln, M., Cox, S.J., Ringland, S., 1997. A fast method of speaker normalisation using formant estimation. In: Proceedings of Eurospeech, Rhodes, Greece, pp. 2095-2098.

[170]

Explaining phonetic variation: a sketch of the H& H theory. In: Hardcastle, W.J., Marchal, A. (Eds.), Speech Production and Speech Modelling, Kluwer Academic Publishers.

[171]

Speech recognition by machines and humans. Speech Communication. v22 i1. 1-15.

[172]

Lippmann, R.P., Martin, E.A., Paul, D.B., 1987. Multi-style training for robust isolated-word speech recognition. In: Proceedings of ICASSP, Dallas, TX, pp. 705-708.

[173]

Effects of phase on the perception of intervocalic stop consonants. Speech Communication. v22 i4. 403-417.

[174]

Liu, S., Doyle, S., Morris, A., Ehsani, F., 1998. The effect of fundamental frequency on mandarin speech recognition. In: Proceedings of ICSLP, vol. 6. Sydney, Australia, pp. 2647-2650.

[175]

Liu, W.K., Fung, P., 2000. MLLR-based accent model adaptation without accented data. In: Proceedings of ICSLP, vol. 3. Beijing, China, pp. 738-741.

[176]

Livescu, K., Glass, J., 2000. Lexical modeling of non-native speech for automatic speech recognition. In: Proceedings of ICASSP, vol. 3. Istanbul, Turkey, pp. 1683-1686.

[177]

Ljolje, A., 2002. Speech recognition using fundamental frequency and voicing in acoustic modeling. In: Proceedings of ICSLP, Denver, USA, pp. 2137-2140.

[178]

Llitjos, A.F., Black, A.W., 2001. Knowledge of language origin improves pronunciation accuracy of proper names. In: Proceedings of Eurospeech, Aalborg, Denmark.

[179]

Le signe de l'élévation de la voix. Ann. Maladies Oreille, Larynx, Nez, Pharynx. v37.

[180]

Linear dimensionality reduction via a heteroscedastic extension of LDA: The Chernoff criterion. IEEE Transactions Pattern Analysis and Machine Intelligence. v26 i6. 732-739.

[181]

Magimai-Doss, M., Stephenson, T.A., Ikbal, S., Bourlard, H., 2004. Modelling auxiliary features in tandem systems. In: Proceedings of ICSLP, Jeju Island, Korea.

[182]

Maison, B., 2003. Pronunciation modeling for names of foreign origin. In: Proceedings of ASRU, US Virgin Islands, pp. 429-434.

[183]

Mak, B., Hsiao, R., 2004. Improving eigenspace-based MLLR adaptation by kernel PCA. In: Proceedings of ICSLP, Jeju Island, Korea.

[184]

Makhoul, J., 1975. Linear prediction: a tutorial review. In: Proceedings of IEEE, vol. 63(4) pp. 561-580.

[185]

Finding consensus in speech recognition: Word-error minimization and other applications of confusion networks. Computer Speech and Language. v14 i4. 373-400.

[186]

A family of distortion measures based upon projection operation for robust speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. v37. 1659-1671.

[187]

Long-term feature averaging for speaker recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. v25. 330-337.

[188]

Markov, K., Nakamura, S., 2003. Hybrid HMM/BN LVCSR system integrating multiple acoustic features. In: Proceedings of ICASSP, vol. 1. pp. 840-843.

[189]

Martin, A., Mauuary, L., 2003. Voicing parameter and energy-based speech/non-speech detection for speech recognition in adverse conditions. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 3069-3072.

[190]

Martinez, F., Tapias, D., Alvarez, J., 1998. Towards speech rate independence in large vocabulary continuous speech recognition. In: Proceedings of ICASSP, Seattle, Washington, pp. 725-728.

[191]

Martinez, F., Tapias, D., Alvarez, J., Leon, P., 1997. Characteristics of slow, average and fast speech and their effects in large vocabulary continuous speech recognition. In: Proceedings of Eurospeech, Rhodes, Greece, pp. 469-472.

[192]

Matsuda, S., Jitsuhiro, T., Markov, K., Nakamura, S., 2004. Speech recognition system robust to noise and speaking styles. In: Proceedings of ICSLP, Jeju Island, Korea.

[193]

Mertins, A., Rademacher, J., 2005. Vocal tract length invariant features for automatic speech recognition. In: Proceedings of ASRU, Cancun, Mexico, pp. 308-312.

[194]

Messina, R., Jouvet, D., 2004. Context dependent long units for speech recognition. In: Proceedings of ICSLP, Jeju Island, Korea.

[195]

Milner,. B.P., 1996. Inclusion of temporal information into features for speech recognition. In: Proceedings of ICSLP, Philadelphia, PA, pp. 256-259.

[196]

Mirghafori, N., Fosler, E., Morgan, N., 1995. Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes. In: Proceedings of Eurospeech, Madrid, Spain, pp. 491-494.

[197]

Mirghafori, N., Fosler, E., Morgan, N., 1996. Towards robustness to fast speech in ASR. In: Proceedings of ICASSP, Atlanta, Georgia, pp. 335-338.

[198]

Towards improving ASR robustness for PSN and GSM telephone applications. Speech Communication. v23 i1-2. 141-159.

[199]

Mokhtari, P., 1998. An acoustic-phonetic and articulatory study of speech-speaker dichotomy. PhD thesis, The University of New South Wales, Canberra, Australia.

[200]

Morgan, N., Chen, B., Zhu, Q., Stolcke, A., 2004. TRAPping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition. In: Proceedings of ICASSP, vol. 1. Montreal, Canada, pp. 536-539.

[201]

Morgan, N., Fosler, E., Mirghafori, N., 1997. Speech recognition using on-line estimation of speaking rate. In: Proceedings of Eurospeech, vol. 4. Rhodes, Greece, pp. 2079-2082.

[202]

Morgan, N., Fosler-Lussier, E., 1998. Combining multiple estimators of speaking rate. In: Proceedings of ICASSP, Seattle, pp. 729-732.

[203]

Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. The Journal of the Acoustical Society of America. v93 i2. 1097-1108.

[204]

Masaki Naito, Y.S., LiDeng, 1998. Speaker clustering for speech recognition using the parameters characterizing vocal-tract dimensions. In: Proceedings of ICASSP, Seattle, WA, pp. 1889-1893.

[205]

Nanjo, H., Kawahara, T., 2002. Speaking-rate dependent decoding and adaptation for spontaneous lecture speech recognition. In: Proceedings of ICASSP, vol. 1. Orlando, FL, pp. 725-728.

[206]

Language model and speaking rate adaptation for spontaneous presentation speech recognition. IEEE Transactions on Speech and Audio Processing. v12 i4. 391-400.

[207]

National Institute of Standards and Technology, 2001. SCLITE scoring software. ftp://jaguar.ncls.nist.gov/pub/sctk-1.2.tar.Z.

[208]

Nearey, T.M., 1978. Phonetic feature systems for vowels. Indiana University Linguistics Club, Bloomington, Indiana, USA.

[209]

Neti, C., Roukos, S., 1997. Phone-context specific gender-dependent acoustic-models for continuous speech recognition. In: Proceedings of ASRU, Santa Barbara, CA, pp. 192-198.

[210]

Automatic scoring of pronunciation quality. Speech Communication. v30 i2-3. 83-93.

[211]

Leonardo Neumeyer, Horacio Franco, Mitchel Weintraub, and Patti Price. 1996. Automatic text-independent pronunciation scoring of foreign language student speech. In: Proceedings of ICSLP, Philadelphia, PA, pp. 1457-1460.

[212]

Eigenvoices: a compact representation of speakers in a model space. Annales des Télécommunications. v55 i3-4.

[213]

Nguyen, P., Rigazio, L., Junqua, J.-C., 2003. Large corpus experiments for broadcast news recognition. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 1837-1840.

[214]

The Phonetic Bases of Speaker Recognition. Cambridge University Press, Cambridge.

[215]

Kamal Omar, M., Chen, K., Hasegawa-Johnson, M., Bradman, Y., 2002. An evaluation of using mutual information for selection of acoustic features representation of phonemes for speech recognition. In: Proceedings of ICSLP, Denver, CO, pp. 2129-2132.

[216]

Kamal Omar, M., Hasegawa-Johnson, M., 2002. Maximum mutual information based acoustic features representation of phonological features for speech recognition. In: Proceedings of ICASSP, vol. 1. Montreal, Canada, pp. 81-84.

[217]

Odell, J.J., Woodlandand, P.C., Valtchev, V., Young, S.J., 1994. Large vocabulary continuous speech recognition using HTK. In: Proceedings of ICASSP, vol. 2. Adelaide, Australia, pp. 125-128.

[218]

Ono, Y., Wakita, H., Zhao, Y., 1993. Speaker normalization using constrained spectra shifts in auditory filter domain. In: Proceedings of Eurospeech, Berlin, Germany, pp. 355-358.

[219]

Speech Communication - Human and Machine. Addison-Wesley.

[220]

O'Shaughnessy, D., Tolba, H., 1999. Towards a robust/fast continuous speech recognition system using a voiced-unvoiced decision. In: Proceedings of ICASSP, vol. 1. Phoenix, Arizona, pp. 413-416.

[221]

Padmanabhan, M., Bahl, L., Nahamoo, D., Picheny, M., 1996. Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems. In: Proceedings of ICASSP, Atlanta, GA, pp. 701-704.

[222]

Maximum-likelihood nonlinear transformation for acoustic adaptation. IEEE Transactions on Speech and Audio Processing. v12 i6. 572-578.

[223]

Maximizing information content in feature extraction. IEEE Transactions on Speech and Audio Processing. v13 i4. 512-519.

[224]

Paliwal, K.K., Alsteris, L., 2003. Usefulness of phase spectrum in human speech perception. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2117-2120.

[225]

Paliwal, K.K., Atal, B.S., 2003. Frequency-related representation of speech. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 65-68.

[226]

Paul, D.B., 1987. A speaker-stress resistant HMM isolated word recognizer. In: Proceedings of ICASSP, Dallas, Texas, pp. 713-716.

[227]

Paul, D.B., 1997. Extensions to phone-state decision-tree clustering: single tree and tagged clustering. In: Proceedings of ICASSP, vol. 2. Munich, Germany, pp. 1487-1490.

[228]

Peters, S.D., Stubley, P., Valin, J.-M., 1999. On the limits of speech recognition in noise. In: Proceedings of ICASSP'99. Phoenix, Arizona, pp. 365-368.

[229]

Control methods used in a study of the vowels. The Journal of the Acoustical Society of America. v24. 175-184.

[230]

Pfau, T., Ruske, G., 1998. Creating Hidden Markov Models for fast speech. In: Proceedings of ICSLP, Sydney, Australia.

[231]

Michael Pitz, 2005. Investigations on Linear Transformations for Speaker Adaptation and Normalization. PhD thesis, RWTH Aachen University.

[232]

Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing. v7 i5. 569-586.

[233]

Perceptual and physical space of vowel sounds. The Journal of the Acoustical Society of America. v46. 458-467.

[234]

Robust recognition of children speech. IEEE Transactions on Speech and Audio Processing. v11. 603-616.

[235]

Potamianos, G., Narayanan, S., Lee, S., 1997. Analysis of children speech: duration, pitch and formants. In: Proceedings of Eurospeech, Rhodes, Greece, pp. 473-476.

[236]

Potamianos, G., Narayanan, S., Lee, S., 1997. Automatic speech recognition for children. In: Proceedings of Eurospeech, Rhodes, Greece, pp. 2371-2374.

[237]

Toward the specification of speech. The Journal of the Acoustical Society of America. v22. 807-820.

[238]

Theory and practice of acoustic confusability. Computer Speech and Language. v16 i1. 131-164.

[239]

Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system. IEEE Transactions on Speech and Audio Processing. vSAP-13 i1. 14-22.

[240]

Fundamentals of speech recognition. Prentice Hall PTR, Englewoood Cliffs, NJ, USA.

[241]

Rabiner, L.R., Lee, C.H., Juang, B.H., Wilpon, J.G., 1989. HMM clustering for connected word recognition. In: Proceedings of ICASSP, vol. 1. Glasgow, Scotland, pp. 405-408.

[242]

Raux, A., 2004. Automated lexical adaptation and speaker clustering based on pronunciation habits for non-native speech recognition. In: Proceedings of ICSLP, Jeju Island, Korea.

[243]

Frequency spectrum deviation between speakers. Speech Communication. v2. 149-152.

[244]

Sakauchi, S., Yamaguchi, Y., Takahashi, S., Kobashikawa, S., 2004. Robust speech recognition based on HMM composition and modified Wiener filter. In: Proceedings of Interspeech. Jeju Island, Korea, pp. 2053-2056.

[245]

Saon, G., Padmanabhan, M., Gopinath, R., Chen, S., 2000. Maximum likelihood discriminant feature spaces. In: Proceedings of ICASSP, pp. 1129-1132.

[246]

Schaaf, T., Kemp, T., 1997. Confidence measures for spontaneous speech recognition. In: Proceedings of ICASSP, Munich, Germany, pp. 875-878.

[247]

Vocal communication of emotion: A review of research paradigms. Speech Communication Special Issue on Speech and Emotions. v40 i1-2. 227-256.

[248]

Schimmel, S., Atlas, L., 2005. Coherent envelope detection for modulation filtering of speech. In: Proceedings of ICASSP, vol. 1. Philadephia, USA, pp. 221-224.

[249]

Flat-spectrum speech. The Journal of the Acoustical Society of America. v79 i5. 1580-1583.

[250]

Schötz, S., 2001. A perceptual study of speaker age. In: Working paper 49, Lund University, Dept of Linguistic, pp. 136-139.

[251]

Schultz, T., Waibel, A., 1998. Language independent and language adaptive large vocabulary speech recognition. In: Proceedings of ICSLP, vol. 5. Sydney, Australia, pp. 1819-1822.

[252]

Schwartz, R., Barry, C., Chow, Y.-L., Deft, A., Feng, M.-W., Kimball, O., Kubala, F., Makhoul, J., Vandegrift, J., 1989. The BBN BYBLOS continuous speech recognition system. In: Proceedings of Speech and Natural Language Workshop. Philadelphia, Pennsylvania, pp. 21-23.

[253]

Selouani, S.-A., Tolba, H., O'Shaughnessy, D., 2002. Distinctive features, formants and cepstral coefficients to improve automatic speech recognition. In: Conference on Signal Processing, Pattern Recognition and Applications, IASTED. Crete, Greece, pp. 530-535.

[254]

Statistical modeling of phonological rules through linguistic hierarchies. Speech Communication. v46 i2. 204-216.

[255]

Shi, Y.Y., Liu, J., Liu, R.S., 2002. Discriminative HMM stream model for Mandarin digit string speech recognition. In: Proceedings of Int. Conf. on Signal Processing, vol. 1. Beijing, China, pp. 528-531.

[256]

Shinozaki, T., Furui, S., 2003. Hidden mode HMM using bayesian network for modeling speaking rate fluctuation. In: Proceedings of ASRU. US Virgin Islands, pp. 417-422.

[257]

Shinozaki, T., Furui, S., 2004. Spontaneous speech recognition using a massively parallel decoder. In: Proceedings of ICSLP, Jeju Island, Korea, pp. 1705-1708.

[258]

Shobaki, K., Hosom, J.-P., Cole, R., 2000. The OGI kids speech corpus and recognizers. In: Proceedings of ICSLP, Beijing, China, pp. 564-567.

[259]

Siegler, M.A., 1995. Measuring and compensating for the effects of speech rate in large vocabulary continuous speech recognition. PhD thesis, Carnegie Mellon University.

[260]

Siegler, M.A., Stern, R.M., 1995. On the effect of speech rate in large vocabulary speech recognition system. In: Proceedings of ICASSP, Detroit, Michigan, pp. 612-615.

[261]

Singer, H., Sagayama, S., 1992. Pitch dependent phone modelling for HMM based speech recognition. In: Proceedings of ICASSP, vol. 1. San Francisco, CA, pp. 273-276.

[262]

Slifka, J., Anderson, T.R., 1995. Speaker modification with LPC pole analysis. In: Proceedings of ICASSP, Detroit, MI, pp. 644-647.

[263]

Song, M.G., Jung, H.I., Shim, K.-J., Kim, H.S., 1998. Speech recognition in car noise environments using multiple models according to noise masking levels. In: Proceedings of ICSLP.

[264]

Sotillo, C., Bard, E.G., 1998. Is hypo-articulation lexically constrained?. In: Proceedings of SPoSS. Aix-en-Provence, pp. 109-112.

[265]

Human and machine consonant recognition. Speech Communication. v45 i4. 401-423.

[266]

Steeneken, H.J.M., van Velden, J.G., 1989. Objective and diagnostic assessment of (isolated) word recognizers. In: Proceedings of ICASSP, vol. 1. Glasgow, UK, pp. 540-543.

[267]

Stephenson, T.A., Bourlard, H., Bengio, S., Morris, A.C., 2000. Automatic speech recognition using dynamic Bayesian networks with both acoustic and articulatory variables. In: Proceedings of ICSLP, vol. 2. Beijing, China, pp. 951-954.

[268]

Speech recognition with auxiliary information. IEEE Transactions on Speech and Audio Processing. vSAP-12 i3. 189-203.

[269]

Stolcke, A., Grezl, F., Hwang, M.-Y., Morgan, N., Vergyri, D., 2006. Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons. In: Proceedings of ICASSP, vol. 1. Toulouse, France, pp. 321-324.

[270]

Modeling pronunciation variation for ASR: a survey of the literature. Speech Communication. v29 i2-4. 225-246.

[271]

Sun, D.X., Deng, L., 1995. Analysis of acoustic-phonetic variations in fluent speech using Timit. In: Proceedings of ICASSP, Detroit, Michigan, pp. 201-204.

[272]

Suzuki, H., Zen, H., Nankaku, Y., Miyajima, C., Tokuda, K., Kitamura, T., 2003. Speech recognition using voice-characteristic-dependent acoustic models. In: Proceedings of ICASSP, vol. 1. Hong-Kong (canceled), pp. 740-743.

[273]

Svendsen, T., Paliwal, K.K., Harborg, E., Husoy, P.O., 1989. An improved sub-word based speech recognizer. In: Proceedings of ICASSP, Glasgow, UK, pp. 108-111.

[274]

Svendsen, T., Soong, F., 1987. On the automatic segmentation of speech signals. In: Proceedings of ICASSP, Dallas, Texas, pp. 77-80.

[275]

Teixeira, C., Trancoso, I., Serralheiro, A., 1996. Accent identification. In: Proceedings of ICSLP, vol. 3. Philadelphia, PA, pp. 1784-1787.

[276]

Thomson, D.L., Chengalvarayan, R., 1998. Use of periodicity and jitter as speech recognition feature. In: Proceedings of ICASSP, vol. 1. Seattle, WA, pp. 21-24.

[277]

Use of voicing features in HMM-based speech recognition. Speech Communication. v37 i3-4. 197-211.

[278]

Tibrewala, S., Hermansky, H., 1997. Sub-band based recognition of noisy speech. In: Proceedings of ICASSP, Munich Germany, pp. 1255-1258.

[279]

Tolba, H., Selouani, S.A., O'Shaughnessy, D., 2002. Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm. In: Proceedings of ICASSP, Orlando, FL, pp. 837-840.

[280]

Tolba, H., Selouani, S.A., O'Shaughnessy, D., 2003. Comparative experiments to evaluate the use of auditory-based acoustic distinctive features and formant cues for robust automatic speech recognition in low-snr car environments. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 3085-3088.

[281]

Tomlinson, M.J., Russell, M.J., Moore, R.K., Buckland, A.P., Fawley, M.A., 1997. Modelling asynchrony in speech using elementary single-signal decomposition. In: Proceedings of ICASSP, Munich Germany, pp. 1247-1250.

[282]

Townshend, B., Bernstein, J., Todic, O., Warren, E., 1998. Automatic text-independent pronunciation scoring of foreign language student speech. In: Proceedings of STiLL-1998, Stockholm, pp. 179-182.

[283]

Traunmüller, H., 1997. Perception of speaker sex, age and vocal effort. Technical Report, Institutionen för lingvistik, Stockholm Universitet.

[284]

Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. IEEE Transactions on Speech and Audio Processing. v13 i3. 367-376.

[285]

Segmental eigenvoice with delicate eigenspace for improved speaker adaptation. IEEE Transactions on Speech and Audio Processing. v13 i3. 399-411.

[286]

Tuerk, A., Young, S., 1999. Modeling speaking rate using a between frame distance metric. In: Proceedings of Eurospeech, vol. 1. Budapest, Hungary, pp. 419-422.

[287]

Tuerk, C., Robinson, T., 1993. A new frequency shift function for reducing inter-speaker variance. In: Proceedings of Eurospeech, Berlin, Germany, pp. 351-354.

[288]

Combining active and semi-supervised learning for spoken language understanding. Speech Communication. v45 i2. 171-186.

[289]

Mel-cepstrum modulation spectrum (MCMS) features for robust ASR. In: Proceedings of ASRU, St. Thomas, US Virgin Islands. pp. 381-386.

[290]

Tyagi, V., Wellekens, C., 2005. Cepstrum representation of speech. In: Proceedings of ASRU, Cancun, Mexico.

[291]

Tyagi, V., Wellekens, C., Bourlard, H., 2005. On variable-scale piecewise stationary spectral analysis of speech signals for ASR. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 209-212.

[292]

Multilingual speech recognition in seven languages. Speech Communication. v35 i1-2. 53-69.

[293]

Uebler, U., Boros, M., 1999. Recognition of non-native german speech with multilingual recognizers. In: Proceedings of Eurospeech, vol. 2. Budapest, Hungary, pp. 911-914.

[294]

Scale transform in speech analysis. IEEE Transactions on Speech and Audio Processing. v7 i1. 40-45.

[295]

Utsuro, T., Harada, T., Nishizaki, H., Nakagawa, S., 2002. A confidence measure based on agreement among multiple LVCSR models - correlation between pair of acoustic models and confidence. In: Proceedings of ICSLP, Denver, Colorado, pp. 701-704.

[296]

Recognizing speech of goats, wolves, sheep and ¿ non-natives. Speech Communication. v35 i1-2. 71-79.

[297]

VanCompernolle, D., Smolders, J., Jaspers, P., Hellemans, T., 1991. Speaker clustering for dialectic robustness in speaker independent speech recognition. In: Proceedings of Eurospeech, Genova, Italy, pp. 723-726.

[298]

Vaseghi, S.V., Harte, N., Miller, B., 1997. Multi resolution phonetic/segmental features and models for HMM-based speech recognition. In: Proceedings of ICASSP, Munich Germany, pp. 1263-1266.

[299]

Venkataraman, A., Stolcke, A., Wangal, W., Vergyri, D., Ramana Rao Gadde, V., Zheng, J., 2004. An efficient repair procedure for quick transcriptions. In: Proceedings of ICSLP, Jeju Island, Korea.

[300]

Normalization of vowels by vocal-tract length and its application to vowel identification. IEEE Transactions on Acoustics, Speech and Signal Processing. v25. 183-192.

[301]

Speaker normalization and adaptation using second-order connectionist networks. IEEE Transactions Neural Networks. v4 i1. 21-30.

[302]

Mitch Weintraub, Kelsey Taussig, Kate Hunicke-Smith, Amy Snodgrass. 1996. Effect of speaking style on LVCSR performance. In: Proceedings Addendum of ICSLP, Philadelphia, PA, USA.

[303]

Speaker adaptive modeling by vocal tract normalization. IEEE Transactions on Speech and Audio Processing. v10 i6. 415-426.

[304]

Weng, F., Bratt, H., Neumeyer, L., Stomcke, A., 1997. A study of multilingual speech recognition. In: Proceedings of Eurospeech, vol. 1. Rhodes, Greece, pp. 359-362.

[305]

Wesker, T., Meyer, B., Wagener, K., Anemüller, J., Mertins, A., Kollmeier, B., 2005. Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines. In: Proceedings of Interspeech. Lisboa, Portugal, pp. 1273-1276.

[306]

Westphal, M., 1997. The use of cepstral means in conversational speech recognition. In: Proceedings of Eurospeech, vol. 3. Rhodes, Greece, pp. 1143-1146.

[307]

Williams, D.A.G., 1999. Knowing what you don't know: Roles for confidence measures in automatic speech recognition. PhD thesis, University of Sheffield.

[308]

Wilpon, J.G., Jacobsen, C.N., 1996. A study of speech recognition for children and the elderly. In: Proceedings of ICASSP, vol. 1. Atlanta, Georgia, pp. 349-352.

[309]

Witt, S.M., Young, S.J., 1999. Off-line acoustic modelling of non-native accents. In: Proceedings of Eurospeech, vol. 3. Budapest, Hungary, pp. 1367-1370.

[310]

Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication. v30 i2-3. 95-108.

[311]

Wong, P.-F., Siu, M.-H., 2004. Decision tree based tone modeling for chinese speech recognition. In: Proceedings of ICASSP, vol. 1. Montreal, Canada, pp. 905-908.

[312]

Wrede, B., Fink, G.A., Sagerer, G., 2001. An investigation of modeling aspects for rate-dependent speech recognition. In: Proceedings of Eurospeech, Aalborg, Denmark.

[313]

Speaker adaptation using constrained transformation. IEEE Transactions on Speech and Audio Processing. v12 i2. 168-174.

[314]

Yang, W.-J., Lee, J.-C., Chang, Y.-C., Wang, H.-C., 1988. Hidden Markov Model for Mandarin lexical tone recognition. In: IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36. pp. 988-992.

[315]

Zavaliagkos, G., Schwartz, R., McDonough, J., 1996. Maximum a posteriori adaptation for large scale HMM recognizers. In: Proceedings of ICASSP, Atlanta, Georgia, pp. 725-728.

[316]

Zhang, B., Matsoukas, S., 2005. Minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition. In: Proceedings of ICASSP, vol. 1. Philadelphia, PA, pp. 925-928.

[317]

Zhan, P., Waibel, A., 1997. Vocal tract length normalization for large vocabulary continuous speech recognition. Technical Report CMU-CS-97-148, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.

[318]

Zhan, P., Westphal, M., 1997. Speaker normalization based on frequency warping. In: Proceedings of ICASSP, vol. 2. Munich, Germany, pp. 1039-1042.

[319]

Zhang, Y., Desilva, C.J.S., Togneri, A., Alder, M., Attikiouzel, Y., 1994. Speaker-independent isolated word recognition using multiple Hidden Markov Models. In: Proceedings IEE Vision, Image and Signal Processing, vol. 141(3). pp. 197-202.

[320]

Zheng, J., Franco, H., Stolcke, A., 2000. Rate of speech modeling for large vocabulary conversational speech recognition. In: Proceedings of ISCA tutorial and research workshop on automatic speech recognition: challenges for the new Millenium. Paris, France, pp. 145-149.

[321]

Zheng, J., Franco, H., Stolcke, A., 2004. Effective acoustic modeling for rate-of-speech variation in large vocabulary conversational speech recognition. In: Proceedings of ICSLP, Jeju Island, Korea, pp. 401-404.

[322]

Rapid discriminative acoustic model based on eigenspace mapping for fast speaker adaptation. IEEE Transactions on Speech and Audio Processing. v13 i4. 554-564.

[323]

Zhou, G., Deisher, M.E., Sharma, S., 2002. Causal analysis of speech recognition failure in adverse environments. In: Proceedings of ICASSP, vol. 4. Orlando, Florida, pp. 3816-1819.

[324]

Zhu, Q., Alwan, A., 2000. AM-demodualtion of speech spectra and its application to noise robust speech recognition. In: Proceedings of ICSLP, vol. 1. Beijing, China, pp. 341-344.

[325]

Zhu, D., Paliwal, K.K., 2004. Product of power spectrum and group delay function for speech recognition. In: Proceedings of ICASSP, pp. 125-128.

[326]

Zhu, Q., Chen, B., Morgan, N., Stolcke, A., 2004. On using MLP features in LVCSR. In: Proceedings of ICSLP, Jeju Island, Korea.

[327]

Zolnay, A., Schlüter, R., Ney, H., 2002. Robust speech recognition using a voiced-unvoiced feature. In: Proceedings of ICSLP, vol. 2. Denver, CO, pp. 1065-1068.

[328]

Zolnay, A., Schlüter, R., Ney, H., 2005. Acoustic feature combination for robust speech recognition. In: Proceedings of ICASSP, vol. I. Philadelphia, PA, pp. 457-460.

Cited By

Petsolari MIbrahim SSlovak P(2024)Socio-technical Imaginaries: Envisioning and Understanding AI Parenting Supports through Design FictionProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642619(1-27)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642619
Shanthamallappa M(2024)Robust Speech Enhancement Using Dabauchies Wavelet Based Adaptive Wavelet Thresholding for the Development of Robust Automatic Speech Recognition: A Comprehensive ReviewWireless Personal Communications: An International Journal10.1007/s11277-024-11448-x137:4(2085-2119)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s11277-024-11448-x
Shabber SBansal M(2024)Temporal feature-based approaches for enhancing phoneme boundary detection and masking in speechInternational Journal of Speech Technology10.1007/s10772-024-10117-527:2(425-436)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10772-024-10117-5
Show More Cited By

Index Terms

Automatic speech recognition and speech variability: A review
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems

Recommendations

Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction

Speech intelligibility is the most important parameter in evaluation of speech quality. In the contribution, a new objective intelligibility assessment of general speech processing algorithms is proposed. It is based on automatic recognition methods ...
Syllable-based automatic arabic speech recognition in noisy-telephone channel

The performance of well-trained speech recognizers using high quality full bandwidth speech data is usually degraded when used in real world environments. In particular, telephone speech recognition is extremely difficult due to the limited bandwidth of ...
Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition

In this paper, pronunciation variability between native and non-native speakers is investigated, and a novel acoustic model adaptation method is proposed based on pronunciation variability analysis in order to improve the performance of a speech ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Speech Communication

Speech Communication Volume 49, Issue 10-11

October, 2007

101 pages

ISSN:0167-6393

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2007.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 October 2007

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

73
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Petsolari MIbrahim SSlovak P(2024)Socio-technical Imaginaries: Envisioning and Understanding AI Parenting Supports through Design FictionProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642619(1-27)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642619
Shanthamallappa M(2024)Robust Speech Enhancement Using Dabauchies Wavelet Based Adaptive Wavelet Thresholding for the Development of Robust Automatic Speech Recognition: A Comprehensive ReviewWireless Personal Communications: An International Journal10.1007/s11277-024-11448-x137:4(2085-2119)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s11277-024-11448-x
Shabber SBansal M(2024)Temporal feature-based approaches for enhancing phoneme boundary detection and masking in speechInternational Journal of Speech Technology10.1007/s10772-024-10117-527:2(425-436)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10772-024-10117-5
Nguyen Doan THuynh SNguyen ALe APhan Thi Thuy AHuynh DNguyen B(2024)Vietnamese Automatic Speech Recognition for Financial Conversation DataIntelligent Information and Database Systems10.1007/978-981-97-4985-0_29(372-383)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1007/978-981-97-4985-0_29
Gündüz AKim YAli Yuksel KAl-Badrashiny MCastro Ferreira TSawaf H(2024)AutoMode-ASR: Learning to Select ASR Systems for Better Quality and CostSpeech and Computer10.1007/978-3-031-77961-9_7(92-103)Online publication date: 25-Nov-2024
https://dl.acm.org/doi/10.1007/978-3-031-77961-9_7
Pande AMishra DNachenahalli Bhuthegowda B(2024)NAO vs. Pepper: Speech Recognition Performance AssessmentHuman-Computer Interaction10.1007/978-3-031-60412-6_12(156-167)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1007/978-3-031-60412-6_12
Venkata Lakshmi SSujatha KJanet J(2023)A hybrid discriminant fuzzy DNN with enhanced modularity bat algorithm for speech recognitionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21294544:3(4079-4091)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-212945
Zhao RYu JZhao HNgai E(2023)Radio2TextProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36108737:3(1-28)Online publication date: 27-Sep-2023
https://dl.acm.org/doi/10.1145/3610873
Heck MJeong JBecker C(2023)Evaluating the Potential of Caption Activation to Mitigate Confusion Inferred from Facial Gestures in Virtual MeetingsProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614142(243-252)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3614142
Kuhn KKersken VZimmermann G(2023)Accuracy of AI-generated Captions With Collaborative Manual Corrections in Real-TimeExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3585724(1-7)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544549.3585724
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents