Speech Structure and Its Application to Robust Speech Processing

Nobuaki Minematsu¹,
Satoshi Asakawa¹,
Masayuki Suzuki¹ &
…
Yu Qiao¹

256 Accesses
19 Citations
Explore all metrics

Abstract

Speech communication consists of three steps: production, transmission, and hearing. Every step inevitably involves acoustic distortions due to gender differences, age, microphone- and room-related factors, and so on. In spite of these variations, listeners can extract linguistic information from speech as easily as if the communications had not been affected by variations at all. One may hypothesize that listeners modify their internal acoustic models whenever extralinguistic factors change. Another possibility is that the linguistic information in speech can be represented separately from the extralinguistic factors. In this study, being inspired by studies of humans and animals, a novel solution to the problem of intrinsic variations is proposed. Speech structures invariant to these variations are derived as transform-invariant features and their linguistic validity is discussed. Their high robustness is demonstrated by applying the speech structures to automatic speech recognition and pronunciation proficiency estimation. This paper also describes the immaturity of the current implementation and application of speech structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Kuhl, P. K., “Early language acquisition: Cracking the speech code,” Nature Reviews Neuroscience, 5, pp.831–843, 2004.
Article Google Scholar
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V. and Wellekens, C., “Automatic speech recognition and speech variability: A review,” Speech Communication, 49, pp.763–786, 2007.
Article Google Scholar
Lotto, R. B. and Purves, D., “An empirical explanation of color contrast,” in Proc. the National Academy of Science USA, 97, pp.12834–12839, 2000.
Lotto, R. B. and Purves, D., “The effects of color on brightness,” Nature neuroscience, 2, 11, pp.1010–1014, 1999.
Article Google Scholar
Taniguchi, T., Sounds become music in mind -introduction to music psychology-, Kitaoji Pub., 2000.
http://www.lottolab.org/illusiondemos/Demo%2012.html
Briscoe, A. D. and Chittka, L., “The evolution of color vision in insects,” Annual review of entomology, 46, pp.471–510, 2001.
Article Google Scholar
Hauser, M. D. and McDermott, J., “The evolution of the music faculty: a comparative perspective,” Nature neurosciences, 6, pp.663–668, 2003.
Article Google Scholar
Acquisition of Communication and Recognition Skills Project (ACORNS). http://www.acorns-project.org/
Human Speechome Project, http://www.media.mit.edu/press/speechome/
Infants' Commonsense Knowledge Project, http://minny.cs.inf.shizuoka.ac.jp/SIG-ICK/
Kato, M., “Phonological development and its disorders,” Journal of Communication Disorders, 20, 2, pp.84–85, 2003.
Shaywitz, S. E., Overcoming dyslexia, Random House, 2005.
Hayakawa, M., “Language acquisition and matherese,” Language, 35, 9, Taishukan pub., pp.62–67, 2006.
Lieberman, P., “On the development of vowel production in young children,” Child Phonology vol.1, (Yeni-Komshian, G. H., Kavanagh, J. F. and Ferguson, C. A. eds.), Academic Press, 1980.
Okanoya, K., “Birdsongs and human language: common evolutionary mechanisms,” in Proc. Spring Meet. Acoust. Soc. Jpn., 1-17-5, pp.1555–1556, 2008 (including Q&A after his presentation).
Gruhn, W., “The audio-vocal system in sound perception and learning of language and music,” in Proc. Int. Conf. on language and music as cognitive systems, 2006.
Umesh, S., Cohen, L., Marinovic, N. and Nelson, D. J., “Scale transform in speech analysis,” IEEE Trans. Speech and Audio Processing, 7, 1, pp.40–45, 1999.
Article Google Scholar
Irino, T. and Patterson, R. D., “Segregating information about the size and shape of the vocal tract using a time-domain auditory model: the stabilised wavelet-Mellin transform”, Speech Communication, 36, pp.181–203, 2002.
Article MATH Google Scholar
Mertins, A. and Rademacher, J., “Vocal trace length invariant features for automatic speech recognition,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, pp.308–312, 2005.
Jakobson, R. and Waugh, L. R., The sound shape of language, Mouton De Gruyter, 1987.
Ladefoged, P. and Broadbent, D. E., “Information conveyed by vowels,” Journal of Acoust. Soc. Am., 29, 1, pp.98–104, 1957.
Article Google Scholar
Nearey, T. M., “Static, dynamic, and relational properties in vowel perception,” Journal of Acoust. Soc. Am., 85, 5, pp.2088–2113, 1989.
Article Google Scholar
Hawkins, J. and Blakeslee, S., On intelligence, Henry Holt, 2004.
Qiao, Y. and Minematsu, N., “A study on invariance of f-divergence and its application to speech recognition,” IEEE Transactions on Signal Processing, 58, 7, pp.3884–3890, 2010.
Article Google Scholar
Csiszar, I., “Information-type measures of difference of probability distributions and indirect,” Stud. Sci. Math. Hung., 2, pp.299–318, 1967.
MATH MathSciNet Google Scholar
Minematsu, N., “Mathematical evidence of the acoustic universal structure in speech,” in Proc. Int. Conf. Acoustics, Speech, & Signal Processing, pp.889–892, 2005.
Minematsu, N., Nishimura, T., Nishinari, K. and Sakuraba, K., “Theorem of the invariant structure and its derivation of speech Gestalt,” in Proc. Int. Workshop on Speech Recognition and Intrinsic Variations, pp.47–52, 2006.
Minematsu, N., “Pronunciation assessment based upon the phonological distortions observed in language learners' utterances,” in Proc. Int. Conf. Spoken Language Processing, pp.1669–1672, 2004.
Saito, D., Matsuura, R., Asakawa, S., Minematsu, N. and Hirose, K., “Directional dependency of cepstrum on vocal tract length,” in Proc. Int. Conf. Acoustics, Speech, & Signal Processing, pp.4485–4488, 2008.
Edihammer, I., “Structure comparison and structure patterns,” Journal of Computational Biology, 7, 5, pp.685–716, 2000.
Article Google Scholar
Pitz, M. and Ney, H., “Vocal tract normalization equals linear transformation in cepstral space,” IEEE Trans. Speech and Audio Processing, 13, 5, pp.930–944, 2005.
Article Google Scholar
Emori, T. and Shinoda, K., “Rapid vocal tract length normalization using maximum likelihood estimation,” in Proc. EUROSPEECH, pp.1649–1652, 2001.
Naito, M., Deng, L. and Sagisaka, Y., “Model based speaker normalization methods for speech recognition,” IEICE Trans. J83-D-II, 11, pp.2360–2369, 2000.
Tohoku university - Matsushita isolated Word database (TMW), http://research.nii.ac.jp/src/eng/list/detail.html#TMW
Kawahara, T., Lee, A., Takeda, K., Itou, K. and Shikano, K., “Recent progress of open-source LVCSR engine Julius and Japanese model repository,” in Proc. Int. Conf. on Spoken Language Processing, pp.3069–3072, 2004.
Qiao, Y., Suzuki, M. and Minematsu, N., “A study of Hidden Structure Model and its application of labeling sequences,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, pp.118–123, 2009.
Greenberg, S. and Kingsbury, B., “The modulation spectrogram: in pursuit of an invariant representation of speech,” in Proc. Int. Conf. Acoustics, Speech, & Signal Processing, pp.1647–1650, 1997.
Hermansky, H. and Morgan, N., “RASTA processing of speech,” IEEE Trans. Speech and Audio Processing, 2, 4, pp.578–589, 1994.
Article Google Scholar
Eskenazi, M., “An overview of spoken language technology for education,” Speech Communication, 51, 10, pp.832–844, 2009.
Article Google Scholar
Witt, S. M. and Young, S. J., “Phone-level pronunciation scoring and assessment for interactive language learning,” Speech Communication, 30, pp.95–108, 2000.
Article Google Scholar
Minematsu, N., Asakawa, S. and Hirose, K., “Structural representation of the pronunciation and its use for CALL,” in Proc. IEEE Int. Workshop on Spoken Language Technology, pp.126–129, 2006.
Minematsu, N., “Training of pronunciation as learning of the sound system embedded in the target language,” in Proc. Int. Symposium on Phonetic Frontiers, CD-ROM, 2008.
Minematsu, N., et al., “Development of English speech database read by Japanese to support CALL research,” in Proc. Int. Conf. Acoustics, pp.577–560, 2004.
Frith, U., Autism: explaining the enigma, Wiley-Blackwell, 2003.
Willey, L. H. and Attwood, T., Pretending to be normal: living with Asperger's syndrome, Jessica Kingsley Publishers, 1999.
Grandin, T. and Johnson, C., Animals in translation: using the mysteries of autism to decode animal behavior, Scribner, 2004.
Higashida, N. and Higashida, M., Messages to all my colleagues living on the planet, Escor Pub., Chiba, 2005.

Download references

Author information

Authors and Affiliations

The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
Nobuaki Minematsu, Satoshi Asakawa, Masayuki Suzuki & Yu Qiao

Authors

Nobuaki Minematsu
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Asakawa
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Suzuki
View author publications
You can also search for this author in PubMed Google Scholar
Yu Qiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nobuaki Minematsu.

About this article

Cite this article

Minematsu, N., Asakawa, S., Suzuki, M. et al. Speech Structure and Its Application to Robust Speech Processing. New Gener. Comput. 28, 299–319 (2010). https://doi.org/10.1007/s00354-009-0091-y

Download citation

Received: 31 May 2009
Revised: 02 December 2009
Published: 14 August 2010
Issue Date: July 2010
DOI: https://doi.org/10.1007/s00354-009-0091-y

Speech Structure and Its Application to Robust Speech Processing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Study on Speech Processing

Shennong: A Python toolbox for audio speech features extraction

Speech coding techniques and challenges: a comprehensive literature survey

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords:

Subscribe and save

Buy Now

Navigation

Speech Structure and Its Application to Robust Speech Processing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Study on Speech Processing

Shennong: A Python toolbox for audio speech features extraction

Speech coding techniques and challenges: a comprehensive literature survey

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords:

Subscribe and save

Buy Now

Search

Navigation