Mandarin pronunciation modeling based on CASS corpus

Zheng Fang¹,
Song Zhanjiang¹,
Fung Pascale² &
…
William Byrne³

63 Accesses
Explore all metrics

Abstract

The pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. In this paper, the factors that may affect the recognition performance are analyzed, including those specific to the Chinese language. By studying the INITIAL/FINAL (IF) characteristics of Chinese language and developing the Bayesian equation, the concepts of generalized INITIAL/FINAL (GIF) and generalized syllable (GS), the GIF modeling and the IF-GIF modeling, as well as the contextdependent pronunciation weighting, are proposed based on a well phonetically transcribed seed database. By using these methods, the Chinese syllable error rate (SER) is reduced by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language model, such as syllable or word N-gram, is not used. The effectiveness of these methods is also proved when more data without the phonetic transcription are used to refine the acoustic model using the proposed iterative forced-alignment based transcribing (IFABT) method, achieving a 5.7% SER reduction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

A Deep Neural Networks (DNN) Based Models for a Computer Aided Pronunciation Learning System

Building Automatic Speech Recognition Systems for Moroccan Dialect: A Phoneme-Based Approach

Article 25 July 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Fosler-Lussier E, Morgan N. Effect of speaking rate and word frequency on pronunciations in conversational speech.Speech Communication, 1999, 29: 137–158.
Article Google Scholar
Decker A M, Lamel L. Pronunciation variants across system configuration, language and speaking style.Speech Communication, 1999, 29: 83–98.
Article Google Scholar
Greenberg S. Speaking in shorthand — A syllable-centric perspective for understanding pronunciation variation.Speech Communication, 1999, 29: 159–176.
Article Google Scholar
Zheng F. A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition. InIEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, March, 1999, II: 601–604.
Finke M, Waibel A. Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition. InEuropean Conference on Speech Communication and Technology (Euro Speech'97), 1997, 5: 2379–2382.
Byrne W, Venkataramani V, Kamm Tet al. Automatic generation of pronunciation lexicons for Mandarin spontaneous s speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, Vol. I, May, 2001, Salt Lake City.
Liu M K, Xu B, Huang T Yet al. Mandarin accent adaptation based on context-independent/context-dependent pronunciation modeling. InInternational Conference on Acoustics, Speech and Signal Processing (ICASSP'2000), Istanbul, June, 2000, 4: 1025–1028.
Cremelie N, Martens J P. Automatic rule-based generation of word pronunciation networks. InEuropean Conference on Speech Communication and Technology (EuroSpeech'97), 1997, 5: 2459–2462.
Cremelie N, Martens J P. In search of better pronunciation models for speech recognition.Speech Co Communication, 1999, 29: 115–136.
Article Google Scholar
Liu Y, Fung P. Rule-based word pronunciation networks generation for Mandarin speech recognition,International Symposium of Chinese Spoken Language Processing, Beijing, Oct., 2000,pp.35–38.
Fukada T, Sagisaka Y. Automatic generation of a pronunciation dictionary based on a pronunciation network. InEuropean Conference on Speech Communication and Technology (EuroSpeech'97), 1997, 5: 2471–2474.
Byrne W, Finke M, Khudanpur Set al. Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. InIEEE International Conference on Acoustics, Speech, and Signal Processing, May, 1998, Seattle, pp.313–316.
Riley M, Byrne W, Finke Met al. Stochastic pronunciation modelling from hand-labelled phonetic corpora.Speech Communication, 1999, 29: 209–224.
Article Google Scholar
Ma K, Zavaliagkos G, Iyer R. Pronunciation modeling for large vocabulary conversational speech recognition. InInternational Conference on Spoken Language Processing, Sydney, Nov., 1998, 6: 2455–2458.
Saraclar M, Nock H, Khudanpur S. Pronunciation modeling by sharing Gaussian densities across phonetic models. InEuropean Conference on Speech Communication and Technology (EuroSpeech'99), 1999, 1: 515–518.
Holter T, Svendsen T. Maximum likelihood modelling of pronunciation variation.Speech Communication, 1999, 29: 177–191.
Article Google Scholar
Finke M, Fritsch J, Koll Det al. Modeling and efficient decoding of large vocabulary conversational speech. InEuropean Conference on Speech Communication and Technology (EuroSpeech'99), 1999, 1: 467–470.
Strik H, Cucchiarini C. Modeling pronunciation variation for ASR: A survey of the literature.Speech Communication, 1999, 29: 225–246.
Article Google Scholar
Li A J, Zheng F, Byrne W, Fung Pet al. CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. InInternational Conference on Spoken Language Processing (ICSLP'2000), Beijing, Oct., 2000, 1: 485–488.
Chen X X, Li A Jet al. An application of SAMPA-C for standard Chinese. InInternational Conference on Spoken Language Processing, Beijing, Oct., 2000, 4: 652–655.
Li A J, Chen X X, Sun Get al. The phonetic labeling on read and spontaneous discourse corpora. InInternational Conference on Spoken Language Processing (ICSLP'2000), Beijing, Oct., 2000, 4: 724–727.
Saraclar M, Nock H, Khudanpur S. Pronunciation modeling by sharing Gaussian densities across phonetic models.Computer Speech and Language, 2000, 14: 137–160.
Article Google Scholar
Young S, Kershaw D, Odell Jet al. The HTK Book. Version 2.2, Entropic Ltd., 1999.
Song Z J. Research on pronunciation modeling for spontaneous Chinese speech recognition [Dissertation]. Tsinghua University, Beijing, Apr., 2001.
Google Scholar
Huang X D, Hwang M Y, Jiang Let al. Deleted interpolation and density sharing for continuous hidden Markov models. InIEEE Int. Con. Acoustics, Speech, and Signal Processing, Atlanta, GA, 1996, pp.885–888.
Jelinek F. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, MA, 1998.
Google Scholar
Kim N S, Un C K. Statistically reliable deleted interpolation.IEEE Trans. SAP, 1997, 5: 292–295.
Google Scholar

Download references

Author information

Authors and Affiliations

Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, P.R. China
Zheng Fang & Song Zhanjiang
Department of Electrical and Electronic Engineering, Hong Kong University of Science and Technology, Hong Kong, P.R. China
Fung Pascale
Center for Language and Speech Processing, The Johns Hopkins University, USA
William Byrne

Authors

Zheng Fang
View author publications
You can also search for this author in PubMed Google Scholar
Song Zhanjiang
View author publications
You can also search for this author in PubMed Google Scholar
Fung Pascale
View author publications
You can also search for this author in PubMed Google Scholar
William Byrne
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng Fang.

Additional information

This paper was a report for the project “Mandarin pronunciation modeling” supported by the National Science Foundation of USA under grant No.#IIS-9820687, and carried out in the 2000 Summer Workshop on Language and Speech Processing, Center for Language and Speech Processing, Johns Hopkins University (http://www.clsp.jhu.edu/ws2000/), and a report of its further research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or The Johns Hopkins University.

ZHENG Fang is an associate professor of Tsinghua University. He is director of the Center of Speech Technology, the State Key Lab of Intelligent Technology and Systems. Dr. Zheng received his B.S., M.S. and Ph.D. degrees from Tsinghua University in 1990, 1992 and 1997 respectively. Dr. Zheng has been working on speech recognition since 1988. His research interests include acoustic/language modeling, isolated/continuous speech recognition, keyword spotting, dictating, language understanding and so on. Dr. Zheng is now an IEEE member, an ISCA (International Speech Communication Association) member, a member of the Editorial Committee of the Journal of Chinese Information Processing, a member of the Artificial Intelligence and Pattern Recognition Technical Commission of China Computer Federation, and the reviewer of several domestic and international journals and conferences. He served as the co-chair of the Program Committee of the International Symposium on Chinese Spoken Language Processing (ISCSLP'2000) and a member of Program Committee of the International Conference on Spoken Language Processing (ICSLP'2000).

SONG Zhanjiang is a Ph.D. candidate of the Department of Computer Science and Technology, Tsinghua University. His research interests include acoustic modeling, search algorithms, continuous speech recognition, automatic pronunciation scoring and pronunciation modeling, and so on. He received his B.S. degree of computer software in 1994, and his M.S. degree of computer application (majoring in computer network) in 1997, both from the Department of Computer and System Sciences, Nankai University.

Pascale Fung is an assistant professor of electrical and electronic engineering at the Hong Kong University of Science and Technology (HKUST) and is a founder of faculty of the Human Language Technology Center at HKUST. She is also a founder of Weniwen Technologies (http://www.weniwen.com), a company using natural language processing and information retrieval technologies for real-time applications. Dr. Fung received her Ph.D. and M.Sc. degrees in computer science from Columbia University, and holds a B.S. degree in electrical engineering from Worcester Polytechnic Institute, Mass. Dr. Fung was a researcher at Bell Laboratories and BBN Systems & Technologies in Cambridge, Mass., Kyoto University (Japan) and French National Scientific Research Center. Her research interests include automatic speech recognition, natural language processing, cross-lingual retrieval as well as machine translation. As a fluent speaker of English, Mandarin, Shanghainese, Cantonese, French and Japanese, she is particularly interested in multilingual and cross-lingual topics. Dr. Fung has served on program committees and editorial boards of leading international conferences and journals including the HK Research Grants Council, Computational Linguistics, Machine Translation, the Association of Computational Linguistics (ACL), COLING, International Conference on Spoken Language Processing (ICSLP), ISCSLP, AMTA, NEMLAP, COMPTERM, PRICAL, etc. She is the committee member of the ACL SIGDAT, and was the conference chair of Empirical Methods on Natural Language Processing (EMNLP) in 1999. Most recently, she has been the team leader of the “Mandarin pronunciation modeling” group at The Johns Hopkins Summer Workshop on Speech and Language Technologies. She is a Senior Member of the Institute of Electrical and Electronic Engineers (IEEE).

William Byrne received the B.S. degree in electrical engineering from Cornell University, Ithaca, NY in 1982, and the Ph.D. degree in electrical engineering from the University of Maryland, College Park, MA in 1993. He has worked at Entropic Research Laboratory, Washington DC, and the National Institute of Health, Bethesda, MD. He is currently a research associate professor in the Department of Electrical Engineering and the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, MD. His research interests are in all aspects of automatic speech recognition, including speaker adaptation, robust estimation, pronunciation modeling, and novel ASR decoding strategies.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, F., Song, Z., Fung, P. et al. Mandarin pronunciation modeling based on CASS corpus. J. of Comput. Sci. & Technol. 17, 249–263 (2002). https://doi.org/10.1007/BF02947304

Download citation

Received: 19 December 2000
Revised: 25 July 2001
Issue Date: May 2002
DOI: https://doi.org/10.1007/BF02947304

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

A Deep Neural Networks (DNN) Based Models for a Computer Aided Pronunciation Learning System

Building Automatic Speech Recognition Systems for Moroccan Dialect: A Phoneme-Based Approach

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

Subscribe and save

Buy Now

Navigation

Mandarin pronunciation modeling based on CASS corpus

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Optimal Feature Set and Minimal Training Size for Pronunciation Adaptation in TTS

A Deep Neural Networks (DNN) Based Models for a Computer Aided Pronunciation Learning System

Building Automatic Speech Recognition Systems for Moroccan Dialect: A Phoneme-Based Approach

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Subscribe and save

Buy Now

Search

Navigation