Abstract
The pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. In this paper, the factors that may affect the recognition performance are analyzed, including those specific to the Chinese language. By studying the INITIAL/FINAL (IF) characteristics of Chinese language and developing the Bayesian equation, the concepts of generalized INITIAL/FINAL (GIF) and generalized syllable (GS), the GIF modeling and the IF-GIF modeling, as well as the contextdependent pronunciation weighting, are proposed based on a well phonetically transcribed seed database. By using these methods, the Chinese syllable error rate (SER) is reduced by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language model, such as syllable or word N-gram, is not used. The effectiveness of these methods is also proved when more data without the phonetic transcription are used to refine the acoustic model using the proposed iterative forced-alignment based transcribing (IFABT) method, achieving a 5.7% SER reduction.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Fosler-Lussier E, Morgan N. Effect of speaking rate and word frequency on pronunciations in conversational speech.Speech Communication, 1999, 29: 137–158.
Decker A M, Lamel L. Pronunciation variants across system configuration, language and speaking style.Speech Communication, 1999, 29: 83–98.
Greenberg S. Speaking in shorthand — A syllable-centric perspective for understanding pronunciation variation.Speech Communication, 1999, 29: 159–176.
Zheng F. A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition. InIEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, March, 1999, II: 601–604.
Finke M, Waibel A. Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition. InEuropean Conference on Speech Communication and Technology (Euro Speech'97), 1997, 5: 2379–2382.
Byrne W, Venkataramani V, Kamm Tet al. Automatic generation of pronunciation lexicons for Mandarin spontaneous s speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, Vol. I, May, 2001, Salt Lake City.
Liu M K, Xu B, Huang T Yet al. Mandarin accent adaptation based on context-independent/context-dependent pronunciation modeling. InInternational Conference on Acoustics, Speech and Signal Processing (ICASSP'2000), Istanbul, June, 2000, 4: 1025–1028.
Cremelie N, Martens J P. Automatic rule-based generation of word pronunciation networks. InEuropean Conference on Speech Communication and Technology (EuroSpeech'97), 1997, 5: 2459–2462.
Cremelie N, Martens J P. In search of better pronunciation models for speech recognition.Speech Co Communication, 1999, 29: 115–136.
Liu Y, Fung P. Rule-based word pronunciation networks generation for Mandarin speech recognition,International Symposium of Chinese Spoken Language Processing, Beijing, Oct., 2000,pp.35–38.
Fukada T, Sagisaka Y. Automatic generation of a pronunciation dictionary based on a pronunciation network. InEuropean Conference on Speech Communication and Technology (EuroSpeech'97), 1997, 5: 2471–2474.
Byrne W, Finke M, Khudanpur Set al. Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. InIEEE International Conference on Acoustics, Speech, and Signal Processing, May, 1998, Seattle, pp.313–316.
Riley M, Byrne W, Finke Met al. Stochastic pronunciation modelling from hand-labelled phonetic corpora.Speech Communication, 1999, 29: 209–224.
Ma K, Zavaliagkos G, Iyer R. Pronunciation modeling for large vocabulary conversational speech recognition. InInternational Conference on Spoken Language Processing, Sydney, Nov., 1998, 6: 2455–2458.
Saraclar M, Nock H, Khudanpur S. Pronunciation modeling by sharing Gaussian densities across phonetic models. InEuropean Conference on Speech Communication and Technology (EuroSpeech'99), 1999, 1: 515–518.
Holter T, Svendsen T. Maximum likelihood modelling of pronunciation variation.Speech Communication, 1999, 29: 177–191.
Finke M, Fritsch J, Koll Det al. Modeling and efficient decoding of large vocabulary conversational speech. InEuropean Conference on Speech Communication and Technology (EuroSpeech'99), 1999, 1: 467–470.
Strik H, Cucchiarini C. Modeling pronunciation variation for ASR: A survey of the literature.Speech Communication, 1999, 29: 225–246.
Li A J, Zheng F, Byrne W, Fung Pet al. CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. InInternational Conference on Spoken Language Processing (ICSLP'2000), Beijing, Oct., 2000, 1: 485–488.
Chen X X, Li A Jet al. An application of SAMPA-C for standard Chinese. InInternational Conference on Spoken Language Processing, Beijing, Oct., 2000, 4: 652–655.
Li A J, Chen X X, Sun Get al. The phonetic labeling on read and spontaneous discourse corpora. InInternational Conference on Spoken Language Processing (ICSLP'2000), Beijing, Oct., 2000, 4: 724–727.
Saraclar M, Nock H, Khudanpur S. Pronunciation modeling by sharing Gaussian densities across phonetic models.Computer Speech and Language, 2000, 14: 137–160.
Young S, Kershaw D, Odell Jet al. The HTK Book. Version 2.2, Entropic Ltd., 1999.
Song Z J. Research on pronunciation modeling for spontaneous Chinese speech recognition [Dissertation]. Tsinghua University, Beijing, Apr., 2001.
Huang X D, Hwang M Y, Jiang Let al. Deleted interpolation and density sharing for continuous hidden Markov models. InIEEE Int. Con. Acoustics, Speech, and Signal Processing, Atlanta, GA, 1996, pp.885–888.
Jelinek F. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, MA, 1998.
Kim N S, Un C K. Statistically reliable deleted interpolation.IEEE Trans. SAP, 1997, 5: 292–295.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper was a report for the project “Mandarin pronunciation modeling” supported by the National Science Foundation of USA under grant No.#IIS-9820687, and carried out in the 2000 Summer Workshop on Language and Speech Processing, Center for Language and Speech Processing, Johns Hopkins University (http://www.clsp.jhu.edu/ws2000/), and a report of its further research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or The Johns Hopkins University.
ZHENG Fang is an associate professor of Tsinghua University. He is director of the Center of Speech Technology, the State Key Lab of Intelligent Technology and Systems. Dr. Zheng received his B.S., M.S. and Ph.D. degrees from Tsinghua University in 1990, 1992 and 1997 respectively. Dr. Zheng has been working on speech recognition since 1988. His research interests include acoustic/language modeling, isolated/continuous speech recognition, keyword spotting, dictating, language understanding and so on. Dr. Zheng is now an IEEE member, an ISCA (International Speech Communication Association) member, a member of the Editorial Committee of the Journal of Chinese Information Processing, a member of the Artificial Intelligence and Pattern Recognition Technical Commission of China Computer Federation, and the reviewer of several domestic and international journals and conferences. He served as the co-chair of the Program Committee of the International Symposium on Chinese Spoken Language Processing (ISCSLP'2000) and a member of Program Committee of the International Conference on Spoken Language Processing (ICSLP'2000).
SONG Zhanjiang is a Ph.D. candidate of the Department of Computer Science and Technology, Tsinghua University. His research interests include acoustic modeling, search algorithms, continuous speech recognition, automatic pronunciation scoring and pronunciation modeling, and so on. He received his B.S. degree of computer software in 1994, and his M.S. degree of computer application (majoring in computer network) in 1997, both from the Department of Computer and System Sciences, Nankai University.
Pascale Fung is an assistant professor of electrical and electronic engineering at the Hong Kong University of Science and Technology (HKUST) and is a founder of faculty of the Human Language Technology Center at HKUST. She is also a founder of Weniwen Technologies (http://www.weniwen.com), a company using natural language processing and information retrieval technologies for real-time applications. Dr. Fung received her Ph.D. and M.Sc. degrees in computer science from Columbia University, and holds a B.S. degree in electrical engineering from Worcester Polytechnic Institute, Mass. Dr. Fung was a researcher at Bell Laboratories and BBN Systems & Technologies in Cambridge, Mass., Kyoto University (Japan) and French National Scientific Research Center. Her research interests include automatic speech recognition, natural language processing, cross-lingual retrieval as well as machine translation. As a fluent speaker of English, Mandarin, Shanghainese, Cantonese, French and Japanese, she is particularly interested in multilingual and cross-lingual topics. Dr. Fung has served on program committees and editorial boards of leading international conferences and journals including the HK Research Grants Council, Computational Linguistics, Machine Translation, the Association of Computational Linguistics (ACL), COLING, International Conference on Spoken Language Processing (ICSLP), ISCSLP, AMTA, NEMLAP, COMPTERM, PRICAL, etc. She is the committee member of the ACL SIGDAT, and was the conference chair of Empirical Methods on Natural Language Processing (EMNLP) in 1999. Most recently, she has been the team leader of the “Mandarin pronunciation modeling” group at The Johns Hopkins Summer Workshop on Speech and Language Technologies. She is a Senior Member of the Institute of Electrical and Electronic Engineers (IEEE).
William Byrne received the B.S. degree in electrical engineering from Cornell University, Ithaca, NY in 1982, and the Ph.D. degree in electrical engineering from the University of Maryland, College Park, MA in 1993. He has worked at Entropic Research Laboratory, Washington DC, and the National Institute of Health, Bethesda, MD. He is currently a research associate professor in the Department of Electrical Engineering and the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, MD. His research interests are in all aspects of automatic speech recognition, including speaker adaptation, robust estimation, pronunciation modeling, and novel ASR decoding strategies.
Rights and permissions
About this article
Cite this article
Zheng, F., Song, Z., Fung, P. et al. Mandarin pronunciation modeling based on CASS corpus. J. of Comput. Sci. & Technol. 17, 249–263 (2002). https://doi.org/10.1007/BF02947304
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02947304