Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Mandarin pronunciation modeling based on CASS corpus

  • Regular Papers
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

The pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. In this paper, the factors that may affect the recognition performance are analyzed, including those specific to the Chinese language. By studying the INITIAL/FINAL (IF) characteristics of Chinese language and developing the Bayesian equation, the concepts of generalized INITIAL/FINAL (GIF) and generalized syllable (GS), the GIF modeling and the IF-GIF modeling, as well as the contextdependent pronunciation weighting, are proposed based on a well phonetically transcribed seed database. By using these methods, the Chinese syllable error rate (SER) is reduced by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language model, such as syllable or word N-gram, is not used. The effectiveness of these methods is also proved when more data without the phonetic transcription are used to refine the acoustic model using the proposed iterative forced-alignment based transcribing (IFABT) method, achieving a 5.7% SER reduction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Fosler-Lussier E, Morgan N. Effect of speaking rate and word frequency on pronunciations in conversational speech.Speech Communication, 1999, 29: 137–158.

    Article  Google Scholar 

  2. Decker A M, Lamel L. Pronunciation variants across system configuration, language and speaking style.Speech Communication, 1999, 29: 83–98.

    Article  Google Scholar 

  3. Greenberg S. Speaking in shorthand — A syllable-centric perspective for understanding pronunciation variation.Speech Communication, 1999, 29: 159–176.

    Article  Google Scholar 

  4. Zheng F. A syllable-synchronous network search algorithm for word decoding in Chinese speech recognition. InIEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, March, 1999, II: 601–604.

  5. Finke M, Waibel A. Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition. InEuropean Conference on Speech Communication and Technology (Euro Speech'97), 1997, 5: 2379–2382.

  6. Byrne W, Venkataramani V, Kamm Tet al. Automatic generation of pronunciation lexicons for Mandarin spontaneous s speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, Vol. I, May, 2001, Salt Lake City.

  7. Liu M K, Xu B, Huang T Yet al. Mandarin accent adaptation based on context-independent/context-dependent pronunciation modeling. InInternational Conference on Acoustics, Speech and Signal Processing (ICASSP'2000), Istanbul, June, 2000, 4: 1025–1028.

  8. Cremelie N, Martens J P. Automatic rule-based generation of word pronunciation networks. InEuropean Conference on Speech Communication and Technology (EuroSpeech'97), 1997, 5: 2459–2462.

  9. Cremelie N, Martens J P. In search of better pronunciation models for speech recognition.Speech Co Communication, 1999, 29: 115–136.

    Article  Google Scholar 

  10. Liu Y, Fung P. Rule-based word pronunciation networks generation for Mandarin speech recognition,International Symposium of Chinese Spoken Language Processing, Beijing, Oct., 2000,pp.35–38.

  11. Fukada T, Sagisaka Y. Automatic generation of a pronunciation dictionary based on a pronunciation network. InEuropean Conference on Speech Communication and Technology (EuroSpeech'97), 1997, 5: 2471–2474.

  12. Byrne W, Finke M, Khudanpur Set al. Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. InIEEE International Conference on Acoustics, Speech, and Signal Processing, May, 1998, Seattle, pp.313–316.

  13. Riley M, Byrne W, Finke Met al. Stochastic pronunciation modelling from hand-labelled phonetic corpora.Speech Communication, 1999, 29: 209–224.

    Article  Google Scholar 

  14. Ma K, Zavaliagkos G, Iyer R. Pronunciation modeling for large vocabulary conversational speech recognition. InInternational Conference on Spoken Language Processing, Sydney, Nov., 1998, 6: 2455–2458.

  15. Saraclar M, Nock H, Khudanpur S. Pronunciation modeling by sharing Gaussian densities across phonetic models. InEuropean Conference on Speech Communication and Technology (EuroSpeech'99), 1999, 1: 515–518.

  16. Holter T, Svendsen T. Maximum likelihood modelling of pronunciation variation.Speech Communication, 1999, 29: 177–191.

    Article  Google Scholar 

  17. Finke M, Fritsch J, Koll Det al. Modeling and efficient decoding of large vocabulary conversational speech. InEuropean Conference on Speech Communication and Technology (EuroSpeech'99), 1999, 1: 467–470.

  18. Strik H, Cucchiarini C. Modeling pronunciation variation for ASR: A survey of the literature.Speech Communication, 1999, 29: 225–246.

    Article  Google Scholar 

  19. Li A J, Zheng F, Byrne W, Fung Pet al. CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. InInternational Conference on Spoken Language Processing (ICSLP'2000), Beijing, Oct., 2000, 1: 485–488.

  20. Chen X X, Li A Jet al. An application of SAMPA-C for standard Chinese. InInternational Conference on Spoken Language Processing, Beijing, Oct., 2000, 4: 652–655.

  21. Li A J, Chen X X, Sun Get al. The phonetic labeling on read and spontaneous discourse corpora. InInternational Conference on Spoken Language Processing (ICSLP'2000), Beijing, Oct., 2000, 4: 724–727.

  22. Saraclar M, Nock H, Khudanpur S. Pronunciation modeling by sharing Gaussian densities across phonetic models.Computer Speech and Language, 2000, 14: 137–160.

    Article  Google Scholar 

  23. Young S, Kershaw D, Odell Jet al. The HTK Book. Version 2.2, Entropic Ltd., 1999.

  24. Song Z J. Research on pronunciation modeling for spontaneous Chinese speech recognition [Dissertation]. Tsinghua University, Beijing, Apr., 2001.

    Google Scholar 

  25. Huang X D, Hwang M Y, Jiang Let al. Deleted interpolation and density sharing for continuous hidden Markov models. InIEEE Int. Con. Acoustics, Speech, and Signal Processing, Atlanta, GA, 1996, pp.885–888.

  26. Jelinek F. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, MA, 1998.

    Google Scholar 

  27. Kim N S, Un C K. Statistically reliable deleted interpolation.IEEE Trans. SAP, 1997, 5: 292–295.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheng Fang.

Additional information

This paper was a report for the project “Mandarin pronunciation modeling” supported by the National Science Foundation of USA under grant No.#IIS-9820687, and carried out in the 2000 Summer Workshop on Language and Speech Processing, Center for Language and Speech Processing, Johns Hopkins University (http://www.clsp.jhu.edu/ws2000/), and a report of its further research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or The Johns Hopkins University.

ZHENG Fang is an associate professor of Tsinghua University. He is director of the Center of Speech Technology, the State Key Lab of Intelligent Technology and Systems. Dr. Zheng received his B.S., M.S. and Ph.D. degrees from Tsinghua University in 1990, 1992 and 1997 respectively. Dr. Zheng has been working on speech recognition since 1988. His research interests include acoustic/language modeling, isolated/continuous speech recognition, keyword spotting, dictating, language understanding and so on. Dr. Zheng is now an IEEE member, an ISCA (International Speech Communication Association) member, a member of the Editorial Committee of the Journal of Chinese Information Processing, a member of the Artificial Intelligence and Pattern Recognition Technical Commission of China Computer Federation, and the reviewer of several domestic and international journals and conferences. He served as the co-chair of the Program Committee of the International Symposium on Chinese Spoken Language Processing (ISCSLP'2000) and a member of Program Committee of the International Conference on Spoken Language Processing (ICSLP'2000).

SONG Zhanjiang is a Ph.D. candidate of the Department of Computer Science and Technology, Tsinghua University. His research interests include acoustic modeling, search algorithms, continuous speech recognition, automatic pronunciation scoring and pronunciation modeling, and so on. He received his B.S. degree of computer software in 1994, and his M.S. degree of computer application (majoring in computer network) in 1997, both from the Department of Computer and System Sciences, Nankai University.

Pascale Fung is an assistant professor of electrical and electronic engineering at the Hong Kong University of Science and Technology (HKUST) and is a founder of faculty of the Human Language Technology Center at HKUST. She is also a founder of Weniwen Technologies (http://www.weniwen.com), a company using natural language processing and information retrieval technologies for real-time applications. Dr. Fung received her Ph.D. and M.Sc. degrees in computer science from Columbia University, and holds a B.S. degree in electrical engineering from Worcester Polytechnic Institute, Mass. Dr. Fung was a researcher at Bell Laboratories and BBN Systems & Technologies in Cambridge, Mass., Kyoto University (Japan) and French National Scientific Research Center. Her research interests include automatic speech recognition, natural language processing, cross-lingual retrieval as well as machine translation. As a fluent speaker of English, Mandarin, Shanghainese, Cantonese, French and Japanese, she is particularly interested in multilingual and cross-lingual topics. Dr. Fung has served on program committees and editorial boards of leading international conferences and journals including the HK Research Grants Council, Computational Linguistics, Machine Translation, the Association of Computational Linguistics (ACL), COLING, International Conference on Spoken Language Processing (ICSLP), ISCSLP, AMTA, NEMLAP, COMPTERM, PRICAL, etc. She is the committee member of the ACL SIGDAT, and was the conference chair of Empirical Methods on Natural Language Processing (EMNLP) in 1999. Most recently, she has been the team leader of the “Mandarin pronunciation modeling” group at The Johns Hopkins Summer Workshop on Speech and Language Technologies. She is a Senior Member of the Institute of Electrical and Electronic Engineers (IEEE).

William Byrne received the B.S. degree in electrical engineering from Cornell University, Ithaca, NY in 1982, and the Ph.D. degree in electrical engineering from the University of Maryland, College Park, MA in 1993. He has worked at Entropic Research Laboratory, Washington DC, and the National Institute of Health, Bethesda, MD. He is currently a research associate professor in the Department of Electrical Engineering and the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, MD. His research interests are in all aspects of automatic speech recognition, including speaker adaptation, robust estimation, pronunciation modeling, and novel ASR decoding strategies.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, F., Song, Z., Fung, P. et al. Mandarin pronunciation modeling based on CASS corpus. J. of Comput. Sci. & Technol. 17, 249–263 (2002). https://doi.org/10.1007/BF02947304

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02947304

Key words

Navigation