Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Joint-sequence models for grapheme-to-phoneme conversion

Published: 01 May 2008 Publication History

Abstract

Grapheme-to-phoneme conversion is the task of finding the pronunciation of a word given its written form. It has important applications in text-to-speech and speech recognition. Joint-sequence models are a simple and theoretically stringent probabilistic framework that is applicable to this problem. This article provides a self-contained and detailed description of this method. We present a novel estimation algorithm and demonstrate high accuracy on a variety of databases. Moreover, we study the impact of the maximum approximation in training and transcription, the interaction of model size parameters, n-best list generation, confidence measures, and phoneme-to-grapheme conversion. Our software implementation of the method proposed in this work is available under an Open Source license.

References

[1]
Andersen, O., Kuhn, R., Lazaridès, A., Dalsgaard, P., Haas, J., Nöth, E., 1996. Comparison of two tree-structured approaches for grapheme-to-phoneme conversion. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. 3, Philadelphia, PA, USA, pp. 1700-1703.
[2]
Phonemic transcription by analogy in text-to-speech synthesis: novel word pronunciation and lexicon compression. Computer Speech Lang. v16. 119-142.
[3]
Unsupervised, language-independent grapheme-to-phoneme conversion by latent analogy. Speech Comm. v46 i2. 140-152.
[4]
Besling, S., 1994. Heuristical and statistical methods for grahpeme-to-phoneme conversion. In: Konferenz zur Verarbeitung natürlicher Sprache (KONVENS), Vienna, Austria, pp. 24-31.
[5]
Bisani, M., Jolles, F., Popovic, M., 2005. LC-Star German lexicon for speech synthesis and recognition. Available from: European Language Resources Association, Catalog Reference S0245.
[6]
Bisani, M., Ney, H., 2001. Breadth-first search for finding the optimal phonetic transcription from multiple utterances. In: Proc. European Conf. on Speech Communication and Technology, Vol. 2, Aalborg, Denmark, pp. 1429-1432.
[7]
Bisani, M., Ney, H., 2002. Investigations on joint-multigram models for grapheme-to-phoneme conversion. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. 1, Denver, CO, USA, pp. 105-108.
[8]
Bisani, M., Ney, H., 2003. Multigram-based grapheme-to-phoneme conversion for LVCSR. In: Proc. European Conf. on Speech Communication and Technology, Vol. 2, Geneva, Switzerland, pp. 933-936.
[9]
Bisani, M., Ney, H., 2004. Bootstrap estimates for confidence intervals in ASR performance evaluation. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Montreal, Canada, pp. 409-411.
[10]
Bisani, M., Ney, H., Sep. 2005. Open vocabulary speech recognition with flat hybrid models. In: Proc. European Conf. on Speech Communication and Technology, Lisbon, Portugal, pp. 725-728.
[11]
Caseiro, D., Trancoso, I., Oliveira, L., Viana, C., 2002. Grapheme-to-phone using finite-state transducers. In: Proc. IEEE Workshop on Speech Synthesis, Santa Monica, CA, USA.
[12]
Celex, 1995. The Celex lexical database. <http://www.kun.nl/celex/>.
[13]
Chen, S.F., 2003. Conditional and joint models for grapheme-to-phoneme conversion. In: Proc. European Conf. on Speech Communication and Technology, Geneva, Switzerland, pp. 2033-2036.
[14]
An empirical study of smoothing techniques for language modeling. Computer Speech Lang. v13 i4. 359-394.
[15]
A survey of smoothing techniques for ME models. IEEE Trans. Speech Audio Process. v8 i1. 37-50.
[16]
Content, A., Mousty, P., Radeau, M., 1990. Brulex: Une base de données lexicales informatisée pour le français écrit et parlé. In: L'Année Psychologique, pp. 551-566.
[17]
Language-independent data-oriented grapheme-to-phoneme conversion. In: Van Santen, J.P.H., Sproat, R.W., Olive, J.P., Hirschberg, J. (Eds.), Progress in Speech Synthesis, Springer Verlag, Berlin, New York. pp. 77-90.
[18]
Pronounce: a program for pronunciation by analogy. Computer Speech Lang. v5 i1. 55-63.
[19]
Deligne, S., Bimbot, F., 1995. Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Detroit, MI, USA, pp. 169-172.
[20]
Inference of variable-length acoustic units for continuous speech recognition. Speech Comm. v23. 223-241.
[21]
Deligne, S., Yvon, F., Bimbot, F., 1995. Variable-length sequence matching for phonetic transcription using joint multigrams. In: Proc. European Conf. on Speech Communication and Technology, Madrid, Spain, pp. 2243-2246.
[22]
Galescu, L., 2003. Recognition of out-of-vocabulary words with sub-lexical language models. In: Proc. European Conf. on Speech Communication and Technology, Geneva, Switzerland, pp. 249-252.
[23]
Galescu, L., Allen, J.F., 2001. Bi-directional conversion between graphemes and phonemes using a joint n-gram model. In: Proc. 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire, Scotland.
[24]
Galescu, L., Allen, J.F., 2002. Pronunciation of proper names with a joint n-gram model for bi-directional grapheme-to-phoneme conversion. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. 1, Denver, CO, USA, pp. 109-112.
[25]
Gollan, C., Bisani, M., Kanthak, S., Schlüter, R., Ney, H., 2005. Cross domain automatic transcription on the TC-Star EPPS corpus. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Philadelphia, PA, USA, pp. 825-828.
[26]
Assessing text-to-phoneme mapping strategies in speaker independent isolated word recognition. Speech Comm. v41 i2. 455-467.
[27]
Jensen, K.J., Riis, S., 2000. Self-organizing letter code-book for text-to-phoneme neural network model. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. 3, Beijing, China, pp. 318-321.
[28]
Jiang, L., Hon, H.-W., Huang, X., 1997. Improvements on a trainable letter-to-sound converter. In: Proc. European Conf. on Speech Communication and Technology, Vol. 2, Rhodes, Greece, pp. 605-608.
[29]
Regular models of phonological rule systems. Comput. Linguist. v20 i3. 331-378.
[30]
Kingsbury, P., Strassel, S., McLemore, C., MacIntyre, R., 1997. Callhome American English lexicon (Pronlex). LDC Catalog No. LDC97L20.
[31]
Kneser, R., Ney, H., 1995. Improved backing-off for M-gram language modeling. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Detroit, MI, USA, pp. 181-184.
[32]
Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. NAUK SSSR. v163 i4. 845-848.
[33]
Lucassen, J.M., Mercer, R.L., 1984. An information theoretic approach to the automatic determination of phonetic baseforms. In: Proc. IEEE Internat Conf. on Acoustics, Speech and Signal Processing, Vol. 9, San Diego, CA, USA, pp. 304-307.
[34]
Lüngen, H., Ehlebracht, K., Gibbon, D., Simíes, A.P.Q., 1998. Bielefelder Lexikon und Morphologie in Verbmobil Phase II. Technical Report ISSN 1434-8845, Universität Bielefeld.
[35]
Lööf, J., Bisani, M., Gollan, C., Heigold, G., Hoffmeister, B., Plahl, C., Schlnter, R., Ney, H., 2006. The 2006 RWTH parliamentary speeches transcription system. In: Proc. Internat. Conf. on Spoken Language Processing, Pittsburgh, PA, USA, pp. 105-108.
[36]
Multistrategy approach to improving pronunciation by analogy. Comput. Linguist. v26 i2. 195-219.
[37]
NETspeak - a re-implementation of NETtalk. Computer Speech Lang. v2 i3-4. 289-302.
[38]
Meng, H.M., Seneff, S., Zue, V.W., 1994. Phonological parsing for bi-directional letter-to-sound/sound-to-letter generation. In: HLT'94: Proc. Workshop on Human Language Technology. Association for Computational Linguistics, Morristown, NJ, USA, pp. 289-294.
[39]
Mitton, R., 1992. Computer-usable dictionary file based on the Oxford Advanced Learner's Dictionary of Current English. <http://ota.ahds.ac.uk/>.
[40]
On the estimation of small probabilities by leaving-one-out. IEEE Trans. Pattern Anal. Machine Intell. v17 i12. 1202-1212.
[41]
Statistical language modeling using leaving-one-out. In: Young, S., Bloothoft, G. (Eds.), Corpus-based Methods in Language and Speech Processing, Kluwer Academic Publishers, Dordrecht, The Netherlands. pp. 174-207.
[42]
A systematic comparison of various statistical alignment models. Comput. Linguist. v29 i1. 19-51.
[43]
Pagel, V., Lenzo, K., Black, A.W., 1998. Letter-to-sound rules for accented lexicon compression. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. V, Sydney, Australia, pp. 2015-2018.
[44]
Numerical Recipes in C. Cambridge University Press.
[45]
Robinson, T., 1997. Beep - British English example pronunciations, version 1.0. <ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionaries/beep.tar.gz>.
[46]
Parallel networks that learn to pronounce English text. Complex Systems. v1 i1. 145-168.
[47]
Sejnowski, T.J., Rosenberg, C.R., 1993. NETtalk corpus. <ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionaries>.
[48]
Suontausta, J., Häkkinen, J., 2000. Decision tree based text-to-phoneme mapping for speech recognition. In: Proc. Internat. Conf. on Spoken Language Processing, Beijing, China.
[49]
Torkkola, K., 1993. An efficient way to learn English grapheme-to-phoneme rules automatically. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 2, Minneapolis, MN, USA, pp. 199-202.
[50]
van den Bosch, A., Chen, S.F., Daelemans, W., Damper, R.I., Gustafson, K., Marchand, Y., Yvon, F., 2006. Pascal letter-to-phoneme conversion challenge. <http://www.pascal-network.org/Challenges/PRONALSYL>.
[51]
Vozila, P., Adams, J., Lobacheva, Y., Thomas, R., 2003. Grapheme to phoneme conversion and dictionary verification using graphonemes. In: Proc. European Conf. on Speech Communication and Technology, Geneva, Switzerland, pp. 2469-2472.
[52]
Weide, R.L., 1998. The Carnegie Mellon pronouncing dictionary. <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>.
[53]
Wells, J.C., 1997a. SAMPA computer readable phonetic alphabet. <http://www.phon.ucl.ac.uk/home/sampa>.
[54]
SAMPA computer readable phonetic alphabet. In: Gibbon, D., Moore, R., Winski, R. (Eds.), Handbook of Standards and Resources for Spoken Language Systems, Mouton de Gruyter, Berlin and New York.
[55]
Yvon, F., 1996. Grapheme-to-phoneme conversion using multiple unbounded overlapping chunks. In: Proc. Conf. on New Methods in Natural Language Processing, Ankara, Turkey, pp. 218-228.
[56]
Ziegenhain, U., 2005. Creation of lexica for speech recognition and synthesis. LC-Star project deliverable D3.1+D3.2 available from www.lc-star.com.

Cited By

View all
  • (2024)Analysis of forced aligner performance on L2 English speechSpeech Communication10.1016/j.specom.2024.103042158:COnline publication date: 1-Mar-2024
  • (2023)The Tacotron2-based IPA-to-Speech speech synthesis systemProceedings of the 2023 6th International Conference on Signal Processing and Machine Learning10.1145/3614008.3614019(70-75)Online publication date: 14-Jul-2023
  • (2023)Boosting Rule-Based Grapheme-to-Phoneme Conversion with Morphological Segmentation and Syllabification in BengaliSpeech and Computer10.1007/978-3-031-48309-7_34(415-429)Online publication date: 29-Nov-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Speech Communication
Speech Communication  Volume 50, Issue 5
May, 2008
97 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 May 2008

Author Tags

  1. Grapheme-to-phoneme
  2. Joint-sequence model
  3. Letter-to-sound
  4. Phonemic transcription
  5. Pronunciation modeling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Analysis of forced aligner performance on L2 English speechSpeech Communication10.1016/j.specom.2024.103042158:COnline publication date: 1-Mar-2024
  • (2023)The Tacotron2-based IPA-to-Speech speech synthesis systemProceedings of the 2023 6th International Conference on Signal Processing and Machine Learning10.1145/3614008.3614019(70-75)Online publication date: 14-Jul-2023
  • (2023)Boosting Rule-Based Grapheme-to-Phoneme Conversion with Morphological Segmentation and Syllabification in BengaliSpeech and Computer10.1007/978-3-031-48309-7_34(415-429)Online publication date: 29-Nov-2023
  • (2022)A Hybrid Machine Transliteration Model Based on Multi-source Encoder–Decoder Framework: English to ManipuriSN Computer Science10.1007/s42979-021-01005-93:2Online publication date: 11-Jan-2022
  • (2022)Comparing NLP Solutions for the Disambiguation of French Heterophonic Homographs for End-to-End TTS SystemsSpeech and Computer10.1007/978-3-031-20980-2_23(265-278)Online publication date: 14-Nov-2022
  • (2022)Transformer-Based Multilingual G2P Converter for E-Learning SystemArtificial Intelligence in HCI10.1007/978-3-031-05643-7_35(546-556)Online publication date: 26-Jun-2022
  • (2021)Pronunciation augmentation for Mandarin-English code-switching speech recognitionEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-021-00222-72021:1Online publication date: 30-Aug-2021
  • (2021)"Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the Real WorldProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security10.1145/3460120.3484742(235-251)Online publication date: 12-Nov-2021
  • (2021)Survey of Post-OCR Processing ApproachesACM Computing Surveys10.1145/345347654:6(1-37)Online publication date: 13-Jul-2021
  • (2021)Arabic grapheme-to-phoneme conversion based on joint multi-gram modelInternational Journal of Speech Technology10.1007/s10772-020-09779-824:1(173-182)Online publication date: 1-Mar-2021
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media