article

Joint-sequence models for grapheme-to-phoneme conversion

Authors:

Maximilian Bisani,

Hermann NeyAuthors Info & Claims

Speech Communication, Volume 50, Issue 5

Pages 434 - 451

https://doi.org/10.1016/j.specom.2008.01.002

Published: 01 May 2008 Publication History

Abstract

Grapheme-to-phoneme conversion is the task of finding the pronunciation of a word given its written form. It has important applications in text-to-speech and speech recognition. Joint-sequence models are a simple and theoretically stringent probabilistic framework that is applicable to this problem. This article provides a self-contained and detailed description of this method. We present a novel estimation algorithm and demonstrate high accuracy on a variety of databases. Moreover, we study the impact of the maximum approximation in training and transcription, the interaction of model size parameters, n-best list generation, confidence measures, and phoneme-to-grapheme conversion. Our software implementation of the method proposed in this work is available under an Open Source license.

References

[1]

Andersen, O., Kuhn, R., Lazaridès, A., Dalsgaard, P., Haas, J., Nöth, E., 1996. Comparison of two tree-structured approaches for grapheme-to-phoneme conversion. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. 3, Philadelphia, PA, USA, pp. 1700-1703.

[2]

Phonemic transcription by analogy in text-to-speech synthesis: novel word pronunciation and lexicon compression. Computer Speech Lang. v16. 119-142.

[3]

Unsupervised, language-independent grapheme-to-phoneme conversion by latent analogy. Speech Comm. v46 i2. 140-152.

[4]

Besling, S., 1994. Heuristical and statistical methods for grahpeme-to-phoneme conversion. In: Konferenz zur Verarbeitung natürlicher Sprache (KONVENS), Vienna, Austria, pp. 24-31.

[5]

Bisani, M., Jolles, F., Popovic, M., 2005. LC-Star German lexicon for speech synthesis and recognition. Available from: European Language Resources Association, Catalog Reference S0245.

[6]

Bisani, M., Ney, H., 2001. Breadth-first search for finding the optimal phonetic transcription from multiple utterances. In: Proc. European Conf. on Speech Communication and Technology, Vol. 2, Aalborg, Denmark, pp. 1429-1432.

[7]

Bisani, M., Ney, H., 2002. Investigations on joint-multigram models for grapheme-to-phoneme conversion. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. 1, Denver, CO, USA, pp. 105-108.

[8]

Bisani, M., Ney, H., 2003. Multigram-based grapheme-to-phoneme conversion for LVCSR. In: Proc. European Conf. on Speech Communication and Technology, Vol. 2, Geneva, Switzerland, pp. 933-936.

[9]

Bisani, M., Ney, H., 2004. Bootstrap estimates for confidence intervals in ASR performance evaluation. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Montreal, Canada, pp. 409-411.

[10]

Bisani, M., Ney, H., Sep. 2005. Open vocabulary speech recognition with flat hybrid models. In: Proc. European Conf. on Speech Communication and Technology, Lisbon, Portugal, pp. 725-728.

[11]

Caseiro, D., Trancoso, I., Oliveira, L., Viana, C., 2002. Grapheme-to-phone using finite-state transducers. In: Proc. IEEE Workshop on Speech Synthesis, Santa Monica, CA, USA.

[12]

Celex, 1995. The Celex lexical database. <http://www.kun.nl/celex/>.

[13]

Chen, S.F., 2003. Conditional and joint models for grapheme-to-phoneme conversion. In: Proc. European Conf. on Speech Communication and Technology, Geneva, Switzerland, pp. 2033-2036.

[14]

An empirical study of smoothing techniques for language modeling. Computer Speech Lang. v13 i4. 359-394.

[15]

A survey of smoothing techniques for ME models. IEEE Trans. Speech Audio Process. v8 i1. 37-50.

[16]

Content, A., Mousty, P., Radeau, M., 1990. Brulex: Une base de données lexicales informatisée pour le français écrit et parlé. In: L'Année Psychologique, pp. 551-566.

[17]

Language-independent data-oriented grapheme-to-phoneme conversion. In: Van Santen, J.P.H., Sproat, R.W., Olive, J.P., Hirschberg, J. (Eds.), Progress in Speech Synthesis, Springer Verlag, Berlin, New York. pp. 77-90.

[18]

Pronounce: a program for pronunciation by analogy. Computer Speech Lang. v5 i1. 55-63.

[19]

Deligne, S., Bimbot, F., 1995. Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Detroit, MI, USA, pp. 169-172.

[20]

Inference of variable-length acoustic units for continuous speech recognition. Speech Comm. v23. 223-241.

[21]

Deligne, S., Yvon, F., Bimbot, F., 1995. Variable-length sequence matching for phonetic transcription using joint multigrams. In: Proc. European Conf. on Speech Communication and Technology, Madrid, Spain, pp. 2243-2246.

[22]

Galescu, L., 2003. Recognition of out-of-vocabulary words with sub-lexical language models. In: Proc. European Conf. on Speech Communication and Technology, Geneva, Switzerland, pp. 249-252.

[23]

Galescu, L., Allen, J.F., 2001. Bi-directional conversion between graphemes and phonemes using a joint n-gram model. In: Proc. 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire, Scotland.

[24]

Galescu, L., Allen, J.F., 2002. Pronunciation of proper names with a joint n-gram model for bi-directional grapheme-to-phoneme conversion. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. 1, Denver, CO, USA, pp. 109-112.

[25]

Gollan, C., Bisani, M., Kanthak, S., Schlüter, R., Ney, H., 2005. Cross domain automatic transcription on the TC-Star EPPS corpus. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Philadelphia, PA, USA, pp. 825-828.

[26]

Assessing text-to-phoneme mapping strategies in speaker independent isolated word recognition. Speech Comm. v41 i2. 455-467.

[27]

Jensen, K.J., Riis, S., 2000. Self-organizing letter code-book for text-to-phoneme neural network model. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. 3, Beijing, China, pp. 318-321.

[28]

Jiang, L., Hon, H.-W., Huang, X., 1997. Improvements on a trainable letter-to-sound converter. In: Proc. European Conf. on Speech Communication and Technology, Vol. 2, Rhodes, Greece, pp. 605-608.

[29]

Regular models of phonological rule systems. Comput. Linguist. v20 i3. 331-378.

[30]

Kingsbury, P., Strassel, S., McLemore, C., MacIntyre, R., 1997. Callhome American English lexicon (Pronlex). LDC Catalog No. LDC97L20.

[31]

Kneser, R., Ney, H., 1995. Improved backing-off for M-gram language modeling. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, Detroit, MI, USA, pp. 181-184.

[32]

Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. NAUK SSSR. v163 i4. 845-848.

[33]

Lucassen, J.M., Mercer, R.L., 1984. An information theoretic approach to the automatic determination of phonetic baseforms. In: Proc. IEEE Internat Conf. on Acoustics, Speech and Signal Processing, Vol. 9, San Diego, CA, USA, pp. 304-307.

[34]

Lüngen, H., Ehlebracht, K., Gibbon, D., Simíes, A.P.Q., 1998. Bielefelder Lexikon und Morphologie in Verbmobil Phase II. Technical Report ISSN 1434-8845, Universität Bielefeld.

[35]

Lööf, J., Bisani, M., Gollan, C., Heigold, G., Hoffmeister, B., Plahl, C., Schlnter, R., Ney, H., 2006. The 2006 RWTH parliamentary speeches transcription system. In: Proc. Internat. Conf. on Spoken Language Processing, Pittsburgh, PA, USA, pp. 105-108.

[36]

Multistrategy approach to improving pronunciation by analogy. Comput. Linguist. v26 i2. 195-219.

[37]

NETspeak - a re-implementation of NETtalk. Computer Speech Lang. v2 i3-4. 289-302.

[38]

Meng, H.M., Seneff, S., Zue, V.W., 1994. Phonological parsing for bi-directional letter-to-sound/sound-to-letter generation. In: HLT'94: Proc. Workshop on Human Language Technology. Association for Computational Linguistics, Morristown, NJ, USA, pp. 289-294.

[39]

Mitton, R., 1992. Computer-usable dictionary file based on the Oxford Advanced Learner's Dictionary of Current English. <http://ota.ahds.ac.uk/>.

[40]

On the estimation of small probabilities by leaving-one-out. IEEE Trans. Pattern Anal. Machine Intell. v17 i12. 1202-1212.

[41]

Statistical language modeling using leaving-one-out. In: Young, S., Bloothoft, G. (Eds.), Corpus-based Methods in Language and Speech Processing, Kluwer Academic Publishers, Dordrecht, The Netherlands. pp. 174-207.

[42]

A systematic comparison of various statistical alignment models. Comput. Linguist. v29 i1. 19-51.

[43]

Pagel, V., Lenzo, K., Black, A.W., 1998. Letter-to-sound rules for accented lexicon compression. In: Proc. Internat. Conf. on Spoken Language Processing, Vol. V, Sydney, Australia, pp. 2015-2018.

[44]

Numerical Recipes in C. Cambridge University Press.

[45]

Robinson, T., 1997. Beep - British English example pronunciations, version 1.0. <ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionaries/beep.tar.gz>.

[46]

Parallel networks that learn to pronounce English text. Complex Systems. v1 i1. 145-168.

[47]

Sejnowski, T.J., Rosenberg, C.R., 1993. NETtalk corpus. <ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionaries>.

[48]

Suontausta, J., Häkkinen, J., 2000. Decision tree based text-to-phoneme mapping for speech recognition. In: Proc. Internat. Conf. on Spoken Language Processing, Beijing, China.

[49]

Torkkola, K., 1993. An efficient way to learn English grapheme-to-phoneme rules automatically. In: Proc. IEEE Internat. Conf. on Acoustics, Speech and Signal Processing, Vol. 2, Minneapolis, MN, USA, pp. 199-202.

[50]

van den Bosch, A., Chen, S.F., Daelemans, W., Damper, R.I., Gustafson, K., Marchand, Y., Yvon, F., 2006. Pascal letter-to-phoneme conversion challenge. <http://www.pascal-network.org/Challenges/PRONALSYL>.

[51]

Vozila, P., Adams, J., Lobacheva, Y., Thomas, R., 2003. Grapheme to phoneme conversion and dictionary verification using graphonemes. In: Proc. European Conf. on Speech Communication and Technology, Geneva, Switzerland, pp. 2469-2472.

[52]

Weide, R.L., 1998. The Carnegie Mellon pronouncing dictionary. <http://www.speech.cs.cmu.edu/cgi-bin/cmudict>.

[53]

Wells, J.C., 1997a. SAMPA computer readable phonetic alphabet. <http://www.phon.ucl.ac.uk/home/sampa>.

[54]

SAMPA computer readable phonetic alphabet. In: Gibbon, D., Moore, R., Winski, R. (Eds.), Handbook of Standards and Resources for Spoken Language Systems, Mouton de Gruyter, Berlin and New York.

[55]

Yvon, F., 1996. Grapheme-to-phoneme conversion using multiple unbounded overlapping chunks. In: Proc. Conf. on New Methods in Natural Language Processing, Ankara, Turkey, pp. 218-228.

[56]

Ziegenhain, U., 2005. Creation of lexica for speech recognition and synthesis. LC-Star project deliverable D3.1+D3.2 available from www.lc-star.com.

Cited By

Williams SFoulkes PHughes V(2024)Analysis of forced aligner performance on L2 English speechSpeech Communication10.1016/j.specom.2024.103042158:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.specom.2024.103042
Liu JLei YYao MLiu ZLin GWang ZLiu YChen W(2023)The Tacotron2-based IPA-to-Speech speech synthesis systemProceedings of the 2023 6th International Conference on Signal Processing and Machine Learning10.1145/3614008.3614019(70-75)Online publication date: 14-Jul-2023
https://dl.acm.org/doi/10.1145/3614008.3614019
Ghosh KMandal SRoy N(2023)Boosting Rule-Based Grapheme-to-Phoneme Conversion with Morphological Segmentation and Syllabification in BengaliSpeech and Computer10.1007/978-3-031-48309-7_34(415-429)Online publication date: 29-Nov-2023
https://dl.acm.org/doi/10.1007/978-3-031-48309-7_34
Show More Cited By

Index Terms

Joint-sequence models for grapheme-to-phoneme conversion

Recommendations

Multilingual recognition of non-native speech using acoustic model transformation and pronunciation modeling

This article presents an approach for the automatic recognition of non-native speech. Some non-native speakers tend to pronounce phonemes as they would in their native language. Model adaptation can improve the recognition rate for non-native speakers, ...
Arabic grapheme-to-phoneme conversion based on joint multi-gram model
Abstract
Grapheme-to-phoneme conversion (G2P) process—which is is a necessary part of text-to-speech (TTS) systems—aims to predict a sequence of phonemes from a sequence of graphemes. For most languages, this task is limited to concatenated segment ...
Using Auto-Encoder BiLSTM Neural Network for Czech Grapheme-to-Phoneme Conversion
Text, Speech, and Dialogue
Abstract
The crucial part of almost all current TTS systems is a grapheme-to-phoneme (G2P) conversion, i.e. the transcription of any input grapheme sequence into the correct sequence of phonemes in the given language. Unfortunately, the preparation of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Speech Communication

Speech Communication Volume 50, Issue 5

May, 2008

97 pages

ISSN:0167-6393

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2008.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 May 2008

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

69
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Williams SFoulkes PHughes V(2024)Analysis of forced aligner performance on L2 English speechSpeech Communication10.1016/j.specom.2024.103042158:COnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.specom.2024.103042
Liu JLei YYao MLiu ZLin GWang ZLiu YChen W(2023)The Tacotron2-based IPA-to-Speech speech synthesis systemProceedings of the 2023 6th International Conference on Signal Processing and Machine Learning10.1145/3614008.3614019(70-75)Online publication date: 14-Jul-2023
https://dl.acm.org/doi/10.1145/3614008.3614019
Ghosh KMandal SRoy N(2023)Boosting Rule-Based Grapheme-to-Phoneme Conversion with Morphological Segmentation and Syllabification in BengaliSpeech and Computer10.1007/978-3-031-48309-7_34(415-429)Online publication date: 29-Nov-2023
https://dl.acm.org/doi/10.1007/978-3-031-48309-7_34
Laitonjam LSingh S(2022)A Hybrid Machine Transliteration Model Based on Multi-source Encoder–Decoder Framework: English to ManipuriSN Computer Science10.1007/s42979-021-01005-93:2Online publication date: 11-Jan-2022
https://dl.acm.org/doi/10.1007/s42979-021-01005-9
Hajj MLenglet MPerrotin OBailly G(2022)Comparing NLP Solutions for the Disambiguation of French Heterophonic Homographs for End-to-End TTS SystemsSpeech and Computer10.1007/978-3-031-20980-2_23(265-278)Online publication date: 14-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-20980-2_23
Liu JRen CLuan YLi SXie TSeals CSpeights Atkins M(2022)Transformer-Based Multilingual G2P Converter for E-Learning SystemArtificial Intelligence in HCI10.1007/978-3-031-05643-7_35(546-556)Online publication date: 26-Jun-2022
https://dl.acm.org/doi/10.1007/978-3-031-05643-7_35
Long YWei SLian JLi Y(2021)Pronunciation augmentation for Mandarin-English code-switching speech recognitionEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-021-00222-72021:1Online publication date: 30-Aug-2021
https://dl.acm.org/doi/10.1186/s13636-021-00222-7
Wenger EBronckers MCianfarani CCryan JSha AZheng HZhao BKim YKim JVigna GShi E(2021)"Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the Real WorldProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security10.1145/3460120.3484742(235-251)Online publication date: 12-Nov-2021
https://dl.acm.org/doi/10.1145/3460120.3484742
Nguyen TJatowt ACoustaty MDoucet A(2021)Survey of Post-OCR Processing ApproachesACM Computing Surveys10.1145/345347654:6(1-37)Online publication date: 13-Jul-2021
https://dl.acm.org/doi/10.1145/3453476
Cherifi EGuerti M(2021)Arabic grapheme-to-phoneme conversion based on joint multi-gram modelInternational Journal of Speech Technology10.1007/s10772-020-09779-824:1(173-182)Online publication date: 1-Mar-2021
https://dl.acm.org/doi/10.1007/s10772-020-09779-8
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents