article

Unsupervised models for morpheme segmentation and morphology learning

Authors:

Mathias Creutz,

Krista LagusAuthors Info & Claims

ACM Transactions on Speech and Language Processing (TSLP), Volume 4, Issue 1

Article No.: 3, Pages 1 - 34

https://doi.org/10.1145/1187415.1187418

Published: 02 February 2007 Publication History

Abstract

We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.

References

[1]

Adda-Decker, M. 2003. A corpus-based decompounding algorithm for German lexical modeling in LVCSR. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech). Geneva, Switzerland. 257--260.

[2]

Allen, M., Badecker, W., and Osterhout, L. 2003. Morphological analysis in sentence processing: An ERP study. Lang. Cognit. Proc. 18, 4, 405--430.

[3]

Altun, Y. and Johnson, M. 2001. Inducing SFA with &epsi;-transitions using Minimum Description Length. In Proceedings of the Finite-State Methods in Natural Language Processing, ESSLLI Workshop. Helsinki, Finland.

[4]

Ando, R. K. and Lee, L. 2000. Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji. In Proceedings of the 6th Applied Natural Language Processing Conference and 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL). 241--248.

[5]

Baayen, R. H., Piepenbrock, R., and Gulikers, L. 1995. The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA. http://wave.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14.

[6]

Baayen, R. H. and Schreuder, R. 2000. Towards a psycholinguistic computational model for morphological parsing. Philosophical Transactions of the Royal Society (Series A: Mathematical, Physical and Engineering Sciences 358), 1--13.

[7]

Baroni, M., Matiasek, J., and Trost, H. 2002. Unsupervised learning of morphologically related words based on orthographic and semantic similarity. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 48--57.

[8]

Brent, M. R. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learn. 34, 71--105.

[9]

Chang, J.-S., Lin, Y.-C., and Su, K.-Y. 1995. Automatic construction of a Chinese electronic dictionary. In Proceedings of the 3rd Workshop on Very Large Corpora. Somerset, NJ. 107--120.

[10]

Creutz, M. 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the Association for Computations Languages (ACL'03). Sapporo, Japan. 280--287.

[11]

Creutz, M. and Lagus, K. 2002. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL. Philadelphia, PA. 21--30.

[12]

Creutz, M. and Lagus, K. 2004. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 43--51.

Digital Library

[13]

Creutz, M. and Lagus, K. 2005a. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05). Espoo, Finland. 106--113.

[14]

Creutz, M. and Lagus, K. 2005b. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Tech. rep. A81, Publications in Computer and Information Science, Helsinki University of Technology.

[15]

Creutz, M. and Lindén, K. 2004. Morpheme segmentation gold standards for Finnish and English. Tech. rep. A77, Publications in Computer and Information Science, Helsinki University of Technology.

[16]

de Marcken, C. G. 1996. Unsupervised language acquisition. Ph.D. thesis, MIT, Cambridge, MA.

[17]

Déjean, H. 1998. Morphemes as necessary concept for structures discovery from untagged corpora. Workshop on Paradigms and Grounding in Natural Language Learning. Adelaide, Australia. 295--299.

[18]

Deligne, S. and Bimbot, F. 1997. Inference of variable-length linguistic and acoustic units by multigrams. Speech Comm. 23, 223--241.

Digital Library

[19]

Feng, H., Chen, K., Kit, C., and Deng, X. 2004. Unsupervised segmentation of Chinese corpus using accessor variety. In Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP). Sanya, Hainan. 255--261.

[20]

Gaussier, E. 1999. Unsupervised learning of derivational morphology from inflectional lexicons. In Proceedings of the ACL Workshop on Unsupervised Learning in Natural Language Processing. University of Maryland. 24--30.

[21]

Ge, X., Pratt, W., and Smyth, P. 1999. Discovering Chinese words from unsegmented text. In Proceedings of SIGIR. 271--272.

[22]

Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computat. Linguis. 27, 2, 153--198.

Digital Library

[23]

Goldsmith, J. 2005. An algorithm for the unsupervised learning of morphology. Tech. rep. TR-2005-06, Department of Computer Science, University of Chicago. http://humfs1.uchicago.edu/~jagoldsm/Papers/Algorithm.pdf.

[24]

Goldsmith, J. and Hu, Y. 2004. From signatures to finite state automata. Midwest Computational Linguistics Colloquium. Bloomington IN.

[25]

Hafer, M. A. and Weiss, S. F. 1974. Word segmentation by letter successor varieties. Inform. Storage Retriev. 10, 371--385.

[26]

Hakulinen, L. 1979. Suomen kielen rakenne ja kehitys (The Structure and Development of the Finnish Language) 4th Ed. Kustannus-Oy Otava.

[27]

Harris, Z. S. 1955. From phoneme to morpheme. Language 31, 2, 190--222. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)

[28]

Harris, Z. S. 1967. Morpheme boundaries within words: Report on a computer test. Transformations and Discourse Analysis Papers 73. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)

[29]

Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005b. The SED heuristic for morpheme discovery: a look at Swahili. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 28--35.

[30]

Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005a. Using morphology and syntax together in unsupervised learning. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 20--27.

[31]

Jacquemin, C. 1997. Guessing morphology from terms and corpora. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97). Philadelphia, PA. 156--165.

[32]

Järvikivi, J. and Niemi, J. 2002. Form-based representation in the mental lexicon: Priming (with) bound stem allomorphs in Finnish. Brain Lang. 81, 412--423.

[33]

Johnson, H. and Martin, J. 2003. Unsupervised learning of morphology for English and Inuktitut. Human Language Technology and North American Chapter of the Association for Computational Linguistics Conference (HLT-NAACL'03). Edmonton, Canada.

[34]

Kazakov, D. 1997. Unsupervised learning of naïve morphology with genetic algorithms. Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks. Prague, Czech Republic, 105--112.

[35]

Kit, C. 2003. How does lexical acquisition begin&quest; A cognitive perspective. Cognit. Science 1, 1, 1--50.

[36]

Kit, C., Pan, H., and Chen, H. 2002. Learning case-based knowledge for disambiguating Chinese word segmentation: A preliminary study. In Proceedings of the COLING Workshop SIGHAN-1. Taipei, Taiwan. 33--39.

[37]

Kit, C. and Wilks, Y. 1999. Unsupervised learning of word boundary with description length gain. In Proceedings of the CoNLL99 ACL Workshop. Bergen, Norway.

[38]

Kneissler, J. and Klakow, D. 2001. Speech recognition for huge vocabularies by using optimized sub-word units. In Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech). Aalborg, Denmark. 69--72.

[39]

Kontorovich, L., Ron, D., and Singer, Y. 2003. A Markov model for the acquisition of morphological structure. Tech. rep. CMU-CS-03-147, School of Computer Science, Carnegie Mellon University.

[40]

Koskenniemi, K. 1983. Two-level morphology: A general computational model for word-form recognition and production. Ph.D. thesis, University of Helsinki.

[41]

Matthews, P. H. 1991. Morphology 2nd Ed. Cambridge Textbooks in Linguistics.

[42]

McKinnon, R., Allen, M., and Osterhout, L. 2003. Morphological decomposition involving non-productive morphemes: ERP evidence. Cognit. Neurosci. Neuropsychol. 14, 6, 883--886.

[43]

Nagata, M. 1997. A self-organizing Japanese word segmenter using heuristic word identification and re-estimation. In Proceedings of the 5th Workshop on Very Large Corpora. 203--215.

[44]

Neuvel, S. and Fulop, S. A. 2002. Unsupervised learning of morphology without morphemes. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 31--40.

[45]

Peng, F. and Schuurmans, D. 2001. Self-supervised Chinese word segmentation. In Proceedings of the 4th International Conference on Intelligent Data Analysis (IDA). Springer, 238--247.

[46]

Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. Longman, Essex.

[47]

Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. Vol. 15. World Scientific Series in Computer Science, Singapore.

[48]

Saffran, J. R., Newport, E. L., and Aslin, R. N. 1996. Word segmentation: The role of distributional cues. J. Memory Lang. 35, 606--621.

[49]

Schone, P. and Jurafsky, D. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of the CoNLL-2000 and LLL-2000. 67--72.

[50]

Schone, P. and Jurafsky, D. 2001. Knowledge-free induction of inflectional morphologies. In Proceedings of the North American Chapter of the Association for Computational Linguistic Conference.

[51]

Snover, M. G. and Brent, M. R. 2001. A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th Annual Meeting of the ACL. 482--490.

[52]

Snover, M. G., Jarosz, G. E., and Brent, M. R. 2002. Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In Proceedings of the Workshop of Morphological&Phonological Learning of ACL. 11--20.

[53]

Wicentowski, R. 2004. Multilingual noise-robust supervised morphological analysis using the WordFrame model. In Proceedings of the 7th ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 70--77.

[54]

Yarowsky, D., Ngai, G., and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research (HLT '01). 161--168.

[55]

Yarowsky, D. and Wicentowski, R. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the Association for Computational Linguistics (ACL '00). 207--216.

[56]

Yu, H. 2000. Unsupervised word induction using MDL criterion. In Proceedings of the International Symposium of Chinese Spoken Language Processing (ISCSL). Beijing, China.

Cited By

Hakala TLindh-Knuutila THultén ALehtonen MSalmelin R(2024)Subword Representations Successfully Decode Brain Responses to Morphologically Complex Written WordsNeurobiology of Language10.1162/nol_a_001495:4(844-863)Online publication date: 11-Sep-2024
https://doi.org/10.1162/nol_a_00149
A MPilar BAngarai Ganesan R(2024)Subword Dictionary Learning and Segmentation for Expanding the Vocabulary of Automatic Speech Recognition in Tamil and KannadaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/370531224:1(1-26)Online publication date: 23-Nov-2024
https://dl.acm.org/doi/10.1145/3705312
Volodina EMohammed YTiedemann T(2024)Swedish word family resourceITL - International Journal of Applied Linguistics10.1075/itl.22026.vol175:1(127-161)Online publication date: 26-Feb-2024
https://doi.org/10.1075/itl.22026.vol
Show More Cited By

Index Terms

Unsupervised models for morpheme segmentation and morphology learning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
2. Mathematics of computing
  1. Information theory

Recommendations

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units ...
Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS ...
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Speech and Language Processing

ACM Transactions on Speech and Language Processing Volume 4, Issue 1

January 2007

68 pages

ISSN:1550-4875

EISSN:1550-4883

DOI:10.1145/1187415

Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2007

Published in TSLP Volume 4, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

143
Total Citations
View Citations
1,767
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)9

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hakala TLindh-Knuutila THultén ALehtonen MSalmelin R(2024)Subword Representations Successfully Decode Brain Responses to Morphologically Complex Written WordsNeurobiology of Language10.1162/nol_a_001495:4(844-863)Online publication date: 11-Sep-2024
https://doi.org/10.1162/nol_a_00149
A MPilar BAngarai Ganesan R(2024)Subword Dictionary Learning and Segmentation for Expanding the Vocabulary of Automatic Speech Recognition in Tamil and KannadaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/370531224:1(1-26)Online publication date: 23-Nov-2024
https://dl.acm.org/doi/10.1145/3705312
Volodina EMohammed YTiedemann T(2024)Swedish word family resourceITL - International Journal of Applied Linguistics10.1075/itl.22026.vol175:1(127-161)Online publication date: 26-Feb-2024
https://doi.org/10.1075/itl.22026.vol
Hegde AShashirekha H(2024)Syllable‐Level Morphological Segmentation of Kannada and Tulu WordsAutomatic Speech Recognition and Translation for Low Resource Languages10.1002/9781394214624.ch7(113-133)Online publication date: 29-Mar-2024
https://doi.org/10.1002/9781394214624.ch7
Todd SBen Youssef CVásquez-Aguilar A(2023)Language structure, attitudes, and learning from ambient exposure: Lexical and phonotactic knowledge of Spanish among non-Spanish-speaking Californians and TexansPLOS ONE10.1371/journal.pone.028491918:4(e0284919)Online publication date: 27-Apr-2023
https://doi.org/10.1371/journal.pone.0284919
Yonghe WBao FGao G(2023)A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/361783022:10(1-20)Online publication date: 13-Oct-2023
https://dl.acm.org/doi/10.1145/3617830
Gezmu ANürnberger A(2023)Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3610773Online publication date: 28-Jul-2023
https://doi.org/10.1145/3610773
Panther FMattingley WHay JTodd SKing JKeegan P(2023)Morphological segmentations of Non-Māori Speaking New Zealanders match proficient speakersBilingualism: Language and Cognition10.1017/S1366728923000329(1-15)Online publication date: 20-Jun-2023
https://doi.org/10.1017/S1366728923000329
Nonaka KYamanouchi KI TOkita TShimada KSakamoto H(2022)A Compression-Based Multiple Subword Segmentation for Neural Machine TranslationElectronics10.3390/electronics1107101411:7(1014)Online publication date: 24-Mar-2022
https://doi.org/10.3390/electronics11071014
Kovács LSzabó G(2022)B-Morpher: Automated Learning of Morphological Language Characteristics for Inflection and Morphological AnalysisCybernetics and Information Technologies10.2478/cait-2022-004222:4(111-128)Online publication date: 10-Nov-2022
https://doi.org/10.2478/cait-2022-0042
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents