Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Unsupervised models for morpheme segmentation and morphology learning

Published: 02 February 2007 Publication History

Abstract

We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.

References

[1]
Adda-Decker, M. 2003. A corpus-based decompounding algorithm for German lexical modeling in LVCSR. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech). Geneva, Switzerland. 257--260.
[2]
Allen, M., Badecker, W., and Osterhout, L. 2003. Morphological analysis in sentence processing: An ERP study. Lang. Cognit. Proc. 18, 4, 405--430.
[3]
Altun, Y. and Johnson, M. 2001. Inducing SFA with ε-transitions using Minimum Description Length. In Proceedings of the Finite-State Methods in Natural Language Processing, ESSLLI Workshop. Helsinki, Finland.
[4]
Ando, R. K. and Lee, L. 2000. Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji. In Proceedings of the 6th Applied Natural Language Processing Conference and 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL). 241--248.
[5]
Baayen, R. H., Piepenbrock, R., and Gulikers, L. 1995. The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA. http://wave.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14.
[6]
Baayen, R. H. and Schreuder, R. 2000. Towards a psycholinguistic computational model for morphological parsing. Philosophical Transactions of the Royal Society (Series A: Mathematical, Physical and Engineering Sciences 358), 1--13.
[7]
Baroni, M., Matiasek, J., and Trost, H. 2002. Unsupervised learning of morphologically related words based on orthographic and semantic similarity. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 48--57.
[8]
Brent, M. R. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learn. 34, 71--105.
[9]
Chang, J.-S., Lin, Y.-C., and Su, K.-Y. 1995. Automatic construction of a Chinese electronic dictionary. In Proceedings of the 3rd Workshop on Very Large Corpora. Somerset, NJ. 107--120.
[10]
Creutz, M. 2003. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the Association for Computations Languages (ACL'03). Sapporo, Japan. 280--287.
[11]
Creutz, M. and Lagus, K. 2002. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL. Philadelphia, PA. 21--30.
[12]
Creutz, M. and Lagus, K. 2004. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 43--51.
[13]
Creutz, M. and Lagus, K. 2005a. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05). Espoo, Finland. 106--113.
[14]
Creutz, M. and Lagus, K. 2005b. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Tech. rep. A81, Publications in Computer and Information Science, Helsinki University of Technology.
[15]
Creutz, M. and Lindén, K. 2004. Morpheme segmentation gold standards for Finnish and English. Tech. rep. A77, Publications in Computer and Information Science, Helsinki University of Technology.
[16]
de Marcken, C. G. 1996. Unsupervised language acquisition. Ph.D. thesis, MIT, Cambridge, MA.
[17]
Déjean, H. 1998. Morphemes as necessary concept for structures discovery from untagged corpora. Workshop on Paradigms and Grounding in Natural Language Learning. Adelaide, Australia. 295--299.
[18]
Deligne, S. and Bimbot, F. 1997. Inference of variable-length linguistic and acoustic units by multigrams. Speech Comm. 23, 223--241.
[19]
Feng, H., Chen, K., Kit, C., and Deng, X. 2004. Unsupervised segmentation of Chinese corpus using accessor variety. In Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP). Sanya, Hainan. 255--261.
[20]
Gaussier, E. 1999. Unsupervised learning of derivational morphology from inflectional lexicons. In Proceedings of the ACL Workshop on Unsupervised Learning in Natural Language Processing. University of Maryland. 24--30.
[21]
Ge, X., Pratt, W., and Smyth, P. 1999. Discovering Chinese words from unsegmented text. In Proceedings of SIGIR. 271--272.
[22]
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Computat. Linguis. 27, 2, 153--198.
[23]
Goldsmith, J. 2005. An algorithm for the unsupervised learning of morphology. Tech. rep. TR-2005-06, Department of Computer Science, University of Chicago. http://humfs1.uchicago.edu/~jagoldsm/Papers/Algorithm.pdf.
[24]
Goldsmith, J. and Hu, Y. 2004. From signatures to finite state automata. Midwest Computational Linguistics Colloquium. Bloomington IN.
[25]
Hafer, M. A. and Weiss, S. F. 1974. Word segmentation by letter successor varieties. Inform. Storage Retriev. 10, 371--385.
[26]
Hakulinen, L. 1979. Suomen kielen rakenne ja kehitys (The Structure and Development of the Finnish Language) 4th Ed. Kustannus-Oy Otava.
[27]
Harris, Z. S. 1955. From phoneme to morpheme. Language 31, 2, 190--222. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)
[28]
Harris, Z. S. 1967. Morpheme boundaries within words: Report on a computer test. Transformations and Discourse Analysis Papers 73. (Reprinted 1970 in Papers in Structural and Transformational Linguistics, Reidel Publishing Company, Dordrecht, Holland.)
[29]
Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005b. The SED heuristic for morpheme discovery: a look at Swahili. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 28--35.
[30]
Hu, Y., Matveeva, I., Goldsmith, J., and Sprague, C. 2005a. Using morphology and syntax together in unsupervised learning. In Proceedings of the 2nd Workshop of Psychocomputational Models of Human Language Acquisition. Ann Arbor, MI. 20--27.
[31]
Jacquemin, C. 1997. Guessing morphology from terms and corpora. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97). Philadelphia, PA. 156--165.
[32]
Järvikivi, J. and Niemi, J. 2002. Form-based representation in the mental lexicon: Priming (with) bound stem allomorphs in Finnish. Brain Lang. 81, 412--423.
[33]
Johnson, H. and Martin, J. 2003. Unsupervised learning of morphology for English and Inuktitut. Human Language Technology and North American Chapter of the Association for Computational Linguistics Conference (HLT-NAACL'03). Edmonton, Canada.
[34]
Kazakov, D. 1997. Unsupervised learning of naïve morphology with genetic algorithms. Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks. Prague, Czech Republic, 105--112.
[35]
Kit, C. 2003. How does lexical acquisition begin? A cognitive perspective. Cognit. Science 1, 1, 1--50.
[36]
Kit, C., Pan, H., and Chen, H. 2002. Learning case-based knowledge for disambiguating Chinese word segmentation: A preliminary study. In Proceedings of the COLING Workshop SIGHAN-1. Taipei, Taiwan. 33--39.
[37]
Kit, C. and Wilks, Y. 1999. Unsupervised learning of word boundary with description length gain. In Proceedings of the CoNLL99 ACL Workshop. Bergen, Norway.
[38]
Kneissler, J. and Klakow, D. 2001. Speech recognition for huge vocabularies by using optimized sub-word units. In Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech). Aalborg, Denmark. 69--72.
[39]
Kontorovich, L., Ron, D., and Singer, Y. 2003. A Markov model for the acquisition of morphological structure. Tech. rep. CMU-CS-03-147, School of Computer Science, Carnegie Mellon University.
[40]
Koskenniemi, K. 1983. Two-level morphology: A general computational model for word-form recognition and production. Ph.D. thesis, University of Helsinki.
[41]
Matthews, P. H. 1991. Morphology 2nd Ed. Cambridge Textbooks in Linguistics.
[42]
McKinnon, R., Allen, M., and Osterhout, L. 2003. Morphological decomposition involving non-productive morphemes: ERP evidence. Cognit. Neurosci. Neuropsychol. 14, 6, 883--886.
[43]
Nagata, M. 1997. A self-organizing Japanese word segmenter using heuristic word identification and re-estimation. In Proceedings of the 5th Workshop on Very Large Corpora. 203--215.
[44]
Neuvel, S. and Fulop, S. A. 2002. Unsupervised learning of morphology without morphemes. In Proceedings of the Workshop on Morphological&Phonological Learning of ACL. 31--40.
[45]
Peng, F. and Schuurmans, D. 2001. Self-supervised Chinese word segmentation. In Proceedings of the 4th International Conference on Intelligent Data Analysis (IDA). Springer, 238--247.
[46]
Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. Longman, Essex.
[47]
Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. Vol. 15. World Scientific Series in Computer Science, Singapore.
[48]
Saffran, J. R., Newport, E. L., and Aslin, R. N. 1996. Word segmentation: The role of distributional cues. J. Memory Lang. 35, 606--621.
[49]
Schone, P. and Jurafsky, D. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of the CoNLL-2000 and LLL-2000. 67--72.
[50]
Schone, P. and Jurafsky, D. 2001. Knowledge-free induction of inflectional morphologies. In Proceedings of the North American Chapter of the Association for Computational Linguistic Conference.
[51]
Snover, M. G. and Brent, M. R. 2001. A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th Annual Meeting of the ACL. 482--490.
[52]
Snover, M. G., Jarosz, G. E., and Brent, M. R. 2002. Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step. In Proceedings of the Workshop of Morphological&Phonological Learning of ACL. 11--20.
[53]
Wicentowski, R. 2004. Multilingual noise-robust supervised morphological analysis using the WordFrame model. In Proceedings of the 7th ACL Special Interest Group in Computational Phonology (SIGPHON). Barcelona, Spain. 70--77.
[54]
Yarowsky, D., Ngai, G., and Wicentowski, R. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the 1st International Conference on Human Language Technology Research (HLT '01). 161--168.
[55]
Yarowsky, D. and Wicentowski, R. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the Association for Computational Linguistics (ACL '00). 207--216.
[56]
Yu, H. 2000. Unsupervised word induction using MDL criterion. In Proceedings of the International Symposium of Chinese Spoken Language Processing (ISCSL). Beijing, China.

Cited By

View all
  • (2024)Subword Representations Successfully Decode Brain Responses to Morphologically Complex Written WordsNeurobiology of Language10.1162/nol_a_001495:4(844-863)Online publication date: 11-Sep-2024
  • (2024)Subword Dictionary Learning and Segmentation for Expanding the Vocabulary of Automatic Speech Recognition in Tamil and KannadaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/370531224:1(1-26)Online publication date: 23-Nov-2024
  • (2024)Swedish word family resourceITL - International Journal of Applied Linguistics10.1075/itl.22026.vol175:1(127-161)Online publication date: 26-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Speech and Language Processing
ACM Transactions on Speech and Language Processing   Volume 4, Issue 1
January 2007
68 pages
ISSN:1550-4875
EISSN:1550-4883
DOI:10.1145/1187415
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2007
Published in TSLP Volume 4, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Efficient storage
  2. highly inflecting and compounding languages
  3. language independent methods
  4. maximum a posteriori (MAP) estimation
  5. morpheme lexicon and segmentation
  6. unsupervised learning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)69
  • Downloads (Last 6 weeks)9
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Subword Representations Successfully Decode Brain Responses to Morphologically Complex Written WordsNeurobiology of Language10.1162/nol_a_001495:4(844-863)Online publication date: 11-Sep-2024
  • (2024)Subword Dictionary Learning and Segmentation for Expanding the Vocabulary of Automatic Speech Recognition in Tamil and KannadaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/370531224:1(1-26)Online publication date: 23-Nov-2024
  • (2024)Swedish word family resourceITL - International Journal of Applied Linguistics10.1075/itl.22026.vol175:1(127-161)Online publication date: 26-Feb-2024
  • (2024)Syllable‐Level Morphological Segmentation of Kannada and Tulu WordsAutomatic Speech Recognition and Translation for Low Resource Languages10.1002/9781394214624.ch7(113-133)Online publication date: 29-Mar-2024
  • (2023)Language structure, attitudes, and learning from ambient exposure: Lexical and phonotactic knowledge of Spanish among non-Spanish-speaking Californians and TexansPLOS ONE10.1371/journal.pone.028491918:4(e0284919)Online publication date: 27-Apr-2023
  • (2023)A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/361783022:10(1-20)Online publication date: 13-Oct-2023
  • (2023)Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3610773Online publication date: 28-Jul-2023
  • (2023)Morphological segmentations of Non-Māori Speaking New Zealanders match proficient speakersBilingualism: Language and Cognition10.1017/S1366728923000329(1-15)Online publication date: 20-Jun-2023
  • (2022)A Compression-Based Multiple Subword Segmentation for Neural Machine TranslationElectronics10.3390/electronics1107101411:7(1014)Online publication date: 24-Mar-2022
  • (2022)B-Morpher: Automated Learning of Morphological Language Characteristics for Inflection and Morphological AnalysisCybernetics and Information Technologies10.2478/cait-2022-004222:4(111-128)Online publication date: 10-Nov-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media