Recovering Word Forms by Context for Morphologically Rich Languages

A. M. Alekseev¹ &
S. I. Nikolenko^1,2

62 Accesses
Explore all metrics

In this work, we focus on “sentence-level unlemmatization,” the task of generating a grammatical sentence given a lemmatized one; this task is usually easy to do for humans but may present problems for machine learning models. We treat this setting as a machine translation problem and, as a first try, apply a sequence-to-sequence model to the texts of Russian Wikipedia articles, evaluate the effect of the different training sets sizes quantitatively and achieve the BLUE score of 67, 3 using the largest training set available. We discuss preliminary results and flaws of traditional machine translation evaluation methods for this task and suggest directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on syntactic processing techniques

Article 08 November 2022

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Article Open access 18 October 2021

SMT: A Case Study of Kazakh-English Word Alignment

References

I. Anisimov, V. Polyakov, E. Makarova, and V. Solovyev, “Spelling correction in english: Joint use of bi-grams and chunking,” in: 2017 Intelligent Systems Conference (IntelliSys), IEEE (2017), pp. 886–892.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 (2014).
D. Gavrilov, P. Kalaidin, and V. Malykh, “Self-attentive model for headline generation,” CoRR abs/1901.07786 arXiv:1901.07786 (2019).
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., 9, No. 8, 1735–1780 (1997).
Article Google Scholar
G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “OpenNMT: Open-Source Toolkit for Neural Machine Translation,” ArXiv e-prints arXiv:1701.02810 (2017). https://arxiv.org/abs/1701.02810
Koehn, P., H. Hoang, A. Birch, Chr. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, Chr. Moran, Zens R., et al., “Moses: Open source toolkit for statistical machine translation,” in:Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Association for Computational Linguistics (2007), pp. 177– 180.
M. Korobov, “Morphological analyzer and generator for russian and ukrainian languages,” in: Analysis of Images, Social Networks and Texts (M. Yu. Khachay, N. Konstantinova, A. Panchenko, D.I. Ignatov, and V.G. Labunets, eds.), Communications in Computer and Information Science, Vol. 542, Springer International Publishing (2015), pp. 320–332.
J. Lee, K. Cho, and Th. Hofmann, “Fully character-level neural machine translation without explicit segmentation,” Transactions of the Association for Computational Linguistics 5, 365–378 (2017).
Article Google Scholar
M.-Th. Luong, H. Pham, and Chr. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025 (2015).
Z. Miftahutdinov and E. Tutubalina, “Deep learning for ICD coding: Looking for medical concepts in clinical documents in english and in french,” in: Experimental IR Meets Multilinguality, Multimodality, and Interaction (Cham) (P. Bellot, Ch. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan, L. Cappellato, and N. Ferro, eds.), Springer International Publishing, 2018, pp. 203–215.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation in PyTorch, NIPS-W, 2017.
Google Scholar
D. Polykovskiy, D. Soloviev, and S. Nikolenko, “Concorde: Morphological agreement in conversational models,” in: Proceedings of The 10th Asian Conference on Machine Learning (J. Zhu and I. Takeuchi, eds.), Proceedings of Machine Learning Research, Vol. 95, PMLR (2018), pp. 407–421.
I. Segalovich, A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine, MLMTA, Citeseer (2003), pp. 273–280.
Google Scholar
D. Sukhonin and A. Panchenko, A Python wrapper of the Tandex mystem 3.1 morphological analyzer, https://github.com/nlpub/pymystem3 (2013).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv:1706.03762 (2017).

Download references

Author information

Authors and Affiliations

St.Petersburg Department of Steklov Mathematical Institute RAS, St. Petersburg, Russia
A. M. Alekseev & S. I. Nikolenko
St.Petersburg State University, St. Petersburg, Russia
S. I. Nikolenko

Authors

A. M. Alekseev
View author publications
You can also search for this author in PubMed Google Scholar
S. I. Nikolenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. M. Alekseev.

Additional information

Published in Zapiski Nauchnykh Seminarov POMI, Vol. 499, 2021, pp. 129–136.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Alekseev, A.M., Nikolenko, S.I. Recovering Word Forms by Context for Morphologically Rich Languages. J Math Sci 273, 527–532 (2023). https://doi.org/10.1007/s10958-023-06518-7

Download citation

Received: 11 February 2019
Published: 22 June 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10958-023-06518-7

Recovering Word Forms by Context for Morphologically Rich Languages

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on syntactic processing techniques

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

SMT: A Case Study of Kazakh-English Word Alignment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

Recovering Word Forms by Context for Morphologically Rich Languages

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on syntactic processing techniques

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

SMT: A Case Study of Kazakh-English Word Alignment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation