In this work, we focus on “sentence-level unlemmatization,” the task of generating a grammatical sentence given a lemmatized one; this task is usually easy to do for humans but may present problems for machine learning models. We treat this setting as a machine translation problem and, as a first try, apply a sequence-to-sequence model to the texts of Russian Wikipedia articles, evaluate the effect of the different training sets sizes quantitatively and achieve the BLUE score of 67, 3 using the largest training set available. We discuss preliminary results and flaws of traditional machine translation evaluation methods for this task and suggest directions for future research.
Similar content being viewed by others
References
I. Anisimov, V. Polyakov, E. Makarova, and V. Solovyev, “Spelling correction in english: Joint use of bi-grams and chunking,” in: 2017 Intelligent Systems Conference (IntelliSys), IEEE (2017), pp. 886–892.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 (2014).
D. Gavrilov, P. Kalaidin, and V. Malykh, “Self-attentive model for headline generation,” CoRR abs/1901.07786 arXiv:1901.07786 (2019).
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., 9, No. 8, 1735–1780 (1997).
G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “OpenNMT: Open-Source Toolkit for Neural Machine Translation,” ArXiv e-prints arXiv:1701.02810 (2017). https://arxiv.org/abs/1701.02810
Koehn, P., H. Hoang, A. Birch, Chr. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, Chr. Moran, Zens R., et al., “Moses: Open source toolkit for statistical machine translation,” in:Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Association for Computational Linguistics (2007), pp. 177– 180.
M. Korobov, “Morphological analyzer and generator for russian and ukrainian languages,” in: Analysis of Images, Social Networks and Texts (M. Yu. Khachay, N. Konstantinova, A. Panchenko, D.I. Ignatov, and V.G. Labunets, eds.), Communications in Computer and Information Science, Vol. 542, Springer International Publishing (2015), pp. 320–332.
J. Lee, K. Cho, and Th. Hofmann, “Fully character-level neural machine translation without explicit segmentation,” Transactions of the Association for Computational Linguistics 5, 365–378 (2017).
M.-Th. Luong, H. Pham, and Chr. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025 (2015).
Z. Miftahutdinov and E. Tutubalina, “Deep learning for ICD coding: Looking for medical concepts in clinical documents in english and in french,” in: Experimental IR Meets Multilinguality, Multimodality, and Interaction (Cham) (P. Bellot, Ch. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan, L. Cappellato, and N. Ferro, eds.), Springer International Publishing, 2018, pp. 203–215.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation in PyTorch, NIPS-W, 2017.
D. Polykovskiy, D. Soloviev, and S. Nikolenko, “Concorde: Morphological agreement in conversational models,” in: Proceedings of The 10th Asian Conference on Machine Learning (J. Zhu and I. Takeuchi, eds.), Proceedings of Machine Learning Research, Vol. 95, PMLR (2018), pp. 407–421.
I. Segalovich, A Fast Morphological Algorithm with Unknown Word Guessing Induced by a Dictionary for a Web Search Engine, MLMTA, Citeseer (2003), pp. 273–280.
D. Sukhonin and A. Panchenko, A Python wrapper of the Tandex mystem 3.1 morphological analyzer, https://github.com/nlpub/pymystem3 (2013).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, arXiv:1706.03762 (2017).
Author information
Authors and Affiliations
Corresponding author
Additional information
Published in Zapiski Nauchnykh Seminarov POMI, Vol. 499, 2021, pp. 129–136.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alekseev, A.M., Nikolenko, S.I. Recovering Word Forms by Context for Morphologically Rich Languages. J Math Sci 273, 527–532 (2023). https://doi.org/10.1007/s10958-023-06518-7
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10958-023-06518-7