In this work, we focus on “sentence-level unlemmatization,” the task of generating a grammatical sentence given a lemmatized one; this task is usually easy to do for humans but may present problems for machine learning models. We treat this setting as a machine translation problem and, as a first try, apply a sequence-to-sequence model to the texts of Russian Wikipedia articles, evaluate the effect of the different training sets sizes quantitatively and achieve the BLUE score of 67, 3 using the largest training set available. We discuss preliminary results and flaws of traditional machine translation evaluation methods for this task and suggest directions for future research.
Published in Zapiski Nauchnykh Seminarov POMI, Vol. 499, 2021, pp. 129–136.
Alekseev, A.M., Nikolenko, S.I. Recovering Word Forms by Context for Morphologically Rich Languages. J Math Sci 273, 527–532 (2023).
