Abstract
The paper presents a system for joint morphosyntactic disambiguation and segmentation of Polish based on conditional random fields (CRFs). The system is coupled with Morfeusz, a morphosyntactic analyzer for Polish, which represents both morphosyntactic and segmentation ambiguities in the form of a directed acyclic graph (DAG). We rely on constrained linear-chain CRFs generalized to work directly on DAGs, which allows us to perform segmentation as a by-product of morphosyntactic disambiguation. This is in contrast with other existing taggers for Polish, which either neglect the problem of segmentation or rely on heuristics to perform it in a pre-processing stage. We evaluate our system on historical corpora of Polish, where segmentation ambiguities are more prominent than in contemporary Polish, and show that our system significantly outperforms several baseline segmentation methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Based on algorithms involving automatic extraction of rules.
- 2.
- 3.
By extension, this holds true also for ensemble taggers, e.g. PoliTa [8].
- 4.
Intuitively, \(f_k\) has a positive influence on the modeled probability if \(\theta _k > 0\), negative influence if \(\theta _k < 0\), and no influence whatsoever if \(\theta _k = 0\).
- 5.
With \(r_i = Y\) for out-of-vocabulary words.
- 6.
Note that these results abstract from the potential morphosyntactic analysis errors.
- 7.
Increasing all counts by 1 makes the probability of unseed segments equal to 1/2.
References
Acedański, S.: A morphosyntactic brill tagger for inflectional languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS (LNAI), vol. 6233, pp. 3–14. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14770-8_3
Calzolari, N., et al., (eds.): Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014. ELRA, Reykjavík, Iceland (2014). http://www.lrec-conf.org/proceedings/lrec2014/index.html
Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206. ACL (2015). http://www.aclweb.org/anthology/D15-1141
Dębowski, L.: Trigram morphosyntactic tagger for Polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining, pp. 409–413. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-39985-8_43
Kieraś, W., Komosińska, D., Modrzejewski, E., Woliński, M.: Morphosyntactic annotation of historical texts. The making of the baroque corpus of Polish. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 308–316. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_35
Kieraś, W., Woliński, M.: Manually annotated corpus of Polish texts published between 1830 and 1918. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2018. ELRA, Miyazaki, Japan (2018)
Kobyliński, Ł., Ogrodniczuk, M.: Results of the PolEval 2017 competition: part-of-speech tagging shared task. In: Vetulani and Paroubek [17], pp. 362–366
Kobyliński, Ł.: PoliTa: A multitagger for Polish. In: Calzolari et al. [2], pp. 2949–2954. http://www.lrec-conf.org/proceedings/lrec2014/index.html
Krasnowska-Kieraś, K.: Morphosyntactic disambiguation for Polish with bi-LSTM neural networks. In: Vetulani and Paroubek [17], pp. 367–371
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004). http://www.aclweb.org/anthology/W04-3230
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics (2004). http://www.aclweb.org/anthology/C04-1081
Piasecki, M., Wardyński, A.: Multiclassifier approach to tagging of Polish. In: Proceedings of the International Multiconference on ISSN, vol. 1896, p. 7094
Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16
Radziszewski, A., Acedański, S.: Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 81–87. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_9
Radziszewski, A., Śniatowski, T.: Maca-a configurable tool to integrate Polish morphological data. In: Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)
Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends® Mach. Learn. 4(4), 267–373 (2012)
Vetulani, Z., Paroubek, P. (eds.): Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu, Poznań, Poland (2017)
Wainwright, M.J., Jordan, M.I., et al.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)
Walentynowicz, W.: MorphoDiTa-based tagger for Polish language (2017), CLARIN-PL digital repository. http://hdl.handle.net/11321/425
Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of COLING 2012, pp. 2789–2804 (2012). http://www.aclweb.org/anthology/C12-1170
Woliński, M.: Morfeusz reloaded. In: Calzolari et al. [2], pp. 1106–1111. http://www.lrec-conf.org/proceedings/lrec2014/index.html
Wróbel, K.: KRNNT: Polish recurrent neural network tagger. In: Vetulani, Paroubek [17], pp. 386–391
Acknowledgements
The work being reported was partially supported by a National Science Centre, Poland grant DEC-2014/15/B/HS2/03119.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Waszczuk, J., Kieraś, W., Woliński, M. (2018). Morphosyntactic Disambiguation and Segmentation for Historical Polish with Graph-Based Conditional Random Fields. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-00794-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)