Abstract
This paper presents the challenges and solutions adopted to the lemmatization and part-of-speech (PoS) tagging of a corpus of Old Portuguese texts (up to 1525), to pave the way to the implementation of an automatic annotation of these Medieval texts. A highly granular tagset, previously devised for Modern Portuguese, was adapted to this end. A large text (\(\sim \)155 thousand words) was manually annotated for PoS and lemmata and used to train an initial PoS-tagger model. When applied to two other texts, the resulting model attained 91.2% precision with a textual variant of the same text, and 67.4% with a new, unseen text. A second model was then trained with the data provided by the previous three texts and applied to two other unseen texts. The new model achieved a precision of 77.3% and 82.4%, respectively.
Research for this paper was partially funded by public funds through Fundação para a Ciência e a Tecnologia: J. Baptista and F. Batista (INESC-ID Lisboa, proj.ref UIDB/50021/2020), E. Cardeira (School of Arts and Humanities, Center of Linguistics, University of Lisbon, proj.ref. UIDP/00214/2020) and M.I. Bico by Ph.D grant (proj.ref. UI/BD/152806/2022).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
http://teitok.clul.ul.pt/teitok/cta/ (last access: 2022/09/05 01:11:19). All the remaining URL in this paper were check on this date.
- 2.
- 3.
- 4.
Biblioteca Nacional (Portugal), Alc.198, fls. 1r–155r.
- 5.
Biblioteca do Museu de Aveiro, ms. 1 [33/CD], fls. 48a–110b.
- 6.
Biblioteca Mun. Porto, Safe n. 527 (Cat. n. 683), ff. 196v–208v.
- 7.
Arq. Mun. Alfredo Pimenta (Guimarães), Ms. da Colegiada 793, fls. 211r–236r.
- 8.
Arq. Nac. Torre do Tombo. Fragm., Cx. 21, n.26 (Casa Forte). Lorvão, Livro 10, fl. 13r. Fragm., Cx. 21, n.23a (Casa Forte).
- 9.
Lisboa, Valentim Fernandes, [1496?]. Biblioteca Nacional (Portugal), Inc. 571.
- 10.
Biblioteca do Museu de Aveiro, ms. 1 [33/CD], fls. 48a-110b.
- 11.
References
Britto, H., Finger, M.: Constructing a parsed corpus of historical Portuguese. In: Proceedings of International Humanities Computing Conference, University of Virginia, Charlottesville. ACH/ALLC (1999)
Camps, J.B., Ing, L., Spadini, E.: Collating medieval vernacular texts. aligning witnesses, classifying variants. In: Digital Humanities Conference (DHC) 2019. Utrecht, Netherlands (2019). https://hal.archives-ouvertes.fr/hal-02268348
Davies, M.: New directions in Spanish and Portuguese corpus linguistics. Stud. Hisp. Lusophone Linguist. 1(1), 149–186 (2008)
Eleutério, S., Ranchhod, E., Freire, H., Baptista, J.: A system of electronic dictionaries of Portuguese. Linguisticae Investigationes 19(1), 57–82 (1995)
Gamallo, P., Pichel, J.R., Santalha, J.M.M., Neves, M.: Uso de tecnologias linguísticas para estudar a evolução dos sufixos -çom e -vel no galego-português medieval a partir de corpora históricos. Linguamática 13(2), 3–17 (2021)
Gonçalves, M.F., Banza, A.P.: Da antiga à nova Filologia: o Projecto MEP-BPEDig. In: Actas del XXVI Congreso Internacional de Lingüística y de Filología Románicas. Tome VII, vol. 7, pp. 205–210. Walter de Gruyter (2013)
Gross, M.: La construction de dictionnaires électroniques. Ann. Télécommun. 44, 4–19 (1989). https://doi.org/10.1007/BF02999875
Hendrickx, I., Marquilhas, R.: From old texts to modern spellings: an experiment in automatic normalisation. J. Lang. Technol. Comput. Linguist. 26(2), 65–76 (2011)
Janssen, M.: TEITOK: text-faithful annotated corpora. In: Proceedings of the 10\(^{th}\) International Conference on Language Resources and Evaluation (LREC 2016), pp. 4037–4043. European Language Resources Association (ELRA), Portorož, Slovenia (2016). https://aclanthology.org/L16-1637
Jurafsky, D., Martin, J.H.: Speech and language processing (draft) (2021). https://web.stanford.edu/jurafsky/slp3/
Lopes, J., Rocio, V., Xaxier, M.F., Vicente, G.: Criação automática de uma colecção de textos de português medieval parcialmente anotados sintacticamente. In: Actas del Segundo Seminário de Escuela Interlatina de Altos Estudios en Lingüística Aplicada, pp. 203–220 (2002)
Mendes, A.: Linguística de corpus e outros usos dos corpora em linguística. In: Martins, A.M., Carrilho, E. (eds.) Manual de linguística portuguesa, vol. 16, pp. 224–251. Walter de Gruyter GmbH & Co KG (2016)
Parkinson, S.R., Emiliano, A.H.: Encoding medieval abbreviations for computer analysis (from Latin-Portuguese and Portuguese non-literary sources). Literary Linguist. Comput. 17(3), 345–360 (2002)
Ranchhod, E., Mota, C., Baptista, J.: A computational lexicon of Portuguese for automatic text parsing. In: Standardizing Lexical Resources (SIGLEX 1999), pp. 74–80. ACL/SIGLEX, Maryland, USA (1999)
Ranchhod, E.M.: O uso de dicionários e de autómatos finitos na representação lexical. In: Ranchhod, E.M. (ed.) Tratamento das Línguas por Computador. Uma introdução à Linguística Computacional e suas aplicações, pp. 13–47. Caminho (2001)
Rocio, V., Alves, M.A., Lopes, J.G.P., Xavier, M.F., Vicente, G.: Automated creation of a medieval Portuguese partial treebank. In: Abeillé, A. (ed.) Treebanks: Building and Using Parsed Corpora, vol. 20, pp. 211–227. Springer, Dordrecht (2003)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, pp. 154–163 (1994)
Schmid, H.: Deep learning-based morphological taggers and lemmatizers for annotating historical texts. In: Proceedings of the 3\(^{rd}\) International Conference on Digital Access to Textual Cultural Heritage, pp. 133–137 (2019)
de Sousa, M.C.P.: O corpus Tycho Brahe: Contribuições para as Humanidades Digitais no Brasil. Filologia e linguística portuguesa 16(esp.), 53–93 (2014)
Vaamonde, G., Janssen, M.: Da edición dixital á análise lingüística. A creación de corpus históricos na plataforma TEITOK, pp. 271–292 (01 2020). https://doi.org/10.17075/cbfc.2020.008
van Zundert, J., Haentjens Dekker, R., Van Hulle, D., Neyt, V., Middell, G.: Computer-supported collation of modern manuscripts: CollateX and the Beckett digital manuscript project. Literary and Linguistics Computing 30(3), 452–470 (2014). https://doi.org/10.1093/llc/fqu007
Xavier, M.F.: O CIPM - Corpus Informatizado do Português Medieval, fonte de um dicionário exaustivo. In: Lingüística de corpus y lingüística histórica iberorrománica, pp. 137–156. De Gruyter (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Bico, M.I., Baptista, J., Batista, F., Cardeira, E. (2022). Early Experiments on Automatic Annotation of Portuguese Medieval Texts. In: Silvello, G., et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_44
Download citation
DOI: https://doi.org/10.1007/978-3-031-16802-4_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16801-7
Online ISBN: 978-3-031-16802-4
eBook Packages: Computer ScienceComputer Science (R0)