Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Training and evaluation of vector models for Galician

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., word2vec, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, fastText embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. In the case of the transformer architecture, a pretrained model is fine-tuned using labelled data rather than trained from scratch (Vaswani et al., 2017; Devlin et al., 2019).

  2. https://osf.io/s9xfd.

  3. https://github.com/marcospln/vector_models_evaluation

  4. A selection of the best models for each task is available at https://zenodo.org/record/6771303

  5. The results for each model and task are available at https://osf.io/6eajf.

  6. https://huggingface.co/marcosgg/bert-base-gl-SLI-NER.

  7. https://tec.citius.usc.es/demos-lingua/.

  8. Current language models based on deep neural networks are often evaluated intrinsically using perplexity. Besides, they can be evaluated using different approaches, such as fine-tuning their weights using labelled data for a downstream task (Devlin et al., 2019) analysing their prediction probabilities on controlled datasets (Marvin & Linzen, 2018), or using their internal representations both to train probing classifiers (Conneau et al., 2018) or to assess their behaviour across the network (Garcia, 2021). More recently, the prompt-based strategy has emerged as a new method to assess the performance of generative models (Liu et al., 2023).

  9. It is debatable whether linguistically similar varieties spoken in other Spanish regions are Galician, Portuguese or independent languages (Gargallo Gil, 2007; Costas Gonzalez, 2007).

  10. https://universaldependencies.org/format.html.

  11. https://academia.gal/dicionario.

  12. http://estraviz.org.

  13. https://ilg.usc.gal/ddd/.

  14. This should not be seen as a simplification of the datasets, as the target entries include concepts with a wide range of frequencies in Galician corpora.

  15. The splits used to train and evaluate our models are available in the following repository: https://github.com/marcospln/evaluation_splits_gl.

  16. https://universaldependencies.org/u/pos/all.html.

  17. Dubbed SLI_CTG_POS.1.0: at https://github.com/xavier-gz/SLI_Galician_Corpora.

  18. To train the models we have used the original implementations provided by the authors of each tool.

  19. Even though fastText does not explicitly model morphology, the subwords tend to correspond to linguistic affixes (e.g., prefixes, suffixes, stems).

  20. We have also performed preliminary evaluations with the 100 dimensional word2vec embeddings provided by Zeman et al. (2017), with lower results than the official fastText models.

  21. https://fasttext.cc/docs/en/pretrained-vectors.html.

  22. https://fasttext.cc/docs/en/crawl-vectors.html

  23. https://github.com/huggingface/transformers.

  24. Recent studies explore the combination of contextualised word representations to obtain type-level word embeddings (Chronis & Erk, 2020; Vulic et al., 2020; Lenci et al., 2022).

  25. The results of the official fastText models and of the non-contextualised BERT approaches are discussed in Appendix D, as they perform worse than the new embeddings in most cases.

  26. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

  27. It is worth noting that the size of each relation in this dataset has a considerable variation, ranging from 12 questions of comparatives, to 4,524 of world capitals, so that models with better performance on the large subsets are those with higher overall accuracies.

  28. In this evaluation we have not used the subword information of the fastText embeddings, but only the word vectors learnt in the training corpus.

  29. We have also performed significance tests, whose results are reported in Appendix E.

  30. http://nilc.icmc.usp.br/nlpnet

  31. https://github.com/UppsalaNLP/uuparser

  32. It is worth noting that, due to the small amount of training data of TreeGal, we have not splitted it into train and development sets, and instead we run the whole 30 epochs without early stopping using the whole original split for training. In this regard, Kann et al. (2019) compare various methods to train models for low-resource languages, finding significant differences depending on, among other factors, whether or not a development set was used.

  33. Although the datasets are standard, the results presented in the various papers are not directly comparable as both training data and procedures are slightly different. For instance, each paper used slightly different versions of the SLI_CTG_POS datasets, and while Vilares et al. (2021) reduce the TreeGal training data to create a development set, we use the whole training data without early stopping.

  34. It is worth mentioning that the results on both UD treebanks are better to the best models participating on the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2018), although they used predicted instead of gold tokenisation: https://universaldependencies.org/conll18/results.html

  35. Note that these latter methods take advantage of additional data, such as training corpora from various treebanks, or combinations of several pretrained embeddings (see each paper for details) (Glavas & Vulic, 2021; Muller-Eberstein et al., 2021).

  36. With default values on the other parameters: https://github.com/achernodub/targer

  37. Instead of reporting the results of the BERT models evaluated by Vilares et al. (2021), we decided to train ad-hoc NER systems, as the splits used in both studies may be different.

  38. Both Bertinho models obtained better performance than mBERT in our experiments, while the NER results of these models presented in Vilares et al. (2021) were more variable.

  39. https://tec.citius.usc.es/demos-lingua/

  40. SemantiGal also includes a tool to explore predictive BERT models for Galician.

  41. The models for these languages have been trained using the Wikipedia, and mapped into a shared vector space with vecmap (Artetxe et al., 2018) This strategy obtained competitive results, reported by Garcia et al. (2019).

  42. The dimensionality reduction is performed with the JavaScript implementation of t-SNE (van der Maaten & Hinton, 2008).

  43. https://github.com/xavier-gz/SLI_Galician_Corpora

References

  • Abadji, J., Ortiz Suarez, P., Romary, L., & Sagot, B. (2022). Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints arXiv:2201.06642.

  • Agerri, R., Gómez Guinovart, X., Rigau, G., Solla Portela, M. A. (2018). Developing new linguistic resources and tools for the Galician language. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1367.

  • Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Boulder, Colorado, pp 19–27, https://aclanthology.org/N09-1003.

  • Aina, L., Gulordava, K., & Boleda, G. (2019). Putting words in context: LSTM language models and lexical ambiguity. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 3342–3348, https://doi.org/10.18653/v1/P19-1324

  • Almuhareb, A. (2006). Attributes in lexical acquisition. PhD thesis, University of Essex.

  • Artetxe, M., Labaka, G., & Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 789–798, https://doi.org/10.18653/v1/P18-1073.

  • Baldwin, T., Bannard, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Association for Computational Linguistics, Sapporo, Japan, pp 89–96, https://doi.org/10.3115/1119282.1119294.

  • Bansal, M., Gimpel, K., & Livescu, K. (2014). Tailoring continuous word representations for dependency parsing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Baltimore, Maryland, pp 809–815, https://doi.org/10.3115/v1/P14-2131.

  • Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics,36(4), 673–721. https://doi.org/10.1162/coli_a_00016.

  • Baroni, M., & Lenci, A. (2011). How we BLESSed distributional semantic evaluation. In: Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, Association for Computational Linguistics, Edinburgh, UK, pp 1–10.

  • Baroni, M., Evert, S., & Lenci, A. (2008). Bridging the gap between semantic theory and computational simulations. In: FOLLI (ed) Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, Hamburg.

  • Baroni, M., Murphy, B., Barbu, E., & Poesio, M. (2010). Strudel: A distributional semantic model based on property and types. Cognitive Science, 34(2), 222–254.

    Article  Google Scholar 

  • Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Baltimore, Maryland, pp 238–247, https://doi.org/10.3115/v1/P14-1023.

  • Batchkarov, M., Kober, T., Reffin, J., Weeds, J., & Weir, D. (2016). A critique of word similarity as a method for evaluating distributional semantic models. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 7–12, https://doi.org/10.18653/v1/W16-2502.

  • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research., 3, 1137–1155.

    Google Scholar 

  • Bentivogli, L., Bernardi, R., Marelli, M., Menini, S., Baroni, M. & Zamparelli, R. (2016). SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Language Resources and Evaluation. 50(1):95–124.

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics,5, 135–146. https://doi.org/10.1162/tacl_a_00051.

  • Boleda, G. (2020). Distributional semantics and linguistic theory. Annual Review of Linguistics., 6, 213–234.

    Article  Google Scholar 

  • Camacho-Collados, J., & Navigli, R. (2016). Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 43–50https://doi.org/10.18653/v1/W16-2508.

  • Chernodub, A., Oliynyk, O., Heidenreich, P., Bondarenko, A., Hagen, M., Biemann, C., & Panchenko, A. (2019). TARGER: Neural argument mining at your fingertips. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, Florence, Italy, pp 195–200, https://doi.org/10.18653/v1/P19-3031.

  • Chiu, B., Korhonen, A., & Pyysalo, S. (2016). Intrinsic evaluation of word vectors fails to predict extrinsic performance. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 1–6, https://doi.org/10.18653/v1/W16-2501.

  • Chronis, G., & Erk, K. (2020). When is a bishop not like a rook? when it’s like a rabbi! multi-prototype BERT embeddings for estimating semantic relationships. In: Proceedings of the 24th Conference on Computational Natural Language Learning, Association for Computational Linguistics, Online, pp 227–244, https://doi.org/10.18653/v1/2020.conll-1.17.

  • Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on Machine learning, pp 160–167.

  • Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (Almost) From Scratch. Journal of Machine Learning Research, 12, 2493–2537.

    Google Scholar 

  • Conneau, A., Kruszewski, G., Lample, G., Barrault, L., & Baroni, M. (2018). What you can cram into a single \$ &!#* vector: Probing sentence embeddings for linguistic properties. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 2126–2136, https://doi.org/10.18653/v1/P18-1198.

  • Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 8440–8451, https://doi.org/10.18653/v1/2020.acl-main.747.

  • Cordeiro, S., Villavicencio, A., Idiart, M., & Ramisch, C. (2019). Unsupervised compositionality prediction of nominal compounds. Computational Linguistics, 45(1), 1–57. https://doi.org/10.1162/coli_a_00341

    Article  Google Scholar 

  • Costas González, X. H. (2007). A Lingua Galega no Eo-Navia, Bierzo Occidental, As Portelas, Calabor e o Val do Ellas: Historia, Breve Caracterización e Situación Sociolingüística Actual, Cadernos de Lingua, vol Anexo 8. Real Academia Galega.

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186https://doi.org/10.18653/v1/N19-1423.

  • de Dios-Flores, I., & Garcia, M. (2022). A computational psycholinguistic evaluation of the syntactic abilities of galician bert models at the interface of dependency resolution and training time. Procesamiento del Lenguaje Natural, 69, 15–26.

    Google Scholar 

  • de Dios-Flores, I., Magariños, C., Vladu, A. I., Ortega, J. E., Pichel, J. R., García, M., Gamallo, P., Fernández Rei, E., Bugarín-Diz, A., González González, M., Barro, S., & Regueira, X.L. (2022). The nós project: Opening routes for the Galician language in the field of language technologies. In: Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, pp 52–61.

  • Drozd, A., Gladkova, A., & Matsuoka, S. (2016). Word embeddings, analogies, and machine learning: Beyond king - man + woman = queen. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, The COLING 2016 Organizing Committee, Osaka, Japan, pp 3519–3530.

  • Erk, K., & Padó, S. (2008). A structured vector space model for word meaning in context. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Honolulu, Hawaii, pp 897–906.

  • Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 30–35, https://doi.org/10.18653/v1/W16-2506.

  • Fernández Rei, F. (1991). Dialectoloxía da Lingua (Galega). Vigo: Xerais de Galicia.

    Google Scholar 

  • Firth, J. R. (1957). A synopsis of linguistic theory 1930-1955. Studies in Linguistic Analysis pp 1–32, reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952–1959. London: Longman (1968).

  • Fonseca, E. R., & Rosa, J. L. G. (2013). Mac-morpho revisited: Towards robust part-of-speech tagging. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology.

  • Freixeiro Mato, X. R. (2003). Gramática da Lingua Galega IV. A Nosa Terra, Vigo: Gramática do texto.

    Google Scholar 

  • Gamallo, P. (2017). Comparing explicit and predictive distributional semantic models endowed with syntactic contexts. Language Resources and Evaluation, 51(3), 727–743.

    Article  Google Scholar 

  • Gamallo, P., Garcia, M., Sotelo, S., & Campos, J. R. P. (2014). Comparing Ranking-based and Naive Bayes Approaches to Language Detection on Tweets. Proceedings of TweetLID: Twitter Language Identification Workshop at SEPLN, 2014, pp. 12–16.

  • Gamallo, P., Garcia, M., Piñeiro, C., Martínez-Castaño, R., & Pichel, J. C. (2018). LinguaKit: a Big Data-based multilingual tool for linguistic analysis and information extraction. 2018 Fifth International Conference on Social Networks Analysis (pp. 239–244). IEEE: Management and Security (SNAMS).

  • Garcia, M. (2021). Exploring the representation of word meanings in context: A case study on homonymy and synonymy. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, pp 3625–3640, https://doi.org/10.18653/v1/2021.acl-long.281.

  • Garcia, M., & Crespo-Otero, A. (2022). A targeted assessment of the syntactic abilities of transformer models for galician-portuguese. In V. Pinheiro, P. Gamallo, R. Amaro, C. Scarton, F. Batista, D. Silva, C. Magro, & H. Pinto (Eds.), Computational Processing of the Portuguese Language (pp. 46–56). Cham: Springer International Publishing.

    Chapter  Google Scholar 

  • Garcia, M., & Gamallo, P. (2010). Análise morfossintáctica para português europeu e galego: Problemas, soluçoes e avaliaçao. Linguamática, 2(2), 59–67.

    Google Scholar 

  • Garcia, M., Gómez-Rodríguez, C., & Alonso, M. A. (2018). New treebank or repurposed? on the feasibility of cross-lingual parsing of romance languages with universal dependencies. Natural Language Engineering, 24(1), 91–122.

    Article  Google Scholar 

  • Garcia, M., García-Salido, M., & Alonso, M. A. (2019). Exploring cross-lingual word embeddings for the inference of bilingual dictionaries. In: Proceedings of TIAD-2019 Shared Task - Translation Inference Across Dictionaries, pp 32–41.

  • Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., & Villavicencio, A. (2021). Probing for idiomaticity in vector space models. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, Online, pp 3551–3564, https://doi.org/10.18653/v1/2021.eacl-main.310.

  • Garcia, M., Rodríguez, I., & Gamallo, P. (2022). SemantiGal: An online visualizer of vector representations for Galician. In: Proceedings of PROPOR 2022: International Conference on the Computational Processing of Portuguese. Demo Papers.

  • Gargallo Gil, J. E. (2007). Gallego-portugués, iberorromance. la ‘fala’ en su contexto románico peninsular. Limite Revista de Estudios Portugueses y de la Lusofonía, 1, 31–49.

    Google Scholar 

  • Garí Soler, A., & Apidianaki, M. (2021). Let’s play mono-poly: BERT can reveal words’ polysemy level and partitionability into senses. Transactions of the Association for Computational Linguistics, 9, 825–844. https://doi.org/10.1162/tacl_a_00400

    Article  Google Scholar 

  • Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In: Proceedings of the NAACL Student Research Workshop, Association for Computational Linguistics, San Diego, California, pp 8–15, https://doi.org/10.18653/v1/N16-2002.

  • Glavaš, G., & Vulić, I. (2021). Climbing the tower of treebanks: Improving low-resource dependency parsing via hierarchical source selection. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, pp 4878–4888, https://doi.org/10.18653/v1/2021.findings-acl.431.

  • Glavaš, G., Litschko, R., Ruder, S., & Vulić, I. (2019). How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 710–721, https://doi.org/10.18653/v1/P19-1070.

  • Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, https://aclanthology.org/L18-1550.

  • Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2), 211.

    Article  Google Scholar 

  • Gulordava, K., & Baroni, M. (2011). A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. In: Proceedings of the GEMS 2011 workshop on geometrical models of natural language semantics, pp 67–71.

  • Harris, Z. S. (1954). Distributional structure. Word, 10(2–3): 146–162.

  • Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., & Aluísio, S. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, Sociedade Brasileira de Computação, Uberlândia, Brazil, pp 122–131.

  • Huebner, P. A., & Willits, J. A. (2018). Structured semantic knowledge can emerge automatically from predicting word sequences in child-directed speech. Frontiers in Psychology, 9, 133.

    Article  Google Scholar 

  • Kann, K., Cho, K., & Bowman, S. R. (2019). Towards realistic practices in low-resource natural language processing: The development set. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp 3342–3349, https://doi.org/10.18653/v1/D19-1329.

  • Kim, Y., Chiu, Y. I., Hanaki, K., Hegde, D., & Petrov, S. (2014). Temporal analysis of language through neural language models. In: Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, Association for Computational Linguistics, Baltimore, MD, USA, pp 61–65, https://doi.org/10.3115/v1/W14-2517.

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211.

    Article  Google Scholar 

  • Landauer, T. K., Laham, D., & Foltz, P. W. (1998). Learning human-like knowledge by singular value decomposition: A progress report. In: Advances in neural information processing systems, pp 45–51.

  • Lebret, R., & Collobert, R. (2015). Rehabilitation of count-based models for word vector representations. In: International Conference on Intelligent Text Processing and Computational Linguistics, Springer, pp 417–429.

  • Lenci, A., Sahlgren, M., Jeuniaux, P., Cuba Gyllensten, A., & Miliani, M. (2022). A comparative evaluation and analysis of three generations of Distributional Semantic Models. Language Resources and Evaluation, 56, 1269–1313. https://doi.org/10.1007/s10579-021-09575-z

    Article  Google Scholar 

  • Levy, O., & Goldberg, Y. (2014a). Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Baltimore, Maryland, pp 302–308, https://doi.org/10.3115/v1/P14-2050.

  • Levy, O., & Goldberg, Y. (2014b). Linguistic regularities in sparse and explicit word representations. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Ann Arbor, Michigan, pp 171–180, https://doi.org/10.3115/v1/W14-1618.

  • Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics,3, 211–225. https://doi.org/10.1162/tacl_a_00134.

  • de Lhoneux, M., Stymne, S., & Nivre, J. (2017). Arc-hybrid non-projective dependency parsing with a static-dynamic oracle. In: Proceedings of the The 15th International Conference on Parsing Technologies (IWPT)., Pisa, Italy.

  • Li, B., Liu, T., Zhao, Z., Tang, B., Drozd, A., Rogers, A., & Du, X. (2017). Investigating different syntactic context types and context representations for learning word embeddings. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, pp 2421–2431, https://doi.org/10.18653/v1/D17-1257.

  • Lin, D. (1998). Automatic retrieval and clustering of similar words. In: COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.

  • Lin, D., & Pantel, P. (2001). Discovery of inference rules for question-answering. Natural Language Engineering, 7(4), 343–360.

    Article  Google Scholar 

  • Lindley Cintra, L. F., & Cunha, C. (1984). Nova Gramática do Português Contemporâneo. Lisbon: Livraria Sá da Costa.

    Google Scholar 

  • Linzen, T. (2016). Issues in evaluating semantic spaces using word analogies. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 13–18, https://doi.org/10.18653/v1/W16-2503.

  • Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, pp 923–929.

  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput Surv. https://doi.org/10.1145/3560815

    Article  Google Scholar 

  • Lund, K. (1995). Semantic and associative priming in high-dimensional semantic space. In: Proc. of the 17th Annual conferences of the Cognitive Science Society, 1995.

  • Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior research methods, instruments, & computers, 28(2), 203–208.

    Article  Google Scholar 

  • Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, pp 1064–1074, https://doi.org/10.18653/v1/P16-1101.

  • van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605.

    Google Scholar 

  • Malvar P, Pichel JR, Senra Ó, Gamallo P, García A (2010) Vencendo a escassez de recursos computacionais. carvalho: Tradutor automático estatístico inglês-galego a partir do corpus paralelo europarl inglês-português. Linguamática. 2(2):31–38.

  • Mandera, P., Keuleers, E., & Brysbaert, M. (2017). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language, 92, 57–78.

    Article  Google Scholar 

  • Marvin, R., & Linzen, T. (2018). Targeted syntactic evaluation of language models. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 1192–1202, https://doi.org/10.18653/v1/D18-1151.

  • McDonald, S., & Ramscar, M. (2001). Testing the distributional hypothesis: The influence of context on judgements of semantic similarity. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol 23.

  • Melamud, O., McClosky, D., Patwardhan, S., & Bansal, M. (2016). The role of context types and dimensionality in learning word embeddings. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, pp 1030–1040, https://doi.org/10.18653/v1/N16-1118.

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. In: Workshop Proceedings of the International Conference on Learning Representations (ICLR) 2013, arXiv preprint arXiv:1301.3781.

  • Mikolov, T., Yih, W. t., & Zweig, G. (2013b). Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 746–751.

  • Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2018). Advances in pre-training distributed word representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.

  • Miller, G. A. (1971). Empirical methods in the study of semantics. Semantics, an interdisciplinary reader in philosophy, linguistics, and psychology. pp 569–585.

  • Mira Mateus, M. H., Brito, A. M., Duarte, I., Hub Faria, I., Frota, S., Matos, G., Oliveira, F., Vigário, M., & Villalva, A. (2003). Gramática da Língua Portuguesa (6th ed.). Caminho.

    Google Scholar 

  • Mitchell, J., & Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science, 34(8), 1388–1429.

    Article  Google Scholar 

  • Müller-Eberstein, M., van der Goot, R., & Plank, B. (2021). Genre as weak supervision for cross-lingual dependency parsing. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 4786–4802, https://doi.org/10.18653/v1/2021.emnlp-main.393.

  • Oliveira, H. G., Sousa, T., & Alves, A. (2020). TALES: Test set of portuguese lexical-semantic relations for assessing word embeddings. In: Proceedings of the ECAI 2020 Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP), pp 41–47.

  • Ortega, J., de Dios-Flores, I., Pichel, J., & Gamallo, P. (2022a). A Neural Machine Translation System for Spanish to Galician through Portuguese Transliteration. In: 15th International Conference on Computational Processing of Portuguese (PROPOR 2022). Demo papers.

  • Ortega, J. E., de Dios-Flores, I., Pichel, J. R., & Gamallo, P. (2022b). Revisiting ccnet for quality measurements in galician. In: International Conference on Computational Processing of the Portuguese Language, Springer, pp 407–412.

  • Padó, S., & Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics, 33(2), 161–199. https://doi.org/10.1162/coli.2007.33.2.161

    Article  Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 1532–1543, https://doi.org/10.3115/v1/D14-1162.

  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, pp 2227–2237, https://doi.org/10.18653/v1/N18-1202.

  • Qi, Y., Sachan, D., Felix, M., Padmanabhan, S., & Neubig, G. (2018). When and why are pre-trained word embeddings useful for neural machine translation? In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana, pp 529–535, https://doi.org/10.18653/v1/N18-2084.

  • Querido, A., Carvalho, R., Rodrigues, J., Garcia, M., Silva, J., Correia, C., Rendeiro, N., Pereira, R., Campos, M., & Branco, A. (2017). Lx-lr4distsemeval: A collection of language resources for the evaluation of distributional semantic models of portuguese. Revista da Associação Portuguesa de Linguística, 3, 265–283.

    Article  Google Scholar 

  • Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In: Proceedings of the Ninth Machine Translation Summit, pp 315–322.

  • Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In: New Challenges for NLP Frameworks (NLPFrameworks 2010) at LREC 2010, ELRA, Valletta, Malta, pp 46–50, http://www.lrec-conf.org/proceedings/lrec2010/workshops/W10.pdf.

  • Rodrigues, J., Branco, A., Neale, S., & Silva, J. (2016). LX-DSemVectors: Distributional Semantics Models for Portuguese. In: International Conference on Computational Processing of the Portuguese Language, Springer, pp 259–270.

  • Rodríguez-Fernández, S., Espinosa-Anke, L., Carlini, R., & Wanner, L. (2016). Semantics-driven recognition of collocations using word embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Berlin, Germany, pp 499–505, https://doi.org/10.18653/v1/P16-2081.

  • Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics,8, 842–866. https://doi.org/10.1162/tacl_a_00349.

  • Rojo, G., López Martínez, M., Domínguez Noya, E., & Barcala, F. (2019). Corpus de adestramento do Etiquetador/Lematizador do Galego Actual (XIADA), versión 2.7. Centro Ramón Piñeiro para a investigación en humanidades.

  • Ruge, G. (1992). Experiments on linguistically-based term associations. Information Processing & Management, 28(3), 317–332.

    Article  Google Scholar 

  • Sagi, E., Kaufmann, S., & Clark, B. (2009). Semantic density analysis: Comparing word meaning across time and phonetic space. In: Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, Association for Computational Linguistics, Athens, Greece, pp 104–111.

  • Sahlgren, M. (2008). The Distributional Hypothesis. Rivista di Linguistica (Italian Journal of Linguistics), 20(1), 33–53.

    Google Scholar 

  • Samartim, R. (2012). Língua somos: A construção da ideia de língua e da identidade coletiva na Galiza (pré-) constitucional. In: Novas achegas ao estudo da cultura galega II: enfoques socio-históricos e lingüístico-literarios, pp 27–36.

  • Schnabel, T., Labutov, I., Mimno, D., & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, pp 298–307, https://doi.org/10.18653/v1/D15-1036.

  • Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–123.

    Google Scholar 

  • Smith, A., de Lhoneux, M., Stymne, S., & Nivre, J. (2018). An investigation of the interactions between pre-trained word embeddings, character models and POS tags in dependency parsing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 2711–2720, https://doi.org/10.18653/v1/D18-1291.

  • Sousa, T., Gonçalo Oliveira, H., & Alves, A. (2020). Exploring Different Methods for Solving Analogies with Portuguese Word Embeddings. In: 9th Symposium on Languages, Applications and Technologies (SLATE 2020), Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

  • Straka, M., & Straková, J. (2017). Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Vancouver, Canada, pp 88–99, http://www.aclweb.org/anthology/K/K17/K17-3009.pdf.

  • Teyssier, P. (1987). História da Língua Portuguesa (3rd ed.). Lisbon: Livraria Sá da Costa.

    Google Scholar 

  • Turney, P. D. (2006). Similarity of semantic relations. Computational Linguistics, 32(3), 379–416. https://doi.org/10.1162/coli.2006.32.3.379

    Article  Google Scholar 

  • Turney, P. D. (2012). Domain and function: A dual-space model of semantic relations and compositions. Journal of Artificial Intelligence Research, 44, 533–585.

    Article  Google Scholar 

  • Turney, P. D., & Littman, M. L. (2005). Corpus-based learning of analogies and semantic relations. Machine Learning, 60(1–3), 251–278.

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I, (2017), Attention Is All You Need. In: \(31^{{\rm st}}\) Conference on Neural Information Processing Systems (NIPS 2017), arXiv preprint arXiv:1706.03762,

  • Vilares, D., & Gómez-Rodríguez, C. (2018). Transition-based parsing with lighter feed-forward networks. In: Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), Association for Computational Linguistics, Brussels, Belgium, pp 162–172, https://doi.org/10.18653/v1/W18-6019.

  • Vilares, D., Garcia, M., & Gómez-Rodríguez, C. (2021). Bertinho: Galician bert representations. Procesamiento del Lenguaje Natural, 66, 13–26.

    Google Scholar 

  • Vulić, I., Ponti, E. M., Litschko, R., Glavaš, G., & Korhonen, A. (2020). Probing pretrained language models for lexical semantics. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp 7222–7240, https://doi.org/10.18653/v1/2020.emnlp-main.586.

  • Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C. C. J. (2019). Evaluating word embedding models: Methods and experimental results. APSIPA Transactions on Signal and Information Processing, 8, 19.

    Article  Google Scholar 

  • Wenzek, G., Lachaux, M. A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., & Grave, E. (2020). CCNet: Extracting high quality monolingual datasets from web crawl data. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, pp 4003–4012.

  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. (2020). Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, pp 38–45, https://doi.org/10.18653/v1/2020.emnlp-demos.6.

  • Xunta de Galicia. (2004). Plan Xeral de Normalización da Lingua Galega. Xunta de Galicia Consellería de Educación e Ordenación Universitaria, Dirección Xeral de Política Lingüística.

  • Zeman, D., Popel, M., Straka, M., Hajič, J., Nivre, J., Ginter, F., Luotolahti, J., Pyysalo, S., Petrov, S., Potthast, M., Tyers, F., Badmaeva, E., Gokirmak, M., Nedoluzhko, A., Cinková, S., Hajič jr, J., Hlaváčová, J., Kettnerová, V., Urešová, Z., Kanerva, J., Ojala, S., Missilä, A., Manning, C. D., Schuster, S., Reddy, S., Taji, D., Habash, N., Leung, H., de Marneffe, M. C., Sanguinetti, M., Simi, M., Kanayama, H., de Paiva, V., Droganova, K., Martínez Alonso, H., Çöltekin, Ç., Sulubacak, U., Uszkoreit, H., Macketanz, V., Burchardt, A., Harris, K., Marheinecke, K., Rehm, G., Kayadelen, T., Attia, M., Elkahky, A., Yu, Z., Pitler, E., Lertpradit, S., Mandl, M., Kirchner, J., Alcalde, H. F., Strnadová, J., Banerjee, E., Manurung, R., Stella, A., Shimada, A., Kwak, S., Mendonça, G., Lando, T., Nitisaroj, R., & Li, J, (2017), CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Vancouver, Canada, pp 1–19, https://doi.org/10.18653/v1/K17-3001.

  • Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Brussels, Belgium, pp 1–21, https://doi.org/10.18653/v1/K18-2001.

  • Zhao, Y., & Karypis, G. (2003). Clustering in Life Sciences. In: Functional Genomics: Methods and Protocols, Humana Press, pp 183–218.

Download references

Acknowledgements

I would like to thank the anonymous reviewers for their valuable comments. This research was funded by the Galician Government (ERDF 2014-2020: Call ED431G 2019/04, and ED431F 2021/01), by MCIN/AEI/10.13039/501100011033 (grants with references PID2021-128811OA-I00 and TED2021-130295B-C33, the latter also funded by “European Union Next Generation EU/PRTR”), and by a Ramón y Cajal grant (RYC 2019-028473-I).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcos Garcia.

Ethics declarations

Conflict of interest

The author has no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Corpus

Each domain of the compiled corpus (Table 11) contains the following sources: Encyclopaedic (Wikipedia), literature (Wikibooks, XIADA corpus, free online books), subtitles (OpenSubtitles), newspapers (A Nosa Terra, Código Cero, Galicia Confidencial, Galiciaé, Galicia Hoxe, Vieiros, Praza Pública, and Xeración), blogs (mostly from the Blogomillo sphere), and technical (SLI CTG).Footnote 43 Mixed data include texts from the SLI GalWeb corpus (from Adega, BNG, Praza Pública, Sermos Galiza, and Xunta), and from other blogs and personal pages in Galician. We have been discontinuously crawling the web between 2009 and 2020, therefore some of the sources are actually extinct newspapers (e.g., Vieiros closed in 2010) or blogs.

The data coming from the Blogomillo blogsphere is the most noisy (with large amounts of intrasentential Galician/Spanish code-switching) so that it was filtered using Galician electronic dictionaries (Garcia & Gamallo, 2010) to discard those sentences with high rate of unknown words. Similarly, a language identification system (Gamallo et al., 2014) was applied in the data from bilingual newspapers to keep only the sentences in Galician. We have used a conservative approach which favours better-quality sentences over longer ones.

Table 11 Number of tokens (with and without punctuation) of each domain of the training corpus

Galician GAT dataset (see Table 12)

Table 12 Analogy types of the Galician GAT dataset (GGAT) with examples and number of questions for each one (the words underlined are those to be discovered)

Intrinsic evaluation

The following plots (Figs. 4 and 5) display the results of the intrinsic evaluation of the static word embeddings (Sect. 6.1). On each figure, bars represent the average accuracies and standard deviation of all the models.

Fig. 4
figure 4

Average accuracies and standard deviations of the models on the concept categorisation task versus window size (top, where dep2vec has a single result as it does not use this parameter), and dimensionality (bottom)

Fig. 5
figure 5

Average accuracies and standard deviations of the models on the word analogies and concept categorisation tasks versus corpus configuration

1.1 Impact of window size and dimensionality on concept categorisation

1.2 Impact of the corpus preprocessing

Performance of available models

Here, we show the results of previously available models (the official fastText embeddings and non-contextualised vectors obtained from BERT models) on both the intrinsic and extrinsic evaluations (except for NER, already shown in Sect. 6.3).

1.1 Intrinsic evaluation

With respect to the official fastText vectors (see Sect. 5.1), the performance of both models (Wiki and CC) varied across tasks: While the Wiki model, trained with less data, obtained better results on the word analogies task, the CC one achieved higher results on two of the three categorisation datasets (and on average). For the first task, the Wiki model outperformed the CC one (35.2 versus 25.4), with lower results than our model trained with equivalent parameters (45.4) and far from the results of our best model (54, see Table 3). Finally, on the concept categorisation datasets the results were more variable: The Wiki model obtained better purity on the G-AP (66.4 vs 56.0), while the CC one achieved higher results on the G-Battig (82.9 vs 76.8) and on the G-ESSLLI (72.7 vs 65.9). In this case, our equivalent fastText model performed worse than the CC one on the G-AP (49.0) and on the G-Battig (70.7), and the same on the G-ESSLLI dataset. On the contraty, our best embeddings obtained notoriously better results on the three datasets (see Table 4).

Table 13 Results of the non-contextualised BERT-based embeddings on the analogy and categorisation datasets

Regarding the type-level representations obtained from BERT models, Table 13 include the results on the four datasets as well as the macro-average performance on the concept categorisation task. When compared to the results of the static models (e.g., Tables 3 or 4), the initial word representations extracted from BERT are not competitive in most cases, so that other methods exploiting the transformations of the networks are needed to obtain better type-level vectors from neural language models (Chronis & Erk, 2020; Lenci et al., 2022).

1.2 Extrinsinc evaluation

The use of previously available models on the downstream tasks had less positive impact than our best word embeddings, even with negative influence in some cases. The results in Table 14 show that, for POS-tagging, the use of fastText embedding was negative in both cases (with very similar results), obtaining lower accuracies than the baseline. For the BERT-based embeddings the accuracy was higher, but again not competitive with the models trained in this study. Regarding dependency parsing, both fastText models surpassed the baseline (on LAS and UAS), and the Wiki vectors had higher results (lower than our equivalent model, however). In this task, BERT models obtained, in general, better results than the baseline, but far from the best results showed in Tables 6 and 7, or to our word2vec model shown in Table 14.

Thus, the official fastText embeddings, which obtained relatively similar results than our equivalent models on the intrinsic evaluation, seem less adequate for downstream tasks such as those evaluated here. About the BERT-based models, the extrinsic evaluation reinforces the idea that other strategies to obtain type-level vectors from neural language models should be explored.

Table 14 Results of the available models (fastText Wiki and CC, and vectors extracted from BERT and Bertinho (Bnho) base and small models) on the extrinsic tasks: POS-tagging, and dependency parsing (LAS and UAS)

Significance tests

To observe if the best performing models in the extrinsic evaluations are actually different from the other systems, we applied paired significance tests between the best model in each evaluation and the other models. Following Vilares et al. (2021), we applied a paired t-test comparing the accuracies per sentence between the best model for POS-tagging (dep2vec with 500 dimensions) and the other models, resulting in \(p<0.01\) in every case. For dependency parsing, we followed Vilares and Gomez-Rodriguez (2018) and applied the Bikel’s randomised parsing evaluation comparator, which shuffles the scores of individual sentences between two given models. In this case, the best model (dep2vec with 100 dimensions) was significantly better (\(p<0.01\)) than the other systems, except for word2vec (tokNP, CBOW, 500 dimensions, and a window of 1), with \(p=0.025\) (LAS) and \(p=0.024\) (UAS).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garcia, M. Training and evaluation of vector models for Galician. Lang Resources & Evaluation 58, 1419–1462 (2024). https://doi.org/10.1007/s10579-024-09740-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-024-09740-0

Keywords

Navigation