Nothing Special   »   [go: up one dir, main page]

skip to main content
brief-report

Training and evaluation of vector models for Galician

Published: 04 June 2024 Publication History

Abstract

This paper presents a large and systematic assessment of distributional models for Galician. To this end, we have first trained and evaluated static word embeddings (e.g., word2vec, GloVe), and then compared their performance with that of current contextualised representations generated by neural language models. First, we have compiled and processed a large corpus for Galician, and created four datasets for word analogies and concept categorisation based on standard resources for other languages. Using the aforementioned corpus, we have trained 760 static vector space models which vary in their input representations (e.g., adjacency-based versus dependency-based approaches), learning algorithms, size of the surrounding contexts, and in the number of vector dimensions. These models have been evaluated both intrinsically, using the newly created datasets, and on extrinsic tasks, namely on POS-tagging, dependency parsing, and named entity recognition. The results provide new insights into the performance of different vector models in Galician, and about the impact of several training parameters on each task. In general, fastText embeddings are the static representations with the best performance in the intrinsic evaluations and in named entity recognition, while syntax-based embeddings achieve the highest results in POS-tagging and dependency parsing, indicating that there is no significant correlation between the performance in the intrinsic and extrinsic tasks. Finally, we have compared the performance of static vector representations with that of BERT-based word embeddings, whose fine-tuning obtains the best performance on named entity recognition. This comparison provides a comprehensive state-of-the-art of current models in Galician, and releases new transformer-based models for NER. All the resources used in this research are freely available to the community, and the best models have been incorporated into SemantiGal, an online tool to explore vector representations for Galician.

References

[1]
Abadji, J., Ortiz Suarez, P., Romary, L., & Sagot, B. (2022). Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints arXiv:2201.06642.
[2]
Agerri, R., Gómez Guinovart, X., Rigau, G., Solla Portela, M. A. (2018). Developing new linguistic resources and tools for the Galician language. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1367.
[3]
Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Boulder, Colorado, pp 19–27, https://aclanthology.org/N09-1003.
[4]
Aina, L., Gulordava, K., & Boleda, G. (2019). Putting words in context: LSTM language models and lexical ambiguity. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 3342–3348,
[5]
Almuhareb, A. (2006). Attributes in lexical acquisition. PhD thesis, University of Essex.
[6]
Artetxe, M., Labaka, G., & Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 789–798,.
[7]
Baldwin, T., Bannard, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Association for Computational Linguistics, Sapporo, Japan, pp 89–96,.
[8]
Bansal, M., Gimpel, K., & Livescu, K. (2014). Tailoring continuous word representations for dependency parsing. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Baltimore, Maryland, pp 809–815,.
[9]
Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics,36(4), 673–721.
[10]
Baroni, M., & Lenci, A. (2011). How we BLESSed distributional semantic evaluation. In: Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, Association for Computational Linguistics, Edinburgh, UK, pp 1–10.
[11]
Baroni, M., Evert, S., & Lenci, A. (2008). Bridging the gap between semantic theory and computational simulations. In: FOLLI (ed) Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, Hamburg.
[12]
Baroni M, Murphy B, Barbu E, and Poesio M Strudel: A distributional semantic model based on property and types Cognitive Science 2010 34 2 222-254
[13]
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Baltimore, Maryland, pp 238–247,.
[14]
Batchkarov, M., Kober, T., Reffin, J., Weeds, J., & Weir, D. (2016). A critique of word similarity as a method for evaluating distributional semantic models. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 7–12,.
[15]
Bengio Y, Ducharme R, Vincent P, and Jauvin C A neural probabilistic language model Journal of Machine Learning Research. 2003 3 1137-1155
[16]
Bentivogli, L., Bernardi, R., Marelli, M., Menini, S., Baroni, M. & Zamparelli, R. (2016). SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Language Resources and Evaluation. 50(1):95–124.
[17]
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics,5, 135–146.
[18]
Boleda G Distributional semantics and linguistic theory Annual Review of Linguistics. 2020 6 213-234
[19]
Camacho-Collados, J., & Navigli, R. (2016). Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 43–50.
[20]
Chernodub, A., Oliynyk, O., Heidenreich, P., Bondarenko, A., Hagen, M., Biemann, C., & Panchenko, A. (2019). TARGER: Neural argument mining at your fingertips. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, Florence, Italy, pp 195–200,.
[21]
Chiu, B., Korhonen, A., & Pyysalo, S. (2016). Intrinsic evaluation of word vectors fails to predict extrinsic performance. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 1–6,.
[22]
Chronis, G., & Erk, K. (2020). When is a bishop not like a rook? when it’s like a rabbi! multi-prototype BERT embeddings for estimating semantic relationships. In: Proceedings of the 24th Conference on Computational Natural Language Learning, Association for Computational Linguistics, Online, pp 227–244,.
[23]
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on Machine learning, pp 160–167.
[24]
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, and Kuksa P Natural Language Processing (Almost) From Scratch Journal of Machine Learning Research 2011 12 2493-2537
[25]
Conneau, A., Kruszewski, G., Lample, G., Barrault, L., & Baroni, M. (2018). What you can cram into a single \$ &!#* vector: Probing sentence embeddings for linguistic properties. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 2126–2136,.
[26]
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, pp 8440–8451,.
[27]
Cordeiro S, Villavicencio A, Idiart M, and Ramisch C Unsupervised compositionality prediction of nominal compounds Computational Linguistics 2019 45 1 1-57
[28]
Costas González, X. H. (2007). A Lingua Galega no Eo-Navia, Bierzo Occidental, As Portelas, Calabor e o Val do Ellas: Historia, Breve Caracterización e Situación Sociolingüística Actual, Cadernos de Lingua, vol Anexo 8. Real Academia Galega.
[29]
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186.
[30]
de Dios-Flores I and Garcia M A computational psycholinguistic evaluation of the syntactic abilities of galician bert models at the interface of dependency resolution and training time Procesamiento del Lenguaje Natural 2022 69 15-26
[31]
de Dios-Flores, I., Magariños, C., Vladu, A. I., Ortega, J. E., Pichel, J. R., García, M., Gamallo, P., Fernández Rei, E., Bugarín-Diz, A., González González, M., Barro, S., & Regueira, X.L. (2022). The nós project: Opening routes for the Galician language in the field of language technologies. In: Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, pp 52–61.
[32]
Drozd, A., Gladkova, A., & Matsuoka, S. (2016). Word embeddings, analogies, and machine learning: Beyond king - man + woman = queen. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, The COLING 2016 Organizing Committee, Osaka, Japan, pp 3519–3530.
[33]
Erk, K., & Padó, S. (2008). A structured vector space model for word meaning in context. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Honolulu, Hawaii, pp 897–906.
[34]
Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 30–35,.
[35]
Fernández Rei F Dialectoloxía da Lingua 1991 Galega Vigo Xerais de Galicia
[36]
Firth, J. R. (1957). A synopsis of linguistic theory 1930-1955. Studies in Linguistic Analysis pp 1–32, reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952–1959. London: Longman (1968).
[37]
Fonseca, E. R., & Rosa, J. L. G. (2013). Mac-morpho revisited: Towards robust part-of-speech tagging. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology.
[38]
Freixeiro Mato XR Gramática da Lingua Galega IV 2003 A Nosa Terra, Vigo Gramática do texto
[39]
Gamallo P Comparing explicit and predictive distributional semantic models endowed with syntactic contexts Language Resources and Evaluation 2017 51 3 727-743
[40]
Gamallo, P., Garcia, M., Sotelo, S., & Campos, J. R. P. (2014). Comparing Ranking-based and Naive Bayes Approaches to Language Detection on Tweets. Proceedings of TweetLID: Twitter Language Identification Workshop at SEPLN, 2014, pp. 12–16.
[41]
Gamallo, P., Garcia, M., Piñeiro, C., Martínez-Castaño, R., & Pichel, J. C. (2018). LinguaKit: a Big Data-based multilingual tool for linguistic analysis and information extraction. 2018 Fifth International Conference on Social Networks Analysis (pp. 239–244). IEEE: Management and Security (SNAMS).
[42]
Garcia, M. (2021). Exploring the representation of word meanings in context: A case study on homonymy and synonymy. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, pp 3625–3640,.
[43]
Garcia M and Crespo-Otero A Pinheiro V, Gamallo P, Amaro R, Scarton C, Batista F, Silva D, Magro C, and Pinto H A targeted assessment of the syntactic abilities of transformer models for galician-portuguese Computational Processing of the Portuguese Language 2022 Cham Springer International Publishing 46-56
[44]
Garcia M and Gamallo P Análise morfossintáctica para português europeu e galego: Problemas, soluçoes e avaliaçao Linguamática 2010 2 2 59-67
[45]
Garcia M, Gómez-Rodríguez C, and Alonso MA New treebank or repurposed? on the feasibility of cross-lingual parsing of romance languages with universal dependencies Natural Language Engineering 2018 24 1 91-122
[46]
Garcia, M., García-Salido, M., & Alonso, M. A. (2019). Exploring cross-lingual word embeddings for the inference of bilingual dictionaries. In: Proceedings of TIAD-2019 Shared Task - Translation Inference Across Dictionaries, pp 32–41.
[47]
Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., & Villavicencio, A. (2021). Probing for idiomaticity in vector space models. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, Online, pp 3551–3564,.
[48]
Garcia, M., Rodríguez, I., & Gamallo, P. (2022). SemantiGal: An online visualizer of vector representations for Galician. In: Proceedings of PROPOR 2022: International Conference on the Computational Processing of Portuguese. Demo Papers.
[49]
Gargallo Gil JE Gallego-portugués, iberorromance. la ‘fala’ en su contexto románico peninsular Limite Revista de Estudios Portugueses y de la Lusofonía 2007 1 31-49
[50]
Garí Soler A and Apidianaki M Let’s play mono-poly: BERT can reveal words’ polysemy level and partitionability into senses Transactions of the Association for Computational Linguistics 2021 9 825-844
[51]
Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In: Proceedings of the NAACL Student Research Workshop, Association for Computational Linguistics, San Diego, California, pp 8–15,.
[52]
Glavaš, G., & Vulić, I. (2021). Climbing the tower of treebanks: Improving low-resource dependency parsing via hierarchical source selection. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, pp 4878–4888,.
[53]
Glavaš, G., Litschko, R., Ruder, S., & Vulić, I. (2019). How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 710–721,.
[54]
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, https://aclanthology.org/L18-1550.
[55]
Griffiths TL, Steyvers M, and Tenenbaum JB Topics in semantic representation Psychological Review 2007 114 2 211
[56]
Gulordava, K., & Baroni, M. (2011). A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. In: Proceedings of the GEMS 2011 workshop on geometrical models of natural language semantics, pp 67–71.
[57]
Harris, Z. S. (1954). Distributional structure. Word,10(2–3): 146–162.
[58]
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., & Aluísio, S. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, Sociedade Brasileira de Computação, Uberlândia, Brazil, pp 122–131.
[59]
Huebner PA and Willits JA Structured semantic knowledge can emerge automatically from predicting word sequences in child-directed speech Frontiers in Psychology 2018 9 133
[60]
Kann, K., Cho, K., & Bowman, S. R. (2019). Towards realistic practices in low-resource natural language processing: The development set. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp 3342–3349,.
[61]
Kim, Y., Chiu, Y. I., Hanaki, K., Hegde, D., & Petrov, S. (2014). Temporal analysis of language through neural language models. In: Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, Association for Computational Linguistics, Baltimore, MD, USA, pp 61–65,.
[62]
Landauer TK and Dumais ST A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge Psychological Review 1997 104 2 211
[63]
Landauer, T. K., Laham, D., & Foltz, P. W. (1998). Learning human-like knowledge by singular value decomposition: A progress report. In: Advances in neural information processing systems, pp 45–51.
[64]
Lebret, R., & Collobert, R. (2015). Rehabilitation of count-based models for word vector representations. In: International Conference on Intelligent Text Processing and Computational Linguistics, Springer, pp 417–429.
[65]
Lenci A, Sahlgren M, Jeuniaux P, Cuba Gyllensten A, and Miliani M A comparative evaluation and analysis of three generations of Distributional Semantic Models Language Resources and Evaluation 2022 56 1269-1313
[66]
Levy, O., & Goldberg, Y. (2014a). Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Baltimore, Maryland, pp 302–308,.
[67]
Levy, O., & Goldberg, Y. (2014b). Linguistic regularities in sparse and explicit word representations. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Ann Arbor, Michigan, pp 171–180,.
[68]
Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics,3, 211–225.
[69]
de Lhoneux, M., Stymne, S., & Nivre, J. (2017). Arc-hybrid non-projective dependency parsing with a static-dynamic oracle. In: Proceedings of the The 15th International Conference on Parsing Technologies (IWPT)., Pisa, Italy.
[70]
Li, B., Liu, T., Zhao, Z., Tang, B., Drozd, A., Rogers, A., & Du, X. (2017). Investigating different syntactic context types and context representations for learning word embeddings. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, pp 2421–2431,.
[71]
Lin, D. (1998). Automatic retrieval and clustering of similar words. In: COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.
[72]
Lin D and Pantel P Discovery of inference rules for question-answering Natural Language Engineering 2001 7 4 343-360
[73]
Lindley Cintra LF and Cunha C Nova Gramática do Português Contemporâneo 1984 Lisbon Livraria Sá da Costa
[74]
Linzen, T. (2016). Issues in evaluating semantic spaces using word analogies. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, Berlin, Germany, pp 13–18,.
[75]
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, pp 923–929.
[76]
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, and Neubig G Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing ACM Comput Surv 2023
[77]
Lund, K. (1995). Semantic and associative priming in high-dimensional semantic space. In: Proc. of the 17th Annual conferences of the Cognitive Science Society, 1995.
[78]
Lund K and Burgess C Producing high-dimensional semantic spaces from lexical co-occurrence Behavior research methods, instruments, & computers 1996 28 2 203-208
[79]
Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, pp 1064–1074,.
[80]
van der Maaten L and Hinton G Visualizing Data using t-SNE Journal of Machine Learning Research 2008 9 86 2579-2605
[81]
Malvar P, Pichel JR, Senra Ó, Gamallo P, García A (2010) Vencendo a escassez de recursos computacionais. carvalho: Tradutor automático estatístico inglês-galego a partir do corpus paralelo europarl inglês-português. Linguamática. 2(2):31–38.
[82]
Mandera P, Keuleers E, and Brysbaert M Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation Journal of Memory and Language 2017 92 57-78
[83]
Marvin, R., & Linzen, T. (2018). Targeted syntactic evaluation of language models. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 1192–1202,.
[84]
McDonald, S., & Ramscar, M. (2001). Testing the distributional hypothesis: The influence of context on judgements of semantic similarity. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol 23.
[85]
Melamud, O., McClosky, D., Patwardhan, S., & Bansal, M. (2016). The role of context types and dimensionality in learning word embeddings. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, pp 1030–1040,.
[86]
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. In: Workshop Proceedings of the International Conference on Learning Representations (ICLR) 2013, arXiv preprint arXiv:1301.3781.
[87]
Mikolov, T., Yih, W. t., & Zweig, G. (2013b). Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 746–751.
[88]
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2018). Advances in pre-training distributed word representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.
[89]
Miller, G. A. (1971). Empirical methods in the study of semantics. Semantics, an interdisciplinary reader in philosophy, linguistics, and psychology. pp 569–585.
[90]
Mira Mateus MH, Brito AM, Duarte I, Hub Faria I, Frota S, Matos G, Oliveira F, Vigário M, and Villalva A Gramática da Língua Portuguesa 2003 6 Caminho
[91]
Mitchell J and Lapata M Composition in distributional models of semantics Cognitive Science 2010 34 8 1388-1429
[92]
Müller-Eberstein, M., van der Goot, R., & Plank, B. (2021). Genre as weak supervision for cross-lingual dependency parsing. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 4786–4802,.
[93]
Oliveira, H. G., Sousa, T., & Alves, A. (2020). TALES: Test set of portuguese lexical-semantic relations for assessing word embeddings. In: Proceedings of the ECAI 2020 Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP), pp 41–47.
[94]
Ortega, J., de Dios-Flores, I., Pichel, J., & Gamallo, P. (2022a). A Neural Machine Translation System for Spanish to Galician through Portuguese Transliteration. In: 15th International Conference on Computational Processing of Portuguese (PROPOR 2022). Demo papers.
[95]
Ortega, J. E., de Dios-Flores, I., Pichel, J. R., & Gamallo, P. (2022b). Revisiting ccnet for quality measurements in galician. In: International Conference on Computational Processing of the Portuguese Language, Springer, pp 407–412.
[96]
Padó S and Lapata M Dependency-based construction of semantic space models Computational Linguistics 2007 33 2 161-199
[97]
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 1532–1543,.
[98]
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, pp 2227–2237,.
[99]
Qi, Y., Sachan, D., Felix, M., Padmanabhan, S., & Neubig, G. (2018). When and why are pre-trained word embeddings useful for neural machine translation? In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana, pp 529–535,.
[100]
Querido A, Carvalho R, Rodrigues J, Garcia M, Silva J, Correia C, Rendeiro N, Pereira R, Campos M, and Branco A Lx-lr4distsemeval: A collection of language resources for the evaluation of distributional semantic models of portuguese Revista da Associação Portuguesa de Linguística 2017 3 265-283
[101]
Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In: Proceedings of the Ninth Machine Translation Summit, pp 315–322.
[102]
Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In: New Challenges for NLP Frameworks (NLPFrameworks 2010) at LREC 2010, ELRA, Valletta, Malta, pp 46–50, http://www.lrec-conf.org/proceedings/lrec2010/workshops/W10.pdf.
[103]
Rodrigues, J., Branco, A., Neale, S., & Silva, J. (2016). LX-DSemVectors: Distributional Semantics Models for Portuguese. In: International Conference on Computational Processing of the Portuguese Language, Springer, pp 259–270.
[104]
Rodríguez-Fernández, S., Espinosa-Anke, L., Carlini, R., & Wanner, L. (2016). Semantics-driven recognition of collocations using word embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Berlin, Germany, pp 499–505,.
[105]
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics,8, 842–866.
[106]
Rojo, G., López Martínez, M., Domínguez Noya, E., & Barcala, F. (2019). Corpus de adestramento do Etiquetador/Lematizador do Galego Actual (XIADA), versión 2.7. Centro Ramón Piñeiro para a investigación en humanidades.
[107]
Ruge G Experiments on linguistically-based term associations Information Processing & Management 1992 28 3 317-332
[108]
Sagi, E., Kaufmann, S., & Clark, B. (2009). Semantic density analysis: Comparing word meaning across time and phonetic space. In: Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, Association for Computational Linguistics, Athens, Greece, pp 104–111.
[109]
Sahlgren M The Distributional Hypothesis Rivista di Linguistica (Italian Journal of Linguistics) 2008 20 1 33-53
[110]
Samartim, R. (2012). Língua somos: A construção da ideia de língua e da identidade coletiva na Galiza (pré-) constitucional. In: Novas achegas ao estudo da cultura galega II: enfoques socio-históricos e lingüístico-literarios, pp 27–36.
[111]
Schnabel, T., Labutov, I., Mimno, D., & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, pp 298–307,.
[112]
Schütze H Automatic word sense discrimination Computational Linguistics 1998 24 1 97-123
[113]
Smith, A., de Lhoneux, M., Stymne, S., & Nivre, J. (2018). An investigation of the interactions between pre-trained word embeddings, character models and POS tags in dependency parsing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, pp 2711–2720,.
[114]
Sousa, T., Gonçalo Oliveira, H., & Alves, A. (2020). Exploring Different Methods for Solving Analogies with Portuguese Word Embeddings. In: 9th Symposium on Languages, Applications and Technologies (SLATE 2020), Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
[115]
Straka, M., & Straková, J. (2017). Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Vancouver, Canada, pp 88–99, http://www.aclweb.org/anthology/K/K17/K17-3009.pdf.
[116]
Teyssier P História da Língua Portuguesa 1987 3 Lisbon Livraria Sá da Costa
[117]
Turney PD Similarity of semantic relations Computational Linguistics 2006 32 3 379-416
[118]
Turney PD Domain and function: A dual-space model of semantic relations and compositions Journal of Artificial Intelligence Research 2012 44 533-585
[119]
Turney PD and Littman ML Corpus-based learning of analogies and semantic relations Machine Learning 2005 60 1–3 251-278
[120]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I, (2017), Attention Is All You Need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), arXiv preprint arXiv:1706.03762,
[121]
Vilares, D., & Gómez-Rodríguez, C. (2018). Transition-based parsing with lighter feed-forward networks. In: Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), Association for Computational Linguistics, Brussels, Belgium, pp 162–172,.
[122]
Vilares D, Garcia M, and Gómez-Rodríguez C Bertinho: Galician bert representations Procesamiento del Lenguaje Natural 2021 66 13-26
[123]
Vulić, I., Ponti, E. M., Litschko, R., Glavaš, G., & Korhonen, A. (2020). Probing pretrained language models for lexical semantics. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, pp 7222–7240,.
[124]
Wang B, Wang A, Chen F, Wang Y, and Kuo CCJ Evaluating word embedding models: Methods and experimental results APSIPA Transactions on Signal and Information Processing 2019 8 19
[125]
Wenzek, G., Lachaux, M. A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., & Grave, E. (2020). CCNet: Extracting high quality monolingual datasets from web crawl data. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, pp 4003–4012.
[126]
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. (2020). Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, pp 38–45,.
[127]
Xunta de Galicia. (2004). Plan Xeral de Normalización da Lingua Galega. Xunta de Galicia Consellería de Educación e Ordenación Universitaria, Dirección Xeral de Política Lingüística.
[128]
Zeman, D., Popel, M., Straka, M., Hajič, J., Nivre, J., Ginter, F., Luotolahti, J., Pyysalo, S., Petrov, S., Potthast, M., Tyers, F., Badmaeva, E., Gokirmak, M., Nedoluzhko, A., Cinková, S., Hajič jr, J., Hlaváčová, J., Kettnerová, V., Urešová, Z., Kanerva, J., Ojala, S., Missilä, A., Manning, C. D., Schuster, S., Reddy, S., Taji, D., Habash, N., Leung, H., de Marneffe, M. C., Sanguinetti, M., Simi, M., Kanayama, H., de Paiva, V., Droganova, K., Martínez Alonso, H., Çöltekin, Ç., Sulubacak, U., Uszkoreit, H., Macketanz, V., Burchardt, A., Harris, K., Marheinecke, K., Rehm, G., Kayadelen, T., Attia, M., Elkahky, A., Yu, Z., Pitler, E., Lertpradit, S., Mandl, M., Kirchner, J., Alcalde, H. F., Strnadová, J., Banerjee, E., Manurung, R., Stella, A., Shimada, A., Kwak, S., Mendonça, G., Lando, T., Nitisaroj, R., & Li, J, (2017), CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Vancouver, Canada, pp 1–19,.
[129]
Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Brussels, Belgium, pp 1–21,.
[130]
Zhao, Y., & Karypis, G. (2003). Clustering in Life Sciences. In: Functional Genomics: Methods and Protocols, Humana Press, pp 183–218.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Language Resources and Evaluation
Language Resources and Evaluation  Volume 58, Issue 4
Dec 2024
422 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 04 June 2024
Accepted: 05 April 2024

Author Tags

  1. Distributional semantics
  2. Word embeddings
  3. Galician
  4. Intrinsic evaluation
  5. Extrinsic evaluation

Qualifiers

  • Brief-report

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media