Abstract
Deep neural language models which achieved state-of-the-art results on downstream natural language processing tasks have recently been trained for the Portuguese language. However, studies that systematically evaluate such models are still necessary for several applications. In this paper, we propose to evaluate the performance of deep neural language models on the semantic similarity tasks provided by the ASSIN dataset against classical word embeddings, both for Brazilian Portuguese and for European Portuguese. Our experiments indicate that the ELMo language model was able to achieve better accuracy than any other pretrained model which has been made publicly available for the Portuguese language, and that performing vocabulary reduction on the dataset before training not only improved the standalone performance of ELMo, but also improved its performance while combined with classical word embeddings. We also demonstrate that FastText skip-gram embeddings can have a significantly better performance on semantic similarity tasks than it was indicated by previous studies in this field.
The source code for the experiments described in this paper has been published on GitHub at https://github.com/ruanchaves/elmo.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Alves, A., Gonçalo Oliveira, H., Rodrigues, R., Encarnaçao, R.: ASAPP 2.0: advancing the state-of-the-art of semantic textual similarity for Portuguese. In: 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2018)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Bruni, E., Tran, N.K., Baroni, M.: Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1–47 (2014)
Camacho-Collados, J., Pilehvar, M.T.: From word to sense embeddings: a survey on vector representations of meaning. J. Artif. Intell. Res. 63, 743–788 (2018)
de Castro, P.V.Q.: Aprendizagem Profunda para Reconhecimento de Entidades Nomeadas em Domínio Jurídico. Master’s thesis, Universidade Federal de Goiás (2019)
Quinta de Castro, P.V., Félix Felipe da Silva, N., da Silva Soares, A.: Portuguese named entity recognition using LSTM-CRF. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 83–92. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_9
de Castro, P.V.Q., da Silva, N.F.F., da Silva Soares, A.: Contextual representations and semi-supervised named entity recognition for Portuguese language (2019)
Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005 (2013)
Chen, X., Liu, Z., Sun, M.: A unified model for word sense representation and disambiguation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1025–1035. Association for Computational Linguistics, October 2014. https://doi.org/10.3115/v1/D14-1110, https://www.aclweb.org/anthology/D14-1110
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018)
Glauber, R.: IberLEF 2019 Portuguese named entity recognition and relation extraction tasks (2019)
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., Aluisio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. arXiv preprint arXiv:1708.06025 (2017)
Hartmann, N.S.: Solo queue at assin: Combinando abordagens tradicionais e emergentes. Linguamática 8(2), 59–64 (2016)
Hu, Z., Chen, T., Chang, K.W., Sun, Y.: Few-shot representation learning for out-of-vocabulary words (2019)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
Ling, W., Dyer, C., Black, A.W., Trancoso, I.: Two/too simple adaptations of Word2Vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1299–1304 (2015)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mesnil, G., He, X., Deng, L., Bengio, Y.: Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In: Interspeech, pp. 3771–3775 (2013)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation (2013)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Peters, M., Ruder, S., Smith, N.A.: To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987 (2019)
Peters, M.E., et al.: Deep contextualized word representations (2018)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018). https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. ACM Commun. 18(11), 613–620 (1975). https://doi.org/10.1145/361219.361220
Santos, D., Cardoso, N.: Reconhecimento de entidades mencionadas em português. Linguateca 7(7), 1 (2007). Portugal
Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., Vieira, R.: Assessing the impact of contextual embeddings for Portuguese named entity recognition. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 437–442. IEEE (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Rodrigues, R.C., Rodrigues, J., de Castro, P.V.Q., da Silva, N.F.F., Soares, A. (2020). Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds) Computational Processing of the Portuguese Language. PROPOR 2020. Lecture Notes in Computer Science(), vol 12037. Springer, Cham. https://doi.org/10.1007/978-3-030-41505-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-41505-1_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41504-4
Online ISBN: 978-3-030-41505-1
eBook Packages: Computer ScienceComputer Science (R0)