Nothing Special   »   [go: up one dir, main page]

Skip to main content

Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks

  • Conference paper
  • First Online:
Computational Processing of the Portuguese Language (PROPOR 2020)

Abstract

Deep neural language models which achieved state-of-the-art results on downstream natural language processing tasks have recently been trained for the Portuguese language. However, studies that systematically evaluate such models are still necessary for several applications. In this paper, we propose to evaluate the performance of deep neural language models on the semantic similarity tasks provided by the ASSIN dataset against classical word embeddings, both for Brazilian Portuguese and for European Portuguese. Our experiments indicate that the ELMo language model was able to achieve better accuracy than any other pretrained model which has been made publicly available for the Portuguese language, and that performing vocabulary reduction on the dataset before training not only improved the standalone performance of ELMo, but also improved its performance while combined with classical word embeddings. We also demonstrate that FastText skip-gram embeddings can have a significantly better performance on semantic similarity tasks than it was indicated by previous studies in this field.

The source code for the experiments described in this paper has been published on GitHub at https://github.com/ruanchaves/elmo.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://allennlp.org/elmo.

References

  1. Alves, A., Gonçalo Oliveira, H., Rodrigues, R., Encarnaçao, R.: ASAPP 2.0: advancing the state-of-the-art of semantic textual similarity for Portuguese. In: 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2018)

    Google Scholar 

  2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  3. Bruni, E., Tran, N.K., Baroni, M.: Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1–47 (2014)

    Article  MathSciNet  Google Scholar 

  4. Camacho-Collados, J., Pilehvar, M.T.: From word to sense embeddings: a survey on vector representations of meaning. J. Artif. Intell. Res. 63, 743–788 (2018)

    Article  MathSciNet  Google Scholar 

  5. de Castro, P.V.Q.: Aprendizagem Profunda para Reconhecimento de Entidades Nomeadas em Domínio Jurídico. Master’s thesis, Universidade Federal de Goiás (2019)

    Google Scholar 

  6. Quinta de Castro, P.V., Félix Felipe da Silva, N., da Silva Soares, A.: Portuguese named entity recognition using LSTM-CRF. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 83–92. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_9

    Chapter  Google Scholar 

  7. de Castro, P.V.Q., da Silva, N.F.F., da Silva Soares, A.: Contextual representations and semi-supervised named entity recognition for Portuguese language (2019)

    Google Scholar 

  8. Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005 (2013)

  9. Chen, X., Liu, Z., Sun, M.: A unified model for word sense representation and disambiguation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1025–1035. Association for Computational Linguistics, October 2014. https://doi.org/10.3115/v1/D14-1110, https://www.aclweb.org/anthology/D14-1110

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018)

    Google Scholar 

  11. Glauber, R.: IberLEF 2019 Portuguese named entity recognition and relation extraction tasks (2019)

    Google Scholar 

  12. Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., Aluisio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. arXiv preprint arXiv:1708.06025 (2017)

  13. Hartmann, N.S.: Solo queue at assin: Combinando abordagens tradicionais e emergentes. Linguamática 8(2), 59–64 (2016)

    Google Scholar 

  14. Hu, Z., Chen, T., Chang, K.W., Sun, Y.: Few-shot representation learning for out-of-vocabulary words (2019)

    Google Scholar 

  15. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)

  16. Ling, W., Dyer, C., Black, A.W., Trancoso, I.: Two/too simple adaptations of Word2Vec for syntax problems. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1299–1304 (2015)

    Google Scholar 

  17. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  18. Mesnil, G., He, X., Deng, L., Bengio, Y.: Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In: Interspeech, pp. 3771–3775 (2013)

    Google Scholar 

  19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  20. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation (2013)

    Google Scholar 

  21. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  22. Peters, M., Ruder, S., Smith, N.A.: To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987 (2019)

  23. Peters, M.E., et al.: Deep contextualized word representations (2018)

    Google Scholar 

  24. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018). https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

  25. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. ACM Commun. 18(11), 613–620 (1975). https://doi.org/10.1145/361219.361220

    Article  MATH  Google Scholar 

  26. Santos, D., Cardoso, N.: Reconhecimento de entidades mencionadas em português. Linguateca 7(7), 1 (2007). Portugal

    Google Scholar 

  27. Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., Vieira, R.: Assessing the impact of contextual embeddings for Portuguese named entity recognition. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 437–442. IEEE (2019)

    Google Scholar 

  28. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  29. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruan Chaves Rodrigues .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rodrigues, R.C., Rodrigues, J., de Castro, P.V.Q., da Silva, N.F.F., Soares, A. (2020). Portuguese Language Models and Word Embeddings: Evaluating on Semantic Similarity Tasks. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds) Computational Processing of the Portuguese Language. PROPOR 2020. Lecture Notes in Computer Science(), vol 12037. Springer, Cham. https://doi.org/10.1007/978-3-030-41505-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-41505-1_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-41504-4

  • Online ISBN: 978-3-030-41505-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics