Abstract
Verbal multiword expressions (VMWEs) such as to make ends meet require special attention in NLP and linguistic research, and annotated corpora are valuable resources for studying them. Corpora annotated with VMWEs in several languages, including Brazilian Portuguese, were made freely available in the PARSEME shared task. The goal of this paper is to describe and analyze this corpus in terms of the characteristics of annotated VMWEs in Brazilian Portuguese. First, we summarize and exemplify the criteria used to annotate VMWEs. Then, we analyze their frequency, average length, discontinuities and variability. We further discuss challenging constructions and borderline cases. We believe that this analysis can improve the annotated corpus and its results can be used to develop systems for automatic VMWE identification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Editions 1.0 (2017) and 1.1 (2018): http://multiword.sourceforge.net/sharedtask2018.
- 2.
- 3.
Boldface indicates lexicalized components for all examples throughout this paper.
- 4.
A flexibility test verifies to what extent a change usually allowed by a language’s grammar also applies to the candidate to annotate.
- 5.
A word that does not co-occur with any other word outside the VMWE.
- 6.
- 7.
In number of intervening tokens.
- 8.
The normalized form of a VMWE is its sequence of lemmatized lexicalized components in lexicographic order, whereas its surface form is the textual sequence [8].
References
Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn, pp. 267–292. CRC Press, Boca Raton (2010)
Bocorny Finatto, M.J., Scarton, C.E., Rocha, A., Aluísio, S.M.: Características do jornalismo popular: avaliação da inteligibilidade e auxílio à descrição do gênero. In: VIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pp. 30–39. Sociedade Brasileira de Computação, Cuiabá, MT, Brazil (2011)
Constant, M., et al.: Multiword expression processing: a survey. Comput. Linguistics 43(4), 837–892 (2017). https://doi.org/10.1162/COLI_a_00302
Constant, M., Nivre, J.: A transition-based system for joint lexical and syntactic analysis. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 161–171. Association for Computational Linguistics, August 2016. http://www.aclweb.org/anthology/P16-1016
Fotopoulou, A., Markantonatou, S., Giouli, V.: Encoding MWEs in a conceptual lexicon. In: Proceedings of the 10th Workshop on Multiword Expressions, MWE 2014, pp. 43–47. Association for Computational Linguistics (2014)
Nissim, M., Zaninello, A.: Modeling the internal variability of multiword expressions through a pattern-based method. ACM TSLP Special Issue MWEs 10(2) (2013)
Nivre, J., et al.: Universal dependencies v1: A multilingual treebank collection. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016, pp. 1659–1666. European Language Resources Association (ELRA), May 2016
Pasquer, C.: Expressions polylexicales verbales: étude de la variabilité en corpus. In: Actes de la 18e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (TALN-RÉCITAL 2017) (2017)
Riedl, M., Biemann, C.: Impact of MWE resources on multiword recognition. In: Proceedings of the 12th Workshop on Multiword Expressions, MWE 2016, pp. 107–111. Association for Computational Linguistics (2016). http://anthology.aclweb.org/W16-1816
Rosén, V., et al.: A survey of multiword expressions in treebanks. In: Proceedings of the 14th International Workshop on Treebanks & Linguistic Theories Conference, December 2015. https://hal.archives-ouvertes.fr/hal-01226001
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1
Sanches Duran, M., Scarton, C.E., Aluísio, S.M., Ramisch, C.: Identifying Pronominal Verbs: Towards Automatic Disambiguation of the Clitic ’se’ in Portuguese. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 93–100. Association for Computational Linguistics, Atlanta, June 2013. http://www.aclweb.org/anthology/W13-1014
Savary, A., Cordeiro, S.R.: Literal readings of multiword expressions: as scarce as hen’s teeth. In: Proceedings of the 16th Workshop on Treebanks and Linguistic Theories (TLT 2016), Prague, Czech Republic (2018)
Savary, A., Jacquemin, C.: Reducing information variation in text. In: Renals, S., Grefenstette, G. (eds.) Text- and Speech-Triggered Information Access. LNCS (LNAI), vol. 2705, pp. 145–181. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45115-0_6
Savary, A., et al.: The PARSEME Shared Task on automatic identification of verbal multiword expressions. In: Proceedings of the 13th Workshop on Multiword Expressions, MWE 217, pp. 31–47. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-1704, http://aclanthology.coli.uni-saarland.de/pdf/W/W17/W17-1704.pdf
Straka, M., Straková, J.: Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computational Linguistics, Vancouver, August 2017
Tutin, A.: Comparing morphological and syntactic variations of support verb constructions and verbal full phrasemes in French: a corpus based study. In: PARSEME COST Action. Relieving the Pain in the Neck in Natural Language Processing: 7th Final General Meeting, Dubrovnik, Croatia (2016)
van Gompel, M., van der Sloot, K., Reynaert, M., van den Bosch, A.: FoLiA in practice: the infrastructure of a linguistic annotation format, pp. 71–81 (2017). https://doi.org/10.5334/bbi.6
Zeman, D., et al.: Conll 2017 shared task: Multilingual parsing from raw text to universal dependencies. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1–19 Association for Computational Linguistics, Vancouver, Canada, August 2017. http://www.aclweb.org/anthology/K/K17/K17-3001.pdf
Acknowledgement
We would like to thank Helena Caseli for her participation as an annotator. We would also like to thank the PARSEME shared task organizers, especially Agata Savary and Veronika Vincze. This work was supported by the IC1207 PARSEME COST action (http://www.parseme.eu) and by the PARSEME-FR project (ANR-14-CERA-0001). (http://parsemefr.lif.univ-mrs.fr/)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ramisch, C., Ramisch, R., Zilio, L., Villavicencio, A., Cordeiro, S. (2018). A Corpus Study of Verbal Multiword Expressions in Brazilian Portuguese. In: Villavicencio, A., et al. Computational Processing of the Portuguese Language. PROPOR 2018. Lecture Notes in Computer Science(), vol 11122. Springer, Cham. https://doi.org/10.1007/978-3-319-99722-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-99722-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99721-6
Online ISBN: 978-3-319-99722-3
eBook Packages: Computer ScienceComputer Science (R0)