Abstract
Collocations in the sense of idiosyncratic binary lexical co-occurrences are one of the biggest challenges for any language learner. Even advanced learners make collocation mistakes in that they literally translate collocation elements from their native tongue, create new words as collocation elements, choose a wrong subcategorization for one of the elements, etc. Therefore, automatic collocation error detection and correction is increasingly in demand. However, while state-of-the-art models predict, with a reasonable accuracy, whether a given co-occurrence is a valid collocation or not, only few of them manage to suggest appropriate corrections with an acceptable hit rate. Most often, a ranked list of correction options is offered from which the learner has then to choose. This is clearly unsatisfactory. Our proposal focuses on this critical part of the problem in the context of the acquisition of Spanish as second language. For collocation error detection, we use a frequency-based technique. To improve on collocation error correction, we discuss three different metrics with respect to their capability to select the most appropriate correction of miscollocations found in our learner corpus.
Similar content being viewed by others
Notes
In accordance with the terminology in Second Language Learning literature, we refer to the native tongue of the learner as ‘L1’ and to her second language as ‘L2’.
Liu et al. (2009) achieve a higher accuracy, however, they start from a manually compiled list of miscollocations rather than from an automatically retrieved list.
CEDEL2 is an L1 English–L2 Spanish learner corpus under construction by Cristóbal Lozano in the framework of a bigger corpus-oriented project directed by Amaya Medikoetxea at the Universidad Autónoma de Madrid. Currently, CEDEL2 contains about 730.000 words of essays in Spanish on a predefined range of topics by native speakers of English and (to a smaller extent, for contrastive studies) by native speakers of Spanish. The topics include, among others How is the region where you live?, How do your plans for the future look like?, How did you spend your last holidays?, Analyze the major aspects of immigration, and so on. The level of Spanish of the authors of the essays ranges from “elementary” over “lower intermediate”, “intermediate”, and “advanced” to “very advanced”. Further information on CEDEL2 can be obtained from http://www.uam.es/proyectoinv/woslac/cedel2.htm and (Lozano 2009; Lozano and Mendikoetxea 2013).
Automatic classification of miscollocations encountered in essays with respect to this typology is another big challenge, which remains to be tackled.
This annotation schema is currently used to annotate a fragment of CEDEL2.
The details of the filtering stage and the size of the correction list on which they calculate the reported MRR are not explicitly discussed in (Wu et al. 2010). However, we can deduce both from experiments with the MUST Collocation checker (http://miscollocation.appspot.com), which is based on their proposal.
Roughly speaking, members of the same “collocation cluster” are values of the same lexical function in the sense of Mel’čuk (1995).
As rightly pointed out by one of the reviewers, apart from graphic similarity, phonetic similarity should also be considered. A large number of phonetic distance measures is available; see (Kessler 2005) for an in-depth discussion—starting with the implementation of Russel and Odell’s Soundex for English.
Obviously, this strategy harbors the danger of contextual feature occurrence sparseness if the learner uses a (mis)collocation in very idiosyncratic contexts.
In contrast to information retrieval-oriented search, we do not eliminate from the context the functional words (which are otherwise considered to be “stop words” that do not contribute to the quality of the search) since they are essential for our task.
As a matter of fact, proper names are poor features for our task. We plan to discard them in the future experiments.
The suggestion *agenciar [una] cita as a possible candidate is due to the wrong PoS tagging of the bigram agencia cita ‘agency cites’, which is very common in a newspaper corpus such as ours.
This gives us a hint that a newspaper material corpus is not a well-balanced corpus for the purposes of collocation-oriented CALL.
However, it is a standard collocation in Argentinian Spanish.
The context feature metric considered thus the same “features” as the lexical context metrics—only that it interpreted them differently.
As already mentioned above, MUST is an implementation of (Wu et al. 2010).
Instead of grow up mind, we introduced grow mind since MUST does not process collocations with phrasal verbs.
Since MUST’s corrections for grow mind did not include any right correction, we added cultivate mind (the correction suggested by Li) to the bag of suggestions for grow mind.
References
Alonso Ramos, M., Wanner, L., Vázquez, N., Vincze, O., Mosqueira, E., & Prieto S. (2010a). Tagging collocations for learners. In: S. Granger & M. Paquot (Eds.), eLexicography in the 21st century: New challenges, new applications. Proceedings of eLex 2009, Cahiers du Cental, volume 7, Louvain-la-Neuve.
Alonso Ramos, M., Wanner, L., Vincze, O., Casamayor, G., Vázquez, N., Mosqueira, E., & Prieto, S. (2010b). Towards a motivated annotation schema of collocation errors in learner corpora. In Proceedings of LREC 2010, Malta.
Atwell, E. (1987). How to detect grammatical errors in a text without parsing it. In Proceedings of the EACL Conference (pp. 38–45). Copenhagen, Denmark.
Bouma, G. (2010). Collocation extraction beyond the independence assumption. In Proceedings of the ACL Conference, Short paper track, Uppsala.
Chang, Y. C., Chang J. S., Chen H. J., & Liou, H. C. (2008). An automatic collocation writing assistant for Taiwanese EFL learners. A case of corpus-based NLP technology. Computer Assisted Language Learning, 21(3), 283–299.
Chen, H. (2009). Microsoft ESL assistant and NTNU statistical grammar checker. Computational Linguistics and Chinese Language Processing, 14(2), 161–180.
Choueka, Y. (1988). Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO (pp. 34–38).
Church, K. W., & Hanks, P. (1989). Word association norms, mutual information, and Lexicography. In Proceedings of the 27th Annual Meeting of the ACL (pp. 76–83).
Cowie, A. P. (1994). Phraseology. In: R. E. Asher & J. Simpson (Eds.), The encyclopedia of language and linguistics (Vol. 6, pp. 3168–3171). Pergamon, Oxford.
Dahlmeier, D., & Ng, H. T. (2011). Correcting semantic collocation errors with L1-induced paraphrases. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 107–117). Edinburgh, Scotland.
Evert, S. (2007). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook. Berlin: Mouton de Gruyter.
Evert, S., & Kermes, H. (2003). Experiments on candidate data for collocation extraction. In Companion Volume to the Proceedings of the 10th Conference of the EACL (pp. 83–86).
Futagi, Y., Deane, P., Chodorow, M., & Tetreault, J. (2008). A computational approach to detecting collocation errors in the writing of non-native speakers of English. Computer Assisted Language Learning, 21(1), 353–367.
Gamon, M., Leacock, C., Brockett, C., Dolan, W., Gao, J., & Belenko, D. (2009). Using statistical techniques and web search to correct ESL errors. CALICO Journal, 26(3), 491–511.
Gilquin, G. (2007). To err is not all. What corpus and elicitation can reveal about the use of collocations by learners. Zeitschrift für Anglistik und Amerikanistik, 55(3), 273–291.
Granger, S. (1998). Prefabricated patterns in advanced EFL writing: Collocations and formulae. In: A. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 145–160). Oxford University Press, Oxford.
Hausmann, F.-J. (1984). Wortschatzlernen ist Kollokationslernen. Zum Lehren und Lernen französischer Wortwendungen. Praxis des neusprachlichen Unterrichts, 31(4), 395–406.
Hausmann, F.-J. (1989). Le dictionnaire de collocations. In F.-J. Hausmann, P. Reichmann, H. E. Wiegang, & L. Zgusta (Eds.), Wörterbücher, dictionaries, dictionnaires. Ein internationales Handbuch. Berlin; De Gruyter.
Hermet, M., Désilets A., & Szpakowicz, S. (2008). Using the web as a linguistic resource to automatically correct lexico-syntactic errors. In Proceedings of the LREC 2008 (pp. 54–57), Marrakech.
Howarth, P. (1998a). Phraseology and second language acquisition. Applied Linguistics, 19(1), 24–44.
Howarth, P. (1998b). The phraseology of learner’s academic writing. In: A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 161–186). Oxford: Oxford University Press.
Kessler, B. (2005). Phonetic comparison algorithms. Transactions of the Philological Society, 103(2), 243–260.
Kilgarriff, A. (2006). Collocationality (and how to measure it). In Proceedings of the 12th EURALEX International Congress, Torino.
Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the ACL Conference (pp. 423–430).
Knight, K., & Chander, I. (1994). Automated postediting of documents. In Proceedings of the AAAI Conference (pp. 779–784) Seattle, WA.
Lesniewska, J. (2006). Collocations and second language use. Studia Linguistica Universitatis lagellonicae Cracoviensis, 123, 95–105.
Lewis, M. (2000). Teaching collocation. Further developments in the lexical approach. London: LTP.
Li, C. C. (2005). A Study of collocational error types in ESL/EFL College learners. Ph.D. thesis, Ming Chuan University College of Applied Languages, Department of Applied English.
Liu, A.L.-E., Wible, D., & Tsao, N.-L. (2009). Automated suggestions for miscollocations. In Proceedings of the NAACL HLT Workshop on Innovative Use of NLP for Building Educational Applications (pp. 47–50). Boulder, CO.
Lozano, C. (2009). CEDEL2: Corpus escrito del español L2. In C. M. Bretones Callejas (Ed.), Applied linguistics now: Understanding language and mind (pp. 197–212). Almería: Universidad de Almería.
Lozano, C., & Mendikoetxea, A. (2013). Learner corpora and second language acquisition: The design and collection of CEDEL2. In A. Díaz-Negrillo, N. Ballier, & P. Thompson, (Eds.), Automatic treatment and analysis of learner corpus data. Amsterdam: Benjamins Academic Publishers.
Mel’čuk, I. A. (1995). Phrasemes in language and phraseology in linguistics. In: M. Everaert, E.-J. van der Linden, A. Schenk & R. Schreuder (Eds.), Idioms: Structural and psychological perspectives (pp. 167–232). Hillsdale: Lawrence Erlbaum Associates.
Meurers, D. (2013). Natural language processing and language learning. In: C. A. Chapelle (Ed.), Encyclopedia of applied linguistics (pp. 1–13). Hoboken: Blackwell.
Nation, I. S. P. (2001). Learning language in another language. Cambridge: Cambridge University Press.
Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24(2), 223–242.
Nesselhauf, N. (2005). Collocations in a learner corpus. Amsterdam: Benjamins Academic Publishers.
Pantel, P., & Lin, D. (2000). Word-for-word glossing with contextually similar words. In Proceedings of 4th NAACL Conference (pp 78–85). Seattle.
Park, T., Lank, E., Poupart, P., & Terry, M. (2008). Is the sky pure today? AwkChecker: An assistive tool for detecting and correcting errors. In Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology (UIST ’08), New York.
Pecina, P. (2008). A machine learning approach to multiword expression extraction. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008) (pp. 54–57), Marrakech.
Shei, C. C., & Pain, H. (2000). An ESL writer’s collocation aid. Computer Assisted Language Learning, 13(2), 167–182.
Smadja, F. (1993). Retrieving collocations from text: X-Tract. Computational Linguistics, 19(1), 143–177.
Vossen, P. (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers.
Wanner, L., Bohnet, B., & Giereth, M. (2006). Making sense of collocations. Computer Speech and Language, 20(4), 609–624.
Wible, D., Kuo, C.-H., Tsao, N.-L., Liu, A. L-E., & Lin, H.-L. (2003). Bootstrapping in a language learning environment. Journal of Computer Assisted Learning, 19(4), 90–102.
Wible, D., & Tsao, N. L. (2010). Stringnet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL-HLT Workshop on Extracting and Using Constructions in Computational Linguistics, Los Angeles.
Wu, J.-C., Chang, Y.-C., Mitamura, T., & Chang, J. S. (2010). Automatic collocation suggestion in academic writing. In Proceedings of the ACL Conference, Short paper track, Uppsala.
Yin, X., Gao, J., & Dolan, W. (2008). A web-based English proofing system for English as a second language users. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (pp. 619–624). Hyderabad, India.
Acknowledgments
Many thanks to Amaya Medikoetxea and Cristóbal Lozano for making the CEDEL2 corpus available to us and to the two anonymous reviewers for their insightful comments, which considerably improved the final version of the paper. Our experiments have been partially run on the Argo cluster of the Department of Communication and Information Technologies, UPF. We are grateful for this service and would like to thank especially Silvina Re and Iván Jiménez for their help. This work has been partially funded by the Spanish Ministry of Science and Innovation under the contract numbers FFI2008-06479-C02-01/02 and FFI2011-30219-CO2-01/02.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ferraro, G., Nazar, R., Alonso Ramos, M. et al. Towards advanced collocation error correction in Spanish learner corpora. Lang Resources & Evaluation 48, 45–64 (2014). https://doi.org/10.1007/s10579-013-9242-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-013-9242-3