Abstract
The pervasiveness of social media in the present digital era has empowered the ‘netizens’ to be more creative and interactive, and to generate content using free language forms that often are closer to spoken language and hence show phenomena previously mainly analysed in speech. One such phenomenon is code-mixing, which occurs when multilingual persons switch freely between the languages they have in common. Code-mixing presents many new challenges for language processing and the paper discusses some of them, taking as a starting point the problems of collecting and annotating three corpora of code-mixed Indian social media text: one corpus with English-Bengali Twitter messages and two corpora containing English-Hindi Twitter and Facebook messages, respectively. We present statistics of these corpora, discuss part-of-speech tagging of the corpora using both a coarse-grained and a fine-grained tag set, and compare their complexity to several other code-mixed corpora based on a Code-Mixing Index.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Androutsopoulos, J.: Language change and digital media: a review of conceptions and evidence. In: Kristiansen, T., Coupland, N. (eds.) Standard Languages and Language Standards in a Changing Europe, pp. 145–159. Novus, Oslo (2011)
Baldwin, T., Cook, P., Lui, M., MacKinlay, A., Wang, L.: How noisy social media text, how diffrnt social media sources? In: Proceedings of the 6th International Joint Conference on Natural Language Processing, pp. 356–364. AFNLP, Nagoya, Japan, October 2013
Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: “I am borrowing \(ya\) mixing?”: An analysis of English-Hindi code mixing in Facebook. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 116–126. ACL, Doha, Qatar, October 2014
Barman, U., Wagner, J., Chrupała, G., Foster, J.: DCU-UVT: word-level language classification with code-mixed data. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 127–132. ACL, Doha, Qatar, October 2014
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Choudhury, M., Jha, G.N., Rajendran, S., Saravanan, K., Sobha, L., Subbarao, K.: A common parts-of-speech tagset framework for Indian languages. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, pp. 1331–1337. ELRA, Marrakech, Marocco, May 2008
Cárdenas-Claros, M.S., Isharyanti, N.: Code switching and code mixing in internet chatting: between “yes”, “ya”, and “si” a case study. J. Comput.-Mediat. Commun. 5(3), 67–78 (2009)
Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Traitement Automatique des Langues 54(3), 41–64 (2013)
Das, A., Gambäck, B.: Identifying languages at the word level in code-mixed Indian social media text. In: Proceedings of the 11th International Conference on Natural Language Processing, pp. 169–178, Goa, India, December 2014
Debole, F., Sebastiani, F.: An analysis of the relative hardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol. 58(6), 584–596 (2005)
Dholakia, P.S., Yoonus, M.M.: Rule based approach for the transition of tagsets to build the POS annotated corpus. Int. J. Adv. Res. Comput. Commun. Eng. 3(7), 7417–7422 (2014)
Diab, M., Kamboj, A.: Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: a pilot annotation. In: Proceedings of the 9th Workshop on Asian Language Resources, pp. 36–40. AFNLP, Chiang Mai, Thailand, November 2011
Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–233 (1948)
Gafaranga, J., Torras, M.C.: Interactional otherness: towards a redefinition of codeswitching. Int. J. Biling. 6(1), 1–22 (2002)
Gambäck, B., Das, A.: On measuring the complexity of code-mixing. In: Proceedings of the 1st Workshop on Language Technologies for Indian Social Media, Goa, India, pp. 1–7, December 2014
Gambäck, B., Das, A.: Comparing the level of code-switching in corpora. In: Proceedings of the 10th International Conference on Language Resources and Evaluation. ELRA, Portorož, Slovenia, May 2016 (to appear)
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for Twitter: annotation, features, and experiments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 42–47. ACL, Portland, Oregon, June 2011
Gupta, P., Bali, K., Banchs, R.E., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International Conference on Research and Development in Information Retrieval, ACM SIGIR, Gold Coast, Queensland, Australia, pp. 677–686, July 2014
Hu, Y., Talamadupula, K., Kambhampati, S.: Dude, srsly?: The surprisingly formal nature of Twitter’s language. In: Proceedings of the 7th International Conference on Weblogs and Social Media. AAAI, Boston, Massachusetts, July 2013
Joshi, A.K.: Processing of sentences with intra-sentential code-switching. In: Proceedings of the 9th International Conference on Computational Linguistics. ACL, Prague, Czechoslovakia, pp. 145–150, July 1982
Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133 (2001)
Nguyen, D., Doğruöz, A.S.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 857–862. ACL, Seattle, Washington, October 2013
Paolillo, J.C.: Language choice on soc.culture.punjab. Electron. J. Commun./La Revue Electronique de Communication 6(3), n3 (1996)
Paolillo, J.: The virtual speech community: social network and language variation on IRC. J. Comput.-Mediat. Commun. 4(4), JCMC446 (1999)
Petrov, S., Das, D., McDonald, R.T.: A universal part-of-speech tagset. CoRR abs/1104.2086 (2011). http://arxiv.org/abs/1104.2086
Pinto, D., Rosso, P., Jiménez-Salazar, H.: A self-enriching methodology for clustering narrow domain short texts. Comput. J. 54(7), 1148–1165 (2011)
Rudrapal, D., Jamatia, A., Chakma, K., Das, A., Gambäck, B.: Sentence boundary detection for social media text. In: Proceedings of the 12th International Conference on Natural Language Processing, Trivandrum, India, pp. 91–97, December 2015
Sequiera, R., Choudhury, M., Gupta, P., Rosso, P., Kumar, S., Banerjee, S., Naskar, S.K., Bandyopadhyay, S., Chittaranjan, G., Das, A., Chakma, K.: Overview of FIRE-2015 shared task on mixed script information retrieval. In: Proceedings of the 7th Forum for Information Retrieval Evaluation, Gandhinagar, India, pp. 21–27, December 2015
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Gohneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A., Fung, P.: Overview for the first shared task on language identification in code-switched data. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 62–72. ACL, Doha, Qatar, October 2014
Vyas, Y., Gella, S., Sharma, J., Bali, K., Choudhury, M.: POS tagging of English-Hindi code-mixed social media content. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 974–979. ACL, Doha, Qatar, October 2014
Acknowledgements
Thanks to the different researchers who have made their datasets available: the organisers of the shared tasks on code-switching at EMNLP 2014 and in transliteration at FIRE 2014 and FIRE 2015, as well as Dong Nguyen and Seza Doğruöz (respectively University of Twente and Tilburg University, The Netherlands), and Monojit Choudhury and Kalika Bali (both at Microsoft Research India). Thanks also to an anonymous reviewer for extensive and useful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Jamatia, A., Gambäck, B., Das, A. (2018). Collecting and Annotating Indian Social Media Code-Mixed Corpora. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-75487-1_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)