Collecting and Annotating Indian Social Media Code-Mixed Corpora

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1280 Accesses

Abstract

The pervasiveness of social media in the present digital era has empowered the ‘netizens’ to be more creative and interactive, and to generate content using free language forms that often are closer to spoken language and hence show phenomena previously mainly analysed in speech. One such phenomenon is code-mixing, which occurs when multilingual persons switch freely between the languages they have in common. Code-mixing presents many new challenges for language processing and the paper discusses some of them, taking as a starting point the problems of collecting and annotating three corpora of code-mixed Indian social media text: one corpus with English-Bengali Twitter messages and two corpora containing English-Hindi Twitter and Facebook messages, respectively. We present statistics of these corpora, discuss part-of-speech tagging of the corpora using both a coarse-grained and a fine-grained tag set, and compare their complexity to several other code-mixed corpora based on a Code-Mixing Index.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Language Identification on Code-Mix Social Text

Resource Creation for Training and Testing of Normalisation Systems for Konkani-English Code-Mixed Social Media Text

Part-of-Speech Tagger for Konkani-English Code-Mixed Social Media Text

Notes

References

Androutsopoulos, J.: Language change and digital media: a review of conceptions and evidence. In: Kristiansen, T., Coupland, N. (eds.) Standard Languages and Language Standards in a Changing Europe, pp. 145–159. Novus, Oslo (2011)
Google Scholar
Baldwin, T., Cook, P., Lui, M., MacKinlay, A., Wang, L.: How noisy social media text, how diffrnt social media sources? In: Proceedings of the 6th International Joint Conference on Natural Language Processing, pp. 356–364. AFNLP, Nagoya, Japan, October 2013
Google Scholar
Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: “I am borrowing $ya$ mixing?”: An analysis of English-Hindi code mixing in Facebook. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 116–126. ACL, Doha, Qatar, October 2014
Google Scholar
Barman, U., Wagner, J., Chrupała, G., Foster, J.: DCU-UVT: word-level language classification with code-mixed data. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 127–132. ACL, Doha, Qatar, October 2014
Google Scholar
Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Choudhury, M., Jha, G.N., Rajendran, S., Saravanan, K., Sobha, L., Subbarao, K.: A common parts-of-speech tagset framework for Indian languages. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, pp. 1331–1337. ELRA, Marrakech, Marocco, May 2008
Google Scholar
Cárdenas-Claros, M.S., Isharyanti, N.: Code switching and code mixing in internet chatting: between “yes”, “ya”, and “si” a case study. J. Comput.-Mediat. Commun. 5(3), 67–78 (2009)
Google Scholar
Das, A., Gambäck, B.: Code-mixing in social media text: the last language identification frontier? Traitement Automatique des Langues 54(3), 41–64 (2013)
Google Scholar
Das, A., Gambäck, B.: Identifying languages at the word level in code-mixed Indian social media text. In: Proceedings of the 11th International Conference on Natural Language Processing, pp. 169–178, Goa, India, December 2014
Google Scholar
Debole, F., Sebastiani, F.: An analysis of the relative hardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol. 58(6), 584–596 (2005)
Article Google Scholar
Dholakia, P.S., Yoonus, M.M.: Rule based approach for the transition of tagsets to build the POS annotated corpus. Int. J. Adv. Res. Comput. Commun. Eng. 3(7), 7417–7422 (2014)
Google Scholar
Diab, M., Kamboj, A.: Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: a pilot annotation. In: Proceedings of the 9th Workshop on Asian Language Resources, pp. 36–40. AFNLP, Chiang Mai, Thailand, November 2011
Google Scholar
Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–233 (1948)
Article Google Scholar
Gafaranga, J., Torras, M.C.: Interactional otherness: towards a redefinition of codeswitching. Int. J. Biling. 6(1), 1–22 (2002)
Article Google Scholar
Gambäck, B., Das, A.: On measuring the complexity of code-mixing. In: Proceedings of the 1st Workshop on Language Technologies for Indian Social Media, Goa, India, pp. 1–7, December 2014
Google Scholar
Gambäck, B., Das, A.: Comparing the level of code-switching in corpora. In: Proceedings of the 10th International Conference on Language Resources and Evaluation. ELRA, Portorož, Slovenia, May 2016 (to appear)
Google Scholar
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for Twitter: annotation, features, and experiments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 42–47. ACL, Portland, Oregon, June 2011
Google Scholar
Gupta, P., Bali, K., Banchs, R.E., Choudhury, M., Rosso, P.: Query expansion for mixed-script information retrieval. In: Proceedings of the 37th International Conference on Research and Development in Information Retrieval, ACM SIGIR, Gold Coast, Queensland, Australia, pp. 677–686, July 2014
Google Scholar
Hu, Y., Talamadupula, K., Kambhampati, S.: Dude, srsly?: The surprisingly formal nature of Twitter’s language. In: Proceedings of the 7th International Conference on Weblogs and Social Media. AAAI, Boston, Massachusetts, July 2013
Google Scholar
Joshi, A.K.: Processing of sentences with intra-sentential code-switching. In: Proceedings of the 9th International Conference on Computational Linguistics. ACL, Prague, Czechoslovakia, pp. 145–150, July 1982
Google Scholar
Kilgarriff, A.: Comparing corpora. Int. J. Corpus Linguist. 6(1), 97–133 (2001)
Article Google Scholar
Nguyen, D., Doğruöz, A.S.: Word level language identification in online multilingual communication. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 857–862. ACL, Seattle, Washington, October 2013
Google Scholar
Paolillo, J.C.: Language choice on soc.culture.punjab. Electron. J. Commun./La Revue Electronique de Communication 6(3), n3 (1996)
Google Scholar
Paolillo, J.: The virtual speech community: social network and language variation on IRC. J. Comput.-Mediat. Commun. 4(4), JCMC446 (1999)
Google Scholar
Petrov, S., Das, D., McDonald, R.T.: A universal part-of-speech tagset. CoRR abs/1104.2086 (2011). http://arxiv.org/abs/1104.2086
Pinto, D., Rosso, P., Jiménez-Salazar, H.: A self-enriching methodology for clustering narrow domain short texts. Comput. J. 54(7), 1148–1165 (2011)
Article Google Scholar
Rudrapal, D., Jamatia, A., Chakma, K., Das, A., Gambäck, B.: Sentence boundary detection for social media text. In: Proceedings of the 12th International Conference on Natural Language Processing, Trivandrum, India, pp. 91–97, December 2015
Google Scholar
Sequiera, R., Choudhury, M., Gupta, P., Rosso, P., Kumar, S., Banerjee, S., Naskar, S.K., Bandyopadhyay, S., Chittaranjan, G., Das, A., Chakma, K.: Overview of FIRE-2015 shared task on mixed script information retrieval. In: Proceedings of the 7th Forum for Information Retrieval Evaluation, Gandhinagar, India, pp. 21–27, December 2015
Google Scholar
Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Gohneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A., Fung, P.: Overview for the first shared task on language identification in code-switched data. In: Proceedings of the 1st Workshop on Computational Approaches to Code Switching, pp. 62–72. ACL, Doha, Qatar, October 2014
Google Scholar
Vyas, Y., Gella, S., Sharma, J., Bali, K., Choudhury, M.: POS tagging of English-Hindi code-mixed social media content. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 974–979. ACL, Doha, Qatar, October 2014
Google Scholar

Download references

Acknowledgements

Thanks to the different researchers who have made their datasets available: the organisers of the shared tasks on code-switching at EMNLP 2014 and in transliteration at FIRE 2014 and FIRE 2015, as well as Dong Nguyen and Seza Doğruöz (respectively University of Twente and Tilburg University, The Netherlands), and Monojit Choudhury and Kalika Bali (both at Microsoft Research India). Thanks also to an anonymous reviewer for extensive and useful comments.

Author information

Authors and Affiliations

National Institute of Technology, Agartala, Tripura, India
Anupam Jamatia
Norwegian University of Science and Technology, Trondheim, Norway
Björn Gambäck
Indian Institute of Information Technology, Sri City, Andhra Pradesh, India
Amitava Das

Authors

Anupam Jamatia
View author publications
You can also search for this author in PubMed Google Scholar
Björn Gambäck
View author publications
You can also search for this author in PubMed Google Scholar
Amitava Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anupam Jamatia .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jamatia, A., Gambäck, B., Das, A. (2018). Collecting and Annotating Indian Social Media Code-Mixed Corpora. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-75487-1_32
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics