2019
pdf
bib
abs
Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data
Fahad AlGhamdi
|
Mona Diab
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Linguistic Code Switching (CS) is a phenomenon that occurs when multilingual speakers alternate between two or more languages/dialects within a single conversation. Processing CS data is especially challenging in intra-sentential data given state-of-the-art monolingual NLP technologies since such technologies are geared toward the processing of one language at a time. In this paper, we address the problem of Part-of-Speech tagging (POS) in the context of linguistic code switching (CS). We explore leveraging multiple neural network architectures to measure the impact of different pre-trained embeddings methods on POS tagging CS data. We investigate the landscape in four CS language pairs, Spanish-English, Hindi-English, Modern Standard Arabic- Egyptian Arabic dialect (MSA-EGY), and Modern Standard Arabic- Levantine Arabic dialect (MSA-LEV). Our results show that multilingual embedding (e.g., MSA-EGY and MSA-LEV) helps closely related languages (EGY/LEV) but adds noise to the languages that are distant (SPA/HIN). Finally, we show that our proposed models outperform state-of-the-art CS taggers for MSA-EGY language pair.
2018
pdf
bib
WASA: A Web Application for Sequence Annotation
Fahad AlGhamdi
|
Mona Diab
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Gustavo Aguilar
|
Fahad AlGhamdi
|
Victor Soto
|
Thamar Solorio
|
Mona Diab
|
Julia Hirschberg
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
pdf
bib
abs
Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task
Gustavo Aguilar
|
Fahad AlGhamdi
|
Victor Soto
|
Mona Diab
|
Julia Hirschberg
|
Thamar Solorio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
In the third shared task of the Computational Approaches to Linguistic Code-Switching (CALCS) workshop, we focus on Named Entity Recognition (NER) on code-switched social-media data. We divide the shared task into two competitions based on the English-Spanish (ENG-SPA) and Modern Standard Arabic-Egyptian (MSA-EGY) language pairs. We use Twitter data and 9 entity types to establish a new dataset for code-switched NER benchmarks. In addition to the CS phenomenon, the diversity of the entities and the social media challenges make the task considerably hard to process. As a result, the best scores of the competitions are 63.76% and 71.61% for ENG-SPA and MSA-EGY, respectively. We present the scores of 9 participants and discuss the most common challenges among submissions.
2016
pdf
bib
Overview for the Second Shared Task on Language Identification in Code-Switched Data
Giovanni Molina
|
Fahad AlGhamdi
|
Mahmoud Ghoneim
|
Abdelati Hawwari
|
Nicolas Rey-Villamizar
|
Mona Diab
|
Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching
pdf
bib
Part of Speech Tagging for Code Switched Data
Fahad AlGhamdi
|
Giovanni Molina
|
Mona Diab
|
Thamar Solorio
|
Abdelati Hawwari
|
Victor Soto
|
Julia Hirschberg
Proceedings of the Second Workshop on Computational Approaches to Code Switching
pdf
bib
abs
Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data
Mona Diab
|
Mahmoud Ghoneim
|
Abdelati Hawwari
|
Fahad AlGhamdi
|
Nada AlMarwani
|
Mohamed Al-Badrashiny
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1%.
2014
pdf
bib
Overview for the First Shared Task on Language Identification in Code-Switched Data
Thamar Solorio
|
Elizabeth Blair
|
Suraj Maharjan
|
Steven Bethard
|
Mona Diab
|
Mahmoud Ghoneim
|
Abdelati Hawwari
|
Fahad AlGhamdi
|
Julia Hirschberg
|
Alison Chang
|
Pascale Fung
Proceedings of the First Workshop on Computational Approaches to Code Switching