Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2632188.2632207acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
poster

Automatic identification of arabic dialects in social media

Published: 11 July 2014 Publication History

Abstract

Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for AD classification using probabilistic models across social media datasets. We present a set of experiments using the character n-gram Markov language model and Naive Bayes classifiers with detailed examination of what models perform best under different conditions in social media context. Experimental results show that Naive Bayes classifier based on character bi-gram model can identify the 18 different Arabic dialects with a considerable overall accuracy of 98%. This work is a first-step towards an ultimate goal of a translation system from Arabic to English and French, within the ASMAT project

References

[1]
R. Al-Sabbagh and R. Girju. Yadac, Yet another dialectal arabic corpus. In N. C. C. Chair, K. Choukri, T. Declerck, M. U. Doan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, May 2012.
[2]
K. Almeman and M. Lee. Automatic building of arabic multi dialect text corpora by bootstrapping dialect words. In Communications, Signal Processing, and their Applications (ICCSPA), 2013.
[3]
T. Baldwin and M. Lui. Language identification: The long and the short of the matter. In Human Language Technologies: The 2010 Annual Conference of theNorth American Chapter of the Association for Computational Linguistics, HLT'10, pages 229--237, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
[4]
S. Bergsma, P. McNamee, M. Bagdouri, Clayton Fin, T. Wilson. Language Identification for Creating Language-Specific Twitter Collections. In Proceedings of the 2012 Workshop on Language in Social Media (LSM 2012), ACL 2012, pages 65--74, 2012.
[5]
W. B. Cavnar, J. M. Trenkle, and al. N-gram-based text categorization. Ann Arbor MI, 48113(2):161--175, 1994.
[6]
H. Elfardy and M. Diab, Sentence-Level Dialect Identification in Arabic, In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Sofia, Bulgaria. 2013.
[7]
T. Dunning. Statistical identification of languages. Citeseer, 1994.
[8]
F. Gotti, P. Langlais, and A. Farzindar. Translating government agencies' tweet feeds: Specificities, problems and (a few) solutions. In Proceedings of the Workshop on Language Analysis in Social Media, Atlanta, Georgia, June 2013. Association for Computational Linguistics, Association for Computational Linguistics.
[9]
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. Springer, 1998.
[10]
G.-i. Kikui. Identifying, the coding system and language, of on-line documents on the internet. In Proceedings of the 16th Conference on Computational Linguistics - Volume 2, COLING '96, pages 652--657, Stroudsburg, PA, USA, 1996. Association for Computational Linguistics.
[11]
K. Kirchhoff and D. Vergyri. Cross-dialectal acoustic data sharing for arabic speech recognition. In Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP'04). IEEE International Conference on, volume 1, pages I-765. IEEE, 2004.
[12]
F. Peng and D. Schuurmans. Combining naive bayes and n-gram language models for text classi_cation. In Advances in Information Retrieval, pages 335--350. Springer, 2003.
[13]
A. F. Suliman, Automatic Identification of Arabic Dialects USING Hidden Markov Models. Doctoral Dissertation, University of Pittsburgh. 2008.
[14]
O. F. Zaidan and C. Callison-Burch. Arabic dialect identi_cation. volume 1, Microsoft Research, 2012.
[15]
N. Habash and O. Rambow.2006. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 681--688, Sydney, Australia, July. Association for Computational Linguistics.
[16]
H. Elfardy and M. Diab. 2012a. Simplified guidelines for the creation of large scale dialectal arabic annotations. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey.
[17]
H. Elfardy and M. Diab. 2012b. Token level identification of linguistic code switching. In Proceedings of the 24th International Conference on Computational Linguistics (COLING),Mumbai, India.
[18]
H. Elfardy, M. Al-Badrashiny, M. Elfardy and M. Diab. 2013. Sentence Level Dialect Identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 456--461, Sofia, Bulgaria, August 4--9 2013.
[19]
F. Biadsy, J. Hirschberg, and N. Habash. 2009. Spoken arabic dialect identification using phonotactic modeling. In Proceedings of the Workshop on Computational Approaches to Semitic Languages at the meeting of the European Association for Computational Linguistics (EACL), Athens, Greece.
[20]
W. Salloum and N. Habash. 2011. Dialectal to standard arabic paraphrasing to improve arabic-english statistical machine translation. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, pages 10--21. Association for Computational Linguistics.
[21]
N. Habash, M. Diab, and O. Rabmow. 2012. Conventional orthography for dialectal arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul.
[22]
P. Dasigi and M. Diab. 2011. Codact: Towards identifying orthographic variants in dialectal arabic. In Proceedings of the 5th International Joint Conference on Natural Language Processing (ICJNLP), Chiangmai, Thailand .
[23]
O. F. Zaidan and C. Callison-Burch. 2011. The arabic online commentary dataset: an annotated dataset of informal arabic with high dialectal content. In Proceedings of ACL, pages 37--41.
[24]
F. Sadat. The ASMAT project - Arabic Social Media Analysis Tools. In proeedings of the Seventeenth Annual Conference of the European Association for Machine Translation (EAMT 2014), Dobrovnik, Croatia, 16-18 June 2014.

Cited By

View all
  • (2024)Advancing AI-Driven Linguistic Analysis: Developing and Annotating Comprehensive Arabic Dialect Corpora for Gulf Countries and Saudi ArabiaMathematics10.3390/math1219312012:19(3120)Online publication date: 5-Oct-2024
  • (2024)Quantifying Urban Linguistic Diversity Related to Rainfall and Flood across China with Social Media DataISPRS International Journal of Geo-Information10.3390/ijgi1303009213:3(92)Online publication date: 15-Mar-2024
  • (2024)Specific Challenges of Variation and Text TypesAutomatic Language Identification in Texts10.1007/978-3-031-45822-4_4(99-115)Online publication date: 2-Jan-2024
  • Show More Cited By

Index Terms

  1. Automatic identification of arabic dialects in social media

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoMeRA '14: Proceedings of the first international workshop on Social media retrieval and analysis
    July 2014
    72 pages
    ISBN:9781450330220
    DOI:10.1145/2632188
    • Program Chairs:
    • Markus Schedl,
    • Peter Knees,
    • Jialie Shen
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. arabic morphology
    2. dialect
    3. language variant identification
    4. markov model
    5. naive bayes

    Qualifiers

    • Poster

    Conference

    SIGIR '14
    Sponsor:

    Acceptance Rates

    SoMeRA '14 Paper Acceptance Rate 13 of 19 submissions, 68%;
    Overall Acceptance Rate 13 of 19 submissions, 68%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Advancing AI-Driven Linguistic Analysis: Developing and Annotating Comprehensive Arabic Dialect Corpora for Gulf Countries and Saudi ArabiaMathematics10.3390/math1219312012:19(3120)Online publication date: 5-Oct-2024
    • (2024)Quantifying Urban Linguistic Diversity Related to Rainfall and Flood across China with Social Media DataISPRS International Journal of Geo-Information10.3390/ijgi1303009213:3(92)Online publication date: 15-Mar-2024
    • (2024)Specific Challenges of Variation and Text TypesAutomatic Language Identification in Texts10.1007/978-3-031-45822-4_4(99-115)Online publication date: 2-Jan-2024
    • (2023)A co-produced service evaluation of ethnic minority community service user experiences of a specialist mental health service during the COVID-19 pandemicBMC Health Services Research10.1186/s12913-023-10115-423:1Online publication date: 17-Oct-2023
    • (2023)Arabic Sentiment Analysis and Sarcasm Detection Using Probabilistic Projections-Based Variational Switch TransformerIEEE Access10.1109/ACCESS.2023.328971511(67865-67881)Online publication date: 2023
    • (2022)Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case StudyApplied Sciences10.3390/app12231243512:23(12435)Online publication date: 5-Dec-2022
    • (2022)Semantic analysis of Arab leaders on social mediaThe Social Science Journal10.1080/03623319.2021.2001224(1-26)Online publication date: 4-Feb-2022
    • (2022)Similarities between Arabic dialectsInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10277059:1Online publication date: 1-Jan-2022
    • (2021)Systematic Literature Review of Dialectal Arabic: Identification and DetectionIEEE Access10.1109/ACCESS.2021.30595049(31010-31042)Online publication date: 2021
    • (2021)Creation of annotated country-level dialectal Arabic resources: An unsupervised approachNatural Language Engineering10.1017/S135132492100019X28:5(607-648)Online publication date: 9-Aug-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media