CNN-BiLSTM Model for Arabic Dialect Identification

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1864))

Included in the following conference series:

International Conference on Computational Collective Intelligence

577 Accesses

Abstract

Dialectal Arabic is a term that covers Arabic dialects, resulting from linguistic interference between the Arabic language and local or neighboring languages. The automatic identification of dialects is a task that allows to recognize automatically the dialect used to write a text. This task is very important for several other NLP tasks, where a priory knowledge of the dialect of an input text can be useful for the rest of the process, such as sentiment analysis, machine translation and hate speech detection. The general objective of this paper is to propose a new approach for the identification of Arabic dialects in textual content. The first obstacle to the development of this task is the lack of resources and especially of balanced dataset. To address this issue and achieve our goal we started by building a balanced dataset by merging together and filtering 7 datasets. Then we proposed a new approach for Arabic dialect identification based on the combination of CNN with BiLSTM. To evaluate our model, we conducted a comparative study of several machine learning and deep learning models using a bunch of features. We considered these models as baseline and compared their results against the ones obtained by our model. The evaluation of our approach showed about 2% improvement in accuracy, compared to the best baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BERT for Arabic NLP Applications: Pretraining and Finetuning MSA and Arabic Dialects

Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects

DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus

Article 01 April 2019

Notes

References

Badri, N., Kboubi, F., Habacha Chaibi, A.: Towards automatic detection of inappropriate content in multi-dialectic Arabic text. In: Bădică, C., Treur, J., Benslimane, D., Hnatkowska, B., Krótkiewicz, M. (eds.) ICCCI 2022. CCIS, vol. 1653, pp. 84–100. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16210-7_7
Chapter Google Scholar
Mousa, A.: Deep identification of Arabic dialects. Thèse de doctorat, Informatics Institute (2021)
Google Scholar
Ali, A., Dehak, N., Cardinal, P.: Automatic dialect detection in Arabic broadcast speech. arXiv preprint arXiv:1509.06928 (2015)
Tachicart, R., Bouzoubaa, K., Aouragh, S.L., Jaafa, H.: Automatic identification of Moroccan colloquial Arabic. In: Lachkar, A., Bouzoubaa, K., Mazroui, A., Hamdani, A., Lekhouaja, A. (eds.) ICALP 2017. CCIS, vol. 782, pp. 201–214. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73500-9_15
Chapter Google Scholar
El-Haj, M., Rayson, P., Aboelezz, M.: Arabic dialect identification in the context of bivalency and code-switching. In: Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan, pp. 3622–3627. European Language Resources Association (2018)
Google Scholar
Shon, S., Ali, A., Glass, J.: Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv preprint arXiv:1803.04567 (2018)
Abdul-Mageed, M., Zhang, C., Bouamor, H.: NADI 2020: the first nuanced Arabic dialect identification shared task. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp. 97–110 (2020)
Google Scholar
Abdul-Mageed, M., Zhang, C., Elmadany, A.R.: NADI 2021: the second nuanced Arabic dialect identification shared task. arXiv preprint arXiv:2103.08466 (2021)
Aseri, Y., Alreemy, K., Alelyani, S.: Meeting challenges of modern standard Arabic and Saudi dialect identification. In: Computer Science & Information Technology (CS & IT) (2022)
Google Scholar
Alzanin, S.M., Azmi, A.M., Aboalsamh, H.A.: Short text classification for Arabic social media tweets. J. King Saud Univ.-Comput. Inf. Sci. 34(9), 6595–6604 (2022)
Google Scholar
Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1465–1468 (2014)
Google Scholar
Malmasi, S., Zampieri, M.: Arabic dialect identification in speech transcripts. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106–113 (2016)
Google Scholar
Salameh, M., Bouamor, H., Habash, N.: Fine-grained Arabic dialect identification. In: 27th International Conference on Computational Linguistics, COLING 2018, pp. 1332–1344. Association for Computational Linguistics (ACL) (2018)
Google Scholar
Elaraby, M., Abdul-Mageed, M.: Deep models for Arabic dialect identification on benchmarked data. In: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (2018)
Google Scholar
Bouamor, H., Hassan, S., Habash, N.: The MADAR shared task on Arabic fine-grained dialect identification. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop (2019)
Google Scholar
Abdelali, A., Mubarak, H., Samih, Y.: QADI: Arabic dialect identification in the wild. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 1–10 (2021)
Google Scholar
Malmasi, S., Refaee, E., Dras, M.: Arabic dialect identification using a parallel multidialectal corpus. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 35–53. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2_3
Chapter Google Scholar
Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)
Article Google Scholar
Ali, M.: Character level convolutional neural network for Arabic dialect identification. In: VarDial@ COLING 2018, pp. 122–127 (2018)
Google Scholar
Soliman, A.B., Eisa, K., El-Beltagy, S.R.: AraVec: a set of Arabic word embedding models for use in Arabic NLP. In: Proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE (2017)
Google Scholar
Gillioz, A., et al.: Overview of the transformer-based models for NLP tasks. In: 2020 15th Conference on Computer Science and Information Systems (FedCSIS). IEEE (2020)
Google Scholar
Fourati, C., Messaoudi, A., Haddad, H.: TUNIZI: a Tunisian Arabizi sentiment analysis dataset. arXiv preprint arXiv:2004.14303 (2020)
Mdhaffar, S., Bougares, F., Esteve, Y.: Sentiment analysis of Tunisian dialects: linguistic ressources and experiments. In: Third Arabic Natural Language Processing Workshop (WANLP), pp. 55–61 (2017)
Google Scholar
Haddad, H., Mulki, H., Oueslati, A.: T-HSAB: a Tunisian hate speech and abusive dataset. In: Smaïli, K. (ed.) ICALP 2019. CCIS, vol. 1108, pp. 251–263. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32959-4_18
Chapter Google Scholar
Alsarsour, I., Mohamed, E., Suwaileh, R.: DART: a large dataset of dialectal Arabic tweets. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Kwaik, K.A., Saad, M., Chatzikyriakidis, S.: Shami: a corpus of Levantine Arabic dialects. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Meftouh, K., Harrat, S., Jamoussi, S.: Machine translation experiments on PADIC: a parallel Arabic dialect corpus. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (2015)
Google Scholar
Zaidan, O., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 37–41 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

National School of Computer Science, ENSI’s RIADI Laboratory, Manouba, Tunisia
Malek Hedhli & Ferihane Kboubi

Authors

Malek Hedhli
View author publications
You can also search for this author in PubMed Google Scholar
Ferihane Kboubi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Malek Hedhli .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Eötvös Loránd University, Budapest, Hungary
János Botzheim
Eötvös Loránd University, Budapest, Hungary
László Gulyás
Universidad Complutense de Madrid, Madrid, Spain
Manuel Nunez
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Jan Treur
University of Münster, Münster, Germany
Gottfried Vossen
Wrocław University of Science and Technology, Wrocław, Poland
Adrianna Kozierkiewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hedhli, M., Kboubi, F. (2023). CNN-BiLSTM Model for Arabic Dialect Identification. In: Nguyen, N.T., et al. Advances in Computational Collective Intelligence. ICCCI 2023. Communications in Computer and Information Science, vol 1864. Springer, Cham. https://doi.org/10.1007/978-3-031-41774-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-41774-0_17
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41773-3
Online ISBN: 978-3-031-41774-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CNN-BiLSTM Model for Arabic Dialect Identification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BERT for Arabic NLP Applications: Pretraining and Finetuning MSA and Arabic Dialects

Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects

DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

CNN-BiLSTM Model for Arabic Dialect Identification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BERT for Arabic NLP Applications: Pretraining and Finetuning MSA and Arabic Dialects

Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects

DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation