Abstract
Dialectal Arabic is a term that covers Arabic dialects, resulting from linguistic interference between the Arabic language and local or neighboring languages. The automatic identification of dialects is a task that allows to recognize automatically the dialect used to write a text. This task is very important for several other NLP tasks, where a priory knowledge of the dialect of an input text can be useful for the rest of the process, such as sentiment analysis, machine translation and hate speech detection. The general objective of this paper is to propose a new approach for the identification of Arabic dialects in textual content. The first obstacle to the development of this task is the lack of resources and especially of balanced dataset. To address this issue and achieve our goal we started by building a balanced dataset by merging together and filtering 7 datasets. Then we proposed a new approach for Arabic dialect identification based on the combination of CNN with BiLSTM. To evaluate our model, we conducted a comparative study of several machine learning and deep learning models using a bunch of features. We considered these models as baseline and compared their results against the ones obtained by our model. The evaluation of our approach showed about 2% improvement in accuracy, compared to the best baseline models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Badri, N., Kboubi, F., Habacha Chaibi, A.: Towards automatic detection of inappropriate content in multi-dialectic Arabic text. In: Bădică, C., Treur, J., Benslimane, D., Hnatkowska, B., Krótkiewicz, M. (eds.) ICCCI 2022. CCIS, vol. 1653, pp. 84–100. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16210-7_7
Mousa, A.: Deep identification of Arabic dialects. Thèse de doctorat, Informatics Institute (2021)
Ali, A., Dehak, N., Cardinal, P.: Automatic dialect detection in Arabic broadcast speech. arXiv preprint arXiv:1509.06928 (2015)
Tachicart, R., Bouzoubaa, K., Aouragh, S.L., Jaafa, H.: Automatic identification of Moroccan colloquial Arabic. In: Lachkar, A., Bouzoubaa, K., Mazroui, A., Hamdani, A., Lekhouaja, A. (eds.) ICALP 2017. CCIS, vol. 782, pp. 201–214. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73500-9_15
El-Haj, M., Rayson, P., Aboelezz, M.: Arabic dialect identification in the context of bivalency and code-switching. In: Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan, pp. 3622–3627. European Language Resources Association (2018)
Shon, S., Ali, A., Glass, J.: Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv preprint arXiv:1803.04567 (2018)
Abdul-Mageed, M., Zhang, C., Bouamor, H.: NADI 2020: the first nuanced Arabic dialect identification shared task. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp. 97–110 (2020)
Abdul-Mageed, M., Zhang, C., Elmadany, A.R.: NADI 2021: the second nuanced Arabic dialect identification shared task. arXiv preprint arXiv:2103.08466 (2021)
Aseri, Y., Alreemy, K., Alelyani, S.: Meeting challenges of modern standard Arabic and Saudi dialect identification. In: Computer Science & Information Technology (CS & IT) (2022)
Alzanin, S.M., Azmi, A.M., Aboalsamh, H.A.: Short text classification for Arabic social media tweets. J. King Saud Univ.-Comput. Inf. Sci. 34(9), 6595–6604 (2022)
Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1465–1468 (2014)
Malmasi, S., Zampieri, M.: Arabic dialect identification in speech transcripts. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106–113 (2016)
Salameh, M., Bouamor, H., Habash, N.: Fine-grained Arabic dialect identification. In: 27th International Conference on Computational Linguistics, COLING 2018, pp. 1332–1344. Association for Computational Linguistics (ACL) (2018)
Elaraby, M., Abdul-Mageed, M.: Deep models for Arabic dialect identification on benchmarked data. In: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (2018)
Bouamor, H., Hassan, S., Habash, N.: The MADAR shared task on Arabic fine-grained dialect identification. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop (2019)
Abdelali, A., Mubarak, H., Samih, Y.: QADI: Arabic dialect identification in the wild. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 1–10 (2021)
Malmasi, S., Refaee, E., Dras, M.: Arabic dialect identification using a parallel multidialectal corpus. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 35–53. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2_3
Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)
Ali, M.: Character level convolutional neural network for Arabic dialect identification. In: VarDial@ COLING 2018, pp. 122–127 (2018)
Soliman, A.B., Eisa, K., El-Beltagy, S.R.: AraVec: a set of Arabic word embedding models for use in Arabic NLP. In: Proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE (2017)
Gillioz, A., et al.: Overview of the transformer-based models for NLP tasks. In: 2020 15th Conference on Computer Science and Information Systems (FedCSIS). IEEE (2020)
Fourati, C., Messaoudi, A., Haddad, H.: TUNIZI: a Tunisian Arabizi sentiment analysis dataset. arXiv preprint arXiv:2004.14303 (2020)
Mdhaffar, S., Bougares, F., Esteve, Y.: Sentiment analysis of Tunisian dialects: linguistic ressources and experiments. In: Third Arabic Natural Language Processing Workshop (WANLP), pp. 55–61 (2017)
Haddad, H., Mulki, H., Oueslati, A.: T-HSAB: a Tunisian hate speech and abusive dataset. In: Smaïli, K. (ed.) ICALP 2019. CCIS, vol. 1108, pp. 251–263. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32959-4_18
Alsarsour, I., Mohamed, E., Suwaileh, R.: DART: a large dataset of dialectal Arabic tweets. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Kwaik, K.A., Saad, M., Chatzikyriakidis, S.: Shami: a corpus of Levantine Arabic dialects. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Meftouh, K., Harrat, S., Jamoussi, S.: Machine translation experiments on PADIC: a parallel Arabic dialect corpus. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (2015)
Zaidan, O., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 37–41 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hedhli, M., Kboubi, F. (2023). CNN-BiLSTM Model for Arabic Dialect Identification. In: Nguyen, N.T., et al. Advances in Computational Collective Intelligence. ICCCI 2023. Communications in Computer and Information Science, vol 1864. Springer, Cham. https://doi.org/10.1007/978-3-031-41774-0_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-41774-0_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41773-3
Online ISBN: 978-3-031-41774-0
eBook Packages: Computer ScienceComputer Science (R0)