Nothing Special   »   [go: up one dir, main page]

Skip to main content

CNN-BiLSTM Model for Arabic Dialect Identification

  • Conference paper
  • First Online:
Advances in Computational Collective Intelligence (ICCCI 2023)

Abstract

Dialectal Arabic is a term that covers Arabic dialects, resulting from linguistic interference between the Arabic language and local or neighboring languages. The automatic identification of dialects is a task that allows to recognize automatically the dialect used to write a text. This task is very important for several other NLP tasks, where a priory knowledge of the dialect of an input text can be useful for the rest of the process, such as sentiment analysis, machine translation and hate speech detection. The general objective of this paper is to propose a new approach for the identification of Arabic dialects in textual content. The first obstacle to the development of this task is the lack of resources and especially of balanced dataset. To address this issue and achieve our goal we started by building a balanced dataset by merging together and filtering 7 datasets. Then we proposed a new approach for Arabic dialect identification based on the combination of CNN with BiLSTM. To evaluate our model, we conducted a comparative study of several machine learning and deep learning models using a bunch of features. We considered these models as baseline and compared their results against the ones obtained by our model. The evaluation of our approach showed about 2% improvement in accuracy, compared to the best baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/tarekeldeeb/GloVe-Arabic.

  2. 2.

    https://github.com/malek-hedhli/Arabic-Dialects-Dataset.

References

  1. Badri, N., Kboubi, F., Habacha Chaibi, A.: Towards automatic detection of inappropriate content in multi-dialectic Arabic text. In: Bădică, C., Treur, J., Benslimane, D., Hnatkowska, B., Krótkiewicz, M. (eds.) ICCCI 2022. CCIS, vol. 1653, pp. 84–100. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16210-7_7

    Chapter  Google Scholar 

  2. Mousa, A.: Deep identification of Arabic dialects. Thèse de doctorat, Informatics Institute (2021)

    Google Scholar 

  3. Ali, A., Dehak, N., Cardinal, P.: Automatic dialect detection in Arabic broadcast speech. arXiv preprint arXiv:1509.06928 (2015)

  4. Tachicart, R., Bouzoubaa, K., Aouragh, S.L., Jaafa, H.: Automatic identification of Moroccan colloquial Arabic. In: Lachkar, A., Bouzoubaa, K., Mazroui, A., Hamdani, A., Lekhouaja, A. (eds.) ICALP 2017. CCIS, vol. 782, pp. 201–214. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73500-9_15

    Chapter  Google Scholar 

  5. El-Haj, M., Rayson, P., Aboelezz, M.: Arabic dialect identification in the context of bivalency and code-switching. In: Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan, pp. 3622–3627. European Language Resources Association (2018)

    Google Scholar 

  6. Shon, S., Ali, A., Glass, J.: Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv preprint arXiv:1803.04567 (2018)

  7. Abdul-Mageed, M., Zhang, C., Bouamor, H.: NADI 2020: the first nuanced Arabic dialect identification shared task. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp. 97–110 (2020)

    Google Scholar 

  8. Abdul-Mageed, M., Zhang, C., Elmadany, A.R.: NADI 2021: the second nuanced Arabic dialect identification shared task. arXiv preprint arXiv:2103.08466 (2021)

  9. Aseri, Y., Alreemy, K., Alelyani, S.: Meeting challenges of modern standard Arabic and Saudi dialect identification. In: Computer Science & Information Technology (CS & IT) (2022)

    Google Scholar 

  10. Alzanin, S.M., Azmi, A.M., Aboalsamh, H.A.: Short text classification for Arabic social media tweets. J. King Saud Univ.-Comput. Inf. Sci. 34(9), 6595–6604 (2022)

    Google Scholar 

  11. Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1465–1468 (2014)

    Google Scholar 

  12. Malmasi, S., Zampieri, M.: Arabic dialect identification in speech transcripts. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106–113 (2016)

    Google Scholar 

  13. Salameh, M., Bouamor, H., Habash, N.: Fine-grained Arabic dialect identification. In: 27th International Conference on Computational Linguistics, COLING 2018, pp. 1332–1344. Association for Computational Linguistics (ACL) (2018)

    Google Scholar 

  14. Elaraby, M., Abdul-Mageed, M.: Deep models for Arabic dialect identification on benchmarked data. In: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (2018)

    Google Scholar 

  15. Bouamor, H., Hassan, S., Habash, N.: The MADAR shared task on Arabic fine-grained dialect identification. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop (2019)

    Google Scholar 

  16. Abdelali, A., Mubarak, H., Samih, Y.: QADI: Arabic dialect identification in the wild. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 1–10 (2021)

    Google Scholar 

  17. Malmasi, S., Refaee, E., Dras, M.: Arabic dialect identification using a parallel multidialectal corpus. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 35–53. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2_3

    Chapter  Google Scholar 

  18. Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)

    Article  Google Scholar 

  19. Ali, M.: Character level convolutional neural network for Arabic dialect identification. In: VarDial@ COLING 2018, pp. 122–127 (2018)

    Google Scholar 

  20. Soliman, A.B., Eisa, K., El-Beltagy, S.R.: AraVec: a set of Arabic word embedding models for use in Arabic NLP. In: Proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE (2017)

    Google Scholar 

  21. Gillioz, A., et al.: Overview of the transformer-based models for NLP tasks. In: 2020 15th Conference on Computer Science and Information Systems (FedCSIS). IEEE (2020)

    Google Scholar 

  22. Fourati, C., Messaoudi, A., Haddad, H.: TUNIZI: a Tunisian Arabizi sentiment analysis dataset. arXiv preprint arXiv:2004.14303 (2020)

  23. Mdhaffar, S., Bougares, F., Esteve, Y.: Sentiment analysis of Tunisian dialects: linguistic ressources and experiments. In: Third Arabic Natural Language Processing Workshop (WANLP), pp. 55–61 (2017)

    Google Scholar 

  24. Haddad, H., Mulki, H., Oueslati, A.: T-HSAB: a Tunisian hate speech and abusive dataset. In: Smaïli, K. (ed.) ICALP 2019. CCIS, vol. 1108, pp. 251–263. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32959-4_18

    Chapter  Google Scholar 

  25. Alsarsour, I., Mohamed, E., Suwaileh, R.: DART: a large dataset of dialectal Arabic tweets. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  26. Kwaik, K.A., Saad, M., Chatzikyriakidis, S.: Shami: a corpus of Levantine Arabic dialects. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  27. Meftouh, K., Harrat, S., Jamoussi, S.: Machine translation experiments on PADIC: a parallel Arabic dialect corpus. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (2015)

    Google Scholar 

  28. Zaidan, O., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 37–41 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Malek Hedhli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hedhli, M., Kboubi, F. (2023). CNN-BiLSTM Model for Arabic Dialect Identification. In: Nguyen, N.T., et al. Advances in Computational Collective Intelligence. ICCCI 2023. Communications in Computer and Information Science, vol 1864. Springer, Cham. https://doi.org/10.1007/978-3-031-41774-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41774-0_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41773-3

  • Online ISBN: 978-3-031-41774-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics