research-article

An Arabic Probabilistic Parser Based on a Property Grammar

Authors:

Philippe BlacheAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 10

Article No.: 237, Pages 1 - 25

https://doi.org/10.1145/3612921

Published: 13 October 2023 Publication History

Abstract

The specificities of Arabic parsing, such as agglutination, vocalization, and the relatively order-free words in Arabic sentences, remain major issues to consider. To promote its robustness, such parseing should define different types of constraints. Property Grammar (PG) formalism verifies the satisfiability of the constraints directly on the units of the structure, thanks to its properties (or relations). In this context, we propose to build a probabilistic parser with syntactic properties, using a PG, and we measure the production rules in terms of different implicit information and in particular the syntactic properties. We experimented with our parser on the treebank ATB, using the parsing algorithm CYK, and we obtained encouraging results. Our method is also automatic for implementation of most property types. Its generalization for other languages or corpus domains (using treebanks) could be a good perspective. Its combination with pre-trained models of BERT may also make our parser faster.

References

[1]

N. Ababou, A. Mazroui, and R. Belehbib. 2017. Parsing Arabic nominal sentences using context free grammar and fundamental rules of classical grammar. Int. J. Intell. Syst. Appl. 9, 8 (2017), 11 pages.

[2]

D. Abdelrazaq, S. Abu-Soud, A. Awajan, and Arafat. 2018. A machine learning system for distinguishing nominal and verbal Arabic sentences. Int. Arab J. Info. Technol. 15, 3 (2018), 567–584.

[3]

I. Adebara. 2019. Womb grammars: A constraint solving model for learning the grammar of Yoruba. In Proceedings of European Language Resources Association (ELRA’19). 169–172.

[4]

S. Al-Ghamdi, H. Al-Khalifa, and A. Al-Salman. 2021. A dependency treebank for classical Arabic poetry. In Proceedings of the 6th International Conference on Dependency Linguistics (SyntaxFest’21). Association for Computational Linguistics, 1–9.

[5]

S. Al-Ghamdi, H. Al-Khalifa, and A. Al-Salman. 2023. Fine-Tuning BERT-based Pre-trained models for Arabic dependency parsing. Appl. Sci. 13, 7, 4225. DOI:

[6]

W. Antoun, F. Baly, and H. Hajj. 2021. Arabert: Transformer-based model for Arabic language understanding. Retrieved from https://ArXiv:2003.00104

[7]

D. Aqel, Darah, S. AlZu'bi, and S. Hamadah. 2019. Comparative study for recent technologies in Arabic language parsing. In Proceedings of the 6th International Conference on Software Defined Systems (SDS’19). 209–212. DOI:

[8]

M. Artetxe and H. Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond, Trans. Assoc. Comput. Ling. 7 (2019), 597–610. DOI:

[9]

S. Ben Ismail, S. Boukédi, and K. Haddar. 2019. HPSG grammar supporting Arabic preference nouns and Its TDL specification. In Proceedings of the International Conference on Arabic Language Processing: From Theory to Practice (ICALP’19). Communications in Computer and Information Science, Vol. 1108, K. Smaïli (ed.). Springer, Cham. DOI:

[10]

R. B. Bensalem, K. Haddar, and P. Blache. 2015. A formal modeling method to enrich the Arabic Treebank ATB with syntactic properties. In Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K’15), SCITEPRESS, 108–117. DOI:

Digital Library

[11]

R. Bensalem, K. Haddar, and P. Blache. 2016. A property grammar-based method to enrich the Arabic treebank ATB. In Knowledge Discovery, Knowledge Engineering, and Knowledge Management, Springer, 302–323. DOI:

[12]

R. B. Bensalem, M. Elkarwi, K. Haddar, and P. Blache. 2014. Building an Arabic linguistic resource from a treebank: the Case of Property Grammar. In Proceedings of the 17th International Conference on Text, Speech and Dialogue (TSD’14), Vol. 8655, Springer, 240–246.

[13]

R. B. Bensalem, N. Kadri, K. Haddar, and P. Blache. 2018. Evaluation and enrichment of stanford parser using an Arabic property grammar. In Proceedings of the Conference on Computational Linguistics and Intelligent Text Processing (CICLing’17), Springer International Publishing, Hungary, 170–182. DOI:

[14]

P. Blache and S. Rauzy. 2006. Mécanismes de contrôle pour l'analyse en Grammaires de Propriétés. P. Mertens, C. Fairon, A. Dister, et P. Watrin (eds.). 415–424.

[15]

P. Blache and S. Rauzy. 2012. Enrichissement du FTB: Un treebank hybride constituants/propriétés. In Proceedings of the Conference on Automatic Natural Language Processing (TALN’12), Vol. 2, 307–320.

[16]

P. Blache and S. Rauzy. 2012. Hybridization and Treebank Enrichment with Constraint-based Representations. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), 6–13.

[17]

P. Blache and S. Rauzy. 2014. A chinese constraint grammar extracted from the chinese treebank. In Proceedings of the 2nd Asia Pacific Corpus Linguistics Conference (APCLC’14), Hong Kong.

[18]

P. Blache, S. Rauzy, and G. Montcheuil. 2016. MarsaGram: An excursion in the forests of parsing trees. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), 2336–2342.

[19]

P. Blache. 2016. Representing syntax by means of properties: A formal framework for descriptive approaches. J. Lang. Model. 4, 2 (2016), 183–224. DOI:

[20]

W. Che, Y. Liu, Y. Wang, B. Zheng, and T. Liu. 2018. Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 55–64. DOI:

[21]

D. Chen and C. Manning. 2014. A fast and accurate dependency parser using neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Association for Computational Linguistics, 740–750. DOI:

[22]

M. Dekhtyar, A. Dikovsky, and B. Karlov. 2015. Categorial dependency grammars, Theoret. Comput. Sci. 579 (2015), 33–63.

Digital Library

[23]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics. 4171–4186. DOI:

[24]

T. Dordevic and S. Stojkovic. 2020. Different approaches in serbian language parsing using context-free grammars. In Proceedings of 7th International Conference on Electrical, Electronic and Computing Engineering IcETRAN, Etno-Selo Stanišići, Bosnia and Herzegovina (Online conference), 588--591.

[25]

T. Dozat, P. Qi, and C. D. Manning. 2017. Stanford's graph-based neural dependency parser. In Proceedings of the CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 20–30. DOI:

[26]

D. Duchier, T.-B.-H. Dao, and Y. Parmentier. 2012. Analyse Syntaxique par Contraintes pour les Grammaires de Propriétés à traits. In Proceedings of the Huitièmes Journées Francophones de Programmation par Contraintes (JFPC’12). 101–106.

[27]

D. Duchier, T.-B.-H. Dao, Y. Parmentier, and W. Lesaint. 2010. Une modélisation en CSP des grammaires de propriétés. In Proceedings of the Sixièmes Journées Francophones de Programmation par Contraintes (JFPC’10). 123–132.

[28]

J. Earley. 1970. An efficient context-free parsing algorithm. Commun. ACM 13, 2 (1970), 94–102. DOI:

Digital Library

[29]

J. M. Eisenschlos, S. Ruder, P. Czapla, M. Kardas, S. Gugger, and J. Howard. 2019. Multit: Efficient multi-lingual language model fine-tuning. Retrieved from. https://ArXiv:1909.04761

[30]

M. Fouad, A. Mahany, N. Aljohani, R. Abbasi, and S.-U. Hassan. 2020. Arwordvec: efficient word embedding models for Arabic tweets. Soft Comput. (2020), 24. DOI:

Digital Library

[31]

N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. 2013. Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, 426--432.

[32]

N. Habash, M. AbuOdeh, D. Taji, R. Faraj, J. El Gizuli, and O. Kallas. 2022. Camel Treebank: An open multi-genre Arabic dependency treebank. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, 2672–2681.

[33]

N. Habash, R. Faraj, and R. Roth. 2009. Syntactic annotation in the Columbia Arabic Treebank. In Proceedings of the International Conference on Arabic Language Resources and Tools (MEDAR’09).

[34]

J. Howard and S. Ruder. 2018. Universal language model fine-tuning for text classification. Retrieved from https://arxiv.org/abs/1801.06146v5

[35]

G. Inoue, B. Alhafni, N. Baimukan, H. Bouamor and N. Habash. 2021. The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models.

[36]

T. Kasami. 1965. An efficient recognition and syntax analysis algorithm for context-free languages (AFCRL-65-758).

[37]

S. Khalifa, N. Zalmout, and N. Habash. 2020. Morphological analysis and disambiguation for gulf Arabic: The interplay between resources and methods. In Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, 3895--3904.

[38]

N. Khatun. 2021. Applications of normality test in statistical analysis. Open Journal of Statistics 11, 1 (2021), 113--122. DOI:

[39]

N. Kitaev and D. Klein. 2018. Constituency parsing with a self-attentive encoder. Retrieved from https://ArXiv:1805.01052

[40]

N. Kitaev, S. Cao, and D. Klein. 2019. Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics. 3499–3505. DOI:

[41]

S. Kulick, A. Bies, and M. Maamouri. 2010. Consistent and flexible integration of morphological annotation in the Arabic treebank. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC'10) European Language Resources Association (ELRA’10).

[42]

Z. Li, R. Wang, K. Chen, M. Utiyama, E. Sumita, Z. Zhang, and H. Zhao. 2020. Data-dependent gaussian prior objective for language generation. In International Conference on Learning Representations.

[43]

R. Maalej, N. Khoufi, and C. Aloulou. 2021. Parsing Arabic using deep learning technology. In Proceedings of the Tunisian-Algerian Joint Conference on Applied Computing (TACC’21). 74–80.

[44]

M. Maamouri, A. Bies, and S. Kulick. 2008. Enhancing the Arabic Treebank: A collaborative effort toward new annotation guidelines. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08), European Language Resources Association (ELRA).

[45]

A. Morad, M. Nagi, and S. Alansary. 2021. Deep Learning–based Constituency Parsing for Arabic Language. In Arabnia, H. R. Ferens, K. de la Fuente, D. Kozerenko, E.B. Olivas Varela, J.A. Tinetti, F.G. (eds.), Advances in Artificial Intelligence and Applied Cognitive Computing, Transactions on Computational Science and Computational Intelligence, Springer, Cham, 45–58. DOI:

[46]

M. Mahyoob. 2020. Developing a simplified morphological analyzer for Arabic pronominal system. International Journal on Natural Language Computing 9 (2020), 9--19.

[47]

A. Paranjpe and G. Tan. 2021. Bohemia . A validator for parser frameworks. IEEE Security and Privacy Workshops (SPW), San Francisco, CA, 162--170. DOI:

[48]

J. Nivre, M.-C. de Marnee, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. 2016. Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), 1659–1666.

[49]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. Retrieved from https://ArXiv:1802.05365

[50]

Y. M. Saber, H. Abdel-Galil, and M. A. Belal. 2022. Arabic ontology extraction model from unstructured text. J. King Saud Univ.– Comput. Info. Sci. 34, 8, Part B (2022), 6066–6076, ISSN 1319-1578. DOI:

[51]

A. Safaya, M. Abdullatif, and D. Yuret. 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the 14th Workshop on Semantic Evaluation, International Committee for Computational Linguistics, Barcelona (online), 2054–2059.

[52]

A. Sahay, A. Nasery, A. Maheshwari, G. Ramakrishnan, and R. Iyer. 2021. Rule augmented unsupervised constituency parsing. In Findings of the Association for Computational Linguistics: (ACL-IJCNLP'21), 4923--4932.

[53]

A. B. Soliman, K. Eissa, and S. El-Beltagy. 2017. Aravec: A set of Arabic word embedding models for use in Arabic NLP. Procedia Computer Sci. Arabic Comput. Ling. (2017), 256–265. DOI:

[54]

D. Taji, S. Khalifa, O. Obeid, F. Eryani, and N. Habash. 2018. An Arabic morphological analyzer and generator with copious features. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Brussels, Belgium, 140--150.

[55]

D. H. Younger. 1967. Recognition and parsing of context-free languages in time n3. Info. Control 10, 2 (1967), 189–208. DOI:

[56]

F. M. Zanzotto, A. Santilli, L. Ranaldi, D. Onorati, P. Tommasino, and F. Fallucchi. 2020. KERMIT: Complementing transformer architectures with encoders of explicit syntactic interpretations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 256--267.

[57]

D. Zeman, J. Hajic, M. Popel, M. Potthast, M. Straka, F. Ginter, J. Nivre, and S. Petrov. 2018. In Proceedings of the CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 1–21. DOI:

[58]

Y. Zhang, Z. Li, and M. Zhang. 2020. Efficient second-order treeCRF for neural dependency parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3295--3305.

Index Terms

An Arabic Probabilistic Parser Based on a Property Grammar
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Parsing Arabic using induced probabilistic context free grammar

The importance of the parsing task for NLP applications is well understood. However developing parsers remains difficult because of the complexity of the Arabic language. Most parsers are based on syntactic grammars that describe the syntactic ...
Parsing Arabic with a Semi-automatically Generated TAG: Dealing with Linguistic Phenomena
Computational Linguistics and Intelligent Text Processing
Abstract
Arabic is a challenging language when it comes to grammar production and parsing. It combines complex linguistic phenomena with a rich morphology that make its processing particularly ambiguous. This leaded us to choose the Tree-Adjoining Grammar (...
A Survey of Syntactic Parsers of Arabic Language
BDAW '16: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies

Syntactic parsing constitutes one of the most important stages for many Natural Language Processing applications such as Information Retrieval or Question Answering. We present a survey that covers almost all syntactic parsers of Arabic language ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 10

October 2023

226 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3627976

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2023

Online AM: 19 September 2023

Accepted: 23 July 2023

Revised: 01 July 2023

Received: 25 October 2022

Published in TALLIP Volume 22, Issue 10

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
71
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents