Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

An Arabic Probabilistic Parser Based on a Property Grammar

Published: 13 October 2023 Publication History

Abstract

The specificities of Arabic parsing, such as agglutination, vocalization, and the relatively order-free words in Arabic sentences, remain major issues to consider. To promote its robustness, such parseing should define different types of constraints. Property Grammar (PG) formalism verifies the satisfiability of the constraints directly on the units of the structure, thanks to its properties (or relations). In this context, we propose to build a probabilistic parser with syntactic properties, using a PG, and we measure the production rules in terms of different implicit information and in particular the syntactic properties. We experimented with our parser on the treebank ATB, using the parsing algorithm CYK, and we obtained encouraging results. Our method is also automatic for implementation of most property types. Its generalization for other languages or corpus domains (using treebanks) could be a good perspective. Its combination with pre-trained models of BERT may also make our parser faster.

References

[1]
N. Ababou, A. Mazroui, and R. Belehbib. 2017. Parsing Arabic nominal sentences using context free grammar and fundamental rules of classical grammar. Int. J. Intell. Syst. Appl. 9, 8 (2017), 11 pages.
[2]
D. Abdelrazaq, S. Abu-Soud, A. Awajan, and Arafat. 2018. A machine learning system for distinguishing nominal and verbal Arabic sentences. Int. Arab J. Info. Technol. 15, 3 (2018), 567–584.
[3]
I. Adebara. 2019. Womb grammars: A constraint solving model for learning the grammar of Yoruba. In Proceedings of European Language Resources Association (ELRA’19). 169–172.
[4]
S. Al-Ghamdi, H. Al-Khalifa, and A. Al-Salman. 2021. A dependency treebank for classical Arabic poetry. In Proceedings of the 6th International Conference on Dependency Linguistics (SyntaxFest’21). Association for Computational Linguistics, 1–9.
[5]
S. Al-Ghamdi, H. Al-Khalifa, and A. Al-Salman. 2023. Fine-Tuning BERT-based Pre-trained models for Arabic dependency parsing. Appl. Sci. 13, 7, 4225. DOI:
[6]
W. Antoun, F. Baly, and H. Hajj. 2021. Arabert: Transformer-based model for Arabic language understanding. Retrieved from https://ArXiv:2003.00104
[7]
D. Aqel, Darah, S. AlZu'bi, and S. Hamadah. 2019. Comparative study for recent technologies in Arabic language parsing. In Proceedings of the 6th International Conference on Software Defined Systems (SDS’19). 209–212. DOI:
[8]
M. Artetxe and H. Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond, Trans. Assoc. Comput. Ling. 7 (2019), 597–610. DOI:
[9]
S. Ben Ismail, S. Boukédi, and K. Haddar. 2019. HPSG grammar supporting Arabic preference nouns and Its TDL specification. In Proceedings of the International Conference on Arabic Language Processing: From Theory to Practice (ICALP’19). Communications in Computer and Information Science, Vol. 1108, K. Smaïli (ed.). Springer, Cham. DOI:
[10]
R. B. Bensalem, K. Haddar, and P. Blache. 2015. A formal modeling method to enrich the Arabic Treebank ATB with syntactic properties. In Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K’15), SCITEPRESS, 108–117. DOI:
[11]
R. Bensalem, K. Haddar, and P. Blache. 2016. A property grammar-based method to enrich the Arabic treebank ATB. In Knowledge Discovery, Knowledge Engineering, and Knowledge Management, Springer, 302–323. DOI:
[12]
R. B. Bensalem, M. Elkarwi, K. Haddar, and P. Blache. 2014. Building an Arabic linguistic resource from a treebank: the Case of Property Grammar. In Proceedings of the 17th International Conference on Text, Speech and Dialogue (TSD’14), Vol. 8655, Springer, 240–246.
[13]
R. B. Bensalem, N. Kadri, K. Haddar, and P. Blache. 2018. Evaluation and enrichment of stanford parser using an Arabic property grammar. In Proceedings of the Conference on Computational Linguistics and Intelligent Text Processing (CICLing’17), Springer International Publishing, Hungary, 170–182. DOI:
[14]
P. Blache and S. Rauzy. 2006. Mécanismes de contrôle pour l'analyse en Grammaires de Propriétés. P. Mertens, C. Fairon, A. Dister, et P. Watrin (eds.). 415–424.
[15]
P. Blache and S. Rauzy. 2012. Enrichissement du FTB: Un treebank hybride constituants/propriétés. In Proceedings of the Conference on Automatic Natural Language Processing (TALN’12), Vol. 2, 307–320.
[16]
P. Blache and S. Rauzy. 2012. Hybridization and Treebank Enrichment with Constraint-based Representations. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), 6–13.
[17]
P. Blache and S. Rauzy. 2014. A chinese constraint grammar extracted from the chinese treebank. In Proceedings of the 2nd Asia Pacific Corpus Linguistics Conference (APCLC’14), Hong Kong.
[18]
P. Blache, S. Rauzy, and G. Montcheuil. 2016. MarsaGram: An excursion in the forests of parsing trees. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), 2336–2342.
[19]
P. Blache. 2016. Representing syntax by means of properties: A formal framework for descriptive approaches. J. Lang. Model. 4, 2 (2016), 183–224. DOI:
[20]
W. Che, Y. Liu, Y. Wang, B. Zheng, and T. Liu. 2018. Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 55–64. DOI:
[21]
D. Chen and C. Manning. 2014. A fast and accurate dependency parser using neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Association for Computational Linguistics, 740–750. DOI:
[22]
M. Dekhtyar, A. Dikovsky, and B. Karlov. 2015. Categorial dependency grammars, Theoret. Comput. Sci. 579 (2015), 33–63.
[23]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics. 4171–4186. DOI:
[24]
T. Dordevic and S. Stojkovic. 2020. Different approaches in serbian language parsing using context-free grammars. In Proceedings of 7th International Conference on Electrical, Electronic and Computing Engineering IcETRAN, Etno-Selo Stanišići, Bosnia and Herzegovina (Online conference), 588--591.
[25]
T. Dozat, P. Qi, and C. D. Manning. 2017. Stanford's graph-based neural dependency parser. In Proceedings of the CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 20–30. DOI:
[26]
D. Duchier, T.-B.-H. Dao, and Y. Parmentier. 2012. Analyse Syntaxique par Contraintes pour les Grammaires de Propriétés à traits. In Proceedings of the Huitièmes Journées Francophones de Programmation par Contraintes (JFPC’12). 101–106.
[27]
D. Duchier, T.-B.-H. Dao, Y. Parmentier, and W. Lesaint. 2010. Une modélisation en CSP des grammaires de propriétés. In Proceedings of the Sixièmes Journées Francophones de Programmation par Contraintes (JFPC’10). 123–132.
[28]
J. Earley. 1970. An efficient context-free parsing algorithm. Commun. ACM 13, 2 (1970), 94–102. DOI:
[29]
J. M. Eisenschlos, S. Ruder, P. Czapla, M. Kardas, S. Gugger, and J. Howard. 2019. Multit: Efficient multi-lingual language model fine-tuning. Retrieved from. https://ArXiv:1909.04761
[30]
M. Fouad, A. Mahany, N. Aljohani, R. Abbasi, and S.-U. Hassan. 2020. Arwordvec: efficient word embedding models for Arabic tweets. Soft Comput. (2020), 24. DOI:
[31]
N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. 2013. Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, 426--432.
[32]
N. Habash, M. AbuOdeh, D. Taji, R. Faraj, J. El Gizuli, and O. Kallas. 2022. Camel Treebank: An open multi-genre Arabic dependency treebank. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, 2672–2681.
[33]
N. Habash, R. Faraj, and R. Roth. 2009. Syntactic annotation in the Columbia Arabic Treebank. In Proceedings of the International Conference on Arabic Language Resources and Tools (MEDAR’09).
[34]
J. Howard and S. Ruder. 2018. Universal language model fine-tuning for text classification. Retrieved from https://arxiv.org/abs/1801.06146v5
[35]
G. Inoue, B. Alhafni, N. Baimukan, H. Bouamor and N. Habash. 2021. The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models.
[36]
T. Kasami. 1965. An efficient recognition and syntax analysis algorithm for context-free languages (AFCRL-65-758).
[37]
S. Khalifa, N. Zalmout, and N. Habash. 2020. Morphological analysis and disambiguation for gulf Arabic: The interplay between resources and methods. In Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, 3895--3904.
[38]
N. Khatun. 2021. Applications of normality test in statistical analysis. Open Journal of Statistics 11, 1 (2021), 113--122. DOI:
[39]
N. Kitaev and D. Klein. 2018. Constituency parsing with a self-attentive encoder. Retrieved from https://ArXiv:1805.01052
[40]
N. Kitaev, S. Cao, and D. Klein. 2019. Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics. 3499–3505. DOI:
[41]
S. Kulick, A. Bies, and M. Maamouri. 2010. Consistent and flexible integration of morphological annotation in the Arabic treebank. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC'10) European Language Resources Association (ELRA’10).
[42]
Z. Li, R. Wang, K. Chen, M. Utiyama, E. Sumita, Z. Zhang, and H. Zhao. 2020. Data-dependent gaussian prior objective for language generation. In International Conference on Learning Representations.
[43]
R. Maalej, N. Khoufi, and C. Aloulou. 2021. Parsing Arabic using deep learning technology. In Proceedings of the Tunisian-Algerian Joint Conference on Applied Computing (TACC’21). 74–80.
[44]
M. Maamouri, A. Bies, and S. Kulick. 2008. Enhancing the Arabic Treebank: A collaborative effort toward new annotation guidelines. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08), European Language Resources Association (ELRA).
[45]
A. Morad, M. Nagi, and S. Alansary. 2021. Deep Learning–based Constituency Parsing for Arabic Language. In Arabnia, H. R. Ferens, K. de la Fuente, D. Kozerenko, E.B. Olivas Varela, J.A. Tinetti, F.G. (eds.), Advances in Artificial Intelligence and Applied Cognitive Computing, Transactions on Computational Science and Computational Intelligence, Springer, Cham, 45–58. DOI:
[46]
M. Mahyoob. 2020. Developing a simplified morphological analyzer for Arabic pronominal system. International Journal on Natural Language Computing 9 (2020), 9--19.
[47]
A. Paranjpe and G. Tan. 2021. Bohemia . A validator for parser frameworks. IEEE Security and Privacy Workshops (SPW), San Francisco, CA, 162--170. DOI:
[48]
J. Nivre, M.-C. de Marnee, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. 2016. Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), 1659–1666.
[49]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. Retrieved from https://ArXiv:1802.05365
[50]
Y. M. Saber, H. Abdel-Galil, and M. A. Belal. 2022. Arabic ontology extraction model from unstructured text. J. King Saud Univ.– Comput. Info. Sci. 34, 8, Part B (2022), 6066–6076, ISSN 1319-1578. DOI:
[51]
A. Safaya, M. Abdullatif, and D. Yuret. 2020. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the 14th Workshop on Semantic Evaluation, International Committee for Computational Linguistics, Barcelona (online), 2054–2059.
[52]
A. Sahay, A. Nasery, A. Maheshwari, G. Ramakrishnan, and R. Iyer. 2021. Rule augmented unsupervised constituency parsing. In Findings of the Association for Computational Linguistics: (ACL-IJCNLP'21), 4923--4932.
[53]
A. B. Soliman, K. Eissa, and S. El-Beltagy. 2017. Aravec: A set of Arabic word embedding models for use in Arabic NLP. Procedia Computer Sci. Arabic Comput. Ling. (2017), 256–265. DOI:
[54]
D. Taji, S. Khalifa, O. Obeid, F. Eryani, and N. Habash. 2018. An Arabic morphological analyzer and generator with copious features. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, Association for Computational Linguistics, Brussels, Belgium, 140--150.
[55]
D. H. Younger. 1967. Recognition and parsing of context-free languages in time n3. Info. Control 10, 2 (1967), 189–208. DOI:
[56]
F. M. Zanzotto, A. Santilli, L. Ranaldi, D. Onorati, P. Tommasino, and F. Fallucchi. 2020. KERMIT: Complementing transformer architectures with encoders of explicit syntactic interpretations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 256--267.
[57]
D. Zeman, J. Hajic, M. Popel, M. Potthast, M. Straka, F. Ginter, J. Nivre, and S. Petrov. 2018. In Proceedings of the CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 1–21. DOI:
[58]
Y. Zhang, Z. Li, and M. Zhang. 2020. Efficient second-order treeCRF for neural dependency parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3295--3305.

Index Terms

  1. An Arabic Probabilistic Parser Based on a Property Grammar

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 10
    October 2023
    226 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3627976
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 October 2023
    Online AM: 19 September 2023
    Accepted: 23 July 2023
    Revised: 01 July 2023
    Received: 25 October 2022
    Published in TALLIP Volume 22, Issue 10

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Probabilistic parser
    2. property grammar formalism
    3. Arabic language
    4. lexicalized grammar

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 71
      Total Downloads
    • Downloads (Last 12 months)42
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media