Multi-Parallel Corpus of North Levantine Arabic
Mateusz Krubiński, Hashem Sellat, Shadi Saleh, Adam Pospíšil, Petr Zemánek, Pavel Pecina
Abstract
Low-resource Machine Translation (MT) is characterized by the scarce availability of training data and/or standardized evaluation benchmarks. In the context of Dialectal Arabic, recent works introduced several evaluation benchmarks covering both Modern Standard Arabic (MSA) and dialects, mapping, however, mostly to a single Indo-European language - English. In this work, we introduce a multi-lingual corpus consisting of 120,600 multi-parallel sentences in English, French, German, Greek, Spanish, and MSA selected from the OpenSubtitles corpus, which were manually translated into the North Levantine Arabic. By conducting a series of training and fine-tuning experiments, we explore how this novel resource can contribute to the research on Arabic MT.- Anthology ID:
- 2023.arabicnlp-1.34
- Volume:
- Proceedings of ArabicNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore (Hybrid)
- Editors:
- Hassan Sawaf, Samhaa El-Beltagy, Wajdi Zaghouani, Walid Magdy, Ahmed Abdelali, Nadi Tomeh, Ibrahim Abu Farha, Nizar Habash, Salam Khalifa, Amr Keleg, Hatem Haddad, Imed Zitouni, Khalil Mrini, Rawan Almatham
- Venues:
- ArabicNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 411–417
- Language:
- URL:
- https://aclanthology.org/2023.arabicnlp-1.34
- DOI:
- 10.18653/v1/2023.arabicnlp-1.34
- Bibkey:
- Cite (ACL):
- Mateusz Krubiński, Hashem Sellat, Shadi Saleh, Adam Pospíšil, Petr Zemánek, and Pavel Pecina. 2023. Multi-Parallel Corpus of North Levantine Arabic. In Proceedings of ArabicNLP 2023, pages 411–417, Singapore (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- Multi-Parallel Corpus of North Levantine Arabic (Krubiński et al., ArabicNLP-WS 2023)
- Copy Citation:
- PDF:
- https://aclanthology.org/2023.arabicnlp-1.34.pdf
Export citation
@inproceedings{krubinski-etal-2023-multi, title = "Multi-Parallel Corpus of {N}orth {L}evantine {A}rabic", author = "Krubi{\'n}ski, Mateusz and Sellat, Hashem and Saleh, Shadi and Posp{\'\i}{\v{s}}il, Adam and Zem{\'a}nek, Petr and Pecina, Pavel", editor = "Sawaf, Hassan and El-Beltagy, Samhaa and Zaghouani, Wajdi and Magdy, Walid and Abdelali, Ahmed and Tomeh, Nadi and Abu Farha, Ibrahim and Habash, Nizar and Khalifa, Salam and Keleg, Amr and Haddad, Hatem and Zitouni, Imed and Mrini, Khalil and Almatham, Rawan", booktitle = "Proceedings of ArabicNLP 2023", month = dec, year = "2023", address = "Singapore (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.arabicnlp-1.34", doi = "10.18653/v1/2023.arabicnlp-1.34", pages = "411--417", abstract = "Low-resource Machine Translation (MT) is characterized by the scarce availability of training data and/or standardized evaluation benchmarks. In the context of Dialectal Arabic, recent works introduced several evaluation benchmarks covering both Modern Standard Arabic (MSA) and dialects, mapping, however, mostly to a single Indo-European language - English. In this work, we introduce a multi-lingual corpus consisting of 120,600 multi-parallel sentences in English, French, German, Greek, Spanish, and MSA selected from the OpenSubtitles corpus, which were manually translated into the North Levantine Arabic. By conducting a series of training and fine-tuning experiments, we explore how this novel resource can contribute to the research on Arabic MT.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="krubinski-etal-2023-multi"> <titleInfo> <title>Multi-Parallel Corpus of North Levantine Arabic</title> </titleInfo> <name type="personal"> <namePart type="given">Mateusz</namePart> <namePart type="family">Krubiński</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hashem</namePart> <namePart type="family">Sellat</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Shadi</namePart> <namePart type="family">Saleh</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Adam</namePart> <namePart type="family">Pospíšil</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Petr</namePart> <namePart type="family">Zemánek</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pavel</namePart> <namePart type="family">Pecina</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2023-12</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of ArabicNLP 2023</title> </titleInfo> <name type="personal"> <namePart type="given">Hassan</namePart> <namePart type="family">Sawaf</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Samhaa</namePart> <namePart type="family">El-Beltagy</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Wajdi</namePart> <namePart type="family">Zaghouani</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Walid</namePart> <namePart type="family">Magdy</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ahmed</namePart> <namePart type="family">Abdelali</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Nadi</namePart> <namePart type="family">Tomeh</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ibrahim</namePart> <namePart type="family">Abu Farha</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Nizar</namePart> <namePart type="family">Habash</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Salam</namePart> <namePart type="family">Khalifa</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Amr</namePart> <namePart type="family">Keleg</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hatem</namePart> <namePart type="family">Haddad</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Imed</namePart> <namePart type="family">Zitouni</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Khalil</namePart> <namePart type="family">Mrini</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rawan</namePart> <namePart type="family">Almatham</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Singapore (Hybrid)</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Low-resource Machine Translation (MT) is characterized by the scarce availability of training data and/or standardized evaluation benchmarks. In the context of Dialectal Arabic, recent works introduced several evaluation benchmarks covering both Modern Standard Arabic (MSA) and dialects, mapping, however, mostly to a single Indo-European language - English. In this work, we introduce a multi-lingual corpus consisting of 120,600 multi-parallel sentences in English, French, German, Greek, Spanish, and MSA selected from the OpenSubtitles corpus, which were manually translated into the North Levantine Arabic. By conducting a series of training and fine-tuning experiments, we explore how this novel resource can contribute to the research on Arabic MT.</abstract> <identifier type="citekey">krubinski-etal-2023-multi</identifier> <identifier type="doi">10.18653/v1/2023.arabicnlp-1.34</identifier> <location> <url>https://aclanthology.org/2023.arabicnlp-1.34</url> </location> <part> <date>2023-12</date> <extent unit="page"> <start>411</start> <end>417</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T Multi-Parallel Corpus of North Levantine Arabic %A Krubiński, Mateusz %A Sellat, Hashem %A Saleh, Shadi %A Pospíšil, Adam %A Zemánek, Petr %A Pecina, Pavel %Y Sawaf, Hassan %Y El-Beltagy, Samhaa %Y Zaghouani, Wajdi %Y Magdy, Walid %Y Abdelali, Ahmed %Y Tomeh, Nadi %Y Abu Farha, Ibrahim %Y Habash, Nizar %Y Khalifa, Salam %Y Keleg, Amr %Y Haddad, Hatem %Y Zitouni, Imed %Y Mrini, Khalil %Y Almatham, Rawan %S Proceedings of ArabicNLP 2023 %D 2023 %8 December %I Association for Computational Linguistics %C Singapore (Hybrid) %F krubinski-etal-2023-multi %X Low-resource Machine Translation (MT) is characterized by the scarce availability of training data and/or standardized evaluation benchmarks. In the context of Dialectal Arabic, recent works introduced several evaluation benchmarks covering both Modern Standard Arabic (MSA) and dialects, mapping, however, mostly to a single Indo-European language - English. In this work, we introduce a multi-lingual corpus consisting of 120,600 multi-parallel sentences in English, French, German, Greek, Spanish, and MSA selected from the OpenSubtitles corpus, which were manually translated into the North Levantine Arabic. By conducting a series of training and fine-tuning experiments, we explore how this novel resource can contribute to the research on Arabic MT. %R 10.18653/v1/2023.arabicnlp-1.34 %U https://aclanthology.org/2023.arabicnlp-1.34 %U https://doi.org/10.18653/v1/2023.arabicnlp-1.34 %P 411-417
Markdown (Informal)
[Multi-Parallel Corpus of North Levantine Arabic](https://aclanthology.org/2023.arabicnlp-1.34) (Krubiński et al., ArabicNLP-WS 2023)
- Multi-Parallel Corpus of North Levantine Arabic (Krubiński et al., ArabicNLP-WS 2023)
ACL
- Mateusz Krubiński, Hashem Sellat, Shadi Saleh, Adam Pospíšil, Petr Zemánek, and Pavel Pecina. 2023. Multi-Parallel Corpus of North Levantine Arabic. In Proceedings of ArabicNLP 2023, pages 411–417, Singapore (Hybrid). Association for Computational Linguistics.