How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages

Rachit Bansal, Himanshu Choudhary, Ravneet Punia, Niko Schenk, Émilie Pagé-Perron, Jacob Dahl

Abstract

Despite the recent advancements of attention-based deep learning architectures across a majority of Natural Language Processing tasks, their application remains limited in a low-resource setting because of a lack of pre-trained models for such languages. In this study, we make the first attempt to investigate the challenges of adapting these techniques to an extremely low-resource language – Sumerian cuneiform – one of the world’s oldest written languages attested from at least the beginning of the 3rd millennium BC. Specifically, we introduce the first cross-lingual information extraction pipeline for Sumerian, which includes part-of-speech tagging, named entity recognition, and machine translation. We introduce InterpretLR, an interpretability toolkit for low-resource NLP and use it alongside human evaluations to gauge the trained models. Notably, all our techniques and most components of our pipeline can be generalised to any low-resource language. We publicly release all our implementations including a novel data set with domain-specific pre-processing to promote further research in this domain.

Anthology ID:: 2021.acl-srw.5
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Month:: August
Year:: 2021
Address:: Online
Editors:: Jad Kabbara, Haitao Lin, Amandalynne Paullada, Jannis Vamvas
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 44–59
Language:
URL:: https://aclanthology.org/2021.acl-srw.5
DOI:: 10.18653/v1/2021.acl-srw.5
Bibkey:
Cite (ACL):: Rachit Bansal, Himanshu Choudhary, Ravneet Punia, Niko Schenk, Émilie Pagé-Perron, and Jacob Dahl. 2021. How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 44–59, Online. Association for Computational Linguistics.
Cite (Informal):: How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages (Bansal et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.acl-srw.5.pdf
Optional supplementary material:: 2021.acl-srw.5.OptionalSupplementaryMaterial.zip
Video:: https://aclanthology.org/2021.acl-srw.5.mp4
Code: cdli-gh/Semi-Supervised-NMT-for-Sumerian-English + additional community code

PDF Cite Search Code Optional supplementary material Video