2024
pdf
bib
abs
One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks
Sebastian Nehrdich
|
Oliver Hellwig
|
Kurt Keutzer
Findings of the Association for Computational Linguistics: EMNLP 2024
Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.
2023
pdf
bib
abs
MITRA-zh: An efficient, open machine translation solution for Buddhist Chinese
Sebastian Nehrdich
|
Marcus Bingenheimer
|
Justin Brody
|
Kurt Keutzer
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Buddhist Classical Chinese is a challenging low-resource language that has not yet received much dedicated attention in NLP research. Standard commercial machine translation software performs poorly on this idiom. In order to address this gap, we present a novel dataset of 209,454 bitext pairs for the training and 2.300 manually curated and corrected bitext pairs for the evaluation of machine translation models. We finetune a number of encoder-decoder models on this dataset and compare their performance against commercial models. We show that our best fine-tuned model outperforms the currently available commercial solutions by a considerable margin while being much more cost-efficient and faster in deployment. This is especially important for digital humanities, where large amounts of data need to be processed efficiently for corpus-level operations such as topic modeling or semantic search. We also show that the commercial chat system GPT4 is surprisingly strong on this task, at times reaching comparable performance to our finetuned model and clearly outperforming standard machine translation providers. We provide a limited case study where we examine the performance of selected different machine translation models on a number of Buddhist Chinese passages in order to demonstrate what level of quality these models reach at the moment.
2022
pdf
bib
abs
Accurate Dependency Parsing and Tagging of Latin
Sebastian Nehrdich
|
Oliver Hellwig
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Having access to high-quality grammatical annotations is important for downstream tasks in NLP as well as for corpus-based research. In this paper, we describe experiments with the Latin BERT word embeddings that were recently be made available by Bamman and Burns (2020). We show that these embeddings produce competitive results in the low-level task morpho-syntactic tagging. In addition, we describe a graph-based dependency parser that is trained with these embeddings and that clearly outperforms various baselines.
pdf
bib
abs
SansTib, a Sanskrit - Tibetan Parallel Corpus and Bilingual Sentence Embedding Model
Sebastian Nehrdich
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents the development of SansTib, a Sanskrit - Classical Tibetan parallel corpus automatically aligned on sentence-level, and a bilingual sentence embedding model. The corpus has a size of about 317,289 sentence pairs and 14,420,771 tokens and thereby is a considerable improvement over previous resources for these two languages. The data is incorporated into the BuddhaNexus database to make it accessible to a larger audience. It also presents a gold evaluation dataset and assesses the quality of the automatic alignment.
2018
pdf
bib
abs
Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks
Oliver Hellwig
|
Sebastian Nehrdich
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.