Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus.

AllImages Books News Maps Videos Shopping

A Case Study on Constructing a Tetun Text Corpus - ACL Anthology

aclanthology.org › 2024.lrec-main.390

This paper proposes Labadain Crawler, a data collection pipeline tailored to automate and optimize the process of constructing textual corpora from the web.

[PDF] Data Collection Pipeline for Low-Resource Languages: A Case ...

www.semanticscholar.org › paper › Data...

This paper proposes Labadain Crawler, a data collection pipeline tailored to automate and optimize the process of constructing textual corpora from the web.

[PDF] Resource Languages: A Case Study on Constructing a Tetun Text Corpus

www.lrec-conf.org › media › slides

Labadain Crawler - A data collection pipeline that relies on three key components: An initial text in the target language. A tokenizer. A language ...

‪Gabriel de Jesus‬ - ‪Google Scholar‬

scholar.google.com › citations

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus. G de Jesus, SS Nunes. Proceedings of the 2024 Joint ...

tetun-lid - PyPI

pypi.org › project › tetun-lid

Tetun Language Identification (Tetun LID) model is a machine learning model that automatically identifies the language of a given text.

Automatic Creation of Text Corpora for Low-Resource Languages from ...

www.semanticscholar.org › paper › Auto...

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus · Gabriel de JesusS. Nunes. Linguistics, Computer Science.

Labadain-30k+: A Monolingual Tetun Document-Level Audited Dataset

rdm.inesctec.pt › dataset

Apr 3, 2024 · Tetun · de Jesus, G., & Nunes, S. (2024). Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus.

tetun-tokenizer - PyPI

pypi.org › project › tetun-tokenizer

Tetun tokenizer is a Python package used to tokenize an input text into tokens. ... Data Collection Pipeline for Low-Resource Languages: A Case Study on ...

[PDF] arXiv:2405.19744v1 [cs.CL] 30 May 2024

arxiv.org › pdf

May 30, 2024 · In this paper, we propose a pipeline to excavate cross-lingual instruction-tuning samples from mul- tilingual corpora by exploiting the better ...

Missing: Tetun | Show results with:Tetun

GlotLID: Language Identification for Low-Resource Languages

www.researchgate.net › publication › 37...

Oct 31, 2023 · We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages , difficulty ...