Evaluating Domain-adapted Language Models for Governmental Text Classification Tasks in Portuguese
Resumo
Domain-adaptive pre-training (DAPT) is a technique in natural language processing (NLP) that tailors pre-trained language models to specific domains, enhancing their performance in real-world applications. In this paper, we evaluate the effectiveness of DAPT in governmental text classification tasks, exploring how different factors, such as target domain dataset, pre-trained model language composition, and dataset size, impact model performance. We systematically vary these factors, creating distinct domain-adapted models derived from BERTimbau and LaBSE. Our experimental results reveal that selecting appropriate target domain datasets and pre-training strategies can notably enhance the performance of language models in governmental tasks.
Palavras-chave:
natural language processing, pre-trained language models, domain-adaptive pre-training, text classification, governmental data
Referências
Brandão, M. A. et al. (2023). Impacto do pré-processamento e representação textual na classificação de documentos de licitações. In SBBD, pages 102–114. SBC.
Brandão, M. A. et al. (2024). PLUS: A Semi-automated Pipeline for Fraud Detection in Public Bids. Digital Government: Research and Practice, 5(1):1–16.
Constantino, K. et al. (2022). Segmentação e Classificação Semântica de Trechos de Diários Oficiais Usando Aprendizado Ativo. In SBBD, pages 304–316. SBC.
Feijó, D. V. and Moreira, V. P. (2020). Mono vs Multilingual Transformer-based Models: a Comparison across Several Language Tasks. CoRR, abs/2007.09757.
Feng, F. et al. (2022). Language-agnostic BERT Sentence Embedding. In ACL, pages 878–891. Association for Computational Linguistics.
Gururangan, S. et al. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In ACL, pages 8342–8360. Association for Computational Linguistics.
Hott, H. R. et al. (2023). Evaluating contextualized embeddings for topic modeling in public bidding domain. In BRACIS, volume 14197 of LNCS, pages 410–426. Springer.
Luz de Araujo, P. H., de Campos, T. E., Braz, F. A., and da Silva, N. C. (2020). VICTOR: a Dataset for Brazilian Legal Documents Classification. In LREC, pages 1449–1458. ELRA.
Luz de Araujo, P. H. et al. (2018). LeNER-Br: a Dataset for Named Entity Recognition in Brazilian Legal Text. In PROPOR, volume 11122 of LNCS, pages 313–323. Springer.
Oliveira, G. P. et al. (2022). Detecting Inconsistencies in Public Bids: An Automated and Data-based Approach. In WebMedia, pages 182–190. ACM.
Rodrigues, R. B. M. et al. (2022). PetroBERT: A Domain Adaptation Language Model for Oil and Gas Applications in Portuguese. In PROPOR, volume 13208 of LNCS, pages 101–109. Springer.
Schneider, E. T. R. et al. (2020). BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In ClinicalNLP@EMNLP, pages 65–72. Association for Computational Linguistics.
Silva, M. O. et al. (2022). LiPSet: Um Conjunto de Dados com Documentos Rotulados de Licitações Públicas. In DSW, pages 13–24. SBC.
Silva, M. O. et al. (2023). Análise de Sobrepreço em Itens de Licitações Públicas. In WCGE, pages 118–129. SBC.
Silva, M. O. and Moro, M. M. (2024). Evaluating Pre-training Strategies for Literary Named Entity Recognition in Portuguese. In PROPOR, pages 384–393. Association for Computational Lingustics.
Silva, N. F. F. et al. (2021). Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies. In BRACIS, volume 13074 of LNCS, pages 104–120. Springer.
Silveira, R. et al. (2021). Topic Modelling of Legal Documents via LEGAL-BERT. In Procs. of the 1st International Workshop RELATED - Relations in the Legal Domain.
Silveira, R. et al. (2023). LegalBert-pt: A Pretrained Language Model for the Brazilian Portuguese Legal Domain. In BRACIS, volume 14197 of LNCS, pages 268–282. Springer.
Singhal, P., Walambe, R., Ramanna, S., and Kotecha, K. (2023). Domain Adaptation: Challenges, Methods, Datasets, and Applications. IEEE Access, 11:6973–7020.
Souza, F., Nogueira, R. F., and de Alencar Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In BRACIS, volume 12319 of LNCS, pages 403–417. Springer.
Zhu, Q. et al. (2021). When does Further Pre-training MLM Help? An Empirical Study on Task-Oriented Dialog Pre-training. In Workshop on Insights from Negative Results in NLP, pages 54–61. Association for Computational Linguistics.
Brandão, M. A. et al. (2024). PLUS: A Semi-automated Pipeline for Fraud Detection in Public Bids. Digital Government: Research and Practice, 5(1):1–16.
Constantino, K. et al. (2022). Segmentação e Classificação Semântica de Trechos de Diários Oficiais Usando Aprendizado Ativo. In SBBD, pages 304–316. SBC.
Feijó, D. V. and Moreira, V. P. (2020). Mono vs Multilingual Transformer-based Models: a Comparison across Several Language Tasks. CoRR, abs/2007.09757.
Feng, F. et al. (2022). Language-agnostic BERT Sentence Embedding. In ACL, pages 878–891. Association for Computational Linguistics.
Gururangan, S. et al. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In ACL, pages 8342–8360. Association for Computational Linguistics.
Hott, H. R. et al. (2023). Evaluating contextualized embeddings for topic modeling in public bidding domain. In BRACIS, volume 14197 of LNCS, pages 410–426. Springer.
Luz de Araujo, P. H., de Campos, T. E., Braz, F. A., and da Silva, N. C. (2020). VICTOR: a Dataset for Brazilian Legal Documents Classification. In LREC, pages 1449–1458. ELRA.
Luz de Araujo, P. H. et al. (2018). LeNER-Br: a Dataset for Named Entity Recognition in Brazilian Legal Text. In PROPOR, volume 11122 of LNCS, pages 313–323. Springer.
Oliveira, G. P. et al. (2022). Detecting Inconsistencies in Public Bids: An Automated and Data-based Approach. In WebMedia, pages 182–190. ACM.
Rodrigues, R. B. M. et al. (2022). PetroBERT: A Domain Adaptation Language Model for Oil and Gas Applications in Portuguese. In PROPOR, volume 13208 of LNCS, pages 101–109. Springer.
Schneider, E. T. R. et al. (2020). BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In ClinicalNLP@EMNLP, pages 65–72. Association for Computational Linguistics.
Silva, M. O. et al. (2022). LiPSet: Um Conjunto de Dados com Documentos Rotulados de Licitações Públicas. In DSW, pages 13–24. SBC.
Silva, M. O. et al. (2023). Análise de Sobrepreço em Itens de Licitações Públicas. In WCGE, pages 118–129. SBC.
Silva, M. O. and Moro, M. M. (2024). Evaluating Pre-training Strategies for Literary Named Entity Recognition in Portuguese. In PROPOR, pages 384–393. Association for Computational Lingustics.
Silva, N. F. F. et al. (2021). Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies. In BRACIS, volume 13074 of LNCS, pages 104–120. Springer.
Silveira, R. et al. (2021). Topic Modelling of Legal Documents via LEGAL-BERT. In Procs. of the 1st International Workshop RELATED - Relations in the Legal Domain.
Silveira, R. et al. (2023). LegalBert-pt: A Pretrained Language Model for the Brazilian Portuguese Legal Domain. In BRACIS, volume 14197 of LNCS, pages 268–282. Springer.
Singhal, P., Walambe, R., Ramanna, S., and Kotecha, K. (2023). Domain Adaptation: Challenges, Methods, Datasets, and Applications. IEEE Access, 11:6973–7020.
Souza, F., Nogueira, R. F., and de Alencar Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In BRACIS, volume 12319 of LNCS, pages 403–417. Springer.
Zhu, Q. et al. (2021). When does Further Pre-training MLM Help? An Empirical Study on Task-Oriented Dialog Pre-training. In Workshop on Insights from Negative Results in NLP, pages 54–61. Association for Computational Linguistics.
Publicado
14/10/2024
Como Citar
SILVA, Mariana O.; OLIVEIRA, Gabriel P.; COSTA, Lucas G. L.; PAPPA, Gisele L..
Evaluating Domain-adapted Language Models for Governmental Text Classification Tasks in Portuguese. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 247-259.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2024.240508.