Computer Science > Computation and Language

arXiv:2212.10422 (cs)

[Submitted on 20 Dec 2022 (v1), last revised 28 Jun 2023 (this version, v3)]

Title:Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

Authors:Tommaso Mario Buonocore, Claudio Crema, Alberto Redolfi, Riccardo Bellazzi, Enea Parimbelli

View PDF

Abstract:In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

Comments:	8 pages, 2 figures, 6 tables. Published in Journal of Biomedical Informatics
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
ACM classes:	I.2.7; J.3
Cite as:	arXiv:2212.10422 [cs.CL]
	(or arXiv:2212.10422v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.10422
Journal reference:	Journal of Biomedical Informatics, Volume 144, 2023, 104431, ISSN 1532-0464
Related DOI:	https://doi.org/10.1016/j.jbi.2023.104431

Submission history

From: Tommaso Mario Buonocore [view email]
[v1] Tue, 20 Dec 2022 16:59:56 UTC (210 KB)
[v2] Thu, 22 Dec 2022 10:29:59 UTC (278 KB)
[v3] Wed, 28 Jun 2023 08:36:20 UTC (520 KB)

Computer Science > Computation and Language

Title:Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

Submission history

Access Paper:

Ancillary files (details):

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

Submission history

Access Paper:

Ancillary files (details):

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators