Abstract
Transformer-based language models have led to the investigation of various embedding and modeling techniques for several downstream natural language processing tasks. Nevertheless, the comprehensive exploration of semantic information about the label from encoder and decoder components in these tasks is yet to be fully realized. In this paper, we propose LIT, an end-to-end pipeline architecture that integrates the transformer’s encoder-decoder mechanism with an additional label semantic to token classification tasks (i.e., historical named entity recognition (NER) and automatic term extraction (ATE)). Our findings demonstrate that LIT outperforms the benchmark in F1 with a maximal rise of 9.5% points in the historical NER task and 11.2% points in the ATE task for the gold standard excluding named entities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
All evaluations were performed with https://github.com/hipe-eval/HIPE-scorer.
References
Almazrouei, E., et al.: The Falcon series of language models: towards open Frontier models (2023)
Boroş, E., et al.: Alleviating digitization errors in named entity recognition for historical documents. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 431–441 (2020)
Crane, G., Jones, A.: The challenge of Virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection. In: Proceedings of the 6th ACM/IEEE-Cs Joint Conference on Digital Libraries, pp. 31–40 (2006)
Ehrmann, M., Hamdi, A., Pontes, E.L., Romanello, M., Doucet, A.: Named entity recognition and classification in historical documents: a survey. ACM Comput. Surv. 56(2), 1–47 (2023)
Ehrmann, M., Romanello, M., Clematide, S., Ströbel, P., Barman, R.: Language resources for historical newspapers: the impresso collection (2020)
Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Extended overview of CLEF HIPE 2020: named entity processing on historical newspapers. In: CLEF 2020 Working Notes. Conference and Labs of the Evaluation Forum, vol. 2696. CEUR-WS (2020)
Ehrmann, M., et al.: Extended overview of HIPE-2022: named entity recognition and linking in multilingual historical documents. In: CEUR Workshop Proceedings, pp. 1038–1063, No. 3180, CEUR-WS (2022)
Floridi, L., Chiriatti, M.: GPT-3: its nature, scope, limits, and consequences. Mind. Mach. 30, 681–694 (2020)
González-Gallardo, C.E., Boros, E., Girdhar, N., Hamdi, A., Moreno, J.G., Doucet, A.: Yes but.. can chatGPT identify entities in historical documents? In: 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 184–189. IEEE (2023)
González-Gallardo, C.E., Tran, T.H.H., Girdhar, N., Boroş, E., Moreno, J.G., Doucet, A.: L3i++ at SemEval-2023 task 2: prompting for multilingual complex named entity recognition. In: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval 2023), pp. 807–814 (2023)
Hamdi, A., et al.: A multilingual dataset for named entity recognition, entity linking and stance detection in historical newspapers. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2328–2334 (2021)
Hazem, A., Bouhandi, M., Boudin, F., Daille, B.: TermEval 2020: TALN-LS2N system for automatic term extraction. In: Proceedings of the 6th International Workshop on Computational Terminology, pp. 95–100 (2020)
Ivačič, N., Tran, T.H.H., Koloski, B., Pollak, S., Purver, M.: Analysis of transfer learning for named entity recognition in south-Slavic languages. In: Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023), pp. 106–112 (2023)
Karimi, A., Rossi, L., Prati, A.: Improving BERT performance for aspect-based sentiment analysis. arXiv preprint arXiv:2010.11731 (2020)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, p. 2 (2019)
Koufakou, A., Pamungkas, E.W., Basile, V., Patti, V., et al.: HurtBERT: incorporating lexical features with BERT for the detection of abusive language. In: Proceedings of the Fourth Workshop on Online Abuse and Harms, pp. 34–43. Association for Computational Linguistics (2020)
Labusch, K., Neudecker, C.: Entity linking in multilingual newspapers and classical commentaries with BERT (2022)
Lang, C., Wachowiak, L., Heinisch, B., Gromann, D.: Transforming term extraction: transformer-based approaches to multilingual term extraction across domains. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3607–3620 (2021)
Li, Z., et al.: Label supervised llama finetuning. arXiv preprint arXiv:2310.01208 (2023)
Lin, Y., et al.: BertGCN: transductive text classification by combining GNN and BERT. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1456–1462 (2021)
Mutinda, J., Mwangi, W., Okeyo, G.: Sentiment analysis of text reviews using lexicon-enhanced BERT embedding (LeBERT) model with convolutional neural network. Appl. Sci. 13(3), 1445 (2023)
OpenAI: GPT-4 technical report (2023)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learning Res. 21(1), 5485–5551 (2020)
Rigouts Terryn, A., Hoste, V., Drouin, P., Lefever, E.: TermEval 2020: shared task on automatic term extraction using the annotated corpora for term extraction research (ACTER) dataset. In: 6th International Workshop on Computational Terminology (COMPUTERM 2020), pp. 85–94. European Language Resources Association (ELRA) (2020)
Rigouts Terryn, A., Hoste, V., Lefever, E.: In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Lang. Resour. Eval. 54(2), 385–418 (2020)
Ritze, D., Zirn, C., Greenstreet, C., Eckert, K., Ponzetto, S.P.: Named entities in court: the marinelives corpus. In: Language Resources and Technologies for Processing and Linking Historical Documents and Archives-Deploying Linked Open Data in Cultural Heritage–LRT4HDA Workshop Programme, p. 26 (2014)
Rosset, S., Grouin, C., Zweigenbaum, P.: Entités nommées structurées: guide d’annotation Quaero. LIMSI-Centre national de la recherche scientifique (2011)
Ryser, A., Nguyen, Q.A., Bodenmann, N., Chen, S.Y.: Exploring transformers for multilingual historical named entity recognition (2022)
Schweter, S., März, L., Schmid, K., Çano, E.: hmBERT: historical multilingual language models for named entity recognition (2022)
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Hanh, T.T.H., Doucet, A., Sidere, N., Moreno, J.G., Pollak, S.: Named entity recognition architecture combining contextual and global features. In: Ke, H.-R., Lee, C.S., Sugiyama, K. (eds.) ICADL 2021. LNCS, vol. 13133, pp. 264–276. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91669-5_21
Tran, H.T.H., Martinc, M., Caporusso, J., Doucet, A., Pollak, S.: The recent advances in automatic term extraction: a survey. arXiv preprint arXiv:2301.06767 (2023)
Tran, H.T.H., Martinc, M., Doucet, A., Pollak, S.: Can cross-domain term extraction benefit from cross-lingual transfer? In: Pascal, P., Ienco, D. (eds.) DS 2022. LNCS, vol. 13601, pp. 363–378. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18840-4_26
Tran, H.T.H., Martinc, M., Pelicon, A., Doucet, A., Pollak, S.: Ensembling transformers for cross-domain automatic term extraction. In: Tseng, Y.H., Katsurai, M., Nguyen, H.N. (eds.) ICADL 2022. LNCS, vol. 13636, pp. 90–100. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-21756-2_7
Tran, H.T.H., Martinc, M., Repar, A., Ljubešić, N., Doucet, A., Pollak, S.: Can cross-domain term extraction benefit from cross-lingual transfer and nested term labeling? Mach. Learn. 113(7), 4285–4314 (2024)
Tran, H., Martinc, M., Doucet, A., Pollak, S.: A transformer-based sequence-labeling approach to the slovenian cross-domain automatic term extraction. In: Slovenian conference on Language Technologies and Digital Humanities (2022)
Won, M., Murrieta-Flores, P., Martins, B.: Ensemble named entity recognition (NER): evaluating NER tools in the identification of place names in historical corpora. Front. Digit. Humanit. 5, 2 (2018)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Yang, Y., Katiyar, A.: Simple and effective few-shot named entity recognition with structured nearest neighbor learning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6365–6375 (2020)
Yang, Z., Chen, H., Zhang, J., Ma, J., Chang, Y.: Attention-based multi-level feature fusion for named entity recognition. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3594–3600 (2021)
Yao, L., Mao, C., Luo, Y.: KG-BERT: BERT for knowledge graph completion. arXiv preprint arXiv:1909.03193 (2019)
Yu, S., Su, J., Luo, D.: Improving BERT-based text classification with auxiliary sentence and domain knowledge. IEEE Access 7, 176600–176612 (2019)
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: Ernie: enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129 (2019)
Acknowledgments
This work has been supported by the ANNA (2019-1R40226), TERMITRAD (2020-2019-8510010), Pypa (AAPR2021-2021-12263410), and Actuadata (AAPR2022-2021-17014610) projects funded by the Nouvelle-Aquitaine Region (France).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare relevant to this article’s content.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, W., Tran, H.T.H., González-Gallardo, CE., Coustaty, M., Doucet, A. (2024). LIT: Label-Informed Transformers on Token-Based Classification. In: Antonacopoulos, A., et al. Linking Theory and Practice of Digital Libraries. TPDL 2024. Lecture Notes in Computer Science, vol 15177. Springer, Cham. https://doi.org/10.1007/978-3-031-72437-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72437-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72436-7
Online ISBN: 978-3-031-72437-4
eBook Packages: Computer ScienceComputer Science (R0)