Nothing Special   »   [go: up one dir, main page]

Skip to main content

LIT: Label-Informed Transformers on Token-Based Classification

  • Conference paper
  • First Online:
Linking Theory and Practice of Digital Libraries (TPDL 2024)

Abstract

Transformer-based language models have led to the investigation of various embedding and modeling techniques for several downstream natural language processing tasks. Nevertheless, the comprehensive exploration of semantic information about the label from encoder and decoder components in these tasks is yet to be fully realized. In this paper, we propose LIT, an end-to-end pipeline architecture that integrates the transformer’s encoder-decoder mechanism with an additional label semantic to token classification tasks (i.e., historical named entity recognition (NER) and automatic term extraction (ATE)). Our findings demonstrate that LIT outperforms the benchmark in F1 with a maximal rise of 9.5% points in the historical NER task and 11.2% points in the ATE task for the gold standard excluding named entities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/WenjunSUN1997/ner_tr.

  2. 2.

    Historical NER: https://huggingface.co/dbmdz/bert-base-historic-multilingual-64k-td-cased;

    ATE: https://huggingface.co/bert-base-cased.

  3. 3.

    All evaluations were performed with https://github.com/hipe-eval/HIPE-scorer.

References

  1. Almazrouei, E., et al.: The Falcon series of language models: towards open Frontier models (2023)

    Google Scholar 

  2. Boroş, E., et al.: Alleviating digitization errors in named entity recognition for historical documents. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 431–441 (2020)

    Google Scholar 

  3. Crane, G., Jones, A.: The challenge of Virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection. In: Proceedings of the 6th ACM/IEEE-Cs Joint Conference on Digital Libraries, pp. 31–40 (2006)

    Google Scholar 

  4. Ehrmann, M., Hamdi, A., Pontes, E.L., Romanello, M., Doucet, A.: Named entity recognition and classification in historical documents: a survey. ACM Comput. Surv. 56(2), 1–47 (2023)

    Article  Google Scholar 

  5. Ehrmann, M., Romanello, M., Clematide, S., Ströbel, P., Barman, R.: Language resources for historical newspapers: the impresso collection (2020)

    Google Scholar 

  6. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Extended overview of CLEF HIPE 2020: named entity processing on historical newspapers. In: CLEF 2020 Working Notes. Conference and Labs of the Evaluation Forum, vol. 2696. CEUR-WS (2020)

    Google Scholar 

  7. Ehrmann, M., et al.: Extended overview of HIPE-2022: named entity recognition and linking in multilingual historical documents. In: CEUR Workshop Proceedings, pp. 1038–1063, No. 3180, CEUR-WS (2022)

    Google Scholar 

  8. Floridi, L., Chiriatti, M.: GPT-3: its nature, scope, limits, and consequences. Mind. Mach. 30, 681–694 (2020)

    Article  Google Scholar 

  9. González-Gallardo, C.E., Boros, E., Girdhar, N., Hamdi, A., Moreno, J.G., Doucet, A.: Yes but.. can chatGPT identify entities in historical documents? In: 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 184–189. IEEE (2023)

    Google Scholar 

  10. González-Gallardo, C.E., Tran, T.H.H., Girdhar, N., Boroş, E., Moreno, J.G., Doucet, A.: L3i++ at SemEval-2023 task 2: prompting for multilingual complex named entity recognition. In: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval 2023), pp. 807–814 (2023)

    Google Scholar 

  11. Hamdi, A., et al.: A multilingual dataset for named entity recognition, entity linking and stance detection in historical newspapers. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2328–2334 (2021)

    Google Scholar 

  12. Hazem, A., Bouhandi, M., Boudin, F., Daille, B.: TermEval 2020: TALN-LS2N system for automatic term extraction. In: Proceedings of the 6th International Workshop on Computational Terminology, pp. 95–100 (2020)

    Google Scholar 

  13. Ivačič, N., Tran, T.H.H., Koloski, B., Pollak, S., Purver, M.: Analysis of transfer learning for named entity recognition in south-Slavic languages. In: Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023), pp. 106–112 (2023)

    Google Scholar 

  14. Karimi, A., Rossi, L., Prati, A.: Improving BERT performance for aspect-based sentiment analysis. arXiv preprint arXiv:2010.11731 (2020)

  15. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, p. 2 (2019)

    Google Scholar 

  16. Koufakou, A., Pamungkas, E.W., Basile, V., Patti, V., et al.: HurtBERT: incorporating lexical features with BERT for the detection of abusive language. In: Proceedings of the Fourth Workshop on Online Abuse and Harms, pp. 34–43. Association for Computational Linguistics (2020)

    Google Scholar 

  17. Labusch, K., Neudecker, C.: Entity linking in multilingual newspapers and classical commentaries with BERT (2022)

    Google Scholar 

  18. Lang, C., Wachowiak, L., Heinisch, B., Gromann, D.: Transforming term extraction: transformer-based approaches to multilingual term extraction across domains. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3607–3620 (2021)

    Google Scholar 

  19. Li, Z., et al.: Label supervised llama finetuning. arXiv preprint arXiv:2310.01208 (2023)

  20. Lin, Y., et al.: BertGCN: transductive text classification by combining GNN and BERT. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1456–1462 (2021)

    Google Scholar 

  21. Mutinda, J., Mwangi, W., Okeyo, G.: Sentiment analysis of text reviews using lexicon-enhanced BERT embedding (LeBERT) model with convolutional neural network. Appl. Sci. 13(3), 1445 (2023)

    Article  Google Scholar 

  22. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  23. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learning Res. 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  24. Rigouts Terryn, A., Hoste, V., Drouin, P., Lefever, E.: TermEval 2020: shared task on automatic term extraction using the annotated corpora for term extraction research (ACTER) dataset. In: 6th International Workshop on Computational Terminology (COMPUTERM 2020), pp. 85–94. European Language Resources Association (ELRA) (2020)

    Google Scholar 

  25. Rigouts Terryn, A., Hoste, V., Lefever, E.: In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Lang. Resour. Eval. 54(2), 385–418 (2020)

    Article  Google Scholar 

  26. Ritze, D., Zirn, C., Greenstreet, C., Eckert, K., Ponzetto, S.P.: Named entities in court: the marinelives corpus. In: Language Resources and Technologies for Processing and Linking Historical Documents and Archives-Deploying Linked Open Data in Cultural Heritage–LRT4HDA Workshop Programme, p. 26 (2014)

    Google Scholar 

  27. Rosset, S., Grouin, C., Zweigenbaum, P.: Entités nommées structurées: guide d’annotation Quaero. LIMSI-Centre national de la recherche scientifique (2011)

    Google Scholar 

  28. Ryser, A., Nguyen, Q.A., Bodenmann, N., Chen, S.Y.: Exploring transformers for multilingual historical named entity recognition (2022)

    Google Scholar 

  29. Schweter, S., März, L., Schmid, K., Çano, E.: hmBERT: historical multilingual language models for named entity recognition (2022)

    Google Scholar 

  30. Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  31. Hanh, T.T.H., Doucet, A., Sidere, N., Moreno, J.G., Pollak, S.: Named entity recognition architecture combining contextual and global features. In: Ke, H.-R., Lee, C.S., Sugiyama, K. (eds.) ICADL 2021. LNCS, vol. 13133, pp. 264–276. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91669-5_21

    Chapter  Google Scholar 

  32. Tran, H.T.H., Martinc, M., Caporusso, J., Doucet, A., Pollak, S.: The recent advances in automatic term extraction: a survey. arXiv preprint arXiv:2301.06767 (2023)

  33. Tran, H.T.H., Martinc, M., Doucet, A., Pollak, S.: Can cross-domain term extraction benefit from cross-lingual transfer? In: Pascal, P., Ienco, D. (eds.) DS 2022. LNCS, vol. 13601, pp. 363–378. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18840-4_26

    Chapter  Google Scholar 

  34. Tran, H.T.H., Martinc, M., Pelicon, A., Doucet, A., Pollak, S.: Ensembling transformers for cross-domain automatic term extraction. In: Tseng, Y.H., Katsurai, M., Nguyen, H.N. (eds.) ICADL 2022. LNCS, vol. 13636, pp. 90–100. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-21756-2_7

    Chapter  Google Scholar 

  35. Tran, H.T.H., Martinc, M., Repar, A., Ljubešić, N., Doucet, A., Pollak, S.: Can cross-domain term extraction benefit from cross-lingual transfer and nested term labeling? Mach. Learn. 113(7), 4285–4314 (2024)

    Article  MathSciNet  Google Scholar 

  36. Tran, H., Martinc, M., Doucet, A., Pollak, S.: A transformer-based sequence-labeling approach to the slovenian cross-domain automatic term extraction. In: Slovenian conference on Language Technologies and Digital Humanities (2022)

    Google Scholar 

  37. Won, M., Murrieta-Flores, P., Martins, B.: Ensemble named entity recognition (NER): evaluating NER tools in the identification of place names in historical corpora. Front. Digit. Humanit. 5, 2 (2018)

    Article  Google Scholar 

  38. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)

    Google Scholar 

  39. Yang, Y., Katiyar, A.: Simple and effective few-shot named entity recognition with structured nearest neighbor learning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6365–6375 (2020)

    Google Scholar 

  40. Yang, Z., Chen, H., Zhang, J., Ma, J., Chang, Y.: Attention-based multi-level feature fusion for named entity recognition. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3594–3600 (2021)

    Google Scholar 

  41. Yao, L., Mao, C., Luo, Y.: KG-BERT: BERT for knowledge graph completion. arXiv preprint arXiv:1909.03193 (2019)

  42. Yu, S., Su, J., Luo, D.: Improving BERT-based text classification with auxiliary sentence and domain knowledge. IEEE Access 7, 176600–176612 (2019)

    Article  Google Scholar 

  43. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: Ernie: enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129 (2019)

Download references

Acknowledgments

This work has been supported by the ANNA (2019-1R40226), TERMITRAD (2020-2019-8510010), Pypa (AAPR2021-2021-12263410), and Actuadata (AAPR2022-2021-17014610) projects funded by the Nouvelle-Aquitaine Region (France).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenjun Sun .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare relevant to this article’s content.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, W., Tran, H.T.H., González-Gallardo, CE., Coustaty, M., Doucet, A. (2024). LIT: Label-Informed Transformers on Token-Based Classification. In: Antonacopoulos, A., et al. Linking Theory and Practice of Digital Libraries. TPDL 2024. Lecture Notes in Computer Science, vol 15177. Springer, Cham. https://doi.org/10.1007/978-3-031-72437-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72437-4_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72436-7

  • Online ISBN: 978-3-031-72437-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics