This is the official code for HistRED: A Historical Document-Level Relation Extraction Dataset (ACL 2023). All materials related to this paper can be found here.

ACL Anthology: Official proceeding publication
Virtual-ACL 2023: You can view papers, posters, and presentation slides.
arXiv: This is the camera-ready version, which is a key part of this paper.

Note that this dataset is open under CC BY-NC-ND 4.0 license. The same code (except the dataset) can be seen in Github

from datasets import load_dataset

dataset = load_dataset("Soyoung/HistRED")

Dataset Example

Due to the complexity of the dataset, we replace the dataset preview with an example figure. The text is translated into English for comprehension (*), however, unlike the figure, the dataset does not include English-translated text, only containing Korean and Hanja. Also, only one relation is shown for readability.

Relation information includes

subject and object entities for Korean and Hanja (sbj_kor, sbj_han, obj_kor, obj_han),
a relation type (label),
and evidence sentence index(es) for each language (evidence_kor, evidence_han).

Metadata contains additional information, such as which book the text is extracted from.

Corpus of HistRED: << Yeonhaengnok >>

In this dataset, we choose Yeonhaengnok, a collection of records originally written in Hanja, classical Chinese writing, which has later been translated into Korean. Joseon, the last dynastic kingdom of Korea, lasted just over five centuries, from 1392 to 1897, and many aspects of Korean traditions and customs trace their roots back to this era. Numerous historical documents exist from the Joseon dynasty, including Annals of Joseon Dynasty (AJD) and Diaries of the Royal Secretariats (DRS). Note that the majority of Joseon's records were written in Hanja, the archaic Chinese writing that differs from modern Chinese because the Korean language had not been standardized until much later.

In short, Yeonhaengnok is a travel diary from the Joseon period. In the past, traveling to other places, particularly to foreign countries, was rare. Therefore, intellectuals who traveled to Chung (also referred to as the Qing dynasty) meticulously documented their journeys, and Yeonhaengnok is a compilation of these accounts. Diverse individuals from different generations recorded their business trips following similar routes from Joseon to Chung, focusing on people, products, and events they encountered. The Institute for the Translation of Korean Classics (ITKC) has open-sourced the original and their translated texts for many historical documents, promoting active historical research. The entire documents were collected from an open-source database at https://db.itkc.or.kr/.

Properties

Our dataset contains (i) named entities, (ii) relations between the entities, and (iii) parallel relationships between Korean and Hanja texts.
dataset.py return processed dataset that can be easily applied to general NLP models.
- For monolingual setting: KoreanDataset, HanjaDataset
- For Bilingual setting: JointDataset
ner_map.json and label_map.json are the mapping dictionaries from label classes to indexes.
Sequence level (SL) is a unit of sequence length for extracting self-contained sub-texts without losing context information for each relation in the text. Each folder SL-k indicates that SL is k.

Dataset usages

Testbed for evaluating the model performance when varying the sequence length.
Relation extraction task especially on Non-English or historical corpus.

Citation

@inproceedings{yang-etal-2023-histred,
    title = "{H}ist{RED}: A Historical Document-Level Relation Extraction Dataset",
    author = "Yang, Soyoung  and
      Choi, Minseok  and
      Cho, Youngwoo  and
      Choo, Jaegul",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.180",
    pages = "3207--3224",
}