CN115618022B

CN115618022B - Low-resource relation extraction method based on data synthesis and two-stage self-training

Info

Publication number: CN115618022B
Application number: CN202211630125.3A
Authority: CN
Inventors: 张勇东; 毛震东; 陈伟东; 宋彦; 徐本峰; 高杰
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-04-28
Anticipated expiration: 2042-12-19
Also published as: CN115618022A

Abstract

The invention relates to the field of data synthesis, and discloses a low-resource relationship extraction method based on data synthesis and two-stage self-training, including a data synthesis method and a two-stage self-training framework, and data synthesis effectively alleviates the lack of labeled data in the current relationship extraction task , the problem of high labeling cost. The two-stage self-training framework sequentially uses unlabeled generated data and labeled data to train the model in each iteration. On the one hand, it can promote the collaborative learning of the model from unlabeled generated data and labeled data, and on the other hand, it also effectively reduces the generated data. the effect of noise. The present invention fits the low-resource situation in the real scene, and can utilize the existing pre-trained language model more effectively.

Description

A low-resource relation extraction method based on data synthesis and two-stage self-training

技术领域technical field

本发明涉及数据合成领域，具体涉及一种基于数据合成和两阶段自训练的低资源关系抽取方法。The invention relates to the field of data synthesis, in particular to a low-resource relationship extraction method based on data synthesis and two-stage self-training.

背景技术Background technique

关系抽取系统作为知识抽取与图谱构建的重要技术支撑，旨在挖掘非结构化文本中实体间的关系，近年来成为自然语言处理领域的研究热点。尽管神经网络模型，特别是预训练语言模型，已经在关系抽取任务上取得了显著的突破，但是训练这些模型需要大量的标注数据。然而在许多现实场景中，获取高质量的标注数据通常是耗时耗力的，因此如何在有限的资源与数据下搭建性能优良的关系抽取系统，成为了一个重大挑战。As an important technical support for knowledge extraction and graph construction, the relationship extraction system aims to mine the relationship between entities in unstructured text, and has become a research hotspot in the field of natural language processing in recent years. Although neural network models, especially pretrained language models, have achieved remarkable breakthroughs in relation extraction tasks, training these models requires a large amount of labeled data. However, in many real-world scenarios, obtaining high-quality labeled data is usually time-consuming and labor-intensive. Therefore, how to build a relationship extraction system with excellent performance under limited resources and data has become a major challenge.

远监督通过将文本中的实体与现有知识库对齐，自动标注关系抽取数据，其作为一种构建大规模关系抽取数据集的有效方法被广泛研究。但是由于关系模式和文本语料库的不同，远监督标注的数据可能会与下游任务有很大差异，抑制了模型性能的进一步优化。例如出于对现有知识库的依赖，目前的远监督大多采用Wikidata作为关系三元组的来源，维基百科作为远监督的语料库。这就将标注数据的模式和文本限制为了日常知识，但是下游任务可能涉及到其他不同领域的专项知识，比如名字之间的语义关系，化学蛋白质的相互作用。Distant supervision automatically labels relation extraction data by aligning entities in text with existing knowledge bases, which has been widely studied as an effective method for building large-scale relation extraction datasets. However, due to differences in relational patterns and text corpora, the data annotated by distant supervision may be quite different from downstream tasks, which inhibits further optimization of model performance. For example, due to the dependence on the existing knowledge base, most of the current distant supervision uses Wikidata as the source of relational triples, and Wikipedia as the corpus of distant supervision. This limits the schema and text of the labeled data to everyday knowledge, but downstream tasks may involve other domain-specific knowledge, such as semantic relationships between names, chemical protein interactions.

考虑到大规模语言模型已经在新闻稿件、商品品类、日常对话等若干领域展示了其强大的语言生成能力，因此本发明将大规模语言模型用于合成关系抽取任务的数据，来解决低资源场景下的训练数据稀缺问题和远监督的领域差异问题。另一方面，为了更好地使用生成数据，本发明提出了一个两阶段自学习框架。Considering that the large-scale language model has demonstrated its powerful language generation capabilities in several fields such as news articles, commodity categories, and daily dialogues, the present invention uses the large-scale language model to synthesize data for relational extraction tasks to solve low-resource scenarios Under the training data scarcity problem and the domain difference problem of distant supervision. On the other hand, in order to make better use of generated data, the present invention proposes a two-stage self-learning framework.

发明内容Contents of the invention

为解决上述技术问题，本发明提供一种基于数据合成和两阶段自训练的低资源关系抽取方法，自学习框架通常迭代进行标注并学习未标注数据的伪标签，来引导模型性能持续提升。但是在本发明的两阶段自学习框架的每次迭代中，在第一阶段使用生成数据进行训练，在第二阶段使用标注数据进行训练。因为标注数据是在训练过程的靠后阶段引入的，所以这种序列形式的训练过程提升了模型对标注数据的关注度。此外，本发明将生成数据训练制定为使用软伪标签的知识蒸馏过程，而不是给它们分配硬标签。总的来说，两阶段自训练框架利用生成数据，解决了低资源下标注数据少的难题，进一步提升了知识抽取系统的性能。In order to solve the above technical problems, the present invention provides a low-resource relationship extraction method based on data synthesis and two-stage self-training. The self-learning framework usually iteratively labels and learns pseudo-labels of unlabeled data to guide the continuous improvement of model performance. But in each iteration of the two-stage self-learning framework of the present invention, the generated data is used for training in the first stage and the labeled data is used for training in the second stage. Since the labeled data is introduced in the later stage of the training process, this sequential form of the training process increases the attention of the model to the labeled data. Furthermore, the present invention formulates generative data training as a knowledge distillation process using soft pseudo-labels instead of assigning them hard labels. In general, the two-stage self-training framework uses generated data to solve the problem of less labeled data under low resources, and further improves the performance of the knowledge extraction system.

为解决上述技术问题，本发明采用如下技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种基于数据合成和两阶段自训练的低资源关系抽取方法，包括以下步骤：A low-resource relationship extraction method based on data synthesis and two-stage self-training, comprising the following steps:

步骤一、基于已标注的训练数据的数据合成：通过在训练数据中加入位置符的方式，将训练数据转化为线性的自然语言序列；构建基于大规模语言模型的数据合成模型，通过训练数据对数据合成模型进行微调；使用多项式抽样反复执行数据合成过程，直至获得符合预设条件的未标注的生成数据集，生成数据中具有位置符；为生成数据中的单词序列，，分别为的主体和客体；为生成数据的数量；Step 1. Data synthesis based on the marked training data: by adding placeholders to the training data, the training data is converted into a linear natural language sequence; a data synthesis model based on a large-scale language model is constructed, and the training data is paired with The data synthesis model is fine-tuned; the data synthesis process is iteratively performed using polynomial sampling until an unlabeled generated dataset meeting the preset criteria is obtained , there are placeholders in the generated data; To generate a sequence of words in the data, , respectively subject and object of for the amount of generated data;

步骤二、两阶段自学习：在训练数据集上训练自编码语言模型η，然后对加了位置符的生成数据集使用自编码语言模型η分类，实现软伪标签标注：；Step 2, two-stage self-learning: in the training data set Train the self-encoding language model η on , and then add the generated data set with placemarks Use the self-encoding language model η classification to realize soft pseudo-label annotation: ;

令的软伪标签集，为软伪标签；使用K个不同的随机种子训练多个自编码语言模型，记为教师模型η，第k个教师模型标注的软伪标签集记为；初始化一个新的自编码语言模型，记为学生模型，对学生模型应用两阶段训练策略：在第一阶段训练中，使用具有软伪标签的生成数据进行蒸馏训练：；将学生模型优化为学生模型，计算蒸馏损失：make soft pseudo label set , is the soft pseudo-label; use K different random seeds to train multiple self-encoding language models, denoted as teacher model η, and the soft-pseudo-label set marked by the kth teacher model is denoted as ;Initialize a new self-encoding language model, denoted as the student model , for the student model Apply a two-stage training strategy: In the first stage of training, distillation training is performed using generated data with soft pseudo-labels: ; the student model Optimized for student models , to calculate the distillation loss :

； ;

代表KL散度；在第二阶段训练中，将学生模型在训练数据集上进行训练：；为标准的交叉熵损失函数，是第二阶段训练迭代后得到的学生模型；在下一次执行两阶段训练策略时，将学生模型作为教师模型η；重复执行两阶段训练策略，直至生成数据集中每个生成数据都被标注上软伪标签； represents the KL divergence; in the second stage of training, the student model in the training dataset Train on: ; is the standard cross-entropy loss function, is the student model obtained after the second stage of training iterations; the next time the two-stage training strategy is executed, the student model As a teacher model η; repeat the two-stage training strategy until a dataset is generated Each generated data in is marked with a soft pseudo-label;

步骤三、关系抽取：构建基于自编码语言模型的关系抽取模型；生成数据和训练数据统称为训练实例，训练实例的关系标签称为真实标签，将训练实例输入关系抽取模型，通过计算关系抽取模型预测的关系标签和训练实例的真实标签的交叉熵损失，来训练关系抽取模型。Step 3. Relation extraction: build a relation extraction model based on the self-encoded language model; the generated data and training data are collectively referred to as training instances, and the relation labels of training instances are referred to as real labels , input the training instance into the relationship extraction model, and calculate the relationship label predicted by the relationship extraction model and the true labels of the training instances The cross-entropy loss to train the relation extraction model.

进一步地，步骤一中，数据合成模型的微调损失均为；数据合成模型的微调方式和数据合成模型的预训练方式相同，其中表示概率函数，LLM表示大规模语言模型，是加入位置符的训练数据中单词序列中的单词，为特殊开始符，完成微调后，在前添加标记提示生成。Further, in step 1, the fine-tuning loss of the data synthesis model is ; The fine-tuning method of the data synthesis model is the same as the pre-training method of the data synthesis model, where Represents a probability function, LLM represents a large-scale language model, is the sequence of words in the training data with placeholders added words in is a special start character, after fine-tuning, in the Add marker prompts before generation.

进一步地，步骤二中，两阶段训练策略会重复执行T次；在每次迭代中从生成数据集抽样1/T的生成数据，T次迭代后，生成数据集中的所有生成数据均被标注上软伪标签。Further, in step 2, the two-stage training strategy will be repeated T times; in each iteration, the data set is generated from Sampling 1/T of generated data, after T iterations, generate a data set All generated data in is annotated with soft pseudo-labels.

进一步地，步骤一中在训练数据中加入位置符时：Further, when adding placeholders to the training data in step 1:

； ;

其中，表示训练数据中加入位置符后的单词序列，和为用来标注训练数据单词序列主体位置的位置符，和为用来标注训练数据单词序列客体位置的位置符。in, Indicates the word sequence after adding placeholders in the training data, and is used to label the training data word sequence body the placeholder for the position, and is used to label the training data word sequence object The placeholder for the location.

进一步地，步骤三中关系抽取模型预测的关系标签时，将训练实例中单词序列相应位置的向量表示h拼接以进行分类，得到关系抽取模型预测的关系标签：Further, when the relationship label predicted by the relationship extraction model is predicted in step 3, the vector representation h of the corresponding position of the word sequence in the training instance is concatenated for classification, and the relationship label predicted by the relationship extraction model is obtained :

； ;

其中为激活函数，为全连接网络，为主体的向量表示，为客体的向量表示，[ ; ]代表拼接操作。in is the activation function, is a fully connected network, is the vector representation of the subject, is the vector representation of the object, [ ; ] represents the splicing operation.

与现有技术相比，本发明的有益技术效果是：Compared with the prior art, the beneficial technical effect of the present invention is:

本发明提出了一种基于数据合成和两阶段自训练的低资源关系抽取方法，包括数据合成方法和两阶段自训练框架；数据合成有效缓解了当前关系抽取任务中标注数据少，标注成本大的问题。两阶段自训练框架在每轮迭代中依次使用无标注生成数据和有标注数据训练自编码语言模型，一方面可以促进自编码语言模型从无标注生成数据和有标注数据中协同学习，另一方面也有效降低了生成数据噪音的影响。本关系抽取系统贴合真实场景中的低资源情况，能够更有效地利用现有的预训练语言模型，发掘其潜力，应用前景广阔。The present invention proposes a low-resource relationship extraction method based on data synthesis and two-stage self-training, including a data synthesis method and a two-stage self-training framework; data synthesis effectively alleviates the problem of less labeled data and high labeling costs in the current relationship extraction task question. The two-stage self-training framework sequentially uses unlabeled generated data and labeled data to train the self-encoded language model in each iteration. It also effectively reduces the impact of generated data noise. This relation extraction system fits the low-resource situation in the real scene, can make more effective use of the existing pre-trained language model, tap its potential, and has broad application prospects.

附图说明Description of drawings

图1本发明整体的模型架构图。Fig. 1 is an overall model architecture diagram of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明的一种优选实施方式作详细的说明。A preferred embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

本发明中设训练数据集，训练数据集对应的标签集，为训练数据对应的关系标签，且，为训练数据集中所有标注的关系标签集合，为训练数据集中训练数据的数量。这里是训练数据集第i个训练数据中的单词序列，代表单词序列中的第个单词，L为单词序列长度，和分别是单词序列中的主体和客体，代表主体在单词序列中的开始位置为、结束位置为，代表客体在单词序列中的开始位置为、结束位置为。In the present invention, the training data set is set , the label set corresponding to the training data set , for the training data the corresponding relation label, and , is the set of all labeled relationship labels in the training dataset, is the number of training data in the training data set. here is the sequence of words in the i-th training data of the training data set , represent a sequence of words in the first words, L is the sequence of words length, and sequence of words subject and object in represents the subject in a sequence of words The starting position in , ending at , represents the object in the sequence of words The starting position in , ending at .

生成数据集，对应的软伪标签集，为软伪标签，为生成数据中的单词序列，，分别为的主体和客体；为生成数据集中生成数据的数量。generate dataset , the corresponding soft-pseudo-label set , For the soft pseudo-label, To generate a sequence of words in the data, , respectively subject and object of The amount of data generated for the generated dataset.

生成数据和训练数据统称为训练实例，训练实例的关系标签称为真实标签，即软伪标签和关系标签统称为真实标签。generate data and training data collectively referred to as training instances , the relation label of the training instance is called the ground truth label , that is, the soft pseudo-label and relationship labels collectively referred to as ground truth labels .

关系抽取的目标是学习一个函数，该函数通过来预测真实标签。The goal of relation extraction is to learn a function , the function pass to predict the true label .

本发明提出的关系抽取系统如图1所示，包括以下三个部分：1、关系抽取模型；2、面向低资源场景的数据合成模型；3、两阶段自学习框架。The relation extraction system proposed by the present invention is shown in Fig. 1, and includes the following three parts: 1. a relation extraction model; 2. a data synthesis model for low-resource scenarios; 3. a two-stage self-learning framework.

1、关系抽取模型1. Relationship extraction model

关系抽取模型的主体采用类似BERT的自编码语言模型，而非自回归语言模型，这是因为自编码语言模型通常在语言理解类任务上有更好的表现。综合考量之前对抽取模型的研究，本发明已在输入关系抽取模型的训练实例的单词序列中加入一些特殊的位置符来强调主体和客体在单词序列中的位置。The main body of the relationship extraction model uses a BERT-like self-encoding language model instead of an autoregressive language model, because the self-encoding language model usually performs better on language understanding tasks. Considering the previous research on the extraction model, the present invention has added some special placemarks to the word sequence of the training example of the input relation extraction model to emphasize the position of the subject and object in the word sequence.

公式一：；Formula one: ;

其中表示加入特殊的位置符后的训练实例单词序列，和是用来标注主体位置的特殊位置符，和是用来标注客体位置的特殊位置符。in Represents the training instance word sequence after adding special placeholders, and is used to mark the subject special placeholders for positions, and is used to mark the object A special placeholder for a position.

在经过关系抽取模型的编码后，将单词序列相应位置的向量表示h拼接以进行分类：After being encoded by the relationship extraction model, the vector representation h of the corresponding position of the word sequence is spliced for classification:

； ;

其中是激活函数，是全连接网络，是主体的向量表示，是客体的向量表示，[ ; ]代表拼接操作，是关系抽取模型预测的关系标签分布，本发明通过计算和的交叉熵损失函数来训练关系抽取模型。in is the activation function, is a fully connected network, is the vector representation of the subject, is the vector representation of the object, [ ; ] represents the splicing operation, is the relationship label distribution predicted by the relationship extraction model, and the present invention calculates and The cross-entropy loss function to train the relation extraction model.

后续介绍训练实例的生成方法。The method of generating training examples will be introduced later.

2、面向低资源场景的数据合成模型2. Data synthesis model for low-resource scenarios

本实施例采用基于大规模语言模型（LLM）的数据合成模型。考虑到大规模语言模型强大的语言生成能力，本发明基于标注数据，通过大规模语言模型合成关系抽取任务所需的数据，来解决低资源场景下的训练数据稀缺问题和远监督的领域差异问题。关系抽取模型的训练实例具有特定结构：，即一个关系事实由一段文本（即单词序列）以及中包含的主体及客体确定。首先使用类似于公式一的方式将训练数据转化为线性的自然语言序列：This embodiment adopts a data synthesis model based on a large-scale language model (LLM). Considering the powerful language generation ability of large-scale language models, the present invention solves the problem of scarcity of training data in low-resource scenarios and the problem of domain differences in remote supervision by synthesizing the data required for the task of relation extraction through large-scale language models based on labeled data. . A training instance for a relation extraction model has a specific structure: , that is, a relational fact consists of a text (i.e. sequence of words) and Subjects contained in and object Sure. First, use a method similar to formula 1 to convert the training data into a linear natural language sequence:

； ;

数据合成模型可以基于任意的大规模语言模型，例如GPT-2。之后基于标注的训练数据对大规模语言模型进行微调，微调方式和数据合成模型的预训练方式相同，数据合成模型的微调损失为：Data synthesis models can be based on arbitrary large-scale language models, such as GPT-2. Then fine-tune the large-scale language model based on the labeled training data. The fine-tuning method is the same as the pre-training method of the data synthesis model. The fine-tuning loss of the data synthesis model for:

； ;

其中表示概率函数，LLM表示大规模语言模型，是单词序列中的单词，并且特殊开始符<bos>（即beginning of sentence）作为被添加到了序列之前。注意，这里忽略了训练数据中的关系标签，因此这是一个无条件的生成过程。这可以消除由于标签-语义不一致引起的噪声，并使数据合成模型能够对自身进行建模，有助于数据合成模型从未标注的生成数据中学习。在完成微调后，只需在特殊开始符<bos>前添加一个标记来提示生成，并使用多项式抽样反复执行生成过程，直到获得符合预期的生成数据集。in Represents a probability function, LLM represents a large-scale language model, is a sequence of words words in , and the special start character <bos> (i.e. beginning of sentence) as is added before the sequence. Note that the relationship labels in the training data are ignored here , so this is an unconditional generating process. This can remove the noise due to label-semantic inconsistency and enable the data synthesis model to model itself, which helps the data synthesis model to learn from unlabeled generated data. After fine-tuning is done, just add a tag before the special start character <bos> to prompt the generation, and use polynomial sampling to iteratively perform the generation process until you get a generated data set that meets expectations .

3、两阶段自学习框架3. Two-stage self-learning framework

自学习是半监督学习中广泛使用的一种学习算法。通常，为了从一个未标记的数据集和一个已标记的数据集中联合学习，需要从未标记的数据集中迭代采样，并为采样的数据分配伪标签，然后将它们与标记的数据集合并，重新训练模型。但这种朴素合并的设计建立在一个强假设之上，即未标记的数据集必须与标记的数据集具有完全相同的分布，而生成的数据则不能严格满足这一假设。Self-learning is a learning algorithm widely used in semi-supervised learning. Usually, to jointly learn from an unlabeled dataset and a labeled dataset, one needs to iteratively sample from the unlabeled dataset, assign pseudo-labels to the sampled data, then merge them with the labeled dataset, re- Train the model. But this naive pooling design is based on a strong assumption that the unlabeled dataset must have exactly the same distribution as the labeled dataset, which cannot be strictly satisfied by the generated data.

为此，本发明提出了一个不同的两阶段自学习框架：依次使用未标注的生成数据集和已标注的训练数据集单独训练模型，而不是将它们合并起来训练。首先在已标注的训练数据集上训练一个自编码语言模型（例如BERT模型）η，然后对加了位置符的未标注的生成数据使用自编码语言模型η分类，实现软伪标签标注：To this end, the present invention proposes a different two-stage self-learning framework: sequentially using the unlabeled generated dataset and the labeled training dataset to train the model separately, instead of combining them for training. First in the labeled training dataset Train a self-encoded language model (such as the BERT model) η on , and then add the unlabeled generated data with placemarks Use the self-encoding language model η classification to realize soft pseudo-label annotation:

； ;

令的软伪标签集，注意这里的上标“^”表示“软伪标签”，即仅保持标签类别的分布而不是进一步取它们的argmax值。为了进一步消除伪标签的波动，使用K个不同的随机种子训练多个自编码语言模型，记为教师模型η，第k个教师模型标注的软伪标签集记为。make soft pseudo label set , note that the superscript "^" here means "soft pseudo-label", that is, only keep the distribution of label categories instead of further taking their argmax value. In order to further eliminate the fluctuation of pseudo-labels, K different random seeds are used to train multiple self-encoding language models, denoted as teacher model η, and the soft pseudo-label set marked by the kth teacher model is denoted as .

软伪标签即软的伪标签，软标签是与硬标签相对应的概念。硬标签会直接给样本打上0、1这种离散的标签来代表正负样本，而软标签则是保留了样本标签类别的分布，给样本按0~1标注。相比之下，硬标签标注简单，但优化时不可微；软标签更为平滑，表达能力更好。本发明中的软伪标签能够起到增强生成数据标签的表达能力的作用。The soft pseudo-label is a soft pseudo-label, and the soft label is a concept corresponding to the hard label. The hard label will directly label the sample with discrete labels such as 0 and 1 to represent positive and negative samples, while the soft label retains the distribution of the sample label category, and labels the sample as 0~1. In contrast, hard labels are easy to annotate, but cannot be differentiable when optimized; soft labels are smoother and have better expressive power. The soft pseudo-label in the present invention can enhance the expression ability of the generated data label.

然后重新初始化一个新的学生模型，并应用两阶段训练策略。在第一阶段训练中，使用具有软伪标签的生成数据进行训练：Then reinitialize a new student model , and apply a two-stage training strategy. In the first stage of training, the generated data with soft pseudo-labels is used for training:

； ;

这可以看作是一个蒸馏过程，在生成数据集的帮助下，知识从教师模型η转移到了学生模型，将其优化为学生模型，计算蒸馏损失：This can be seen as a distillation process, where the dataset With the help of η, the knowledge is transferred from the teacher model η to the student model , optimize it to the student model , to calculate the distillation loss :

； ;

代表KL散度。然后在第二阶段训练中，将学生模型在有标注的训练数据集上进行训练： stands for KL divergence. Then in the second stage of training, the student model In the labeled training dataset Train on:

； ;

这里是标准的交叉熵损失函数，是第二阶段训练迭代后得到的学生模型。然后在下一次迭代中，将学生模型作为教师模型η重新标注。here is the standard cross-entropy loss function, is the student model obtained after the second stage of training iterations. Then in the next iteration, the student model Relabeled as teacher model η .

以上整个过程会重复T次。按照标准的自学习设定，在每次迭代中从抽样1/T的生成数据，T次迭代后，所有中的生成数据都会被标注一个软伪标签。The above whole process will be repeated T times. According to the standard self-learning setting, in each iteration from Sampling 1/T of generated data, after T iterations, all The generated data in will be marked with a soft pseudo-label.

经过这种两阶段自学习方式，可以很好地跨越现有泛用知识库和下游任务数据之间的语义鸿沟，同时也有效降低生成数据噪音带来的干扰。After this two-stage self-learning method, the semantic gap between the existing general-purpose knowledge base and the downstream task data can be well bridged, and the interference caused by the noise of the generated data can also be effectively reduced.

对于本领域技术人员而言，显然本发明不限于上述示范性实施例的细节，而且在不背离本发明的精神或基本特征的情况下，能够以其他的具体形式实现本发明。因此无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本发明的范围由所附权利要求而不是上述说明限定，因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本发明内，不应将权利要求中的任何附图标记视为限制所涉及的权利要求。It will be apparent to those skilled in the art that the invention is not limited to the details of the above-described exemplary embodiments, but that the invention can be embodied in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the embodiments should be regarded as exemplary and non-restrictive in all points of view. The scope of the present invention is defined by the appended claims rather than the above description, and it is therefore intended that the scope of the present invention be defined by the appended claims rather than by the foregoing description. All changes within the meaning and range of equivalent elements are embraced in the present invention, and any reference sign in a claim shall not be construed as limiting the claim concerned.

此外，应当理解，虽然本说明书按照实施方式加以描述，但并非每个实施方式仅包含一个独立技术方案，说明书的这种叙述方式仅仅是为了清楚起见，本领域技术人员应当将说明书作为一个整体，各实施例中的技术方案也可以经适当组合，形成本领域技术人员可以理解的其他实施方式。In addition, it should be understood that although this specification is described according to implementation modes, not each implementation mode includes only one independent technical solution, and this description in the specification is only for clarity, and those skilled in the art should take the specification as a whole, The technical solutions in the various embodiments can also be properly combined to form other implementations that can be understood by those skilled in the art.

Claims

1. A low-resource relationship extraction method based on data synthesis and two-stage self-training, comprising the following steps:

Step 1. Data synthesis based on the marked training data: by adding placeholders to the training data, the training data is converted into a linear natural language sequence; a data synthesis model based on a large-scale language model is constructed, and the training data is paired with The data synthesis model is fine-tuned; the data synthesis process is iteratively performed using polynomial sampling until an unlabeled generated dataset meeting the preset criteria is obtained , there are placeholders in the generated data; To generate a sequence of words in the data, , respectively subject and object of for the amount of generated data;

Step 2, two-stage self-learning: in the training data set Train the self-encoding language model η on , and then add the generated data set with placemarks Use the self-encoding language model η classification to realize soft pseudo-label annotation: ;

make soft pseudo label set , is the soft pseudo-label; use K different random seeds to train multiple self-encoding language models, denoted as teacher model η, and the soft-pseudo-label set marked by the kth teacher model is denoted as ;Initialize a new self-encoding language model, denoted as the student model , for the student model Apply a two-stage training strategy: In the first stage of training, distillation training is performed using generated data with soft pseudo-labels: ; the student model Optimized for student models , to calculate the distillation loss :

;

represents the KL divergence; in the second stage of training, the student model in the training dataset Train on: ; is the standard cross-entropy loss function, for the corresponding label set, is the student model obtained after the second stage of training iterations; the next time the two-stage training strategy is executed, the student model As a teacher model η; repeat the two-stage training strategy until a dataset is generated Each generated data in is marked with a soft pseudo-label;

Step 3. Relation extraction: build a relation extraction model based on the self-encoded language model; the generated data and training data are collectively referred to as training instances, and the relation labels of training instances are referred to as real labels , input the training instance into the relationship extraction model, and calculate the relationship label predicted by the relationship extraction model and the true labels of the training instances The cross-entropy loss to train the relation extraction model.

2. The low-resource relationship extraction method based on data synthesis and two-stage self-training according to claim 1, wherein in step 1, the fine-tuning loss of the data synthesis model is ; The fine-tuning method of the data synthesis model is the same as the pre-training method of the data synthesis model, where Represents a probability function, LLM represents a large-scale language model, is the sequence of words in the training data with placeholders added words in is a special start character, after fine-tuning, in the Add marker prompts before generation.

3. the method for extracting low-resource relations based on data synthesis and two-stage self-training according to claim 1, characterized in that, in step 2, the two-stage training strategy is repeatedly executed T times; in each iteration, from the generated data set Sampling 1/T of generated data, after T iterations, generate a data set All generated data in is annotated with soft pseudo-labels.

4. the method for extracting low-resource relations based on data synthesis and two-stage self-training according to claim 1, wherein, in step 1, when adding placemarks in the training data:

;

in, Indicates the word sequence after adding placeholders in the training data, and is used to label the training data word sequence body the placeholder for the position, and is used to label the training data word sequence object The placeholder for the location.

5. The low-resource relationship extraction method based on data synthesis and two-stage self-training according to claim 1, characterized in that: when the relationship label predicted by the relationship extraction model in step 3, the vector of the corresponding position of the word sequence in the training example Represents h splicing for classification, and obtains the relationship label predicted by the relationship extraction model :

;

in is the activation function, is a fully connected network, is the vector representation of the subject, is the vector representation of the object, [ ; ] represents the splicing operation.