Nothing Special   »   [go: up one dir, main page]

CN115618022B - Low-resource relation extraction method based on data synthesis and two-stage self-training - Google Patents

Low-resource relation extraction method based on data synthesis and two-stage self-training Download PDF

Info

Publication number
CN115618022B
CN115618022B CN202211630125.3A CN202211630125A CN115618022B CN 115618022 B CN115618022 B CN 115618022B CN 202211630125 A CN202211630125 A CN 202211630125A CN 115618022 B CN115618022 B CN 115618022B
Authority
CN
China
Prior art keywords
training
data
model
stage
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211630125.3A
Other languages
Chinese (zh)
Other versions
CN115618022A (en
Inventor
张勇东
毛震东
陈伟东
宋彦
徐本峰
高杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202211630125.3A priority Critical patent/CN115618022B/en
Publication of CN115618022A publication Critical patent/CN115618022A/en
Application granted granted Critical
Publication of CN115618022B publication Critical patent/CN115618022B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of data synthesis, and discloses a low-resource relation extraction method based on data synthesis and two-stage self-training. The two-stage self-training framework sequentially uses the non-labeling generated data and the labeled data to train the model in each iteration, so that on one hand, the model can be promoted to cooperatively learn from the non-labeling generated data and the labeled data, and on the other hand, the influence of noise of the generated data is effectively reduced. The method is fit for the low resource condition in the real scene, and can more effectively utilize the existing pre-training language model.

Description

Low-resource relation extraction method based on data synthesis and two-stage self-training
Technical Field
The invention relates to the field of data synthesis, in particular to a low-resource relation extraction method based on data synthesis and two-stage self-training.
Background
The relation extraction system is used as an important technical support for knowledge extraction and map construction, aims to mine the relation among entities in unstructured texts, and becomes a research hotspot in the field of natural language processing in recent years. While neural network models, and in particular pre-trained language models, have made significant breakthroughs in the task of relationship extraction, training these models requires a large amount of labeling data. However, in many real-world scenarios, it is often time-consuming and labor-consuming to obtain high-quality annotation data, and thus how to build a relationship extraction system with good performance under limited resources and data becomes a significant challenge.
Remote supervision is widely studied as an effective method of constructing large-scale relational extraction data sets by aligning entities in text with an existing knowledge base, automatically annotating the relational extraction data. However, due to the difference of the relation mode and the text corpus, the data of the remote supervision annotation can be greatly different from the downstream task, and further optimization of the model performance is inhibited. For example, with the reliance on existing knowledge bases, current remote supervision mostly uses Wikidata as the source of relational triples, and wikipedia as the remote supervision corpus. This limits the pattern and text of the annotation data to daily knowledge, but the downstream tasks may involve other specialized knowledge in different fields, such as semantic relationships between names, interactions of chemical proteins.
Considering that the large-scale language model has demonstrated strong language generating capability in a plurality of fields such as news manuscripts, commodity categories, daily conversations and the like, the invention uses the large-scale language model for synthesizing data of a relation extraction task to solve the problem of scarcity of training data in a low-resource scene and the problem of field difference of remote supervision. On the other hand, for better use of the generated data, the present invention proposes a two-stage self-learning framework.
Disclosure of Invention
In order to solve the technical problems, the invention provides a low-resource relation extraction method based on data synthesis and two-stage self-training, and a self-learning framework generally iterates to label and learn pseudo labels without labeled data so as to guide continuous improvement of model performance. However, in each iteration of the two-stage self-learning framework of the present invention, training is performed using the generated data in the first stage and training is performed using the annotation data in the second stage. Because the annotation data is introduced at a later stage of the training process, the training process in the form of the sequence improves the attention of the model to the annotation data. Furthermore, the present invention formulates the training of the generated data as a knowledge distillation process using soft pseudo tags, rather than assigning hard tags to them. In general, the two-stage self-training framework utilizes the generated data, solves the problem of less labeled data under low resources, and further improves the performance of the knowledge extraction system.
In order to solve the technical problems, the invention adopts the following technical scheme:
a low-resource relation extraction method based on data synthesis and two-stage self-training comprises the following steps:
step one, data synthesis based on marked training data: converting the training data into a linear natural language sequence by adding a position symbol into the training data; constructing a data synthesis model based on a large-scale language model, and performing fine adjustment on the data synthesis model through training data; repeatedly performing the data synthesis process using polynomial sampling until an unlabeled generated dataset meeting the preset conditions is obtained
Figure 774485DEST_PATH_IMAGE001
Generating data with a place holder;
Figure 16110DEST_PATH_IMAGE002
in order to generate a sequence of words in the data,
Figure 494496DEST_PATH_IMAGE003
Figure 872388DEST_PATH_IMAGE004
respectively is
Figure 473134DEST_PATH_IMAGE002
Is a subject and an object of (a);
Figure 372694DEST_PATH_IMAGE005
to generate an amount of data;
step two, two-stage self-learning: in training data set
Figure 717088DEST_PATH_IMAGE006
Training self-coding language model eta, and then generating data set with added position sign
Figure 949486DEST_PATH_IMAGE007
Using self-encoding languageModel eta classification, soft pseudo label marking:
Figure 658816DEST_PATH_IMAGE008
order the
Figure 609455DEST_PATH_IMAGE007
Soft pseudo tag set of (a)
Figure 491960DEST_PATH_IMAGE009
Figure 782127DEST_PATH_IMAGE010
Is a soft pseudo tag; training multiple self-coding language models by using K different random seeds, marking as teacher model eta, and marking a soft pseudo-label set marked by a kth teacher model as
Figure 724675DEST_PATH_IMAGE011
The method comprises the steps of carrying out a first treatment on the surface of the Initializing a new self-coding language model, which is recorded as student model
Figure 162610DEST_PATH_IMAGE012
For student model
Figure 786489DEST_PATH_IMAGE012
Two-stage training strategies are applied: in the first stage training, distillation training is performed using the generated data with soft pseudo tags:
Figure 993480DEST_PATH_IMAGE013
the method comprises the steps of carrying out a first treatment on the surface of the Model students
Figure 106929DEST_PATH_IMAGE012
Optimized as a student model
Figure 468378DEST_PATH_IMAGE014
Calculate distillation loss
Figure 692686DEST_PATH_IMAGE015
Figure 19762DEST_PATH_IMAGE016
Figure 976217DEST_PATH_IMAGE017
Represents KL divergence; in the second stage training, the student is modeled
Figure 388744DEST_PATH_IMAGE014
In training data set
Figure 354426DEST_PATH_IMAGE006
Training is carried out on:
Figure 536008DEST_PATH_IMAGE018
Figure 928944DEST_PATH_IMAGE019
as a standard cross-entropy loss function,
Figure 563187DEST_PATH_IMAGE020
the student model is obtained after the second stage training iteration; the next time the two-stage training strategy is executed, the student is modeled
Figure 129298DEST_PATH_IMAGE020
As a teacher model η; repeating the two-stage training strategy until a data set is generated
Figure 867184DEST_PATH_IMAGE007
Each generated data is marked with a soft pseudo tag;
step three, relation extraction: constructing a relation extraction model based on a self-coding language model; the generated data and the training data are collectively called as training examples, and the relationship labels of the training examples are called as real labels
Figure 493338DEST_PATH_IMAGE021
Inputting the training examples into a relation extraction model, and calculating the relation predicted by the relation extraction modelTie label
Figure 614878DEST_PATH_IMAGE022
And training instance's true tags
Figure 922362DEST_PATH_IMAGE021
To train a relationship extraction model.
Further, in the first step, the fine tuning loss of the data synthesis model is all
Figure 78537DEST_PATH_IMAGE023
The method comprises the steps of carrying out a first treatment on the surface of the The fine tuning mode of the data synthesis model and the pre-training mode of the data synthesis model are the same, wherein
Figure 813275DEST_PATH_IMAGE024
Representing probability functions, LLM represents a large-scale language model,
Figure 422111DEST_PATH_IMAGE025
is a word sequence in training data added with position symbol
Figure 330024DEST_PATH_IMAGE026
The word(s) in (a) is (are),
Figure 278388DEST_PATH_IMAGE027
for special start symbol, after finishing trimming
Figure 246344DEST_PATH_IMAGE027
And generating a pre-addition mark prompt.
Further, in the second step, the two-stage training strategy is repeatedly executed for T times; from generating data sets in each iteration
Figure 342476DEST_PATH_IMAGE007
Sampling 1/T of generated data, and generating a data set after T iterations
Figure 490299DEST_PATH_IMAGE007
All the generated data in the table are marked with softAnd (5) pseudo tags.
Further, in the first step, when a placer is added to the training data:
Figure 355486DEST_PATH_IMAGE028
wherein ,
Figure 432027DEST_PATH_IMAGE026
representing the word sequence after adding the placeholders in the training data,
Figure 15455DEST_PATH_IMAGE029
and
Figure 530750DEST_PATH_IMAGE030
to be used for marking training data word sequence body
Figure 188127DEST_PATH_IMAGE031
The location identifier of the location(s),
Figure 497886DEST_PATH_IMAGE032
and
Figure 506293DEST_PATH_IMAGE033
for marking training data word sequence objects
Figure 559700DEST_PATH_IMAGE034
A location identifier of the location.
Further, when the relation label predicted by the relation extraction model in the third step is obtained, vector representations h of corresponding positions of word sequences in the training examples are spliced to be classified, and the relation label predicted by the relation extraction model is obtained
Figure 399480DEST_PATH_IMAGE022
Figure 316358DEST_PATH_IMAGE035
wherein
Figure 874378DEST_PATH_IMAGE036
In order to activate the function,
Figure 403580DEST_PATH_IMAGE037
in order to be a fully connected network,
Figure 97866DEST_PATH_IMAGE038
is a vector representation of the subject of the object,
Figure 15007DEST_PATH_IMAGE039
for vector representation of objects, [;]representing a stitching operation.
Compared with the prior art, the invention has the beneficial technical effects that:
the invention provides a low-resource relation extraction method based on data synthesis and two-stage self-training, which comprises a data synthesis method and a two-stage self-training framework; the data synthesis effectively relieves the problems of less marked data and high marked cost in the current relation extraction task. The two-stage self-training framework sequentially uses the non-label generated data and the labeled data to train the self-coding language model in each round of iteration, so that on one hand, the self-coding language model can be promoted to cooperatively learn from the non-label generated data and the labeled data, and on the other hand, the influence of noise of the generated data is effectively reduced. The relation extraction system fits the low resource condition in the real scene, can more effectively utilize the existing pre-training language model, explores the potential thereof, and has wide application prospect.
Drawings
Figure 1 is a diagram of the overall model architecture of the present invention.
Detailed Description
A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.
The invention is provided with a training data set
Figure 732427DEST_PATH_IMAGE040
Tag set corresponding to training data set
Figure 127636DEST_PATH_IMAGE041
Figure 942008DEST_PATH_IMAGE042
For training data
Figure 702154DEST_PATH_IMAGE043
Corresponding relation label, and
Figure 969187DEST_PATH_IMAGE044
Figure 604306DEST_PATH_IMAGE045
to train all annotated sets of relationship labels in the dataset,
Figure 273184DEST_PATH_IMAGE046
the number of training data in the training data set. Here, the
Figure 266548DEST_PATH_IMAGE047
Is the word sequence in the ith training data of the training data set
Figure 958561DEST_PATH_IMAGE048
Figure 695573DEST_PATH_IMAGE049
Representing word sequences
Figure 218958DEST_PATH_IMAGE047
The first of (3)
Figure DEST_PATH_IMAGE050
Individual words, L being word sequences
Figure 586485DEST_PATH_IMAGE047
The length of the tube is equal to the length,
Figure 828111DEST_PATH_IMAGE051
and
Figure 40917DEST_PATH_IMAGE052
respectively word sequences
Figure 684388DEST_PATH_IMAGE047
Is provided with a main body and an object,
Figure 455773DEST_PATH_IMAGE053
representing subject in word sequence
Figure 184695DEST_PATH_IMAGE047
The starting position of (a) is
Figure 263509DEST_PATH_IMAGE054
The end position is
Figure 699170DEST_PATH_IMAGE055
Figure 470816DEST_PATH_IMAGE056
Representing an object in a word sequence
Figure 421455DEST_PATH_IMAGE047
The starting position of (a) is
Figure 241643DEST_PATH_IMAGE057
The end position is
Figure 594127DEST_PATH_IMAGE058
Generating a dataset
Figure 536676DEST_PATH_IMAGE001
Corresponding soft pseudo tag set
Figure 912293DEST_PATH_IMAGE009
Figure 332910DEST_PATH_IMAGE010
In the form of a soft pseudo tag,
Figure 805480DEST_PATH_IMAGE002
in order to generate a sequence of words in the data,
Figure 378585DEST_PATH_IMAGE003
Figure 303816DEST_PATH_IMAGE004
respectively is
Figure 528124DEST_PATH_IMAGE002
Is a subject and an object of (a);
Figure 527304DEST_PATH_IMAGE005
the amount of data is generated for the generated dataset.
Generating data
Figure 811655DEST_PATH_IMAGE059
And training data
Figure 224181DEST_PATH_IMAGE043
Collectively referred to as training examples
Figure 189863DEST_PATH_IMAGE060
The relationship label of the training example is called a real label
Figure 371446DEST_PATH_IMAGE021
I.e. soft pseudo tags
Figure 561119DEST_PATH_IMAGE061
Sum relationship label
Figure 398625DEST_PATH_IMAGE062
Collectively referred to as genuine labels
Figure 964736DEST_PATH_IMAGE021
The goal of the relation extraction is to learn a function
Figure 266404DEST_PATH_IMAGE063
The function is
Figure 63196DEST_PATH_IMAGE063
By passing through
Figure 450315DEST_PATH_IMAGE064
Predicting real tags
Figure 820117DEST_PATH_IMAGE065
The relation extraction system provided by the invention is shown in fig. 1, and comprises the following three parts: 1. extracting a model of the relation; 2. a data synthesis model oriented to a low-resource scene; 3. two-stage self-learning framework.
1. Relation extraction model
The subject of the relation extraction model employs a BERT-like autorecoding language model rather than an autoregressive language model, as the autorecoding language model generally performs better on language understanding class tasks. In view of previous studies on extraction models, the present invention has added special placeholders to word sequences of training examples of input relationship extraction models to emphasize the locations of subjects and objects in the word sequences.
Equation one:
Figure 648395DEST_PATH_IMAGE066
wherein
Figure 445450DEST_PATH_IMAGE067
Representing the training example word sequence after the addition of a particular placer,
Figure 54286DEST_PATH_IMAGE029
and
Figure 165462DEST_PATH_IMAGE030
is used for marking the main body
Figure 176143DEST_PATH_IMAGE068
The specific placer of the location(s),
Figure 144099DEST_PATH_IMAGE032
and
Figure 177914DEST_PATH_IMAGE033
is used for labeling objects
Figure 889518DEST_PATH_IMAGE069
A special place identifier of the location.
After the coding of the relation extraction model, vector representations h of corresponding positions of the word sequences are spliced to classify:
Figure 754706DEST_PATH_IMAGE035
wherein
Figure 329781DEST_PATH_IMAGE036
Is the function of the activation and,
Figure 913209DEST_PATH_IMAGE037
is a fully-connected network, and the network is a fully-connected network,
Figure 162925DEST_PATH_IMAGE038
is a vector representation of the subject,
Figure 148199DEST_PATH_IMAGE039
is a vector representation of the object, [;]representing the operation of the splicing operation,
Figure 395640DEST_PATH_IMAGE022
the relation label distribution predicted by the relation extraction model is calculated
Figure 466365DEST_PATH_IMAGE022
And
Figure 519771DEST_PATH_IMAGE021
training a relational extraction model.
The method of generating the training examples is described subsequently.
2. Data synthesis model oriented to low-resource scene
The present embodiment employs a data synthesis model based on a large-scale language model (LLM). In consideration of the strong language generating capability of the large-scale language model, the method and the device solve the problem of scarcity of training data and the problem of far supervision field difference in a low-resource scene by extracting data required by tasks through the large-scale language model synthesis relation based on the labeling data. The training instance of the relational extraction model has a specific structure:
Figure 297234DEST_PATH_IMAGE060
i.e. a fact of relationship is formed by a piece of text
Figure 777894DEST_PATH_IMAGE070
(i.e., word sequence)
Figure 70335DEST_PATH_IMAGE070
Comprises a main body
Figure 865116DEST_PATH_IMAGE071
Object and its manufacturing method
Figure 559403DEST_PATH_IMAGE072
And (5) determining. Training data is first converted into a linear natural language sequence using a manner similar to equation one:
Figure 210964DEST_PATH_IMAGE028
the data synthesis model may be based on any large-scale language model, such as GPT-2. Then, fine tuning is carried out on the large-scale language model based on the marked training data, the fine tuning mode is the same as the pre-training mode of the data synthesis model, and the fine tuning loss of the data synthesis model
Figure 692499DEST_PATH_IMAGE073
The method comprises the following steps:
Figure 87708DEST_PATH_IMAGE023
wherein
Figure 636501DEST_PATH_IMAGE024
Representing probability functions, LLM represents a large-scale language model,
Figure 662226DEST_PATH_IMAGE025
is a word sequence
Figure 929259DEST_PATH_IMAGE026
In a word, and special start symbol<bos>(i.e. beginning of sentence) as
Figure 65842DEST_PATH_IMAGE027
Is added before the sequence. Note that here the relationship labels in the training data are ignored
Figure 469142DEST_PATH_IMAGE062
This is therefore an unconditional generation process. This can eliminate noise due to tag-semantic inconsistencies and enable the data synthesis model to model itself, helping the data synthesis model learn from unlabeled generated data. After finishing the fine tuning, only the special start symbol is needed<bos>A marker is added before to prompt generation and the generation process is repeatedly performed using polynomial sampling until a generation dataset is obtained that meets expectations
Figure 728085DEST_PATH_IMAGE007
3. Two-stage self-learning framework
Self-learning is a learning algorithm widely used in semi-supervised learning. Typically, to jointly learn from an unlabeled dataset and a labeled dataset, it is necessary to iteratively sample from the unlabeled dataset and assign pseudo tags to the sampled data, and then combine them with the labeled dataset to retrain the model. However, this naive merged design builds on a strong assumption that the unlabeled dataset must have exactly the same distribution as the labeled dataset, and the resulting data does not strictly meet this assumption.
To this end, the invention proposes a different two-stage self-learning framework: the model is trained separately using the unlabeled generated dataset and the labeled training dataset in turn, rather than combining them together for training. First, in the marked training data set
Figure 420097DEST_PATH_IMAGE006
Training a self-coding language model (such as BERT model) eta, and then generating data with unlabeled position symbols
Figure 157109DEST_PATH_IMAGE007
Using the self-coding language model eta classification to realize soft pseudo tag marking:
Figure 680494DEST_PATH_IMAGE008
order the
Figure 546557DEST_PATH_IMAGE007
Soft pseudo tag set of (a)
Figure 522603DEST_PATH_IMAGE009
Note that the superscript "≡" here indicates "soft pseudo tags", i.e. only the distribution of tag categories is maintained rather than further taking their argmax value. To further eliminate fluctuations in pseudo tags, K different random seeds are used to train multiple self-encoding language models, denoted as teacher model η, and the kth teacher model labeled soft pseudo tag set is denoted as
Figure 63306DEST_PATH_IMAGE011
Soft pseudo tags are soft pseudo tags, which are concepts corresponding to hard tags. The hard label can directly label samples with 0 and 1 discrete labels to represent positive and negative samples, and the soft label keeps the distribution of sample label types and marks the samples according to 0-1. In contrast, hard tags are simple to label, but are not microscopically optimized; soft tags are smoother and more expressive. The soft pseudo tag in the invention can play a role in enhancing the expression capability of the generated data tag.
Then reinitialize a new student model
Figure 644460DEST_PATH_IMAGE012
And applies a two-stage training strategy. In the first stage training, training is performed using the generated data with soft pseudo tags:
Figure 979626DEST_PATH_IMAGE013
this can be seen as a distillation process, in which a data set is generated
Figure 708548DEST_PATH_IMAGE007
With the help of (a), knowledge is transferred from teacher model eta to student model
Figure 725045DEST_PATH_IMAGE012
Optimizing it as a student model
Figure 223023DEST_PATH_IMAGE014
Calculate distillation loss
Figure 729091DEST_PATH_IMAGE015
Figure 882991DEST_PATH_IMAGE074
Figure 765497DEST_PATH_IMAGE017
Representing KL divergence. Then in the second stage training, the student is modeled
Figure 117981DEST_PATH_IMAGE014
In a labeled training data set
Figure 496747DEST_PATH_IMAGE006
Training is carried out on:
Figure 934682DEST_PATH_IMAGE018
here, the
Figure 355299DEST_PATH_IMAGE019
Is a standard cross entropy loss function,
Figure 765552DEST_PATH_IMAGE020
is a student model obtained after the second stage training iteration. Then in the next iteration, the student is modeled
Figure 613422DEST_PATH_IMAGE020
Eta re-labeling as teacher model
Figure 538652DEST_PATH_IMAGE007
The whole process is repeated for T times. According to standard self-learning settings, from within each iteration
Figure 700644DEST_PATH_IMAGE007
Sampling 1/T generated data, and after T iterations, all
Figure 27720DEST_PATH_IMAGE007
The generated data in the system is marked with a soft pseudo tag.
Through the two-stage self-learning mode, the semantic gap between the existing universal knowledge base and downstream task data can be well spanned, and meanwhile, the interference caused by data noise generation is effectively reduced.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims (5)

1. A low-resource relation extraction method based on data synthesis and two-stage self-training comprises the following steps:
step one, data synthesis based on marked training data: converting the training data into a linear natural language sequence by adding a position symbol into the training data; constructing a data synthesis model based on a large-scale language model, and performing fine adjustment on the data synthesis model through training data; repeatedly performing the data synthesis process using polynomial sampling until an unlabeled generated dataset meeting the preset conditions is obtained
Figure 15121DEST_PATH_IMAGE001
Generating data with a place holder;
Figure 897626DEST_PATH_IMAGE002
in order to generate a sequence of words in the data,
Figure 250110DEST_PATH_IMAGE003
Figure 864762DEST_PATH_IMAGE004
respectively is
Figure 302697DEST_PATH_IMAGE002
Is a subject and an object of (a);
Figure 425112DEST_PATH_IMAGE005
to generate an amount of data;
step two, two-stage self-learning: in training data set
Figure 632102DEST_PATH_IMAGE006
Training self-coding language model eta, and then generating data set with added position sign
Figure 745551DEST_PATH_IMAGE007
Using the self-coding language model eta classification to realize soft pseudo tag marking:
Figure 608465DEST_PATH_IMAGE008
order the
Figure 832773DEST_PATH_IMAGE007
Soft pseudo tag set of (a)
Figure 159849DEST_PATH_IMAGE009
Figure 381883DEST_PATH_IMAGE010
Is a soft pseudo tag; training multiple self-coding language models by using K different random seeds, marking as teacher model eta, and marking a soft pseudo-label set marked by a kth teacher model as
Figure 528831DEST_PATH_IMAGE011
The method comprises the steps of carrying out a first treatment on the surface of the Initializing a new self-coding language model, which is recorded as student model
Figure 556830DEST_PATH_IMAGE012
For student model
Figure 676095DEST_PATH_IMAGE012
Two-stage training strategies are applied: in the first stage training, distillation training is performed using the generated data with soft pseudo tags:
Figure 131348DEST_PATH_IMAGE013
the method comprises the steps of carrying out a first treatment on the surface of the Model students
Figure 31170DEST_PATH_IMAGE012
Optimized as a student model
Figure 33499DEST_PATH_IMAGE014
Calculate distillation loss
Figure 69588DEST_PATH_IMAGE015
Figure 695742DEST_PATH_IMAGE016
Figure 754965DEST_PATH_IMAGE017
Represents KL divergence; in the second stage training, the student is modeled
Figure 124766DEST_PATH_IMAGE014
In training data set
Figure 280941DEST_PATH_IMAGE006
Training is carried out on:
Figure 15679DEST_PATH_IMAGE018
Figure 624515DEST_PATH_IMAGE019
as a standard cross-entropy loss function,
Figure 532428DEST_PATH_IMAGE020
is that
Figure 543109DEST_PATH_IMAGE006
The set of labels to be used in the corresponding set of labels,
Figure 448748DEST_PATH_IMAGE021
the student model is obtained after the second stage training iteration; the next time the two-stage training strategy is executed, the student is modeled
Figure 544880DEST_PATH_IMAGE021
As a teacher model η; repeating the two-stage training strategy until a data set is generated
Figure 256484DEST_PATH_IMAGE007
Each generated data is marked with a soft pseudo tag;
step three, relation extraction: constructing a relation extraction model based on a self-coding language model; the generated data and the training data are collectively called as training examples, and the relationship labels of the training examples are called as real labels
Figure 823470DEST_PATH_IMAGE022
Inputting the training examples into a relation extraction model, and calculating relation labels predicted by the relation extraction model
Figure 962327DEST_PATH_IMAGE023
And training instance's true tags
Figure 280176DEST_PATH_IMAGE022
To train a relationship extraction model.
2. The method for extracting low-resource relationship based on data synthesis and two-stage self-training according to claim 1, wherein in the first step, fine tuning loss of the data synthesis model is
Figure 733154DEST_PATH_IMAGE024
The method comprises the steps of carrying out a first treatment on the surface of the The fine tuning mode of the data synthesis model and the pre-training mode of the data synthesis model are the same, wherein
Figure 718427DEST_PATH_IMAGE025
Representing probability functions, LLM represents a large-scale language model,
Figure 28186DEST_PATH_IMAGE026
is a word sequence in training data added with position symbol
Figure 771014DEST_PATH_IMAGE027
The word(s) in (a) is (are),
Figure 824421DEST_PATH_IMAGE028
for special start symbol, after finishing trimming
Figure 664201DEST_PATH_IMAGE028
And generating a pre-addition mark prompt.
3. The method for extracting low-resource relation based on data synthesis and two-stage self-training according to claim 1, wherein in the second step, the two-stage training strategy is repeatedly executed T times; from generating data sets in each iteration
Figure 82544DEST_PATH_IMAGE007
Sampling 1/T of generated data, and generating a data set after T iterations
Figure 640564DEST_PATH_IMAGE007
All the generated data in (a) are marked with soft pseudo tags.
4. The method for extracting low-resource relation based on data synthesis and two-stage self-training according to claim 1, wherein when a placer is added to the training data in the first step:
Figure 232082DEST_PATH_IMAGE029
wherein ,
Figure 628166DEST_PATH_IMAGE027
representing the word sequence after adding the placeholders in the training data,
Figure 279728DEST_PATH_IMAGE030
and
Figure 59465DEST_PATH_IMAGE031
to be used for marking training data word sequence body
Figure 392357DEST_PATH_IMAGE032
The location identifier of the location(s),
Figure 206729DEST_PATH_IMAGE033
and
Figure 29192DEST_PATH_IMAGE034
for marking training data word sequence objects
Figure 233908DEST_PATH_IMAGE035
A location identifier of the location.
5. The low-resource relationship extraction method based on data synthesis and two-stage self-training according to claim 1, wherein: when the relation label predicted by the relation extraction model in the third step is obtained, vector representations h of corresponding positions of word sequences in the training examples are spliced to be classified, and the relation label predicted by the relation extraction model is obtained
Figure 432808DEST_PATH_IMAGE023
Figure 101687DEST_PATH_IMAGE036
wherein
Figure 32734DEST_PATH_IMAGE037
In order to activate the function,
Figure 787063DEST_PATH_IMAGE038
in order to be a fully connected network,
Figure 789654DEST_PATH_IMAGE039
is a vector representation of the subject of the object,
Figure 749258DEST_PATH_IMAGE040
for vector representation of objects, [;]representing a stitching operation.
CN202211630125.3A 2022-12-19 2022-12-19 Low-resource relation extraction method based on data synthesis and two-stage self-training Active CN115618022B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211630125.3A CN115618022B (en) 2022-12-19 2022-12-19 Low-resource relation extraction method based on data synthesis and two-stage self-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211630125.3A CN115618022B (en) 2022-12-19 2022-12-19 Low-resource relation extraction method based on data synthesis and two-stage self-training

Publications (2)

Publication Number Publication Date
CN115618022A CN115618022A (en) 2023-01-17
CN115618022B true CN115618022B (en) 2023-04-28

Family

ID=84879772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211630125.3A Active CN115618022B (en) 2022-12-19 2022-12-19 Low-resource relation extraction method based on data synthesis and two-stage self-training

Country Status (1)

Country Link
CN (1) CN115618022B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116415005B (en) * 2023-06-12 2023-08-18 中南大学 Relationship extraction method for academic network construction of scholars
CN117174240B (en) * 2023-10-26 2024-02-09 中国科学技术大学 Medical image report generation method based on large model field migration

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420548A (en) * 2021-06-24 2021-09-21 杭州电子科技大学 Entity extraction sampling method based on knowledge distillation and PU learning
WO2021184311A1 (en) * 2020-03-19 2021-09-23 中山大学 Method and apparatus for automatically generating inference questions and answers
CN114386371A (en) * 2022-03-25 2022-04-22 中国科学技术大学 Method, system, equipment and storage medium for correcting Chinese spelling error
CN114528835A (en) * 2022-02-17 2022-05-24 杭州量知数据科技有限公司 Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN114912456A (en) * 2022-07-19 2022-08-16 北京惠每云科技有限公司 Medical entity relationship identification method and device and storage medium
CN115270797A (en) * 2022-09-23 2022-11-01 山东省计算中心(国家超级计算济南中心) Text entity extraction method and system based on self-training semi-supervised learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11783008B2 (en) * 2020-11-06 2023-10-10 Adobe Inc. Machine-learning tool for generating segmentation and topic metadata for documents

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021184311A1 (en) * 2020-03-19 2021-09-23 中山大学 Method and apparatus for automatically generating inference questions and answers
CN113420548A (en) * 2021-06-24 2021-09-21 杭州电子科技大学 Entity extraction sampling method based on knowledge distillation and PU learning
CN114528835A (en) * 2022-02-17 2022-05-24 杭州量知数据科技有限公司 Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN114386371A (en) * 2022-03-25 2022-04-22 中国科学技术大学 Method, system, equipment and storage medium for correcting Chinese spelling error
CN114912456A (en) * 2022-07-19 2022-08-16 北京惠每云科技有限公司 Medical entity relationship identification method and device and storage medium
CN115270797A (en) * 2022-09-23 2022-11-01 山东省计算中心(国家超级计算济南中心) Text entity extraction method and system based on self-training semi-supervised learning

Also Published As

Publication number Publication date
CN115618022A (en) 2023-01-17

Similar Documents

Publication Publication Date Title
CN110597997B (en) Military scenario text event extraction corpus iterative construction method and device
CN107273355B (en) Chinese word vector generation method based on word and phrase joint training
CN115618022B (en) Low-resource relation extraction method based on data synthesis and two-stage self-training
CN110232192A (en) Electric power term names entity recognition method and device
CN110209832B (en) Method, system and computer equipment for judging upper and lower relationship
CN114065738B (en) Chinese spelling error correction method based on multitask learning
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
US11869484B2 (en) Apparatus and method for automatic generation and update of knowledge graph from multi-modal sources
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN110516240B (en) Semantic similarity calculation model DSSM (direct sequence spread spectrum) technology based on Transformer
CN115270797A (en) Text entity extraction method and system based on self-training semi-supervised learning
CN113326367B (en) Task type dialogue method and system based on end-to-end text generation
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN118093834A (en) AIGC large model-based language processing question-answering system and method
CN112905750B (en) Method and equipment for generating optimization model
CN112015921B (en) Natural language processing method based on learning auxiliary knowledge graph
CN112861827B (en) Sign language translation method and system using single language material translation
CN116306653A (en) Regularized domain knowledge-aided named entity recognition method
CN117035077A (en) Difficulty controllable problem generation method based on soft template and inverse fact reasoning
CN106815211B (en) Method for document theme modeling based on cyclic focusing mechanism
CN116484858A (en) Text abstract generation method based on diffusion model
CN116595169A (en) Question-answer intention classification method for coal mine production field based on prompt learning
CN115906854A (en) Multi-level confrontation-based cross-language named entity recognition model training method
CN115730599A (en) Chinese patent key information identification method based on structBERT, computer equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant