CN115186670B - Method and system for identifying domain named entities based on active learning - Google Patents
Method and system for identifying domain named entities based on active learning Download PDFInfo
- Publication number
- CN115186670B CN115186670B CN202211092071.XA CN202211092071A CN115186670B CN 115186670 B CN115186670 B CN 115186670B CN 202211092071 A CN202211092071 A CN 202211092071A CN 115186670 B CN115186670 B CN 115186670B
- Authority
- CN
- China
- Prior art keywords
- text
- field
- texts
- named entity
- recognized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a field named entity recognition method and system based on active learning, and relates to the technical field of field named entity recognition. Clustering all texts in the universal text set according to the distance between each text in the universal text set and the text of the field to be recognized to obtain a text set; forming an extended text set by each text in the text set and the text of the field to be identified; performing self-supervision learning on the pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized; and constructing a domain named entity recognition model, and training the domain named entity recognition model by adopting an active learning method according to the extended text set and the text feature vector corresponding to the domain to be recognized to obtain the trained domain named entity recognition model. The method for identifying the domain named entity can transfer the universal text characteristics to specific domain tasks without a large amount of labeled data.
Description
Technical Field
The invention relates to the technical field of named entity identification, in particular to a method and a system for identifying a named entity in the field based on active learning.
Background
In recent years, in the field of named entity recognition, deep learning based methods have dominated. The named entity recognition method utilizing deep learning can be decomposed into three parts from an input sequence to a tag sequence: 1. distributed representation of input, i.e. the method of converting input into vectors, maps the input words to a low-dimensional vector while preserving their semantic properties, using vector representation methods including word vectors, word vectors and hybrid vectors. 2. Context encoder, i.e. mining the associated information of the text by the model. The main current methods include unsupervised tasks on unlabeled data using recurrent neural networks and their variants, gated recursion units and long-short term memory networks, and Transformer-based language modeling models to learn initial parameters, encode in combination with contextual and static features. 3. And the label decoder is used for mapping the high-dimensional features output by the model to label classification, for example, a multi-layer neuron and Softmax are used as the label decoder, namely, the mapping of the output to the label is regarded as a multi-classification task, and the mapping of each output is independent. However, when the method is specifically applied to the field text, the following problems mainly exist:
1. domain text feature extraction is insufficient or relies on manually building domain features. The existing text feature extraction mainly depends on establishing a self-supervision task to obtain the relation between texts, and further obtain a feature vector retaining text semantic information. This class of self-supervised learning models, commonly referred to as pre-trained models, are trained on large volumes of corpus data to obtain text vectors that can be adapted for most downstream tasks. However, the domain text is usually different from the general text in terms of subject, genre, style, and the like, and therefore, only the text vector obtained by pre-training on the general text is relied on, and the task accuracy is limited. Therefore, in the named entity recognition task specific to the domain text, some domain features are often added in order to obtain higher model accuracy when extracting text feature vectors, but if the text feature vectors are refined to each domain or each task, manual construction of the text feature vectors is not easy, and therefore a method capable of automatically migrating general text features to specific domain tasks is needed.
2. The named entity recognition model based on deep learning depends on a large amount of labeling data, but the cost of manual labeling is high. Deep learning based models are widely used in named entity recognition tasks because the structure of their depth allows complex features to be learned from data. However, the problem is that these models contain a large number of parameters, and in order to obtain a model with satisfactory accuracy, a large amount of labeled data must be collected first for the model to perform supervised learning, and the parameters are updated in a gradient descending manner. In the case of research and learning, there are a large number of public labeled data sets available for use. However, particularly in an application environment, the named entity recognition task cannot be limited to recognition of entities such as people, places, dates and the like, and further refinement may be performed according to business scenarios, so that labeled data needs to be collected first, but manual labeling requires manpower. Therefore, a domain named entity recognition method capable of migrating general text features to specific domain tasks without requiring a large amount of labeled data is urgently needed.
Disclosure of Invention
The invention aims to provide a domain named entity recognition method and system based on active learning, which can transfer general text features to specific domain tasks and do not need a large amount of labeled data.
In order to achieve the purpose, the invention provides the following scheme:
a domain named entity recognition method based on active learning comprises the following steps:
acquiring a general text set and a text of a field to be identified;
clustering the texts in the general text set according to the distance between the texts in the general text set and the texts in the field to be recognized to obtain a text set;
determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set;
performing self-supervision learning on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, wherein the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are connected in sequencesoftmaxA layer;
constructing a domain named entity recognition model; the domain named entity recognition model comprises a context encoder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model;
and training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, wherein the trained domain named entity recognition model is used for performing domain named entity recognition on the text of the domain to be recognized.
Optionally, the clustering, according to the distance between each text in the general text set and the text of the field to be recognized, each text in the general text set to obtain a text set specifically includes:
determining a text vector of each text in the general text set and a text vector of the text of the field to be recognized;
clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be recognized;
and determining the text corresponding to each text vector in the text vector set as a text set.
Optionally, the determining the text vector of each text in the general text set and the text vector of the text in the field to be recognized specifically includes:
respectively performing word segmentation on each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text;
and respectively inputting the word segmentation sets corresponding to the texts into an encoder to obtain the text vectors of the texts in the general text set and the text vectors of the texts in the field to be recognized.
Optionally, the determining, as the text after the expansion of the field to be recognized, each text in the text set and the text of the field to be recognized form an expanded text set, specifically includes:
inputting each text vector in the text vector set into a decoder respectively to obtain a text corresponding to each text vector;
and determining texts corresponding to the text vectors and the texts in the field to be recognized as the texts in the field to be recognized after being expanded to form an expanded text set.
Optionally, the training of the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized by using the active learning method to obtain the trained domain named entity recognition model specifically includes:
under the current iteration times, inputting a text feature vector corresponding to a field to be recognized and the text into the field named entity recognition model to obtain a label sequence of the text and the prediction probability of each label in the label sequence under each word in a word segmentation set corresponding to the text for any text in the extended text set; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled;
determining the information content of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles;
sorting all texts in the extended text set in a descending order according to the information content of each text in the extended text set;
selecting the first M texts to label the domain named entities to obtain labeled texts;
and training the domain named entity recognition model according to the labeled text to obtain a domain named entity recognition model under the next iteration number, determining the unlabeled text in the extended text set as the extended text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.
A domain named entity recognition system based on active learning, comprising:
the acquisition module is used for acquiring a general text set and a text of a field to be identified;
the clustering module is used for clustering the texts in the general text set according to the distance between the texts in the general text set and the texts in the field to be identified to obtain a text set;
the expansion module is used for determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set;
the pre-training module is used for carrying out self-supervision learning on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, and the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are sequentially connectedsoftmaxA layer;
the construction module is used for constructing a domain named entity recognition model; the domain named entity recognition model comprises a context coder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model;
and the training module is used for training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the field to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, and the trained domain named entity recognition model is used for recognizing the domain named entity of the text of the field to be recognized.
Optionally, the clustering module specifically includes:
the text vector calculation unit is used for determining text vectors of all texts in the general text set and text vectors of texts in the field to be identified;
the clustering unit is used for clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be identified;
and the text set determining unit is used for determining the text corresponding to each text vector in the text vector set as a text set.
Optionally, the text vector calculation unit specifically includes:
the word segmentation subunit is used for respectively segmenting each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text;
and the text vector calculation subunit is used for respectively inputting the word segmentation sets corresponding to the texts into an encoder to obtain the text vectors of the texts in the general text set and the text vectors of the texts in the field to be recognized.
Optionally, the expansion module specifically includes:
the encoding unit is used for respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector;
and the expansion unit is used for determining the texts corresponding to the text vectors and the texts in the field to be recognized as the expanded texts in the field to be recognized to form an expanded text set.
Optionally, the training module specifically includes:
a probability determining unit, configured to, for any text in the extended text set, input a text feature vector corresponding to a field to be identified and the text into the field named entity identification model under the current iteration number to obtain a tag sequence of the text and a prediction probability of each tag in the tag sequence under each participle in a participle set corresponding to the text; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled;
the information quantity calculating unit is used for determining the information quantity of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles;
the sorting unit is used for sorting all texts in the extended text set in a descending order according to the information quantity of each text in the extended text set;
the marking unit is used for selecting the first M texts to mark the domain named entities to obtain marked texts;
and the training unit is used for training the domain named entity recognition model according to the labeled text to obtain the domain named entity recognition model under the next iteration number, determining the unlabeled text in the extended text set as the extended text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: clustering all texts in the universal text set according to the distance between each text in the universal text set and the text in the field to be identified to obtain a text set; determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set; performing self-supervision learning on the pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized; the method comprises the steps of constructing a domain named entity recognition model, training the domain named entity recognition model according to an extended text set and text feature vectors corresponding to a to-be-recognized domain by adopting an active learning method to obtain a trained domain named entity recognition model, clustering general texts according to the distance between the general texts and the domain texts to obtain extended texts, transferring general text features to specific domain tasks, training the model by adopting the active learning method, replacing part of cost of manual labeling with the computing power of the model, and obtaining a model with higher precision by using labeled texts as few as possible.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a detailed flowchart of a domain-named entity recognition method based on active learning according to an embodiment of the present invention;
fig. 2 is a general flowchart of a domain-named entity recognition method based on active learning according to an embodiment of the present invention;
FIG. 3 is a block diagram of a pre-trained model;
FIG. 4 is a block diagram of a domain named entity recognition model;
FIG. 5 is a flow chart for training a model using active learning.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, an embodiment of the present invention discloses a domain named entity identification method based on active learning, including:
step 101: and acquiring a general text set and a text of the field to be recognized.
Step 102: and clustering the texts in the general text set according to the distance between the texts in the general text set and the texts in the field to be identified to obtain a text set.
Step 103: and determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set.
Step 104: self-supervision learning is carried out on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, and the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are connected in sequencesoftmaxA layer.
Step 105: constructing a domain named entity recognition model; the domain named entity recognition model comprises a context coder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model.
Step 106: and training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, wherein the trained domain named entity recognition model is used for performing domain named entity recognition on the text of the domain to be recognized.
In practical application, the clustering each text in the universal text set according to the distance between each text in the universal text set and the text of the field to be recognized to obtain a text set specifically includes:
and determining a text vector of each text in the general text set and a text vector of the text of the field to be recognized.
And clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be recognized.
And determining the text corresponding to each text vector in the text vector set as a text set.
In practical application, the determining the text vector of each text in the general text set and the text vector of the text in the field to be recognized specifically includes:
and respectively performing word segmentation on each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text.
And respectively inputting the word segmentation set corresponding to each text into an encoder to obtain a text vector of each text in the general text set and a text vector of the text in the field to be identified.
In practical application, the determining, as the text after the expansion of the field to be recognized, each text in the text set and the text of the field to be recognized form an expanded text set, specifically includes:
and respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector.
And determining texts corresponding to the text vectors and texts in the field to be recognized as texts after the field to be recognized is expanded to form an expanded text set.
In practical application, the training of the domain named entity recognition model by the active learning method according to the extended text set and the text feature vector corresponding to the domain to be recognized to obtain the trained domain named entity recognition model specifically includes:
under the current iteration times, inputting a text feature vector corresponding to a field to be recognized and the text into the field named entity recognition model to obtain a label sequence of the text and the prediction probability of each label in the label sequence under each word in a word segmentation set corresponding to the text for any text in the extended text set; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled. Specifically, a text feature vector corresponding to a field to be recognized and a word segmentation set corresponding to the text are input into the field named entity recognition model, and a label corresponding to each word segmentation and a prediction probability of each label under each word segmentation are obtained.
Determining the information content of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles.
And sequencing all texts in the extended text set in a descending order according to the information quantity of each text in the extended text set.
And selecting the first M texts to label the domain named entities to obtain the labeled texts.
And training the domain named entity recognition model according to the labeled text to obtain a domain named entity recognition model under the next iteration number, determining the unlabeled text in the extended text set as the extended text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.
The embodiment has the following technical effects:
1. because the prior art needs to be further refined according to the service scene when the entity naming recognition model is trained, labeled data is collected firstly, but manual labeling needs to be invested, therefore, active learning is proposed, the cost of part of manual labeling is replaced by the calculation power of the model, and a high-precision model is obtained by labeling texts as few as possible.
2. The embodiment provides a text extension method aiming at named entity recognition in a specific field. The text is converted into the feature vector, the distance of the feature vector of the text is calculated, the degree of correlation between the text and the field is determined under the conditions of no language material source and different styles of subject matters, the language material data aiming at the specific field is expanded according to the secondary correlation, and the function of searching the text data in the specific direction in a complicated universal data set is realized.
3. The embodiment determines the specific definition and calculation method of the optimal text in the active learning strategy, constructs a model iteration method based on the optimal text, determines the information content in the text by calculating the information quantity borne by different texts, and determines a text screening scheme by taking the information content as a measurement standard, thereby effectively reducing the number of training sets required by model training in the named entity recognition process, improving the model training efficiency and reducing the model training cost.
The invention also provides a field named entity recognition system based on active learning aiming at the method, which comprises the following steps:
and the acquisition module is used for acquiring the general text set and the text of the field to be identified.
And the clustering module is used for clustering the texts in the universal text set according to the distance between the texts in the universal text set and the texts in the field to be identified to obtain a text set.
And the expansion module is used for determining each text in the text set and the text of the field to be recognized as the expanded text of the field to be recognized to form an expanded text set.
The pre-training module is used for carrying out self-supervision learning on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, and the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are sequentially connectedsoftmaxAnd (3) a layer.
The construction module is used for constructing a domain named entity recognition model; the domain named entity recognition model comprises a context encoder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model.
And the training module is used for training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the field to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, and the trained domain named entity recognition model is used for recognizing the domain named entity of the text of the field to be recognized.
As an optional implementation manner, the clustering module specifically includes:
and the text vector calculation unit is used for determining the text vector of each text in the general text set and the text vector of the text in the field to be recognized.
And the clustering unit is used for clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be identified.
And the text set determining unit is used for determining the text corresponding to each text vector in the text vector set as a text set.
As an optional implementation manner, the text vector calculation unit specifically includes:
and the word segmentation subunit is used for respectively segmenting each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text.
And the text vector calculation subunit is used for respectively inputting the word segmentation sets corresponding to the texts into an encoder to obtain text vectors of the texts in the general text set and text vectors of the texts in the field to be identified.
As an optional implementation manner, the expansion module specifically includes:
and the encoding unit is used for respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector.
And the expansion unit is used for determining the texts corresponding to the text vectors and the texts in the field to be recognized as the expanded texts in the field to be recognized to form an expanded text set.
As an optional implementation manner, the training module specifically includes:
a probability determining unit, configured to, for any text in the extended text set, input a text feature vector corresponding to a field to be identified and the text into the field named entity identification model under the current iteration number to obtain a tag sequence of the text and a prediction probability of each tag in the tag sequence under each word in a word segmentation set corresponding to the text; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled.
The information quantity calculating unit is used for determining the information quantity of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles.
And the sequencing unit is used for sequencing all texts in the extended text set in a descending order according to the information quantity of each text in the extended text set.
And the marking unit is used for selecting the first M texts to mark the domain named entities to obtain marked texts.
And the training unit is used for training the domain named entity recognition model according to the labeled text to obtain the domain named entity recognition model under the next iteration number, determining the text which is not labeled in the extended text set as the extended text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.
The invention provides a more specific embodiment for training a domain named entity recognition model, and the overall design of the domain named entity recognition model is shown in FIG. 2.
As shown in fig. 2, the domain-oriented named entity identification method based on active learning mainly includes three parts: 1) And extracting text features based on the domain self-adaptive pre-training. 2) Named entity recognition models based on deep learning. 3) Text screening algorithm based on active learning. The three parts in the framework jointly realize the functions of text feature expression, named entity identification, active learning text screening and the like.
In the embodiment, the number of the pre-trained texts is increased by performing corpus expansion on the field text, and then the field text features are extracted through pre-training. The pre-trained model will be used to initialize the input vector and context encoders of the named entity recognition model. The named entity recognition model maps the text into vectors, then maps the vectors onto entity labels to obtain the prediction probability of each entity, and then the prediction probability is used for calculating the text information amount in the active learning strategy. And the active learning strategy is used for marking the selected text with the largest information quantity. Finally, the annotation text is used to update the model until the model converges or a satisfactory criterion is reached, and the process ends. And finally obtaining a named entity recognition model optimized aiming at the domain text.
1) Extracting text features based on field adaptive pre-training:
first, for the text (text of the field to be recognized and a plurality of general texts)xThe word segmentation is performed respectively. If the data set is English, word segmentation can be carried out according to the blank space. If the data set is Chinese, performing word segmentation on the Chinese text by using a word segmentation tool jieba to obtain a word segmentation set corresponding to each text, removing some words without any real meaning by using a Chinese common word stop list, and finally generating a word list with the length of V. Each textThe numerical value of the word in the sequence is the frequency of the word in the text, and the numerical value of the unrealistic word not in the sequence is 0, whereinPresentation pairAnd performing word segmentation to obtain the nth word.
And secondly, expanding the domain text. By constructing the encoder, the text can be encoded into a form that is easy to represent, and this form can be decoded back into the original real text as lossless as possible. And respectively inputting the word segmentation set corresponding to each text in the first step into an encoder to obtain a high-dimensional vector, wherein the vector contains the subject information of the text. The present embodiment introduces noise parameters in the process of constructing the encoder, so that for each text, its encoding will cover the whole encoding space, and the probability of encoding is the highest near the original encoding, and the farther away from the original encoding point, the lower the encoding probability. And then, determining which texts in the universal texts are the domain-related texts by distance measurement on the vector output by the encoder, wherein in the distance measurement, the scheme adopts a kNN algorithm, namely for the texts in each domain, k universal texts with the nearest distance are searched to serve as the related texts, and the distance measurement adopts Euclidean distance. And inputting the universal text vector obtained by clustering into a decoder, decoding and restoring the universal text vector into an original text, and supplementing the text into the field text to realize the expansion of the field text.
And thirdly, pre-training the domain text. The trained pre-training model is obtained by carrying out a self-supervision learning task on the pre-training model on the expanded domain text, and a text feature vector suitable for the domain is obtained in the training process. The self-supervision learning task of the pre-training model adopts a random covering task, namely, randomly covering some words from the input corpus and predicting the words through the context. When the model is trained, a sentence is input into the model for parameter learning for a plurality of times, in order to enable the downstream task to see the covered original word with probability, after the word to be covered is determined, the word is directly replaced by [ Mask ] in 80%, the word is replaced by any other word in 10%, and the original word is kept in 10%.
The input vector representation of the pre-trained model is the unit sum of three feature vectors: respectively, a word vector, a position vector, and a segmentation vector. For Chinese, a word vector refers to the vector representation of a word, and for English, a word vector refers to the vector representation of each word after WordPiece participle. The position vector is to encode the position information of the word into a feature vector to make up for the loss caused by abandoning the traditional RNN and CNN structures. The vector is split to distinguish whether two sentences, e.g., B, are the context of a.
FIG. 3 illustrates the structure of the pre-training model, the lowest input vector representation is the above three embedded feature unit sums, trm represents a Transformer unit, ti represents the feature vector of the ith character, and the output layers are a feed-forward neural network and a Softmax layer. In the pre-training task, ti is used to predict the original word, and the goal of the pre-training task is to reduce the cross entropy between Ti and the real word.
2) Named entity recognition
As shown in fig. 4, the domain named entity recognition model mainly includes two modules: a context encoder and a tag decoder.
The encoder structure still adopts the stacked Transformer structure in the pre-training model, and the initialization parameters of the encoder structure are all derived from the pre-training model. The Transformer is a model for generating vectorized representation of input and output completely through a self-attention mechanism, does not depend on a traditional recurrent neural network or a convolutional network, and has the following advantages: firstly, the calculation of the next time slice does not depend on the previous time slice, and the parallel capability of the model is exerted. And secondly, the distance of any position of the sequence is constant, so that the problem of long-term dependence is solved. The traditional neural network is abandoned, and the model possibly loses the capability of capturing local features to a certain extent, so that the position information of characters is added into the input feature vector to make up.
The main goal of the context encoder module is to map the input text into text feature vectors. Assume the original text isThe mapped feature vector can be represented asWhereinAnd the feature vector of the ith word segmentation after the word segmentation of the text x is represented. As noted above, tonken1 in FIGS. 3 and 4 is the 1 st participle in xx 1 The initial word vector is the text characteristic vector after the domain text is pre-trained, and the initial word vector is fine-tuned in the downstream task together with the Transformer structure parametersThe text feature vector output by the module not only comprises the features of the text, but also comprises the relation between the text and the label, and when the text is recognized by using the domain named entity recognition model, whether each participle of the text is related to the label or not can be judged by using the initial word vector. After the context encoder is input, a new feature vector T is generated to be used as the input of a label decoder, and the final predicted label probability is obtained through the decoder.
The main goal of the tag decoder module is to map the text feature vectors T onto the entity tag sequence, which can be considered a multi-classification task. Hypothesis textThe tag sequence of (A) isAnd if there are k labels in total, the text feature vector is reduced to k dimension after passing through the linear layer, and the prediction probability of each label under each participle is obtained after passing through the Softmax layer, whereinPresentation labelIn word segmentationThe probability of the prediction of the following (b),to get word segmentsPredictive label of=maxThe model will optimize the network using a cross entropy loss functionThe performance of (c).
The network realizes that under the condition of considering text field characteristics, sequence characteristics and text-label mapping, the context characteristic vector of the text is extracted by adopting a network structure of a self-attention mechanism, the mapping relation between the context characteristic vector and the label sequence is mined by taking reduction of reality and prediction of label difference as targets, and thus the task of named entity identification is completed.
3) Text screening method based on active learning
The main objective of the screening method is to achieve the highest possible model accuracy with as little labeled data as possible. By the above tag decoder, text can be acquiredCorresponding predicted tag sequencesAnd probability of each predictive labelFor each text in UxAmount of informationDefined as the formula:whereinPresentation textProbability set of predicted tag sequences of this sequence x.
I.e. predicted tag probability output by tag decoderAre independent of each other, so there is the following formula:
in order that the strategy does not tend to select longer sequences, normalizing the above formula has the formula,
the best text defined by the policy isI.e. (MNLP: maximum Normalized Log-Proavailability), then will be expressed asThe text is sorted in descending order. And after each round of model training is converged, selecting a plurality of texts in the front for marking and learning in the next round.
After the text information amount is defined, the information amount of each text which is not marked can be measured, so that the best text is screened out for model updating. The updating mode of the model in the scheme is full updating, namely, all the labeled texts are used as a training set to carry out model training in the next round, and training is started from the best model stored in the previous round in an iterative manner. FIG. 5 illustrates a flow of model updates in an active learning process.
First, the model is trained on the labeled text to converge. And then dividing the unlabeled texts into a plurality of batches with the size of B, acquiring the probability of the predicted entity label of each batch of texts through a model, and calculating the information content of each text. And then sequencing all texts, selecting the first M texts with the information quantity from large to small for labeling, adding a labeling set, and performing the next round of training. And so on until reaching a certain stopping standard. Finally, a named entity recognition model with satisfactory precision is obtained under the condition that the number of texts is marked as little as possible.
The stopping criteria are determined according to specific situations, for example, if the F-score on the model achieves satisfactory precision, or the cost of continuous labeling is far more than the improvement of the model precision, according to the practical principle, 25%,50%,75% and 100% of the training set are generally used for evaluation, and if the F-score of the last 25% is not obviously improved, the labeling is not carried out.
The invention has the following technical effects:
the domain-oriented named entity recognition method based on active learning provided by the embodiment can extract text context features through domain text pre-training, realize entity prediction through a named entity recognition model, screen out the best text for marking according to an active learning strategy, and update the model.
The word vectors are used for evaluating the distance between the data in the universal text data set and the materials in the specific field, so that the automatic expansion of the data set in the specific field by using the multi-source universal data set is realized, and the automatic migration of the universal model to the model in the specific field is realized.
The invention provides a feature automatic migration mode based on the difference between a field text and a general text, which is used for extracting and expressing the field text features; and simultaneously, based on a named entity recognition algorithm of deep learning, establishing a context relation of text features, further mapping the context relation to an entity label, completing a named entity recognition task, and selecting a text with the most information quantity for marking based on an active learning strategy, so that an active learning framework for field-oriented named entity recognition is constructed, and a high-precision model is obtained by marking the text as few as possible. The invention solves the problem of how to automatically migrate the general text features into a specific field and fit the depth model with as few annotated texts as possible.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.
Claims (8)
1. A domain named entity recognition method based on active learning is characterized by comprising the following steps:
acquiring a general text set and a text of a field to be identified;
clustering the texts in the general text set according to the distance between the texts in the general text set and the texts in the field to be recognized to obtain a text set;
determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set;
performing self-supervision learning on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, wherein the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are connected in sequencesoftmaxA layer;
constructing a domain named entity recognition model; the domain named entity recognition model comprises a context coder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model;
training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, wherein the trained domain named entity recognition model is used for performing domain named entity recognition on the text of the domain to be recognized;
the method for active learning is used for training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the domain to be recognized to obtain a trained domain named entity recognition model, and specifically comprises the following steps:
under the current iteration times, inputting a text feature vector corresponding to a field to be recognized and the text into the field named entity recognition model to obtain a label sequence of the text and the prediction probability of each label in the label sequence under each participle in a participle set corresponding to the text for any text in the extended text set; the label sequence of the text comprises labels corresponding to the participles of the text after the participles are participled;
determining the information content of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles;
sorting all texts in the extended text set in a descending order according to the information content of each text in the extended text set;
selecting the first M texts to label the domain named entities to obtain the labeled texts;
and training the domain named entity recognition model according to the labeled text to obtain the domain named entity recognition model under the next iteration number, determining the unlabeled text in the expanded text set as the expanded text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.
2. The method for recognizing a domain named entity based on active learning according to claim 1, wherein the clustering of the texts in the generic text set according to the distance between the texts in the generic text set and the text of the domain to be recognized specifically comprises:
determining a text vector of each text in the general text set and a text vector of the text of the field to be recognized;
clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be recognized;
and determining the text corresponding to each text vector in the text vector set as a text set.
3. The method as claimed in claim 2, wherein the determining the text vector of each text in the generic text set and the text vector of the text in the domain to be recognized specifically includes:
respectively performing word segmentation on each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text;
and respectively inputting the word segmentation set corresponding to each text into an encoder to obtain a text vector of each text in the general text set and a text vector of the text in the field to be identified.
4. The method according to claim 3, wherein the determining of each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set comprises:
respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector;
and determining texts corresponding to the text vectors and the texts in the field to be recognized as the texts in the field to be recognized after being expanded to form an expanded text set.
5. A domain named entity recognition system based on active learning, comprising:
the acquisition module is used for acquiring a general text set and a text of a field to be identified;
the clustering module is used for clustering all texts in the universal text set according to the distance between each text in the universal text set and the text of the field to be identified to obtain a text set;
the expansion module is used for determining each text in the text set and the text of the field to be recognized as the text after the field to be recognized is expanded to form an expanded text set;
the pre-training module is used for carrying out self-supervision learning on a pre-training model according to the extended text set to obtain a trained pre-training model and a text feature vector corresponding to the field to be recognized, and the pre-training model comprises a context encoder, a feedforward neural network and a text feature vector which are sequentially connectedsoftmaxA layer;
the construction module is used for constructing a domain named entity recognition model; the domain named entity recognition model comprises a context encoder and a tag decoder which are sequentially connected; the context encoder is a context encoder in the trained pre-training model;
the training module is used for training the domain named entity recognition model according to the extended text set and the text feature vector corresponding to the field to be recognized by adopting an active learning method to obtain a trained domain named entity recognition model, and the trained domain named entity recognition model is used for recognizing the domain named entity of the text of the field to be recognized;
the training module specifically comprises:
a probability determining unit, configured to, for any text in the extended text set, input a text feature vector corresponding to a field to be identified and the text into the field named entity identification model under the current iteration number to obtain a tag sequence of the text and a prediction probability of each tag in the tag sequence under each word in a word segmentation set corresponding to the text; the label sequence of the text comprises labels corresponding to all participles after the text is participled;
the information quantity calculating unit is used for determining the information quantity of the text according to the prediction probabilities corresponding to all the prediction labels in the prediction label sequence of the text; the prediction label sequence of the text comprises prediction labels of all participles in a participle set corresponding to the text, and the prediction label of any participle is a label corresponding to the maximum prediction probability of all labels in the prediction probabilities under the participles;
the sorting unit is used for sorting all texts in the extended text set in a descending order according to the information quantity of each text in the extended text set;
the marking unit is used for selecting the first M texts to mark the domain named entities to obtain marked texts;
and the training unit is used for training the domain named entity recognition model according to the labeled text to obtain the domain named entity recognition model under the next iteration number, determining the unlabeled text in the extended text set as the extended text set under the next iteration number, and entering the next iteration until an iteration stop condition is reached to obtain the trained domain named entity recognition model.
6. The system according to claim 5, wherein the clustering module specifically comprises:
the text vector calculation unit is used for determining text vectors of all texts in the general text set and text vectors of texts in the field to be identified;
the clustering unit is used for clustering the text vectors of the texts in the general text to obtain a text vector set according to the distance between the text vector of each text in the general text set and the text vector of the text in the field to be identified;
and the text set determining unit is used for determining the text corresponding to each text vector in the text vector set as a text set.
7. The system according to claim 6, wherein the text vector calculation unit specifically comprises:
the word segmentation subunit is used for respectively segmenting each text in the general text set and the text in the field to be recognized to obtain a word segmentation set corresponding to each text;
and the text vector calculation subunit is used for respectively inputting the word segmentation sets corresponding to the texts into an encoder to obtain text vectors of the texts in the general text set and text vectors of the texts in the field to be identified.
8. The system of claim 7, wherein the expansion module specifically comprises:
the encoding unit is used for respectively inputting each text vector in the text vector set into a decoder to obtain a text corresponding to each text vector;
and the expansion unit is used for determining the texts corresponding to the text vectors and the texts in the field to be recognized as the expanded texts in the field to be recognized to form an expanded text set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211092071.XA CN115186670B (en) | 2022-09-08 | 2022-09-08 | Method and system for identifying domain named entities based on active learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211092071.XA CN115186670B (en) | 2022-09-08 | 2022-09-08 | Method and system for identifying domain named entities based on active learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115186670A CN115186670A (en) | 2022-10-14 |
CN115186670B true CN115186670B (en) | 2023-01-03 |
Family
ID=83522463
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211092071.XA Active CN115186670B (en) | 2022-09-08 | 2022-09-08 | Method and system for identifying domain named entities based on active learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115186670B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116070700A (en) * | 2023-02-02 | 2023-05-05 | 北京交通大学 | Biomedical relation extraction method and system integrating iterative active learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112800766A (en) * | 2021-01-27 | 2021-05-14 | 华南理工大学 | Chinese medical entity identification and labeling method and system based on active learning |
CN113919358A (en) * | 2021-11-03 | 2022-01-11 | 厦门市美亚柏科信息股份有限公司 | Named entity identification method and system based on active learning |
CN114266254A (en) * | 2021-12-24 | 2022-04-01 | 上海德拓信息技术股份有限公司 | Text named entity recognition method and system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111553164A (en) * | 2020-04-29 | 2020-08-18 | 平安科技(深圳)有限公司 | Training method and device for named entity recognition model and computer equipment |
CN112214604A (en) * | 2020-11-04 | 2021-01-12 | 腾讯科技(深圳)有限公司 | Training method of text classification model, text classification method, device and equipment |
-
2022
- 2022-09-08 CN CN202211092071.XA patent/CN115186670B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112800766A (en) * | 2021-01-27 | 2021-05-14 | 华南理工大学 | Chinese medical entity identification and labeling method and system based on active learning |
CN113919358A (en) * | 2021-11-03 | 2022-01-11 | 厦门市美亚柏科信息股份有限公司 | Named entity identification method and system based on active learning |
CN114266254A (en) * | 2021-12-24 | 2022-04-01 | 上海德拓信息技术股份有限公司 | Text named entity recognition method and system |
Non-Patent Citations (2)
Title |
---|
基于半监督学习与CRF的应急预案命名实体识别;刘彤等;《软件导刊》;20201231(第03期);第41-44页 * |
特定领域的命名实体识别方法的研究;张磊;《计算机与现代化》;20180315(第03期);第64-68页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115186670A (en) | 2022-10-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109992782B (en) | Legal document named entity identification method and device and computer equipment | |
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN112541355B (en) | Entity boundary type decoupling few-sample named entity recognition method and system | |
CN111694924A (en) | Event extraction method and system | |
CN112270379A (en) | Training method of classification model, sample classification method, device and equipment | |
WO2022198750A1 (en) | Semantic recognition method | |
CN108536754A (en) | Electronic health record entity relation extraction method based on BLSTM and attention mechanism | |
CN112183064B (en) | Text emotion reason recognition system based on multi-task joint learning | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
CN108416058A (en) | A kind of Relation extraction method based on the enhancing of Bi-LSTM input informations | |
CN111881677A (en) | Address matching algorithm based on deep learning model | |
CN113190656A (en) | Chinese named entity extraction method based on multi-label framework and fusion features | |
CN110555084A (en) | remote supervision relation classification method based on PCNN and multi-layer attention | |
CN111177402B (en) | Evaluation method, device, computer equipment and storage medium based on word segmentation processing | |
CN110580287A (en) | Emotion classification method based ON transfer learning and ON-LSTM | |
CN113961666B (en) | Keyword recognition method, apparatus, device, medium, and computer program product | |
CN111309918A (en) | Multi-label text classification method based on label relevance | |
CN113255366B (en) | Aspect-level text emotion analysis method based on heterogeneous graph neural network | |
CN116432655B (en) | Method and device for identifying named entities with few samples based on language knowledge learning | |
CN111145914A (en) | Method and device for determining lung cancer clinical disease library text entity | |
CN115831102A (en) | Speech recognition method and device based on pre-training feature representation and electronic equipment | |
CN112699685A (en) | Named entity recognition method based on label-guided word fusion | |
CN113204975A (en) | Sensitive character wind identification method based on remote supervision | |
CN115186670B (en) | Method and system for identifying domain named entities based on active learning | |
CN116595023A (en) | Address information updating method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |