CN116244445B - Aviation text data labeling method and labeling system thereof - Google Patents
Aviation text data labeling method and labeling system thereof Download PDFInfo
- Publication number
- CN116244445B CN116244445B CN202211706705.6A CN202211706705A CN116244445B CN 116244445 B CN116244445 B CN 116244445B CN 202211706705 A CN202211706705 A CN 202211706705A CN 116244445 B CN116244445 B CN 116244445B
- Authority
- CN
- China
- Prior art keywords
- entity
- sample
- aviation
- text
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000002372 labelling Methods 0.000 title claims abstract description 72
- 238000000605 extraction Methods 0.000 claims abstract description 36
- 238000012216 screening Methods 0.000 claims abstract description 27
- 230000009467 reduction Effects 0.000 claims abstract description 12
- 230000004927 fusion Effects 0.000 claims abstract description 9
- 238000000034 method Methods 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 16
- 230000002457 bidirectional effect Effects 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 12
- 230000002708 enhancing effect Effects 0.000 claims description 9
- 238000012546 transfer Methods 0.000 claims description 9
- 239000003550 marker Substances 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 5
- 230000037430 deletion Effects 0.000 claims description 5
- 239000012634 fragment Substances 0.000 claims description 5
- 230000001419 dependent effect Effects 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000013136 deep learning model Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000012795 verification Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Animal Behavior & Ethology (AREA)
- Machine Translation (AREA)
Abstract
The application relates to an aviation text data labeling method and a labeling system thereof, comprising the following steps that step 1 expands samples of original aviation text data based on a text enhancement algorithm of entity core EODA, and the samples of the original aviation text data and the expanded samples form unlabeled data together; step 2, screening a target sample from unlabeled data based on sample screening of an active learning model; and 3, establishing an aviation text labeling model based on information extraction, and realizing any aviation text data labeling. The application expands the sample number through a data enhancement algorithm based on the entity core EODA, and identifies through the entity; in the active learning model, an uncertainty sample query strategy and a version space reduction sample query strategy are combined, and a sample query strategy based on the lowest confidence of the word level is established. Under the framework of active learning, the labeling efficiency is improved through experimental verification. The intelligent level of the labeling system is improved through the effective fusion of the algorithm and the model.
Description
Technical Field
The application relates to the field of aviation text information extraction, in particular to an aviation text data labeling method and an aviation text data labeling system.
Background
In natural language processing tasks, information extraction technology is mature, and the information extraction technology plays a great role in real scenes such as data retrieval, knowledge graph, question-answering systems and the like. However, the performance of information extraction depends on the quality and the data scale of the marked data to a great extent, the data of an open source is difficult to meet the specific scene requirement, and an efficient, high-quality and automatic aviation text marking system is realized, so that the method is an important research direction in the field of information extraction.
At present, the aviation text labeling method mainly relies on a practitioner to manually label original data, and relies on a certain labeling tool or system to improve the labeling specification and efficiency, the existing labeling systems at home and abroad mainly can be divided into two main types, one type is completely manually-dependent labeling, and the other type is fused with a semi-supervised active learning algorithm in the labeling system, so that the data can be labeled semi-automatically, and the labeling efficiency is higher than that of the former type. Aiming at the problems of less marking data and difficult data marking in the aviation field, the marking function is realized based on the information extraction technology.
Disclosure of Invention
In order to overcome the defects of the prior art, the application completes the expansion of the number of samples through a data enhancement algorithm based on Entity core EODA (Entity-Oriented DataAugmentation), performs experimental comparison through a plurality of models on Entity identification and relation extraction tasks, and has better effect gain on data enhancement work; the relationship extraction part uses maximum entropy, minimum confidence and boundary sampling strategies through the query strategy based on the lowest confidence of the word level. Under the framework of two types of active learning, experiments prove that the labeling efficiency is obviously improved. Through the effective fusion of the algorithm and the technology, the intelligent level of the labeling system is improved.
In order to achieve the above object, the solution adopted by the present application is: an aviation text data labeling method comprises the following steps:
step 1: expanding the sample of the original aviation text data based on the text enhancement algorithm of the entity core EODA to obtain an expanded sample, and forming unlabeled data by the sample of the original aviation text data and the expanded sample; the method comprises the following steps:
distinguishing non-entity words and entity words in a sample of original aviation text data by using an entity recognition model, and then respectively enhancing the non-entity words and the entity words; the entity recognition model uses a probability map model as a named entity recognition model, and on the premise of being based on aviation text content X in an entity data set, the conditional probability distribution of an entity class Y is expressed as P (Y|X); in the undirected graph g= (V, E), a random variable Y v Following markov, the conditional probability distribution P (y|x) is called a conditional random field, as follows:
P(Y v |X,Y w ,w≠v)=P(Y v |X,Y w ,w~v);
wherein: p represents a conditional probability distribution; x represents the aviation text content in the entity dataset; y is Y v And Y e Random variables corresponding to vertexes v and w are respectively represented; w-v represent that there is an edge connection between v and w vertices in the undirected graph G; w.noteq.v denotes that w is other than vAll vertices;
setting the aviation text content X and the entity category Y in the entity data set to have the same graph structure, realizing the entity identification task through a linear chain conditional random field, and observing an observation sequence (X 1 ,X 2 ,...,X n ) The conditional probability of its state sequence is as follows:
wherein: p (y|x) represents the conditional probability of the state sequence; z (x) represents a normalization factor; lambda (lambda) k Representing a transfer characteristic function t k Weight coefficient of (2); t is t k (y i-1 ,y i X, i) represents a transfer characteristic function on edge E; mu (mu) l Representing a state characteristic function s l Weight coefficient of (2); s is(s) l (y i X, i) represents a state feature function at vertex V; y is i-1 And y i Respectively represent input X i-1 And X i All corresponding tag sequences; y represents all the marker sequences corresponding to the input X; the transfer characteristic function t k And a state characteristic function s l Is a local feature function that is location dependent;
the optimization objective of the entity recognition model is to maximize likelihood probability, using log likelihood probability as follows:
wherein: p (y|X) represents maximizing likelihood probability; score (X, y) represents the score corresponding to input X; y (X) represents a random variable corresponding to the input X; y is * Representing a special marking sequence corresponding to the input X; score (x, y) represents the score corresponding to input x;
step 2: screening a target sample from unlabeled data based on sample screening of an active learning model;
based on the active learning model, combining an uncertainty sample query strategy and a version space reduction sample query strategy, establishing a sample query strategy based on the lowest confidence of the word level, wherein a sample query strategy expression based on the lowest confidence of the word level is as follows:
wherein: y1, y2, y m-1 And ym represents the 1 st, 2 nd, m-1 st and m-th marker sequences, respectively; m represents the number of the tag sequence; score (t) represents the score corresponding to time t; p (P) t A score vector corresponding to the time t is represented; p is p 1 ,p 2 And p m Representing the 1 st, 2 nd and m th score vector parameters, respectively;
screening target samples from unlabeled data according to a sample query strategy based on the lowest confidence of word level;
step 3: establishing an aviation text labeling model based on information extraction, and realizing any aviation text data labeling;
obtaining unlabeled data by using the step 1, obtaining a screening sample by using the step S2, judging the distinguishing difficulty of the screening sample, feeding back the judging result to the entity identification model and the active learning model, realizing iterative updating of the sample query strategy expression parameters based on the lowest confidence coefficient of the word level in the entity identification model and the active learning model, returning to the step 1, continuously performing the operation in a circulating way until the iteration reaches the designated times or target values, and establishing an aviation text marking model based on information extraction, wherein the aviation text marking model based on information extraction comprises a certain amount of marking data sets, the entity identification model and the active learning model after parameter optimization; and inputting the new aviation text data into an aviation text labeling model based on information extraction, so as to label any aviation text data.
In a preferred embodiment, the non-entity word part enhancement in step 1 specifically includes: the method comprises four steps of synonym replacement, random interpolation, random exchange and random deletion, and specifically comprises the following steps: the synonym replacement is to randomly replace non-entity word fragments in the aviation text to be marked with synonyms; the candidate word is used as a word needing synonym replacement, the synonym is selected from dictionary data which is processed in advance, and can also be derived from a language model, adjacent vocabulary characterization is searched from an embedding space of a word vector, so that candidate synonym is obtained and then used for replacing an original text parity value; the random interpolation is used for preventing the model from being fitted excessively, improving the robustness, randomly inserting words into non-entity word segments in the aviation text to be marked, wherein the sources of the words are non-entity words or Chinese stop word lists in a sample word segmentation result; the random exchange is to randomly exchange two words of non-entity word segments of the aviation text to be marked; and the random deletion is to randomly delete words from non-entity word fragments in the aviation text to be marked.
In another preferred embodiment, the entity word part enhancement in the step 1 specifically includes four steps of entity word replacement, word embedding replacement, phrase shift and phrase generation, as follows: the entity word replacement is the same as the same-label synonym replacement, and when the entity word replacement is carried out, candidate words are randomly taken out from the marked entity word list and replaced to the original position of the aviation text to be marked; the word embedding replacement is to replace a random vector with a certain probability for word embedding of an entity in a sample, so that the capability of the vocabulary for carrying out template slot learning according to the context content is improved; the phrase shift is based on comma and period separator, and the multiple sentences in the same sample are spliced randomly to enlarge the long-distance context information of the sample; the phrase generation is to randomly select phrases containing at least one entity word and splice and generate new samples so as to improve the entity recognition performance of short aviation texts.
Further, the active learning model in the step 2 specifically includes: the active learning model comprises a learning engine and a selection engine, wherein the BERT-BiLSTM-CRF deep learning model is used as a working reference model in the learning engine and the selection engine; the BERT model comprises word front or post information, wherein a multi-layer bidirectional transducer encoder in the structure releases the limitation of only one-way fusion of the context information through a bidirectional self-attention mechanism, and can generate deep bidirectional language representation of fusion of the context information when a model construction mode of the bidirectional transducer structure combined with MLM is used for pre-training; and obtaining a bi-directional represented hidden state sequence through the BiLSTM layer, obtaining the posterior probability of the output sequence through the CRF layer, and applying the posterior probability to unlabeled sample confidence calculation in the query process to measure the uncertainty of the sample.
Preferably, the uncertainty sample query strategy in step 2 includes:
minimum confidence policy: for each sample, the active learning model predicts the scores of the samples under all category systems, the scores are between 0 and 1 probability values, and the category with the highest score is taken as the predicted category label of the sample according to the ranking from high score to low score, and the method is specifically as follows:
wherein:a predictive category label representing a sample; argmax represents taking the maximum value; p (P) θ (y * |x) represents the score of the sample; argmin represents taking the minimum value;
boundary sampling strategy: the boundary sampling needs to select samples which are most easily judged to be in two categories, namely probability scores of the two categories are similar in a model prediction score result; the boundary sampling strategy focuses on two targets, namely a maximum category score and a second maximum category score, and finally, samples with the smallest difference value between the maximum category score and the second maximum category score are screened out from the batch samples, and the method is specifically as follows:
wherein:a sample representing the smallest score difference; />Representing a maximum category score; />Representing the next largest category score; and
maximum entropy strategy: the concept of entropy comes from the information theory, is used for measuring the uncertainty of a system, and the larger the entropy value is, the more uncertainty of the model on sample class prediction can be embodied, and the method is specifically as follows:
wherein:representing uncertainty of sample class prediction; p (P) θ (y i I x) represents the sample class score.
It is further preferable that the version space reduction sample query policy in step 2 is: the version space reduction sample inquiry strategy is to select a part of samples from unlabeled sets, reduce the current version space after model training is selected, and finally select the result that each model is the most inconsistent in discrimination.
The application provides an aviation text labeling system applying the aviation text data labeling method, which comprises a text enhancement algorithm module based on an entity core, an entity recognition algorithm module based on entity recognition and an active learning algorithm module, wherein a labeling flow is formed for an entity recognition task aiming at an information extraction task; the organization architecture of the aviation text labeling system comprises a basic service layer, a business logic layer and an application layer; the data annotation of the original aviation text is realized through an aviation text annotation system, and the text enhancement algorithm module is used for enhancing non-entity words and expanding samples;
the entity recognition algorithm module is used for executing an entity recognition model to realize distinguishing non-entity words from entity words in a sample of original aviation text data;
the active learning algorithm module is used for executing an active learning model, combining an uncertainty sample query strategy and a version space reduction sample query strategy, establishing a sample query strategy based on the lowest confidence of word level, and realizing sample screening; judging the distinguishing degree of the screened samples;
inputting aviation text data into an entity recognition algorithm module, recognizing non-entity words and entity words, inputting a recognition result into a text enhancement algorithm module, enhancing the non-entity words and the entity words to obtain unlabeled data, and inputting the unlabeled data into an active learning algorithm module for screening to obtain screening samples; judging the distinguishing degree of the screened samples, and feeding back the judging result to the entity recognition algorithm module and the active learning algorithm module to realize iterative updating of the sample query strategy expression parameters based on the minimum confidence coefficient of the word level in the entity recognition model and the active learning model.
Compared with the prior art, the application has the beneficial effects that:
(1) Aiming at a low-resource scene, the application provides a data enhancement method based on an entity core EODA, the sample number is expanded, and experimental comparison is carried out on a plurality of models on a named entity identification and relation extraction task, so that the method is superior to the effect gain brought by the traditional EDA method on the data enhancement work;
(2) Aiming at a multi-resource scene, the relation extraction part uses the query strategy of maximum entropy, minimum confidence and boundary sampling by the query strategy based on the minimum confidence of word level;
(3) Under the framework of two types of active learning, the entity identification and relation extraction tasks can be improved by more than 30%, and the model can be converged more quickly.
Drawings
FIG. 1 is a flow chart of the method for labeling aviation text data according to the present application;
FIG. 2 is a diagram of an active operation framework of the present application;
FIG. 3 is a frame diagram of the BERT-BiLSTM CRF model of the present application;
FIG. 4 is a block diagram of one embodiment of an aeronautical text data labeling system of the present application;
FIG. 5 is a flow chart of an aeronautical text labeling system in accordance with a preferred embodiment of the present application.
Detailed Description
Hereinafter, embodiments of the present application will be described with reference to the drawings.
According to the embodiment of the application, the data enhancement method based on the entity core EODA and the query strategy based on the lowest confidence level of the word level are fused, the sample number is expanded, the multiple models on the task of named entity identification and relation extraction are subjected to experimental comparison, the labeling efficiency is improved by more than 30%, and the models can be converged more quickly. The whole labeling efficiency and the intelligent level of the labeling system are improved, and the information extraction aviation text data labeling is better served. Fig. 1 is a control block diagram of an aviation text data labeling method according to an embodiment of the application.
The embodiment of the application provides an aviation text data labeling method, and as shown in fig. 2, an active operation frame diagram of the embodiment of the application is shown; to demonstrate the applicability of the application, it is applied to examples, comprising in particular the following steps:
s1: expanding the sample of the original aviation text data based on the text enhancement algorithm of the entity core EODA to obtain an expanded sample, and forming unlabeled data by the sample of the original aviation text data and the expanded sample;
the expanded sample can be used for training, such as entity recognition or extracting a model, enhancing the model capacity, and can be used as a candidate sample for an active learning part to perform.
The non-entity word part enhancement operation specifically comprises the following steps: synonym replacement, random interpolation, random exchange and random deletion, as follows;
the synonym replacement is to randomly replace non-entity word fragments in the aviation text to be marked with synonyms; the candidate word is selected from dictionary data which is processed in advance, or from language model, and adjacent vocabulary characterization is searched from the embedding space of word vector to obtain candidate word, which is then used for replacing the orthotopic value of the original text.
The random interpolation is used for preventing the model from being fitted excessively, improving the robustness, randomly inserting words into non-entity word segments in the aviation text to be marked, and the sources of the words are non-entity words or Chinese stop word lists in a sample word segmentation result.
The random exchange is to randomly exchange the non-entity word segments of the aviation text to be annotated with two words.
And randomly deleting the non-entity word segments in the aviation text to be marked.
Examples of non-entity word enhancements are shown in Table 1 below:
table 1EODA aerotext data enhancement examples
The entity word part enhancement operation specifically comprises four steps of entity word replacement, word embedding replacement, phrase shift and phrase generation, and is shown as follows;
the entity word replacement is similar to the same-label synonym replacement, but the source of the vocabulary mainly originates from an entity word list in the marked corpus, and when the entity word replacement is carried out, candidate words are randomly taken out from the marked entity word list and replaced to the original position of the aviation text to be marked.
Word embedding replacement is to replace a random vector with a certain probability for word embedding of an entity in a sample, and is mainly used for improving the capability of word template slot learning according to context content.
The phrase shift is based on comma and period separator, and the method is to randomly splice multiple sentences in the same sample to enrich the long-distance context information of the sample.
The phrase generation is to randomly select a phrase containing at least one entity word, splice the phrase with other samples to generate a new sample, so as to improve the recognition performance of the short aviation text entity lacking the context information.
In this embodiment, an entity recognition model is used to distinguish non-entity words from entity words in a sample of original aviation text data, where the entity recognition model is:
using the probability map model as a named entity recognition model, the conditional probability distribution of the entity class Y on the premise of being based on the aviation text content X in the entity dataset is denoted as P (y|x); in the undirected graph g= (V, E), a random variable Y v Following markov, the conditional probability distribution P (y|x) is called a conditional random field, as follows:
P(Y v |X,Y w ,w≠v)=P(Y v |X,Y w ,w~v);
wherein: p represents a conditional probability distribution; x represents the aviation text content in the entity dataset; y is Y v And Y e Random variables corresponding to vertexes v and w are respectively represented; w-v represent that there is an edge connection between v and w vertices in the undirected graph G; w+.v denotes that w is all vertices except v.
Setting the aviation text content X and the entity category Y in the entity data set to have the same graph structure, realizing the entity identification task through a linear chain conditional random field, and observing an observation sequence (X 1 ,X 2 ,...,X n ) The conditional probability of its state sequence is as follows:
wherein: p (y|x) represents the conditional probability of the state sequence; z (x) represents a normalization factor; lambda (lambda) k Representing a transfer characteristic function t k Weight coefficient of (2); t is t k (y i-1 ,y i X, i) represents a transfer characteristic function on edge e; mu (mu) l Representing a state characteristic function s l Weight coefficient of (2); s is(s) l (y i X, i) represents a state feature function at vertex V; y is i-1 And y i Respectively represent input X i-1 And X i All corresponding tag sequences; y represents all the marker sequences corresponding to the input X, and the transfer characteristic function t k And a state characteristic function s l Are both local feature functions that are location dependent.
The optimization objective of the entity recognition model is to maximize likelihood probability, using log likelihood probability as follows:
wherein: p (y|X) represents maximizing likelihood probability; score (X, y) represents the score corresponding to input X; y (X) represents a random variable corresponding to the input X; y is * Representing a special marking sequence corresponding to the input X; score (x, y) represents the score corresponding to input x.
S2: screening a target sample from unlabeled data based on sample screening of an active learning model;
based on an active learning model, combining an uncertainty sample query strategy and a version space reduction sample query strategy, establishing a sample query strategy based on the lowest confidence of a word level, and screening a target sample from unlabeled data by using the sample query strategy based on the lowest confidence of the word level;
the active learning model specifically comprises the following steps:
the core of the active learning model is to construct a learning engine and a selection engine, and the application uses the BERT-BiLSTM-CRF deep learning model as a working reference model in the learning engine and the selection engine; the focus of the BERT model is not only limited to word front or post information, a multi-layer bidirectional transducer encoder in the structure removes the limitation of only one-way fusion of the context information through a bidirectional self-attention mechanism, and in addition, when the model construction mode of combining the bidirectional transducer structure with the MLM is used for pre-training, deep bidirectional language representation of fusion of the context information can be well generated; and obtaining a bi-directional represented hidden state sequence through the BiLSTM layer, obtaining the posterior probability of the output sequence through the CRF layer, and applying the posterior probability to unlabeled sample confidence calculation in the query process to measure the uncertainty of the sample. FIG. 3 is a diagram of the BERT-BiLSTM CRF model framework according to an embodiment of the present application.
The sample query strategy is reduced by combining the uncertainty sample query strategy and the version space, and the uncertainty sample query strategy is mainly shown as follows according to the strategy:
minimum confidence policy: for each sample, the model predicts the scores of the samples under all category systems, the scores are between 0 and 1 probability values, and the category with the highest score is taken as the predicted category label of the sample according to the ranking from high score to low score, and the method is specifically as follows:
wherein:a predictive category label representing a sample; argmax represents taking the maximum value; p (P) θ (y * |x) represents the score of the sample; argmin represents taking the minimum value.
Boundary sampling strategy: the boundary sampling needs to select samples which are most easily judged to be in two categories, namely probability scores of the two categories are similar in a model prediction score result; the boundary sampling strategy focuses on the target mainly including two items, namely a maximum category score and a second maximum category score, and finally, samples with the smallest difference value between the maximum category score and the second maximum category score are screened out from the batch samples, and the method is specifically as follows:
wherein:a sample representing the smallest score difference; />Representing a maximum category score; />Representing the next largest category score.
Maximum entropy strategy: the concept of entropy comes from the information theory, is used for measuring the uncertainty of a system, and the larger the entropy value is, the more uncertainty of the model on sample class prediction can be embodied, and the method is specifically as follows:
wherein:representing uncertainty of sample class prediction; p (P) θ (y i I x) represents the sample class score.
The version space reduction sample query strategy is to sort out a part of samples from the unlabeled set, screen the samples and reduce the current version space to a maximum extent after model training, and finally select the instance with the least consistent discrimination of each model, wherein the committee is a representative query strategy.
The main working mechanisms of the committee are: training n reference models By using marked training sets in a database, wherein the reference models respectively work independently, a voting Committee is established, a plurality of models with the same structure are trained By using the same training set based on a Committee Query method (Query-By-Committee, QBC), the models vote to select a dispute sample, the dispute sample is marked, and then the models are trained, and the following iteration is repeated:
C={θ^((1)),...,θ^((n))};
wherein: c represents voting committee results; θ++1) represents the 1 st reference model; θ++n) represents the nth reference model.
Each trained benchmark model can vote on unlabeled examples, and the examples with large disputes and inconsistent decision opinions are selected for more strict labeling, and finally added into a labeled training set for the next round of model learning.
A sample query strategy based on the lowest confidence of word level is established, i.e., its expression is as follows:
wherein: y1, y2, y m-1 And ym represents the 1 st, 2 nd, m-1 st and m-th marker sequences, respectively; m represents the number of the tag sequence; score (t) represents the score corresponding to time t; p (P) t A score vector corresponding to the time t is represented; p is p 1 ,p 2 And p m Representing the 1 st, 2 nd and m-th score vector parameters, respectively.
Screening samples from unlabeled data by using a sample query strategy based on the lowest confidence of word level, and then feeding back a discrimination result to the entity recognition model and the active learning model again by discriminating the discrimination among the screening samples to form multiple iterations;
s3: establishing an aviation text labeling model based on information extraction, and realizing any aviation text data labeling;
and (3) establishing an aviation text labeling model based on information extraction based on the theory of the step (1) and the step (2) to form an iterative updating scheme. Firstly, training the entity recognition model in advance, namely, using the data set obtained in the step 1, and also using other data sets to mainly realize rough training of the entity recognition model, screening the data set obtained in the step 1, then scoring and sorting through an active learning model, judging samples which are difficult to distinguish among the models through marked standards, evaluating the screened samples, feeding the judging result back to the extracted model and the active learning model, so as to iteratively update model parameters, and achieving the final optimal effect through repeated iteration, wherein the method comprises a certain amount of marked data sets, and also comprises the extracted model and the active learning model after parameter optimization.
And (3) establishing an aviation text labeling model based on information extraction based on the S1 and the S2 to form a set of labeling flow with a complete life cycle.
The aviation text labeling system based on information extraction comprises a data enhancement algorithm, an entity recognition algorithm and an active learning algorithm, and therefore a system organization framework supported by a core algorithm is mainly divided into three layers: a basic service layer, a business logic layer and an application layer; a system architecture diagram of an embodiment of the present application is shown in fig. 4.
The text enhancement algorithm module is used for enhancing the non-entity words and the entity words and expanding the sample;
the entity recognition algorithm module is used for executing an entity recognition model to realize distinguishing non-entity words from entity words in a sample of original aviation text data;
the active learning algorithm module is used for executing an active learning model, combining an uncertainty sample query strategy and a version space reduction sample query strategy, establishing a sample query strategy based on the lowest confidence of word level, and realizing sample screening; judging the distinguishing degree of the screened samples;
inputting aviation text data into an entity recognition algorithm module, recognizing non-entity words and entity words, inputting a recognition result into a text enhancement algorithm module, enhancing the non-entity words and the entity words to obtain unlabeled data, and inputting the unlabeled data into an active learning algorithm module for screening to obtain screening samples; judging the distinguishing degree of the screened samples, and feeding back the judging result to the entity recognition algorithm module and the active learning algorithm module to realize iterative updating of the sample query strategy expression parameters based on the minimum confidence coefficient of the word level in the entity recognition model and the active learning model.
And finally, the data annotation of the original aviation text is realized through the aviation text annotation system. A flow chart of an aeronautical text labeling system in accordance with a preferred embodiment of the application is shown in fig. 5. In this embodiment, in combination with the prior art, the task categories of the tagged items are classified as entity identification or relationship extraction; entity identification in the auxiliary labeling scheme provides two types of auxiliary labeling schemes based on active learning and data enhancement, and provides more choices for users. The system is used according to the following steps:
s31: the task category of the labeling item is selected, and the selection item is entity identification or relation extraction.
S32: uploading a knowledge system of the task, if entity identification is selected, setting entity types; if the relation extraction is selected, the category of the host and guest entities of each triplet and the relation indicator in the middle need to be set.
S33: selecting an auxiliary labeling scheme, taking entity identification as an example, providing two types of auxiliary labeling schemes based on active learning and data enhancement, and selecting according to actual scene requirements; training iteration rounds and sample selection strategy candidates are provided in the active learning scheme, and gain coefficients and word operation proportions are provided in the data enhancement scheme.
S34: the initialization of the labeling task is finished, the original aviation text data is uploaded, and the aviation text labeling system automatically completes the preprocessing, denoising and formatting of the corresponding task.
S35: and the aviation text labeling system automatically distributes labeling tasks according to the size of aviation text data, the back-end model synchronously monitors the number of labeled aviation text data, and an administrator starts model training according to the actual labeling scene and is used for assisting the labeling process.
S36: and (5) finishing marking the aviation text data, and deriving a marked aviation text data set.
The EODA method follows the principle that entity categories in the samples are kept unchanged in the stage before and after enhancement, does not damage the semantics of the original samples as much as possible, carries out text enhancement by introducing reasonable noise, thus balancing the difference of the number of the sample categories, completing sample expansion and improving the model performance in an effective and low-cost way; the entity recognition algorithm based on active learning uses a small number of marked examples as an initial training set to learn a model, all unmarked examples are randomly divided into a plurality of batches of query sets, an optimal batch of examples are selected from the unmarked examples in the current batch through a query strategy to carry out stricter marking processing, the marked examples are put into the model in a learning engine to be trained, the updated model acts on the sample query again, and the reciprocating iteration is carried out, so that the convergence speed of the information extraction model is accelerated under the same marked data scale, and the performance is more excellent.
In conclusion, the prediction result of the case proves that the case has very good effect sentence length effect.
(1) Aiming at a low-resource scene, the embodiment of the application provides a data enhancement method based on an entity core EODA, completes sample number expansion, and performs experimental comparison on a plurality of models on a named entity identification and relation extraction task. Aiming at a multi-resource scene, the relation extraction part uses the query strategy of maximum entropy, minimum confidence and boundary sampling by the query strategy based on the minimum confidence of word level. Under the framework of two types of active learning, the entity identification and relation extraction tasks can be improved by more than 30%, and the model can be converged more quickly.
(2) The embodiment of the application applies the data enhancement and active learning ideas to the actual labeling system, and improves the overall labeling efficiency and the intelligent level of the labeling system through the effective fusion of the algorithm and the technology, thereby better serving for information extraction and aviation text data labeling.
The above examples are only illustrative of the preferred embodiments of the present application and are not intended to limit the scope of the present application, and various modifications and improvements made by those skilled in the art to the technical solution of the present application should fall within the scope of protection defined by the claims of the present application without departing from the spirit of the present application.
Claims (5)
1. The aviation text data labeling method is characterized by comprising the following steps of:
step 1: expanding the sample of the original aviation text data based on the text enhancement algorithm of the entity core EODA to obtain an expanded sample, and forming unlabeled data by the sample of the original aviation text data and the expanded sample; the method comprises the following steps:
distinguishing non-entity words and entity words in a sample of original aviation text data by using an entity recognition model, and then respectively enhancing the non-entity words and the entity words; the entity recognition model uses a probability map model as a named entity recognition model, and on the premise of being based on aviation text content X in an entity data set, the conditional probability distribution of an entity class Y is expressed as P (Y|X); in the undirected graph g= (V, E), a random variable Y v Following markov, the conditional probability distribution P (y|x) is called a conditional random field, as follows:
P(Y v |X,Y w ,w≠v)=P(Y v |X,Y w ,w~v);
wherein: p represents a conditional probability distribution; x represents the aviation text content in the entity dataset; y is Y v And Y w Random variables corresponding to vertexes v and w are respectively represented; w-v represent that there is an edge connection between v and w vertices in the undirected graph G; w+.v denotes all vertices except where w is v;
setting the aviation text content X and the entity category Y in the entity data set to have the same graph structure, realizing the entity identification task through a linear chain conditional random field, and observing an observation sequence (X 1 ,X 2 ,...,X n ) The conditional probability of its state sequence is as follows:
wherein: p (y|x) represents the conditional probability of the state sequence; z (x) represents a normalization factor; lambda (lambda) k Representing a transfer characteristic function t k Weight coefficient of (2); t is t k (y i-1 ,y i X, i) represents a transfer characteristic function on edge E; mu (mu) l Representing a state characteristic function s l Weight coefficient of (2); s is(s) l (y i X, i) represents a state feature function at vertex V; y is i-1 And y i Respectively represent input X i-1 And X i All corresponding tag sequences; y represents all the marker sequences corresponding to the input X; the transfer characteristic function t k And a state characteristic function s l Is a local feature function that is location dependent;
the optimization objective of the entity recognition model is to maximize likelihood probability, using log likelihood probability as follows:
wherein: p (y|X) represents maximizing likelihood probability; score (X, y) represents the score corresponding to input X; y (X) represents a random variable corresponding to the input X; y is * Representing a special marking sequence corresponding to the input X; score (x, y) represents the score corresponding to input x;
step 2: screening a target sample from unlabeled data based on sample screening of an active learning model;
based on the active learning model, combining the uncertainty sample query strategy and the version space reduction sample query strategy, establishing a sample query strategy based on the lowest confidence of word level,
the uncertainty sample query strategy includes:
minimum confidence policy: for each sample, the active learning model predicts the scores of the samples under all category systems, the scores are between 0 and 1 probability values, and the category with the highest score is taken as the predicted category label of the sample according to the ranking from high score to low score, and the method is specifically as follows:
wherein:a predictive category label representing a sample; argmax represents taking the maximum value; p (P) θ (y * |x) represents the score of the sample; argmin represents the minimumA value;
boundary sampling strategy: the boundary sampling needs to select samples which are most easily judged to be in two categories, namely probability scores of the two categories are similar in a model prediction score result; the boundary sampling strategy focuses on two targets, namely a maximum category score and a second maximum category score, and finally, samples with the smallest difference value between the maximum category score and the second maximum category score are screened out from the batch samples, and the method is specifically as follows:
wherein:a sample representing the smallest score difference; />Representing a maximum category score; />Representing the next largest category score; and
maximum entropy strategy: the concept of entropy comes from the information theory, is used for measuring the uncertainty of a system, and the larger the entropy value is, the more uncertainty of the model on sample class prediction can be embodied, and the method is specifically as follows:
wherein:representing uncertainty of sample class prediction; p (P) θ (y i |x) represents the sample class score;
version space reduction sample query policies are: the version space reduction sample inquiry strategy is to select a part of samples from unlabeled sets, reduce the current version space after model training is selected, and finally select the result that each model is the most inconsistent in discrimination;
the sample query policy expression based on the lowest confidence of word level is as follows:
wherein: y is 1 ,y 2 ,y m-1 And y m Respectively represent the 1 st, 2 nd, m-1 st and m-th marker sequences; m represents the number of the tag sequence; score (t) represents the score corresponding to time t; p (P) t A score vector corresponding to the time t is represented; p is p 1 ,p 2 And p m Representing the 1 st, 2 nd and m th score vector parameters, respectively;
screening target samples from unlabeled data according to a sample query strategy based on the lowest confidence of word level;
step 3: establishing an aviation text labeling model based on information extraction, and realizing any aviation text data labeling;
obtaining unlabeled data by using the step 1, obtaining a screening sample by using the step S2, judging the distinguishing difficulty of the screening sample, feeding back the judging result to the entity identification model and the active learning model, realizing iterative updating of the sample query strategy expression parameters based on the lowest confidence coefficient of the word level in the entity identification model and the active learning model, returning to the step 1, continuously performing the operation in a circulating way until the iteration reaches the designated times or target values, and establishing an aviation text marking model based on information extraction, wherein the aviation text marking model based on information extraction comprises a certain amount of marking data sets, the entity identification model and the active learning model after parameter optimization;
and inputting the new aviation text data into an aviation text labeling model based on information extraction, so as to label any aviation text data.
2. The method for labeling aviation text data according to claim 1, wherein the non-entity word part in step 1 is enhanced, specifically comprising: the method comprises four steps of synonym replacement, random interpolation, random exchange and random deletion, and specifically comprises the following steps:
the synonym replacement is to randomly replace non-entity word fragments in the aviation text to be marked with synonyms; the candidate word is used as a word needing synonym replacement, the synonym is selected from dictionary data which is processed in advance, and can also be derived from a language model, adjacent vocabulary characterization is searched from an embedding space of a word vector, so that candidate synonym is obtained and then used for replacing an original text parity value;
the random interpolation is used for preventing the model from being fitted excessively, improving the robustness, randomly inserting words into non-entity word segments in the aviation text to be marked, wherein the sources of the words are non-entity words or Chinese stop word lists in a sample word segmentation result;
the random exchange is to randomly exchange two words of non-entity word segments of the aviation text to be marked;
and the random deletion is to randomly delete words from non-entity word fragments in the aviation text to be marked.
3. The method for labeling aviation text data according to claim 1, wherein the entity word part enhancement in the step 1 specifically comprises four steps of entity word replacement, word embedding replacement, phrase shift and phrase generation, as follows:
the entity word replacement is the same as the same-label synonym replacement, and when the entity word replacement is carried out, candidate words are randomly taken out from the marked entity word list and replaced to the original position of the aviation text to be marked;
the word embedding replacement is to replace a random vector with a certain probability for word embedding of an entity in a sample, so that the capability of the vocabulary for carrying out template slot learning according to the context content is improved;
the phrase shift is based on comma and period separator, and the multiple sentences in the same sample are spliced randomly to enlarge the long-distance context information of the sample;
the phrase generation is to randomly select phrases containing at least one entity word and splice and generate new samples so as to improve the entity recognition performance of short aviation texts.
4. The method for labeling aviation text data according to claim 1, wherein the active learning model in step 2 is specifically:
the active learning model comprises a learning engine and a selection engine, wherein the BERT-BiLSTM-CRF deep learning model is used as a working reference model in the learning engine and the selection engine; the BERT model comprises word front or post information, wherein a multi-layer bidirectional transducer encoder in the structure releases the limitation of only one-way fusion of the context information through a bidirectional self-attention mechanism, and can generate deep bidirectional language representation of fusion of the context information when a model construction mode of the bidirectional transducer structure combined with MLM is used for pre-training; and obtaining a bi-directional represented hidden state sequence through the BiLSTM layer, obtaining the posterior probability of the output sequence through the CRF layer, and applying the posterior probability to unlabeled sample confidence calculation in the query process to measure the uncertainty of the sample.
5. An aviation text labeling system according to an aviation text data labeling method of one of claims 1-4, characterized by comprising a text enhancement algorithm module based on entity cores, an entity recognition algorithm module based on entity recognition and an active learning algorithm module, forming a labeling flow for entity recognition tasks aiming at information extraction tasks; the organization architecture of the aviation text labeling system comprises a basic service layer, a business logic layer and an application layer; the data annotation of the original aviation text is realized through an aviation text annotation system, wherein:
the text enhancement algorithm module is used for enhancing the non-entity words and the entity words and expanding the sample;
the entity recognition algorithm module is used for executing an entity recognition model to realize distinguishing non-entity words from entity words in a sample of original aviation text data;
the active learning algorithm module is used for executing an active learning model, combining an uncertainty sample query strategy and a version space reduction sample query strategy, establishing a sample query strategy based on the lowest confidence of word level, and realizing sample screening; judging the distinguishing degree of the screened samples;
inputting aviation text data into an entity recognition algorithm module, recognizing non-entity words and entity words, inputting a recognition result into a text enhancement algorithm module, enhancing the non-entity words and the entity words to obtain unlabeled data, and inputting the unlabeled data into an active learning algorithm module for screening to obtain screening samples; judging the distinguishing degree of the screened samples, and feeding back the judging result to the entity recognition algorithm module and the active learning algorithm module to realize iterative updating of the sample query strategy expression parameters based on the minimum confidence coefficient of the word level in the entity recognition model and the active learning model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211706705.6A CN116244445B (en) | 2022-12-29 | 2022-12-29 | Aviation text data labeling method and labeling system thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211706705.6A CN116244445B (en) | 2022-12-29 | 2022-12-29 | Aviation text data labeling method and labeling system thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116244445A CN116244445A (en) | 2023-06-09 |
CN116244445B true CN116244445B (en) | 2023-12-12 |
Family
ID=86626902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211706705.6A Active CN116244445B (en) | 2022-12-29 | 2022-12-29 | Aviation text data labeling method and labeling system thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116244445B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116776884A (en) * | 2023-06-26 | 2023-09-19 | 中山大学 | Data enhancement method and system for medical named entity recognition |
CN117473096B (en) * | 2023-12-28 | 2024-03-15 | 江西师范大学 | Knowledge point labeling method fusing LATEX labels and model thereof |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113901825A (en) * | 2021-11-22 | 2022-01-07 | 东北大学 | Entity relation joint extraction method and system based on active deep learning |
CN114548102A (en) * | 2020-11-25 | 2022-05-27 | 株式会社理光 | Method and device for labeling sequence of entity text and computer readable storage medium |
CN115039140A (en) * | 2020-08-11 | 2022-09-09 | 辉达公司 | Enhanced object recognition using one or more neural networks |
WO2022222224A1 (en) * | 2021-04-19 | 2022-10-27 | 平安科技(深圳)有限公司 | Deep learning model-based data augmentation method and apparatus, device, and medium |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11138523B2 (en) * | 2016-07-27 | 2021-10-05 | International Business Machines Corporation | Greedy active learning for reducing labeled data imbalances |
AU2019392537A1 (en) * | 2018-12-03 | 2021-07-01 | Tempus Ai, Inc. | Clinical concept identification, extraction, and prediction system and related methods |
EP3903241A4 (en) * | 2018-12-24 | 2022-09-14 | Roam Analytics, Inc. | Constructing a knowledge graph employing multiple subgraphs and a linking layer including multiple linking nodes |
WO2021003391A1 (en) * | 2019-07-02 | 2021-01-07 | Insurance Services Office, Inc. | Machine learning systems and methods for evaluating sampling bias in deep active classification |
US11436448B2 (en) * | 2019-12-06 | 2022-09-06 | Palo Alto Research Center Incorporated | System and method for differentially private pool-based active learning |
-
2022
- 2022-12-29 CN CN202211706705.6A patent/CN116244445B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115039140A (en) * | 2020-08-11 | 2022-09-09 | 辉达公司 | Enhanced object recognition using one or more neural networks |
CN114548102A (en) * | 2020-11-25 | 2022-05-27 | 株式会社理光 | Method and device for labeling sequence of entity text and computer readable storage medium |
WO2022222224A1 (en) * | 2021-04-19 | 2022-10-27 | 平安科技(深圳)有限公司 | Deep learning model-based data augmentation method and apparatus, device, and medium |
CN113901825A (en) * | 2021-11-22 | 2022-01-07 | 东北大学 | Entity relation joint extraction method and system based on active deep learning |
Non-Patent Citations (2)
Title |
---|
LTP:A New Active Learning Strategy for CRF-Based Named Entity Recognition;Tong Zhang等;researchgate;1-9 * |
装备文本预料数据标注规范化研究;刘俊等;航空标准化与质量(第06期);38-44 * |
Also Published As
Publication number | Publication date |
---|---|
CN116244445A (en) | 2023-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107729468B (en) | answer extraction method and system based on deep learning | |
CN116244445B (en) | Aviation text data labeling method and labeling system thereof | |
CN110489523B (en) | Fine-grained emotion analysis method based on online shopping evaluation | |
CN113377897B (en) | Multi-language medical term standard standardization system and method based on deep confrontation learning | |
CN109684928B (en) | Chinese document identification method based on internet retrieval | |
CN110263325A (en) | Chinese automatic word-cut | |
CN113094502B (en) | Multi-granularity takeaway user comment emotion analysis method | |
CN107895000A (en) | A kind of cross-cutting semantic information retrieval method based on convolutional neural networks | |
CN101901213A (en) | Instance-based dynamic generalization coreference resolution method | |
CN113962228A (en) | Long document retrieval method based on semantic fusion of memory network | |
CN113095087B (en) | Chinese word sense disambiguation method based on graph convolution neural network | |
CN111460147A (en) | Title short text classification method based on semantic enhancement | |
CN114611491A (en) | Intelligent government affair public opinion analysis research method based on text mining technology | |
CN110134950A (en) | A kind of text auto-collation that words combines | |
CN112884087A (en) | Biological enhancer and identification method for type thereof | |
CN113420766B (en) | Low-resource language OCR method fusing language information | |
CN114048314B (en) | Natural language steganalysis method | |
CN115033753A (en) | Training corpus construction method, text processing method and device | |
CN114996455A (en) | News title short text classification method based on double knowledge maps | |
CN112579583B (en) | Evidence and statement combined extraction method for fact detection | |
CN117371534B (en) | Knowledge graph construction method and system based on BERT | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN118227790A (en) | Text classification method, system, equipment and medium based on multi-label association | |
CN117891948A (en) | Small sample news classification method based on internal knowledge extraction and contrast learning | |
CN111144134A (en) | Translation engine automatic evaluation system based on OpenKiwi |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |