WO2022222224A1 - Deep learning model-based data augmentation method and apparatus, device, and medium - Google Patents
Deep learning model-based data augmentation method and apparatus, device, and medium Download PDFInfo
- Publication number
- WO2022222224A1 WO2022222224A1 PCT/CN2021/096475 CN2021096475W WO2022222224A1 WO 2022222224 A1 WO2022222224 A1 WO 2022222224A1 CN 2021096475 W CN2021096475 W CN 2021096475W WO 2022222224 A1 WO2022222224 A1 WO 2022222224A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- original
- parameter list
- replacement
- model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 184
- 238000013434 data augmentation Methods 0.000 title claims abstract description 43
- 238000013136 deep learning model Methods 0.000 title claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 219
- 238000012360 testing method Methods 0.000 claims abstract description 111
- 238000005457 optimization Methods 0.000 claims abstract description 67
- 241000251468 Actinopterygii Species 0.000 claims abstract description 35
- 238000012217 deletion Methods 0.000 claims description 35
- 230000037430 deletion Effects 0.000 claims description 35
- 238000012545 processing Methods 0.000 claims description 31
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 7
- 238000006467 substitution reaction Methods 0.000 claims description 4
- 230000001131 transforming effect Effects 0.000 claims 2
- 230000000694 effects Effects 0.000 abstract description 21
- 239000000203 mixture Substances 0.000 abstract description 11
- 238000009472 formulation Methods 0.000 abstract description 4
- 230000003416 augmentation Effects 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000002708 enhancing effect Effects 0.000 description 3
- 239000002245 particle Substances 0.000 description 3
- 101100535673 Drosophila melanogaster Syn gene Proteins 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 244000309464 bull Species 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002040 relaxant effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 101100043727 Caenorhabditis elegans syx-2 gene Proteins 0.000 description 1
- 101100043731 Caenorhabditis elegans syx-3 gene Proteins 0.000 description 1
- 101100368134 Mus musculus Syn1 gene Proteins 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 239000003643 water by type Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Definitions
- the present application relates to the field of artificial intelligence, and in particular, to a data enhancement method, apparatus, device and medium based on a deep learning model.
- NER Named Entity Recognition
- the data enhancement model of the named entity recognition model mainly replaces the entity words in the training data through different data enhancement methods and parameters corresponding to the data enhancement methods.
- the entity words in the training data are subjected to synonym replacement, random insertion, random exchange of positions, and random deletions to increase the scale and diversity of the training data.
- the enhancement effect of the data augmentation model on the training data is inseparable from the model parameters, but the model parameters of the existing data augmentation models are determined by experience or the parameter optimization method of grid search, and the interaction with the named entity recognition model is low. As a result, the data augmentation model has a poor effect on the augmentation of the training data.
- the present application provides a data enhancement method, device, equipment and medium based on a deep learning model.
- the model parameters of the data enhancement model are determined by experience or the parameter optimization method of grid search, resulting in the data enhancement model.
- a data augmentation method based on a deep learning model including:
- Model Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions.
- Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
- a data enhancement device based on a deep learning model comprising:
- an acquisition module used for acquiring the manually marked original training data and original test data, and acquiring the original parameter list, where the original parameter list is composed of the data enhancement method and the enhancement parameters corresponding to the data enhancement method;
- an initialization module for randomly initializing the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists
- a conversion module configured to convert the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;
- a test module configured to use the multiple training sets to train to obtain multiple recognition models, and use the original test data as a test set to test the multiple recognition models to determine whether the multiple recognition models are There is a model that satisfies the convergence condition;
- an output module configured to output an optimization parameter list corresponding to the model that satisfies the convergence condition if there is a model that satisfies the convergence condition in the plurality of identification models, as a target data enhancement parameter list;
- An enhancement module configured to perform data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of a named entity recognition model.
- a computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:
- Model Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions.
- Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
- One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
- Model Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions.
- Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
- an artificial fish swarm algorithm suitable for the coexistence of discrete values and continuous values is used to randomly initialize the enhancement parameters in the original parameter list, and the recognition effect of the recognition model is taken as the optimization target and integrated into the formulation of the data enhancement strategy, so as to be more efficient.
- a data enhancement list with better effect is obtained at a small cost, thereby improving the data enhancement effect of the data enhancement list on the data.
- FIG. 1 is a schematic diagram of an application environment of a data enhancement method based on a deep learning model in an embodiment of the present application
- FIG. 2 is a schematic flowchart of a data enhancement method based on a deep learning model in an embodiment of the present application
- FIG. 3 is another schematic flowchart of a data enhancement method based on a deep learning model in an embodiment of the present application
- Fig. 4 is a realization flow chart of step S30 in Fig. 2;
- Fig. 5 is a realization flow chart of step S33 in Fig. 4;
- Fig. 6 is another realization flow chart of step S30 in Fig. 2;
- Fig. 7 is a realization flow chart of step S50 in Fig. 2;
- Fig. 8 is a realization flow chart of step S52 in Fig. 7;
- FIG. 9 is a schematic structural diagram of a data enhancement device based on a deep learning model in an embodiment of the present application.
- FIG. 10 is a schematic structural diagram of a computer device in an embodiment of the present application.
- the data enhancement method based on the deep learning model provided by the embodiment of the present application can be applied in the application environment as shown in FIG. 1 , in which the terminal device communicates with the server through the network.
- the server obtains the manually labeled original training data and original test data sent by the user through the terminal device, and obtains the original parameter list sent by the user through the terminal device.
- the original parameter list consists of the data enhancement method and the enhancement parameters corresponding to the data enhancement method.
- the enhanced parameters in the original parameter list are randomly initialized to obtain multiple optimized parameter lists, and each optimized parameter list is used to transform the original training data to obtain the corresponding artificially constructed data, and the original
- the training data is mixed with the corresponding artificially constructed data to obtain multiple training sets, and multiple training sets are used to train and obtain multiple recognition models, and the original test data is used as the test set to test the multiple recognition models to determine the multiple recognition models.
- the optimized parameter list corresponding to the model that satisfies the convergence condition is output as the target data enhancement parameter list, and the target data is used to enhance the
- the parameter list performs data enhancement on the original training data to obtain the training set of the named entity recognition model.
- the artificial fish swarm algorithm suitable for the coexistence of discrete values and continuous values is used to randomly initialize the enhanced parameters in the original parameter list, and the recognition model is used to randomly initialize the enhanced parameters.
- the recognition effect is integrated into the formulation of the data augmentation strategy as an optimization objective, and a data augmentation list with better effect is obtained at a small cost, thus ensuring the data diversity of the training set of the named entity recognition model and expanding the scale of the training set. Then, the recognition accuracy of the named entity recognition model is improved, and the training data enhancement and artificial intelligence of named entity recognition are realized.
- the relevant data used or produced by the deep learning model-based data enhancement method is stored in the database of the server, and the database in this embodiment is stored in the blockchain network for storing and implementing the deep learning model-based etc.
- the data used and generated by the enhancement method such as the original training data, the original test data, the original parameter list, the artificially constructed data, the optimized parameter list and the related data of multiple recognition models.
- the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods.
- Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer. Deploying the database on the blockchain can improve the security of data storage.
- the terminal device can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.
- the server can be implemented as an independent server or a server cluster composed of multiple servers.
- a data enhancement method based on a deep learning model is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
- the original parameter list in this embodiment is a data enhancement model
- the data enhancement model is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method
- the data enhancement performance of the data enhancement model depends on the data in the model.
- the enhancement parameters corresponding to the enhancement method and the data enhancement method Therefore, before using the data enhancement model, it is necessary to optimize the parameters of the existing data enhancement model to improve the enhancement performance of the data enhancement model on the training data, so as to ensure the subsequent training data.
- the artificial fish swarm algorithm After obtaining the manually labeled original training data and original test data, and obtaining the original parameter list, the artificial fish swarm algorithm with fast convergence speed and suitable for the coexistence of discrete values and continuous values is used as the framework. Random initialization is performed to obtain multiple lists of optimized parameters.
- each optimization parameter list is used to transform the original training data to obtain corresponding artificially constructed data, and the original training data and the corresponding artificially constructed data are randomly scrambled to obtain multiple a training set.
- each optimization parameter list is used to convert the original training data, and L pieces of corresponding artificially constructed data are obtained, and each piece of artificially constructed data corresponds to an optimization parameter list.
- the original training data is mixed with each artificially constructed data to obtain L training sets.
- S40 Use multiple training sets to train to obtain multiple recognition models, and use the original test data as a test set to test the multiple recognition models.
- S50 Determine, according to the test result, whether there is a model that satisfies the convergence condition in the plurality of identification models.
- the multiple recognition models After testing the multiple recognition models with the original test data as the test set, according to the recognition effect of each recognition model on each entity word in the test set, that is, the test result, it is determined whether there is a model satisfying the convergence condition among the multiple recognition models.
- the multiple recognition models may be traditional entity recognition models.
- the optimization parameter list corresponding to the training set is determined to be the data enhancement list that meets the data enhancement requirements, and the corresponding optimization parameter list is output as the target data enhancement parameter list.
- S70 Perform data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of the named entity recognition model.
- the target data enhancement parameter list After outputting the corresponding optimization parameter list as the target data enhancement parameter list, the target data enhancement parameter list will be used to perform data enhancement on the original training data, and then the enhanced data after data enhancement will be randomly scrambled with the original training data.
- the training set of the named entity recognition model By mixing the training set of the named entity recognition model, a relatively accurate named entity recognition model can be obtained, thereby ensuring the recognition accuracy of the named entity recognition model.
- the artificial fish swarm optimization algorithm is a particle swarm optimization algorithm, which regards particles as fish trying to reach the position with the highest food concentration in the waters, thereby improving their living conditions.
- the particles and artificial fish are the enhancement parameters in the original parameter list for random initialization
- the food concentration is the cost function or loss function of the recognition model
- the swimming process of the artificial fish during the algorithm operation is the original parameter
- the enhancement parameters in the list gradually approach the optimal position, and the process of making the cost function or loss function approach the lowest value.
- the discrete values ⁇ 1 to ⁇ 5 in the original parameter list and the discrete value p syn are combined to form a mixed continuous value and discrete value
- the original parameter list includes the data enhancement method and the corresponding enhancement parameters.
- the artificial fish swarm algorithm is used to iteratively optimize the enhancement parameters of the original parameter list to obtain an optimized parameter list.
- the data is processed to obtain artificially constructed data, and then the artificially constructed data is mixed with the original training data to obtain a high-quality training set at a lower cost, which ensures the recognition accuracy of the named entity recognition model.
- the manually labeled original training data and original test data are obtained, and the original parameter list is obtained.
- the original parameter list is composed of the data enhancement method and the enhancement parameters corresponding to the data enhancement method.
- the original parameters are The enhancement parameters in the list are randomly initialized to obtain multiple optimized parameter lists, and each optimized parameter list is used to transform the original training data to obtain the corresponding artificially constructed data, and the original training data is compared with the corresponding artificially constructed data.
- the optimization parameter list corresponding to the model that satisfies the convergence condition is output as the target data enhancement parameter list, and the original training data is enhanced by using the target data enhancement parameter list.
- an artificial fish swarm algorithm suitable for the coexistence of discrete values and continuous values is used to randomly initialize the enhanced parameters in the original parameter list, and the recognition effect of the recognition model is used as the optimization target.
- this embodiment can support the expansion of the data enhancement method, and obtain different data enhancement lists according to the needs of users, so that a more A large amount of model training data further ensures the accuracy of the model.
- the data enhancement method includes a synonym replacement method. As shown in FIG. 3 , after step S50, that is, after determining whether there is a model that satisfies the convergence condition in the multiple recognition models according to the test result, the method further specifically includes the following: step:
- the parameter list does not sufficiently augment the original training data.
- the enhanced parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm, so as to obtain multiple recognition models through training according to the optimized parameter list after random initialization, and test the multiple recognition models until the target data is obtained.
- Enhanced parameter list when the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm, it is necessary to record and count the number of times of repeated random initialization of the enhancement parameters in the original parameter list.
- S90 Determine whether the number of random initializations for the enhancement parameters of the original parameter list is less than a preset number of times.
- the target data enhancement parameter list After determining whether the number of random initializations of the enhancement parameters of the original parameter list is less than the preset number of times, if the number of random initializations of the enhancement parameters of the original parameter list is less than the preset number of times, the target data enhancement parameter list has not yet been determined at this time, Then the above steps S30-S70 need to be repeatedly performed, that is, multiple new recognition models need to be retrained according to the randomly initialized optimization parameter list, and multiple new recognition models are tested to obtain the target data enhancement parameter list, and The training set of the named entity recognition model is obtained by augmenting the parameter list with the target data.
- the enhancement parameters in the original parameter list are again adjusted according to the artificial fish swarm algorithm. Perform random initialization to obtain multiple optimized parameter lists after random initialization, and count them to determine whether the number of random initializations for the enhanced parameters of the original parameter list is less than the preset number of times, and determine whether to randomly initialize the enhanced parameters of the original parameter list.
- the data enhancement method includes a synonym replacement method. As shown in FIG. 4 , in step S30, each optimization parameter list is used to convert the original training data, which specifically includes the following steps:
- S31 Determine enhancement parameters corresponding to the synonym replacement method in the optimization parameter list, where the enhancement parameters corresponding to the synonym replacement method include entity word category replacement probability and entity word replacement category.
- the data enhancement method in the optimization parameter list includes a synonym replacement method
- the enhancement parameter corresponding to the synonym replacement method is determined in the optimization parameter list, wherein the enhancement parameter corresponding to the synonym replacement method includes entity word category replacement probability and entity Word replacement category.
- S32 Acquire a preset synonym dictionary pre-built by the user according to requirements.
- the preset synonym dictionary entity words in the same entity category whose synonymous relationship is not prohibited are used as synonyms for each other.
- the dictionary is a dictionary that is pre-built by users according to the needs and includes entity words of different entity categories.
- entity words of the same entity category are used as synonyms for each other, and in the preset synonym dictionary, specific entities are also prohibited. Synonymous relationship between words, entity words that are prohibited from synonymous relationship cannot be used as synonyms of each other.
- the scale of the entity words of the preset synonym dictionary is increased by relaxing the judgment conditions of synonyms, and the entity words of the same entity category are used as synonyms, that is, if the word A in the sentence is replaced by the word B, the new Sentence, semantics and grammar are still reasonable, then word B and word A are the same entity category, then word B is a synonym of word A, and the entity words of the same category are collected to form a preset synonym dictionary.
- Sun Wukong is pressed under the Five Elements Mountain. In this sentence, Sun Wukong can be replaced by the names of People, Bull Demon, etc., then Sun Wukong, Buddha Buddha and Bull Demon are synonyms for each other.
- the quality of the preset synonym dictionary is improved by prohibiting the synonymous relationship between specific words.
- the sentence grammar changes.
- Sun Wukong is pressed under the Five Elements Mountain.
- the Yellow River is replaced by the Five Elements Mountain
- the Sun Wukong is pressed under the Yellow River. Therefore, the synonymous relationship between the Five Elements Mountain and the Yellow River is not prohibited in the preset synonym dictionary.
- synonym substitution they cannot be replaced as synonyms of each other.
- the above sentences are based on the Sun Wukong being pressed under the Five Elements Mountain, and the synonyms are explained by using Tathagata Buddha, Niu Demon King and the Yellow River as entity words, which are only exemplary descriptions. In other embodiments, other sentences can also be used. and entity words as an example.
- the synonyms in the preset synonym dictionary can exist in the form of Table 2, wherein, Table 2 includes four columns, the first column is the serial number, the second column and the third column are different words: word A and word B, The fourth column is the replacement relationship between word A and word B. If word B can replace word A, it means that word A and word B are synonyms with each other. If word B cannot replace word A, it means word A and word A B are not synonyms for each other.
- Table 2 The content of the preset synonym dictionary is shown in Table 2 below:
- S33 Perform synonym replacement on entity words in the original training data according to the preset synonym dictionary, entity word category replacement probability, and entity word replacement category.
- the replacement probability of the entity word category is the replacement probability of the entity word replacement category.
- the enhancement parameters corresponding to the synonym replacement method are determined in the optimization parameter list, and the enhancement parameters corresponding to the synonym replacement method include the entity word category replacement probability and the entity word replacement category, and a preset synonym dictionary pre-built by the user according to requirements is obtained.
- the preset synonym dictionary the entity words in the same entity category whose synonymous relationship is not prohibited are regarded as synonyms of each other, and the entity words in the original training data are compared according to the preset synonym dictionary, entity word category replacement probability and entity word replacement category. Words are replaced by synonyms, which refines the steps of using each optimization parameter list to convert the original training data.
- the scale of the preset synonym dictionary is expanded, and the diversity of artificially constructed data is improved. Build a method based on the prohibition of synonymous relationships, and continuously improve the quality of the preset synonym dictionary, thereby ensuring the quality of artificially constructed data.
- step S33 that is, according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, the entity words in the original training data are replaced by synonyms, which specifically includes the following steps: :
- S331 Determine whether the category of each entity word in the original training data belongs to the entity word replacement category.
- each entity word in the original training data belongs to the entity word replacement category, if the category of an entity word in the original training data belongs to the entity word replacement category, it means that the entity word in the original training data needs to be replaced by synonyms. All synonyms of the entity word in the synonym dictionary are preset for subsequent replacement.
- S333 Determine whether the synonymous relationship between the entity word and the synonym of the entity word is prohibited.
- the synonym is skipped, that is, the synonym is not used as the replacement of the entity word word.
- the entity word replacement category includes three categories: person name, place name, and institution name
- each person’s name in a sentence in the original training data There is a 30% probability of being replaced with a synonym of the person's name in the preset synonym dictionary; if a synonym of the person's name in the preset synonym dictionary is prohibited from being synonymous, skip the synonym and use other synonyms. Replace the person's name.
- the entity word replacement The class and entity word class replacement probabilities can also be other.
- each entity word in the original training data belongs to the entity word replacement category, if the category part of each entity word in the original training data belongs to the entity word replacement category, it means that the entity word in the original training data does not need to be replaced by synonyms. Perform other data augmentation methods in the optimization parameter list.
- the data enhancement method further includes a random replacement method, a random deletion method, a random exchange method and a long sentence construction method, as shown in FIG. 6 , after step S33, synonyms are performed on the entity words in the original training data.
- the method further specifically includes the following steps:
- S34 In the optimization parameter list, determine the random replacement probability of the random replacement method, and determine the random deletion probability of the random deletion method.
- the data enhancement method further includes a random replacement method and a random deletion method. It is necessary to determine the random replacement probability of the random replacement method and the random deletion probability of the random deletion method in the optimization parameter list, so as to determine the random replacement probability of the random replacement method and the random deletion probability according to the random replacement probability,
- the original training data is transformed with random deletion probabilities.
- S35 Determine the random exchange probability of the random exchange method, and determine the sentence length set by the method of constructing a long sentence.
- the data enhancement method further includes a random exchange method and a long sentence construction method.
- the optimization parameter list it is also necessary to determine the random exchange probability of the random exchange method, and determine the sentence length set by the long sentence construction method, so as to The original training data is transformed according to the random exchange probability and the sentence length set by the method of constructing long sentences.
- S36 Perform entity word replacement for each sentence in the original training data according to the random replacement probability, and perform the same sentence entity word exchange for each sentence in the original training data according to the random exchange probability.
- the random replacement probability of the random replacement method is ⁇ 2, and the random exchange probability of the random exchange method is ⁇ 3.
- the dictionary which can be a preset synonym dictionary
- the rules for selecting tokens from the dictionary are: obey a uniform random distribution and exclude other tokens to be randomly replaced in the original training data.
- the i-th token and the j-th token have a probability of ⁇ 3 for position exchange.
- S37 Perform entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data.
- the original training data After performing entity word replacement for each sentence in the original training data according to the random replacement probability, and after performing the same sentence entity word exchange for each sentence in the original training data according to the random exchange probability, the original training data is replaced according to the random deletion probability. Entity word removal is performed on each sentence of , to obtain processing data.
- each token of each sentence with any other token in the dictionary with the probability of ⁇ 2, and then in each sentence, with the probability of ⁇ 3, replace the ith token and the ith token with the The positions of j tokens are exchanged, and then each token of each sentence is deleted with a probability of ⁇ 4 to obtain processing data.
- the sentence length set by the method of constructing long sentences is 100, and the sentence length of each sentence in the data is statistically processed to obtain the 90th percentile of sentence length.
- Sentences are paired in pairs to spliced into a longer spliced sentence (the order of the two sentences is random), and then delete the part of the spliced sentence whose length exceeds 100, so that the sentence length of each sentence in the processing data is sentence length 100 .
- the sentence length set by the method for constructing long sentences is 100, and the pairwise splicing of sentences whose sentence length is less than or less than the 90th percentile is only an exemplary illustration.
- the sentence length set by the long sentence method can also be other values, and sentences with sentence lengths of other percentiles can also be paired and spliced, which will not be repeated here.
- the random replacement probability of the random replacement method and the random deletion probability of the random deletion method are determined in the optimization parameter list, and the random replacement probability of the random exchange method is determined.
- Random exchange probability and determine the sentence length set by the method of constructing long sentences, perform entity word replacement for each sentence in the original training data according to the random replacement probability, and perform entity word replacement for each sentence in the original training data according to the random exchange probability.
- Exchange the entity words of the same sentence delete the entity words of each sentence in the original training data according to the random deletion probability to obtain the processing data, and perform splicing processing on each sentence in the processing data, so that the sentence length after the processing is completed. It further refines the steps of using each optimization parameter list to convert the original training data, and adopts a variety of data enhancement methods to convert the original training data, which further increases the diversity of artificially constructed data and ensures the recognition model training set. accuracy.
- the data enhancement method includes a synonym replacement method. As shown in FIG. 7 , in step S50, it is determined according to the test result whether there is a convergence model in the multiple recognition models, which specifically includes the following steps:
- S51 Determine the highest recognition score for each word in the test set in the multiple recognition models.
- the highest recognition score of the multiple recognition models for recognizing each word in the test set is determined.
- the recognition score of each word in the test set by the recognition model is determined by the following formula:
- score t is the score of the recognition model for the recognition of the t-th word in the test set
- recall is the recall rate of the entity word
- precision is the accuracy of the recognition model recalling the entity word.
- the recognition scores of the three recognition models A, B, and C for the t-th word in the test set are 0.6, 0.8 and 0.9 respectively, then among the three recognition models A, B, and C, the highest recognition score for recognizing the t-th word in the test set is 0.9.
- the number of recognition models is 3, and the recognition scores for the t-th word in the test set are 0.6, 0.8, and 0.9, respectively, for exemplary illustration. In other embodiments, the number of recognition models may also be other numerical values. The recognition score for the t-th word in the test set may also be other numerical values, which will not be repeated here.
- the identification model is a convergent model, and the optimization parameter list corresponding to the convergent model can be used as the target data enhancement parameter list.
- the highest recognition score satisfies the convergence condition, if the highest recognition score satisfies the convergence condition, it means that the recognition effect of no recognition model meets the requirements, then it is determined that there is no convergence model that satisfies the convergence condition in the multiple recognition models, and the optimization parameters of this round.
- the list is not available and needs to be re-iteratively optimized using the artificial fish swarm algorithm.
- the recognition model corresponding to the highest recognition score is the convergence model.
- the recognition effect of the recognition model on the test set is used as the concentration of the artificial fish swarm algorithm, and the recognition effect of the recognition model on the test set is used as the goal of parameter optimization of the data enhancement model, and a small cost is obtained. Effective data augmentation strategy.
- the data enhancement method includes a synonym replacement method. As shown in FIG. 8 , in step S52, it is determined whether the highest recognition score satisfies the convergence condition, which specifically includes the following steps:
- S522 Determine the first highest recognition score for recognizing the t-th word in the test set among the multiple recognition models
- S523 Determine the second highest recognition score for recognizing the t-1th word in the test set among the multiple recognition models
- S524 subtract the second highest recognition score from the first highest recognition score to obtain the highest recognition score difference
- S525 Determine whether the ratio of the difference between the highest recognition score and the second highest recognition score is less than the convergence parameter
- maxscore t is the highest recognition score of the t-th word in the test set among multiple recognition models, that is, the first highest recognition score
- maxscore t-1 is the highest recognition score of the t-1th word in the test set among multiple recognition models.
- ⁇ is a convergence parameter configured by the user (it can be 0.01).
- the convergence value is obtained like is less than the convergence parameter ⁇ , then it is determined that the highest recognition score satisfies the convergence condition; if If not less than the convergence parameter ⁇ , it is determined that the highest recognition score does not satisfy the convergence condition.
- the convergence parameters configured by the user are determined, the first highest recognition score for recognizing the t-th word in the test set among the multiple recognition models is determined, and the number of recognition models for recognizing the t-1th word in the test set is determined. For the second highest recognition score, subtract the second highest recognition score from the first highest recognition score to obtain the highest recognition score difference, and determine whether the ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter.
- the ratio of the second highest recognition score is less than the convergence parameter, it is determined that the highest recognition score satisfies the convergence condition; if the ratio of the difference between the highest recognition score and the second highest recognition score is not less than the convergence parameter, then it is determined that the highest recognition score does not satisfy the convergence condition, and it is clear that The specific process of determining whether the highest recognition score satisfies the convergence condition provides a judgment basis for determining whether the model converges according to the highest recognition score.
- a data enhancement apparatus based on a deep learning model corresponds to the data enhancement method based on the deep learning model in the above-mentioned embodiment.
- the data enhancement device based on the deep learning model includes an acquisition module 901 , an initialization module 902 , a conversion module 903 , a test module 904 , an output module 905 and an enhancement module 906 .
- the detailed description of each functional module is as follows:
- Obtaining module 901 is used to obtain the original training data and original test data marked manually, and obtain the original parameter list, and the original parameter list is formed by the data enhancement method and the corresponding enhancement parameters of the data enhancement method;
- An initialization module 902 configured to randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;
- the conversion module 903 is configured to convert the original training data by using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain corresponding artificially constructed data. Get multiple training sets;
- the testing module 904 is configured to use the multiple training sets to train to obtain multiple recognition models, and use the original test data as a test set to test the multiple recognition models, so as to determine the number of recognition models among the multiple recognition models. Whether there is a model that satisfies the convergence condition;
- the output module 905 is configured to output the optimization parameter list corresponding to the model satisfying the convergence condition as the target data enhancement parameter list if there is a model satisfying the convergence condition in the plurality of identification models;
- An enhancement module 906 configured to perform data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of a named entity recognition model.
- the data enhancement device based on the deep learning model further includes a loop module 907, after determining whether there is a model satisfying the convergence condition in the plurality of identification models, the loop module 907 is specifically used for:
- the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm to obtain multiple optimizations after the random initialization. parameter list, and count;
- the data enhancement method includes a synonym replacement method, and the conversion module 903 is specifically used for:
- the entity word category replacement probability and the entity word replacement category synonym replacement is performed on the entity words in the original training data.
- conversion module 903 is specifically also used for:
- the data enhancement method also includes a random replacement method, a random deletion method, a random exchange method and a long sentence construction method, and the conversion module 903 is specifically also used for:
- test module 904 is specifically used for:
- the recognition model corresponding to the highest recognition score is the convergence model
- the convergence model that satisfies the convergence condition does not exist in the plurality of recognition models.
- test module 905 is specifically also used for:
- the ratio of the difference between the highest recognition score and the second highest recognition score is not less than the convergence parameter, it is determined that the highest recognition score does not satisfy the convergence condition.
- Each module in the above-mentioned deep learning model-based data augmentation apparatus may be implemented in whole or in part by software, hardware, and combinations thereof.
- the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
- a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10 .
- the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
- the memory of the computer device includes a non-volatile storage medium, an internal memory.
- the non-volatile storage medium stores an operating system, computer readable instructions and a database.
- the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
- the database of the computer device is used to store the original training data, the original test data, the original parameter list, the artificially constructed data, the optimized parameter list, and the related data used or produced by the data enhancement methods such as multiple recognition models.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer-readable instructions when executed by a processor, implement a deep learning model-based data augmentation method.
- a computer device including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor implements the above deep learning-based model when the processor executes the computer-readable instructions The steps of the data augmentation method.
- a computer-readable storage medium on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, implement the steps of the above-mentioned deep learning model-based data enhancement method.
- Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
A deep learning model-based data augmentation method and apparatus, a device, and a medium. The method comprises: obtaining manually labeled original training data and original test data, and obtaining an original parameter list (S10); randomly initializing, according to an artificial fish swarm algorithm, augmentation parameters in the original parameter list to obtain multiple optimization parameter lists (S20); using each optimization parameter list to transform the original training data to obtain corresponding artificially constructed data, and mixing the original training data with the corresponding artificially constructed data to obtain multiple training sets (S30); using the multiple training sets to carry out training separately to obtain multiple recognition models, and using the original test data as a test set to test the multiple recognition models (S40); determining, according to the test results, whether there is a model satisfying a convergence condition among the multiple recognition models (S50); if yes, outputting the optimization parameter list corresponding to the model satisfying the convergence condition as a target data augmentation parameter list (S60); and using the target data augmentation parameter list to perform data augmentation on the original training data to obtain a training set of a named entity recognition model (S70). The method uses the artificial fish swarm algorithm as the framework and integrates the model recognition effect as an optimization target into the formulation of the data augmentation strategy, thereby improving data augmentation effect on the data.
Description
本申请要求于2021年04月19日提交中国专利局、申请号为202110420110.3,发明名称“基于深度学习模型的数据增强方法、装置、设备及介质”的中国发明专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese invention patent application filed on April 19, 2021 with the application number 202110420110.3 and the invention title "Data Enhancement Method, Apparatus, Equipment and Medium Based on Deep Learning Model", and its entire content Incorporated herein by reference.
本申请涉及人工智能领域,尤其涉及一种基于深度学习模型的数据增强方法、装置、设备及介质。The present application relates to the field of artificial intelligence, and in particular, to a data enhancement method, apparatus, device and medium based on a deep learning model.
随着智能化技术的发展,在问答系统、机器翻译系统等自然语言处理方法的应用领域中,命名实体识别(Named Entity Recognition,NER)任务的需求越来越多,由此,基于实体数据训练得到的命名实体识别模型来执行命名实体识别任务,成为一种越来越常用的识别方式。为提高命名实体识别模型对待识别文本中的实体识别率,通常从增强训练数据或者增强模型算法这两个角度出发,以达到增强命名实体识别模型准确性的目的。With the development of intelligent technology, there are more and more demands for Named Entity Recognition (NER) tasks in the application fields of natural language processing methods such as question answering systems and machine translation systems. The obtained named entity recognition model to perform named entity recognition task has become an increasingly common recognition method. In order to improve the entity recognition rate in the text to be recognized by the named entity recognition model, we usually start from two perspectives: enhancing the training data or enhancing the model algorithm, so as to achieve the purpose of enhancing the accuracy of the named entity recognition model.
发明人发现,现有技术中,命名实体识别模型的数据增强模型,主要是通过不同的数据增强方法及数据增强方法对应的参数,对训练数据中的实体词进行替换,例如以一定的概率对训练数据中的实体词进行同义词替换、随机插入、随机交换位置、随机删除等操作,以增加训练数据的规模和多样性。数据增强模型对训练数据的增强效果与模型参数的密不可分,但现有的数据增强模型的模型参数依靠经验或者网格搜索的参数寻优方法确定,与命名实体识别模型的互动性较低,导致数据增强模型对训练数据的增强效果不佳。The inventor found that in the prior art, the data enhancement model of the named entity recognition model mainly replaces the entity words in the training data through different data enhancement methods and parameters corresponding to the data enhancement methods. The entity words in the training data are subjected to synonym replacement, random insertion, random exchange of positions, and random deletions to increase the scale and diversity of the training data. The enhancement effect of the data augmentation model on the training data is inseparable from the model parameters, but the model parameters of the existing data augmentation models are determined by experience or the parameter optimization method of grid search, and the interaction with the named entity recognition model is low. As a result, the data augmentation model has a poor effect on the augmentation of the training data.
本申请提供一种基于深度学习模型的数据增强方法、装置、设备及介质,以现有技术中,数据增强模型的模型参数依靠经验或者网格搜索的参数寻优方法确定,导致数据增强模型的数据增强效果不佳的问题。The present application provides a data enhancement method, device, equipment and medium based on a deep learning model. In the prior art, the model parameters of the data enhancement model are determined by experience or the parameter optimization method of grid search, resulting in the data enhancement model. The problem of poor data augmentation.
一种基于深度学习模型的数据增强方法,包括:A data augmentation method based on a deep learning model, including:
获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;Obtain the manually marked original training data and original test data, and obtain the original parameter list, where the original parameter list is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method;
根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;Randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;
利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;Transform the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;
利用所述多个训练集训分别练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions. Model;
若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;If there is a model that satisfies the convergence condition in the plurality of identification models, outputting an optimization parameter list corresponding to the model that satisfies the convergence condition as a target data enhancement parameter list;
利用所述目标数据增强参数列表对所述原始训练数据进行数据增强,以获得命名实体识别模型的训练集。Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
一种基于深度学习模型的数据增强装置,包括:A data enhancement device based on a deep learning model, comprising:
获取模块,用于获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列 表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;an acquisition module, used for acquiring the manually marked original training data and original test data, and acquiring the original parameter list, where the original parameter list is composed of the data enhancement method and the enhancement parameters corresponding to the data enhancement method;
初始化模块,用于根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;an initialization module for randomly initializing the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;
转换模块,用于利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;a conversion module, configured to convert the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;
测试模块,用于利用所述多个训练集分别训练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;A test module, configured to use the multiple training sets to train to obtain multiple recognition models, and use the original test data as a test set to test the multiple recognition models to determine whether the multiple recognition models are There is a model that satisfies the convergence condition;
输出模块,用于若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;an output module, configured to output an optimization parameter list corresponding to the model that satisfies the convergence condition if there is a model that satisfies the convergence condition in the plurality of identification models, as a target data enhancement parameter list;
增强模块,用于利用所述目标数据增强参数列表对所述原始训练数据进行数据增强,以获得命名实体识别模型的训练集。An enhancement module, configured to perform data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of a named entity recognition model.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device, comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:
获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;Obtain the manually marked original training data and original test data, and obtain the original parameter list, where the original parameter list is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method;
根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;Randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;
利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;Transform the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;
利用所述多个训练集训分别练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions. Model;
若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;If there is a model that satisfies the convergence condition in the plurality of identification models, outputting an optimization parameter list corresponding to the model that satisfies the convergence condition as a target data enhancement parameter list;
利用所述目标数据增强参数列表对所述原始训练数据进行数据增强,以获得命名实体识别模型的训练集。Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;Obtain the manually marked original training data and original test data, and obtain the original parameter list, where the original parameter list is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method;
根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;Randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;
利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;Transform the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;
利用所述多个训练集训分别练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions. Model;
若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;If there is a model that satisfies the convergence condition in the plurality of identification models, outputting an optimization parameter list corresponding to the model that satisfies the convergence condition as a target data enhancement parameter list;
利用所述目标数据增强参数列表对所述原始训练数据进行数据增强,以获得命名实体识别模型的训练集。Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
技术效果technical effect
本申请中,采用了适合离散值和连续值共存情形的人工鱼群算法随机初始化原参数列表中的增强参数,并把识别模型的识别效果作为优化目标融合到了数据增强策略的制定中, 以较小的代价得到一个效果较好的数据增强列表,从而提高了数据增强列表对数据的数据增强效果。In this application, an artificial fish swarm algorithm suitable for the coexistence of discrete values and continuous values is used to randomly initialize the enhancement parameters in the original parameter list, and the recognition effect of the recognition model is taken as the optimization target and integrated into the formulation of the data enhancement strategy, so as to be more efficient. A data enhancement list with better effect is obtained at a small cost, thereby improving the data enhancement effect of the data enhancement list on the data.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below, and other features and advantages of the application will become apparent from the description, drawings, and claims.
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.
图1是本申请一实施例中基于深度学习模型的数据增强方法的一应用环境示意图;1 is a schematic diagram of an application environment of a data enhancement method based on a deep learning model in an embodiment of the present application;
图2是本申请一实施例中基于深度学习模型的数据增强方法的一流程示意图;2 is a schematic flowchart of a data enhancement method based on a deep learning model in an embodiment of the present application;
图3是本申请一实施例中基于深度学习模型的数据增强方法的另一流程示意图;3 is another schematic flowchart of a data enhancement method based on a deep learning model in an embodiment of the present application;
图4是图2中步骤S30的一实现流程图;Fig. 4 is a realization flow chart of step S30 in Fig. 2;
图5是图4中步骤S33的一实现流程图;Fig. 5 is a realization flow chart of step S33 in Fig. 4;
图6是图2中步骤S30的另一实现流程图;Fig. 6 is another realization flow chart of step S30 in Fig. 2;
图7是图2中步骤S50的一实现流程图;Fig. 7 is a realization flow chart of step S50 in Fig. 2;
图8是图7中步骤S52的一实现流程图;Fig. 8 is a realization flow chart of step S52 in Fig. 7;
图9是本申请一实施例中基于深度学习模型的数据增强装置的一结构示意图;9 is a schematic structural diagram of a data enhancement device based on a deep learning model in an embodiment of the present application;
图10是本申请一实施例中计算机设备的一结构示意图。FIG. 10 is a schematic structural diagram of a computer device in an embodiment of the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
本申请实施例提供的基于深度学习模型的数据增强方法,可应用在如图1的应用环境中,其中,终端设备通过网络与服务器进行通信。服务器通过获取用户通过终端设备发送的,经过人工标注的原始训练数据和原始测试数据,并获取用户通过终端设备发送的原参数列表,原参数列表由数据增强方法和数据增强方法对应的增强参数构成,根据人工鱼群算法对原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表,利用每一优化参数列表对原始训练数据进行转换,以获得对应的人工构造数据,并将原始训练数据与对应的人工构造数据进行混合,以获得多个训练集,利用多个训练集分别训练获得多个识别模型,并将原始测试数据作为测试集对多个识别模型进行测试,以确定多个识别模型中是否存在满足收敛条件的模型,若多个识别模型中存在满足收敛条件的模型,则输出满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表,利用目标数据增强参数列表对原始训练数据进行数据增强,以获得命名实体识别模型的训练集,采用了适合离散值和连续值共存情形的人工鱼群算法随机初始化原参数列表中的增强参数,并把识别模型的识别效果作为优化目标融合到了数据增强策略的制定中,以较小的代价得到一个效果较好的数据增强列表,从而保证了命名实体识别模型训练集的数据多样性,扩大了训练集的规模,进而提升命名实体识别模型的识别准确性,进而实现了训练数据增强以及命名实体识别的人工智能化。The data enhancement method based on the deep learning model provided by the embodiment of the present application can be applied in the application environment as shown in FIG. 1 , in which the terminal device communicates with the server through the network. The server obtains the manually labeled original training data and original test data sent by the user through the terminal device, and obtains the original parameter list sent by the user through the terminal device. The original parameter list consists of the data enhancement method and the enhancement parameters corresponding to the data enhancement method. , according to the artificial fish swarm algorithm, the enhanced parameters in the original parameter list are randomly initialized to obtain multiple optimized parameter lists, and each optimized parameter list is used to transform the original training data to obtain the corresponding artificially constructed data, and the original The training data is mixed with the corresponding artificially constructed data to obtain multiple training sets, and multiple training sets are used to train and obtain multiple recognition models, and the original test data is used as the test set to test the multiple recognition models to determine the multiple recognition models. Whether there is a model that satisfies the convergence condition in each recognition model, if there is a model that satisfies the convergence condition in multiple recognition models, the optimized parameter list corresponding to the model that satisfies the convergence condition is output as the target data enhancement parameter list, and the target data is used to enhance the The parameter list performs data enhancement on the original training data to obtain the training set of the named entity recognition model. The artificial fish swarm algorithm suitable for the coexistence of discrete values and continuous values is used to randomly initialize the enhanced parameters in the original parameter list, and the recognition model is used to randomly initialize the enhanced parameters. The recognition effect is integrated into the formulation of the data augmentation strategy as an optimization objective, and a data augmentation list with better effect is obtained at a small cost, thus ensuring the data diversity of the training set of the named entity recognition model and expanding the scale of the training set. Then, the recognition accuracy of the named entity recognition model is improved, and the training data enhancement and artificial intelligence of named entity recognition are realized.
其中,基于深度学习模型的等数据增强方法用到或者生产的相关数据存储在服务器的数据库中,本实施例中的数据库存储于区块链网络中,用于存储实现基于深度学习模型的等数据增强方法所用到、生成的数据,如原始训练数据、原始测试数据、原参数列表、人 工构造数据、优化参数列表和多个识别模型等相关数据。本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。将数据库部署于区块链可提高数据存储的安全性。Among them, the relevant data used or produced by the deep learning model-based data enhancement method is stored in the database of the server, and the database in this embodiment is stored in the blockchain network for storing and implementing the deep learning model-based etc. data The data used and generated by the enhancement method, such as the original training data, the original test data, the original parameter list, the artificially constructed data, the optimized parameter list and the related data of multiple recognition models. The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer. Deploying the database on the blockchain can improve the security of data storage.
其中,终端设备可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。Wherein, the terminal device can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种基于深度学习模型的数据增强方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 2, a data enhancement method based on a deep learning model is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
S10:获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,原参数列表由数据增强方法和数据增强方法对应的增强参数构成。S10: Obtain the manually labeled original training data and original test data, and obtain the original parameter list, where the original parameter list is composed of the data enhancement method and the enhancement parameters corresponding to the data enhancement method.
可以理解的是,本实施例中的原参数列表为数据增强模型,数据增强模型由数据增强方法和数据增强方法对应的增强参数构成,数据增强模型的对数据的增强性能取决于模型内的数据增强方法和数据增强方法对应的增强参数,因此,在利用数据增强模型之前,需要对已有的数据增强模型进行参数寻优,以提高数据增强模型对训练数据的增强性能,从而保证后续训练数据训练获得的命名实体识别模型的识别精度。It can be understood that the original parameter list in this embodiment is a data enhancement model, the data enhancement model is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method, and the data enhancement performance of the data enhancement model depends on the data in the model. The enhancement parameters corresponding to the enhancement method and the data enhancement method. Therefore, before using the data enhancement model, it is necessary to optimize the parameters of the existing data enhancement model to improve the enhancement performance of the data enhancement model on the training data, so as to ensure the subsequent training data. The recognition accuracy of the named entity recognition model obtained by training.
对已有的数据增强模型进行参数寻优,需要获取已有的数据增强模型,即获取模型的原参数列表,并获取经过人工标注的原始训练数据和原始测试数据。To optimize the parameters of an existing data augmentation model, it is necessary to obtain the existing data augmentation model, that is, to obtain the original parameter list of the model, and to obtain the manually labeled original training data and original test data.
S20:根据人工鱼群算法对原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表。S20: Randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists.
在获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表之后,以收敛速度快、适合离散值和连续值共存情形的人工鱼群算法为框架,对原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表。After obtaining the manually labeled original training data and original test data, and obtaining the original parameter list, the artificial fish swarm algorithm with fast convergence speed and suitable for the coexistence of discrete values and continuous values is used as the framework. Random initialization is performed to obtain multiple lists of optimized parameters.
S30:利用每一优化参数列表对原始训练数据进行转换,以获得对应的人工构造数据,并将原始训练数据与对应的人工构造数据进行混合,以获得多个训练集。S30: Transform the original training data by using each optimization parameter list to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets.
在获得多个优化参数列表之后,利用每一优化参数列表对原始训练数据进行转换,以获得对应的人工构造数据,并将原始训练数据与对应的人工构造数据进行随机打乱,以混合获得多个训练集。After multiple optimization parameter lists are obtained, each optimization parameter list is used to transform the original training data to obtain corresponding artificially constructed data, and the original training data and the corresponding artificially constructed data are randomly scrambled to obtain multiple a training set.
例如,在获得L个优化参数列表之后,利用每一优化参数列表对原始训练数据进行转换,则获得L个份对应的人工构造数据,每份人工构造数据对应一个优化参数列表,在获得L个份对应的人工构造数据之后,并将原始训练数据分别与每份人工构造数据进行混合,以获得L个训练集。For example, after obtaining L optimization parameter lists, each optimization parameter list is used to convert the original training data, and L pieces of corresponding artificially constructed data are obtained, and each piece of artificially constructed data corresponds to an optimization parameter list. After the corresponding artificially constructed data, the original training data is mixed with each artificially constructed data to obtain L training sets.
S40:利用多个训练集训分别练获得多个识别模型,并将原始测试数据作为测试集对多个识别模型进行测试。S40: Use multiple training sets to train to obtain multiple recognition models, and use the original test data as a test set to test the multiple recognition models.
在获得多个训练集之后,利用多个训练集训分别练获得多个识别模型,并将原始测试数据作为测试集,利用测试集对多个识别模型进行测试,获得每一识别模型对测试集中各实体词的识别效果(识别得分),作为测试结果。After obtaining multiple training sets, use multiple training sets to train to obtain multiple recognition models, use the original test data as the test set, use the test set to test the multiple recognition models, and obtain each recognition model for each recognition model in the test set. The recognition effect (recognition score) of the entity word is used as the test result.
S50:根据测试结果确定多个识别模型中是否存在满足收敛条件的模型。S50: Determine, according to the test result, whether there is a model that satisfies the convergence condition in the plurality of identification models.
在将原始测试数据作为测试集对多个识别模型进行测试之后,根据各识别模型对测试集中各实体词的识别效果,即测试结果,确定多个识别模型中是否存在满足收敛条件的模型。其中,多个识别模型可以是传统的实体识别模型。After testing the multiple recognition models with the original test data as the test set, according to the recognition effect of each recognition model on each entity word in the test set, that is, the test result, it is determined whether there is a model satisfying the convergence condition among the multiple recognition models. Wherein, the multiple recognition models may be traditional entity recognition models.
S60:若多个识别模型中存在满足收敛条件的模型,则输出满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表。S60: If a model that satisfies the convergence condition exists in the plurality of identification models, output an optimization parameter list corresponding to the model that satisfies the convergence condition as a target data enhancement parameter list.
在确定多个识别模型中是否存在满足收敛条件的模型之后,若多个识别模型中存在满足收敛条件的模型,表示多个识别模型中已有识别模型的识别效果满足用户要求,对应的,该满足收敛条件的模型所使用的训练集满足要求,进而确定该训练集对应的优化参数列表则为满足数据增强需求的数据增强列表,则输出所对应的优化参数列表,作为目标数据增强参数列表。After it is determined whether there is a model that satisfies the convergence condition in the multiple recognition models, if there is a model that satisfies the convergence condition in the multiple recognition models, it means that the recognition effect of the existing recognition model in the multiple recognition models meets the user's requirements. Correspondingly, this The training set used by the model that satisfies the convergence conditions meets the requirements, and then the optimization parameter list corresponding to the training set is determined to be the data enhancement list that meets the data enhancement requirements, and the corresponding optimization parameter list is output as the target data enhancement parameter list.
S70:利用目标数据增强参数列表对原始训练数据进行数据增强,以获得命名实体识别模型的训练集。S70: Perform data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of the named entity recognition model.
在输出所对应的优化参数列表,作为目标数据增强参数列表之后,将利用目标数据增强参数列表对原始训练数据进行数据增强,进而利用数据增强后的增强数据与原训练数据进行随机打乱,以混合获得命名实体识别模型的训练集,可获得较为准确的命名实体识别模型,从而保证了命名实体识别模型的识别精度。After outputting the corresponding optimization parameter list as the target data enhancement parameter list, the target data enhancement parameter list will be used to perform data enhancement on the original training data, and then the enhanced data after data enhancement will be randomly scrambled with the original training data. By mixing the training set of the named entity recognition model, a relatively accurate named entity recognition model can be obtained, thereby ensuring the recognition accuracy of the named entity recognition model.
需要理解的是,人工鱼群算法是一种粒子群优化算法,把粒子看做试图达到水域中食物浓度最高的位置、从而提升自身生活状态的鱼。在本实施例中,粒子、人工鱼就是进行随机初始化的原参数列表中的增强参数,食物浓度就是识别模型的代价函数或损失函数,人工鱼在算法运行过程中的游动过程,就是原参数列表中的增强参数逐渐逼近最优位置、使代价函数或损失函数取值逼近最低值的过程。It should be understood that the artificial fish swarm optimization algorithm is a particle swarm optimization algorithm, which regards particles as fish trying to reach the position with the highest food concentration in the waters, thereby improving their living conditions. In this embodiment, the particles and artificial fish are the enhancement parameters in the original parameter list for random initialization, the food concentration is the cost function or loss function of the recognition model, and the swimming process of the artificial fish during the algorithm operation is the original parameter The enhancement parameters in the list gradually approach the optimal position, and the process of making the cost function or loss function approach the lowest value.
其中,由数据增强方法和数据增强方法对应的增强参数,构成的原参数列表可以如表1所示:Among them, the original parameter list formed by the data enhancement method and the enhancement parameters corresponding to the data enhancement method can be shown in Table 1:
表1Table 1
本实施例中,如表1所示,以人工鱼群算法为框架,将原参数列表中的离散值β
1至β
5,以及离散值p
syn联合,构成一个混合了连续值和离散值的原参数列表,原参数列表中包括数据增强方法和对应的增强参数,使用人工鱼群算法,对这个原参数列表的增强参数进行迭代优化,从而获得一个优化参数列表,使用优化参数列表对原始训练数据进行处理获得人工构造数据,进而将人工构造数据与原始训练数据进行混合,以较低的代价得到了较高质量的训练集,保证了命名实体识别模型的识别精度。
In this embodiment, as shown in Table 1, with the artificial fish swarm algorithm as the framework, the discrete values β 1 to β 5 in the original parameter list and the discrete value p syn are combined to form a mixed continuous value and discrete value The original parameter list. The original parameter list includes the data enhancement method and the corresponding enhancement parameters. The artificial fish swarm algorithm is used to iteratively optimize the enhancement parameters of the original parameter list to obtain an optimized parameter list. The data is processed to obtain artificially constructed data, and then the artificially constructed data is mixed with the original training data to obtain a high-quality training set at a lower cost, which ensures the recognition accuracy of the named entity recognition model.
本实施例中,通过获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,原参数列表由数据增强方法和数据增强方法对应的增强参数构成,根据人工鱼群算法对原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表,利用每一优化参数列表对原始训练数据进行转换,以获得对应的人工构造数据,并将原始训练数据与对应的人工构造数据进行混合,以获得多个训练集,利用多个训练集分别训练获得多个识别模型,并将原始测试数据作为测试集对多个识别模型进行测试,以确定多个识别模型中是否存在满足收敛条件的模型,若多个识别模型中存在满足收敛条件的模型,则输出满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表,利用目标数据增强参数列表对原始训练数据进行数据增强,以获得命名实体识别模型的训练集;本申请中,采用了适合离散值和连续值共存情形的人工鱼群算法随机初始化原参数列表中的增强参数, 并把识别模型的识别效果作为优化目标融合到了数据增强策略的制定中,以较小的代价得到一个效果较好的数据增强列表,从而保证了命名实体识别模型训练集的数据多样性,扩大了训练集的规模,进而提升命名实体识别模型的识别准确性。In this embodiment, the manually labeled original training data and original test data are obtained, and the original parameter list is obtained. The original parameter list is composed of the data enhancement method and the enhancement parameters corresponding to the data enhancement method. According to the artificial fish swarm algorithm, the original parameters are The enhancement parameters in the list are randomly initialized to obtain multiple optimized parameter lists, and each optimized parameter list is used to transform the original training data to obtain the corresponding artificially constructed data, and the original training data is compared with the corresponding artificially constructed data. Mix to obtain multiple training sets, use multiple training sets to train to obtain multiple recognition models, and use the original test data as a test set to test multiple recognition models to determine whether there are multiple recognition models that meet the convergence conditions If there is a model that satisfies the convergence condition in the multiple recognition models, the optimization parameter list corresponding to the model that satisfies the convergence condition is output as the target data enhancement parameter list, and the original training data is enhanced by using the target data enhancement parameter list. , to obtain the training set of the named entity recognition model; in this application, an artificial fish swarm algorithm suitable for the coexistence of discrete values and continuous values is used to randomly initialize the enhanced parameters in the original parameter list, and the recognition effect of the recognition model is used as the optimization target. It is integrated into the formulation of the data enhancement strategy, and a data enhancement list with better effect is obtained at a small cost, thereby ensuring the data diversity of the training set of the named entity recognition model, expanding the scale of the training set, and improving the named entity recognition. The recognition accuracy of the model.
此外,由于目标数据增强参数列表中各个数据增强方法的增强参数为自动寻优得到,本实施例中可以支持数据增强方法的扩展,根据用户的需求获得不同的数据增强列表,从而可以构造出更多的模型训练数据,进一步保证了模型的精度。In addition, since the enhancement parameters of each data enhancement method in the target data enhancement parameter list are obtained by automatic optimization, this embodiment can support the expansion of the data enhancement method, and obtain different data enhancement lists according to the needs of users, so that a more A large amount of model training data further ensures the accuracy of the model.
在一实施例中,数据增强方法包括同义词替换方法,如图3所示,步骤S50之后,即根据测试结果确定多个识别模型中是否存在满足收敛条件的模型之后,所述方法还具体包括如下步骤:In one embodiment, the data enhancement method includes a synonym replacement method. As shown in FIG. 3 , after step S50, that is, after determining whether there is a model that satisfies the convergence condition in the multiple recognition models according to the test result, the method further specifically includes the following: step:
S80:若多个识别模型中不存在满足收敛条件的模型,则再次根据人工鱼群算法对原参数列表中的增强参数进行随机初始化,以获得随机初始化后的多个优化参数列表,并进行计数。S80: If there is no model that satisfies the convergence condition among the multiple identification models, randomly initialize the enhanced parameters in the original parameter list again according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists after random initialization, and count them .
在确定多个识别模型中是否存在满足收敛条件的模型之后,若多个识别模型中不存在满足收敛条件的模型,表示多个识别模型的识别效果均没有满足用户要求,本次随机优化的优化参数列表对原始训练数据的增强效果不足。此时,再次根据人工鱼群算法对原参数列表中的增强参数进行随机初始化,以根据随机初始化后的优化参数列表训练获得多个识别模型,并对多个识别模型进行测试,直至获得目标数据增强参数列表。同时,再次根据人工鱼群算法对原参数列表中的增强参数进行随机初始化时,需要记录重复对原参数列表中的增强参数进行随机初始化的次数进行计数。After determining whether there is a model that satisfies the convergence condition in the multiple recognition models, if there is no model that satisfies the convergence condition in the multiple recognition models, it means that the recognition effects of the multiple recognition models do not meet the user's requirements. The parameter list does not sufficiently augment the original training data. At this time, the enhanced parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm, so as to obtain multiple recognition models through training according to the optimized parameter list after random initialization, and test the multiple recognition models until the target data is obtained. Enhanced parameter list. At the same time, when the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm, it is necessary to record and count the number of times of repeated random initialization of the enhancement parameters in the original parameter list.
S90:确定对原参数列表的增强参数进行随机初始化的次数是否小于预设次数。S90: Determine whether the number of random initializations for the enhancement parameters of the original parameter list is less than a preset number of times.
S100:若对原参数列表的增强参数进行随机初始化的次数不小于预设次数,则停止对原参数列表中的增强参数进行随机初始化。S100: If the number of random initializations of the enhanced parameters in the original parameter list is not less than the preset number of times, stop random initialization of the enhanced parameters in the original parameter list.
在确定对原参数列表的增强参数进行随机初始化的次数是否小于预设次数之后,若对原参数列表的增强参数进行随机初始化的次数不小于预设次数,表示迭代次数过多,为减少计算负担,需要停止对原参数列表中的增强参数进行随机初始化,次数,可以输出接近收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表,进而利用目标数据增强参数列表对原始训练数据进行数据增强,以获得命名实体识别模型的训练集。After determining whether the number of random initializations of the enhanced parameters of the original parameter list is less than the preset number of times, if the number of random initializations of the enhanced parameters of the original parameter list is not less than the preset number of times, it means that the number of iterations is too many, in order to reduce the computational burden , it is necessary to stop random initialization of the enhancement parameters in the original parameter list, and the number of times, can output the optimization parameter list corresponding to the model close to the convergence condition, as the target data enhancement parameter list, and then use the target data enhancement parameter list to perform the original training data. Data augmentation to obtain a training set for a named entity recognition model.
S110:若对原参数列表的增强参数进行随机初始化的次数小于预设次数,则重复执行步骤S30-步骤S70。S110: If the number of random initializations for the enhanced parameters of the original parameter list is less than the preset number of times, repeat steps S30 to S70.
在确定对原参数列表的增强参数进行随机初始化的次数是否小于预设次数之后,若对原参数列表的增强参数进行随机初始化的次数小于预设次数,此时还未确定目标数据增强参数列表,则需要重复执行上述步骤S30-S70,即需要根据随机初始化后的优化参数列表重新训练获得多个新的识别模型,并对多个新的识别模型进行测试,以获得目标数据增强参数列表,并利用目标数据增强参数列表获得命名实体识别模型的训练集。After determining whether the number of random initializations of the enhancement parameters of the original parameter list is less than the preset number of times, if the number of random initializations of the enhancement parameters of the original parameter list is less than the preset number of times, the target data enhancement parameter list has not yet been determined at this time, Then the above steps S30-S70 need to be repeatedly performed, that is, multiple new recognition models need to be retrained according to the randomly initialized optimization parameter list, and multiple new recognition models are tested to obtain the target data enhancement parameter list, and The training set of the named entity recognition model is obtained by augmenting the parameter list with the target data.
本实施例中,在确定多个识别模型中是否存在满足收敛条件的模型之后,若多个识别模型中不存在满足收敛条件的模型,则再次根据人工鱼群算法对原参数列表中的增强参数进行随机初始化,以获得随机初始化后的多个优化参数列表,并进行计数,确定对原参数列表的增强参数进行随机初始化的次数是否小于预设次数,确定对原参数列表的增强参数进行随机初始化的次数是否小于预设次数,若对原参数列表的增强参数进行随机初始化的次数小于预设次数,则重复执行步骤S30-步骤S70,进一步明确了当无识别模型收敛时需要执行的操作,通过多次采用人工鱼群算法,并以多个识别模型的识别效果作为目标对原参数列表的参数进行寻优,以获得用户满意的优化参数列表,保证了优化参数列表的参数性能,进而保证了对数据的增强效果。In this embodiment, after determining whether there is a model satisfying the convergence condition among the multiple identification models, if there is no model satisfying the convergence condition among the multiple identification models, the enhancement parameters in the original parameter list are again adjusted according to the artificial fish swarm algorithm. Perform random initialization to obtain multiple optimized parameter lists after random initialization, and count them to determine whether the number of random initializations for the enhanced parameters of the original parameter list is less than the preset number of times, and determine whether to randomly initialize the enhanced parameters of the original parameter list. Whether the number of times is less than the preset number of times, if the number of random initialization of the enhancement parameters of the original parameter list is less than the preset number of times, then repeat steps S30 to S70, and further clarify the operations that need to be performed when the non-identified model converges. The artificial fish swarm algorithm is used many times, and the parameters of the original parameter list are optimized with the recognition effect of multiple recognition models as the goal, so as to obtain the optimized parameter list that is satisfactory to the user, which ensures the parameter performance of the optimized parameter list, thereby ensuring Enhancements to the data.
在一实施例中,数据增强方法包括同义词替换方法,如图4所示,步骤S30中,即利用每一优化参数列表对原始训练数据进行转换,具体包括如下步骤:In one embodiment, the data enhancement method includes a synonym replacement method. As shown in FIG. 4 , in step S30, each optimization parameter list is used to convert the original training data, which specifically includes the following steps:
S31:在优化参数列表中确定同义词替换方法对应的增强参数,同义词替换方法对应的增强参数包括实体词类别替换概率和实体词替换类别。S31: Determine enhancement parameters corresponding to the synonym replacement method in the optimization parameter list, where the enhancement parameters corresponding to the synonym replacement method include entity word category replacement probability and entity word replacement category.
本实施例中,优化参数列表中的数据增强方法包括同义词替换方法,在优化参数列表中确定同义词替换方法所对应的增强参数,其中,同义词替换方法对应的增强参数包括实体词类别替换概率和实体词替换类别。In this embodiment, the data enhancement method in the optimization parameter list includes a synonym replacement method, and the enhancement parameter corresponding to the synonym replacement method is determined in the optimization parameter list, wherein the enhancement parameter corresponding to the synonym replacement method includes entity word category replacement probability and entity Word replacement category.
S32:获取用户根据需求预先构建的预设同义词词典,预设同义词词典中,将同一实体类别中未被禁止同义关系的实体词作为彼此的同义词。S32: Acquire a preset synonym dictionary pre-built by the user according to requirements. In the preset synonym dictionary, entity words in the same entity category whose synonymous relationship is not prohibited are used as synonyms for each other.
在利用每一优化参数列表中的数据增强方法和对应的增强参数,对原始训练数据进行转换之前,还需要获取预设同义词词典,作为转换原始训练数据中实体词的来源,其中,预设同义词词典为用户根据需求预先构建的、包括不同实体类别实体词的词典,在预设同义词词典中,将同一实体类别的实体词作为彼此的同义词,且在预设同义词词典中,还禁止了特定实体词之间的同义关系,被禁止同义关系的实体词不能作为彼此的同义词。Before converting the original training data by using the data enhancement method and the corresponding enhancement parameters in each optimization parameter list, it is also necessary to obtain a preset synonym dictionary as the source of the entity words in the original training data for conversion, wherein the preset synonyms The dictionary is a dictionary that is pre-built by users according to the needs and includes entity words of different entity categories. In the preset synonym dictionary, entity words of the same entity category are used as synonyms for each other, and in the preset synonym dictionary, specific entities are also prohibited. Synonymous relationship between words, entity words that are prohibited from synonymous relationship cannot be used as synonyms of each other.
本实施例中,通过放松同义词的判定条件,来提升预设同义词词典的实体词规模,将同一个实体类别的实体词作为同义词,即如果将句子中的词语A替换为词语B后得到的新句子,语义和语法上仍然合理,则词语B与词语A为同一实体类别,那么词语B就是词语A的一个同义词,将同一个类别的实体词集合,形成预设同义词词典。例如,孙悟空被压在五行山下,这个句子里,孙悟空可以被替换为如来佛祖、牛魔王等人名,则孙悟空、如来佛祖、牛魔王为彼此的同义词。In this embodiment, the scale of the entity words of the preset synonym dictionary is increased by relaxing the judgment conditions of synonyms, and the entity words of the same entity category are used as synonyms, that is, if the word A in the sentence is replaced by the word B, the new Sentence, semantics and grammar are still reasonable, then word B and word A are the same entity category, then word B is a synonym of word A, and the entity words of the same category are collected to form a preset synonym dictionary. For example, Sun Wukong is pressed under the Five Elements Mountain. In this sentence, Sun Wukong can be replaced by the names of Buddha, Bull Demon, etc., then Sun Wukong, Buddha Buddha and Bull Demon are synonyms for each other.
本实施例中,通过禁止特定词语之间的同义关系,来提升预设同义词词典的质量。在日常使用中,部分实体词虽属于同一个实体类别,但作为同义词替换到句子后,导致句子语法改变,此时需要禁止双方的同义关系,即两个实体词不是同义词。例如,孙悟空被压在五行山下,这个句子里,若将五行山替换后黄河,则孙悟空被压在黄河下为病句,因此,五行山与黄河在预设同义词词典中没被禁止同义关系,在进行同义词替换时,不能作为彼此的同义词进行替换。In this embodiment, the quality of the preset synonym dictionary is improved by prohibiting the synonymous relationship between specific words. In daily use, although some entity words belong to the same entity category, but after being replaced as synonyms in the sentence, the sentence grammar changes. At this time, it is necessary to prohibit the synonymous relationship between the two parties, that is, the two entity words are not synonyms. For example, Sun Wukong is pressed under the Five Elements Mountain. In this sentence, if the Yellow River is replaced by the Five Elements Mountain, then the Sun Wukong is pressed under the Yellow River. Therefore, the synonymous relationship between the Five Elements Mountain and the Yellow River is not prohibited in the preset synonym dictionary. When performing synonym substitution, they cannot be replaced as synonyms of each other.
本实施例中,上述以孙悟空被压在五行山下为句子,并以如来佛祖、牛魔王和黄河为实体词对同义词进行解释,仅为示例性说明,在其他实施例中,还可以以其他句子和实体词为例进行说明。In this embodiment, the above sentences are based on the Sun Wukong being pressed under the Five Elements Mountain, and the synonyms are explained by using Tathagata Buddha, Niu Demon King and the Yellow River as entity words, which are only exemplary descriptions. In other embodiments, other sentences can also be used. and entity words as an example.
其中,预设同义词词典中的同义词可以以表2中的形式存在,其中,表2包括四列,第一列为序号,第二列和第三列为不同的词语:词语A和词语B,第四列为词语A和词语B之间替换的关系,若词语B可以替换词语A,则表示词语A和词语B彼此互为同义词,若词语B不可以替换词语A,则表示词语A和词语B不是彼此的同义词。预设同义词词典的内容如下表2所示:Among them, the synonyms in the preset synonym dictionary can exist in the form of Table 2, wherein, Table 2 includes four columns, the first column is the serial number, the second column and the third column are different words: word A and word B, The fourth column is the replacement relationship between word A and word B. If word B can replace word A, it means that word A and word B are synonyms with each other. If word B cannot replace word A, it means word A and word A B are not synonyms for each other. The content of the preset synonym dictionary is shown in Table 2 below:
表2Table 2
S33:根据预设同义词词典、实体词类别替换概率和实体词替换类别,对原始训练数据中的实体词进行同义词替换。S33: Perform synonym replacement on entity words in the original training data according to the preset synonym dictionary, entity word category replacement probability, and entity word replacement category.
在获得预设同义词词典、实体词类别替换概率和实体词替换类别之后,根据预设同义词词典、实体词类别替换概率和实体词替换类别,对原始训练数据中的实体词进行同义词替换,获得进行同义词替换之后的数据,进而根据优化参数列表中其他数据增强方法及对 应的增强参数,对进行同义词替换之后的数据进行数据处理,以获得人工构造数据。其中,实体词类别替换概率即为实体词替换类别的替换概率,在优化参数列表中,各个实体词类别被替换的概率分布为p_syn=[p_(syn1),p_(syn2),…,p_(synK)],基于预设同义词词典,以p_(syn,k)的概率,将原始训练数据中的k类实体词,替换成预设同义词词典的同义词。After obtaining the preset synonym dictionary, entity word category replacement probability and entity word replacement category, perform synonym replacement on entity words in the original training data according to the preset synonym dictionary, entity word category replacement probability and entity word replacement category, and obtain The data after synonym replacement is then processed according to other data enhancement methods and corresponding enhancement parameters in the optimization parameter list to obtain artificially constructed data. Among them, the replacement probability of the entity word category is the replacement probability of the entity word replacement category. In the optimization parameter list, the probability distribution of each entity word category being replaced is p_syn=[p_(syn1), p_(syn2),...,p_( synK)], based on the preset synonym dictionary, with the probability of p_(syn, k), the k-type entity words in the original training data are replaced with the synonyms of the preset synonym dictionary.
本实施例中,通过在优化参数列表中确定同义词替换方法对应的增强参数,同义词替换方法对应的增强参数包括实体词类别替换概率和实体词替换类别,获取用户根据需求预先构建的预设同义词词典,预设同义词词典中,将同一实体类别中未被禁止同义关系的实体词作为彼此的同义词,根据预设同义词词典、实体词类别替换概率和实体词替换类别,对原始训练数据中的实体词进行同义词替换,细化了利用每一优化参数列表对原始训练数据进行转换的步骤,通过放松实体词同义词的判定条件,扩大预设同义词词典规模,提升了人工构造数据的多样性,并通过建设基于同义关系禁止的方式,持续提升预设同义词词典的质量,从而保证了人工构造数据的质量。In this embodiment, the enhancement parameters corresponding to the synonym replacement method are determined in the optimization parameter list, and the enhancement parameters corresponding to the synonym replacement method include the entity word category replacement probability and the entity word replacement category, and a preset synonym dictionary pre-built by the user according to requirements is obtained. , in the preset synonym dictionary, the entity words in the same entity category whose synonymous relationship is not prohibited are regarded as synonyms of each other, and the entity words in the original training data are compared according to the preset synonym dictionary, entity word category replacement probability and entity word replacement category. Words are replaced by synonyms, which refines the steps of using each optimization parameter list to convert the original training data. By relaxing the judgment conditions of entity word synonyms, the scale of the preset synonym dictionary is expanded, and the diversity of artificially constructed data is improved. Build a method based on the prohibition of synonymous relationships, and continuously improve the quality of the preset synonym dictionary, thereby ensuring the quality of artificially constructed data.
在一实施例中,如图5所示,步骤S33中,即根据预设同义词词典、实体词类别替换概率和实体词替换类别,对原始训练数据中的实体词进行同义词替换,具体包括如下步骤:In one embodiment, as shown in FIG. 5 , in step S33, that is, according to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, the entity words in the original training data are replaced by synonyms, which specifically includes the following steps: :
S331:确定原始训练数据中各实体词的类别是否属于实体词替换类别。S331: Determine whether the category of each entity word in the original training data belongs to the entity word replacement category.
在预设同义词词典、实体词类别替换概率和实体词替换类别之后,需要确定原始训练数据中各实体词的类别,以确定原始训练数据中各实体词是否属于实体词替换类别。After presetting the synonym dictionary, entity word category replacement probability, and entity word replacement category, it is necessary to determine the category of each entity word in the original training data to determine whether each entity word in the original training data belongs to the entity word replacement category.
S332:若原始训练数据中实体词的类别属于实体词替换类别,则在预设同义词词典中查找实体词的同义词。S332: If the category of the entity word in the original training data belongs to the entity word replacement category, search for the synonym of the entity word in the preset synonym dictionary.
在确定原始训练数据中各实体词是否属于实体词替换类别之后,若原始训练数据中一实体词的类别属于实体词替换类别,表示需要对原始训练数据的该实体词进行同义词替换,则需要在预设同义词词典中该实体词所有的同义词,以便后续进行替换。After determining whether each entity word in the original training data belongs to the entity word replacement category, if the category of an entity word in the original training data belongs to the entity word replacement category, it means that the entity word in the original training data needs to be replaced by synonyms. All synonyms of the entity word in the synonym dictionary are preset for subsequent replacement.
S333:确定实体词与实体词的同义词之间是否被禁止同义关系。S333: Determine whether the synonymous relationship between the entity word and the synonym of the entity word is prohibited.
在在预设同义词词典中该实体词的同义词之后,确定该实体词与各同义词之间是否被禁止同义关系。After presetting the synonyms of the entity word in the synonym dictionary, it is determined whether the synonymous relationship between the entity word and each synonym is prohibited.
S334:若实体词与实体词的同义词之间未被禁止同义关系,则以实体词类别替换概率从预设同义词词典中选择一同义词作为替换词,以将实体词替换为替换词。S334: If the synonymous relationship between the entity word and the synonym of the entity word is not prohibited, select a synonym from the preset synonym dictionary as the replacement word with the entity word category replacement probability to replace the entity word with the replacement word.
在确定实体词与实体词的同义词之间是否被禁止同义关系之后,若实体词与实体词的同义词之间未被禁止同义关系,则以实体词类别替换概率从将实体词替换为对应的同义词。After determining whether the synonymous relationship between the entity word and the synonym of the entity word is prohibited, if the synonymous relationship between the entity word and the synonym of the entity word is not prohibited, replace the entity word with the corresponding synonyms.
S335:若实体词与实体词的同义词之间被禁止同义关系,则不将该同义词作为实体词的替换词。S335: If the synonymous relationship between the entity word and the synonym of the entity word is prohibited, the synonym is not used as a replacement word for the entity word.
在确定实体词与对应的同义词之间是否被禁止同义关系之后,若实体词与实体词的同义词之间被禁止同义关系,则跳过该同义词,即不将该同义词作为实体词的替换词。After determining whether the synonymous relationship between the entity word and the corresponding synonym is prohibited, if the synonymous relationship between the entity word and the synonym of the entity word is prohibited, the synonym is skipped, that is, the synonym is not used as the replacement of the entity word word.
例如,实体词替换类别包括人名、地名、机构名3类,实体词类别替换概率为p_syn=[0.30,0.60,0.10],即根据同义词替换方法中,原训练数据中的人名被替换的概率为0.30,地名被替换的概率为0.6,机构名被替换的概率为0.1,若预设同义词词典内该人名的同义词均未被禁止同义关系,则原训练数据中一句话中的每一个人名,都有30%的概率被替换为该人名在预设同义词词典内的同义词;若预设同义词词典内该人名的某一同义词被禁止同义关系,则跳过该同义词不进行替换,使用其他同义词替换掉该人名。For example, the entity word replacement category includes three categories: person name, place name, and institution name, and the entity word category replacement probability is p_syn=[0.30, 0.60, 0.10], that is, according to the synonym replacement method, the probability of the person name in the original training data being replaced is 0.30, the probability of a place name being replaced is 0.6, and the probability of an institution name being replaced is 0.1. If the synonyms of the person’s name in the preset synonym dictionary are not prohibited from synonymous relationship, then each person’s name in a sentence in the original training data, There is a 30% probability of being replaced with a synonym of the person's name in the preset synonym dictionary; if a synonym of the person's name in the preset synonym dictionary is prohibited from being synonymous, skip the synonym and use other synonyms. Replace the person's name.
本实施例中,实体词替换类别包括人名、地名、机构名3类,实体词类别替换概率为p_syn=[0.30,0.60,0.10],仅为示例性说明,在其他实施例中,实体词替换类别和实体词类别替换概率还可以是其他。In this embodiment, the entity word replacement category includes three categories: person name, place name, and organization name, and the entity word category replacement probability is p_syn=[0.30, 0.60, 0.10], which is only an exemplary illustration. In other embodiments, the entity word replacement The class and entity word class replacement probabilities can also be other.
S336:若原始训练数据中各实体词的类别不属于实体词替换类别,则不进行同义词替换。S336: If the category of each entity word in the original training data does not belong to the category of entity word replacement, no synonym replacement is performed.
在确定原始训练数据中各实体词是否属于实体词替换类别之后,若原始训练数据中各 实体词的类别部属于实体词替换类别,表示不需要对原始训练数据的该实体词进行同义词替换,可以执行优化参数列表中的其他数据增强方法。After determining whether each entity word in the original training data belongs to the entity word replacement category, if the category part of each entity word in the original training data belongs to the entity word replacement category, it means that the entity word in the original training data does not need to be replaced by synonyms. Perform other data augmentation methods in the optimization parameter list.
本实施例中,通过确定原始训练数据中各实体词的类别是否属于实体词替换类别,若原始训练数据中实体词的类别属于实体词替换类别,则查找实体词在预设同义词词典中对应的同义词,确定实体词与对应的同义词之间是否被禁止同义关系,若实体词与对应的同义词之间未被禁止同义关系,则将实体词,以实体词类别替换概率替换为对应的同义词,明确了根据预设同义词词典、实体词类别替换概率和实体词替换类别,对原始训练数据中的实体词进行同义词替换的步骤,为人工构造数据的获取提供了基础。In this embodiment, by determining whether the category of each entity word in the original training data belongs to the category of entity word replacement, if the category of the entity word in the original training data belongs to the category of entity word replacement, find the corresponding entity word in the preset synonym dictionary. Synonyms, determine whether the synonymous relationship between the entity word and the corresponding synonym is prohibited, if the synonymous relationship is not prohibited between the entity word and the corresponding synonym, replace the entity word with the corresponding synonym with the entity word category replacement probability , which clarifies the steps of performing synonym replacement on entity words in the original training data according to the preset synonym dictionary, entity word category replacement probability and entity word replacement category, which provides a basis for the acquisition of artificially constructed data.
在一实施例中,数据增强方法还包括随机替换方法、随机删除方法、随机交换方法和构造长句方法,如图6所示,步骤S33之后中,即对原始训练数据中的实体词进行同义词替换之后,所述方法还具体包括如下步骤:In one embodiment, the data enhancement method further includes a random replacement method, a random deletion method, a random exchange method and a long sentence construction method, as shown in FIG. 6 , after step S33, synonyms are performed on the entity words in the original training data. After the replacement, the method further specifically includes the following steps:
S34:在优化参数列表中,确定随机替换方法的随机替换概率,并确定随机删除方法的随机删除概率。S34: In the optimization parameter list, determine the random replacement probability of the random replacement method, and determine the random deletion probability of the random deletion method.
本实施例中,数据增强方法还包括随机替换方法和随机删除方法,需要在优化参数列表中,确定随机替换方法的随机替换概率,并确定随机删除方法的随机删除概率,以根据随机替换概率、随机删除概率对原始训练数据进行转换处理。In this embodiment, the data enhancement method further includes a random replacement method and a random deletion method. It is necessary to determine the random replacement probability of the random replacement method and the random deletion probability of the random deletion method in the optimization parameter list, so as to determine the random replacement probability of the random replacement method and the random deletion probability according to the random replacement probability, The original training data is transformed with random deletion probabilities.
S35:确定随机交换方法的随机交换概率,并确定构造长句方法所设定的句长。S35: Determine the random exchange probability of the random exchange method, and determine the sentence length set by the method of constructing a long sentence.
本实施例中,数据增强方法还包括随机交换方法和构造长句方法,在优化参数列表中,还需要确定随机交换方法的随机交换概率,并确定构造长句方法所设定的句长,以根据随机交换概率和构造长句方法所设定的句长对原始训练数据进行转换处理。In this embodiment, the data enhancement method further includes a random exchange method and a long sentence construction method. In the optimization parameter list, it is also necessary to determine the random exchange probability of the random exchange method, and determine the sentence length set by the long sentence construction method, so as to The original training data is transformed according to the random exchange probability and the sentence length set by the method of constructing long sentences.
S36:根据随机替换概率对原始训练数据中的每一句子进行实体词替换,并根据随机交换概率对原始训练数据中的每一句子进行同句实体词交换。S36: Perform entity word replacement for each sentence in the original training data according to the random replacement probability, and perform the same sentence entity word exchange for each sentence in the original training data according to the random exchange probability.
在确定随机替换方法的随机替换概率,并确定随机交换方法的随机交换概率之后,以随机替换概率对原始训练数据中的每一句子进行实体词替换,并根据随机交换概率对原始训练数据中的每一句子进行同句实体词交换。After determining the random replacement probability of the random replacement method and the random exchange probability of the random exchange method, perform entity word replacement for each sentence in the original training data with the random replacement probability, and replace the original training data with the random exchange probability. Each sentence performs the same sentence entity word exchange.
例如,在优化参数列表中,随机替换方法的随机替换概率为β2,随机交换方法的随机交换概率为β3,在原始训练数据中每个句子的每一个token(实体词),有β2的概率被替换为词典(可以是预设同义词词典)中的任意一个其他token,其中,从词典中选择token的规则为:服从均匀随机分布、排除原始训练数据中其他待随机替换的token。同时,原始训练数据的每个句子中,第i个token和第j个token,有β3的概率进行位置交换。For example, in the optimization parameter list, the random replacement probability of the random replacement method is β2, and the random exchange probability of the random exchange method is β3. In the original training data, for each token (entity word) of each sentence, there is a probability of β2 to be Replace with any other token in the dictionary (which can be a preset synonym dictionary), where the rules for selecting tokens from the dictionary are: obey a uniform random distribution and exclude other tokens to be randomly replaced in the original training data. At the same time, in each sentence of the original training data, the i-th token and the j-th token have a probability of β3 for position exchange.
S37:根据随机删除概率对原始训练数据中的每一句子进行实体词删除,以获得处理数据。S37: Perform entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data.
在根据随机替换概率对原始训练数据中的每一句子进行实体词替换,并根据随机交换概率对原始训练数据中的每一句子进行同句实体词交换之后,根据随机删除概率对原始训练数据中的每一句子进行实体词删除,以获得处理数据。After performing entity word replacement for each sentence in the original training data according to the random replacement probability, and after performing the same sentence entity word exchange for each sentence in the original training data according to the random exchange probability, the original training data is replaced according to the random deletion probability. Entity word removal is performed on each sentence of , to obtain processing data.
例如,在原始训练数据中,以β2的概率将每个句子的每一个token,替换为词典中的任意一个其他token,然后在每个句子中,以有β3的概率将第i个token和第j个token进行位置交换,再在以β4的概率删除每个句子的每一个token,获得处理数据。For example, in the original training data, replace each token of each sentence with any other token in the dictionary with the probability of β2, and then in each sentence, with the probability of β3, replace the ith token and the ith token with the The positions of j tokens are exchanged, and then each token of each sentence is deleted with a probability of β4 to obtain processing data.
S38:对处理数据中每一句子进行拼接处理,以使处理完成后的句子长度为句长。S38: Perform splicing processing on each sentence in the processing data, so that the sentence length after the processing is completed is the sentence length.
在获得处理数据之后,对处理数据中每一句子进行拼接处理,以使处理完成后的句子长度为句长。After the processing data is obtained, splicing processing is performed on each sentence in the processing data, so that the sentence length after the processing is completed is the sentence length.
例如,构造长句方法所设定的句长为100,统计处理数据中每一句子的句子长度,得到句子长度的第90百分位数,将句子长度小于或者小于第90百分位数的句子,进行两两配对,以拼接成一个较长的拼接句子(两个句子的顺序随机),然后删除拼接句子中长度超 过100的部分,使得处理数据中每一句子的句子长度为句长100。For example, the sentence length set by the method of constructing long sentences is 100, and the sentence length of each sentence in the data is statistically processed to obtain the 90th percentile of sentence length. Sentences are paired in pairs to spliced into a longer spliced sentence (the order of the two sentences is random), and then delete the part of the spliced sentence whose length exceeds 100, so that the sentence length of each sentence in the processing data is sentence length 100 .
本实施例中,构造长句方法所设定的句长为100、将句子长度小于或者小于第90百分位数的句子进行两两配对拼接仅为示例性说明,在其他实施例中,构造长句方法所设定的句长还可以是其他数值,还可以将句子长度为其他的百分位的句子进行两两配对拼接,在此不再赘述。In this embodiment, the sentence length set by the method for constructing long sentences is 100, and the pairwise splicing of sentences whose sentence length is less than or less than the 90th percentile is only an exemplary illustration. The sentence length set by the long sentence method can also be other values, and sentences with sentence lengths of other percentiles can also be paired and spliced, which will not be repeated here.
本实施例中,在对原始训练数据中的实体词进行同义词替换之后,通过在优化参数列表中,确定随机替换方法的随机替换概率,并确定随机删除方法的随机删除概率,确定随机交换方法的随机交换概率,并确定构造长句方法所设定的句长,根据随机替换概率对原始训练数据中的每一句子进行实体词替换,并根据随机交换概率对原始训练数据中的每一句子进行同句实体词交换,根据随机删除概率对原始训练数据中的每一句子进行实体词删除,以获得处理数据,对处理数据中每一句子进行拼接处理,以使处理完成后的句子长度为句长,进一步细化了利用每一优化参数列表对原始训练数据进行转换的步骤,采用多种数据增强方法对原始训练数据进行转换,进一步增加了人工构造数据的多样性,保证了识别模型训练集的准确性。In this embodiment, after the entity words in the original training data are replaced with synonyms, the random replacement probability of the random replacement method and the random deletion probability of the random deletion method are determined in the optimization parameter list, and the random replacement probability of the random exchange method is determined. Random exchange probability, and determine the sentence length set by the method of constructing long sentences, perform entity word replacement for each sentence in the original training data according to the random replacement probability, and perform entity word replacement for each sentence in the original training data according to the random exchange probability. Exchange the entity words of the same sentence, delete the entity words of each sentence in the original training data according to the random deletion probability to obtain the processing data, and perform splicing processing on each sentence in the processing data, so that the sentence length after the processing is completed. It further refines the steps of using each optimization parameter list to convert the original training data, and adopts a variety of data enhancement methods to convert the original training data, which further increases the diversity of artificially constructed data and ensures the recognition model training set. accuracy.
在一实施例中,数据增强方法包括同义词替换方法,如图7所示,步骤S50中,即根据测试结果确定多个识别模型中是否存在收敛模型,具体包括如下步骤:In one embodiment, the data enhancement method includes a synonym replacement method. As shown in FIG. 7 , in step S50, it is determined according to the test result whether there is a convergence model in the multiple recognition models, which specifically includes the following steps:
S51:确定多个识别模型中对测试集中各个词进行识别的最高识别得分。S51: Determine the highest recognition score for each word in the test set in the multiple recognition models.
在将原始测试数据作为测试集对多个识别模型进行测试之后,确定多个识别模型中对测试集中各个词进行识别的最高识别得分。After the multiple recognition models are tested on the original test data as the test set, the highest recognition score of the multiple recognition models for recognizing each word in the test set is determined.
其中,识别模型对测试集中各个词进行识别的得分通过如下公式确定:Among them, the recognition score of each word in the test set by the recognition model is determined by the following formula:
其中,score
t为识别模型对测试集中第t词进行识别的得分,recall为对实体词的召回率,precision是识别模型召回实体词的精度。
Among them, score t is the score of the recognition model for the recognition of the t-th word in the test set, recall is the recall rate of the entity word, and precision is the accuracy of the recognition model recalling the entity word.
例如,识别模型的数量为3,在将原始测试数据作为测试集对A、B、C三个识别模型进行测试之后,A、B、C三个识别模型对测试集中第t词的识别得分,分别为0.6、0.8和0.9,则A、B、C三个识别模型中,对测试集中第t词进行识别的最高识别得分为0.9。For example, if the number of recognition models is 3, after using the original test data as the test set to test the three recognition models A, B, and C, the recognition scores of the three recognition models A, B, and C for the t-th word in the test set, are 0.6, 0.8 and 0.9 respectively, then among the three recognition models A, B, and C, the highest recognition score for recognizing the t-th word in the test set is 0.9.
本实施例中,识别模型的数量为3,对测试集中第t词的识别得分分别为0.6、0.8和0.9仅为示例性说明,在其他实施例中,识别模型的数量还可以是其他数值,对测试集中第t词的识别得分还可以是其他数值,在此不再赘述。In this embodiment, the number of recognition models is 3, and the recognition scores for the t-th word in the test set are 0.6, 0.8, and 0.9, respectively, for exemplary illustration. In other embodiments, the number of recognition models may also be other numerical values. The recognition score for the t-th word in the test set may also be other numerical values, which will not be repeated here.
S52:确定最高识别得分是否满足收敛条件。S52: Determine whether the highest recognition score satisfies the convergence condition.
在确定多个识别模型中对测试集中各个词进行识别的最高识别得分之后,确定多个识别模型中对测试集中各个词进行识别的最高识别得分是否满足收敛条件。After determining the highest recognition score for recognizing each word in the test set in the multiple recognition models, it is determined whether the highest recognition score for recognizing each word in the test set in the multiple recognition models satisfies the convergence condition.
S53:若最高识别得分满足收敛条件,则确定多个识别模型中存在满足收敛条件的收敛模型,最高识别得分对应的识别模型为收敛模型。S53: If the highest recognition score satisfies the convergence condition, it is determined that there is a convergence model satisfying the convergence condition among the plurality of recognition models, and the recognition model corresponding to the highest recognition score is the convergence model.
在确定最高识别得分是否满足收敛条件之后,若最高识别得分满足收敛条件,表示已有识别模型的识别效果满足要求,则确定多个识别模型中存在满足收敛条件的收敛模型,最高识别得分对应的识别模型为收敛模型,收敛模型对应的优化参数列表可以作为目标数据增强参数列表。After determining whether the highest recognition score satisfies the convergence condition, if the highest recognition score satisfies the convergence condition, it means that the recognition effect of the existing recognition model meets the requirements, then it is determined that there is a convergence model that satisfies the convergence condition in the multiple recognition models, and the highest recognition score corresponds to the The identification model is a convergent model, and the optimization parameter list corresponding to the convergent model can be used as the target data enhancement parameter list.
S54:若最高识别得分不满足收敛条件,则确定多个识别模型中不存在满足收敛条件的收敛模型。S54: If the highest recognition score does not satisfy the convergence condition, determine that there is no convergence model satisfying the convergence condition among the multiple recognition models.
在确定最高识别得分是否满足收敛条件之后,若最高识别得分满足收敛条件,表示没有识别模型的识别效果满足要求,则确定多个识别模型中不存在满足收敛条件的收敛模型,本轮的优化参数列表不可用,需要利用人工鱼群算法进行重新迭代优化。After determining whether the highest recognition score satisfies the convergence condition, if the highest recognition score satisfies the convergence condition, it means that the recognition effect of no recognition model meets the requirements, then it is determined that there is no convergence model that satisfies the convergence condition in the multiple recognition models, and the optimization parameters of this round The list is not available and needs to be re-iteratively optimized using the artificial fish swarm algorithm.
本实施例中,通过确定多个识别模型中对测试集中各个词进行识别的最高识别得分, 确定最高识别得分是否满足收敛条件,若最高识别得分满足收敛条件,则确定多个识别模型中存在满足收敛条件的收敛模型,最高识别得分对应的识别模型为收敛模型,若最高识别得分不满足收敛条件,则确定多个识别模型中不存在满足收敛条件的收敛模型,明确了确定多个识别模型中是否存在收敛模型的判断过程,以识别模型对测试集的识别效果作为人工鱼群算法的浓度,将识别模型对测试集的识别效果作为数据增强模型参数优化的目标,以较小的代价得到一个效果较好的数据增强策略。In this embodiment, by determining the highest recognition score for identifying each word in the test set in the multiple recognition models, it is determined whether the highest recognition score satisfies the convergence condition, and if the highest recognition score satisfies the convergence condition, then it is determined that there are any The convergence model of the convergence condition, the recognition model corresponding to the highest recognition score is the convergence model. If the highest recognition score does not satisfy the convergence condition, it is determined that there is no convergence model that satisfies the convergence condition among the multiple recognition models, and it is clarified that among the multiple recognition models Whether there is a convergence model judgment process, the recognition effect of the recognition model on the test set is used as the concentration of the artificial fish swarm algorithm, and the recognition effect of the recognition model on the test set is used as the goal of parameter optimization of the data enhancement model, and a small cost is obtained. Effective data augmentation strategy.
在一实施例中,数据增强方法包括同义词替换方法,如图8所示,步骤S52中,即确定最高识别得分是否满足收敛条件,具体包括如下步骤:In one embodiment, the data enhancement method includes a synonym replacement method. As shown in FIG. 8 , in step S52, it is determined whether the highest recognition score satisfies the convergence condition, which specifically includes the following steps:
S521:确定用户配置的收敛参数。S521: Determine the convergence parameters configured by the user.
S522:确定多个识别模型中对测试集中第t词进行识别的第一最高识别得分;S522: Determine the first highest recognition score for recognizing the t-th word in the test set among the multiple recognition models;
S523:确定多个识别模型中对测试集中第t-1词进行识别的第二最高识别得分;S523: Determine the second highest recognition score for recognizing the t-1th word in the test set among the multiple recognition models;
S524:将第一最高识别得分减去第二最高识别得分,得到最高识别得分差;S524: subtract the second highest recognition score from the first highest recognition score to obtain the highest recognition score difference;
S525:确定最高识别得分差与第二最高识别得分的比是否小于收敛参数;S525: Determine whether the ratio of the difference between the highest recognition score and the second highest recognition score is less than the convergence parameter;
S526:若最高识别得分差与第二最高识别得分的比小于收敛参数,则确定最高识别得分满足收敛条件;S526: If the ratio of the difference between the highest recognition score and the second highest recognition score is less than the convergence parameter, then determine that the highest recognition score satisfies the convergence condition;
S527:若最高识别得分差与第二最高识别得分的比不小于收敛参数,则确定最高识别得分不满足收敛条件。S527: If the ratio of the difference between the highest recognition score and the second highest recognition score is not less than the convergence parameter, determine that the highest recognition score does not satisfy the convergence condition.
在确定多个识别模型中对测试集中各个词进行识别的最高识别得分之后,通过如下公式确定多个识别模型的最高识别得分是否满足收敛条件:After determining the highest recognition score of each word in the test set among the multiple recognition models, determine whether the highest recognition score of the multiple recognition models satisfies the convergence condition by the following formula:
其中,maxscore
t为多个识别模型中对测试集第t词的最高识别得分,即第一最高识别得分,maxscore
t-1为多个识别模型中对测试集第t-1词的最高识别得分,即第二最高识别得分,α为用户配置的收敛参数(可以为0.01)。
Among them, maxscore t is the highest recognition score of the t-th word in the test set among multiple recognition models, that is, the first highest recognition score, and maxscore t-1 is the highest recognition score of the t-1th word in the test set among multiple recognition models. , that is, the second highest recognition score, α is a convergence parameter configured by the user (it can be 0.01).
在上述公式中,若第一最高识别得分maxscore
t与第二最高识别得分maxscore
t-1之间的最高得分差maxscore
t-maxscore
t-1,除以第二最高识别得分的得到收敛值
若
小于收敛参数α,则确定最高识别得分满足收敛条件;若
不小于收敛参数α,则确定最高识别得分不满足收敛条件。
In the above formula, if the highest score difference maxscore t -maxscore t -1 between the first highest recognition score maxscore t and the second highest recognition score maxscore t-1 is divided by the second highest recognition score, the convergence value is obtained like is less than the convergence parameter α, then it is determined that the highest recognition score satisfies the convergence condition; if If not less than the convergence parameter α, it is determined that the highest recognition score does not satisfy the convergence condition.
本实施例中,确定用户配置的收敛参数,确定多个识别模型中对测试集中第t词进行识别的第一最高识别得分,确定多个识别模型中对测试集中第t-1词进行识别的第二最高识别得分,将第一最高识别得分减去第二最高识别得分,得到最高识别得分差,确定最高识别得分差与第二最高识别得分的比是否小于收敛参数,若最高识别得分差与第二最高识别得分的比小于收敛参数,则确定最高识别得分满足收敛条件;若最高识别得分差与第二最高识别得分的比不小于收敛参数,则确定最高识别得分不满足收敛条件,明确了确定最高识别得分是否满足收敛条件的具体过程,为根据确定最高识别得分确定模型是否收敛提供了判断基础。In this embodiment, the convergence parameters configured by the user are determined, the first highest recognition score for recognizing the t-th word in the test set among the multiple recognition models is determined, and the number of recognition models for recognizing the t-1th word in the test set is determined. For the second highest recognition score, subtract the second highest recognition score from the first highest recognition score to obtain the highest recognition score difference, and determine whether the ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter. If the ratio of the second highest recognition score is less than the convergence parameter, it is determined that the highest recognition score satisfies the convergence condition; if the ratio of the difference between the highest recognition score and the second highest recognition score is not less than the convergence parameter, then it is determined that the highest recognition score does not satisfy the convergence condition, and it is clear that The specific process of determining whether the highest recognition score satisfies the convergence condition provides a judgment basis for determining whether the model converges according to the highest recognition score.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
在一实施例中,提供一种基于深度学习模型的数据增强装置,该基于深度学习模型的数据增强装置与上述实施例中基于深度学习模型的数据增强方法一一对应。如图9所示,该基于深度学习模型的数据增强装置包括获取模块901、初始化模块902、转换模块903、测试模块904、输出模块905和增强模块906。各功能模块详细说明如下:In one embodiment, a data enhancement apparatus based on a deep learning model is provided, and the data enhancement apparatus based on a deep learning model corresponds to the data enhancement method based on the deep learning model in the above-mentioned embodiment. As shown in FIG. 9 , the data enhancement device based on the deep learning model includes an acquisition module 901 , an initialization module 902 , a conversion module 903 , a test module 904 , an output module 905 and an enhancement module 906 . The detailed description of each functional module is as follows:
获取模块901,用于获取经过人工标注的原始训练数据和原始测试数据,并获取原参 数列表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;Obtaining module 901 is used to obtain the original training data and original test data marked manually, and obtain the original parameter list, and the original parameter list is formed by the data enhancement method and the corresponding enhancement parameters of the data enhancement method;
初始化模块902,用于根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;An initialization module 902, configured to randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;
转换模块903,用于利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;The conversion module 903 is configured to convert the original training data by using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain corresponding artificially constructed data. Get multiple training sets;
测试模块904,用于利用所述多个训练集分别训练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;The testing module 904 is configured to use the multiple training sets to train to obtain multiple recognition models, and use the original test data as a test set to test the multiple recognition models, so as to determine the number of recognition models among the multiple recognition models. Whether there is a model that satisfies the convergence condition;
输出模块905,用于若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;The output module 905 is configured to output the optimization parameter list corresponding to the model satisfying the convergence condition as the target data enhancement parameter list if there is a model satisfying the convergence condition in the plurality of identification models;
增强模块906,用于利用所述目标数据增强参数列表对所述原始训练数据进行数据增强,以获得命名实体识别模型的训练集。An enhancement module 906, configured to perform data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of a named entity recognition model.
进一步地,所述基于深度学习模型的数据增强装置还包括循环模块907,所述确定所述多个识别模型中是否存在满足收敛条件的模型之后,所述循环模块907具体用于:Further, the data enhancement device based on the deep learning model further includes a loop module 907, after determining whether there is a model satisfying the convergence condition in the plurality of identification models, the loop module 907 is specifically used for:
若所述多个识别模型中不存在所述满足收敛条件的模型,则再次根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得所述随机初始化后的多个优化参数列表,并进行计数;If there is no model that satisfies the convergence condition in the multiple identification models, the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm to obtain multiple optimizations after the random initialization. parameter list, and count;
确定对所述原参数列表中的增强参数进行随机初始化的次数是否小于预设次数;Determine whether the number of random initializations of the enhanced parameters in the original parameter list is less than a preset number of times;
若对所述原参数列表中的增强参数进行随机初始化的次数不小于所述预设次数,则停止对所述原参数列表中的增强参数进行所述随机初始化;If the number of random initializations for the enhanced parameters in the original parameter list is not less than the preset number of times, stop performing the random initialization on the enhanced parameters in the original parameter list;
若对所述原参数列表中的增强参数进行随机初始化的次数小于所述预设次数,则根据所述随机初始化后的优化参数列表训练获得多个新的识别模型,并对所述新的多个识别模型进行测试,以获得所述目标数据增强参数列表,并利用所述目标数据增强参数列表获得所述命名实体识别模型的训练集。If the number of random initializations for the enhanced parameters in the original parameter list is less than the preset number of times, multiple new recognition models are obtained by training according to the randomly initialized optimized parameter list, and the new multiple Each recognition model is tested to obtain the target data enhancement parameter list, and a training set of the named entity recognition model is obtained by using the target data enhancement parameter list.
进一步地,所述数据增强方法包括同义词替换方法,所述转换模块903具体用于:Further, the data enhancement method includes a synonym replacement method, and the conversion module 903 is specifically used for:
在所述优化参数列表中确定所述同义词替换方法对应的增强参数,所述同义词替换方法对应的增强参数包括实体词类别替换概率和实体词替换类别;Determine the enhancement parameters corresponding to the synonym replacement method in the optimization parameter list, and the enhancement parameters corresponding to the synonym replacement method include entity word category replacement probability and entity word replacement category;
获取用户根据需求预先构建的预设同义词词典,所述预设同义词词典中,将同一实体类别中未被禁止同义关系的实体词作为彼此的同义词;Obtaining a preset synonym dictionary pre-built by the user according to requirements, in the preset synonym dictionary, entity words in the same entity category whose synonymous relationship is not prohibited are used as synonyms for each other;
根据所述预设同义词词典、所述实体词类别替换概率和所述实体词替换类别,对所述原始训练数据中的实体词进行同义词替换。According to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, synonym replacement is performed on the entity words in the original training data.
进一步地,所述转换模块903具体还用于:Further, the conversion module 903 is specifically also used for:
确定所述原始训练数据中各实体词的类别是否属于所述实体词替换类别;determining whether the category of each entity word in the original training data belongs to the entity word replacement category;
若所述原始训练数据中实体词的类别属于所述实体词替换类别,则在所述预设同义词词典中查找所述实体词的同义词;If the category of the entity word in the original training data belongs to the entity word replacement category, search for the synonym of the entity word in the preset synonym dictionary;
确定所述实体词与所述实体词的同义词之间是否被禁止同义关系;determining whether a synonymous relationship is prohibited between the entity word and a synonym of the entity word;
若所述实体词与所述实体词的同义词之间未被禁止同义关系,则以所述实体词类别替换概率,从所述预设同义词词典中选择一所述同义词作为替换词,以将所述实体词替换为所述替换词。If the synonym relationship between the entity word and the synonym of the entity word is not prohibited, use the entity word category replacement probability to select one of the synonyms from the preset synonym dictionary as a replacement word to replace the The entity word is replaced with the replacement word.
进一步地,所述数据增强方法还包括随机替换方法、随机删除方法、随机交换方法和构造长句方法,所述转换模块903具体还用于:Further, the data enhancement method also includes a random replacement method, a random deletion method, a random exchange method and a long sentence construction method, and the conversion module 903 is specifically also used for:
在所述优化参数列表中,确定所述随机替换方法的随机替换概率,并确定所述随机删除方法的随机删除概率;In the optimization parameter list, determine the random replacement probability of the random replacement method, and determine the random deletion probability of the random deletion method;
确定所述随机交换方法的随机交换概率,并确定所述构造长句方法所设定的句长;Determine the random exchange probability of the random exchange method, and determine the sentence length set by the method of constructing a long sentence;
根据所述随机替换概率对所述原始训练数据中的每一句子进行实体词替换,并根据所述随机交换概率对所述原始训练数据中的每一句子进行同句实体词交换;Perform entity word replacement for each sentence in the original training data according to the random replacement probability, and perform same-sentence entity word exchange for each sentence in the original training data according to the random exchange probability;
根据所述随机删除概率对所述原始训练数据中的每一句子进行实体词删除,以获得处理数据;Perform entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data;
对所述处理数据中每一句子进行拼接处理,以使处理完成后的句子长度为所述句长。Perform splicing processing on each sentence in the processing data, so that the sentence length after the processing is completed is the sentence length.
进一步地,所述测试模块904具体用于:Further, the test module 904 is specifically used for:
确定所述多个识别模型中对所述测试集中各个词进行识别的最高识别得分;determining the highest recognition score for recognizing each word in the test set in the plurality of recognition models;
确定所述最高识别得分是否满足所述收敛条件;determining whether the highest recognition score satisfies the convergence condition;
若所述最高识别得分满足所述收敛条件,则确定所述多个识别模型中存在所述满足收敛条件的收敛模型,所述最高识别得分对应的识别模型为所述收敛模型;If the highest recognition score satisfies the convergence condition, it is determined that there is a convergence model satisfying the convergence condition in the plurality of recognition models, and the recognition model corresponding to the highest recognition score is the convergence model;
若所述最高识别得分不满足所述收敛条件,则确定所述多个识别模型中不存在所述满足收敛条件的收敛模型。If the highest recognition score does not satisfy the convergence condition, it is determined that the convergence model that satisfies the convergence condition does not exist in the plurality of recognition models.
进一步地,所述测试模块905具体还用于:Further, the test module 905 is specifically also used for:
确定用户配置的收敛参数;Determine the convergence parameters configured by the user;
确定所述多个识别模型中对所述测试集中第t词进行识别的第一最高识别得分;determining the first highest recognition score for recognizing the t-th word in the test set in the plurality of recognition models;
确定所述多个识别模型中对所述测试集中第t-1词进行识别的第二最高识别得分;determining the second highest recognition score for recognizing the t-1th word in the test set in the plurality of recognition models;
将所述第一最高识别得分减去所述第二最高识别得分,得到最高识别得分差;Subtracting the second highest recognition score from the first highest recognition score to obtain the highest recognition score difference;
确定所述最高识别得分差与所述第二最高识别得分的比是否小于所述收敛参数;determining whether the ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;
若所述最高识别得分差与所述第二最高识别得分的比小于所述收敛参数,则确定所述最高识别得分满足所述收敛条件;If the ratio of the difference between the highest recognition score and the second highest recognition score is less than the convergence parameter, it is determined that the highest recognition score satisfies the convergence condition;
若所述最高识别得分差与所述第二最高识别得分的比不小于所述收敛参数,则确定所述最高识别得分不满足所述收敛条件。If the ratio of the difference between the highest recognition score and the second highest recognition score is not less than the convergence parameter, it is determined that the highest recognition score does not satisfy the convergence condition.
关于基于深度学习模型的数据增强装置的具体限定可以参见上文中对于基于深度学习模型的数据增强方法的限定,在此不再赘述。上述基于深度学习模型的数据增强装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the data augmentation apparatus based on the deep learning model, reference may be made to the above definition of the data augmentation method based on the deep learning model, which will not be repeated here. Each module in the above-mentioned deep learning model-based data augmentation apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储原始训练数据、原始测试数据、原参数列表、人工构造数据、优化参数列表和多个识别模型等数据增强方法用到或者生产的相关数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于深度学习模型的数据增强方法。In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium. The database of the computer device is used to store the original training data, the original test data, the original parameter list, the artificially constructed data, the optimized parameter list, and the related data used or produced by the data enhancement methods such as multiple recognition models. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions, when executed by a processor, implement a deep learning model-based data augmentation method.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述基于深度学习模型的数据增强方法的步骤。In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor implements the above deep learning-based model when the processor executes the computer-readable instructions The steps of the data augmentation method.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现上述基于深度学习模型的数据增强方法的步骤。In one embodiment, a computer-readable storage medium is provided, on which computer-readable instructions are stored, and when the computer-readable instructions are executed by a processor, implement the steps of the above-mentioned deep learning model-based data enhancement method.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任 何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.
Claims (20)
- 一种基于深度学习模型的数据增强方法,其中,包括:A data augmentation method based on a deep learning model, comprising:获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;Obtain the manually marked original training data and original test data, and obtain the original parameter list, where the original parameter list is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method;根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;Randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;Transform the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;利用所述多个训练集训分别练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions. Model;若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;If there is a model that satisfies the convergence condition in the plurality of identification models, outputting an optimization parameter list corresponding to the model that satisfies the convergence condition as a target data enhancement parameter list;利用所述目标数据增强参数列表对所述原始训练数据进行数据增强,以获得命名实体识别模型的训练集。Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
- 如权利要求1所述的基于深度学习模型的数据增强方法,其中,所述确定所述多个识别模型中是否存在满足收敛条件的模型之后,所述方法还包括:The data enhancement method based on a deep learning model according to claim 1, wherein after determining whether there is a model that satisfies the convergence condition in the plurality of recognition models, the method further comprises:若所述多个识别模型中不存在所述满足收敛条件的模型,则再次根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得所述随机初始化后的多个优化参数列表,并进行计数;If there is no model that satisfies the convergence condition in the multiple identification models, the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm to obtain multiple optimizations after the random initialization. parameter list, and count;确定对所述原参数列表中的增强参数进行随机初始化的次数是否小于预设次数;Determine whether the number of random initializations of the enhanced parameters in the original parameter list is less than a preset number of times;若对所述原参数列表中的增强参数进行随机初始化的次数不小于所述预设次数,则停止对所述原参数列表中的增强参数进行所述随机初始化;If the number of random initializations for the enhanced parameters in the original parameter list is not less than the preset number of times, stop performing the random initialization on the enhanced parameters in the original parameter list;若对所述原参数列表中的增强参数进行随机初始化的次数小于所述预设次数,则根据所述随机初始化后的优化参数列表训练获得多个新的识别模型,并对所述新的多个识别模型进行测试,以获得所述目标数据增强参数列表,并利用所述目标数据增强参数列表获得所述命名实体识别模型的训练集。If the number of random initializations for the enhanced parameters in the original parameter list is less than the preset number of times, multiple new recognition models are obtained by training according to the randomly initialized optimized parameter list, and the new multiple Each recognition model is tested to obtain the target data enhancement parameter list, and a training set of the named entity recognition model is obtained by using the target data enhancement parameter list.
- 如权利要求1所述的基于深度学习模型的数据增强方法,其中,所述数据增强方法包括同义词替换方法,所述利用每一所述优化参数列表对所述原始训练数据进行转换,包括:The data enhancement method based on a deep learning model according to claim 1, wherein the data enhancement method includes a synonym replacement method, and the conversion of the original training data by using each of the optimized parameter lists includes:在所述优化参数列表中确定所述同义词替换方法对应的增强参数,所述同义词替换方法对应的增强参数包括实体词类别替换概率和实体词替换类别;Determine the enhancement parameters corresponding to the synonym replacement method in the optimization parameter list, and the enhancement parameters corresponding to the synonym replacement method include entity word category replacement probability and entity word replacement category;获取用户根据需求预先构建的预设同义词词典,所述预设同义词词典中,将同一实体类别中未被禁止同义关系的实体词作为彼此的同义词;Obtaining a preset synonym dictionary pre-built by the user according to requirements, in the preset synonym dictionary, entity words in the same entity category whose synonymous relationship is not prohibited are used as synonyms for each other;根据所述预设同义词词典、所述实体词类别替换概率和所述实体词替换类别,对所述原始训练数据中的实体词进行同义词替换。According to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, synonym replacement is performed on the entity words in the original training data.
- 如权利要求3所述的基于深度学习模型的数据增强方法,其中,所述根据所述预设同义词词典、所述实体词类别替换概率和所述实体词替换类别,对所述原始训练数据中的实体词进行同义词替换,包括:The data enhancement method based on a deep learning model according to claim 3, wherein, according to the preset thesaurus dictionary, the entity word category replacement probability and the entity word replacement category, for the original training data Entity words for synonym substitution, including:确定所述原始训练数据中各实体词的类别是否属于所述实体词替换类别;determining whether the category of each entity word in the original training data belongs to the entity word replacement category;若所述原始训练数据中实体词的类别属于所述实体词替换类别,则在所述预设同义词词典中查找所述实体词的同义词;If the category of the entity word in the original training data belongs to the entity word replacement category, search for the synonym of the entity word in the preset synonym dictionary;确定所述实体词与所述实体词的同义词之间是否被禁止同义关系;determining whether a synonymous relationship is prohibited between the entity word and a synonym of the entity word;若所述实体词与所述实体词的同义词之间未被禁止同义关系,则以所述实体词类别替换概率,从所述预设同义词词典中选择一所述同义词作为替换词,以将所述实体词替换为 所述替换词。If the synonym relationship between the entity word and the synonym of the entity word is not prohibited, use the entity word category replacement probability to select one of the synonyms from the preset synonym dictionary as a replacement word to replace the The entity word is replaced with the replacement word.
- 如权利要求4所述的基于深度学习模型的数据增强方法,其中,所述数据增强方法还包括随机替换方法、随机删除方法、随机交换方法和构造长句方法,所述对所述原始训练数据中的实体词进行同义词替换之后,所述方法还包括:The data augmentation method based on a deep learning model according to claim 4, wherein the data augmentation method further comprises a random replacement method, a random deletion method, a random exchange method and a long sentence construction method, and the data augmentation method for the original training data After the entity words in are replaced by synonyms, the method further includes:在所述优化参数列表中,确定所述随机替换方法的随机替换概率,并确定所述随机删除方法的随机删除概率;In the optimization parameter list, determine the random replacement probability of the random replacement method, and determine the random deletion probability of the random deletion method;确定所述随机交换方法的随机交换概率,并确定所述构造长句方法所设定的句长;Determine the random exchange probability of the random exchange method, and determine the sentence length set by the method of constructing a long sentence;根据所述随机替换概率对所述原始训练数据中的每一句子进行实体词替换,并根据所述随机交换概率对所述原始训练数据中的每一句子进行同句实体词交换;Perform entity word replacement for each sentence in the original training data according to the random replacement probability, and perform same-sentence entity word exchange for each sentence in the original training data according to the random exchange probability;根据所述随机删除概率对所述原始训练数据中的每一句子进行实体词删除,以获得处理数据;Perform entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data;对所述处理数据中每一句子进行拼接处理,以使处理完成后的句子长度为所述句长。Perform splicing processing on each sentence in the processing data, so that the sentence length after the processing is completed is the sentence length.
- 如权利要求1-5任一项所述的基于深度学习模型的数据增强方法,其中,所述确定所述多个识别模型中是否存在收敛模型,包括:The data enhancement method based on a deep learning model according to any one of claims 1-5, wherein the determining whether there is a convergence model in the plurality of identification models comprises:确定所述多个识别模型中对所述测试集中各个词进行识别的最高识别得分;determining the highest recognition score for recognizing each word in the test set in the plurality of recognition models;确定所述最高识别得分是否满足所述收敛条件;determining whether the highest recognition score satisfies the convergence condition;若所述最高识别得分满足所述收敛条件,则确定所述多个识别模型中存在所述满足收敛条件的收敛模型,所述最高识别得分对应的识别模型为所述收敛模型;If the highest recognition score satisfies the convergence condition, it is determined that there is a convergence model satisfying the convergence condition in the plurality of recognition models, and the recognition model corresponding to the highest recognition score is the convergence model;若所述最高识别得分不满足所述收敛条件,则确定所述多个识别模型中不存在所述满足收敛条件的收敛模型。If the highest recognition score does not satisfy the convergence condition, it is determined that the convergence model that satisfies the convergence condition does not exist in the plurality of recognition models.
- 如权利要求6所述的基于深度学习模型的数据增强方法,其中,所述确定所述最高识别得分是否满足所述收敛条件,包括:The data enhancement method based on a deep learning model according to claim 6, wherein the determining whether the highest recognition score satisfies the convergence condition comprises:确定用户配置的收敛参数;Determine the convergence parameters configured by the user;确定所述多个识别模型中对所述测试集中第t词进行识别的第一最高识别得分;determining the first highest recognition score for recognizing the t-th word in the test set in the plurality of recognition models;确定所述多个识别模型中对所述测试集中第t-1词进行识别的第二最高识别得分;determining the second highest recognition score for recognizing the t-1th word in the test set in the plurality of recognition models;将所述第一最高识别得分减去所述第二最高识别得分,得到最高识别得分差;Subtracting the second highest recognition score from the first highest recognition score to obtain the highest recognition score difference;确定所述最高识别得分差与所述第二最高识别得分的比是否小于所述收敛参数;determining whether the ratio of the highest recognition score difference to the second highest recognition score is less than the convergence parameter;若所述最高识别得分差与所述第二最高识别得分的比小于所述收敛参数,则确定所述最高识别得分满足所述收敛条件;If the ratio of the difference between the highest recognition score and the second highest recognition score is less than the convergence parameter, it is determined that the highest recognition score satisfies the convergence condition;若所述最高识别得分差与所述第二最高识别得分的比不小于所述收敛参数,则确定所述最高识别得分不满足所述收敛条件。If the ratio of the difference between the highest recognition score and the second highest recognition score is not less than the convergence parameter, it is determined that the highest recognition score does not satisfy the convergence condition.
- 一种基于深度学习模型的数据增强装置,其中,包括:A data enhancement device based on a deep learning model, comprising:获取模块,用于获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;an acquisition module, configured to acquire the manually marked original training data and original test data, and acquire an original parameter list, where the original parameter list is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method;初始化模块,用于根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;an initialization module for randomly initializing the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;转换模块,用于利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;a conversion module, configured to convert the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;测试模块,用于利用所述多个训练集分别训练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;A test module, configured to use the multiple training sets to train to obtain multiple recognition models, and use the original test data as a test set to test the multiple recognition models to determine whether the multiple recognition models are There is a model that satisfies the convergence condition;输出模块,用于若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;an output module, configured to output an optimization parameter list corresponding to the model that satisfies the convergence condition if there is a model that satisfies the convergence condition in the plurality of identification models, as a target data enhancement parameter list;增强模块,用于利用所述目标数据增强参数列表对所述原始训练数据进行数据增强, 以获得命名实体识别模型的训练集。An enhancement module, configured to perform data enhancement on the original training data by using the target data enhancement parameter list to obtain a training set of a named entity recognition model.
- 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;Obtain the manually marked original training data and original test data, and obtain the original parameter list, where the original parameter list is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method;根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;Randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;Transform the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;利用所述多个训练集训分别练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions. Model;若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;If there is a model that satisfies the convergence condition in the plurality of identification models, outputting an optimization parameter list corresponding to the model that satisfies the convergence condition as a target data enhancement parameter list;利用所述目标数据增强参数列表对所述原始训练数据进行数据增强,以获得命名实体识别模型的训练集。Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
- 如权利要求9所述的计算机设备,其中,所述确定所述多个识别模型中是否存在满足收敛条件的模型之后,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 9, wherein after the determining whether there is a model satisfying the convergence condition among the plurality of identification models, the processor further implements the following steps when executing the computer-readable instructions:若所述多个识别模型中不存在所述满足收敛条件的模型,则再次根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得所述随机初始化后的多个优化参数列表,并进行计数;If there is no model that satisfies the convergence condition in the multiple identification models, the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm to obtain multiple optimizations after the random initialization. parameter list, and count;确定对所述原参数列表中的增强参数进行随机初始化的次数是否小于预设次数;Determine whether the number of random initializations of the enhanced parameters in the original parameter list is less than a preset number of times;若对所述原参数列表中的增强参数进行随机初始化的次数不小于所述预设次数,则停止对所述原参数列表中的增强参数进行所述随机初始化;If the number of random initializations for the enhanced parameters in the original parameter list is not less than the preset number of times, stop performing the random initialization on the enhanced parameters in the original parameter list;若对所述原参数列表中的增强参数进行随机初始化的次数小于所述预设次数,则根据所述随机初始化后的优化参数列表训练获得多个新的识别模型,并对所述新的多个识别模型进行测试,以获得所述目标数据增强参数列表,并利用所述目标数据增强参数列表获得所述命名实体识别模型的训练集。If the number of random initializations for the enhanced parameters in the original parameter list is less than the preset number of times, multiple new recognition models are obtained by training according to the randomly initialized optimized parameter list, and the new multiple Each recognition model is tested to obtain the target data enhancement parameter list, and a training set of the named entity recognition model is obtained by using the target data enhancement parameter list.
- 如权利要求9所述的计算机设备,其中,所述数据增强方法包括同义词替换方法,所述利用每一所述优化参数列表对所述原始训练数据进行转换,包括:The computer device of claim 9, wherein the data augmentation method includes a synonym replacement method, and the transforming the original training data using each of the optimization parameter lists includes:在所述优化参数列表中确定所述同义词替换方法对应的增强参数,所述同义词替换方法对应的增强参数包括实体词类别替换概率和实体词替换类别;Determine the enhancement parameters corresponding to the synonym replacement method in the optimization parameter list, and the enhancement parameters corresponding to the synonym replacement method include entity word category replacement probability and entity word replacement category;获取用户根据需求预先构建的预设同义词词典,所述预设同义词词典中,将同一实体类别中未被禁止同义关系的实体词作为彼此的同义词;Obtaining a preset synonym dictionary pre-built by the user according to requirements, in the preset synonym dictionary, entity words in the same entity category whose synonymous relationship is not prohibited are used as synonyms for each other;根据所述预设同义词词典、所述实体词类别替换概率和所述实体词替换类别,对所述原始训练数据中的实体词进行同义词替换。According to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, synonym replacement is performed on the entity words in the original training data.
- 如权利要求11所述的计算机设备,其中,所述根据所述预设同义词词典、所述实体词类别替换概率和所述实体词替换类别,对所述原始训练数据中的实体词进行同义词替换,包括:The computer device according to claim 11, wherein the entity words in the original training data are replaced by synonyms according to the preset thesaurus dictionary, the entity word category replacement probability and the entity word replacement category ,include:确定所述原始训练数据中各实体词的类别是否属于所述实体词替换类别;determining whether the category of each entity word in the original training data belongs to the entity word replacement category;若所述原始训练数据中实体词的类别属于所述实体词替换类别,则在所述预设同义词词典中查找所述实体词的同义词;If the category of the entity word in the original training data belongs to the entity word replacement category, search for the synonym of the entity word in the preset synonym dictionary;确定所述实体词与所述实体词的同义词之间是否被禁止同义关系;determining whether a synonymous relationship is prohibited between the entity word and a synonym of the entity word;若所述实体词与所述实体词的同义词之间未被禁止同义关系,则以所述实体词类别替换概率,从所述预设同义词词典中选择一所述同义词作为替换词,以将所述实体词替换为所述替换词。If the synonym relationship between the entity word and the synonym of the entity word is not prohibited, use the entity word category replacement probability to select one of the synonyms from the preset synonym dictionary as a replacement word to replace the The entity word is replaced with the replacement word.
- 如权利要求12所述的计算机设备,其中,所述数据增强方法还包括随机替换方法、随机删除方法、随机交换方法和构造长句方法,所述对所述原始训练数据中的实体词进行同义词替换之后,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 12, wherein the data enhancement method further comprises a random replacement method, a random deletion method, a random exchange method and a long sentence construction method, and the synonyms are performed on the entity words in the original training data. After the replacement, the processor further implements the following steps when executing the computer-readable instructions:在所述优化参数列表中,确定所述随机替换方法的随机替换概率,并确定所述随机删除方法的随机删除概率;In the optimization parameter list, determine the random replacement probability of the random replacement method, and determine the random deletion probability of the random deletion method;确定所述随机交换方法的随机交换概率,并确定所述构造长句方法所设定的句长;Determine the random exchange probability of the random exchange method, and determine the sentence length set by the method of constructing a long sentence;根据所述随机替换概率对所述原始训练数据中的每一句子进行实体词替换,并根据所述随机交换概率对所述原始训练数据中的每一句子进行同句实体词交换;Perform entity word replacement for each sentence in the original training data according to the random replacement probability, and perform same-sentence entity word exchange for each sentence in the original training data according to the random exchange probability;根据所述随机删除概率对所述原始训练数据中的每一句子进行实体词删除,以获得处理数据;Perform entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data;对所述处理数据中每一句子进行拼接处理,以使处理完成后的句子长度为所述句长。Perform splicing processing on each sentence in the processing data, so that the sentence length after the processing is completed is the sentence length.
- 如权利要求9-13任一项所述的计算机设备,其中,所述确定所述多个识别模型中是否存在收敛模型,包括:The computer device of any one of claims 9-13, wherein the determining whether a convergent model exists in the plurality of recognition models comprises:确定所述多个识别模型中对所述测试集中各个词进行识别的最高识别得分;determining the highest recognition score for recognizing each word in the test set in the plurality of recognition models;确定所述最高识别得分是否满足所述收敛条件;determining whether the highest recognition score satisfies the convergence condition;若所述最高识别得分满足所述收敛条件,则确定所述多个识别模型中存在所述满足收敛条件的收敛模型,所述最高识别得分对应的识别模型为所述收敛模型;If the highest recognition score satisfies the convergence condition, it is determined that there is a convergence model satisfying the convergence condition in the plurality of recognition models, and the recognition model corresponding to the highest recognition score is the convergence model;若所述最高识别得分不满足所述收敛条件,则确定所述多个识别模型中不存在所述满足收敛条件的收敛模型。If the highest recognition score does not satisfy the convergence condition, it is determined that the convergence model that satisfies the convergence condition does not exist in the plurality of recognition models.
- 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer-readable instructions, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:获取经过人工标注的原始训练数据和原始测试数据,并获取原参数列表,所述原参数列表由数据增强方法和所述数据增强方法对应的增强参数构成;Obtain the manually marked original training data and original test data, and obtain the original parameter list, where the original parameter list is composed of a data enhancement method and an enhancement parameter corresponding to the data enhancement method;根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得多个优化参数列表;Randomly initialize the enhanced parameters in the original parameter list according to the artificial fish swarm algorithm to obtain multiple optimized parameter lists;利用每一所述优化参数列表对所述原始训练数据进行转换,以获得对应的人工构造数据,并将所述原始训练数据与所述对应的人工构造数据进行混合,以获得多个训练集;Transform the original training data using each of the optimized parameter lists to obtain corresponding artificially constructed data, and mix the original training data with the corresponding artificially constructed data to obtain multiple training sets;利用所述多个训练集训分别练获得多个识别模型,并将所述原始测试数据作为测试集对所述多个识别模型进行测试,以确定所述多个识别模型中是否存在满足收敛条件的模型;Use the plurality of training sets to train to obtain a plurality of recognition models, and use the original test data as a test set to test the plurality of recognition models, so as to determine whether the plurality of recognition models meet the convergence conditions. Model;若所述多个识别模型中存在所述满足收敛条件的模型,则输出所述满足收敛条件的模型所对应的优化参数列表,作为目标数据增强参数列表;If there is a model that satisfies the convergence condition in the plurality of identification models, outputting an optimization parameter list corresponding to the model that satisfies the convergence condition as a target data enhancement parameter list;利用所述目标数据增强参数列表对所述原始训练数据进行数据增强,以获得命名实体识别模型的训练集。Data augmentation is performed on the original training data by using the target data augmentation parameter list to obtain a training set of a named entity recognition model.
- 如权利要求15所述的可读存储介质,其中,所述确定所述多个识别模型中是否存在满足收敛条件的模型之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:16. The readable storage medium of claim 15, wherein, after said determining whether there is a model in the plurality of identification models that satisfies a convergence condition, the computer-readable instructions, when executed by one or more processors, cause The one or more processors also perform the following steps:若所述多个识别模型中不存在所述满足收敛条件的模型,则再次根据人工鱼群算法对所述原参数列表中的增强参数进行随机初始化,以获得所述随机初始化后的多个优化参数列表,并进行计数;If there is no model that satisfies the convergence condition in the multiple identification models, the enhancement parameters in the original parameter list are randomly initialized again according to the artificial fish swarm algorithm to obtain multiple optimizations after the random initialization. parameter list, and count;确定对所述原参数列表中的增强参数进行随机初始化的次数是否小于预设次数;Determine whether the number of random initializations of the enhanced parameters in the original parameter list is less than a preset number of times;若对所述原参数列表中的增强参数进行随机初始化的次数不小于所述预设次数,则停止对所述原参数列表中的增强参数进行所述随机初始化;If the number of random initializations for the enhanced parameters in the original parameter list is not less than the preset number of times, stop performing the random initialization on the enhanced parameters in the original parameter list;若对所述原参数列表中的增强参数进行随机初始化的次数小于所述预设次数,则根据所述随机初始化后的优化参数列表训练获得多个新的识别模型,并对所述新的多个识别模型进行测试,以获得所述目标数据增强参数列表,并利用所述目标数据增强参数列表获得 所述命名实体识别模型的训练集。If the number of random initializations for the enhanced parameters in the original parameter list is less than the preset number of times, multiple new recognition models are obtained by training according to the randomly initialized optimized parameter list, and the new multiple Each recognition model is tested to obtain the target data enhancement parameter list, and a training set of the named entity recognition model is obtained by using the target data enhancement parameter list.
- 如权利要求15所述的可读存储介质,其中,所述数据增强方法包括同义词替换方法,所述利用每一所述优化参数列表对所述原始训练数据进行转换,包括:The readable storage medium of claim 15, wherein the data augmentation method includes a synonym replacement method, and the transforming the original training data using each of the optimization parameter lists includes:在所述优化参数列表中确定所述同义词替换方法对应的增强参数,所述同义词替换方法对应的增强参数包括实体词类别替换概率和实体词替换类别;Determine the enhancement parameters corresponding to the synonym replacement method in the optimization parameter list, and the enhancement parameters corresponding to the synonym replacement method include entity word category replacement probability and entity word replacement category;获取用户根据需求预先构建的预设同义词词典,所述预设同义词词典中,将同一实体类别中未被禁止同义关系的实体词作为彼此的同义词;Obtaining a preset synonym dictionary pre-built by the user according to requirements, in the preset synonym dictionary, entity words in the same entity category whose synonymous relationship is not prohibited are used as synonyms for each other;根据所述预设同义词词典、所述实体词类别替换概率和所述实体词替换类别,对所述原始训练数据中的实体词进行同义词替换。According to the preset synonym dictionary, the entity word category replacement probability and the entity word replacement category, synonym replacement is performed on the entity words in the original training data.
- 如权利要求17所述的可读存储介质,其中,所述根据所述预设同义词词典、所述实体词类别替换概率和所述实体词替换类别,对所述原始训练数据中的实体词进行同义词替换,包括:The readable storage medium according to claim 17, wherein the entity word in the original training data is performed according to the preset thesaurus dictionary, the entity word category replacement probability and the entity word replacement category. Synonym substitution, including:确定所述原始训练数据中各实体词的类别是否属于所述实体词替换类别;determining whether the category of each entity word in the original training data belongs to the entity word replacement category;若所述原始训练数据中实体词的类别属于所述实体词替换类别,则在所述预设同义词词典中查找所述实体词的同义词;If the category of the entity word in the original training data belongs to the entity word replacement category, search for the synonym of the entity word in the preset synonym dictionary;确定所述实体词与所述实体词的同义词之间是否被禁止同义关系;determining whether a synonymous relationship is prohibited between the entity word and a synonym of the entity word;若所述实体词与所述实体词的同义词之间未被禁止同义关系,则以所述实体词类别替换概率,从所述预设同义词词典中选择一所述同义词作为替换词,以将所述实体词替换为所述替换词。If the synonym relationship between the entity word and the synonym of the entity word is not prohibited, use the entity word category replacement probability to select one of the synonyms from the preset synonym dictionary as a replacement word to replace the The entity word is replaced with the replacement word.
- 如权利要求18所述的可读存储介质,其中,所述数据增强方法还包括随机替换方法、随机删除方法、随机交换方法和构造长句方法,所述对所述原始训练数据中的实体词进行同义词替换之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium of claim 18, wherein the data augmentation method further comprises a random replacement method, a random deletion method, a random exchange method, and a long sentence construction method, and wherein the entity words in the original training data are After the synonym substitution is performed, when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps:在所述优化参数列表中,确定所述随机替换方法的随机替换概率,并确定所述随机删除方法的随机删除概率;In the optimization parameter list, determine the random replacement probability of the random replacement method, and determine the random deletion probability of the random deletion method;确定所述随机交换方法的随机交换概率,并确定所述构造长句方法所设定的句长;Determine the random exchange probability of the random exchange method, and determine the sentence length set by the method of constructing a long sentence;根据所述随机替换概率对所述原始训练数据中的每一句子进行实体词替换,并根据所述随机交换概率对所述原始训练数据中的每一句子进行同句实体词交换;Perform entity word replacement for each sentence in the original training data according to the random replacement probability, and perform same-sentence entity word exchange for each sentence in the original training data according to the random exchange probability;根据所述随机删除概率对所述原始训练数据中的每一句子进行实体词删除,以获得处理数据;Perform entity word deletion on each sentence in the original training data according to the random deletion probability to obtain processing data;对所述处理数据中每一句子进行拼接处理,以使处理完成后的句子长度为所述句长。Perform splicing processing on each sentence in the processing data, so that the sentence length after the processing is completed is the sentence length.
- 如权利要求15-19任一项所述的可读存储介质,其中,所述确定所述多个识别模型中是否存在收敛模型,包括:The readable storage medium of any one of claims 15-19, wherein the determining whether a convergence model exists in the plurality of identification models comprises:确定所述多个识别模型中对所述测试集中各个词进行识别的最高识别得分;determining the highest recognition score for recognizing each word in the test set in the plurality of recognition models;确定所述最高识别得分是否满足所述收敛条件;determining whether the highest recognition score satisfies the convergence condition;若所述最高识别得分满足所述收敛条件,则确定所述多个识别模型中存在所述满足收敛条件的收敛模型,所述最高识别得分对应的识别模型为所述收敛模型;If the highest recognition score satisfies the convergence condition, it is determined that there is a convergence model satisfying the convergence condition in the plurality of recognition models, and the recognition model corresponding to the highest recognition score is the convergence model;若所述最高识别得分不满足所述收敛条件,则确定所述多个识别模型中不存在所述满足收敛条件的收敛模型。If the highest recognition score does not satisfy the convergence condition, it is determined that the convergence model that satisfies the convergence condition does not exist in the plurality of recognition models.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110420110.3A CN113158652B (en) | 2021-04-19 | 2021-04-19 | Data enhancement method, device, equipment and medium based on deep learning model |
CN202110420110.3 | 2021-04-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022222224A1 true WO2022222224A1 (en) | 2022-10-27 |
Family
ID=76868692
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/096475 WO2022222224A1 (en) | 2021-04-19 | 2021-05-27 | Deep learning model-based data augmentation method and apparatus, device, and medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113158652B (en) |
WO (1) | WO2022222224A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116244445A (en) * | 2022-12-29 | 2023-06-09 | 中国航空综合技术研究所 | Aviation text data labeling method and labeling system thereof |
CN116451690A (en) * | 2023-03-21 | 2023-07-18 | 麦博(上海)健康科技有限公司 | Medical field named entity identification method |
CN116501979A (en) * | 2023-06-30 | 2023-07-28 | 北京水滴科技集团有限公司 | Information recommendation method, information recommendation device, computer equipment and computer readable storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116911305A (en) * | 2023-09-13 | 2023-10-20 | 中博信息技术研究院有限公司 | Chinese address recognition method based on fusion model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543906A (en) * | 2019-08-29 | 2019-12-06 | 彭礼烨 | Skin type automatic identification method based on data enhancement and Mask R-CNN model |
US20200226212A1 (en) * | 2019-01-15 | 2020-07-16 | International Business Machines Corporation | Adversarial Training Data Augmentation Data for Text Classifiers |
CN111738004A (en) * | 2020-06-16 | 2020-10-02 | 中国科学院计算技术研究所 | Training method of named entity recognition model and named entity recognition method |
CN111738007A (en) * | 2020-07-03 | 2020-10-02 | 北京邮电大学 | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network |
CN111832294A (en) * | 2020-06-24 | 2020-10-27 | 平安科技(深圳)有限公司 | Method and device for selecting marking data, computer equipment and storage medium |
CN112257441A (en) * | 2020-09-15 | 2021-01-22 | 浙江大学 | Named entity identification enhancement method based on counterfactual generation |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145965A (en) * | 2018-08-02 | 2019-01-04 | 深圳辉煌耀强科技有限公司 | Cell recognition method and device based on random forest disaggregated model |
US11568307B2 (en) * | 2019-05-20 | 2023-01-31 | International Business Machines Corporation | Data augmentation for text-based AI applications |
CN110516835A (en) * | 2019-07-05 | 2019-11-29 | 电子科技大学 | A kind of Multi-variable Grey Model optimization method based on artificial fish-swarm algorithm |
-
2021
- 2021-04-19 CN CN202110420110.3A patent/CN113158652B/en active Active
- 2021-05-27 WO PCT/CN2021/096475 patent/WO2022222224A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200226212A1 (en) * | 2019-01-15 | 2020-07-16 | International Business Machines Corporation | Adversarial Training Data Augmentation Data for Text Classifiers |
CN110543906A (en) * | 2019-08-29 | 2019-12-06 | 彭礼烨 | Skin type automatic identification method based on data enhancement and Mask R-CNN model |
CN111738004A (en) * | 2020-06-16 | 2020-10-02 | 中国科学院计算技术研究所 | Training method of named entity recognition model and named entity recognition method |
CN111832294A (en) * | 2020-06-24 | 2020-10-27 | 平安科技(深圳)有限公司 | Method and device for selecting marking data, computer equipment and storage medium |
CN111738007A (en) * | 2020-07-03 | 2020-10-02 | 北京邮电大学 | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network |
CN112257441A (en) * | 2020-09-15 | 2021-01-22 | 浙江大学 | Named entity identification enhancement method based on counterfactual generation |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116244445A (en) * | 2022-12-29 | 2023-06-09 | 中国航空综合技术研究所 | Aviation text data labeling method and labeling system thereof |
CN116244445B (en) * | 2022-12-29 | 2023-12-12 | 中国航空综合技术研究所 | Aviation text data labeling method and labeling system thereof |
CN116451690A (en) * | 2023-03-21 | 2023-07-18 | 麦博(上海)健康科技有限公司 | Medical field named entity identification method |
CN116501979A (en) * | 2023-06-30 | 2023-07-28 | 北京水滴科技集团有限公司 | Information recommendation method, information recommendation device, computer equipment and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113158652A (en) | 2021-07-23 |
CN113158652B (en) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022222224A1 (en) | Deep learning model-based data augmentation method and apparatus, device, and medium | |
WO2022007823A1 (en) | Text data processing method and device | |
CN112765312B (en) | Knowledge graph question-answering method and system based on graph neural network embedded matching | |
CN110990559B (en) | Method and device for classifying text, storage medium and processor | |
WO2020143320A1 (en) | Method and apparatus for acquiring word vectors of text, computer device, and storage medium | |
CN112115267A (en) | Training method, device and equipment of text classification model and storage medium | |
CN113536795B (en) | Method, system, electronic device and storage medium for entity relation extraction | |
CN109710921B (en) | Word similarity calculation method, device, computer equipment and storage medium | |
CN117194637A (en) | Multi-level visual evaluation report generation method and device based on large language model | |
CN112380837A (en) | Translation model-based similar sentence matching method, device, equipment and medium | |
CN116051388A (en) | Automatic photo editing via language request | |
WO2021164302A1 (en) | Sentence vector generation method, apparatus, device and storage medium | |
CN109117474A (en) | Calculation method, device and the storage medium of statement similarity | |
Cheng et al. | A hierarchical multimodal attention-based neural network for image captioning | |
CN112861543A (en) | Deep semantic matching method and system for matching research and development supply and demand description texts | |
CN112016311A (en) | Entity identification method, device, equipment and medium based on deep learning model | |
CN112307048A (en) | Semantic matching model training method, matching device, equipment and storage medium | |
CN113469338B (en) | Model training method, model training device, terminal device and storage medium | |
Cruz et al. | A resource for studying chatino verbal morphology | |
WO2022116444A1 (en) | Text classification method and apparatus, and computer device and medium | |
Matthews et al. | Generalized robust counterparts for constraints with bounded and unbounded uncertain parameters | |
WO2021237928A1 (en) | Training method and apparatus for text similarity recognition model, and related device | |
CN113220996A (en) | Scientific and technological service recommendation method, device, equipment and storage medium based on knowledge graph | |
CN112446205A (en) | Sentence distinguishing method, device, equipment and storage medium | |
CN116610795B (en) | Text retrieval method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21937450 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21937450 Country of ref document: EP Kind code of ref document: A1 |