Nothing Special   »   [go: up one dir, main page]

WO2023071745A1 - Information labeling method, model training method, electronic device and storage medium - Google Patents

Information labeling method, model training method, electronic device and storage medium Download PDF

Info

Publication number
WO2023071745A1
WO2023071745A1 PCT/CN2022/124185 CN2022124185W WO2023071745A1 WO 2023071745 A1 WO2023071745 A1 WO 2023071745A1 CN 2022124185 W CN2022124185 W CN 2022124185W WO 2023071745 A1 WO2023071745 A1 WO 2023071745A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
entity
labeling
model
relationship
Prior art date
Application number
PCT/CN2022/124185
Other languages
French (fr)
Chinese (zh)
Inventor
李春霞
周祥生
钟斌
屠要峰
徐进
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023071745A1 publication Critical patent/WO2023071745A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/221Parsing markup language streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of information processing, and in particular to an information labeling method, a model training method, electronic equipment, and a storage medium.
  • CRF Conditional Random Fields, conditional random field
  • RNN Recurrent Neural Network, cyclic neural network
  • LSTM Long Short Term Memory, long short-term memory
  • Embodiments of the present application provide an information labeling method, a model training method, an electronic device, and a storage medium.
  • the embodiment of the present application provides an information labeling method, including: obtaining the information text to be processed; inputting the message text to be processed into the information labeling model to obtain the first entity, the second entity and the first entity Entity relationship information between an entity and the second entity, wherein the first entity and the second entity are obtained by distinguishing the entities of the information text to be processed by the information annotation model, and the entity relationship
  • the information is obtained by discriminating the relationship between the first entity and the second entity by the information labeling model; Perform information annotation processing to obtain target information text.
  • the embodiment of the present application also provides a model training method, including: obtaining a training sample, the training sample is a text with label information; inputting the training sample into an information labeling model to obtain the training An information labeling result of a sample, wherein the information labeling result includes a first labeling entity, a second labeling entity, and relationship labeling information between the first labeling entity and the second labeling entity, and the first labeling entity and the second labeled entity are obtained by distinguishing the entities of the training samples by the information labeling model, and the relationship labeling information is obtained by performing entity identification on the first labeled entity and the second labeled entity by the information labeling model It is obtained through relationship discrimination; and the parameters of the information labeling model are updated according to the information labeling result and the label information.
  • the embodiment of the present application also provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the above when executing the computer program.
  • an electronic device including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the above when executing the computer program.
  • the embodiment of the present application also provides a computer-readable storage medium, which stores a processor-executable program, and when the processor-executable program is executed by the processor, it is used to implement the above-mentioned first aspect.
  • a computer-readable storage medium which stores a processor-executable program, and when the processor-executable program is executed by the processor, it is used to implement the above-mentioned first aspect.
  • the embodiment of the present application further provides a computer program product, including a computer program or a computer instruction, the computer program or the computer instruction is stored in a computer-readable storage medium, and the processor of the computer device reads from the The computer-readable storage medium reads the computer program or the computer instruction, and the processor executes the computer program or the computer instruction, so that the computer device executes the information labeling method as described in the first aspect above, or Realize the model training method as described in the second aspect above.
  • FIG. 1 is a flowchart of an information labeling method provided by an embodiment of the present application
  • Fig. 2 is the flowchart of the method to step S200 in Fig. 1;
  • Fig. 3 is the flowchart of the method to step S200 in Fig. 1;
  • Fig. 4 is the flow chart of the method to step S300 in Fig. 1;
  • Fig. 5 is the flowchart of the method to step S320 in Fig. 4;
  • FIG. 6 is a flow chart of an information labeling method provided by another embodiment of the present application.
  • FIG. 7 is a schematic diagram of a system architecture for executing a model training method provided by another embodiment of the present application.
  • Fig. 8 is a frame diagram of a pre-training module based on knowledge fusion provided by an embodiment of the present application.
  • FIG. 9 is a framework diagram of an entity and relationship automatic labeling module provided by an embodiment of the present application.
  • Fig. 10 is a frame diagram of an annotation result review module provided by an embodiment of the present application.
  • Fig. 11 is a frame diagram of an audit data management and incremental training module provided by an embodiment of the present application.
  • Fig. 12 is a flow chart of the audit data management and incremental training module provided by another embodiment of the present application.
  • Fig. 13 is a flowchart of a model training method provided by an embodiment of the present application.
  • Fig. 14 is a flowchart of the method for step S700 in Fig. 13;
  • Fig. 15 is a flowchart of the method for step S700 in Fig. 13;
  • Fig. 16 is a flowchart of the method for step S800 in Fig. 13;
  • Fig. 17 is a flowchart of an information labeling method provided by an example of the present application.
  • the present application provides an information labeling method, a model training method, electronic equipment and a storage medium. Firstly, the information text to be processed is obtained, and then the text to be processed is input into the information labeling model to obtain the first entity, the second entity and the first entity.
  • the entity relationship information between the entity and the second entity wherein, the first entity and the second entity are obtained by distinguishing the entities of the information text to be processed by the information annotation model, and the entity relationship information is obtained by the information annotation model for the first entity and the second entity It is obtained by performing relationship discrimination, and then performing information labeling processing on the information text to be processed according to the first entity, the second entity, and the entity relationship information to obtain the target information text.
  • the first entity, the second entity, and the entity relationship information between the first entity and the second entity are obtained by obtaining the information text to be processed and inputting the information text to be processed into the information labeling model, and then
  • the information text to be processed is processed by information annotation to obtain the target information text, that is to say, the information text to be processed is processed by the information annotation model to obtain the target information text with annotation information.
  • This process does not require manual participation and realizes automatic information processing. Labeling, thus effectively reducing the cost of pure manual labeling.
  • FIG. 1 is a flowchart of an information labeling method provided by an embodiment of the present application, and the information labeling method can be applied to an information labeling device.
  • the information tagging method may include but not limited to step S100, step S200 and step S300.
  • Step S100 Obtain the text of the message to be processed.
  • the data of the information text to be processed may come from documents in the business field, database data, and the like.
  • the information text to be processed may have various types of information.
  • the information text to be processed may be paper information, news information information, speech information, etc., which is not specifically limited in this embodiment.
  • Step S200 Input the text to be processed into the information annotation model to obtain the first entity, the second entity and the entity relationship information between the first entity and the second entity, wherein the first entity and the second entity are treated by the information annotation model
  • the processed information text is obtained by distinguishing entities
  • the entity relationship information is obtained by distinguishing the relationship between the first entity and the second entity by the information labeling model.
  • the information text to be processed in step S100 is input into the information annotation model, so in the information annotation model, the entity distinction of the information text to be processed can be performed to obtain the first entity and the second entity, and then the first entity and the second entity Entities perform relationship discrimination to obtain entity relationship information, so that subsequent steps can refer to the first entity, the second entity, and the entity relationship information between the first entity and the second entity.
  • entity distinction can be made between place nouns and organization nouns in the information text to be processed.
  • the first entity obtained is the noun of the place
  • the second entity is the noun of the organization organization
  • Nouns and job nouns are distinguished as entities.
  • the first entity obtained is a person noun
  • the second entity is an organization noun.
  • the entity relationship information also has many different implementations corresponding to it.
  • the obtained entity relationship information is information such as "at”, "located in”, and "set”
  • the first entity is a person noun
  • the second entity is a noun of an organization
  • the obtained entity relationship information is information such as "employ at" and "work at”, which is not specifically limited in this embodiment.
  • Step S300 According to the first entity, the second entity and the entity relationship information, perform information labeling processing on the information text to be processed to obtain the target information text.
  • the first entity, the second entity, and the entity relationship information have been obtained according to step S200, so the target information text can be obtained by performing information labeling processing on the information text to be processed according to the first entity, the second entity, and the entity relationship information, so as to realize Automatic labeling of information is achieved, thereby effectively reducing the cost of pure manual labeling.
  • the first entity, the second entity, and the entity relationship information may be processed by performing information annotation processing on the to-be-processed information text to obtain the target information text, which is not specifically limited in this embodiment.
  • the information labeling process can be to highlight the first entity, the second entity, and entity relationship information to obtain the target information text; for another example, the information labeling process can be to perform The line is processed to obtain the target information text.
  • FIG. 2 further illustrates step S200, which may include but not limited to step S210 and step S220.
  • Step S210 The information labeling model performs word segmentation processing on the information text to be processed to obtain a plurality of first field information.
  • word segmentation is performed on the information to be processed to obtain a plurality of first field information, so as to obtain the first entity and the second entity according to the first field information in a subsequent step.
  • word segmentation processing can be performed on the text to be processed according to the set text length, so as to obtain a plurality of first field information, and the set text length can be eight characters or ten characters, etc.
  • the text of the information to be processed is segmented according to the spacing distance, so as to obtain a plurality of first field information.
  • Step S220 The information labeling model performs entity recognition processing on a plurality of first field information, and identifies the first entity and the second entity in the plurality of field information.
  • entity identification processing is performed on the plurality of first field information obtained in step S220, and the first entity and the second entity are identified in the plurality of field information, so that in subsequent steps, according to the first entity and the second entity, the The relationship judgment obtains entity relationship information.
  • entity recognition can be performed on the location nouns and organization nouns in the first field information.
  • the obtained first entity is the location noun
  • the second entity is the organization noun
  • the first entity obtained is the person noun and the second entity is the organization noun.
  • step S200 which may include but not limited to step S230 and step S240.
  • Step S230 The information labeling model performs category identification processing on the first entity and the second entity, and obtains first category information corresponding to the first entity and second category information corresponding to the second entity.
  • step S220 category identification processing is performed on the first entity and the second entity obtained in step S220, and the first category information corresponding to the first entity and the second category information corresponding to the second entity are obtained, so as to facilitate the following based on the first category information Perform relationship identification processing with the second category information to obtain entity relationship information.
  • the information of the first category obtained from the first entity and the information of the second category obtained from the second entity will also correspond to how many different types, which are not specifically limited in this embodiment.
  • the first category information obtained is the city name
  • the second category information obtained is the country name
  • the first entity is a doctor
  • the first type of information obtained is occupational nouns
  • the second entity is nouns such as hospitals and schools
  • the second type of information obtained is location nouns.
  • Step S240 The information labeling model performs relationship recognition processing on the first category information and the second category information to obtain entity relationship information.
  • the relationship identification processing is performed on the first category information and the second category information obtained in step S230 to obtain entity relationship information, so as to facilitate subsequent labeling operations on the text to be processed.
  • the entity relationship information obtained by performing relationship identification processing according to the first category information and the second category information will also correspond to how many different types, which are not specifically limited in this embodiment.
  • the obtained entity relationship information is information such as "belongs to" and "included in”
  • the second category information is a place noun
  • the obtained entity relationship information is information such as "work at”, "employ at” and so on.
  • step S300 which may include but not limited to step S310 and step S320 .
  • Step S310 Highlight the first entity and the second entity in the message text to be processed to obtain the first message text.
  • the first entity and the second entity in the information text to be processed are highlighted to obtain the first information text, so as to obtain the target information text in subsequent steps.
  • first entity and the second entity are highlighted to obtain the first information text; another example is the first entity and the second entity are underlined to obtain the first information text.
  • Step S320 mark the entity relationship information in the first information text to obtain the target information text, wherein the entity relationship information is used to form an association relationship between the first entity and the second entity in the target information text.
  • the entity relationship information can be marked in the first information text to obtain the target information text, wherein the entity relationship information is used to combine the first entity and The second entity forms an association relationship, thereby realizing automatic labeling of information, thereby effectively reducing the cost of pure manual labeling.
  • step S320 which may include but not limited to step S3210 and step S3220.
  • Step S3210 In the first information text, mark the first type information for the first entity, and mark the second type information for the second entity.
  • the first type information can be marked on the first entity, and the second type information can be marked on the second entity, so as to obtain the target information text in subsequent steps.
  • first entity and the second entity may have many different entity types
  • first category information and the second category information will also correspond to multiple different types, which is not specifically limited in this embodiment , has been specifically described in the above embodiments, and will not be described in detail here.
  • Step S3220 According to the first category information and the second category information, mark the entity relationship information in the first information text to obtain the target information text.
  • step S3210 since the first category information and the second category information are obtained in step S3210, the entity relationship information can be marked in the first information text to obtain the target information text, and the automatic labeling of information is realized, thereby effectively reducing the The cost of purely manual labeling is eliminated.
  • FIG. 6 provides a flowchart of an information labeling method according to another embodiment.
  • the information labeling method may include but not limited to step S400 , step S410 and step S420 .
  • Step S400 Proofread at least one of the first entity, the second entity, and the entity relationship information of the target information text to obtain a proofreading result.
  • step S320 since the target information text is obtained in step S320, at least one of the first entity, the second entity, and the entity relationship information of the target information text is collated to obtain a proofreading result, so as to verify the proofreading result in subsequent steps judge.
  • At least one of the first entity, the second entity, and the entity relationship information of the tagged information text may be collated to obtain a collation result, and there may be different implementation modes, which are not specifically limited in this embodiment.
  • manual proofreading can be performed on any one of the first entity, second entity, and entity relationship information to obtain the proofreading result;
  • another example is to perform program proofreading on any one of the first entity, second entity, and entity relationship information, Get the proofreading result.
  • Step S410 When the collation result shows that at least one of the first entity, the second entity, and the entity relationship information has error information, correct information is obtained according to the error information.
  • step S400 since the proofreading result was obtained in step S400, when the proofreading result shows that at least one of the first entity, the second entity, and the entity relationship information has error information, correction information is obtained according to the error information, so as to facilitate correcting in subsequent steps
  • the information labeling model is updated.
  • Step S420 Update the information labeling model according to the correction information.
  • step S410 since the correction information is obtained in step S410, the information labeling model is updated according to the correction information, which improves the efficiency and quality of labeling, and reduces the cost of purely manual labeling.
  • the updating process of updating the information labeling model according to the correction information includes iteration of the information labeling model, replacement of the information labeling model, etc., which is not specifically limited in this embodiment.
  • FIG. 7 is a schematic diagram of a system architecture for executing a model training method provided by an embodiment of the present application.
  • the system architecture includes a general domain training data construction module 100, a pre-training model training module 200 based on knowledge fusion, an entity and relationship automatic labeling review module 300, a labeling result review module 400, review data management and augmentation Quantitative training module 500.
  • the general field training data that needs to be constructed is divided into two categories from the perspective of machine learning, namely unsupervised training data and supervised training data. , news, etc.
  • Supervised data in this scenario specifically refers to data that labels entities and their relationships in text. Entity types usually include person names, place names, institutions, time, etc., and inter-entity relationships include founders, locations, works, etc.
  • the ratio between unsupervised data and supervised data is set to 8:2, and it can also be set to other ratios, which is not specifically limited in this embodiment.
  • the ratio between unsupervised data and supervised data can be set to 7:3; for another example, when the model receives less text information, the ratio between unsupervised data and supervised data can be set to It is 9:1.
  • the knowledge fusion-based pre-training model training module 200 after the knowledge fusion-based pre-training model training module 200 receives the training data, it trains a neural network model that is characterized and suitable for entity and relationship extraction tasks according to the training data, wherein the knowledge fusion-based pre-training The model training module 200 introduces the information contained in the knowledge map into the training model, and then sends the trained neural network model to the entity and relationship automatic labeling and review module 300 .
  • the rest of the configuration information of the neural network is: the activation function uses the tanh activation function, the model weight and bias values use the random number initialization method, the gradient descent and backpropagation methods are used to solve the model parameters, and the crossover method is used.
  • the number of training of the model is 5*60000 iterations
  • the learning rate is 0.00003, among them
  • dropout regularization The purpose is to reduce overfitting and enhance the generalization ability of the model.
  • FIG. 8 is a frame diagram of a pre-training module based on knowledge fusion provided by an embodiment of the present application.
  • the frame diagram of the pre-training module 200 based on knowledge fusion includes the following subtasks: MLM (Masked Language Mode, mask language model) task, NSP (Next Sentence Prediction, next sentence prediction model) task, entity distinction task, relationship discrimination task, and entity extraction task based on reading comprehension; among them, the MLM task is to perform word segmentation on the sentence, and then select 15% of all words, and 80% of the 15% of the selected words use [MASK] tokens, 10% are represented by original tokens, and 10% are represented by random tokens.
  • MLM Mask Language Mode, mask language model
  • NSP Next Sentence Prediction, next sentence prediction model
  • entity distinction task entity distinction task
  • relationship discrimination task and entity extraction task based on reading comprehension
  • the MLM task is to perform word segmentation on the sentence, and then select 15% of all words, and 80% of the 15% of the selected words use [MASK] tokens, 10% are represented by original tokens, and 10% are represented by random tokens.
  • [MASK] After [MASK] is introduced, it is used as model input; the model output is the word representation of the corresponding position of the [MASK] word, and then the loss is calculated by cross entropy, hoping that the model output [MASK] word and the real word just match; the NSP task is the training data
  • the generation method is to randomly extract two consecutive sentences from the parallel corpus, and 50% of them retain the two extracted sentences. The second sentence is randomly extracted from the corpus, and their relationship is NotNext.
  • the NotNext relationship is a non-inheritance relationship, indicating that there is no relationship between the two sentences before and after, and then compare the model output with the real expected output to perform model
  • the training operation of ; the entity discrimination task is to find the correct tail entity from the current document given the head entity and relationship. For example, a hospital and a doctor have an employment relationship, so the relationship occupation and the head entity hospital are spliced in front of the original document as a reminder.
  • the task of distinguishing the correct tail entity can be transformed into pulling the head entity closer under the framework of contrastive learning
  • the distance between the entity representation of the correct tail entity and the distance between the entity representation of the head entity and other entities (negative samples) in the document; the relationship discrimination task is to distinguish the similarity of the two relationship representations in the semantic space. Randomly sample multiple documents and derive multiple relational representations from each document, where these relations may involve only sentence-level reasoning or complex reasoning across sentences. Then, based on the contrastive learning framework, different relational representations are trained in the relational space according to the distantly supervised labels. As mentioned above, each relational representation consists of two entity representations in the document.
  • Positive samples are relational representations with the same distant supervision label, and negative samples are the opposite; the task of entity extraction based on reading comprehension is given a text sequence X, its length is n, extract each entity in it, where, entity There are corresponding entity types. For example, assuming that the set of all entity labels in a dataset is Y, then for each entity label y in it, such as location LOC, there is a question q(y) about it. Wherein, this question can be a word, also can be a sentence etc.
  • the model input is X and q(y), and the model is trained and optimized to predict each entity with the label y, so the task can take into account the extraction of ordinary entities and nested entities.
  • FIG. 9 is a frame diagram of an entity and relationship automatic labeling module provided by an embodiment of the present application.
  • the entity and relationship automatic labeling module trains the neural network model to obtain the entity and relationship recognition model, wherein the entity and relationship automatic labeling module also includes REST (Representational State Transfer, representational state transfer) for external submission and interaction service, and synthesize the labeled dataset.
  • REST Representational State Transfer, representational state transfer
  • the dataset to be labeled is input to the entity and relationship automatic labeling module through the externally submitted interactive REST service.
  • the dataset to be labeled is received as the input of the model, and the model provided Ability to automatically label data, output entities and relationships in the data after model recognition, and write them into a new data set.
  • the original data text and data labels are synthesized in the new data set, and finally the labeling results are sent to the labeling result review module 400 for further processing check.
  • FIG. 10 is a frame diagram of an annotation result review module provided by an embodiment of the present application.
  • the labeling result review module 400 realizes reading the automatically labeled data set, and can perform visual display on the interface of the labeling tool, so as to facilitate the review and proofreading of the automatically marked results on the interface to obtain proofreading data, and then proofread The data is sent to the audit data management and incremental training module 500 .
  • the review and proofreading operations in the labeling result review module 400 can be performed manually or by a machine, which is not limited in this embodiment.
  • Fig. 11 is a frame diagram of an audit data management and incremental training module provided by an embodiment of the present application
  • Fig. 12 is a flow chart of the audit data management and incremental training module provided by an embodiment of the present application.
  • the audit data management and incremental training module 500 includes but not limited to step S500 , step S510 , step S520 , step S530 and step S540 .
  • Step S500 Obtain a dataset to be labeled.
  • Step S510 Input the dataset to be labeled into the entity and relationship automatic labeling module to perform automatic labeling service to obtain labeling results.
  • Step S520 Proofreading the labeling results according to the labeling tool to obtain proofreading data.
  • Step S530 Input the proofreading data into the model for incremental training to obtain a new model.
  • Step S540 Use the new model to replace the model in the original automatic labeling plug-in.
  • the audit data management and incremental training module 500 can manage the proofreaded data on the one hand, and can input the proofreaded data as model incremental training on the other hand, so as to realize continuous iterative optimization of the labeling model and form a label System data and model closed loop.
  • the audit data management and incremental training module 500 will collect the data that has been proofread during the audit and proofreading, and based on these proofreading data, incrementally train the model, and the new model after training will replace the model in the original automatic labeling plug-in , in order to achieve the effect of continuous optimization of the model.
  • Model incremental training is triggered by timing or data volume. After the model is generated, the old model can be automatically replaced, and then the new model is used to provide automatic labeling services to the outside world.
  • FIG. 13 is a flowchart of a model training method provided by an embodiment of the present application, and the information labeling method can be applied to a model training device.
  • the information labeling method may include but not limited to step S600, step S700 and step S800.
  • Step S600 Obtain training samples, which are texts with label information.
  • the text with tag information refers to the text that has been marked manually in advance.
  • the entity type usually includes person name, place name, organization, time, etc., and the relationship between entities includes founder, location, work, etc.
  • Step S700 Input the training sample into the information labeling model to obtain the information labeling result of the training sample, wherein the information labeling result includes the first labeling entity, the second labeling entity, and the relationship labeling between the first labeling entity and the second labeling entity Information, the first labeled entity and the second labeled entity are obtained by the information labeling model performing entity distinction on the training samples, and the relationship labeling information is obtained by the information labeling model performing relationship discrimination between the first labeled entity and the second labeled entity.
  • the training samples are obtained in step S600, the training samples are input into the information labeling model to obtain the information labeling results of the training samples, wherein the information labeling results include the first labeling entity, the second labeling entity and the first labeling entity
  • the relationship labeling information between the entity and the second labeling entity, the first labeling entity and the second labeling entity are obtained by distinguishing the training samples from the information labeling model, and the relationship labeling information is obtained by the information labeling model on the first labeling entity and the second labeling entity
  • Annotated entities are obtained through relationship discrimination, so that the parameters of the information annotation model can be updated in the subsequent steps according to the information annotation results and label information.
  • first labeled entity and the second labeled entity are obtained by distinguishing entities of training samples by an information labeling model, and there may be different implementation manners, which are not specifically limited in this embodiment.
  • entity distinction can be performed on location nouns and organization nouns in the training samples.
  • the first labeled entity obtained is the location noun
  • the second labeled entity is the organization noun
  • Nouns and job nouns are distinguished as entities.
  • the first labeled entity is a person noun
  • the second labeled entity is an organization noun.
  • the relationship labeling information also has many corresponding differences.
  • the obtained relationship tagging information is information such as "in”, “located in”, and “set at”
  • the second tagged entity is the noun of the organization
  • the obtained relationship tagged information is information such as "worked at”, “worked at”, etc., which is not specifically limited in this embodiment.
  • Step S800 Update the parameters of the information labeling model according to the information labeling result and label information.
  • step S600 since the label information is obtained in step S600 and the information labeling result is obtained in step S610, the parameters of the information labeling model are updated according to the information labeling result and the label information.
  • the parameters of the information labeling model can be replaced according to the information labeling results and label information to obtain a new information labeling model; for another example, the parameters of the information labeling model can be added according to the information labeling results and label information to add new labels
  • the dataset is in the information annotation model.
  • step S700 which may include but not limited to step S710 and step S720.
  • Step S710 The information labeling model performs word segmentation processing on the training sample to obtain a plurality of second field information.
  • word segmentation processing is performed on the training samples to obtain a plurality of second field information, so that the first labeled entity and the second labeled entity can be obtained according to the second field information in a subsequent step.
  • the steps in this embodiment have the same technical principle and the same technical effect as the step S210 in the embodiment shown in FIG. 2 above, and the difference between the two embodiments is that the operation objects are different, wherein,
  • the operation object of the above-mentioned embodiment shown in FIG. 2 is the information text to be processed, while the operation object of this embodiment is the training sample with label information.
  • Step S720 The information labeling model performs entity recognition processing on the plurality of second field information, and identifies the first labeled entity and the second labeled entity in the multiple second field information.
  • step S710 entity recognition processing is performed on the plurality of second field information obtained in step S710, and the first marked entity and the second marked entity are identified in the multiple second field information, so that in subsequent steps, according to the first marked entity Perform relationship judgment with the second labeled entity to obtain information labeling results of the training samples.
  • the steps in this embodiment have the same technical principle and the same technical effect as the step S220 in the embodiment shown in Figure 2 above, and the difference between the two embodiments is that the operation objects are different, wherein,
  • the operation object of the above-mentioned embodiment shown in FIG. 2 is the information text to be processed, while the operation object of this embodiment is the training sample with label information.
  • step S700 which may include but not limited to step S730 and step S740.
  • Step S730 The information labeling model performs category identification processing on the first labeling entity and the second labeling entity to obtain the first labeling category information corresponding to the first labeling entity and the second labeling category information corresponding to the second labeling entity.
  • step S720 category identification processing is performed on the first labeled entity and the second labeled entity obtained in step S720 to obtain the first labeled category information corresponding to the first labeled entity and the second labeled category information corresponding to the second labeled entity, so as to facilitate Afterwards, a relationship identification process is performed on the first labeling category information and the second labeling category information to obtain relationship labeling information.
  • the steps in this embodiment have the same technical principle and the same technical effect as the step S230 in the embodiment shown in Figure 3 above, and the difference between the two embodiments is that the operation objects are different, wherein,
  • the operation object of the above-mentioned embodiment shown in FIG. 3 is the information text to be processed, while the operation object of this embodiment is the training sample with label information.
  • Step S740 The information labeling model performs relationship identification processing on the first labeling category information and the second labeling category information to obtain relationship labeling information.
  • the relationship identification processing is performed on the first labeling category information and the second labeling category information obtained in step S730 to obtain entity relationship information, so as to facilitate subsequent labeling operations on training samples.
  • the steps in this embodiment have the same technical principle and the same technical effect as the step S240 in the embodiment shown in FIG. 3 above, and the difference between the two embodiments is that the operation objects are different, wherein,
  • the operation object of the above-mentioned embodiment shown in FIG. 3 is the information text to be processed, while the operation object of this embodiment is the training sample with label information.
  • step S800 which may include but not limited to step S810 and step S820.
  • Step S810 Obtain the training loss value according to the information labeling result and label information.
  • step S700 obtains the information labeling result and step S600 obtains label information
  • the training loss value is obtained according to the information labeling result and label information, so that the subsequent steps can update the information labeling model according to the training loss value.
  • the training loss value obtained according to the information labeling result and label information may be implemented in different manners, which is not specifically limited in this embodiment.
  • the information labeling result does not find similar information in the label information of the information labeling model, and the loss value is obtained; another example, the information labeling result does not match the label information of the information labeling model, and labeling errors occur, and the loss value is obtained.
  • Step S820 Update the parameters of the information labeling model according to the loss value until the loss value satisfies the training stop condition.
  • step S810 since the loss value is obtained in step S810, the parameters of the information labeling model are updated according to the loss value until the loss value satisfies the training stop condition, thereby improving the efficiency and quality of labeling and reducing the cost of pure manual labeling.
  • FIG. 17 is a processing flowchart of an information labeling method provided by an example.
  • the information labeling method includes the following steps:
  • Step S101 Prepare model training data.
  • Step S102 Proportionally divide the data.
  • Step S103 Annotate the entities and relationships in the text, and record the position information and label of each annotated element in the text.
  • Step S104 Construct the pre-trained language model based on knowledge fusion of this application to predict text entities and their relationships.
  • Step S105 Use the neural network model to predict and label the text entities and their relationships.
  • Step S106 Perform REST encapsulation on the model, and add HTTP request and response processing capabilities.
  • Step S107 Start the model service and provide automatic labeling capability externally.
  • Step S108 Start automatic labeling on the interface, read the data set in the background, and call the automatic labeling service.
  • Step S109 After the automatic labeling is completed, the labeling tool filters and loads automatically labeled entities and relationships according to the previously configured entity and relationship categories, and displays them on the interface.
  • Step S110 Proofread the tagged results, and modify the tagged entities and relationships.
  • Step S111 When saving after modification, the system records the proofreading data and stores it in the proofreading training data set, which includes the original text and labels.
  • Step S112 The incremental training trigger periodically judges whether the incremental training trigger condition is met.
  • Step S113 Take the data set as input, input the above model network structure, and perform incremental training.
  • Step S114 Update the model online by means of gray scale publishing, and put the new model online for automatic labeling.
  • an embodiment of the present application also provides an electronic device, which includes: a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor and memory can be connected by a bus or other means.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory may include memory located remotely from the processor, which remote memory may be connected to the processor through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transitory software programs and instructions required to implement the information labeling method or model training method of the above-mentioned embodiments are stored in the memory, and when executed by the processor, the information labeling method in the above-mentioned embodiments is executed, for example, the above-described Method steps S100 to S300 in Fig. 1, method steps S210 and S220 in Fig. 2, method steps S230 and S240 in Fig. 3, method steps S310 and S320 in Fig. 4, method steps S3210 and S3220 in Fig. 5, Method steps S400 to S420 in FIG. 6, or, execute the model training method in the above-mentioned embodiment, for example, execute method steps S600 to S800 in FIG. 13 described above, method steps S710 and S720 in FIG. 14, FIG. 15 Method steps S730 and S740 in , method steps S810 and S820 in FIG. 16 .
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Executed by a processor in the device embodiment, the above-mentioned processor can execute the information labeling method or the model training method in the above-mentioned embodiment, for example, execute the above-described method steps S100 to S300 in FIG. 1 and the method in FIG. 2 Steps S210 and S220, method steps S230 and S240 in FIG. 3, method steps S310 and S320 in FIG. 4, method steps S3210 and S3220 in FIG. 5, method steps S400 to S420 in FIG.
  • the model training method in the example for example, executes method steps S600 to S800 in Fig. 13 described above, method steps S710 and S720 in Fig. 14 , method steps S730 and S740 in Fig. 15 , method step S810 in Fig. 16 and S820.
  • the embodiment of the present application includes: obtaining the information text to be processed, and then inputting the information text to be processed into the information labeling model to obtain the first entity, the second entity, and the entity relationship information between the first entity and the second entity, wherein, the first entity
  • the first entity and the second entity are obtained by distinguishing the entities of the information text to be processed by the information annotation model, and the entity relationship information is obtained by discriminating the relationship between the first entity and the second entity by the information annotation model.
  • the target information text is obtained by performing information annotation processing on the entity relationship information and the information text to be processed.
  • the first entity, the second entity, and the entity relationship information between the first entity and the second entity are obtained by obtaining the information text to be processed and inputting the information text to be processed into the information labeling model, and then
  • the information text to be processed is processed by information annotation to obtain the target information text, that is to say, the information text to be processed is processed by the information annotation model to obtain the target information text with annotation information.
  • This process does not require manual participation and realizes automatic information processing. Labeling, thus effectively reducing the cost of pure manual labeling.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application discloses an information labeling method, a model training method, an electronic device and a storage medium. The information labeling method comprises: acquiring an information text to be processed (S100); inputting said information text into an information labeling model to obtain a first entity and a second entity as well as entity relationship information between the first entity and the second entity, wherein the first entity and the second entity are obtained by the information labeling model performing entity distinguishing on said information text, and the entity relationship information is obtained by the information labeling model performing relationship discrimination on the first entity and the second entity (S200); and performing information labeling processing on said information text according to the first entity, the second entity and the entity relationship information to obtain a target information text (S300).

Description

信息标注方法、模型训练方法、电子设备及存储介质Information labeling method, model training method, electronic device and storage medium
相关申请的交叉引用Cross References to Related Applications
本申请基于申请号为202111241199.3、申请日为2021年10月25日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with application number 202111241199.3 and a filing date of October 25, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本申请涉及信息处理技术领域,尤其涉及一种信息标注方法、模型训练方法、电子设备及存储介质。The present application relates to the technical field of information processing, and in particular to an information labeling method, a model training method, electronic equipment, and a storage medium.
背景技术Background technique
近年来,人工智能逐渐成为热门的关注领域,现在企业大都采用人工智能的技术来提高客户服务水平、提高企业效率、降低运营。如果想要人工智能技术发挥真正的作用,就需要大量的数据来进行模型训练,训练出高质量的模型。In recent years, artificial intelligence has gradually become a hot area of concern. Now most companies use artificial intelligence technology to improve customer service levels, improve enterprise efficiency, and reduce operations. If you want artificial intelligence technology to play a real role, you need a lot of data for model training to train high-quality models.
自然语言处理技术中比较重要的任务有实体识别、实体关系挖掘等。在一些情形下,一般采用CRF(Conditional Random Fields,条件随机场)模型、RNN(Recurrent Neural Network,循环神经网络)模型或LSTM(Long Short Term Memory,长短期记忆)等来对命名实体及其关系进行识别。要生成这些模型,需要先使用文本数据进行标注、然后进行模型训练。因此高质量的标注数据集对模型训练变得尤为重要,但是标注数据是件很繁琐的工作,需要大量的人工投入,而且上述技术产生的模型,准确不高,投入产出较低。The more important tasks in natural language processing technology include entity recognition and entity relationship mining. In some cases, CRF (Conditional Random Fields, conditional random field) model, RNN (Recurrent Neural Network, cyclic neural network) model or LSTM (Long Short Term Memory, long short-term memory) are generally used to identify named entities and their relationships. to identify. To generate these models, text data needs to be annotated first, followed by model training. Therefore, high-quality labeled data sets are particularly important for model training, but labeling data is a very cumbersome task that requires a lot of manual input, and the models produced by the above-mentioned techniques are not very accurate, and the input and output are low.
发明内容Contents of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.
本申请实施例提供了一种信息标注方法、模型训练方法、电子设备及存储介质。Embodiments of the present application provide an information labeling method, a model training method, an electronic device, and a storage medium.
第一方面,本申请实施例提供了一种信息标注方法,包括:获取待处理信息文本;把所述待处理信息文本输入到信息标注模型,得到第一实体、第二实体以及所述第一实体与所述第二实体之间的实体关系信息,其中,所述第一实体和所述第二实体由所述信息标注模型对所述待处理信息文本进行实体区分而得到,所述实体关系信息由所述信息标注模型对所述第一实体和所述第二实体进行关系判别而得到;根据所述第一实体、所述第二实体和所述实体关系信息对所述待处理信息文本进行信息标注处理得到目标信息文本。In the first aspect, the embodiment of the present application provides an information labeling method, including: obtaining the information text to be processed; inputting the message text to be processed into the information labeling model to obtain the first entity, the second entity and the first entity Entity relationship information between an entity and the second entity, wherein the first entity and the second entity are obtained by distinguishing the entities of the information text to be processed by the information annotation model, and the entity relationship The information is obtained by discriminating the relationship between the first entity and the second entity by the information labeling model; Perform information annotation processing to obtain target information text.
第二方面,本申请实施例还提供了一种模型训练方法,包括:获取训练样本,所述训练样本为带有标签信息的文本;将所述训练样本输入到信息标注模型,得到所述训练样本的信息标注结果,其中,所述信息标注结果包括第一标注实体、第二标注实体以及所述第一标注实体与所述第二标注实体之间的关系标注信息,所述第一标注实体和所述第二标注实体由所述信息标注模型对所述训练样本进行实体区分而得到,所述关系标注信息由所述信息标注模型对所述第一标注实体和所述第二标注实体进行关系判别而得到;根据所述信息标注结果和所述标签信息对所述信息标注模型的参数进行更新。In the second aspect, the embodiment of the present application also provides a model training method, including: obtaining a training sample, the training sample is a text with label information; inputting the training sample into an information labeling model to obtain the training An information labeling result of a sample, wherein the information labeling result includes a first labeling entity, a second labeling entity, and relationship labeling information between the first labeling entity and the second labeling entity, and the first labeling entity and the second labeled entity are obtained by distinguishing the entities of the training samples by the information labeling model, and the relationship labeling information is obtained by performing entity identification on the first labeled entity and the second labeled entity by the information labeling model It is obtained through relationship discrimination; and the parameters of the information labeling model are updated according to the information labeling result and the label information.
第三方面,本申请实施例还提供了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上第一方面所述的信息标注方法,或者实现如上第二方面所述的模型训练方法。In a third aspect, the embodiment of the present application also provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the above when executing the computer program. The information labeling method described in the first aspect, or implement the model training method described in the second aspect above.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中存储有处理器可执行的程序,所述处理器可执行的程序被处理器执行时用于实现如上第一方面所述的信息标注方法,或者实现如上第二方面所述的模型训练方法。In the fourth aspect, the embodiment of the present application also provides a computer-readable storage medium, which stores a processor-executable program, and when the processor-executable program is executed by the processor, it is used to implement the above-mentioned first aspect. The information labeling method described above, or the model training method described in the second aspect above.
第五方面,本申请实施例还提供了一种计算机程序产品,包括计算机程序或计算机指令,所述计算机程序或所述计算机指令存储在计算机可读存储介质中,计算机设备的处理器从所述计算机可读存储介质读取所述计算机程序或所述计算机指令,所述处理器执行所述计算机程序或所述计算机指令,使得所述计算机设备执行如上第一方面所述的信息标注方法,或者实现如上第二方面所述的模型训练方法。In the fifth aspect, the embodiment of the present application further provides a computer program product, including a computer program or a computer instruction, the computer program or the computer instruction is stored in a computer-readable storage medium, and the processor of the computer device reads from the The computer-readable storage medium reads the computer program or the computer instruction, and the processor executes the computer program or the computer instruction, so that the computer device executes the information labeling method as described in the first aspect above, or Realize the model training method as described in the second aspect above.
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
附图说明Description of drawings
图1是本申请一个实施例提供的信息标注方法的流程图;FIG. 1 is a flowchart of an information labeling method provided by an embodiment of the present application;
图2是图1中对步骤S200的方法的流程图;Fig. 2 is the flowchart of the method to step S200 in Fig. 1;
图3是图1中对步骤S200的方法的流程图;Fig. 3 is the flowchart of the method to step S200 in Fig. 1;
图4是图1中对步骤S300的方法的流程图;Fig. 4 is the flow chart of the method to step S300 in Fig. 1;
图5是图4中对步骤S320的方法的流程图;Fig. 5 is the flowchart of the method to step S320 in Fig. 4;
图6是本申请另一个实施例提供的信息标注方法的流程图;FIG. 6 is a flow chart of an information labeling method provided by another embodiment of the present application;
图7是本申请另一个实施例提供的用于执行模型训练方法的系统架构的示意图;FIG. 7 is a schematic diagram of a system architecture for executing a model training method provided by another embodiment of the present application;
图8是本申请一个实施例提供的基于知识融合的预训练模块的框架图;Fig. 8 is a frame diagram of a pre-training module based on knowledge fusion provided by an embodiment of the present application;
图9是本申请一个实施例提供的实体及关系自动标注模块的框架图;FIG. 9 is a framework diagram of an entity and relationship automatic labeling module provided by an embodiment of the present application;
图10是本申请一个实施例提供的标注结果审核模块的框架图;Fig. 10 is a frame diagram of an annotation result review module provided by an embodiment of the present application;
图11是本申请一个实施例提供的审核数据管理与增量训练模块的框架图;Fig. 11 is a frame diagram of an audit data management and incremental training module provided by an embodiment of the present application;
图12是本申请另一个实施例提供的审核数据管理与增量训练模块的流程图;Fig. 12 is a flow chart of the audit data management and incremental training module provided by another embodiment of the present application;
图13是本申请一个实施例提供的模型训练方法的流程图;Fig. 13 is a flowchart of a model training method provided by an embodiment of the present application;
图14是图13中对步骤S700的方法的流程图;Fig. 14 is a flowchart of the method for step S700 in Fig. 13;
图15是图13中对步骤S700的方法的流程图;Fig. 15 is a flowchart of the method for step S700 in Fig. 13;
图16是图13中对步骤S800的方法的流程图;Fig. 16 is a flowchart of the method for step S800 in Fig. 13;
图17是本申请一个示例提供的信息标注方法的流程图。Fig. 17 is a flowchart of an information labeling method provided by an example of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.
需要说明的是,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than in the flowchart. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.
本申请提供了一种信息标注方法、模型训练方法、电子设备及存储介质,先获取待处理信息文本,之后把待处理信息文本输入到信息标注模型,得到第一实体、第二实体以及第一实体与第二实体之间的实体关系信息,其中,第一实体和第二实体由信息标注模型对待处理信息文本进行实体区分而得到,实体关系信息由信息标注模型对第一实体和第二实体进行关系判别而得到,再根据第一实体、第二实体和实体关系信息对待处理信息文本进行信息标注处理得到目标信息文本。根据本申请实施例的方案,通过获取待处理信息文本,并且把待处理信息文本输入到信息标注模型得到第一实体、第二实体以及第一实体与第二实体之间的实体关系信息,然后对待处理信息文本进行信息标注处理得到目标信息文本,也就是说,待处理信息文本由信息标注模型处理从而得到带有标注信息的目标信息文本,该过程不需人工参与,实现了对信息的自动标注,从而有效降低了纯人工标注的成本。The present application provides an information labeling method, a model training method, electronic equipment and a storage medium. Firstly, the information text to be processed is obtained, and then the text to be processed is input into the information labeling model to obtain the first entity, the second entity and the first entity. The entity relationship information between the entity and the second entity, wherein, the first entity and the second entity are obtained by distinguishing the entities of the information text to be processed by the information annotation model, and the entity relationship information is obtained by the information annotation model for the first entity and the second entity It is obtained by performing relationship discrimination, and then performing information labeling processing on the information text to be processed according to the first entity, the second entity, and the entity relationship information to obtain the target information text. According to the solution of the embodiment of the present application, the first entity, the second entity, and the entity relationship information between the first entity and the second entity are obtained by obtaining the information text to be processed and inputting the information text to be processed into the information labeling model, and then The information text to be processed is processed by information annotation to obtain the target information text, that is to say, the information text to be processed is processed by the information annotation model to obtain the target information text with annotation information. This process does not require manual participation and realizes automatic information processing. Labeling, thus effectively reducing the cost of pure manual labeling.
下面结合附图,对本申请实施例作进一步阐述。The embodiments of the present application will be further described below in conjunction with the accompanying drawings.
如图1所示,图1是本申请一个实施例提供的信息标注方法的流程图,该信息标注方法可以应用于信息标注装置。该信息标注方法可以包括但不限于有步骤S100、步骤S200和步骤S300。As shown in FIG. 1 , FIG. 1 is a flowchart of an information labeling method provided by an embodiment of the present application, and the information labeling method can be applied to an information labeling device. The information tagging method may include but not limited to step S100, step S200 and step S300.
步骤S100:获取待处理信息文本。Step S100: Obtain the text of the message to be processed.
本步骤中,待处理信息文本的数据可以来自于业务领域的文档、数据库数据等。In this step, the data of the information text to be processed may come from documents in the business field, database data, and the like.
需要说明的是,待处理信息文本可以有多种不同的信息类型,例如,待处理信息文本可以是论文信息、新闻资讯信息、演讲稿信息等,本实施例对此不作具体限定。It should be noted that the information text to be processed may have various types of information. For example, the information text to be processed may be paper information, news information information, speech information, etc., which is not specifically limited in this embodiment.
步骤S200:把待处理文本输入到信息标注模型,得到第一实体、第二实体以及第一实体与第二实体之间的实体关系信息,其中,第一实体和第二实体由信息标注模型对待处理信息文本进行实体区分而得到,实体关系信息由信息标注模型对第一实体和第二实体进行关系判别而得到。Step S200: Input the text to be processed into the information annotation model to obtain the first entity, the second entity and the entity relationship information between the first entity and the second entity, wherein the first entity and the second entity are treated by the information annotation model The processed information text is obtained by distinguishing entities, and the entity relationship information is obtained by distinguishing the relationship between the first entity and the second entity by the information labeling model.
本步骤中,将步骤S100的待处理信息文本输入到信息标注模型,因此可以在信息标注模型中,对待处理信息文本进行实体区分得到第一实体以及第二实体,之后对第一实体和第二实体进行关系判别得到实体关系信息,以便于后续步骤可以引用第一实体、第二实体以及第一实体与第二实体之间的实体关系信息。In this step, the information text to be processed in step S100 is input into the information annotation model, so in the information annotation model, the entity distinction of the information text to be processed can be performed to obtain the first entity and the second entity, and then the first entity and the second entity Entities perform relationship discrimination to obtain entity relationship information, so that subsequent steps can refer to the first entity, the second entity, and the entity relationship information between the first entity and the second entity.
需要说明的是,对待处理信息文本进行实体区分得到第一实体以及第二实体,可以有不同的实施方式,本实施例对此不作具体限定。例如,可以对待处理信息文本中的地点名词以及组织机构名词进行实体区分,此时,得到的第一实体为地点名词,第二实体为组织机构名词;又如,可以对待处理信息文本中的人物名词和职务名词进行实体区分,此时,得到的第一实体为人物名词,第二实体为组织机构名词。It should be noted that there may be different implementation manners for obtaining the first entity and the second entity through entity distinction of the information text to be processed, which is not specifically limited in this embodiment. For example, entity distinction can be made between place nouns and organization nouns in the information text to be processed. At this time, the first entity obtained is the noun of the place, and the second entity is the noun of the organization organization; Nouns and job nouns are distinguished as entities. At this time, the first entity obtained is a person noun, and the second entity is an organization noun.
需要说明的是,由于第一实体和第二实体有多种不同实施方式并且实体关系信息为根据第一实体和第二实体进行关系判别得到,所以实体关系信息也有与之对应的多种不同实施方式,例如,当第一实体为地点名词,第二实体为组织机构名词,得到的实体关系信息为“在”、“位于”、“设置”等信息;当第一实体为人物名词,第二实体为组织机构名词,得到的实体关系信息为“就职于”、“工作于”等信息,本实施例对此不作具体限定。It should be noted that, since the first entity and the second entity have many different implementations and the entity relationship information is obtained through the relationship discrimination between the first entity and the second entity, the entity relationship information also has many different implementations corresponding to it. For example, when the first entity is a place noun and the second entity is an organization noun, the obtained entity relationship information is information such as "at", "located in", and "set"; when the first entity is a person noun, the second entity The entity is a noun of an organization, and the obtained entity relationship information is information such as "employ at" and "work at", which is not specifically limited in this embodiment.
步骤S300:根据第一实体、第二实体和实体关系信息对待处理信息文本进行信息标注处理得到目标信息文本。Step S300: According to the first entity, the second entity and the entity relationship information, perform information labeling processing on the information text to be processed to obtain the target information text.
本步骤中,根据步骤S200已经获得第一实体、第二实体和实体关系信息,因此可以根据第一实体、第二实体和实体关系信息对待处理信息文本进行信息标注处理得到目标信息文本,从而实现了对信息的自动标注,从而有效降低纯人工标注成本。In this step, the first entity, the second entity, and the entity relationship information have been obtained according to step S200, so the target information text can be obtained by performing information labeling processing on the information text to be processed according to the first entity, the second entity, and the entity relationship information, so as to realize Automatic labeling of information is achieved, thereby effectively reducing the cost of pure manual labeling.
需要说明的是,第一实体、第二实体和实体关系信息对待处理信息文本进行信息标注处理得到目标信息文本,可以有不同的实施方式,本实施例对此不作具体限定。例如,信息标注处理可以为对第一实体、第二实体和实体关系信息进行提亮处理,得到目标信息文本;又如,信息标注处理可以为对第一实体、第二实体和实体关系信息进行划线处理,得到目标信息文本。It should be noted that there may be different implementation manners for the first entity, the second entity, and the entity relationship information to be processed by performing information annotation processing on the to-be-processed information text to obtain the target information text, which is not specifically limited in this embodiment. For example, the information labeling process can be to highlight the first entity, the second entity, and entity relationship information to obtain the target information text; for another example, the information labeling process can be to perform The line is processed to obtain the target information text.
在一实施例中,如图2所示,图2是对步骤S200进行进一步的说明,该步骤S200可以包括但不限于有步骤S210和步骤S220。In an embodiment, as shown in FIG. 2, FIG. 2 further illustrates step S200, which may include but not limited to step S210 and step S220.
步骤S210:信息标注模型对待处理信息文本进行分词处理得到多个第一字段信息。Step S210: The information labeling model performs word segmentation processing on the information text to be processed to obtain a plurality of first field information.
本步骤中,对待处理信息进行分词处理得到多个第一字段信息,以便于后续步骤中根据第一字段信息获取第一实体和第二实体。In this step, word segmentation is performed on the information to be processed to obtain a plurality of first field information, so as to obtain the first entity and the second entity according to the first field information in a subsequent step.
需要说明的是,对待处理信息文本进行分词处理得到多个第一字段信息可以有不同的实施方式,本实施例对此不作具体限定。例如,可以根据设置的文本长度对待处理信息文本进行分词,从而得到多个第一字段信息,设置的文本长度可以为八个字符或者十个字符等,又如,可以根据设置的两个实体之间距离对待处理信息文本进行分词,从而得到多个第一字段信息。It should be noted that there may be different implementation manners for performing word segmentation processing on the information text to be processed to obtain a plurality of first field information, which is not specifically limited in this embodiment. For example, word segmentation can be performed on the text to be processed according to the set text length, so as to obtain a plurality of first field information, and the set text length can be eight characters or ten characters, etc. The text of the information to be processed is segmented according to the spacing distance, so as to obtain a plurality of first field information.
步骤S220:信息标注模型对多个第一字段信息进行实体识别处理,在多个字段信息中识别出第一实体和第二实体。Step S220: The information labeling model performs entity recognition processing on a plurality of first field information, and identifies the first entity and the second entity in the plurality of field information.
本步骤中,对步骤S220得到的多个第一字段信息进行实体识别处理,在多个字段信息中识别出第一实体和第二实体,以便于后续步骤中根据第一实体和第二实体进行关系判断得到实体关系信息。In this step, entity identification processing is performed on the plurality of first field information obtained in step S220, and the first entity and the second entity are identified in the plurality of field information, so that in subsequent steps, according to the first entity and the second entity, the The relationship judgment obtains entity relationship information.
需要说明的是,对多个第一字段信息进行实体识别处理,在多个字段信息中识别出第一实体和第二实体,可以有不同的实施方式,本实施例对此不作具体限定。例如,可以对第一字段信息中的地点名词以及组织机构名词进行实体识别,此时,得到的第一实体为地点名词,第二实体为组织机构名词;又如,可以对第一字段信息中的人物名词和职务名词进行实体识别,此时,得到的第一实体为人物名词,第二实体为组织机构名词。It should be noted that there may be different implementation manners for performing entity identification processing on a plurality of first field information, and identifying the first entity and the second entity in the plurality of field information, which is not specifically limited in this embodiment. For example, entity recognition can be performed on the location nouns and organization nouns in the first field information. At this time, the obtained first entity is the location noun, and the second entity is the organization noun; In this case, the first entity obtained is the person noun and the second entity is the organization noun.
在一另实施例中,如图3所示,图3是对步骤S200进行进一步的说明,该步骤S200可以包括但不限于有步骤S230和步骤S240。In another embodiment, as shown in FIG. 3 , which further illustrates step S200, which may include but not limited to step S230 and step S240.
步骤S230:信息标注模型对第一实体和第二实体进行类别识别处理,得到第一实体对应的第一类别信息和第二实体对应的第二类别信息。Step S230: The information labeling model performs category identification processing on the first entity and the second entity, and obtains first category information corresponding to the first entity and second category information corresponding to the second entity.
本步骤中,对步骤S220得到的第一实体和第二实体进行类别识别处理,得到第一实体对应的第一类别信息和第二实体对应的第二类别信息,以便于后面根据第一类别信息和第二类别信息进行关系识别处理,得到实体关系信息。In this step, category identification processing is performed on the first entity and the second entity obtained in step S220, and the first category information corresponding to the first entity and the second category information corresponding to the second entity are obtained, so as to facilitate the following based on the first category information Perform relationship identification processing with the second category information to obtain entity relationship information.
需要说明的是,由于第一实体和第二实体可以有很多不同的实体类型,因此,根据第一实体得到的第一类别信息和根据第二实体得到的第二类别信息,也会对应有多种不同的类型,本实施例对此不作具体限定。例如当第一实体为城市名词时,得到的第一类别信息则为城市名称,第二实体为国家名词时,得到的第二类别信息则为国家名称;又如,当第一实体为医生、教师等名词时,得到的第一类别信息则为职业名词,第二实体为医院、学校等名词时,得到的第二类别信息则为地点名词。It should be noted that since the first entity and the second entity may have many different entity types, the information of the first category obtained from the first entity and the information of the second category obtained from the second entity will also correspond to how many different types, which are not specifically limited in this embodiment. For example, when the first entity is a city noun, the first category information obtained is the city name, and when the second entity is a country noun, the second category information obtained is the country name; for another example, when the first entity is a doctor, For nouns such as teachers, the first type of information obtained is occupational nouns, and when the second entity is nouns such as hospitals and schools, the second type of information obtained is location nouns.
步骤S240:信息标注模型对第一类别信息和第二类别信息进行关系识别处理,得到实体关系信息。Step S240: The information labeling model performs relationship recognition processing on the first category information and the second category information to obtain entity relationship information.
本步骤中,对步骤S230中得到的第一类别信息和第二类别信息进行关系识别处理,得到实体关系信息,以便于后续对待处理文本进行标注操作。In this step, the relationship identification processing is performed on the first category information and the second category information obtained in step S230 to obtain entity relationship information, so as to facilitate subsequent labeling operations on the text to be processed.
需要说明的是,由于第一类别信息和第二类别信息有很多不同的信息类型,因此,根据第一类别信息和第二类别信息进行关系识别处理,得到的实体关系信息,也会对应有多种不同的类型,本实施例对此不作具体限定。例如,当第一类别信息为首都名称,第二类别信息为国家名称时,得到的实体关系信息为“归属于”、“包含于”等信息;又如当第一类别信息为职业名词,第二类别信息为地点名词时,得到的实体关系信息为“工作于”、“就职于”等信息。It should be noted that since the first category information and the second category information have many different information types, the entity relationship information obtained by performing relationship identification processing according to the first category information and the second category information will also correspond to how many different types, which are not specifically limited in this embodiment. For example, when the first type of information is the name of the capital and the second type of information is the name of the country, the obtained entity relationship information is information such as "belongs to" and "included in"; When the second category information is a place noun, the obtained entity relationship information is information such as "work at", "employ at" and so on.
在一实施例中,如图4所示,图4是对步骤S300进行进一步的说明,该步骤S300可以包括但不限于有步骤S310和步骤S320。In an embodiment, as shown in FIG. 4 , which further illustrates step S300 , which may include but not limited to step S310 and step S320 .
步骤S310:将第一实体和第二实体突出显示于待处理信息文本,得到第一信息文本。Step S310: Highlight the first entity and the second entity in the message text to be processed to obtain the first message text.
本步骤中,对待处理信息文本中的第一实体和第二实体进行突出显示,得到第一信息文本,以便于后续步骤中得到目标信息文本。In this step, the first entity and the second entity in the information text to be processed are highlighted to obtain the first information text, so as to obtain the target information text in subsequent steps.
需要说明的是,对待处理信息文本中的第一实体和第二实体进行突出显示,可以有不同的实施方式,本实施例对此并不作具体限定。例如,对第一实体和第二实体进行提亮处理,得到第一信息文本;又如,对第一实体和第二实体进行划线处理,得到第一信息文本。It should be noted that there may be different implementation manners for highlighting the first entity and the second entity in the information text to be processed, which is not specifically limited in this embodiment. For example, the first entity and the second entity are highlighted to obtain the first information text; another example is the first entity and the second entity are underlined to obtain the first information text.
步骤S320:在第一信息文本中标注实体关系信息,得到目标信息文本,其中,实体关系信息用于在目标信息文本中将所述第一实体和第二实体形成关联关系。Step S320: mark the entity relationship information in the first information text to obtain the target information text, wherein the entity relationship information is used to form an association relationship between the first entity and the second entity in the target information text.
本步骤中,由于在步骤S310获取到了第一信息文本,因此可以在第一信息文本中标注实体关系信息,得到目标信息文本,其中,实体关系信息用于在目标信息文本中将第一实体和第二实体形成关联关系,从而实现了对信息的自动标注,从而有效降低纯人工标注成本。In this step, since the first information text is obtained in step S310, the entity relationship information can be marked in the first information text to obtain the target information text, wherein the entity relationship information is used to combine the first entity and The second entity forms an association relationship, thereby realizing automatic labeling of information, thereby effectively reducing the cost of pure manual labeling.
在一实施例中,如图5所示,图5是对步骤S320进行进一步的说明,该步骤S320可以包括但不限于有步骤S3210和步骤S3220。In an embodiment, as shown in FIG. 5 , which further illustrates step S320, which may include but not limited to step S3210 and step S3220.
步骤S3210:在第一信息文本中,对第一实体标注第一类别信息,以及对第二实体标注所述第二类别信息。Step S3210: In the first information text, mark the first type information for the first entity, and mark the second type information for the second entity.
本步骤中,由于在步骤S310获取到了第一信息文本,因此可以对第一实体标注第一类别信息,以及对第二实体标注所述第二类别信息,以便于后续步骤中得到目标信息文本。In this step, since the first information text is obtained in step S310, the first type information can be marked on the first entity, and the second type information can be marked on the second entity, so as to obtain the target information text in subsequent steps.
需要说明的是,由于第一实体和第二实体可以有很多不同的实体类型,因此,第一类别信息和第二类别信息也会对应有多种不同的类型,本实施例对此不作具体限定,在上面实施例已具体说明,在此不再具体阐述。It should be noted that since the first entity and the second entity may have many different entity types, the first category information and the second category information will also correspond to multiple different types, which is not specifically limited in this embodiment , has been specifically described in the above embodiments, and will not be described in detail here.
步骤S3220:根据第一类别信息和第二类别信息,在第一信息文本中标注实体关系信息,得到目标信息文本。Step S3220: According to the first category information and the second category information, mark the entity relationship information in the first information text to obtain the target information text.
本步骤中,由于在步骤S3210中获得到了第一类别信息和第二类别信息,因此可以对第一信息文本中标注实体关系信息,得到目标信息文本,实现了对信息的自动标注,从而有效降低了纯人工标注的成本。In this step, since the first category information and the second category information are obtained in step S3210, the entity relationship information can be marked in the first information text to obtain the target information text, and the automatic labeling of information is realized, thereby effectively reducing the The cost of purely manual labeling is eliminated.
如图6所示,图6提供了另一实施例的信息标注方法流程图,该信息标注方法可以包括但不限于有步骤S400、步骤S410和步骤S420。As shown in FIG. 6 , FIG. 6 provides a flowchart of an information labeling method according to another embodiment. The information labeling method may include but not limited to step S400 , step S410 and step S420 .
步骤S400:对目标信息文本的第一实体、第二实体和实体关系信息中的至少一个进行校对处理得到校对结果。Step S400: Proofread at least one of the first entity, the second entity, and the entity relationship information of the target information text to obtain a proofreading result.
本步骤中,由于在步骤S320获取到了目标信息文本,因此对目标信息文本的第一实体、第二实体和实体关系信息中的至少一个进行校对处理得到校对结果,以便于后续步骤中对校对结果进行判断。In this step, since the target information text is obtained in step S320, at least one of the first entity, the second entity, and the entity relationship information of the target information text is collated to obtain a proofreading result, so as to verify the proofreading result in subsequent steps judge.
需要说明的是,可以对标信息文本的第一实体、第二实体和实体关系信息中的至少一个进行校对处理得到校对结果,可以有不同的实施方式,本实施例对此不作具体限定。例如,可以对第一实体、第二实体和实体关系信息中的任意一个进行人工校对,得到校对结果;又如,对第一实体、第二实体和实体关系信息中的任意一个进行程序校对,得到校对结果。It should be noted that at least one of the first entity, the second entity, and the entity relationship information of the tagged information text may be collated to obtain a collation result, and there may be different implementation modes, which are not specifically limited in this embodiment. For example, manual proofreading can be performed on any one of the first entity, second entity, and entity relationship information to obtain the proofreading result; another example is to perform program proofreading on any one of the first entity, second entity, and entity relationship information, Get the proofreading result.
步骤S410:当校对结果为第一实体、第二实体和实体关系信息中的至少一个存在错误信息,根据错误信息得到更正信息。Step S410: When the collation result shows that at least one of the first entity, the second entity, and the entity relationship information has error information, correct information is obtained according to the error information.
本步骤中,由于在步骤S400获取到了校对结果,因此当校对结果为第一实体、第二实体和实体关系信息中的至少一个存在错误信息,根据错误信息得到更正信息,以便于后续步骤中对信息标注模型进行更新处理。In this step, since the proofreading result was obtained in step S400, when the proofreading result shows that at least one of the first entity, the second entity, and the entity relationship information has error information, correction information is obtained according to the error information, so as to facilitate correcting in subsequent steps The information labeling model is updated.
步骤S420:根据更正信息对信息标注模型进行更新处理。Step S420: Update the information labeling model according to the correction information.
本步骤中,由于在步骤S410获取到了更正信息,因此根据更正信息对信息标注模型进行更新处理,提高了标注的效率和质量,降低了纯人工标注的成本。In this step, since the correction information is obtained in step S410, the information labeling model is updated according to the correction information, which improves the efficiency and quality of labeling, and reduces the cost of purely manual labeling.
需要说明的是,根据更正信息对信息标注模型进行更新处理中的更新处理包括对信息标注模型的迭代,对信息标注模型的替换等,本实施例对此不作具体限定。It should be noted that the updating process of updating the information labeling model according to the correction information includes iteration of the information labeling model, replacement of the information labeling model, etc., which is not specifically limited in this embodiment.
如图7所示,图7是本申请实施例提供的用于执行模型训练方法的系统架构的示意图。在图7的示例中,该系统架构包括通用领域训练数据构建模块100、基于知识融合的预训练模型训练模块200、实体及关系自动标注审核模块300、标注结果审核模块400、审核数据管理与增量训练模块500。As shown in FIG. 7 , FIG. 7 is a schematic diagram of a system architecture for executing a model training method provided by an embodiment of the present application. In the example shown in Figure 7, the system architecture includes a general domain training data construction module 100, a pre-training model training module 200 based on knowledge fusion, an entity and relationship automatic labeling review module 300, a labeling result review module 400, review data management and augmentation Quantitative training module 500.
在一实施例中,首先通过通用领域训练数据构建模块100构建用于训练基于知识融合的预训练语言模型所需的训练数据,之后将训练数据发送给基于知识融合的预训练模型训练模块200,其中,需要构建的通用领域训练数据从机器学习角度分为两大类,即无监督训练数据和有监督训练数据,无监督训练数据指无需人工事先标注的纯文本,如中文维基百科、百度百科、新闻等。有监督数据在本场景下特指对文本标注出实体及其关系的数据,实体类型通常包含人名、地名、机构、时间等,实体间关系包含创始人、位于、作品等。In one embodiment, first construct the training data required for training the pre-training language model based on knowledge fusion through the general domain training data construction module 100, and then send the training data to the pre-training model training module 200 based on knowledge fusion, Among them, the general field training data that needs to be constructed is divided into two categories from the perspective of machine learning, namely unsupervised training data and supervised training data. , news, etc. Supervised data in this scenario specifically refers to data that labels entities and their relationships in text. Entity types usually include person names, place names, institutions, time, etc., and inter-entity relationships include founders, locations, works, etc.
需要说明的是,在本实施例中无监督数据与有监督数据之间的比例设为8:2,也可设置为其他比例,本实施例对此不作具体限定,例如,当模型接收到的文本信息较多时,无监督数据与有监督数据之间的比例可设为7:3;又如,当模型接收到的文本信息较少时,无监督数据与有监督数据之间的比例可设为9:1。It should be noted that in this embodiment, the ratio between unsupervised data and supervised data is set to 8:2, and it can also be set to other ratios, which is not specifically limited in this embodiment. For example, when the model receives When there is a lot of text information, the ratio between unsupervised data and supervised data can be set to 7:3; for another example, when the model receives less text information, the ratio between unsupervised data and supervised data can be set to It is 9:1.
在一实施例中,当基于知识融合的预训练模型训练模块200接收到训练数据之后,根据训练数据训练出表征且适用于实体及关系抽取任务的神经网络模型,其中,基于知识融合的预训练模型训练模块200在与训练模型中引入知识图谱中包含的信息,之后将训练完成的神经网络模型发送给实体及关系自动标注审核模块300。In one embodiment, after the knowledge fusion-based pre-training model training module 200 receives the training data, it trains a neural network model that is characterized and suitable for entity and relationship extraction tasks according to the training data, wherein the knowledge fusion-based pre-training The model training module 200 introduces the information contained in the knowledge map into the training model, and then sends the trained neural network model to the entity and relationship automatic labeling and review module 300 .
在一实施例中,神经网络其余配置信息为:激活函数采用的是tanh激活函数、模型权重和偏置值采用的是随机数初始化方式、采用梯度下降和反向传播方法求解模型参数、采用交叉熵损失函数、以及使用dropout(drop the out,正则化方法)削减过拟合的影响(dropout参数为0.8)、模型的训练次数为5*60000次迭代、学习率为0.00003,其中,dropout正则化目的是为了减弱过拟合、增强模型的泛化能力。In one embodiment, the rest of the configuration information of the neural network is: the activation function uses the tanh activation function, the model weight and bias values use the random number initialization method, the gradient descent and backpropagation methods are used to solve the model parameters, and the crossover method is used. The entropy loss function, and the use of dropout (drop the out, regularization method) to reduce the impact of overfitting (dropout parameter is 0.8), the number of training of the model is 5*60000 iterations, and the learning rate is 0.00003, among them, dropout regularization The purpose is to reduce overfitting and enhance the generalization ability of the model.
参考图8,图8是本申请一个实施例提供的基于知识融合的预训练模块的框架图。Referring to FIG. 8 , FIG. 8 is a frame diagram of a pre-training module based on knowledge fusion provided by an embodiment of the present application.
在一实施例中,基于知识融合的预训练模块200的框架图包含以下子任务:MLM(Masked Language Mode,掩码语言模型)任务、NSP(Next Sentence Prediction,下句预测模型)任务、实体区分任务、关系判别任务和基于阅读理解的实体抽取任务;其中,MLM任务为对句子进行分词操作,然后选取所有词中的15%个数,15%的选择词数中,80%用[MASK]的tocken表示,10%用原始tocken表示,10%用随机tocken表示。[MASK]引入后,作为模型输入;模型输出为[MASK]词对应位置的词表示,然后loss通过交叉熵计算,希望模型输出[MASK]的词和真正的词正好匹配;NSP任务为训练数据的生成方式是从平行语料中随机抽取的连续两句话,其中50%保留抽取的两句话,它们符合IsNext关系,其中,IsNext关系为继承关系,说明前后两句话有联系,另外50%的第二句话是随机从语料中提取的,它们的关系是NotNext的,其中,NotNext关系为非继承关系,说明前后两句话没有关系,之后再将模型输出与真正预期输出比较,进行模型的训练操作;实体区分任务为给定头实体和关系,从当前文档中寻找正确的尾实体。例如,医院和医生具有雇佣的关系,于是将关系职业和头实体医院拼接在原文档的前面作为提示,在此条件下区分正确的尾实体的任务可以在对比学习的框架下转换成拉近头实体和正确尾实体的实体表示的距离,推远头实体和文档中其它实体(负样本)的实体表示的距;关系区分任务为区分两个关系的表示在语义空间上的相近程度。随机采样多个文档,并从每个文档中得到多个关系表示,其中,这些关系可能只涉及句子级别的推理,也可能涉及跨句子的复杂推理。之后基于对比学习框架,根据远程监督的标签在关系空间中对不同的关系表示进行训练,如前文所述,每个关系表示均由文档中的两个实体表示构成。正样本即具有相同远程监督标签的关系表示,负样本与此相反;基于阅读理解的实体抽取任务为给定一个文本序列X,它的长度为n,抽取出其中的每个实体,其中,实体都有与之相对应的实体类型。例如,假设数据集的所有实体标签集合为Y,那么对其中的每个实体标签y,比如地点LOC,都有一个关于它的问题q(y)。其中,这个问题可以是一个词,也可以是一句话等。此时,模型输入是X和q(y),训练及优化模型使其预测出具有标签y的每个实体,因此该任务能够兼顾普通实体和嵌套实体的抽取。In one embodiment, the frame diagram of the pre-training module 200 based on knowledge fusion includes the following subtasks: MLM (Masked Language Mode, mask language model) task, NSP (Next Sentence Prediction, next sentence prediction model) task, entity distinction task, relationship discrimination task, and entity extraction task based on reading comprehension; among them, the MLM task is to perform word segmentation on the sentence, and then select 15% of all words, and 80% of the 15% of the selected words use [MASK] tokens, 10% are represented by original tokens, and 10% are represented by random tokens. After [MASK] is introduced, it is used as model input; the model output is the word representation of the corresponding position of the [MASK] word, and then the loss is calculated by cross entropy, hoping that the model output [MASK] word and the real word just match; the NSP task is the training data The generation method is to randomly extract two consecutive sentences from the parallel corpus, and 50% of them retain the two extracted sentences. The second sentence is randomly extracted from the corpus, and their relationship is NotNext. Among them, the NotNext relationship is a non-inheritance relationship, indicating that there is no relationship between the two sentences before and after, and then compare the model output with the real expected output to perform model The training operation of ; the entity discrimination task is to find the correct tail entity from the current document given the head entity and relationship. For example, a hospital and a doctor have an employment relationship, so the relationship occupation and the head entity hospital are spliced in front of the original document as a reminder. Under this condition, the task of distinguishing the correct tail entity can be transformed into pulling the head entity closer under the framework of contrastive learning The distance between the entity representation of the correct tail entity and the distance between the entity representation of the head entity and other entities (negative samples) in the document; the relationship discrimination task is to distinguish the similarity of the two relationship representations in the semantic space. Randomly sample multiple documents and derive multiple relational representations from each document, where these relations may involve only sentence-level reasoning or complex reasoning across sentences. Then, based on the contrastive learning framework, different relational representations are trained in the relational space according to the distantly supervised labels. As mentioned above, each relational representation consists of two entity representations in the document. Positive samples are relational representations with the same distant supervision label, and negative samples are the opposite; the task of entity extraction based on reading comprehension is given a text sequence X, its length is n, extract each entity in it, where, entity There are corresponding entity types. For example, assuming that the set of all entity labels in a dataset is Y, then for each entity label y in it, such as location LOC, there is a question q(y) about it. Wherein, this question can be a word, also can be a sentence etc. At this point, the model input is X and q(y), and the model is trained and optimized to predict each entity with the label y, so the task can take into account the extraction of ordinary entities and nested entities.
参考图9,图9是本申请一个实施例提供的实体及关系自动标注模块的框架图。Referring to FIG. 9 , FIG. 9 is a frame diagram of an entity and relationship automatic labeling module provided by an embodiment of the present application.
在一实施例中,实体及关系自动标注模块对神经网络模型进行训练得到实体及关系识别模型,其中,实体及关系自动标注模块还包括对外提交交互的REST(Representational State Transfer,表述性状态转移)服务、对标注数据集进行合成操作,首先,待标注数据集通过对外提交交互的REST服务输入该实体及关系自动标注模块,此时,接收待标注的数据集作为模型的输入,采用模型提供的能力,对数据进行自动标注,模型识别后输出数据中的实体及关系,写入新的数据集中,新数据集中合成了原始数据文本、数据标签,最后将标注结果发送给标注结果审核模块400进行核对。In one embodiment, the entity and relationship automatic labeling module trains the neural network model to obtain the entity and relationship recognition model, wherein the entity and relationship automatic labeling module also includes REST (Representational State Transfer, representational state transfer) for external submission and interaction service, and synthesize the labeled dataset. First, the dataset to be labeled is input to the entity and relationship automatic labeling module through the externally submitted interactive REST service. At this time, the dataset to be labeled is received as the input of the model, and the model provided Ability to automatically label data, output entities and relationships in the data after model recognition, and write them into a new data set. The original data text and data labels are synthesized in the new data set, and finally the labeling results are sent to the labeling result review module 400 for further processing check.
参考图10,图10是本申请一个实施例提供的标注结果审核模块的框架图。Referring to FIG. 10 , FIG. 10 is a frame diagram of an annotation result review module provided by an embodiment of the present application.
在一实施例中,标注结果审核模块400实现读取自动标注的数据集,并且可以在标注工具的界面进行可视化展示,便于在界面对自动标注的结果进行审核、校对得到校对数据,之后将校对数据发送给审核数据管理与增量训练模块500。In one embodiment, the labeling result review module 400 realizes reading the automatically labeled data set, and can perform visual display on the interface of the labeling tool, so as to facilitate the review and proofreading of the automatically marked results on the interface to obtain proofreading data, and then proofread The data is sent to the audit data management and incremental training module 500 .
需要说明的是,标注结果审核模块400中的审核、校对操作可以由人工进行校对操作,也可以为机器进行校对操作,本实施例对此不做限制。It should be noted that the review and proofreading operations in the labeling result review module 400 can be performed manually or by a machine, which is not limited in this embodiment.
参考图11和图12,图11是本申请一个实施例提供的审核数据管理与增量训练模块的框架图,图12是本申请实施例提供的审核数据管理与增量训练模块的流程图。在图12的示例中,该审核数据管理与增量训练模块500包括但不限于有步骤S500、步骤S510、步骤S520、步骤S530和步骤S540。Referring to Fig. 11 and Fig. 12, Fig. 11 is a frame diagram of an audit data management and incremental training module provided by an embodiment of the present application, and Fig. 12 is a flow chart of the audit data management and incremental training module provided by an embodiment of the present application. In the example of FIG. 12 , the audit data management and incremental training module 500 includes but not limited to step S500 , step S510 , step S520 , step S530 and step S540 .
步骤S500:获取待标注数据集。Step S500: Obtain a dataset to be labeled.
步骤S510:将待标注数据集输入实体及关系自动标注模块进行自动标注服务得到标注结果。Step S510: Input the dataset to be labeled into the entity and relationship automatic labeling module to perform automatic labeling service to obtain labeling results.
步骤S520:根据标注工具对标注结果进行校对操作得到校对数据。Step S520: Proofreading the labeling results according to the labeling tool to obtain proofreading data.
步骤S530:将校对数据输入模型进行增量训练得到新模型。Step S530: Input the proofreading data into the model for incremental training to obtain a new model.
步骤S540:使用新模型替换原自动标注插件中的模型。Step S540: Use the new model to replace the model in the original automatic labeling plug-in.
在一实施例中,审核数据管理与增量训练模块500一方面可以管理校对过的数据,另一方面可以将校对后的数据作为模型增量训练输入,从而实现标注模型不断迭代优化,形成标注系统的数据和模型闭环。其中,审核数据管理与增量训练模块500会收集在审核校对时,校对过的数据,并基于这些校对数据,对模型进行增量训练,训练后的新模型替换掉原自动标注插件中的模型,以达到模型不断优化的效果。整个过程无需人工干预,模型增量训练由定时或数据量触发,生成模型后可自动替换掉旧模型,之后使用新模型对外提供自动标注服务。In one embodiment, the audit data management and incremental training module 500 can manage the proofreaded data on the one hand, and can input the proofreaded data as model incremental training on the other hand, so as to realize continuous iterative optimization of the labeling model and form a label System data and model closed loop. Among them, the audit data management and incremental training module 500 will collect the data that has been proofread during the audit and proofreading, and based on these proofreading data, incrementally train the model, and the new model after training will replace the model in the original automatic labeling plug-in , in order to achieve the effect of continuous optimization of the model. The whole process does not require manual intervention. Model incremental training is triggered by timing or data volume. After the model is generated, the old model can be automatically replaced, and then the new model is used to provide automatic labeling services to the outside world.
为了清楚的说明本申请实施例提供的模型训练方法的处理流程,下面以示例进行说明。In order to clearly illustrate the processing flow of the model training method provided in the embodiment of the present application, an example is used for description below.
如图13所示,图13是本申请一个实施例提供的模型训练方法的流程图,该信息标注方法可以应用于模型训练装置。该信息标注方法可以包括但不限于有步骤S600、步骤S700和步骤S800。As shown in FIG. 13 , FIG. 13 is a flowchart of a model training method provided by an embodiment of the present application, and the information labeling method can be applied to a model training device. The information labeling method may include but not limited to step S600, step S700 and step S800.
步骤S600:获取训练样本,训练样本为带有标签信息的文本。Step S600: Obtain training samples, which are texts with label information.
本步骤中,带有标签信息的文本是指人工事先完成了标签信息的标注的文本,实体类型通常包含人名、地名、机构、时间等,实体间关系包含创始人、位于、作品等。In this step, the text with tag information refers to the text that has been marked manually in advance. The entity type usually includes person name, place name, organization, time, etc., and the relationship between entities includes founder, location, work, etc.
步骤S700:将训练样本输入到信息标注模型,得到训练样本的信息标注结果,其中,信息标注结果包括第一标注实体、第二标注实体以及第一标注实体与第二标注实体之间的关系标注信息,第一标注实体和第二标注实体由信息标注模型对训练样本进行实体区分而得到,关系标注信息由信息标注模型对第一标注实体和第二标注实体进行关系判别而得到。Step S700: Input the training sample into the information labeling model to obtain the information labeling result of the training sample, wherein the information labeling result includes the first labeling entity, the second labeling entity, and the relationship labeling between the first labeling entity and the second labeling entity Information, the first labeled entity and the second labeled entity are obtained by the information labeling model performing entity distinction on the training samples, and the relationship labeling information is obtained by the information labeling model performing relationship discrimination between the first labeled entity and the second labeled entity.
本步骤中,由于在步骤S600获取了训练样本,所以将训练样本输入到信息标注模型,得到训练样本的信息标注结果,其中,信息标注结果包括第一标注实体、第二标注实体以及第一标注实体与第二标注实体之间的关系标注信息,第一标注实体和第二标注实体由信息标注模型对训练样本进行实体区分而得到,关系标注信息由信息标注模型对第一标注实体和第二标注实体进行关系判别而得到,以便于后续步骤根据信息标注结果和标签信息对信息标注模型的参数进行更新。In this step, since the training samples are obtained in step S600, the training samples are input into the information labeling model to obtain the information labeling results of the training samples, wherein the information labeling results include the first labeling entity, the second labeling entity and the first labeling entity The relationship labeling information between the entity and the second labeling entity, the first labeling entity and the second labeling entity are obtained by distinguishing the training samples from the information labeling model, and the relationship labeling information is obtained by the information labeling model on the first labeling entity and the second labeling entity Annotated entities are obtained through relationship discrimination, so that the parameters of the information annotation model can be updated in the subsequent steps according to the information annotation results and label information.
需要说明的是,第一标注实体和第二标注实体由信息标注模型对训练样本进行实体区分而得到,可以有不同的实施方式,本实施例对此并不作具体限定。例如,可以对训练样本中的地点名词以及组织机构名词进行实体区分,此时,得到的第一标注实体为地点名词,第二标注实体为组织机构名词;又如,可以对训练样本中的人物名词和职务名词进行实体区分,此时,得到的第一标注实体为人物名词,第二标注实体为组织机构名词。It should be noted that the first labeled entity and the second labeled entity are obtained by distinguishing entities of training samples by an information labeling model, and there may be different implementation manners, which are not specifically limited in this embodiment. For example, entity distinction can be performed on location nouns and organization nouns in the training samples. At this time, the first labeled entity obtained is the location noun, and the second labeled entity is the organization noun; Nouns and job nouns are distinguished as entities. At this time, the first labeled entity is a person noun, and the second labeled entity is an organization noun.
需要说明的是,由于第一实体和第二标注实体有多种不同实施方式并且关系标注信息为根据第一实体和第二实体进行关系判别得到,所以关系标注信息也有与之对应的多种不同实施方式,例如,当第一标注实体为地点名词,第二标注实体为组织机构名词,得到的关系标注信息为“在”、“位于”、“设置在”等信息;当第一标注实体为人物名词,第二标注实体为组织机构名词,得到的关系标注信息为“就职于”、“工作于”等信息,本实施例对此不作具体限定。It should be noted that since the first entity and the second labeled entity have many different implementation modes and the relationship labeling information is obtained by distinguishing the relationship between the first entity and the second entity, the relationship labeling information also has many corresponding differences. Embodiments, for example, when the first tagging entity is a place noun, and the second tagging entity is an organization noun, the obtained relationship tagging information is information such as "in", "located in", and "set at"; when the first tagging entity is Person nouns, the second tagged entity is the noun of the organization, and the obtained relationship tagged information is information such as "worked at", "worked at", etc., which is not specifically limited in this embodiment.
步骤S800:根据信息标注结果和标签信息对信息标注模型的参数进行更新。Step S800: Update the parameters of the information labeling model according to the information labeling result and label information.
本步骤中,由于在步骤S600获取了标签信息,在步骤S610中获取了信息标注结果,所以根据信息标注结果和标签信息对信息标注模型的参数进行更新。In this step, since the label information is obtained in step S600 and the information labeling result is obtained in step S610, the parameters of the information labeling model are updated according to the information labeling result and the label information.
需要说明的是,根据信息标注结果和标签信息对信息标注模型的参数进行更新,可以有不同的实施方式,本实施例对此并不作具体限定。例如,可以根据信息标注结果和标签信息对信息标注模型的参数进行替换,得到新的信息标注模型;又如,可以根据信息标注结果和标签信息对信息标注模型的参数进行添加,增加新的标注数据集在信息标注模型中。It should be noted that there may be different implementation manners for updating the parameters of the information labeling model according to the information labeling result and label information, which is not specifically limited in this embodiment. For example, the parameters of the information labeling model can be replaced according to the information labeling results and label information to obtain a new information labeling model; for another example, the parameters of the information labeling model can be added according to the information labeling results and label information to add new labels The dataset is in the information annotation model.
在一实施例中,如图14所示,图14是对步骤S700进行进一步的说明,该步骤S700可以包括但不限于有步骤S710和步骤S720。In an embodiment, as shown in FIG. 14 , which further illustrates step S700, which may include but not limited to step S710 and step S720.
步骤S710:信息标注模型对训练样本进行分词处理得到多个第二字段信息。Step S710: The information labeling model performs word segmentation processing on the training sample to obtain a plurality of second field information.
本步骤中,对训练样本进行分词处理得到多个第二字段信息,以便于后续步骤中根据第二字段信息获取第一标注实体和第二标注实体。In this step, word segmentation processing is performed on the training samples to obtain a plurality of second field information, so that the first labeled entity and the second labeled entity can be obtained according to the second field information in a subsequent step.
需要说明的是,本实施例中的步骤与上述如图2所示实施例的步骤S210,具有相同的技术原理以及相同的技术效果,两个实施例之间的区别在于操作对象不同,其中,上述如图2所示实施例的操作对象为待处理信息文本,而本实施例的操作对象为带有标签信息的训练样本。关于本实施例的技术原理以及技术效果,可以参照上述如图2所示实施例中的相关描述说明,为了避免内容重复冗余,此处不再赘述。It should be noted that the steps in this embodiment have the same technical principle and the same technical effect as the step S210 in the embodiment shown in FIG. 2 above, and the difference between the two embodiments is that the operation objects are different, wherein, The operation object of the above-mentioned embodiment shown in FIG. 2 is the information text to be processed, while the operation object of this embodiment is the training sample with label information. Regarding the technical principles and technical effects of this embodiment, reference may be made to the relevant descriptions in the above embodiment shown in FIG. 2 .
步骤S720:信息标注模型对多个第二字段信息进行实体识别处理,在多个第二字段信息中识别出第一标注实体和第二标注实体。Step S720: The information labeling model performs entity recognition processing on the plurality of second field information, and identifies the first labeled entity and the second labeled entity in the multiple second field information.
本步骤中,对步骤S710得到的多个第二字段信息进行实体识别处理,在多个第二字段信息中识别出第一标注实体和第二标注实体,以便于后续步骤中根据第一标注实体和第二标注实体进行关系判断得到训练样本的信息标注结果。In this step, entity recognition processing is performed on the plurality of second field information obtained in step S710, and the first marked entity and the second marked entity are identified in the multiple second field information, so that in subsequent steps, according to the first marked entity Perform relationship judgment with the second labeled entity to obtain information labeling results of the training samples.
需要说明的是,本实施例中的步骤与上述如图2所示实施例的步骤S220,具有相同的技术原理以及相同的技术效果,两个实施例之间的区别在于操作对象不同,其中,上述如图2所示实施例的操作对象为待处理信息文本,而本实施例的操作对象为带有标签信息的训练样本。关于本实施例的技术原理以及技术效果,可以参照上述如图2所示实施例中的相关描述说明,为了避免内容重复冗余,此处不再赘述。It should be noted that the steps in this embodiment have the same technical principle and the same technical effect as the step S220 in the embodiment shown in Figure 2 above, and the difference between the two embodiments is that the operation objects are different, wherein, The operation object of the above-mentioned embodiment shown in FIG. 2 is the information text to be processed, while the operation object of this embodiment is the training sample with label information. Regarding the technical principles and technical effects of this embodiment, reference may be made to the relevant descriptions in the above embodiment shown in FIG. 2 .
在另一实施例中,如图15所示,图15是对步骤S700进行进一步的说明,该步骤S700可以包括但不限于有步骤S730和步骤S740。In another embodiment, as shown in FIG. 15 , which further illustrates step S700, which may include but not limited to step S730 and step S740.
步骤S730;信息标注模型对第一标注实体和第二标注实体进行类别识别处理,得到第一标注实体对应的第一标注类别信息和第二标注实体对应的第二标注类别信息。Step S730: The information labeling model performs category identification processing on the first labeling entity and the second labeling entity to obtain the first labeling category information corresponding to the first labeling entity and the second labeling category information corresponding to the second labeling entity.
本步骤中,对步骤S720得到的第一标注实体和第二标注实体进行类别识别处理,得到第一标注实体对应的第一标注类别信息和第二标注实体对应的第二标注类别信息,以便于后面对第一标注类别信息和第二标注类别信息进行关系识别处理,得到关系标注信息。In this step, category identification processing is performed on the first labeled entity and the second labeled entity obtained in step S720 to obtain the first labeled category information corresponding to the first labeled entity and the second labeled category information corresponding to the second labeled entity, so as to facilitate Afterwards, a relationship identification process is performed on the first labeling category information and the second labeling category information to obtain relationship labeling information.
需要说明的是,本实施例中的步骤与上述如图3所示实施例的步骤S230,具有相同的技术原理以及相同的技术效果,两个实施例之间的区别在于操作对象不同,其中,上述如图3所示实施例的操作对象为待处理信息文本,而本实施例的操作对象为带有标签信息的训练样本。关于本实施例的技术原理以及技术效果,可以参照上述如图3所示实施例中的相关描述说明,为了避免内容重复冗余,此处不再赘述。It should be noted that the steps in this embodiment have the same technical principle and the same technical effect as the step S230 in the embodiment shown in Figure 3 above, and the difference between the two embodiments is that the operation objects are different, wherein, The operation object of the above-mentioned embodiment shown in FIG. 3 is the information text to be processed, while the operation object of this embodiment is the training sample with label information. Regarding the technical principles and technical effects of this embodiment, reference may be made to the relevant descriptions in the above-mentioned embodiment shown in FIG. 3 .
步骤S740:信息标注模型对第一标注类别信息和第二标注类别信息进行关系识别处理,得到关系标注信息。Step S740: The information labeling model performs relationship identification processing on the first labeling category information and the second labeling category information to obtain relationship labeling information.
本步骤中,对步骤S730中得到的第一标注类别信息和第二标注类别信息进行关系识别处理,得到实体关系信息,以便于后续对训练样本进行标注操作。In this step, the relationship identification processing is performed on the first labeling category information and the second labeling category information obtained in step S730 to obtain entity relationship information, so as to facilitate subsequent labeling operations on training samples.
需要说明的是,本实施例中的步骤与上述如图3所示实施例的步骤S240,具有相同的技术原理以及相同的技术效果,两个实施例之间的区别在于操作对象不同,其中,上述如图3所示实施例的操作对象为待处理信息文本,而本实施例的操作对象为带有标签信息的训练样本。关于本实施例的技术原理以及技术效果,可以参照上述如图3所示实施例中的相关描述说明,为了避免内容重复冗余,此处不再赘述。It should be noted that the steps in this embodiment have the same technical principle and the same technical effect as the step S240 in the embodiment shown in FIG. 3 above, and the difference between the two embodiments is that the operation objects are different, wherein, The operation object of the above-mentioned embodiment shown in FIG. 3 is the information text to be processed, while the operation object of this embodiment is the training sample with label information. Regarding the technical principles and technical effects of this embodiment, reference may be made to the relevant descriptions in the above-mentioned embodiment shown in FIG. 3 .
在一实施例中,如图16所示,图16是对步骤S800进行进一步的说明,该步骤S800可以包括但不限于有步骤S810和步骤S820。In an embodiment, as shown in FIG. 16 , which further illustrates step S800, which may include but not limited to step S810 and step S820.
步骤S810:根据信息标注结果和标签信息得到训练的损失值。Step S810: Obtain the training loss value according to the information labeling result and label information.
本步骤中,由于步骤S700获得信息标注结果,步骤S600获得标签信息,所以根据信息标注结果和标签信息得到训练的损失值,以便于后续步骤根据训练损失值更新信息标注模型。In this step, since step S700 obtains the information labeling result and step S600 obtains label information, the training loss value is obtained according to the information labeling result and label information, so that the subsequent steps can update the information labeling model according to the training loss value.
需要说明的是,根据信息标注结果和标签信息得到训练的损失值,可以有不同的实施方式,本实施例对此并不作具体限定。例如,信息标注结果没有在信息标注模型的标签信息中找到类似信息,得到损失值;又如,信息标注结果与信息标注模型的标签信息不符,出现标注错误,得到损失值。It should be noted that the training loss value obtained according to the information labeling result and label information may be implemented in different manners, which is not specifically limited in this embodiment. For example, the information labeling result does not find similar information in the label information of the information labeling model, and the loss value is obtained; another example, the information labeling result does not match the label information of the information labeling model, and labeling errors occur, and the loss value is obtained.
步骤S820:根据损失值更新信息标注模型的参数,直至损失值满足训练停止条件。Step S820: Update the parameters of the information labeling model according to the loss value until the loss value satisfies the training stop condition.
本步骤中,由于步骤S810获得损失值,所以根据损失值更新信息标注模型的参数,直至损失值满足训练停止条件,从而提高了标注的效率和质量,降低了纯人工标注的成本。In this step, since the loss value is obtained in step S810, the parameters of the information labeling model are updated according to the loss value until the loss value satisfies the training stop condition, thereby improving the efficiency and quality of labeling and reducing the cost of pure manual labeling.
为了更加清楚的说明本申请实施例提供的信息标注方法的处理流程,下面以具体的示例进行说明。In order to more clearly illustrate the processing flow of the information labeling method provided by the embodiment of the present application, a specific example is used below for description.
如图17所示,图17是一个示例提供的信息标注方法的处理流程图。该信息标注方法包括以下步骤:As shown in FIG. 17 , FIG. 17 is a processing flowchart of an information labeling method provided by an example. The information labeling method includes the following steps:
示例一:Example one:
步骤S101:准备模型训练数据。Step S101: Prepare model training data.
步骤S102:对数据进行比例划分。Step S102: Proportionally divide the data.
步骤S103:标注文本中的实体及关系,记录每一个标注元素在文本中的位置信息及标注标签。Step S103: Annotate the entities and relationships in the text, and record the position information and label of each annotated element in the text.
步骤S104:构建本申请的基于知识融合的预训练语言模型预测文本实体及其关系。Step S104: Construct the pre-trained language model based on knowledge fusion of this application to predict text entities and their relationships.
步骤S105:采用神经网络模型对文本的实体及其关系进行预测标注。Step S105: Use the neural network model to predict and label the text entities and their relationships.
步骤S106:对模型进行REST封装,添加HTTP请求和响应处理能力。Step S106: Perform REST encapsulation on the model, and add HTTP request and response processing capabilities.
步骤S107:启动模型服务,对外提供自动标注能力。Step S107: Start the model service and provide automatic labeling capability externally.
步骤S108:在界面启动自动标注,后台读取数据集,调用自动标注服务。Step S108: Start automatic labeling on the interface, read the data set in the background, and call the automatic labeling service.
步骤S109:自动标注完成后,标注工具根据前述配置的实体及关系类别,筛选并加载自动标注的实体及关系,并在界面展示。Step S109: After the automatic labeling is completed, the labeling tool filters and loads automatically labeled entities and relationships according to the previously configured entity and relationship categories, and displays them on the interface.
步骤S110:对标注结果进行校对,修改标注的实体及关系。Step S110: Proofread the tagged results, and modify the tagged entities and relationships.
步骤S111:修改后保存时,系统记录校对的数据,存储到校对待训练数据集中,该数据集包括原始文本及标签。Step S111: When saving after modification, the system records the proofreading data and stores it in the proofreading training data set, which includes the original text and labels.
步骤S112:增量训练触发器定时判断是否达到增量训练触发条件。Step S112: The incremental training trigger periodically judges whether the incremental training trigger condition is met.
步骤S113:将数据集作为输入,输入上述模型网络结构,进行增量训练。Step S113: Take the data set as input, input the above model network structure, and perform incremental training.
步骤S114:采用灰度发布的方式在线更新模型,将新模型上线,用于自动标注。Step S114: Update the model online by means of gray scale publishing, and put the new model online for automatic labeling.
另外,本申请的一个实施例还提供了一种电子设备,该电子设备包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序。In addition, an embodiment of the present application also provides an electronic device, which includes: a memory, a processor, and a computer program stored in the memory and operable on the processor.
处理器和存储器可以通过总线或者其他方式连接。The processor and memory can be connected by a bus or other means.
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory may include memory located remotely from the processor, which remote memory may be connected to the processor through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
实现上述实施例的信息标注方法或者模型训练方法所需的非暂态软件程序以及指令存储在存储器中,当被处理器执行时,执行上述实施例中的信息标注方法,例如,执行以上描述的图1中的方法步骤S100至S300、图2中的方法步骤S210和S220、图3中的方法步骤S230和S240、图4中的方法步骤S310和S320、图5中的方法步骤S3210和S3220、图6中的方法步骤S400至S420,或者,执行上述实施例中的模型训练方法,例如,执行以上描述的图13中的方法步骤S600至S800、图14中的方法步骤S710和S720、图15中的方法步骤S730和S740、图16中的方法步骤S810和S820。The non-transitory software programs and instructions required to implement the information labeling method or model training method of the above-mentioned embodiments are stored in the memory, and when executed by the processor, the information labeling method in the above-mentioned embodiments is executed, for example, the above-described Method steps S100 to S300 in Fig. 1, method steps S210 and S220 in Fig. 2, method steps S230 and S240 in Fig. 3, method steps S310 and S320 in Fig. 4, method steps S3210 and S3220 in Fig. 5, Method steps S400 to S420 in FIG. 6, or, execute the model training method in the above-mentioned embodiment, for example, execute method steps S600 to S800 in FIG. 13 described above, method steps S710 and S720 in FIG. 14, FIG. 15 Method steps S730 and S740 in , method steps S810 and S820 in FIG. 16 .
以上所描述的装置实施例或者系统实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The above-described device embodiments or system embodiments are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units superior. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
此外,本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被上述装置实施例中的一个处理器执行,可使得上述处理器执行上述实施例中的信息标注方法或者模型训练方法,例如,执行以上描述的图1中的方法步骤S100至S300、图2中的方法步骤S210和S220、图3中的方法步骤S230和S240、图4中的方法步骤S310和S320、图5中的方法步骤S3210和S3220、图6中的方法步骤S400至S420,或者,执行上述实施例中的模型训练方法,例如,执行以上描述的图13中的方法步骤S600至S800、图14中的方法步骤S710和S720、图15中的方法步骤S730和S740、图16中的方法步骤S810和S820。In addition, an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Executed by a processor in the device embodiment, the above-mentioned processor can execute the information labeling method or the model training method in the above-mentioned embodiment, for example, execute the above-described method steps S100 to S300 in FIG. 1 and the method in FIG. 2 Steps S210 and S220, method steps S230 and S240 in FIG. 3, method steps S310 and S320 in FIG. 4, method steps S3210 and S3220 in FIG. 5, method steps S400 to S420 in FIG. The model training method in the example, for example, executes method steps S600 to S800 in Fig. 13 described above, method steps S710 and S720 in Fig. 14 , method steps S730 and S740 in Fig. 15 , method step S810 in Fig. 16 and S820.
本申请实施例包括:获取待处理信息文本,之后把待处理信息文本输入到信息标注模型,得到第一实体、第二实体以及第一实体与第二实体之间的实体关系信息,其中,第一实体和第二实体由信息标注模型对待处理信息文本进行实体区分而得到,实体关系信息由信息标注模型对第一实体和第二实体进行关系判别而得到,再根据第一实体、第二实体和实体关系信息对待处理信息文本进行信息标注处理得到目标信息文本。根据本申请实施例的方案,通过获取待处理信息文本,并且把待处理信息文本输入到信息标注模型得到第一实体、第二实体以及第一实体与第二实体之间的实体关系信息,然后对待处理信息文本进行信息标注处理得到目标信息文本,也就是说,待处理信息文本由信息标注模型处理从而得到带有标注信息的目标信息文本,该过程不需人工参与,实现了对信息的自动标注,从而有效降低了纯人工标注的成本。The embodiment of the present application includes: obtaining the information text to be processed, and then inputting the information text to be processed into the information labeling model to obtain the first entity, the second entity, and the entity relationship information between the first entity and the second entity, wherein, the first entity The first entity and the second entity are obtained by distinguishing the entities of the information text to be processed by the information annotation model, and the entity relationship information is obtained by discriminating the relationship between the first entity and the second entity by the information annotation model. The target information text is obtained by performing information annotation processing on the entity relationship information and the information text to be processed. According to the solution of the embodiment of the present application, the first entity, the second entity, and the entity relationship information between the first entity and the second entity are obtained by obtaining the information text to be processed and inputting the information text to be processed into the information labeling model, and then The information text to be processed is processed by information annotation to obtain the target information text, that is to say, the information text to be processed is processed by the information annotation model to obtain the target information text with annotation information. This process does not require manual participation and realizes automatic information processing. Labeling, thus effectively reducing the cost of pure manual labeling.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .
以上是对本申请的若干实施方式进行了说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a description of several embodiments of the present application, but the present application is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent deformations or replacements without violating the spirit of the present application. Any modification or substitution is included within the scope defined by the claims of the present application.

Claims (13)

  1. 一种信息标注方法,包括:A method for labeling information, comprising:
    获取待处理信息文本;Get the message text to be processed;
    把所述待处理信息文本输入到信息标注模型,得到第一实体、第二实体以及所述第一实体与所述第二实体之间的实体关系信息,其中,所述第一实体和所述第二实体由所述信息标注模型对所述待处理信息文本进行实体区分而得到,所述实体关系信息由所述信息标注模型对所述第一实体和所述第二实体进行关系判别而得到;Input the information text to be processed into the information labeling model to obtain the first entity, the second entity and the entity relationship information between the first entity and the second entity, wherein the first entity and the The second entity is obtained by distinguishing entities of the information text to be processed by the information annotation model, and the entity relationship information is obtained by distinguishing the relationship between the first entity and the second entity by the information annotation model ;
    根据所述第一实体、所述第二实体和所述实体关系信息对所述待处理信息文本进行信息标注处理得到目标信息文本。performing information annotation processing on the information text to be processed according to the first entity, the second entity, and the entity relationship information to obtain a target information text.
  2. 根据权利要求1所述的信息标注方法,其中,所述第一实体和所述第二实体由所述信息标注模型对所述待处理信息文本进行实体区分而得到,包括:The information labeling method according to claim 1, wherein the first entity and the second entity are obtained by distinguishing entities of the information text to be processed by the information labeling model, including:
    所述信息标注模型对所述待处理信息文本进行分词处理得到多个第一字段信息;The information labeling model performs word segmentation processing on the information text to be processed to obtain a plurality of first field information;
    所述信息标注模型对所述多个第一字段信息进行实体识别处理,在所述多个字段信息中识别出所述第一实体和所述第二实体。The information labeling model performs entity recognition processing on the plurality of first field information, and identifies the first entity and the second entity in the plurality of field information.
  3. 根据权利要求1所述的信息标注方法,其中,所述实体关系信息由所述信息标注模型对所述第一实体和所述第二实体进行关系判别而得到,包括:The information labeling method according to claim 1, wherein the entity relationship information is obtained by discriminating the relationship between the first entity and the second entity by the information labeling model, including:
    所述信息标注模型对所述第一实体和所述第二实体进行类别识别处理,得到所述第一实体对应的第一类别信息和所述第二实体对应的第二类别信息;The information labeling model performs category recognition processing on the first entity and the second entity to obtain first category information corresponding to the first entity and second category information corresponding to the second entity;
    所述信息标注模型对所述第一类别信息和所述第二类别信息进行关系识别处理,得到所述实体关系信息。The information labeling model performs relationship recognition processing on the first category information and the second category information to obtain the entity relationship information.
  4. 根据权利要求3所述的信息标注方法,其中,所述根据所述第一实体、所述第二实体和所述实体关系信息对所述待处理信息文本进行信息标注处理得到目标信息文本,包括:The information labeling method according to claim 3, wherein, performing information labeling processing on the information text to be processed according to the first entity, the second entity, and the entity relationship information to obtain the target information text includes :
    将所述第一实体和所述第二实体突出显示于所述待处理信息文本,得到第一信息文本;highlighting the first entity and the second entity on the message text to be processed to obtain a first message text;
    在所述第一信息文本中标注所述实体关系信息,得到目标信息文本,其中,所述实体关系信息用于在所述目标信息文本中将所述第一实体和所述第二实体形成关联关系。Annotating the entity relationship information in the first information text to obtain a target information text, wherein the entity relationship information is used to associate the first entity with the second entity in the target information text relation.
  5. 根据权利要求4所述的信息标注方法,其中,所述在所述第一信息文本中标注所述实体关系信息,得到目标信息文本,包括:The information labeling method according to claim 4, wherein said labeling said entity relationship information in said first message text to obtain a target message text includes:
    在所述第一信息文本中,对所述第一实体标注所述第一类别信息,以及对所述第二实体标注所述第二类别信息;In the first information text, the first category information is marked for the first entity, and the second category information is marked for the second entity;
    根据所述第一类别信息和所述第二类别信息,在所述第一信息文本中标注所述实体关系信息,得到目标信息文本。According to the first category information and the second category information, mark the entity relationship information in the first information text to obtain a target information text.
  6. 根据权利要求1所述的信息标注方法,其中,所述信息标注方法还包括:The information labeling method according to claim 1, wherein the information labeling method further comprises:
    对所述目标信息文本的所述第一实体、所述第二实体和所述实体关系信息中的至少一个进行校对处理得到校对结果;Proofreading at least one of the first entity, the second entity, and the entity relationship information of the target information text to obtain a proofreading result;
    当所述校对结果为所述第一实体、所述第二实体和所述实体关系信息中的至少一个存在错误信息,根据所述错误信息得到更正信息;When the proofreading result is that at least one of the first entity, the second entity, and the entity relationship information has error information, correcting information is obtained according to the error information;
    根据所述更正信息对所述信息标注模型进行更新处理。The information labeling model is updated according to the correction information.
  7. 一种模型训练方法,包括:A model training method, comprising:
    获取训练样本,所述训练样本为带有标签信息的文本;Obtain a training sample, the training sample is text with label information;
    将所述训练样本输入到信息标注模型,得到所述训练样本的信息标注结果,其中,所述信息标注结果包括第一标注实体、第二标注实体以及所述第一标注实体与所述第二标注实体之间的关系标注信息,所述第一标注实体和所述第二标注实体由所述信息标注模型对所述训练样本进行实体区分而得到,所述关系标注信息由所述信息标注模型对所述第一标注实体和所述第二标注实体进行关系判别而得到;Input the training samples into the information labeling model to obtain the information labeling results of the training samples, wherein the information labeling results include the first labeling entity, the second labeling entity, and the first labeling entity and the second labeling entity Annotating relationship labeling information between entities, the first labeling entity and the second labeling entity are obtained by the information labeling model performing entity distinction on the training samples, and the relationship labeling information is obtained by the information labeling model obtained by discriminating the relationship between the first marked entity and the second marked entity;
    根据所述信息标注结果和所述标签信息对所述信息标注模型的参数进行更新。The parameters of the information annotation model are updated according to the information annotation result and the tag information.
  8. 根据权利要求7所述的模型训练方法,其中,所述第一标注实体和所述第二标注实体由所述信息标注模型对所述训练样本进行实体区分而得到,包括:The model training method according to claim 7, wherein the first labeled entity and the second labeled entity are obtained by performing entity distinction on the training samples by the information labeling model, comprising:
    所述信息标注模型对所述训练样本进行分词处理得到多个第二字段信息;The information labeling model performs word segmentation processing on the training sample to obtain a plurality of second field information;
    所述信息标注模型对所述多个第二字段信息进行实体识别处理,在所述多个第二字段信息中识别出所述第一标注实体和所述第二标注实体。The information labeling model performs entity recognition processing on the plurality of second field information, and identifies the first labeling entity and the second labeling entity in the multiple second field information.
  9. 根据权利要求7所述的模型训练方法,其中,所述关系标注信息由所述信息标注模型对所述第一标注实体和所述第二标注实体进行关系判别而得到,包括:The model training method according to claim 7, wherein the relationship labeling information is obtained by the information labeling model performing relationship discrimination between the first labeling entity and the second labeling entity, including:
    所述信息标注模型对所述第一标注实体和所述第二标注实体进行类别识别处理,得到所述第一标注实体对应的第一标注类别信息和所述第二标注实体对应的第二标注类别信息;The information labeling model performs category recognition processing on the first labeling entity and the second labeling entity to obtain the first labeling category information corresponding to the first labeling entity and the second labeling corresponding to the second labeling entity category information;
    所述信息标注模型对所述第一标注类别信息和所述第二标注类别信息进行关系识别处理,得到所述关系标注信息。The information labeling model performs relationship identification processing on the first labeling category information and the second labeling category information to obtain the relationship labeling information.
  10. 根据权利要求7所述的模型训练方法,其中,所述根据所述信息标注结果和所述标签信息对所述信息标注模型的参数进行更新,包括:The model training method according to claim 7, wherein updating the parameters of the information labeling model according to the information labeling result and the label information includes:
    根据所述信息标注结果和所述标签信息得到训练的损失值;Obtaining a training loss value according to the information labeling result and the label information;
    根据所述损失值更新所述信息标注模型的参数,直至所述损失值满足训练停止条件。Updating the parameters of the information labeling model according to the loss value until the loss value satisfies the training stop condition.
  11. 一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至6任意一项所述的信息标注方法,或者实现如权利要求7至10任意一项所述的模型训练方法。An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the computer program, the computer program described in any one of claims 1 to 6 is implemented. The information labeling method described above, or implement the model training method described in any one of claims 7 to 10.
  12. 一种计算机可读存储介质,其中存储有处理器可执行的程序,所述处理器可执行的程序被处理器执行时用于实现如权利要求1至6任意一项所述的信息标注方法,或者实现如权利要求7至10任意一项所述的模型训练方法。A computer-readable storage medium, in which a processor-executable program is stored, and when the processor-executable program is executed by the processor, it is used to implement the information labeling method according to any one of claims 1 to 6, Or realize the model training method described in any one of claims 7 to 10.
  13. 一种计算机程序产品,包括计算机程序或计算机指令,其中,所述计算机程序或所述计算机指令存储在计算机可读存储介质中,计算机设备的处理器从所述计算机可读存储介质读取所述计算机程序或所述计算机指令,所述处理器执行所述计算机程序或所述计算机指令,使得所述计算机设备执行如权利要求1至6任意一项所述的信息标注方法,或者实现如权利要求7至10任意一项所述的模型训练方法。A computer program product comprising a computer program or computer instructions, wherein the computer program or the computer instructions are stored in a computer-readable storage medium, and a processor of a computer device reads the computer-readable storage medium from the computer-readable storage medium A computer program or the computer instruction, the processor executes the computer program or the computer instruction, so that the computer device executes the information labeling method according to any one of claims 1 to 6, or realizes the information labeling method according to any one of claims 1 to 6 The model training method described in any one of 7 to 10.
PCT/CN2022/124185 2021-10-25 2022-10-09 Information labeling method, model training method, electronic device and storage medium WO2023071745A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111241199.3A CN114003690A (en) 2021-10-25 2021-10-25 Information labeling method, model training method, electronic device and storage medium
CN202111241199.3 2021-10-25

Publications (1)

Publication Number Publication Date
WO2023071745A1 true WO2023071745A1 (en) 2023-05-04

Family

ID=79923813

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124185 WO2023071745A1 (en) 2021-10-25 2022-10-09 Information labeling method, model training method, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN114003690A (en)
WO (1) WO2023071745A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034864A (en) * 2023-09-07 2023-11-10 广州市新谷电子科技有限公司 Visual labeling method, visual labeling device, computer equipment and storage medium
CN117172248A (en) * 2023-11-03 2023-12-05 翼方健数(北京)信息科技有限公司 Text data labeling method, system and medium
CN117236338A (en) * 2023-08-29 2023-12-15 北京工商大学 Named entity recognition model of dense entity text and training method thereof

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114003690A (en) * 2021-10-25 2022-02-01 南京中兴新软件有限责任公司 Information labeling method, model training method, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170032249A1 (en) * 2015-07-30 2017-02-02 Tata Consultancy Serivces Limited Automatic Entity Relationship (ER) Model Generation for Services as Software
CN111259669A (en) * 2018-11-30 2020-06-09 阿里巴巴集团控股有限公司 Information labeling method, information processing method and device
CN112069319A (en) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 Text extraction method and device, computer equipment and readable storage medium
CN112163424A (en) * 2020-09-17 2021-01-01 中国建设银行股份有限公司 Data labeling method, device, equipment and medium
CN114003690A (en) * 2021-10-25 2022-02-01 南京中兴新软件有限责任公司 Information labeling method, model training method, electronic device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737951B (en) * 2019-03-20 2022-10-14 北京大学 Text language incidence relation labeling method and device
CN113032469B (en) * 2019-12-24 2024-02-20 医渡云(北京)技术有限公司 Text structured model training and medical text structuring method and device
CN112347759A (en) * 2020-11-10 2021-02-09 华夏幸福产业投资有限公司 Method, device and equipment for extracting entity relationship and storage medium
CN112559770A (en) * 2020-12-15 2021-03-26 北京邮电大学 Text data relation extraction method, device and equipment and readable storage medium
CN112633001B (en) * 2020-12-28 2024-07-02 咪咕文化科技有限公司 Text named entity recognition method, device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170032249A1 (en) * 2015-07-30 2017-02-02 Tata Consultancy Serivces Limited Automatic Entity Relationship (ER) Model Generation for Services as Software
CN111259669A (en) * 2018-11-30 2020-06-09 阿里巴巴集团控股有限公司 Information labeling method, information processing method and device
CN112069319A (en) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 Text extraction method and device, computer equipment and readable storage medium
CN112163424A (en) * 2020-09-17 2021-01-01 中国建设银行股份有限公司 Data labeling method, device, equipment and medium
CN114003690A (en) * 2021-10-25 2022-02-01 南京中兴新软件有限责任公司 Information labeling method, model training method, electronic device and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236338A (en) * 2023-08-29 2023-12-15 北京工商大学 Named entity recognition model of dense entity text and training method thereof
CN117236338B (en) * 2023-08-29 2024-05-28 北京工商大学 Named entity recognition model of dense entity text and training method thereof
CN117034864A (en) * 2023-09-07 2023-11-10 广州市新谷电子科技有限公司 Visual labeling method, visual labeling device, computer equipment and storage medium
CN117034864B (en) * 2023-09-07 2024-05-10 广州市新谷电子科技有限公司 Visual labeling method, visual labeling device, computer equipment and storage medium
CN117172248A (en) * 2023-11-03 2023-12-05 翼方健数(北京)信息科技有限公司 Text data labeling method, system and medium
CN117172248B (en) * 2023-11-03 2024-01-30 翼方健数(北京)信息科技有限公司 Text data labeling method, system and medium

Also Published As

Publication number Publication date
CN114003690A (en) 2022-02-01

Similar Documents

Publication Publication Date Title
WO2023071745A1 (en) Information labeling method, model training method, electronic device and storage medium
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
US7426496B2 (en) Assisted form filling
CN109685056A (en) Obtain the method and device of document information
WO2023040493A1 (en) Event detection
CN112559734B (en) Brief report generating method, brief report generating device, electronic equipment and computer readable storage medium
CN113468887A (en) Student information relation extraction method and system based on boundary and segment classification
CN115952791A (en) Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium
CN116304023A (en) Method, system and storage medium for extracting bidding elements based on NLP technology
CN117952104A (en) Small sample triplet extraction method based on fusion of large model and knowledge graph
Li et al. A policy-based process mining framework: mining business policy texts for discovering process models
CN114528418B (en) Text processing method, system and storage medium
CN114818718A (en) Contract text recognition method and device
CN111488737A (en) Text recognition method, device and equipment
CN114356924A (en) Method and apparatus for extracting data from structured documents
CN112668335A (en) Method for identifying and extracting business license structured information by using named entity
CN115757325B (en) Intelligent conversion method and system for XES log
US12001779B2 (en) Method and system for automatically formulating an optimization problem using machine learning
CN115659989A (en) Web table abnormal data discovery method based on text semantic mapping relation
CN115130437A (en) Intelligent document filling method and device and storage medium
CN114186565A (en) User semantic analysis method in IT operation and maintenance service field
Arafat et al. Hydrating large-scale coronavirus pandemic tweets: A review of software for transportation research
US12093300B1 (en) Enhancing accuracy of entity matching inference using large language models
CN116991983B (en) Event extraction method and system for company information text
US20240143903A1 (en) System and method for machine learning architecture for electronic field autofill

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885626

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22885626

Country of ref document: EP

Kind code of ref document: A1