WO2018153316A1 - Method and apparatus for obtaining text extraction model - Google Patents
Method and apparatus for obtaining text extraction model Download PDFInfo
- Publication number
- WO2018153316A1 WO2018153316A1 PCT/CN2018/076605 CN2018076605W WO2018153316A1 WO 2018153316 A1 WO2018153316 A1 WO 2018153316A1 CN 2018076605 W CN2018076605 W CN 2018076605W WO 2018153316 A1 WO2018153316 A1 WO 2018153316A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- training
- extraction model
- corpus
- corpora
- Prior art date
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 295
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000012549 training Methods 0.000 claims abstract description 366
- 238000012360 testing method Methods 0.000 claims description 50
- 238000010801 machine learning Methods 0.000 abstract description 4
- 238000002372 labelling Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 239000000284 extract Substances 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Definitions
- the present invention relates to the field of machine learning technology, and in particular, to a method and apparatus for acquiring a text extraction model.
- Machine learning technology refers to the technology that computers improve performance by summarizing data such as text or images, and is widely used in data mining, computer vision, natural language processing, and robotics.
- the text extraction model is usually acquired by using machine learning technology, and the text extraction model is applied to the chat robot, so that the chat robot is from the corpus of the user. Extract the text that expresses the user's needs and respond to the text.
- the chat robot can provide ticket service, and a certain corpus is “I want to buy a train ticket”, and the manually marked text is “train ticket”.
- the training text collection is completely obtained by manual annotation. Due to the large amount of corpus data required for obtaining the text extraction model and the low efficiency of manual annotation, the training process of the text extraction model consumes a lot of labor cost and time cost.
- Embodiments of the present disclosure provide a method and apparatus for acquiring a text extraction model, which can reduce labor cost and time cost.
- the technical solution is as follows:
- a method of obtaining a text extraction model comprising:
- the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;
- the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;
- an apparatus for obtaining a text extraction model comprising:
- a model obtaining module configured to obtain a first text extraction model, where the first text extraction model is obtained according to a manually labeled first training text set, where the first training text set includes a plurality of training corpora and the plurality of training corpora a plurality of annotation texts;
- a training text collection acquisition module configured to acquire a second training text set if the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text collection includes multiple a first training corpus and a plurality of first target texts extracted from the plurality of first training corpora by the first text extraction model, each first target text being a correct text extracted from the first training corpus;
- the model obtaining module is configured to acquire a second text extraction model according to the first training text set and the second training text set.
- an electronic device comprising a memory and a processor, the memory for storing instructions, the processor being configured to execute the instructions to perform a group video session as described below A step of:
- the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;
- the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;
- the embodiment of the present disclosure acquires a second training text set by acquiring a first text extraction model, where the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text set includes a plurality of first training corpora and And acquiring, by the first text extraction model, the plurality of first target texts extracted from the plurality of first training corpora, thereby obtaining the second training text set by using the acquired first text extraction model, without manual labeling, further, according to the a training text set and a second training text set, obtaining a second text extraction model, so that the process of acquiring the text extraction model tends to be automated, since the efficiency of obtaining the training text set by the model is much higher than the efficiency of manual labeling, the present invention is adopted
- the acquisition method can greatly reduce labor costs and time costs.
- FIG. 1 is a schematic diagram of an implementation environment for acquiring a text extraction model according to an embodiment of the present disclosure
- FIG. 2 is a flowchart of a method for acquiring a text extraction model according to an embodiment of the present disclosure
- FIG. 3 is a flowchart of acquiring training text according to an embodiment of the present disclosure
- FIG. 5 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure
- FIG. 6 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure
- FIG. 7 is a block diagram of an apparatus 700 for acquiring a text extraction model according to an embodiment of the present disclosure.
- FIG. 1 is a schematic diagram of an implementation environment for acquiring a text extraction model according to an embodiment of the present disclosure.
- the implementation environment includes:
- At least one server 101 at least one chat bot 102, at least one terminal 103 (e.g., mobile terminal and desktop computer).
- the server 101 is configured to acquire a first text extraction model. If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, and acquiring a second text extraction model according to the acquired training text set.
- the acquired text extraction model is applied to the chatbot 102 or the terminal 103.
- the chat bot 102 is configured to acquire or update a text extraction model according to the control of the server 101, and provide various services such as a chat service to the user based on the control of the server 101.
- the smart chat application provided by the server 101 is installed on the terminal 103, and the text extraction model is acquired or updated according to the control of the server 101.
- the chat bot 102 can also be a server of the smart chat application, and can provide various services to the terminal 103.
- the chat bot 102 can run on the server 101 or on a plurality of customer service terminals provided by the server 101.
- the text extraction model described above can be stored on the device where the chat bot 102 is located, and is applied by the chat bot 102 when it is received.
- the server 101 can also configure at least one database, such as a chat database, a user authentication database of a user database, and the like.
- the chat database is used to store a conversation corpus between the user and the chat bot (or smart chat application), the conversation corpus may identify the time stamp of the conversation, or the service record of the conversation, and the like;
- the user database is used for storing User behavior data, such as logs and comments posted by the user, user's praise behavior and scoring behavior, etc.;
- the user authentication database is used to store the user's username and user password.
- FIG. 2 is a flowchart of a method for acquiring a text extraction model according to an embodiment of the present disclosure.
- the method can be applied to any device, and the device has at least a processor and a memory, and the set of training samples in the memory can be processed by the processor to obtain a text extraction model.
- the method specifically includes: 201: The server acquires a first text extraction model, and the first text extraction model is obtained according to the manually labeled first training text set.
- the first training text set is used to generate a text extraction model, the first training text set includes a plurality of training corpora and a correct text obtained by manually marking a plurality of training corpora, and a training corpus and a correct text marked from the pair constitute a pair Training text.
- the embodiment of the present disclosure does not limit the form of the training corpus.
- the training corpus can be in the form of a single sentence, or in the form of a conversation.
- the correct text marked from a training corpus may be one or more, generally related to the service provided by the chat bot (or smart chat application) to which the text extraction model is applied, for example, the training corpus is "how to go to Hangzhou” ", the correct text marked can be "Hangzhou”; the training corpus is "I want to buy a ticket to Tianjin”, the correct text can be labeled "Tianjin” and "ticket.”
- the server may obtain multiple training corpora from its own database or network, and manually extract the correct text from multiple training corpora to obtain the first training text set, and then, the server pairs A training text set is trained, that is, extracting features of each pair of training texts (eg, context features), determining values of respective parameters of the initial extraction model according to the extracted features, and obtaining a first text extraction model of known parameters .
- the initial extraction model is not limited to a CRF (Conditional Random Field Algorithm) model or an HMM (Hidden Markov Model).
- training corpora are such as "what's wrong" and "why".
- the embodiments of the present disclosure do not limit the manner in which these training corpora are processed, for example, The training corpus is discarded directly, and it is not marked; for example, the training corpus that cannot mark the text is manually added with a default label, which is used to mark the training corpus that cannot mark the text, and the default label is “None”.
- the server may store the discarded training corpus or the training corpus added with the default label as the reference corpus to be filtered; after obtaining the initial training corpus, The server can filter out the same initial training corpus as the reference corpus to be screened, and obtain the filtered training corpus.
- each parameter of the initial extraction model may be initialized, and in the training process, the random gradient descent and the forward backward propagation method may be used to optimize each of the text extraction models. Parameters to minimize errors in the text extraction model.
- the number of training texts in the first training text set is much less than the training text required for the original training, for example, the originally required training text.
- the number of N, the number of training texts required by embodiments of the present disclosure may be 50%*N.
- the server acquires a second training text set, where the second training text set includes a plurality of first training corpora and the first text extraction model is obtained from the plurality of first Training a plurality of first target texts extracted from the corpus.
- the text extracted by the first text extraction model may be correct or incorrect, and in order to ensure that the extraction accuracy of the text extraction model obtained according to the second training text collection is higher, the first embodiment of the present disclosure
- a target text is the correct text that should be extracted from the first training corpus.
- the server determines the extraction accuracy of the first text extraction model, and determines whether the extraction accuracy is lower than a preset threshold. If yes, the second training text set is acquired. Otherwise, determining the first text extraction model may be use.
- the preset embodiment does not limit the preset threshold.
- the preset threshold is as 80%.
- the server may continue to acquire the second training text collection.
- the server After the first training corpus is obtained, the text extracted by the first text extraction model is directly obtained as the first target text, and the specific process of obtaining the second training text set may be referred to the following, and the method is manually confirmed.
- the server can be determined using the following steps (1)-(3):
- the server obtains a test text set, and the test text set includes a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora.
- the acquisition process of the test text collection is the same as the acquisition process of the first training text collection, but the test text collection is used to test the extraction accuracy of the first text extraction model.
- the server For each test corpus of the plurality of test corpora, the server extracts the second text from the test corpus through the first text extraction model.
- the server inputs each test corpus into the first text extraction model, and the first text extraction model corresponds to the text output by the test corpus as the second text.
- the server determines the number ratio of the second text and the plurality of correct texts that are the same as any correct text as the extraction accuracy of the first text extraction model.
- the server may determine the quantity A of the plurality of correct texts (also equivalent to the number of the plurality of test corpora), and determine that the second text corresponding to each test corpus is corresponding to the test corpus. Whether the correct texts are the same, if they are the same, count them, otherwise, ignore them; further, the server can determine the number B of the second text that is the same as any correct text, and determine the ratio of B to A as the first text extraction model. Extraction accuracy.
- the process of the server acquiring the second training text set may be specifically: if the extraction accuracy of the first text extraction model is lower than a preset threshold, the server acquires multiple first training corpora; and for the plurality of first training corpora Each of the first training corpora, the server extracts the first text from the first training corpus by the first text extraction model; if the first text is correct, the first training corpus and the first text are used as the second training text set a pair of training texts; if the first text is wrong, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
- the server may input each first training corpus into the first text extraction model, and obtain the text corresponding to the output of the training corpus as the first text. And, the determining information added to the first text is manually obtained, where the determining information is used to indicate whether the first text is correct. If the obtained determining information indicates that the first text is correct, the server may directly directly compare the first training corpus with the first The text is used as a pair of training texts in the second training text set; if the obtained judgment information indicates the first text error, the server may acquire the artificially corrected text carried in the judgment information, and use the first training corpus and the manually corrected text as A pair of text in the second training text collection.
- the human when judging whether the first text is correct, the human may not have to operate on each first text, but directly correct the wrong first text, so that the server obtains manual correction.
- the text and the corresponding first training corpus and directly acquiring the remaining first text without the correcting operation and the corresponding first training corpus.
- the server can be retrieved from the network or its own database, for example, to better understand the user's needs, the database can be a user database, or to improve the text in order to make the training corpus closer to the actual application environment of the text extraction model. Extract the corpus hit ratio of the model to the user at the time of application, and the database may be a chat database or the like.
- the server can adopt at least two acquisition methods:
- the server acquires the conversation corpus in the preset time period from the chat database, and predicts the dialogue in the preset time period as the plurality of first training corpora.
- the server may acquire the dialog corpus within the preset time period.
- the embodiment of the present disclosure does not specifically limit the preset time period.
- the preset time period may be the latest month.
- the preset time period may be matched with the time period for providing the service, and the dialog corpus is separately obtained for each time period.
- the service period is divided as follows: the ticket service period is daytime, and the ticket consultation service period is nighttime.
- the server may query the chat corpus in the chat database according to the preset time period, and use the plurality of query corpora as the plurality of first training corpora.
- Obtaining mode 2 If the extraction accuracy of the first text extraction model is lower than a preset threshold, the server filters out the conversation corpus of the conversation successfully from the chat database, and predicts the dialogue successfully as a plurality of first training corpora, and the dialogue succeeds.
- Dialogue corpus refers to the conversation corpus that the chat bot successfully provides services for users.
- the dialogue corpus of the dialogue success can be obtained as the first training corpus.
- the server can be determined in at least three ways:
- Judgment mode 1 When there is a keyword with successful dialogue in any conversation corpus, the server determines the conversation corpus as the conversation corpus for successful dialogue.
- the embodiment of the present disclosure does not limit keywords that are successful in dialogue. For example, considering that the conversation is successful, the user usually expresses gratitude, so the keyword for the success of the conversation can be: good, thank you.
- the chat robot's reply may also include some keywords for successful dialogue when the conversation is successful, such as: no problem, no thanks.
- Judging mode 2 when there is a keyword that fails in the dialogue in any of the dialog corpora, the server filters out the dialog corpus and determines that the remaining dialog corpus is the dialog corpus of the successful dialogue.
- the embodiment of the present disclosure does not limit keywords that fail in the dialogue.
- the keyword that the dialog fails may be: You are wrong, not the meaning.
- the chat robot's reply may also include some keywords for successful dialogue. For example, don't mind, don't understand what you mean, please say it again.
- Judging mode 3 when any dialog is expected to have a corresponding service record, the server determines the dialog corpus as a dialog corpus for successful dialogue.
- the dialogue corpus corresponding to the presence of the service record can be used as a dialog corpus for successful dialogue.
- a combination of two or more of the above three determination manners may be integrated, for example, detecting keywords of the corpus and detecting the dialogue is in any conversation corpus.
- the server determines the dialog corpus as a dialog corpus for successful dialogue.
- the server determines the dialog corpus as a dialog corpus for successful dialogue, so as to avoid leakage. Select a dialogue corpus with reference value to improve data utilization.
- the server acquires a second text extraction model according to the first training text set and the second training text set.
- the server may re-train the two training text sets to obtain a second text extraction model.
- the server may continue to acquire the training text set and Each acquired training text set is subjected to the next model training until the extraction accuracy of the second text extraction model obtained by the training is not lower than a preset threshold, the training text set includes a plurality of second training corpora and is obtained by the training The second text extraction model extracts a plurality of second target texts from the plurality of second training corpora.
- FIG. 4 is a flowchart of an iterative model provided by an embodiment of the present disclosure.
- the server may determine the extraction accuracy of the second text extraction model according to the method for determining the extraction accuracy in step 202, if determined. If the extraction accuracy is not lower than the preset threshold, determining that the second text extraction model is usable, if the determined extraction accuracy is lower than the preset threshold, continuing to acquire the training text set, and the specific acquisition process of the training text collection The specific acquisition process of the two training text sets is the same, and the training is performed based on the acquired first training text set, the second training text set and the training text set, thereby obtaining a more accurate text extraction model and reconfirming The extraction accuracy of the text extraction model is continued. If the extraction accuracy of the text extraction model is lower than a preset threshold, the training text collection is continued until the extraction accuracy of the text extraction model obtained by the iterative method is not lower than the preset. The threshold is up.
- the text extraction model may be temporarily stored, waiting for an instruction to apply the text extraction model, or the text extraction model may be directly applied, for example, the text is extracted.
- the model is applied to the chat bot, or the text extraction model is updated to the smart chat application on the terminal where the user is located.
- the dialog message is input to the second text extraction model obtained by the training to obtain the semantic information of the dialog message; Determining, according to the semantic information of the conversation message, the reply message of the conversation message from the reply message database, and returning to the questioning user, because the accuracy of the extraction semantics of the second text extraction model is higher, thereby achieving better
- the dialogue effect improves the intelligence of the chat bot.
- the embodiment of the present disclosure acquires a second training text set by acquiring a first text extraction model, where the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text set includes a plurality of first training corpora and And acquiring, by the first text extraction model, the plurality of first target texts extracted from the plurality of first training corpora, thereby obtaining the second training text set by using the acquired first text extraction model, without manual labeling, further, according to the a training text set and a second training text set, obtaining a second text extraction model, so that the process of acquiring the text extraction model tends to be automated, since the efficiency of obtaining the training text set by the model is much higher than the efficiency of manual labeling, the present invention is adopted
- the acquisition method can greatly reduce labor costs and time costs.
- a specific method for obtaining a second training text set is provided, by acquiring a first training corpus, and extracting a first text from the first training corpus by using the first text extraction model, and if the first text is correct, directly a training corpus and a first text as a pair of training texts in the second training text set, and if the first text is wrong, acquiring the artificially corrected text and the first training corpus as a pair of training texts in the second training text set Since the second training text set is obtained by the first text extraction model and manually confirmed, the accuracy of the second training text set is ensured while ensuring the acquisition efficiency of the second training text set.
- the dialog corpus within the preset time period may be obtained from the chat database, or, in order to make the first training corpus have more Strong reference, you can get the dialogue corpus of the conversation in the chat database.
- a specific method for determining extraction accuracy is provided, by extracting a test text set, extracting a second text from the test corpus through the first text extraction model, and determining the number of second texts identical to any correct text, and The number of correct texts is determined by the ratio of the former to the latter as the extraction accuracy of the first text extraction model, thereby providing a specific method for testing whether the first text extraction model meets the criteria.
- the extraction accuracy of the current text extraction model may also be determined. If the extraction accuracy of the current text extraction model is lower than a preset threshold, the training text collection is further acquired, and based on The acquired training text sets are trained until the extraction degree of the extracted text extraction model is not lower than a preset threshold, thereby continuously optimizing the acquired text extraction model by iteratively, so as to finally obtain a high extraction accuracy. Text extraction model.
- FIG. 5 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure.
- the device specifically includes:
- the model obtaining module 501 is configured to obtain a first text extraction model, where the first text extraction model is obtained according to the manually labeled first training text set, where the first training text set includes a plurality of training corpora and the plurality of training corpora Multiple label texts;
- the training text collection obtaining module 502 is configured to obtain a second training text set, where the second training text set includes a plurality of first training corpora and extract by using the first text if the extraction accuracy of the first text extraction model is lower than a preset threshold a plurality of first target texts extracted by the model from the plurality of first training corpora, each of the first target texts being the correct text extracted from the first training corpus;
- the model obtaining module 501 is further configured to acquire the second text extraction model according to the first training text set and the second training text set.
- the embodiment of the present disclosure acquires a second training text set by acquiring a first text extraction model, where the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text set includes a plurality of first training corpora and And acquiring, by the first text extraction model, the plurality of first target texts extracted from the plurality of first training corpora, thereby obtaining the second training text set by using the acquired first text extraction model, without manual labeling, further, according to the a training text set and a second training text set, obtaining a second text extraction model, so that the process of acquiring the text extraction model tends to be automated, since the efficiency of obtaining the training text set by the model is much higher than the efficiency of manual labeling, the present invention is adopted
- the acquisition method can greatly reduce labor costs and time costs.
- the training text collection obtaining module 502 is configured to:
- the training text collection obtaining module 502 is configured to:
- the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be used as a plurality of first training corpora, and the chat database is used for storing Conversational corpus between the user and the chat bot.
- the training text collection obtaining module 502 is configured to:
- the conversation corpus of the dialogue success is filtered out from the chat database, and the dialogue successfully recorded is used as a plurality of first training corpora, and the chat database is used to store the user and the chat.
- the dialogue corpus between the robots and the dialogue corpus of the successful dialogue refers to the dialogue corpus that the chat robot successfully provides services for the users.
- the device further includes:
- the test text collection obtaining module 503 is configured to obtain a test text set, where the test text set includes a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;
- the extracting module 504 is configured to extract, from the test corpus, the second text by using the first text extraction model for each of the plurality of test corpora;
- the determining module 505 is configured to determine a quantity ratio of the second text and the plurality of correct texts that are the same as any correct text as the extraction accuracy of the first text extraction model.
- the training text collection obtaining module 502 is further configured to continue to acquire the training text set if the extraction accuracy of the second text extraction model obtained by the current training is lower than the preset threshold;
- the model obtaining module 501 is further configured to perform the next training based on the acquired sets of training texts until the extraction accuracy of the second text extraction model obtained by the training is not lower than the preset threshold, where the training text set includes a plurality of second training corpora and a plurality of second target texts extracted from the plurality of second training corpora by the second text extraction model obtained by the training.
- the apparatus for obtaining a text extraction model provided by the foregoing embodiment obtains the text extraction model
- only the division of each functional module is used as an example.
- the foregoing functions may be assigned differently according to needs.
- the function module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
- the device for acquiring the text extraction model and the method for obtaining the text extraction model provided by the foregoing embodiments are in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
- FIG. 7 is a block diagram of an electronic device 700 according to an embodiment of the present disclosure.
- electronic device 700 includes a processing component 722 that further includes one or more processors, and memory resources represented by memory 732 for storing instructions executable by processing component 722, such as an application.
- An application stored in memory 732 can include one or more modules each corresponding to a set of instructions.
- the processing component 722 is configured to execute instructions to perform a method of obtaining a text extraction model:
- the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;
- the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;
- the processor is configured to execute the instructions to perform the steps of:
- the chat message is input to the second text extraction model to obtain semantic information of the conversation message;
- the processor is configured to execute the instructions to perform the steps of:
- the first training corpus and the first text are used as a pair of training texts in the second training text set;
- the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
- the processor is configured to execute the instructions to perform the steps of:
- the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.
- the processor is configured to execute the instructions to perform the steps of:
- the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora.
- the chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.
- the processor is configured to execute the instructions to perform the steps of:
- test text set including a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;
- a ratio of the number of the second text identical to any of the correct texts and the plurality of correct texts is determined as the extraction accuracy of the first text extraction model.
- the processor is configured to execute the instructions to perform the steps of:
- the extraction accuracy of the second text extraction model obtained by the training is lower than the preset threshold, continue to acquire the training text set, and perform the next training based on the acquired training text sets until the second text obtained by the training.
- the extraction accuracy of the extracted model is not lower than the preset threshold, and the training text set includes a plurality of second training corpora and a second text extraction model obtained by the training is extracted from the plurality of second training corpora Multiple second target texts.
- the electronic device 700 can also include a power supply component 726 configured to perform power management of the electronic device 700, a wired or wireless network interface 750 configured to connect the device 700 to the network, and an input/output (I/O) interface 758.
- the electronic device 700 may operate based on an operating system stored in the memory 732, such as Windows Server TM, Mac OS X TM , Unix TM, Linux TM, FreeBSD TM or the like.
- a computer readable storage medium such as a memory comprising instructions executable by a processor in a terminal to perform a method of acquiring text extraction models in the embodiments described below.
- the computer readable storage medium can be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
- a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
- the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method and apparatus for obtaining a text extraction model, which relate to the technical field of machine learning. The method comprises: obtaining a first text extraction model, the first text extraction model being obtained according to a manually-marked first training text collection; if the extraction accuracy of the first text extraction model is lower than a preset threshold, obtaining a second training text collection, the second training text collection comprising multiple first training corpora and multiple first target texts extracted from the multiple first training corpora by means of the first text extraction model; and obtaining a second text extraction model according to the first training text collection and the second training text collection. The second training text collection is obtained by means of the first text extraction model, so that the process of obtaining the text extraction model tends to be automated, and accordingly, labor costs and time costs are reduced.
Description
本申请要求于2017年2月27日提交中国国家知识产权局、申请号为2017101077875、发明名称为“获取文本提取模型的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese Patent Application of the Chinese National Intellectual Property Office, Application No. 2017101077875, entitled "Method and Apparatus for Obtaining a Text Extraction Model" on February 27, 2017, the entire contents of which are incorporated by reference. In this application.
本发明涉及机器学习技术领域,特别涉及一种获取文本提取模型的方法及装置。The present invention relates to the field of machine learning technology, and in particular, to a method and apparatus for acquiring a text extraction model.
机器学习技术是指计算机通过归纳文本或图片等数据改善性能的技术,广泛地应用于数据挖掘、计算机视觉、自然语言处理和机器人等方面。例如,为使聊天机器人能够理解自然语言的意义,从而与用户之间进行交互,通常利用机器学习技术获取文本提取模型,并将文本提取模型应用于聊天机器人,使得聊天机器人从与用户的语料中提取出表达用户需求的文本,并对应该文本进行应答。Machine learning technology refers to the technology that computers improve performance by summarizing data such as text or images, and is widely used in data mining, computer vision, natural language processing, and robotics. For example, in order to enable the chat robot to understand the meaning of the natural language and thereby interact with the user, the text extraction model is usually acquired by using machine learning technology, and the text extraction model is applied to the chat robot, so that the chat robot is from the corpus of the user. Extract the text that expresses the user's needs and respond to the text.
一般地,在获取文本提取模型时,需要获取大量语料,并人工地从每个语料中标注出表达用户需求的文本,将大量语料和对应标注出的文本作为训练文本集合,基于训练文本集合进行模型训练,从而得到文本提取模型,该文本提取模型可以用于从语料中提取出表达用户需求的文本。其中,人工标注出的文本一般与聊天机器人所提供的服务的相关,例如,聊天机器人可提供票务服务,某个语料为“我要购买火车票”,则人工标注的文本为“火车票”。Generally, when acquiring a text extraction model, it is necessary to obtain a large amount of corpus, and manually mark out texts expressing user requirements from each corpus, and use a large amount of corpus and corresponding marked text as a training text set, based on the training text collection. The model is trained to obtain a text extraction model, which can be used to extract text representing the user's needs from the corpus. Among them, the manually marked text is generally related to the service provided by the chat robot. For example, the chat robot can provide ticket service, and a certain corpus is “I want to buy a train ticket”, and the manually marked text is “train ticket”.
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems:
训练文本集合完全由人工标注的方式得到,由于获取文本提取模型所需的语料数据量庞大、人工标注的效率低,导致文本提取模型的训练过程会消耗大量人力成本和时间成本。The training text collection is completely obtained by manual annotation. Due to the large amount of corpus data required for obtaining the text extraction model and the low efficiency of manual annotation, the training process of the text extraction model consumes a lot of labor cost and time cost.
发明内容Summary of the invention
本公开实施例提供了一种获取文本提取模型的方法及装置,能够降低人力成本和时间成本。所述技术方案如下:Embodiments of the present disclosure provide a method and apparatus for acquiring a text extraction model, which can reduce labor cost and time cost. The technical solution is as follows:
一方面,提供了一种获取文本提取模型的方法,所述方法包括:In one aspect, a method of obtaining a text extraction model is provided, the method comprising:
获取第一文本提取模型,所述第一文本提取模型根据人工标注的第一训练文本集合得到,所述第一训练文本集合包括多个训练语料和所述多个训练语料中的多个标注文本;Obtaining a first text extraction model, the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;
如果所述第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合,所述第二训练文本集合包括多个第一训练语料和通过所述第一文本提取模型从所述多个第一训练语料中提取的多个第一目标文本,每个第一目标文本为第一训练语料中提取出的正确文本;If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;
根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型。And acquiring a second text extraction model according to the first training text set and the second training text set.
一方面,提供了一种获取文本提取模型的装置,所述装置包括:In one aspect, an apparatus for obtaining a text extraction model is provided, the apparatus comprising:
模型获取模块,用于获取第一文本提取模型,所述第一文本提取模型根据人工标注的第一训练文本集合得到,所述第一训练文本集合包括多个训练语料和所述多个训练语料中的多个标注文本;训练文本集合获取模块,用于如果所述第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合,所述第二训练文本集合包括多个第一训练语料和通过所述第一文本提取模型从所述多个第一训练语料中提取的多个第一目标文本,每个第一目标文本为第一训练语料中提取出的正确文本;a model obtaining module, configured to obtain a first text extraction model, where the first text extraction model is obtained according to a manually labeled first training text set, where the first training text set includes a plurality of training corpora and the plurality of training corpora a plurality of annotation texts; a training text collection acquisition module, configured to acquire a second training text set if the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text collection includes multiple a first training corpus and a plurality of first target texts extracted from the plurality of first training corpora by the first text extraction model, each first target text being a correct text extracted from the first training corpus;
所述模型获取模块,用于根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型。The model obtaining module is configured to acquire a second text extraction model according to the first training text set and the second training text set.
一方面,提供了一种电子设备,所述电子设备包括存储器和处理器,所述存储器用于存储指令,所述处理器被配置为执行所述指令,以执行下述群组视频会话的方法的步骤:In one aspect, an electronic device is provided, the electronic device comprising a memory and a processor, the memory for storing instructions, the processor being configured to execute the instructions to perform a group video session as described below A step of:
获取第一文本提取模型,所述第一文本提取模型根据人工标注的第一训练文本集合得到,所述第一训练文本集合包括多个训练语料和所述多个训练语料中的多个标注文本;Obtaining a first text extraction model, the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;
如果所述第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合,所述第二训练文本集合包括多个第一训练语料和通过所述第一文本提取模型从所述多个第一训练语料中提取的多个第一目标文本,每个第一目标文 本为第一训练语料中提取出的正确文本;If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;
根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型。And acquiring a second text extraction model according to the first training text set and the second training text set.
本公开实施例通过获取第一文本提取模型,在第一文本提取模型的提取准确度低于预设阈值时,获取第二训练文本集合,该第二训练文本集合包括多个第一训练语料和通过第一文本提取模型从多个第一训练语料中提取的多个第一目标文本,从而通过已获取的第一文本提取模型得到第二训练文本集合,而无需人工标注,进一步地,根据第一训练文本集合和第二训练文本集合,获取第二文本提取模型,使得获取文本提取模型的过程趋于自动化,由于通过模型获取训练文本集合的效率远高于人工标注的效率,因此采用本发明的获取方法可以大大减少人力成本和时间成本。The embodiment of the present disclosure acquires a second training text set by acquiring a first text extraction model, where the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text set includes a plurality of first training corpora and And acquiring, by the first text extraction model, the plurality of first target texts extracted from the plurality of first training corpora, thereby obtaining the second training text set by using the acquired first text extraction model, without manual labeling, further, according to the a training text set and a second training text set, obtaining a second text extraction model, so that the process of acquiring the text extraction model tends to be automated, since the efficiency of obtaining the training text set by the model is much higher than the efficiency of manual labeling, the present invention is adopted The acquisition method can greatly reduce labor costs and time costs.
为了更清楚地说明本公开实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.
图1是本公开实施例提供的一种获取文本提取模型的实施环境示意图;1 is a schematic diagram of an implementation environment for acquiring a text extraction model according to an embodiment of the present disclosure;
图2是本公开实施例提供的一种获取文本提取模型的方法流程图;2 is a flowchart of a method for acquiring a text extraction model according to an embodiment of the present disclosure;
图3是本公开实施例提供的一种获取训练文本的流程图;FIG. 3 is a flowchart of acquiring training text according to an embodiment of the present disclosure;
图4是本公开实施例提供的一种获取迭代模型的流程图;4 is a flowchart of an acquisition iteration model provided by an embodiment of the present disclosure;
图5是本公开实施例提供的一种获取文本提取模型的装置框图;FIG. 5 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure;
图6是本公开实施例提供的一种获取文本提取模型的装置框图;6 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure;
图7是本公开实施例提供的一种获取文本提取模型的装置700的框图。FIG. 7 is a block diagram of an apparatus 700 for acquiring a text extraction model according to an embodiment of the present disclosure.
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
图1是本公开实施例提供的一种获取文本提取模型的实施环境示意图。参见图1,该实施环境中包括:FIG. 1 is a schematic diagram of an implementation environment for acquiring a text extraction model according to an embodiment of the present disclosure. Referring to Figure 1, the implementation environment includes:
至少一个服务器101,至少一个聊天机器人102,至少一个终端103(如,移动终端和台式电脑)。其中,服务器101用于获取第一文本提取模型,如果第一文本提取模型的提取准确度低于预设阈值,则获取第二训练文本集合,根据已获取的训练文本集合获取第二文本提取模型,将获取的文本提取模型应用于聊天机器人102或终端103。该聊天机器人102用于根据服务器101的控制获取或更新文本提取模型,并基于服务器101的控制为用户提供各种服务,如,聊天服务。终端103上安装有服务器101所提供的智能聊天应用,并根据服务器101的控制获取或更新文本提取模型。At least one server 101, at least one chat bot 102, at least one terminal 103 (e.g., mobile terminal and desktop computer). The server 101 is configured to acquire a first text extraction model. If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, and acquiring a second text extraction model according to the acquired training text set. The acquired text extraction model is applied to the chatbot 102 or the terminal 103. The chat bot 102 is configured to acquire or update a text extraction model according to the control of the server 101, and provide various services such as a chat service to the user based on the control of the server 101. The smart chat application provided by the server 101 is installed on the terminal 103, and the text extraction model is acquired or updated according to the control of the server 101.
该聊天机器人102还可以是智能聊天应用的服务端,可以为终端103提供各种服务。聊天机器人102可以运行于服务器101上,或是运行于服务器101所提供的多个客服终端上。上述文本提取模型可以存储于聊天机器人102所在设备上,并由聊天机器人102在接收到时对话消息应用。The chat bot 102 can also be a server of the smart chat application, and can provide various services to the terminal 103. The chat bot 102 can run on the server 101 or on a plurality of customer service terminals provided by the server 101. The text extraction model described above can be stored on the device where the chat bot 102 is located, and is applied by the chat bot 102 when it is received.
另外,该服务器101还可以配置至少一个数据库,如,聊天数据库、用户数据库的用户认证数据库等等。该聊天数据库用于存储用户与聊天机器人(或者智能聊天应用)之间的对话语料,该对话语料可以标识有本次对话的时间戳,或者本次对话的服务记录等数据;用户数据库用于存储用户行为数据,如用户发表的日志和评论,用户的点赞行为和评分行为等;该用户认证数据库用于存储用户的用户名和用户密码。In addition, the server 101 can also configure at least one database, such as a chat database, a user authentication database of a user database, and the like. The chat database is used to store a conversation corpus between the user and the chat bot (or smart chat application), the conversation corpus may identify the time stamp of the conversation, or the service record of the conversation, and the like; the user database is used for storing User behavior data, such as logs and comments posted by the user, user's praise behavior and scoring behavior, etc.; the user authentication database is used to store the user's username and user password.
图2是本公开实施例提供的一种获取文本提取模型的方法流程图。参见图2,该方法可以应用于任一设备,且该设备至少具有处理器和存储器,可以通过处理器处理存储器中的训练样本集合,得到文本提取模型。该方法具体包括:201、服务器获取第一文本提取模型,第一文本提取模型根据人工标注的第一训练文本集合得到。FIG. 2 is a flowchart of a method for acquiring a text extraction model according to an embodiment of the present disclosure. Referring to FIG. 2, the method can be applied to any device, and the device has at least a processor and a memory, and the set of training samples in the memory can be processed by the processor to obtain a text extraction model. The method specifically includes: 201: The server acquires a first text extraction model, and the first text extraction model is obtained according to the manually labeled first training text set.
第一训练文本集合用于生成文本提取模型,该第一训练文本集合包括多个训练语料和人工对多个训练语料进行标注得到的正确文本,一个训练语料和从中标注出的正确文本构成一对训练文本。其中,本公开实施例对训练语料的形式不做限定。例如,该训练语料可以是单句形式,或者对话形式。而且,从一个训练语料中标注出的正确文本可以是一个或多个,一般与应用该文本提取模型的聊天机器人(或智能聊天应用)所提供的服务相关,例如,训练语料为“怎么去杭州”,标注出的正确文本可以为“杭州”;训练语料为“我要买到天津的 机票”,标注出的正确文本可以为“天津”和“机票”。The first training text set is used to generate a text extraction model, the first training text set includes a plurality of training corpora and a correct text obtained by manually marking a plurality of training corpora, and a training corpus and a correct text marked from the pair constitute a pair Training text. The embodiment of the present disclosure does not limit the form of the training corpus. For example, the training corpus can be in the form of a single sentence, or in the form of a conversation. Moreover, the correct text marked from a training corpus may be one or more, generally related to the service provided by the chat bot (or smart chat application) to which the text extraction model is applied, for example, the training corpus is "how to go to Hangzhou" ", the correct text marked can be "Hangzhou"; the training corpus is "I want to buy a ticket to Tianjin", the correct text can be labeled "Tianjin" and "ticket."
该步骤中,服务器可以从自身的数据库或网络获取多个训练语料,并通过人工标注的方式从多个训练语料中标注出的正确文本,从而获取到第一训练文本集合,进而,服务器对第一训练文本集合进行训练,也即是,提取每对训练文本的特征(如,上下文特征),根据提取的特征确定初始提取模型的各个参数的取值,得到已知参数的第一文本提取模型。其中,该初始提取模型不限于CRF(Conditional Random Field algorithm,条件随机场)模型或HMM(Hidden Markov Model,隐马尔可夫模型)。In this step, the server may obtain multiple training corpora from its own database or network, and manually extract the correct text from multiple training corpora to obtain the first training text set, and then, the server pairs A training text set is trained, that is, extracting features of each pair of training texts (eg, context features), determining values of respective parameters of the initial extraction model according to the extracted features, and obtaining a first text extraction model of known parameters . The initial extraction model is not limited to a CRF (Conditional Random Field Algorithm) model or an HMM (Hidden Markov Model).
事实上,人工也可能从某些训练语料中不能标注出文本,这些训练语料如“怎么了”、“为什么”,该情况下,本公开实施例对处理这些训练语料的方式不做限定,例如,直接丢弃该训练语料,不对它进行标注;又例如,人工对不能标注出文本的训练语料统一添加默认标签,该默认标签用于标记不能标注出文本的训练语料,默认标签如“无”。进一步地,为了方便后续人工标注的过程,提高人工标注的效率,服务器可以将被丢弃的训练语料或者被添加了默认标签的训练语料存储为待筛选参考语料;后续在获取到初始训练语料之后,服务器可以筛选掉与待筛选参考语料相同的初始训练语料,得到筛选后的训练语料。In fact, it is also possible for humans to not mark texts from certain training corpora. These training corpora are such as "what's wrong" and "why". In this case, the embodiments of the present disclosure do not limit the manner in which these training corpora are processed, for example, The training corpus is discarded directly, and it is not marked; for example, the training corpus that cannot mark the text is manually added with a default label, which is used to mark the training corpus that cannot mark the text, and the default label is “None”. Further, in order to facilitate the subsequent manual labeling process and improve the efficiency of the manual labeling, the server may store the discarded training corpus or the training corpus added with the default label as the reference corpus to be filtered; after obtaining the initial training corpus, The server can filter out the same initial training corpus as the reference corpus to be screened, and obtain the filtered training corpus.
需要说明的是,在训练过程之前,还可以将初始提取模型的各个参数进行初始化,而在训练过程中,还可使用随机梯度下降和前向后向传播方法等来优化文本提取模型中的各个参数,以尽可能地减少文本提取模型的误差。It should be noted that, before the training process, each parameter of the initial extraction model may be initialized, and in the training process, the random gradient descent and the forward backward propagation method may be used to optimize each of the text extraction models. Parameters to minimize errors in the text extraction model.
另外需要说明的是,本公开实施例为了减少人工标注的成本,该第一训练文本集合中的训练文本的数量会大大少于原本训练所需的训练文本,如,原来所需的训练文本的数量为N,本公开实施例所需的训练文本的数量可以为50%*N。In addition, it should be noted that, in order to reduce the cost of manual labeling, the number of training texts in the first training text set is much less than the training text required for the original training, for example, the originally required training text. The number of N, the number of training texts required by embodiments of the present disclosure may be 50%*N.
202、如果第一文本提取模型的提取准确度低于预设阈值,服务器获取第二训练文本集合,第二训练文本集合包括多个第一训练语料和通过第一文本提取模型从多个第一训练语料中提取的多个第一目标文本。202. If the extraction accuracy of the first text extraction model is lower than a preset threshold, the server acquires a second training text set, where the second training text set includes a plurality of first training corpora and the first text extraction model is obtained from the plurality of first Training a plurality of first target texts extracted from the corpus.
通过第一文本提取模型所提取的文本可能是正确的,也可能是错误的,而为了保证根据该第二训练文本集合得到的文本提取模型的提取准确度更高,本公开实施例中的第一目标文本是指应该从第一训练语料中提取出的正确文本。该步骤中,服务器确定第一文本提取模型的提取准确度,并判断该提取准确度 是否低于预设阈值,如果是,则获取第二训练文本集合,否则,确定该第一文本提取模型可使用。本公开实施例对预设阈值不做限定。该预设阈值如80%。事实上,即使该第一文本提取模型的提取准确度不低于预设阈值,为了进一步提高第一文本提取模型的准确度,服务器也可以继续获取第二训练文本集合,该情况下,服务器既可以获取第一训练语料后,直接获取第一文本提取模型所提取出的文本作为第一目标文本,也可以参照以下获取第二训练文本集合的具体过程,以人工确认的方式进行获取。The text extracted by the first text extraction model may be correct or incorrect, and in order to ensure that the extraction accuracy of the text extraction model obtained according to the second training text collection is higher, the first embodiment of the present disclosure A target text is the correct text that should be extracted from the first training corpus. In this step, the server determines the extraction accuracy of the first text extraction model, and determines whether the extraction accuracy is lower than a preset threshold. If yes, the second training text set is acquired. Otherwise, determining the first text extraction model may be use. The preset embodiment does not limit the preset threshold. The preset threshold is as 80%. In fact, even if the extraction accuracy of the first text extraction model is not lower than a preset threshold, in order to further improve the accuracy of the first text extraction model, the server may continue to acquire the second training text collection. In this case, the server After the first training corpus is obtained, the text extracted by the first text extraction model is directly obtained as the first target text, and the specific process of obtaining the second training text set may be referred to the following, and the method is manually confirmed.
在确定提取准确度时,本公开实施例对具体的确定方法不做限定。例如,服务器可以采用以下步骤(1)-(3)进行确定:The specific determination method is not limited in the embodiment of the present disclosure when determining the extraction accuracy. For example, the server can be determined using the following steps (1)-(3):
(1)、服务器获取测试文本集合,测试文本集合包括多个测试语料和人工从多个测试语料中标注出的多个正确文本。(1) The server obtains a test text set, and the test text set includes a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora.
该测试文本集合的获取过程与第一训练文本集合的获取过程同理,但该测试文本集合用于测试第一文本提取模型的提取准确度。The acquisition process of the test text collection is the same as the acquisition process of the first training text collection, but the test text collection is used to test the extraction accuracy of the first text extraction model.
(2)、对于多个测试语料中的每个测试语料,服务器通过第一文本提取模型从测试语料中提取出第二文本。(2) For each test corpus of the plurality of test corpora, the server extracts the second text from the test corpus through the first text extraction model.
该步骤(2)中,服务器将每个测试语料输入第一文本提取模型,并将第一文本提取模型对应该测试语料输出的文本作为第二文本。In the step (2), the server inputs each test corpus into the first text extraction model, and the first text extraction model corresponds to the text output by the test corpus as the second text.
(3)、服务器将与任一正确文本相同的第二文本和多个正确文本的数量比例确定为第一文本提取模型的提取准确度。(3) The server determines the number ratio of the second text and the plurality of correct texts that are the same as any correct text as the extraction accuracy of the first text extraction model.
该步骤(3)中,服务器可以确定多个正确文本的数量A(也等价于多个测试语料的数量),并确定每个测试语料对应提取的第二文本与该测试语料对应标注出的正确文本是否相同,如果相同,则进行计数,否则,忽略不计;进而,服务器可以确定与任一正确文本相同的第二文本的数量B,并将B与A的比例确定为第一文本提取模型的提取准确度。In the step (3), the server may determine the quantity A of the plurality of correct texts (also equivalent to the number of the plurality of test corpora), and determine that the second text corresponding to each test corpus is corresponding to the test corpus. Whether the correct texts are the same, if they are the same, count them, otherwise, ignore them; further, the server can determine the number B of the second text that is the same as any correct text, and determine the ratio of B to A as the first text extraction model. Extraction accuracy.
该步骤202中,服务器获取第二训练文本集合的过程可以具体为:如果第一文本提取模型的提取准确度低于预设阈值,服务器获取多个第一训练语料;对于多个第一训练语料中的每个第一训练语料,服务器通过第一文本提取模型从第一训练语料中提取出第一文本;如果第一文本正确,将第一训练语料和第一文本作为第二训练文本集合中的一对训练文本;如果第一文本错误,将第一训练语料和人工修正的文本作为第二训练文本集合中的一对训练文本。In the step 202, the process of the server acquiring the second training text set may be specifically: if the extraction accuracy of the first text extraction model is lower than a preset threshold, the server acquires multiple first training corpora; and for the plurality of first training corpora Each of the first training corpora, the server extracts the first text from the first training corpus by the first text extraction model; if the first text is correct, the first training corpus and the first text are used as the second training text set a pair of training texts; if the first text is wrong, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
以上具体过程参见图3所示的获取训练文本的流程图,该具体过程中,服 务器可以将每个第一训练语料输入第一文本提取模型,并获取该训练语料对应输出的文本作为第一文本,进而,可以获取人工对该第一文本添加的判断信息,该判断信息用于指示第一文本是否正确,如果获取的判断信息指示第一文本正确,服务器可以直接将第一训练语料和第一文本作为第二训练文本集合中的一对训练文本;如果获取的判断信息指示第一文本错误,服务器可以获取判断信息中携带的人工修正的文本,并将第一训练语料和人工修正的文本作为第二训练文本集合中的一对文本。For the specific process, refer to the flowchart of acquiring the training text shown in FIG. 3. In the specific process, the server may input each first training corpus into the first text extraction model, and obtain the text corresponding to the output of the training corpus as the first text. And, the determining information added to the first text is manually obtained, where the determining information is used to indicate whether the first text is correct. If the obtained determining information indicates that the first text is correct, the server may directly directly compare the first training corpus with the first The text is used as a pair of training texts in the second training text set; if the obtained judgment information indicates the first text error, the server may acquire the artificially corrected text carried in the judgment information, and use the first training corpus and the manually corrected text as A pair of text in the second training text collection.
事实上,为了提高获取文本提取模型的效率,在判断第一文本是否正确时,人工可以不必对每个第一文本进行操作,而是直接对错误的第一文本进行修正,使得服务器获取人工修正的文本和对应的第一训练语料,并直接获取剩余的未进行修正操作的第一文本和对应的第一训练语料即可。In fact, in order to improve the efficiency of obtaining the text extraction model, when judging whether the first text is correct, the human may not have to operate on each first text, but directly correct the wrong first text, so that the server obtains manual correction. The text and the corresponding first training corpus, and directly acquiring the remaining first text without the correcting operation and the corresponding first training corpus.
需要说明的是,本公开实施例对获取第一训练语料的方式不做限定。例如,服务器可以从网络或自身的数据库进行获取,如,为了更深入地了解用户需求,该数据库可以为用户数据库,或者,为使训练语料更贴近文本提取模型的实际的应用环境,从而提高文本提取模型在应用时对用户的语料命中率,该数据库可以为聊天数据库等。以从聊天数据库中获取第一训练语料为例,服务器可以采用以下至少两种获取方式:It should be noted that the manner in which the first training corpus is obtained is not limited in the embodiment of the present disclosure. For example, the server can be retrieved from the network or its own database, for example, to better understand the user's needs, the database can be a user database, or to improve the text in order to make the training corpus closer to the actual application environment of the text extraction model. Extract the corpus hit ratio of the model to the user at the time of application, and the database may be a chat database or the like. Taking the first training corpus from the chat database as an example, the server can adopt at least two acquisition methods:
获取方式1、如果第一文本提取模型的提取准确度低于预设阈值,服务器从聊天数据库中获取预设时段内的对话语料,将预设时段内的对话预料作为多个第一训练语料。Obtaining mode 1. If the extraction accuracy of the first text extraction model is lower than the preset threshold, the server acquires the conversation corpus in the preset time period from the chat database, and predicts the dialogue in the preset time period as the plurality of first training corpora.
为了有针对性地获取第一训练语料,服务器可以获取预设时段内的对话语料。本公开实施例对预设时段不做具体限定。例如,为使第一训练语料更贴近当前用户的表达方式,使得获取的文本提取模型在应用时的提取准确度更高,该预设时段可以为最近一个月。又例如,为使第一训练语料更吻合不同聊天机器人所提供的不同服务,从而提高文本提取模型的提取准确度,该预设时段可以与提供服务的时段匹配,且每个时段单独获取对话语料并对应文本提取模型,该服务时段的划分如:售票服务的时段为白天,票务咨询服务的时段为夜间。In order to obtain the first training corpus in a targeted manner, the server may acquire the dialog corpus within the preset time period. The embodiment of the present disclosure does not specifically limit the preset time period. For example, in order to make the first training corpus closer to the expression mode of the current user, the extracted text extraction model has higher extraction accuracy when applied, and the preset time period may be the latest month. For another example, in order to make the first training corpus more consistent with different services provided by different chat bots, thereby improving the extraction accuracy of the text extraction model, the preset time period may be matched with the time period for providing the service, and the dialog corpus is separately obtained for each time period. And corresponding to the text extraction model, the service period is divided as follows: the ticket service period is daytime, and the ticket consultation service period is nighttime.
该获取方式1中,服务器可以根据预设时段,在聊天数据库中查询与预设时段内的对话语料,并将查询到的多个对话语料作为多个第一训练语料。In the obtaining mode 1, the server may query the chat corpus in the chat database according to the preset time period, and use the plurality of query corpora as the plurality of first training corpora.
获取方式2、如果第一文本提取模型的提取准确度低于预设阈值,服务器 从聊天数据库中筛选出对话成功的对话语料,将对话成功的对话预料作为多个第一训练语料,对话成功的对话语料是指聊天机器人成功为用户提供服务的对话语料。Obtaining mode 2: If the extraction accuracy of the first text extraction model is lower than a preset threshold, the server filters out the conversation corpus of the conversation successfully from the chat database, and predicts the dialogue successfully as a plurality of first training corpora, and the dialogue succeeds. Dialogue corpus refers to the conversation corpus that the chat bot successfully provides services for users.
为使第一训练语料具有更强的参考性,可以获取对话成功的对话语料作为第一训练语料。其中,对话成功的对话语料的确定方式可以有多种。例如,服务器至少可以采用以下三种确定方式进行确定:In order to make the first training corpus have stronger reference, the dialogue corpus of the dialogue success can be obtained as the first training corpus. Among them, there are many ways to determine the dialogue corpus for successful dialogue. For example, the server can be determined in at least three ways:
判断方式1、当任一对话语料中存在对话成功的关键词时,服务器将该对话语料确定为对话成功的对话语料。Judgment mode 1. When there is a keyword with successful dialogue in any conversation corpus, the server determines the conversation corpus as the conversation corpus for successful dialogue.
其中,本公开实施例对对话成功的关键词不做限定。例如,考虑到对话成功时,用户通常会表达谢意,因此该对话成功的关键词可以为:好的、谢谢。又例如,在对话成功时聊天机器人的答复也可能包括一些对话成功的关键词,如:没问题,不用谢。The embodiment of the present disclosure does not limit keywords that are successful in dialogue. For example, considering that the conversation is successful, the user usually expresses gratitude, so the keyword for the success of the conversation can be: good, thank you. For another example, the chat robot's reply may also include some keywords for successful dialogue when the conversation is successful, such as: no problem, no thanks.
判断方式2,当任一对话语料中存在对话失败的关键词时,服务器筛选掉该对话语料,并确定剩余的对话语料为对话成功的对话语料。Judging mode 2, when there is a keyword that fails in the dialogue in any of the dialog corpora, the server filters out the dialog corpus and determines that the remaining dialog corpus is the dialog corpus of the successful dialogue.
其中,本公开实施例对对话失败的关键词不做限定。例如,考虑到对话失败时用户可能会提醒聊天机器人理解错误,则该对话失败的关键词可以为:你错了,不是这个意思。又例如,对话失败时聊天机器人的答复也可能包括一些对话成功的关键词,如:别介意,没有理解您的意思,请再说一遍。The embodiment of the present disclosure does not limit keywords that fail in the dialogue. For example, considering that the user may remind the chat bot to understand the error when the dialogue fails, the keyword that the dialog fails may be: You are wrong, not the meaning. For another example, when the conversation fails, the chat robot's reply may also include some keywords for successful dialogue. For example, don't mind, don't understand what you mean, please say it again.
判断方式3,当任一对话预料存在对应的服务记录时,服务器将该对话语料确定为对话成功的对话语料。Judging mode 3, when any dialog is expected to have a corresponding service record, the server determines the dialog corpus as a dialog corpus for successful dialogue.
考虑到当某一对话语料存在对应的服务记录时,说明通过本次对话成功地为用户提供了一次服务,因此,可以将对应存在服务记录的对话语料作为对话成功的对话语料。Considering that when a certain service corpus has a corresponding service record, it indicates that the user is successfully provided with a service through this dialogue. Therefore, the dialogue corpus corresponding to the presence of the service record can be used as a dialog corpus for successful dialogue.
需要说明的是,在获取第一训练语料时,可以综合上述三种判断方式中的两种或两种以上判断方式的组合,例如:检测语料的关键词并检测对话是当任一对话语料中存在对话成功的关键词时,服务器将该对话语料确定为对话成功的对话语料,当任一对话预料存在对应的服务记录时,服务器将该对话语料确定为对话成功的对话语料,这样可以避免漏选有参考价值的对话语料,提高数据的利用率。It should be noted that, when acquiring the first training corpus, a combination of two or more of the above three determination manners may be integrated, for example, detecting keywords of the corpus and detecting the dialogue is in any conversation corpus. When there is a keyword for successful dialogue, the server determines the dialog corpus as a dialog corpus for successful dialogue. When any dialog is expected to have a corresponding service record, the server determines the dialog corpus as a dialog corpus for successful dialogue, so as to avoid leakage. Select a dialogue corpus with reference value to improve data utilization.
203、服务器根据第一训练文本集合和第二训练文本集合,获取第二文本提取模型。203. The server acquires a second text extraction model according to the first training text set and the second training text set.
基于第一训练文本集合和第二训练文本集合,服务器可以将这两个训练文本集合重新进行训练,得到第二文本提取模型。Based on the first training text set and the second training text set, the server may re-train the two training text sets to obtain a second text extraction model.
事实上,对于一次训练过程所得到的第二文本提取模型来说,如果本次训练得到的第二文本提取模型的提取准确度低于预设阈值,服务器还可以继续获取训练文本集合,并基于已获取的各个训练文本集合进行下一次模型训练,直到训练得到的第二文本提取模型的提取准确度不低于预设阈值,该训练文本集合包括多个第二训练语料和通过本次训练得到的第二文本提取模型从多个第二训练语料中提取的多个第二目标文本。In fact, for the second text extraction model obtained in a training process, if the extraction accuracy of the second text extraction model obtained by the training is lower than a preset threshold, the server may continue to acquire the training text set and Each acquired training text set is subjected to the next model training until the extraction accuracy of the second text extraction model obtained by the training is not lower than a preset threshold, the training text set includes a plurality of second training corpora and is obtained by the training The second text extraction model extracts a plurality of second target texts from the plurality of second training corpora.
例如,图4是本公开实施例提供的一种迭代模型的流程图,参见图4,服务器可以根据步骤202中确定提取准确度的方法来确定第二文本提取模型的提取准确度,如果确定的提取准确度不低于预设阈值,则确定该第二文本提取模型可使用,如果确定的提取准确度低于预设阈值,则继续获取训练文本集合,该训练文本集合的具体获取过程与第二训练文本集合的具体获取过程同理,并基于已获取的第一训练文本集合、第二训练文本集合和该训练文本集合进行训练,从而得到一个准确度更高的文本提取模型,并再次确认该文本提取模型的提取准确度,如果该文本提取模型的提取准确度低于预设阈值,则继续获取训练文本集合,直到通过迭代的方式得到的文本提取模型的提取准确度不低于预设阈值为止。For example, FIG. 4 is a flowchart of an iterative model provided by an embodiment of the present disclosure. Referring to FIG. 4, the server may determine the extraction accuracy of the second text extraction model according to the method for determining the extraction accuracy in step 202, if determined. If the extraction accuracy is not lower than the preset threshold, determining that the second text extraction model is usable, if the determined extraction accuracy is lower than the preset threshold, continuing to acquire the training text set, and the specific acquisition process of the training text collection The specific acquisition process of the two training text sets is the same, and the training is performed based on the acquired first training text set, the second training text set and the training text set, thereby obtaining a more accurate text extraction model and reconfirming The extraction accuracy of the text extraction model is continued. If the extraction accuracy of the text extraction model is lower than a preset threshold, the training text collection is continued until the extraction accuracy of the text extraction model obtained by the iterative method is not lower than the preset. The threshold is up.
需要说明的是,在服务器获取到最终的文本提取模型之后,既可以暂存该文本提取模型、等待应用该文本提取模型的指令,也可以直接应用将该文本提取模型,如,将该文本提取模型应用于聊天机器人,或者,将该文本提取模型更新至用户所在终端上的智能聊天应用。It should be noted that after the server obtains the final text extraction model, the text extraction model may be temporarily stored, waiting for an instruction to apply the text extraction model, or the text extraction model may be directly applied, for example, the text is extracted. The model is applied to the chat bot, or the text extraction model is updated to the smart chat application on the terminal where the user is located.
例如,以聊天机器人上的应用为例,当聊天机器人的服务器接收到任一用户发出的对话消息时,将所述对话消息输入至训练得到的第二文本提取模型,得到对话消息的语义信息;根据所述对话消息的语义信息,从答复消息数据库中查询所述对话消息的答复消息,并返回给提问用户,由于第二文本提取模型的提取语义的准确度较高,因此,能够达到更好的对话效果,提高聊天机器人的智能型。For example, taking the application on the chat bot as an example, when the server of the chat bot receives the dialog message sent by any user, the dialog message is input to the second text extraction model obtained by the training to obtain the semantic information of the dialog message; Determining, according to the semantic information of the conversation message, the reply message of the conversation message from the reply message database, and returning to the questioning user, because the accuracy of the extraction semantics of the second text extraction model is higher, thereby achieving better The dialogue effect improves the intelligence of the chat bot.
本公开实施例通过获取第一文本提取模型,在第一文本提取模型的提取准确度低于预设阈值时,获取第二训练文本集合,该第二训练文本集合包括多个第一训练语料和通过第一文本提取模型从多个第一训练语料中提取的多个第 一目标文本,从而通过已获取的第一文本提取模型得到第二训练文本集合,而无需人工标注,进一步地,根据第一训练文本集合和第二训练文本集合,获取第二文本提取模型,使得获取文本提取模型的过程趋于自动化,由于通过模型获取训练文本集合的效率远高于人工标注的效率,因此采用本发明的获取方法可以大大减少人力成本和时间成本。The embodiment of the present disclosure acquires a second training text set by acquiring a first text extraction model, where the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text set includes a plurality of first training corpora and And acquiring, by the first text extraction model, the plurality of first target texts extracted from the plurality of first training corpora, thereby obtaining the second training text set by using the acquired first text extraction model, without manual labeling, further, according to the a training text set and a second training text set, obtaining a second text extraction model, so that the process of acquiring the text extraction model tends to be automated, since the efficiency of obtaining the training text set by the model is much higher than the efficiency of manual labeling, the present invention is adopted The acquisition method can greatly reduce labor costs and time costs.
另外,提供了获取第二训练文本集合的具体方法,通过获取第一训练语料,并通过第一文本提取模型从第一训练语料中提取出第一文本,如果第一文本正确,则直接将第一训练语料和第一文本作为第二训练文本集合中的一对训练文本,如果第一文本错误,则将人工修正的文本和第一训练语料获取为第二训练文本集合中的一对训练文本,由于第二训练文本集合通过第一文本提取模型得到、且人工进行确认,因此在保证了第二训练文本集合的获取效率的同时,也保证了第二训练文本集合的准确性。In addition, a specific method for obtaining a second training text set is provided, by acquiring a first training corpus, and extracting a first text from the first training corpus by using the first text extraction model, and if the first text is correct, directly a training corpus and a first text as a pair of training texts in the second training text set, and if the first text is wrong, acquiring the artificially corrected text and the first training corpus as a pair of training texts in the second training text set Since the second training text set is obtained by the first text extraction model and manually confirmed, the accuracy of the second training text set is ensured while ensuring the acquisition efficiency of the second training text set.
另外,提供了至少两种获取第一训练语料的具体方法,如,为了保证对话语料的有效性,可以从聊天数据库中获取预设时段内的对话语料,或者,为使第一训练语料具有更强的参考性,可以获取聊天数据库中对话成功的对话语料。In addition, at least two specific methods for obtaining the first training corpus are provided. For example, in order to ensure the validity of the dialog corpus, the dialog corpus within the preset time period may be obtained from the chat database, or, in order to make the first training corpus have more Strong reference, you can get the dialogue corpus of the conversation in the chat database.
另外,提供了确定提取准确度的具体方法,通过获取测试文本集合,通过第一文本提取模型从测试语料中提取出第二文本,并确定与任一正确文本相同的第二文本的数量,和多个正确文本的数量,将前者与后者的比例确定为第一文本提取模型的提取准确度,从而提供了测试第一文本提取模型是否达标的具体方法。In addition, a specific method for determining extraction accuracy is provided, by extracting a test text set, extracting a second text from the test corpus through the first text extraction model, and determining the number of second texts identical to any correct text, and The number of correct texts is determined by the ratio of the former to the latter as the extraction accuracy of the first text extraction model, thereby providing a specific method for testing whether the first text extraction model meets the criteria.
另外,在获取第二文本提取模型之后,还可以确定当前的文本提取模型的提取准确度,如果当前的文本提取模型的提取准确度低于预设阈值,则继续获取训练文本集合,并基于以获取的各个训练文本集合进行训练,直到训练得到的文本提取模型的提取度不低于预设阈值,从而通过迭代的方式不断优化已获取的文本提取模型,以最终得到一个提取准确度较高的文本提取模型。In addition, after acquiring the second text extraction model, the extraction accuracy of the current text extraction model may also be determined. If the extraction accuracy of the current text extraction model is lower than a preset threshold, the training text collection is further acquired, and based on The acquired training text sets are trained until the extraction degree of the extracted text extraction model is not lower than a preset threshold, thereby continuously optimizing the acquired text extraction model by iteratively, so as to finally obtain a high extraction accuracy. Text extraction model.
图5是本公开实施例提供的一种获取文本提取模型的装置框图。参见图5,该装置具体包括:FIG. 5 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure. Referring to FIG. 5, the device specifically includes:
模型获取模块501,用于获取第一文本提取模型,第一文本提取模型根据人工标注的第一训练文本集合得到,所述第一训练文本集合包括多个训练语料 和所述多个训练语料中的多个标注文本;The model obtaining module 501 is configured to obtain a first text extraction model, where the first text extraction model is obtained according to the manually labeled first training text set, where the first training text set includes a plurality of training corpora and the plurality of training corpora Multiple label texts;
训练文本集合获取模块502,用于如果第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合,第二训练文本集合包括多个第一训练语料和通过第一文本提取模型从多个第一训练语料中提取的多个第一目标文本,每个第一目标文本为第一训练语料中提取出的正确文本;The training text collection obtaining module 502 is configured to obtain a second training text set, where the second training text set includes a plurality of first training corpora and extract by using the first text if the extraction accuracy of the first text extraction model is lower than a preset threshold a plurality of first target texts extracted by the model from the plurality of first training corpora, each of the first target texts being the correct text extracted from the first training corpus;
模型获取模块501,还用于根据第一训练文本集合和第二训练文本集合,获取第二文本提取模型。The model obtaining module 501 is further configured to acquire the second text extraction model according to the first training text set and the second training text set.
本公开实施例通过获取第一文本提取模型,在第一文本提取模型的提取准确度低于预设阈值时,获取第二训练文本集合,该第二训练文本集合包括多个第一训练语料和通过第一文本提取模型从多个第一训练语料中提取的多个第一目标文本,从而通过已获取的第一文本提取模型得到第二训练文本集合,而无需人工标注,进一步地,根据第一训练文本集合和第二训练文本集合,获取第二文本提取模型,使得获取文本提取模型的过程趋于自动化,由于通过模型获取训练文本集合的效率远高于人工标注的效率,因此采用本发明的获取方法可以大大减少人力成本和时间成本。The embodiment of the present disclosure acquires a second training text set by acquiring a first text extraction model, where the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text set includes a plurality of first training corpora and And acquiring, by the first text extraction model, the plurality of first target texts extracted from the plurality of first training corpora, thereby obtaining the second training text set by using the acquired first text extraction model, without manual labeling, further, according to the a training text set and a second training text set, obtaining a second text extraction model, so that the process of acquiring the text extraction model tends to be automated, since the efficiency of obtaining the training text set by the model is much higher than the efficiency of manual labeling, the present invention is adopted The acquisition method can greatly reduce labor costs and time costs.
在一种可能实现方式中,该训练文本集合获取模块502用于:In a possible implementation, the training text collection obtaining module 502 is configured to:
如果第一文本提取模型的提取准确度低于预设阈值,获取多个第一训练语料;对于多个第一训练语料中的每个第一训练语料,通过第一文本提取模型从第一训练语料中提取出第一文本;如果第一文本正确,将第一训练语料和第一文本作为第二训练文本集合中的一对训练文本;如果第一文本错误,将第一训练语料和人工修正的文本作为第二训练文本集合中的一对训练文本。Acquiring a plurality of first training corpora if the extraction accuracy of the first text extraction model is lower than a preset threshold; for each of the plurality of first training corpora, extracting from the first training by the first text extraction model Extracting the first text from the corpus; if the first text is correct, using the first training corpus and the first text as a pair of training texts in the second training text set; if the first text is wrong, the first training corpus and the manual correction The text serves as a pair of training texts in the second training text collection.
在一种可能实现方式中,该训练文本集合获取模块502用于:In a possible implementation, the training text collection obtaining module 502 is configured to:
如果第一文本提取模型的提取准确度低于预设阈值,从聊天数据库中获取预设时段内的对话语料,将预设时段内的对话预料作为多个第一训练语料,聊天数据库用于存储用户与聊天机器人之间的对话语料。If the extraction accuracy of the first text extraction model is lower than a preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be used as a plurality of first training corpora, and the chat database is used for storing Conversational corpus between the user and the chat bot.
在一种可能实现方式中,该训练文本集合获取模块502用于:In a possible implementation, the training text collection obtaining module 502 is configured to:
如果第一文本提取模型的提取准确度低于预设阈值,从聊天数据库中筛选出对话成功的对话语料,将对话成功的对话预料作为多个第一训练语料,聊天数据库用于存储用户与聊天机器人之间的对话语料,对话成功的对话语料是指聊天机器人成功为用户提供服务的对话语料。If the extraction accuracy of the first text extraction model is lower than a preset threshold, the conversation corpus of the dialogue success is filtered out from the chat database, and the dialogue successfully recorded is used as a plurality of first training corpora, and the chat database is used to store the user and the chat. The dialogue corpus between the robots and the dialogue corpus of the successful dialogue refers to the dialogue corpus that the chat robot successfully provides services for the users.
在一种可能实现方式中,基于图5的装置组成,参见图6,该装置还包括:In a possible implementation, based on the device composition of FIG. 5, referring to FIG. 6, the device further includes:
测试文本集合获取模块503,用于获取测试文本集合,测试文本集合包括多个测试语料和人工从多个测试语料中标注出的多个正确文本;The test text collection obtaining module 503 is configured to obtain a test text set, where the test text set includes a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;
提取模块504,用于对于多个测试语料中的每个测试语料,通过第一文本提取模型从测试语料中提取出第二文本;The extracting module 504 is configured to extract, from the test corpus, the second text by using the first text extraction model for each of the plurality of test corpora;
确定模块505,用于将与任一正确文本相同的第二文本和多个正确文本的数量比例确定为第一文本提取模型的提取准确度。The determining module 505 is configured to determine a quantity ratio of the second text and the plurality of correct texts that are the same as any correct text as the extraction accuracy of the first text extraction model.
在一种可能实现方式中,该训练文本集合获取模块502,还用于如果本次训练得到的第二文本提取模型的提取准确度低于所述预设阈值,继续获取训练文本集合;In a possible implementation manner, the training text collection obtaining module 502 is further configured to continue to acquire the training text set if the extraction accuracy of the second text extraction model obtained by the current training is lower than the preset threshold;
该模型获取模块501,还用于基于已获取的各个训练文本集合进行下一次训练,直到训练得到的第二文本提取模型的提取准确度不低于所述预设阈值,所述训练文本集合包括多个第二训练语料和通过本次训练得到的第二文本提取模型从所述多个第二训练语料中提取的多个第二目标文本。The model obtaining module 501 is further configured to perform the next training based on the acquired sets of training texts until the extraction accuracy of the second text extraction model obtained by the training is not lower than the preset threshold, where the training text set includes a plurality of second training corpora and a plurality of second target texts extracted from the plurality of second training corpora by the second text extraction model obtained by the training.
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。All of the above optional technical solutions may be used in any combination to form an optional embodiment of the present invention, and will not be further described herein.
需要说明的是:上述实施例提供的获取文本提取模型的装置在获取文本提取模型时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的获取文本提取模型的装置与获取文本提取模型的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that, when the apparatus for obtaining a text extraction model provided by the foregoing embodiment obtains the text extraction model, only the division of each functional module is used as an example. In actual applications, the foregoing functions may be assigned differently according to needs. The function module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the device for acquiring the text extraction model and the method for obtaining the text extraction model provided by the foregoing embodiments are in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
图7是本公开实施例提供的一种电子设备700的框图。参照图7,电子设备700包括处理组件722,其进一步包括一个或多个处理器,以及由存储器732所代表的存储器资源,用于存储可由处理部件722的执行的指令,例如应用程序。存储器732中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件722被配置为执行指令,以执行获取文本提取模型的方法:FIG. 7 is a block diagram of an electronic device 700 according to an embodiment of the present disclosure. Referring to Figure 7, electronic device 700 includes a processing component 722 that further includes one or more processors, and memory resources represented by memory 732 for storing instructions executable by processing component 722, such as an application. An application stored in memory 732 can include one or more modules each corresponding to a set of instructions. Further, the processing component 722 is configured to execute instructions to perform a method of obtaining a text extraction model:
获取第一文本提取模型,所述第一文本提取模型根据人工标注的第一训练文本集合得到,所述第一训练文本集合包括多个训练语料和所述多个训练语料中的多个标注文本;Obtaining a first text extraction model, the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;
如果所述第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合,所述第二训练文本集合包括多个第一训练语料和通过所述第一文本提取模型从所述多个第一训练语料中提取的多个第一目标文本,每个第一目标文本为第一训练语料中提取出的正确文本;If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;
根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型。And acquiring a second text extraction model according to the first training text set and the second training text set.
在一种可能实现方式中,所述处理器被配置为执行所述指令,以执行下述步骤:In one possible implementation, the processor is configured to execute the instructions to perform the steps of:
当接收到对话消息时,将所述聊天消息输入至所述第二文本提取模型,得到所述对话消息的语义信息;When the conversation message is received, the chat message is input to the second text extraction model to obtain semantic information of the conversation message;
根据所述对话消息的语义信息,查询所述对话消息的答复消息。And querying the reply message of the conversation message according to the semantic information of the conversation message.
在一种可能实现方式中,所述处理器被配置为执行所述指令,以执行下述步骤:In one possible implementation, the processor is configured to execute the instructions to perform the steps of:
如果所述第一文本提取模型的提取准确度低于所述预设阈值,获取所述多个第一训练语料;If the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corpora;
对于所述多个第一训练语料中的每个第一训练语料,通过所述第一文本提取模型从所述第一训练语料中提取出第一文本;For each of the plurality of first training corpora, extracting the first text from the first training corpus by the first text extraction model;
如果所述第一文本正确,将所述第一训练语料和所述第一文本作为所述第二训练文本集合中的一对训练文本;If the first text is correct, the first training corpus and the first text are used as a pair of training texts in the second training text set;
如果所述第一文本错误,将所述第一训练语料和人工修正的文本作为所述第二训练文本集合中的一对训练文本。If the first text is erroneous, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
在一种可能实现方式中,所述处理器被配置为执行所述指令,以执行下述步骤:In one possible implementation, the processor is configured to execute the instructions to perform the steps of:
如果所述第一文本提取模型的提取准确度低于所述预设阈值,从聊天数据库中获取预设时段内的对话语料,将所述预设时段内的对话预料作为所述多个第一训练语料,所述聊天数据库用于存储用户与聊天机器人之间的对话语料。If the extraction accuracy of the first text extraction model is lower than the preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.
在一种可能实现方式中,所述处理器被配置为执行所述指令,以执行下述步骤:In one possible implementation, the processor is configured to execute the instructions to perform the steps of:
如果所述第一文本提取模型的提取准确度低于所述预设阈值,从聊天数据库中筛选出对话成功的对话语料,将所述对话成功的对话预料作为所述多个第一训练语料,所述聊天数据库用于存储用户与聊天机器人之间的对话语料,所 述对话成功的对话语料是指所述聊天机器人成功为所述用户提供服务的对话语料。If the extraction accuracy of the first text extraction model is lower than the preset threshold, the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora. The chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.
在一种可能实现方式中,所述处理器被配置为执行所述指令,以执行下述步骤:In one possible implementation, the processor is configured to execute the instructions to perform the steps of:
获取测试文本集合,所述测试文本集合包括多个测试语料和人工从所述多个测试语料中标注出的多个正确文本;Obtaining a test text set, the test text set including a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;
对于所述多个测试语料中的每个测试语料,通过所述第一文本提取模型从所述测试语料中提取出第二文本;For each of the plurality of test corpora, extracting the second text from the test corpus by the first text extraction model;
将与任一正确文本相同的第二文本和所述多个正确文本的数量比例确定为所述第一文本提取模型的提取准确度。A ratio of the number of the second text identical to any of the correct texts and the plurality of correct texts is determined as the extraction accuracy of the first text extraction model.
在一种可能实现方式中,所述处理器被配置为执行所述指令,以执行下述步骤:In one possible implementation, the processor is configured to execute the instructions to perform the steps of:
如果本次训练得到的第二文本提取模型的提取准确度低于所述预设阈值,继续获取训练文本集合,并基于已获取的各个训练文本集合进行下一次训练,直到训练得到的第二文本提取模型的提取准确度不低于所述预设阈值,所述训练文本集合包括多个第二训练语料和通过本次训练得到的第二文本提取模型从所述多个第二训练语料中提取的多个第二目标文本。If the extraction accuracy of the second text extraction model obtained by the training is lower than the preset threshold, continue to acquire the training text set, and perform the next training based on the acquired training text sets until the second text obtained by the training. The extraction accuracy of the extracted model is not lower than the preset threshold, and the training text set includes a plurality of second training corpora and a second text extraction model obtained by the training is extracted from the plurality of second training corpora Multiple second target texts.
电子设备700还可以包括一个电源组件726被配置为执行电子设备700的电源管理,一个有线或无线网络接口750被配置为将装置700连接到网络,和一个输入输出(I/O)接口758。电子设备700可以操作基于存储在存储器732的操作系统,例如Windows Server
TM,Mac OS X
TM,Unix
TM,Linux
TM,FreeBSD
TM或类似。
The electronic device 700 can also include a power supply component 726 configured to perform power management of the electronic device 700, a wired or wireless network interface 750 configured to connect the device 700 to the network, and an input/output (I/O) interface 758. The electronic device 700 may operate based on an operating system stored in the memory 732, such as Windows Server TM, Mac OS X TM , Unix TM, Linux TM, FreeBSD TM or the like.
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括指令的存储器,上述指令可由终端中的处理器执行以完成下述实施例中的获取文本提取模型方法。例如,所述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform a method of acquiring text extraction models in the embodiments described below. For example, the computer readable storage medium can be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘 或光盘等。A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.
Claims (20)
- 一种获取文本提取模型的方法,其特征在于,所述方法包括:A method for obtaining a text extraction model, the method comprising:获取第一文本提取模型,所述第一文本提取模型根据人工标注的第一训练文本集合得到,所述第一训练文本集合包括多个训练语料和所述多个训练语料中的多个标注文本;Obtaining a first text extraction model, the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;如果所述第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合,所述第二训练文本集合包括多个第一训练语料和通过所述第一文本提取模型从所述多个第一训练语料中提取的多个第一目标文本,每个第一目标文本为第一训练语料中提取出的正确文本;If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型。And acquiring a second text extraction model according to the first training text set and the second training text set.
- 根据权利要求1所述的方法,其特征在于,根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型之后,所述方法还包括:The method according to claim 1, wherein after obtaining the second text extraction model according to the first training text set and the second training text set, the method further comprises:当接收到对话消息时,将所述对话消息输入至所述第二文本提取模型,得到所述对话消息的语义信息;When the conversation message is received, the conversation message is input to the second text extraction model to obtain semantic information of the conversation message;根据所述对话消息的语义信息,查询所述对话消息的答复消息。And querying the reply message of the conversation message according to the semantic information of the conversation message.
- 根据权利要求1所述的方法,其特征在于,所述如果所述第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合包括:The method according to claim 1, wherein if the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring the second training text set comprises:如果所述第一文本提取模型的提取准确度低于所述预设阈值,获取所述多个第一训练语料;If the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corpora;对于所述多个第一训练语料中的每个第一训练语料,通过所述第一文本提取模型从所述第一训练语料中提取出第一文本;For each of the plurality of first training corpora, extracting the first text from the first training corpus by the first text extraction model;如果所述第一文本正确,将所述第一训练语料和所述第一文本作为所述第二训练文本集合中的一对训练文本;If the first text is correct, the first training corpus and the first text are used as a pair of training texts in the second training text set;如果所述第一文本错误,将所述第一训练语料和人工修正的文本作为所述第二训练文本集合中的一对训练文本。If the first text is erroneous, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
- 根据权利要求3所述的方法,所述如果所述第一文本提取模型的提取准 确度低于所述预设阈值,获取所述多个第一训练语料包括:The method according to claim 3, wherein if the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corporas comprises:如果所述第一文本提取模型的提取准确度低于所述预设阈值,从聊天数据库中获取预设时段内的对话语料,将所述预设时段内的对话预料作为所述多个第一训练语料,所述聊天数据库用于存储用户与聊天机器人之间的对话语料。If the extraction accuracy of the first text extraction model is lower than the preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.
- 根据权利要求3所述的方法,所述如果所述第一文本提取模型的提取准确度低于所述预设阈值,获取所述多个第一训练语料包括:The method according to claim 3, if the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corporas comprises:如果所述第一文本提取模型的提取准确度低于所述预设阈值,从聊天数据库中筛选出对话成功的对话语料,将所述对话成功的对话预料作为所述多个第一训练语料,所述聊天数据库用于存储用户与聊天机器人之间的对话语料,所述对话成功的对话语料是指所述聊天机器人成功为所述用户提供服务的对话语料。If the extraction accuracy of the first text extraction model is lower than the preset threshold, the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora. The chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.
- 根据权利要求1所述的方法,其特征在于,确定第一文本提取模型的提取准确度的过程包括:The method according to claim 1, wherein the process of determining the extraction accuracy of the first text extraction model comprises:获取测试文本集合,所述测试文本集合包括多个测试语料和人工从所述多个测试语料中标注出的多个正确文本;Obtaining a test text set, the test text set including a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;对于所述多个测试语料中的每个测试语料,通过所述第一文本提取模型从所述测试语料中提取出第二文本;For each of the plurality of test corpora, extracting the second text from the test corpus by the first text extraction model;将与任一正确文本相同的第二文本和所述多个正确文本的数量比例确定为所述第一文本提取模型的提取准确度。A ratio of the number of the second text identical to any of the correct texts and the plurality of correct texts is determined as the extraction accuracy of the first text extraction model.
- 根据权利要求1所述的方法,其特征在于,所述根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型之后,所述方法还包括:The method according to claim 1, wherein after the obtaining the second text extraction model according to the first training text set and the second training text set, the method further comprises:如果本次训练得到的第二文本提取模型的提取准确度低于所述预设阈值,继续获取训练文本集合,并基于已获取的各个训练文本集合进行下一次训练,直到训练得到的第二文本提取模型的提取准确度不低于所述预设阈值,所述训练文本集合包括多个第二训练语料和通过本次训练得到的第二文本提取模型从所述多个第二训练语料中提取的多个第二目标文本。If the extraction accuracy of the second text extraction model obtained by the training is lower than the preset threshold, continue to acquire the training text set, and perform the next training based on the acquired training text sets until the second text obtained by the training. The extraction accuracy of the extracted model is not lower than the preset threshold, and the training text set includes a plurality of second training corpora and a second text extraction model obtained by the training is extracted from the plurality of second training corpora Multiple second target texts.
- 一种获取文本提取模型的装置,其特征在于,所述装置包括:An apparatus for obtaining a text extraction model, the apparatus comprising:模型获取模块,用于获取第一文本提取模型,所述第一文本提取模型根据人工标注的第一训练文本集合得到,所述第一训练文本集合包括多个训练语料和所述多个训练语料中的多个标注文本;a model obtaining module, configured to obtain a first text extraction model, where the first text extraction model is obtained according to a manually labeled first training text set, where the first training text set includes a plurality of training corpora and the plurality of training corpora Multiple label texts in ;训练文本集合获取模块,用于如果所述第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合,所述第二训练文本集合包括多个第一训练语料和通过所述第一文本提取模型从所述多个第一训练语料中提取的多个第一目标文本,每个第一目标文本为第一训练语料中提取出的正确文本;a training text collection obtaining module, configured to acquire a second training text set, where the second training text set includes a plurality of first training corpora and a passbook if the extraction accuracy of the first text extraction model is lower than a preset threshold Decoding, by the first text extraction model, a plurality of first target texts extracted from the plurality of first training corpora, each first target text being a correct text extracted from the first training corpus;所述模型获取模块,还用于根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型。The model obtaining module is further configured to acquire a second text extraction model according to the first training text set and the second training text set.
- 根据权利要求8所述的装置,其特征在于,所述训练文本集合获取模块用于:The apparatus according to claim 8, wherein the training text set obtaining module is configured to:如果所述第一文本提取模型的提取准确度低于所述预设阈值,获取所述多个第一训练语料;If the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corpora;对于所述多个第一训练语料中的每个第一训练语料,通过所述第一文本提取模型从所述第一训练语料中提取出第一文本;For each of the plurality of first training corpora, extracting the first text from the first training corpus by the first text extraction model;如果所述第一文本正确,将所述第一训练语料和所述第一文本作为所述第二训练文本集合中的一对训练文本;If the first text is correct, the first training corpus and the first text are used as a pair of training texts in the second training text set;如果所述第一文本错误,将所述第一训练语料和人工修正的文本作为所述第二训练文本集合中的一对训练文本。If the first text is erroneous, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
- 根据权利要求9所述的装置,所述训练文本集合获取模块用于:The apparatus according to claim 9, wherein the training text collection obtaining module is configured to:如果所述第一文本提取模型的提取准确度低于所述预设阈值,从聊天数据库中获取预设时段内的对话语料,将所述预设时段内的对话预料作为所述多个第一训练语料,所述聊天数据库用于存储用户与聊天机器人之间的对话语料。If the extraction accuracy of the first text extraction model is lower than the preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.
- 根据权利要求9所述的装置,所述训练文本集合获取模块用于:The apparatus according to claim 9, wherein the training text collection obtaining module is configured to:如果所述第一文本提取模型的提取准确度低于所述预设阈值,从聊天数据库中筛选出对话成功的对话语料,将所述对话成功的对话预料作为所述多个第一训练语料,所述聊天数据库用于存储用户与聊天机器人之间的对话语料,所述对话成功的对话语料是指所述聊天机器人成功为所述用户提供服务的对话语 料。If the extraction accuracy of the first text extraction model is lower than the preset threshold, the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora. The chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.
- 根据权利要求8所述的装置,其特征在于,所述装置还包括:The device according to claim 8, wherein the device further comprises:测试文本集合获取模块,用于获取测试文本集合,所述测试文本集合包括多个测试语料和人工从所述多个测试语料中标注出的多个正确文本;a test text collection obtaining module, configured to obtain a test text set, where the test text set includes a plurality of test corporas and a plurality of correct texts manually marked from the plurality of test corpora;提取模块,用于对于所述多个测试语料中的每个测试语料,通过所述第一文本提取模型从所述测试语料中提取出第二文本;An extracting module, configured to extract, from the test corpus, a second text by using the first text extraction model for each of the plurality of test corpora;确定模块,用于将与任一正确文本相同的第二文本和所述多个正确文本的数量比例确定为所述第一文本提取模型的提取准确度。And a determining module, configured to determine a quantity ratio of the second text and the plurality of correct texts that are the same as any correct text as the extraction accuracy of the first text extraction model.
- 根据权利要求8所述的装置,其特征在于,The device of claim 8 wherein:所述训练文本集合获取模块,还用于如果本次训练得到的第二文本提取模型的提取准确度低于所述预设阈值,继续获取训练文本集合;The training text set obtaining module is further configured to continue to acquire the training text set if the extraction accuracy of the second text extraction model obtained by the current training is lower than the preset threshold;所述模型获取模块,还用于基于已获取的各个训练文本集合进行下一次训练,直到训练得到的第二文本提取模型的提取准确度不低于所述预设阈值,所述训练文本集合包括多个第二训练语料和通过本次训练得到的第二文本提取模型从所述多个第二训练语料中提取的多个第二目标文本。The model obtaining module is further configured to perform the next training based on the acquired sets of training texts until the extraction accuracy of the second text extraction model obtained by the training is not lower than the preset threshold, where the training text set includes a plurality of second training corpora and a plurality of second target texts extracted from the plurality of second training corpora by the second text extraction model obtained by the training.
- 一种电子设备,其特征在于,所述电子设备包括存储器和处理器,所述存储器用于存储指令,所述处理器被配置为执行所述指令,以执行下述获取文本提取模型方法的步骤:An electronic device, comprising: a memory for storing instructions, the processor being configured to execute the instructions to perform the step of acquiring a text extraction model method described below :获取第一文本提取模型,所述第一文本提取模型根据人工标注的第一训练文本集合得到,所述第一训练文本集合包括多个训练语料和所述多个训练语料中的多个标注文本;Obtaining a first text extraction model, the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;如果所述第一文本提取模型的提取准确度低于预设阈值,获取第二训练文本集合,所述第二训练文本集合包括多个第一训练语料和通过所述第一文本提取模型从所述多个第一训练语料中提取的多个第一目标文本,每个第一目标文本为第一训练语料中提取出的正确文本;If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;根据所述第一训练文本集合和所述第二训练文本集合,获取第二文本提取模型。And acquiring a second text extraction model according to the first training text set and the second training text set.
- 根据权利要求14所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以执行下述步骤:The electronic device of claim 14, wherein the processor is configured to execute the instructions to perform the steps of:当接收到对话消息时,将所述对话消息输入至所述第二文本提取模型,得到所述对话消息的语义信息;When the conversation message is received, the conversation message is input to the second text extraction model to obtain semantic information of the conversation message;根据所述对话消息的语义信息,查询所述对话消息的答复消息。And querying the reply message of the conversation message according to the semantic information of the conversation message.
- 根据权利要求14所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以执行下述步骤:The electronic device of claim 14, wherein the processor is configured to execute the instructions to perform the steps of:如果所述第一文本提取模型的提取准确度低于所述预设阈值,获取所述多个第一训练语料;If the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corpora;对于所述多个第一训练语料中的每个第一训练语料,通过所述第一文本提取模型从所述第一训练语料中提取出第一文本;For each of the plurality of first training corpora, extracting the first text from the first training corpus by the first text extraction model;如果所述第一文本正确,将所述第一训练语料和所述第一文本作为所述第二训练文本集合中的一对训练文本;If the first text is correct, the first training corpus and the first text are used as a pair of training texts in the second training text set;如果所述第一文本错误,将所述第一训练语料和人工修正的文本作为所述第二训练文本集合中的一对训练文本。If the first text is erroneous, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
- 根据权利要求16所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以执行下述步骤:The electronic device of claim 16 wherein said processor is configured to execute said instructions to perform the steps of:如果所述第一文本提取模型的提取准确度低于所述预设阈值,从聊天数据库中获取预设时段内的对话语料,将所述预设时段内的对话预料作为所述多个第一训练语料,所述聊天数据库用于存储用户与聊天机器人之间的对话语料。If the extraction accuracy of the first text extraction model is lower than the preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.
- 根据权利要求16所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以执行下述步骤:The electronic device of claim 16 wherein said processor is configured to execute said instructions to perform the steps of:如果所述第一文本提取模型的提取准确度低于所述预设阈值,从聊天数据库中筛选出对话成功的对话语料,将所述对话成功的对话预料作为所述多个第一训练语料,所述聊天数据库用于存储用户与聊天机器人之间的对话语料,所述对话成功的对话语料是指所述聊天机器人成功为所述用户提供服务的对话语料。If the extraction accuracy of the first text extraction model is lower than the preset threshold, the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora. The chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.
- 根据权利要求14所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以执行下述步骤:The electronic device of claim 14, wherein the processor is configured to execute the instructions to perform the steps of:获取测试文本集合,所述测试文本集合包括多个测试语料和人工从所述多个测试语料中标注出的多个正确文本;Obtaining a test text set, the test text set including a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;对于所述多个测试语料中的每个测试语料,通过所述第一文本提取模型从所述测试语料中提取出第二文本;For each of the plurality of test corpora, extracting the second text from the test corpus by the first text extraction model;将与任一正确文本相同的第二文本和所述多个正确文本的数量比例确定为所述第一文本提取模型的提取准确度。A ratio of the number of the second text identical to any of the correct texts and the plurality of correct texts is determined as the extraction accuracy of the first text extraction model.
- 根据权利要求14所述的电子设备,其特征在于,所述处理器被配置为执行所述指令,以执行下述步骤:The electronic device of claim 14, wherein the processor is configured to execute the instructions to perform the steps of:如果本次训练得到的第二文本提取模型的提取准确度低于所述预设阈值,继续获取训练文本集合,并基于已获取的各个训练文本集合进行下一次训练,直到训练得到的第二文本提取模型的提取准确度不低于所述预设阈值,所述训练文本集合包括多个第二训练语料和通过本次训练得到的第二文本提取模型从所述多个第二训练语料中提取的多个第二目标文本。If the extraction accuracy of the second text extraction model obtained by the training is lower than the preset threshold, continue to acquire the training text set, and perform the next training based on the acquired training text sets until the second text obtained by the training. The extraction accuracy of the extracted model is not lower than the preset threshold, and the training text set includes a plurality of second training corpora and a second text extraction model obtained by the training is extracted from the plurality of second training corpora Multiple second target texts.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710107787.5A CN106909656B (en) | 2017-02-27 | 2017-02-27 | Obtain the method and device of Text Feature Extraction model |
CN201710107787.5 | 2017-02-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018153316A1 true WO2018153316A1 (en) | 2018-08-30 |
Family
ID=59209337
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/076605 WO2018153316A1 (en) | 2017-02-27 | 2018-02-13 | Method and apparatus for obtaining text extraction model |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106909656B (en) |
WO (1) | WO2018153316A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106909656B (en) * | 2017-02-27 | 2019-03-08 | 腾讯科技(深圳)有限公司 | Obtain the method and device of Text Feature Extraction model |
CN110245338A (en) * | 2018-03-09 | 2019-09-17 | 北京国双科技有限公司 | The bearing calibration of fact identification and device |
CN110472198B (en) * | 2018-05-10 | 2023-01-24 | 腾讯科技(深圳)有限公司 | Keyword determination method, text processing method and server |
CN110263322B (en) * | 2019-05-06 | 2023-09-05 | 平安科技(深圳)有限公司 | Audio corpus screening method and device for speech recognition and computer equipment |
CN110347786B (en) * | 2019-06-11 | 2021-01-05 | 深圳追一科技有限公司 | Semantic model tuning method and system |
CN110866100B (en) * | 2019-11-07 | 2022-08-23 | 北京声智科技有限公司 | Phonetics generalization method and device and electronic equipment |
CN112632284A (en) * | 2020-12-30 | 2021-04-09 | 上海明略人工智能(集团)有限公司 | Information extraction method and system for unlabeled text data set |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060074634A1 (en) * | 2004-10-06 | 2006-04-06 | International Business Machines Corporation | Method and apparatus for fast semi-automatic semantic annotation |
CN102236639A (en) * | 2010-04-28 | 2011-11-09 | 三星电子株式会社 | System and method for updating language model |
US8818793B1 (en) * | 2002-12-24 | 2014-08-26 | At&T Intellectual Property Ii, L.P. | System and method of extracting clauses for spoken language understanding |
US20150325235A1 (en) * | 2014-05-07 | 2015-11-12 | Microsoft Corporation | Language Model Optimization For In-Domain Application |
CN105956179A (en) * | 2016-05-30 | 2016-09-21 | 上海智臻智能网络科技股份有限公司 | Data filtering method and apparatus |
CN106445908A (en) * | 2015-08-07 | 2017-02-22 | 阿里巴巴集团控股有限公司 | Text identification method and apparatus |
CN106909656A (en) * | 2017-02-27 | 2017-06-30 | 腾讯科技(深圳)有限公司 | Obtain the method and device of Text Feature Extraction model |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101021838A (en) * | 2007-03-02 | 2007-08-22 | 华为技术有限公司 | Text handling method and system |
CN102033950A (en) * | 2010-12-23 | 2011-04-27 | 哈尔滨工业大学 | Construction method and identification method of automatic electronic product named entity identification system |
CN103593334B (en) * | 2012-08-15 | 2017-07-28 | 中国电信股份有限公司 | A kind of method and system for being used to judge emotional degree of text |
CN104317894B (en) * | 2014-10-23 | 2018-12-21 | 北京百度网讯科技有限公司 | The determination method and apparatus of sample mark |
CN104408093B (en) * | 2014-11-14 | 2018-01-26 | 中国科学院计算技术研究所 | A method and device for extracting news event elements |
CN106202177B (en) * | 2016-06-27 | 2017-12-15 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
CN106407357B (en) * | 2016-09-07 | 2019-04-19 | 深圳市中易科技有限责任公司 | A kind of engineering method of text data rule model exploitation |
-
2017
- 2017-02-27 CN CN201710107787.5A patent/CN106909656B/en active Active
-
2018
- 2018-02-13 WO PCT/CN2018/076605 patent/WO2018153316A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8818793B1 (en) * | 2002-12-24 | 2014-08-26 | At&T Intellectual Property Ii, L.P. | System and method of extracting clauses for spoken language understanding |
US20060074634A1 (en) * | 2004-10-06 | 2006-04-06 | International Business Machines Corporation | Method and apparatus for fast semi-automatic semantic annotation |
CN102236639A (en) * | 2010-04-28 | 2011-11-09 | 三星电子株式会社 | System and method for updating language model |
US20150325235A1 (en) * | 2014-05-07 | 2015-11-12 | Microsoft Corporation | Language Model Optimization For In-Domain Application |
CN106445908A (en) * | 2015-08-07 | 2017-02-22 | 阿里巴巴集团控股有限公司 | Text identification method and apparatus |
CN105956179A (en) * | 2016-05-30 | 2016-09-21 | 上海智臻智能网络科技股份有限公司 | Data filtering method and apparatus |
CN106909656A (en) * | 2017-02-27 | 2017-06-30 | 腾讯科技(深圳)有限公司 | Obtain the method and device of Text Feature Extraction model |
Also Published As
Publication number | Publication date |
---|---|
CN106909656A (en) | 2017-06-30 |
CN106909656B (en) | 2019-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018153316A1 (en) | Method and apparatus for obtaining text extraction model | |
US10460029B2 (en) | Reply information recommendation method and apparatus | |
US10423725B2 (en) | Intention acquisition method, electronic device and computer-readable storage medium | |
US12039286B2 (en) | Automatic post-editing model for generated natural language text | |
EP3179473A1 (en) | Training method and apparatus for language model, and device | |
WO2018120889A1 (en) | Input sentence error correction method and device, electronic device, and medium | |
WO2021051560A1 (en) | Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium | |
CN110134949B (en) | Text labeling method and equipment based on teacher supervision | |
CN108090043B (en) | Error correction report processing method and device based on artificial intelligence and readable medium | |
CN109582772B (en) | Contract information extraction method, contract information extraction device, computer equipment and storage medium | |
WO2018214895A1 (en) | Data processing method, data processing apparatus, storage device and network device | |
CN112700769B (en) | Semantic understanding method, semantic understanding device, semantic understanding equipment and computer readable storage medium | |
EP3617896A1 (en) | Method and apparatus for intelligent response | |
US9811517B2 (en) | Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text | |
WO2014117553A1 (en) | Method and system of adding punctuation and establishing language model | |
CN110209790A (en) | Question and answer matching process and device | |
CN110751234A (en) | OCR recognition error correction method, device and equipment | |
CN113190669A (en) | Intelligent dialogue method, device, terminal and storage medium | |
WO2020052060A1 (en) | Method and apparatus for generating correction statement | |
CN118377719A (en) | Application software fault analysis method, device, equipment, storage medium and product | |
CN111753062B (en) | A method, device, equipment and medium for determining a conversation response scheme | |
CN115858776B (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN112035623A (en) | Intelligent question and answer method and device, electronic equipment and storage medium | |
CN111597336A (en) | Processing method and device of training text, electronic equipment and readable storage medium | |
CN113435188B (en) | Semantic similarity-based allergic text sample generation method and device and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18758471 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18758471 Country of ref document: EP Kind code of ref document: A1 |