WO2018153316A1

WO2018153316A1 - Method and apparatus for obtaining text extraction model

Info

Publication number: WO2018153316A1
Application number: PCT/CN2018/076605
Authority: WO
Inventors: 陈益
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2017-02-27
Filing date: 2018-02-13
Publication date: 2018-08-30
Also published as: CN106909656A; CN106909656B

Abstract

A method and apparatus for obtaining a text extraction model, which relate to the technical field of machine learning. The method comprises: obtaining a first text extraction model, the first text extraction model being obtained according to a manually-marked first training text collection; if the extraction accuracy of the first text extraction model is lower than a preset threshold, obtaining a second training text collection, the second training text collection comprising multiple first training corpora and multiple first target texts extracted from the multiple first training corpora by means of the first text extraction model; and obtaining a second text extraction model according to the first training text collection and the second training text collection. The second training text collection is obtained by means of the first text extraction model, so that the process of obtaining the text extraction model tends to be automated, and accordingly, labor costs and time costs are reduced.

Description

Method and device for acquiring text extraction model

This application claims priority to the Chinese Patent Application of the Chinese National Intellectual Property Office, Application No. 2017101077875, entitled "Method and Apparatus for Obtaining a Text Extraction Model" on February 27, 2017, the entire contents of which are incorporated by reference. In this application.

Technical field

The present invention relates to the field of machine learning technology, and in particular, to a method and apparatus for acquiring a text extraction model.

Background technique

Machine learning technology refers to the technology that computers improve performance by summarizing data such as text or images, and is widely used in data mining, computer vision, natural language processing, and robotics. For example, in order to enable the chat robot to understand the meaning of the natural language and thereby interact with the user, the text extraction model is usually acquired by using machine learning technology, and the text extraction model is applied to the chat robot, so that the chat robot is from the corpus of the user. Extract the text that expresses the user's needs and respond to the text.

Generally, when acquiring a text extraction model, it is necessary to obtain a large amount of corpus, and manually mark out texts expressing user requirements from each corpus, and use a large amount of corpus and corresponding marked text as a training text set, based on the training text collection. The model is trained to obtain a text extraction model, which can be used to extract text representing the user's needs from the corpus. Among them, the manually marked text is generally related to the service provided by the chat robot. For example, the chat robot can provide ticket service, and a certain corpus is “I want to buy a train ticket”, and the manually marked text is “train ticket”.

In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems:

The training text collection is completely obtained by manual annotation. Due to the large amount of corpus data required for obtaining the text extraction model and the low efficiency of manual annotation, the training process of the text extraction model consumes a lot of labor cost and time cost.

Summary of the invention

Embodiments of the present disclosure provide a method and apparatus for acquiring a text extraction model, which can reduce labor cost and time cost. The technical solution is as follows:

In one aspect, a method of obtaining a text extraction model is provided, the method comprising:

Obtaining a first text extraction model, the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;

If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;

And acquiring a second text extraction model according to the first training text set and the second training text set.

In one aspect, an apparatus for obtaining a text extraction model is provided, the apparatus comprising:

a model obtaining module, configured to obtain a first text extraction model, where the first text extraction model is obtained according to a manually labeled first training text set, where the first training text set includes a plurality of training corpora and the plurality of training corpora a plurality of annotation texts; a training text collection acquisition module, configured to acquire a second training text set if the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text collection includes multiple a first training corpus and a plurality of first target texts extracted from the plurality of first training corpora by the first text extraction model, each first target text being a correct text extracted from the first training corpus;

The model obtaining module is configured to acquire a second text extraction model according to the first training text set and the second training text set.

In one aspect, an electronic device is provided, the electronic device comprising a memory and a processor, the memory for storing instructions, the processor being configured to execute the instructions to perform a group video session as described below A step of:

The embodiment of the present disclosure acquires a second training text set by acquiring a first text extraction model, where the extraction accuracy of the first text extraction model is lower than a preset threshold, the second training text set includes a plurality of first training corpora and And acquiring, by the first text extraction model, the plurality of first target texts extracted from the plurality of first training corpora, thereby obtaining the second training text set by using the acquired first text extraction model, without manual labeling, further, according to the a training text set and a second training text set, obtaining a second text extraction model, so that the process of acquiring the text extraction model tends to be automated, since the efficiency of obtaining the training text set by the model is much higher than the efficiency of manual labeling, the present invention is adopted The acquisition method can greatly reduce labor costs and time costs.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.

1 is a schematic diagram of an implementation environment for acquiring a text extraction model according to an embodiment of the present disclosure;

2 is a flowchart of a method for acquiring a text extraction model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of acquiring training text according to an embodiment of the present disclosure;

4 is a flowchart of an acquisition iteration model provided by an embodiment of the present disclosure;

FIG. 5 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure;

6 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of an apparatus 700 for acquiring a text extraction model according to an embodiment of the present disclosure.

detailed description

The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of an implementation environment for acquiring a text extraction model according to an embodiment of the present disclosure. Referring to Figure 1, the implementation environment includes:

At least one server 101, at least one chat bot 102, at least one terminal 103 (e.g., mobile terminal and desktop computer). The server 101 is configured to acquire a first text extraction model. If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, and acquiring a second text extraction model according to the acquired training text set. The acquired text extraction model is applied to the chatbot 102 or the terminal 103. The chat bot 102 is configured to acquire or update a text extraction model according to the control of the server 101, and provide various services such as a chat service to the user based on the control of the server 101. The smart chat application provided by the server 101 is installed on the terminal 103, and the text extraction model is acquired or updated according to the control of the server 101.

The chat bot 102 can also be a server of the smart chat application, and can provide various services to the terminal 103. The chat bot 102 can run on the server 101 or on a plurality of customer service terminals provided by the server 101. The text extraction model described above can be stored on the device where the chat bot 102 is located, and is applied by the chat bot 102 when it is received.

In addition, the server 101 can also configure at least one database, such as a chat database, a user authentication database of a user database, and the like. The chat database is used to store a conversation corpus between the user and the chat bot (or smart chat application), the conversation corpus may identify the time stamp of the conversation, or the service record of the conversation, and the like; the user database is used for storing User behavior data, such as logs and comments posted by the user, user's praise behavior and scoring behavior, etc.; the user authentication database is used to store the user's username and user password.

FIG. 2 is a flowchart of a method for acquiring a text extraction model according to an embodiment of the present disclosure. Referring to FIG. 2, the method can be applied to any device, and the device has at least a processor and a memory, and the set of training samples in the memory can be processed by the processor to obtain a text extraction model. The method specifically includes: 201: The server acquires a first text extraction model, and the first text extraction model is obtained according to the manually labeled first training text set.

The first training text set is used to generate a text extraction model, the first training text set includes a plurality of training corpora and a correct text obtained by manually marking a plurality of training corpora, and a training corpus and a correct text marked from the pair constitute a pair Training text. The embodiment of the present disclosure does not limit the form of the training corpus. For example, the training corpus can be in the form of a single sentence, or in the form of a conversation. Moreover, the correct text marked from a training corpus may be one or more, generally related to the service provided by the chat bot (or smart chat application) to which the text extraction model is applied, for example, the training corpus is "how to go to Hangzhou" ", the correct text marked can be "Hangzhou"; the training corpus is "I want to buy a ticket to Tianjin", the correct text can be labeled "Tianjin" and "ticket."

In this step, the server may obtain multiple training corpora from its own database or network, and manually extract the correct text from multiple training corpora to obtain the first training text set, and then, the server pairs A training text set is trained, that is, extracting features of each pair of training texts (eg, context features), determining values of respective parameters of the initial extraction model according to the extracted features, and obtaining a first text extraction model of known parameters . The initial extraction model is not limited to a CRF (Conditional Random Field Algorithm) model or an HMM (Hidden Markov Model).

In fact, it is also possible for humans to not mark texts from certain training corpora. These training corpora are such as "what's wrong" and "why". In this case, the embodiments of the present disclosure do not limit the manner in which these training corpora are processed, for example, The training corpus is discarded directly, and it is not marked; for example, the training corpus that cannot mark the text is manually added with a default label, which is used to mark the training corpus that cannot mark the text, and the default label is “None”. Further, in order to facilitate the subsequent manual labeling process and improve the efficiency of the manual labeling, the server may store the discarded training corpus or the training corpus added with the default label as the reference corpus to be filtered; after obtaining the initial training corpus, The server can filter out the same initial training corpus as the reference corpus to be screened, and obtain the filtered training corpus.

It should be noted that, before the training process, each parameter of the initial extraction model may be initialized, and in the training process, the random gradient descent and the forward backward propagation method may be used to optimize each of the text extraction models. Parameters to minimize errors in the text extraction model.

In addition, it should be noted that, in order to reduce the cost of manual labeling, the number of training texts in the first training text set is much less than the training text required for the original training, for example, the originally required training text. The number of N, the number of training texts required by embodiments of the present disclosure may be 50%*N.

202. If the extraction accuracy of the first text extraction model is lower than a preset threshold, the server acquires a second training text set, where the second training text set includes a plurality of first training corpora and the first text extraction model is obtained from the plurality of first Training a plurality of first target texts extracted from the corpus.

The text extracted by the first text extraction model may be correct or incorrect, and in order to ensure that the extraction accuracy of the text extraction model obtained according to the second training text collection is higher, the first embodiment of the present disclosure A target text is the correct text that should be extracted from the first training corpus. In this step, the server determines the extraction accuracy of the first text extraction model, and determines whether the extraction accuracy is lower than a preset threshold. If yes, the second training text set is acquired. Otherwise, determining the first text extraction model may be use. The preset embodiment does not limit the preset threshold. The preset threshold is as 80%. In fact, even if the extraction accuracy of the first text extraction model is not lower than a preset threshold, in order to further improve the accuracy of the first text extraction model, the server may continue to acquire the second training text collection. In this case, the server After the first training corpus is obtained, the text extracted by the first text extraction model is directly obtained as the first target text, and the specific process of obtaining the second training text set may be referred to the following, and the method is manually confirmed.

The specific determination method is not limited in the embodiment of the present disclosure when determining the extraction accuracy. For example, the server can be determined using the following steps (1)-(3):

(1) The server obtains a test text set, and the test text set includes a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora.

The acquisition process of the test text collection is the same as the acquisition process of the first training text collection, but the test text collection is used to test the extraction accuracy of the first text extraction model.

(2) For each test corpus of the plurality of test corpora, the server extracts the second text from the test corpus through the first text extraction model.

In the step (2), the server inputs each test corpus into the first text extraction model, and the first text extraction model corresponds to the text output by the test corpus as the second text.

(3) The server determines the number ratio of the second text and the plurality of correct texts that are the same as any correct text as the extraction accuracy of the first text extraction model.

In the step (3), the server may determine the quantity A of the plurality of correct texts (also equivalent to the number of the plurality of test corpora), and determine that the second text corresponding to each test corpus is corresponding to the test corpus. Whether the correct texts are the same, if they are the same, count them, otherwise, ignore them; further, the server can determine the number B of the second text that is the same as any correct text, and determine the ratio of B to A as the first text extraction model. Extraction accuracy.

In the step 202, the process of the server acquiring the second training text set may be specifically: if the extraction accuracy of the first text extraction model is lower than a preset threshold, the server acquires multiple first training corpora; and for the plurality of first training corpora Each of the first training corpora, the server extracts the first text from the first training corpus by the first text extraction model; if the first text is correct, the first training corpus and the first text are used as the second training text set a pair of training texts; if the first text is wrong, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.

For the specific process, refer to the flowchart of acquiring the training text shown in FIG. 3. In the specific process, the server may input each first training corpus into the first text extraction model, and obtain the text corresponding to the output of the training corpus as the first text. And, the determining information added to the first text is manually obtained, where the determining information is used to indicate whether the first text is correct. If the obtained determining information indicates that the first text is correct, the server may directly directly compare the first training corpus with the first The text is used as a pair of training texts in the second training text set; if the obtained judgment information indicates the first text error, the server may acquire the artificially corrected text carried in the judgment information, and use the first training corpus and the manually corrected text as A pair of text in the second training text collection.

In fact, in order to improve the efficiency of obtaining the text extraction model, when judging whether the first text is correct, the human may not have to operate on each first text, but directly correct the wrong first text, so that the server obtains manual correction. The text and the corresponding first training corpus, and directly acquiring the remaining first text without the correcting operation and the corresponding first training corpus.

It should be noted that the manner in which the first training corpus is obtained is not limited in the embodiment of the present disclosure. For example, the server can be retrieved from the network or its own database, for example, to better understand the user's needs, the database can be a user database, or to improve the text in order to make the training corpus closer to the actual application environment of the text extraction model. Extract the corpus hit ratio of the model to the user at the time of application, and the database may be a chat database or the like. Taking the first training corpus from the chat database as an example, the server can adopt at least two acquisition methods:

Obtaining mode 1. If the extraction accuracy of the first text extraction model is lower than the preset threshold, the server acquires the conversation corpus in the preset time period from the chat database, and predicts the dialogue in the preset time period as the plurality of first training corpora.

In order to obtain the first training corpus in a targeted manner, the server may acquire the dialog corpus within the preset time period. The embodiment of the present disclosure does not specifically limit the preset time period. For example, in order to make the first training corpus closer to the expression mode of the current user, the extracted text extraction model has higher extraction accuracy when applied, and the preset time period may be the latest month. For another example, in order to make the first training corpus more consistent with different services provided by different chat bots, thereby improving the extraction accuracy of the text extraction model, the preset time period may be matched with the time period for providing the service, and the dialog corpus is separately obtained for each time period. And corresponding to the text extraction model, the service period is divided as follows: the ticket service period is daytime, and the ticket consultation service period is nighttime.

In the obtaining mode 1, the server may query the chat corpus in the chat database according to the preset time period, and use the plurality of query corpora as the plurality of first training corpora.

Obtaining mode 2: If the extraction accuracy of the first text extraction model is lower than a preset threshold, the server filters out the conversation corpus of the conversation successfully from the chat database, and predicts the dialogue successfully as a plurality of first training corpora, and the dialogue succeeds. Dialogue corpus refers to the conversation corpus that the chat bot successfully provides services for users.

In order to make the first training corpus have stronger reference, the dialogue corpus of the dialogue success can be obtained as the first training corpus. Among them, there are many ways to determine the dialogue corpus for successful dialogue. For example, the server can be determined in at least three ways:

Judgment mode 1. When there is a keyword with successful dialogue in any conversation corpus, the server determines the conversation corpus as the conversation corpus for successful dialogue.

The embodiment of the present disclosure does not limit keywords that are successful in dialogue. For example, considering that the conversation is successful, the user usually expresses gratitude, so the keyword for the success of the conversation can be: good, thank you. For another example, the chat robot's reply may also include some keywords for successful dialogue when the conversation is successful, such as: no problem, no thanks.

Judging mode 2, when there is a keyword that fails in the dialogue in any of the dialog corpora, the server filters out the dialog corpus and determines that the remaining dialog corpus is the dialog corpus of the successful dialogue.

The embodiment of the present disclosure does not limit keywords that fail in the dialogue. For example, considering that the user may remind the chat bot to understand the error when the dialogue fails, the keyword that the dialog fails may be: You are wrong, not the meaning. For another example, when the conversation fails, the chat robot's reply may also include some keywords for successful dialogue. For example, don't mind, don't understand what you mean, please say it again.

Judging mode 3, when any dialog is expected to have a corresponding service record, the server determines the dialog corpus as a dialog corpus for successful dialogue.

Considering that when a certain service corpus has a corresponding service record, it indicates that the user is successfully provided with a service through this dialogue. Therefore, the dialogue corpus corresponding to the presence of the service record can be used as a dialog corpus for successful dialogue.

It should be noted that, when acquiring the first training corpus, a combination of two or more of the above three determination manners may be integrated, for example, detecting keywords of the corpus and detecting the dialogue is in any conversation corpus. When there is a keyword for successful dialogue, the server determines the dialog corpus as a dialog corpus for successful dialogue. When any dialog is expected to have a corresponding service record, the server determines the dialog corpus as a dialog corpus for successful dialogue, so as to avoid leakage. Select a dialogue corpus with reference value to improve data utilization.

203. The server acquires a second text extraction model according to the first training text set and the second training text set.

Based on the first training text set and the second training text set, the server may re-train the two training text sets to obtain a second text extraction model.

In fact, for the second text extraction model obtained in a training process, if the extraction accuracy of the second text extraction model obtained by the training is lower than a preset threshold, the server may continue to acquire the training text set and Each acquired training text set is subjected to the next model training until the extraction accuracy of the second text extraction model obtained by the training is not lower than a preset threshold, the training text set includes a plurality of second training corpora and is obtained by the training The second text extraction model extracts a plurality of second target texts from the plurality of second training corpora.

For example, FIG. 4 is a flowchart of an iterative model provided by an embodiment of the present disclosure. Referring to FIG. 4, the server may determine the extraction accuracy of the second text extraction model according to the method for determining the extraction accuracy in step 202, if determined. If the extraction accuracy is not lower than the preset threshold, determining that the second text extraction model is usable, if the determined extraction accuracy is lower than the preset threshold, continuing to acquire the training text set, and the specific acquisition process of the training text collection The specific acquisition process of the two training text sets is the same, and the training is performed based on the acquired first training text set, the second training text set and the training text set, thereby obtaining a more accurate text extraction model and reconfirming The extraction accuracy of the text extraction model is continued. If the extraction accuracy of the text extraction model is lower than a preset threshold, the training text collection is continued until the extraction accuracy of the text extraction model obtained by the iterative method is not lower than the preset. The threshold is up.

It should be noted that after the server obtains the final text extraction model, the text extraction model may be temporarily stored, waiting for an instruction to apply the text extraction model, or the text extraction model may be directly applied, for example, the text is extracted. The model is applied to the chat bot, or the text extraction model is updated to the smart chat application on the terminal where the user is located.

For example, taking the application on the chat bot as an example, when the server of the chat bot receives the dialog message sent by any user, the dialog message is input to the second text extraction model obtained by the training to obtain the semantic information of the dialog message; Determining, according to the semantic information of the conversation message, the reply message of the conversation message from the reply message database, and returning to the questioning user, because the accuracy of the extraction semantics of the second text extraction model is higher, thereby achieving better The dialogue effect improves the intelligence of the chat bot.

In addition, a specific method for obtaining a second training text set is provided, by acquiring a first training corpus, and extracting a first text from the first training corpus by using the first text extraction model, and if the first text is correct, directly a training corpus and a first text as a pair of training texts in the second training text set, and if the first text is wrong, acquiring the artificially corrected text and the first training corpus as a pair of training texts in the second training text set Since the second training text set is obtained by the first text extraction model and manually confirmed, the accuracy of the second training text set is ensured while ensuring the acquisition efficiency of the second training text set.

In addition, at least two specific methods for obtaining the first training corpus are provided. For example, in order to ensure the validity of the dialog corpus, the dialog corpus within the preset time period may be obtained from the chat database, or, in order to make the first training corpus have more Strong reference, you can get the dialogue corpus of the conversation in the chat database.

In addition, a specific method for determining extraction accuracy is provided, by extracting a test text set, extracting a second text from the test corpus through the first text extraction model, and determining the number of second texts identical to any correct text, and The number of correct texts is determined by the ratio of the former to the latter as the extraction accuracy of the first text extraction model, thereby providing a specific method for testing whether the first text extraction model meets the criteria.

In addition, after acquiring the second text extraction model, the extraction accuracy of the current text extraction model may also be determined. If the extraction accuracy of the current text extraction model is lower than a preset threshold, the training text collection is further acquired, and based on The acquired training text sets are trained until the extraction degree of the extracted text extraction model is not lower than a preset threshold, thereby continuously optimizing the acquired text extraction model by iteratively, so as to finally obtain a high extraction accuracy. Text extraction model.

FIG. 5 is a block diagram of an apparatus for acquiring a text extraction model according to an embodiment of the present disclosure. Referring to FIG. 5, the device specifically includes:

The model obtaining module 501 is configured to obtain a first text extraction model, where the first text extraction model is obtained according to the manually labeled first training text set, where the first training text set includes a plurality of training corpora and the plurality of training corpora Multiple label texts;

The training text collection obtaining module 502 is configured to obtain a second training text set, where the second training text set includes a plurality of first training corpora and extract by using the first text if the extraction accuracy of the first text extraction model is lower than a preset threshold a plurality of first target texts extracted by the model from the plurality of first training corpora, each of the first target texts being the correct text extracted from the first training corpus;

The model obtaining module 501 is further configured to acquire the second text extraction model according to the first training text set and the second training text set.

In a possible implementation, the training text collection obtaining module 502 is configured to:

Acquiring a plurality of first training corpora if the extraction accuracy of the first text extraction model is lower than a preset threshold; for each of the plurality of first training corpora, extracting from the first training by the first text extraction model Extracting the first text from the corpus; if the first text is correct, using the first training corpus and the first text as a pair of training texts in the second training text set; if the first text is wrong, the first training corpus and the manual correction The text serves as a pair of training texts in the second training text collection.

If the extraction accuracy of the first text extraction model is lower than a preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be used as a plurality of first training corpora, and the chat database is used for storing Conversational corpus between the user and the chat bot.

If the extraction accuracy of the first text extraction model is lower than a preset threshold, the conversation corpus of the dialogue success is filtered out from the chat database, and the dialogue successfully recorded is used as a plurality of first training corpora, and the chat database is used to store the user and the chat. The dialogue corpus between the robots and the dialogue corpus of the successful dialogue refers to the dialogue corpus that the chat robot successfully provides services for the users.

In a possible implementation, based on the device composition of FIG. 5, referring to FIG. 6, the device further includes:

The test text collection obtaining module 503 is configured to obtain a test text set, where the test text set includes a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;

The extracting module 504 is configured to extract, from the test corpus, the second text by using the first text extraction model for each of the plurality of test corpora;

The determining module 505 is configured to determine a quantity ratio of the second text and the plurality of correct texts that are the same as any correct text as the extraction accuracy of the first text extraction model.

In a possible implementation manner, the training text collection obtaining module 502 is further configured to continue to acquire the training text set if the extraction accuracy of the second text extraction model obtained by the current training is lower than the preset threshold;

The model obtaining module 501 is further configured to perform the next training based on the acquired sets of training texts until the extraction accuracy of the second text extraction model obtained by the training is not lower than the preset threshold, where the training text set includes a plurality of second training corpora and a plurality of second target texts extracted from the plurality of second training corpora by the second text extraction model obtained by the training.

All of the above optional technical solutions may be used in any combination to form an optional embodiment of the present invention, and will not be further described herein.

It should be noted that, when the apparatus for obtaining a text extraction model provided by the foregoing embodiment obtains the text extraction model, only the division of each functional module is used as an example. In actual applications, the foregoing functions may be assigned differently according to needs. The function module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the device for acquiring the text extraction model and the method for obtaining the text extraction model provided by the foregoing embodiments are in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.

FIG. 7 is a block diagram of an electronic device 700 according to an embodiment of the present disclosure. Referring to Figure 7, electronic device 700 includes a processing component 722 that further includes one or more processors, and memory resources represented by memory 732 for storing instructions executable by processing component 722, such as an application. An application stored in memory 732 can include one or more modules each corresponding to a set of instructions. Further, the processing component 722 is configured to execute instructions to perform a method of obtaining a text extraction model:

In one possible implementation, the processor is configured to execute the instructions to perform the steps of:

When the conversation message is received, the chat message is input to the second text extraction model to obtain semantic information of the conversation message;

And querying the reply message of the conversation message according to the semantic information of the conversation message.

If the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corpora;

For each of the plurality of first training corpora, extracting the first text from the first training corpus by the first text extraction model;

If the first text is correct, the first training corpus and the first text are used as a pair of training texts in the second training text set;

If the first text is erroneous, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.

If the extraction accuracy of the first text extraction model is lower than the preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.

If the extraction accuracy of the first text extraction model is lower than the preset threshold, the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora. The chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.

Obtaining a test text set, the test text set including a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;

For each of the plurality of test corpora, extracting the second text from the test corpus by the first text extraction model;

A ratio of the number of the second text identical to any of the correct texts and the plurality of correct texts is determined as the extraction accuracy of the first text extraction model.

If the extraction accuracy of the second text extraction model obtained by the training is lower than the preset threshold, continue to acquire the training text set, and perform the next training based on the acquired training text sets until the second text obtained by the training. The extraction accuracy of the extracted model is not lower than the preset threshold, and the training text set includes a plurality of second training corpora and a second text extraction model obtained by the training is extracted from the plurality of second training corpora Multiple second target texts.

The electronic device 700 can also include a power supply component 726 configured to perform power management of the electronic device 700, a wired or wireless network interface 750 configured to connect the device 700 to the network, and an input/output (I/O) interface 758. The electronic device 700 may operate based on an operating system stored in the memory 732, such as ^{^{Windows Server TM, Mac OS X TM}} , Unix TM, Linux TM, FreeBSD TM or the like.

In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform a method of acquiring text extraction models in the embodiments described below. For example, the computer readable storage medium can be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device.

A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the spirit and scope of the present invention, should be included in the protection of the present invention. Within the scope.

Claims

A method for obtaining a text extraction model, the method comprising:

Obtaining a first text extraction model, the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;

If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;

And acquiring a second text extraction model according to the first training text set and the second training text set.
The method according to claim 1, wherein after obtaining the second text extraction model according to the first training text set and the second training text set, the method further comprises:

When the conversation message is received, the conversation message is input to the second text extraction model to obtain semantic information of the conversation message;

And querying the reply message of the conversation message according to the semantic information of the conversation message.
The method according to claim 1, wherein if the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring the second training text set comprises:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corpora;

For each of the plurality of first training corpora, extracting the first text from the first training corpus by the first text extraction model;

If the first text is correct, the first training corpus and the first text are used as a pair of training texts in the second training text set;

If the first text is erroneous, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
The method according to claim 3, wherein if the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corporas comprises:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.
The method according to claim 3, if the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corporas comprises:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora. The chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.
The method according to claim 1, wherein the process of determining the extraction accuracy of the first text extraction model comprises:

Obtaining a test text set, the test text set including a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;

For each of the plurality of test corpora, extracting the second text from the test corpus by the first text extraction model;

A ratio of the number of the second text identical to any of the correct texts and the plurality of correct texts is determined as the extraction accuracy of the first text extraction model.
The method according to claim 1, wherein after the obtaining the second text extraction model according to the first training text set and the second training text set, the method further comprises:

If the extraction accuracy of the second text extraction model obtained by the training is lower than the preset threshold, continue to acquire the training text set, and perform the next training based on the acquired training text sets until the second text obtained by the training. The extraction accuracy of the extracted model is not lower than the preset threshold, and the training text set includes a plurality of second training corpora and a second text extraction model obtained by the training is extracted from the plurality of second training corpora Multiple second target texts.
An apparatus for obtaining a text extraction model, the apparatus comprising:

a model obtaining module, configured to obtain a first text extraction model, where the first text extraction model is obtained according to a manually labeled first training text set, where the first training text set includes a plurality of training corpora and the plurality of training corpora Multiple label texts in ;

a training text collection obtaining module, configured to acquire a second training text set, where the second training text set includes a plurality of first training corpora and a passbook if the extraction accuracy of the first text extraction model is lower than a preset threshold Decoding, by the first text extraction model, a plurality of first target texts extracted from the plurality of first training corpora, each first target text being a correct text extracted from the first training corpus;

The model obtaining module is further configured to acquire a second text extraction model according to the first training text set and the second training text set.
The apparatus according to claim 8, wherein the training text set obtaining module is configured to:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corpora;

For each of the plurality of first training corpora, extracting the first text from the first training corpus by the first text extraction model;

If the first text is correct, the first training corpus and the first text are used as a pair of training texts in the second training text set;

If the first text is erroneous, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
The apparatus according to claim 9, wherein the training text collection obtaining module is configured to:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.
The apparatus according to claim 9, wherein the training text collection obtaining module is configured to:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora. The chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.
The device according to claim 8, wherein the device further comprises:

a test text collection obtaining module, configured to obtain a test text set, where the test text set includes a plurality of test corporas and a plurality of correct texts manually marked from the plurality of test corpora;

An extracting module, configured to extract, from the test corpus, a second text by using the first text extraction model for each of the plurality of test corpora;

And a determining module, configured to determine a quantity ratio of the second text and the plurality of correct texts that are the same as any correct text as the extraction accuracy of the first text extraction model.
The device of claim 8 wherein:

The training text set obtaining module is further configured to continue to acquire the training text set if the extraction accuracy of the second text extraction model obtained by the current training is lower than the preset threshold;

The model obtaining module is further configured to perform the next training based on the acquired sets of training texts until the extraction accuracy of the second text extraction model obtained by the training is not lower than the preset threshold, where the training text set includes a plurality of second training corpora and a plurality of second target texts extracted from the plurality of second training corpora by the second text extraction model obtained by the training.
An electronic device, comprising: a memory for storing instructions, the processor being configured to execute the instructions to perform the step of acquiring a text extraction model method described below :

Obtaining a first text extraction model, the first text extraction model being obtained according to a manually labeled first training text set, the first training text set comprising a plurality of training corpora and a plurality of annotated texts of the plurality of training corpora ;

If the extraction accuracy of the first text extraction model is lower than a preset threshold, acquiring a second training text set, the second training text set includes a plurality of first training corpora and extracting from the first text extraction model Decoding a plurality of first target texts extracted from the plurality of first training corpora, each of the first target texts being a correct text extracted from the first training corpus;

And acquiring a second text extraction model according to the first training text set and the second training text set.
The electronic device of claim 14, wherein the processor is configured to execute the instructions to perform the steps of:

When the conversation message is received, the conversation message is input to the second text extraction model to obtain semantic information of the conversation message;

And querying the reply message of the conversation message according to the semantic information of the conversation message.
The electronic device of claim 14, wherein the processor is configured to execute the instructions to perform the steps of:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, acquiring the plurality of first training corpora;

For each of the plurality of first training corpora, extracting the first text from the first training corpus by the first text extraction model;

If the first text is correct, the first training corpus and the first text are used as a pair of training texts in the second training text set;

If the first text is erroneous, the first training corpus and the manually corrected text are used as a pair of training texts in the second training text set.
The electronic device of claim 16 wherein said processor is configured to execute said instructions to perform the steps of:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, the conversation corpus in the preset time period is obtained from the chat database, and the dialogue in the preset time period is expected to be the plurality of first Training corpus, the chat database is used to store the conversation corpus between the user and the chat bot.
The electronic device of claim 16 wherein said processor is configured to execute said instructions to perform the steps of:

If the extraction accuracy of the first text extraction model is lower than the preset threshold, the dialog corpus with successful dialogue is filtered out from the chat database, and the dialogue successfully written by the conversation is expected to be the plurality of first training corpora. The chat database is used to store a conversation corpus between the user and the chat bot, and the conversation corpus of the conversation success refers to a conversation corpus that the chat bot successfully provides services for the user.
The electronic device of claim 14, wherein the processor is configured to execute the instructions to perform the steps of:

Obtaining a test text set, the test text set including a plurality of test corpora and a plurality of correct texts manually marked from the plurality of test corpora;

For each of the plurality of test corpora, extracting the second text from the test corpus by the first text extraction model;

A ratio of the number of the second text identical to any of the correct texts and the plurality of correct texts is determined as the extraction accuracy of the first text extraction model.
The electronic device of claim 14, wherein the processor is configured to execute the instructions to perform the steps of:

If the extraction accuracy of the second text extraction model obtained by the training is lower than the preset threshold, continue to acquire the training text set, and perform the next training based on the acquired training text sets until the second text obtained by the training. The extraction accuracy of the extracted model is not lower than the preset threshold, and the training text set includes a plurality of second training corpora and a second text extraction model obtained by the training is extracted from the plurality of second training corpora Multiple second target texts.