CN110609886A

CN110609886A - Text analysis method and device

Info

Publication number: CN110609886A
Application number: CN201910881838.9A
Authority: CN
Inventors: 戴淑敏; 唐剑波; 李长亮
Original assignee: Chengdu Kingsoft Digital Entertainment Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd; Chengdu Kingsoft Digital Entertainment Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2019-12-24

Abstract

The application provides a text analysis method and device. The text analysis method comprises the following steps: acquiring a text to be analyzed, a question to be answered and a candidate answer; embedding the word units in the text to be analyzed, the question to be answered and the candidate answer to generate a first word vector corresponding to the word units; performing semantic annotation processing on word units in the text to be analyzed, the question to be answered and the candidate answer to generate a second word vector corresponding to the word units; generating a third word vector corresponding to the word unit based on the first word vector and the second word vector corresponding to the word unit; and inputting the third word vector into a text analysis model for processing, and determining answers of the questions to be answered in the candidate answers. The text analysis method and the text analysis device can effectively improve the depth, flexibility and diversity of extraction of the text and the question information in the text analysis process, and improve the accuracy of answers to questions to be answered.

Description

Text analysis method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text analysis method and apparatus, a text analysis model training method and apparatus, a computing device, and a computer-readable storage medium.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and it is studying various theories and methods that enable efficient communication between humans and computers using Natural Language. The application scenario of natural language processing is, in a large aspect, intelligent processing of language words, including reading comprehension, question-answering conversation, writing, translation, and the like. These application scenarios can be further subdivided into tasks, including recognizing words from a series of words, recognizing phrases from a series of words, recognizing predicates from sentences, statins, colloquials, recognizing moods from sentences, abstracting abstracts from the entire article, finding answers from the entire article according to questions, i.e., reading comprehension and question answering, and so on.

For reading, understanding and questioning and answering tasks, a Bidirectional attention neural network model (BERT) is usually selected for processing.

However, the BERT model processes the chinese text in units of words, and lacks fine-grained features at a word level, which limits the capability of the model to extract text information, thereby affecting the processing effect of the model.

Disclosure of Invention

In view of this, embodiments of the present application provide a text analysis method and apparatus, a text analysis model training method and apparatus, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.

The embodiment of the application discloses a text analysis method, which comprises the following steps:

acquiring a text to be analyzed, a question to be answered and a candidate answer;

embedding the word units in the text to be analyzed, the question to be answered and the candidate answer to generate a first word vector corresponding to the word units;

performing semantic annotation processing on word units in the text to be analyzed, the question to be answered and the candidate answer to generate a second word vector corresponding to the word units;

generating a third word vector corresponding to the word unit based on the first word vector and the second word vector corresponding to the word unit;

and inputting the third word vector into a text analysis model for processing, and determining answers of the questions to be answered in the candidate answers.

Further, after the obtaining of the text to be analyzed, the question to be answered and the candidate answer, the method further includes:

splicing the text to be analyzed and the candidate answers to generate a text answer set;

the embedding processing of the word units in the text to be analyzed, the question to be answered and the candidate answer to generate the first word vector corresponding to the word units includes:

embedding the word units in the text answer set and the question to be answered to generate a first word vector corresponding to the word units;

performing semantic annotation processing on word units in the text to be analyzed, the question to be answered and the candidate answer to generate a second word vector corresponding to the word units, including:

and performing semantic annotation processing on word units in the text answer set and the question to be answered to generate a second word vector corresponding to the word units.

splicing the question to be answered and the candidate answers to generate a question answer set;

embedding the word units in the question answer set and the text to be analyzed to generate a first word vector corresponding to the word units;

and performing semantic annotation processing on word units in the question answer set and the text to be analyzed to generate a second word vector corresponding to the word units.

Further, the semantic labeling processing is performed on word units in the text to be analyzed, the question to be answered, and the candidate answer, and a second word vector corresponding to the word units is generated, including:

semantically labeling the text to be analyzed, the question to be answered and the candidate answer to generate a semantic label corresponding to the word unit;

and generating a second word vector corresponding to the word unit based on the semantic label.

Further, the generating a second word vector corresponding to the word unit based on the semantic label includes:

and embedding the semantic tags to generate tag vectors, and taking the tag vectors as second word vectors of the word units.

Further, the generating a third word vector corresponding to the word unit based on the first word vector and the second word vector corresponding to the word unit includes:

and splicing the first word vector and the second word vector of the word unit in the text to be analyzed, the question to be answered and the candidate answer to generate a third word vector corresponding to the word unit.

Further, the inputting the third word vector into a text analysis model for processing, and determining an answer to the question to be answered from the candidate answers, includes:

inputting the third word vector into a text analysis model for feature extraction to generate a feature vector;

carrying out linear mapping and nonlinear transformation processing on the feature vectors in sequence to obtain the probability that the candidate answers are used as answers of the questions to be answered;

determining an answer based on a probability that the candidate answer is the answer to the question to be answered.

The application also provides a training method of the text analysis model, which comprises the following steps:

acquiring a training sample and a sample label, wherein the training sample comprises a sample text, a sample question and a sample candidate answer, and the sample label comprises a correct answer corresponding to the sample text and the sample question;

embedding the word units in the sample text, the sample question and the sample candidate answer to generate a first sample word vector corresponding to the word units;

performing semantic annotation processing on word units in the sample text, the sample question and the sample candidate answer to generate a second sample word vector corresponding to the word units;

generating a third sample word vector corresponding to the word unit based on the first sample word vector and the second sample word vector corresponding to the word unit;

inputting the third sample word vector into a text analysis model for processing, and determining a predicted answer of the sample question;

and comparing the predicted answer with the correct answer, and updating the text analysis model based on the comparison result of the predicted answer and the correct answer.

The present application further provides a text analysis apparatus, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire a text to be analyzed, a question to be answered and a candidate answer;

the embedding module is configured to embed word units in the text to be analyzed, the question to be answered and the candidate answer to generate a first word vector corresponding to the word units;

the labeling module is configured to perform semantic labeling processing on word units in the text to be analyzed, the question to be answered and the candidate answer, and generate a second word vector corresponding to the word units;

a generating module configured to generate a third word vector corresponding to the word unit based on the first and second word vectors corresponding to the word unit;

a determining module configured to input the third word vector into a text analysis model for processing, and determine an answer of the question to be answered from the candidate answers.

The present application further provides a training device for a text analysis model, including:

a sample acquisition module configured to acquire a training sample including a sample text, a sample question, and a sample candidate answer, and a sample label including a correct answer corresponding to the sample text and the sample question;

the sample embedding module is configured to perform embedding processing on word units in the sample text, the sample question and the sample candidate answer, and generate a first sample word vector corresponding to the word units;

the sample labeling module is configured to perform semantic labeling processing on word units in the sample text, the sample question and the sample candidate answer, and generate a second sample word vector corresponding to the word units;

a sample generation module configured to generate a third sample word vector corresponding to the word unit based on the first and second sample word vectors corresponding to the word unit;

a sample determining module configured to input the third sample word vector into a text analysis model for processing, and determine a predicted answer of the sample question;

and the model updating module is configured to compare the predicted answer with the correct answer and update the text analysis model based on the comparison result of the predicted answer and the correct answer.

The present application further provides a computing device, comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the text analysis method or the training method of the text analysis model when executing the instructions.

The present application also provides a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the above text analysis method or the training method of the text analysis model.

According to the text analysis method and device, the text to be analyzed, the question to be answered and the candidate answer are respectively subjected to embedding processing and semantic labeling processing, semantic role labeling information of fine granularity levels is integrated into a word vector, the comprehension degree of a text analysis model to the text and the question is deepened, the depth, flexibility and diversity of extraction of the text and the question information in the text analysis process are effectively improved, and the accuracy of the answer to the question is improved.

According to the training method and device for the text analysis model, the sample text, the sample question and the sample candidate answer are respectively subjected to embedding processing and semantic labeling processing, semantic role labeling information of fine granularity levels is integrated into a word vector, the information extraction capability of the text analysis model in reading understanding is enhanced, and the training effect of the model is effectively improved.

Drawings

FIG. 1 is a schematic block diagram of a computing device according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating steps of a text analysis method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating steps of a text analysis method according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating steps of a text analysis method according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating steps of a method for training a text analysis model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a text analysis device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a training apparatus for a text analysis model according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

Word unit (token): before any actual processing of the input text, it needs to be segmented into language units such as words, punctuation marks, numbers or letters, which are called word units. For an english text, the word unit may be a word, a punctuation mark, a number, etc., and for a chinese text, the smallest word unit may be a word, a punctuation mark, a number, etc.

Word embedding: means that a high-dimensional space with the number of all words is embedded into a continuous vector space with a much lower dimension, and each word or phrase is mapped to a vector on the real number domain.

Semantic Role Labeling (SRL) is a shallow Semantic analysis mode, and aims to analyze corresponding Semantic Role components of predicates in sentences, including actors, respondents, time, places and the like.

One-hot (One-hot) is an embedding method, in which a language symbol is represented by a binary vector, and the length of the vector is the same as the number of symbols in a dictionary. In the vector, only the position corresponding to the symbol has a value of 1, and the other positions are all 0.

BERT model: a bidirectional attention neural network model. The BERT model may predict the current word from the left and right side contexts and the next sentence from the current sentence. The BERT model aims to obtain the semantic representation of the text containing rich semantic information by utilizing large-scale unmarked corpus training, then finely adjust the semantic representation of the text in a specific NLP task, and finally apply the NLP task.

Linear mapping (linear mapping): is a mapping from one vector space V to another vector space W and holds addition operations and number multiplication operations, while a linear transformation is a linear mapping of a linear space V to itself.

Nonlinear transformation: the original feature is non-linearly transformed to obtain a new feature, and the new feature is used for linear classification, which corresponds to the original feature space and is equivalent to non-linear classification.

Normalized exponential function (softmax function): it can "compress" a K-dimensional vector containing arbitrary real numbers into another K-dimensional real number vector such that each element ranges between (0, 1) and the sum of all elements is 1, which is often used to solve the multi-classification problem.

In the present application, a text analysis method and apparatus, a text analysis model training method and apparatus, a computing device, and a computer-readable storage medium are provided, which are described in detail in the following embodiments one by one.

Fig. 1 is a block diagram illustrating a configuration of a computing device 100 according to an embodiment of the present specification. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 100 and other components not shown in FIG. 1 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the method shown in fig. 2. Fig. 2 is a schematic flowchart illustrating a text analysis method according to an embodiment of the present application, including step S210 to step S250.

Step S210, obtaining a text to be analyzed, a question to be answered and a candidate answer.

Specifically, the text to be analyzed is a chinese text, which may be a sentence, a word, multiple words, an article, and the like, the question to be answered is a question related to the content of the text to be analyzed, the candidate answers include three options of "positive answer", "negative answer", and "indeterminable", and may be various sentences capable of expressing the meanings of "positive answer", "negative answer", and "indeterminable", which is not limited in this application.

For example, assuming that the text content to be analyzed includes "one apple on a table", in the case where the question to be answered is "? do have apple on a table", the candidate answers may be "present", "absent", and "indeterminate", and in the case where the question to be answered is "? what on a table", the candidate answers may be "apple on a table", "nothing on a table", and "indeterminable".

In practical application, after the text to be analyzed, the question to be answered and the candidate answer are obtained, the text to be analyzed, the question to be answered and the candidate answer are spliced, specifically, the candidate answer can be spliced to the text to be analyzed or spliced to the question to be answered, and then the following steps are continuously executed.

Step S220, carrying out embedding processing on word units in the text to be analyzed, the question to be answered and the candidate answer to generate a first word vector corresponding to the word units.

Specifically, word segmentation processing is carried out on a text to be analyzed, a question to be answered and a candidate answer in a word unit mode, a plurality of word units in the text to be analyzed, the question to be answered and the candidate answer are obtained, each word unit corresponds to one word or one punctuation mark, embedding processing is carried out on the word units, and a first word vector of each word unit is generated.

The embedding processing is carried out on the text to be analyzed, the question to be answered and the candidate answer, so that the information extraction depth and the information extraction richness of the text to be analyzed, the question to be answered and the candidate answer can be improved, and the analysis accuracy is enhanced.

Step S230, performing semantic annotation processing on word units in the text to be analyzed, the question to be answered, and the candidate answer, and generating a second word vector corresponding to the word units.

Furthermore, semantic annotation can be performed on the text to be analyzed, the question to be answered and the candidate answer, so as to generate a semantic label corresponding to the word unit; and generating a second word vector corresponding to the word unit based on the semantic label.

Specifically, the semantic label is a Semantic Role Label (SRL), a chinese natural language processing toolkit (pyltp) may be used to perform semantic role labeling on the text to be analyzed, the question to be answered, and the candidate answer, respectively, to generate a semantic role label for each sentence in the text to be analyzed, the question to be answered, and the candidate answer, and then the semantic role labels for each sentence in the text to be analyzed or the question to be answered are connected to obtain the semantic role label for the text to be analyzed or the question to be answered.

Marking semantic labels of each word unit, namely each word and each punctuation mark, in a B-I-E marking mode according to the semantic role labels of each sentence in the text to be analyzed or the question to be answered, wherein the word unit without the semantic role takes '0' as the semantic label.

For example, assume that the content of the text to be analyzed includes: "Zhejiang will implement the examination reform examination plan completely", and the position marks of each word in the sentence are shown in table 1.

TABLE 1

The semantic role set generated is "3 a 0: (0,0) ADV (1,1) ADV (2,2) A1(4,7) ". The semantic role labels of the sentence are the semantic role labels, wherein a0 represents the actor, ADV represents the default annotation type, and a1 represents the victim.

Based on the semantic role set, the semantic role label of the word Zhejiang is A0, the semantic role label of the word general and the semantic role label of the word complete are ADV, and the semantic role label of the word examination, reform, examination and scheme is A1.

And marking each word unit by using a marking mode of B-I-E to obtain a word unit-semantic label corresponding table shown in the table 2.

TABLE 2

And in the case that a plurality of predicates exist in one sentence to cause a plurality of semantic role sets, selecting the semantic role set with the least words without semantic role labels as the semantic role labels of the sentence.

For example, assuming that the content of the text to be analyzed includes "beginning of the next year, Zhejiang will fully implement the examination reform examination scheme, foreign language and examination-selecting subjects take two examinations a year, and the examinee determines the examination-selecting subjects autonomously. "the sentence of the text to be analyzed contains 3 verb predicates in total, which are" start "," implement ", and" determine ", respectively. Semantically labeling the texts to be analyzed to generate three semantic role sets, namely a semantic role set A "1 TMP (0, 0)" generated based on a verb "start", a semantic role set B "6 TMP (0,2) A0(3,3) ADV (4,4) ADV (5,5) A1(7, 10)" generated based on a verb "implement", and a semantic role set C "23A 0(21,21) ADV (22,22) A1(24, 26)" generated based on the verb "determine". Where the number represents the position of the word in the sentence (starting from 0), TMP represents time, a0 represents the actor, ADV represents the default callout type, a1 represents the victim.

It can be seen that, in the above three semantic role sets, the semantic role set B labels the most words, that is, the words without semantic role labels in the sentence are the least, and then the semantic role set B is used as the optimal semantic role set, and the semantic role set B is used as the semantic role labels of the sentence, wherein the words without semantic roles use "0" as their semantic role labels.

Further, the semantic tag may be embedded to generate a tag vector, and the tag vector may be used as a second word vector of the word unit.

Specifically, the embedding process is performed in a one-hot embedding manner. Taking the text to be analyzed as the 'Zhejiang will fully implement examination reform trial scheme', the semantic label of the word unit 'Zhe' is 'B-A0', the semantic label 'B-A0' is subjected to one-hot embedding processing to generate a label vector, and the label vector is used as a second word vector of the word unit 'Zhe'. Other cases can be analogized, and are not described in detail herein.

Semantic annotation processing is carried out on the text to be analyzed, the question to be answered and the candidate answer, so that semantic role information of each word unit in the sentence can be obtained, the model is favorable for deepening the understanding degree of the sentence, and the accuracy of reading and understanding the question and answer is improved.

Step S240, generating a third word vector corresponding to the word unit based on the first word vector and the second word vector corresponding to the word unit.

Further, the first word vector and the second word vector of the word unit in the text to be analyzed, the question to be answered and the candidate answer may be spliced to generate a third word vector corresponding to the word unit.

Taking the text to be analyzed as an example, assume that the first word vector of the word unit "zhe" in the text to be analyzed is a1, the second word vector is a2, the third word vector is a3, and a3 is a1+ a 2.

The semantic role information of each word unit in the text to be analyzed, the question to be answered and the candidate answer is fused with the word vector of each word unit, so that the extraction depth and richness of the text and the question information can be further improved, the semantic understanding depth can be improved by combining the semantic role information, and the accuracy of reading and understanding the question and answer can be improved.

Step S250, inputting the third word vector into a text analysis model for processing, and determining an answer to the question to be answered from the candidate answers.

Further, the third word vector may be input into a text analysis model for feature extraction, so as to generate a feature vector; carrying out linear mapping and nonlinear transformation processing on the feature vectors in sequence to obtain the probability that the candidate answers are used as answers of the questions to be answered; and determining an answer based on the probability that the candidate answer is the answer to the question to be answered.

Specifically, the text analysis model is a BERT model. When the text to be analyzed, the question to be answered, and the third word vector of each word unit in the candidate answer are input into the text analysis model, the text to be analyzed, the question to be answered, and the candidate answer may be input by dividing the text to be analyzed, the third word vector of the question to be answered, and the third word vector of the question to be answered into two word vector sequences, and the third word vector of the candidate answer and the third word vector of the question to be answered are input into the text analysis model as one word vector sequence to be processed in step S210, or the third word vector of the question to be answered may be input into one word vector sequence, and the third word vector of the candidate answer and the third word vector of the text to be analyzed are input into the text analysis model as one word vector sequence to be processed, which is not limited in the present application.

Among the candidate answers, "positive answer" is replaced with "0", "negative answer" is replaced with "1", and "indeterminable" is replaced with "2". The feature vector output after the BERT model processing may be represented as a matrix of 1 × n columns, and the weight matrix (parameter) in the BERT model may be represented as a matrix of n × 3 columns, where "3" represents 3 candidate answer options, and then linear processing is performed by a softmax function, whose expression is as follows:

wherein z is_jIs an element in the vector before conversion, e is an exponential function, denominatorFor the sum of the indices of all elements in the pre-conversion vector, σ (z)_jAnd calculating the value of the softmax function as a result of the nonlinear conversion, wherein the answer represented by the position corresponding to the maximum probability value in the output result is the correct answer.

For example: the input vector [1,2,3,4,1,2,3] corresponds to a Softmax function having a value of [0.024,0.064,0.175,0.475,0.024,0.064,0.175 ]. The term in the output vector with the greatest weight corresponds to the maximum value of "4" in the input vector. The answer of the position corresponding to "4" is the correct answer.

According to the text analysis method provided by the embodiment, the text to be analyzed, the question to be answered and the candidate answer are subjected to semantic annotation processing, and fine-grained semantic annotation information is blended into the word vectors of the text to be analyzed, the question to be answered and the candidate answer, so that the information extraction capability and depth of a text analysis model in the reading and understanding process can be effectively improved, and the accuracy of the answer is improved.

According to the text analysis method provided by the embodiment, the candidate answers, the text to be analyzed and the question to be answered are input into the text analysis model together for processing, so that the accuracy of the text analysis model can be effectively improved.

As shown in fig. 3, a text analysis method includes steps S310 to S360.

And step S310, obtaining a text to be analyzed, a question to be answered and a candidate answer.

And S320, splicing the text to be analyzed and the candidate answers to generate a text answer set.

In practical application, after the candidate answers are spliced to the text to be analyzed, a text answer set is generated.

For example, assume that the text to be analyzed includes: the method comprises the steps of 'one apple is arranged on a table', a question to be answered is 'grape is arranged on the table', candidate answers comprise 'existence', 'absence' and 'uncertainty', and after the candidate answers are spliced to a text to be analyzed, a text answer set 'that one apple is arranged on the table' is generated. There are. None. Cannot be determined. "

And step S330, carrying out embedding processing on the word units in the text answer set and the question to be answered, and generating a first word vector corresponding to the word units.

Step S340, carrying out semantic annotation processing on word units in the text answer set and the question to be answered, and generating a second word vector corresponding to the word units.

Step S350, generating a third word vector corresponding to the word unit based on the first word vector and the second word vector corresponding to the word unit.

Step S360, inputting the third word vector into a text analysis model for processing, and determining answers of the questions to be answered in the candidate answers.

The specific implementation process of the above steps can be referred to the above embodiments, and is not described herein again.

According to the text analysis method, the candidate answers and the text to be analyzed are spliced together, and then the subsequent embedding processing, semantic labeling processing and text analysis model processing are performed, so that the information extraction depth and the extraction richness of the text to be analyzed and the question to be answered in the reading and understanding process can be improved, and the accuracy of reading and understanding the question and answer is improved.

As shown in fig. 4, a text analysis method includes steps S410 to S460.

And step S410, obtaining a text to be analyzed, a question to be answered and a candidate answer.

And step S420, splicing the question to be answered and the candidate answer to generate a question answer set.

In practical application, after the candidate answers are spliced to the questions to be answered, a question answer set is generated.

For example, suppose the text to be analyzed includes "an apple on a table", the question to be answered is "a grape on a table", the candidate answers include "present", "absent" and "indeterminate", the candidate answers are pieced to the question to be answered to generate a question answer set "a grape on a table is? present, absent, indeterminate".

And step S430, embedding the word units in the question answer set and the text to be analyzed to generate a first word vector corresponding to the word units.

Step S440, performing semantic annotation processing on word units in the question answer set and the text to be analyzed, and generating second word vectors corresponding to the word units.

Step S450, generating a third word vector corresponding to the word unit based on the first word vector and the second word vector corresponding to the word unit.

Step S460, inputting the third word vector into a text analysis model for processing, and determining an answer to the question to be answered from the candidate answers.

According to the text analysis method, the candidate answers and the questions to be answered are spliced together, and then subsequent embedding processing, semantic annotation processing and text analysis model processing are performed, so that the information extraction depth and the extraction richness of the texts to be analyzed and the questions to be answered in the reading and understanding process can be improved, and the accuracy of reading and understanding the questions and answers is improved.

As shown in fig. 5, a method for training a text analysis model includes steps S510 to S560.

Step S510, a training sample and a sample label are obtained, wherein the training sample comprises a sample text, a sample question and a sample candidate answer, and the sample label comprises a correct answer corresponding to the sample text and the sample question.

Specifically, the sample text is a chinese text, which may be a sentence, a word, multiple words, an article, and the like, the sample question is a question related to the content of the sample text, the sample candidate answers include three options of "positive answer", "negative answer", and "indeterminable", and may be various sentences capable of expressing the meanings of "positive answer", "negative answer", and "indeterminable", which is not limited in this application.

And step S520, carrying out embedding processing on the word units in the sample text, the sample question and the sample candidate answer to generate a first sample word vector corresponding to the word units.

The sample texts, the sample questions and the sample candidate answers are subjected to embedding processing, so that the information extraction depth and richness of the sample texts, the sample questions and the sample candidate answers can be improved, and the analysis accuracy is enhanced.

Step S530, performing semantic annotation processing on word units in the sample text, the sample question and the sample candidate answer to generate a second sample word vector corresponding to the word units.

Furthermore, semantic labeling can be performed on the sample text, the sample question and the sample candidate answer, and a semantic label corresponding to the word unit is generated; and generating a second word vector corresponding to the word unit based on the semantic label.

Specifically, the semantic label is a Semantic Role Label (SRL), which can be performed on the sample text, the sample question, and the sample candidate answer by using a chinese natural language processing kit (pyltp), respectively, to generate a semantic role label for each sentence in the sample text, the sample question, and the sample candidate answer, and then the semantic role labels for each sentence in the text to be analyzed or the question to be answered are connected to obtain the semantic role label for the text to be analyzed or the question to be answered.

Semantic annotation processing is carried out on the sample text, the sample question and the sample candidate answers, semantic role information of each word unit in the sentence can be obtained, the model is favorable for deepening the understanding degree of the sentence, and the accuracy of reading and understanding the question and answer is improved.

Step S540, generating a third sample word vector corresponding to the word unit based on the first sample word vector and the second sample word vector corresponding to the word unit.

Further, the first sample word vector and the second sample word vector of the word unit in the sample text, the sample question and the sample candidate answer may be spliced to generate a third sample word vector corresponding to the word unit.

The semantic role information of each word unit in the sample text, the sample question and the sample candidate answer is fused with the word vector of each word unit, so that the extraction depth and richness of the text and question information can be further improved, the semantic understanding depth can be improved by combining the semantic role information, and the accuracy of reading and understanding the question and answer can be improved.

And step S550, inputting the third sample word vector into a text analysis model for processing, and determining a predicted answer of the sample question.

Further, the third sample word vector may be input into a text analysis model for feature extraction, so as to generate a feature vector; carrying out linear mapping and nonlinear transformation processing on the feature vectors in sequence to obtain the probability that the sample candidate answers are used as answers of the sample questions; and determining a predicted answer based on a probability that the sample candidate answer is an answer to the sample question.

Specifically, the text analysis model is a BERT model, the sample text, the sample questions and the sample candidate answers are processed by the BERT model, the mutual dependency relationship between the text and the questions can be fully extracted, and the accuracy of reading and understanding the questions and answers is high.

And step S560, comparing the predicted answer with the correct answer, and updating the text analysis model based on the comparison result of the predicted answer and the correct answer.

Further, comparing the predicted answer with the correct answer, if the predicted answer is inconsistent with the correct answer, adjusting parameters of the text analysis model, updating the text analysis model, and continuing iterative training; if the predicted answer is consistent with the correct answer, the training is finished.

The present embodiment will be further described with reference to specific examples.

The sample text is supposed to comprise that a school leader is provided during a meeting and supports students to actively participate in extracurricular activities, the sample question is ' the school supports the students to participate in extracurricular activities '? ', the sample candidate answers comprise ' support ', ' non-support ' and ' uncertain ', and the correct answer is ' support '.

And performing word segmentation on the sample text, the spliced sample question and the sample candidate answer to obtain a plurality of character units. Taking the sample candidate answer "unable to be determined" as an example, the sample candidate answer "unable to be determined" is subjected to word segmentation processing to obtain four character units of "none", "failure", "determination", and the rest can be analogized, and thus, the description is omitted here.

And performing embedding processing on each word unit in the sample text, the spliced sample question and the sample candidate answer to generate a first sample word vector corresponding to each word unit, wherein the first sample word vector of the word unit in the sample text is m 1-m 21 by taking the sample text as an example.

The sample text, the spliced sample question and the sample candidate answer are semantically labeled, taking the sample text as an example, the sample text is semantically labeled to generate three semantic role sets, namely a semantic role set D "4A 0(0,0) TMP (1, 3)" generated based on a verb "propose", a semantic role set E "6A 1(7, 7)" generated based on a verb "support", and a semantic role set F "9A 0(7,7) ADV (8,8) A1(10, 11)" generated based on a verb "join". In the three semantic role sets, the semantic role set D and the semantic role set F label the most words, that is, the words without semantic role labels in the sentence are the least, and the semantic role set E labels the least words, that is, the words without semantic role labels in the sentence are the most. Therefore, one semantic role set is selected from the semantic role set D and the semantic role set F as an optimal semantic role set, namely a semantic role label. Here, the semantic role label of the sentence is taken as an example, where the semantic role set D is the optimal semantic role set, where a number indicates the position of a word in the sentence (starting from 0), a0 indicates the actor, TMP indicates time, ADV indicates the default annotation type, and a1 indicates the victim.

Semantic role labeling is carried out on each word unit based on the semantic role labels of each sentence in a B-I-E labeling mode, and semantic labels corresponding to each word unit are generated, wherein the word units without the semantic roles take '0' as the semantic labels. Taking a text to be analyzed as an example, semantic role labeling is performed on each word unit by adopting a B-I-E labeling mode to obtain a semantic tag corresponding table shown in table 3, and so on in other cases, which is not described herein again.

TABLE 3

And performing one-hot embedding processing on the semantic label of each word unit to generate a label vector of the semantic label of each word unit, and taking the label vector as a second sample word vector, wherein the second sample word vector of the word unit in the sample text is n 1-n 21 by taking the sample text as an example.

The first sample word vector and the second sample word vector of each word unit are spliced to generate a third sample word vector corresponding to each word unit, and taking the sample text as an example, assuming that the third sample word vector of the sample word unit is p 1-p 21, p1 is m1+ n1, p2 is m2+ n2 … …, and so on, which is not described herein again.

Inputting the third word vector of the sample word unit as a word vector sequence into a text analysis model for processing, based on the splicing of the sample question and the sample candidate answer in the above steps, inputting the spliced sample question and the third word vector of the sample candidate answer word unit as a word vector sequence into the text analysis model for processing, generating the probability that the sample candidate answer is used as the sample question answer, assuming that the probability that the sample candidate answer is "supported" is 0.55, the probability that the sample candidate answer is "unsupported" is 0.25, and the probability that the sample candidate answer is "undeterminable" is 0.20, determining that the sample candidate answer "supported" is the predicted answer of the sample question.

And comparing the predicted answer of the sample question with the correct answer, performing iterative training on the text analysis model based on the comparison result, and finishing the training when the accuracy of the text analysis model reaches a target threshold value. The target threshold of the accuracy may be determined according to specific situations, and the present application is not limited thereto.

Through multiple experiments, under the condition that semantic annotation processing is not performed, and only a text to be analyzed and a question to be answered are input into a text analysis model for processing, the accuracy of the text analysis model is about 77%, under the condition that semantic annotation processing is performed, but only the text to be analyzed and the question to be answered are input into the text analysis model for processing, the accuracy of the text analysis model is about 78%, and under the condition that semantic annotation processing is performed, and candidate answers, the text to be analyzed and the question to be answered are input into the text analysis model together for processing, the accuracy of the text analysis model is about 79%.

Therefore, according to the training method of the text analysis model provided by the embodiment, by performing embedding processing and semantic labeling processing on the sample text, the sample question, the sample candidate answer and the correct answer, semantic role labeling information of a fine granularity level can be integrated into a word vector, the information extraction capability of the text analysis model in reading and understanding is enhanced, the training effect of the model is effectively improved, and the accuracy of the text analysis model is effectively improved.

According to the training method of the text analysis model provided by the embodiment, the sample candidate answers and the correct answers are input into the text analysis model along with the sample text and the sample questions to be processed, so that the training effect of the text analysis model and the accuracy of reading and understanding the question and answer of the text analysis model can be effectively improved.

As shown in fig. 6, a text analysis apparatus includes:

the obtaining module 601 is configured to obtain a text to be analyzed, a question to be answered, and a candidate answer.

An embedding module 602, configured to perform embedding processing on word units in the text to be analyzed, the question to be answered, and the candidate answer to generate a first word vector corresponding to the word units.

And the labeling module 603 is configured to perform semantic labeling processing on word units in the text to be analyzed, the question to be answered, and the candidate answer, and generate a second word vector corresponding to the word units.

A generating module 604 configured to generate a third word vector corresponding to the word unit based on the first and second word vectors corresponding to the word unit.

A determining module 605 configured to input the third word vector into a text analysis model for processing, and determine an answer of the question to be answered from the candidate answers.

Optionally, the text analysis apparatus further includes:

and the first splicing module is configured to splice the text to be analyzed and the candidate answers to generate a text answer set.

The embedding module 602, further configured to:

and embedding the word units in the text answer set and the question to be answered to generate a first word vector corresponding to the word units.

The annotation module 603 is further configured to:

Optionally, the text analysis apparatus further includes:

and the second splicing module is configured to splice the question to be answered and the candidate answer to generate a question answer set.

The embedding module 602, further configured to:

and embedding the word units in the question answer set and the text to be analyzed to generate a first word vector corresponding to the word units.

The annotation module 603 is further configured to:

Optionally, the annotation module 603 is further configured to:

and performing semantic annotation on the text to be analyzed, the question to be answered and the candidate answer to generate a semantic label corresponding to the word unit.

Optionally, the labeling module 603 is further configured to:

Optionally, the generating module 604 is further configured to:

Optionally, the determining module 605 is further configured to:

and inputting the third word vector into a text analysis model for feature extraction to generate a feature vector.

And sequentially carrying out linear mapping and nonlinear transformation processing on the feature vectors to obtain the probability that the candidate answers are used as answers of the questions to be answered.

The text analysis device provided by the application carries out embedding processing and semantic labeling processing on the text to be analyzed, the question to be answered and the candidate answer respectively, can integrate fine-grained semantic role labeling information into a word vector, deepens the comprehension degree of a text analysis model to the text and the question, effectively improves the depth, flexibility and diversity of extraction of the text and the question information in the text analysis process, and improves the accuracy of the answer to the question to be answered.

As shown in fig. 7, an apparatus for training a text analysis model includes:

a sample obtaining module 701 configured to obtain a training sample and a sample label, wherein the training sample comprises a sample text, a sample question and a sample candidate answer, and the sample label comprises a correct answer corresponding to the sample text and the sample question.

A sample embedding module 702 configured to perform embedding processing on the word units in the sample text, the sample question and the sample candidate answer, and generate a first sample word vector corresponding to the word units.

A sample labeling module 703 configured to perform semantic labeling processing on word units in the sample text, the sample question, and the sample candidate answer, and generate a second sample word vector corresponding to the word units.

A sample generation module 704 configured to generate a third sample word vector corresponding to the word unit based on the first sample word vector and the second sample word vector corresponding to the word unit.

A sample determination module 705 configured to input the third sample word vector into a text analysis model for processing, and determine a predicted answer to the sample question.

A model update module 706 configured to compare the predicted answer and the correct answer and update the text analysis model based on a comparison result of the predicted answer and the correct answer.

Optionally, the training apparatus for the text analysis model further includes:

and the third splicing module is configured to splice the text to be analyzed and the candidate answers to generate a text answer set.

The sample embedding module 703 is further configured to:

The sample annotation module 704, further configured to:

and the fourth splicing module is configured to splice the question to be answered and the candidate answers to generate a question answer set.

The sample embedding module 703 is further configured to:

The sample annotation module 704, further configured to:

Optionally, the sample annotation module 702 is further configured to:

Optionally, the sample generation module 704 is further configured to:

Optionally, the sample determination module 705 is further configured to:

According to the training device of the text analysis model, the sample text, the sample question, the sample candidate answer and the correct answer are respectively subjected to embedding processing and semantic labeling processing, semantic role labeling information of fine granularity levels can be integrated into a word vector, the information extraction capability of the text analysis model in reading understanding is enhanced, and the training effect of the model is effectively improved.

An embodiment of the present application further provides a computing device, including a memory, a processor, and computer instructions stored on the memory and executable on the processor, where the processor executes the instructions to implement the following steps:

and acquiring a text to be analyzed, a question to be answered and a candidate answer.

And performing embedding processing on word units in the text to be analyzed, the question to be answered and the candidate answer to generate a first word vector corresponding to the word units.

And performing semantic annotation processing on word units in the text to be analyzed, the question to be answered and the candidate answer to generate a second word vector corresponding to the word units.

And generating a third word vector corresponding to the word unit based on the first word vector and the second word vector corresponding to the word unit.

An embodiment of the present application further provides a computer readable storage medium, which stores computer instructions, when executed by a processor, for implementing the steps of the text analysis method or the training method of the text analysis model as described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text analysis method or the text analysis model training method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text analysis method or the text analysis model training method.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A method of text analysis, comprising:

2. The text analysis method according to claim 1, wherein after the obtaining of the text to be analyzed, the question to be answered and the candidate answer, further comprising:

3. The text analysis method according to claim 1, wherein after the obtaining of the text to be analyzed, the question to be answered and the candidate answer, further comprising:

4. The text analysis method according to claim 1, wherein performing semantic annotation processing on word units in the text to be analyzed, the question to be answered, and the candidate answer to generate a second word vector corresponding to the word units comprises:

5. The text analysis method of claim 4, wherein the generating a second word vector corresponding to the word unit based on the semantic tag comprises:

6. The method of claim 1, wherein generating a third word vector corresponding to the word unit based on the first and second word vectors corresponding to the word unit comprises:

7. The text analysis method of claim 1, wherein the inputting the third word vector into a text analysis model for processing, and determining an answer to the question to be answered from among the candidate answers, comprises:

8. A method for training a text analysis model, comprising:

9. A text analysis apparatus, comprising:

10. An apparatus for training a text analysis model, comprising:

11. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of claims 1-7 or 8 when executing the instructions.

12. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of claims 1-7 or 8.