CN114580376A

CN114580376A - Chinese abstract generating method based on component sentence method analysis

Info

Publication number: CN114580376A
Application number: CN202210252383.6A
Authority: CN
Inventors: 龙军; 李浩然; 刘磊; 向一平
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-03

Abstract

The invention discloses a Chinese abstract generating method based on constituent sentence method analysis, which comprises the following steps: preprocessing a document to obtain a text sentence subset; based on the text sentence subset, obtaining text semantic information codes by using a semantic extraction model; generating a component sentence method analysis structure tree of each sentence based on the text sentence subset, and converting the component sentence method analysis structure tree of each sentence into a component sentence method structure serialization code based on a span method; the text semantic information coding and the structural serialization coding of the sentence-forming method are input into a coder together for integrated coding; and decoding the integrated codes transmitted by the encoder through a decoder to generate the text abstract. The original grammatical structure of the text can be provided for monitoring the text abstract generating process, and the problems of accuracy and readability of the text abstract are solved.

Description

Chinese abstract generating method based on component sentence method analysis

Technical Field

The invention relates to the technical field of information processing, in particular to a Chinese abstract generating method based on constituent sentence method analysis.

Background

The national science foundation contains the declaration of basic theory and applied basic theory research work, and is theoretical work for revealing the general law, the basic principle and the nature of natural phenomenon motion in the nature. In the process of fund declaration, the review experts need to efficiently and accurately acquire effective information from the massive declaration text and make review. The text summarization technology aims to automatically extract key information from a large amount of declaration text data, and can play an auxiliary role in the expert review process to a certain extent. However, a great number of scientific research professional terms are contained in the fund application, the existing text abstract model is difficult to fully mine and understand semantic information and grammatical structure information contained in the professional terms, and the generated abstract often has the defects of key information omission, incomplete coverage, unsmooth grammar and the like.

Most of declaration texts are long in length, and a traditional text summary generation model based on sequences cannot perform parallel computation and does not pay attention to text grammatical information, so that summary generated when a long text is processed is not obvious in theme and does not accord with grammatical rules. In text abstract generation, the abstract is insufficient and accurate, and the main problem faced in the generation process is that the abstract does not accord with the language habit of human. If the deep semantic information of the text can be acquired and the syntactic structure information is introduced, the integrated code comprises the text semantic information and the syntactic information, so that the abstract can better accord with the syntactic rule and the subject is more obvious.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a Chinese abstract generating method, a Chinese abstract generating device and a storage medium based on component sentence method analysis, and aims to solve the problems that the abstract obtained by the existing text abstract generating method is unclear in gist and low in readability.

In order to achieve the above object, the present invention adopts the following technical solutions.

A Chinese abstract generating method based on constituent sentence method analysis comprises the following steps:

preprocessing a document to obtain a text sentence subset;

based on the text sentence subset, obtaining text semantic information codes by using a semantic extraction model;

generating a constituent sentence method analysis structure tree of each sentence based on the text sentence subset, and converting the constituent sentence method analysis structure tree of each sentence into constituent sentence method structure serialization codes based on a span method;

the text semantic information coding and the structural serialization coding of the sentence-forming method are input into a coder together for integrated coding;

and decoding the integrated codes transmitted by the encoder through a decoder to generate the text abstract.

Further, the semantic extraction model adopts a PEGASUS model.

Further, the structure tree is analyzed using a constituent sentence method of Stanford CoreNLP generation per sentence.

Further, the converting the span-based method of the constituent syntactic parse structure tree of each sentence into a constituent syntactic structure serialization code includes:

analyzing a structure tree for each sentence by a constituent sentence method, recursively combining the rightmost two child nodes, and converting the two child nodes into a right binary tree;

representing the obtained right binary tree as a span table;

dividing the span table into n parts according to the right boundary of the span table, wherein n is the length of a sentence; all left children including root nodes in the binary tree are distributed in n parts, the right boundaries of all the left children correspond to the values in [1, n ] one by one, the right boundaries of all the left children are used as the subscripts after serialization, the corresponding left boundaries are used as the values after serialization, and the constituent sentence structure serialization coding after cross-table linearization is obtained.

Further, the encoder adopts a semantic structure encoder based on an attention mechanism, and firstly fuses text semantic information encoding and structural serialization encoding by a sentence method, as shown in the following formula:

in the formula (I), the compound is shown in the specification,

the final hidden state of the encoder is shown, d is shown as structural serialization coding of a sentence method, h is shown as text semantic information coding,

for the GLU activation function, b represents an offset, and W represents a learnable parameter;

the attention medium of the encoder redistributes the attention of other words according to the decoder input at the current time t and generates a context semantic vector C which changes with the current word_tThe attention mechanism formula is as follows:

in the formula, a_t，iExpress attention weight, by e_t，iCalculating a score;

representing the i-th hidden state of the encoder, S_t-1Representing the hidden state at time t-1 at the decoder, n represents the sentence length,

W_h、V_heach represents a weight matrix.

Further, the decoder uses a unidirectional GRU network, the input of which is the output y of the decoder from the last instant t-1_t-1Previous time t-1 hidden state S of decoder_t-1Context semantic vector C of current time t_tJointly forming; will encode the final hidden state of the encoder

As a first input to the decoder, the unidirectional GRU network structure formula is:

z_t＝σ(W_zS_t-1+W_zC_t+W_zy_t-1)

r_t＝σ(W_r+S_t-1+W_rC_t+W_ry_t-1)

in the formula, z_tAnd r_tRespectively representing an update gate and a reset gate;

is to C_t，y_t-1And S of the previous time_t-1To calculate the new vector

Containing the above information and y_t-1(ii) a σ and tanh represent activation functions; an indication of the product of the matrices; w_zTo update the weight parameter of the door, W_rIn order to reset the weight parameters of the gates,

is composed of

The weight parameter of (2); s_tRepresenting the hidden state of the decoder at the current moment t;

obtaining the position of the word in the vocabulary through the softmax layer, wherein the formula is as follows:

P(y_t|y₁，y₂，...，y_n，C_t)＝softmax(S_t)

hidden state S of decoder_tAs shown below, GRU stands for gated cyclic unit decoder;

S_t＝GRU(S_t-1，C_t，y_t-1)

and finally, generating an optimal solution by adopting a search algorithm.

Further, the searching algorithm adopts a bundle searching algorithm.

Advantageous effects

The invention provides a Chinese abstract generating method based on component sentence method analysis, which comprises the steps of firstly extracting text semantic information codes of a text by adopting a semantic extraction model, wherein the text semantic information codes comprise deep semantic information of the text; meanwhile, a constituent sentence method analysis structure tree of the text is obtained, and a span-based method is used for obtaining a constituent sentence method structure serialization code, wherein the code contains the syntactic structure information of the text and can be used for supervising the abstract generation process, so that the generated abstract is more in line with the language habit of human; and then, inputting the text semantic information coding and the constituent sentence method structure serialization coding into an attention-based encoder structure together, wherein the encoder is mainly used for fusing the text semantic information coding and the constituent sentence method structure serialization coding, calculating a keyword semantic vector by using an attention mechanism, combining the keyword semantic vector with the constituent sentence method structure serialization coding and then deducing the keyword semantic vector to a next lexical item, so that the integrated coding contains syntactic structure information while retaining the semantic information. And finally, decoding the integrated codes transmitted by the encoder by adopting a decoder, and finally generating an abstract result which is more in line with grammar rules and has more remarkable gist.

The invention can realize the long text Chinese abstract generating task, can solve the problems that the abstract gist is not clear and the abstract does not conform to the human language habit in the long text abstract generating task due to overlong text, and can play an important auxiliary role in the declaration text quantification work. The core significance of the scheme is that the original grammatical structure of the text can be extracted for monitoring the text abstract generating process, and the problems of text abstract accuracy and readability are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a Chinese abstract generation method based on constituent sentence method analysis according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a process for generating a constituent syntax structure serialized code according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention discloses a chinese abstract generating method based on constituent sentence analysis, including:

s1: and preprocessing the document to be subjected to abstract generation to obtain a text sentence subset.

Specifically, stop words are sequentially filtered out from each sentence in the document, and only words with specified parts of speech are reserved, so that a new text sentence subset is obtained.

S2: and based on the text sentence subset, obtaining text semantic information codes by using a semantic extraction model.

In implementation, the semantic extraction model adopts a PEGASUS model. The PEGASUS model is pre-trained on a large text corpus by using a new self-supervision target, and the generator is added into the pre-training structure of the Bert, so that the generation-type task can be pre-trained. As a pre-training target, the model uses both GSG (gap Sendences Generation) and MLM (masked Language model). Specifically, assuming that m sentences in the original text constitute a sentence set after text preprocessing, a pseudo data set is formed by considering 2/3 parts in the sentence set as the original text and 1/3 parts as abstract parts in the sentence set as a training set, and the pegsus model is pre-trained using the pseudo data set.

Gsg (gap sequences generation): a key sentence is masked from the document and generated in accordance with other sentences of the document. Facilitating the model's understanding of the entire document and the generation of similar summaries.

Mlm (masked Language model): selecting 15% of the characters of the input text according to BERT, and selecting (1) 80% of the characters to be replaced by symbols [ MASK ]; (2) 10% random substitution to other characters; (3) 10% was unchanged.

S3: and generating a component syntactic analysis structure tree of each sentence based on the text sentence subset, and converting the component syntactic analysis structure tree of each sentence into component syntactic structure serialization codes based on a span method.

The sentence constituent sentence analysis structure tree is that the sentence is divided into components, the larger component is obtained by combining the smaller components, in the embodiment, the Stanford CoreNLP is adopted to generate the constituent sentence analysis structure tree of each sentence. Stanford CoreNLP is a natural language processing toolkit that currently supports Arabic, Chinese, English, French, German, Spanish, and the like. The Stanford CoreNLP integration comprises the functions of word segmentation, part of speech tagging, syntactic analysis and the like. Thus, by parsing the sentences using Stanford CoreNLP, a constituent syntactic parse structure tree is obtained for each sentence.

The model with grammatical information is directly constructed, so that the structure of the whole analytic structure tree of the constituent sentence method can be captured and recorded, and the contents of any two different words can be distinguished. However, the direct construction of such a model with syntactic structure information lacks an efficient method, and the present embodiment uses an alternative method-the transformation of a constituent syntactic parse structure tree into a linearized structure tag sequence by a span-based method, shown in FIG. 2 as the sentence "I love codes". "is used for the linearization process.

The original constituent syntactic analysis structure tree is obtained through Stanford CoreNLP firstly, and then two rightmost subnodes are combined recursively to convert the structure tree into a right binary tree. Next, the tree is represented as a span table (span table) and divided into five parts according to the right boundary of the span table.

Defining: let W be (W)₁，w₂，...，w_n) Is a sentence; definition (i, j): is one from w_i+1To w_iI is more than or equal to 0 and less than or equal to j and less than or equal to n; given a sentence W and its constituent syntactic parse tree T, we call d ═ d (d)₁，d₂，...，d_n) Linearization of T (i.e., partial syntax structure serialization coding), where d_iE {0, 1.., i-1} and (d)_iI) is the longest span in T that ends with i.

The gray part of the right binary tree (b) in the span table (c) is all the left children (including the root node), and the black part is all the right children. The right boundaries of all left children must not have repetition, and therefore must have a one-to-one correspondence of [1, n ]]Then they can be used as the serialized indices and the corresponding left boundary as the serialized values. For example, in FIG. 2, the span in span table (1, 4) represents the phrase "Aiwrite code", and the value at subscript 4 of serialized array d is d₄1. D obtained by linearizing span table in sequence_iThe sequence is the structured serialization coding of the needed sentence method.

S4: and inputting the text semantic information code and the constituent sentence method structure serialization code into an encoder together for integrated coding.

In this embodiment, the encoder adopts a semantic structure encoder based on an attention mechanism. The semantic structure encoder based on the attention mechanism is a double-encoder model, and the encoder1 adopts corresponding hidden state information h in text semantic information encoding acquired by a generative pre-training model PEGASUS_i(i ═ 1, 2,. cndot, p); the encoder2 adopts a CNN (conditional neural networks, CNN) network to extract text grammar information and vectorizes the constituent sentence method structure tree serialization encoding; and fusing the characteristic information extracted by the CNN and the PEGASUS hidden layer state to construct an attention mechanism, and fusing text semantic information codes and span-based constituent sentence method structure serialization codes to enable the integration codes to contain text integral information.

Firstly, text semantic information coding and sentence-method-based structural serialization coding are fused, and the following formula is shown:

in the formula (I), the compound is shown in the specification,

representing the final hidden state of the encoder, d representing the serialization coding of the syntactic structure, h being the hidden state value of the PEGASUS model,

the GLU activation function is represented by b, offset and learnable parameter W, and the GLU activation function can be obtained through training;

and after fusion, obtaining a hidden state value containing the syntactic structure characteristics of the text. In the traditional Seq2Seq model, an encoder encodes the full text and generates a fixed context semantic vector C, and a decoder decodes C and outputs a final result. In the embodiment, the model adopts an attention mechanism to change the target data in a weighting manner, the attention medium of the encoder redistributes the attention of other words according to the input of the decoder at the current time t, and a context semantic vector C which changes constantly with the current word is generated_tThe attention mechanism formula is as follows:

in the formula, a_t，iExpress attention weight, by e_t，iCalculating a score;

representing the i-th hidden state of the encoder, S_t-1Representing the hidden state at a time t-1 at the decoder,

W_h、V_hall represent weight matrices and can be obtained by training. According to hidden state of encoder

And decoder hidden state S_t-1To calculate e_t，iScore according to e_t，iScore calculation attention weight a_t，iHiding each encoder state

And attention weight a_t，iMultiplying and then carrying out weighted summation to obtain a context semantic vector C of the current moment t_t。

S5: and decoding the integrated codes transmitted by the encoder through a decoder to generate the text abstract.

In this embodiment, the decoder uses a unidirectional GRU network with the input from the output y of the decoder at the last time t-1_t-1Last moment t-1 hidden state S of decoder_t-1Context semantic vector C of current time t_tJointly forming; will encode the final hidden state of the encoder

z_t＝σ(W_zS_t-1+W_zC_t+W_zy_t-1)

r_t＝σ(W_r+S_t-1+W_rC_t+W_ry_t-1)

in the formula, z_tAnd r_tRepresenting an update gate and a reset gate, respectively, the update gate being capable of deciding which information to discard and which new information to add, the reset gate being used to decide the extent to which previous information is discarded;

is to C_t，y_t-1And S of the previous time_t-1To calculate the new vector

Containing the above information and y_t-1(ii) a σ and tanh represent activation functions; an indication of the product of the matrices; w_z、W_r、

Representing a weight matrix, obtainable by training, W_zTo update the weight parameter of the door, W_rIn order to reset the weight parameters of the gates,

is composed of

The weight parameter of (2); s_tDecoder for indicating current time tIs in a hidden state of the display panel,

denotes S_tAn intermediate state of (1).

The output state of the current moment is obtained through a decoder, and the position of the word in the vocabulary is obtained through a softmax layer, as shown in the formula:

P(y_t|y₁，y₂，...，y_n，C_t)＝softmax(S_t)

S_t＝GRU(S_t-1，C_t，y_t-1)

and finally, generating an optimal solution by adopting a search algorithm. In the embodiment, the cluster searching algorithm is adopted to reduce the calculation complexity and improve the accuracy. The traditional decoder adopts a greedy search algorithm, namely, the maximum probability is selected in a probability matrix to generate a target word. The cluster searching algorithm is a heuristic graph searching algorithm, the cluster searching selects the first q maximum probabilities, and more candidate spaces are considered, so that a better generated result can be obtained. The present embodiment sets the width q of the bundle search to 10.

The invention can realize the long text Chinese abstract generating task, can solve the problems that the abstract gist is not clear and the abstract does not conform to the human language habit in the long text abstract generating task due to overlong text, and can play an important auxiliary role in the declaration text quantification work. The core significance of the scheme is that the original grammatical structure of the text can be extracted for monitoring the text abstract generating process, and the problems of text abstract accuracy and readability are solved. Firstly, a generating pre-training model-PEGASUS is adopted, and the model is combined with a GSG (gap Sendences Generation) and an MLM (masked Language model) method for extracting text semantic information codes of texts, wherein the text semantic information codes comprise deep semantic information of the texts; meanwhile, a constituent sentence method analysis structure tree of the text is obtained through the Stanford coreNLP, and a span-based method is used for obtaining a constituent sentence method structure serialization code, wherein the code contains the syntactic structure information of the text and can be used for monitoring the abstract generation process, so that the generated abstract is more in line with the habit of human language; and then, inputting the text semantic information code and the constituent sentence method structure serialization code into an attention-based encoder structure together, wherein the encoder is mainly used for fusing the text semantic information code and the constituent sentence method structure serialization code, calculating a keyword semantic vector by using an attention mechanism, combining the keyword semantic vector with the constituent sentence method structure serialization code and then deducing the keyword semantic vector to the next term, so that the integrated code contains syntactic structure information while retaining the semantic information. And finally, decoding the integrated code transmitted by the decoder by adopting a unidirectional GRU network based on cluster search, thereby improving the accuracy rate of colleagues who reduce the computational complexity and finally generating an abstract result which is more in line with grammatical rules and has more remarkable gist.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A Chinese abstract generating method based on constituent sentence method analysis is characterized by comprising the following steps:

preprocessing a document to obtain a text sentence subset;

generating a component sentence method analysis structure tree of each sentence based on the text sentence subset, and converting the component sentence method analysis structure tree of each sentence into a component sentence method structure serialization code based on a span method;

2. The method of generating chinese abstract based on constituent syntactic analysis of claim 1, wherein the semantic extraction model employs a PEGASUS model.

3. The method of generating chinese abstract based on constituent syntactic analysis of claim 1, wherein the constituent syntactic parse structure tree for each sentence is generated using Stanford CoreNLP.

4. The method for generating a chinese abstract based on constituent syntactic analysis of claim 1 or 3, wherein said converting the span-based method of constituent syntactic analytical structure tree of each sentence into constituent syntactic structure serialization coding comprises:

representing the obtained right binary tree as a span table;

5. The method for generating chinese abstract based on constituent sentence analysis of claim 1, wherein said encoder is a semantic structure encoder based on attention mechanism, which first fuses text semantic information encoding and constituent sentence structure serialization encoding as shown in the following formula:

in the formula (I), the compound is shown in the specification,

representing the final hidden state of the encoder, d representing the structural serialization coding of the sentence method, h representing the semantic information coding of the text,

in the formula, a_t,iExpress attention weight, by e_t,iCalculating a score;

W_h、V_heach represents a weight matrix.

6. The method of claim 5 in which the decoder uses a unidirectional GRU network with the input being the output y of the decoder from the last time t-1_t-1Previous time t-1 hidden state S of decoder_t-1Context semantic vector C of current time t_tAre formed together; will encode the final hidden state of the encoder

z_t＝σ(W_zS_t-1+W_zC_t+W_zy_t-1)

r_t＝σ(W_r+S_t-1+W_rC_t+W_ry_t-1)

is to C_t,y_t-1And the last momentS of_t-1To calculate the new vector

Containing the above information and y_t-1(ii) a σ, tanh represent activation functions; an indication of the product of the matrices; w_zTo update the weight parameter of the door, W_rIn order to reset the weight parameter of the gate,

is composed of

The weight parameter of (2); s. the_tRepresenting the hidden state of the decoder at the current moment t;

P(y_t|y₁,y₂,…,y_n,C_t)＝softmax(S_t)

S_t＝GRU(S_t-1,C_t,y_t-1)

and finally, generating an optimal solution by adopting a search algorithm.

7. The method for generating a chinese abstract according to claim 6, wherein the search algorithm is a cluster search algorithm.