Nothing Special   »   [go: up one dir, main page]

CN114580376A - Chinese abstract generating method based on component sentence method analysis - Google Patents

Chinese abstract generating method based on component sentence method analysis Download PDF

Info

Publication number
CN114580376A
CN114580376A CN202210252383.6A CN202210252383A CN114580376A CN 114580376 A CN114580376 A CN 114580376A CN 202210252383 A CN202210252383 A CN 202210252383A CN 114580376 A CN114580376 A CN 114580376A
Authority
CN
China
Prior art keywords
sentence
text
decoder
constituent
serialization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210252383.6A
Other languages
Chinese (zh)
Inventor
龙军
李浩然
刘磊
向一平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210252383.6A priority Critical patent/CN114580376A/en
Publication of CN114580376A publication Critical patent/CN114580376A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese abstract generating method based on constituent sentence method analysis, which comprises the following steps: preprocessing a document to obtain a text sentence subset; based on the text sentence subset, obtaining text semantic information codes by using a semantic extraction model; generating a component sentence method analysis structure tree of each sentence based on the text sentence subset, and converting the component sentence method analysis structure tree of each sentence into a component sentence method structure serialization code based on a span method; the text semantic information coding and the structural serialization coding of the sentence-forming method are input into a coder together for integrated coding; and decoding the integrated codes transmitted by the encoder through a decoder to generate the text abstract. The original grammatical structure of the text can be provided for monitoring the text abstract generating process, and the problems of accuracy and readability of the text abstract are solved.

Description

Chinese abstract generating method based on component sentence method analysis
Technical Field
The invention relates to the technical field of information processing, in particular to a Chinese abstract generating method based on constituent sentence method analysis.
Background
The national science foundation contains the declaration of basic theory and applied basic theory research work, and is theoretical work for revealing the general law, the basic principle and the nature of natural phenomenon motion in the nature. In the process of fund declaration, the review experts need to efficiently and accurately acquire effective information from the massive declaration text and make review. The text summarization technology aims to automatically extract key information from a large amount of declaration text data, and can play an auxiliary role in the expert review process to a certain extent. However, a great number of scientific research professional terms are contained in the fund application, the existing text abstract model is difficult to fully mine and understand semantic information and grammatical structure information contained in the professional terms, and the generated abstract often has the defects of key information omission, incomplete coverage, unsmooth grammar and the like.
Most of declaration texts are long in length, and a traditional text summary generation model based on sequences cannot perform parallel computation and does not pay attention to text grammatical information, so that summary generated when a long text is processed is not obvious in theme and does not accord with grammatical rules. In text abstract generation, the abstract is insufficient and accurate, and the main problem faced in the generation process is that the abstract does not accord with the language habit of human. If the deep semantic information of the text can be acquired and the syntactic structure information is introduced, the integrated code comprises the text semantic information and the syntactic information, so that the abstract can better accord with the syntactic rule and the subject is more obvious.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a Chinese abstract generating method, a Chinese abstract generating device and a storage medium based on component sentence method analysis, and aims to solve the problems that the abstract obtained by the existing text abstract generating method is unclear in gist and low in readability.
In order to achieve the above object, the present invention adopts the following technical solutions.
A Chinese abstract generating method based on constituent sentence method analysis comprises the following steps:
preprocessing a document to obtain a text sentence subset;
based on the text sentence subset, obtaining text semantic information codes by using a semantic extraction model;
generating a constituent sentence method analysis structure tree of each sentence based on the text sentence subset, and converting the constituent sentence method analysis structure tree of each sentence into constituent sentence method structure serialization codes based on a span method;
the text semantic information coding and the structural serialization coding of the sentence-forming method are input into a coder together for integrated coding;
and decoding the integrated codes transmitted by the encoder through a decoder to generate the text abstract.
Further, the semantic extraction model adopts a PEGASUS model.
Further, the structure tree is analyzed using a constituent sentence method of Stanford CoreNLP generation per sentence.
Further, the converting the span-based method of the constituent syntactic parse structure tree of each sentence into a constituent syntactic structure serialization code includes:
analyzing a structure tree for each sentence by a constituent sentence method, recursively combining the rightmost two child nodes, and converting the two child nodes into a right binary tree;
representing the obtained right binary tree as a span table;
dividing the span table into n parts according to the right boundary of the span table, wherein n is the length of a sentence; all left children including root nodes in the binary tree are distributed in n parts, the right boundaries of all the left children correspond to the values in [1, n ] one by one, the right boundaries of all the left children are used as the subscripts after serialization, the corresponding left boundaries are used as the values after serialization, and the constituent sentence structure serialization coding after cross-table linearization is obtained.
Further, the encoder adopts a semantic structure encoder based on an attention mechanism, and firstly fuses text semantic information encoding and structural serialization encoding by a sentence method, as shown in the following formula:
Figure BDA0003547290020000021
in the formula (I), the compound is shown in the specification,
Figure BDA0003547290020000022
the final hidden state of the encoder is shown, d is shown as structural serialization coding of a sentence method, h is shown as text semantic information coding,
Figure BDA0003547290020000023
for the GLU activation function, b represents an offset, and W represents a learnable parameter;
the attention medium of the encoder redistributes the attention of other words according to the decoder input at the current time t and generates a context semantic vector C which changes with the current wordtThe attention mechanism formula is as follows:
Figure BDA0003547290020000024
Figure BDA0003547290020000025
Figure BDA0003547290020000026
in the formula, at,iExpress attention weight, by et,iCalculating a score;
Figure BDA0003547290020000027
representing the i-th hidden state of the encoder, St-1Representing the hidden state at time t-1 at the decoder, n represents the sentence length,
Figure BDA0003547290020000028
Wh、Vheach represents a weight matrix.
Further, the decoder uses a unidirectional GRU network, the input of which is the output y of the decoder from the last instant t-1t-1Previous time t-1 hidden state S of decodert-1Context semantic vector C of current time ttJointly forming; will encode the final hidden state of the encoder
Figure BDA0003547290020000029
As a first input to the decoder, the unidirectional GRU network structure formula is:
zt=σ(WzSt-1+WzCt+Wzyt-1)
rt=σ(Wr+St-1+WrCt+Wryt-1)
Figure BDA00035472900200000210
Figure BDA00035472900200000211
in the formula, ztAnd rtRespectively representing an update gate and a reset gate;
Figure BDA00035472900200000212
is to Ct,yt-1And S of the previous timet-1To calculate the new vector
Figure BDA0003547290020000031
Containing the above information and yt-1(ii) a σ and tanh represent activation functions; an indication of the product of the matrices; wzTo update the weight parameter of the door, WrIn order to reset the weight parameters of the gates,
Figure BDA0003547290020000032
is composed of
Figure BDA0003547290020000033
The weight parameter of (2); stRepresenting the hidden state of the decoder at the current moment t;
obtaining the position of the word in the vocabulary through the softmax layer, wherein the formula is as follows:
P(yt|y1,y2,...,yn,Ct)=softmax(St)
hidden state S of decodertAs shown below, GRU stands for gated cyclic unit decoder;
St=GRU(St-1,Ct,yt-1)
and finally, generating an optimal solution by adopting a search algorithm.
Further, the searching algorithm adopts a bundle searching algorithm.
Advantageous effects
The invention provides a Chinese abstract generating method based on component sentence method analysis, which comprises the steps of firstly extracting text semantic information codes of a text by adopting a semantic extraction model, wherein the text semantic information codes comprise deep semantic information of the text; meanwhile, a constituent sentence method analysis structure tree of the text is obtained, and a span-based method is used for obtaining a constituent sentence method structure serialization code, wherein the code contains the syntactic structure information of the text and can be used for supervising the abstract generation process, so that the generated abstract is more in line with the language habit of human; and then, inputting the text semantic information coding and the constituent sentence method structure serialization coding into an attention-based encoder structure together, wherein the encoder is mainly used for fusing the text semantic information coding and the constituent sentence method structure serialization coding, calculating a keyword semantic vector by using an attention mechanism, combining the keyword semantic vector with the constituent sentence method structure serialization coding and then deducing the keyword semantic vector to a next lexical item, so that the integrated coding contains syntactic structure information while retaining the semantic information. And finally, decoding the integrated codes transmitted by the encoder by adopting a decoder, and finally generating an abstract result which is more in line with grammar rules and has more remarkable gist.
The invention can realize the long text Chinese abstract generating task, can solve the problems that the abstract gist is not clear and the abstract does not conform to the human language habit in the long text abstract generating task due to overlong text, and can play an important auxiliary role in the declaration text quantification work. The core significance of the scheme is that the original grammatical structure of the text can be extracted for monitoring the text abstract generating process, and the problems of text abstract accuracy and readability are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a Chinese abstract generation method based on constituent sentence method analysis according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a process for generating a constituent syntax structure serialized code according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention discloses a chinese abstract generating method based on constituent sentence analysis, including:
s1: and preprocessing the document to be subjected to abstract generation to obtain a text sentence subset.
Specifically, stop words are sequentially filtered out from each sentence in the document, and only words with specified parts of speech are reserved, so that a new text sentence subset is obtained.
S2: and based on the text sentence subset, obtaining text semantic information codes by using a semantic extraction model.
In implementation, the semantic extraction model adopts a PEGASUS model. The PEGASUS model is pre-trained on a large text corpus by using a new self-supervision target, and the generator is added into the pre-training structure of the Bert, so that the generation-type task can be pre-trained. As a pre-training target, the model uses both GSG (gap Sendences Generation) and MLM (masked Language model). Specifically, assuming that m sentences in the original text constitute a sentence set after text preprocessing, a pseudo data set is formed by considering 2/3 parts in the sentence set as the original text and 1/3 parts as abstract parts in the sentence set as a training set, and the pegsus model is pre-trained using the pseudo data set.
Gsg (gap sequences generation): a key sentence is masked from the document and generated in accordance with other sentences of the document. Facilitating the model's understanding of the entire document and the generation of similar summaries.
Mlm (masked Language model): selecting 15% of the characters of the input text according to BERT, and selecting (1) 80% of the characters to be replaced by symbols [ MASK ]; (2) 10% random substitution to other characters; (3) 10% was unchanged.
S3: and generating a component syntactic analysis structure tree of each sentence based on the text sentence subset, and converting the component syntactic analysis structure tree of each sentence into component syntactic structure serialization codes based on a span method.
The sentence constituent sentence analysis structure tree is that the sentence is divided into components, the larger component is obtained by combining the smaller components, in the embodiment, the Stanford CoreNLP is adopted to generate the constituent sentence analysis structure tree of each sentence. Stanford CoreNLP is a natural language processing toolkit that currently supports Arabic, Chinese, English, French, German, Spanish, and the like. The Stanford CoreNLP integration comprises the functions of word segmentation, part of speech tagging, syntactic analysis and the like. Thus, by parsing the sentences using Stanford CoreNLP, a constituent syntactic parse structure tree is obtained for each sentence.
The model with grammatical information is directly constructed, so that the structure of the whole analytic structure tree of the constituent sentence method can be captured and recorded, and the contents of any two different words can be distinguished. However, the direct construction of such a model with syntactic structure information lacks an efficient method, and the present embodiment uses an alternative method-the transformation of a constituent syntactic parse structure tree into a linearized structure tag sequence by a span-based method, shown in FIG. 2 as the sentence "I love codes". "is used for the linearization process.
The original constituent syntactic analysis structure tree is obtained through Stanford CoreNLP firstly, and then two rightmost subnodes are combined recursively to convert the structure tree into a right binary tree. Next, the tree is represented as a span table (span table) and divided into five parts according to the right boundary of the span table.
Defining: let W be (W)1,w2,...,wn) Is a sentence; definition (i, j): is one from wi+1To wiI is more than or equal to 0 and less than or equal to j and less than or equal to n; given a sentence W and its constituent syntactic parse tree T, we call d ═ d (d)1,d2,...,dn) Linearization of T (i.e., partial syntax structure serialization coding), where diE {0, 1.., i-1} and (d)iI) is the longest span in T that ends with i.
The gray part of the right binary tree (b) in the span table (c) is all the left children (including the root node), and the black part is all the right children. The right boundaries of all left children must not have repetition, and therefore must have a one-to-one correspondence of [1, n ]]Then they can be used as the serialized indices and the corresponding left boundary as the serialized values. For example, in FIG. 2, the span in span table (1, 4) represents the phrase "Aiwrite code", and the value at subscript 4 of serialized array d is d41. D obtained by linearizing span table in sequenceiThe sequence is the structured serialization coding of the needed sentence method.
S4: and inputting the text semantic information code and the constituent sentence method structure serialization code into an encoder together for integrated coding.
In this embodiment, the encoder adopts a semantic structure encoder based on an attention mechanism. The semantic structure encoder based on the attention mechanism is a double-encoder model, and the encoder1 adopts corresponding hidden state information h in text semantic information encoding acquired by a generative pre-training model PEGASUSi(i ═ 1, 2,. cndot, p); the encoder2 adopts a CNN (conditional neural networks, CNN) network to extract text grammar information and vectorizes the constituent sentence method structure tree serialization encoding; and fusing the characteristic information extracted by the CNN and the PEGASUS hidden layer state to construct an attention mechanism, and fusing text semantic information codes and span-based constituent sentence method structure serialization codes to enable the integration codes to contain text integral information.
Firstly, text semantic information coding and sentence-method-based structural serialization coding are fused, and the following formula is shown:
Figure BDA0003547290020000051
in the formula (I), the compound is shown in the specification,
Figure BDA0003547290020000052
representing the final hidden state of the encoder, d representing the serialization coding of the syntactic structure, h being the hidden state value of the PEGASUS model,
Figure BDA0003547290020000053
the GLU activation function is represented by b, offset and learnable parameter W, and the GLU activation function can be obtained through training;
and after fusion, obtaining a hidden state value containing the syntactic structure characteristics of the text. In the traditional Seq2Seq model, an encoder encodes the full text and generates a fixed context semantic vector C, and a decoder decodes C and outputs a final result. In the embodiment, the model adopts an attention mechanism to change the target data in a weighting manner, the attention medium of the encoder redistributes the attention of other words according to the input of the decoder at the current time t, and a context semantic vector C which changes constantly with the current word is generatedtThe attention mechanism formula is as follows:
Figure BDA0003547290020000054
Figure BDA0003547290020000061
Figure BDA0003547290020000062
in the formula, at,iExpress attention weight, by et,iCalculating a score;
Figure BDA0003547290020000063
representing the i-th hidden state of the encoder, St-1Representing the hidden state at a time t-1 at the decoder,
Figure BDA0003547290020000064
Wh、Vhall represent weight matrices and can be obtained by training. According to hidden state of encoder
Figure BDA0003547290020000065
And decoder hidden state St-1To calculate et,iScore according to et,iScore calculation attention weight at,iHiding each encoder state
Figure BDA0003547290020000066
And attention weight at,iMultiplying and then carrying out weighted summation to obtain a context semantic vector C of the current moment tt
S5: and decoding the integrated codes transmitted by the encoder through a decoder to generate the text abstract.
In this embodiment, the decoder uses a unidirectional GRU network with the input from the output y of the decoder at the last time t-1t-1Last moment t-1 hidden state S of decodert-1Context semantic vector C of current time ttJointly forming; will encode the final hidden state of the encoder
Figure BDA00035472900200000615
As a first input to the decoder, the unidirectional GRU network structure formula is:
zt=σ(WzSt-1+WzCt+Wzyt-1)
rt=σ(Wr+St-1+WrCt+Wryt-1)
Figure BDA0003547290020000067
Figure BDA0003547290020000068
in the formula, ztAnd rtRepresenting an update gate and a reset gate, respectively, the update gate being capable of deciding which information to discard and which new information to add, the reset gate being used to decide the extent to which previous information is discarded;
Figure BDA0003547290020000069
is to Ct,yt-1And S of the previous timet-1To calculate the new vector
Figure BDA00035472900200000610
Containing the above information and yt-1(ii) a σ and tanh represent activation functions; an indication of the product of the matrices; wz、Wr
Figure BDA00035472900200000611
Representing a weight matrix, obtainable by training, WzTo update the weight parameter of the door, WrIn order to reset the weight parameters of the gates,
Figure BDA00035472900200000612
is composed of
Figure BDA00035472900200000613
The weight parameter of (2); stDecoder for indicating current time tIs in a hidden state of the display panel,
Figure BDA00035472900200000614
denotes StAn intermediate state of (1).
The output state of the current moment is obtained through a decoder, and the position of the word in the vocabulary is obtained through a softmax layer, as shown in the formula:
P(yt|y1,y2,...,yn,Ct)=softmax(St)
hidden state S of decodertAs shown below, GRU stands for gated cyclic unit decoder;
St=GRU(St-1,Ct,yt-1)
and finally, generating an optimal solution by adopting a search algorithm. In the embodiment, the cluster searching algorithm is adopted to reduce the calculation complexity and improve the accuracy. The traditional decoder adopts a greedy search algorithm, namely, the maximum probability is selected in a probability matrix to generate a target word. The cluster searching algorithm is a heuristic graph searching algorithm, the cluster searching selects the first q maximum probabilities, and more candidate spaces are considered, so that a better generated result can be obtained. The present embodiment sets the width q of the bundle search to 10.
The invention can realize the long text Chinese abstract generating task, can solve the problems that the abstract gist is not clear and the abstract does not conform to the human language habit in the long text abstract generating task due to overlong text, and can play an important auxiliary role in the declaration text quantification work. The core significance of the scheme is that the original grammatical structure of the text can be extracted for monitoring the text abstract generating process, and the problems of text abstract accuracy and readability are solved. Firstly, a generating pre-training model-PEGASUS is adopted, and the model is combined with a GSG (gap Sendences Generation) and an MLM (masked Language model) method for extracting text semantic information codes of texts, wherein the text semantic information codes comprise deep semantic information of the texts; meanwhile, a constituent sentence method analysis structure tree of the text is obtained through the Stanford coreNLP, and a span-based method is used for obtaining a constituent sentence method structure serialization code, wherein the code contains the syntactic structure information of the text and can be used for monitoring the abstract generation process, so that the generated abstract is more in line with the habit of human language; and then, inputting the text semantic information code and the constituent sentence method structure serialization code into an attention-based encoder structure together, wherein the encoder is mainly used for fusing the text semantic information code and the constituent sentence method structure serialization code, calculating a keyword semantic vector by using an attention mechanism, combining the keyword semantic vector with the constituent sentence method structure serialization code and then deducing the keyword semantic vector to the next term, so that the integrated code contains syntactic structure information while retaining the semantic information. And finally, decoding the integrated code transmitted by the decoder by adopting a unidirectional GRU network based on cluster search, thereby improving the accuracy rate of colleagues who reduce the computational complexity and finally generating an abstract result which is more in line with grammatical rules and has more remarkable gist.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (7)

1. A Chinese abstract generating method based on constituent sentence method analysis is characterized by comprising the following steps:
preprocessing a document to obtain a text sentence subset;
based on the text sentence subset, obtaining text semantic information codes by using a semantic extraction model;
generating a component sentence method analysis structure tree of each sentence based on the text sentence subset, and converting the component sentence method analysis structure tree of each sentence into a component sentence method structure serialization code based on a span method;
the text semantic information coding and the structural serialization coding of the sentence-forming method are input into a coder together for integrated coding;
and decoding the integrated codes transmitted by the encoder through a decoder to generate the text abstract.
2. The method of generating chinese abstract based on constituent syntactic analysis of claim 1, wherein the semantic extraction model employs a PEGASUS model.
3. The method of generating chinese abstract based on constituent syntactic analysis of claim 1, wherein the constituent syntactic parse structure tree for each sentence is generated using Stanford CoreNLP.
4. The method for generating a chinese abstract based on constituent syntactic analysis of claim 1 or 3, wherein said converting the span-based method of constituent syntactic analytical structure tree of each sentence into constituent syntactic structure serialization coding comprises:
analyzing a structure tree for each sentence by a constituent sentence method, recursively combining the rightmost two child nodes, and converting the two child nodes into a right binary tree;
representing the obtained right binary tree as a span table;
dividing the span table into n parts according to the right boundary of the span table, wherein n is the length of a sentence; all left children including root nodes in the binary tree are distributed in n parts, the right boundaries of all the left children correspond to the values in [1, n ] one by one, the right boundaries of all the left children are used as the subscripts after serialization, the corresponding left boundaries are used as the values after serialization, and the constituent sentence structure serialization coding after cross-table linearization is obtained.
5. The method for generating chinese abstract based on constituent sentence analysis of claim 1, wherein said encoder is a semantic structure encoder based on attention mechanism, which first fuses text semantic information encoding and constituent sentence structure serialization encoding as shown in the following formula:
Figure FDA0003547290010000011
in the formula (I), the compound is shown in the specification,
Figure FDA0003547290010000012
representing the final hidden state of the encoder, d representing the structural serialization coding of the sentence method, h representing the semantic information coding of the text,
Figure FDA0003547290010000013
for the GLU activation function, b represents an offset, and W represents a learnable parameter;
the attention medium of the encoder redistributes the attention of other words according to the decoder input at the current time t and generates a context semantic vector C which changes with the current wordtThe attention mechanism formula is as follows:
Figure FDA0003547290010000014
Figure FDA0003547290010000021
Figure FDA0003547290010000022
in the formula, at,iExpress attention weight, by et,iCalculating a score;
Figure FDA0003547290010000023
representing the i-th hidden state of the encoder, St-1Representing the hidden state at time t-1 at the decoder, n represents the sentence length,
Figure FDA0003547290010000024
Wh、Vheach represents a weight matrix.
6. The method of claim 5 in which the decoder uses a unidirectional GRU network with the input being the output y of the decoder from the last time t-1t-1Previous time t-1 hidden state S of decodert-1Context semantic vector C of current time ttAre formed together; will encode the final hidden state of the encoder
Figure FDA0003547290010000025
As a first input to the decoder, the unidirectional GRU network structure formula is:
zt=σ(WzSt-1+WzCt+Wzyt-1)
rt=σ(Wr+St-1+WrCt+Wryt-1)
Figure FDA0003547290010000026
Figure FDA0003547290010000027
in the formula, ztAnd rtRespectively representing an update gate and a reset gate;
Figure FDA0003547290010000028
is to Ct,yt-1And the last momentS oft-1To calculate the new vector
Figure FDA0003547290010000029
Containing the above information and yt-1(ii) a σ, tanh represent activation functions; an indication of the product of the matrices; wzTo update the weight parameter of the door, WrIn order to reset the weight parameter of the gate,
Figure FDA00035472900100000210
is composed of
Figure FDA00035472900100000211
The weight parameter of (2); s. thetRepresenting the hidden state of the decoder at the current moment t;
obtaining the position of the word in the vocabulary through the softmax layer, wherein the formula is as follows:
P(yt|y1,y2,…,yn,Ct)=softmax(St)
hidden state S of decodertAs shown below, GRU stands for gated cyclic unit decoder;
St=GRU(St-1,Ct,yt-1)
and finally, generating an optimal solution by adopting a search algorithm.
7. The method for generating a chinese abstract according to claim 6, wherein the search algorithm is a cluster search algorithm.
CN202210252383.6A 2022-03-15 2022-03-15 Chinese abstract generating method based on component sentence method analysis Pending CN114580376A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210252383.6A CN114580376A (en) 2022-03-15 2022-03-15 Chinese abstract generating method based on component sentence method analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210252383.6A CN114580376A (en) 2022-03-15 2022-03-15 Chinese abstract generating method based on component sentence method analysis

Publications (1)

Publication Number Publication Date
CN114580376A true CN114580376A (en) 2022-06-03

Family

ID=81775880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210252383.6A Pending CN114580376A (en) 2022-03-15 2022-03-15 Chinese abstract generating method based on component sentence method analysis

Country Status (1)

Country Link
CN (1) CN114580376A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874222A (en) * 2024-03-13 2024-04-12 中国石油大学(华东) Abstract text defense method based on semantic consistency

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874222A (en) * 2024-03-13 2024-04-12 中国石油大学(华东) Abstract text defense method based on semantic consistency
CN117874222B (en) * 2024-03-13 2024-05-17 中国石油大学(华东) Abstract text defense method based on semantic consistency

Similar Documents

Publication Publication Date Title
CN108460013B (en) Sequence labeling model and method based on fine-grained word representation model
CN106502985B (en) neural network modeling method and device for generating titles
CN110032638B (en) Encoder-decoder-based generative abstract extraction method
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN110738062A (en) GRU neural network Mongolian Chinese machine translation method
CN109062907A (en) Incorporate the neural machine translation method of dependence
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN111125333A (en) Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism
CN113657123A (en) Mongolian aspect level emotion analysis method based on target template guidance and relation head coding
CN114972907A (en) Image semantic understanding and text generation based on reinforcement learning and contrast learning
Mathur et al. A scaled‐down neural conversational model for chatbots
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN109815497B (en) Character attribute extraction method based on syntactic dependency
CN114580376A (en) Chinese abstract generating method based on component sentence method analysis
CN112347783B (en) Alarm condition and stroke data event type identification method without trigger words
WO2019163752A1 (en) Morpheme analysis learning device, morpheme analysis device, method, and program
CN116611436B (en) Threat information-based network security named entity identification method
CN113449517B (en) Entity relationship extraction method based on BERT gated multi-window attention network model
CN115346158A (en) Video description method based on coherence attention mechanism and double-stream decoder
CN115169285A (en) Event extraction method and system based on graph analysis
CN114580385A (en) Text semantic similarity calculation method combined with grammar
Zhang et al. Knowledge adaptive neural network for natural language inference
CN112966502A (en) Electric power patent text entity relation extraction method based on long difficult sentence simplification
Wang et al. Automatic Text Summarization Technology of Keyword Replacement Based on Seq2Seq
CN118278392B (en) Chinese spelling error correction method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination