CN113688230B

CN113688230B - Text abstract generation method and system

Info

Publication number: CN113688230B
Application number: CN202110824660.1A
Authority: CN
Inventors: 张宇
Original assignee: Wuhan Zhongzhi Digital Technology Co ltd
Current assignee: Wuhan Zhongzhi Digital Technology Co ltd
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2024-07-26
Anticipated expiration: 2041-07-21
Also published as: CN113688230A

Abstract

A method and system for generating text abstract, which comprises composing the text to be generated abstract into a set, and preprocessing the text; clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories; performing abstract generation on different types of text sets by adopting a text abstract generation model; and merging and outputting the generated text abstracts. The method introduces the field data and clusters the multi-topic or fact text, so that the problem that the effect of generating the abstract on the multi-topic or fact description text by using the field proper name overflow and the text generation model is not ideal can be solved.

Description

Text abstract generation method and system

Technical Field

The invention relates to the field of text generation, in particular to a method and a system for generating a text abstract.

Background

The text abstract generation technology can efficiently generate the simplified abstract text, is convenient for us to quickly acquire useful information, and has high commercial value. The existing text abstract generation method mainly comprises an extraction type and a generation type. The former selects key sentences in the text to form the abstract of the text, and the generated abstract has strong readability but contains a lot of redundant information. The latter simulates the thinking way of human beings through the deep neural network, so that the generated abstract information is more novel and concise. However, the problem of overflow of domain proper nouns occurs when the text generation model of the current open domain is migrated to a specific domain. In addition, a piece of text often describes multiple facts or topics, so how to make a text summary generation system better handle multiple topics or fact description text generation readable and short summaries becomes a problem to be solved.

Disclosure of Invention

The present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and system for generating a text excerpt that overcomes or at least partially solves the above problems.

In order to solve the technical problems, the embodiment of the application discloses the following technical scheme:

a method of text summary generation, comprising:

s100, forming a set of texts of the abstract to be generated, and preprocessing the texts;

s200, clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories;

S300, performing abstract generation on different types of text sets by adopting a text abstract generation model;

s400, merging and outputting the generated text abstracts.

Further, in S100, preprocessing the text to perform a cleaning operation on the text to remove useless characters in the text, at least including: stop words and exclamation words.

Further, in S200, clustering the preprocessed text according to a preset first rule includes: firstly, word segmentation operation is carried out on texts to be clustered, then a feature matrix of the texts to be clustered is constructed according to the texts to be clustered and the occurrence times of each character segment in each text to be clustered, and feature extraction operation is carried out on the feature matrix of the texts to be clustered, so that an original feature matrix is obtained; and inputting the feature matrix into a K-Means clustering model, performing iterative computation, and converging to obtain a clustering result matrix.

Further, in S300, a text digest is generated by category through a pre-training domain language model and a text digest generation model.

Further, the pre-training domain language model workflow is: training a Mask language model on field data by adopting a general language model BERT, wherein the Mask language model training process is to randomly replace words in the text with Mask marks [ Mask ] according to a certain proportion before inputting the text into an encoder of the BERT model, then performing BERT coding, finally predicting words replaced with the Mask marks, then calculating cross entropy loss of predicted words and marked words, and repeating until the model converges to obtain a first pre-training field language model; and performing fine adjustment on the training data marked in the field aiming at the downstream task on the basis of the first pre-training language model to obtain a second pre-training field language model.

Further, the text abstract generation model is composed of an encoding component and a decoding component, and the encoding component and the decoding component can be split into a plurality of encoders and decoders; the encoding component is a pre-training field language model, and the parameters of the decoding component are obtained by training a text abstract generating task.

Further, the workflow of the encoder and decoder to generate the summary text is: cutting and converting the text to be generated with the abstract to obtain vector representation of the input text; inputting the vector representation of the input text to a coding component of a text abstract model for coding to obtain semantic representation of the abstract text to be generated; the semantic representation of the abstract text to be generated is input into a decoding component for decoding to obtain the generated abstract text.

Further, the coding component for inputting the vector representation of the input text into the text summarization model codes, specifically comprises: the vector representation of the input text is encoded and calculated by a multi-head attention layer, a summation and normalization layer, a full connection layer and a summation and normalization layer in the encoder to obtain a first text encoded representation.

Further, the semantic representation of the abstract text to be generated is input into a decoding component for decoding to obtain the generated abstract text, which specifically comprises the following steps: firstly, vector representation calculation is carried out on decoding identification < START >, and then the vector representation calculation and semantic representation of abstract text to be generated are input into a decoding component for calculation; and the decoding calculation process sequentially carries out calculation of a Mask multi-head attention layer, a summation normalization layer, a full connection layer and a summation normalization layer to obtain a first text decoding representation.

The invention also discloses a system for generating the text abstract, which comprises: the system comprises a text data preprocessing module, a text clustering module, a text abstract generating module and a text abstract merging output module; wherein:

The text data preprocessing module is used for forming texts of the abstract to be generated into a set and preprocessing the texts;

The text clustering module is used for clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories;

The text abstract generating module is used for generating the text abstract according to the types through the collection of the texts of different types;

And the text abstract merging output module is used for merging and outputting the generated text abstracts.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

The invention discloses a method for generating a text abstract, which comprises the steps of forming a set of texts of the abstract to be generated and preprocessing the texts; clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories; generating a text abstract by category through a set of texts of different categories; and merging and outputting the generated text abstracts. The method introduces the field data and clusters the multi-topic or fact text, so that the problems that the field proper name overflows and the text generation model is not ideal in summary effect on the multi-topic or fact description text can be overcome.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flowchart of a method for generating a text abstract according to the embodiment 1 of the invention;

FIG. 2 is a flowchart of generating a pre-training language model in embodiment 1 of the present invention;

FIG. 3 is a diagram showing a text summary generation model in embodiment 2 of the present invention;

Fig. 4 is a structural diagram of an encoder and a decoder in embodiment 2 of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to solve the problems in the prior art, the embodiment of the invention provides a method and a system for generating a text abstract.

Example 1

A method of text summary generation, as in fig. 1, comprising:

s100, forming a set of texts of the abstract to be generated, and preprocessing the texts; specifically, the text data preprocessing module mainly performs a cleaning operation on texts in an input text set to remove useless characters in the texts, such as stop words, mood words and the like.

S200, clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories. The purpose of text clustering is to group together sentences that describe the same facts. Generally, K-Means is used for text. After word segmentation, sentences are expressed into vector representations taking the dictionary as a length and taking the number of times of occurrence in the sentences as characteristic values according to the positions and the number of times of occurrence in the dictionary of each word in each sentence, and then a clustering algorithm is adopted to cluster sentence vector sets, so that clustered texts are obtained.

In this embodiment, clustering the preprocessed text according to a preset first rule includes: firstly, word segmentation operation is carried out on texts to be clustered, then a feature matrix of the texts to be clustered is constructed according to the texts to be clustered and the occurrence times of each character segment in each text to be clustered, and feature extraction operation is carried out on the feature matrix of the texts to be clustered, so that an original feature matrix is obtained; and inputting the feature matrix into a K-Means clustering model, performing iterative computation, and converging to obtain a clustering result matrix.

Specifically, for word segmentation operation, a JIEBA word segmentation device is adopted to assist a domain proper noun dictionary to perform word segmentation operation on texts to be clustered, and n-gram characteristics of the texts are introduced. Here, n may be specifically set according to actual requirements.

Taking the text 'resource library table creation and data access' as an example, the result obtained by JIEBA word segmentation device is 'resource library, table, creation, data and access'; the 1-Gram feature words are: "resources, sources, libraries, tables, creation, data, access"; the 2-Gram feature words are: "resources, source libraries, library tables, table creation, number, data access, access"; the final word segmentation result is the result after the input of the crust word segmentation device, the 1-Gram feature word and the 2-Gram feature word are combined, namely: "resource library, table, create, and, data, access, resource, source, library, table, create, and, data, access, resource, source library, library table, table creation, create, and, data, access.

For text vector representation, after word segmentation operation is carried out on the text to be clustered, a feature matrix of the text to be clustered can be constructed according to the text to be clustered and the occurrence times of each character segment in each text to be clustered. Each row of the feature matrix represents a vector representation of the text to be clustered, each class represents a feature column of each text to be clustered, and the number of times a word in the dictionary appears in each text to be clustered. For example, the feature matrix X is an n×m matrix, where n is the number of texts to be clustered, m is the length of the dictionary, and the value of the nth row and the mth column represents the number of times that the mth word appears in the nth text in the dictionary.

Assuming that 2 texts to be clustered are provided, population related table cleaning scripts are written for the text 1 'resource library table creation' and the data access 'and the text 2', respectively. Performing word segmentation operation on two texts to be clustered, wherein word segmentation results of the text 1 are 'resource library, table, creation, and data, access, resource, library, table, creation, and data, access, resource, source library, library table, table creation, and number, data, access'; the word segmentation result of the text 2 is "write, population, mouth phase, correlation, table closing, table cleaning, washing, foot washing, script, write, population, correlation, table, washing, script, writing, person, mouth, phase, table closing, table, cleaning, foot washing, book". The word segmentation results of the two texts to be clustered form 42 words in total, namely: "table creation, washing, data access, resource library, library table, creation, resource, cleaning, closing, building, source, population, composition, foot, creation, resource, correlation, table cleaning, data, and number, person, book, write, source library, mouth phase, table closing, table, access, phase, entry, build, mouth, library, access, and write, cleaning, foot washing, script, data, write. The text feature matrix X to be clustered can be expressed as follows:

X＝[[1,0,1,1,1,1,2,1,0,0,1,1,0,0,0,1,1,0,0,1,1,0,0,0,1,0,0,2,2,0,1,1,0,1,1,2,0,0,0,0,2,0],[0,1,0,0,0,0,0,0,1,1,0,0,2,1,1,0,0,2,1,0,0,1,1,1,0,1,1,2,0,1,0,0,1,0,0,0,1,2,1,2,0,2]].

wherein the first behavior of the feature matrix is to cluster the representation vector of text 1 and the second behavior is to cluster the representation vector of text 2.

For text feature extraction, after a feature matrix of the text to be clustered is obtained, the feature matrix can be input into a clustering model for clustering. However, in practical applications, the dictionary tends to be large, and the resulting text representation vector is high-dimensional and sparse. Therefore, before clustering, we can also perform feature extraction operation on the text feature matrix to be clustered, so as to obtain low-rank representation of the original feature matrix. The purpose of this is to obtain a meaningful feature representation of the text to be clustered, and reducing feature dimensions can also speed up the clustering calculation.

The feature extraction and dimension reduction processing of the text feature matrix to be clustered can use a preset dimension reduction algorithm, such as NMF (Non-negative Matrix Factorization). For an n×m feature matrix, after dimension reduction by NMF, an n×r feature representation matrix S can be obtained, where r is far smaller than m.

For text clustering, the clustering algorithm is performed on a feature matrix S after text feature extraction, and each row of the matrix S represents one text feature representation to be clustered. Inputting the feature matrix S into a K-means++ clustering model, performing iterative computation, and obtaining a clustering result matrix C after convergence, wherein a certain value in the matrix represents the probability that the ith sample data belongs to the jth category. The determination of the number of clusters is performed by automatically determining the optimal number of clusters by an elbow method (selecting the number of clusters with the smallest distance from the sample to the cluster center within a certain range of the number of clusters).

S300, generating a text abstract according to the categories through the collection of different categories of texts.

In this embodiment, the text summaries are generated by category through a pre-training domain language model and a text summary generation model.

Specifically, for the pre-training domain language model, performing unsupervised mask language model and domain supervision task training on domain data by adopting a universal language model BERT to obtain the domain language model.

A mask language model (Masked Language Model, MLM) is trained on the domain data using a generic language model BERT, the model structure of which is a 12-layer encoder composition. The training mode of the domain pre-training language model is shown in fig. 2, and before inputting a text to an encoder of a BERT model, the Mask language model training process randomly replaces words in the text with Mask marks according to a certain proportion, then BERT encoding is carried out, finally, the words replaced with the Mask marks are predicted, then, the cross entropy loss of the predicted words and the marked words is calculated, and the first pre-training domain language model is obtained after cyclic reciprocation until the model converges. The structure of the first pre-trained language model is the same as the original BERT model structure, but the parameters of the model are iteratively learned.

And performing fine adjustment on the training data marked in the field aiming at the downstream task on the basis of the first pre-training language model to obtain a second pre-training field language model. The downstream tasks of model training are constructed according to the field requirements, and generally include a named entity recognition task, a text classification task and the like. The second pre-training field language model is the final pre-training field language model. Improvement over the generic language model: and introducing domain data to perform retraining again on the basis of the universal language model. For the size of text granularity, some domain-oriented specific task training was introduced. For example: named entity recognition, text classification, etc.

For the text abstract generation model, the structure of the text abstract generation model is shown in fig. 3, and the text abstract generation model is composed of an encoding component and a decoding component, wherein the encoding component and the decoding component can be split into a plurality of encoders and decoders. Wherein the encoding component is a pre-training domain language model, the parameters of the decoding component are obtained by training a text abstract generating task, and the encoder and decoder structures are shown in fig. 4.

The method forms training data by a text-abstract pair through a Seq2Seq framework, and realizes a system for automatically generating the text abstract. And the text abstract model carries out encoding operation on the input abstract text to be generated through an encoding component to obtain an intermediate semantic vector of the text abstract model. Then, the intermediate semantic vector is decoded by a decoding component to generate the corresponding abstract text. The specific method comprises the following steps:

a. And cutting and converting the text to be generated with the abstract to obtain the vector representation of the input text.

The input text is segmented by a word segmentation device (Tokenizer) to obtain a word index list (token_ids) and a multi-sentence separation identification list (segment_ids) of the input text. Word embedding operations are then performed on token_ids to obtain a first vector representation of the input text, which is of size [ batch_size, seq_length, embedding _size ]. Then a position-coded representation of the text is calculated, having a size of [ batch_size, seq_length, embedding _size ], and finally the first vector representation of the text is summed with the position-coded representation of the text to obtain a vector representation of the input text.

B. And inputting the vector representation of the input text to a coding component of the text abstract model for coding to obtain the semantic representation of the abstract text to be generated.

The vector representation of the input text is encoded and calculated by a multi-head attention layer, a summation and normalization layer, a full connection layer and a summation and normalization layer in the encoder to obtain a first text encoded representation. The internal structure of the encoder is shown in fig. 3. And coding is carried out through the L encoders in sequence, and the final L text coding representation is obtained, namely the semantic representation of the abstract text to be generated.

C. the semantic representation of the abstract text to be generated is input into a decoding component for decoding to obtain the generated abstract text.

Vector representation calculation is first performed for the decoding identity < START >, and then input into the decoding component for calculation along with the semantic representation of the summary text to be generated. And the decoding calculation process sequentially carries out calculation of a Mask multi-head attention layer, a summation normalization layer, a full connection layer and a summation normalization layer to obtain a first text decoding representation. The internal structure of the decoder is shown in fig. 3. And decoding is carried out through the L decoders in sequence to obtain a final L text decoding representation, finally, calculation of a full-connection layer is carried out to obtain probability distribution of the current state t on a dictionary, and a word with the highest probability is taken as the generation output of the current state t. Inputting the output character representation of the current state t, the character representation of the step t-1 and the semantic representation input of the abstract text to be generated into a decoding component, predicting the generation output of the step t+1 until the output character of the current state is < EOS > or the preset text generation length is reached, terminating the text decoding process, and sequentially combining the output characters obtained during the process into the abstract output of the current input text.

S400, merging and outputting the generated text abstracts. After the text abstract is generated for a plurality of small sentence sets, the results are combined, and the final abstract text is output. It is desirable that the abstract text generated by the multiple small sentence sets is different, so that mutually exclusive sentences can be selected, and sentences with high similarity are filtered out as final output. The specific merging process can be directly and manually screened for manual merging or automatically de-duplicated by a machine.

The embodiment also discloses a system for generating the text abstract, which comprises the following steps: the system comprises a text data preprocessing module, a text clustering module, a text abstract generating module and a text abstract merging output module; wherein:

The specific workflow and structure of the text data preprocessing module, the text clustering module, the text abstract generating module and the text abstract merging output module are described in detail in the foregoing, and are not described in detail herein.

According to the method and the system for generating the text abstract, disclosed by the embodiment, texts of the abstract to be generated form a set, and the texts are preprocessed; clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories; generating a text abstract by category through a set of texts of different categories; and merging and outputting the generated text abstracts. The method introduces the field data and clusters the multi-topic or fact text, so that the problems that the field proper name overflows and the text generation model is not ideal in summary effect on the multi-topic or fact description text can be overcome.

It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. The processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described in this disclosure may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. These software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".

Claims

1. A method of text summary generation, comprising:

S200, clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories; in S200, clustering the preprocessed text according to a preset first rule, including: firstly, word segmentation operation is carried out on texts to be clustered, then a feature matrix of the texts to be clustered is constructed according to the texts to be clustered and the occurrence times of each character segment in each text to be clustered, and feature extraction operation is carried out on the feature matrix of the texts to be clustered, so that an original feature matrix is obtained; inputting the feature matrix into a K-Means clustering model, performing iterative computation, and converging to obtain a clustering result matrix;

S400, merging and outputting the generated text abstracts;

S300, performing abstract generation on texts according to categories through a pre-training field language model and a text abstract generation model; the pre-training field language model workflow is as follows: training a Mask language model on field data by adopting a general language model BERT, wherein the Mask language model training process is to randomly replace words in the text with Mask marks [ Mask ] according to a certain proportion before inputting the text into an encoder of the BERT model, then performing BERT coding, finally predicting words replaced with the Mask marks, then calculating cross entropy loss of predicted words and marked words, and repeating until the model converges to obtain a first pre-training field language model; performing fine adjustment on the training data marked in the field aiming at the upstream task on the basis of the first pre-training language model to obtain a second pre-training field language model;

The text abstract generation model consists of an encoding component and a decoding component, wherein the encoding component and the decoding component are split into a plurality of encoders and decoders; the coding component is a pre-training field language model, and the parameters of the decoding component are obtained by training a text abstract generating task; the workflow of the encoder and decoder to generate summary text is: cutting and converting the text to be generated with the abstract to obtain vector representation of the input text; inputting the vector representation of the input text to a coding component of a text abstract model for coding to obtain semantic representation of the abstract text to be generated; inputting semantic representations of the abstract text to be generated into a decoding component for decoding to obtain the generated abstract text; encoding a vector representation of input text into an encoding component of a text summarization model, comprising: the vector representation of the input text is encoded and calculated by a multi-head attention layer, a summation and normalization layer, a full connection layer and a summation and normalization layer in the encoder to obtain a first text encoded representation.

2. The method for generating a text summary as claimed in claim 1, wherein in S100, preprocessing the text to perform a cleaning operation on the text to remove useless characters in the text, at least including: stop words and exclamation words.

3. A system for generating a text excerpt, using the method for generating a text excerpt according to claim 1 or 2, comprising: the system comprises a text data preprocessing module, a text clustering module, a text abstract generating module and a text abstract merging output module; wherein: