Nothing Special   »   [go: up one dir, main page]

CN113688230B - Text abstract generation method and system - Google Patents

Text abstract generation method and system Download PDF

Info

Publication number
CN113688230B
CN113688230B CN202110824660.1A CN202110824660A CN113688230B CN 113688230 B CN113688230 B CN 113688230B CN 202110824660 A CN202110824660 A CN 202110824660A CN 113688230 B CN113688230 B CN 113688230B
Authority
CN
China
Prior art keywords
text
abstract
texts
model
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110824660.1A
Other languages
Chinese (zh)
Other versions
CN113688230A (en
Inventor
张宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Zhongzhi Digital Technology Co ltd
Original Assignee
Wuhan Zhongzhi Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Zhongzhi Digital Technology Co ltd filed Critical Wuhan Zhongzhi Digital Technology Co ltd
Priority to CN202110824660.1A priority Critical patent/CN113688230B/en
Publication of CN113688230A publication Critical patent/CN113688230A/en
Application granted granted Critical
Publication of CN113688230B publication Critical patent/CN113688230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and system for generating text abstract, which comprises composing the text to be generated abstract into a set, and preprocessing the text; clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories; performing abstract generation on different types of text sets by adopting a text abstract generation model; and merging and outputting the generated text abstracts. The method introduces the field data and clusters the multi-topic or fact text, so that the problem that the effect of generating the abstract on the multi-topic or fact description text by using the field proper name overflow and the text generation model is not ideal can be solved.

Description

Text abstract generation method and system
Technical Field
The invention relates to the field of text generation, in particular to a method and a system for generating a text abstract.
Background
The text abstract generation technology can efficiently generate the simplified abstract text, is convenient for us to quickly acquire useful information, and has high commercial value. The existing text abstract generation method mainly comprises an extraction type and a generation type. The former selects key sentences in the text to form the abstract of the text, and the generated abstract has strong readability but contains a lot of redundant information. The latter simulates the thinking way of human beings through the deep neural network, so that the generated abstract information is more novel and concise. However, the problem of overflow of domain proper nouns occurs when the text generation model of the current open domain is migrated to a specific domain. In addition, a piece of text often describes multiple facts or topics, so how to make a text summary generation system better handle multiple topics or fact description text generation readable and short summaries becomes a problem to be solved.
Disclosure of Invention
The present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and system for generating a text excerpt that overcomes or at least partially solves the above problems.
In order to solve the technical problems, the embodiment of the application discloses the following technical scheme:
a method of text summary generation, comprising:
s100, forming a set of texts of the abstract to be generated, and preprocessing the texts;
s200, clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories;
S300, performing abstract generation on different types of text sets by adopting a text abstract generation model;
s400, merging and outputting the generated text abstracts.
Further, in S100, preprocessing the text to perform a cleaning operation on the text to remove useless characters in the text, at least including: stop words and exclamation words.
Further, in S200, clustering the preprocessed text according to a preset first rule includes: firstly, word segmentation operation is carried out on texts to be clustered, then a feature matrix of the texts to be clustered is constructed according to the texts to be clustered and the occurrence times of each character segment in each text to be clustered, and feature extraction operation is carried out on the feature matrix of the texts to be clustered, so that an original feature matrix is obtained; and inputting the feature matrix into a K-Means clustering model, performing iterative computation, and converging to obtain a clustering result matrix.
Further, in S300, a text digest is generated by category through a pre-training domain language model and a text digest generation model.
Further, the pre-training domain language model workflow is: training a Mask language model on field data by adopting a general language model BERT, wherein the Mask language model training process is to randomly replace words in the text with Mask marks [ Mask ] according to a certain proportion before inputting the text into an encoder of the BERT model, then performing BERT coding, finally predicting words replaced with the Mask marks, then calculating cross entropy loss of predicted words and marked words, and repeating until the model converges to obtain a first pre-training field language model; and performing fine adjustment on the training data marked in the field aiming at the downstream task on the basis of the first pre-training language model to obtain a second pre-training field language model.
Further, the text abstract generation model is composed of an encoding component and a decoding component, and the encoding component and the decoding component can be split into a plurality of encoders and decoders; the encoding component is a pre-training field language model, and the parameters of the decoding component are obtained by training a text abstract generating task.
Further, the workflow of the encoder and decoder to generate the summary text is: cutting and converting the text to be generated with the abstract to obtain vector representation of the input text; inputting the vector representation of the input text to a coding component of a text abstract model for coding to obtain semantic representation of the abstract text to be generated; the semantic representation of the abstract text to be generated is input into a decoding component for decoding to obtain the generated abstract text.
Further, the coding component for inputting the vector representation of the input text into the text summarization model codes, specifically comprises: the vector representation of the input text is encoded and calculated by a multi-head attention layer, a summation and normalization layer, a full connection layer and a summation and normalization layer in the encoder to obtain a first text encoded representation.
Further, the semantic representation of the abstract text to be generated is input into a decoding component for decoding to obtain the generated abstract text, which specifically comprises the following steps: firstly, vector representation calculation is carried out on decoding identification < START >, and then the vector representation calculation and semantic representation of abstract text to be generated are input into a decoding component for calculation; and the decoding calculation process sequentially carries out calculation of a Mask multi-head attention layer, a summation normalization layer, a full connection layer and a summation normalization layer to obtain a first text decoding representation.
The invention also discloses a system for generating the text abstract, which comprises: the system comprises a text data preprocessing module, a text clustering module, a text abstract generating module and a text abstract merging output module; wherein:
The text data preprocessing module is used for forming texts of the abstract to be generated into a set and preprocessing the texts;
The text clustering module is used for clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories;
The text abstract generating module is used for generating the text abstract according to the types through the collection of the texts of different types;
And the text abstract merging output module is used for merging and outputting the generated text abstracts.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
The invention discloses a method for generating a text abstract, which comprises the steps of forming a set of texts of the abstract to be generated and preprocessing the texts; clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories; generating a text abstract by category through a set of texts of different categories; and merging and outputting the generated text abstracts. The method introduces the field data and clusters the multi-topic or fact text, so that the problems that the field proper name overflows and the text generation model is not ideal in summary effect on the multi-topic or fact description text can be overcome.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flowchart of a method for generating a text abstract according to the embodiment 1 of the invention;
FIG. 2 is a flowchart of generating a pre-training language model in embodiment 1 of the present invention;
FIG. 3 is a diagram showing a text summary generation model in embodiment 2 of the present invention;
Fig. 4 is a structural diagram of an encoder and a decoder in embodiment 2 of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to solve the problems in the prior art, the embodiment of the invention provides a method and a system for generating a text abstract.
Example 1
A method of text summary generation, as in fig. 1, comprising:
s100, forming a set of texts of the abstract to be generated, and preprocessing the texts; specifically, the text data preprocessing module mainly performs a cleaning operation on texts in an input text set to remove useless characters in the texts, such as stop words, mood words and the like.
S200, clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories. The purpose of text clustering is to group together sentences that describe the same facts. Generally, K-Means is used for text. After word segmentation, sentences are expressed into vector representations taking the dictionary as a length and taking the number of times of occurrence in the sentences as characteristic values according to the positions and the number of times of occurrence in the dictionary of each word in each sentence, and then a clustering algorithm is adopted to cluster sentence vector sets, so that clustered texts are obtained.
In this embodiment, clustering the preprocessed text according to a preset first rule includes: firstly, word segmentation operation is carried out on texts to be clustered, then a feature matrix of the texts to be clustered is constructed according to the texts to be clustered and the occurrence times of each character segment in each text to be clustered, and feature extraction operation is carried out on the feature matrix of the texts to be clustered, so that an original feature matrix is obtained; and inputting the feature matrix into a K-Means clustering model, performing iterative computation, and converging to obtain a clustering result matrix.
Specifically, for word segmentation operation, a JIEBA word segmentation device is adopted to assist a domain proper noun dictionary to perform word segmentation operation on texts to be clustered, and n-gram characteristics of the texts are introduced. Here, n may be specifically set according to actual requirements.
Taking the text 'resource library table creation and data access' as an example, the result obtained by JIEBA word segmentation device is 'resource library, table, creation, data and access'; the 1-Gram feature words are: "resources, sources, libraries, tables, creation, data, access"; the 2-Gram feature words are: "resources, source libraries, library tables, table creation, number, data access, access"; the final word segmentation result is the result after the input of the crust word segmentation device, the 1-Gram feature word and the 2-Gram feature word are combined, namely: "resource library, table, create, and, data, access, resource, source, library, table, create, and, data, access, resource, source library, library table, table creation, create, and, data, access.
For text vector representation, after word segmentation operation is carried out on the text to be clustered, a feature matrix of the text to be clustered can be constructed according to the text to be clustered and the occurrence times of each character segment in each text to be clustered. Each row of the feature matrix represents a vector representation of the text to be clustered, each class represents a feature column of each text to be clustered, and the number of times a word in the dictionary appears in each text to be clustered. For example, the feature matrix X is an n×m matrix, where n is the number of texts to be clustered, m is the length of the dictionary, and the value of the nth row and the mth column represents the number of times that the mth word appears in the nth text in the dictionary.
Assuming that 2 texts to be clustered are provided, population related table cleaning scripts are written for the text 1 'resource library table creation' and the data access 'and the text 2', respectively. Performing word segmentation operation on two texts to be clustered, wherein word segmentation results of the text 1 are 'resource library, table, creation, and data, access, resource, library, table, creation, and data, access, resource, source library, library table, table creation, and number, data, access'; the word segmentation result of the text 2 is "write, population, mouth phase, correlation, table closing, table cleaning, washing, foot washing, script, write, population, correlation, table, washing, script, writing, person, mouth, phase, table closing, table, cleaning, foot washing, book". The word segmentation results of the two texts to be clustered form 42 words in total, namely: "table creation, washing, data access, resource library, library table, creation, resource, cleaning, closing, building, source, population, composition, foot, creation, resource, correlation, table cleaning, data, and number, person, book, write, source library, mouth phase, table closing, table, access, phase, entry, build, mouth, library, access, and write, cleaning, foot washing, script, data, write. The text feature matrix X to be clustered can be expressed as follows:
X=[[1,0,1,1,1,1,2,1,0,0,1,1,0,0,0,1,1,0,0,1,1,0,0,0,1,0,0,2,2,0,1,1,0,1,1,2,0,0,0,0,2,0],[0,1,0,0,0,0,0,0,1,1,0,0,2,1,1,0,0,2,1,0,0,1,1,1,0,1,1,2,0,1,0,0,1,0,0,0,1,2,1,2,0,2]].
wherein the first behavior of the feature matrix is to cluster the representation vector of text 1 and the second behavior is to cluster the representation vector of text 2.
For text feature extraction, after a feature matrix of the text to be clustered is obtained, the feature matrix can be input into a clustering model for clustering. However, in practical applications, the dictionary tends to be large, and the resulting text representation vector is high-dimensional and sparse. Therefore, before clustering, we can also perform feature extraction operation on the text feature matrix to be clustered, so as to obtain low-rank representation of the original feature matrix. The purpose of this is to obtain a meaningful feature representation of the text to be clustered, and reducing feature dimensions can also speed up the clustering calculation.
The feature extraction and dimension reduction processing of the text feature matrix to be clustered can use a preset dimension reduction algorithm, such as NMF (Non-negative Matrix Factorization). For an n×m feature matrix, after dimension reduction by NMF, an n×r feature representation matrix S can be obtained, where r is far smaller than m.
For text clustering, the clustering algorithm is performed on a feature matrix S after text feature extraction, and each row of the matrix S represents one text feature representation to be clustered. Inputting the feature matrix S into a K-means++ clustering model, performing iterative computation, and obtaining a clustering result matrix C after convergence, wherein a certain value in the matrix represents the probability that the ith sample data belongs to the jth category. The determination of the number of clusters is performed by automatically determining the optimal number of clusters by an elbow method (selecting the number of clusters with the smallest distance from the sample to the cluster center within a certain range of the number of clusters).
S300, generating a text abstract according to the categories through the collection of different categories of texts.
In this embodiment, the text summaries are generated by category through a pre-training domain language model and a text summary generation model.
Specifically, for the pre-training domain language model, performing unsupervised mask language model and domain supervision task training on domain data by adopting a universal language model BERT to obtain the domain language model.
A mask language model (Masked Language Model, MLM) is trained on the domain data using a generic language model BERT, the model structure of which is a 12-layer encoder composition. The training mode of the domain pre-training language model is shown in fig. 2, and before inputting a text to an encoder of a BERT model, the Mask language model training process randomly replaces words in the text with Mask marks according to a certain proportion, then BERT encoding is carried out, finally, the words replaced with the Mask marks are predicted, then, the cross entropy loss of the predicted words and the marked words is calculated, and the first pre-training domain language model is obtained after cyclic reciprocation until the model converges. The structure of the first pre-trained language model is the same as the original BERT model structure, but the parameters of the model are iteratively learned.
And performing fine adjustment on the training data marked in the field aiming at the downstream task on the basis of the first pre-training language model to obtain a second pre-training field language model. The downstream tasks of model training are constructed according to the field requirements, and generally include a named entity recognition task, a text classification task and the like. The second pre-training field language model is the final pre-training field language model. Improvement over the generic language model: and introducing domain data to perform retraining again on the basis of the universal language model. For the size of text granularity, some domain-oriented specific task training was introduced. For example: named entity recognition, text classification, etc.
For the text abstract generation model, the structure of the text abstract generation model is shown in fig. 3, and the text abstract generation model is composed of an encoding component and a decoding component, wherein the encoding component and the decoding component can be split into a plurality of encoders and decoders. Wherein the encoding component is a pre-training domain language model, the parameters of the decoding component are obtained by training a text abstract generating task, and the encoder and decoder structures are shown in fig. 4.
The method forms training data by a text-abstract pair through a Seq2Seq framework, and realizes a system for automatically generating the text abstract. And the text abstract model carries out encoding operation on the input abstract text to be generated through an encoding component to obtain an intermediate semantic vector of the text abstract model. Then, the intermediate semantic vector is decoded by a decoding component to generate the corresponding abstract text. The specific method comprises the following steps:
a. And cutting and converting the text to be generated with the abstract to obtain the vector representation of the input text.
The input text is segmented by a word segmentation device (Tokenizer) to obtain a word index list (token_ids) and a multi-sentence separation identification list (segment_ids) of the input text. Word embedding operations are then performed on token_ids to obtain a first vector representation of the input text, which is of size [ batch_size, seq_length, embedding _size ]. Then a position-coded representation of the text is calculated, having a size of [ batch_size, seq_length, embedding _size ], and finally the first vector representation of the text is summed with the position-coded representation of the text to obtain a vector representation of the input text.
B. And inputting the vector representation of the input text to a coding component of the text abstract model for coding to obtain the semantic representation of the abstract text to be generated.
The vector representation of the input text is encoded and calculated by a multi-head attention layer, a summation and normalization layer, a full connection layer and a summation and normalization layer in the encoder to obtain a first text encoded representation. The internal structure of the encoder is shown in fig. 3. And coding is carried out through the L encoders in sequence, and the final L text coding representation is obtained, namely the semantic representation of the abstract text to be generated.
C. the semantic representation of the abstract text to be generated is input into a decoding component for decoding to obtain the generated abstract text.
Vector representation calculation is first performed for the decoding identity < START >, and then input into the decoding component for calculation along with the semantic representation of the summary text to be generated. And the decoding calculation process sequentially carries out calculation of a Mask multi-head attention layer, a summation normalization layer, a full connection layer and a summation normalization layer to obtain a first text decoding representation. The internal structure of the decoder is shown in fig. 3. And decoding is carried out through the L decoders in sequence to obtain a final L text decoding representation, finally, calculation of a full-connection layer is carried out to obtain probability distribution of the current state t on a dictionary, and a word with the highest probability is taken as the generation output of the current state t. Inputting the output character representation of the current state t, the character representation of the step t-1 and the semantic representation input of the abstract text to be generated into a decoding component, predicting the generation output of the step t+1 until the output character of the current state is < EOS > or the preset text generation length is reached, terminating the text decoding process, and sequentially combining the output characters obtained during the process into the abstract output of the current input text.
S400, merging and outputting the generated text abstracts. After the text abstract is generated for a plurality of small sentence sets, the results are combined, and the final abstract text is output. It is desirable that the abstract text generated by the multiple small sentence sets is different, so that mutually exclusive sentences can be selected, and sentences with high similarity are filtered out as final output. The specific merging process can be directly and manually screened for manual merging or automatically de-duplicated by a machine.
The embodiment also discloses a system for generating the text abstract, which comprises the following steps: the system comprises a text data preprocessing module, a text clustering module, a text abstract generating module and a text abstract merging output module; wherein:
The text data preprocessing module is used for forming texts of the abstract to be generated into a set and preprocessing the texts;
The text clustering module is used for clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories;
The text abstract generating module is used for generating the text abstract according to the types through the collection of the texts of different types;
And the text abstract merging output module is used for merging and outputting the generated text abstracts.
The specific workflow and structure of the text data preprocessing module, the text clustering module, the text abstract generating module and the text abstract merging output module are described in detail in the foregoing, and are not described in detail herein.
According to the method and the system for generating the text abstract, disclosed by the embodiment, texts of the abstract to be generated form a set, and the texts are preprocessed; clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories; generating a text abstract by category through a set of texts of different categories; and merging and outputting the generated text abstracts. The method introduces the field data and clusters the multi-topic or fact text, so that the problems that the field proper name overflows and the text generation model is not ideal in summary effect on the multi-topic or fact description text can be overcome.
It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. The processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described in this disclosure may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. These software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".

Claims (3)

1. A method of text summary generation, comprising:
s100, forming a set of texts of the abstract to be generated, and preprocessing the texts;
S200, clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories; in S200, clustering the preprocessed text according to a preset first rule, including: firstly, word segmentation operation is carried out on texts to be clustered, then a feature matrix of the texts to be clustered is constructed according to the texts to be clustered and the occurrence times of each character segment in each text to be clustered, and feature extraction operation is carried out on the feature matrix of the texts to be clustered, so that an original feature matrix is obtained; inputting the feature matrix into a K-Means clustering model, performing iterative computation, and converging to obtain a clustering result matrix;
S300, performing abstract generation on different types of text sets by adopting a text abstract generation model;
S400, merging and outputting the generated text abstracts;
S300, performing abstract generation on texts according to categories through a pre-training field language model and a text abstract generation model; the pre-training field language model workflow is as follows: training a Mask language model on field data by adopting a general language model BERT, wherein the Mask language model training process is to randomly replace words in the text with Mask marks [ Mask ] according to a certain proportion before inputting the text into an encoder of the BERT model, then performing BERT coding, finally predicting words replaced with the Mask marks, then calculating cross entropy loss of predicted words and marked words, and repeating until the model converges to obtain a first pre-training field language model; performing fine adjustment on the training data marked in the field aiming at the upstream task on the basis of the first pre-training language model to obtain a second pre-training field language model;
The text abstract generation model consists of an encoding component and a decoding component, wherein the encoding component and the decoding component are split into a plurality of encoders and decoders; the coding component is a pre-training field language model, and the parameters of the decoding component are obtained by training a text abstract generating task; the workflow of the encoder and decoder to generate summary text is: cutting and converting the text to be generated with the abstract to obtain vector representation of the input text; inputting the vector representation of the input text to a coding component of a text abstract model for coding to obtain semantic representation of the abstract text to be generated; inputting semantic representations of the abstract text to be generated into a decoding component for decoding to obtain the generated abstract text; encoding a vector representation of input text into an encoding component of a text summarization model, comprising: the vector representation of the input text is encoded and calculated by a multi-head attention layer, a summation and normalization layer, a full connection layer and a summation and normalization layer in the encoder to obtain a first text encoded representation.
2. The method for generating a text summary as claimed in claim 1, wherein in S100, preprocessing the text to perform a cleaning operation on the text to remove useless characters in the text, at least including: stop words and exclamation words.
3. A system for generating a text excerpt, using the method for generating a text excerpt according to claim 1 or 2, comprising: the system comprises a text data preprocessing module, a text clustering module, a text abstract generating module and a text abstract merging output module; wherein:
The text data preprocessing module is used for forming texts of the abstract to be generated into a set and preprocessing the texts;
The text clustering module is used for clustering the preprocessed texts according to a preset first rule to obtain a set of texts with different categories;
The text abstract generating module is used for generating the text abstract according to the types through the collection of the texts of different types;
And the text abstract merging output module is used for merging and outputting the generated text abstracts.
CN202110824660.1A 2021-07-21 2021-07-21 Text abstract generation method and system Active CN113688230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110824660.1A CN113688230B (en) 2021-07-21 2021-07-21 Text abstract generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110824660.1A CN113688230B (en) 2021-07-21 2021-07-21 Text abstract generation method and system

Publications (2)

Publication Number Publication Date
CN113688230A CN113688230A (en) 2021-11-23
CN113688230B true CN113688230B (en) 2024-07-26

Family

ID=78577596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110824660.1A Active CN113688230B (en) 2021-07-21 2021-07-21 Text abstract generation method and system

Country Status (1)

Country Link
CN (1) CN113688230B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417854A (en) * 2020-12-15 2021-02-26 北京信息科技大学 Chinese document abstraction type abstract method

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107861938B (en) * 2017-09-21 2020-09-25 北京三快在线科技有限公司 POI (Point of interest) file generation method and device and electronic equipment
CN108319668B (en) * 2018-01-23 2021-04-20 义语智能科技(上海)有限公司 Method and equipment for generating text abstract
EP3557451A1 (en) * 2018-04-19 2019-10-23 Siemens Aktiengesellschaft Method for determining output data for a plurality of text documents
CN108763211B (en) * 2018-05-23 2020-07-31 中国科学院自动化研究所 Automatic abstracting method and system fusing intrinsic knowledge
EP3620935A1 (en) * 2018-09-04 2020-03-11 Siemens Aktiengesellschaft System and method for natural language processing
US10810243B2 (en) * 2019-03-08 2020-10-20 Fuji Xerox Co., Ltd. System and method for generating abstractive summaries of interleaved texts
CN110597979B (en) * 2019-06-13 2023-06-23 中山大学 Self-attention-based generated text abstract method
CN110413768B (en) * 2019-08-06 2022-05-03 成都信息工程大学 Automatic generation method of article titles
CN110765264A (en) * 2019-10-16 2020-02-07 北京工业大学 Text abstract generation method for enhancing semantic relevance
CN110956021A (en) * 2019-11-14 2020-04-03 微民保险代理有限公司 Original article generation method, device, system and server
US11481418B2 (en) * 2020-01-02 2022-10-25 International Business Machines Corporation Natural question generation via reinforcement learning based graph-to-sequence model
CN111312356B (en) * 2020-01-17 2022-07-01 四川大学 Traditional Chinese medicine prescription generation method based on BERT and integration efficacy information
CN111966917B (en) * 2020-07-10 2022-05-03 电子科技大学 Event detection and summarization method based on pre-training language model
CN112562669B (en) * 2020-12-01 2024-01-12 浙江方正印务有限公司 Method and system for automatically abstracting intelligent digital newspaper and performing voice interaction chat
CN112765345A (en) * 2021-01-22 2021-05-07 重庆邮电大学 Text abstract automatic generation method and system fusing pre-training model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417854A (en) * 2020-12-15 2021-02-26 北京信息科技大学 Chinese document abstraction type abstract method

Also Published As

Publication number Publication date
CN113688230A (en) 2021-11-23

Similar Documents

Publication Publication Date Title
CN110348016B (en) Text abstract generation method based on sentence correlation attention mechanism
CN110413768B (en) Automatic generation method of article titles
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN111401084A (en) Method and device for machine translation and computer readable storage medium
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN110569505A (en) text input method and device
CN114969304A (en) Case public opinion multi-document generation type abstract method based on element graph attention
CN111241820A (en) Bad phrase recognition method, device, electronic device, and storage medium
Moeng et al. Canonical and surface morphological segmentation for nguni languages
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
CN111985243A (en) Emotion model training method, emotion analysis device and storage medium
CN113961706A (en) Accurate text representation method based on neural network self-attention mechanism
CN113012822A (en) Medical question-answering system based on generating type dialogue technology
CN116049387A (en) Short text classification method, device and medium based on graph convolution
CN111814479B (en) Method and device for generating enterprise abbreviations and training model thereof
Cao Generating natural language descriptions from tables
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN113688230B (en) Text abstract generation method and system
Serban et al. Text-based speaker identification for multi-participant opendomain dialogue systems
Zare et al. Deepnorm-a deep learning approach to text normalization
Yolchuyeva Novel NLP Methods for Improved Text-To-Speech Synthesis
CN111723584A (en) Punctuation prediction method based on consideration of domain information
CN117932487B (en) Risk classification model training and risk classification method and device
Sun et al. Text sentiment polarity classification method based on word embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant