Nothing Special   »   [go: up one dir, main page]

CN113535960A - Text classification method, device and equipment - Google Patents

Text classification method, device and equipment Download PDF

Info

Publication number
CN113535960A
CN113535960A CN202110880080.4A CN202110880080A CN113535960A CN 113535960 A CN113535960 A CN 113535960A CN 202110880080 A CN202110880080 A CN 202110880080A CN 113535960 A CN113535960 A CN 113535960A
Authority
CN
China
Prior art keywords
classification
text
word
vector
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110880080.4A
Other languages
Chinese (zh)
Inventor
莫周培
干志勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110880080.4A priority Critical patent/CN113535960A/en
Publication of CN113535960A publication Critical patent/CN113535960A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a text classification method, a text classification device and text classification equipment, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: acquiring a text to be classified; preprocessing the text to be classified to obtain a target word vector; determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification; and performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified. In the embodiment of the specification, the accuracy of the classification result can be effectively improved in a multi-model fusion mode.

Description

Text classification method, device and equipment
Technical Field
The embodiment of the specification relates to the technical field of artificial intelligence, in particular to a text classification method, a text classification device and text classification equipment.
Background
Natural language processing is always an important topic in the field of artificial intelligence, certain basic word processing work such as automatic document analysis, key information extraction, text classification and audit, text intelligent correction and the like can be realized by using a natural language processing technology, and the natural language processing technology is fully applied to various industries, but the complexity of human language also brings important difficulty to natural language processing. The intelligent analysis of the long text is very challenging, and key information is extracted from the numerous and varied lengthy texts with rich and smash information amount, which is always a difficult problem in the text field.
In the prior art, a text is generally subjected to word segmentation, and then the category of the text is identified by using a single model, the method has a certain effect on a short text, but key information cannot be effectively extracted from a tedious text which is numerous and varied and has a large information amount, so that the internal structure and semantic information of the text can be deeply analyzed, and the category of the text can be determined. Therefore, the technical scheme in the prior art cannot accurately classify the lengthy texts.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the specification provides a text classification method, a text classification device and text classification equipment, and aims to solve the problem that a lengthy text cannot be accurately classified in the prior art.
An embodiment of the present specification provides a text classification method, including: acquiring a text to be classified; preprocessing the text to be classified to obtain a target word vector; determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification; and performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified.
An embodiment of the present specification further provides a text classification apparatus, including: the acquisition module is used for acquiring texts to be classified; the preprocessing module is used for preprocessing the text to be classified to obtain a target word vector; the determining module is used for determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification; and the fusion module is used for performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified.
Embodiments of the present specification further provide a text classification device, including a processor and a memory for storing processor-executable instructions, where the processor executes the instructions to implement the steps of any one of the method embodiments in the embodiments of the present specification.
The present specification embodiments also provide a computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of any one of the method embodiments of the specification embodiments.
The embodiment of the present specification provides a text classification method, which may be implemented by preprocessing an acquired text to be classified to obtain a target word vector, and determining a classification result corresponding to each classification model according to the target word vector and a plurality of classification models, where the plurality of classification models may be machine learning models for text classification, which are pre-trained by TextCNN, GRU, RNN, RCNN, and Attention-GRU, respectively. Furthermore, because the classification results output by each classification model are different, in order to improve the accuracy of the classification results, probability equal-weight fusion can be performed on the classification results corresponding to each classification model, so that the fused classification results can be obtained, and the category of the text to be classified can be determined. The accuracy of the classification result can be effectively improved by means of multi-model fusion, and the problems that a single model is deeply complicated and key information cannot be extracted from a lengthy text to improve the text classification accuracy are effectively solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the disclosure, are incorporated in and constitute a part of this specification, and are not intended to limit the embodiments of the disclosure. In the drawings:
FIG. 1 is a schematic diagram illustrating steps of a text classification method provided in an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a text classification apparatus provided in an embodiment of the present specification;
fig. 3 is a schematic structural diagram of a text classification device provided in an embodiment of the present specification.
Detailed Description
The principles and spirit of the embodiments of the present specification will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and to implement the embodiments of the present description, and are not intended to limit the scope of the embodiments of the present description in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, implementations of the embodiments of the present description may be embodied as a system, an apparatus, a method, or a computer program product. Therefore, the disclosure of the embodiments of the present specification can be embodied in the following forms: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
Although the flow described below includes operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).
Referring to fig. 1, the present embodiment can provide a text classification method. The text classification method can be used for accurately determining the category of the text to be classified. The above text classification method may include the following steps.
S101: and acquiring the text to be classified.
In this embodiment, a text to be classified may be obtained, where the text to be classified may be a long text, a short text, or a complete text, and may be determined specifically according to an actual situation, and this is not limited in this description embodiment.
In the present embodiment, a text to be classified is acquired. The method can comprise the following steps: and pulling the text to be classified from a preset database, or receiving the text to be classified input by a user. It is understood that other possible manners may also be adopted to obtain the text to be classified, for example, the text to be classified is searched in a web page according to a certain search condition, which may be determined specifically according to an actual situation, and this is not limited in this description embodiment. The preset database may be a database for storing texts acquired in real time.
S102: and preprocessing the text to be classified to obtain a target word vector.
In this embodiment, since the initially obtained text to be classified may have a form problem, in order to better represent the text to be classified, the text to be classified may be preprocessed to obtain the target word vector. The target word vector may include a plurality of words.
In this embodiment, the pretreatment may include: data cleaning, truncation and completion, word segmentation, word removal and stop, and the like. Of course, the preprocessing method is not limited to the above examples, and other modifications are possible for those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, and all that can be achieved is intended to be covered by the scope of the embodiments of the present disclosure as long as the achieved functions and effects are the same as or similar to the embodiments of the present disclosure.
In this embodiment, the dimension of the target word vector may be preset, and the dimension may be a positive integer greater than 0, for example: 100. 300, 1000, etc. The specific situation can be determined according to actual situations, and the embodiment of the present specification does not limit the specific situation.
S103: determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification.
In this embodiment, the classification result corresponding to each classification model may be determined according to the target word vector and a plurality of classification models, where the plurality of classification models may be machine learning models for text classification, which are pre-trained by TextCNN, GRU, RNN, RCNN, and Attention-GRU, respectively.
In this embodiment, the target word vector is input into each classification model, and each classification model can obtain a corresponding classification result. The classification result may be a classification result vector, each one-dimensional data in the classification result vector is a probability that the target word vector belongs to one text category, and dimensions of the classification result vectors output by the classification models may be the same.
In the present embodiment, the TextCNN is a convolutional neural network for text classification; GRU (gated regenerative unit) is a gated cyclic unit, GRU is one of RNNs (Recurrent neural networks); RNN (Current Neural network) is a recurrent Neural network, and the RNN can process an input sequence with any time sequence by using the internal memory of the RNN; RCNN is a recurrent convolutional neural network, and CNN is a convolutional neural network; the Attention-GRU is a gating cycle unit based on an Attention mechanism (Attention mechanism), and a deep learning model based on the Attention mechanism can identify the importance degree of words in a text through a training word vector to extract important features in the text.
In the present embodiment, the main frameworks of the five deep learning models TextCNN, GRU, RNN, RCNN, Attention-GRU are similar, and the basic idea is: after a word (or a character) passes through an Embedding layer, local information, global information or context information is extracted by using structures such as CNN (convolutional neural network), RNN (neural network) and the like, and classified by using a classifier, wherein the classifier consists of two fully-connected layers
In this embodiment, the final output of the model is a vector with fixed dimension, and the dimension of the vector may be the number of preset text categories. For example: the text needs to be classified into literature, news, and sports, and the dimension of the corresponding classification result vector is 3. Of course, the method of determining the dimensions is not limited to the above examples, and other modifications may be made by those skilled in the art within the spirit of the embodiments of the present disclosure, but the functions and effects achieved by the embodiments of the present disclosure are all covered by the scope of the embodiments of the present disclosure.
In some embodiments, since different deep learning models have different learning abilities and different emphasis points, in order to combine the advantages of the different deep learning models and improve the accuracy of the final classification result, F1-score obtained in the training process of the classification model may be used as a criterion for setting the weight of each classification model, so that each classification model has a corresponding weight. The F1-score is an index for measuring the accuracy of the two classification models, the F1-score considers both the accuracy and the recall rate of the classification models, the F1 score can be regarded as a weighted average of the accuracy and the recall rate of the models, the maximum value is 1, and the minimum value is 0.
In some embodiments, the weight may be a weight of the output data of the classification model, and may be configured at the output end of the classification model. In some embodiments, the data obtained by multiplying the classification result by the corresponding weight may be output as an output result, or the classification result may be directly output, which may be determined according to actual situations, and this is not limited in this specification.
S104: and performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified.
In this embodiment, since the classification results output by each classification model are different, in order to improve the accuracy of the classification results, probability equal-weight fusion can be performed on the classification results corresponding to each classification model, so that the fused classification results can be obtained, and the category of the text to be classified can be determined.
In the present embodiment, the probability and other weights may be fused in such a manner that the weight of the classification result output by each model is the same by default, and the classification results output by each model may be directly added to be the final classification result. It will of course be appreciated that other possible ways of fusion of the models may be used, for example: and setting corresponding weight for each classification model, and performing weighted summation on the classification result output by each model to obtain a fusion result. The specific situation can be determined according to actual situations, and the embodiment of the present specification does not limit the specific situation.
In this embodiment, each piece of one-dimensional data in the classification result vector is the probability that the target word vector belongs to one text category, so the category to which the text to be classified belongs can be determined according to the fused classification result vector. For example: the category corresponding to the probability maximum value in the fused classification result vector may be used as the category to which the text to be classified belongs, or the category corresponding to the probability value of the top three in the fused classification result vector may be used as the category to which the text to be classified belongs, which may be determined specifically according to the actual situation, and this is not limited in the embodiments of the present specification.
From the above description, it can be seen that the embodiments of the present specification achieve the following technical effects: the method comprises the steps of preprocessing an acquired text to be classified to obtain a target word vector, and determining a classification result corresponding to each classification model according to the target word vector and a plurality of classification models, wherein the plurality of classification models can be machine learning models which are obtained by using textCNN, GRU, RNN, RCNN and Attention-GRU to perform text classification in a pre-training mode. Furthermore, because the classification results output by each classification model are different, in order to improve the accuracy of the classification results, probability equal-weight fusion can be performed on the classification results corresponding to each classification model, so that the fused classification results can be obtained, and the category of the text to be classified can be determined. The accuracy of the classification result can be effectively improved by means of multi-model fusion, and the problems that a single model is deeply complicated and key information cannot be extracted from a lengthy text to improve the text classification accuracy are effectively solved.
In one embodiment, the preprocessing the text to be classified to obtain a target word vector may include: and performing truncation and filling on the text to be classified to obtain a target text, and respectively obtaining a first word vector, a second word vector and a third word vector by using a first text representation model, a second text representation model and a third text representation model according to the target text. Further, the first word vector, the second word vector and the third word vector may be spliced to obtain the target word vector.
In this embodiment, since the lengths of different texts may be different, in order to standardize the text length in the input text representation model, the text to be classified may be truncated and filled. The input of a general model needs a matrix with equal size, so that before the model is input, the length of each text after numerical value mapping can be standardized, the reasonable length is taken according to text length distribution analysis, the overlength text is cut off, and the insufficient text is filled.
In this embodiment, the length may be set in advance, and the specific numerical value of the length may be determined according to the overall situation of the text to be classified. For example: the length x is set, the truncation process is to directly remove the excess length x, and if the length x is not enough, 0 is added to the length x.
In this embodiment, the word vector of the target text may be determined by using a text representation model, and the first word vector, the second word vector, and the third word vector may be obtained by using three different text representation models. The first word vector, the second word vector and the third word vector output by the three text representation models can be spliced in sequence, so that the target word vector is obtained.
In one embodiment, the first word vector, the second word vector, and the third word vector have the same vector dimension.
In this embodiment, the vector dimensions of the first, second and third word vectors may be positive integers greater than 0, such as: 100. 155, 300, etc., which can be determined according to practical situations and are not limited by the embodiments in this specification.
In one embodiment, the first text representation model, the second text representation model and the third text representation model are respectively: word2vec, glove, fastText.
In this embodiment, word2vec is a correlation model for generating word vectors, and word2vec is a shallow two-layer neural network for training to reconstruct linguistic word text. The trained word2vec model may be used to map each word to a vector, which may be used to represent word-to-word relationships.
In this embodiment, the glove model is a global vector model, which is an unsupervised learning model, and can obtain word vectors similar to word2 vec. However, the technique is different, and training is performed on an aggregated global word-word co-occurrence matrix, so that a vector space with meaningful substructures can be obtained.
In the embodiment, the fastText model is the extension and improvement of the word2vec model, and the fastText is a frame expressed by learning words and can perform robust, fast and accurate text classification.
In this embodiment, other possible models may also be used to determine the word vector, such as: the details of ELMo, Gaus Embedding, LDA, Bert, and the like may be determined according to actual conditions, and the embodiments of the present specification do not limit this.
In an embodiment, after the text to be classified is preprocessed to obtain the target word vector, the method may further include: according to the target text, a first word vector, a second word vector and a third word vector are obtained by respectively utilizing a first text representation model, a second text representation model and a third text representation model, and the first word vector, the second word vector and the third word vector are spliced to obtain a target word vector.
In this embodiment, text classification may also be performed based on words, and since the word-based model effect is much better than the word-based model effect, in some embodiments, fusion may be performed using only the word-based classification model. But the larger the difference of the data is, the more the fusion is promoted, and although the model score based on word training is lower, the fusion with the model based on word training can be greatly promoted. Thus, in some embodiments, the word classification model and the word classification model may also be fused to determine the final classification result.
In this embodiment, the determination method of the target word vector is similar to that of the target word vector, and repeated descriptions are omitted. For example: a 100-dimensional vector of word data and a 100-dimensional vector of word data can be obtained by using word2 vec; utilizing glove to obtain a 100-dimensional vector of the word data and a 100-dimensional vector of the word data; and obtaining 100-dimensional vectors of the word data and 100-dimensional vectors of the word data by using the fastText, splicing the 100-dimensional vectors of the three word data to obtain 300-dimensional word vectors, and splicing the 100-dimensional vectors of the three word data to obtain 300-dimensional word vectors.
In one embodiment, determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models may include: according to the target word vector, obtaining a first classification result vector, a second classification result vector, a third classification result vector, a fourth classification result vector and a fifth classification result vector by utilizing a first word classification model, a second word classification model, a third word classification model, a fourth word classification model and a fifth word classification model; wherein each dimension of data in the classification result vector represents a probability of belonging to a text classification. Correspondingly, performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the categories of the texts to be classified, which may include: and adding the first classification result vector, the second classification result vector, the third classification result vector, the fourth classification result vector and the fifth classification result vector to obtain a target classification result vector, and taking the classification corresponding to the preset names before the probability value in the target classification result vector is sorted as the classification of the text to be classified.
In this embodiment, each word classification model may obtain a corresponding classification result vector, each dimension data in the classification result vector represents a probability of belonging to one text classification, and a category to which a text belongs may be determined according to a probability value of belonging to each text classification in the classification result vector.
In this embodiment, the probability equal weight fusion may be performed in such a manner that the weight of the classification result output by each model is the same by default, and the classification results output by each model may be directly added to obtain the target classification result vector. It will of course be appreciated that other possible ways of fusion of the models may be used, for example: and setting corresponding weight for each classification model, and performing weighted summation on the classification result output by each model to obtain a fusion result. The specific situation can be determined according to actual situations, and the embodiment of the present specification does not limit the specific situation.
In this embodiment, each dimension data in the target classification result vector represents the probability that the text to be classified belongs to one text classification, and the probability values in the target classification result vector may be sorted in a descending order. In some embodiments, the classification with the highest probability value may be used as the classification of the text to be classified, or a classification corresponding to a preset name before the probability value is ranked may be used as the classification of the text to be classified. The specific situation can be determined according to actual situations, and the embodiment of the present specification does not limit the specific situation.
In this embodiment, the preset name may be a positive integer, for example: 3. 5, etc., which can be determined according to practical conditions, and are not limited in the embodiments of the present specification.
In some embodiments, the word classification model and the word classification model may also be fused to determine a final classification result. According to the target word vector, a first word classification model, a second word classification model, a third word classification model, a fourth word classification model and a fifth word classification model are utilized to obtain a corresponding classification result vector: a sixth classification result vector, a seventh classification result vector, an eighth classification result vector, a ninth classification result vector, and a tenth classification result vector. The first word classification model, the second word classification model, the third word classification model, the fourth word classification model and the fifth word classification model are machine learning models which are obtained by means of textCNN, GRU, RNN, RCNN and Attention-GRU pre-training respectively and are used for text classification.
In this embodiment, the first classification result vector, the second classification result vector, the third classification result vector, the fourth classification result vector, the fifth classification result vector, the sixth classification result vector, the seventh classification result vector, the eighth classification result vector, the ninth classification result vector, and the tenth classification result vector may be added to obtain a fused classification result vector, and the fused classification result vector is used to determine the classification of the text to be classified.
In an embodiment, before determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models, the method may further include: acquiring an initial text data set; and preprocessing the initial text data set to obtain a word vector training sample set and a word vector training sample set. Further, a plurality of word classification models can be obtained by utilizing TextCNN, GRU, RNN, RCNN, Attention-GRU training based on the word vector training sample set. And training by using TextCNN, GRU, RNN, RCNN and Attention-GRU to obtain a plurality of word classification models based on the word vector training sample set.
In this embodiment, the word vector training sample set may include a plurality of word vector training samples, and each word vector training sample set may include a word vector and a classification label. The word vector training sample set may include a plurality of word vector training samples, and each word vector training sample set may include a word vector and a class label. The classification label may be used to indicate a category to which the text belongs, and the text category included in the classification label may be one or multiple, which may be determined specifically according to an actual situation, and this is not limited in this specification.
In the present embodiment, 5 different word classification models and 5 different word classification models can be obtained by using 5 models. The TextCNN can select to use more convolution kernels and BatchNorm (accelerated neural network training), similarly, a convolution process is performed on the basis of the original convolution, and two layers of full connection are used in classification. The RNN can adopt Bi-LSTM (bidirectional long-short term memory network), and the output of all hidden elements is firstly made into K-MaxPholing during classification. RCNN concat the output of the Embedding layer. Wherein, K-MaxPooling is only the strongest one of a series of characteristic values of the convolution layer for the original Max Pooling Over Time, MaxPooling Over Time is the most common down-sampling operation in the CNN model, and is a plurality of characteristic values are extracted for a certain Filter, only the value with the largest score is taken as a Pooling layer reserved value, other characteristic values are all discarded, the value with the largest score represents that only the strongest one of the characteristics is reserved, and other weak characteristics are discarded; the concat method described above is used to join two or more arrays.
In this embodiment, a preset training strategy may be used to train the model, and an appropriate training strategy may inhibit overfitting of the model. In one embodiment, the training strategy may be:
the method comprises the following steps: setting the learning rate of an Embedding layer to be 0 and the learning rates of other layers to be 1e-3 when training is started, and adopting an Adam optimizer (initially, convolutional layers are initialized randomly, and the gradient of the Embedding layer obtained by back propagation is influenced by the convolutional layers and is equivalent to noise), wherein Adam is a first-order optimization algorithm capable of replacing the traditional random gradient descent process, and can iteratively update the weight of a neural network based on training data;
step two: after training 1-2 epochs, setting the learning rate of an Embedding layer to be 2e-4, wherein 1 epoch is equal to one time of training by using all samples in a training sample set, namely the value of the epoch is that the whole training sample set is put into turn for several times;
step three: score in the validation set every half or one epoch statistic (f 1-score can be selected as the scoring criterion here);
step four: judging whether the score rises or not;
step five: if the score rises, saving the model and recording a saving path;
step six: and if the score is reduced, loading the storage path of the previous model, reducing half of the learning rate, and simultaneously initializing an Adam optimizer and emptying momentum information.
In this embodiment, the same training strategy may be used for TextCNN, GRU, RNN, RCNN, Attention-GRU.
In one embodiment, preprocessing the initial text data set to obtain a word vector training sample set and a word vector training sample set may include: and performing truncation and filling on each text data in the initial text data set to obtain a target text data set. Respectively performing text representation on each text data in the target text data set by using a first text representation model, a second text representation model and a third text representation model to obtain an initial word vector set and an initial word vector set; the initial word vector set comprises a plurality of groups of word vectors, each group of word vectors comprises word vectors corresponding to three text representation models, the initial word vector set comprises a plurality of groups of word vectors, and each group of word vectors comprises word vectors corresponding to three text representation models. Further, each group of word vectors in the initial word vector set can be spliced to obtain a word vector training sample set. Each group of word vectors in the initial word vector set may be spliced to obtain a word vector training sample set.
In this embodiment, since the lengths of different texts may be different, in order to standardize the text length in the input text representation model, each text data in the initial text data set may be truncated and filled. The input of a general model needs a matrix with equal size, so that before the model is input, the length of each text after numerical value mapping can be standardized, the reasonable length is taken according to text length distribution analysis, the overlength text is cut off, and the insufficient text is filled.
In this embodiment, the length may be set in advance, and the specific value of the length may be determined according to the overall situation of each text data in the initial text data set, and may be set to a length that satisfies 95% of the coverage word and word data. For example: the lengths are summarized as (16, 14, 17, 16, 14, 15, 18, 15, 18, 19, 20), the total length of 11 texts covers 95% of the total length, namely about 10 texts are within the set length range, therefore, the length 19 can be selected, and the coverage rate requirement can be met.
In this embodiment, the truncating and filling up each text data in the initial text data set may include: setting the length as x, directly removing excess length x by truncation processing, adding 0 and the like to the length x if the length x is not enough, and accordingly ensuring that the length of each text of the target text data set is x.
In this embodiment, three text representation models may be used to determine the word vector and the word vector of each text data in the target text data set, for example: a 100-dimensional vector of word data and a 100-dimensional vector of word data can be obtained by using word2 vec; utilizing glove to obtain a 100-dimensional vector of the word data and a 100-dimensional vector of the word data; and obtaining 100-dimensional vectors of word data and 100-dimensional vectors of word data by using the fastText, splicing the 100-dimensional vectors of the three word data to obtain 300-dimensional word vectors, and splicing the 100-dimensional vectors of the three word data to obtain 300-dimensional word vectors, thereby obtaining a word vector training sample set and a word vector training sample set. Of course, the way of determining the word vector training sample set and the word vector training sample set is not limited to the above examples, and other modifications are possible for those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, and all that can be achieved is included in the scope of the embodiments of the present disclosure as long as the functions and effects achieved by the present disclosure are the same as or similar to the embodiments of the present disclosure.
Based on the same inventive concept, the embodiment of the present specification further provides a text classification device, as described in the following embodiments. Because the principle of the text classification device for solving the problems is similar to the text classification method, the implementation of the text classification device can refer to the implementation of the text classification method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated. Fig. 2 is a block diagram of a structure of a text classification apparatus according to an embodiment of the present disclosure, and as shown in fig. 2, the text classification apparatus may include: the system comprises an acquisition module 201, a preprocessing module 202, a determination module 203 and a fusion module 204, and the structure is explained below.
The obtaining module 201 may be configured to obtain a text to be classified;
the preprocessing module 202 may be configured to preprocess the text to be classified to obtain a target word vector;
the determining module 203 may be configured to determine a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification;
the fusion module 204 may be configured to perform probability-equal-weight fusion on the classification results corresponding to the classification models to obtain categories of the texts to be classified.
The embodiment of the present specification further provides an electronic device, which may specifically refer to a schematic structural diagram of an electronic device based on the text classification method provided by the embodiment of the present specification, shown in fig. 3, where the electronic device may specifically include an input device 31, a processor 32, and a memory 33. The input device 31 may be specifically used for inputting a text to be classified. The processor 32 may be specifically configured to obtain a text to be classified; preprocessing the text to be classified to obtain a target word vector; determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification; and performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified. The memory 33 may be specifically configured to store parameters such as categories of texts to be classified.
In this embodiment, the input device may be one of the main apparatuses for information exchange between a user and a computer system. The input device may include a keyboard, a mouse, a camera, a scanner, a light pen, a handwriting input board, a voice input device, etc.; the input device is used to input raw data and a program for processing the data into the computer. The input device can also acquire and receive data transmitted by other modules, units and devices. The processor may be implemented in any suitable way. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The memory may in particular be a memory device used in modern information technology for storing information. The memory may include multiple levels, and in a digital system, the memory may be any memory as long as it can store binary data; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
In this embodiment, the functions and effects specifically realized by the electronic device can be explained by comparing with other embodiments, and are not described herein again.
Embodiments of the present specification further provide a computer storage medium based on a text classification method, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer storage medium may implement: acquiring a text to be classified; preprocessing the text to be classified to obtain a target word vector; determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification; and performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified.
In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present specification described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed over a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present description are not limited to any specific combination of hardware and software.
Although the embodiments herein provide the method steps as described in the above embodiments or flowcharts, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In the case of steps where no causal relationship is logically necessary, the order of execution of the steps is not limited to that provided by the embodiments of the present description. When the method is executed in an actual device or end product, the method can be executed sequentially or in parallel according to the embodiment or the method shown in the figure (for example, in the environment of a parallel processor or a multi-thread processing).
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of embodiments of the present specification should, therefore, be determined not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above description is only a preferred embodiment of the embodiments of the present disclosure, and is not intended to limit the embodiments of the present disclosure, and it will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present disclosure should be included in the protection scope of the embodiments of the present disclosure.

Claims (11)

1. A method of text classification, comprising:
acquiring a text to be classified;
preprocessing the text to be classified to obtain a target word vector;
determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification;
and performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified.
2. The method of claim 1, wherein preprocessing the text to be classified to obtain a target word vector comprises:
truncating and filling the text to be classified to obtain a target text;
according to the target text, a first word vector, a second word vector and a third word vector are obtained by respectively utilizing a first text representation model, a second text representation model and a third text representation model;
and splicing the first word vector, the second word vector and the third word vector to obtain the target word vector.
3. The method of claim 2, wherein the first, second, and third word vectors have the same vector dimensions.
4. The method of claim 2, wherein the first, second and third text representation models are respectively: word2vec, glove, fastText.
5. The method according to claim 2, wherein after preprocessing the text to be classified to obtain a target word vector, further comprising:
according to the target text, a first word vector, a second word vector and a third word vector are obtained by respectively utilizing a first text representation model, a second text representation model and a third text representation model;
and splicing the first word vector, the second word vector and the third word vector to obtain a target word vector.
6. The method of claim 1, wherein determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models comprises:
according to the target word vector, obtaining a first classification result vector, a second classification result vector, a third classification result vector, a fourth classification result vector and a fifth classification result vector by utilizing a first word classification model, a second word classification model, a third word classification model, a fourth word classification model and a fifth word classification model; wherein each dimension data in the classification result vector represents the probability of belonging to a text classification;
correspondingly, performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified, including:
adding the first classification result vector, the second classification result vector, the third classification result vector, the fourth classification result vector and the fifth classification result vector to obtain a target classification result vector;
and taking the classification corresponding to the preset name before the probability value in the target classification result vector as the classification of the text to be classified.
7. The method of claim 1, before determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models, further comprising:
acquiring an initial text data set;
preprocessing the initial text data set to obtain a word vector training sample set and a word vector training sample set;
training by using TextCNN, GRU, RNN, RCNN and Attention-GRU to obtain a plurality of word classification models based on the word vector training sample set;
and training by using TextCNN, GRU, RNN, RCNN and Attention-GRU to obtain a plurality of word classification models based on the word vector training sample set.
8. The method of claim 7, wherein preprocessing the initial text data set to obtain a word vector training sample set and a word vector training sample set comprises:
truncating and filling each text data in the initial text data set to obtain a target text data set;
respectively performing text representation on each text data in the target text data set by using a first text representation model, a second text representation model and a third text representation model to obtain an initial word vector set and an initial word vector set; the initial word vector set comprises a plurality of groups of word vectors, each group of word vectors comprises word vectors corresponding to three text representation models, the initial word vector set comprises a plurality of groups of word vectors, and each group of word vectors comprises word vectors corresponding to three text representation models;
splicing each group of word vectors in the initial word vector set to obtain a word vector training sample set;
and splicing each group of word vectors in the initial word vector set to obtain a word vector training sample set.
9. A text classification apparatus, comprising:
the acquisition module is used for acquiring texts to be classified;
the preprocessing module is used for preprocessing the text to be classified to obtain a target word vector;
the determining module is used for determining a classification result corresponding to each classification model according to the target word vector and the plurality of classification models; the classification models are machine learning models which are obtained by pre-training textCNN, GRU, RNN, RCNN and Attention-GRU respectively and are used for text classification;
and the fusion module is used for performing probability equal-weight fusion on the classification results corresponding to the classification models to obtain the classes of the texts to be classified.
10. A text classification device comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 8.
11. A computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 8.
CN202110880080.4A 2021-08-02 2021-08-02 Text classification method, device and equipment Pending CN113535960A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110880080.4A CN113535960A (en) 2021-08-02 2021-08-02 Text classification method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110880080.4A CN113535960A (en) 2021-08-02 2021-08-02 Text classification method, device and equipment

Publications (1)

Publication Number Publication Date
CN113535960A true CN113535960A (en) 2021-10-22

Family

ID=78090056

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110880080.4A Pending CN113535960A (en) 2021-08-02 2021-08-02 Text classification method, device and equipment

Country Status (1)

Country Link
CN (1) CN113535960A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118817A (en) * 2021-11-30 2022-03-01 济南农村商业银行股份有限公司 Bank sunshine loan-handling loan examination and dispatching method, device and system
CN116992033A (en) * 2023-09-25 2023-11-03 北京中关村科金技术有限公司 Text classification threshold determining method, text classification method and related device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN109376240A (en) * 2018-10-11 2019-02-22 平安科技(深圳)有限公司 A kind of text analyzing method and terminal
CN110134793A (en) * 2019-05-28 2019-08-16 电子科技大学 Text sentiment classification method
CN110209805A (en) * 2018-04-26 2019-09-06 腾讯科技(深圳)有限公司 File classification method, device, storage medium and computer equipment
CN110609897A (en) * 2019-08-12 2019-12-24 北京化工大学 Multi-category Chinese text classification method fusing global and local features
WO2020147393A1 (en) * 2019-01-17 2020-07-23 平安科技(深圳)有限公司 Convolutional neural network-based text classification method, and related device
CN113011533A (en) * 2021-04-30 2021-06-22 平安科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209805A (en) * 2018-04-26 2019-09-06 腾讯科技(深圳)有限公司 File classification method, device, storage medium and computer equipment
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN109376240A (en) * 2018-10-11 2019-02-22 平安科技(深圳)有限公司 A kind of text analyzing method and terminal
WO2020147393A1 (en) * 2019-01-17 2020-07-23 平安科技(深圳)有限公司 Convolutional neural network-based text classification method, and related device
CN110134793A (en) * 2019-05-28 2019-08-16 电子科技大学 Text sentiment classification method
CN110609897A (en) * 2019-08-12 2019-12-24 北京化工大学 Multi-category Chinese text classification method fusing global and local features
CN113011533A (en) * 2021-04-30 2021-06-22 平安科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118817A (en) * 2021-11-30 2022-03-01 济南农村商业银行股份有限公司 Bank sunshine loan-handling loan examination and dispatching method, device and system
CN114118817B (en) * 2021-11-30 2022-08-05 济南农村商业银行股份有限公司 Bank loan examination order dispatching method, device and system
CN116992033A (en) * 2023-09-25 2023-11-03 北京中关村科金技术有限公司 Text classification threshold determining method, text classification method and related device
CN116992033B (en) * 2023-09-25 2023-12-08 北京中关村科金技术有限公司 Text classification threshold determining method, text classification method and related device

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN111427995B (en) Semantic matching method, device and storage medium based on internal countermeasure mechanism
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN109471944A (en) Training method, device and the readable storage medium storing program for executing of textual classification model
CN112507039A (en) Text understanding method based on external knowledge embedding
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN111522908A (en) Multi-label text classification method based on BiGRU and attention mechanism
CN112148831B (en) Image-text mixed retrieval method and device, storage medium and computer equipment
CN112149410A (en) Semantic recognition method and device, computer equipment and storage medium
CN115130538A (en) Training method of text classification model, text processing method, equipment and medium
CN113535960A (en) Text classification method, device and equipment
CN113051887A (en) Method, system and device for extracting announcement information elements
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN115269833A (en) Event information extraction method and system based on deep semantics and multitask learning
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN112765353B (en) Scientific research text-based biomedical subject classification method and device
CN117291192B (en) Government affair text semantic understanding analysis method and system
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN117272142A (en) Log abnormality detection method and system and electronic equipment
CN110334204A (en) A kind of exercise similarity calculation recommended method based on user record

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination