Nothing Special   »   [go: up one dir, main page]

WO2022227207A1 - Text classification method, apparatus, computer device, and storage medium - Google Patents

Text classification method, apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2022227207A1
WO2022227207A1 PCT/CN2021/097195 CN2021097195W WO2022227207A1 WO 2022227207 A1 WO2022227207 A1 WO 2022227207A1 CN 2021097195 W CN2021097195 W CN 2021097195W WO 2022227207 A1 WO2022227207 A1 WO 2022227207A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
classification
training
target
model
Prior art date
Application number
PCT/CN2021/097195
Other languages
French (fr)
Chinese (zh)
Inventor
刘翔
谷坤
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022227207A1 publication Critical patent/WO2022227207A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of natural language processing, and in particular, to text classification methods, apparatuses, computer equipment and storage media.
  • a text classification method refers to determining a category for each document in a document collection according to a predefined topic category.
  • the text classification method technology has a wide range of applications in daily life, for example, the technical division of patent texts and so on.
  • patent texts Compared with general texts, patent texts have the characteristics of special structure, strong professionalism, and more vocabulary in the field, so a more targeted classification method needs to be adopted.
  • the patent text classification method belongs to the field of natural language processing, and generally includes steps such as data preprocessing, text feature representation, classifier selection and effect evaluation. Among them, text feature representation and classifier selection are the most important, which will directly affect the accuracy of classification results.
  • text classification methods based on traditional machine learning such as the TF-IDF text classification method
  • TF-IDF text classification method only measure the importance of words by "word frequency”, and the subsequent feature value sequences of documents are independent of each other and cannot reflect the sequence. information; easily affected by the skewness of the dataset, such as too many documents in a certain category, it will lead to underestimation of the IDF; the processing method is to increase the category weight.
  • the intra- and inter-class distribution biases are not considered.
  • Text classification methods based on deep learning such as Facebook's open source FastText text classification method, Text-CNN text classification method, Text-RNN text classification method, etc.
  • TextCNN can perform well in many tasks, but the biggest problem of CNN is the fixed filter_size field of view. On the one hand, it cannot model longer sequence information, and on the other hand, the hyperparameter adjustment of filter_size is also very cumbersome.
  • the essence of CNN is to do the feature expression work of text, and the recurrent neural network (RNN, Recurrent Neural Network) is more commonly used in natural language processing, which can better express context information.
  • RNN Recurrent Neural Network
  • CNN and RNN are used in the task of text classification methods, they have the disadvantage that they are not intuitive enough, and the interpretability is not good, especially when analyzing badcase.
  • the present application provides a text classification method, apparatus, computer equipment and storage medium.
  • a first aspect provides a text classification method, the method comprising:
  • the word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model.
  • a second aspect provides a text classification device, including:
  • the target text acquisition module is used to extract the target text data to be analyzed from the original text
  • a word segmentation module for preprocessing the target text data to obtain a word segmentation result of the target text data
  • the classification module is used to input the word segmentation result into a trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and The target classification label of the target text data is obtained based on the target word vector, the target word vector and the target position vector; wherein, the text classification model is a trained alber model.
  • a third aspect provides a computer device, including a memory and a processor, where computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, cause the processor to perform the following steps:
  • the word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model.
  • a fourth aspect provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model.
  • the above text classification method, device, computer equipment and storage medium the above text classification method, first, extract the target text data to be analyzed from the original text; secondly, preprocess the target text data to obtain the target text data. word segmentation result; then, the target word vector, target word vector and target position vector corresponding to the target text data are obtained based on the word segmentation result; finally, the word segmentation result is input into the trained text classification model, the text
  • the classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result, and obtains the target text based on the target word vector, the target word vector and the target position vector
  • the target classification label for the data first, extract the target text data to be analyzed from the original text; secondly, preprocess the target text data to obtain the target text data. word segmentation result; then, the target word vector, target word vector and target position vector corresponding to the target text data are obtained based on the word segmentation result; finally, the word segmentation result is input into the trained
  • the albert model is used to process the text data, and the obtained word vector sequence contains the text information and context information of the text data, so it integrates the full text semantic information, contains more comprehensive text information, and is more conducive to subsequent text classification , which helps to improve the accuracy of text classification and improve the classification effect.
  • Fig. 1 is the implementation environment diagram of the text classification method provided in one embodiment
  • FIG. 2 is a block diagram of the internal structure of a computer device in one embodiment
  • FIG. 3 is a flowchart of a text classification method in one embodiment
  • FIG. 4 is a structural block diagram of a text classification apparatus in one embodiment.
  • FIG. 1 is an implementation environment diagram of the text classification method provided in one embodiment. As shown in FIG. 1 , the implementation environment includes a computer device 110 and a terminal 120 .
  • the computer device 110 is a text classification server
  • the terminal 120 is a text acquisition device to be classified, and has a text classification result output interface.
  • text classification is required, the text to be classified is obtained through the terminal 120, and the text to be classified is processed through the computer device 110. Classification.
  • the terminal 120 and the computer device 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but are not limited thereto.
  • the computer device 110 and the terminal 110 may be connected through Bluetooth, USB (Universal Serial Bus, Universal Serial Bus) or other communication connection methods, which are not limited in this application.
  • FIG. 2 is a schematic diagram of the internal structure of a computer device in one embodiment.
  • the computer device includes a processor, a storage medium, a memory, and a network API interface connected through a system bus.
  • the storage medium of the computer device stores an operating system, a database and computer-readable instructions
  • the database can store a control information sequence
  • the processor can be made to implement a text classification method .
  • the processor of the computer device is used to provide computing and control capabilities and support the operation of the entire computer device.
  • the computer device may have computer readable instructions stored in the memory that, when executed by the processor, cause the processor to perform a text classification method.
  • the network API interface of the computer device is used to communicate with the terminal connection.
  • FIG. 2 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • the albert model A language model released by Google in 2018 that trains deep bidirectional representations by uniting bidirectional transformers in all layers.
  • the albert model combines the advantages of many natural language processing models, and has achieved better results in many natural language processing tasks.
  • the model input vector of the albert model is the sum of the vectors of a word vector (TokenEmbedding), a position vector (PositionEmbedding) and a sentence vector (SegmentEmbedding).
  • the word vector is the vectorized representation of the text
  • the position vector is used to represent the position of the word in the text
  • the sentence vector is used to represent the sequence of sentences in the text.
  • Pre-training A process in which the neural network model learns common features in the dataset by training the neural network model with a large dataset.
  • the purpose of pre-training is to provide high-quality model parameters for subsequent neural network model training on a specific dataset.
  • the pre-training in this embodiment of the present application refers to the process of training the albert model by using unlabeled training text.
  • Fine-tuning A process of further training a pretrained neural network model using a specific dataset.
  • the data volume of the data set used in the fine-tuning stage is smaller than that of the data set used in the pre-training stage, and the fine-tuning stage adopts a supervised learning method, that is, the training samples in the data set used in the fine-tuning stage contain annotation information.
  • the fine-tuning stage in the embodiment of the present application refers to training the albert model by using the training text containing the classification labels.
  • Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • a text classification method is proposed, and the text classification method can be applied to the above-mentioned computer device 110, and may specifically include the following steps:
  • Step 101 extracting target text data to be analyzed from the original text
  • Patent text classification belongs to the field of natural language processing, and generally includes steps such as data preprocessing, text feature representation, classifier selection, and effect evaluation. Among them, text feature representation and classifier selection are the most important, which will directly affect the accuracy of classification results.
  • the text data of the description abstract, the claims and the description title part in the patent text are extracted as the target text data.
  • Step 102 preprocessing the target text data to obtain the word segmentation result of the target text data
  • the purpose of preprocessing the target text data is to extract useful data in the original text data, or to delete the noise data in the original text, so that the text data irrelevant to the extraction purpose in the original text data can be deleted.
  • the above step 102 may include: performing one of stop word removal and duplication removal on the target text data to obtain second text data, and performing a word segmentation operation on the second text data to obtain a word segmentation result.
  • noise data in the original text data is removed by deduplication; noise data in the original text data is removed by means of deletion, so that the noise data in the original text data can be removed.
  • Stop words refer to certain words or words that are automatically filtered out before or after processing natural language texts in order to save storage space and improve search efficiency in information retrieval. These words or words are called stop words (Stop words). Words).
  • removing stop words can remove words in the natural language text that do not contribute to text features, such as punctuation, tone, person, meaningless garbled characters and spaces.
  • the selected method for removing stop words is stop word list filtering. Stop word list filtering can be performed by matching the words in the text data one by one through the constructed stop word list. If the match is successful, then the word is Stop word, which needs to be deleted.
  • Word segmentation is a basic task in lexical analysis, and word segmentation algorithms are mainly divided into two categories according to their core ideas: one is dictionary-based word segmentation, which first divides text data into words according to the dictionary, and then finds the best combination of words; the other The first is word segmentation based on words, that is, words are formed from words. First, the sentence is divided into words, and then the words are combined into words to find the optimal segmentation strategy. At the same time, it can also be transformed into a sequence labeling problem.
  • the word segmentation algorithm used when performing word segmentation in this embodiment may include: a rule-based word segmentation method, an understanding-based word segmentation method, or a statistics-based word segmentation method.
  • the rule-based word segmentation method (such as the word segmentation method based on string matching) is to match the Chinese character string to be analyzed with the entry in a "sufficiently large" dictionary according to a certain strategy. string, the match is successful (a word is recognized).
  • Commonly used rule-based word segmentation methods include: forward maximum matching method (from left to right direction); reverse maximum matching method (from right to left direction); minimum segmentation (minimize the number of words cut out in each sentence). ).
  • the forward maximum matching method is to separate a string, where the length of the separation is limited, and then match the separated substring with the words in the dictionary. If the match is successful, the next round of matching is performed until all strings are processed. Completed, otherwise remove a word from the end of the substring, and then match, and so on.
  • the reverse maximum matching rule is similar to this forward maximum matching method.
  • the word segmentation method based on comprehension is to achieve the effect of word recognition by letting the computer simulate the human understanding of the sentence.
  • the basic idea of comprehension-based word segmentation is to perform syntactic and semantic analysis at the same time as word segmentation, and to use syntactic and semantic information to deal with ambiguity.
  • Statistical word segmentation method From a formal point of view, words are stable combinations of words, so in the context, the more times adjacent words appear at the same time, the more likely it is to form a word. Therefore, the frequency or probability of adjacent co-occurrence of words can better reflect the credibility of the words. By calculating the frequency of the combinations of adjacent co-occurring words in the text data, their mutual occurrence information is calculated. The mutual information reflects the closeness of the combination relationship between Chinese characters.
  • the statistical word segmentation system can use a basic word segmentation dictionary to perform string matching word segmentation, and at the same time use statistical methods to identify some new words, that is, combine string frequency statistics and string matching, which not only gives full play to the speed and efficiency of matching word segmentation It also takes advantage of the advantages of dictionary-free word segmentation combined with context recognition of new words and automatic disambiguation.
  • the original text data is represented by a series of keywords, but the data in this text form cannot be directly processed by the subsequent classification algorithm, but should be converted into a numerical form, so these keywords need to be word vectorized.
  • Form conversion to obtain the text data to be classified which is in the form of text vectors.
  • Step 103 input the word segmentation result into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result, and based on the target word vector, target word vector and target The position vector obtains the target classification label of the target text data; wherein, the text classification model is a trained alber model.
  • the word segmentation results are input into the pre-trained text classification model.
  • the text classification model is a pre-trained albert model. In order to enable the alber model to learn the contextual relationship of the text, it can also learn the mapping between pinyin and text. relationship, in the embodiment of the present application, when training the albert model, vectors of three degrees are used, that is, a word vector, a word vector, and a position vector.
  • the word vector is obtained by converting the text by using a word vector (wordtovector, word2vec) model.
  • step 103 may include the following steps:
  • Step 1031 Acquire a word vector corresponding to the text data according to the part-of-speech and location information of the text data.
  • position information is added to the text data using position coding, and the text data to which the position information is added is represented by an initial word vector; the part of speech of the text data is obtained, and the part of speech is converted into a part of speech vector; the initial word vector and the part of speech are combined The vectors are added to obtain the word vector corresponding to the text data.
  • Step 1032 Input the word vector into the albert model for data processing to obtain a word matrix of the text data.
  • Step 1033 Acquire a word vector sequence of the text data according to the word matrix.
  • the word matrix is used to predict whether the two sentences in the text data are upper and lower sentences, the masked words in the two sentences and the part-of-speech features of the masked words, and the part-of-speech features are normalized to obtain the word vector of the text data sequence.
  • the albert model used in this embodiment is a pre-trained model, so when processing text data, it is only necessary to input the text data into the pre-trained albert model to obtain its corresponding Sequence of word vectors.
  • the classification category and quantity of the classifier are related to the classification task to be implemented by the text classification model, and the classifier may be a multi-class classifier (such as a softmax classifier).
  • the embodiment of the present application does not limit the specific type of the classifier.
  • the above-mentioned text classification method before extracting the text data to be classified from the original text, further includes:
  • Step 100a extracting keywords in the original text, and forming a keyword set
  • Step 100b determine the word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model
  • the text features matching the keywords in the text features of the corpus of this category are determined, and the word frequency-inverse document frequency of the matched text features is used as the word frequency-inverse document frequency of the keyword.
  • punctuation marks such as periods, question marks, exclamation marks, and semicolons
  • the text in a corpus of a certain category is divided into several sentences, and the text features in each sentence are extracted.
  • a text feature library is established for each category. Under each category, count the frequency of each text feature.
  • the inverse document frequency of each text feature is counted, that is, the natural logarithm of the quotient of the total number of categories and the number of categories containing the text feature, and under each category, the word frequency-inverse document frequency of each text feature is calculated separately.
  • Step 100c based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category, determine the confidence that the original text belongs to each category;
  • each category respectively perform the following operations: determine the number of times the keyword appears in the corpus of this category; The number of occurrences, determine the class conditional probability of the original text relative to the category; according to the class conditional probability of the original text relative to the category, determine the confidence that the text to be classified belongs to the category.
  • Step 100d according to the confidence that the original text belongs to each category, determine the first-level classification label of the original text
  • the category with the highest confidence level is used as the first-level classification label of the text to be classified.
  • Step 100e Match the first-level classification label with preset first-level classification label information, and determine whether to use a text classification model to perform text classification on the original text according to the matching result.
  • the target classification label obtained in the above steps 101 to 104 is the lowest classification label of the patent text.
  • the patent text has a third-level classification label
  • the first-level classification label has only one classification
  • the second-level classification label has At least two, and at least two tertiary classification labels. Therefore, in this step, firstly use the TF-IDF model to perform the first-level classification label according to the keywords of the original file. If the preset first-level classification label of the first-level classification label of the patent file does not match, the original file does not need to be labeled.
  • the initial classification label is a label that is artificially set at a higher level than the underlying classification label.
  • the above-mentioned text classification method further includes: a pre-trained text classification model, and a pre-trained text classification model, including:
  • Step 1001 obtaining a first training sample set, the first training sample set includes a first training text, and the first training text includes a corresponding first classification label;
  • the first training sample set is a specific data set related to text classification, wherein the training text includes a corresponding classification label, the classification label can be manually labeled, and the classification label belongs to the classification result of the text classification model.
  • the classification labels include specific technical fields, such as cloud computing, image processing, and the like. The embodiment of the present application does not limit the specific content of the classification label.
  • Step 1002 based on the first training sample set, pre-training the albert model with the first classification label as the classification target to obtain an initial text classification model;
  • the above step 1002 may include:
  • the trained initial text classification model is verified based on the verification data, and the optimized initial text classification model is obtained according to the verification results.
  • the first training sample set is divided according to the ratio of 9:1, 90% is used as the training set, and 10% is used as the validation set. Prediction, and appropriately tune the model parameters according to the results to obtain the initial text classification model.
  • Step 1003 judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold
  • Step 1004 if it is greater than, then take the initial text classification model as the final text classification model;
  • Step 1005 if not greater than, correct the classification label corresponding to the first training text, and iterate the initial text classification model based on the error-corrected first training sample set, until the classification result of the initial text classification model is accurate. rate is greater than the preset threshold.
  • step 1005 the initial text classification model is iterated based on the error-corrected first training sample set, that is, the initial text model can be optimized based on all or part of the error-corrected first training sample set. For the specific number of iterations, it is necessary to judge whether the accuracy of the classification result of the fine-tuned initial text classification model is greater than the preset threshold.
  • judging whether the accuracy rate of the classification result of the initial text classification model is greater than a preset threshold value may include:
  • a second training sample set different from the first training sample set is used as the verification data for verifying the accuracy of the classification result of the initial text classification model.
  • the training data of the initial classification model is expanded; The problem of low accuracy of the initial text classification model due to errors in the original classification labels of the first training sample set.
  • step 1005 error correction is performed on the classification label corresponding to the first training text, which may include:
  • this embodiment in view of the inaccuracy of the initial prediction of the initial text classification model, this embodiment iterates the model, so that the model prediction is more accurate.
  • the computer device uses a gradient descent or backpropagation algorithm to adjust the network parameters of the albert model according to the error between the prediction result and the classification label, until the error satisfies the convergence condition.
  • the data volume of the second training sample set used for fine-tuning is much smaller than the data volume of the first training sample set.
  • the albert model is processed. Fine tune.
  • the computer device uses the second word vector, the second target word vector and the second target position vector of the second training sample set as the input vector of the albert model, and obtains the output of the albert model. Text classification prediction results, and then use the classification labels corresponding to the second training text as supervision, fine-tune the albert model, and finally train to obtain a text classification model.
  • a text classification device is provided, and the text classification device can be integrated in the above-mentioned computer equipment 110, and can specifically include
  • the target text acquisition module 411 is used to extract the target text data to be analyzed from the original text
  • the word segmentation module 412 is used to preprocess the target text data to obtain the word segmentation result of the target text data
  • the vector obtaining module 413 is used to obtain the target word vector, target position vector and target sentence vector corresponding to the text in the target classification text;
  • the classification module 414 is used to input the target word vector, the target word vector and the target position vector into the text classification model, to obtain the target classification label output by the text classification model, and the target classification label is a text classification model using any one of claims 1 to 4.
  • the text classification model trained by the training method is used to input the target word vector, the target word vector and the target position vector into the text classification model, to obtain the target classification label output by the text classification model, and the target classification label is a text classification model using any one of claims 1 to 4.
  • a computer device in one embodiment, includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the following steps when executing the computer program: from the original text Extract the target text data to be analyzed; preprocess the target text data to obtain the word segmentation result of the target text data; obtain the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result; The word vector and the target position vector are input into the pre-trained text classification model, and the target classification label output by the text classification model is obtained, wherein the text classification model is a fine-tuned alber model.
  • the method before extracting the text data to be classified from the original text, the method further includes: extracting keywords from the original text based on the TF-IDF model, and forming a keyword set; First-class classification label; the step of matching the first-class classification label with the preset first-class classification label information, and determining whether to use the text classification model to classify the original text according to the matching result.
  • the original text is patent text data
  • extracting the text data to be classified from the original text includes: extracting the text data of the description abstract, claims and description title parts in the patent text as the text data to be classified .
  • inputting the word segmentation result into a pre-trained albert model to obtain a sequence of word vectors corresponding to the text data including: obtaining word vectors corresponding to the text data according to part of speech and position information of the text data; inputting the word vectors to Data processing is performed in the albert model to obtain the word matrix of the text data; according to the word matrix, the word vector sequence of the text data is obtained.
  • judging whether the accuracy rate of the classification result of the initial text classification model is greater than a preset threshold includes: acquiring a second training sample set, where the second training sample set includes the second training text; based on the initial text classification model, obtaining The predicted classification label corresponding to the second training text in the second training sample set; according to the predicted classification label and the second classification label corresponding to the second training text, determine whether the accuracy of the classification result of the initial classification model is greater than the preset threshold, wherein, The second classification label is the second classification label manually annotated by the user.
  • pre-training the albert model with the first classification label as the classification target based on the first training sample set to obtain an initial text classification model includes: dividing the first training sample set into training data and Verify data; input the training data into the initial text classification model to be trained for model training; verify the trained initial text classification model based on the verification data, and obtain an optimized initial text classification model according to the verification results.
  • performing error correction on the classification label corresponding to the first training text includes:
  • a storage medium stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps: Extract the target text data to be analyzed; preprocess the target text data to obtain the word segmentation result of the target text data; obtain the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result; The word vector and the target position vector are input into the pre-trained text classification model, and the target classification label output by the text classification model is obtained, wherein the text classification model is a fine-tuned alber model.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or the like, or a volatile memory, that is, a random access memory (Random Access Memory, RAM). )Wait.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A text classification method, an apparatus, a computer device, and a storage medium. The method comprises: extracting target text data to be analyzed from an original text (301); performing pre-processing on the target text data and obtaining a word segmentation result for the target text data (302); obtaining a target character vector, a target word vector, and a target position vector corresponding to the target text data on the basis of the word segmentation result (303); inputting the target character vector, the target word vector, and the target position vector into a pre-trained text classification model, and obtaining a target classification label output by the text classification model, wherein the text classification model is an ALBERT model that has undergone fine-tuning (304). The present method utilizes an ALBERT model to perform processing on text data, and text classification efficiency and accuracy are effectively increased.

Description

文本分类方法、装置、计算机设备和存储介质Text classification method, apparatus, computer equipment and storage medium
本申请要求于2021年04月30日提交国家知识产权局、申请号为202110482695.1,发明名称为“文本分类方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110482695.1 and the title of the invention "Text Classification Method, Apparatus, Computer Equipment and Storage Medium", which was submitted to the State Intellectual Property Office on April 30, 2021, the entire contents of which are by reference Incorporated in this application.
技术领域technical field
本申请涉及自然语言处理技术领域,特别是涉及文本分类方法、装置、计算机设备和存储介质。The present application relates to the technical field of natural language processing, and in particular, to text classification methods, apparatuses, computer equipment and storage media.
背景技术Background technique
随着网络技术的快速发展,海量的信息资源以文本的形式存在。发明人意识到如何对这些文本进行有效的分类,从海量的文本中快速、准确、全面的挖掘有效信息,已经成为了自然语言处理研究领域的热点之一。文本分类方法是指按照预先定义的主题类别,为文档集合中的每个文档确定一个类别。文本分类方法技术在日常生活中具有广泛的应用,例如,对专利文本的技术划分等等。With the rapid development of network technology, massive information resources exist in the form of text. The inventor realized that how to effectively classify these texts and mine effective information from massive texts quickly, accurately and comprehensively has become one of the hotspots in the field of natural language processing research. A text classification method refers to determining a category for each document in a document collection according to a predefined topic category. The text classification method technology has a wide range of applications in daily life, for example, the technical division of patent texts and so on.
与一般文本相比,专利文本具有结构特殊、专业性强、领域词汇较多等特点,需要采用更加针对的分类方法。专利文本分类方法属于自然语言处理领域,一般包括数据预处理、文本特征表示、分类器选择及效果评价等步骤,其中文本特征表示与分类器选择最为重要,将直接影响分类结果的准确性。Compared with general texts, patent texts have the characteristics of special structure, strong professionalism, and more vocabulary in the field, so a more targeted classification method needs to be adopted. The patent text classification method belongs to the field of natural language processing, and generally includes steps such as data preprocessing, text feature representation, classifier selection and effect evaluation. Among them, text feature representation and classifier selection are the most important, which will directly affect the accuracy of classification results.
现有技术中,基于传统机器学习的文本分类方法,如TF-IDF文本分类方法是仅以“词频”度量词的重要性,后续构成文档的特征值序列,词之间各自独立,无法反映序列信息;易受数据集偏斜的影响,如某一类别的文档偏多,会导致IDF低估;处理方法是需要增加类别权重。没有考虑类内、类间分布偏差(被用于特征选择时)。基于深度学习的文本分类方法,如Facebook开源的FastText文本分类方法,Text-CNN文本分类方法,Text-RNN文本分类方法等。TextCNN能够在很多任务里面能有不错的表现,但CNN有个最大问题是固定filter_size的视野,一方面无法建模更长的序列信息,另一方面filter_size的超参调节也很繁琐。CNN本质是做文本的特征表达工作,而自然语言处理中更常用的是递归神经网络(RNN,RecurrentNeuralNetwork),能够更好的表达上下文信息。CNN和RNN用在文本分类方法任务中尽管效果显著,但都有一个不足的地方就是不够直观,可解释性不好,特别是在分析badcase时候感受尤其深刻。In the prior art, text classification methods based on traditional machine learning, such as the TF-IDF text classification method, only measure the importance of words by "word frequency", and the subsequent feature value sequences of documents are independent of each other and cannot reflect the sequence. information; easily affected by the skewness of the dataset, such as too many documents in a certain category, it will lead to underestimation of the IDF; the processing method is to increase the category weight. The intra- and inter-class distribution biases (when used for feature selection) are not considered. Text classification methods based on deep learning, such as Facebook's open source FastText text classification method, Text-CNN text classification method, Text-RNN text classification method, etc. TextCNN can perform well in many tasks, but the biggest problem of CNN is the fixed filter_size field of view. On the one hand, it cannot model longer sequence information, and on the other hand, the hyperparameter adjustment of filter_size is also very cumbersome. The essence of CNN is to do the feature expression work of text, and the recurrent neural network (RNN, Recurrent Neural Network) is more commonly used in natural language processing, which can better express context information. Although CNN and RNN are used in the task of text classification methods, they have the disadvantage that they are not intuitive enough, and the interpretability is not good, especially when analyzing badcase.
发明内容SUMMARY OF THE INVENTION
本申请提供了一种文本分类方法、装置、计算机设备和存储介质。The present application provides a text classification method, apparatus, computer equipment and storage medium.
第一方面提供了一种文本分类方法,所述方法包括:A first aspect provides a text classification method, the method comprising:
从原始文本中提取待分析的目标文本数据;Extract the target text data to be analyzed from the original text;
对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;Preprocessing the target text data to obtain a word segmentation result of the target text data;
将所述分词结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签;其中,所述文本分类模型为经过训练的alber模型。The word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model.
第二方面提供了一种文本分类装置,包括:A second aspect provides a text classification device, including:
目标文本获取模块,用于从原始文本中提取待分析的目标文本数据;The target text acquisition module is used to extract the target text data to be analyzed from the original text;
分词模块,用于对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;A word segmentation module for preprocessing the target text data to obtain a word segmentation result of the target text data;
分类模块,用于将所述分词结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签;其中,所述文本分类模型为经过训练的alber模型。The classification module is used to input the word segmentation result into a trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and The target classification label of the target text data is obtained based on the target word vector, the target word vector and the target position vector; wherein, the text classification model is a trained alber model.
第三方面提供了一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:A third aspect provides a computer device, including a memory and a processor, where computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, cause the processor to perform the following steps:
从原始文本中提取待分析的目标文本数据;Extract the target text data to be analyzed from the original text;
对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;Preprocessing the target text data to obtain a word segmentation result of the target text data;
将所述分词结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签;其中,所述文本分类模型为经过训练的alber模型。第四方面提供了一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:The word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model. A fourth aspect provides a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the following steps:
从原始文本中提取待分析的目标文本数据;Extract the target text data to be analyzed from the original text;
对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;Preprocessing the target text data to obtain a word segmentation result of the target text data;
将所述分词结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签;其中,所述文本分类模型为经过训练的alber模型。The word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model.
上述文本分类方法、装置、计算机设备和存储介质,上述文本分类方法,首先,从原始文本中提取待分析的目标文本数据;其次,对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;然后,基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量;最后,将所述分词 结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签。因此,采用albert模型对文本数据进行处理,获得的字向量序列中包含了文本数据的文本信息以及上下文信息,因此其融合了全文语义信息,包含的文本信息更全面,更有利于后续的文本分类,从而有助于提高文本分类的准确性,改善分类效果。The above text classification method, device, computer equipment and storage medium, the above text classification method, first, extract the target text data to be analyzed from the original text; secondly, preprocess the target text data to obtain the target text data. word segmentation result; then, the target word vector, target word vector and target position vector corresponding to the target text data are obtained based on the word segmentation result; finally, the word segmentation result is input into the trained text classification model, the text The classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result, and obtains the target text based on the target word vector, the target word vector and the target position vector The target classification label for the data. Therefore, the albert model is used to process the text data, and the obtained word vector sequence contains the text information and context information of the text data, so it integrates the full text semantic information, contains more comprehensive text information, and is more conducive to subsequent text classification , which helps to improve the accuracy of text classification and improve the classification effect.
附图说明Description of drawings
图1为一个实施例中提供的文本分类方法的实施环境图;Fig. 1 is the implementation environment diagram of the text classification method provided in one embodiment;
图2为一个实施例中计算机设备的内部结构框图;2 is a block diagram of the internal structure of a computer device in one embodiment;
图3为一个实施例中文本分类方法的流程图;3 is a flowchart of a text classification method in one embodiment;
图4为一个实施例中文本分类装置的结构框图。FIG. 4 is a structural block diagram of a text classification apparatus in one embodiment.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。It will be understood that the terms "first", "second", etc. used in this application may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish a first element from another element.
图1为一个实施例中提供的文本分类方法的实施环境图,如图1所示,在该实施环境中,包括计算机设备110以及终端120。FIG. 1 is an implementation environment diagram of the text classification method provided in one embodiment. As shown in FIG. 1 , the implementation environment includes a computer device 110 and a terminal 120 .
计算机设备110为文本分类服务器,终端120为待分类的文本获取设备,具有文本分类结果输出界面,当需要进行文本分类时,通过终端120获取待分类的文本,并通过计算机设备110对待分类文本进行分类。The computer device 110 is a text classification server, the terminal 120 is a text acquisition device to be classified, and has a text classification result output interface. When text classification is required, the text to be classified is obtained through the terminal 120, and the text to be classified is processed through the computer device 110. Classification.
需要说明的是,终端120以及计算机设备110可为智能手机、平板电脑、笔记本电脑、台式计算机等,但并不局限于此。计算机设备110以及终端110可以通过蓝牙、USB(Universal SerialBus,通用串行总线)或者其他通讯连接方式进行连接,本申请在此不做限制。It should be noted that the terminal 120 and the computer device 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but are not limited thereto. The computer device 110 and the terminal 110 may be connected through Bluetooth, USB (Universal Serial Bus, Universal Serial Bus) or other communication connection methods, which are not limited in this application.
图2为一个实施例中计算机设备的内部结构示意图。如图2所示,该计算机设备包括通过系统总线连接的处理器、存储介质、存储器和网络API接口。其中,该计算机设备的存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种文本分类方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该 计算机可读指令被处理器执行时,可使得处理器执行一种文本分类方法。该计算机设备的网络API接口用于与终端连接通信。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。FIG. 2 is a schematic diagram of the internal structure of a computer device in one embodiment. As shown in FIG. 2, the computer device includes a processor, a storage medium, a memory, and a network API interface connected through a system bus. Wherein, the storage medium of the computer device stores an operating system, a database and computer-readable instructions, the database can store a control information sequence, and when the computer-readable instructions are executed by the processor, the processor can be made to implement a text classification method . The processor of the computer device is used to provide computing and control capabilities and support the operation of the entire computer device. The computer device may have computer readable instructions stored in the memory that, when executed by the processor, cause the processor to perform a text classification method. The network API interface of the computer device is used to communicate with the terminal connection. Those skilled in the art can understand that the structure shown in FIG. 2 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
为了方便理解,下面首先对本申请实施例中涉及的名词进行说明。For the convenience of understanding, the terms involved in the embodiments of the present application are first described below.
albert模型:由谷歌在2018年发布的一种语言模型,该模型通过联合所有层中的双向转换器来训练深度双向表示。albert模型融合了众多自然语言处理模型的优点,在多项自然语言处理任务中均取得较优效果。相关技术中,albert模型的模型输入向量为字向量(TokenEmbedding)、位置向量(PositionEmbedding)和句向量(SegmentEmbedding)的向量之和。其中,字向量为文字的向量化表示,位置向量用于表征字在文本中所处的位置,句向量用于表征句子在文本中的先后顺序。albert model: A language model released by Google in 2018 that trains deep bidirectional representations by uniting bidirectional transformers in all layers. The albert model combines the advantages of many natural language processing models, and has achieved better results in many natural language processing tasks. In the related art, the model input vector of the albert model is the sum of the vectors of a word vector (TokenEmbedding), a position vector (PositionEmbedding) and a sentence vector (SegmentEmbedding). Among them, the word vector is the vectorized representation of the text, the position vector is used to represent the position of the word in the text, and the sentence vector is used to represent the sequence of sentences in the text.
预训练(pre-training):一种通过使用大型数据集对神经网络模型进行训练,使神经网络模型学习到数据集中的通用特征的过程。预训练的目的是为后续神经网络模型在特定数据集上训练提供优质的模型参数。本申请实施例中的预训练指利用无标签训练文本训练albert模型的过程。Pre-training: A process in which the neural network model learns common features in the dataset by training the neural network model with a large dataset. The purpose of pre-training is to provide high-quality model parameters for subsequent neural network model training on a specific dataset. The pre-training in this embodiment of the present application refers to the process of training the albert model by using unlabeled training text.
微调(fine-tuning):一种使用特定数据集对预训练神经网络模型进行进一步训练的过程。通常情况下,微调阶段所使用数据集的数据量小于预训练阶段所使用数据集的数据量,且微调阶段采用监督式学习的方式,即微调阶段所使用数据集中的训练样本包含标注信息。本申请实施例中的微调阶段指利用包含分类标签的训练文本训练albert模型。Fine-tuning: A process of further training a pretrained neural network model using a specific dataset. Usually, the data volume of the data set used in the fine-tuning stage is smaller than that of the data set used in the pre-training stage, and the fine-tuning stage adopts a supervised learning method, that is, the training samples in the data set used in the fine-tuning stage contain annotation information. The fine-tuning stage in the embodiment of the present application refers to training the albert model by using the training text containing the classification labels.
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
如图3所示,在一个实施例中,提出了一种文本分类方法,该文本分类方法可以应用于上述的计算机设备110中,具体可以包括以下步骤:As shown in FIG. 3, in one embodiment, a text classification method is proposed, and the text classification method can be applied to the above-mentioned computer device 110, and may specifically include the following steps:
步骤101、从原始文本中提取待分析的目标文本数据;Step 101, extracting target text data to be analyzed from the original text;
其中,原始文本可以是专利文本,专利文本具有结构特殊、专业性强、领域词汇较多等特点,需要采用更加针对的分类方法。专利文本分类属于自然语言处理领域,一般包括数据预处理、文本特征表示、分类器选择及效果评价等步骤, 其中文本特征表示与分类器选择最为重要,将直接影响分类结果的准确性。Among them, the original text can be a patent text, and the patent text has the characteristics of special structure, strong professionalism, and many field vocabulary, and needs to adopt a more targeted classification method. Patent text classification belongs to the field of natural language processing, and generally includes steps such as data preprocessing, text feature representation, classifier selection, and effect evaluation. Among them, text feature representation and classifier selection are the most important, which will directly affect the accuracy of classification results.
在本实施例中,提取专利文本中的说明书摘要、权利要求书和说明书标题部分的文本数据作为目标文本数据。In this embodiment, the text data of the description abstract, the claims and the description title part in the patent text are extracted as the target text data.
步骤102、对目标文本数据进行预处理得到目标文本数据的分词结果;Step 102, preprocessing the target text data to obtain the word segmentation result of the target text data;
在本实施例中,对目标文本数据进行预处理的目的在于提取原始文本数据中的有用数据,或者删除原始文本中的噪音数据,从而可以对原始文本数据中与提取目的无关的文本数据进行删除。In this embodiment, the purpose of preprocessing the target text data is to extract useful data in the original text data, or to delete the noise data in the original text, so that the text data irrelevant to the extraction purpose in the original text data can be deleted. .
在一些实施例中,上述步骤102可以包括:对目标文本数据进行去停用词、去重中的一种,得到第二文本数据,对第二文本数据进行分词操作,得到分词结果。In some embodiments, the above step 102 may include: performing one of stop word removal and duplication removal on the target text data to obtain second text data, and performing a word segmentation operation on the second text data to obtain a word segmentation result.
其中,在删除噪音数据时,通过去重的方式去除原始文本数据中的重复数据;通过删除的方式去除原始文本数据中的噪音数据等,从而可以将原始文本数据中的噪音数据进行去除。Among them, when deleting the noise data, duplicate data in the original text data is removed by deduplication; noise data in the original text data is removed by means of deletion, so that the noise data in the original text data can be removed.
停用词是指在信息检索中,为节省存储空间和提高搜索效率,在处理自然语言文本之前或之后会自动过滤掉某些字或词,这些字或词即被称为停用词(Stop Words)。Stop words refer to certain words or words that are automatically filtered out before or after processing natural language texts in order to save storage space and improve search efficiency in information retrieval. These words or words are called stop words (Stop words). Words).
在本实施例中,去停用词可以去掉自然语言文本中对文本特征没有任何贡献作用的字词,例如标点符号、语气、人称、无意义的乱码和空格等。选取的去停用词的方法为停用词表过滤,停用词表过滤可以是通过已经构建好的停用词表和文本数据中的词语进行一一匹配,如果匹配成功,那么这个词语就是停用词,需要将该词删除。In this embodiment, removing stop words can remove words in the natural language text that do not contribute to text features, such as punctuation, tone, person, meaningless garbled characters and spaces. The selected method for removing stop words is stop word list filtering. Stop word list filtering can be performed by matching the words in the text data one by one through the constructed stop word list. If the match is successful, then the word is Stop word, which needs to be deleted.
为了获得向量形式的目标文本数据,需要首先对第二文本数据进行分词。分词是词法分析中的基本任务,分词算法根据其核心思想主要分为两类:一种是基于词典的分词,先把文本数据按照词典切分成词,再寻找词的最佳组合方式;另一种是基于字的分词,即由字构词,先把句子分成一个个字,再将字组合成词,寻找最优的切分策略,同时也可以转化成序列标注问题。本实施例在进行分词时采用的分词算法可以包括:基于规则的分词方法、基于理解的分词方法或者基于统计的分词方法。In order to obtain the target text data in the form of vectors, it is necessary to firstly perform word segmentation on the second text data. Word segmentation is a basic task in lexical analysis, and word segmentation algorithms are mainly divided into two categories according to their core ideas: one is dictionary-based word segmentation, which first divides text data into words according to the dictionary, and then finds the best combination of words; the other The first is word segmentation based on words, that is, words are formed from words. First, the sentence is divided into words, and then the words are combined into words to find the optimal segmentation strategy. At the same time, it can also be transformed into a sequence labeling problem. The word segmentation algorithm used when performing word segmentation in this embodiment may include: a rule-based word segmentation method, an understanding-based word segmentation method, or a statistics-based word segmentation method.
其中,基于规则的分词方法(例如基于字符串匹配的分词方法)是按照一定的策略将待分析的汉字串与一个“充分大的”词典中的词条进行配,若在词典中找到某个字符串,则匹配成功(识别出一个词)。常用的基于规则的分词方法包括:正向最大匹配法(由左到右的方向);逆向最大匹配法(由右到左的方向);最少切分(使每一句中切出的词数最小)。正向最大匹配法是将一段字符串进行分隔,其中分隔的长度有限制,然后将分隔的子字符串与词典中的词进行匹配,如果匹配成 功则进行下一轮匹配,直到所有字符串处理完毕,否则将子字符串从末尾去除一个字,再进行匹配,如此反复。逆向最大匹配法则与此正向最大匹配法类似。Among them, the rule-based word segmentation method (such as the word segmentation method based on string matching) is to match the Chinese character string to be analyzed with the entry in a "sufficiently large" dictionary according to a certain strategy. string, the match is successful (a word is recognized). Commonly used rule-based word segmentation methods include: forward maximum matching method (from left to right direction); reverse maximum matching method (from right to left direction); minimum segmentation (minimize the number of words cut out in each sentence). ). The forward maximum matching method is to separate a string, where the length of the separation is limited, and then match the separated substring with the words in the dictionary. If the match is successful, the next round of matching is performed until all strings are processed. Completed, otherwise remove a word from the end of the substring, and then match, and so on. The reverse maximum matching rule is similar to this forward maximum matching method.
基于理解的分词方法是通过让计算机模拟人对句子的理解,达到识别词的效果。基于理解的分词方法的基本思想就是在分词的同时进行句法、语义分析,利用句法信息和语义信息来处理歧义现象。基于统计的分词方法:从形式上看,词是稳定的字的组合,因此在上下文中,相邻的字同时出现的次数越多,就越有可能构成一个词。从而字与字相邻共现的频率或概率能够较好的反映成词的可信度。通过对文本数据中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息。互现信息体现了汉字之间结合关系的紧密程度,当紧密程度高于某一个阈值时,便可认为此字组可能构成了一个词。在实际应用时,统计分词系统可以使用一部基本的分词词典进行串匹配分词,同时使用统计方法识别一些新词,即将串频统计和串匹配结合起来,既发挥匹配分词切分速度快、效率高的特点,又利用了无词典分词结合上下文识别生词、自动消除歧义的优点。The word segmentation method based on comprehension is to achieve the effect of word recognition by letting the computer simulate the human understanding of the sentence. The basic idea of comprehension-based word segmentation is to perform syntactic and semantic analysis at the same time as word segmentation, and to use syntactic and semantic information to deal with ambiguity. Statistical word segmentation method: From a formal point of view, words are stable combinations of words, so in the context, the more times adjacent words appear at the same time, the more likely it is to form a word. Therefore, the frequency or probability of adjacent co-occurrence of words can better reflect the credibility of the words. By calculating the frequency of the combinations of adjacent co-occurring words in the text data, their mutual occurrence information is calculated. The mutual information reflects the closeness of the combination relationship between Chinese characters. When the closeness is higher than a certain threshold, it can be considered that this group of characters may constitute a word. In practical applications, the statistical word segmentation system can use a basic word segmentation dictionary to perform string matching word segmentation, and at the same time use statistical methods to identify some new words, that is, combine string frequency statistics and string matching, which not only gives full play to the speed and efficiency of matching word segmentation It also takes advantage of the advantages of dictionary-free word segmentation combined with context recognition of new words and automatic disambiguation.
经过上述分词处理后,原始文本数据由一系列的关键词表示,但是这种文本形式的数据不能直接被后续的分类算法所处理,而应该转化为数值形式,因此需要对这些关键词进行词向量形式转化,以获取待分类的文本数据,其为文本向量的形式。After the above word segmentation processing, the original text data is represented by a series of keywords, but the data in this text form cannot be directly processed by the subsequent classification algorithm, but should be converted into a numerical form, so these keywords need to be word vectorized. Form conversion to obtain the text data to be classified, which is in the form of text vectors.
步骤103、将分词结果输入与训练好的文本分类模型中,文本分类模型基于分词结果得到目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于目标字向量、目标词向量和目标位置向量得到目标文本数据的目标分类标签;其中,文本分类模型为经过训练的alber模型。Step 103, input the word segmentation result into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result, and based on the target word vector, target word vector and target The position vector obtains the target classification label of the target text data; wherein, the text classification model is a trained alber model.
其中,将分词结果输入预先训练好的文本分类模型中,文本分类模型是预先训练好的albert模型,为了使alber模型既能够学习到文字的上下文关系,又能够学习得到拼音与文字之间的映射关系,本申请实施例中,训练albert模型时,使用到三个度的向量,即字向量、词向量和位置向量。Among them, the word segmentation results are input into the pre-trained text classification model. The text classification model is a pre-trained albert model. In order to enable the alber model to learn the contextual relationship of the text, it can also learn the mapping between pinyin and text. relationship, in the embodiment of the present application, when training the albert model, vectors of three degrees are used, that is, a word vector, a word vector, and a position vector.
可选的,字向量采用词向量(wordtovector,word2vec)模型对文字进行转化得到。Optionally, the word vector is obtained by converting the text by using a word vector (wordtovector, word2vec) model.
在本实施例中,步骤103可以包括以下步骤:In this embodiment, step 103 may include the following steps:
步骤1031、根据文本数据的词性和位置信息,获取文本数据对应的词向量。在本实施例中,使用位置编码给文本数据加上位置信息,并使用初始词向量表示添加位置信息的文本数据;获取文本数据的词性,并将词性转换为词性向量;将初始词向量与词性向量相加,得到文本数据对应的词向量。Step 1031: Acquire a word vector corresponding to the text data according to the part-of-speech and location information of the text data. In this embodiment, position information is added to the text data using position coding, and the text data to which the position information is added is represented by an initial word vector; the part of speech of the text data is obtained, and the part of speech is converted into a part of speech vector; the initial word vector and the part of speech are combined The vectors are added to obtain the word vector corresponding to the text data.
步骤1032、将词向量输入至albert模型中进行数据处理,得到文本数据的词矩阵。Step 1032: Input the word vector into the albert model for data processing to obtain a word matrix of the text data.
步骤1033、根据词矩阵,获取文本数据的字向量序列。在本实施例中,使用词矩阵,预测文本数据中两个语句是否为上下句、两个语句中掩盖词和掩盖词的词性特征,并对词性特征归一化处理,得到文本数据的字向量序列。Step 1033: Acquire a word vector sequence of the text data according to the word matrix. In this embodiment, the word matrix is used to predict whether the two sentences in the text data are upper and lower sentences, the masked words in the two sentences and the part-of-speech features of the masked words, and the part-of-speech features are normalized to obtain the word vector of the text data sequence.
应当理解的是,本实施例中使用的albert模型是经过预先训练得到的模型,因此在对文本数据进行处理时,只需要将文本数据输入至该预先训练的albert模型中即可获得其对应的字向量序列。It should be understood that the albert model used in this embodiment is a pre-trained model, so when processing text data, it is only necessary to input the text data into the pre-trained albert model to obtain its corresponding Sequence of word vectors.
其中,为了使alber模型能够实现文本分类,需要在alber模型中设置分类器。可选的,该分类器的分类类别及数量与文本分类模型所需实现的分类任务相关,该分类器可以是多分类分类器(比如softmax分类器)。本申请实施例并不对分类器的具体类型进行限定。Among them, in order to enable the alber model to realize text classification, a classifier needs to be set in the alber model. Optionally, the classification category and quantity of the classifier are related to the classification task to be implemented by the text classification model, and the classifier may be a multi-class classifier (such as a softmax classifier). The embodiment of the present application does not limit the specific type of the classifier.
在一些实施例中,上述文本分类方法,在从原始文本中提取待分类的文本数据之前,还包括:In some embodiments, the above-mentioned text classification method, before extracting the text data to be classified from the original text, further includes:
步骤100a、提取原始文本中关键词,并构成关键词集;Step 100a, extracting keywords in the original text, and forming a keyword set;
步骤100b、基于TF-IDF模型确定关键词集在各个类别的语料库中的词频-逆文档频率;Step 100b, determine the word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;
具体地,确定该类别的语料库的文本特征中与关键词相匹配的文本特征,将相匹配的文本特征的词频-逆文档频率,作为该关键词的词频-逆文档频率。其中,根据句号、问号、感叹号和分号等标点符号,将某一类别的语料库中的文本分割为若干句子,提取每个句子中的文本特征。根据提取的文本特征,分别为各个类别建立文本特征库。分别在各个类别下,统计各个文本特征的频率。统计各文本特征的逆文档频率,即总类别数与包含该文本特征的类别数之商的自然对数值,并在各个类别下,分别计算各文本特征的词频-逆文档频率。Specifically, the text features matching the keywords in the text features of the corpus of this category are determined, and the word frequency-inverse document frequency of the matched text features is used as the word frequency-inverse document frequency of the keyword. Among them, according to punctuation marks such as periods, question marks, exclamation marks, and semicolons, the text in a corpus of a certain category is divided into several sentences, and the text features in each sentence are extracted. According to the extracted text features, a text feature library is established for each category. Under each category, count the frequency of each text feature. The inverse document frequency of each text feature is counted, that is, the natural logarithm of the quotient of the total number of categories and the number of categories containing the text feature, and under each category, the word frequency-inverse document frequency of each text feature is calculated separately.
步骤100c、基于原始文本的关键词集在各个类别的语料库中的词频-逆文档频率,确定原始文本属于各个类别的置信度;Step 100c, based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category, determine the confidence that the original text belongs to each category;
具体地,针对每个类别,分别进行以下操作:确定关键词在该类别的语料库中出现的次数;根据关键词在该类别的语料库中的词频-逆文档频率和关键词在该类别的语料中出现的次数,确定原始文本相对于该类别的类条件概率;根据原始文本相对于该类别的类条件概率,确定该待分类文本属于该类别的置信度。Specifically, for each category, respectively perform the following operations: determine the number of times the keyword appears in the corpus of this category; The number of occurrences, determine the class conditional probability of the original text relative to the category; according to the class conditional probability of the original text relative to the category, determine the confidence that the text to be classified belongs to the category.
步骤100d、根据原始文本属于各个类别的置信度,确定原始文本的一级分类标签;Step 100d, according to the confidence that the original text belongs to each category, determine the first-level classification label of the original text;
具体地,将原始文本属于各个类别的置信度中,置信度最大的类别作为待分类文本的一级分类标签。Specifically, among the confidence levels of the original text belonging to each category, the category with the highest confidence level is used as the first-level classification label of the text to be classified.
步骤100e、将一级分类标签与预设的一级分类标签信息进行匹配,并根据匹配结果确定是否采用文本分类模型对原始文本进行文本分类。Step 100e: Match the first-level classification label with preset first-level classification label information, and determine whether to use a text classification model to perform text classification on the original text according to the matching result.
可以理解的是,在上述步骤101至104中得到的目标分类标签是专利文本最底层的分类标签,例如,专利文本有三级分类标签,一级分类标签只有一个分类,而二级分类标签有至少两个,三级分类标签至少有两个。因此,在该步骤中,首先通过TF-IDF模型根据原始文件的关键词进行一级分类标签,如果该专利文件的一级分类标签预设的一级分类标签不匹配,则无需原始文件进行标签分类了,初始分类标签是人为设置的级别高于底层分类标签的标签。It can be understood that the target classification label obtained in the above steps 101 to 104 is the lowest classification label of the patent text. For example, the patent text has a third-level classification label, the first-level classification label has only one classification, and the second-level classification label has At least two, and at least two tertiary classification labels. Therefore, in this step, firstly use the TF-IDF model to perform the first-level classification label according to the keywords of the original file. If the preset first-level classification label of the first-level classification label of the patent file does not match, the original file does not need to be labeled. Classified, the initial classification label is a label that is artificially set at a higher level than the underlying classification label.
在一些实施例中,上述文本分类方法,还包括:预训练文本分类模型,预训练文本分类模型,包括:In some embodiments, the above-mentioned text classification method further includes: a pre-trained text classification model, and a pre-trained text classification model, including:
步骤1001、获取第一训练样本集,第一训练样本集中包含第一训练文本,且第一训练文本包含对应的第一分类标签;Step 1001, obtaining a first training sample set, the first training sample set includes a first training text, and the first training text includes a corresponding first classification label;
可选的,第一训练样本集是与文本分类相关的特定数据集,其中的训练文本包含对应的分类标签,该分类标签可以通过人工标注,且该分类标签属于文本分类模型的分类结果。在一个示意性的例子中,当文本分类模型用于对专利文本进行分类时,分类标签包括具体不同的技术领域,例如,云计算、图像处理等。本申请实施例并不对分类标签的具体内容进行限定。Optionally, the first training sample set is a specific data set related to text classification, wherein the training text includes a corresponding classification label, the classification label can be manually labeled, and the classification label belongs to the classification result of the text classification model. In an illustrative example, when a text classification model is used to classify patent texts, the classification labels include specific technical fields, such as cloud computing, image processing, and the like. The embodiment of the present application does not limit the specific content of the classification label.
步骤1002、基于第一训练样本集,以第一分类标签为分类目标预训练albert模型,得到初始文本分类模型;Step 1002, based on the first training sample set, pre-training the albert model with the first classification label as the classification target to obtain an initial text classification model;
上述步骤1002可以包括:The above step 1002 may include:
将第一训练样本集按照预设的比例分为训练数据和验证数据;Divide the first training sample set into training data and verification data according to a preset ratio;
将训练数据输入待训练的初始文本分类模型进行模型训练;Input the training data into the initial text classification model to be trained for model training;
基于验证数据对训练后的初始文本分类模型进行验证,并根据验证结果得到优化后的初始文本分类模型。The trained initial text classification model is verified based on the verification data, and the optimized initial text classification model is obtained according to the verification results.
在该步骤中,第一训练样本集按照按9:1比例分开,90%用作训练集,10%用作验证集,模型用90%数据训练后,生成预测模型,开始对10%样本做预测,根据结果对模型参数进行适当调优,以得到初始文本分类模型。In this step, the first training sample set is divided according to the ratio of 9:1, 90% is used as the training set, and 10% is used as the validation set. Prediction, and appropriately tune the model parameters according to the results to obtain the initial text classification model.
步骤1003、判断初始文本分类模型的分类结果的准确率是否大于预设阈值,Step 1003, judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold,
步骤1004、如果大于,则以初始文本分类模型为最终的文本分类模型;Step 1004, if it is greater than, then take the initial text classification model as the final text classification model;
步骤1005、如果不大于,则对第一训练文本对应的分类标签进行纠错,并基于纠错后的第一训练样本集对初始文本分类模型进行迭代,直至初始文本分类模型的分类结果的准确率大于预设阈值。Step 1005, if not greater than, correct the classification label corresponding to the first training text, and iterate the initial text classification model based on the error-corrected first training sample set, until the classification result of the initial text classification model is accurate. rate is greater than the preset threshold.
可以理解的是,步骤1005中基于纠错后的第一训练样本集对初始文本分类模型进行迭代,也就是可以基于全部或部分纠错后的第一训练样本集对初始文本模型进行优化,至于具体迭代的次数需要判断微调后的初始文本分类模型的分类结果的准确率是否大于预设阈值,如果大于则停止迭代,如果不大于,则继续对 初始文本分类模型进行优化训练。It can be understood that in step 1005, the initial text classification model is iterated based on the error-corrected first training sample set, that is, the initial text model can be optimized based on all or part of the error-corrected first training sample set. For the specific number of iterations, it is necessary to judge whether the accuracy of the classification result of the fine-tuned initial text classification model is greater than the preset threshold.
上述步骤1003中,判断初始文本分类模型的分类结果的准确率是否大于预设阈值,可以包括:In the above step 1003, judging whether the accuracy rate of the classification result of the initial text classification model is greater than a preset threshold value may include:
1003a、获取第二训练样本集,第二训练样本集中包含第二训练文本;1003a. Obtain a second training sample set, where the second training sample set includes a second training text;
1003b、基于初始文本分类模型,得到第二训练样本集中的第二训练文本对应的预测分类标签;1003b, based on the initial text classification model, obtain the predicted classification label corresponding to the second training text in the second training sample set;
1003c、根据预测分类标签和第二训练文本对应的第二分类标签,判断初始分类模型的分类结果的准确率是否大于预设阈值,其中,第二分类标签是通过用户人工标注的第二分类标签。1003c, according to the predicted classification label and the second classification label corresponding to the second training text, determine whether the accuracy rate of the classification result of the initial classification model is greater than a preset threshold, wherein the second classification label is the second classification label manually marked by the user .
本实施例中,采用不同于第一训练样本集的第二训练样本集作为验证初始文本分类模型的分类结果准确率的验证数据,其一扩展了初始分类模型的训练数据,其二,避免了由于第一训练样本集的原始的分类标签的错误造成的初始文本分类模型的准确率低的问题。In this embodiment, a second training sample set different from the first training sample set is used as the verification data for verifying the accuracy of the classification result of the initial text classification model. First, the training data of the initial classification model is expanded; The problem of low accuracy of the initial text classification model due to errors in the original classification labels of the first training sample set.
上述步骤1005中,对第一训练文本对应的分类标签进行纠错,可以包括:In the above step 1005, error correction is performed on the classification label corresponding to the first training text, which may include:
1005a、对预测结果进行审核,得到预测正确的第一训练文本和预测错误的第一训练文本;1005a. Review the prediction results, and obtain the first training text that is correctly predicted and the first training text that is wrongly predicted;
1005b、将预测错误第一训练文本进行人工标注,以将预测错误第一训练文本的标签正确标注。1005b. Manually label the first training text with the prediction error, so as to correctly label the label of the first training text with the prediction error.
本实施例中,针对初始文本分类模型初期预测的不准确的情况,本实施例对模型进行迭代,使得模型预测的更加准确。In this embodiment, in view of the inaccuracy of the initial prediction of the initial text classification model, this embodiment iterates the model, so that the model prediction is more accurate.
在一些实施例中,计算机设备采用梯度下降或反向传播算法,根据预测结果与分类标签之间的误差对albert模型的网络参数进行调整,直至误差满足收敛条件。In some embodiments, the computer device uses a gradient descent or backpropagation algorithm to adjust the network parameters of the albert model according to the error between the prediction result and the classification label, until the error satisfies the convergence condition.
在一种可能的实施方式中,由于预训练的albert模型已经学习到了文字的上下文关系,因此进行微调时所采用的第二训练样本集的数据量远小于第一训练样本集的数据量。In a possible implementation manner, since the pre-trained albert model has learned the context of the text, the data volume of the second training sample set used for fine-tuning is much smaller than the data volume of the first training sample set.
与预训练过程类似的,为了使文本分类模型能够学习到文本分类与文字拼音之间的映射关系,除了将第二训练文本中文字的字向量、位置向量和句向量作为输入,对albert模型进行微调。Similar to the pre-training process, in order to enable the text classification model to learn the mapping relationship between text classification and text pinyin, except that the word vector, position vector and sentence vector of the text in the second training text are used as input, the albert model is processed. Fine tune.
在一种可能的实施方式中,微调过程中,计算机设备将第二训练样本集的第二字向量、第二目标词向量和第二目标位置向量作为albert模型的输入向量,得到albert模型输出的文本分类预测结果,进而以第二训练文本对应的分类标签为监督,对albert模型进行微调,最终训练得到文本分类模型。In a possible implementation, during the fine-tuning process, the computer device uses the second word vector, the second target word vector and the second target position vector of the second training sample set as the input vector of the albert model, and obtains the output of the albert model. Text classification prediction results, and then use the classification labels corresponding to the second training text as supervision, fine-tune the albert model, and finally train to obtain a text classification model.
如图4所示,在一个实施例中,提供了一种文本分类装置,该文本分类装置 可以集成于上述的计算机设备110中,具体可以包括As shown in Figure 4, in one embodiment, a text classification device is provided, and the text classification device can be integrated in the above-mentioned computer equipment 110, and can specifically include
目标文本获取模块411,用于从原始文本中提取待分析的目标文本数据;The target text acquisition module 411 is used to extract the target text data to be analyzed from the original text;
分词模块412,用于对目标文本数据进行预处理得到目标文本数据的分词结果;The word segmentation module 412 is used to preprocess the target text data to obtain the word segmentation result of the target text data;
向量获取模块413,用于获取目标分类文本中文字对应的目标字向量、目标位置向量以及目标句向量;The vector obtaining module 413 is used to obtain the target word vector, target position vector and target sentence vector corresponding to the text in the target classification text;
分类模块414,用于将目标字向量、目标词向量和目标位置向量输入文本分类模型,得到文本分类模型输出的目标分类标签,目标分类标签为采用权利要求1至4任一的文本分类模型的训练方法训练得到的文本分类模型。The classification module 414 is used to input the target word vector, the target word vector and the target position vector into the text classification model, to obtain the target classification label output by the text classification model, and the target classification label is a text classification model using any one of claims 1 to 4. The text classification model trained by the training method.
在一个实施例中,提出了一种计算机设备,计算机设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现以下步骤:从原始文本中提取待分析的目标文本数据;对目标文本数据进行预处理得到目标文本数据的分词结果;基于分词结果得到目标文本数据对应的目标字向量、目标词向量和目标位置向量;将目标字向量、目标词向量和目标位置向量输入预先训练好的文本分类模型,得到文本分类模型输出的目标分类标签,其中,文本分类模型为经过微调的alber模型。In one embodiment, a computer device is proposed, the computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the following steps when executing the computer program: from the original text Extract the target text data to be analyzed; preprocess the target text data to obtain the word segmentation result of the target text data; obtain the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result; The word vector and the target position vector are input into the pre-trained text classification model, and the target classification label output by the text classification model is obtained, wherein the text classification model is a fine-tuned alber model.
在一个实施例中,在从原始文本中提取待分类的文本数据之前,还包括:基于TF-IDF模型获取原始文本中提取关键词,并构成关键词集;根据关键词集,确定原始文本的一级分类标签;将一级分类标签与预设的一级分类标签信息进行匹配,并根据匹配结果确定是否采用文本分类模型对原始文本进行文本分类的步骤。In one embodiment, before extracting the text data to be classified from the original text, the method further includes: extracting keywords from the original text based on the TF-IDF model, and forming a keyword set; First-class classification label; the step of matching the first-class classification label with the preset first-class classification label information, and determining whether to use the text classification model to classify the original text according to the matching result.
在一个实施例中,原始文本是专利文本数据,从原始文本中提取待分类的文本数据,包括:提取专利文本中的说明书摘要、权利要求书和说明书标题部分的文本数据作为待分类的文本数据。In one embodiment, the original text is patent text data, and extracting the text data to be classified from the original text includes: extracting the text data of the description abstract, claims and description title parts in the patent text as the text data to be classified .
在一个实施例中,将分词结果输入预先训练好的albert模型,得到文本数据对应的字向量序列,包括:根据文本数据的词性和位置信息,获取文本数据对应的词向量;将词向量输入至albert模型中进行数据处理,得到文本数据的词矩阵;根据词矩阵,获取文本数据的字向量序列。In one embodiment, inputting the word segmentation result into a pre-trained albert model to obtain a sequence of word vectors corresponding to the text data, including: obtaining word vectors corresponding to the text data according to part of speech and position information of the text data; inputting the word vectors to Data processing is performed in the albert model to obtain the word matrix of the text data; according to the word matrix, the word vector sequence of the text data is obtained.
在一个实施例中,判断初始文本分类模型的分类结果的准确率是否大于预设阈值,包括:获取第二训练样本集,第二训练样本集中包含第二训练文本;基于初始文本分类模型,得到第二训练样本集中的第二训练文本对应的预测分类标签;根据预测分类标签和第二训练文本对应的第二分类标签,判断初始分类模型的分类结果的准确率是否大于预设阈值,其中,第二分类标签是通过用户人工标注的第二分类标签。In one embodiment, judging whether the accuracy rate of the classification result of the initial text classification model is greater than a preset threshold includes: acquiring a second training sample set, where the second training sample set includes the second training text; based on the initial text classification model, obtaining The predicted classification label corresponding to the second training text in the second training sample set; according to the predicted classification label and the second classification label corresponding to the second training text, determine whether the accuracy of the classification result of the initial classification model is greater than the preset threshold, wherein, The second classification label is the second classification label manually annotated by the user.
在一个实施例中,基于第一训练样本集,以第一分类标签为分类目标预训练albert模型,得到初始文本分类模型,包括:将第一训练样本集按照预设的比例分为训练数据和验证数据;将训练数据输入待训练的初始文本分类模型进行模型训练;基于验证数据对训练后的初始文本分类模型进行验证,并根据验证结果得到优化后的初始文本分类模型。In one embodiment, pre-training the albert model with the first classification label as the classification target based on the first training sample set to obtain an initial text classification model includes: dividing the first training sample set into training data and Verify data; input the training data into the initial text classification model to be trained for model training; verify the trained initial text classification model based on the verification data, and obtain an optimized initial text classification model according to the verification results.
在一个实施例中,对第一训练文本对应的分类标签进行纠错,包括:In one embodiment, performing error correction on the classification label corresponding to the first training text includes:
对预测结果进行审核,得到预测正确的第一训练文本和预测错误的第一训练文本;Review the prediction results, and obtain the first training text that is correctly predicted and the first training text that is wrongly predicted;
将预测错误第一训练文本进行人工标注,以将预测错误第一训练文本的标签正确标注。Manually label the first training text with the prediction error, so as to correctly label the label of the first training text with the prediction error.
在一个实施例中,提出了一种存储有计算机可读指令的存储介质,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:从原始文本中提取待分析的目标文本数据;对目标文本数据进行预处理得到目标文本数据的分词结果;基于分词结果得到目标文本数据对应的目标字向量、目标词向量和目标位置向量;将目标字向量、目标词向量和目标位置向量输入预先训练好的文本分类模型,得到文本分类模型输出的目标分类标签,其中,文本分类模型为经过微调的alber模型。In one embodiment, a storage medium is provided that stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps: Extract the target text data to be analyzed; preprocess the target text data to obtain the word segmentation result of the target text data; obtain the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result; The word vector and the target position vector are input into the pre-trained text classification model, and the target classification label output by the text classification model is obtained, wherein the text classification model is a fine-tuned alber model.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或易失性存储器即随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that the realization of all or part of the processes in the methods of the above embodiments can be accomplished by instructing relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium, and the program is During execution, it may include the processes of the embodiments of the above-mentioned methods. The aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or the like, or a volatile memory, that is, a random access memory (Random Access Memory, RAM). )Wait.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be noted that, for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims (20)

  1. 一种文本分类方法,其中,所述方法包括:A text classification method, wherein the method comprises:
    从原始文本中提取待分析的目标文本数据;Extract the target text data to be analyzed from the original text;
    对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;Preprocessing the target text data to obtain a word segmentation result of the target text data;
    将所述分词结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签;其中,所述文本分类模型为经过训练的alber模型。The word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model.
  2. 根据权利要求1所述的文本分类方法,其中,在从原始文本中提取待分类的文本数据之前,还包括:The text classification method according to claim 1, wherein before extracting the text data to be classified from the original text, it further comprises:
    提取待所述原始文本中关键词,并构成关键词集;Extracting keywords in the original text to be described, and forming a keyword set;
    基于TF-IDF模型确定所述关键词集在各个类别的语料库中的词频-逆文档频率;Determine the word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;
    基于所述原始文本的关键词集在各个类别的语料库中的词频-逆文档频率,确定所述原始文本属于各个类别的置信度;Determine the confidence that the original text belongs to each category based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category;
    根据所述原始文本属于各个类别的置信度,确定所述原始文本的一级分类标签;Determine the first-level classification label of the original text according to the confidence that the original text belongs to each category;
    将所述一级分类标签与预设的一级分类标签信息进行匹配,并根据匹配结果确定是否采用所述文本分类模型对所述原始文本进行文本分类。Matching the first-level classification label with preset first-level classification label information, and determining whether to use the text classification model to perform text classification on the original text according to the matching result.
  3. 根据权利要求1所述的文本分类方法,其中,所述对所述文本数据进行预处理得到分词结果,包括:The text classification method according to claim 1, wherein the preprocessing of the text data to obtain a word segmentation result comprises:
    对所述目标文本数据进行去停用词、去重中的一种,得到第二文本数据,对所述第二文本数据进行分词操作,得到分词结果。One of removing stop words and removing duplicates is performed on the target text data to obtain second text data, and a word segmentation operation is performed on the second text data to obtain a word segmentation result.
  4. 根据权利要求1所述的文本分类方法,其中,所述方法还包括:训练所述文本分类模型,所述训练所述文本分类模型,包括:The text classification method according to claim 1, wherein the method further comprises: training the text classification model, and the training the text classification model comprises:
    获取第一训练样本集,所述第一训练样本集中包含第一训练文本,且所述第一训练文本包含对应的第一分类标签;obtaining a first training sample set, the first training sample set includes a first training text, and the first training text includes a corresponding first classification label;
    基于所述第一训练样本集,以所述第一分类标签为分类目标预训练albert模型,得到初始文本分类模型;Based on the first training sample set, the albert model is pre-trained with the first classification label as a classification target to obtain an initial text classification model;
    判断所述初始文本分类模型的分类结果的准确率是否大于预设阈值,judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold,
    如果大于所述预设阈值,则以所述初始文本分类模型为最终的文本分类模型;If it is greater than the preset threshold, the initial text classification model is used as the final text classification model;
    如果不大于所述预设阈值,则对所述第一训练文本对应的分类标签进行纠错,并基于纠错后的第一训练样本集对所述初始文本分类模型进行迭代,直至所述初始文本分类模型的分类结果的准确率大于预设阈值。If it is not greater than the preset threshold, perform error correction on the classification label corresponding to the first training text, and iterate the initial text classification model based on the error-corrected first training sample set until the initial The accuracy of the classification result of the text classification model is greater than the preset threshold.
  5. 根据权利要求4所述的文本分类模型的训练方法,其中,所述判断所述初始文本分类模型的分类结果的准确率是否大于预设阈值,包括:The method for training a text classification model according to claim 4, wherein the judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold comprises:
    获取第二训练样本集,所述第二训练样本集中包含第二训练文本;obtaining a second training sample set, where the second training sample set includes a second training text;
    基于所述初始文本分类模型,得到所述第二训练样本集中的第二训练文本对应的预测分类标签;Based on the initial text classification model, obtain the predicted classification label corresponding to the second training text in the second training sample set;
    根据所述预测分类标签和所述第二训练文本对应的第二分类标签,判断所述初始分类模型的分类结果的准确率是否大于预设阈值,其中,所述第二分类标签是通过用户人工标注的第二分类标签。According to the predicted classification label and the second classification label corresponding to the second training text, it is judged whether the accuracy of the classification result of the initial classification model is greater than a preset threshold, wherein the second classification label is manually Annotated second category label.
  6. 根据权利要求4所述的文本分类模型的训练方法,其中,所述基于所述第一训练样本集,以所述第一分类标签为分类目标预训练albert模型,得到初始文本分类模型,包括:The method for training a text classification model according to claim 4, wherein, based on the first training sample set, the first classification label is used as a classification target to pre-train an albert model to obtain an initial text classification model, comprising:
    将所述第一训练样本集按照预设的比例分为训练数据和验证数据;Dividing the first training sample set into training data and verification data according to a preset ratio;
    将所述训练数据输入待训练的初始文本分类模型进行模型训练;Inputting the training data into the initial text classification model to be trained for model training;
    基于所述验证数据对训练后的所述初始文本分类模型进行验证,并根据验证结果得到优化后的初始文本分类模型。The trained initial text classification model is verified based on the verification data, and an optimized initial text classification model is obtained according to the verification result.
  7. 根据权利要求4所述的文本分类模型的训练方法,其中,所述对所述第一训练文本对应的分类标签进行纠错,包括:The method for training a text classification model according to claim 4, wherein the performing error correction on the classification label corresponding to the first training text comprises:
    对所述预测结果进行审核,得到预测正确的第一训练文本和预测错误的第一训练文本;Reviewing the prediction results, and obtaining the first training text that is correctly predicted and the first training text that is wrongly predicted;
    将所述预测错误第一训练文本进行人工标注,以将所述预测错误第一训练文本的标签正确标注。Manually label the first training text with the prediction error, so as to correctly label the label of the first training text with the prediction error.
  8. 一种文本分类装置,其中,包括:A text classification device, comprising:
    目标文本获取模块,用于从原始文本中提取待分析的目标文本数据;The target text acquisition module is used to extract the target text data to be analyzed from the original text;
    分词模块,用于对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;A word segmentation module for preprocessing the target text data to obtain a word segmentation result of the target text data;
    分类模块,用于将所述分词结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签;其中,所述文本分类模型为经过训练的alber模型。The classification module is used to input the word segmentation result into a trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and The target classification label of the target text data is obtained based on the target word vector, the target word vector and the target position vector; wherein, the text classification model is a trained alber model.
  9. 一种计算机设备,其中,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the processor is caused to perform the following steps:
    从原始文本中提取待分析的目标文本数据;Extract the target text data to be analyzed from the original text;
    对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;Preprocessing the target text data to obtain a word segmentation result of the target text data;
    将所述分词结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签;其中,所述文本分类模型为经过训练的alber模型。The word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model.
  10. 根据权利要求9所述的计算机设备,其中,所述计算机可读指令被所述处理器执行时,还使得所述处理器执行以下步骤:The computer device of claim 9, wherein the computer-readable instructions, when executed by the processor, further cause the processor to perform the following steps:
    提取待所述原始文本中关键词,并构成关键词集;Extracting keywords in the original text to be described, and forming a keyword set;
    基于TF-IDF模型确定所述关键词集在各个类别的语料库中的词频-逆文档频率;Determine the word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;
    基于所述原始文本的关键词集在各个类别的语料库中的词频-逆文档频率,确定所述原始文本属于各个类别的置信度;Determine the confidence that the original text belongs to each category based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category;
    根据所述原始文本属于各个类别的置信度,确定所述原始文本的一级分类标签;Determine the first-level classification label of the original text according to the confidence that the original text belongs to each category;
    将所述一级分类标签与预设的一级分类标签信息进行匹配,并根据匹配结果确定是否采用所述文本分类模型对所述原始文本进行文本分类。Matching the first-level classification label with preset first-level classification label information, and determining whether to use the text classification model to perform text classification on the original text according to the matching result.
  11. 根据权利要求9所述的计算机设备,其中,所述计算机可读指令被所述处理器执行时,还使得所述处理器执行以下步骤:The computer device of claim 9, wherein the computer-readable instructions, when executed by the processor, further cause the processor to perform the following steps:
    训练所述文本分类模型,所述训练所述文本分类模型,包括:Training the text classification model, the training of the text classification model includes:
    获取第一训练样本集,所述第一训练样本集中包含第一训练文本,且所述第一训练文本包含对应的第一分类标签;obtaining a first training sample set, the first training sample set includes a first training text, and the first training text includes a corresponding first classification label;
    基于所述第一训练样本集,以所述第一分类标签为分类目标预训练albert模型,得到初始文本分类模型;Based on the first training sample set, the albert model is pre-trained with the first classification label as a classification target to obtain an initial text classification model;
    判断所述初始文本分类模型的分类结果的准确率是否大于预设阈值,judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold,
    如果大于所述预设阈值,则以所述初始文本分类模型为最终的文本分类模型;If it is greater than the preset threshold, the initial text classification model is used as the final text classification model;
    如果不大于所述预设阈值,则对所述第一训练文本对应的分类标签进行纠错,并基于纠错后的第一训练样本集对所述初始文本分类模型进行迭代,直至所述初始文本分类模型的分类结果的准确率大于预设阈值。If it is not greater than the preset threshold, perform error correction on the classification label corresponding to the first training text, and iterate the initial text classification model based on the error-corrected first training sample set until the initial The accuracy of the classification result of the text classification model is greater than the preset threshold.
  12. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述判断所述初始文本分类模型的分类结果的准确率是否大于预设阈值时,具体包括:The computer device according to claim 11, wherein, when the processor performs the judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold, the process specifically includes:
    获取第二训练样本集,所述第二训练样本集中包含第二训练文本;obtaining a second training sample set, where the second training sample set includes a second training text;
    基于所述初始文本分类模型,得到所述第二训练样本集中的第二训练文本对应的预测分类标签;Based on the initial text classification model, obtain the predicted classification label corresponding to the second training text in the second training sample set;
    根据所述预测分类标签和所述第二训练文本对应的第二分类标签,判断所述初始分类模型的分类结果的准确率是否大于预设阈值,其中,所述第二分类标签是通过用户人工标注的第二分类标签。According to the predicted classification label and the second classification label corresponding to the second training text, it is judged whether the accuracy of the classification result of the initial classification model is greater than a preset threshold, wherein the second classification label is manually Annotated second category label.
  13. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述基于所述第一训练样本集,以所述第一分类标签为分类目标预训练albert模型,得到初始文本分类模型时,具体包括:The computer device according to claim 11, wherein, when the processor performs the pre-training of the albert model based on the first training sample set and using the first classification label as a classification target to obtain an initial text classification model, Specifically include:
    将所述第一训练样本集按照预设的比例分为训练数据和验证数据;dividing the first training sample set into training data and verification data according to a preset ratio;
    将所述训练数据输入待训练的初始文本分类模型进行模型训练;Inputting the training data into the initial text classification model to be trained for model training;
    基于所述验证数据对训练后的所述初始文本分类模型进行验证,并根据验证结果得到优化后的初始文本分类模型。The trained initial text classification model is verified based on the verification data, and an optimized initial text classification model is obtained according to the verification result.
  14. 根据权利要求11所述的计算机设备,其中,所述处理器执行所述对所述第一训练文本对应的分类标签进行纠错时,具体包括:The computer device according to claim 11, wherein, when the processor performs the error correction for the classification label corresponding to the first training text, it specifically includes:
    对所述预测结果进行审核,得到预测正确的第一训练文本和预测错误的第一训练文本;Reviewing the prediction results, and obtaining the first training text that is correctly predicted and the first training text that is wrongly predicted;
    将所述预测错误第一训练文本进行人工标注,以将所述预测错误第一训练文本的标签正确标注。Manually label the first training text with the prediction error, so as to correctly label the label of the first training text with the prediction error.
  15. 一种存储有计算机可读指令的存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:A storage medium storing computer-readable instructions, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
    从原始文本中提取待分析的目标文本数据;Extract the target text data to be analyzed from the original text;
    对所述目标文本数据进行预处理得到所述目标文本数据的分词结果;Preprocessing the target text data to obtain a word segmentation result of the target text data;
    将所述分词结果输入与训练好的文本分类模型中,所述文本分类模型基于所述分词结果得到所述目标文本数据对应的目标字向量、目标词向量和目标位置向量以及基于所述目标字向量、所述目标词向量和所述目标位置向量得到所述目标文本数据的目标分类标签;其中,所述文本分类模型为经过训练的alber模型。The word segmentation result is input into the trained text classification model, and the text classification model obtains the target word vector, target word vector and target position vector corresponding to the target text data based on the word segmentation result and based on the target word vector. vector, the target word vector and the target position vector to obtain the target classification label of the target text data; wherein, the text classification model is a trained alber model.
  16. 根据权利要求15所述的存储有计算机可读指令的存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,还使得一个或多个处理器执行以下步骤:The storage medium storing computer-readable instructions according to claim 15, wherein, when executed by one or more processors, the computer-readable instructions further cause one or more processors to perform the following steps:
    提取待所述原始文本中关键词,并构成关键词集;Extracting keywords in the original text to be described, and forming a keyword set;
    基于TF-IDF模型确定所述关键词集在各个类别的语料库中的词频-逆文档频率;Determine the word frequency-inverse document frequency of the keyword set in the corpus of each category based on the TF-IDF model;
    基于所述原始文本的关键词集在各个类别的语料库中的词频-逆文档频率,确定所述原始文本属于各个类别的置信度;Determine the confidence that the original text belongs to each category based on the word frequency-inverse document frequency of the keyword set of the original text in the corpus of each category;
    根据所述原始文本属于各个类别的置信度,确定所述原始文本的一级分类标签;Determine the first-level classification label of the original text according to the confidence that the original text belongs to each category;
    将所述一级分类标签与预设的一级分类标签信息进行匹配,并根据匹配结果确定是否采用所述文本分类模型对所述原始文本进行文本分类。Matching the first-level classification label with preset first-level classification label information, and determining whether to use the text classification model to perform text classification on the original text according to the matching result.
  17. 根据权利要求15所述的存储有计算机可读指令的存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,还使得一个或多个处理器执行以下步骤:The storage medium storing computer-readable instructions according to claim 15, wherein, when executed by one or more processors, the computer-readable instructions further cause one or more processors to perform the following steps:
    训练所述文本分类模型,所述训练所述文本分类模型,包括:Training the text classification model, the training of the text classification model includes:
    获取第一训练样本集,所述第一训练样本集中包含第一训练文本,且所述第一训练文本包含对应的第一分类标签;obtaining a first training sample set, the first training sample set includes a first training text, and the first training text includes a corresponding first classification label;
    基于所述第一训练样本集,以所述第一分类标签为分类目标预训练albert模型,得到初始文本分类模型;Based on the first training sample set, the albert model is pre-trained with the first classification label as a classification target to obtain an initial text classification model;
    判断所述初始文本分类模型的分类结果的准确率是否大于预设阈值,judging whether the accuracy of the classification result of the initial text classification model is greater than a preset threshold,
    如果大于所述预设阈值,则以所述初始文本分类模型为最终的文本分类模型;If it is greater than the preset threshold, the initial text classification model is used as the final text classification model;
    如果不大于所述预设阈值,则对所述第一训练文本对应的分类标签进行纠错,并基于纠错后的第一训练样本集对所述初始文本分类模型进行迭代,直至所述初始文本分类模型的分类结果的准确率大于预设阈值。If it is not greater than the preset threshold, perform error correction on the classification label corresponding to the first training text, and iterate the initial text classification model based on the error-corrected first training sample set until the initial The accuracy of the classification result of the text classification model is greater than the preset threshold.
  18. 根据权利要求17所述的存储有计算机可读指令的存储介质,其中,所述计算机可读指令被一个或多个处理器执行,使得一个或多个处理器执行所述判断所述初始文本分类模型的分类结果的准确率是否大于预设阈值时,具体包括:18. The storage medium storing computer-readable instructions of claim 17, wherein the computer-readable instructions are executed by one or more processors to cause the one or more processors to perform the determining the initial text classification When the accuracy of the classification result of the model is greater than the preset threshold, it specifically includes:
    获取第二训练样本集,所述第二训练样本集中包含第二训练文本;obtaining a second training sample set, where the second training sample set includes a second training text;
    基于所述初始文本分类模型,得到所述第二训练样本集中的第二训练文本对应的预测分类标签;Based on the initial text classification model, obtain the predicted classification label corresponding to the second training text in the second training sample set;
    根据所述预测分类标签和所述第二训练文本对应的第二分类标签,判断所述初始分类模型的分类结果的准确率是否大于预设阈值,其中,所述第二分类标签是通过用户人工标注的第二分类标签。According to the predicted classification label and the second classification label corresponding to the second training text, it is judged whether the accuracy of the classification result of the initial classification model is greater than a preset threshold, wherein the second classification label is manually Annotated second category label.
  19. 根据权利要求17所述的存储有计算机可读指令的存储介质,其中,所述计算机可读指令被一个或多个处理器执行,使得一个或多个处理器执行所述基于所述第一训练样本集,以所述第一分类标签为分类目标预训练albert模型,得到初始文本分类模型时,具体包括:18. The storage medium having computer-readable instructions stored thereon of claim 17, wherein the computer-readable instructions are executed by one or more processors to cause the one or more processors to perform the training based on the first training The sample set, when the albert model is pre-trained with the first classification label as the classification target, and the initial text classification model is obtained, it specifically includes:
    将所述第一训练样本集按照预设的比例分为训练数据和验证数据;dividing the first training sample set into training data and verification data according to a preset ratio;
    将所述训练数据输入待训练的初始文本分类模型进行模型训练;Inputting the training data into the initial text classification model to be trained for model training;
    基于所述验证数据对训练后的所述初始文本分类模型进行验证,并根据验证结果得到优化后的初始文本分类模型。The trained initial text classification model is verified based on the verification data, and an optimized initial text classification model is obtained according to the verification result.
  20. 根据权利要求17所述的存储有计算机可读指令的存储介质,其中,所 述计算机可读指令被一个或多个处理器执行,使得一个或多个处理器执行所述对所述第一训练文本对应的分类标签进行纠错时,具体包括:18. The storage medium storing computer-readable instructions of claim 17, wherein the computer-readable instructions are executed by one or more processors to cause the one or more processors to perform the pairing of the first training When correcting the error of the classification label corresponding to the text, it specifically includes:
    对所述预测结果进行审核,得到预测正确的第一训练文本和预测错误的第一训练文本;Reviewing the prediction results, and obtaining the first training text that is correctly predicted and the first training text that is wrongly predicted;
    将所述预测错误第一训练文本进行人工标注,以将所述预测错误第一训练文本的标签正确标注。Manually label the first training text with the prediction error, so as to correctly label the label of the first training text with the prediction error.
PCT/CN2021/097195 2021-04-30 2021-05-31 Text classification method, apparatus, computer device, and storage medium WO2022227207A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110482695.1 2021-04-30
CN202110482695.1A CN113011533B (en) 2021-04-30 2021-04-30 Text classification method, apparatus, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2022227207A1 true WO2022227207A1 (en) 2022-11-03

Family

ID=76380485

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097195 WO2022227207A1 (en) 2021-04-30 2021-05-31 Text classification method, apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN113011533B (en)
WO (1) WO2022227207A1 (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545009A (en) * 2022-12-01 2022-12-30 中科雨辰科技有限公司 Data processing system for acquiring target text
CN115563289A (en) * 2022-12-06 2023-01-03 中信证券股份有限公司 Industry classification label generation method and device, electronic equipment and readable medium
CN115827875A (en) * 2023-01-09 2023-03-21 无锡容智技术有限公司 Text data processing terminal searching method
CN115994527A (en) * 2023-03-23 2023-04-21 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
CN116204645A (en) * 2023-03-02 2023-06-02 北京数美时代科技有限公司 Intelligent text classification method, system, storage medium and electronic equipment
CN116205601A (en) * 2023-02-27 2023-06-02 开元数智工程咨询集团有限公司 Internet-based engineering list rechecking and data statistics method and system
CN116992034A (en) * 2023-09-26 2023-11-03 之江实验室 Intelligent event marking method, device and storage medium
CN117009534A (en) * 2023-10-07 2023-11-07 之江实验室 Text classification method, apparatus, computer device and storage medium
CN117034901A (en) * 2023-10-10 2023-11-10 北京睿企信息科技有限公司 Data statistics system based on text generation template
CN117252514A (en) * 2023-11-20 2023-12-19 中铁四局集团有限公司 Building material library data processing method based on deep learning and model training
CN117743573A (en) * 2023-12-11 2024-03-22 中国科学院文献情报中心 Corpus automatic labeling method and device, storage medium and electronic equipment
CN117743857A (en) * 2023-12-29 2024-03-22 北京海泰方圆科技股份有限公司 Text correction model training, text correction method, device, equipment and medium
CN117910479A (en) * 2024-03-19 2024-04-19 湖南蚁坊软件股份有限公司 Method, device, equipment and medium for judging aggregated news
CN117951007A (en) * 2024-01-09 2024-04-30 航天中认软件测评科技(北京)有限责任公司 Test case classification method based on theme
CN117992600A (en) * 2024-04-07 2024-05-07 之江实验室 Service execution method and device, storage medium and electronic equipment
CN118193743A (en) * 2024-05-20 2024-06-14 山东齐鲁壹点传媒有限公司 Multi-level text classification model based on pre-training model
CN118332091A (en) * 2024-06-06 2024-07-12 中电信数智科技有限公司 Ancient book knowledge base intelligent question-answering method, device and equipment based on large model technology
CN118503796A (en) * 2024-07-18 2024-08-16 北京睿企信息科技有限公司 Label system construction method, device, equipment and medium
CN118503399A (en) * 2024-07-18 2024-08-16 北京睿企信息科技有限公司 Standardized text acquisition method, device, equipment and medium
CN118503795A (en) * 2024-07-18 2024-08-16 北京睿企信息科技有限公司 Text label verification method, electronic equipment and storage medium
CN118535739A (en) * 2024-06-26 2024-08-23 上海建朗信息科技有限公司 Data classification method and system based on keyword weight matching
WO2024212648A1 (en) * 2023-04-14 2024-10-17 华为技术有限公司 Method for training classification model, and related apparatus

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254657B (en) * 2021-07-07 2021-11-19 明品云(北京)数据科技有限公司 User data classification method and system
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification
CN113283235B (en) * 2021-07-21 2021-11-19 明品云(北京)数据科技有限公司 User label prediction method and system
CN113535960A (en) * 2021-08-02 2021-10-22 中国工商银行股份有限公司 Text classification method, device and equipment
CN113627509B (en) * 2021-08-04 2024-05-10 口碑(上海)信息技术有限公司 Data classification method, device, computer equipment and computer readable storage medium
CN113609860B (en) * 2021-08-05 2023-09-19 湖南特能博世科技有限公司 Text segmentation method and device and computer equipment
CN113836892B (en) * 2021-09-08 2023-08-08 灵犀量子(北京)医疗科技有限公司 Sample size data extraction method and device, electronic equipment and storage medium
CN114706974A (en) * 2021-09-18 2022-07-05 北京墨丘科技有限公司 Technical problem information mining method and device and storage medium
CN114065772A (en) * 2021-11-19 2022-02-18 浙江百应科技有限公司 Business opportunity identification method and device based on Albert model and electronic equipment
CN114141248A (en) * 2021-11-24 2022-03-04 青岛海尔科技有限公司 Voice data processing method and device, electronic equipment and storage medium
CN114706961A (en) * 2022-01-20 2022-07-05 平安国际智慧城市科技股份有限公司 Target text recognition method, device and storage medium
CN114492661B (en) * 2022-02-14 2024-06-28 平安科技(深圳)有限公司 Text data classification method and device, computer equipment and storage medium
CN114936282B (en) * 2022-04-28 2024-06-11 北京中科闻歌科技股份有限公司 Financial risk cue determination method, device, equipment and medium
CN115861606B (en) * 2022-05-09 2023-09-08 北京中关村科金技术有限公司 Classification method, device and storage medium for long-tail distributed documents
CN115587185B (en) * 2022-11-25 2023-03-14 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN116975400B (en) * 2023-08-03 2024-05-24 星环信息科技(上海)股份有限公司 Data classification and classification method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN109710770A (en) * 2019-01-31 2019-05-03 北京牡丹电子集团有限责任公司数字电视技术中心 A kind of file classification method and device based on transfer learning
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111198948A (en) * 2020-01-08 2020-05-26 深圳前海微众银行股份有限公司 Text classification correction method, device and equipment and computer readable storage medium
CN112052331A (en) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 Method and terminal for processing text information

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334605B (en) * 2018-02-01 2020-06-16 腾讯科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN109508378B (en) * 2018-11-26 2023-07-14 平安科技(深圳)有限公司 Sample data processing method and device
CN110457471A (en) * 2019-07-15 2019-11-15 平安科技(深圳)有限公司 File classification method and device based on A-BiLSTM neural network
CN111078887B (en) * 2019-12-20 2022-04-29 厦门市美亚柏科信息股份有限公司 Text classification method and device
CN111125317A (en) * 2019-12-27 2020-05-08 携程计算机技术(上海)有限公司 Model training, classification, system, device and medium for conversational text classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN109710770A (en) * 2019-01-31 2019-05-03 北京牡丹电子集团有限责任公司数字电视技术中心 A kind of file classification method and device based on transfer learning
CN112052331A (en) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 Method and terminal for processing text information
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111198948A (en) * 2020-01-08 2020-05-26 深圳前海微众银行股份有限公司 Text classification correction method, device and equipment and computer readable storage medium

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545009A (en) * 2022-12-01 2022-12-30 中科雨辰科技有限公司 Data processing system for acquiring target text
CN115563289A (en) * 2022-12-06 2023-01-03 中信证券股份有限公司 Industry classification label generation method and device, electronic equipment and readable medium
CN115563289B (en) * 2022-12-06 2023-03-07 中信证券股份有限公司 Industry classification label generation method and device, electronic equipment and readable medium
CN115827875A (en) * 2023-01-09 2023-03-21 无锡容智技术有限公司 Text data processing terminal searching method
CN115827875B (en) * 2023-01-09 2023-04-25 无锡容智技术有限公司 Text data processing terminal searching method
CN116205601A (en) * 2023-02-27 2023-06-02 开元数智工程咨询集团有限公司 Internet-based engineering list rechecking and data statistics method and system
CN116205601B (en) * 2023-02-27 2024-04-05 开元数智工程咨询集团有限公司 Internet-based engineering list rechecking and data statistics method and system
CN116204645B (en) * 2023-03-02 2024-02-20 北京数美时代科技有限公司 Intelligent text classification method, system, storage medium and electronic equipment
CN116204645A (en) * 2023-03-02 2023-06-02 北京数美时代科技有限公司 Intelligent text classification method, system, storage medium and electronic equipment
CN115994527A (en) * 2023-03-23 2023-04-21 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
WO2024212648A1 (en) * 2023-04-14 2024-10-17 华为技术有限公司 Method for training classification model, and related apparatus
CN116992034A (en) * 2023-09-26 2023-11-03 之江实验室 Intelligent event marking method, device and storage medium
CN116992034B (en) * 2023-09-26 2023-12-22 之江实验室 Intelligent event marking method, device and storage medium
CN117009534B (en) * 2023-10-07 2024-02-13 之江实验室 Text classification method, apparatus, computer device and storage medium
CN117009534A (en) * 2023-10-07 2023-11-07 之江实验室 Text classification method, apparatus, computer device and storage medium
CN117034901B (en) * 2023-10-10 2023-12-08 北京睿企信息科技有限公司 Data statistics system based on text generation template
CN117034901A (en) * 2023-10-10 2023-11-10 北京睿企信息科技有限公司 Data statistics system based on text generation template
CN117252514A (en) * 2023-11-20 2023-12-19 中铁四局集团有限公司 Building material library data processing method based on deep learning and model training
CN117252514B (en) * 2023-11-20 2024-01-30 中铁四局集团有限公司 Building material library data processing method based on deep learning and model training
CN117743573A (en) * 2023-12-11 2024-03-22 中国科学院文献情报中心 Corpus automatic labeling method and device, storage medium and electronic equipment
CN117743857A (en) * 2023-12-29 2024-03-22 北京海泰方圆科技股份有限公司 Text correction model training, text correction method, device, equipment and medium
CN117743857B (en) * 2023-12-29 2024-09-17 北京海泰方圆科技股份有限公司 Text correction model training, text correction method, device, equipment and medium
CN117951007A (en) * 2024-01-09 2024-04-30 航天中认软件测评科技(北京)有限责任公司 Test case classification method based on theme
CN117910479A (en) * 2024-03-19 2024-04-19 湖南蚁坊软件股份有限公司 Method, device, equipment and medium for judging aggregated news
CN117910479B (en) * 2024-03-19 2024-06-04 湖南蚁坊软件股份有限公司 Method, device, equipment and medium for judging aggregated news
CN117992600A (en) * 2024-04-07 2024-05-07 之江实验室 Service execution method and device, storage medium and electronic equipment
CN117992600B (en) * 2024-04-07 2024-06-11 之江实验室 Service execution method and device, storage medium and electronic equipment
CN118193743A (en) * 2024-05-20 2024-06-14 山东齐鲁壹点传媒有限公司 Multi-level text classification model based on pre-training model
CN118332091A (en) * 2024-06-06 2024-07-12 中电信数智科技有限公司 Ancient book knowledge base intelligent question-answering method, device and equipment based on large model technology
CN118535739A (en) * 2024-06-26 2024-08-23 上海建朗信息科技有限公司 Data classification method and system based on keyword weight matching
CN118503796A (en) * 2024-07-18 2024-08-16 北京睿企信息科技有限公司 Label system construction method, device, equipment and medium
CN118503399A (en) * 2024-07-18 2024-08-16 北京睿企信息科技有限公司 Standardized text acquisition method, device, equipment and medium
CN118503795A (en) * 2024-07-18 2024-08-16 北京睿企信息科技有限公司 Text label verification method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113011533A (en) 2021-06-22
CN113011533B (en) 2023-10-24

Similar Documents

Publication Publication Date Title
WO2022227207A1 (en) Text classification method, apparatus, computer device, and storage medium
CN111444726B (en) Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
US11580415B2 (en) Hierarchical multi-task term embedding learning for synonym prediction
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
US12079586B2 (en) Linguistically rich cross-lingual text event embeddings
CN111709243B (en) Knowledge extraction method and device based on deep learning
WO2022241950A1 (en) Text summarization generation method and apparatus, and device and storage medium
EP3534272A1 (en) Natural language question answering systems
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN111414481A (en) Chinese semantic matching method based on pinyin and BERT embedding
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
US11183175B2 (en) Systems and methods implementing data query language and utterance corpus implements for handling slot-filling and dialogue intent classification data in a machine learning task-oriented dialogue system
WO2023159758A1 (en) Data enhancement method and apparatus, electronic device, and storage medium
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN114547230B (en) Intelligent administrative law enforcement case information extraction and case identification method
Othman et al. Learning english and arabic question similarity with siamese neural networks in community question answering services
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
US20220156489A1 (en) Machine learning techniques for identifying logical sections in unstructured data
CN117251524A (en) Short text classification method based on multi-strategy fusion
CN114490937A (en) Comment analysis method and device based on semantic perception
Hua et al. A character-level method for text classification
CN113935308A (en) Method and system for automatically generating text abstract facing field of geoscience
CN118227790A (en) Text classification method, system, equipment and medium based on multi-label association

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21938677

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21938677

Country of ref document: EP

Kind code of ref document: A1