CN109284382B

CN109284382B - Text classification method and computing device

Info

Publication number: CN109284382B
Application number: CN201811158905.6A
Authority: CN
Inventors: 徐乐乐
Original assignee: Wuhan Douyu Network Technology Co Ltd
Current assignee: Wuhan Douyu Network Technology Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2021-05-28
Anticipated expiration: 2038-09-30
Also published as: CN109284382A

Abstract

The embodiment of the application discloses a text classification method and a calculation device, which are used for solving the problems of sample imbalance among different classes and feature screening, and can remarkably improve the text classification effect of a model. The method in the embodiment of the application comprises the following steps: acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value; selecting A pieces of text information from the text information of the N color value areas and the text information of the M game partitions; selecting at least two features from the first feature, the second feature and the third feature as candidate features; and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.

Description

Text classification method and computing device

Technical Field

The present application relates to the field of big data, and in particular, to a text classification method and a computing device.

Background

In machine learning, a random forest is a classifier that contains multiple decision trees, and the class of its output is determined by the mode of the class output by the individual trees. Random forests are actually a special bagging method that uses decision trees as models in bagging. Firstly, generating m training sets by using a bootstrap method, then constructing a decision tree for each training set, and when finding features for splitting, not finding all the features to maximize indexes (such as information gain), but randomly extracting a part of features from the features, finding an optimal solution among the extracted features, and applying the optimal solution to the nodes for splitting. The random forest method is equivalent to sampling samples and features due to the bagging, namely the integration idea.

However, when a task of text classification based on a random forest algorithm is performed, two common problems are caused: 1. the imbalance of samples among the classes can lead the classification result to be biased to the class with more samples; 2. the selection of the features determines the execution speed and the final effect of the algorithm.

Disclosure of Invention

The embodiment of the application provides a text classification method and a calculation device, which are used for solving the problems of sample imbalance among different classes and feature screening, and can remarkably improve the text classification effect of a model.

In view of the above, a first aspect of the present application provides a text classification method, which may include:

acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;

selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;

selecting at least two features from the first feature, the second feature and the third feature as candidate features;

and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.

Optionally, in some embodiments of the present application, before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further includes:

acquiring original text information of X1 color value areas;

when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;

calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;

determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.

Optionally, in some embodiments of the present application, the calculating new text information of X3 color value regions according to the text information of X2 color value regions and a sample sampling formula includes:

determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;

and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.

acquiring original text information of Y1 game partitions;

when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;

calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;

determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.

Optionally, in some embodiments of the present application, the calculating new text information of Y3 game partitions according to the text information of Y2 game partitions and a sample sampling formula includes:

determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;

and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.

Optionally, in some embodiments of the present application, the feature selection formula is:

wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,

and adjusting the coefficient to be between (0, 1).

Optionally, in some embodiments of the present application, the sample sampling formula is:

s_i＝x_i+τ*max(0.1,|x_ij-x_i|)，

wherein s is_iDenotes the ith new sample, x_iRepresenting any one of a few classes of samples, x_ijDenotes x_iJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).

A second aspect of the present application provides a computing device, which may include:

the first acquisition module is used for acquiring text information of N color value areas and text information of M game subareas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;

a first selection module, configured to select a pieces of text information from the text information of the N color value regions and the text information of the M game partitions, where each piece of text information in the a pieces of text information includes a first feature, a second feature, and a third feature, the first feature includes a sentence length, the second feature includes a maximum inverse text frequency index value of a word in a sentence, and the third feature includes a maximum word frequency value of a word in a sentence;

a second selection module for selecting at least two features from the first feature, the second feature and the third feature as candidate features;

and the generation module is used for selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula so as to generate a random forest model.

Optionally, in some embodiments of the present application, the computing apparatus may further include:

the second acquisition module is used for acquiring the original text information of the X1 color value areas;

a third selecting module, configured to select text information of X2 color value regions from the original text information of the X1 color value regions when an absolute value of a difference between X1 and M is greater than the preset threshold;

the calculation module is used for calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;

a determining module, configured to determine that a sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.

Alternatively, in some embodiments of the present application,

the calculation module is specifically configured to determine, according to the text information and euclidean distances of the X2 color value regions, neighboring text information of X3 color value regions; and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.

Alternatively, in some embodiments of the present application,

the second acquisition module is used for acquiring the original text information of Y1 game partitions;

a third selecting module, configured to select text information of Y2 game partitions from the original text information of the Y1 game partitions when an absolute value of a difference between Y1 and M is greater than the preset threshold;

the calculation module is used for calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;

a determining module, configured to determine that a sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.

Alternatively, in some embodiments of the present application,

the calculation module is specifically configured to determine neighboring text information of Y3 game partitions according to the text information and euclidean distances of the Y2 game partitions; and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.

and adjusting the coefficient to be between (0, 1).

s_i＝x_i+τ*max(0.1,|x_ij-x_i|)，

In a third aspect, an embodiment of the present invention provides a computing apparatus, including a memory, and a processor, where the processor is configured to implement the steps of the text classification method described in the foregoing first aspect when executing a computer program stored in the memory.

In a fourth aspect, an embodiment of the present invention provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the text classification method as described in the foregoing first aspect embodiment.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, the text information of N color value areas and the text information of M game subareas in the current scene are obtained, wherein N and M are integers larger than 0, and the absolute value of the difference value between N and M is smaller than a preset threshold value; selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence; selecting at least two features from the first feature, the second feature and the third feature as candidate features; and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model. The method is used for solving the problem of sample imbalance among different classes and the problem of feature screening, and can remarkably improve the text classification effect of the model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following briefly introduces the embodiments and the drawings used in the description of the prior art, and obviously, the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to the drawings.

FIG. 1 is a schematic diagram of an embodiment of a text classification method in an embodiment of the present application;

FIG. 2 is a schematic diagram of an embodiment of a computing device in an embodiment of the present application;

FIG. 3 is a schematic diagram of another embodiment of a computing device in an embodiment of the present application;

FIG. 4 is a schematic diagram of another embodiment of a computing device in an embodiment of the present application;

fig. 5 is a schematic diagram of another embodiment of a computer-readable storage medium in an embodiment of the present application.

Detailed Description

For a person skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. The embodiments in the present application shall fall within the protection scope of the present application.

In the following, a brief description of the terms referred to in the present application is given, as follows:

random Forest algorithm (RF), in machine learning, a Random Forest is a classifier that contains multiple decision trees and whose output class is determined by the mode of the class output by the individual trees.

Each tree was built according to the following algorithm:

(1) the number of training cases (samples) is represented by N, and the number of features is represented by M.

(2) Inputting a characteristic number m for determining a decision result of a node on a decision tree; where M should be much smaller than M.

(3) Sampling N times from N training cases (samples) in a manner of sampling back to form a training set (i.e. bootstrap sampling), and using the cases (samples) which are not extracted as a prediction to evaluate the error.

(4) For each node, m features are randomly selected, and the decision for each node on the decision tree is determined based on these features. Based on the m features, the optimal splitting mode is calculated.

(5) Each tree grows completely without pruning, which may be employed after a normal tree classifier is built.

When a text classification task based on a random forest algorithm is performed, 2 common problems exist: 1. the imbalance of samples among the classes can lead the classification result to be biased to the class with more samples; 2. the selection of the features determines the execution speed and the final effect of the algorithm.

Therefore, the present invention improves these two problems, and the following further describes the technical solution of the present application in an embodiment, as shown in fig. 1, which is an exemplary illustration of a text classification method in the embodiment of the present application, and may include:

101. acquiring text information of N color value areas and text information of M game areas in the current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value.

In this embodiment of the application, before the obtaining of the text information of the N color value regions and the text information of the M game partitions in the current scene, the method may further include:

(1) acquiring original text information of X1 color value areas; when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas; calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula; determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.

Or,

(2) acquiring original text information of Y1 game partitions; when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions; calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula; determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.

Optionally, the calculating to obtain new text information of X3 color value regions according to the text information of the X2 color value regions and the sample sampling formula may include:

determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas; and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.

Optionally, the calculating new text information of Y3 game partitions according to the text information of Y2 game partitions and the sample sampling formula may include:

determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions; and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.

For example, the computing device may extract the color value and the text information (which may also be referred to as corpus) of the game partition from the barrage library of the live page, for example: color value area: 10 ten thousand, game zone: 2 ten thousand.

Firstly, segmenting all linguistic data by utilizing a crust, filtering stop words, and mapping to a 4-dimensional word2vec space; then, supplementing the text information of the game partition, and randomly taking 1 ten thousand as an original sample; aiming at each original sample, solving 5 adjacent samples of the sample in a TFIDF vector space by using Euclidean distance; and then 5 new samples can be generated by transforming the 5 adjacent samples by using a sample sampling formula.

Wherein the sample sampling formula is:

s_i＝x_i+τ*max(0.1,|x_ij-x_il) (formula one),

wherein s is_iDenotes the ith new sample, x_iRepresenting any one of a few classes of samples, x_ijDenotes x_iJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1). It should be noted that formula one means a sampling formula with fewer samples in the category, and the purpose is to increase N samples, so that the samples in the category are balanced.

Assume s 1: i like to see Miss [0.212, 0.356, 0.254, 0.684 ]; thus, 5 neighbor samples of s1 can be found:

s11＝[0.102,0.254,0.102,0.631],…，s15；

then, by using a formula I, converting the generated new sample through s 11;

s’＝s1+0.6*|s11-s1|

＝[0.212,0.356,0.254,0.684]+0.6*([0.11,0.102,0.152,0.053])

＝[0.278,0.4172,0.356,0.3452,0.7158]

therefore, the new sample s' is mapped to new text by word2 vec: miss will look nice.

Similarly, the computing device may obtain new samples of other nearby samples, and the number of samples for the game partition is expanded to 7 ten thousand. I.e. the number of 5 ten thousand new samples plus the number of 2 ten thousand original samples.

102. Selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in the sentence, and the third feature comprises a maximum word frequency value of words in the sentence.

Illustratively, each decision tree in the random forest is a training set that selects 2 million corpora from the entire corpus as the decision tree with random playback.

Each of the samples had 3 features which were,

the characteristics are as follows: whether the sentence length is greater than 5;

and (B) is as follows: whether the maximum Inverse text Frequency Index (IDF) value of the word in the sentence is more than 200;

and (C) feature: whether the Term maximum Frequency (TF) value in the sentence is larger than 30.

103. Selecting at least two features from the first feature, the second feature, and the third feature as candidate features.

Illustratively, t (t <3) -dimensional features are selected as candidate features of the decision tree.

104. And selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.

Wherein the feature selection formula is:

and adjusting the coefficient to be between (0, 1). The meaning of the formula two is that the decision tree is a standard for selecting the features as the nodes, and the features with the largest information gain ratio can be the nodes of the current round.

Illustratively, the computing device selects the feature with the largest information gain from the candidate features by using a formula II to split the nodes of the decision tree, and the t value is kept unchanged in the growth process of the random forest.

Let t be 2, the first selected feature is A, B, the total number of samples N is 20000,

the number of the game subareas Ng is 8000, and the number of the color value subareas Nf is 12000;

9000, 6000 of them belong to the color value zone and 3000 belong to the game zone;

a- ═ 11000, of which 6000 belong to the color value zone and 5000 belong to the game zone;

b + 5000, 3000 of which belong to the color value zone and 2000 belong to the game zone;

b-15000, 9000 of which belong to the color zone and 6000 of which belong to the game zone.

The computing means can therefore find the information gain according to equation three.

Information gain g (a): g (a) ═ E (S) -E (S | a) (formula three),

wherein E (S) represents the entropy of the set S and refers to the entropy formula of the decision tree, and E (S | A) represents the entropy divided by the characteristic A and refers to the conditional entropy formula of the decision tree. The meaning of the formula three is that the formula is an information gain formula, and a random forest is referred to, so as to supplement the description of the formula two.

Exemplary, formula three: g (a) ═ E (n) -E (S | a);

therefore, the information gain: g (a) ═ 0.292-0.286 ═ 0.006.

The calculation means may find the information content separation according to equation four.

Information content division split (a):

wherein n is the total number divided by the characteristic A; a is_jThe total number of categories j when divided by feature a. The meaning of the formula four is that the formula is divided for the information amount, and a random forest is referred to, so as to supplement the description of the formula two.

Exemplary, equation four, traffic separation split (a):

the calculation means can thus find the value of the attribute association degree according to the formula five.

Attribute association formula t (f):

where n is the total number of attributes that do not contain attribute A, H (F)_i) Representing the entropy value of the ith attribute. The expression five is the expression of the degree of association between the attributes, that is, the smaller the degree of association between the attribute a and other attributes is, the larger the information gain ratio of the attribute a is.

Exemplarily, since h (a) ═ E (S | a), and h (B) ═ E (S | B) ═ 0.203,

then it is calculated according to equation two:

Gen(B)＝0.107。

similarly, we can find gen (B) ═ 0.107, so gen (B) > gen (a), and then we should choose the B-feature split node in this decision tree.

It should be noted that, the step 102-104 is circularly executed, so as to ensure that each decision tree in the random deep forest is fissured to the maximum extent, pruning is not required, and finally the random deep forest model is generated.

As shown in fig. 2, fig. 2 is a schematic view of an embodiment of a computing apparatus in an embodiment of the present application, and may include:

a first obtaining module 201, configured to obtain text information of N color value regions and text information of M game partitions in a current scene, where N and M are integers greater than 0, and an absolute value of a difference between N and M is smaller than a preset threshold;

a first selection module 202, configured to select a text message from text messages of the N color value regions and text messages of the M game partitions, where each text message of the a text messages includes a first feature, a second feature, and a third feature, the first feature includes a sentence length, the second feature includes a maximum inverse text frequency index value of a word in the sentence, and the third feature includes a maximum word frequency value of the word in the sentence;

a second selection module 203, configured to select at least two features from the first feature, the second feature, and the third feature as candidate features;

and the generating module 204 is configured to select a feature with the largest information gain to split nodes of the decision tree according to the candidate feature and the feature selection formula, and generate a random forest model.

Optionally, in some embodiments of the present application, as shown in fig. 3, fig. 3 is a schematic diagram of an embodiment of a computing device in an embodiment of the present application, and the computing device may further include:

a second obtaining module 205, configured to obtain original text information of X1 color value regions;

a third selecting module 206, configured to select text information of X2 color value regions from the original text information of X1 color value regions when an absolute value of a difference between X1 and M is greater than a preset threshold;

the calculating module 207 is configured to calculate new text information of X3 color value regions according to the text information of the X2 color value regions and the sample sampling formula;

and the determining module 208 is used for determining that the sum of the new text information of the X3 color value areas and the original text information of the X1 color value areas is the text information of the N color value areas.

Alternatively, in some embodiments of the present application,

the calculating module 207 is specifically configured to determine neighboring text information of X3 color value regions according to the text information of the X2 color value regions and the euclidean distance; and calculating to obtain new text information of X3 color value areas according to the neighboring text information of the X3 color value areas and a sample sampling formula.

Alternatively, in some embodiments of the present application,

a second obtaining module 205, configured to obtain original text information of Y1 game partitions;

a third selecting module 206, configured to select text information of Y2 game partitions from the original text information of Y1 game partitions when the absolute value of the difference between Y1 and M is greater than a preset threshold;

the calculating module 207 is used for calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;

and the determining module 208 is used for determining the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions as the text information of the M game partitions.

Alternatively, in some embodiments of the present application,

the calculation module 207 is specifically configured to determine neighboring text information of Y3 game partitions according to the text information and euclidean distances of Y2 game partitions; and calculating to obtain new text information of Y3 game partitions according to the adjacent text information of the Y3 game partitions and a sample sampling formula.

and adjusting the coefficient to be between (0, 1).

s_i＝x_i+τ*max(0.1,|x_ij-x_i|)，

wherein s is_iRepresents the ith new sample，x_iRepresenting any one of a few classes of samples, x_ijDenotes x_iJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).

As shown in fig. 4, an embodiment of the present invention provides a computing apparatus, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory 420 and running on the processor 420, and when the processor 420 executes the computer program 411, the following steps may be implemented:

Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:

acquiring original text information of X1 color value areas;

acquiring original text information of Y1 game partitions;

and adjusting the coefficient to be between (0, 1).

s_i＝x_i+τ*max(0.1,|x_ij-x_i|)，

Referring to fig. 5, fig. 5 is a schematic diagram illustrating an embodiment of a computer-readable storage medium according to the present invention.

As shown in fig. 5, the present embodiment provides a computer-readable storage medium, on which a computer program 511 is stored, and the computer program 511, when executed by a processor, can implement the following steps:

Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:

acquiring original text information of X1 color value areas;

acquiring original text information of Y1 game partitions;

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of text classification, comprising:

selecting a characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model;

the feature selection formula is as follows:

and adjusting the coefficient to be between (0, 1).

2. The method of claim 1, wherein before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further comprises:

acquiring original text information of X1 color value areas;

3. The method according to claim 2, wherein calculating new text information of X3 color value regions according to the text information of the X2 color value regions and a sample sampling formula comprises:

4. The method of claim 1, wherein before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further comprises:

acquiring original text information of Y1 game partitions;

5. The method of claim 4, wherein calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula comprises:

6. The method according to any of claims 2-5, wherein the sample sampling formula is:

s_i＝x_i+τ*max(0.1,|x_ij-x_i|)，

7. A computing device, comprising:

the generation module is used for selecting a characteristic with the largest information gain to split nodes of the decision tree according to the candidate characteristic and the characteristic selection formula so as to generate a random forest model;

the feature selection formula is as follows:

and adjusting the coefficient to be between (0, 1).

8. A computing device comprising a processor for implementing the steps of the text classification method according to any one of claims 1 to 6 when executing a computer program stored in a memory.

9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text classification method according to any one of claims 1 to 6.