Nothing Special   »   [go: up one dir, main page]

CN109284382B - Text classification method and computing device - Google Patents

Text classification method and computing device Download PDF

Info

Publication number
CN109284382B
CN109284382B CN201811158905.6A CN201811158905A CN109284382B CN 109284382 B CN109284382 B CN 109284382B CN 201811158905 A CN201811158905 A CN 201811158905A CN 109284382 B CN109284382 B CN 109284382B
Authority
CN
China
Prior art keywords
text information
feature
color value
game
areas
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811158905.6A
Other languages
Chinese (zh)
Other versions
CN109284382A (en
Inventor
徐乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Douyu Network Technology Co Ltd
Original Assignee
Wuhan Douyu Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Douyu Network Technology Co Ltd filed Critical Wuhan Douyu Network Technology Co Ltd
Priority to CN201811158905.6A priority Critical patent/CN109284382B/en
Publication of CN109284382A publication Critical patent/CN109284382A/en
Application granted granted Critical
Publication of CN109284382B publication Critical patent/CN109284382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a text classification method and a calculation device, which are used for solving the problems of sample imbalance among different classes and feature screening, and can remarkably improve the text classification effect of a model. The method in the embodiment of the application comprises the following steps: acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value; selecting A pieces of text information from the text information of the N color value areas and the text information of the M game partitions; selecting at least two features from the first feature, the second feature and the third feature as candidate features; and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.

Description

Text classification method and computing device
Technical Field
The present application relates to the field of big data, and in particular, to a text classification method and a computing device.
Background
In machine learning, a random forest is a classifier that contains multiple decision trees, and the class of its output is determined by the mode of the class output by the individual trees. Random forests are actually a special bagging method that uses decision trees as models in bagging. Firstly, generating m training sets by using a bootstrap method, then constructing a decision tree for each training set, and when finding features for splitting, not finding all the features to maximize indexes (such as information gain), but randomly extracting a part of features from the features, finding an optimal solution among the extracted features, and applying the optimal solution to the nodes for splitting. The random forest method is equivalent to sampling samples and features due to the bagging, namely the integration idea.
However, when a task of text classification based on a random forest algorithm is performed, two common problems are caused: 1. the imbalance of samples among the classes can lead the classification result to be biased to the class with more samples; 2. the selection of the features determines the execution speed and the final effect of the algorithm.
Disclosure of Invention
The embodiment of the application provides a text classification method and a calculation device, which are used for solving the problems of sample imbalance among different classes and feature screening, and can remarkably improve the text classification effect of a model.
In view of the above, a first aspect of the present application provides a text classification method, which may include:
acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;
selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
Optionally, in some embodiments of the present application, before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further includes:
acquiring original text information of X1 color value areas;
when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;
calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Optionally, in some embodiments of the present application, the calculating new text information of X3 color value regions according to the text information of X2 color value regions and a sample sampling formula includes:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;
and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Optionally, in some embodiments of the present application, before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further includes:
acquiring original text information of Y1 game partitions;
when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;
calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Optionally, in some embodiments of the present application, the calculating new text information of Y3 game partitions according to the text information of Y2 game partitions and a sample sampling formula includes:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
Optionally, in some embodiments of the present application, the feature selection formula is:
Figure BDA0001819574390000031
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,
Figure BDA0001819574390000032
and adjusting the coefficient to be between (0, 1).
Optionally, in some embodiments of the present application, the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
A second aspect of the present application provides a computing device, which may include:
the first acquisition module is used for acquiring text information of N color value areas and text information of M game subareas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
a first selection module, configured to select a pieces of text information from the text information of the N color value regions and the text information of the M game partitions, where each piece of text information in the a pieces of text information includes a first feature, a second feature, and a third feature, the first feature includes a sentence length, the second feature includes a maximum inverse text frequency index value of a word in a sentence, and the third feature includes a maximum word frequency value of a word in a sentence;
a second selection module for selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and the generation module is used for selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula so as to generate a random forest model.
Optionally, in some embodiments of the present application, the computing apparatus may further include:
the second acquisition module is used for acquiring the original text information of the X1 color value areas;
a third selecting module, configured to select text information of X2 color value regions from the original text information of the X1 color value regions when an absolute value of a difference between X1 and M is greater than the preset threshold;
the calculation module is used for calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
a determining module, configured to determine that a sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Alternatively, in some embodiments of the present application,
the calculation module is specifically configured to determine, according to the text information and euclidean distances of the X2 color value regions, neighboring text information of X3 color value regions; and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Alternatively, in some embodiments of the present application,
the second acquisition module is used for acquiring the original text information of Y1 game partitions;
a third selecting module, configured to select text information of Y2 game partitions from the original text information of the Y1 game partitions when an absolute value of a difference between Y1 and M is greater than the preset threshold;
the calculation module is used for calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
a determining module, configured to determine that a sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Alternatively, in some embodiments of the present application,
the calculation module is specifically configured to determine neighboring text information of Y3 game partitions according to the text information and euclidean distances of the Y2 game partitions; and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
Optionally, in some embodiments of the present application, the feature selection formula is:
Figure BDA0001819574390000041
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,
Figure BDA0001819574390000042
and adjusting the coefficient to be between (0, 1).
Optionally, in some embodiments of the present application, the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
In a third aspect, an embodiment of the present invention provides a computing apparatus, including a memory, and a processor, where the processor is configured to implement the steps of the text classification method described in the foregoing first aspect when executing a computer program stored in the memory.
In a fourth aspect, an embodiment of the present invention provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the text classification method as described in the foregoing first aspect embodiment.
According to the technical scheme, the embodiment of the application has the following advantages:
in the embodiment of the application, the text information of N color value areas and the text information of M game subareas in the current scene are obtained, wherein N and M are integers larger than 0, and the absolute value of the difference value between N and M is smaller than a preset threshold value; selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence; selecting at least two features from the first feature, the second feature and the third feature as candidate features; and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model. The method is used for solving the problem of sample imbalance among different classes and the problem of feature screening, and can remarkably improve the text classification effect of the model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following briefly introduces the embodiments and the drawings used in the description of the prior art, and obviously, the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to the drawings.
FIG. 1 is a schematic diagram of an embodiment of a text classification method in an embodiment of the present application;
FIG. 2 is a schematic diagram of an embodiment of a computing device in an embodiment of the present application;
FIG. 3 is a schematic diagram of another embodiment of a computing device in an embodiment of the present application;
FIG. 4 is a schematic diagram of another embodiment of a computing device in an embodiment of the present application;
fig. 5 is a schematic diagram of another embodiment of a computer-readable storage medium in an embodiment of the present application.
Detailed Description
The embodiment of the application provides a text classification method and a calculation device, which are used for solving the problems of sample imbalance among different classes and feature screening, and can remarkably improve the text classification effect of a model.
For a person skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. The embodiments in the present application shall fall within the protection scope of the present application.
In the following, a brief description of the terms referred to in the present application is given, as follows:
random Forest algorithm (RF), in machine learning, a Random Forest is a classifier that contains multiple decision trees and whose output class is determined by the mode of the class output by the individual trees.
Each tree was built according to the following algorithm:
(1) the number of training cases (samples) is represented by N, and the number of features is represented by M.
(2) Inputting a characteristic number m for determining a decision result of a node on a decision tree; where M should be much smaller than M.
(3) Sampling N times from N training cases (samples) in a manner of sampling back to form a training set (i.e. bootstrap sampling), and using the cases (samples) which are not extracted as a prediction to evaluate the error.
(4) For each node, m features are randomly selected, and the decision for each node on the decision tree is determined based on these features. Based on the m features, the optimal splitting mode is calculated.
(5) Each tree grows completely without pruning, which may be employed after a normal tree classifier is built.
When a text classification task based on a random forest algorithm is performed, 2 common problems exist: 1. the imbalance of samples among the classes can lead the classification result to be biased to the class with more samples; 2. the selection of the features determines the execution speed and the final effect of the algorithm.
Therefore, the present invention improves these two problems, and the following further describes the technical solution of the present application in an embodiment, as shown in fig. 1, which is an exemplary illustration of a text classification method in the embodiment of the present application, and may include:
101. acquiring text information of N color value areas and text information of M game areas in the current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value.
In this embodiment of the application, before the obtaining of the text information of the N color value regions and the text information of the M game partitions in the current scene, the method may further include:
(1) acquiring original text information of X1 color value areas; when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas; calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula; determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Or,
(2) acquiring original text information of Y1 game partitions; when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions; calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula; determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Optionally, the calculating to obtain new text information of X3 color value regions according to the text information of the X2 color value regions and the sample sampling formula may include:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas; and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Optionally, the calculating new text information of Y3 game partitions according to the text information of Y2 game partitions and the sample sampling formula may include:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions; and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
For example, the computing device may extract the color value and the text information (which may also be referred to as corpus) of the game partition from the barrage library of the live page, for example: color value area: 10 ten thousand, game zone: 2 ten thousand.
Firstly, segmenting all linguistic data by utilizing a crust, filtering stop words, and mapping to a 4-dimensional word2vec space; then, supplementing the text information of the game partition, and randomly taking 1 ten thousand as an original sample; aiming at each original sample, solving 5 adjacent samples of the sample in a TFIDF vector space by using Euclidean distance; and then 5 new samples can be generated by transforming the 5 adjacent samples by using a sample sampling formula.
Wherein the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xil) (formula one),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1). It should be noted that formula one means a sampling formula with fewer samples in the category, and the purpose is to increase N samples, so that the samples in the category are balanced.
Assume s 1: i like to see Miss [0.212, 0.356, 0.254, 0.684 ]; thus, 5 neighbor samples of s1 can be found:
s11=[0.102,0.254,0.102,0.631],…,s15;
then, by using a formula I, converting the generated new sample through s 11;
s’=s1+0.6*|s11-s1|
=[0.212,0.356,0.254,0.684]+0.6*([0.11,0.102,0.152,0.053])
=[0.278,0.4172,0.356,0.3452,0.7158]
therefore, the new sample s' is mapped to new text by word2 vec: miss will look nice.
Similarly, the computing device may obtain new samples of other nearby samples, and the number of samples for the game partition is expanded to 7 ten thousand. I.e. the number of 5 ten thousand new samples plus the number of 2 ten thousand original samples.
102. Selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in the sentence, and the third feature comprises a maximum word frequency value of words in the sentence.
Illustratively, each decision tree in the random forest is a training set that selects 2 million corpora from the entire corpus as the decision tree with random playback.
Each of the samples had 3 features which were,
the characteristics are as follows: whether the sentence length is greater than 5;
and (B) is as follows: whether the maximum Inverse text Frequency Index (IDF) value of the word in the sentence is more than 200;
and (C) feature: whether the Term maximum Frequency (TF) value in the sentence is larger than 30.
103. Selecting at least two features from the first feature, the second feature, and the third feature as candidate features.
Illustratively, t (t <3) -dimensional features are selected as candidate features of the decision tree.
104. And selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
Wherein the feature selection formula is:
Figure BDA0001819574390000091
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,
Figure BDA0001819574390000092
and adjusting the coefficient to be between (0, 1). The meaning of the formula two is that the decision tree is a standard for selecting the features as the nodes, and the features with the largest information gain ratio can be the nodes of the current round.
Illustratively, the computing device selects the feature with the largest information gain from the candidate features by using a formula II to split the nodes of the decision tree, and the t value is kept unchanged in the growth process of the random forest.
Let t be 2, the first selected feature is A, B, the total number of samples N is 20000,
the number of the game subareas Ng is 8000, and the number of the color value subareas Nf is 12000;
9000, 6000 of them belong to the color value zone and 3000 belong to the game zone;
a- ═ 11000, of which 6000 belong to the color value zone and 5000 belong to the game zone;
b + 5000, 3000 of which belong to the color value zone and 2000 belong to the game zone;
b-15000, 9000 of which belong to the color zone and 6000 of which belong to the game zone.
The computing means can therefore find the information gain according to equation three.
Information gain g (a): g (a) ═ E (S) -E (S | a) (formula three),
wherein E (S) represents the entropy of the set S and refers to the entropy formula of the decision tree, and E (S | A) represents the entropy divided by the characteristic A and refers to the conditional entropy formula of the decision tree. The meaning of the formula three is that the formula is an information gain formula, and a random forest is referred to, so as to supplement the description of the formula two.
Exemplary, formula three: g (a) ═ E (n) -E (S | a);
Figure BDA0001819574390000101
Figure BDA0001819574390000102
therefore, the information gain: g (a) ═ 0.292-0.286 ═ 0.006.
The calculation means may find the information content separation according to equation four.
Information content division split (a):
Figure BDA0001819574390000103
wherein n is the total number divided by the characteristic A; a isjThe total number of categories j when divided by feature a. The meaning of the formula four is that the formula is divided for the information amount, and a random forest is referred to, so as to supplement the description of the formula two.
Exemplary, equation four, traffic separation split (a):
Figure BDA0001819574390000104
the calculation means can thus find the value of the attribute association degree according to the formula five.
Attribute association formula t (f):
Figure BDA0001819574390000105
where n is the total number of attributes that do not contain attribute A, H (F)i) Representing the entropy value of the ith attribute. The expression five is the expression of the degree of association between the attributes, that is, the smaller the degree of association between the attribute a and other attributes is, the larger the information gain ratio of the attribute a is.
Exemplarily, since h (a) ═ E (S | a), and h (B) ═ E (S | B) ═ 0.203,
Figure BDA0001819574390000106
then it is calculated according to equation two:
Figure BDA0001819574390000111
Gen(B)=0.107。
similarly, we can find gen (B) ═ 0.107, so gen (B) > gen (a), and then we should choose the B-feature split node in this decision tree.
It should be noted that, the step 102-104 is circularly executed, so as to ensure that each decision tree in the random deep forest is fissured to the maximum extent, pruning is not required, and finally the random deep forest model is generated.
In the embodiment of the application, the text information of N color value areas and the text information of M game subareas in the current scene are obtained, wherein N and M are integers larger than 0, and the absolute value of the difference value between N and M is smaller than a preset threshold value; selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence; selecting at least two features from the first feature, the second feature and the third feature as candidate features; and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model. The method is used for solving the problem of sample imbalance among different classes and the problem of feature screening, and can remarkably improve the text classification effect of the model.
As shown in fig. 2, fig. 2 is a schematic view of an embodiment of a computing apparatus in an embodiment of the present application, and may include:
a first obtaining module 201, configured to obtain text information of N color value regions and text information of M game partitions in a current scene, where N and M are integers greater than 0, and an absolute value of a difference between N and M is smaller than a preset threshold;
a first selection module 202, configured to select a text message from text messages of the N color value regions and text messages of the M game partitions, where each text message of the a text messages includes a first feature, a second feature, and a third feature, the first feature includes a sentence length, the second feature includes a maximum inverse text frequency index value of a word in the sentence, and the third feature includes a maximum word frequency value of the word in the sentence;
a second selection module 203, configured to select at least two features from the first feature, the second feature, and the third feature as candidate features;
and the generating module 204 is configured to select a feature with the largest information gain to split nodes of the decision tree according to the candidate feature and the feature selection formula, and generate a random forest model.
Optionally, in some embodiments of the present application, as shown in fig. 3, fig. 3 is a schematic diagram of an embodiment of a computing device in an embodiment of the present application, and the computing device may further include:
a second obtaining module 205, configured to obtain original text information of X1 color value regions;
a third selecting module 206, configured to select text information of X2 color value regions from the original text information of X1 color value regions when an absolute value of a difference between X1 and M is greater than a preset threshold;
the calculating module 207 is configured to calculate new text information of X3 color value regions according to the text information of the X2 color value regions and the sample sampling formula;
and the determining module 208 is used for determining that the sum of the new text information of the X3 color value areas and the original text information of the X1 color value areas is the text information of the N color value areas.
Alternatively, in some embodiments of the present application,
the calculating module 207 is specifically configured to determine neighboring text information of X3 color value regions according to the text information of the X2 color value regions and the euclidean distance; and calculating to obtain new text information of X3 color value areas according to the neighboring text information of the X3 color value areas and a sample sampling formula.
Alternatively, in some embodiments of the present application,
a second obtaining module 205, configured to obtain original text information of Y1 game partitions;
a third selecting module 206, configured to select text information of Y2 game partitions from the original text information of Y1 game partitions when the absolute value of the difference between Y1 and M is greater than a preset threshold;
the calculating module 207 is used for calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
and the determining module 208 is used for determining the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions as the text information of the M game partitions.
Alternatively, in some embodiments of the present application,
the calculation module 207 is specifically configured to determine neighboring text information of Y3 game partitions according to the text information and euclidean distances of Y2 game partitions; and calculating to obtain new text information of Y3 game partitions according to the adjacent text information of the Y3 game partitions and a sample sampling formula.
Optionally, in some embodiments of the present application, the feature selection formula is:
Figure BDA0001819574390000121
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,
Figure BDA0001819574390000131
and adjusting the coefficient to be between (0, 1).
Optionally, in some embodiments of the present application, the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiRepresents the ith new sample,xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
As shown in fig. 4, an embodiment of the present invention provides a computing apparatus, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory 420 and running on the processor 420, and when the processor 420 executes the computer program 411, the following steps may be implemented:
acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;
selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:
acquiring original text information of X1 color value areas;
when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;
calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;
and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:
acquiring original text information of Y1 game partitions;
when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;
calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Optionally, in some embodiments of the present application, the processor 420, when executing the computer program 411, may further implement the following steps:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
Optionally, in some embodiments of the present application, the feature selection formula is:
Figure BDA0001819574390000141
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,
Figure BDA0001819574390000142
and adjusting the coefficient to be between (0, 1).
Optionally, in some embodiments of the present application, the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
Referring to fig. 5, fig. 5 is a schematic diagram illustrating an embodiment of a computer-readable storage medium according to the present invention.
As shown in fig. 5, the present embodiment provides a computer-readable storage medium, on which a computer program 511 is stored, and the computer program 511, when executed by a processor, can implement the following steps:
acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;
selecting at least two features from the first feature, the second feature and the third feature as candidate features;
and selecting the characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model.
Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:
acquiring original text information of X1 color value areas;
when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;
calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;
and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:
acquiring original text information of Y1 game partitions;
when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;
calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
Optionally, in some embodiments of the present application, the computer program 511, when executed by the processor, may further implement the following steps:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
Determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (9)

1. A method of text classification, comprising:
acquiring text information of N color value areas and text information of M game areas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
selecting A pieces of text information from the text information of the N color value areas and the text information of the M game areas, wherein each piece of text information in the A pieces of text information comprises a first feature, a second feature and a third feature, the first feature comprises a sentence length, the second feature comprises a maximum inverse text frequency index value of words in a sentence, and the third feature comprises a maximum word frequency value of words in the sentence;
selecting at least two features from the first feature, the second feature and the third feature as candidate features;
selecting a characteristic with the largest information gain to split the nodes of the decision tree according to the candidate characteristic and the characteristic selection formula to generate a random forest model;
the feature selection formula is as follows:
Figure FDA0002805708050000011
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,
Figure FDA0002805708050000012
and adjusting the coefficient to be between (0, 1).
2. The method of claim 1, wherein before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further comprises:
acquiring original text information of X1 color value areas;
when the absolute value of the difference value between X1 and M is larger than the preset threshold, selecting text information of X2 color value areas from the original text information of the X1 color value areas;
calculating to obtain new text information of X3 color value areas according to the text information of the X2 color value areas and a sample sampling formula;
determining that the sum of the new text information of the X3 color value regions and the original text information of the X1 color value regions is the text information of the N color value regions.
3. The method according to claim 2, wherein calculating new text information of X3 color value regions according to the text information of the X2 color value regions and a sample sampling formula comprises:
determining neighboring text information of X3 color value areas according to the text information and Euclidean distance of the X2 color value areas;
and calculating to obtain new text information of the X3 color value areas according to the neighboring text information of the X3 color value areas and the sample sampling formula.
4. The method of claim 1, wherein before obtaining the text information of the N color value regions and the text information of the M game partitions in the current scene, the method further comprises:
acquiring original text information of Y1 game partitions;
when the absolute value of the difference value between Y1 and M is larger than the preset threshold value, selecting the text information of Y2 game partitions from the original text information of the Y1 game partitions;
calculating to obtain new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula;
determining that the sum of the new text information of the Y3 game partitions and the original text information of the Y1 game partitions is the text information of the M game partitions.
5. The method of claim 4, wherein calculating new text information of Y3 game partitions according to the text information of the Y2 game partitions and a sample sampling formula comprises:
determining adjacent text information of Y3 game partitions according to the text information and Euclidean distance of the Y2 game partitions;
and calculating to obtain new text information of the Y3 game partitions according to the adjacent text information of the Y3 game partitions and the sample sampling formula.
6. The method according to any of claims 2-5, wherein the sample sampling formula is:
si=xi+τ*max(0.1,|xij-xi|),
wherein s isiDenotes the ith new sample, xiRepresenting any one of a few classes of samples, xijDenotes xiJ is more than or equal to 0 and less than or equal to N, N represents the number of randomly selected N samples, and tau adjustment coefficient takes on a value between (0 and 1).
7. A computing device, comprising:
the first acquisition module is used for acquiring text information of N color value areas and text information of M game subareas in a current scene, wherein N and M are integers larger than 0, and the absolute value of the difference value of N and M is smaller than a preset threshold value;
a first selection module, configured to select a pieces of text information from the text information of the N color value regions and the text information of the M game partitions, where each piece of text information in the a pieces of text information includes a first feature, a second feature, and a third feature, the first feature includes a sentence length, the second feature includes a maximum inverse text frequency index value of a word in a sentence, and the third feature includes a maximum word frequency value of a word in a sentence;
a second selection module for selecting at least two features from the first feature, the second feature and the third feature as candidate features;
the generation module is used for selecting a characteristic with the largest information gain to split nodes of the decision tree according to the candidate characteristic and the characteristic selection formula so as to generate a random forest model;
the feature selection formula is as follows:
Figure FDA0002805708050000031
wherein G (A) represents the information gain of the attribute A, Split (A) represents the information division component of the attribute A, T (F) represents the association degree of the attribute A and the non-attribute A, F represents the non-attribute A set,
Figure FDA0002805708050000032
and adjusting the coefficient to be between (0, 1).
8. A computing device comprising a processor for implementing the steps of the text classification method according to any one of claims 1 to 6 when executing a computer program stored in a memory.
9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text classification method according to any one of claims 1 to 6.
CN201811158905.6A 2018-09-30 2018-09-30 Text classification method and computing device Active CN109284382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811158905.6A CN109284382B (en) 2018-09-30 2018-09-30 Text classification method and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811158905.6A CN109284382B (en) 2018-09-30 2018-09-30 Text classification method and computing device

Publications (2)

Publication Number Publication Date
CN109284382A CN109284382A (en) 2019-01-29
CN109284382B true CN109284382B (en) 2021-05-28

Family

ID=65182189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811158905.6A Active CN109284382B (en) 2018-09-30 2018-09-30 Text classification method and computing device

Country Status (1)

Country Link
CN (1) CN109284382B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390400B (en) * 2019-07-02 2023-07-14 北京三快在线科技有限公司 Feature generation method and device of computing model, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473231A (en) * 2012-06-06 2013-12-25 深圳先进技术研究院 Classifier building method and system
CN107292186A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of model training method and device based on random forest
CN107357895A (en) * 2017-01-05 2017-11-17 大连理工大学 A kind of processing method of the text representation based on bag of words

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141978A (en) * 2010-02-02 2011-08-03 阿里巴巴集团控股有限公司 Method and system for classifying texts

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473231A (en) * 2012-06-06 2013-12-25 深圳先进技术研究院 Classifier building method and system
CN107292186A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of model training method and device based on random forest
CN107357895A (en) * 2017-01-05 2017-11-17 大连理工大学 A kind of processing method of the text representation based on bag of words

Also Published As

Publication number Publication date
CN109284382A (en) 2019-01-29

Similar Documents

Publication Publication Date Title
CN109325132A (en) Expertise recommended method, device, computer equipment and storage medium
CN110472043B (en) Clustering method and device for comment text
CN111767403A (en) Text classification method and device
KR101623860B1 (en) Method for calculating similarity between document elements
CN112231468B (en) Information generation method, device, electronic equipment and storage medium
CN113761105A (en) Text data processing method, device, equipment and medium
CN105631749A (en) User portrait calculation method based on statistical data
CN111198946A (en) Network news hotspot mining method and device
CN115470344A (en) Video barrage and comment theme fusion method based on text clustering
CN109284382B (en) Text classification method and computing device
CN109299463B (en) Emotion score calculation method and related equipment
JP6446987B2 (en) Video selection device, video selection method, video selection program, feature amount generation device, feature amount generation method, and feature amount generation program
CN111125547A (en) Knowledge community discovery method based on complex network
CN110162769B (en) Text theme output method and device, storage medium and electronic device
US20210312333A1 (en) Semantic relationship learning device, semantic relationship learning method, and storage medium storing semantic relationship learning program
CN111339778B (en) Text processing method, device, storage medium and processor
CN111556375B (en) Video barrage generation method and device, computer equipment and storage medium
CN113656451B (en) Data mining method, electronic device, and computer-readable storage medium
CN109657079A (en) A kind of Image Description Methods and terminal device
CN111753050B (en) Topic map-based comment generation
CN103324653A (en) Main point extraction device and main point extraction method
WO2014117296A1 (en) Generating a hint for a query
KR102117281B1 (en) Method for generating chatbot utterance using frequency table
Eichinger Reviews are gold!? on the link between item reviews and item preferences
CN112765329A (en) Method and system for discovering key nodes of social network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant