CN108647325A - A kind of Text Classification System of avoidable over-fitting - Google Patents
A kind of Text Classification System of avoidable over-fitting Download PDFInfo
- Publication number
- CN108647325A CN108647325A CN201810447545.5A CN201810447545A CN108647325A CN 108647325 A CN108647325 A CN 108647325A CN 201810447545 A CN201810447545 A CN 201810447545A CN 108647325 A CN108647325 A CN 108647325A
- Authority
- CN
- China
- Prior art keywords
- training sample
- weight
- training
- iteration
- over
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of Text Classification Systems of avoidable over-fitting, including over-fitting rate judgment module, text classification module, over-fitting rate judgment module is used to judge that the severity of over-fitting, text classification module to be responsible for classifying to text;The method of the present invention has carried out quantitative description to overfitting problem, whether it is over-fitting rate to have used the parameter RO, RO of description overfitting problem, occur for describing overfitting problem, severity, and this method proposes new method also directed to the classification accuracy improved in overfitting problem.
Description
Technical field
The present invention relates to machine learning fields, are related to a kind of Text Classification System of avoidable over-fitting.
Background technology
Now, over-fitting is a very serious problem in machine learning, and Producing reason is, in actual use, because
Most of file classification method all uses vector space model, i.e., each document is regarded as a bag of words, each word conduct
Feature.The generation of redundancy vector is inevitably resulted in, and will produce a higher-dimension sparse matrix.But excessively pursue training set
Accuracy rate, can increase the complexity of model, learn without calligraphy learning trend, and only by the non-predictable spy in training data
Sign.If data do not occur, expression power is deteriorated, and becomes over-fitting.
Invention content
In view of this, the present invention provides a kind of text classification for the avoidable over-fitting for solving or partly solving the above problems
System.
To achieve the effect that above-mentioned technical proposal, the technical scheme is that:A kind of text of avoidable over-fitting point
Class system, including:
Text Classification System includes over-fitting rate judgment module, text classification module;
Over-fitting rate judgment module is used to judge the severity of over-fitting, is judged using over-fitting rate RO,
Over-fitting rate RO is defined as follows:
Wherein, o indicates that test error, z indicate that training error, u indicate the scale or iterations of training set;Over-fitting
The positive and negative of rate determine by test error, and positive and negative opposite with test error;When having served as qualified rates of fitting less than zero, with training set
Scale or iterations increase, test error o also increases, and the value of over-fitting rate RO is smaller, then the journey of over-fitting
Degree is more serious;Test error o is further decreased if necessary, by increasing compensation term to increase training error z to avoid excessively quasi-
It closes;It has served as qualified rates of fitting and has been equal to 0, training set increases, and test error no longer changes;
Over-fitting rate is more than zero, indicates that the increase with training scale, test error reduce;Qualified rates of fitting has been served as to maintain just
, no longer there is over-fitting in number;Training sample is equal to training sample set;
Text classification module is responsible for classifying to text, is divided into two steps, the first step, preprocessing process, second step, specifically
Processing procedure;
In preprocessing process, the input and output of text classification module are defined, the input of text classification module is training
Sample set D={ (x1,y1),...,(xi,yi),...,(xm,ym), wherein D is training sample set, (x1,y1),...,
(xi,yi),...,(xm,ym) it is training sample, X={ x1..., xi,...,xmIt is training sample point, Y={ y1, ...,
yi,...,ymBe training sample set class label, x be training sample point, y be training sample point class label, to make
With, to the classification results of sample point, m is the number of training sample point after grader;Training sample point with same index and class
Distinguishing label is mutual corresponding, i.e. i-th of training sample point xiClassification results be yi, i is=1 ..., m;Text classification module
Output be graderWherein, H (x) is grader H to training sample
The classification results of point x, β(t)Indicate that weights of the Weak Classifier h in the t times iteration, t indicate to be in the t times iteration, β indicates weak
The weight of grader h, a kind of qualified relation of I function representations limit the function appeared in before I, will meet item in I function brackets
The value output for appearing in the function before I of part;h(t)(x) indicate Weak Classifier h in the t times iteration to training sample point x minutes
Class handling result, k indicate the hypothesis parameter to classification results, temporarily store classification results, and T indicates the maximum times of iteration, is
Positive integer;It indicates the weight of the grader of all T iteration being added, makes addition
With reach maximum value, obtain the classification results of grader in such cases;
In specific processing procedure, it is as follows:
3) weighted value of initialization training sample isFor i-th of training sample in training sample set
This xiThe weight of 1st iteration, i are positive integer, and subscript (1) indicates that the 1st iteration, m indicate the number of training sample point;
4) start iteration, the total degree that iteration is arranged is T, and T is positive integer;The parameter of iteration is t, and t is positive integer, is indicated
In the t times iteration, t=1 ..., T;
2.a) input parameter λ and training sample set D, selects the subset D of D ', it is c to select the number of training sample point
=m × λ, wherein 0<λ≤1, detailed process are, by training sample set D={ (x1,y1),...,(xm,ym) be used as and wait selecting
Set, every time from the set for wait for selection select a training sample point, the number selected is c, and c is positive integer, selection
Step is:
First, n=0 is initialized, n indicates the count parameter of selection, the number for calculating the training sample having been selected,
For positive integer;
Iv. it is each training sample (xi,yi) assignment selection weight qi, i=1 ..., m,Right to choose
The initial value q of weight0=0, wherein q are any one selection weight, and j is positive integer, wjFor j-th of training sample point xjWeight;
V. random number p is generated, random number p is equal to rand (0, qm), that is, it randomly generates and arrives q 0mBetween a number assignment
To p, using selection weight q demarcation interval set, the section that section set includes is [q0,q1][q1,q2]、[q2,q3] ...,
[qi,qi+1],…,[qm-1,qm], section [qi,qi+1] correspond to training sample (xi,yi);qiFor i-th of selection weight;
Vi. judge that random number o belongs to the section for including in the set of section, the instruction of selection is recalled from training sample set D
Practice sample and be put into the subset D ' of training sample set, set two integer l, r, l is initially 0, r and is initially equal to m-1, respectively with area
Between gather in include section [q0,q1]、[qm-1,qm] corresponding, and meet ql≤ r and qr> r:The process of cycle 1 is to take
Mediant
If ql≤ r, l=U+1, otherwise r=U;Check whether l is equal with the numerical value of r, if phase
Etc. cycle 1 is jumped out, if unequal continue to execute cycle 1;qlFor first of selection weight, qrFor r-th of selection weight;
When jumping out cycle 1, by section [ql,ql+1] removed from section set, by (xl,yl) subset of training sample is added
D ', n=n+1, and go to the i-th step;
When the value of n is equal to c, the subset D of training sample set D ' it is finished by selection;
2.b) by the subset D of training sample set D ' and the t times iteration, i-th of training sample point xiWeight wi (t)Make
For input parameter, classify, iterative classification is h(t)=L (D ', wt (t)), L indicates that weighted value be wi (t), training sample set
Close the subset D of D ' carry out learning training, h(t)For Weak Classifier h the t times Iterative classification as a result, being used in text classification module
Learning training model of the ELM models as Weak Classifier h, the error e of the t times Iterative classification(t)For:
h(t)(xi) indicate to the t times iteration using Weak Classifier h to training sample point xiCarry out classification results, h(t)(xi)
≠ yi indicates i-th of training sample point xiClassification results be not equal to corresponding tag along sort yi,Indicate i-th
A training sample point xiClassification results are equal to the classification results other than tag along sort,For point other than tag along sort
Class as a result,Expression meets h(t)(xi)≠yiOr's
The weight of i-th of sample point in the t times iterationThe weight of all sample points of the formula will be met in all iterative process
In the sum of numerical value accumulation, and accounted for the error that total weight proportion is calculated as the t times iteration;Expression be not
I-th of tag along sort yiAny one classification results, ifIt is defaulted as sample point xiIteration is taken turns in t
In the process without the classification by grader h;
2.c) weight betas of the calculating Weak Classifier h in the t times iteration(t), calculating formula is as follows:
In iteration each time, all by above formula calculate Weak Classifier h the t times iteration weight beta(t);β indicates weak typing
The weight of device h;
The weight of training sample 2.d) is readjusted, the formula of levelling is as follows:
According to the Weak Classifier h of previous step the t times iteration weight beta(t)Value, be limited, h(t)(xi) table
Grader h give the impression of weakness in the t times iteration pair, i-th of sample point xiSorted class label, xiIndicate i-th of sample point, yiTable
Show i-th of class label;Expression meets h(t)(xi)≠yiOr
The t times iteration in i-th of sample point weight beta(t), as the index of e, then withIt is multiplied and is used as next iteration
The weight of training sample,For the weight of i-th of training sample of the t times iteration,For i-th of instruction of the t+1 times iteration
Practice the weight of sample;
2.e) to the training sample weight of next roundIt is standardized, standardized calculating formula is as follows:
T=t+1 jumps out iteration when t is equal to T;
Finally, the grader of grader is calculated
The present invention useful achievement be:A kind of Text Classification System of avoidable over-fitting provided by the invention, the present invention
Method quantitative description has been carried out to overfitting problem, it is over-fitting rate to have used the parameter RO of description overfitting problem, RO, is used
Whether occur in description overfitting problem, severity, and this method is accurate also directed to the classification improved in overfitting problem
Rate proposes new method.
Specific implementation mode
In order to make technical problems, technical solutions and advantages to be solved be more clearly understood, tie below
Embodiment is closed, the present invention will be described in detail.It should be noted that specific embodiment described herein is only explaining
The present invention is not intended to limit the present invention, and can be realized that the product of said function belongs to equivalent replacement and improvement, is all contained in this hair
Within bright protection domain.The specific method is as follows:
Embodiment 1:In general, file classification method uses vector space model, in vector space model, including
Multiple bag of words, and bag of words include multiple words, the bag of words represent a document, and word represents feature;In the vector space model
It can lead to the generation of redundancy vector, and generate high sparse matrix, the redundancy vector and the high sparse matrix can all cause
The generation of over-fitting;
The overfitting problem can be construed to, and the accuracy rate of training set can promote the complexity of learning model, to institute
It states learning model and only has recorded non-in training data it is contemplated that feature;When the learning model processing new data, effect is bad
The phenomenon that;
The deep neural network machine learning model complicated as one, the inside include excessive parameter amount, are easily absorbed in
Fitting;
The general solution of the overfitting problem is, using Feature Dimension Reduction, regularization and random drop may be used
The method of (drop out) neuron;Now, lack the quantitative description of the overfitting problem;
The present invention provides the quantitative descriptions for solving the overfitting problem, are proposing over-fitting rate among these
Whether parameter, over-fitting rate are used for describing overfitting problem occurring and its severity;And the present invention devises one kind
New algorithm AdaBELM come avoid overfitting problem and improve classification accuracy rate;
Over-fitting rate is defined as follows:
Wherein, indicate that over-fitting rate, y indicate that test error, z indicate that training error, u indicate the rule of training set using RO
Mould or iterations;The positive and negative of the over-fitting rate determine by the test error, and positive and negative with the test error
On the contrary;
When the over-fitting rate is less than zero, indicate with the scale of the training set or the increasing of the iterations
Greatly, the test error also increases;The value of the i.e. described over-fitting rate is smaller, then the degree of over-fitting is more serious;
Under the premise of avoiding the problem that the over-fitting, by increasing compensation term to increase training error z, to reduce
The test error y;
The over-fitting rate is more than zero, indicates that the increase with the trained scale, the test error reduce;
AdaBoost algorithms are proposed by Kearns and Valiant.Detailed process is as follows:
At the beginning, training sample set P={ (x1,y1),...,(xm,ym) it is all endowed the weight of same size, it should
Weight is that initial weight is distributed D1;Weight is assigned as repeating, and the t times weight distribution is Dt, and Weak Classifier ht:X
→ Y is from training sample set and is distributed DtMiddle generation, Weak Classifier htError rate also calculated;Wherein, X={ x1,...,
xmIt is sample point, Y={ y1,...,ymBe training sample set class label, y1,...,ym={ 0,1 ..., C }, m are institute
The number of sample point is stated, is positive integer;The value of the weight for the sample that do not classified correctly increases, the sample correctly classified
The value of weight is reduced;The D in next cyclet+1Value be updated, to from training sample set and Dt+1, basic machine newly
It generates;The above process repeats T cycle altogether, and T is positive integer, and final mask is to carry out most of elections by T weak learners
It obtains;The purpose of algorithm is to practise a series of Weak Classifiers or basic classification device from training sample set middle school, then by weak point
Class device or basic classification device are combined into a strong classifier.Describing following procedure with step is:
Pretreatment is the input for defining algorithm and the output of algorithm, and the input of algorithm is training sample set D={ (x1,
y1),...,(xm,ym), the number T of iteration;The output of algorithm is grader
1) weighted value of initialization training sample isFor the training sample in the training sample set
Weight, i is expressed as i-th of training sample (x in the training sample seti,yi);The weighted value of each training sample
It is assignedSubscript 1 indicates the weights of initial training sample;
2) T iteration is carried out, if the number of current iteration is t, t=1 ..., T
3.a) for weight distribution DtAnd wi (t), obtain Weakly separated device h(t)=L (Dt,wt (t)), L expressions are to weighted value
wi (t), weight distribution DtTraining sample set carry out learning training, h(t)For Weak Classifier, Weak Classifier can be that will train
Sample X is classified as -1 or 1, the error e of classification(t)For:
Wherein, I function representations are to wi (t)Limitation, h(t)(xi) indicate Weak Classifier h in the t times iteration to xiClassification after
Class label, xiIndicate i-th of sample point, yiIndicate i-th of class label;Weak Classifier is classified incorrect trained sample
This corresponding weight is added;
2.b) calculate the weight beta of Weak Classifier(t), calculating formula is as follows:
Iteration is taken turns in t, the weight beta of Weak Classifier is all calculated by above formula(t);
The weight of training sample 2.c) is readjusted, the formula of levelling is as follows:
According to the weight beta of the Weak Classifier of previous step(t)Value, be limited, h(t)(xi) indicate that Weak Classifier h exists
The t times iteration is to xiSorted class label, xiIndicate i-th of sample point, yiIndicate i-th of class label;Classification is tied
The weight of the incorrect Weak Classifier of fruit as e index, then withThe weight being multiplied as next iteration training sample,For the weight of i-th of training sample of the t times iteration,For the weight of i-th of training sample of the t+1 times iteration;
Finally, to the training sample weight of next roundIt is standardized, standardized calculating formula is as follows:
Assembled classifier obtains new graderSeek meeting function most
The value for the independent variable being worth greatly.
Embodiment 2:The mechanism of extreme learning machine, ELM models;
ELM is a kind of single layer feedforward network with quickly training training set, and only there are three layers in a network:Input layer
The Input Layer, hidden layer the Hidden Layer and output layer the Output Layer.Assuming that having N number of arbitrary
Sample (xi,ti), wherein xi=[xi1,xi2,...,xin]T∈Rn, ti=[ti1,ti2,...,tin]T∈Rm.There is L for one
The neural networks with single hidden layer of a hidden node can be expressed as
Wherein, g=g (x) is activation primitive, Wi=[wi,1,wi,2,···,wi,n]TFor input weight, βiIt is weighed for output
Weight, biIt is the biasing of i-th of Hidden unit.Wi·xjIndicate WiAnd xjInner product.
The target of neural networks with single hidden layer study is the error minimum so that output, can be expressed as
There is βi, WiAnd biSo that
It can be expressed as with matrix
H β=T
Wherein, H is the output of hidden node, and β is output weight, and T is desired output.
In order to training neural networks with single hidden layer, it is intended that obtainWithSo that
Wherein, i=1, |, L, this is equivalent to minimize loss function
Algorithm of some traditional based on gradient descent method can be used for solving this problem, but it is basic based on
The learning algorithm needs of gradient adjust all parameters during iteration.And in ELM algorithms, once input weight WiWith it is hidden
The biasing b of layeriIt is determined at random, the output matrix H of hidden layer is just now uniquely determined.Training neural networks with single hidden layer can be converted into
Solve a linear system H β=T.And exporting weight beta can be determined
Wherein, H+It is the Moore-Penrose generalized inverses of matrix H.And the provable solution acquiredNorm be minimum
And it is unique.
The present invention can avoid the problem that over-fitting, obtain the good file classification method of effect, use extreme learning machine
As base grader, and training subset is used as base grader, parameter lambda is used for the scale of controlled training collection;Wherein, 0<
λ≤1, specific explanations are, when 0<λ<When 1, select a subset of training sample as training set for each base grader, when
When λ=1, entire training set is used.Selection for training subset.For no selected data, increase it in next round
The probability being selected in cycle.The present invention not only can effectively handle big data problem, while can avoid overfitting problem.
The method is divided into two steps, the first step, preprocessing process, second step, specific processing procedure;
In the preprocessing process, the input and output of the method are defined, is inputted as training sample set D=
{(x1,y1),...,(xm,ym), wherein D is training sample set, (x1,y1),...,(xm,ym) it is training sample, X=
{x1,...,xmIt is sample point, Y={ y1,...,ymBe the training sample set class label, y1,...,ym=0,
1 ..., C }, m is the number of the sample point and training sample, and x is sample point, and y is class label, after using grader
To the classification results of sample point, sample point and the class label of same index are mutual corresponding;Output is that the output of algorithm is
GraderWherein, H (x) is grader, β(t)Indicate that Weak Classifier exists
The t times iteration when weight, t indicate be in the t time iteration, β expression Weak Classifier weight, a kind of restriction of I function representations
Relationship limits the function appeared in before I, and the value for meeting the function before I of condition in I function brackets is exported;Such as
β(t)I(h(t)(x)=k) it indicates to meet h(t)(x) weight beta of=k formulas(t)Output;h(t)(x) the t times iteration pair of presentation class device
The classification handling result of sample point x, k indicate, to the hypothesis parameter of the classification results, temporarily to store the classification results, h tables
Show that classification handling result of the grader to sample point x, T indicate the maximum times of iteration, is positive integer;Indicate the weight of the grader of all T iteration being added, make addition with reach
Maximum value obtains the classification results of grader in such cases;
In specific processing procedure, it is as follows:
5) weighted value of initialization training sample isFor i-th of instruction in the training sample set
Practice the weight of sample, i is positive integer, and subscript (1) indicates that the 1st iteration, m indicate the number of sample point;
6) start iteration, the total degree that iteration is arranged is T;The parameter of iteration is t, there is shown takes turns iteration in t, t is just
Integer, there is shown take turns iteration, t=1 ..., T in t;
2.a) input parameter λ and training sample set D, selects the subset of D, and it is c=m to select the number of training sample point
× λ, 0<λ≤1, detailed process are, by training sample set the D={ (x1,y1),...,(xm,ym) as the collection for waiting for selection
It closes, selects a training sample point from the set for waiting for selection every time, the number selected is c, and c is positive integer, selection
Step is:Initialization Integer n=0, n indicates the count parameter of selection, the number for calculating the training sample having been selected;
Vii. it is each training sample point (xi,yi) assignment selection weight qi, i=1 ..., m,Selection
The initial value q of weight0=0;
Viii. random number o, the random number o are generated and is equal to rand (0, qm), that is, it randomly generates and arrives q 0mBetween one
Number utilizes selection weight qiGather between partition, it includes section be [q0,q1][q1,q2]、[q2,q3] ..., [qi,qi+1],…,
[qm-1,qm], section [qi,qi+1] correspond to training sample point (xi,yi), when the random number r belongs to [qi,qi+1], selection training
Sample point (xi,yi), when r is in section [qi,qi+1] between, by section [qi,qi+1] gather from the section in remove, n=at this time
N+1, and go to the i-th step;
Ix. judge the random number o belongs to which section for including in the set of section, from the training sample set D
Recall the subset D that corresponding training sample is put into training sample set|, two integer l, r are set, l is initially 0, r and is initially equal to
M-1 indicates the section [q for including in gathering with section respectively0,q1]、[qm-1,qm] corresponding, and meet qlR and qrr:It is described
The process of cycle 1 is to take mediantIf qlR, l=U+1, otherwise r=U;Check the numerical value of l and r
It is whether equal, if equal jump out the cycle 1, if unequal continue to execute cycle 1;
By section [ql,ql+1] removed from section set, by (xl,yl) be added training sample subset, n=n+1, and
And go to the i-th step;
When the value of n is equal to c, the subset of training sample is selected to finish;
2.b) by the subset of training sample and training wi (t)As input parameter, classify, iterative classification is h(t)
=L (D^, wt (t)), L indicates that weighted value be wi(t), weight distribution DtTraining sample set carry out learning training, h(t)For
The t times Iterative classification of Weak Classifier h is as a result, use ELM models work in the method
For the learning training model of the Weak Classifier h, the error e of the t times Iterative classification(t)For:
h(t)(xi) indicate to use Weak Classifier h sample points x to the t times iterationiCarry out classification results, h(t)(xi)≠yiTable
Show i-th of sample point xiClassification results are not equal to corresponding tag along sort yi,Indicate i-th of sample point xiPoint
Class result is equal to the classification results other than tag along sort,Indicate full
Sufficient h(t)(xi)≠yiOrThe t times iteration in i-th of sample point weightThe formula will be met
All sample points numerical value accumulation the sum of of the weight in all iterative process, and accounted for total weight proportion and calculated
It is used as the error of the t times iteration;Indicate not to be tag along sort yiAny one classification results, ifIt is defaulted as sample point xiNot by the classification of grader h during t takes turns iteration;
2.c) weight betas of the calculating Weak Classifier h in the t times iteration(t), calculating formula is as follows:
In each round iteration, all by above formula calculate Weak Classifier h the t times iteration weight beta(t);β indicates weak typing
The weight of device h;
The weight of training sample 2.d) is readjusted, the formula of levelling is as follows:
According to the Weak Classifier h of previous step the t times iteration weight beta(t)Value, be limited, h(t)(xi) table
Grader h give the impression of weakness in the t times iteration pair, i-th of sample point xiSorted class label, xiIndicate i-th of sample point, yiTable
Show i-th of class label;Expression meets h(t)(xi)≠yiOr
The t times iteration in i-th of sample point weight beta(t), as the index of e, then withIt is multiplied and is used as next iteration
The weight of training sample,For the weight of i-th of training sample of the t times iteration,For i-th of instruction of the t+1 times iteration
Practice the weight of sample;
2.e) to the training sample weight of next roundIt is standardized, standardized calculating formula is as follows:
The present invention useful achievement be:A kind of Text Classification System of avoidable over-fitting provided by the invention, the present invention
Method quantitative description has been carried out to overfitting problem, it is over-fitting rate to have used the parameter RO of description overfitting problem, RO, is used
Whether occur in description overfitting problem, severity, and this method is accurate also directed to the classification improved in overfitting problem
Rate proposes new method.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic
Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as
It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, those skilled in the art can carry out the embodiment of the present invention various modification and variations without departing from this hair
The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention
And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.
Claims (1)
1. a kind of Text Classification System of avoidable over-fitting, which is characterized in that include the following contents:
The Text Classification System includes over-fitting rate judgment module, text classification module;
The over-fitting rate judgment module is used to judge the severity of over-fitting, can provide and sentence for the text classification module
The method of disconnected over-fitting rate;Judged using over-fitting rate RO in the over-fitting rate judgment module, the over-fitting rate RO's
It is defined as follows:
Wherein, o indicates that test error, z indicate that training error, u indicate the scale or iterations of training set;Over-fitting rate
It is positive and negative to be determined by test error and positive and negative opposite with test error;When the over-fitting rate is less than zero, with training set
Scale or iterations increase, test error o also increases, and the value of the over-fitting rate RO is smaller, then over-fitting
Degree it is more serious;Further decrease test error o if necessary, by increase compensation term with increase the training error z with
Avoid over-fitting;When the over-fitting rate is equal to 0, training set is further added by, and test error no longer changes;
The over-fitting rate is more than zero, indicates that the increase with training scale, test error reduce;When the over-fitting rate is
Just, no longer there are problems that over-fitting;Training set is equal to training sample set;
The text classification module is responsible for classifying to text, is divided into two steps, the first step, preprocessing process, second step, specifically
Processing procedure;
In the preprocessing process, the input and output of text classification module, institute described in the text classification module definition
The input for stating text classification module is training sample set D={ (x1,y1),...,(xi,yi),...,(xm,ym), wherein D is
Training sample set, (x1,y1),...,(xi,yi),...,(xm,ym) it is training sample, X={ x1..., xi,...,xmIt is instruction
Practice sample point, Y={ y1..., yi,...,ymBe training sample set class label, x be training sample point, y be training
The class label of sample point, for the classification results for using after grader to sample point, m is the number of training sample point, by user
It is calculated after input training sample set;The training sample point with same index is to correspond with class label
, i.e. i-th of training sample point xiClassification results be yi, i is=1 ..., m;The output of the text classification module is classification
DeviceWherein, H (x) is classification knots of the grader H to training sample point x
Fruit, β(t)Indicate that weights of the Weak Classifier h in the t times iteration, t indicate to be in the t times iteration, β indicates the power of Weak Classifier h
Weight, a kind of qualified relation of I function representations limit and appear in function before I, by the condition in I function brackets of meeting and position
Set the value output of the function before being close to I;h(t)(x) indicate Weak Classifier h at classification of the t times iteration to training sample point x
For reason as a result, k expressions temporarily store classification results to the hypothesis parameter of classification results, T indicates the maximum times of iteration, is just whole
Number;Indicate the weight of the grader of all T iteration being added, make addition with reach
Maximum value obtains the classification results of grader in such cases;
In specific processing procedure, the text classification module execution is as follows:
1) weighted value of all training sample points of initialization iswi (1)For i-th of instruction in the training sample set
Practice sample point xiIn the weight of the 1st iteration, i is positive integer, indicates label of the training sample point in training sample set, on
Mark (1) indicates that training sample is in the 1st iteration, and m indicates the number of training sample point in the training sample set;
2) training sample set runs beginning iteration jointly, and the total degree that iteration is arranged is T, and T is positive integer;The parameter of iteration is t, and t is just
Integer indicates to be in the t times iteration, t=1 ..., T;
2.a) the text classification module input parameter λ and training sample set D, and select D subset D ', if selection training
The number of sample point be c=m × λ, i.e., subset D ' in include training sample point number be c, wherein 0<λ≤1, specific mistake
Cheng Wei, by training sample set D={ (x1,y1),...,(xm,ym) as the set for waiting for selection, every time selection is waited for from described
Select a training sample point in set, the step of number selected is c, and c is positive integer, selects for:
First, n=0 is initialized, n indicates the count parameter of selection, the number for calculating the training sample having been selected, and n is
Positive integer;
I. the text classification module is each training sample (xi,yi) assignment selection weight qi, i=1 ..., m, whereinSelect the initial value q of weight0=0, wherein q are any one selection weight, and j is positive integer, wjIt is j-th
Training sample point xjWeight;
Ii. the text classification module generates random number p, and random number p is equal to rand (0, qm), that is, it randomly generates and arrives q 0mBetween
A number be assigned to p, then, using selection weight q demarcation interval set, the section that section set includes is [q0,q1]
[q1,q2]、[q2,q3] ..., [qi,qi+1],…,[qm-1,qm], section [qi,qi+1] correspond to training sample (xi+1,yi+1);qiFor
I-th of selection weight, qmIndicate m-th of selection weight;
Iii. the text classification module judges random number p belongs to which section for including in the section set, from the instruction
Practice recalled in sample set D selection training sample be put into the training sample set subset D ', set two integer l, r, l
It is initially 0, r and is initially equal to m-1, the section [q for including in gathering respectively with section0,q1]、[qm-1,qm] corresponding, and meet
ql≤ r and qr> r:The process of the text classification module setting cycle 1 is to take mediantIf
ql≤ r, l=U+1, otherwise r=U;Check whether l is equal with the numerical value of r, if equal jump out cycle 1, if unequal continuation
Execute cycle 1;Wherein, qlFor first of selection weight, qrFor r-th of selection weight;
When jumping out cycle 1, by section [ql,ql+1] removed from section set, by (xl,yl) be added training sample subset D ', n=
N+1, and go to the i-th step;
When the value of n is equal to c, the subset D of the training sample set D ' all finished by selection;
2.b) the text classification module is by the subset D of the training sample set D ' and the t times iteration, i-th of training sample
Point xiWeight wi (t)As input parameter, classify, iterative classification is h(t)=L (D ', wt (t)), L is indicated to weighted value
For wi (t), the training sample set D subset D ' carry out learning training, h(t)For the t times Iterative classification knot of Weak Classifier h
Fruit uses learning training model of the ELM models as Weak Classifier h, and the t times iteration point in the text classification module
The error e of class(t)For:
h(t)(xi) indicate to the t times iteration using Weak Classifier h to training sample point xiIt is classifying as a result, h(t)(xi)≠yi
Indicate i-th of training sample point xiClassification results be not equal to corresponding tag along sort yi,Indicate i-th of instruction
Practice sample point xiClassification results are equal to the classification results other than tag along sort,For the classification knot other than tag along sort
Fruit,
Expression meets h(t)(xi)≠yiOr
The t times iteration in i-th of sample point weighte(t)Calculating formula be expressed as meeting should
Numerical value accumulation the sum of of the weight of all sample points of formula in all iterative process, and accounted for total weight proportion meter
Calculate the error as the t times iteration;Indicate not to be i-th of tag along sort yiAny one classification results, ifIt is defaulted as sample point xiNot by the classification of grader h during t takes turns iteration;
2.c) weight betas of the text classification module calculating Weak Classifier h in the t times iteration(t), calculating formula is as follows:
In iteration each time, all by above formula calculate Weak Classifier h the t times iteration weight beta(t);β indicates Weak Classifier h's
Weight;
2.d) the text classification module readjusts the weight of training sample, and the formula of levelling is as follows:
According to the Weak Classifier h of previous step the t times iteration weight beta(t)Value, it is rightValue limited, h(t)(xi) table
Grader h give the impression of weakness in the t times iteration pair, i-th of sample point xiSorted class label, xiIndicate i-th of sample point, yiTable
Show i-th of class label;Expression meets h(t)(xi)≠yiOr
The t times iteration in i-th of sample point weight beta(t), as the index of e,
Then withThe weight being multiplied as next iteration training sample,For the power of i-th of training sample of the t times iteration
Weight,For the weight of i-th of training sample of the t+1 times iteration;
2.e) finally, training sample weight of the text classification module to next roundIt is standardized, standardized meter
Formula is as follows:
T=t+1 jumps out iteration when t is equal to T;
Finally, grader is calculated according to above-mentioned steps
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810447545.5A CN108647325A (en) | 2018-05-11 | 2018-05-11 | A kind of Text Classification System of avoidable over-fitting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810447545.5A CN108647325A (en) | 2018-05-11 | 2018-05-11 | A kind of Text Classification System of avoidable over-fitting |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108647325A true CN108647325A (en) | 2018-10-12 |
Family
ID=63754547
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810447545.5A Pending CN108647325A (en) | 2018-05-11 | 2018-05-11 | A kind of Text Classification System of avoidable over-fitting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108647325A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109597888A (en) * | 2018-11-19 | 2019-04-09 | 北京百度网讯科技有限公司 | Establish the method, apparatus of text field identification model |
CN112836051A (en) * | 2021-02-19 | 2021-05-25 | 太极计算机股份有限公司 | Online self-learning court electronic file text classification method |
-
2018
- 2018-05-11 CN CN201810447545.5A patent/CN108647325A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109597888A (en) * | 2018-11-19 | 2019-04-09 | 北京百度网讯科技有限公司 | Establish the method, apparatus of text field identification model |
CN112836051A (en) * | 2021-02-19 | 2021-05-25 | 太极计算机股份有限公司 | Online self-learning court electronic file text classification method |
CN112836051B (en) * | 2021-02-19 | 2024-03-26 | 太极计算机股份有限公司 | Online self-learning court electronic file text classification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sharma | Deep challenges associated with deep learning | |
Ye | Data mining: theories, algorithms, and examples | |
CN107861951A (en) | Session subject identifying method in intelligent customer service | |
CN110033281A (en) | A kind of method and device that intelligent customer service is converted to artificial customer service | |
CN109118013A (en) | A kind of management data prediction technique, readable storage medium storing program for executing and forecasting system neural network based | |
CN103309953B (en) | Method for labeling and searching for diversified pictures based on integration of multiple RBFNN classifiers | |
CN108829763A (en) | A kind of attribute forecast method of the film review website user based on deep neural network | |
Luo et al. | Sparse group restricted boltzmann machines | |
CN106845717A (en) | A kind of energy efficiency evaluation method based on multi-model convergence strategy | |
CN106650920A (en) | Prediction model based on optimized extreme learning machine (ELM) | |
Sun et al. | Linguistic value soft set-based approach to multiple criteria group decision-making | |
CN107301604A (en) | Multi-model fusion estimation system | |
CN103605711A (en) | Construction method and device, classification method and device of support vector machine | |
CN103400190A (en) | Integrated framework method for optimizing extremity learning machine by using genetic algorithm | |
CN109934306A (en) | Multi-tag attribute value division methods and device based on random walk | |
CN109886755A (en) | A kind of communication user attrition prediction method and system based on evolution algorithm | |
Babu et al. | A Novel Artificial Neural Network Algorithm for Prediction of Natural Gas Prices using Machine Learning | |
CN108647325A (en) | A kind of Text Classification System of avoidable over-fitting | |
CN108875034A (en) | A kind of Chinese Text Categorization based on stratification shot and long term memory network | |
CN110457470A (en) | A kind of textual classification model learning method and device | |
CN107368611B (en) | A kind of short text classification method | |
CN107305565A (en) | Information processor, information processing method and message processing device | |
CN109447767A (en) | A kind of commodity evaluation method and system applied to e-commerce | |
CN109657779A (en) | Model data processing method, data processing model and electronic device based on DNN | |
Qiao et al. | SRS-DNN: a deep neural network with strengthening response sparsity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181012 |
|
RJ01 | Rejection of invention patent application after publication |