Nothing Special   »   [go: up one dir, main page]

CN108647325A - A kind of Text Classification System of avoidable over-fitting - Google Patents

A kind of Text Classification System of avoidable over-fitting Download PDF

Info

Publication number
CN108647325A
CN108647325A CN201810447545.5A CN201810447545A CN108647325A CN 108647325 A CN108647325 A CN 108647325A CN 201810447545 A CN201810447545 A CN 201810447545A CN 108647325 A CN108647325 A CN 108647325A
Authority
CN
China
Prior art keywords
training sample
weight
training
iteration
over
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810447545.5A
Other languages
Chinese (zh)
Inventor
丰小月
丰超
时小虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN201810447545.5A priority Critical patent/CN108647325A/en
Publication of CN108647325A publication Critical patent/CN108647325A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of Text Classification Systems of avoidable over-fitting, including over-fitting rate judgment module, text classification module, over-fitting rate judgment module is used to judge that the severity of over-fitting, text classification module to be responsible for classifying to text;The method of the present invention has carried out quantitative description to overfitting problem, whether it is over-fitting rate to have used the parameter RO, RO of description overfitting problem, occur for describing overfitting problem, severity, and this method proposes new method also directed to the classification accuracy improved in overfitting problem.

Description

A kind of Text Classification System of avoidable over-fitting
Technical field
The present invention relates to machine learning fields, are related to a kind of Text Classification System of avoidable over-fitting.
Background technology
Now, over-fitting is a very serious problem in machine learning, and Producing reason is, in actual use, because Most of file classification method all uses vector space model, i.e., each document is regarded as a bag of words, each word conduct Feature.The generation of redundancy vector is inevitably resulted in, and will produce a higher-dimension sparse matrix.But excessively pursue training set Accuracy rate, can increase the complexity of model, learn without calligraphy learning trend, and only by the non-predictable spy in training data Sign.If data do not occur, expression power is deteriorated, and becomes over-fitting.
Invention content
In view of this, the present invention provides a kind of text classification for the avoidable over-fitting for solving or partly solving the above problems System.
To achieve the effect that above-mentioned technical proposal, the technical scheme is that:A kind of text of avoidable over-fitting point Class system, including:
Text Classification System includes over-fitting rate judgment module, text classification module;
Over-fitting rate judgment module is used to judge the severity of over-fitting, is judged using over-fitting rate RO,
Over-fitting rate RO is defined as follows:
Wherein, o indicates that test error, z indicate that training error, u indicate the scale or iterations of training set;Over-fitting The positive and negative of rate determine by test error, and positive and negative opposite with test error;When having served as qualified rates of fitting less than zero, with training set Scale or iterations increase, test error o also increases, and the value of over-fitting rate RO is smaller, then the journey of over-fitting Degree is more serious;Test error o is further decreased if necessary, by increasing compensation term to increase training error z to avoid excessively quasi- It closes;It has served as qualified rates of fitting and has been equal to 0, training set increases, and test error no longer changes;
Over-fitting rate is more than zero, indicates that the increase with training scale, test error reduce;Qualified rates of fitting has been served as to maintain just , no longer there is over-fitting in number;Training sample is equal to training sample set;
Text classification module is responsible for classifying to text, is divided into two steps, the first step, preprocessing process, second step, specifically Processing procedure;
In preprocessing process, the input and output of text classification module are defined, the input of text classification module is training Sample set D={ (x1,y1),...,(xi,yi),...,(xm,ym), wherein D is training sample set, (x1,y1),..., (xi,yi),...,(xm,ym) it is training sample, X={ x1..., xi,...,xmIt is training sample point, Y={ y1, ..., yi,...,ymBe training sample set class label, x be training sample point, y be training sample point class label, to make With, to the classification results of sample point, m is the number of training sample point after grader;Training sample point with same index and class Distinguishing label is mutual corresponding, i.e. i-th of training sample point xiClassification results be yi, i is=1 ..., m;Text classification module Output be graderWherein, H (x) is grader H to training sample The classification results of point x, β(t)Indicate that weights of the Weak Classifier h in the t times iteration, t indicate to be in the t times iteration, β indicates weak The weight of grader h, a kind of qualified relation of I function representations limit the function appeared in before I, will meet item in I function brackets The value output for appearing in the function before I of part;h(t)(x) indicate Weak Classifier h in the t times iteration to training sample point x minutes Class handling result, k indicate the hypothesis parameter to classification results, temporarily store classification results, and T indicates the maximum times of iteration, is Positive integer;It indicates the weight of the grader of all T iteration being added, makes addition With reach maximum value, obtain the classification results of grader in such cases;
In specific processing procedure, it is as follows:
3) weighted value of initialization training sample isFor i-th of training sample in training sample set This xiThe weight of 1st iteration, i are positive integer, and subscript (1) indicates that the 1st iteration, m indicate the number of training sample point;
4) start iteration, the total degree that iteration is arranged is T, and T is positive integer;The parameter of iteration is t, and t is positive integer, is indicated In the t times iteration, t=1 ..., T;
2.a) input parameter λ and training sample set D, selects the subset D of D ', it is c to select the number of training sample point =m × λ, wherein 0<λ≤1, detailed process are, by training sample set D={ (x1,y1),...,(xm,ym) be used as and wait selecting Set, every time from the set for wait for selection select a training sample point, the number selected is c, and c is positive integer, selection Step is:
First, n=0 is initialized, n indicates the count parameter of selection, the number for calculating the training sample having been selected, For positive integer;
Iv. it is each training sample (xi,yi) assignment selection weight qi, i=1 ..., m,Right to choose The initial value q of weight0=0, wherein q are any one selection weight, and j is positive integer, wjFor j-th of training sample point xjWeight;
V. random number p is generated, random number p is equal to rand (0, qm), that is, it randomly generates and arrives q 0mBetween a number assignment To p, using selection weight q demarcation interval set, the section that section set includes is [q0,q1][q1,q2]、[q2,q3] ..., [qi,qi+1],…,[qm-1,qm], section [qi,qi+1] correspond to training sample (xi,yi);qiFor i-th of selection weight;
Vi. judge that random number o belongs to the section for including in the set of section, the instruction of selection is recalled from training sample set D Practice sample and be put into the subset D ' of training sample set, set two integer l, r, l is initially 0, r and is initially equal to m-1, respectively with area Between gather in include section [q0,q1]、[qm-1,qm] corresponding, and meet ql≤ r and qr> r:The process of cycle 1 is to take Mediant
If ql≤ r, l=U+1, otherwise r=U;Check whether l is equal with the numerical value of r, if phase Etc. cycle 1 is jumped out, if unequal continue to execute cycle 1;qlFor first of selection weight, qrFor r-th of selection weight;
When jumping out cycle 1, by section [ql,ql+1] removed from section set, by (xl,yl) subset of training sample is added D ', n=n+1, and go to the i-th step;
When the value of n is equal to c, the subset D of training sample set D ' it is finished by selection;
2.b) by the subset D of training sample set D ' and the t times iteration, i-th of training sample point xiWeight wi (t)Make For input parameter, classify, iterative classification is h(t)=L (D ', wt (t)), L indicates that weighted value be wi (t), training sample set Close the subset D of D ' carry out learning training, h(t)For Weak Classifier h the t times Iterative classification as a result, being used in text classification module Learning training model of the ELM models as Weak Classifier h, the error e of the t times Iterative classification(t)For:
h(t)(xi) indicate to the t times iteration using Weak Classifier h to training sample point xiCarry out classification results, h(t)(xi) ≠ yi indicates i-th of training sample point xiClassification results be not equal to corresponding tag along sort yi,Indicate i-th A training sample point xiClassification results are equal to the classification results other than tag along sort,For point other than tag along sort Class as a result,Expression meets h(t)(xi)≠yiOr's The weight of i-th of sample point in the t times iterationThe weight of all sample points of the formula will be met in all iterative process In the sum of numerical value accumulation, and accounted for the error that total weight proportion is calculated as the t times iteration;Expression be not I-th of tag along sort yiAny one classification results, ifIt is defaulted as sample point xiIteration is taken turns in t In the process without the classification by grader h;
2.c) weight betas of the calculating Weak Classifier h in the t times iteration(t), calculating formula is as follows:
In iteration each time, all by above formula calculate Weak Classifier h the t times iteration weight beta(t);β indicates weak typing The weight of device h;
The weight of training sample 2.d) is readjusted, the formula of levelling is as follows:
According to the Weak Classifier h of previous step the t times iteration weight beta(t)Value, be limited, h(t)(xi) table Grader h give the impression of weakness in the t times iteration pair, i-th of sample point xiSorted class label, xiIndicate i-th of sample point, yiTable Show i-th of class label;Expression meets h(t)(xi)≠yiOr The t times iteration in i-th of sample point weight beta(t), as the index of e, then withIt is multiplied and is used as next iteration The weight of training sample,For the weight of i-th of training sample of the t times iteration,For i-th of instruction of the t+1 times iteration Practice the weight of sample;
2.e) to the training sample weight of next roundIt is standardized, standardized calculating formula is as follows:
T=t+1 jumps out iteration when t is equal to T;
Finally, the grader of grader is calculated
The present invention useful achievement be:A kind of Text Classification System of avoidable over-fitting provided by the invention, the present invention Method quantitative description has been carried out to overfitting problem, it is over-fitting rate to have used the parameter RO of description overfitting problem, RO, is used Whether occur in description overfitting problem, severity, and this method is accurate also directed to the classification improved in overfitting problem Rate proposes new method.
Specific implementation mode
In order to make technical problems, technical solutions and advantages to be solved be more clearly understood, tie below Embodiment is closed, the present invention will be described in detail.It should be noted that specific embodiment described herein is only explaining The present invention is not intended to limit the present invention, and can be realized that the product of said function belongs to equivalent replacement and improvement, is all contained in this hair Within bright protection domain.The specific method is as follows:
Embodiment 1:In general, file classification method uses vector space model, in vector space model, including Multiple bag of words, and bag of words include multiple words, the bag of words represent a document, and word represents feature;In the vector space model It can lead to the generation of redundancy vector, and generate high sparse matrix, the redundancy vector and the high sparse matrix can all cause The generation of over-fitting;
The overfitting problem can be construed to, and the accuracy rate of training set can promote the complexity of learning model, to institute It states learning model and only has recorded non-in training data it is contemplated that feature;When the learning model processing new data, effect is bad The phenomenon that;
The deep neural network machine learning model complicated as one, the inside include excessive parameter amount, are easily absorbed in Fitting;
The general solution of the overfitting problem is, using Feature Dimension Reduction, regularization and random drop may be used The method of (drop out) neuron;Now, lack the quantitative description of the overfitting problem;
The present invention provides the quantitative descriptions for solving the overfitting problem, are proposing over-fitting rate among these Whether parameter, over-fitting rate are used for describing overfitting problem occurring and its severity;And the present invention devises one kind New algorithm AdaBELM come avoid overfitting problem and improve classification accuracy rate;
Over-fitting rate is defined as follows:
Wherein, indicate that over-fitting rate, y indicate that test error, z indicate that training error, u indicate the rule of training set using RO Mould or iterations;The positive and negative of the over-fitting rate determine by the test error, and positive and negative with the test error On the contrary;
When the over-fitting rate is less than zero, indicate with the scale of the training set or the increasing of the iterations Greatly, the test error also increases;The value of the i.e. described over-fitting rate is smaller, then the degree of over-fitting is more serious;
Under the premise of avoiding the problem that the over-fitting, by increasing compensation term to increase training error z, to reduce The test error y;
The over-fitting rate is more than zero, indicates that the increase with the trained scale, the test error reduce;
AdaBoost algorithms are proposed by Kearns and Valiant.Detailed process is as follows:
At the beginning, training sample set P={ (x1,y1),...,(xm,ym) it is all endowed the weight of same size, it should Weight is that initial weight is distributed D1;Weight is assigned as repeating, and the t times weight distribution is Dt, and Weak Classifier ht:X → Y is from training sample set and is distributed DtMiddle generation, Weak Classifier htError rate also calculated;Wherein, X={ x1,..., xmIt is sample point, Y={ y1,...,ymBe training sample set class label, y1,...,ym={ 0,1 ..., C }, m are institute The number of sample point is stated, is positive integer;The value of the weight for the sample that do not classified correctly increases, the sample correctly classified The value of weight is reduced;The D in next cyclet+1Value be updated, to from training sample set and Dt+1, basic machine newly It generates;The above process repeats T cycle altogether, and T is positive integer, and final mask is to carry out most of elections by T weak learners It obtains;The purpose of algorithm is to practise a series of Weak Classifiers or basic classification device from training sample set middle school, then by weak point Class device or basic classification device are combined into a strong classifier.Describing following procedure with step is:
Pretreatment is the input for defining algorithm and the output of algorithm, and the input of algorithm is training sample set D={ (x1, y1),...,(xm,ym), the number T of iteration;The output of algorithm is grader
1) weighted value of initialization training sample isFor the training sample in the training sample set Weight, i is expressed as i-th of training sample (x in the training sample seti,yi);The weighted value of each training sample It is assignedSubscript 1 indicates the weights of initial training sample;
2) T iteration is carried out, if the number of current iteration is t, t=1 ..., T
3.a) for weight distribution DtAnd wi (t), obtain Weakly separated device h(t)=L (Dt,wt (t)), L expressions are to weighted value wi (t), weight distribution DtTraining sample set carry out learning training, h(t)For Weak Classifier, Weak Classifier can be that will train Sample X is classified as -1 or 1, the error e of classification(t)For:
Wherein, I function representations are to wi (t)Limitation, h(t)(xi) indicate Weak Classifier h in the t times iteration to xiClassification after Class label, xiIndicate i-th of sample point, yiIndicate i-th of class label;Weak Classifier is classified incorrect trained sample This corresponding weight is added;
2.b) calculate the weight beta of Weak Classifier(t), calculating formula is as follows:
Iteration is taken turns in t, the weight beta of Weak Classifier is all calculated by above formula(t)
The weight of training sample 2.c) is readjusted, the formula of levelling is as follows:
According to the weight beta of the Weak Classifier of previous step(t)Value, be limited, h(t)(xi) indicate that Weak Classifier h exists The t times iteration is to xiSorted class label, xiIndicate i-th of sample point, yiIndicate i-th of class label;Classification is tied The weight of the incorrect Weak Classifier of fruit as e index, then withThe weight being multiplied as next iteration training sample,For the weight of i-th of training sample of the t times iteration,For the weight of i-th of training sample of the t+1 times iteration;
Finally, to the training sample weight of next roundIt is standardized, standardized calculating formula is as follows:
Assembled classifier obtains new graderSeek meeting function most The value for the independent variable being worth greatly.
Embodiment 2:The mechanism of extreme learning machine, ELM models;
ELM is a kind of single layer feedforward network with quickly training training set, and only there are three layers in a network:Input layer The Input Layer, hidden layer the Hidden Layer and output layer the Output Layer.Assuming that having N number of arbitrary Sample (xi,ti), wherein xi=[xi1,xi2,...,xin]T∈Rn, ti=[ti1,ti2,...,tin]T∈Rm.There is L for one The neural networks with single hidden layer of a hidden node can be expressed as
Wherein, g=g (x) is activation primitive, Wi=[wi,1,wi,2,···,wi,n]TFor input weight, βiIt is weighed for output Weight, biIt is the biasing of i-th of Hidden unit.Wi·xjIndicate WiAnd xjInner product.
The target of neural networks with single hidden layer study is the error minimum so that output, can be expressed as
There is βi, WiAnd biSo that
It can be expressed as with matrix
H β=T
Wherein, H is the output of hidden node, and β is output weight, and T is desired output.
In order to training neural networks with single hidden layer, it is intended that obtainWithSo that
Wherein, i=1, |, L, this is equivalent to minimize loss function
Algorithm of some traditional based on gradient descent method can be used for solving this problem, but it is basic based on The learning algorithm needs of gradient adjust all parameters during iteration.And in ELM algorithms, once input weight WiWith it is hidden The biasing b of layeriIt is determined at random, the output matrix H of hidden layer is just now uniquely determined.Training neural networks with single hidden layer can be converted into Solve a linear system H β=T.And exporting weight beta can be determined
Wherein, H+It is the Moore-Penrose generalized inverses of matrix H.And the provable solution acquiredNorm be minimum And it is unique.
The present invention can avoid the problem that over-fitting, obtain the good file classification method of effect, use extreme learning machine As base grader, and training subset is used as base grader, parameter lambda is used for the scale of controlled training collection;Wherein, 0< λ≤1, specific explanations are, when 0<λ<When 1, select a subset of training sample as training set for each base grader, when When λ=1, entire training set is used.Selection for training subset.For no selected data, increase it in next round The probability being selected in cycle.The present invention not only can effectively handle big data problem, while can avoid overfitting problem.
The method is divided into two steps, the first step, preprocessing process, second step, specific processing procedure;
In the preprocessing process, the input and output of the method are defined, is inputted as training sample set D= {(x1,y1),...,(xm,ym), wherein D is training sample set, (x1,y1),...,(xm,ym) it is training sample, X= {x1,...,xmIt is sample point, Y={ y1,...,ymBe the training sample set class label, y1,...,ym=0, 1 ..., C }, m is the number of the sample point and training sample, and x is sample point, and y is class label, after using grader To the classification results of sample point, sample point and the class label of same index are mutual corresponding;Output is that the output of algorithm is GraderWherein, H (x) is grader, β(t)Indicate that Weak Classifier exists The t times iteration when weight, t indicate be in the t time iteration, β expression Weak Classifier weight, a kind of restriction of I function representations Relationship limits the function appeared in before I, and the value for meeting the function before I of condition in I function brackets is exported;Such as β(t)I(h(t)(x)=k) it indicates to meet h(t)(x) weight beta of=k formulas(t)Output;h(t)(x) the t times iteration pair of presentation class device The classification handling result of sample point x, k indicate, to the hypothesis parameter of the classification results, temporarily to store the classification results, h tables Show that classification handling result of the grader to sample point x, T indicate the maximum times of iteration, is positive integer;Indicate the weight of the grader of all T iteration being added, make addition with reach Maximum value obtains the classification results of grader in such cases;
In specific processing procedure, it is as follows:
5) weighted value of initialization training sample isFor i-th of instruction in the training sample set Practice the weight of sample, i is positive integer, and subscript (1) indicates that the 1st iteration, m indicate the number of sample point;
6) start iteration, the total degree that iteration is arranged is T;The parameter of iteration is t, there is shown takes turns iteration in t, t is just Integer, there is shown take turns iteration, t=1 ..., T in t;
2.a) input parameter λ and training sample set D, selects the subset of D, and it is c=m to select the number of training sample point × λ, 0<λ≤1, detailed process are, by training sample set the D={ (x1,y1),...,(xm,ym) as the collection for waiting for selection It closes, selects a training sample point from the set for waiting for selection every time, the number selected is c, and c is positive integer, selection Step is:Initialization Integer n=0, n indicates the count parameter of selection, the number for calculating the training sample having been selected;
Vii. it is each training sample point (xi,yi) assignment selection weight qi, i=1 ..., m,Selection The initial value q of weight0=0;
Viii. random number o, the random number o are generated and is equal to rand (0, qm), that is, it randomly generates and arrives q 0mBetween one Number utilizes selection weight qiGather between partition, it includes section be [q0,q1][q1,q2]、[q2,q3] ..., [qi,qi+1],…, [qm-1,qm], section [qi,qi+1] correspond to training sample point (xi,yi), when the random number r belongs to [qi,qi+1], selection training Sample point (xi,yi), when r is in section [qi,qi+1] between, by section [qi,qi+1] gather from the section in remove, n=at this time N+1, and go to the i-th step;
Ix. judge the random number o belongs to which section for including in the set of section, from the training sample set D Recall the subset D that corresponding training sample is put into training sample set|, two integer l, r are set, l is initially 0, r and is initially equal to M-1 indicates the section [q for including in gathering with section respectively0,q1]、[qm-1,qm] corresponding, and meet qlR and qrr:It is described The process of cycle 1 is to take mediantIf qlR, l=U+1, otherwise r=U;Check the numerical value of l and r It is whether equal, if equal jump out the cycle 1, if unequal continue to execute cycle 1;
By section [ql,ql+1] removed from section set, by (xl,yl) be added training sample subset, n=n+1, and And go to the i-th step;
When the value of n is equal to c, the subset of training sample is selected to finish;
2.b) by the subset of training sample and training wi (t)As input parameter, classify, iterative classification is h(t) =L (D^, wt (t)), L indicates that weighted value be wi(t), weight distribution DtTraining sample set carry out learning training, h(t)For The t times Iterative classification of Weak Classifier h is as a result, use ELM models work in the method
For the learning training model of the Weak Classifier h, the error e of the t times Iterative classification(t)For:
h(t)(xi) indicate to use Weak Classifier h sample points x to the t times iterationiCarry out classification results, h(t)(xi)≠yiTable Show i-th of sample point xiClassification results are not equal to corresponding tag along sort yi,Indicate i-th of sample point xiPoint Class result is equal to the classification results other than tag along sort,Indicate full Sufficient h(t)(xi)≠yiOrThe t times iteration in i-th of sample point weightThe formula will be met All sample points numerical value accumulation the sum of of the weight in all iterative process, and accounted for total weight proportion and calculated It is used as the error of the t times iteration;Indicate not to be tag along sort yiAny one classification results, ifIt is defaulted as sample point xiNot by the classification of grader h during t takes turns iteration;
2.c) weight betas of the calculating Weak Classifier h in the t times iteration(t), calculating formula is as follows:
In each round iteration, all by above formula calculate Weak Classifier h the t times iteration weight beta(t);β indicates weak typing The weight of device h;
The weight of training sample 2.d) is readjusted, the formula of levelling is as follows:
According to the Weak Classifier h of previous step the t times iteration weight beta(t)Value, be limited, h(t)(xi) table Grader h give the impression of weakness in the t times iteration pair, i-th of sample point xiSorted class label, xiIndicate i-th of sample point, yiTable Show i-th of class label;Expression meets h(t)(xi)≠yiOr The t times iteration in i-th of sample point weight beta(t), as the index of e, then withIt is multiplied and is used as next iteration The weight of training sample,For the weight of i-th of training sample of the t times iteration,For i-th of instruction of the t+1 times iteration Practice the weight of sample;
2.e) to the training sample weight of next roundIt is standardized, standardized calculating formula is as follows:
The present invention useful achievement be:A kind of Text Classification System of avoidable over-fitting provided by the invention, the present invention Method quantitative description has been carried out to overfitting problem, it is over-fitting rate to have used the parameter RO of description overfitting problem, RO, is used Whether occur in description overfitting problem, severity, and this method is accurate also directed to the classification improved in overfitting problem Rate proposes new method.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, those skilled in the art can carry out the embodiment of the present invention various modification and variations without departing from this hair The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.

Claims (1)

1. a kind of Text Classification System of avoidable over-fitting, which is characterized in that include the following contents:
The Text Classification System includes over-fitting rate judgment module, text classification module;
The over-fitting rate judgment module is used to judge the severity of over-fitting, can provide and sentence for the text classification module The method of disconnected over-fitting rate;Judged using over-fitting rate RO in the over-fitting rate judgment module, the over-fitting rate RO's It is defined as follows:
Wherein, o indicates that test error, z indicate that training error, u indicate the scale or iterations of training set;Over-fitting rate It is positive and negative to be determined by test error and positive and negative opposite with test error;When the over-fitting rate is less than zero, with training set Scale or iterations increase, test error o also increases, and the value of the over-fitting rate RO is smaller, then over-fitting Degree it is more serious;Further decrease test error o if necessary, by increase compensation term with increase the training error z with Avoid over-fitting;When the over-fitting rate is equal to 0, training set is further added by, and test error no longer changes;
The over-fitting rate is more than zero, indicates that the increase with training scale, test error reduce;When the over-fitting rate is Just, no longer there are problems that over-fitting;Training set is equal to training sample set;
The text classification module is responsible for classifying to text, is divided into two steps, the first step, preprocessing process, second step, specifically Processing procedure;
In the preprocessing process, the input and output of text classification module, institute described in the text classification module definition The input for stating text classification module is training sample set D={ (x1,y1),...,(xi,yi),...,(xm,ym), wherein D is Training sample set, (x1,y1),...,(xi,yi),...,(xm,ym) it is training sample, X={ x1..., xi,...,xmIt is instruction Practice sample point, Y={ y1..., yi,...,ymBe training sample set class label, x be training sample point, y be training The class label of sample point, for the classification results for using after grader to sample point, m is the number of training sample point, by user It is calculated after input training sample set;The training sample point with same index is to correspond with class label , i.e. i-th of training sample point xiClassification results be yi, i is=1 ..., m;The output of the text classification module is classification DeviceWherein, H (x) is classification knots of the grader H to training sample point x Fruit, β(t)Indicate that weights of the Weak Classifier h in the t times iteration, t indicate to be in the t times iteration, β indicates the power of Weak Classifier h Weight, a kind of qualified relation of I function representations limit and appear in function before I, by the condition in I function brackets of meeting and position Set the value output of the function before being close to I;h(t)(x) indicate Weak Classifier h at classification of the t times iteration to training sample point x For reason as a result, k expressions temporarily store classification results to the hypothesis parameter of classification results, T indicates the maximum times of iteration, is just whole Number;Indicate the weight of the grader of all T iteration being added, make addition with reach Maximum value obtains the classification results of grader in such cases;
In specific processing procedure, the text classification module execution is as follows:
1) weighted value of all training sample points of initialization iswi (1)For i-th of instruction in the training sample set Practice sample point xiIn the weight of the 1st iteration, i is positive integer, indicates label of the training sample point in training sample set, on Mark (1) indicates that training sample is in the 1st iteration, and m indicates the number of training sample point in the training sample set;
2) training sample set runs beginning iteration jointly, and the total degree that iteration is arranged is T, and T is positive integer;The parameter of iteration is t, and t is just Integer indicates to be in the t times iteration, t=1 ..., T;
2.a) the text classification module input parameter λ and training sample set D, and select D subset D ', if selection training The number of sample point be c=m × λ, i.e., subset D ' in include training sample point number be c, wherein 0<λ≤1, specific mistake Cheng Wei, by training sample set D={ (x1,y1),...,(xm,ym) as the set for waiting for selection, every time selection is waited for from described Select a training sample point in set, the step of number selected is c, and c is positive integer, selects for:
First, n=0 is initialized, n indicates the count parameter of selection, the number for calculating the training sample having been selected, and n is Positive integer;
I. the text classification module is each training sample (xi,yi) assignment selection weight qi, i=1 ..., m, whereinSelect the initial value q of weight0=0, wherein q are any one selection weight, and j is positive integer, wjIt is j-th Training sample point xjWeight;
Ii. the text classification module generates random number p, and random number p is equal to rand (0, qm), that is, it randomly generates and arrives q 0mBetween A number be assigned to p, then, using selection weight q demarcation interval set, the section that section set includes is [q0,q1] [q1,q2]、[q2,q3] ..., [qi,qi+1],…,[qm-1,qm], section [qi,qi+1] correspond to training sample (xi+1,yi+1);qiFor I-th of selection weight, qmIndicate m-th of selection weight;
Iii. the text classification module judges random number p belongs to which section for including in the section set, from the instruction Practice recalled in sample set D selection training sample be put into the training sample set subset D ', set two integer l, r, l It is initially 0, r and is initially equal to m-1, the section [q for including in gathering respectively with section0,q1]、[qm-1,qm] corresponding, and meet ql≤ r and qr> r:The process of the text classification module setting cycle 1 is to take mediantIf ql≤ r, l=U+1, otherwise r=U;Check whether l is equal with the numerical value of r, if equal jump out cycle 1, if unequal continuation Execute cycle 1;Wherein, qlFor first of selection weight, qrFor r-th of selection weight;
When jumping out cycle 1, by section [ql,ql+1] removed from section set, by (xl,yl) be added training sample subset D ', n= N+1, and go to the i-th step;
When the value of n is equal to c, the subset D of the training sample set D ' all finished by selection;
2.b) the text classification module is by the subset D of the training sample set D ' and the t times iteration, i-th of training sample Point xiWeight wi (t)As input parameter, classify, iterative classification is h(t)=L (D ', wt (t)), L is indicated to weighted value For wi (t), the training sample set D subset D ' carry out learning training, h(t)For the t times Iterative classification knot of Weak Classifier h Fruit uses learning training model of the ELM models as Weak Classifier h, and the t times iteration point in the text classification module The error e of class(t)For:
h(t)(xi) indicate to the t times iteration using Weak Classifier h to training sample point xiIt is classifying as a result, h(t)(xi)≠yi Indicate i-th of training sample point xiClassification results be not equal to corresponding tag along sort yi,Indicate i-th of instruction Practice sample point xiClassification results are equal to the classification results other than tag along sort,For the classification knot other than tag along sort Fruit,
Expression meets h(t)(xi)≠yiOr
The t times iteration in i-th of sample point weighte(t)Calculating formula be expressed as meeting should Numerical value accumulation the sum of of the weight of all sample points of formula in all iterative process, and accounted for total weight proportion meter Calculate the error as the t times iteration;Indicate not to be i-th of tag along sort yiAny one classification results, ifIt is defaulted as sample point xiNot by the classification of grader h during t takes turns iteration;
2.c) weight betas of the text classification module calculating Weak Classifier h in the t times iteration(t), calculating formula is as follows:
In iteration each time, all by above formula calculate Weak Classifier h the t times iteration weight beta(t);β indicates Weak Classifier h's Weight;
2.d) the text classification module readjusts the weight of training sample, and the formula of levelling is as follows:
According to the Weak Classifier h of previous step the t times iteration weight beta(t)Value, it is rightValue limited, h(t)(xi) table Grader h give the impression of weakness in the t times iteration pair, i-th of sample point xiSorted class label, xiIndicate i-th of sample point, yiTable Show i-th of class label;Expression meets h(t)(xi)≠yiOr The t times iteration in i-th of sample point weight beta(t), as the index of e,
Then withThe weight being multiplied as next iteration training sample,For the power of i-th of training sample of the t times iteration Weight,For the weight of i-th of training sample of the t+1 times iteration;
2.e) finally, training sample weight of the text classification module to next roundIt is standardized, standardized meter Formula is as follows:
T=t+1 jumps out iteration when t is equal to T;
Finally, grader is calculated according to above-mentioned steps
CN201810447545.5A 2018-05-11 2018-05-11 A kind of Text Classification System of avoidable over-fitting Pending CN108647325A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810447545.5A CN108647325A (en) 2018-05-11 2018-05-11 A kind of Text Classification System of avoidable over-fitting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810447545.5A CN108647325A (en) 2018-05-11 2018-05-11 A kind of Text Classification System of avoidable over-fitting

Publications (1)

Publication Number Publication Date
CN108647325A true CN108647325A (en) 2018-10-12

Family

ID=63754547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810447545.5A Pending CN108647325A (en) 2018-05-11 2018-05-11 A kind of Text Classification System of avoidable over-fitting

Country Status (1)

Country Link
CN (1) CN108647325A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597888A (en) * 2018-11-19 2019-04-09 北京百度网讯科技有限公司 Establish the method, apparatus of text field identification model
CN112836051A (en) * 2021-02-19 2021-05-25 太极计算机股份有限公司 Online self-learning court electronic file text classification method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597888A (en) * 2018-11-19 2019-04-09 北京百度网讯科技有限公司 Establish the method, apparatus of text field identification model
CN112836051A (en) * 2021-02-19 2021-05-25 太极计算机股份有限公司 Online self-learning court electronic file text classification method
CN112836051B (en) * 2021-02-19 2024-03-26 太极计算机股份有限公司 Online self-learning court electronic file text classification method

Similar Documents

Publication Publication Date Title
Sharma Deep challenges associated with deep learning
Ye Data mining: theories, algorithms, and examples
CN107861951A (en) Session subject identifying method in intelligent customer service
CN110033281A (en) A kind of method and device that intelligent customer service is converted to artificial customer service
CN109118013A (en) A kind of management data prediction technique, readable storage medium storing program for executing and forecasting system neural network based
CN103309953B (en) Method for labeling and searching for diversified pictures based on integration of multiple RBFNN classifiers
CN108829763A (en) A kind of attribute forecast method of the film review website user based on deep neural network
Luo et al. Sparse group restricted boltzmann machines
CN106845717A (en) A kind of energy efficiency evaluation method based on multi-model convergence strategy
CN106650920A (en) Prediction model based on optimized extreme learning machine (ELM)
Sun et al. Linguistic value soft set-based approach to multiple criteria group decision-making
CN107301604A (en) Multi-model fusion estimation system
CN103605711A (en) Construction method and device, classification method and device of support vector machine
CN103400190A (en) Integrated framework method for optimizing extremity learning machine by using genetic algorithm
CN109934306A (en) Multi-tag attribute value division methods and device based on random walk
CN109886755A (en) A kind of communication user attrition prediction method and system based on evolution algorithm
Babu et al. A Novel Artificial Neural Network Algorithm for Prediction of Natural Gas Prices using Machine Learning
CN108647325A (en) A kind of Text Classification System of avoidable over-fitting
CN108875034A (en) A kind of Chinese Text Categorization based on stratification shot and long term memory network
CN110457470A (en) A kind of textual classification model learning method and device
CN107368611B (en) A kind of short text classification method
CN107305565A (en) Information processor, information processing method and message processing device
CN109447767A (en) A kind of commodity evaluation method and system applied to e-commerce
CN109657779A (en) Model data processing method, data processing model and electronic device based on DNN
Qiao et al. SRS-DNN: a deep neural network with strengthening response sparsity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181012

RJ01 Rejection of invention patent application after publication