Summary of the invention
The embodiment of the present application provides a kind of method for updating disaggregated model, to solve existing disaggregated model
Update mode training effectiveness is low and updates problem not in time.The embodiment of the present application also provide for a kind of for
Update the device of disaggregated model.
The application provides a kind of method for updating disaggregated model, and described disaggregated model includes predetermined quantity
Original decision tree, for carrying out class prediction according to the user behavior data in network application, including:
The incremental data in acquisition predetermined amount of time from the server that described user behavior data is provided, and from
Middle extraction training sample set;
According to described training sample set, generate the decision tree of newly-increased quantity;
According to history classification accuracy with for this classification accuracy of described training sample set, from newly-increased certainly
Disaggregated model after selecting the decision tree of described predetermined quantity, composition to update in plan tree and original decision tree.
Optionally, described increasing from the server that described user behavior data is provided in acquisition predetermined amount of time
Amount data, and therefrom extract training sample set, including:
The incremental data in predetermined amount of time is obtained, wherein from the server that described user behavior data is provided
Every incremental data all comprises one group of original variable and analog value and class label;
Described incremental data is carried out pretreatment so that it is meet the disaggregated model requirement to training sample data;
Every incremental data is performed following operation, generates described training sample set: extract from incremental data
Corresponding to the value of the characteristic variable of described disaggregated model, and train with described class label composition by the value extracted
Sample.
Optionally, described described incremental data is carried out pretreatment, including the one in set forth below or arbitrarily
In conjunction with:
According to mode set in advance, maximum in incremental data and/or minimum are processed;
According to mode set in advance, the missing values in incremental data is processed;
According to the described disaggregated model call format to sample data, carry out corresponding form conversion.
Optionally, described increasing from the server that described user behavior data is provided in acquisition predetermined amount of time
Amount data, and therefrom extract training sample set, also include:
Extraction feature variable from pretreated incremental data, adds in characteristic variable set;
The described characteristic variable corresponding to disaggregated model refers to, that choose from described characteristic variable set, right
Should be in the characteristic variable of disaggregated model.
Optionally, described according to described training sample set, the decision tree generating newly-increased quantity includes:
According to described training sample set, random forests algorithm is used to generate the decision tree of newly-increased quantity.
Optionally, described according to described training sample set, use random forests algorithm to generate determining of newly-increased quantity
Plan tree, including:
The mode using sampling with replacement according to described training sample set builds bootstrap sample set;
Use described bootstrap sample set, use each node according to predetermined policy selected characteristic variable,
And the mode carrying out dividing according to selected characteristic variable generates a new decision tree;Described according to predetermined policy choosing
Take characteristic variable to refer to, from the characteristic variable randomly choosed, choose optimal characteristics variable according to predetermined policy;
Forward the described mode according to described training sample set employing sampling with replacement to and build bootstrap sample set
The step closed continues executing with, until generating the decision tree of described newly-increased quantity.
Optionally, described choose optimal characteristics variable according to predetermined policy and include: choose according to information gain,
Choose according to information gain-ratio or according to optimal characteristics variable described in Geordie selecting index.
Optionally, described this subseries according to history classification accuracy with for described training sample set is accurate
Rate, selects the decision tree of described predetermined quantity from newly-increased decision tree and original decision tree, including:
According to history classification accuracy with for this classification accuracy of described training sample set, calculate newly-increased
The compressive classification accuracy rate of decision tree and original decision tree;
According to described compressive classification accuracy rate, newly-increased decision tree and original decision tree are ranked up;
Tagmeme is selected to be in a high position, the decision tree of described predetermined quantity from the decision tree after sequence.
Optionally, described this subseries according to history classification accuracy with for described training sample set is accurate
Rate, calculates newly-increased decision tree and the compressive classification accuracy rate of original decision tree, including:
According to described training sample set, calculate this classification accuracy of newly-increased decision tree, as its total score
Class accuracy rate;
According to described training sample set, calculate this classification accuracy of original decision tree;
According to described history classification accuracy and this classification accuracy, calculate the compressive classification of original decision tree
Accuracy rate.
Optionally, described according to history classification accuracy with this classification accuracy, calculate original decision tree
Compressive classification accuracy rate realizes in the following way:
Use moving average method, calculate the compressive classification accuracy rate of original decision tree.
Optionally, described moving average method includes: the method for weighted moving average or index moving average method.
Optionally, before performing the step of the decision tree that described generation increases quantity newly, execution operations described below:
Judge whether to create described disaggregated model;
If it is not, using described predetermined quantity as described newly-increased quantity.
Optionally, the described decision tree selecting described predetermined quantity from newly-increased decision tree and original decision tree,
After the step of the disaggregated model after composition renewal, execution operations described below:
Delete unselected decision tree.
Accordingly, the application also provides for a kind of device for updating disaggregated model, including:
Training sample set extraction unit, predetermined for obtaining from the server providing described user behavior data
Incremental data in time period, and therefrom extract training sample set;
Decision tree signal generating unit, for according to described training sample set, generating the decision tree of newly-increased quantity;
Decision tree selects unit, for according to history classification accuracy with for this of described training sample set
Classification accuracy, selects the decision tree of described predetermined quantity from newly-increased decision tree and original decision tree, composition
Disaggregated model after renewal.
Optionally, described training sample set extraction unit includes:
Incremental data obtains subelement, predetermined for obtaining from the server providing described user behavior data
Incremental data in time period, wherein every incremental data all comprise one group of original variable and analog value and
Class label;
Data prediction subelement, for carrying out pretreatment to described incremental data so that it is meet disaggregated model
Requirement to training sample data;
Characteristics extraction subelement, for every incremental data performs following operation, generates described training sample
This collection: extract the value of characteristic variable corresponding to described disaggregated model from incremental data, and by the value extracted
Training sample is formed with described class label.
Optionally, described data prediction subelement at least includes one of following subelement:
Extreme value processes subelement, for according to mode set in advance, to maximum in incremental data and/or pole
Little value processes;
Missing values processes subelement, for according to mode set in advance, enters the missing values in incremental data
Row processes;
Form conversion subelement, for according to the described disaggregated model call format to sample data, carrying out phase
The form conversion answered.
Optionally, described training sample set extraction unit also includes:
Characteristic variable extraction subelement, for extraction feature variable from pretreated incremental data, adds
In characteristic variable set;
Described characteristics extraction subelement specifically for, every incremental data is performed following operation, generates institute
State training sample set: extract that choose from described characteristic variable set, corresponding to described from incremental data
The value of the characteristic variable of disaggregated model, and form training sample by the value extracted with described class label.
Optionally, described decision tree signal generating unit specifically for, according to described training sample set, use random
Forest algorithm generates the decision tree of newly-increased quantity.
Optionally, described decision tree signal generating unit includes:
Loop control subelement, during for being less than described newly-increased quantity when the quantity generating decision tree, calls down
State each subelement and create decision tree;
Bootstrap sample set builds subelement, for using sampling with replacement according to described training sample set
Mode builds bootstrap sample set;
Decision tree generates and performs subelement, is used for using described bootstrap sample set, uses at each joint
Point generates one according to predetermined policy selected characteristic variable the mode that carries out dividing according to selected characteristic variable
New decision tree;Described refer to according to predetermined policy selected characteristic variable, from the characteristic variable randomly choosed by
Optimal characteristics variable is chosen according to predetermined policy.
Optionally, the predetermined policy that described decision tree generation execution subelement is used includes: increase according to information
Benefit is chosen attribute, is chosen attribute according to information gain-ratio or according to Geordie selecting index attribute.
Optionally, described decision tree selects unit to include:
Aggregative indicator computation subunit, for according to history classification accuracy with for described training sample set
This classification accuracy, calculates newly-increased decision tree and the compressive classification accuracy rate of original decision tree;
Sequence subelement, for entering newly-increased decision tree and original decision tree according to described compressive classification accuracy rate
Row sequence;
Screening subelement, for selecting tagmeme to be in a high position, described predetermined number from the decision tree after sequence
The decision tree of amount.
Optionally, described aggregative indicator computation subunit includes:
Newly-increased decision tree aggregative indicator computation subunit, for according to described training sample set, calculates newly-increased determining
This classification accuracy of plan tree, as its compressive classification accuracy rate;
This index computation subunit of original decision tree, for according to described training sample set, calculates original determining
This classification accuracy of plan tree;
Original decision tree aggregative indicator computation subunit, for according to described history classification accuracy and this point
Class accuracy rate, calculates the compressive classification accuracy rate of original decision tree.
Optionally, described original decision tree aggregative indicator computation subunit specifically for, use moving average method,
Calculate the compressive classification accuracy rate of original decision tree.
Optionally, the moving average method that described original decision tree aggregative indicator computation subunit uses includes: add
Power moving average method or index moving average method.
Optionally, described device also includes:
Disaggregated model creates judging unit, for before triggering the work of described decision tree signal generating unit, it is judged that
The most create described disaggregated model, if it is not, using described predetermined quantity as described newly-increased quantity.
Optionally, described device also includes:
Decision tree deletes unit, after the disaggregated model after selecting unit composition to update described decision tree,
Delete unselected decision tree.
Compared with prior art, the application has the advantage that
The method being used for updating disaggregated model that the application provides, the incremental data within nearest a period of time
Extract training sample set, generate the decision tree of newly-increased quantity according to described training sample set, and divide according to history
Class accuracy rate and this classification accuracy, select determining of predetermined quantity from newly-increased decision tree and original decision tree
Plan tree, collects the disaggregated model after being updated.Use said method, owing to need not according to full dose data
It is trained, but utilizes incremental data to carry out incremental update on the basis of original disaggregated model, therefore may be used
As required disaggregated model is carried out the dynamic renewal of various time granularity, such as: per diem update or near
Like real-time update, such that it is able to improve the efficiency of model training, realize the quick response to business;Further
Ground, owing to not having only only in accordance with this classification accuracy when evaluating the classifying quality of decision tree, but
Introduce history classification accuracy, therefore from the angle of the overall situation, the compressive classification effect of decision tree can be carried out
Assessment, such that it is able to the shortage term fluctuation of smoothed data, it is ensured that the disaggregated model after renewal can keep more steady
Fixed classification prediction effect.
Detailed description of the invention
Elaborate a lot of detail in the following description so that fully understanding the application.But, this Shen
Please implement to be much different from alternate manner described here, those skilled in the art can not disobey
Doing similar popularization in the case of back of the body the application intension, therefore, the application is not embodied as by following public
Limit.
In this application, each provide a kind of method for updating disaggregated model, and a kind of for more
The device of new disaggregated model, is described in detail the most one by one.
Refer to Fig. 1, it is the flow chart of a kind of embodiment of the method for updating disaggregated model of the application.
Described method comprises the steps:
Step 101: obtain the increment number in predetermined amount of time from the server that described user behavior data is provided
According to, and therefrom extract training sample set.
The method being used for updating disaggregated model that the application provides, carries out disaggregated model more based on incremental data
Newly so that disaggregated model can be made in time or near real-time according to the change of sample data accordingly
Adjust, thus it is synchronization with up-to-date sample data to realize disaggregated model.
In actual service application, after obtaining user behavior data, can be first with disposing the most on line
Disaggregated model, the disaggregated model being i.e. made up of the original decision tree of predetermined quantity, by marking by the way of enter
Row class prediction, using the classification (the decision tree quantity selecting the category is most) of highest scoring as prediction class
Not, and carry out service application set in advance based on this prediction classification, such as: category carries out recommending, by
Classification carries out risk control etc..After generally going through a period of time, the operation behavior follow-up according to user or be
The comprehensive analysis of system, can know the concrete class of described user behavior data, and be this user behavior data
Add corresponding class label.After above-mentioned operation flow, generally just can generate and a collection of possess classification mark
The user behavior data signed, in this case, it is possible to implement the technical program and carry out the dynamic of disaggregated model
State updates.
First this step obtains the increment in predetermined amount of time from the server providing described user behavior data
Data.Described predetermined amount of time refers to the time period being positioned at before current time, and its length can basis
Concrete demand is configured, such as can in units of sky, by hour in units of, even with minute for single
Position is all possible, if the user behavior data in the described time period already at retrievable state and
Contain the class label information of reality.
After obtaining described incremental data, can by the pretreatment of described incremental data, extraction feature variable,
Extract the processing procedures such as characteristic variable value, finally give described training sample set.Whole process includes following institute
Step 101-1 stated, to step 101-4,2 is described further below in conjunction with the accompanying drawings.
Step 101-1: obtain the increment in predetermined amount of time from the server that described user behavior data is provided
Data, wherein every incremental data all comprises one group of original variable and analog value and class label.
The incremental data that this step obtains generally comprises a plurality of user behavior data, wherein every user behavior number
According to all comprising one group of original variable and analog value and identifying the class label of true classification, every increment number
According to similar form as follows: (original variable 1, x1;Original variable 2, x2;... original variable n, xn: y),
Wherein xiRepresenting the analog value corresponding to original variable i, described original variable is also referred to as attribute, corresponding value
Also referred to as property value, y is the class label of this user behavior data.
Such as, in an object lesson of the present embodiment, in the risk control business of internet business platform
Field, uses disaggregated model whether customer transaction behavior exists risk and carries out classification prediction, in this step
Original variable in the incremental data obtained may include that the personal attribute information such as user account, age, friendship
The information such as information attribute value and dealing money such as the easy classification of commodity, title, price.Class label
Then include black/white sample two kind (the most corresponding risky and devoid of risk).
Step 101-2: described incremental data is carried out pretreatment so that it is meet disaggregated model to number of training
According to requirement.
This step incremental data to having obtained carries out pretreatment, in order to use in subsequent steps based on
The training sample set of described incremental data generates decision tree.Described pretreatment can include greatly/minimum processes,
Missing values processes and form conversion, illustrates separately below.
Described maximum value or minimum value typically refers to, the upper limit of transnormal reasonable value scope or lower limit
Numerical value, such as indoor temperature is 100 degrees Celsius, and this numerical value is just above the maximum of reasonable value scope,
This kind of numerical value be probably that system produces, be also likely to be owing to artificial maloperation produces.In being embodied as,
This type of data can be processed, such as: if being only in individual user's behavioral data to use mode set in advance
Comprise this type of data, directly can delete relative users behavioral data from incremental data;If these type of data
Incremental data occurs more frequent, then can use calculating meansigma methods and replace original pole by meansigma methods
Big value and/or minimizing processing mode.
Described missing values typically refers to, certain original variable do not have correspondence numerical value, this it may be the case that because of
Described data are not collected, such as a certain information during user does not fills in web form, and phase for system
The data acquisition program answered the most does not writes default value for it.In this case, with above-mentioned to maximum/minimum
The processing mode of value is similar to, and can delete the incomplete user behavior data of original variable value, it is also possible to use it
The average of this original variable value that his user behavior data comprises, replaces the numerical value of described disappearance.
Described form is changed, it is common that because the multiformity of measurement unit or the multiformity of data coding method,
Need some the original variable value in the incremental data that will collect to be converted into and meet the numerical value that disaggregated model requires.
Such as: disaggregated model requirement offer certain original variable value with Celsius temperature as measurement unit, and incremental data
In corresponding original variable value in units of Fahrenheit temperature, in this case, it is necessary to incremental data
In corresponding data carry out form conversion.
Use aforesaid way that incremental data is carried out pretreatment, be to ensure that the integrity of incremental data, have
Effect property and the correctness etc. of numerical value, thus the training sample set extracted from described incremental data can meet
Update the disaggregated model requirement to training sample data, thus ensure that the disaggregated model after updating can obtain good
Good prediction effect.
Step 101-3: extraction feature variable from pretreated incremental data, adds characteristic variable set to
In.
Generally every user behavior data all comprises substantial amounts of original variable, on the one hand due to not each
Original variable is the most meaningful for class prediction, on the other hand user behavior data comprise original
Variable is also likely to be change, such as, be gradually increased perfect, and for the ease of management, this step can be from increasing
In amount data, extraction contributes to characterizing the original variable of user behavior feature, i.e. characteristic variable (also referred to as feature
Attribute), and selected characteristic variable is added in characteristic variable set (also referred to as characteristic variable pond).
Step 101-4: extract from every incremental data that choose from described characteristic variable set, correspond to
The value of the characteristic variable of described disaggregated model, and form training sample by the value extracted with described class label,
Thus obtain training sample set.
Owing to its classification feature of different disaggregated models is different, the characteristic variable of employing is likely to difference, therefore
This step chooses the characteristic variable corresponding to described disaggregated model to be updated from characteristic variable set.Then root
According to described characteristic variable, from every incremental data, extract characteristic of correspondence variate-value, and these features are become
Value forms training sample in the lump with the class label of this incremental data, similar form as follows: (x1,
x2,......xn: y), wherein xiRepresenting the characteristic variable value of this sample, y then represents the class label of this sample.
Use aforesaid way to process every incremental data successively, thus obtain training sample set.
So far, by step 101-1 to step 101-4, from increment user behavior data, training sample it is extracted
This collection.It should be noted that disaggregated model reach the standard grade application initial stage, such as: initial three months or
Half a year, it is generally in the stage that characteristic variable is the most perfect, along with business cognition is goed deep into, by performing
Step 101-3 will assist in the original variable of class prediction and is gradually added in characteristic variable set.Along with user
The original variable that behavioral data comprises the most stable, and disaggregated model is becoming better and approaching perfection day by day, characteristic variable set
In characteristic variable also can be in the metastable stage, in this case, it is also possible to do not perform step
101-3, but directly from the most stable characteristic variable set, choose the characteristic variable corresponding to disaggregated model,
And generate training sample set further.
Step 102: according to described training sample set, generate the decision tree of newly-increased quantity.
This step generates the decision tree of newly-increased quantity, and described newly-increased quantity typically smaller than disaggregated model is comprised
The predetermined quantity of original decision tree, its concrete value, it may be considered that the concrete application scenarios of disaggregated model, instruction
The factors such as the quantity of the original decision tree that the scale of white silk sample set or described disaggregated model comprise, arrange one
Individual empirical value.Such as: in the internet, applications carrying out risk control, the span of described newly-increased quantity
Can be arranged in the range of the 1/40 to 1/10 of described predetermined quantity, if disaggregated model comprises 200 400
Decision tree, can set the quantity of newly-increased decision tree as 10.Above-mentioned is only an example, specifically
Enforcement can be configured with comprehensive reference various factors.
In addition it is also possible to use disaggregated model described in described training sample set pair to verify, and according to checking
Result determines the quantity of newly-increased decision tree.Specifically, with described disaggregated model training sample can be concentrated
Each sample is classified, and with the ratio of the correct number of times classified and total sample number as classification accuracy,
And according to the quantity of the newly-increased decision tree of classification accuracy adjustment, such as, classification accuracy exceedes set in advance
During threshold value, illustrate that current training sample data can relatively accurately be classified by existing disaggregated model, because of
This can arrange relatively small newly-increased quantity;Relatively large newly-increased quantity otherwise can be set.
After determining described newly-increased quantity, it is possible to generate the decision tree of described newly-increased quantity.Can as one
The embodiment of choosing, can concentrate from described training sample and randomly select a number of sample, then for
Selected sample uses conventional decision trees to generate a decision tree, repeats above-mentioned to choose sample and life
Become the step of decision tree, until generating the decision tree of described newly-increased quantity.
In order to improve the efficiency of the newly-increased quantitative decision tree of generation, avoid the occurrence of Expired Drugs and improve anti-
Noise immune, the present embodiment provides a kind of and uses random forests algorithm to generate the preferred real of newly-increased quantitative decision tree
Execute mode, specifically include step 102-1 to step 102-3,3 be described further below in conjunction with the accompanying drawings.
Step 102-1: use the mode of sampling with replacement to build bootstrap sample according to described training sample set
Set;
Bootstrap sampling approach (also referred to as bootstrapping or Bootstrap sampling method), is a kind of uniform sampling side having and putting back to
Method, this step concentrates the mode using sampling with replacement to extract N number of sample from the training sample comprising N number of sample
This, in extraction process, the part sample in described training sample set is not the most pumped to, and portion
Divide sample may be extracted repeatedly, N number of sample one the bootstrap sample set of composition that will finally extract.
Step 102-2: use described bootstrap sample set, uses and selects according to predetermined policy at each node
Take characteristic variable and the mode that carries out dividing according to selected characteristic variable generates a new decision tree.
Use described bootstrap sample set, use the mode of dot splitting section by section to generate a new decision tree,
It it is critical only that the selection of Split Attribute (i.e. characteristic variable) of each node.Specifically, for comprising M
The sample of individual characteristic variable, when each node decision tree needs division, becomes from M feature first at random
Amount selects m (generally meet condition m < < M), then from m selected characteristic variable according to
Predetermined policy chooses 1 optimum characteristic variable, and divides according to this feature variable.At each node
All repeat said process, until some node cannot continue division or its all samples comprised broadly fall into
Same classification, now fission process terminates, and a new decision tree creates complete.
In being embodied as, the number randomly choosing characteristic variable can use the side calculating square root and rounding
Formula obtains, such as: each sample packages contains M=100 characteristic variable, then can randomly choose m=every time
Sqrt (M)=10 characteristic variable, naturally it is also possible to adopt determine in other ways randomly choose characteristic variable
Number, as long as meeting the m < < condition of M.
As for choosing optimum characteristic variable from the characteristic variable randomly selected, can use set in advance
Strategy, such as, according to information gain, information gain-ratio or Geordie selecting index.Use above-mentioned three kinds
Mode choose optimal characteristics variable go forward side by side line splitting generate decision tree process, belong to the existing skill of comparative maturity
Art, is the most no longer further described detailed process.
By above description it can be seen that the randomness of random forests algorithm is embodied in the training sample of each tree
Originally being random, in tree, the categorical attribute of each node also randomly chooses, random based on above-mentioned the two
The guarantee of characteristic, the decision tree using random forests algorithm to generate is generally of preferable noise resisting ability, and
And Expired Drugs will not be produced.
This step can also record the relevant information of every newly-generated decision tree, including: decision tree identifies,
Such as decision tree id, and the generation time etc..
Step 102-3: judge that the quantity of newly-generated decision tree, whether less than described newly-increased quantity, if so, turns
Perform to step 102-1.
The most newly-generated decision tree, it is possible to the quantity of cumulative newly-generated decision tree, and judge this quantity
Whether less than described newly-increased quantity, if so, forward step 102-1 to and perform, continue to generate new decision tree;If
No, illustrate that the quantity of newly-generated decision tree has met requirement, need not continue to generate.
Step 103: according to history classification accuracy with for this classification accuracy of described training sample set,
Classification after selecting the decision tree of described predetermined quantity, composition to update from newly-increased decision tree and original decision tree
Model.
Generate the decision tree of newly-increased quantity in a step 102, this step according to classifying quality from newly-increased decision-making
Tree and the existing original decision tree of disaggregated model select the decision tree of described predetermined quantity.Iff basis
This classification accuracy (classification accuracy obtained based on this training sample set) carries out classifying quality
Assessment, generally can obtain the effect of local optimum, but for the random data fluctuations by a relatively large margin occurred
Situation, owing to this data fluctuations is the most all transient, do not represent long-term, the trend of the overall situation,
If carrying out classifying quality assessment the optimal decision tree of screenability only for current training sample set,
Thus disaggregated model is the most inaccurate after the renewal obtained, the classification for future customer behavioral data is pre-
Survey may be inaccurate.
In order to avoid there is above-mentioned situation, the technical program introduces history classification accuracy, i.e. all previous more
The classification accuracy of record during new disaggregated model, by accurate with this subseries by history classification accuracy
The combination of rate such that it is able to from the angle of the overall situation, reflect the classifying quality of decision tree more objectively.
In being embodied as, can use different according to history classification accuracy and this classification accuracy
Strategy or algorithm, complete the selection task of this step.It is accurate that the present embodiment provides a kind of calculating compressive classification
Rate, and the embodiment screened is carried out according to this index, specifically, including step 103-1 to step 103-3,
4 it is described further below in conjunction with the accompanying drawings.
Step 103-1: according to history classification accuracy with for this classification accuracy of described training sample set,
Calculate newly-increased decision tree and the compressive classification accuracy rate of original decision tree;
Calculate compressive classification accuracy rate according to history classification accuracy with this classification accuracy, can use not
Same computational methods, for example, it is possible to utilize self-defining function or formula to solve, consider enforcement
Effect and the maturity of algorithm, the present embodiment uses moving average method to calculate compressive classification accuracy rate.
Moving average method is used to calculate newly-increased decision tree and the compressive classification accuracy rate of original decision tree, including step
Rapid 103-1-1, to step 103-1-3,5 further illustrates below in conjunction with the accompanying drawings.
Step 103-1-1: according to described training sample set, calculate this classification accuracy of newly-increased decision tree,
As its compressive classification accuracy rate.
This step calculates this classification accuracy of each newly-increased decision tree for the training sample set obtained.
Specifically, can be accurate by this subseries of the newly-increased decision tree of any one calculating in following three kinds of modes
Rate:
1) each sample concentrated training sample with described newly-increased decision tree is classified, and classifies with correct
The ratio of number of times and total sample number as this classification accuracy;
2) if newly-increased decision tree is to use random forests algorithm to generate, due to use during generating
Bootstrap sample set, but therefore can use be included in described training sample concentrate be not included in
The outer sample of bag in described bootstrap sample set, use and 1) in similar method calculate this subseries
Accuracy rate;
3) if newly-increased decision tree is to use random forests algorithm to generate, it is also possible to directly use bootstrap
Sample in sample set, use and 1) similar mode calculates this classification accuracy.
For newly-increased decision tree, usually not about the history classification accuracy information of this decision tree, because of
This can be using this classification accuracy as its compressive classification accuracy rate.If but disaggregated model once creates
Cross the decision tree identical with described newly-increased decision tree, and also remain with the history classification accuracy of this decision tree,
The mode of similar step 103-1-3 so can also be used to calculate the compressive classification accuracy rate of newly-increased decision tree.
Step 103-1-2: according to described training sample set, calculate this classification accuracy of original decision tree.
With the mode 1 described in above-mentioned steps 103-1-1) similar, this step can be with original decision tree to instruction
The each sample practiced in sample set is classified, and with the number of times of correct classification and the ratio conduct of total sample number
This classification accuracy of original decision tree.Every the original decision tree comprising disaggregated model all uses above-mentioned
Mode processes, thus obtains this classification accuracy of every original decision tree.
Step 103-1-3: according to history classification accuracy and this classification accuracy, calculate original decision tree
Compressive classification accuracy rate.
The present embodiment uses moving average method to calculate the compressive classification accuracy rate of described original decision tree.Mobile flat
All methods (moving average method) typically refer to, according to time series, according to specific weight coefficient
Calculate the average (weighted mean) of volume of data item successively such that it is able to eliminate the random wave in data
Dynamic, reflect the variation tendency of data more objectively.
Moving average method is used to calculate the fundamental formular of compressive classification accuracy rate of decision tree as shown in following formula,
Wherein piFor a certain data item in described time series, i.e. classification accuracy, including this classification accuracy
And the history classification accuracy that all previous renewal disaggregated model is recorded;Wi is to should the power of classification accuracy
Weight coefficient, each weight coefficient and usually 1;N is the data item number in the time series participating in calculating.
The most easy moving average method is simple moving average method (Simple Moving Average SMA),
That is: the non-weighted arithmetic average of this classification accuracy and history classification accuracy is calculated, and this is average
Value is as compressive classification accuracy rate.Adopt in this way, be equivalent to this classification accuracy and each history is divided
The weight coefficient of class accuracy rate is all identical, is all 1/n, can the shortage term fluctuation of preferably smoothed data,
The long-term trend of reflection data variation.
In view of in actual applications, the classification accuracy data of different times are for the classification of evaluation decision tree
The effect of effect is probably differentiated, and the power of influence of classification accuracy the most at a specified future date is relatively low, and gets over
Recent classification accuracy the most more can the classifying quality of evaluation decision tree exactly.Based on above-mentioned consideration,
In order to while smoothed data shortage term fluctuation, show the power of influence of Recent data, this enforcement further especially
Example additionally provides the employing method of weighted moving average or the preferred implementation of index moving average method.
The so-called method of weighted moving average (Weighted Moving Average WMA), is calculating weighted average
In the following way different weight coefficients is set for each data item during number: for comprising n data item,
The denominator of weight coefficient can be set as A=n+ (n-1)+(n-2)+...+2+1, will most recently data item (p1,
Such as this classification accuracy) weight coefficient be set to n/A, secondary Recent data item (p2) weight coefficient
It is set to (n-1)/A, and so on, until 1/A, refer to Fig. 6, it illustrates the weighting of n=15
The weight distribution schematic diagram of rolling average.The computing formula of the method for weighted moving average is as follows:
So-called index moving average method (Exponential Moving Average EMA), moves with above-mentioned weighting
Dynamic averaging method compares, and is the rolling average of weight coefficient of successively decreasing with exponential form.The weighted influence of each data item
Power in time and successively decrease by exponential form, and the most recent data weighting power of influence is the heaviest, also gives data more at a specified future date
Give certain weighted value, refer to Fig. 7, it illustrates the weight distribution signal of the index rolling average of n=20
Figure.In the specific implementation, the degree of weighting can determine with constant α, and α numerical value Jie 0 to 1, α is also
Can represent with the number n of the data item participating in calculating, such as α=2/ (n+1).Index based on constant α moves
The computing formula of dynamic averaging method is as follows:
By above description it can be seen that the method for weighted moving average and the weight coefficient of index moving average method
Mainly by the time that the generates decision of data item, the weight coefficient of the most recent data item is the biggest, the most at a specified future date
The weight coefficient of data item the least.It is applied particularly to the present embodiment, can be that this classification accuracy is arranged
Bigger weight coefficient, and the history classification accuracy being at a specified future date arranges the least weight coefficient, uses
This mode calculated compressive classification accuracy rate, both can be from the classification effect of overall situation angle reaction decision tree
Really, the classification performance that decision tree is recent can also be highlighted, then by the decision-making gone out based on this index screening simultaneously
The disaggregated model that tree is formed, generally can obtain in following classification prediction and predict the outcome the most accurately.
By above description it can also be seen that the difference of the method for weighted moving average and index moving average method it
Place is, in index moving average method the weighted influence power of each data item be not linear successively decrease but use
Exponential manner successively decreases.Therefore, for the fast-changing application scenarios of data, (such as website is carried out sales promotion and is lived
Dynamic) generally can use index moving average method, and (the most common at data application scenarios relatively smoothly
Working day) generally can use the method for weighted moving average.
When being embodied as this step, can be according to the mark of original decision tree, such as decision tree id,
Storage classification accuracy data in search cut-off this update before certain period in (such as: 7 days or
Person 1 month), corresponding to the classification accuracy of this decision tree and generate temporal information, these data are generally the most all
Record during all previous renewal disaggregated model;Then according to concrete application demand or difference
Application scenarios, utilize the above-mentioned classification accuracy found and this classification accuracy, use above-mentioned shifting
One of dynamic averaging method calculates the compressive classification accuracy rate of this decision tree.Each included for disaggregated model former
Beginning decision tree performs aforesaid operations, it is hereby achieved that the compressive classification accuracy rate of each original decision tree.
So far, by step 103-1-1 to step 103-1-3, on the basis of calculating this classification accuracy,
Obtain the compressive classification accuracy rate of every newly-increased decision tree and original decision tree.In order to update classification follow-up
Full and accurate history classification accuracy data can be obtained during model, in above-mentioned steps 103-1-1 and
After 103-1-2 calculates this classification accuracy of newly-increased decision tree and original decision tree, can by these data with
And this calculating time is stored in the relevant information of corresponding decision tree, i.e. in the relevant information of every decision tree
Not only include: decision tree mark and the time of generation, it is also possible to include corresponding to a series of classification of seasonal effect in time series
Accuracy rate.
Step 103-2: described newly-increased decision tree and original decision tree are carried out according to described compressive classification accuracy rate
Sequence.
This step according to step 103-1 calculated compressive classification accuracy rate, to newly-increased decision tree and original certainly
Tree is ranked up plan, it may be assumed that arrange above-mentioned decision tree according to compressive classification accuracy rate order from high to low
Sequence so that the sorting position of the decision tree that compressive classification accuracy rate is high is in the decision-making that compressive classification accuracy rate is low
Before tree, carry out screening ready for subsequent step 103-3.
Step 103-3: select tagmeme to be positioned at a high position, described predetermined quantity determining from the decision tree after sequence
Plan tree.
Such as, disaggregated model comprises T original decision tree, generates K decision tree in a step 102,
So this step is from T+K the decision tree sorted according to compressive classification accuracy rate, selects ranking forward,
T the decision tree that i.e. compressive classification accuracy rate is optimum, collects the disaggregated model after composition updates.
So far complete the screening of decision tree by step 103-1 to 103-3 to operate, and after being updated
Disaggregated model.In the specific implementation, can in doing so, delete unselected decision tree and
Its relevant information, including: decision tree mark, generation time and the information relevant to classification accuracy.
If the decision tree being deleted is this newly-increased decision tree, then deleting the information relevant to classification accuracy is
Refer to, only delete this classification accuracy and this calculating time;If the decision tree being deleted is original decision-making
Tree, then be possible not only to delete this classification accuracy and this calculating time, it is also possible to record before deletion
History classification accuracy and information correlation time.
Additionally, due to the technical program provides Dynamic Updating Mechanism, disaggregated model can be the most perfect, because of
This is (process grown out of nothing, it is also possible to be considered renewal process) when creating disaggregated model first, it is also possible to
Use the technical program.Specifically, before performing step 102, first judge whether to have created disaggregated model,
If it is not, then using described predetermined quantity as described newly-increased quantity, and create predetermined quantity in step 102 kind
Decision tree, calculates this classification accuracy of every newly-increased decision tree in step 103, divides for follow-up renewal
Reference during class model, and direct being collected by newly-increased decision tree obtain disaggregated model.Use this embodiment,
The establishment of disaggregated model and renewal process can be united, reduce artificial participation, it is simple to maintenance and management.
In sum, the method being used for updating disaggregated model that the present embodiment provides, owing to need not according to complete
Amount data are trained, but utilize incremental data to carry out incremental update on the basis of original disaggregated model,
Therefore as required disaggregated model can be carried out the dynamic renewal of various time granularity, such as: per diem update
Or approximate real time renewal, such that it is able to improve the efficiency of model training, realize the quick response to business;
Further, owing to not having only only in accordance with this classification accuracy when evaluating the classifying quality of decision tree,
And it is the introduction of history classification accuracy, therefore can be from the angle of the overall situation compressive classification effect to decision tree
It is estimated, such that it is able to the shortage term fluctuation of smoothed data, it is ensured that the disaggregated model after renewal can keep ratio
More stable classification prediction effect.
It is in the above-described embodiment, it is provided that a kind of method for updating disaggregated model, corresponding,
The application also provides for a kind of device for updating disaggregated model.Refer to Fig. 8, it is the one use of the application
Schematic diagram in the device embodiment updating disaggregated model.Implement owing to device embodiment is substantially similar to method
Example, so describing fairly simple, relevant part sees the part of embodiment of the method and illustrates.Following retouch
The device embodiment stated is only schematically.
A kind of device for updating disaggregated model of the present embodiment, including: training sample set extraction unit 801,
For obtaining the incremental data in predetermined amount of time from the server providing described user behavior data, and from
Middle extraction training sample set;Decision tree signal generating unit 802, for according to described training sample set, generates newly-increased
The decision tree of quantity;Decision tree selects unit 803, for according to history classification accuracy with for described training
This classification accuracy of sample set, selects described predetermined quantity from newly-increased decision tree and original decision tree
Decision tree, the disaggregated model after composition renewal.
Optionally, described training sample set extraction unit includes:
Incremental data obtains subelement, predetermined for obtaining from the server providing described user behavior data
Incremental data in time period, wherein every incremental data all comprise one group of original variable and analog value and
Class label;
Data prediction subelement, for carrying out pretreatment to described incremental data so that it is meet disaggregated model
Requirement to training sample data;
Characteristics extraction subelement, for every incremental data performs following operation, generates described training sample
This collection: extract the value of characteristic variable corresponding to described disaggregated model from incremental data, and by the value extracted
Training sample is formed with described class label.
Optionally, described data prediction subelement at least includes one of following subelement:
Extreme value processes subelement, for according to mode set in advance, to maximum in incremental data and/or pole
Little value processes;
Missing values processes subelement, for according to mode set in advance, enters the missing values in incremental data
Row processes;
Form conversion subelement, for according to the described disaggregated model call format to sample data, carrying out phase
The form conversion answered.
Optionally, described training sample set extraction unit also includes:
Characteristic variable extraction subelement, for extraction feature variable from pretreated incremental data, adds
In characteristic variable set;
Described characteristics extraction subelement specifically for, every incremental data is performed following operation, generates institute
State training sample set: extract that choose from described characteristic variable set, corresponding to described from incremental data
The value of the characteristic variable of disaggregated model, and form training sample by the value extracted with described class label.
Optionally, described decision tree signal generating unit specifically for, according to described training sample set, use random
Forest algorithm generates the decision tree of newly-increased quantity.
Optionally, described decision tree signal generating unit includes:
Loop control subelement, during for being less than described newly-increased quantity when the quantity generating decision tree, calls down
State each subelement and create decision tree;
Bootstrap sample set builds subelement, for using sampling with replacement according to described training sample set
Mode builds bootstrap sample set;
Decision tree generates and performs subelement, is used for using described bootstrap sample set, uses at each joint
Point generates one according to predetermined policy selected characteristic variable the mode that carries out dividing according to selected characteristic variable
New decision tree;Described refer to according to predetermined policy selected characteristic variable, from the characteristic variable randomly choosed by
Optimal characteristics variable is chosen according to predetermined policy.
Optionally, the predetermined policy that described decision tree generation execution subelement is used includes: increase according to information
Benefit is chosen attribute, is chosen attribute according to information gain-ratio or according to Geordie selecting index attribute.
Optionally, described decision tree selects unit to include:
Aggregative indicator computation subunit, for according to history classification accuracy with for described training sample set
This classification accuracy, calculates newly-increased decision tree and the compressive classification accuracy rate of original decision tree;
Sequence subelement, for entering newly-increased decision tree and original decision tree according to described compressive classification accuracy rate
Row sequence;
Screening subelement, for selecting tagmeme to be in a high position, described predetermined number from the decision tree after sequence
The decision tree of amount.
Optionally, described aggregative indicator computation subunit includes:
Newly-increased decision tree aggregative indicator computation subunit, for according to described training sample set, calculates newly-increased determining
This classification accuracy of plan tree, as its compressive classification accuracy rate;
This index computation subunit of original decision tree, for according to described training sample set, calculates original determining
This classification accuracy of plan tree;
Original decision tree aggregative indicator computation subunit, for according to described history classification accuracy and this point
Class accuracy rate, calculates the compressive classification accuracy rate of original decision tree.
Optionally, described original decision tree aggregative indicator computation subunit specifically for, use moving average method,
Calculate the compressive classification accuracy rate of original decision tree.
Optionally, the moving average method that described original decision tree aggregative indicator computation subunit uses includes: add
Power moving average method or index moving average method.
Optionally, described device also includes:
Disaggregated model creates judging unit, for before triggering the work of described decision tree signal generating unit, it is judged that
The most create described disaggregated model, if it is not, using described predetermined quantity as described newly-increased quantity.
Optionally, described device also includes:
Decision tree deletes unit, after the disaggregated model after selecting unit composition to update described decision tree,
Delete unselected decision tree.
Although the application is open as above with preferred embodiment, but it is not for limiting the application, Ren Heben
Skilled person, without departing from spirit and scope, can make possible variation and amendment,
Therefore the protection domain of the application should be defined in the range of standard with the application claim.
In a typical configuration, calculating equipment includes one or more processor (CPU), input/output
Interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory
(RAM) and/or the form such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM).
Internal memory is the example of computer-readable medium.
1, computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by
Any method or technology realize information storage.Information can be computer-readable instruction, data structure, journey
The module of sequence or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory
(PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other
The random access memory (RAM) of type, read only memory (ROM), the read-only storage of electrically erasable
Device (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassette tape, tape magnetic rigid disk stores or other
Magnetic storage apparatus or any other non-transmission medium, can be used for the information that storage can be accessed by a computing device.
According to defining herein, computer-readable medium does not include non-temporary computer readable media (transitory
Media), such as data signal and the carrier wave of modulation.
2, it will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer
Program product.Therefore, the application can use complete hardware embodiment, complete software implementation or combine software
Form with the embodiment of hardware aspect.And, the application can use and wherein include meter one or more
The computer-usable storage medium of calculation machine usable program code (include but not limited to disk memory, CD-ROM,
Optical memory etc.) form of the upper computer program implemented.