CN106156809A

CN106156809A - For updating the method and device of disaggregated model

Info

Publication number: CN106156809A
Application number: CN201510203239.3A
Authority: CN
Inventors: 沈雄
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2015-04-24
Filing date: 2015-04-24
Publication date: 2016-11-23

Abstract

This application discloses a kind of method for updating disaggregated model, described disaggregated model includes the original decision tree of predetermined quantity, for carrying out class prediction according to the user behavior data in network application, described method includes: obtains the incremental data in predetermined amount of time from the server providing described user behavior data, and therefrom extracts training sample set；According to described training sample set, generate the decision tree of newly-increased quantity；Disaggregated model according to history classification accuracy with for this classification accuracy of described training sample set, after selecting the decision tree of described predetermined quantity, composition to update from newly-increased decision tree and original decision tree.The application provides a kind of device for updating disaggregated model simultaneously.The method that the application provides, can improve the efficiency of model training, realize the quick response to business；And owing to introducing history classification accuracy when evaluating the classifying quality of decision tree, therefore can be with the shortage term fluctuation of smoothed data.

Description

For updating the method and device of disaggregated model

Technical field

The application relates to disaggregated model based on decision tree, is specifically related to a kind of side for updating disaggregated model Method.The application relates to a kind of device for updating disaggregated model simultaneously.

Background technology

Along with the development of Internet technology, occur in that substantial amounts of network application, such as: network social intercourse, network Reading, network application provider is in order to recommend to user have information targetedly or carry out necessity Monitoring management, it usually needs according to user's operation behavior in network application, for target set in advance Carry out classification prediction, such as: swindle or risk identification, tenant group, purchasing power identification etc..In order to improve Predictive efficiency and accuracy, generally use disaggregated model (also referred to as grader) in most of network applications Carry out classification prediction.

Wherein, random forest grader is one of universal disaggregated model of Application comparison, and this grader is by many Decision tree forms, and when sample to be sorted enters random forest, each decision tree classifies, finally Choose and selected the most classification of number of times as final classification results by all decision trees.In the application, generally The machine-learning process using off-line constructs this disaggregated model: by the study of the user behavior data to full dose, Analyze and training, draw the knowledge about classification, thus complete the structure to disaggregated model and deployment is reached the standard grade. As time goes on, the disaggregated model disposed on line would generally gradually be degenerated, and the accuracy rate of its classification can Can cannot meet requirement, for this situation, prior art is generally also the mode using calculated off line, profit With full dose historical data re-training disaggregated model, and again dispose to reach the standard grade by the disaggregated model trained and carry out Classification prediction.

The mode of above-mentioned renewal disaggregated model, owing to using full dose data to carry out the training of disaggregated model every time, Along with increase, the process time of data volume can extend accordingly, model training efficiency is caused to reduce；And in reality In the application of border, the process of above-mentioned renewal disaggregated model is the most all just to start, the most just after disaggregated model is degenerated It is to say that disaggregated model can not make corresponding adjustment in real time or in time according to the change of data, causes business , there is hysteresis quality in low-response.

Summary of the invention

The embodiment of the present application provides a kind of method for updating disaggregated model, to solve existing disaggregated model Update mode training effectiveness is low and updates problem not in time.The embodiment of the present application also provide for a kind of for Update the device of disaggregated model.

The application provides a kind of method for updating disaggregated model, and described disaggregated model includes predetermined quantity Original decision tree, for carrying out class prediction according to the user behavior data in network application, including:

The incremental data in acquisition predetermined amount of time from the server that described user behavior data is provided, and from Middle extraction training sample set；

According to described training sample set, generate the decision tree of newly-increased quantity；

According to history classification accuracy with for this classification accuracy of described training sample set, from newly-increased certainly Disaggregated model after selecting the decision tree of described predetermined quantity, composition to update in plan tree and original decision tree.

Optionally, described increasing from the server that described user behavior data is provided in acquisition predetermined amount of time Amount data, and therefrom extract training sample set, including:

The incremental data in predetermined amount of time is obtained, wherein from the server that described user behavior data is provided Every incremental data all comprises one group of original variable and analog value and class label；

Described incremental data is carried out pretreatment so that it is meet the disaggregated model requirement to training sample data；

Every incremental data is performed following operation, generates described training sample set: extract from incremental data Corresponding to the value of the characteristic variable of described disaggregated model, and train with described class label composition by the value extracted Sample.

Optionally, described described incremental data is carried out pretreatment, including the one in set forth below or arbitrarily In conjunction with:

According to mode set in advance, maximum in incremental data and/or minimum are processed；

According to mode set in advance, the missing values in incremental data is processed；

According to the described disaggregated model call format to sample data, carry out corresponding form conversion.

Optionally, described increasing from the server that described user behavior data is provided in acquisition predetermined amount of time Amount data, and therefrom extract training sample set, also include:

Extraction feature variable from pretreated incremental data, adds in characteristic variable set；

The described characteristic variable corresponding to disaggregated model refers to, that choose from described characteristic variable set, right Should be in the characteristic variable of disaggregated model.

Optionally, described according to described training sample set, the decision tree generating newly-increased quantity includes:

According to described training sample set, random forests algorithm is used to generate the decision tree of newly-increased quantity.

Optionally, described according to described training sample set, use random forests algorithm to generate determining of newly-increased quantity Plan tree, including:

The mode using sampling with replacement according to described training sample set builds bootstrap sample set；

Use described bootstrap sample set, use each node according to predetermined policy selected characteristic variable, And the mode carrying out dividing according to selected characteristic variable generates a new decision tree；Described according to predetermined policy choosing Take characteristic variable to refer to, from the characteristic variable randomly choosed, choose optimal characteristics variable according to predetermined policy；

Forward the described mode according to described training sample set employing sampling with replacement to and build bootstrap sample set The step closed continues executing with, until generating the decision tree of described newly-increased quantity.

Optionally, described choose optimal characteristics variable according to predetermined policy and include: choose according to information gain, Choose according to information gain-ratio or according to optimal characteristics variable described in Geordie selecting index.

Optionally, described this subseries according to history classification accuracy with for described training sample set is accurate Rate, selects the decision tree of described predetermined quantity from newly-increased decision tree and original decision tree, including:

According to history classification accuracy with for this classification accuracy of described training sample set, calculate newly-increased The compressive classification accuracy rate of decision tree and original decision tree；

According to described compressive classification accuracy rate, newly-increased decision tree and original decision tree are ranked up；

Tagmeme is selected to be in a high position, the decision tree of described predetermined quantity from the decision tree after sequence.

Optionally, described this subseries according to history classification accuracy with for described training sample set is accurate Rate, calculates newly-increased decision tree and the compressive classification accuracy rate of original decision tree, including:

According to described training sample set, calculate this classification accuracy of newly-increased decision tree, as its total score Class accuracy rate；

According to described training sample set, calculate this classification accuracy of original decision tree；

According to described history classification accuracy and this classification accuracy, calculate the compressive classification of original decision tree Accuracy rate.

Optionally, described according to history classification accuracy with this classification accuracy, calculate original decision tree Compressive classification accuracy rate realizes in the following way:

Use moving average method, calculate the compressive classification accuracy rate of original decision tree.

Optionally, described moving average method includes: the method for weighted moving average or index moving average method.

Optionally, before performing the step of the decision tree that described generation increases quantity newly, execution operations described below:

Judge whether to create described disaggregated model；

If it is not, using described predetermined quantity as described newly-increased quantity.

Optionally, the described decision tree selecting described predetermined quantity from newly-increased decision tree and original decision tree, After the step of the disaggregated model after composition renewal, execution operations described below:

Delete unselected decision tree.

Accordingly, the application also provides for a kind of device for updating disaggregated model, including:

Training sample set extraction unit, predetermined for obtaining from the server providing described user behavior data Incremental data in time period, and therefrom extract training sample set；

Decision tree signal generating unit, for according to described training sample set, generating the decision tree of newly-increased quantity；

Decision tree selects unit, for according to history classification accuracy with for this of described training sample set Classification accuracy, selects the decision tree of described predetermined quantity from newly-increased decision tree and original decision tree, composition Disaggregated model after renewal.

Optionally, described training sample set extraction unit includes:

Incremental data obtains subelement, predetermined for obtaining from the server providing described user behavior data Incremental data in time period, wherein every incremental data all comprise one group of original variable and analog value and Class label；

Data prediction subelement, for carrying out pretreatment to described incremental data so that it is meet disaggregated model Requirement to training sample data；

Characteristics extraction subelement, for every incremental data performs following operation, generates described training sample This collection: extract the value of characteristic variable corresponding to described disaggregated model from incremental data, and by the value extracted Training sample is formed with described class label.

Optionally, described data prediction subelement at least includes one of following subelement:

Extreme value processes subelement, for according to mode set in advance, to maximum in incremental data and/or pole Little value processes；

Missing values processes subelement, for according to mode set in advance, enters the missing values in incremental data Row processes；

Form conversion subelement, for according to the described disaggregated model call format to sample data, carrying out phase The form conversion answered.

Optionally, described training sample set extraction unit also includes:

Characteristic variable extraction subelement, for extraction feature variable from pretreated incremental data, adds In characteristic variable set；

Described characteristics extraction subelement specifically for, every incremental data is performed following operation, generates institute State training sample set: extract that choose from described characteristic variable set, corresponding to described from incremental data The value of the characteristic variable of disaggregated model, and form training sample by the value extracted with described class label.

Optionally, described decision tree signal generating unit specifically for, according to described training sample set, use random Forest algorithm generates the decision tree of newly-increased quantity.

Optionally, described decision tree signal generating unit includes:

Loop control subelement, during for being less than described newly-increased quantity when the quantity generating decision tree, calls down State each subelement and create decision tree；

Bootstrap sample set builds subelement, for using sampling with replacement according to described training sample set Mode builds bootstrap sample set；

Decision tree generates and performs subelement, is used for using described bootstrap sample set, uses at each joint Point generates one according to predetermined policy selected characteristic variable the mode that carries out dividing according to selected characteristic variable New decision tree；Described refer to according to predetermined policy selected characteristic variable, from the characteristic variable randomly choosed by Optimal characteristics variable is chosen according to predetermined policy.

Optionally, the predetermined policy that described decision tree generation execution subelement is used includes: increase according to information Benefit is chosen attribute, is chosen attribute according to information gain-ratio or according to Geordie selecting index attribute.

Optionally, described decision tree selects unit to include:

Aggregative indicator computation subunit, for according to history classification accuracy with for described training sample set This classification accuracy, calculates newly-increased decision tree and the compressive classification accuracy rate of original decision tree；

Sequence subelement, for entering newly-increased decision tree and original decision tree according to described compressive classification accuracy rate Row sequence；

Screening subelement, for selecting tagmeme to be in a high position, described predetermined number from the decision tree after sequence The decision tree of amount.

Optionally, described aggregative indicator computation subunit includes:

Newly-increased decision tree aggregative indicator computation subunit, for according to described training sample set, calculates newly-increased determining This classification accuracy of plan tree, as its compressive classification accuracy rate；

This index computation subunit of original decision tree, for according to described training sample set, calculates original determining This classification accuracy of plan tree；

Original decision tree aggregative indicator computation subunit, for according to described history classification accuracy and this point Class accuracy rate, calculates the compressive classification accuracy rate of original decision tree.

Optionally, described original decision tree aggregative indicator computation subunit specifically for, use moving average method, Calculate the compressive classification accuracy rate of original decision tree.

Optionally, the moving average method that described original decision tree aggregative indicator computation subunit uses includes: add Power moving average method or index moving average method.

Optionally, described device also includes:

Disaggregated model creates judging unit, for before triggering the work of described decision tree signal generating unit, it is judged that The most create described disaggregated model, if it is not, using described predetermined quantity as described newly-increased quantity.

Optionally, described device also includes:

Decision tree deletes unit, after the disaggregated model after selecting unit composition to update described decision tree, Delete unselected decision tree.

Compared with prior art, the application has the advantage that

The method being used for updating disaggregated model that the application provides, the incremental data within nearest a period of time Extract training sample set, generate the decision tree of newly-increased quantity according to described training sample set, and divide according to history Class accuracy rate and this classification accuracy, select determining of predetermined quantity from newly-increased decision tree and original decision tree Plan tree, collects the disaggregated model after being updated.Use said method, owing to need not according to full dose data It is trained, but utilizes incremental data to carry out incremental update on the basis of original disaggregated model, therefore may be used As required disaggregated model is carried out the dynamic renewal of various time granularity, such as: per diem update or near Like real-time update, such that it is able to improve the efficiency of model training, realize the quick response to business；Further Ground, owing to not having only only in accordance with this classification accuracy when evaluating the classifying quality of decision tree, but Introduce history classification accuracy, therefore from the angle of the overall situation, the compressive classification effect of decision tree can be carried out Assessment, such that it is able to the shortage term fluctuation of smoothed data, it is ensured that the disaggregated model after renewal can keep more steady Fixed classification prediction effect.

Accompanying drawing explanation

Fig. 1 is the flow chart of a kind of embodiment of the method for updating disaggregated model of the application；

Fig. 2 is the flow chart of the processing procedure extracting training sample set that the embodiment of the present application provides；

Fig. 3 is the flow chart of the processing procedure generating newly-increased quantitative decision tree that the embodiment of the present application provides；

Fig. 4 is selecting in advance according to history classification accuracy and this classification accuracy of the embodiment of the present application offer The flow chart of the processing procedure of determined number decision tree；

Fig. 5 is that the compressive classification calculating newly-increased decision tree and original decision tree that the embodiment of the present application provides is accurate The flow chart of the processing procedure of rate；

Fig. 6 is the weight distribution schematic diagram of the weighted moving average of the n=15 that the embodiment of the present application provides；

Fig. 7 is the weight distribution schematic diagram of the index rolling average of the n=20 that the embodiment of the present application provides；

Fig. 8 is the schematic diagram of a kind of device embodiment for updating disaggregated model of the application.

Detailed description of the invention

Elaborate a lot of detail in the following description so that fully understanding the application.But, this Shen Please implement to be much different from alternate manner described here, those skilled in the art can not disobey Doing similar popularization in the case of back of the body the application intension, therefore, the application is not embodied as by following public Limit.

In this application, each provide a kind of method for updating disaggregated model, and a kind of for more The device of new disaggregated model, is described in detail the most one by one.

Refer to Fig. 1, it is the flow chart of a kind of embodiment of the method for updating disaggregated model of the application. Described method comprises the steps:

Step 101: obtain the increment number in predetermined amount of time from the server that described user behavior data is provided According to, and therefrom extract training sample set.

The method being used for updating disaggregated model that the application provides, carries out disaggregated model more based on incremental data Newly so that disaggregated model can be made in time or near real-time according to the change of sample data accordingly Adjust, thus it is synchronization with up-to-date sample data to realize disaggregated model.

In actual service application, after obtaining user behavior data, can be first with disposing the most on line Disaggregated model, the disaggregated model being i.e. made up of the original decision tree of predetermined quantity, by marking by the way of enter Row class prediction, using the classification (the decision tree quantity selecting the category is most) of highest scoring as prediction class Not, and carry out service application set in advance based on this prediction classification, such as: category carries out recommending, by Classification carries out risk control etc..After generally going through a period of time, the operation behavior follow-up according to user or be The comprehensive analysis of system, can know the concrete class of described user behavior data, and be this user behavior data Add corresponding class label.After above-mentioned operation flow, generally just can generate and a collection of possess classification mark The user behavior data signed, in this case, it is possible to implement the technical program and carry out the dynamic of disaggregated model State updates.

First this step obtains the increment in predetermined amount of time from the server providing described user behavior data Data.Described predetermined amount of time refers to the time period being positioned at before current time, and its length can basis Concrete demand is configured, such as can in units of sky, by hour in units of, even with minute for single Position is all possible, if the user behavior data in the described time period already at retrievable state and Contain the class label information of reality.

After obtaining described incremental data, can by the pretreatment of described incremental data, extraction feature variable, Extract the processing procedures such as characteristic variable value, finally give described training sample set.Whole process includes following institute Step 101-1 stated, to step 101-4,2 is described further below in conjunction with the accompanying drawings.

Step 101-1: obtain the increment in predetermined amount of time from the server that described user behavior data is provided Data, wherein every incremental data all comprises one group of original variable and analog value and class label.

The incremental data that this step obtains generally comprises a plurality of user behavior data, wherein every user behavior number According to all comprising one group of original variable and analog value and identifying the class label of true classification, every increment number According to similar form as follows: (original variable 1, x₁；Original variable 2, x₂；... original variable n, x_n: y), Wherein x_iRepresenting the analog value corresponding to original variable i, described original variable is also referred to as attribute, corresponding value Also referred to as property value, y is the class label of this user behavior data.

Such as, in an object lesson of the present embodiment, in the risk control business of internet business platform Field, uses disaggregated model whether customer transaction behavior exists risk and carries out classification prediction, in this step Original variable in the incremental data obtained may include that the personal attribute information such as user account, age, friendship The information such as information attribute value and dealing money such as the easy classification of commodity, title, price.Class label Then include black/white sample two kind (the most corresponding risky and devoid of risk).

Step 101-2: described incremental data is carried out pretreatment so that it is meet disaggregated model to number of training According to requirement.

This step incremental data to having obtained carries out pretreatment, in order to use in subsequent steps based on The training sample set of described incremental data generates decision tree.Described pretreatment can include greatly/minimum processes, Missing values processes and form conversion, illustrates separately below.

Described maximum value or minimum value typically refers to, the upper limit of transnormal reasonable value scope or lower limit Numerical value, such as indoor temperature is 100 degrees Celsius, and this numerical value is just above the maximum of reasonable value scope, This kind of numerical value be probably that system produces, be also likely to be owing to artificial maloperation produces.In being embodied as, This type of data can be processed, such as: if being only in individual user's behavioral data to use mode set in advance Comprise this type of data, directly can delete relative users behavioral data from incremental data；If these type of data Incremental data occurs more frequent, then can use calculating meansigma methods and replace original pole by meansigma methods Big value and/or minimizing processing mode.

Described missing values typically refers to, certain original variable do not have correspondence numerical value, this it may be the case that because of Described data are not collected, such as a certain information during user does not fills in web form, and phase for system The data acquisition program answered the most does not writes default value for it.In this case, with above-mentioned to maximum/minimum The processing mode of value is similar to, and can delete the incomplete user behavior data of original variable value, it is also possible to use it The average of this original variable value that his user behavior data comprises, replaces the numerical value of described disappearance.

Described form is changed, it is common that because the multiformity of measurement unit or the multiformity of data coding method, Need some the original variable value in the incremental data that will collect to be converted into and meet the numerical value that disaggregated model requires. Such as: disaggregated model requirement offer certain original variable value with Celsius temperature as measurement unit, and incremental data In corresponding original variable value in units of Fahrenheit temperature, in this case, it is necessary to incremental data In corresponding data carry out form conversion.

Use aforesaid way that incremental data is carried out pretreatment, be to ensure that the integrity of incremental data, have Effect property and the correctness etc. of numerical value, thus the training sample set extracted from described incremental data can meet Update the disaggregated model requirement to training sample data, thus ensure that the disaggregated model after updating can obtain good Good prediction effect.

Step 101-3: extraction feature variable from pretreated incremental data, adds characteristic variable set to In.

Generally every user behavior data all comprises substantial amounts of original variable, on the one hand due to not each Original variable is the most meaningful for class prediction, on the other hand user behavior data comprise original Variable is also likely to be change, such as, be gradually increased perfect, and for the ease of management, this step can be from increasing In amount data, extraction contributes to characterizing the original variable of user behavior feature, i.e. characteristic variable (also referred to as feature Attribute), and selected characteristic variable is added in characteristic variable set (also referred to as characteristic variable pond).

Step 101-4: extract from every incremental data that choose from described characteristic variable set, correspond to The value of the characteristic variable of described disaggregated model, and form training sample by the value extracted with described class label, Thus obtain training sample set.

Owing to its classification feature of different disaggregated models is different, the characteristic variable of employing is likely to difference, therefore This step chooses the characteristic variable corresponding to described disaggregated model to be updated from characteristic variable set.Then root According to described characteristic variable, from every incremental data, extract characteristic of correspondence variate-value, and these features are become Value forms training sample in the lump with the class label of this incremental data, similar form as follows: (x₁, x₂,......x_n: y), wherein x_iRepresenting the characteristic variable value of this sample, y then represents the class label of this sample. Use aforesaid way to process every incremental data successively, thus obtain training sample set.

So far, by step 101-1 to step 101-4, from increment user behavior data, training sample it is extracted This collection.It should be noted that disaggregated model reach the standard grade application initial stage, such as: initial three months or Half a year, it is generally in the stage that characteristic variable is the most perfect, along with business cognition is goed deep into, by performing Step 101-3 will assist in the original variable of class prediction and is gradually added in characteristic variable set.Along with user The original variable that behavioral data comprises the most stable, and disaggregated model is becoming better and approaching perfection day by day, characteristic variable set In characteristic variable also can be in the metastable stage, in this case, it is also possible to do not perform step 101-3, but directly from the most stable characteristic variable set, choose the characteristic variable corresponding to disaggregated model, And generate training sample set further.

Step 102: according to described training sample set, generate the decision tree of newly-increased quantity.

This step generates the decision tree of newly-increased quantity, and described newly-increased quantity typically smaller than disaggregated model is comprised The predetermined quantity of original decision tree, its concrete value, it may be considered that the concrete application scenarios of disaggregated model, instruction The factors such as the quantity of the original decision tree that the scale of white silk sample set or described disaggregated model comprise, arrange one Individual empirical value.Such as: in the internet, applications carrying out risk control, the span of described newly-increased quantity Can be arranged in the range of the 1/40 to 1/10 of described predetermined quantity, if disaggregated model comprises 200 400 Decision tree, can set the quantity of newly-increased decision tree as 10.Above-mentioned is only an example, specifically Enforcement can be configured with comprehensive reference various factors.

In addition it is also possible to use disaggregated model described in described training sample set pair to verify, and according to checking Result determines the quantity of newly-increased decision tree.Specifically, with described disaggregated model training sample can be concentrated Each sample is classified, and with the ratio of the correct number of times classified and total sample number as classification accuracy, And according to the quantity of the newly-increased decision tree of classification accuracy adjustment, such as, classification accuracy exceedes set in advance During threshold value, illustrate that current training sample data can relatively accurately be classified by existing disaggregated model, because of This can arrange relatively small newly-increased quantity；Relatively large newly-increased quantity otherwise can be set.

After determining described newly-increased quantity, it is possible to generate the decision tree of described newly-increased quantity.Can as one The embodiment of choosing, can concentrate from described training sample and randomly select a number of sample, then for Selected sample uses conventional decision trees to generate a decision tree, repeats above-mentioned to choose sample and life Become the step of decision tree, until generating the decision tree of described newly-increased quantity.

In order to improve the efficiency of the newly-increased quantitative decision tree of generation, avoid the occurrence of Expired Drugs and improve anti- Noise immune, the present embodiment provides a kind of and uses random forests algorithm to generate the preferred real of newly-increased quantitative decision tree Execute mode, specifically include step 102-1 to step 102-3,3 be described further below in conjunction with the accompanying drawings.

Step 102-1: use the mode of sampling with replacement to build bootstrap sample according to described training sample set Set；

Bootstrap sampling approach (also referred to as bootstrapping or Bootstrap sampling method), is a kind of uniform sampling side having and putting back to Method, this step concentrates the mode using sampling with replacement to extract N number of sample from the training sample comprising N number of sample This, in extraction process, the part sample in described training sample set is not the most pumped to, and portion Divide sample may be extracted repeatedly, N number of sample one the bootstrap sample set of composition that will finally extract.

Step 102-2: use described bootstrap sample set, uses and selects according to predetermined policy at each node Take characteristic variable and the mode that carries out dividing according to selected characteristic variable generates a new decision tree.

Use described bootstrap sample set, use the mode of dot splitting section by section to generate a new decision tree, It it is critical only that the selection of Split Attribute (i.e. characteristic variable) of each node.Specifically, for comprising M The sample of individual characteristic variable, when each node decision tree needs division, becomes from M feature first at random Amount selects m (generally meet condition m < < M), then from m selected characteristic variable according to Predetermined policy chooses 1 optimum characteristic variable, and divides according to this feature variable.At each node All repeat said process, until some node cannot continue division or its all samples comprised broadly fall into Same classification, now fission process terminates, and a new decision tree creates complete.

In being embodied as, the number randomly choosing characteristic variable can use the side calculating square root and rounding Formula obtains, such as: each sample packages contains M=100 characteristic variable, then can randomly choose m=every time Sqrt (M)=10 characteristic variable, naturally it is also possible to adopt determine in other ways randomly choose characteristic variable Number, as long as meeting the m < < condition of M.

As for choosing optimum characteristic variable from the characteristic variable randomly selected, can use set in advance Strategy, such as, according to information gain, information gain-ratio or Geordie selecting index.Use above-mentioned three kinds Mode choose optimal characteristics variable go forward side by side line splitting generate decision tree process, belong to the existing skill of comparative maturity Art, is the most no longer further described detailed process.

By above description it can be seen that the randomness of random forests algorithm is embodied in the training sample of each tree Originally being random, in tree, the categorical attribute of each node also randomly chooses, random based on above-mentioned the two The guarantee of characteristic, the decision tree using random forests algorithm to generate is generally of preferable noise resisting ability, and And Expired Drugs will not be produced.

This step can also record the relevant information of every newly-generated decision tree, including: decision tree identifies, Such as decision tree id, and the generation time etc..

Step 102-3: judge that the quantity of newly-generated decision tree, whether less than described newly-increased quantity, if so, turns Perform to step 102-1.

The most newly-generated decision tree, it is possible to the quantity of cumulative newly-generated decision tree, and judge this quantity Whether less than described newly-increased quantity, if so, forward step 102-1 to and perform, continue to generate new decision tree；If No, illustrate that the quantity of newly-generated decision tree has met requirement, need not continue to generate.

Step 103: according to history classification accuracy with for this classification accuracy of described training sample set, Classification after selecting the decision tree of described predetermined quantity, composition to update from newly-increased decision tree and original decision tree Model.

Generate the decision tree of newly-increased quantity in a step 102, this step according to classifying quality from newly-increased decision-making Tree and the existing original decision tree of disaggregated model select the decision tree of described predetermined quantity.Iff basis This classification accuracy (classification accuracy obtained based on this training sample set) carries out classifying quality Assessment, generally can obtain the effect of local optimum, but for the random data fluctuations by a relatively large margin occurred Situation, owing to this data fluctuations is the most all transient, do not represent long-term, the trend of the overall situation, If carrying out classifying quality assessment the optimal decision tree of screenability only for current training sample set, Thus disaggregated model is the most inaccurate after the renewal obtained, the classification for future customer behavioral data is pre- Survey may be inaccurate.

In order to avoid there is above-mentioned situation, the technical program introduces history classification accuracy, i.e. all previous more The classification accuracy of record during new disaggregated model, by accurate with this subseries by history classification accuracy The combination of rate such that it is able to from the angle of the overall situation, reflect the classifying quality of decision tree more objectively.

In being embodied as, can use different according to history classification accuracy and this classification accuracy Strategy or algorithm, complete the selection task of this step.It is accurate that the present embodiment provides a kind of calculating compressive classification Rate, and the embodiment screened is carried out according to this index, specifically, including step 103-1 to step 103-3, 4 it is described further below in conjunction with the accompanying drawings.

Step 103-1: according to history classification accuracy with for this classification accuracy of described training sample set, Calculate newly-increased decision tree and the compressive classification accuracy rate of original decision tree；

Calculate compressive classification accuracy rate according to history classification accuracy with this classification accuracy, can use not Same computational methods, for example, it is possible to utilize self-defining function or formula to solve, consider enforcement Effect and the maturity of algorithm, the present embodiment uses moving average method to calculate compressive classification accuracy rate.

Moving average method is used to calculate newly-increased decision tree and the compressive classification accuracy rate of original decision tree, including step Rapid 103-1-1, to step 103-1-3,5 further illustrates below in conjunction with the accompanying drawings.

Step 103-1-1: according to described training sample set, calculate this classification accuracy of newly-increased decision tree, As its compressive classification accuracy rate.

This step calculates this classification accuracy of each newly-increased decision tree for the training sample set obtained. Specifically, can be accurate by this subseries of the newly-increased decision tree of any one calculating in following three kinds of modes Rate:

1) each sample concentrated training sample with described newly-increased decision tree is classified, and classifies with correct The ratio of number of times and total sample number as this classification accuracy；

2) if newly-increased decision tree is to use random forests algorithm to generate, due to use during generating Bootstrap sample set, but therefore can use be included in described training sample concentrate be not included in The outer sample of bag in described bootstrap sample set, use and 1) in similar method calculate this subseries Accuracy rate；

3) if newly-increased decision tree is to use random forests algorithm to generate, it is also possible to directly use bootstrap Sample in sample set, use and 1) similar mode calculates this classification accuracy.

For newly-increased decision tree, usually not about the history classification accuracy information of this decision tree, because of This can be using this classification accuracy as its compressive classification accuracy rate.If but disaggregated model once creates Cross the decision tree identical with described newly-increased decision tree, and also remain with the history classification accuracy of this decision tree, The mode of similar step 103-1-3 so can also be used to calculate the compressive classification accuracy rate of newly-increased decision tree.

Step 103-1-2: according to described training sample set, calculate this classification accuracy of original decision tree.

With the mode 1 described in above-mentioned steps 103-1-1) similar, this step can be with original decision tree to instruction The each sample practiced in sample set is classified, and with the number of times of correct classification and the ratio conduct of total sample number This classification accuracy of original decision tree.Every the original decision tree comprising disaggregated model all uses above-mentioned Mode processes, thus obtains this classification accuracy of every original decision tree.

Step 103-1-3: according to history classification accuracy and this classification accuracy, calculate original decision tree Compressive classification accuracy rate.

The present embodiment uses moving average method to calculate the compressive classification accuracy rate of described original decision tree.Mobile flat All methods (moving average method) typically refer to, according to time series, according to specific weight coefficient Calculate the average (weighted mean) of volume of data item successively such that it is able to eliminate the random wave in data Dynamic, reflect the variation tendency of data more objectively.

Moving average method is used to calculate the fundamental formular of compressive classification accuracy rate of decision tree as shown in following formula, Wherein p_iFor a certain data item in described time series, i.e. classification accuracy, including this classification accuracy And the history classification accuracy that all previous renewal disaggregated model is recorded；Wi is to should the power of classification accuracy Weight coefficient, each weight coefficient and usually 1；N is the data item number in the time series participating in calculating.

P = Σ_{n = 1}^{n} p_{i} * Wi

The most easy moving average method is simple moving average method (Simple Moving Average SMA), That is: the non-weighted arithmetic average of this classification accuracy and history classification accuracy is calculated, and this is average Value is as compressive classification accuracy rate.Adopt in this way, be equivalent to this classification accuracy and each history is divided The weight coefficient of class accuracy rate is all identical, is all 1/n, can the shortage term fluctuation of preferably smoothed data, The long-term trend of reflection data variation.

In view of in actual applications, the classification accuracy data of different times are for the classification of evaluation decision tree The effect of effect is probably differentiated, and the power of influence of classification accuracy the most at a specified future date is relatively low, and gets over Recent classification accuracy the most more can the classifying quality of evaluation decision tree exactly.Based on above-mentioned consideration, In order to while smoothed data shortage term fluctuation, show the power of influence of Recent data, this enforcement further especially Example additionally provides the employing method of weighted moving average or the preferred implementation of index moving average method.

The so-called method of weighted moving average (Weighted Moving Average WMA), is calculating weighted average In the following way different weight coefficients is set for each data item during number: for comprising n data item, The denominator of weight coefficient can be set as A=n+ (n-1)+(n-2)+...+2+1, will most recently data item (p₁, Such as this classification accuracy) weight coefficient be set to n/A, secondary Recent data item (p₂) weight coefficient It is set to (n-1)/A, and so on, until 1/A, refer to Fig. 6, it illustrates the weighting of n=15 The weight distribution schematic diagram of rolling average.The computing formula of the method for weighted moving average is as follows:

WMA = \frac{{np}_{1} + (n - 1) p_{2} + \cdot \cdot \cdot + 2 p_{n - 1} + p_{n}}{n + (n - 1) + 1 \cdot \cdot \cdot + 2 + 1}

So-called index moving average method (Exponential Moving Average EMA), moves with above-mentioned weighting Dynamic averaging method compares, and is the rolling average of weight coefficient of successively decreasing with exponential form.The weighted influence of each data item Power in time and successively decrease by exponential form, and the most recent data weighting power of influence is the heaviest, also gives data more at a specified future date Give certain weighted value, refer to Fig. 7, it illustrates the weight distribution signal of the index rolling average of n=20 Figure.In the specific implementation, the degree of weighting can determine with constant α, and α numerical value Jie 0 to 1, α is also Can represent with the number n of the data item participating in calculating, such as α=2/ (n+1).Index based on constant α moves The computing formula of dynamic averaging method is as follows:

EMA = \frac{p_{1} + (1 - α) p_{2} + {(1 - α)}^{2} p_{3} + {(1 - α)}^{3} p_{4} + \cdot \cdot \cdot}{1 + (1 - α) + {(1 - α)}^{2} + {(1 - α)}^{3} + \cdot \cdot \cdot}

By above description it can be seen that the method for weighted moving average and the weight coefficient of index moving average method Mainly by the time that the generates decision of data item, the weight coefficient of the most recent data item is the biggest, the most at a specified future date The weight coefficient of data item the least.It is applied particularly to the present embodiment, can be that this classification accuracy is arranged Bigger weight coefficient, and the history classification accuracy being at a specified future date arranges the least weight coefficient, uses This mode calculated compressive classification accuracy rate, both can be from the classification effect of overall situation angle reaction decision tree Really, the classification performance that decision tree is recent can also be highlighted, then by the decision-making gone out based on this index screening simultaneously The disaggregated model that tree is formed, generally can obtain in following classification prediction and predict the outcome the most accurately.

By above description it can also be seen that the difference of the method for weighted moving average and index moving average method it Place is, in index moving average method the weighted influence power of each data item be not linear successively decrease but use Exponential manner successively decreases.Therefore, for the fast-changing application scenarios of data, (such as website is carried out sales promotion and is lived Dynamic) generally can use index moving average method, and (the most common at data application scenarios relatively smoothly Working day) generally can use the method for weighted moving average.

When being embodied as this step, can be according to the mark of original decision tree, such as decision tree id, Storage classification accuracy data in search cut-off this update before certain period in (such as: 7 days or Person 1 month), corresponding to the classification accuracy of this decision tree and generate temporal information, these data are generally the most all Record during all previous renewal disaggregated model；Then according to concrete application demand or difference Application scenarios, utilize the above-mentioned classification accuracy found and this classification accuracy, use above-mentioned shifting One of dynamic averaging method calculates the compressive classification accuracy rate of this decision tree.Each included for disaggregated model former Beginning decision tree performs aforesaid operations, it is hereby achieved that the compressive classification accuracy rate of each original decision tree.

So far, by step 103-1-1 to step 103-1-3, on the basis of calculating this classification accuracy, Obtain the compressive classification accuracy rate of every newly-increased decision tree and original decision tree.In order to update classification follow-up Full and accurate history classification accuracy data can be obtained during model, in above-mentioned steps 103-1-1 and After 103-1-2 calculates this classification accuracy of newly-increased decision tree and original decision tree, can by these data with And this calculating time is stored in the relevant information of corresponding decision tree, i.e. in the relevant information of every decision tree Not only include: decision tree mark and the time of generation, it is also possible to include corresponding to a series of classification of seasonal effect in time series Accuracy rate.

Step 103-2: described newly-increased decision tree and original decision tree are carried out according to described compressive classification accuracy rate Sequence.

This step according to step 103-1 calculated compressive classification accuracy rate, to newly-increased decision tree and original certainly Tree is ranked up plan, it may be assumed that arrange above-mentioned decision tree according to compressive classification accuracy rate order from high to low Sequence so that the sorting position of the decision tree that compressive classification accuracy rate is high is in the decision-making that compressive classification accuracy rate is low Before tree, carry out screening ready for subsequent step 103-3.

Step 103-3: select tagmeme to be positioned at a high position, described predetermined quantity determining from the decision tree after sequence Plan tree.

Such as, disaggregated model comprises T original decision tree, generates K decision tree in a step 102, So this step is from T+K the decision tree sorted according to compressive classification accuracy rate, selects ranking forward, T the decision tree that i.e. compressive classification accuracy rate is optimum, collects the disaggregated model after composition updates.

So far complete the screening of decision tree by step 103-1 to 103-3 to operate, and after being updated Disaggregated model.In the specific implementation, can in doing so, delete unselected decision tree and Its relevant information, including: decision tree mark, generation time and the information relevant to classification accuracy. If the decision tree being deleted is this newly-increased decision tree, then deleting the information relevant to classification accuracy is Refer to, only delete this classification accuracy and this calculating time；If the decision tree being deleted is original decision-making Tree, then be possible not only to delete this classification accuracy and this calculating time, it is also possible to record before deletion History classification accuracy and information correlation time.

Additionally, due to the technical program provides Dynamic Updating Mechanism, disaggregated model can be the most perfect, because of This is (process grown out of nothing, it is also possible to be considered renewal process) when creating disaggregated model first, it is also possible to Use the technical program.Specifically, before performing step 102, first judge whether to have created disaggregated model, If it is not, then using described predetermined quantity as described newly-increased quantity, and create predetermined quantity in step 102 kind Decision tree, calculates this classification accuracy of every newly-increased decision tree in step 103, divides for follow-up renewal Reference during class model, and direct being collected by newly-increased decision tree obtain disaggregated model.Use this embodiment, The establishment of disaggregated model and renewal process can be united, reduce artificial participation, it is simple to maintenance and management.

In sum, the method being used for updating disaggregated model that the present embodiment provides, owing to need not according to complete Amount data are trained, but utilize incremental data to carry out incremental update on the basis of original disaggregated model, Therefore as required disaggregated model can be carried out the dynamic renewal of various time granularity, such as: per diem update Or approximate real time renewal, such that it is able to improve the efficiency of model training, realize the quick response to business； Further, owing to not having only only in accordance with this classification accuracy when evaluating the classifying quality of decision tree, And it is the introduction of history classification accuracy, therefore can be from the angle of the overall situation compressive classification effect to decision tree It is estimated, such that it is able to the shortage term fluctuation of smoothed data, it is ensured that the disaggregated model after renewal can keep ratio More stable classification prediction effect.

It is in the above-described embodiment, it is provided that a kind of method for updating disaggregated model, corresponding, The application also provides for a kind of device for updating disaggregated model.Refer to Fig. 8, it is the one use of the application Schematic diagram in the device embodiment updating disaggregated model.Implement owing to device embodiment is substantially similar to method Example, so describing fairly simple, relevant part sees the part of embodiment of the method and illustrates.Following retouch The device embodiment stated is only schematically.

A kind of device for updating disaggregated model of the present embodiment, including: training sample set extraction unit 801, For obtaining the incremental data in predetermined amount of time from the server providing described user behavior data, and from Middle extraction training sample set；Decision tree signal generating unit 802, for according to described training sample set, generates newly-increased The decision tree of quantity；Decision tree selects unit 803, for according to history classification accuracy with for described training This classification accuracy of sample set, selects described predetermined quantity from newly-increased decision tree and original decision tree Decision tree, the disaggregated model after composition renewal.

Optionally, described training sample set extraction unit includes:

Optionally, described training sample set extraction unit also includes:

Optionally, described decision tree signal generating unit includes:

Optionally, described decision tree selects unit to include:

Optionally, described aggregative indicator computation subunit includes:

Optionally, described device also includes:

Although the application is open as above with preferred embodiment, but it is not for limiting the application, Ren Heben Skilled person, without departing from spirit and scope, can make possible variation and amendment, Therefore the protection domain of the application should be defined in the range of standard with the application claim.

In a typical configuration, calculating equipment includes one or more processor (CPU), input/output Interface, network interface and internal memory.

Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or the form such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM). Internal memory is the example of computer-readable medium.

1, computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by Any method or technology realize information storage.Information can be computer-readable instruction, data structure, journey The module of sequence or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other The random access memory (RAM) of type, read only memory (ROM), the read-only storage of electrically erasable Device (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassette tape, tape magnetic rigid disk stores or other Magnetic storage apparatus or any other non-transmission medium, can be used for the information that storage can be accessed by a computing device. According to defining herein, computer-readable medium does not include non-temporary computer readable media (transitory Media), such as data signal and the carrier wave of modulation.

2, it will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer Program product.Therefore, the application can use complete hardware embodiment, complete software implementation or combine software Form with the embodiment of hardware aspect.And, the application can use and wherein include meter one or more The computer-usable storage medium of calculation machine usable program code (include but not limited to disk memory, CD-ROM, Optical memory etc.) form of the upper computer program implemented.

Claims

1., for the method updating disaggregated model, described disaggregated model includes the original decision-making of predetermined quantity Tree, for carrying out class prediction according to the user behavior data in network application, it is characterised in that including:

Method for updating disaggregated model the most according to claim 1, it is characterised in that described from There is provided and the server of described user behavior data obtains the incremental data in predetermined amount of time, and therefrom extract Training sample set, including:

Method for updating disaggregated model the most according to claim 2, it is characterised in that described right Described incremental data carries out pretreatment, including the one in set forth below or arbitrarily combines:

Method for updating disaggregated model the most according to claim 2, it is characterised in that described from There is provided and the server of described user behavior data obtains the incremental data in predetermined amount of time, and therefrom extract Training sample set, also includes:

Method for updating disaggregated model the most according to claim 1, it is characterised in that described According to described training sample set, the decision tree generating newly-increased quantity includes:

Method for updating disaggregated model the most according to claim 5, it is characterised in that described According to described training sample set, random forests algorithm is used to generate the decision tree of newly-increased quantity, including:

Method for updating disaggregated model the most according to claim 6, it is characterised in that described press Choose optimal characteristics variable according to predetermined policy to include: choose according to information gain, choose according to information gain-ratio, Or according to optimal characteristics variable described in Geordie selecting index.

8. according to the arbitrary described method being used for updating disaggregated model of claim 1-7, it is characterised in that Described according to history classification accuracy with for this classification accuracy of described training sample set, from newly-increased certainly Plan tree and original decision tree select the decision tree of described predetermined quantity, including:

Method for updating disaggregated model the most according to claim 8, it is characterised in that described According to history classification accuracy with for this classification accuracy of described training sample set, calculate newly-increased decision tree With the compressive classification accuracy rate of original decision tree, including:

Method for updating disaggregated model the most according to claim 9, it is characterised in that described According to history classification accuracy and this classification accuracy, the compressive classification accuracy rate calculating original decision tree is adopted Realize by following manner:

11. methods for updating disaggregated model according to claim 10, it is characterised in that described Moving average method includes: the method for weighted moving average or index moving average method.

12. methods for updating disaggregated model according to claim 1, it is characterised in that holding Before the step of the decision tree that the described generation of row increases quantity newly, execution operations described below:

Judge whether to create described disaggregated model；

13. methods for updating disaggregated model according to claim 1, it is characterised in that in institute State the decision tree selecting described predetermined quantity from newly-increased decision tree and original decision tree, dividing after composition renewal After the step of class model, execution operations described below:

Delete unselected decision tree.

14. 1 kinds for updating the device of disaggregated model, it is characterised in that including:

15. devices for updating disaggregated model according to claim 14, it is characterised in that described Training sample set extraction unit includes:

16. devices for updating disaggregated model according to claim 15, it is characterised in that described Data prediction subelement at least includes one of following subelement:

17. devices for updating disaggregated model according to claim 15, it is characterised in that described Training sample set extraction unit also includes:

18. devices for updating disaggregated model according to claim 14, it is characterised in that described Decision tree signal generating unit specifically for, according to described training sample set, use random forests algorithm to generate newly-increased The decision tree of quantity.

19. devices for updating disaggregated model according to claim 18, it is characterised in that described Decision tree signal generating unit includes:

20. devices for updating disaggregated model according to claim 19, it is characterised in that described The predetermined policy that decision tree generation execution subelement is used includes: choose attribute, basis according to information gain Information gain-ratio chooses attribute or according to Geordie selecting index attribute.

21. according to the arbitrary described device for updating disaggregated model of claim 14-20, it is characterised in that Described decision tree selects unit to include:

22. devices for updating disaggregated model according to claim 21, it is characterised in that described Aggregative indicator computation subunit includes:

23. devices for updating disaggregated model according to claim 22, it is characterised in that described Original decision tree aggregative indicator computation subunit specifically for, use moving average method, calculate original decision tree Compressive classification accuracy rate.

24. devices for updating disaggregated model according to claim 23, it is characterised in that described Original decision tree aggregative indicator computation subunit use moving average method include: the method for weighted moving average or Person's index moving average method.

25. devices for updating disaggregated model according to claim 14, it is characterised in that also wrap Include:

26. devices for updating disaggregated model according to claim 14, it is characterised in that also wrap Include: