Nothing Special   »   [go: up one dir, main page]

CN110795603A - Prediction method and device based on tree model - Google Patents

Prediction method and device based on tree model Download PDF

Info

Publication number
CN110795603A
CN110795603A CN201911040223.XA CN201911040223A CN110795603A CN 110795603 A CN110795603 A CN 110795603A CN 201911040223 A CN201911040223 A CN 201911040223A CN 110795603 A CN110795603 A CN 110795603A
Authority
CN
China
Prior art keywords
data
node
splitting
party
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911040223.XA
Other languages
Chinese (zh)
Other versions
CN110795603B (en
Inventor
陈超超
王力
周俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN201911040223.XA priority Critical patent/CN110795603B/en
Publication of CN110795603A publication Critical patent/CN110795603A/en
Priority to PCT/CN2020/108973 priority patent/WO2021082634A1/en
Application granted granted Critical
Publication of CN110795603B publication Critical patent/CN110795603B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Marketing (AREA)
  • Medical Informatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a method and a device for building a tree model for protecting privacy, and a prediction method and a prediction device based on the tree model, wherein the building method comprises the following steps: acquiring M groups of splitting results from respective equipment of at least two data sides, wherein the M groups of splitting results correspond to the M characteristics respectively; recording data sides corresponding to the M groups of cracking results; respectively calculating the splitting gain of each splitting result based on the label value of each of the N samples; acquiring a splitting result with the maximum splitting gain as an optimal splitting result; determining a data side corresponding to the optimal splitting result under the condition that the splitting gain of the optimal splitting result is a positive value; when the corresponding data side is the second data side, the optimal splitting result is sent to equipment of the second data side, and the corresponding relation between the first node and the second data side is recorded; the first node is labeled to indicate that there is no node data of the first node locally and the tree structure of the first tree is updated accordingly.

Description

Prediction method and device based on tree model
Technical Field
The embodiment of the specification relates to the technical field of machine learning, in particular to a tree model construction method and device and a prediction method and device based on a tree model.
Background
Machine learning capable of securing data is a technical problem that is currently of great interest. Namely, a plurality of data parties have own data, and under the condition that the data security of each party is ensured, each party trains the machine learning model in a coordinated manner for multiple parties to use. Linear/logistic regression tree models are widely used in the industry as regression/classification models. The following problems are often faced in use situations: multiple data parties have their own data and they want to use each other's data together to uniformly model a tree model, such as the xgboost t or GBDT model, but do not want to provide the respective data to each other. For example, assuming there are two data parties, a and B, and similarly for multiple data parties, each data party has its own characteristics, x _ a and x _ B, for a sample. While one of the parties has the label of the specimen, i.e., y. For example, for sample 1, A possesses the characteristics of sample 1
Figure BDA0002252631200000011
And label y1, B possesses the characteristics of sample 1
Figure BDA0002252631200000012
In the prior art, if the characteristics owned by A are usedAnd the label y1, and the characteristics possessed by B
Figure BDA0002252631200000014
Collectively modeling the tree model will result in data owned by A being exposed to B and data owned by B being exposed to A.
Therefore, there is a need for a more efficient privacy preserving tree model construction and use scheme.
Disclosure of Invention
Embodiments of the present disclosure aim to provide a more effective privacy-preserving tree model construction and usage scheme to solve the deficiencies in the prior art.
To achieve the above object, an aspect of the present specification provides a tree model building method, where building data currently used for building the tree model is composed of data owned by devices of at least two data parties, the building data includes feature values of M features of N samples, the at least two data parties include a first data party and a second data party, the device of the first data party has tag values of the N samples, the tree model currently includes a first tree, the first tree currently includes a first node, and the method is performed by the device of the first data party with respect to the first node, and includes:
obtaining M groups of splitting results from respective devices of the at least two data parties, wherein the M groups of splitting results correspond to the M characteristics respectively, and each group of splitting results comprises a plurality of splitting results which are split based on characteristic values of respective corresponding characteristics of the N samples;
recording data sides corresponding to the M groups of splitting results;
respectively calculating the splitting gain of each splitting result in the M groups of splitting results based on the label values of the N samples;
acquiring a splitting result with the maximum splitting gain as an optimal splitting result;
determining a data party corresponding to the optimal splitting result based on the data parties respectively corresponding to the M groups of splitting results recorded locally under the condition that the splitting gain of the optimal splitting result is a positive value;
when the corresponding data party is the second data party, sending the optimal splitting result to equipment of the second data party, and recording the corresponding relation between the first node and the second data party;
the first node is labeled to indicate that there is no node data of the first node locally and the tree structure of the first tree is updated accordingly.
In one embodiment, each of the M sets of split results includes P split results corresponding to a respective feature, where P is determined based on the number of non-duplicate feature values of the respective feature included in the N samples.
In one embodiment, the method further comprises notifying devices of other ones of the at least two data parties, other than the first data party and the second data party, to instruct the devices of the other data parties to said label the first node and update the tree structure of the first tree accordingly.
In one embodiment, the first data party further has a feature value of a first feature of each of the N samples, and the method further includes, in a case where the corresponding data party is the first data party, and in a case where it is determined that the optimal splitting result is the splitting result corresponding to the first feature, determining the first feature as the feature corresponding to the first node, and determining the splitting value corresponding to the optimal splitting result as the splitting value of the first feature of the first node, and notifying devices of other data parties than the first data party among the at least two data parties to instruct the devices of the other data parties to perform the labeling on the first node and update the tree structure of the first tree accordingly.
In one embodiment, the method further includes, in a case that a splitting gain of the optimal splitting result is less than or equal to zero, calculating a score corresponding to the first node based on the label values of the N samples, and notifying devices of other data parties except the first data party in the at least two data parties to instruct the devices of the other data parties to perform the labeling on the first node.
In an embodiment, the first tree is a t-th tree in the tree model, wherein the calculating the splitting gain of each splitting result in the M sets of splitting results based on the respective label values of the N samples includes calculating the splitting gain of each splitting result in the M sets of splitting results based on the respective label values of the N samples and a t-1 th tree of the tree model obtained in advance.
In an embodiment, the determining the data party corresponding to the optimal splitting result includes, in a case that the optimal splitting result is determined to correspond to at least two data parties based on the data parties corresponding to the M groups of splitting results recorded locally, randomly determining one data party from the at least two data parties as the data party corresponding to the optimal splitting result.
In one embodiment, the tree model is any one of the following tree models: xgboost model, GBDT model, random forest.
Another aspect of the present specification provides a prediction method based on a tree model, where model data of the tree model is composed of data owned by respective devices of at least two data parties, where each device of the data party has a tree structure of the tree model, and the at least two data parties include a first data party and a second data party, where a device of the first data party records a correspondence relationship between each node in the tree model and the data party, and the method is performed by the device of the first data party, and includes:
receiving a model use request for a first object from a device of a model requestor;
for a first father node in a first tree in the tree model, determining a data party corresponding to the first father node based on the corresponding relation between each node in the tree model and the data party;
when the data party corresponding to the first parent node is determined to be the second data party, informing the device of the second data party to divide the first object in the first parent node;
the partitioning result of the first object in the first parent node is received from the device of the second data side to continue the prediction of the first object.
In one embodiment, the first data side device records a first feature and a corresponding feature splitting value corresponding to the first parent node, and the first data side device further has a feature value of the first feature of the first object, and the method further includes, in a case where the data side corresponding to the first parent node is determined to be the first data side, dividing the first object into one child node of the first parent node based on the feature splitting value of the first parent node and the feature value of the first feature of the first object.
In one embodiment, the device of the first data side further records scores of each leaf node in the tree model, the partition result partitions the first object into a first child node of the first parent node, the first child node is a leaf node, and the method further includes sending the score of the first child node as a prediction result to the device of the model requesting side.
Another aspect of the present specification provides a tree model building apparatus, where building data currently used for building the tree model is composed of data owned by devices of at least two data parties, the building data includes feature values of M features of N samples, the at least two data parties include a first data party and a second data party, the device of the first data party has tag values of the N samples, the tree model currently includes a first tree, the first tree currently includes a first node, and the apparatus is deployed in the device of the first data party with respect to the first node, and includes:
a first obtaining unit, configured to obtain M sets of splitting results from respective devices of the at least two data parties, where the M sets of splitting results correspond to the M features, respectively, and each set of splitting results includes a plurality of splitting results that are split based on feature values of respective corresponding features of the N samples;
the recording unit is configured to record data sides corresponding to the M groups of splitting results;
a first calculating unit configured to calculate a splitting gain of each splitting result in the M groups of splitting results based on a label value of each of the N samples;
a second acquisition unit configured to acquire a splitting result having a maximum splitting gain as an optimal splitting result;
a first determining unit, configured to, in a case that a splitting gain of the optimal splitting result is a positive value, determine a data party corresponding to the optimal splitting result based on data parties corresponding to the M groups of splitting results recorded locally;
a sending unit, configured to send the optimal splitting result to a device of the second data party and record a corresponding relationship between the first node and the second data party when the corresponding data party is the second data party;
and the marking unit is configured to mark the first node to indicate that no node data of the first node exists locally, and update the tree structure of the first tree correspondingly.
In one embodiment, the apparatus further includes a first notifying unit configured to notify devices of other data parties of the at least two data parties except the first data party and the second data party to instruct the devices of the other data parties to perform the labeling on the first node and update the tree structure of the first tree accordingly.
In one embodiment, the first data party further has a feature value of a first feature of each of the N samples, the apparatus further includes a second determining unit configured to, in a case where the corresponding data party is the first data party, and in a case where the optimal splitting result is determined to be a splitting result corresponding to a first feature, determine the first feature as the feature corresponding to the first node, and determine a splitting value corresponding to the optimal splitting result as the splitting value of the first feature of the first node, and a second notifying unit configured to notify devices of other data parties than the first data party among the at least two data parties, to instruct the devices of the other data parties to perform the labeling on the first node and update the tree structure of the first tree accordingly.
In one embodiment, the apparatus further includes a second calculating unit configured to calculate a score corresponding to the first node based on the label value of each of the N samples in a case where a splitting gain of the optimal splitting result is equal to or less than zero, and a third notifying unit configured to notify devices of other data parties than the first data party among the at least two data parties to instruct the devices of the other data parties to perform the labeling on the first node.
In an embodiment, the first tree is a t-th tree in the tree model, wherein the first calculating unit is further configured to calculate splitting gains of the splitting results in the M sets of splitting results respectively based on the label values of the N samples and t-1 th trees of the tree model acquired in advance.
In an embodiment, the M groups of split results include at least two split results that are received from respective devices of at least two data parties and coincide with each other, and the first determining unit is further configured to, in a case where the optimal split result is determined to correspond to at least two data parties based on the data parties corresponding to the M groups of split results recorded locally, randomly determine one data party from the at least two data parties as the data party corresponding to the optimal split result.
Another aspect of the present specification provides a prediction apparatus based on a tree model, where model data of the tree model is composed of data owned by respective devices of at least two data parties, where each device of the data party has a tree structure of the tree model, and the at least two data parties include a first data party and a second data party, where a correspondence between each node in the tree model and the data party is recorded in the device of the first data party, and the apparatus is deployed in the device of the first data party, and includes:
a first receiving unit configured to receive a model use request for a first object from a device of a model requester;
the determining unit is configured to determine, for a first parent node in a first tree in the tree model, a data party corresponding to the first parent node based on the correspondence between each node in the tree model and the data party;
the notification unit is configured to notify a device of a second data party to divide a first object in a first parent node when the data party corresponding to the first parent node is determined to be the second data party;
a second receiving unit configured to receive a division result of the first object in the first parent node from the device of the second data side to continue prediction of the first object.
In one embodiment, the first parent node corresponds to a first feature and a corresponding feature splitting value, and the first data side device further has a feature value of the first feature of the first object, and the apparatus further includes a dividing unit configured to, in a case where it is determined that the data side corresponding to the first parent node is the first data side, divide the first object into one child node of the first parent node based on the feature splitting value of the first parent node and the feature value of the first feature of the first object.
In an embodiment, the device of the first data side further records scores of each leaf node in the tree model, the partitioning result partitions the first object into a first child node of the first parent node, where the first child node is a leaf node, and the apparatus further includes a sending unit configured to send the score of the first child node as a prediction result to the device of the model requesting side.
Another aspect of the present specification provides a computer readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform any one of the above methods.
Another aspect of the present specification provides a computing device comprising a memory and a processor, wherein the memory stores executable code, and the processor implements any one of the above methods when executing the executable code.
According to the tree model construction scheme, the tree model can be constructed by using data of a plurality of data parties together, meanwhile, the data of each data party is protected from being leaked to other data parties, the data volume for constructing the tree model is expanded, the prediction accuracy of the tree model is increased, and meanwhile, the data safety of each data party is protected.
Drawings
The embodiments of the present specification may be made more clear by describing the embodiments with reference to the attached drawings:
FIG. 1 illustrates a schematic diagram of a privacy preserving tree model building and use system in accordance with an embodiment of the present description;
FIG. 2 illustrates a method of privacy preserving tree model construction according to an embodiment of the present description;
FIG. 3 schematically illustrates model data owned by parties to a data party A, B, C after undergoing the tree model building process shown in FIG. 2;
FIG. 4 illustrates a tree model based prediction method in accordance with an embodiment of the present description;
FIG. 5 illustrates a privacy preserving tree model building apparatus 500 in accordance with an embodiment of the present description;
fig. 6 illustrates a tree model based prediction apparatus 600 according to an embodiment of the present description.
Detailed Description
The embodiments of the present specification will be described below with reference to the accompanying drawings.
FIG. 1 illustrates a schematic diagram of a privacy preserving tree model building and using system according to an embodiment of the present description. The system comprises at least two data parties, three data parties are schematically shown in the figure, a data party A, a data party B and a data party C. The three data parties collectively construct and use a tree model using their own data (shown as data 1 through data 3 in the figure). For example, the data parties A, B, C are banks, insurance institutions, and platforms, respectively, which may have a plurality of users in common among users (e.g., users a, B, C, d, and e), except that a data party a (e.g., a bank) has a feature value (e.g., a deposit balance) of a feature 1 of each user, a data party B (an insurance institution) has a feature value (e.g., an insurance amount) of a feature 2 of each user, and a data party C (e.g., a platform) has a feature value (e.g., a consumption amount) and a tag value (e.g., whether to purchase a specific commodity) of a feature 3 of each user, so that the data party A, B, C may together construct a tree model based on their respective data, wherein a construction sample includes the feature value and the tag value of each feature of one user. Since the sample label value is needed when determining the split value in the node of the tree model, the process of constructing and using the model is dominated by the party (i.e. the data party C) having the label value.
In the model construction process, for example, for a root node of a model, a data party A acquires 6 splitting results of five users a, b, C, d and e based on a characteristic value of a characteristic 1 of each user, and sends the 6 splitting results to a data party C; the data side B acquires 6 splitting results of five users a, B, C, d and e based on the characteristic value of the characteristic 2 of each user, and sends the 6 splitting results to the data side C; the data side C acquires 6 splitting results for five users a, b, C, d, e based on the feature values of the features 3 of the respective users, and calculates the respective gain values of the above 18 splitting results, thereby acquiring the optimal splitting result with the maximum gain. Then, if the gain value is greater than 0 and if the optimal splitting result is not the 6 splitting results obtained from the data party C itself, the data party C sends the optimal splitting result to the party (for example, party a) corresponding to the splitting result, so that party a updates its own tree model. For example, the data side a updates the feature corresponding to the node and the corresponding feature split value, and adds two child nodes of the node in the tree structure. Meanwhile, the data side C marks the node in its own tree model as a null node and updates the tree structure accordingly.
After the tree model is constructed in the above manner, each data party has the same tree structure, except that each data party has node data of a corresponding node, for example, the root node corresponds to the feature 1 of the data party a, the data party a has node data of the root node (e.g., the feature corresponding to the node, the feature split value), and the data party B and the data party C mark only the root node in the tree model as a null (dummy) node, that is, neither the data party B nor the data party C acquires the data of the data party a. In the prediction by the tree model, for example, the model requester wishes to predict a first user, and data party a has a feature value of feature 1 of the first user, data party B has a feature value of feature 2 of the first user, and data party C has a feature value of feature 3 of the first user. In this case, the prediction process is dominated by the data party C, which sends a model request to the data party C, which predicts the first user starting from the root node, e.g., where the root node is recorded to correspond to data party a, the data party C informs data party a to predict the first user at the root node based on the record, so that the data party a uses the feature value of the first user's feature 1 and the split value of the first feature at the root node, to split the first user into the next child node of the root node, if the next child node is a leaf node, the data side A sends the division result to the data side C, the data side can send the score of the leaf node as a model prediction value to the model request side, if the next child node is the node corresponding to the data side C, the data side C continues to divide the first user in the child node.
It is to be understood that the system illustrated in FIG. 1 in accordance with embodiments of the present description is intended to be illustrative, and not restrictive. For example, the sample is not limited to data for each user, and in the case where each data party is each shopping platform and has a common commodity, the sample may be data for a commodity that each shopping platform has in common. Hereinafter, a description will be given taking a data sample of each user as an example.
A privacy-preserving tree model construction scheme and a use scheme according to an embodiment of the present specification will be described in detail below.
Fig. 2 illustrates a tree model building method according to an embodiment of the present specification, in which building data currently used for building the tree model is composed of data owned by devices of at least two data parties, the building data includes feature values of M features of N samples, the at least two data parties include a first data party and a second data party, the device of the first data party has tag values of the N samples, the tree model currently includes a first tree, the first tree currently includes a first node, and the method is performed by the device of the first data party with respect to the first node, and includes:
step S202, obtaining M groups of splitting results from respective devices of the at least two data parties, wherein the M groups of splitting results correspond to the M characteristics respectively, and each group of splitting results comprises a plurality of splitting results which are split based on characteristic values of respective corresponding characteristics of the N samples;
step S204, recording data parties corresponding to the M groups of splitting results;
step S206, respectively calculating the splitting gains of the M groups of splitting results based on the label values of the N samples;
step S208, acquiring the splitting result with the maximum splitting gain as the optimal splitting result;
step S210, in the case that the splitting gain of the optimal splitting result is a positive value, determining a data party corresponding to the optimal splitting result based on the data parties corresponding to the M groups of splitting results recorded locally;
step S212, in the case that the corresponding data party is the second data party, sending the optimal splitting result to the equipment of the second data party, and recording the corresponding relation between the first node and the second data party;
step S214, labeling the first node to indicate that there is no node data of the first node locally, and updating the tree structure of the first tree accordingly.
First, in step S202, M sets of splitting results are obtained from respective devices of the at least two data parties, where the M sets of splitting results correspond to the M features, respectively, and each set of splitting results includes a plurality of splitting results that are split based on feature values of respective corresponding features of the N samples.
As described above, for example, data side A, data side B, and data sideC have common users a, b, C, d, e, wherein the data side A has characteristic values (x) of the characteristics 1(f1) of the respective usersa1、xb1、xc1、xd1、xe1) The data party B has feature values (x) of the features 2(f2) of the respective usersa2、xb2、xc2、xd2、xe2) The data side C has a feature value (x) of the feature 3(f3) of each usera3、xb3、xc3、xd3、xe3) And a tag value (y)a、yb、yc、yd、ye). Also as described above, assuming that the data side a is a bank, f1 is a deposit balance, the data side B is an insurance agency, f2 is an insurance amount, the data side C is a platform, f3 is a consumption amount, and the tag value indicates whether the user purchases a specific commodity. For example, suppose (x)a1、xb1、xc1、xd1、xe1)=(10,15,13,12,16),(xa2、xb2、xc2、xd2、xe2)=(7,9,8,6,10),(xa3、xb3、xc3、xd3、xd3)=(5,3,4,2,6),(ya、yb、yc、yd、ye) (0,1,0,0, 1). These data can be used to construct a tree model, for example, the tree model is an Xgboost model, and it is understood that the solution of the embodiment of the present specification is not limited to the Xgboost model, but may also be various tree models such as GBDT, random forest, etc., and the Xgboost model will be described as an example hereinafter.
After starting to build the tree model, the data of all the users mentioned above is used to build the root node of the 1 st tree (for example, the node is represented as node 1), that is, the splitting characteristic and the splitting value are determined at the root node, and all the users are split based on the determined splitting characteristic and the determined splitting value. It will be appreciated that the root node is described here as an example, however, the constructed node may be any node in any tree in the tree model, unlike the root node, the user data used to construct the non-root node will not be the user data of all users, but the user data of users classified to the non-root node, and the feature data used to construct the non-root node will not be the feature values of all features of the respective users, but need to remove features that have previously corresponded to other nodes.
In the above-mentioned data side A, B, C, data side C has respective tag values for the respective users, that is, data side C is the first data side, and thus the method shown in fig. 2 is performed at a device of data side C, such as a server, a terminal, a computer, and the like. The device on the data side C thus obtains M sets of split results from the devices on the data sides, where each set of split results includes a plurality of split results for N users obtained by splitting based on the feature values of the respective features of the N users, where M is 3, i.e., the number of features, and N is 5, i.e., the number of samples, i.e., the number of users. Based on the characteristic value of each user of each characteristic, a plurality of splitting results corresponding to the characteristic and corresponding to 5 users can be obtained through different splitting algorithms. For example, a Greedy Algorithm (Exact Greedy Algorithm) may find a split result corresponding to the feature, for example, N +1 ═ 5+1 ═ 6. Here, if the data side C itself includes the feature data of each user of a certain feature, the respective data sides also include the data side C itself. That is, the device on the data side C acquires 6 division results corresponding to the feature f1 from the device on the data side a, 6 division results corresponding to the feature f2 from the device on the data side B, and 6 division results corresponding to the feature f3 from itself, so that the device on the data side C acquires three sets of division results in total, each set including 6 division results.
The splitting result corresponding to a certain feature of each user is obtained based on the feature value of the feature. For example, the sample (here, the user) may be split using the Greedy Algorithm (Exact Greedy Algorithm), which is an Algorithm that finds the optimal split by exhaustive methods. For example, for the feature f1, the feature values of the feature f1 of the respective users are a:10, b:15, c:13, d:12, e:16, and the respective users may be first sorted in order of the feature values from small to large, that is: adcbe, so that six split results corresponding to f1 may be obtained based on the ranking: (0, adcbe), (a, dcbe), (ad, cbe), (adc, be), (adcb, e), and (adcbe, 0). It is understood that the feature values of the users may be the same, for example, the feature corresponds to the age, and the ages of the users may be the same, for example, assuming that the feature values of the feature f1 of the user a and the user b are the same, when the user is split using the feature f1, the user a and the user b are regarded as one split object (ab) and arranged in the order from small to large according to the feature value f1 with the other users, and the splitting process is performed. For example, the feature values of the feature f1 of each user are a:10, b:10, c:13, d:12, e:16, the users may be first sorted in order of the feature values from small to large, that is: (ab) dce so that five split results corresponding to f1 can be obtained based on the ranking: (0, abdce), (ab, dce), (abd, ce), (abdc, e), (abdce, 0). That is, in this case, 5 split results corresponding to f1 can be acquired for the five users.
For the features f2 and f3, five users can be split in the same way, and assuming that the feature values of the users do not coincide, the splitting results corresponding to the features shown in table 1 can be obtained:
f1 0,adcbe a,dcbe ad,cbe adc,be adcb,e adcbe,0
f2 0,dacbe d,acbe da,cbe dac,be dacb,e dacbe,0
f3 0,dbcae d,bcae db,cae dbc,ae dbca,e dbcae,0
TABLE 1
In step S204, data sides corresponding to the M sets of splitting results are recorded.
For example, after the data side a determines the 6 split results shown in table 1 based on the f1 values of the respective users it owns, it sends only the 6 split results to the data side C without informing the data side C of f1 in order to protect its data from being leaked to the data side C. For example, for the splitting result (a, dcbe), a is the user id of the user a divided from the current node 1 to its left child node (for example, it is identified as node 2), and d, c, b, e are the user ids of the users d, c, b, e divided from the current node to its right child node (for example, it is identified as node 3). After receiving the 6 split results from the data side a, the data side C records that the 6 split results were received from the data side a. Accordingly, the data side C records 6 splitting results corresponding to the data side B after receiving the 6 splitting results from the data side B (i.e. receiving from the data side B), and the data side C records 6 splitting results corresponding to the data side C after obtaining the 6 splitting results from the own data, so that the data side C records 18 splitting results and corresponding data sides as shown in table 2:
data side A 0,adcbe a,dcbe ad,cbe adc,be adcb,e adcbe,0
Data side B 0,dacbe d,acbe da,cbe dac,be dacb,e dacbe,0
Data side C 0,dbcae d,bcae db,cae dbc,ae dbca,e dbcae,0
TABLE 2
In step S206, based on the label values of the N samples, the splitting gain of each splitting result in the M sets of splitting results is calculated, respectively.
In the tree model, generally, for the currently constructed node 1, the splitting characteristic and the corresponding characteristic value of the node can be determined by calculating the splitting gain of each possible splitting result. Different tree models may have different formulas for calculating the splitting gain, but it is common that the splitting gain needs to be calculated based on the label values of the samples, that is, the splitting gain is consistent with the accuracy of the model prediction, and the larger the splitting gain, the higher the accuracy of the model prediction, that is, the smaller the error between the model prediction value and the corresponding label value.
For example, in the Xgboost model, the splitting gain L of one split in the t-th tree in the model is calculated by the following formula (1)sp
Figure BDA0002252631200000141
Wherein λ and γ are hyper-parameters, ILFor the sample set of the left child node after the current node split, IRIs the sample set of the right child node after the splitting of the current node, I is the sample set of the current node, wherein giAnd hiCalculated by the following equations (2) and (3), respectively:
where l () is the predicted loss function for the t-1 st tree in the model, where for the 1 st tree with t-1,
Figure BDA0002252631200000144
is that
Figure BDA0002252631200000145
Figure BDA0002252631200000146
May be determined randomly.
From equations (2) and (3), it follows that after the t-1 th tree has been determined, for the t-th tree, g for each user iiAnd hiAre all deterministic.
For example, for one of the split results (a, dcbe) in table 2 above, the gain of the split result can be calculated according to equation (1) as shown in equation (4) below:
among them, for the splitting result (0, abdce) corresponding to f1, since the eigenvalue of f1 of each user may be missing, for example, in the case where the eigenvalue of f1 of user a is missing, user a will not be included in the summation of the first term in formula (1), and therefore, the gain calculated by formula (1) may also be larger than 0.
By this method, the splitting gain can be calculated for each splitting result in table 2. In table 2, it can be seen that there are coincident split results, e.g., (ad, cbe) for data party a and (da, cbe) for data party B, which are coincident split results, in which case the split gain can be calculated only once for them. In practice, since the number of samples and the number of features are both large, there is a small probability that there are coincident split results.
As described above, the tree model according to the embodiments of the present disclosure is not limited to the Xgboost model, for example, the tree model may be the GBDT model, and may be split gains through information gain, kini index, and the like, which are not described in detail herein.
In step S208, the splitting result having the largest splitting gain is acquired as the optimum splitting result.
After the splitting gains of the respective splitting results in table 2 are obtained as described above, the splitting result having the largest splitting gain may be taken as the optimum splitting result, and in the case where there are coincident splitting results in table 2, for example, the splitting results having the largest splitting gains are (ad, cbe) for data side a and (da, cbe) for data side B, one splitting result may be randomly determined between the splitting result for data side a and the splitting result for data side B as the optimum splitting result.
In step S210, in a case that the splitting gain of the optimal splitting result is a positive value, a data party corresponding to the optimal splitting result is determined based on data parties corresponding to the M groups of splitting results recorded locally.
In the case that the splitting gain of the optimal splitting result is a positive value, it is stated that a better prediction effect can be brought about by continuing to split the current node by using the feature corresponding to the optimal splitting result. Since, as described above, the correspondence between each splitting result and the corresponding data party is recorded in table 2, for example, after the optimal splitting result in table 2 is determined, the data party to which the optimal splitting result corresponds can be determined according to table 2. For example, if the optimal splitting result is (a, dcbe) in the 18 splitting results in table 2, it can be obtained from table 2 that the data party corresponding to the optimal splitting result is data party a.
And in the case that the splitting gain of the optimal splitting result is less than or equal to zero, the fact that the continued splitting of the model cannot bring more benefits is indicated, so that the splitting of the current node is stopped, the current node is taken as a leaf node, and the score of the leaf node is calculated according to the label value of each locally stored sample. Specifically, the mean of the label values of the samples is calculated, so as to obtain the score of the leaf node. Thus, it can be seen that since only data side C has label values for each sample, the scores for all leaf nodes of the tree model are at data side C, i.e., all leaf nodes correspond to data side C.
In step S212, in a case that the corresponding data party is the second data party, the optimal splitting result is sent to the device of the second data party, and the corresponding relationship between the first node and the second data party is recorded.
For example, as described above, if it is determined that the optimal splitting result (a, dcbe) corresponds to the data party a (i.e., the second data party), the splitting result is sent to the device of the data party a as the optimal splitting result, so that the device of the data party a can know the optimal splitting result.
After obtaining the optimal splitting result, the device of the data side a may determine that the optimal splitting result corresponds to the feature data of the local feature f1, so as to correspond the node 1 to the feature f1, and determine the feature splitting value of the node 1 according to the splitting value corresponding to the optimal splitting result, so that the model data of the tree model may be updated in the device of the data side a, that is, the feature corresponding to the node 1 and the corresponding feature splitting value are updated, and the tree structure of the tree model is extended two sub-nodes (that is, the node 2 and the node 3) from the node.
In step S214, the first node is labeled to indicate that there is no node data of the first node locally, and the tree structure of the first tree is updated accordingly.
Since node 1 corresponds to data partner a, i.e. only data partner a owns the node data of node 1, while data partner C does not have the node data of node 1, the device of data partner C may mark node 1 as a null node in the local tree model, for example, and extend two sub-nodes (node 2 and node 3) from the tree structure of the model tree down from node 1 to store the corresponding tree structure. It is to be understood that, in the embodiment of the present specification, the node 1 is not limited to be labeled as an empty node, but may be labeled as a predetermined identifier as long as it indicates that there is no node data of the first node locally.
In the above description, model construction is performed based on the data of five users a, b, c, d, e, which are owned by the data side A, B, C, respectively, to derive the root node (node 1) of the Xgboost model, the data of the node 1 is owned by the data side a, and the node 1 has two child nodes, node 2 and node 3. Thereafter, the method shown in fig. 2 may be performed by the device of the data partner C with the node 2 and the node 3 as the first nodes, respectively, to determine the tree structure and node data of each data partner.
In one embodiment, the method of FIG. 2 is performed with the left child node of node 1 (node 2) as the first node. As described above, after the splitting of the node 1 is performed, only the user a is included in the node 2, and in this case, it is impossible to split the node 2 again, and therefore, the device of the data side C labels the node 2 as a leaf node and calculates the score of the node 2 based on the label value of the user a. Meanwhile, the device of the data side C notifies the respective devices of the data side a and the data side B that the node 2 is a leaf node, so that the respective devices of the data side a and the data side B label the node 2 as an empty node.
In one embodiment, the method of FIG. 2 is performed with the child node to the right of node 1 (node 3) as the first node. As described above, after the splitting of the node 1 is performed, the users d, C, b, e are included in the node 3, that is, in this case, the number of users is 4, that is, N is 4, the number of features available for the splitting is 2(f2, f3), and the device on the data side C similarly acquires the splitting result of M (N +1) 2 by 5 to 10 for the users d, C, b, e, and similarly performs the above-described steps S204 to S208. For example, the data side corresponding to the optimal splitting result (dbc, e) is determined to be the data side C itself, the data side C includes the feature value of the feature f3 of each of the users d, C, b, e, and based on the above-mentioned feature values (3, 4,2,6) of the users b, C, d, e, five splitting manners (0, dbce), (d, bce), (db, ce), (dbc, e), and (dbce, 0) can be obtained for the users, so that the device of the data side C can determine that the optimal splitting result corresponds to the feature f 3. It is understood that when the data side C has only the feature value of one feature of each user, the optimal splitting result may be directly determined to correspond to the feature, and when the data side C has the feature values of a plurality of features of each user, the feature corresponding to the optimal splitting result may be obtained by comparing the optimal splitting result with the splitting results corresponding to the respective features as described above. Thus, in the device of the data side C, f3 is determined as the feature corresponding to the node 3, and the feature split value of the node 3 is determined based on the split value corresponding to the optimal split result, and the tree structure of the tree model is extended two child nodes (node 4 and node 5) downward from the node 3. Thereafter, the data partner C's device may notify the data partner A and the data partner B's respective devices to mark both node 3 in the local tree model as an empty node and update the tree structure accordingly, i.e., extending node 3 down two child nodes (node 4 and node 5).
As described above, the node 4 includes the users d, b, and C, and the node 5 includes the user e, and similarly to the node 2, the node e is a leaf node, and the device of the data side C can perform the same processing as that in the node 2.
With respect to node 4, the method shown in fig. 2 may be performed again with node 4 as the first node. For example, at node 4, it is determined that the split result (d, cb) corresponding to data party B is the optimal split result with the gain value greater than 0, the device of data party C sends the split result to the device of data party B similarly as above, so that the device of data party B obtains node data of node 4 similarly as node 3 and extends the tree model down to node 6 and node 7, and at the same time, the device of data party C labels node 4 of the local tree model as an empty node and updates the tree structure, and notifies the device of data party a to label node 4 of its tree model as an empty node and update the tree structure.
For newly-dropped node 6 and node 7, since only user d is included in node 6, it is also a leaf node and will not be described in detail here. Users b and c are included in node 7, and node 7 is also a leaf node since there are no other features to continue splitting the node. In the case where there are other features for splitting, for example, the data side a has the feature value of the feature f4 of each user, the splitting gain of the splitting result (b, C) may be calculated, if the splitting gain is greater than 0, the node 7 may be further divided down by the data side a into two child nodes, if the splitting gain is less than or equal to 0, the splitting need not be further continued, only the node 7 is taken as a leaf node, and the device of the data side C calculates the average of the two label values as the score of the node 7 based on the label values of the users b and C, for example, the label values of the users b and C are 1 and 0, and the score of the node 7 is 0.5.
FIG. 3 schematically shows model data owned by parties to data party A, B, C after undergoing the tree model building process described above. As shown in fig. 3, the data parties A, B, C have a common tree structure, namely 7 nodes: nodes 1-7, and the connection relationship among the nodes. Wherein, the data side A only has node data (split characteristic f1, split value s1) of node 1, wherein nodes 2-7 are all marked as empty nodes. Data partner B has only node data for node 4 (split characteristic f2, split value s2), data partner C has node data for node 3 (split characteristic f3, split value s3) and scores for the individual leaf nodes (v1, v2, v3, v 4). That is, since data party a has the user's characteristic value of f1, while node 1 corresponds to f1, data party a corresponds to node 1, and similarly data party B corresponds to node 4, and data party C corresponds to nodes 2, 3, 5, 6, 7. In addition, (a) recorded at the node 1 in the data side C indicates that the node 1 corresponds to the data side a, and (B) recorded at the node 4 indicates that the node 4 corresponds to the data side B.
In the above example, only one tree was constructed, however, in the case where, for example, the data side A, B, C has feature values of more features for each user, a second tree can be constructed for five users a to e by the method shown in fig. 2 using the more features, thereby making the prediction of the model more accurate.
After obtaining the tree models each having a portion owned by the data parties A, B, C shown in FIG. 3, predictions for objects may be made via the tree models.
Fig. 4 illustrates a prediction method based on a tree model according to an embodiment of the present disclosure, where model data of the tree model is composed of data owned by respective devices of at least two data parties, where each device of the data parties has a tree structure of the tree model, and the at least two data parties include a first data party and a second data party, where a correspondence relationship between each node in the tree model and the data party is recorded in the device of the first data party, and the method is performed by the device of the first data party, and includes:
step S402, receiving a model use request for a first object from a device of a model requester;
step S404, for a first father node in a first tree in the tree model, determining a data party corresponding to the first father node based on the corresponding relation between each node in the tree model and the data party;
step S406, in a case that it is determined that the data party corresponding to the first parent node is the second data party, notifying a device of the second data party to divide the first object in the first parent node;
step S408, receiving a partitioning result of the first object in the first parent node from the device of the second data side, so as to continue the prediction of the first object.
First, in step S402, a model use request for a first object is received from a device of a model requester.
For example, the first object is a first user in a data party C (i.e., platform), and the model requester may be a manufacturer, marketer, etc. of a particular commodity that wishes to predict whether the first user will purchase the particular commodity, and therefore, the device of the model requester may make a model prediction request for the first user to the device of the data party C.
In step S404, for a first parent node in a first tree in the tree model, a data party corresponding to the first parent node is determined based on a correspondence between each node in the tree model and the data party.
For example, for the constructed tree model shown in fig. 3, the device of the data side C can learn that the node 1 corresponds to the data side a (second data side) according to the recorded correspondence between the node 1 and the data side.
In step S406, in a case that it is determined that the data party corresponding to the first parent node is the second data party, notifying a device of the second data party to divide the first object in the first parent node.
That is, after the device of data party C determines that node 1 corresponds to data party a, the device of data party C notifies the device of data party a to perform the partitioning of the first user in node 1. In the notification, the device of the data party C may identify the first user with the user identifier of the first user, such as the identity identifier of the first user, the user number of the first user agreed by the data party a and the data party C, and the like. After receiving the notification, the device of the data party a may locally obtain the f1 value of the first user and divide the first user into the node 2 or the node 3 based on the characteristic split value of the f1 of the first node, because the node 1 corresponds to f1 and the data party a has the characteristic value of f1 of each user. And, the device of the data side a transmits the division result to the device of the data side C after completing the division of the first user.
In step S408, the result of the division of the first object in the first parent node is received from the device of the second data side to continue the prediction of the first object.
In one embodiment, the device of data party a partitions the first user to node 2, i.e., a leaf node, so that the device of data party C, after receiving the partitioning result, may send the score of node 2 as a predicted value for the first user to the device of the model requestor.
In one embodiment, the device of data partner A partitions the first user into node 3, with the node data for node 3 in data partner C, so that the device of data partner C can do a repartitioning of the first user based on the locally stored split value of f3 for node 3 and the first user's f3 value to partition the first user into node 4 or node 5.
Fig. 5 illustrates a privacy-preserving tree model building apparatus 500 according to an embodiment of the present specification, where building data currently used for building the tree model is composed of data owned by devices of at least two data parties, the building data includes feature values of M features of N samples, the at least two data parties include a first data party and a second data party, the device of the first data party has tag values of the N samples, the tree model currently includes a first tree, the first tree currently includes a first node, and the apparatus is deployed in the device of the first data party with respect to the first node, and includes:
a first obtaining unit 501, configured to obtain M sets of splitting results from respective devices of the at least two data parties, where the M sets of splitting results correspond to the M features respectively, and each set of splitting results includes a plurality of splitting results that are split based on feature values of respective corresponding features of the N samples;
a recording unit 502 configured to record data parties corresponding to the M groups of splitting results;
a first calculating unit 503, configured to calculate splitting gains of the splitting results of the M groups of splitting results, respectively, based on respective label values of the N samples;
a second obtaining unit 504 configured to obtain a splitting result having a maximum splitting gain as an optimal splitting result;
a first determining unit 505, configured to, in a case that a splitting gain of the optimal splitting result is a positive value, determine a data party corresponding to the optimal splitting result based on data parties corresponding to the M groups of splitting results recorded locally;
a sending unit 506, configured to send the optimal splitting result to a device of the second data party and record a corresponding relationship between the first node and the second data party when the corresponding data party is the second data party;
a labeling unit 507 configured to label the first node to indicate that there is no node data of the first node locally, and update the tree structure of the first tree accordingly.
In one embodiment, the apparatus further comprises a first notifying unit 508 configured to notify devices of other data parties of the at least two data parties except the first data party and the second data party to instruct the devices of the other data parties to perform the labeling on the first node and update the tree structure of the first tree accordingly.
In one embodiment, the first data party further has a feature value of a first feature of each of the N samples, the apparatus further includes a second determining unit 509 configured to, in a case where the corresponding data party is the first data party, and in a case where the optimal splitting result is determined to be a splitting result corresponding to a first feature, determine the first feature as the feature corresponding to the first node, and determine a splitting value corresponding to the optimal splitting result as the splitting value of the first feature of the first node, and a second notifying unit 510 configured to notify devices of other data parties than the first data party among the at least two data parties, to instruct the devices of the other data parties to perform the labeling on the first node and update the tree structure of the first tree accordingly.
In one embodiment, the apparatus further includes a second calculating unit 511 configured to calculate a score corresponding to the first node based on the label value of each of the N samples in a case that the splitting gain of the optimal splitting result is less than or equal to zero, and a third notifying unit 512 configured to notify devices of other data parties except the first data party of the at least two data parties to instruct the devices of the other data parties to perform the labeling on the first node.
In an embodiment, the first tree is a t-th tree in the tree model, wherein the first calculating unit is further configured to calculate splitting gains of the splitting results in the M sets of splitting results respectively based on the label values of the N samples and t-1 th trees of the tree model acquired in advance.
In an embodiment, the M groups of split results include at least two split results that are received from respective devices of at least two data parties and coincide with each other, and the first determining unit is further configured to, in a case where the optimal split result is determined to correspond to at least two data parties based on the data parties corresponding to the M groups of split results recorded locally, randomly determine one data party from the at least two data parties as the data party corresponding to the optimal split result.
In one embodiment, the tree model is any one of the following tree models: xgboost model, GBDT model, random forest.
Fig. 6 illustrates a prediction apparatus 600 based on a tree model according to an embodiment of the present disclosure, where model data of the tree model is composed of data owned by respective devices of at least two data parties, where the devices of each data party have a tree structure of the tree model, and the at least two data parties include a first data party and a second data party, where a correspondence relationship between each node in the tree model and the data party is recorded in the device of the first data party, and the apparatus is deployed in the device of the first data party, and includes:
a first receiving unit 61 configured to receive a model use request for a first object from a device of a model requester;
a determining unit 62 configured to, for a first parent node in a first tree in the tree model, determine a data party corresponding to the first parent node based on a correspondence between each node in the tree model and the data party;
a notifying unit 63, configured to, in a case that it is determined that the data party corresponding to the first parent node is the second data party, notify a device of the second data party to divide the first object in the first parent node;
a second receiving unit 64 configured to receive a division result of the first object in the first parent node from the device of the second data side to continue prediction of the first object.
In an embodiment, the first parent node corresponds to a first feature and a corresponding feature splitting value recorded in the device of the first data side, and the device of the first data side further has a feature value of the first feature of the first object, and the apparatus further includes a dividing unit 65 configured to, in a case where it is determined that the data side corresponding to the first parent node is the first data side, divide the first object into one child node of the first parent node based on the feature splitting value of the first parent node and the feature value of the first feature of the first object.
In an embodiment, the score of each leaf node in the tree model is further recorded in the device of the first data side, the partition result partitions the first object into a first child node of the first parent node, the first child node is a leaf node, and the apparatus further includes a sending unit 66 configured to send the score of the first child node as a prediction result to the device of the model requesting side.
Another aspect of the present specification provides a computer readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform any one of the above methods.
Another aspect of the present specification provides a computing device comprising a memory and a processor, wherein the memory stores executable code, and the processor implements any one of the above methods when executing the executable code.
According to the tree model construction scheme, the tree model can be constructed by using data of a plurality of data parties together, meanwhile, the data of each data party is protected from being leaked to other data parties, the data volume for constructing the tree model is expanded, the prediction accuracy of the tree model is increased, and meanwhile, the data safety of each data party is protected.
It is to be understood that the terms "first," "second," and the like, herein are used for descriptive purposes only and not for purposes of limitation, to distinguish between similar concepts.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (24)

1. A tree model building method, wherein building data currently used for building the tree model is composed of data owned by devices of at least two data parties, the building data includes feature values of M features of N samples, the at least two data parties include a first data party and a second data party, the device of the first data party has tag values of N samples, the tree model currently includes a first tree, the first tree currently includes a first node, and the method is performed by the device of the first data party with respect to the first node, and includes:
obtaining M groups of splitting results from respective devices of the at least two data parties, wherein the M groups of splitting results correspond to the M characteristics respectively, and each group of splitting results comprises a plurality of splitting results which are split based on characteristic values of respective corresponding characteristics of the N samples;
recording data sides corresponding to the M groups of splitting results;
respectively calculating the splitting gain of each splitting result in the M groups of splitting results based on the label values of the N samples;
acquiring a splitting result with the maximum splitting gain as an optimal splitting result;
determining a data party corresponding to the optimal splitting result based on the data parties respectively corresponding to the M groups of splitting results recorded locally under the condition that the splitting gain of the optimal splitting result is a positive value;
when the corresponding data party is the second data party, sending the optimal splitting result to equipment of the second data party, and recording the corresponding relation between the first node and the second data party;
the first node is labeled to indicate that there is no node data of the first node locally and the tree structure of the first tree is updated accordingly.
2. The method of claim 1, wherein each of the M sets of split results comprises P split results corresponding to a respective feature, wherein P is determined based on a number of non-duplicate feature values of the respective feature included in the N samples.
3. The method of claim 1, further comprising notifying devices of other ones of the at least two data parties other than the first data party and the second data party to instruct the devices of the other data parties to said label the first node and update the tree structure of the first tree accordingly.
4. The method of claim 1, wherein the first data party further has a feature value of a first feature of each of the N samples, the method further comprising, in a case where the corresponding data party is the first data party, and in a case where the optimal splitting result is determined to be the splitting result corresponding to the first feature, determining the first feature as the feature corresponding to the first node and determining the splitting value corresponding to the optimal splitting result as the splitting value of the first feature of the first node, and notifying devices of other data parties than the first data party of the at least two data parties to instruct the devices of the other data parties to perform the labeling of the first node and update the tree structure of the first tree accordingly.
5. The method of claim 1, further comprising, in a case that a splitting gain of the optimal splitting result is less than or equal to zero, calculating a score corresponding to the first node based on respective label values of the N samples, and notifying devices of other data parties except the first data party of the at least two data parties to instruct the devices of the other data parties to perform the labeling on the first node.
6. The method according to claim 1, wherein the first tree is a t-th tree in the tree model, and wherein separately calculating the splitting gain of each splitting result in the M sets of splitting results based on the respective label values of the N samples comprises separately calculating the splitting gain of each splitting result in the M sets of splitting results based on the respective label values of the N samples and a t-1 th tree of the tree model acquired in advance.
7. The method of claim 1, wherein the M sets of split results include at least two split results that are received from respective devices of at least two data parties and that coincide, and determining the data party corresponding to the optimal split result based on the data party corresponding to each of the M sets of split results recorded locally comprises randomly determining one data party from the at least two data parties as the data party corresponding to the optimal split result in a case where the optimal split result is determined to correspond to at least two data parties based on the data parties corresponding to each of the M sets of split results recorded locally.
8. The method of claim 1, wherein the tree model is any one of: xgboost model, GBDT model, random forest.
9. A prediction method based on a tree model, wherein model data of the tree model is composed of data owned by respective devices of at least two data parties, wherein each data party has a tree structure of the tree model in the device, and the at least two data parties include a first data party and a second data party, wherein a correspondence relationship between each node in the tree model and a data party is recorded in the device of the first data party, and the method is performed by the device of the first data party, and includes:
receiving a model use request for a first object from a device of a model requestor;
for a first father node in a first tree in the tree model, determining a data party corresponding to the first father node based on the corresponding relation between each node in the tree model and the data party;
when the data party corresponding to the first parent node is determined to be the second data party, informing the device of the second data party to divide the first object in the first parent node;
the partitioning result of the first object in the first parent node is received from the device of the second data side to continue the prediction of the first object.
10. The method of claim 9, wherein the first data party's device has recorded therein a first parent node corresponding to a first feature and a corresponding feature split value, the first data party's device also having therein a feature value of a first feature of the first object, the method further comprising, in the event that the data party to which the first parent node corresponds is determined to be the first data party, partitioning the first object into one child node of the first parent node based on the feature split value of the first parent node and the feature value of the first feature of the first object.
11. The method of claim 10, wherein the device of the first data party further records scores of leaf nodes in the tree model, the partition result partitions the first object into a first child node of the first parent node, the first child node being a leaf node, and the method further comprises sending the score of the first child node as a prediction result to the device of the model requesting party.
12. A tree model building apparatus, wherein building data currently used for building the tree model is composed of data owned by respective devices of at least two data parties, the building data includes feature values of M features of respective N samples, the at least two data parties include a first data party and a second data party, the device of the first data party has respective label values of N samples, the tree model currently includes a first tree, the first tree currently includes a first node, the apparatus is deployed in the device of the first data party with respect to the first node, and includes:
a first obtaining unit, configured to obtain M sets of splitting results from respective devices of the at least two data parties, where the M sets of splitting results correspond to the M features, respectively, and each set of splitting results includes a plurality of splitting results that are split based on feature values of respective corresponding features of the N samples;
the recording unit is configured to record data sides corresponding to the M groups of splitting results;
a first calculating unit configured to calculate a splitting gain of each splitting result in the M groups of splitting results based on a label value of each of the N samples;
a second acquisition unit configured to acquire a splitting result having a maximum splitting gain as an optimal splitting result;
a first determining unit, configured to, in a case that a splitting gain of the optimal splitting result is a positive value, determine a data party corresponding to the optimal splitting result based on data parties corresponding to the M groups of splitting results recorded locally;
a sending unit, configured to send the optimal splitting result to a device of the second data party and record a corresponding relationship between the first node and the second data party when the corresponding data party is the second data party;
and the marking unit is configured to mark the first node to indicate that no node data of the first node exists locally, and update the tree structure of the first tree correspondingly.
13. The apparatus of claim 12, wherein each of the M sets of split results comprises P split results corresponding to a respective feature, wherein P is determined based on a number of non-duplicate feature values of the respective feature included in the N samples.
14. The apparatus of claim 12, further comprising a first notifying unit configured to notify devices of other ones of the at least two data parties other than the first data party and the second data party to instruct the devices of the other data parties to perform the labeling on the first node and update the tree structure of the first tree accordingly.
15. The apparatus according to claim 12, wherein the first data side further has a feature value of a first feature of each of the N samples, the apparatus further comprising, a second determining unit configured to, in a case where the corresponding data side is the first data side, and in a case where it is determined that the optimal splitting result is a splitting result corresponding to a first feature, determining the first feature as a feature corresponding to the first node, and determining a splitting value corresponding to the optimal splitting result as a splitting value of a first feature of the first node, a second notification unit configured to notify devices of other data parties except the first data party among the at least two data parties, the labeling is performed with the device instructing the other data parties to the first node and the tree structure of the first tree is updated accordingly.
16. The apparatus according to claim 12, further comprising a second calculating unit configured to calculate a score corresponding to the first node based on the label values of the N samples, in a case where the splitting gain of the optimal splitting result is equal to or less than zero, and a third notifying unit configured to notify devices of other data parties than the first data party among the at least two data parties to instruct the devices of the other data parties to perform the labeling on the first node.
17. The apparatus according to claim 12, wherein the first tree is a t-th tree in the tree model, and wherein the first computing unit is further configured to compute the splitting gain of each splitting result in the M sets of splitting results respectively based on the label values of the N samples and t-1 th trees of the tree model obtained in advance.
18. The apparatus according to claim 12, wherein the M sets of split results include at least two split results that are received from respective devices of at least two data parties and coincide, and the first determining unit is further configured to randomly determine one data party from the at least two data parties as the data party to which the optimal split result corresponds, in a case where the optimal split result is determined to correspond to the at least two data parties based on the data parties to which the M sets of split results recorded locally correspond, respectively.
19. The apparatus of claim 12, wherein the tree model is any one of: xgboost model, GBDT model, random forest.
20. A prediction apparatus based on a tree model, wherein model data of the tree model is composed of data owned by respective devices of at least two data parties, wherein each data party has a tree structure of the tree model in the device, the at least two data parties include a first data party and a second data party, wherein a correspondence relationship between each node in the tree model and a data party is recorded in the device of the first data party, and the apparatus is deployed in the device of the first data party, and includes:
a first receiving unit configured to receive a model use request for a first object from a device of a model requester;
the determining unit is configured to determine, for a first parent node in a first tree in the tree model, a data party corresponding to the first parent node based on the correspondence between each node in the tree model and the data party;
the notification unit is configured to notify a device of a second data party to divide a first object in a first parent node when the data party corresponding to the first parent node is determined to be the second data party;
a second receiving unit configured to receive a division result of the first object in the first parent node from the device of the second data side to continue prediction of the first object.
21. The apparatus according to claim 20, wherein the first data party device records therein a first feature and a corresponding feature splitting value corresponding to the first parent node, and the first data party device further has therein a feature value of a first feature of the first object, and the apparatus further comprises a dividing unit configured to, in a case where it is determined that the data party corresponding to the first parent node is the first data party, divide the first object into one child node of the first parent node based on the feature splitting value of the first parent node and the feature value of the first feature of the first object.
22. The apparatus according to claim 21, wherein the device of the first data party further records scores of respective leaf nodes in the tree model, and the partition result partitions the first object into a first child node of the first parent node, the first child node being a leaf node, and the apparatus further comprises a transmitting unit configured to transmit the score of the first child node as a prediction result to the device of the model requesting party.
23. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.
24. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-11.
CN201911040223.XA 2019-10-29 2019-10-29 Prediction method and device based on tree model Active CN110795603B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911040223.XA CN110795603B (en) 2019-10-29 2019-10-29 Prediction method and device based on tree model
PCT/CN2020/108973 WO2021082634A1 (en) 2019-10-29 2020-08-13 Tree model-based prediction method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911040223.XA CN110795603B (en) 2019-10-29 2019-10-29 Prediction method and device based on tree model

Publications (2)

Publication Number Publication Date
CN110795603A true CN110795603A (en) 2020-02-14
CN110795603B CN110795603B (en) 2021-02-19

Family

ID=69441899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911040223.XA Active CN110795603B (en) 2019-10-29 2019-10-29 Prediction method and device based on tree model

Country Status (2)

Country Link
CN (1) CN110795603B (en)
WO (1) WO2021082634A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401570A (en) * 2020-04-10 2020-07-10 支付宝(杭州)信息技术有限公司 Interpretation method and device for privacy tree model
CN112101577A (en) * 2020-11-13 2020-12-18 同盾控股有限公司 XGboost-based cross-sample federal learning and testing method, system, device and medium
CN112700031A (en) * 2020-12-12 2021-04-23 同济大学 XGboost prediction model training method for protecting multi-party data privacy
WO2021082634A1 (en) * 2019-10-29 2021-05-06 支付宝(杭州)信息技术有限公司 Tree model-based prediction method and apparatus
CN113420072A (en) * 2021-06-24 2021-09-21 深圳前海微众银行股份有限公司 Data processing method, device, equipment and storage medium
CN114265851A (en) * 2021-12-21 2022-04-01 瀚云科技有限公司 Data platform maintenance method and system, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130346033A1 (en) * 2012-06-21 2013-12-26 Jianqiang Wang Tree-based regression
CN105930934A (en) * 2016-04-27 2016-09-07 北京物思创想科技有限公司 Prediction model demonstration method and device and prediction model adjustment method and device
CN108536650A (en) * 2018-04-03 2018-09-14 北京京东尚科信息技术有限公司 Generate the method and apparatus that gradient promotes tree-model
CN109657865A (en) * 2018-12-21 2019-04-19 国网能源研究院有限公司 A kind of fund pool account facility extent prediction technique and device
CN109657696A (en) * 2018-11-05 2019-04-19 阿里巴巴集团控股有限公司 Multitask supervised learning model training, prediction technique and device
WO2019088972A1 (en) * 2017-10-30 2019-05-09 Equifax, Inc. Training tree-based machine-learning modeling algorithms for predicting outputs and generating explanatory data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808582A (en) * 2014-12-30 2016-07-27 华为技术有限公司 Parallel generation method and device of decision tree on the basis of layered strategy
US9792204B2 (en) * 2016-02-02 2017-10-17 General Electric Company System and method for coverage-based automated test case augmentation for design models
CN109034398B (en) * 2018-08-10 2023-09-12 深圳前海微众银行股份有限公司 Gradient lifting tree model construction method and device based on federal training and storage medium
CN109165683B (en) * 2018-08-10 2023-09-12 深圳前海微众银行股份有限公司 Sample prediction method, device and storage medium based on federal training
CN110751330B (en) * 2019-10-18 2022-07-22 支付宝(杭州)信息技术有限公司 Prediction method and device based on tree model
CN110795603B (en) * 2019-10-29 2021-02-19 支付宝(杭州)信息技术有限公司 Prediction method and device based on tree model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130346033A1 (en) * 2012-06-21 2013-12-26 Jianqiang Wang Tree-based regression
CN105930934A (en) * 2016-04-27 2016-09-07 北京物思创想科技有限公司 Prediction model demonstration method and device and prediction model adjustment method and device
WO2019088972A1 (en) * 2017-10-30 2019-05-09 Equifax, Inc. Training tree-based machine-learning modeling algorithms for predicting outputs and generating explanatory data
CN108536650A (en) * 2018-04-03 2018-09-14 北京京东尚科信息技术有限公司 Generate the method and apparatus that gradient promotes tree-model
CN109657696A (en) * 2018-11-05 2019-04-19 阿里巴巴集团控股有限公司 Multitask supervised learning model training, prediction technique and device
CN109657865A (en) * 2018-12-21 2019-04-19 国网能源研究院有限公司 A kind of fund pool account facility extent prediction technique and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021082634A1 (en) * 2019-10-29 2021-05-06 支付宝(杭州)信息技术有限公司 Tree model-based prediction method and apparatus
CN111401570A (en) * 2020-04-10 2020-07-10 支付宝(杭州)信息技术有限公司 Interpretation method and device for privacy tree model
CN112101577A (en) * 2020-11-13 2020-12-18 同盾控股有限公司 XGboost-based cross-sample federal learning and testing method, system, device and medium
CN112101577B (en) * 2020-11-13 2021-04-13 同盾控股有限公司 XGboost-based cross-sample federal learning and testing method, system, device and medium
CN112700031A (en) * 2020-12-12 2021-04-23 同济大学 XGboost prediction model training method for protecting multi-party data privacy
CN113420072A (en) * 2021-06-24 2021-09-21 深圳前海微众银行股份有限公司 Data processing method, device, equipment and storage medium
CN113420072B (en) * 2021-06-24 2024-04-05 深圳前海微众银行股份有限公司 Data processing method, device, equipment and storage medium
CN114265851A (en) * 2021-12-21 2022-04-01 瀚云科技有限公司 Data platform maintenance method and system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110795603B (en) 2021-02-19
WO2021082634A1 (en) 2021-05-06

Similar Documents

Publication Publication Date Title
CN110795603B (en) Prediction method and device based on tree model
CN110991552B (en) Isolated forest model construction and prediction method and device based on federal learning
CN108140075A (en) User behavior is classified as exception
CN103733190A (en) Protecting network entity data while preserving network properties
CN111835776A (en) Network traffic data privacy protection method and system
CN106327345A (en) Social group discovering method based on multi-network modularity
CN115311478A (en) Federal image classification method based on image depth clustering and storage medium
CN107657357B (en) Data processing method and device
CN111491300B (en) Risk detection method, apparatus, device and storage medium
CN110929172A (en) Information selection method and device, electronic equipment and readable storage medium
CN111931023A (en) Community structure identification method and device based on network embedding
CN111460315A (en) Social portrait construction method, device and equipment and storage medium
Alkan et al. RedNemo: topology-based PPI network reconstruction via repeated diffusion with neighborhood modifications
CN109684588B (en) Asset management system and method
CN113448876B (en) Service testing method, device, computer equipment and storage medium
US20160292300A1 (en) System and method for fast network queries
CN112116485B (en) Method and device for obtaining related information in social network
CN114092268A (en) User community detection method and device, computer equipment and storage medium
CN112949724A (en) Training method of image classification network model, image classification method and related equipment
CN109190039B (en) Method and device for determining similar objects and computer readable storage medium
CN113781156A (en) Malicious order recognition method, malicious order model training method, malicious order recognition equipment and malicious order model training storage medium
US20140297887A1 (en) Method for transmitting information on priority basis to one or more nodes in distributed network
CN112149566A (en) Image processing method and device, electronic equipment and storage medium
She et al. An improved malicious code intrusion detection method based on target tree for space information network
CN107104962B (en) Anonymous method for preventing label neighbor attack in dynamic network multi-release

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant