CN106611191A - Decision tree classifier construction method based on uncertain continuous attributes - Google Patents
Decision tree classifier construction method based on uncertain continuous attributes Download PDFInfo
- Publication number
- CN106611191A CN106611191A CN201610542924.3A CN201610542924A CN106611191A CN 106611191 A CN106611191 A CN 106611191A CN 201610542924 A CN201610542924 A CN 201610542924A CN 106611191 A CN106611191 A CN 106611191A
- Authority
- CN
- China
- Prior art keywords
- attribute
- uncertain
- class
- probability
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a decision tree classifier construction method based on uncertain continuous attributes. The method comprises the following steps: classifying X sample data in which uncertain continuous data exists; ordering by merging attribute values Sij of uncertain continuous data attributes Si, performing attribute value Sij operation on the uncertain data attributes Si according to classes, and denoting as the probabilistic summation P(Sij); processing the classes to obtain the probabilistic cardinality P(Sij, Lr) of each branched attribute value, creating a decision tree, and selecting splitSi according to a target function created by the invention, and stopping creating the tree according to a condition. The constructed decision tree can better avoid the problem that the information bias is large in order of magnitude, and can realize the classification and prediction function of an object, namely, the uncertain continuous attributes; the constructed decision tree is high in classification accuracy; and the constructed decision tree is more suitable for an application for an actual data mining problem.
Description
Technical field
The present invention relates to machine learning, artificial intelligence and Data Mining, and in particular to one kind is based on uncertain company
The decision tree classifier construction method of continuous attribute.
Background technology
Decision tree research is an important and positive research topic in data mining and machine learning.What it was proposed
Algorithm is widely used in the middle of practical problem, such as ID3, CART and C4.5, such algorithm is mainly studies asking for accuracy rate
Topic.With the progress of science and technology, in recent years, uncertain data is frequently occurred in real world applications, including wireless senser
The fields such as network, radio frequency identification, secret protection.Its data characteristics is that data value is not to determine, that is, represent a number
Strong point, its representation is, as an entirety, occur various values as each value correspondence according to certain probability distribution
Probability, it more and more frequently occurs promoting the research of correlation to get the attention and fast-developing.And traditional data is dug
The uncertainty in data is often have ignored in pick technology, its study model is not inconsistent with objective world.So uncertain data digs
Practical application important in inhibiting of the pick technology to data mining technology.The uncertainty of attribute includes Category Attributes and continuous category
Property, during training and test, in order to ensure effective assessment models, verify the performance of grader, the class in data set
All it is not known.Here the classification and prediction in uncertain connection attribute is primarily directed to, additionally, due to measuring instrument fertilization
The impact of degree, the data of collection usually contain certain error, are not entirely accurates, in order to improve uncertain connection attribute
Classification accuracy, the present invention proposes the decision tree classifier construction method based on uncertain connection attribute.
The content of the invention
It is directed to the accuracy rate for solving the problems, such as that uncertain connection attribute classification is classified to it, predicted with prediction and raising
Problem, the present invention proposes the decision tree classifier construction method based on uncertain connection attribute.
To solve the above problems, the present invention is achieved by the following technical solutions:
Based on the decision tree classifier construction method of uncertain connection attribute, comprise the steps:
Step 1:If uncertain connection attribute training is concentrated with X sample, attribute number is n, i.e. n=(S1, S2... Sn),
While Split Attribute SiM class L, wherein L are corresponded tor∈(L1, L2..., Lm), i ∈ (1,2 ..., n), r ∈ (1,2 ..., m).Si∈
(S1, S2... Sn), wherein property value has continuous uncertain.
Step 2:Uncertain continuous data attribute SiProperty value SijOrdering by merging, belongs to according to class to uncertain data
Property SiCarry out property value SijComputing, is designated as probability and P (Sij), class is carried out processing each fork attribute value probability gesture P
(Sij, Lr)。
Step 3:Create root node G.
Step 4:If training dataset is sky, return node G and labelling failure.
Step 5:If training data concentrates all records to belong to same category, the type flag node G.
Step 6:If candidate attribute is sky, return G is leafy node, is labeled as training data and concentrates most common
Class.
Step 7:Due to the uncertainty of continuous property, selected from candidate attribute according to following object function
splitSi。
Step 8:Flag node G is attribute splitSi。
Step 9:Extended by node and meet condition for splitS=splitSiBranch and splitSi=splitSijSon
Branch, if meeting one of following two conditions, just stops contributing.
9.1 it is assumed here that YiSplitS=splitS is concentrated for training dataiSample set, if YiFor sky, one is added
Individual leafy node, is labeled as training data and concentrates most common class.
All examples belong to same class in 9.2 this node.
Step 10:Situation in non-9.1 and 9.2, then recursive call step 7 is to step 9.
Step 11:The decision tree classifier of the continuous uncertain attribute that preservation has been generated.
Present invention has the advantages that:
1st, the decision tree for constituting preferably has evaded information and has been biased to the big problem of the order of magnitude.
2nd, classification and forecast function that object is uncertain connection attribute can be realized.
3rd, the decision tree classification accuracy that this builds is high.
4th, this decision tree for building is more suitable for the application to real data Mining Problems.
Description of the drawings
Fig. 1 is that the decision tree classifier based on uncertain connection attribute builds flow chart.
Specific embodiment
To solve the problems, such as the accuracy rate that uncertain connection attribute classification, forecasting problem and raising are classified to it, predicted, knot
Close Fig. 1 to be described in detail the present invention, its specific implementation step is as follows:
Step 1:If uncertain connection attribute training is concentrated with X sample, attribute number is n, i.e. n=(S1, S2... Sn),
While Split Attribute SiM class L, wherein L are corresponded tor∈(L1, L2..., Lm), i ∈ (1,2 ..., n), r ∈ (1,2 ..., m).Si∈
(S1, S2... Sn), wherein property value has uncertainty.
Step 2:Uncertain continuous data attribute SiProperty value SijOrdering by merging, according to class to not knowing continuous data
Attribute SiCarry out property value SijComputing, is designated as probability and P (Sij), class is carried out processing each fork attribute value probability gesture P
(Sij, Lr).Its concrete operation process is as follows:
Uncertain connection attribute Si, its value P (Sij) it is a probability vector, it is designated as:
P(Sij)∈(P(Si1), P (Si2) ..., P (Sik)), andSo those determinations before is continuous
Attribute is considered as this special circumstances, i.e. attribute SiMiddle property value P (Sij)=1, other probability are 0 situation.
There is maximum and minima end points on each interval in uncertain continuous property.This numerical representation
For the purpose of being in order at protection customer privacy.The processing procedure of uncertain connection attribute is that the key point that will be occurred is arranged by ascending order
Row, merge and repeat point, and the whole interval of such attribute is split into multiple subintervals.
The interval of each use-case can cover one or more subintervals.
Assume that any one subset interval is [α, β], then it is in the probability distribution of interval [α, β]:
Above formula f (x) is the probability distribution that this event occurs, concrete event concrete analysis, and this is determined by corresponding expert,
The probability-distribution function of each uncertain connection attribute is generally identical.
Attribute SiIn the probability and P (S of interval [α, β]ij) be:
(1,2 ..., k), k is attribute S to above formula j ∈iCorresponding property value number.
According to property value SijCorresponding class LrIn the probability gesture P (S of interval [α, β]ij, Lr) be:
Above formulaFor property value SijThe species of corresponding class,It is in property value SijCorresponding apoplexy due to endogenous wind is LrClass.
P(Sij) it is property value SijProbability is taken according to class and class L is total up to m.
Step 3:Create root node G.
Step 4:If training dataset is sky, return node G and labelling failure.
Step 5:If training data concentrates all records to belong to same category, the type flag node G.
Step 6:If candidate attribute is sky, return G is leafy node, is labeled as training data and concentrates most common
Class.
Step 7:Due to the uncertainty of continuous property, according to following object function f (Si) select from candidate attribute
splitSi.Its collective's calculating process is as follows:
Object function f (Si):
Above formula Lr ∈ (1,2 ..., it is L m)rClass.P(Sij) for the attribute S in step 2iProbability gesture, P (Sij, LrStep 2
Middle attribute SiProperty value SijWith regard to class LrProbability gesture, j be property value number.
As selection attribute splitSiMeet object function f (Si) it is bigger when, then find labelling section G.
Step 8:Flag node G is attribute splitSi。
Step 9:Extended by node and meet condition for splitS=splitSiBranch and splitSi=splitSijSon
Branch, if meeting one of following two conditions, just stops contributing.
9.1 it is assumed here that YiSplitS=splitS is concentrated for training dataiSample set, if YiFor sky, one is added
Individual leafy node, is labeled as training data and concentrates most common class.
All examples belong to same class in 9.2 this node.
The determination elder generation comparative training collection of branch's leaf node is again by P (Sij, Lr) determination of value size, i.e.,
Leaf node is assured that by above formula.
Step 10:Situation in non-9.1 and 9.2, then recursive call step 7 is to step 9.
Step 11:The decision tree classifier of the uncertain continuous data that preservation has been generated.
Based on the decision tree classifier construction method of uncertain connection attribute, its false code calculating process is as follows:
Input:Uncertain continuous data training sample set X.
Output:The decision tree classifier of uncertain connection attribute.
Claims (4)
1. the decision tree classifier construction method based on uncertain connection attribute, the present invention relates to machine learning, artificial intelligence with
And Data Mining, and in particular to a kind of decision tree classifier construction method based on uncertain connection attribute, it is characterized in that,
Comprise the steps:
Step 1:If uncertain connection attribute training is concentrated with X sample, attribute number is n, i.e.,
While Split AttributeM class L has been corresponded to, wherein Wherein property value has continuous uncertain
Step 2:Uncertain continuous data attributeProperty valueOrdering by merging, according to class to uncertain data attributeCarry out property valueComputing, be designated as probability andClass is carried out processing each fork attribute value probability gesture
Step 3:Create root node G
Step 4:If training dataset is sky, return node G and labelling failure
Step 5:If training data concentrates all records to belong to same category, the type flag node G
Step 6:If candidate attribute is sky, return G is leafy node, is labeled as training data and concentrates most common class
Step 7:Due to the uncertainty of continuous property, selected from candidate attribute according to following object function
Step 8:Flag node G is attribute
Step 9:Extended by node and meet condition and beBranch andSon point
, if meeting one of following two conditions, just stop contributing
9.1 it is assumed here thatFor training data concentrationSample set, ifFor sky, a leaf is added
Child node, is labeled as training data and concentrates most common class
All examples belong to same class in 9.2 this node
Step 10:Situation in non-9.1 and 9.2, then recursive call step 7 is to step 9
Step 11:The decision tree classifier of the continuous uncertain attribute that preservation has been generated.
2., according to the decision tree classifier construction method based on uncertain connection attribute described in claim 1, it is characterized in that,
Concrete statement process in step 2 is as follows:
Step 2:Uncertain continuous data attributeProperty valueOrdering by merging, according to class to uncertain continuous data category
PropertyCarry out property valueComputing, be designated as probability andClass is carried out processing each fork attribute value probability gesture, its concrete operation process is as follows:
Uncertain connection attribute, its valueFor a probability vector, it is designated as
And, so those determinations before
Connection attribute is considered as this special circumstances, i.e. attributeMiddle property valueOther probability are 0 situation
There is maximum and minima end points in uncertain continuous property, this numerical representation is on each interval
For the purpose of protection customer privacy, the processing procedure of uncertain connection attribute is that the key point that will be occurred is arranged by ascending order,
Merge and repeat point, the whole interval of such attribute is split into multiple subintervals
The interval of each use-case can cover one or more subintervals
Assume that any one subset interval is, then it is in intervalProbability distribution be:
Above formula f (x) is the probability distribution that this event occurs, concrete event concrete analysis, and this is determined by corresponding expert, each
The probability-distribution function of uncertain connection attribute is generally identical
AttributeIn intervalProbability andFor:
Above formulaK is attributeCorresponding property value number
According to property valueCorresponding classIn intervalProbability gestureFor:
Above formulaFor property valueThe species of corresponding class,It is in property valueCorresponding apoplexy due to endogenous wind isClass
For property valueProbability is taken according to class and class L is total up to m.
3., according to the decision tree classifier construction method based on uncertain connection attribute described in claim 1, it is characterized in that,
Concrete calculating process in step 7 is as follows:
Step 7:Due to the uncertainty of continuous property, according to following object functionSelect from candidate attribute, its collective's calculating process is as follows:
Object function:
Above formulaForClass,For the attribute in step 2Probability gesture,Step 2
Middle attributeProperty valueWith regard to classProbability gesture, j be property value number
When selection attributeMeet object functionWhen bigger, then labelling section G is found.
4., according to the decision tree classifier construction method based on uncertain connection attribute described in claim 1, it is characterized in that,
In step 9 to be specifically described process as follows:
Step 9:Extended by node and meet condition and beBranch andSon point
, if meeting one of following two conditions, just stop contributing
9.1 it is assumed here thatFor training data concentrationSample set, ifFor sky, one is added
Individual leafy node, is labeled as training data and concentrates most common class
All examples belong to same class in 9.2 this node
The determination elder generation comparative training collection of branch's leaf node again byValue size determination, i.e.,
Leafy node is assured that by above formula.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610542924.3A CN106611191A (en) | 2016-07-11 | 2016-07-11 | Decision tree classifier construction method based on uncertain continuous attributes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610542924.3A CN106611191A (en) | 2016-07-11 | 2016-07-11 | Decision tree classifier construction method based on uncertain continuous attributes |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106611191A true CN106611191A (en) | 2017-05-03 |
Family
ID=58615367
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610542924.3A Pending CN106611191A (en) | 2016-07-11 | 2016-07-11 | Decision tree classifier construction method based on uncertain continuous attributes |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106611191A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019223384A1 (en) * | 2018-05-21 | 2019-11-28 | 阿里巴巴集团控股有限公司 | Feature interpretation method and device for gbdt model |
CN111670445A (en) * | 2018-01-31 | 2020-09-15 | Asml荷兰有限公司 | Substrate marking method based on process parameters |
CN113032463A (en) * | 2021-04-01 | 2021-06-25 | 河南向量智能科技研究院有限公司 | Mining method of consensus data in product collaborative design block chain |
US20210406222A1 (en) * | 2017-08-30 | 2021-12-30 | Jpmorgan Chase Bank, N.A. | System and method for identifying business logic and data lineage with machine learning |
-
2016
- 2016-07-11 CN CN201610542924.3A patent/CN106611191A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210406222A1 (en) * | 2017-08-30 | 2021-12-30 | Jpmorgan Chase Bank, N.A. | System and method for identifying business logic and data lineage with machine learning |
US11860827B2 (en) * | 2017-08-30 | 2024-01-02 | Jpmorgan Chase Bank, N.A. | System and method for identifying business logic and data lineage with machine learning |
CN111670445A (en) * | 2018-01-31 | 2020-09-15 | Asml荷兰有限公司 | Substrate marking method based on process parameters |
CN111670445B (en) * | 2018-01-31 | 2024-03-22 | Asml荷兰有限公司 | Substrate marking method based on process parameters |
WO2019223384A1 (en) * | 2018-05-21 | 2019-11-28 | 阿里巴巴集团控股有限公司 | Feature interpretation method and device for gbdt model |
US11205129B2 (en) | 2018-05-21 | 2021-12-21 | Advanced New Technologies Co., Ltd. | GBDT model feature interpretation method and apparatus |
CN113032463A (en) * | 2021-04-01 | 2021-06-25 | 河南向量智能科技研究院有限公司 | Mining method of consensus data in product collaborative design block chain |
CN113032463B (en) * | 2021-04-01 | 2024-03-15 | 河南向量智能科技研究院有限公司 | Mining method for co-data in product collaborative design block chain |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106228398A (en) | Specific user's digging system based on C4.5 decision Tree algorithms and method thereof | |
Suvanto et al. | High-resolution mapping of forest vulnerability to wind for disturbance-aware forestry | |
Subburayalu et al. | Disaggregation of component soil series on an Ohio County soil survey map using possibilistic decision trees | |
Danandeh Mehr | Drought classification using gradient boosting decision tree | |
CN106611191A (en) | Decision tree classifier construction method based on uncertain continuous attributes | |
CN109218223A (en) | A kind of robustness net flow assorted method and system based on Active Learning | |
CN111274301B (en) | Intelligent management method and system based on data assets | |
Coulibaly et al. | Rule-based machine learning for knowledge discovering in weather data | |
Kirichenko et al. | Machine learning classification of multifractional Brownian motion realizations | |
Ferreira et al. | The effect of time series distance functions on functional climate networks | |
CN112087316B (en) | Network anomaly root cause positioning method based on anomaly data analysis | |
Haribabu et al. | Prediction of flood by rainf all using MLP classifier of neural network model | |
Ghasemian et al. | Application of a novel hybrid machine learning algorithm in shallow landslide susceptibility mapping in a mountainous area | |
CN117349748A (en) | Active learning fault diagnosis method based on cloud edge cooperation | |
CN118233135A (en) | Network traffic anomaly detection method based on isolated forest algorithm | |
CN110716957B (en) | Intelligent mining and analyzing method for class case suspicious objects | |
Shaji et al. | Weather Prediction Using Machine Learning Algorithms | |
Viswambari et al. | Data mining techniques to predict weather: a survey | |
CN114897085A (en) | Clustering method based on closed subgraph link prediction and computer equipment | |
CN113283243B (en) | Entity and relationship combined extraction method | |
CN106611036A (en) | Improved multidimensional scaling heterogeneous cost-sensitive decision tree building method | |
CN111770053B (en) | Malicious program detection method based on improved clustering and self-similarity | |
CN112199287B (en) | Cross-project software defect prediction method based on enhanced hybrid expert model | |
Zhang et al. | A novel combinational forecasting model of dust storms based on rare classes classification algorithm | |
Czerwinski et al. | An application of fuzzy C-means, fuzzy cognitive maps, and fuzzy rules to forecasting first arrival date of avian spring migrants |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170503 |
|
WD01 | Invention patent application deemed withdrawn after publication |