Nothing Special   »   [go: up one dir, main page]

CN106611191A - Decision tree classifier construction method based on uncertain continuous attributes - Google Patents

Decision tree classifier construction method based on uncertain continuous attributes Download PDF

Info

Publication number
CN106611191A
CN106611191A CN201610542924.3A CN201610542924A CN106611191A CN 106611191 A CN106611191 A CN 106611191A CN 201610542924 A CN201610542924 A CN 201610542924A CN 106611191 A CN106611191 A CN 106611191A
Authority
CN
China
Prior art keywords
attribute
uncertain
class
probability
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610542924.3A
Other languages
Chinese (zh)
Inventor
金平艳
胡成华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yonglian Information Technology Co Ltd
Original Assignee
Sichuan Yonglian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yonglian Information Technology Co Ltd filed Critical Sichuan Yonglian Information Technology Co Ltd
Priority to CN201610542924.3A priority Critical patent/CN106611191A/en
Publication of CN106611191A publication Critical patent/CN106611191A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a decision tree classifier construction method based on uncertain continuous attributes. The method comprises the following steps: classifying X sample data in which uncertain continuous data exists; ordering by merging attribute values Sij of uncertain continuous data attributes Si, performing attribute value Sij operation on the uncertain data attributes Si according to classes, and denoting as the probabilistic summation P(Sij); processing the classes to obtain the probabilistic cardinality P(Sij, Lr) of each branched attribute value, creating a decision tree, and selecting splitSi according to a target function created by the invention, and stopping creating the tree according to a condition. The constructed decision tree can better avoid the problem that the information bias is large in order of magnitude, and can realize the classification and prediction function of an object, namely, the uncertain continuous attributes; the constructed decision tree is high in classification accuracy; and the constructed decision tree is more suitable for an application for an actual data mining problem.

Description

Decision tree classifier construction method based on uncertain connection attribute
Technical field
The present invention relates to machine learning, artificial intelligence and Data Mining, and in particular to one kind is based on uncertain company The decision tree classifier construction method of continuous attribute.
Background technology
Decision tree research is an important and positive research topic in data mining and machine learning.What it was proposed Algorithm is widely used in the middle of practical problem, such as ID3, CART and C4.5, such algorithm is mainly studies asking for accuracy rate Topic.With the progress of science and technology, in recent years, uncertain data is frequently occurred in real world applications, including wireless senser The fields such as network, radio frequency identification, secret protection.Its data characteristics is that data value is not to determine, that is, represent a number Strong point, its representation is, as an entirety, occur various values as each value correspondence according to certain probability distribution Probability, it more and more frequently occurs promoting the research of correlation to get the attention and fast-developing.And traditional data is dug The uncertainty in data is often have ignored in pick technology, its study model is not inconsistent with objective world.So uncertain data digs Practical application important in inhibiting of the pick technology to data mining technology.The uncertainty of attribute includes Category Attributes and continuous category Property, during training and test, in order to ensure effective assessment models, verify the performance of grader, the class in data set All it is not known.Here the classification and prediction in uncertain connection attribute is primarily directed to, additionally, due to measuring instrument fertilization The impact of degree, the data of collection usually contain certain error, are not entirely accurates, in order to improve uncertain connection attribute Classification accuracy, the present invention proposes the decision tree classifier construction method based on uncertain connection attribute.
The content of the invention
It is directed to the accuracy rate for solving the problems, such as that uncertain connection attribute classification is classified to it, predicted with prediction and raising Problem, the present invention proposes the decision tree classifier construction method based on uncertain connection attribute.
To solve the above problems, the present invention is achieved by the following technical solutions:
Based on the decision tree classifier construction method of uncertain connection attribute, comprise the steps:
Step 1:If uncertain connection attribute training is concentrated with X sample, attribute number is n, i.e. n=(S1, S2... Sn), While Split Attribute SiM class L, wherein L are corresponded tor∈(L1, L2..., Lm), i ∈ (1,2 ..., n), r ∈ (1,2 ..., m).Si∈ (S1, S2... Sn), wherein property value has continuous uncertain.
Step 2:Uncertain continuous data attribute SiProperty value SijOrdering by merging, belongs to according to class to uncertain data Property SiCarry out property value SijComputing, is designated as probability and P (Sij), class is carried out processing each fork attribute value probability gesture P (Sij, Lr)。
Step 3:Create root node G.
Step 4:If training dataset is sky, return node G and labelling failure.
Step 5:If training data concentrates all records to belong to same category, the type flag node G.
Step 6:If candidate attribute is sky, return G is leafy node, is labeled as training data and concentrates most common Class.
Step 7:Due to the uncertainty of continuous property, selected from candidate attribute according to following object function splitSi
Step 8:Flag node G is attribute splitSi
Step 9:Extended by node and meet condition for splitS=splitSiBranch and splitSi=splitSijSon Branch, if meeting one of following two conditions, just stops contributing.
9.1 it is assumed here that YiSplitS=splitS is concentrated for training dataiSample set, if YiFor sky, one is added Individual leafy node, is labeled as training data and concentrates most common class.
All examples belong to same class in 9.2 this node.
Step 10:Situation in non-9.1 and 9.2, then recursive call step 7 is to step 9.
Step 11:The decision tree classifier of the continuous uncertain attribute that preservation has been generated.
Present invention has the advantages that:
1st, the decision tree for constituting preferably has evaded information and has been biased to the big problem of the order of magnitude.
2nd, classification and forecast function that object is uncertain connection attribute can be realized.
3rd, the decision tree classification accuracy that this builds is high.
4th, this decision tree for building is more suitable for the application to real data Mining Problems.
Description of the drawings
Fig. 1 is that the decision tree classifier based on uncertain connection attribute builds flow chart.
Specific embodiment
To solve the problems, such as the accuracy rate that uncertain connection attribute classification, forecasting problem and raising are classified to it, predicted, knot Close Fig. 1 to be described in detail the present invention, its specific implementation step is as follows:
Step 1:If uncertain connection attribute training is concentrated with X sample, attribute number is n, i.e. n=(S1, S2... Sn), While Split Attribute SiM class L, wherein L are corresponded tor∈(L1, L2..., Lm), i ∈ (1,2 ..., n), r ∈ (1,2 ..., m).Si∈ (S1, S2... Sn), wherein property value has uncertainty.
Step 2:Uncertain continuous data attribute SiProperty value SijOrdering by merging, according to class to not knowing continuous data Attribute SiCarry out property value SijComputing, is designated as probability and P (Sij), class is carried out processing each fork attribute value probability gesture P (Sij, Lr).Its concrete operation process is as follows:
Uncertain connection attribute Si, its value P (Sij) it is a probability vector, it is designated as:
P(Sij)∈(P(Si1), P (Si2) ..., P (Sik)), andSo those determinations before is continuous Attribute is considered as this special circumstances, i.e. attribute SiMiddle property value P (Sij)=1, other probability are 0 situation.
There is maximum and minima end points on each interval in uncertain continuous property.This numerical representation For the purpose of being in order at protection customer privacy.The processing procedure of uncertain connection attribute is that the key point that will be occurred is arranged by ascending order Row, merge and repeat point, and the whole interval of such attribute is split into multiple subintervals.
The interval of each use-case can cover one or more subintervals.
Assume that any one subset interval is [α, β], then it is in the probability distribution of interval [α, β]:
Above formula f (x) is the probability distribution that this event occurs, concrete event concrete analysis, and this is determined by corresponding expert, The probability-distribution function of each uncertain connection attribute is generally identical.
Attribute SiIn the probability and P (S of interval [α, β]ij) be:
(1,2 ..., k), k is attribute S to above formula j ∈iCorresponding property value number.
According to property value SijCorresponding class LrIn the probability gesture P (S of interval [α, β]ij, Lr) be:
Above formulaFor property value SijThe species of corresponding class,It is in property value SijCorresponding apoplexy due to endogenous wind is LrClass.
P(Sij) it is property value SijProbability is taken according to class and class L is total up to m.
Step 3:Create root node G.
Step 4:If training dataset is sky, return node G and labelling failure.
Step 5:If training data concentrates all records to belong to same category, the type flag node G.
Step 6:If candidate attribute is sky, return G is leafy node, is labeled as training data and concentrates most common Class.
Step 7:Due to the uncertainty of continuous property, according to following object function f (Si) select from candidate attribute splitSi.Its collective's calculating process is as follows:
Object function f (Si):
Above formula Lr ∈ (1,2 ..., it is L m)rClass.P(Sij) for the attribute S in step 2iProbability gesture, P (Sij, LrStep 2 Middle attribute SiProperty value SijWith regard to class LrProbability gesture, j be property value number.
As selection attribute splitSiMeet object function f (Si) it is bigger when, then find labelling section G.
Step 8:Flag node G is attribute splitSi
Step 9:Extended by node and meet condition for splitS=splitSiBranch and splitSi=splitSijSon Branch, if meeting one of following two conditions, just stops contributing.
9.1 it is assumed here that YiSplitS=splitS is concentrated for training dataiSample set, if YiFor sky, one is added Individual leafy node, is labeled as training data and concentrates most common class.
All examples belong to same class in 9.2 this node.
The determination elder generation comparative training collection of branch's leaf node is again by P (Sij, Lr) determination of value size, i.e.,
Leaf node is assured that by above formula.
Step 10:Situation in non-9.1 and 9.2, then recursive call step 7 is to step 9.
Step 11:The decision tree classifier of the uncertain continuous data that preservation has been generated.
Based on the decision tree classifier construction method of uncertain connection attribute, its false code calculating process is as follows:
Input:Uncertain continuous data training sample set X.
Output:The decision tree classifier of uncertain connection attribute.

Claims (4)

1. the decision tree classifier construction method based on uncertain connection attribute, the present invention relates to machine learning, artificial intelligence with And Data Mining, and in particular to a kind of decision tree classifier construction method based on uncertain connection attribute, it is characterized in that, Comprise the steps:
Step 1:If uncertain connection attribute training is concentrated with X sample, attribute number is n, i.e., While Split AttributeM class L has been corresponded to, wherein Wherein property value has continuous uncertain
Step 2:Uncertain continuous data attributeProperty valueOrdering by merging, according to class to uncertain data attributeCarry out property valueComputing, be designated as probability andClass is carried out processing each fork attribute value probability gesture
Step 3:Create root node G
Step 4:If training dataset is sky, return node G and labelling failure
Step 5:If training data concentrates all records to belong to same category, the type flag node G
Step 6:If candidate attribute is sky, return G is leafy node, is labeled as training data and concentrates most common class
Step 7:Due to the uncertainty of continuous property, selected from candidate attribute according to following object function
Step 8:Flag node G is attribute
Step 9:Extended by node and meet condition and beBranch andSon point , if meeting one of following two conditions, just stop contributing
9.1 it is assumed here thatFor training data concentrationSample set, ifFor sky, a leaf is added Child node, is labeled as training data and concentrates most common class
All examples belong to same class in 9.2 this node
Step 10:Situation in non-9.1 and 9.2, then recursive call step 7 is to step 9
Step 11:The decision tree classifier of the continuous uncertain attribute that preservation has been generated.
2., according to the decision tree classifier construction method based on uncertain connection attribute described in claim 1, it is characterized in that, Concrete statement process in step 2 is as follows:
Step 2:Uncertain continuous data attributeProperty valueOrdering by merging, according to class to uncertain continuous data category PropertyCarry out property valueComputing, be designated as probability andClass is carried out processing each fork attribute value probability gesture, its concrete operation process is as follows:
Uncertain connection attribute, its valueFor a probability vector, it is designated as
And, so those determinations before Connection attribute is considered as this special circumstances, i.e. attributeMiddle property valueOther probability are 0 situation
There is maximum and minima end points in uncertain continuous property, this numerical representation is on each interval For the purpose of protection customer privacy, the processing procedure of uncertain connection attribute is that the key point that will be occurred is arranged by ascending order, Merge and repeat point, the whole interval of such attribute is split into multiple subintervals
The interval of each use-case can cover one or more subintervals
Assume that any one subset interval is, then it is in intervalProbability distribution be:
Above formula f (x) is the probability distribution that this event occurs, concrete event concrete analysis, and this is determined by corresponding expert, each The probability-distribution function of uncertain connection attribute is generally identical
AttributeIn intervalProbability andFor:
Above formulaK is attributeCorresponding property value number
According to property valueCorresponding classIn intervalProbability gestureFor:
Above formulaFor property valueThe species of corresponding class,It is in property valueCorresponding apoplexy due to endogenous wind isClass
For property valueProbability is taken according to class and class L is total up to m.
3., according to the decision tree classifier construction method based on uncertain connection attribute described in claim 1, it is characterized in that, Concrete calculating process in step 7 is as follows:
Step 7:Due to the uncertainty of continuous property, according to following object functionSelect from candidate attribute, its collective's calculating process is as follows:
Object function
Above formulaForClass,For the attribute in step 2Probability gesture,Step 2 Middle attributeProperty valueWith regard to classProbability gesture, j be property value number
When selection attributeMeet object functionWhen bigger, then labelling section G is found.
4., according to the decision tree classifier construction method based on uncertain connection attribute described in claim 1, it is characterized in that, In step 9 to be specifically described process as follows:
Step 9:Extended by node and meet condition and beBranch andSon point , if meeting one of following two conditions, just stop contributing
9.1 it is assumed here thatFor training data concentrationSample set, ifFor sky, one is added Individual leafy node, is labeled as training data and concentrates most common class
All examples belong to same class in 9.2 this node
The determination elder generation comparative training collection of branch's leaf node again byValue size determination, i.e.,
Leafy node is assured that by above formula.
CN201610542924.3A 2016-07-11 2016-07-11 Decision tree classifier construction method based on uncertain continuous attributes Pending CN106611191A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610542924.3A CN106611191A (en) 2016-07-11 2016-07-11 Decision tree classifier construction method based on uncertain continuous attributes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610542924.3A CN106611191A (en) 2016-07-11 2016-07-11 Decision tree classifier construction method based on uncertain continuous attributes

Publications (1)

Publication Number Publication Date
CN106611191A true CN106611191A (en) 2017-05-03

Family

ID=58615367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610542924.3A Pending CN106611191A (en) 2016-07-11 2016-07-11 Decision tree classifier construction method based on uncertain continuous attributes

Country Status (1)

Country Link
CN (1) CN106611191A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019223384A1 (en) * 2018-05-21 2019-11-28 阿里巴巴集团控股有限公司 Feature interpretation method and device for gbdt model
CN111670445A (en) * 2018-01-31 2020-09-15 Asml荷兰有限公司 Substrate marking method based on process parameters
CN113032463A (en) * 2021-04-01 2021-06-25 河南向量智能科技研究院有限公司 Mining method of consensus data in product collaborative design block chain
US20210406222A1 (en) * 2017-08-30 2021-12-30 Jpmorgan Chase Bank, N.A. System and method for identifying business logic and data lineage with machine learning

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210406222A1 (en) * 2017-08-30 2021-12-30 Jpmorgan Chase Bank, N.A. System and method for identifying business logic and data lineage with machine learning
US11860827B2 (en) * 2017-08-30 2024-01-02 Jpmorgan Chase Bank, N.A. System and method for identifying business logic and data lineage with machine learning
CN111670445A (en) * 2018-01-31 2020-09-15 Asml荷兰有限公司 Substrate marking method based on process parameters
CN111670445B (en) * 2018-01-31 2024-03-22 Asml荷兰有限公司 Substrate marking method based on process parameters
WO2019223384A1 (en) * 2018-05-21 2019-11-28 阿里巴巴集团控股有限公司 Feature interpretation method and device for gbdt model
US11205129B2 (en) 2018-05-21 2021-12-21 Advanced New Technologies Co., Ltd. GBDT model feature interpretation method and apparatus
CN113032463A (en) * 2021-04-01 2021-06-25 河南向量智能科技研究院有限公司 Mining method of consensus data in product collaborative design block chain
CN113032463B (en) * 2021-04-01 2024-03-15 河南向量智能科技研究院有限公司 Mining method for co-data in product collaborative design block chain

Similar Documents

Publication Publication Date Title
CN106228398A (en) Specific user's digging system based on C4.5 decision Tree algorithms and method thereof
Suvanto et al. High-resolution mapping of forest vulnerability to wind for disturbance-aware forestry
Subburayalu et al. Disaggregation of component soil series on an Ohio County soil survey map using possibilistic decision trees
Danandeh Mehr Drought classification using gradient boosting decision tree
CN106611191A (en) Decision tree classifier construction method based on uncertain continuous attributes
CN109218223A (en) A kind of robustness net flow assorted method and system based on Active Learning
CN111274301B (en) Intelligent management method and system based on data assets
Coulibaly et al. Rule-based machine learning for knowledge discovering in weather data
Kirichenko et al. Machine learning classification of multifractional Brownian motion realizations
Ferreira et al. The effect of time series distance functions on functional climate networks
CN112087316B (en) Network anomaly root cause positioning method based on anomaly data analysis
Haribabu et al. Prediction of flood by rainf all using MLP classifier of neural network model
Ghasemian et al. Application of a novel hybrid machine learning algorithm in shallow landslide susceptibility mapping in a mountainous area
CN117349748A (en) Active learning fault diagnosis method based on cloud edge cooperation
CN118233135A (en) Network traffic anomaly detection method based on isolated forest algorithm
CN110716957B (en) Intelligent mining and analyzing method for class case suspicious objects
Shaji et al. Weather Prediction Using Machine Learning Algorithms
Viswambari et al. Data mining techniques to predict weather: a survey
CN114897085A (en) Clustering method based on closed subgraph link prediction and computer equipment
CN113283243B (en) Entity and relationship combined extraction method
CN106611036A (en) Improved multidimensional scaling heterogeneous cost-sensitive decision tree building method
CN111770053B (en) Malicious program detection method based on improved clustering and self-similarity
CN112199287B (en) Cross-project software defect prediction method based on enhanced hybrid expert model
Zhang et al. A novel combinational forecasting model of dust storms based on rare classes classification algorithm
Czerwinski et al. An application of fuzzy C-means, fuzzy cognitive maps, and fuzzy rules to forecasting first arrival date of avian spring migrants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170503

WD01 Invention patent application deemed withdrawn after publication