MLT Unit 3

UNIT-3
Decision Learning
& INSTANCE-BASED LEARNING
Decision Tree
• Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
Decision Tree (Cont.)
It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.
• NOTE: A decision tree can contain categorical data (YES/NO) as well as numeric data.
INDUCTIVE BIAS IN DECISION TREE
• The inductive bias refers to the assumption or constraints that shape how the decision
tree algorithm builds to the tree and makes predictions.
• Decision Tree algorithm has an inductive bias towards using hierarchical if-else rules
to represent the relationships between features and the target variable.
• It uses binary splitting to divide the data.
Inductive Inference
• Inductive inference refers to the process of drwaing
general conclusions of making predictions based on
limited observation .
• We start with a set of specific observations or example
and aim to derive general principles or rules that can be
applies to new, unseen situation.
• The goal is to make reliable predictions or generalizations
beyond the observed data.
ID3 ALGORITHM
• Iterative Dichotomiser 3 or commonly known as ID3. ID3 was invented by Ross Quinlan.
• It is a classification algorithm that follows a greedy approach of building a decision tree by
selecting a best attribute that yields maximum Information Gain (IG) or minimum Entropy (H).
• Decision Tree is most effective if the problem characteristics look like the following points:
1) Instances can be described by attribute-value pairs.

2) Target function is discrete-valued.
ID3 ALGORITHM (Cont.)
• “Entropy is the measurement of homogeneity.

• It returns us the information about an arbitrary dataset that how impure/non-homogeneous the data
set is.”
• Given a collection of examples/dataset S, containing positive and negative examples of some target
concept, the entropy of S relative to this boolean classification is-
ID3 ALGORITHM (cont)
• “Entropy is the measurement of homogeneity.
• It returns us the information about an arbitrary dataset that how impure/non-homogeneous the data
set is.”
• Given a collection of examples/dataset S, containing positive and negative examples of some target
concept, the entropy of S relative to this boolean classification is-
ID3 ALGORITHM (cont)
• When we use a node in a decision tree to partition the training instances
into smaller subsets the entropy changes. Information gain is a measure
of this change in entropy.
• Information Gain = entropy(parent) – [average entropy(children)]

Example: Decision Tree for Play Tennis
• Here, dataset is of binary classes(yes and no), where 9 out of 14 are "yes" and 5 out of 14 are
"no".Complete entropy of dataset is –
H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))
= - (9/14) * log2(9/14) - (5/14) * log2(5/14)
= - (-0.41) - (-0.53)= 0.94
First Attribute - Outlook
Categorical values - sunny, overcast and rain
H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook=rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971
H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0
Average Entropy Information for Outlook -
I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) + p(overcast) *
H(Outlook=overcast)
= (5/14)*0.971 + (5/14)*0.971 + (4/14)*0 = 0.693
Information Gain = H(S) - I(Outlook)
= 0.94 - 0.693 = 0.247
Second Attribute - Temperature
Categorical values - hot, mild, cool
H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1
H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179
• Average Entropy Information for Temperature -

I(Temperature) = p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) +
p(cool)*H(Temperature=cool)
= (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811
= 0.9108
• Information Gain = H(S) - I(Temperature)
= 0.94 - 0.9108
= 0.0292
• Third Attribute - Humidity

Categorical values - high, normal
H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983
H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591
• Average Entropy Information for Humidity -

• I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
= (7/14)*0.983 + (7/14)*0.591
= 0.787
• Information Gain = H(S) - I(Humidity)
= 0.94 - 0.787
= 0.153
• Fourth Attribute - Wind
Categorical values - weak, strong
H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811
H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1
• Average Entropy Information for Wind -
I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong)
= (8/14)*0.811 + (6/14)*1
= 0.892
• Information Gain = H(S) - I(Wind)

= 0.94 - 0.892
= 0.048
• Here, the attribute with maximum information gain is Outlook. So, the decision tree built so far -
• Here, when Outlook == overcast, it is of pure class(Yes).

•
Now, we have to repeat same procedure for the data with rows consist of Outlook value as
Sunny and then for Outlook value as Rain
• Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset rows = [1,
2, 8, 9, 11]
Complete entropy of Sunny is –
= - (2/5) * log2(2/5) - (3/5) * log2(3/5)
= 0.971
Complete entropy of Sunny is –

= - (2/5) * log2(2/5) - (3/5) * log2(3/5)
= 0.971
DIFFERENCE BETWEEN ID3 AND CANDIDATE _ELIMINATION
• ID3
– Searches a complete hypothesis space incompletely
– Inductive bias is solely a consequence of the ordering of
hypotheses by its search strategy
• Candidate-Elimination
– Searches an incomplete hypothesis space completely
– Inductive bias is solely a consequence of the expressive
power of its hypothesis representation
ISSUES IN DECISION TREE
Practical issues in learning decision trees include
• Determining how deeply to grow the decision tree,

• Handling continuous attributes,
• Choosing an appropriate attribute selection measure,
• Handling training data with missing attribute values,
• Handling attributes with differing costs, and
• Improving computational efficiency.
INSTANCE BASED LEARNING
• The Machine Learning systems which are categorized
as instance-based learning are the systems that learn the
training examples by heart and then generalizes to new
instances based on some similarity measure.
• It is called instance-based because it builds the hypotheses from
the training instances.
• It is also known as memory-based learning or lazy-learning.
• The time complexity of this algorithm depends upon the size of
training data. The worst-case time complexity of this algorithm
is O (n), where n is the number of training instances.
INSTANCE BASED LEARNING
• Some of the instance-based learning algorithms are :
1.K Nearest Neighbor (KNN)
2.Self-Organizing Map (SOM)
3.Learning Vector Quantization (LVQ)
4.Locally Weighted Learning (LWL)
INSTANCE BASED LEARNING(CONT’d)
• Advantages:
1. Instead of estimating for the entire instance set, local approximations can
be made to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we
go
• Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query
involves starting the identification of a local model from scratch.
K-NEAREST NEIGHBHOR(KNN) LEARNING
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.
• K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
K-NEAREST NEIGHBHOR(KNN) LEARNING
• K-NN is a non-parametric algorithm, which means it does

not make any assumption on underlying data.
• It is also called a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on
the dataset.
• KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
KNN ALGORITHM
• The K-NN working can be explained on the basis of the below
algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
• Step-6: Our model is ready.
WORKING OF KNN
• Suppose we have a new data point and we need to put it in the
required category. Consider the below image:
WORKING OF KNN
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
WORKING OF KNN
• By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
PROS AND CONS OF KNN
Advantages of KNN Algorithm:

• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:

• Always needs to determine the value of K which may be
complex some time.
• The computation cost is high because of calculating the
distance between the data points for all the training samples.
NUMERICAL ON KNN
• Perform KNN Classification Algorithm
on the following data set and predict
the class for x P1
(P1=3 and P2=7),
P2 where Class
k=3
7 7 False
7 4 False
1 4 True
Euclidean Distance =
• D(x,i)= + = 4
• D(x,ii)= + = 5
• D(x,iii)= + = 3
• D(x,iv)= + = 3.6
We, need to find out the three nearest neighbors that means, the
distance having the lowest value: 3, 3.6 and 4
TRUE FALSE
TRUE
Thus, X(p1= 2 and p2=7) will belong to class True .

WHY KNN IS NON-PARAMETRIC?
• Non-parametric means not making any assumptions on the

underlying data distribution. Non-parametric methods do
not have fixed numbers of parameters in the model.
Similarly in KNN, model parameters actually grows with the
training data set - you can imagine each training case as a
"parameter" in the model.
PROBLEMS IN LINEAR REGRESSION
• One of the problems in liner regression is that it tries to fit a
constant line to you data once the model was created.
• Such behaviour might be okay when your data follows

linear pattern and does not have much noise.
• However, when the data set is not linear, linear regression

tends to under fit the training data.
LOCALLY WEIGHTED REGRESSION
• Model-based methods, such as neural networks and the mixture

of Gaussians, use the data to build a parameterized model.
• After training, the model is used for predictions and the data are
generally discarded.
• In contrast, ``memory-based'' methods are non-parametric
approaches that explicitly retain the training data, and use it each
time a prediction needs to be made.
• Locally weighted regression (LWR) is a memory-based method
that performs a regression around a point of interest using only
training data that are ``local'' to that point.
LOCALLY WEIGHTED REGRESSION
In locally weighted regression, points are weighted by proximity to the current x in question using
a kernel. A regression is then computed using the weighted points.
CASE BASED LEARNING
• Case-Based Reasoning classifiers (CBR) use a database of

problem solutions to solve new problems. It stores the tuples or cases for
problem-solving as complex symbolic descriptions.
HOW CBR WORKS
• When a new case arrises to classify, a Case-based Reasoner(CBR) will
first check if an identical training case exists.
• If one is found, then the accompanying solution to that case is returned.
• If no identical case is found, then the CBR will search for training cases
having components that are similar to those of the new case.
CASE BASED LEARNING
• Conceptually, these training cases may be considered as
neighbours of the new case.
• If cases are represented as graphs, this involves searching for
subgraphs that are similar to subgraphs within the new case.
• The CBR tries to combine the solutions of the neighbouring
training cases to propose a solution for the new case.
• If compatibilities arise with the individual solutions, then
backtracking to search for other solutions may be necessary.
• The CBR may employ background knowledge and problem-
solving strategies to propose a feasible solution.
APPLICATIONS OF CBR
1.Problem resolution for customer service help desks, where
cases describe product-related diagnostic problems.
2.It is also applied to areas such as engineering and law, where

cases are either technical designs or legal rulings, respectively.
3.Medical educations, where patient case histories and

treatments are used to help diagnose and treat new patients.

MLT Unit 3

Uploaded by

Copyright:

Available Formats

MLT Unit 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLT Unit 3

Uploaded by

Copyright:

Available Formats

UNIT-3

1) Instances can be described by attribute-value pairs.

• “Entropy is the measurement of homogeneity.

• Information Gain = entropy(parent) – [average entropy(children)]

• Average Entropy Information for Temperature -

• Third Attribute - Humidity

• Average Entropy Information for Humidity -

• Information Gain = H(S) - I(Wind)

• Here, when Outlook == overcast, it is of pure class(Yes).

H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))

Practical issues in learning decision trees include

• Determining how deeply to grow the decision tree,

• K-NN is a non-parametric algorithm, which means it does

Advantages of KNN Algorithm:

Disadvantages of KNN Algorithm:

Thus, X(p1= 2 and p2=7) will belong to class True .

• Non-parametric means not making any assumptions on the

• Such behaviour might be okay when your data follows

• However, when the data set is not linear, linear regression

• Model-based methods, such as neural networks and the mixture

• Case-Based Reasoning classifiers (CBR) use a database of

2.It is also applied to areas such as engineering and law, where

3.Medical educations, where patient case histories and

You might also like