Nothing Special   »   [go: up one dir, main page]

MLT Quantum

Download as pdf or txt
Download as pdf or txt
You are on page 1of 138

1

UNITT
Introduction

CONTENTS
Part-1 Learning, Types of Learning . - L to 1-7L

********************* 1-1L to 1-9L


Part-2
Well Defined Learning
Problems, Designing a
Learning System
Part-3 History of ML, Introduction. 1-9L to 1-24L
of Machine Learning Approaches
(Artificial Neural Network,
Custering, Reinforcement

Learning, Decision Tree Learning,


Bayesian Network, Support Vector
Machine, Genetic Algorithm)

1-24Lto 1-26L
Part-4 Issues in Machine Learning..
and Data Science Vs.
** *******

Machine Learning8

1-1L CS/IT-Sem-6)
1-2 L (CSSIT-Sem-5) Introduction

PART 1
Learning, Types of Learning.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que11. Define the term learning What are the components of a


learning system ?

Answer
1. Learning refers to the change in a subject's behaviour to a given situation
that the
brought by repeated experiences in that situation, provided
the basis native
behaviour changes cannot be explained on
of response
tendencies, matriculation or temporary states of the subject.

2. Learning agent can be thought of as containing a performance element


that decides what actions to take and a learning element that modifies
the performance element so that it makes better decisions.

3. The design of a learning element is affected by three major issues


a Components of the performance element.
b. Feedback of components.
c. Representation of the components.
The important components of learning are:

Stimuli
examples Feedback
Learner
component

Environment
or teacheer Critic
Knowledge performance
base
evaluator

Kesponse
Performance
component Tasks
Fig. 1.1.1. General learning model.
1.Acquisition of new knowledge:
a. One component of learning is the acquisition of new knowledge
Machine Learning Techniques 1-3L(CS/IT-Sem-5)

b. Simple data acquisition is easy for computers, even though it is


difficult for people.
2 Problem solving :
The other component of learning is the problem solving that is required
for both to integrate into the system, new knowledge that is presented
to it and to deduce new information when required facts are not been
presented.
Que 1.2.Write down the performance measures for learning.

Answer
Following are the performance measures for learning are :

1. Generality:
a The most important performanece measure for learning methods is
the generality or scope of the method.
b. Generality is a measure of the case with which the method can be
adapted to different domains of application.
A completely general algorithm is one which is a fixed or self adjusting
configuration that can learn or adapt in any environment or

application domain.

2 Efficiency:
a. Theetticiency of a method is a measure of the average time required
to construct the target knowledge structures from some specified
initial structures.
b. Since this measure is often difficult to determine and is meaningless
without some standard comparison time, a relative etficiency index

be used
can
instead.
3. Robustness :
a. Robustness the ability of
learning
a
system to function with
unreliable feedback and with a variety of training examples, including
noisy ones.
b. A robust system must be able to build tentative structures which
are subjected to modification or withdrawal if later found to be
inconsistent with statistically sound structures.

4 Efficacy:
a. The efficacy of a system is a measure of the overall power of the

system. It is a combination of the factors generality, eficiency, and


robustne88.

5. Ease of implementation:
a. Ease of implementation relates to the complexity of the programs
and data structures, and the resources required to develop the
given learning system.
Introduction
1-4L (CS/IT-Sem-5)

metrics, this measure will often be


b. Lacking good complexity
somewhat subjective.

and unsupervised learning.


Que 13.Discuss supervised

Answer
Supervised learning:
associative learning, in which
known
Supervised learning is also
as
1. and matching
trained by providing it with input
the network is
output patterns.
vector with
the pairing of each input
Supervised training requires
desired output.
vector representing the
a target vector is
with the corresponding target
3. The input vector together
called training pair.
T'arget feature
Input feature Matehing

Neural
network XKH
/Weight/threshold
adjustment Error
vector
Supervised
learning
algorithm
Fig 13.1
to the network,
session an input vector is applied
During the training
vector,
and it results in an output
is compared with t target response
5. This response
the network
differs from the target
response,

6. Ifthe actual response


will generate an
error signal.
the adjustment that
is then used to calculate
7. This e r r o r signal the actual output
made in the synaptic weights so that
should be
matches the target output.
requires a supervisor
minimization in this kind of training
8. The error

or teacher.
an external teacher,
or
can be provided by
9. These input-output pairs network (self-supervised).
the neural
which contains
by the system to perform
non-linear
methoda a r e used
10. Supervised trainingc l a s s i f i c a t i o n networks, pattern association

mapping in pattern neural


networks.
networks and multilayer
Machine Learning Techniques 1-5L (CS/TT-Sem-5)
11. Supervised learning generatesaglobal model that maps input objects
to desired outputs.

12. In some cases, the map is implemented as a set of local models such
as in case-based reasoning or the nearest neighbour algorithm.
13. In order to solve problem of supervised learning following steps are
considered:

Determine the type of training examples.


ii Gathering a training set.
Determine the input feature representation of the learned
function.
learned funetion and
Determine the structure of the
corresponding learningg algorithm.
Complete the design.
Unsupervised learning:
1 It is a learning in which an output unit is trained to respond to
clusters of pattern within the input.
2 Unsupervised training is employed in self-organizing neural
networks.

3. This training does not require a teacher.


4 In this method of training, the input vectors of similar e s are
grouped without the use of training data to specify how a typical
member of each group looks or to which group a member belongs.

5. During training the neural network receives input patterns and


organizes these patterns into categories.
6. When new input pattern is applied, the neural network provides an

response indicating the class to which the input patten


output
belongs.
class cannot be found for the input pattern, a new class is
7. If a

generated.
8. Though unsupervised training does not require a teacher, it requires

certain guidelines to form groups.


9. Grouping can be done based on color, shape and any other property

of the object.
method of machine learning where a model is fit to
10. It is a

observations.

learning by the fact that there is


11. It is distinguished from supervised
no priori output.
12. In this, a data set of input objects is gathered.
set of random variables. It can be used in
13. It treats input objects as a

inference to produce conditional


conjunction with Bayesian
probabilities.
1-6L(CSIT-Sem-5) Introduction

14. Unsupervised learning is useful for data compression and clustering.


Vector describing state
of the environment

Learning
Environment system

Fig. 13.2. Block diagram of unsupervised learming.


15. In unsupervised learning, system is supposed to discover statistically
salient features of the input population.
16. Unlike the supervised learning paradigm, there is
not a priori set of
categories into which the patterns are to be classified; rather the
system must develop its own representation of the input stimuli.

Que 14. Describe briefly reinforcement learning ?

Answer
1. Reinforcement learning is the study of how artificial system can learn to
optimize their behaviour in the face of rewards and punishments.
2. Reinforcement learning algorithms have been developed that are closely
related to methods of dynamic programming which is a general approach
to optimal control.

3. Reinforcement learning phenomena have been observed in psychological


studies of animal behaviour, and in neurobiological investigations of
neuromodulation and addiction.

Primary
State (input) reintorcement
vector signal
Critic
Environment
Heuristic
reinforcement
Actions signal
Learning
system

Fig. 14.1. Block diagram of reinforcement learning.


The task of reinforcement learning is to use observed rewards to learn
an optimal policy for the environment.
5. An optimal policy is a policy that maximizes the expected total reward.
6. Without some feedback about what is good and what is bad, the agent
will have no grounds for deciding which move to make.
7. The ugents need to know that something god has happened when it
wins and that something bad has happened when it loses.
8 This kind of feedback is called a reward or reinforcenment.
Machine Learning Techniques
1-7L (CS/IT-Sem-5)
9 Reinforcement learning is very valuable in the field of robotics, where
the tasks to be
performed are frequently complex enough to dety
encoding as programs and no training data is available.
10. The robot's task consists of
finding out, through trial and error (or
success), which actions are good in a certain situation and which are
not.

l1. In many cases humans learn in a very similar way.


12. For example, when a child learns to walk, this usually
instruction, rather
happens without
simply through reinforcement.
13. Successful attempts at working are rewarded
by forward progress, and
unsuccessful attempts are penalized by often painful falls.
14. Positive and negative reinforcement also
are important factors
successful learning in school and in many sports.
15. In many complex domains, reinforcement learning is the only feasible
way to train a program to perform at high levels.

Que 15.What are the steps used to design a learning system ?

Answer
Steps used to design a learning system are:
1 Specify the learning task.
2. Chose a suitable set of training data to serve as the training experience.
3. Divide the training data into groups cla.sses and label
or
accordingly.
4 Determine the type of knowledge representation to be learned from the
training experience.
5. Choose a learner classifier that can generate general hypotheses from
the training data.
6. Apply the learner classifier to test data.
7. Compare the performance of the system with that of an expert human.

Learner

Environment
Experience Knowledge
Performance
element

Fig 1.5.1

PART-2
Well Defined Learning Problems, Designing a Learning System.
Introduction
1-8L (CS/IT-Sem-5)

Questions-Answers

Answer Type Questions


Long Answer Type and Medium

with
Que 16.Write short note on well defined learning problem
example.
Answer
Well defined learning problem:
E with respect to s o m e
is said to learn from experience
A computer program at tasks in T,
measure P, ifits performancee
class of tasks T' and performance
with experience E.
as m e a s u r e d by P, improves
Three features in learning problems:
1. The class of tasks (7)
(P)
2. The measure of performance to be improved

3. The source of experience (E)

For example:
1. A checkers learning problem:
a. Task (T): Playing checkers. won against
b. Performance measure (P): Percent of games
opponents. itself.
(E): Playing practice games against
T r a i n i n g experience
learning problem:
A handwriting recognition handwritten words within
a. Task (T) : Recognizing and classifying
images.
b. Performance measure (P): Percent of words correctly classified.
words with
(E): A database of handwritten
Training experience
given classifications.

3. A robot driving learning problem


highways using vision
four-lane
sensors.

a. Task (T): Driving on publie


distance travelled before an

b. Performance m e a s u r e (P): Average


human overseer).
error (as judged by
(E) : A sequence of images and steering
c. Training experience a human driver.
commands recorded while observing

problems role's in
Describe well defined learning
que 1.7.
machine learning.
Machine Learning Techniques 1-9L (CS/IT-Sem-5)

Answer
Well defined learning problems role's in machine learning :
L Learning to recogmize spoken words:
a Successful speech recognition systems employ machine learning in
some form.

b. For example, the SPHINX system learns speaker-specific strategies


for recognizing the primitive sounds (phonemes) and words from
the observed speech signal.
Neural network learning methods and methods for learning hidden
Markov models are effective for automatically customizing to
individual speakers, vocabularies, microphone characteristiCs,
background noise, etc.
2 Learning to drive an autonomous vehicle :
a Machine learning methods have been used to train computer
controlled vehicles to steer correctly when driving on a variety of
road types.
b. For example, the ALYINN system has used its learned strategies to
drive unassisted at 70 miles per hour for 90 miles on public highways
among other cars.
3 Learning to classify new astronomical struetures
a Machine learning methods have been applied to a variety of large
databases to learn general regularities implicit in the data.
b. For example, decision tree learning algorithms have been used by
to how celestial objects from the second
NASA learn to classity
Palomar Observatory Sky Survey.
c. This system is used to automatically classify all objects in the Sky
Survey, which consists of three terabytes of image data.
4 Learning to play world class backgammon :
a The most successful computer programs for playing games such as
backgammon are based on machine learning algorithms.
b.
For example, the world's top computer program for backgammon,
TD-GAMMON learned its strategy by playing over one million
practice games against itself.

PART-3
History of ML, Introduction of Machine Learning
Approaches -(Artificial Neural Network, Clustering, Reinforcement
Learning, Decision Tree Learning, Bayesian Network, Support
Vector Machine, Genetic Algorithm.
1-10 L (CS/IT-Sem-5) Introduction

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 1.8.Describe briefly the history of machine learning.

Answer
A Early history of machine learning:
mathematician Walter
1. In 1943, neurophysiologist Warren McCulloch and
htts wrote a paper about neurons, and
how they work. They created a
an electrical circuit, and thus
the neural network
model of neurons using
was created.

program which could


2. In 1952, Arthur Samuel created the first computer
learn as it ran.

Frank Rosenblatt designed the first artificial neural network in 1958,


called Perceptron. The main goal of this was pattern and shape

recogniti0n.
In 1959, Bernard Widrow and Marcian Hoff
created two models of neural
4.
network. The first was called ADELINE, and it could detect binary
in a stream of bits, it could predict what the
next
patterns. For example,
and it could eliminate
one would be. The second was called MALDELINE,
echo on phone lines.
B. 1980s and 1990s:
network which had
1. In 1982, John Hopfield suggested creating a

actually work.
bidirectional lines, similar to how neurons

neural networks came in 1986, when


2 Use of back propagation in
researchers from the Stanford psychology department decided to extend
an algorithm created by Widrow and Hoff in 1962. This allowed multiple
known as 'slow
a neural network, creatingofwhat
are
to be
layers used in time.
learners, which long period
will learn over a

the IBM computer Deep Blue, which was a chess-playing


3. In 1997,
computer, beat the world chess champion.
digit recognition resulted
In 1998, research at AT&T Bell Laboratories
on

handwritten postcodes from the US Postal


in good accuracy in detecting
Service.
C. 21st Century:
businesses have realised that
1. Since the start of the 21st century, many
machine learning will increase calculation potential. This is why they
are researching more heavily in it, in order to stay ahead of the
competition.
Machine Learning 'Techniques
1-11L(CS/IT-Sem-5)
2. Some large projects include :

i GoogleBrain (2012)
i. AlexNet (2012)

ii. DeepFace (2014)


iv. DeepMind (2014)

v.OpenAI(2015)
vi. ResNet (2015)
vi. U-net (2015)

Que 1.9.Explain briefly the term machine learning.


Answer
1. Machine learning is an application of Artificial Intelligence (A) that
provides systems the ability to automatically learn and improve from
experience without being explicitly programmed.
Machine learning focuses on the development of
computer programs
that can access data.
3. The primary aim is to allow the computersto learn automatically without
human intervention or assistance and adjust actions accordingly.

4 Machine learning enables analysis of massive quantities of data.


5. It generally delivers faster and more accurate results in order to identify
profitable opportunities or dangerous risks.
6. Combining machine learning with AI and cognitive technologies can
make it even more effective in processing large volumes of information.

Que 1.10. What are the applications of machine learning?


Answer
Following are the applications of machine learning:
1. Image recognition :
a Image recognition is the process of identifying and detecting an
obyect or a feature in a digital image or video.
b. This is used in many applications like systems for factory automation,
toll booth monitoring, and security surveillance,

2 Speech recognition:
a. Speech Recognition (SR) is the translation of spoken words into
text.

b. It is also known as Automatic Speech Recognition (ASR), computer


speech recognition, or Speech To Text (STT).
1-12 L (CS/AT-Sem-5) Introduction

C. In speech recognition, a software application recognizes spoken


words.
3 Medical diagnosis:
a. MLprovides methods, techniques, and tools that can help in solving
diagnostic and prognostic problems in a variety of medical domains.
b. It is being used for the analysis of the importance of clinical
parameters and their combinations for prognosis.

Statistical arbitrage:
a. In finance, statistical arbitrage refers to automated trading
strategies that are typical of a short-term and involve a large number
of securities.
b. In such strategies, the user tries to implement a trading algorithm
for a set of securities on the basis of quantities such as historical
orrelations and general economie variables.
5. Learning associations: Learning association is the process for

discovering relations between variables in large data base.

6 Extraction :
a Information Extraction (IE) is another application of machine

learning.
b. It is the process of extracting structured information from
unstructured data.

Que 1.11. What are the advantages and disadvantages of machine

learning?

Answer
Advantages of machine learning are :

1 Easily identifies trends and patterns:


and discover
a. Machine learning can review large volumes of data
not be apparent to humans.
specific trends and patterns that would
For an e-commerce website like Flipkart, it serves to understand
b.
the browsing behavi0urs and purchase histories of its users to help
cater to the right products, deals, and reminders relevant to them.

C. It uses the results to reveal relevant advertisements to them.


2 No human intervention needed (automation): Machine learning
t.e., n o human intervention is needed.
does not require physical force

3. Continuous improvement
a. ML algorithms gain experience, they keep improving in accuracy

and eficiency.
As the amount of data keeps growing, algorithms learn to make
accurate predictions faster.
Machine Learning Techniques 1-13 L(CS/IT-Sem-5)

4 Handling multi-dimensional and multi-variety data:


a Machine learning algorithms are good at handling data that are
multi-dimensional and muti-variety, and they can do this in dynamie
or uncertain environments.
Disadvantages of machine learning are:
1. Data acquisition :
a. Machine learning requires massive data sets to train on, and these
should be inclusive/unbiased, and of good quality.
2 Time and resources:
a. ML needs enough time to let the
algorithms
learn and develop
enough to fulfill their purpose with a considerable amount of
accuracy and relevancy.
b. It also needs massive resources to fur
3 Interpretation of results:
a. To accurately interpret results generated by the algorithms. We
must carefully choose the algorithms for our purpose.
4 High error-susceptibility:
a. Machine learning is autonomous but highly susceptible to errors.
b. It takes time to recognize the source of the issue, and even longer
to correct it.

Que 1.12- What are the advantages and disadvantages


types of machine learning algorithm ?
of different
Answer
Advantages of supervised machine learning algorithm:
1 Classes represent the features on the
ground.
2. Training data is reusable unless features change.
Disadvantages of supervised machine learning algorithm:
1. Classes may not match spectral classes.
2. Varying consistency in classes.
3. Cost and time
involved in selecting training data.
are

Advantages of unsupervised machine learning algorithm


1. No previous knowledge of the image area is required.
2. The opportunity for human error is minimised.
3. It produces unique spectral classes.
4. Relatively easy and fast to carry out.
Introduction
1-14 L (CS/AT-Sem-5)

Disadvantages of unsupervised machine learning algorithm


:

1. The spectral classes do not necessarily represent the features on the

ground.
2. It does not consider spatial relationships in the data.

3. It can take time to interpret the spectral classes.


Advantages of semi-supervised machine learning algorithm:
1. It is easy to understand.

2 It reduces the amount of annotated data used.


3 It is stable, fast convergent
It is simple.
4
5. It has high efticiency.
Disadvantages of semi-supervised machine learning algorithm:
1. Iteration results are not stable.

2 It is not applicable to network level data.


3 It has low aceuracy
Advantages of reinforcement learning algorithm:
1 Reinforcement learning is used to solve complex problems that cannot
be solved by conventional techniques.

2 Thistechnique is preferred to achieve long-term results which are very


difficult to achieve.
3 This learning model is very similar to the learning of human beings.
Hence, it is close to achieving perfection.
Disadvantages of reinforcement learning algorithm:
L Too much reinforcement learning can lead to an overload of states
which can diminish the results.

2 Reinforcement learning is not preferable for solving simple problems.


3 Reinforcement learning needs a lot of data and a lot of computation.
4. The curse of dimensionality limits reinforcement learning for real
physical systems.

Que 1.13. Write short note on Artificial Neural Network (ANN).

Answer
1 Artificial Neural Networks (ANN) or neural networks are computational
algorithms thatintended to simulate the behaviour of biological systems
composed of neurons.
Machine Learning Techniques 1-16 L(CS/AT-Sem-5)

2. ANNs are computational models inspired by an animal's central nervous


sy'stems.
3. It is capable of machine learning as well as
pattern recognition.
4. A neural network is an oriented graph. It consists of nodes which in the
biological analogy represent neurons, connected by arcs.
5. It corresponds to dendrites and synapses. Each arc associated with a
weight at each node.
6. A neural network is a machine
learning algorithm based on the model
of a human neuron. The human brain consists of millions of neurons.
Tt sends and process
signals in the form of electrical and chemical signals.
8. These neurons are connected with a special structure known as
synapses.
Synapses allow neurons to pass signals.
9. An Artificial Neural Network is an information processing technique. It
works like the way human brain processes information.
10. ANN includes a large number of connected processing units that work
together to process information. They also generate meaningful results
from it.

Que 1.14. Write short note


on clustering.
Answer
1. Clustering is a division of data into groups of similar objects.
2. Each group or cluster consists of objects that are similar among themselves
and dissimilar to objects of other groups as shown in Fig. 1.14.1.

Fig. 1.14.1. Clusters


3. A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the object in the other
cluster.

4. Clusters may be described as connected regions of a multidinensional


space containing relatively high density points, separated from each
other by a region containing a relatively low density points.
5 From the machine learning perspective, clustering can be viewed as
unsupervised learning of concepts
6 Clustering analyzes data objects without help of known class label.
1-16 L (CS/AT-Sem-5)
Introduction
7. In clustering,
the class labels are not present in
training data simply
because they are not known to cluster the data
objects.
Hence, it is the type of unsupervised learning.
9. For this reason, clustering is a form of learning
by observation rather
than learning by examples.

10. There are certain situations where clustering is useful. These include :

a.
The collection classification
and
of training data be costly and
can
timeconsuming. Therefore it is difficult to collect a training data
set. A number of
training samples are not all labelled. Then it
large
1s Useful to train a supervised classi fier with a small portion of
training data and then use clustering procedures to tune the classifier
based on the large, unclassified dataset.
b. For data mining, it can be useful to search for grouping among the
data and then recognize the cluster.
The properties of feature vectors can change over time. Then,
supervised classification is not reasonable. Because the test feature
vectors may have completely different properties.
d The clustering can be useful when it is required to search for good
parametric families for the class conditional densities, in case of
supervised classification.

Que 1.15. | What are the applications of clustering?

Answer
Following are the applications of clustering:
1 Data reduction
a. In many cases, the amount of available data is very large and its
processing becomes complicated.
b. Cluster analysis can be used to group the data into a number of
clusters and then process each cluster as a single entity.

c. In this way, data compression is achieved.


Hypothesis generation:
a. In this case, cluster analysis is applied to a data set to infer hypothesis
that concerns about the nature of the data.
b. Clustering is used here to suggest hypothesis that must be verified
using other data sets.
3. Hypothesis testing: In this context, cluster analysis is used for the
verification of the validity of a specitfic hypothesis.
4. Prediction based on groups:
a. In this case, cluster analysis is applied to the available data set and
then the resulting clusters are characterized based on the
characteristics of the patterns by which they are formned.
Machine Learning Techniques 1-17L(CS/IT-Sem-5)
b. In this sequence, if an unknown pattern is given, we can determine
the cluster to which it is more likely to belong and characterize it
based on the characterization of the
respective clusteer.
Que 1.16. Differentiate between clustering and elassification.
Answer

S.No Clustering Classification


1. Clustering analyzes data objects In classification, data are
without known class label.
grouped by analyzing the data
objects whose class label is
known.
2. There is no
prior knowledge of There is some prior
the attributes of the data to form knowledge of the attributes of
clusters. each classification.
3. I t is done by grouping only the It is done by classifying output
input data because output is not based on the values of the
predefined. input data.
The number of clusters is not The number of classes is
known before clustering. These known before classification as
are identified after the there is predefined output
completion of clustering. based on input data.

Unknown class label Known class label Data Dgecta

6. It is considered as unsupervised t is considered as the


learning because there is no prior supervised learning because
knowledge of the class labels. class labels are known before.

9ue 1.17. What are the various clustering techniques ?


1-18 L (CS/AT-Sem-5)
Introduction

Answer
1. Clustering techniques are used for combining observed examples into
clusters or groups which satisfy two following main criteria :

a. Each group or cluster is homogeneous i.e., examples belong to the


same group are similar to each other.
Each should
group or cluster be different from other clusters ie.,
examples that belong to one cluster should be different from the
examples of the other clusters.
2Depending on the clustering techniques, clusters can be expressed in
different ways
a ldentified clusters nmay be exclusive, so that any example belongs to
only one cluster.
b. hey may be overlapping i.e., an example may belong to several
clusters.
c They may be probabilistic i.e., an example belongs to each cluster
with a certain probability.
d Clusters might have hierarchical structure.
Major classifications of clustering techniques are:

Clustering
Hierarchical Partitional

Divisive Agglomerative CentroidModel GraphicSpectral


[based|theoretic
Fig. L17.1 Types of clustering
a Once a criterion funetion has been selected, clustering becomes a
well-defined problem in discrete optimization. We find those
partitions of the set of samples that extremize the criterion function.
c. The sample set is finite, there are only a finite number of possible
partition
d The clustering problem can always be solved by exhaustive
enumeration.
L Hierarchical clustering:
This method works by grouping data object into a tree of
a
clusters
b. This method can be further classified depending on whether the
hierarchical decomposition is formed in bottom up (merging) or top
down (splitting) fashion.
Following are the two types of hierarchical clustering
Machine Learning 'Techniques 1-19 L (Cs/IT-Sem-5)

a.
Agglomerative hierarchical clustering: This bottom up strategy
starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the
objects are in a single cluster.
b. Divisive hierarchical
This
clustering
top down strategy does the reverse of agglomerative
strategy by starting with all objects in one cluster.
ii. It subdivides the cluster into smaller and smaller
pieces until
each object forms a cluster on its own.
2 Partitional clustering :
a. This method first creates an initial set of
number of partitions
where each partition
represents a cluster.
b. The clusters are formed to
optimize an objective partition criterion
such as a
dissimilarnty function based on distance so that the objects
within a cluster are similar whereas the objects of different clusters
are dissimilar.
Following are the types of partitioning methods:
Centroid based
a.
clustering
In this, it takes the input parameter and partitions a set of
object into a number of clusters so that resulting intracluster
similarity is high but the intercluster similarity is low.
Cluster similarity is measured in terms of the mean value of
the objects in the cluster, which can be viewed as the cluster's
centroid or center gravity.
of
b. Model-based clustering: This method hypothesizes a model for
each of the cluster and finds the best fit of the data to that model.

Que 1.18. Describe reinforcement learning.

Answer
1. Reinforcement learning is the study of how animals and artificial systems
can learm to optimize their behaviour in the face of rewards and
punishments.
2. Reinforcement learning algorithms related to methods of dynamie
programming which is a general approach to optimal control.
3. Reinforeement learning phenomena have been observed in psychological
studies of animal behaviour, and in neurobiological investigations of
neuromodulation and addiction.
The task of reinforcement learning is to use observed rewards to learn
an optimal policy for the environment. An optimal policy is a policy that
maximizes the expected total reward.
1-20 L (CS/IT-Sem-5) Introduction

Que 1.19. Explain decision tree in detail.

Answer
1. A decision tree is a flowehart structure in which each internal node
a test on a feature, each leaf node represents a class label
represents
and branches represent conjunctions of features that lead to those class

labels.
The paths from root to leaf represent classification rules.

Fig 1.19.1,illustrate the basic flow of decision tree for decision making
with labels (Rain(Yes), Rain(No).

Outlook

Sunny Overcast Rain

Wind
Humidity Yes

Normal Strong Weak


High
Yes
No
Fig. 1.19.1.

Decision tree is the predictive modelling


approach used in statistics, data
4 machine learning.
mining and
that identifies
constructed via an algorithmic approach
5. Decision trees are
conditions.
data set based on different
the ways to split a
method used
Decision trees are a non-parametric supervised learning
6.
classification and regression tasks.
for both
variable can
tree models where the target
7. Classification trees a r e the
values.
take a discrete set of
where the target variable can
the decision trees
trees are
8. Regression
take continuous set of values.

decision tree ?
What the steps used for making
Que 1.20.|
are

Anwer
decision tree are:
Steps used for making consideration for making
(dataset) which are taken into
1. Get list of rows at each node).
decision tree (recursively
Machine Learning Techniques 1-21 L (Cs/IT-Sem-5)
2. Calculate uncertainty of our dataset or Gini impurity or how much our
data is mixed up etc.
Generate list of all question which needs to be asked at that node.
Partition rows into True rows and False rows based on each question
asked.
5. Calculate infornmation gain based on Gini impurity and partition of data
from prev1ous step.

Update highest information gain based on each question asked.


Update question based on information gain (higher information gain).
8. Divide the node on
question. Repeat again from stepl until
we get pure
node (leaf nodes).

Que 121.What are the advantages and disadvantages of decision


tree method ?

Answer
Advantages of decision tree method are:
Decision trees are able to
12. Decision trees
generate understandable rules.
perform classification without requiring computation.
3. Decision trees are able to handle both continuous and categorical
variables.
4.Decision trees provide a clear indication for the fields that are important
for prediction or classification.
Disadvantages of decision tree method are:
1 Decision trees are less appropriate for estimation tasks where the goal
is to predict the value ofa continuous attribute
Decision trees are prone to errors in classification problems with many
class and relatively small number of trainingg examples.
Decision tree are computationally expensive to train. At each node,
each candidate splitting field must be sorted before its best split can be
found.

4 In decision tree algorithms, combinations of fields are used and a search


must be made for optimal combining weights. Pruning algorithms can
also be expensive since many candidate sub-trees must be formed and
compared.
Que 1.22.| Write short note on Bayesian belief networks.

Answer
1. Bayesian belief networks specify joint conditional probability distributions
2. They are also known as belief networks, Bayesian networks, or
probabilistic networks.
1-22 L (CS/AT-Sem-5) Introduction

3. A Belief Network allows class conditional independencies to be defined


between subsets of variables.
4. It provides a graphical model of causal relationship on which learning
can be performed

5. We can use a trained Bayesian network for classification.


6. There are two components that define a Bayesian belief network:

a. Directed acyelic graph:


i. Each node in a directed acyclic graph represents a random
variable.
i. These variable may be discrete or continuous valued.
i. These variables may correspond to the actual attribute given

in the data.
shows a
Directed acyclic graph representation: The following diagram
directed acyclic graph for six Boolean variables.

i The arc in the diagram allows representation of causal


knowledge.
i For example, lung cancer is influenced by a person's family
well whether or not the person is
history of lung cancer, as as

a smoker.

Smoker
(Pamily History)

LungCancer Emphysema

Dyspnea
(Positive Xray)
Positive X-ray is independent
ii. Itis worth noting that the variable or

of whether the patient has family


a history of lung cancer
that we know the patient
that the patient is a smoker, given
has lung cancer.

b. Conditional probability table :

the variable
table for the values of
The conditional probability combination ofthe values
(LC) showing each possible
LungCancer (FH), and Smoker (S) is as
follows:
parent nodes, FamilyHistory
ofits
FH,S FHS -PHS FHS
0.7 0.1
LC0.8 0.5
0.3 0.9
-LC0.2 0.5
Machine Learning Techniques 1-23 L (CS/AT-Sem-5)

Que 1.23. Write a short note on support vector machine.

Answer
1. A Support Vector Machine (SVM) is machine learning algorithm that
analyzes data for classification and regression analysis.
SVM is a supervised learning method that looks at data and sorts it into
one of two categories.

An SVM outputs a map of the sorted data with the margins between the
two as far apart as
possible
Applications ofSVM:
1. Text and hpertext classification
i. Image classification
i. Recognizing handwritten characters
iv. Biological sciences, including protein classification
Que 1.24. Explain genetic algorithm with flow chart.

Answer
Genetic algorithm (GA):
1. The genetic algorithm is a method for solving both constrained and
unconstrained optimization problems that is based on natural selection.
2 The genetic algorithm repeatedly modifies a population of individual
solutions.
3. At each step, the genetic algorithm selects individuals at random from
the current population to be parents and uses them to produce the
children for the next
generation.
4. Over successive generations, the population evolves toward an
optimal
solution.

Flowehart:The genetic algorithm uses three main types of rules at each


step to create the next generation from the current population :
a. Selection rule: Selection rules select the individuals, called
parents,
that contribute to the population at the next
generation.
b. Crossover rule: Crossover rules combine two parents to form children
for the next generation.
c. Mutation rule: Mutation rules apply random changes to individual
parents to form children.
1-24 L (CS/AT-Sem-5) Introduction

Start

Initialization

Initial population

Selection

New population

s Quit ?

NO

Crossover

Mutation

End

Fig. 1.24.1

LPART-4
Issues in Machine Learning and Data Science Vs. Machine Learning.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

the issues related with machine


Que 1.25. Briefly explain
learning
Machine Learning Techniques
1-25 L(CS/IT-Sem-5)

Answer
Issues related with machine learning a r e :
1. Data quality:
a. It is essential to have
good quality data to produce quality MI.
algorithms and models.
b. To get
high-quality data,
must implement data evaluation,
we
integration, exploration, and governance techniques prior to
developing ML models.
c. Accuracy of ML is driven by the quality of the data.
2 Transparency:
a. It is diff+cult to make definitive statements on how well a model is
going to generalize in new environments.

Manpower:
a. Manpower means having data and being able to use it. This does
not introduce bias into the model.
b. There should be enough skill sets in the
organization for software
development and data collection.
Other
a The most common issue with ML is
people using it where it does
not belong.
b. Every time there is innovation in ML, we see overzealous
some new

engineers trying to use it where it's not really necessary.


c. This used to happen a lot with deep learning and neural networks.
d Traceability and reproduction of results are two main issues.

Que 1.26. What are the classes of problem in machine learning ?

Answer
Common classes of problem in machine learning:
1. Classification :
a. In classification data is labelled i.e., it is assigned a class, for example,
spam/non-spam or fraud/non-fraud.
b. The decision being modelled is to assign labels to new unlabelled
pieces of data.
c. This can be thought of as a discrimination problem, modelling the
differences or similarities between groups.
2 Regression
a. Regression data is labelled with a real value rather than a label.
b. The decision being modelled is what value to predict for new
unpredicted data.
1-26 L (CS/AT-Sem-5) Introduction

3 Clustering:
a In clustering data is not labelled, but can be divided into groups
on similarity and other measures of natural structure in the
based
data.
b. For example, organising pictures by faces without names, where
the human user has to assign names to groups, like iPhoto on the
Mac.
4 Rule extraction
a. In rule extraction, data is used as the basis for the extraction of
propositional rules.
b. These rules discover statistically supportable relationships between
attributes in the data.

between data science and machine


Que 127. Differentiate
learning
Answer

.No. Data science Machine learning


Data science is a concept used Machine learning is defined as

to tackle big data and he practice of using algorithms


data cleansing,
includes
preparation, to use data, learn from it and
then forecast future trends for
and analysis.
at
topic.
It includes various data It includes subset of Artificial
2
operations. Intelligence.
Data science works by Machine learning uses
efficient
sOurcing, cleaning, and programs that can use data
without being explicitly told to
processing data to extract
it for do so.
meaning out of
analytical purposes.
Amazon Lex, IBM Watson
SAS, Tableau, Apache, Spark,
Studio, Microsoft Azure ML
MATLAB arethe tools used
Studio are the tools used in ML.
in data science.
Data science deals with Machine learning uses statistical
.

structured and
unstructured
models.
data.
Fraud detection and Recommendation systems such
Spotily andFacial
he althcare analysis are as
Recognition
are examples of machine
examples of data science
learning.
2
UNIT
Regression and
Bayesian Learning

CONTENTS
Part-1
Regression, Linear Regression.
and Logistic Regression
2-2L to 2-4L
Part-2 Bayesian Learning, Bayes. .2-4L to 2-19L
***************

Theorem, Concept Learning


Naive
Bayes Optimal Classifier,
Bayes Classifier, Bayesian
Belief Networks, EM Algorithm

Part-3: Support Vector Machine, . . 2-201L to 2-241


Introduction, Types of Support
Vector Kernel (Linear Kernel
Polynomial Kernel, and Gaussian
Kernel), Hyperplane
Decision Surface), Properties
of SVM, and Issues in SVMI

2-1L CS/TT-Sem-5)
2-2L(CS/IT-Sem-5) Regression & Bayesian learning

LPART-1
Regression, Linear Regression and Logistic Regression.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 2.1.| Define the term regression with its type.

Answer
1.
Kegression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of the
relationship between one dependent variable (usually denoted by Y)
and a series of other variables (known as independent variables).
2. Regression helps investment and financial managers to value assets
and understand the relationships between variables, such as commodity
prices and the stocks of businesses dealing in those commodities.
There are two type of regression:
a. Simple linear regression: It uses one independent variable to
explain or predict the outcome of dependent variable Y.

Y =a
+bX+u
b. Mutiple linear regression: lt two
uses or more independent
variables to predict outcomes.
Y= a +b,X, +6_X, + b,X, + ... + bX, + u
Where:
Y=The variable we you are trying to predict (dependent variable).
X The variable that we are using to
=
predict Y (independent variable).
a = The intercept.
b =The slope.
u =The regression residual.

Que 2.2.Deseribe briefly linear regression.


Answer
1. Linear regresB1on i8 a
8upervised machine learning
the algorithm where
predicted output is continuous and has a constant slope.
2. It is used to prediet values within a continuous range, (for example
sales, price) rather than trying to classify them into categories (for
example :cat, dog).
Machine Learning T'echniques 2-3 L (CS/IT-Sem-5)

3. are the types


a.
Following
Simple regression :
of linear regression:
iSimple linear regTession uses traditional slope-intercept form to produce

accurate prediction, y mx +b
where, m and b are the variables,
=

represents our input data and y represents our prediction.


b Multivariable regression:
A multi-variable linear equation is given below, where w represents the
coeflicients, or weights:

. T h e variables x, y, z represent the attributes, or distinct pieces of


information that, we have about each observation.
P o r sales predictions, these attributes might include a company's
advertising spend on radio, TV, and newspapers.
Sales w, Radio + w, TV +
=

wgNewspapers
Que 2.3. Explain logistics regression.
Answer
Logistic regression is a supervised learning classification algorithm used
to predict the
probability of a target variable.
2. The nature of target or dependent variable is dichotomous, which means
there would be only two possible classes.
3 The dependent variable is binary in nature having data coded as either
1 (stands for success/yes) or 0 (stands for failure/no).
4. A logistic regression model predicts PY1)as a function
of X. It is
=
one
of the simplest ML
algorithms that can be used for various
classification
problems such as spam detection, diabetes prediction, cancer detection
etc.

Que 24.What are the types of logistics regression ?

Answer
Logistics regression can be divided into following types:
1. Binary (Binomial) Regression :

In this
a
classification,
types either 1 and 0.
a
dependent variable will have only two possible
b. For example, these variables may
or no, win or loss etc.
represent success or failure, yes
2 Multinomial regression
a. In this classification,
possible unordered types
dependentthevariable can
or
have three or more
types having no
quantitative
significance.
b. For example, these variables may
represent "Type A"
T'ype C"
or
"Type B" or
24L(CS/IT-Sem-5) Regression & Bayesian Learning

3. Ordinal regression :
a. In this classification, dependent variable can have three or more
possible ordered types or the types having a quantitative significance.
b. For example, these variables may represent "poor" or "good, "very
good", "Excellent" and each category can have the scores like 0, 1, 2, 3.

Que 2.5. Differentiate between linear regression and logistics


regression.

Answepr

S.NNo. Linear regression Logistics regression


1 Linear regression is a Logistic regression is a supervised
supervised regression model. classification model.

|In Linear regression, we In Logistic regression, we predict


predict the value by an integer the value by 1 or 0.
number.
3. No activation function is used. Activation function is used to
onvert ainear regression
equation to the logistic regression
equation.

Athreshold value is added. No threshold value is needed.

|Itis based on the least square The dependent variable consists


estimation. of only two categories.
6. Lanear regression is used to Logistic regression is used to
estimate the dependent calculate the probability of an
variable in case of a change in event.
independent variables.

1. |Lanear regression assumes theLogistic regression assumes the


normal o r gaussian binomial distribution of the
distribution of the dependent dependent variable.
variable.

PART-2

Bayesian Learning, Bayes Theorem, Concept Learning,


Bayes Optimal Classifier, Naive Bayes Classifier, Bayesian
Belief Netuworks, EM Algorithm.
Machine Learning Techniques 2-5 L(CS/IT-Sem-5)

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 2.6. Explain Bayesian learning. Explain two category


classification.

Answer
Bayesian learning:
1. Bayesian learning is a fundamental statistical approach to the problem
ofpattern classification.
2. This approach is based on quantifying the tradeoffs between various
classification decisions using probability and costs that accompany such
decisions.
3. Because the decision problem is solved on the basis of probabilistic terms,
hence it is assumed that all the relevant probabilities are known.
4. For this we define the state of nature of the things present in the
particular pattern. We denote the state of nature by .
Twocategory classification:
1. Let o, o be the two classes of the patterns. It is assumed that the a
priori probabilities plo,) and plo,) are known.
2. Even if they are not known, they can easily be estimated from the
available training feature vectors.
3. IfN is total number of available training patterns and N. N, of themn
belong to o, and og, respectively then pto,) = N,/N and plo,) = NJN.
4 The conditional probability density functions plx | o), i = 1, 2 is also
assumed to be known which describes the distribution of the feature
vectors in each of the classes.

5. The feature vectors can take any value in the l-dimensional feature
space.
6. Density functions ptx |o) become probability and will be denoted by
plx| o) when the feature vectors can take only diserete values.
7. Consider the conditional probability,

Plo, lx) PMx|o,x)}plo,)


= (2.6.1)

where pr) is the probability density function ofx and for which we have

otr) = Zplz|,)plo,) (2.6.2)


2-6 L(CS/IT-Sem-5) Regression & Bayesian Learning

Now, the Baye's classification rule can be defined as


a.
plo, |x)>pl», lx) x is classified to o,
b.
lfpto, lx) <plog |x) xis classified to o, (2.6.3)
In the case of equality the pattern can be assigned to either of the two
classes. Using equation (2.6. 1), decision can equivalently be based on

the inequalities:
a. plx| 0,)plo,) > p(x|o,)plo
b. px| o) plo,) < p(x | o,)plo,) ..(2.6.4)
10. Here p(x) is not taken because it is same for all classes and it does not
affect the decision.
11. Further, if the priori probabilities are equal, i.e.,
a. plo =p(o,) = 1/2 then Eq. (2.6.4) becomes,

b.
plx| o) > plx| o2)
C. plx| o,) <p{x| 02)
12. For example, in Fig. 2.6.1, two equiprobable classes are presented which
shows the variations of plx | o,), i = 1, 2 as functions of x for the simple
case of a single feature (l = ).

13. The dotted line at x, is a threshold which partitions the space into two
regions, R, and R According to Baye'sdecisions rule, for all value ofx
in R, the classifier decides o, and for all values in R, it decides o
14. From the Fig. 2.6.1, it is obvious that the errors are unavoidable. There
1s a finite probability for an x to lie in the R, region and at the same time
to belong in class o. Then there is error in the decision.

Px|o) plx | 02)


px

Shade the part

Fig. 2.6.1. Bayesian classifier for the case of two equiprobable classes.
15. The total probability, P of committing a decision error for two
equiprobable classes is given by,
2-7L(CSIT-Sem-5)
Machine Learning Techniques

in Fig. 2.6.1
under the curves

which is equal to the total shaded area

the decision error for Bayesian


Explain how
Que 2.7.
classification can be minimized.

Answer classification
the
Bayesian classifier can be made optimal by minimizing
error probability.
away from
2. In Fig. 2.7.1, it is observed that when the threshold is moved increases.
a r e a under the curves always
the corresponding shaded
Xo to minimize the error.
3. Hence, we have to decrease this shaded area

for o, and R2 be the


be the region of the feature space
Let R,
corresponding region for o
to if x

5. Then an errorwill be occurred if, X


eR, although it belongs ,or
to
e
R, although it belongs o, 1.e.,

.(2.7.1)
P.= plxeRa , ,)+pxeR,, og)
be written as,
6. P. can

. =
plxeR2l)plo,) +plxeR, | «)plo)
..(2.7.2)
= Plo plx |o)dx+ pl®2) plx|@2)dx
R2 R
7. Using the Baye's rule,

= P] ploj lx)plxdx + J pl»g|x)plx)dx .(2.7.3)

2 R
and R, of the
The e r r o r will be minimized if the partitioning regions R,
feature space are chosen so that

R: plo, |x)> pl»,|x)


(2.7.4)
R2: pl»2lax) plo, |x) >

covers all the space, we have


9. Since the union of the regions R, R2

lx)ptx)dx =1 ..(2.7.5)
ploy lx)p/xdx+ ] plo
10.
2
Combining equation (2.7.3) and (2.7.5), we get,

(2.7.6)
P =plw,) (plojlx)-plw2|a)) plx\dx
11. Thus, the probability of error is minimized i f R, is the region of space in
which plo,|X) > p(o, |x). Then R, becomes region where the reverse is

true.
2-8L (CS/1T-Sem-5) egression & Bayesian Learning

12. In a classification task with M classes, o, , an unknown pattern,


represented by the feature vector x, is assigned to class o, if p(o,|x)>
p,Ix)v j*i.

Que 2.8. Consider the Bayesian classifier for the un iformly


distributed classes, where

xe la,, a,
Plxw,) =
muullion

*Elb,. 6,1
Plxlo) 6, -b
0 muullion
Show the classification results for some values for a and b
("muullion" means "otherwise").

Answer
Typical cases a r e presented in the Fig. 2.8.1.

Pa lu) Plxl)

(a) (b)
Pt ly) Pa|y,
ba-b
2

C) (d)
Pig. 28.1.

Define Bayes claseifier. Explain how classification is


Que2.9.
done by using Bayes classifier.

Answer
1. A Bayes classifier is a simple probabilistic classifier based on applying

Bayes theorem (from Bayesian statistics) with strong (Naive)


independence assumptions.
Machine Learning Techniques 2-9L (CS/IT-Sem-5)

2. A Naive Bayes classifier assumes that the presence (or absence) of a


particular feature of a class is unrelated to the presence (or absence) of
any other feature.
3 Depending on the precise nature of the probability model, Naive Bayes
classifiers can be trained very efficiently in a supervised learning
4 In many practical applications, parameter estimation for Naive Bayes
models uses the method of maximum likelihood; in other words, one
can work with the Naive Bayes model without believing in Bayesian
probability or using any Bayesian methods.
5. An advantage of the Naive Bayes classifier is that it requires a smal
amount ot training data to estimate the parameters (means and
variances of the variables) necessary for classification.
6. The perceptron bears a certain relationship to a classical patternn
classifier known as the Bayes classifier.
7. When the environment is Gaussian, the Bayes classifier reduces to a
linear classifier.
In the Bayes classifier, or Bayes hypothesis testing procedure, we
minimize the average risk, denoted by R. For a two-class problem,
represented by classes C1 and C2, the average risk is defined:

R
C Pa/C)ds+CP Pa/C,d

+CP
H2
Px/C,dx+CPPx/C d H
where the various terms are defined as follows:
P, = Prior probability that the observation vector x is drawn from
subspace H, with i = 1, 2, and P + P2=1

C, Cost of deciding in favour of class C, represented by subspace H,


when class C, is true, with t,J =1,2
P, /C) = Conditional probability density funetion of the random vector X

Fig. 2.9.1a) depicts a block diagram representation of the Bayes classifier.


The important points in this block diagram are two fold :
a The data processing in designing the Bayes classifier is confined
entirely to the computation of the likelihood ratio alx).
b. This computation is completely invariant to the values assigned to
the prior probabilities and involved in the decision-making process.
These quantities merely affect the values of the threshold x.
c. From a computational point of view, we find it more convenient to
work with logarithm of the likelihood ratio rather than the
likelihood ratio itself.
2-10 L(CS/AT-Sem-5) Regression & Bayesian Learning

[Assign r to class
Input vector
LikelihoodaAlx)Comparator
ratio
computer
ift (x) > E

Otherwise, assign
lit to class E2

(a)
[Assign x to class
Input vector Likelihood
ratio
og alar
Comparator
if log a (x)> log
computer Otherwise, assign
Lit to clas 52

(6) logT
Fig. 2.9.1. Two equivalent implementations of the Bayes classifier:
(a) Likelihood ratio test, (6) Log-likelihood ratio test

Que 2.10.1 Discuss Bayes classifier using some example in detail.

Answer
Bayes classifier: Refer Q. 2.9, Page 2-8L, Unit-2.

For
1.
example:
LetD be a training set of features and their associated class labels. Each
Teature is represented by an n-dimensional attribute vector
X=1, T2 ) depicting n measurements made on the feature from
n attributes, respectively A,Az.A
Suppose that there are m classes, C,. Cg Given a feature X, the
classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X. That is, classifier predicts that X
belongs to class C, if and only if,

pC,|X)>p(C,|X) for 1Sj S m, jai


Thus, we maximize p(C, |X). The class C, for which p(C,|X) is maximized
is called the maximum posterior hypothesis. By Bayes theorem,

PC,|X) =
AA |C)pG;
pX)
3. As p(X) is constant for all classes,
only PCX| C) PC; need to be
maximized. If the class prior probabilities are not known then it is
commonly assumed that the classes are cqually likely i.e.,
plC,) = plC,)=.. p{C,) and therefore pX|C)is maximized. Otherwise
P A T C ) p ) 1s maximized.

i Given data sets with many attributes, the computation of p(X|C)


will be extremely expensive.
ii. To reduce computation in evaluating pX|C,), the assumption of
class conditional
independence is mnde.
Machine Learning Techniques 2-11 L(CS/AT-Sem-5)

i. This presumes that the values of the attributes are conditionally


independent of one another, given the class label of the feature.

Thus, pX|c) =
||Pa,|C
R=

px,|C,) xp a2|C,)x.. plx,| C


*

iv. The probabilities plx, | C), plx,1C) plr, 1C) are easily estimated
from the training feature. Here x refers to the value of attribute
A, for each attribute, it is checked whether the attribute is
categorical or continuous valued.
For example, to compute p(X|C)) we consider,
a. IfA, is categorical then p(z,IC) is the number of feature of
class C in D having the value z, for A, divided by |C, D|, the
number of features of class C, in D.
b. IfA, is continuous valued then continuous valued attribute is
typically assumed to have a Gaussian distribution with a mean
Hand standard deviation a, defined by,

gx) = o 1_
so that pr,|C) =g{z,).
vi There is a need to compute the mean H and the standard deviation
a of the value of attribute A, for training set of class C, These
values are used to estimate plz,|CP
vii. For example, let X = (35, Rs. 40,000) where A, and A, are the
attributes age and income, respectively. Let the class label attribute
be buys-computer.
vii. The associated class label for X is yes (ie., buys-computer = yes).
Let's suppose that age has not been discretized and therefore exists
as a continuous valued attribute.

ix. Suppose that from the training set, we find that eustomer in Dwho
buy a computer are 38+ 12 years of age. In other words, for attribute
age and this class, we have u = 38 and a = 12.
5. In order to predict the classlabelofX.pUX|C)p(C)is evaluated for each
class C, The classifier predicts that the class label of X is the class C, if
and only if

plX|C) PC,)> pX|C) plC) for 1s jsm,j *i,


The predicted class label is the elass C, for which ptX|C) PC) is the
maximum.
Regression & Bayesian Learning
2-12L (CS/IT-Sem-5)

Que 2.11. Let blue, green, and red be


three classes of objects with
14.
P(blue) 1/4, P(green) 1/2, P(red)
=

prior probabilities given by and paper. Let the


Let there be three types of objects pencils, pens,
class-conditional probabilities of these objects
be given a s follows.
Use Bayes classifier to classify pencil, pen and paper.
l/2 P(paper/green)= /V6
P(penci/green) = I3 P(pen/green)= P(paper/blue) = 1/3
P(pencillblue) = 1/2 P(pen/blue) = 1/6
P(pen/red) = /3 P(paper/red) = /2
P(pencilred) = 1/6

Answer
As per Bayes rule:

=
Plpenci/ green) Pgreen)
Pgreenpencu (P(penci/green) Pgreen)P(red)P(pencil blue) +

P(blue)+ P(penci/ red)


1

3 2 b=0.5050
4
P(penci/ blue) P(blue)
Pblue/pencu P(pencil green) Hgreens P(penci/ blue)
P(blue) + P(pencil/ red) P(red)

2 =0.378
0.33
Ppenci/ red) P(red)
PMredpencil)= (P(penci/ red) Pred)+ P(pencil blue)
Pblue)+ P(pencil/ green) Plgreen)

6424 =0.126
.33 0.33.126

Since, Plgreen/pencil) has the highest value therefore pencil belongs to

class green.

Ppen green) P(green)


PMgreenpen P(pen/ green) Plgreen) + P(pen/ blue)
Pblue)+ P(pen/ red) Plred)

0.666
,1,1,1,10.375
Machine Learning Techniques 2-13 L (CSNT-Sem-5)

P(pen/ blue)P(blue)
Pblue/pen)=
P(pen/ green) Plgreen) +
Ptpen/ blue)
P(blue)+ P(pen/ red) Pred)

6 4 24 =
0.111
0.375 0.375

Pred/pen) = Ppen/red) P(red)


P(pen/ green) P(green) P(pen/ + blue)

Pblue)+ Ppen/ red) Ptred)


1
30.375
4120.375
=0.222
bUL
Since Plgreen/pen) has the highest value therefore, pen belongs to
class green.

Plpaper/ green) Plgreen)


Pgreen/paper Ppaper/ green) Plgreen) Pipaperr bIue +

Pblue) + P(paper/ red) Plred)

6 2

12 = 0.286
0.291
Plpaper/ blue) P(blue
Pblue/paper Plpaperl green) Pgreen) + Plpaper Diue
P(blue) + P(paper/ red) P(red)
1
12 = 0.286
0.291 0.291

Plpaper/ red) P(red)


Pred/paperP(paperl green) P(green)+ Plpaper/ blue)
P(blue)+ P(paper/ red) Plred)

1
2 4 8 0.429
0.291 0.291
value therefore, paper belongs to
Since, P(red/paper) has the highest
class red.

Naive Bayes classifier.


Que 2.12. Explain
2-14L(CS/AT-Sem-5) Regression & Bayesian Learning

Answer
Bayesian network model used
1. Naive Bayes model is the most common

in machine learning.
which is to be predicted and the
2. Here, the class variable C is the root
attribute variables X, are the leaves.
that the attributes are
The model is Naive because it
assumes
3.
other, given the cla8s.
conditionaly independent of each

0.9 * *

..
n * - -

***
0.8

0.7-

0.6
Decision 'ree
Naive Bayes ***
0.5

0.4
20 40 0 100
Training set size

Fig. 2.12.1. The learning curve for Naive Bayes learning


the parameters are:
4. Assuming Boolean variables,
0= PMC = true), 0,, = PRX, = true|C = true),

2 PX, = true | C= False)

5. Naive Bayes models can Bayesian networks in which each


be viewed as

andC has no parents.


X, has C as the sole parent
6. A Naive Bayes model gaussiEn PX,|C) is equivalent to a mixture
with matrices.
of gaussians with diagonal covariance
estimation in continuous
While mixtures of gaussians are used for density
mixed domains.
domains, Naive Bayes models used in discrete and
inference of marginal and
8. Naive Bayes models allow for very efficient
conditional distributions.
9. Naive Bayes learning has no dificulty with noisy data and can give
more appropriate probabilistic predictions.

Que 2.13. Consider a two-class (Tasty o r non-Tasty) problem with


the following training data. Use Naive Bayes classifier to classify
the pattern:
"Cook Asha, Health-Status Bad, Cuisine = Continental".
Machine Learning Techniques 2-15 L (CS/IT-Sem-5)

Cook Health-Status Cuisine Tasty


Asha Bad Indian Yes
Asha Good Continental Yes
Sita Bad Indian No
Sita Good Indian Yes
Usha Bad Indian es

Usha Bad Continental No


Sita Bad Continental No
Sita Good Continental Yes
Usha Good Indian Yes
Usha Good Continental No

Answer

Cook Health Cuisine


status
esN o (es No Yes No

Asha 0 Bad 2 3 Indian


Sita Good 4 1 Continental

Usha
Tasty
Yes No

Cook Health- Cuisine

status
Yes No Yes No Yes No
Asha
2/6 0Bad 2/6 3/4Indian 4/6 1/4
Sita 2/6 214 Good 4/6 1/4 Continental|2/63/4
Usha 2/6 2/4

T'asty
Yes No

6/10 4/10
Regression & Bayesian Learning
2-16 L (CS/AT-Sem-5)

Likelihoodof yes = =0.023

Likelihood of no = 0xx
4 4
x

10
=0

Therefore, the prediction is tasty.

Que 2.14. Explain EM algorithm with steps.


Answer
iterative way to
1. The Expectation-Maximization (EM) algorithm is an
when the
find maximum-likelihood estimates for model parameters
variables.
data is incomplete or has missing data points or has some hidden
missing data points and estimates a
2. EM chooses random values for the
new set of data.
used to estimate a better first
3. These new values are then recursively
missing points, until the values get fixed.
data, by filling up
These are the two basic steps of the EM algorithm

a Estimation Step:
i Initialize 4g, 2, and n, by random values, or by K means clustering
resuts.
results or by hierarchical clustering
i. Then for those given parameter values, estimate the value of
the latent variables (i.e., Yp
b. Maximization Step: Update the value of the parameters (t.e., Ha

and n,) calculated using ML method:


Initialize the mean H,
the covariance matrix 2, and

the mixing coefficients T

by random values, (or other values).

i Compute the n, values for all k.


i. Again estimate all the parameters using the current n, values.

function.
iv. Compute log-likelihood
v. Put some convergence criterion.
to some value
vi. If the log-likelihood value converges
(or if all the parameters converge to some values) then stop,

else return to Step 2.


Machine Learning Techniques 2-171.(CS/AT-Sem-5)

Que 2.15. Describe the usage, advantages and disadvantages of


EM algorithm.

Answer
Usage of EM algorithm:
1. It can be used to fill the missing data in a sample.
2. lt can be used as the basis of unsupervised learning of clusters.

3 ItMarkov used for


can beModel the purpose of estimating the parameters of Hidden
(HMM).

4. lt can be used for disecovering the values of latent variables.

Advantages of EM algorithm are:


1 l t is always guaranteed that likelihood will increase with each iteration.

2 The E-step and M-step are often pretty easy for many problems in terms
of implementation.

3 Solutions to the M-steps often exist in the closed form.


Disadvantages of EM algorithm are :
1. It has slow convergence.

2 It makes convergence to the local optima only.


3. It requires both the probabilities, forward and backwi.rd (numerical
optimization requires only forward probability).

Que 2.16.| Write a short note on Bayesian network.


OR
Explain Bayesian network by taking an example. How is the Bayesian
network powerful representation for uncertainty knowledge ?

Answer
A Bayesian network is a directed acyclic graph in which each node is
annotated with quantitative probability information.
2. The full specification is as follows:
i A set of random variables makes up the nodes of the network
variables may be discrete or continuous.

i. A set of directed links or arrows connects pairs of nodes. If there is


an arrow from x to node y, x is said to be parent
a
of y.
ii. Each node x, has a conditional probability distribution
Px,Iparentlz,)) that quantifies the effect of parents on the node.
iv. The graph has no directed eycles (and hence is directed
a
acyclic
graph or DAG).
2-18L (CS/AT-Sem-65) Regression & Bayesian Learning

3. A Bayesian network provides a complete description of the domain.


Every entry in the full joint probability distribution can be caleulated
from the information in the network.
4 Bayesian networks provide a concise way to represent conditional
independence relationships in the domain.
5. A Bayesian network is often exponentialy smaller than the full joint
distribution.
For example:
1. Suppose we want to determine the possibility of grass getting wet or dry
due to the occurrence of different seasons.
The weather has three states : Sunny, Cloudy, and Rainy. There are
two possibilities for the grass: Wet or Dry.
3. The sprinkler can be on or off. Ifit is rainy, the grass gets wet but ifit is
sunny, we can make grass wet by pouring water from a sprinkler.
Suppose that the grass is wet. This could be contributed by one of the
two reasons - Firstly, it is raining. Secondly, the sprinklers are turned
on.

5. Using the Baye's rule, we can deduce the most contributing factor
towards the wet grass.

Conditionj
Sprinkler Rain

Wet grass

Fig. 2.16.1.

Bayesian network possesses the following merits in uncertainty


knowledge representation:
1 Bayesian network conveniently handle incomplete data.
can

2 Bayesian network can learn the casual relation of variables. In data


analysis, casual relation is helpful for field knowledge understanding, it
can also easily lead to precise prediction even under much interference.

3 The combination of bayesian network and bayesian statistics can take


full advantage of field knowledge and information from data.
The combination of bayesian network and other models can effectively
avoid over-fitting problem.
2-19L(CS/IT-Sem-5)
Machine Learning Techniques

the role of prior probability


and posterior
Que 2.17.| Explain
classification.
probability in bayesian
Answer
Role of prior probability :

of the event
The prior probability is used
to compute the probability
1
before the collection of new data.
/ domain knowledge and is
2. It is used to capture our assumptions
independent of the data.
is assigned before any relevant
3. It is the unconditional probability that
evidence is taken into account.

Role of posterior probability:


of an event after
1. Posterior probability is used to compute the probability
collection of data.
domain knowledge and the
It is used to capture both the assumptions/
observed data.
pattern in
that is after the relevant evidence
3. It is the conditional probability assigned
or background is taken into account.

inference
method of handling approximate
Que 2.18. Explain the
in Bayesian networks.

Answer
inference
inference methods can be used when exact
1. Approximate times because the
network
methods lead to unacceptable computation
connected.
is very large or densely
inference:
2. Methods handling approximate
to generate
Simulation methods:
This method use the network
i. distribution and estimate
conditional probability
samples from the number of samples
conditional
of interest when the
probabilities
is sufficiently large.
inference task
This method express the
ii. Variational methods:
find upper and
numerical optimization problem and then
as a
of interest by solving a simplified
lower bounds of the probabilities
version of this optimization problem.
2-20 L (CS/AT-Sem-5) Regression & Bayesian Learning

PART 3
Support Vector Machine, Introduction, Types of Support
Vector Kernel - (Linear Kernel Polynomial Kernel, and Gaussian
Kernel), Hyperplane: (Decision Surface), Properties
of SVM, and Issues in SVM.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 2.19.| Write short note on support vector machine.


Answer
Refer Q. 1.23, Page 1-23L, Unit-1.

Que 2.20.| What are the types of support vector machine?


Answer

Following are the types of support vector machine :


1. Linear SVM: Linear SVM is used for linearly separable data, whieh
can be classified into two classes by using single
a
means if a
dataset
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
2 Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.

Que 2.21. What is polynomial kernel ? Explain polynomial kernel


using o n e dimensional and two dimensional.

Answer

1 The polynomial kernel is a kernel funetion used with Support Vector


Machines (SVMs) and other kernelized models, that represents the
Machine Learning Techniques 2-21 L(CS/AT-Sem-5)

similarity of vectors (training samples) in a feature space over polynomials


of the original variables, allowing learning of non-linear models.

Polynomial kernel function is given by the equation:


(a xb+ r "

where, a and b are two different data points that we need to classify.
r determines the coefficients of the polynomial
d determines the degree of the polynomial.
3. We perform the dot products of the data points, which gives us the high
dimensional coordinates for the data.

4When d=1, the polynomial kernel computes the relationship between


each pair of observations in 1-Dimension and these relationships help to
find the support vector classifier.
5. When d = 2, the polynomial kernel computes the 2-Dimensional
relationship between each pair of observations which help to find the
support vector classifier.

Que222. Describe Gaussian Kernel (Radial Basis Function).

Answer|
1. RBFkernel is a funetion whose value depends on the distance from the
origin or from some point.
2. Gaussian Kernel is of the following format:

K1,Xz, )=
exponent (=-7||X, -X, I)
I1X,- X, l = Euclidean distance between X, and X2

Using the distance in the original space we calculate the dot produet
(Similanty) of A, and
A
3. Following are the parameters used in Gaussain Kernel:
a. C:Inverse of the strength of regularization.
Behavior : As the value of'e' increases the model gets overfits.
As the value of 'c' decreases the model underfits.
b. y: Gamma (used only for RBF kernel)
Lehavior : As the value of y increases the model gets overfits.
As ine value of y decreases the model underfits.

gue 2.23Write short note on hyperplane (Decision surface).


2-22 L (CS/AT-Sem-5) Regression & Bayesian Learning

Answer

A hyperplane in an n-dimensional Euclidean space is a flat, n-1


of that space that divides the space into two
dimensional subset
disconnected parts.
2. Forexample let's assume a line to be one dimensional Euclidean space.
3. Now pick a point on the line, this point divides the line into two parts.
The line has 1 dimension, while the point has 0 dimensions. So a point is
a hyperplane of the line.
5. For two dimensions that the
we saw separating line was the hyperplane.
6. Similarly, for three dimensions a plane with two dimensions divides the
3d space nto two parts and thus act as a
hyperplane
Thus for a space of n dimensions we have a hyperplane of n-1 dimensions
separating it into two parts.

Que 224. What are the advantages and disadvantags of SVM ?

Answer

Advantages of SVM are:


. Guaranteed optimality : Owing to the nature of Convex Optimization,
the solution will always be global minimum, not a local minimum.
2 The abundance of implementations : We can access it conveniently.
3. SVM can be used for linearly separable as well non-linearly separable
as
data. Linearly separable data pases hard margin whereas non-linearly
separable data poses a soft margin.
SVMs provide compliance to the semi-supervised learning models. It
can be used in areas where the data is labeled as well as unlabeled. It
only requires a condition to the minimization problem which is known
as the transductive SVM.
5. Feature Mapping used to be quite a load on the computational complexity
of the overall training performance of the model. However, with the
help of Kernel Trick, SVM can carry out the feature mapping using the
simple dot product.

Disadvantages of SVM:
1. SVM does not give the best performance for handling text structures as
compared to other algorithms that are used in handling text data. This
leads to loss of sequentil information and thereby, leading to worse
performance.
Machine Learning Techniques 2-23 L (CS/NT-Sem-5)

2. SVM cannot return the probabilistie confidence value that is similar to


logistic regression. This does not provide much explanation as the
confidence is important in several
of prediction applications.
3. The choice of the kernel is perhaps the biggest limitation of the support
vector machine. Considering so many kernels present, it becomes difficult
to choose the right one for the data.

Que 2.25.
Explain the properties of SVM.
Answer

Following are the properties of SVM:


1. Flexibility in choosing a similarity function: Sparseness of
solution when dealing with large data sets only support vectors are used
to specify the separating hyperplane
2 Ability to handle large feature spaces: complexity does not depend
on the dimensionality of the feature
space
Overfitting can be controlled by soft margin approach: A simple
convex optimization problem which is guaranteed to converge to a single
global solution
What the parameters used in support vector
Que 2.26. are

classifier ?

Answer
Parameters used in support vector classifier are:
1. Kernel:
a. Kernel, is selected based on the type of data and also the type of
transformation.
b. By default, the kernel is Radial Basis Function Kernel (RBF).
2 Gamma 3

This parameter decides how far the influence of a single training


example reaches during transformation, which in turn affects how
tightly the decision boundaries end up surrounding points in the
input space.
b. Ifthere is a small value of gamma, points farther apart are considered
similar.

So, more points are grouped together and have smoother decision
boundaries (may be less accurate).
d. Larger values of gamma cause points to be closer together (may
cause uverfitting).
2-24 L (CS/AT-Sem-5) Regression & Bayesian Learning

3 The 'C' parameter


a This parameter controls the amount of regularization applied on
the data.
b. Large values of C mean low regularization which in turn causes
the training data to fit very well (may cause overfitting).
Lower values of C mean higher regularization which causes the
model to be more tolerant of errors (may lead to lower accuracy).
3 UNIT
Decision Tree Learning

CONTENTS
Part-1 Decision Tree Learning,
******************** 4 L to 3-6L
Decision Tree Learning
Algorithm, Inductive Bias,
Inductive Inference with
Decision Trees

Part-2 Bntropy and Information . 3-6L to 1 2 L


Theory, nformation Gain, D-3
Algorithm, Issues in Decision
Tree Learning

Part-3 : Instance-based Learning,. ********************.


..3-12L to 3-16L

Part-4: K-Nearest Neighbour. *********** 3-16L to I-20L


Learning, Locally Weighted
Regression, Radial Basis
Function Networks,

Part-5: Case-based Learning. . . ..3-20L to 3-27L

3-1L(CSIT-Sem-5)
3-2L (CS/IT-Sem-5) Decision Tree Learning

LPART- 11
Decision Tree Learning, Decision Tree Learning Algorithm,
Inductive Bias, Inductive Inference with Decision Trees

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 3.1. Describe the basic terminology used in decision tree.

Answer
Basic terminology used in decision trees are:
1. Root node : It represents entire population or sample and this further
gets divided into two or more homogeneous sets.
2 Splitting: It is a process of dividing a node into two or more sub-nodes.
3 Decision node: When a sub-node splits into further sub-nodes, then it
is called decision node.
Root node
Branch/sub-tree
Splitting
Decision node Decision node

Terminal node Decision node|||Terminal nodel|Terminal node

Terminal node| Terminal node


4 Leaf/Terminal node: Nodes that do not split is called leaf or terminal
node.
5. Pruning: When we remove sub-nodes of a decision node, this process
is called pruning. This proceas is opposite to splitting process.
6 Branch/sub-tree: A sub section of entire tree is ealled branch or sub
tree.
7. Parent and ehild node: A node which is divided into sub-nodes is
called parent node of sub-nodes where as sub-nodes are the child of
parent node.
Decision Tree Learning
3-4L(CS/TT-Sem-5)
Outlook

Sunny Overcost Rain

Yes

Humidity Wind

Weak
High Normal Strong
No Yes No Yes

Fig. 3.3.1.

Que 34. Explain various decision tree learning algorithms.

Answer
Various decision tree learning algorithms are:
1. D3 (Iterative Dichotomiser3)
i D 3 is an algorithm used to generate a decision tree from a dataset.

To construct a decision tree, ID3 uses a top-down, greedy search


through the given sets, where each attribute at every tree node is
tested to select the attribute that is best for classification of a given
e.

i Therefore, the attribute with the highest information gain can be


selected as the test attribute of the current node.
iv. Inthis algorithm, small decision trees are preferred over the larger
ones. It isa heuristic algorithm because it does not construct the

smallest tree.
v. For building a decision tree model, ID3 only accepts categorical
ID3 when there is
attributes. Accurate results are not given by
noise and when it is serially implemented.
vi. Therefore data is preprocessed before constructing a decision tree.

vii. For constructing adecision tree informationgainis caleulated for


every attribute and attribute with the
highest information
each andcomes the root node. The rest possible values are denoted
gain
by arcs.
viü. All the outeome instances that are possible are examined whether
they belong to the same class or not. For the instances of the samme
class, a 8ingle name is used to denote the class otherwise the
instances are classified on the basis of splitting attribute.
3-5L(CSIT-Sem-5)
Machine Learning Techniques

2 C4.5:
i. C4.5 is an algorithm used to generate a decision tree. It is an extension
of ID3 algorithm.
for classification
ii. C4.5 generates decision trees which c a n be used
and therefore C4.5 is referred to a s statistical classifier.
with both
ii. It is better than the ID3 algorithm because it deals
continuous and discrete attributes and also with the missing values
and pruning trees after construction.
iv. C5.0 is the commercial successor of C4.5 because it 1s faster, memory

efficient and used for building smaller decision trees.


to the
v. C4.5 performs by default a tree pruning process. This leads
formation of smaller trees, simple rules and produces more intuitive

interpretations.
3 CART (Classification And Regression Trees):
i CART algorithm builds both classification and regression trees.

i. The classification tree is constructed by CART through binary


splitting of the attribute.
i Gini Index is used for selecting the splitting attribute.
iv. The CART is also used for regression analysis with the help of
regression tree.
can be used forecasting in a
V.The regression feature of CART
dependent variable given a set of predictor variable over a given

period of time.
vi CART has an average speed of processingg and supports both
continuous and nominal attribute data.

Que 35.What are the advantages and disadvantages of different


decision tree learning algorithm ?

Answer
Advantages of ID3 algorithm:
1. The training data is used to create understandable predietion rules.
2. It builds short and fast tree.
3. D3 searches the whole dataset to create the whole tree.
4. It finds the leaf nodes thus enabling the test data to be pruned and
reducing the number of tests.
The calculation time of ID3 is the linear function of the produet of the
characteristic number and node number.
Disadvantages of ID3 algorithm:
1. For a small sample, data may be overfitted or overclassified.
3-6 L (CS/IT-Sem-5) Decision Tree Learning

2. For making a decision, only one attribute is tested at an instant thus


consuming a lot of time.

3 Classifying the continuous data may prove to be expensive in terms of


as many trees have to be generated to see where to break
computation,
the continuous sequence.
4 Tt is overly sensitive to features when given a large number of input
values.

Advantages of C4.5 algorithm:


1 C4.5is easy to implement.
2. C4.5 builds models that can be easily interpreted.
3. t can handie both categorical and continuous values.
It can deal with noise and missing value attributes.
4
Disadvantages of C4.5 algorithmn:
1. A small variation in data can lead to different decision trees when using
C4.5.
2 For a small training set, C4.5 does not work very well.
Advantages of CART algorithm :
CART can handle missing values automatically using proxy
2 It uses combination of continuous/diserete variables.
splits
CART automatically performs variable selection.
CART can establish interactions among variables.
5. CART does not vary according to the monotonie transformation of
predictive variable.
Disadvantages of CART algorithm:
1. CART has unstable decision trees.
2. CART splits only by one variable.
3. It is non-parametric algorithm.

LPART-2
Entropy and Information Theory, Information Gain,
ID-3 Algorithm, Issues in Decision Tree Learning.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 36.Explain attribute selection measures used in decision


tree.
Machine Learning Techniques
3-7L(CS/T-Sem-5)

Answer
Attribute selection mensures used in decision tree are:
1. Entropy:
1 Entropy is a measure of uncertainty associated with a random
variable.
The entropy increases with the increase in uncertainty or
randomness and decreases with a decrease in uncertainty or
randomness.
ii. The value of entropy ranges from 0-1.

Entropy D)= 2L.-p log,(p)


where p, is the non-zero probability that an arbitrary tuple in D
belongs to class C and is estimated by IC, D|/|D|.
iv. A log function of base 2 is used because the entropy is encoded in
bits 0 and 1.
2 Information gain:
i ID3 uses information gain as its attribute selection measure.
i Information gain is the difference between the original information
gain requirement (i.e. based on the proportion of classes) and the
new requirement (i.e. obtained after the partitioning of A).

GainD,A) =
Entropy(D)- ntropy)
Where,
D: Agiven data partition
A: Attribute
V: Suppose we partition the tuples in D on some
attribute A having Vdistinct values

ii Dis split into V partition or subsets, ID,, Da.D) where D, contains


those tuples in D that have outcome a, of A.
iv. The attribute that has the highest information gain is chosen.

3 Gain ratio:
The information gain measure is biased towards tests with many
i
outcome8.
large number of
That is, it prefers to select attributes having a

values.
is
i As each is pure, the information gain by partitioning
partition
be used for classification.
maximal. But such partitioning cannot

attribute selection m e a s u r e which is an extension


to
iv. C4.5 uses this
the information gain.
3-8L(CSTT-Sem-5) Decision Tree Learning

Gain ratio differs from information gain, which measures the


information with respect to a classification that is acquired based
on some partitioning.
vi. Gain ratio applies kind of information gain using a split information
value defined as:

Splitinfoa =- A|DI D
v. The gain ratio is then defined as :

Gain ratio (A) = Gain (A)


SplitInfo (D)
vi. A splitting attribute is selected which is the attribute having the
maximum gain ratio.

Que 3.7.Explain applications of decision tree in various areas


of data mining.

Answer
The various decision tree applications in data mining are:
1. E-Commerce: It is used widely in the field of e-commerce, decision
tree helps to generate online catalog which is an important factor tor
the success of an e-commerce website.

2 Industry: Decision tree algorithm is useful for producing quality control


(faults identification) systems.

3. Intelligent vehicles : An important task for the development of


intelligent vehicles is to find the lane boundaries the of road.
4 Medicine:
a. Decision tree is an important technique for medical research and
practice. A decision tree is used for diagnostic of various diseases.
b. Decision tree is also used for hard sound
diagnosis.
5. Business: Decision trees find use in the field of businesswhere they
used for
are

(Customer
visualization probabilistic
of business models, used in CRMM
Relationship Management) and used for credit scoring for
credit card users and for
predicting loan risks in banks.
Que 38. Explain procedure of ID3 algorithm.
Answer
ID3 (Examples, Target Attribute, Attributes)
1. Create a Root node for the tree.
2. If all Examples are positive, return the
=+
single-node tree root, with label
Machine Learning Techniques 3-9L (CS/IT-Sem-5)

3. If all Examples are negative, return the single-node tree root, with Ilabel

4 If Attributes is empty, return the single-node tree root, with label =


most common value of target attribute in examples.
5. Othervise begin
a. A-the attribute from Attributes that best classifies Examples
b. The decision attribute for Root-A
C. For each possible value, V, of A,
i Add a new tree branch below root, corresponding to the test A
V,
i Let Example V, be the subset of Examples that have value V,
for A

i IfExample V,is empty


a. Then below this new branch add a leaf node with label
most common value
of'TargetAttribute in Examples
b. Else below this new branch add the sub-tree ID3 (Example
V,, TargetAttribute, Attributes-{A))
6. End
7. Return root.

gue 3.9. Explain induetive bias with inductive system.

Answer
Inductive bias:
1. Inductive bias refers to the restrictions that are imposed by the
assumptions made in the learning method.
2 Por example, assuming that the solution to the problem of road safety
can be expressed as a conjunction of a set of eight concepts.
3. This does not allow for more complex expressions that cannot be
expressed as a conjunction.
4. This inductive bias means that there are some potential solutions that
we cannot explore, and not contained within the version space we
examine.

5. Order to have an unbiased learner, the version space would have to


contain every possible hypothesis that could possibly be expressed.
6. The solution that the learner produced could be
never more general
than the complete set of training data.
7. In other words, it would be able to classify data that it had previously
seen (as the rote learner could) but would be unable to generalize in
order to classify new, unseen data.
Decision Tree Learning
3-10L (CSAT-Sem-5)
8. The inductive bias of the candidate elimination algorithm is that it is

only able to classify a new piece of data if all the hypotheses contained
within its version space give data the same classification.

9 Hence, the inductive bias does impose a limitation on the learning method.
Inductive system
Inductive system
Classification of
Candidate new instance or
Training exampi elimination do not know
algorithm
Newinstance
Using hypothesis
space H

Fig. 3.9.1.

Que 3.10 Explain inductive learning algorithm.

Answer
Inductive learning algorithm
Step 1: Divide the table T containing m examples into n sub-tables
(t1, t2,..tn). One table for each possible value of the class attribute (repeat
steps 2-8 for each sub-table).
Step2: Initialize the attribute combination countj = 1.

Step 3:For the sub-table on which work is going on, divide the attribute list
into distinct combinations, each combination withj distinct attributes.
Step 4: For each combination of attributes, count the number of occurrences
of attribute values that appear under the same combination of attributes in
unmarked rows of the sub-table under consideration, and at the same time,
not appears under the same combination of attributes of other sub-tables.
Call the first combination with the maximum number of occurrences the
max-combination MAX.
Step 5:If MAX = = null, increasej by 1 and go to Step 3.

Step 6: Mark all rows of the sub-table where working, in which the values
of MAX appear, as classified.
Step 7:Add a rule (IP attribute ="XYZ">THEN decision is YES/NO) to R
(rule set) whose left-hand side will have attribute names of the MAX with
their values separated by AND, and its right hand side contains the decision
attribute value associated with the sub-table.
Step8:If all rows are marked as classified, then move on to process another
sub-table and go to Step 2, else, go to Step 4. If no sub-tables are available,
exit with the set of rules obtained till then.
Machine Learning Techniques 3-11L (Cs/IT-Sem-5)

Que 3.11. Which learning algorithms are used in inductive bias?

Answer
Learning algorithm used in inductive bias are:
1. Rote-learner:
a.
Learning corresponds to storing each observed training example in
memory.
b. Subsequent instances classified by
are
looking them up in memory.
c. If the instance is found in memory, the stored classification is
returned.
d Otherwise, the system refuses to classify the new instance.
e Inductive bias: There is no inductive bias.
2 Candidate-elimination:
a. New instances are classified
only in the case where all members of
the current version
space agree on the classification.
b. Otherwise, the system refuses to classify the new, instance.
c.
Induetive bias: The target concept can be represented in its
hypothesis space.
3 FIND-S:
a. This
algorithm, finds the most specific hypothesis consistent with
the training examples.
b. It then uses this hypothesis to classify all subsequent instances.
c.
Inductive bias: The target concept can be
hypothesis space, and all instances are negativerepresented
in its
instances unless
the opposite is entailed by its other
knowledge.
Que 3.12. Discuss the issues related to the applications of decision
trees.

Answer
Issues related to the
applications of decision trees are:
1 Missing data:
d. When values have
gone unrecorded, or they might be too
expensive
obtain.
b. Two problems arise:
To classify an object that is missing from the test
i. To modify the information
attributes.
gain formula when examples have
unknown values for the attribute.
Decision Tree Learning
3-12L(CS/AT-Sem-5)
2 Multi-valued attributes:
a. When an attribute has many possible values, the information gain

gives an inappropriate indication of the attribute's


measure
usefulness.
b. In the extreme case, we could use an attribute that has a different

value for every example.


C. Then each subset of examples would be a singleton with a unique
classification, so the information gain measure would have its

highest value for this attribute, the attribute could be irrelevant or

useles
d One solution is to use the gain ratio.
Continuous and integer valued input attributes:

a. Height and weight have an infinite set of possible values.

b. Rather than generating infinitely many branches, decision tree


learning algorithms find the split point that gives the highest
information gain.
C. Eficient dynamie programming methods exist for finding good
part of real orld
Split points, but it 1s still the most expensive
decision tree learning applications.

4 Continuous-valued output attributes:


a f we are trying to predict a numerical value, such as the price of
a work of art, rather than discrete classifications, then we need a

regression tree.
b. Such a tree has a linear function of some subset of numerical

attributes, rather than a single value at each leaf.


c. The learning algorithm must decide when to stop splitting and

begin applying linear regression using


the remaining attributes.

PART-3
Instance-based Learning.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 3.13. Write short note on inatance-based learning.


Machine Learning Techniques 3-13 L (CS/AT-Sem-5)

Answer
1. Instance-Based Learning (TBL) is an extension of nearest neighbour or
K-NN classification algorithms.
TBL algorithms do not maintain a set of abstractions of model
created
rom the instances.
3. The K-NN algorithms have large space requirement.
4 They also extend it with a significance test to work with noisy instances,
since a lot of real-life datasets have training instances and K-NN
algorithms do not work well with noise.
5. Instance-based learning is based on the memorization of the dataset.
6. The number of
data.
parameters is unbounded and grows with the size of the
7. The classification is obtainedthrough memorized examples.
8. The cost of the learning process is 0, all the cost is in the computation of
the prediction.
9. This kind learning is also known as lazy learning.
Que 3.14. Explain instance-based learning representation.
Answer
Following are the instance based learning representation:
Instance-based representation (1):
1 The simplest form of learning is plain memorization.
2 This is a completely different way of representing the knowledge extracted
from a set
of instances:
just store the instances themselves and operate
by relating new instances whose class is unknown to existing ones
whose class is knowT.
3 Instead of creating rules, work directly from the examples themselves.
Instance-based representation (2):
Instance-based learning is lazy, deferring the real work as long as
possible.
In instance-based learning, each new instance is compared with existing
ones using a distance metric, and the closest existing instance is used to
assign the class to the new one. This is also called the nearest-neighbour
classification method.
3. Sometimes more than one nearest neighbour is used, and the majority
class of the closest k-nearest neighbours is assigned to the new instance.
This is termed the k-nearest neighbour method.

Instance-based representation (3):


1 When computing the distance between two examples, the standard
Euclidean distance may be used.
3-14 L (CS/AT-Sem-5) Decision Tree Learning

2. A distance of 0 is assigned if the values are identical, otherwise the


distance is 1.
3. Some attributes will be more important than others. We need some
kinds of attribute weighting. To get suitable attribute weights from the

training set is a key problem.


It may not be necessary, or desirable, to store all the training instances.

Instance-based representation (4) :


1. Generally some regions of attribute space are more stable with regard
to class than others, and just a few examples are needed inside stable

regions.
is that they do
2. An apparent drawback to instance-based representation
a r e learned.
not make explicit the structures that

b) (C)
a
Fig. 3.14.1.

Que 3.15. What are the performance dimensions used for instance

based learning algorithm ?

Answer
instance-based learning algorithm
Performance dimension used for

are:

1. Generality:
of a n
describe the representation
a This is the class of concepts that
algorithm.
concept whose boundary is a
b. IBL algorithms can pac-learn any
of closed hyper-curves of finite size.
union of a finite number

2 Accuracy: This concept describes the accuracy classification.of


3. Learning rate: increases during
classification accuracy
a. This is the speed at which
training.
of the learning
b. It is a more useful indicator of the performance
finite-sized training sets.
than accuracy for
algorithm
4 Incorporation costs:
descriptions with a

These are incurred while updating the concept


a.
single training instance.

include classification costs.


b. They
Machine Learning Techniques
3-15 L (CS/IT-Sem-5)
5. Storage requirement This is the size of the
:
IBL algorithms, which is defined as the concept description for
number of saved instances used
for classification decisions.

Que 3.16. What are the functions of instance-based learning 7


Answer
Functions of
1.
instance-based learning
Similarity function:
are:

a. This computes the similarity between


instances in the concept
a training instance i and the
description.
b. Similarities are numeric-valued.
2 Classification function:
a This receives the similarity function's results and the classification
performance records of the instances in the concept description.
b. It yields a classification for i.
3 Concept description updater:
a. This maintains records classification
on
performance and decides
which instances to include in the concept description.
b. Inputs include i, the similarity results, the classification results,
and a current concept description. It yields the modified concept
description.
Que 3.17.What are the advantages and disadvantages of instance
based learning?

Answer
Advantages of instance-based learning:
1. Learning is trivial.
2. Works efficiently.
3. Noise resistant.
Rich representation, arbitrary decision surfaces.

5. Easy to understand.
Disadvantages of instance-based learning:
1. Need lots of data.
2. Computational cost is high.
3. Restricted to x e R".
4. Implicit weights of attributes (need normalization).
5 Need large space for storage i.e., , require large memory.
6. Expensive application time.
3-16 L (CS/AT-Sem-5) Decision Tree Learning

PART-44
K-Nearest Neighbour Learning, Locally Weighted Regression,
Radial Basis Function Networks.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Deseribe K-Nearest Neighbour algorithm with steps.


Que 3.18.|

Answer
is used to decide the new instance
1. The KNN classification algorithm
should belong to which class.
2 When K=1, we have the nearest neighbour algorithm.
3. KNN classification is incremental.
training phase, all instances are
KNN classification does not have a

stored. Training uses indexing to find neighbours quickly.


to find K-nearest
KNN classification algorithm has
5. During testing, if we do exhaustive
neighbours of a new instance. This is time consuming

comparison.
the local neighborh0od to obtain
a prediction.
6. K-nearest neighbours use

Let p be an
Algorithm: Let m be the number of training data samples.
unknown point.
of data points array. This means
Store the training samples in an array
a tuple (x, y).
each element of this array represents
2. For i = 0 to m:

Calculate Euclidean distance d(arrlü), p).


obtained. Each of these distances
3 Make set S of K smallest distances
corresponds to an already classified data point.
4. Return the majority label among S.
and disadvantages of K-nearest
Que 3.19. What are the advantages
neighbour algorithm ?

Answer
Advantages of KNN algorithm
1. No training period:
a. KNN is called lazy learner (Instance-based learning).
Machine Learning 1Techniques 3-17L (Cs/IT-Sem-5)

b. It does not learn anything in the training period. It does not derive
any discriminative function fronm the training data.
c. In other words, there is no training period for it. It stores the
training dataset and learns from it only at the time of making real
time predictions

d This makes the KNN algorithm much faster than other algorithms
that require training for example, SVM, Linear Regression etc.
Since the KNN algorithm requires no training before making predictions,
new data can be added seamlessly which will not impact the accuracy of
the algorithm.
3. KNN is very easy to implement. There are only two parameters required
to implement KNN i.e., the value of K and the distance function (for
example, Euclidean).
Disadvantages of KNN:
1. Does not work well with large dataset: In large datasets, the cost of
calculating the distance between the new point and each existing points
is huge which degrades the performance of the algorithm.
Does not work well with high dimensions: The KNN algorithm
does not work well with high dimensional data because with large number
of dimensions, it becomes difficult for the algorithm to calculate the
distance in each dimension.
3 Need feature scaling: We need to do feature scaling (standardization
and normalization) before applying KNN algorithm to any dataset.
do not do so, KINN may
If we
generate wrong predictions.
Sensitive to noisy data, missing values and outliers : KNN is
sensitive to noise in the dataset. We need to manually represent missing
values and remove outliers.

Que 3.20.| Explain locally weighted regression.


Answer
1. Model-based methods, such as neural networks and the mixture of
Gaussians, use the data to build a parameterized model.
2. After training, the model is used for predictions and the data are generally
discarded.
3. In contrast, memory-based methods are
non-parametric approaches
that explicitly retain the training daa, and it each time
needs to be
use a
prediction
made.
4 Locally Weighted Regression (LWR) is a memory-based method that
performs a regression around a point using only training data that are
local to that point.
5. LWR was suitable for real-time control
by constructing an LWR-based
system that learned a difficult juggling task.
3-18 L (CS/AT-Sem-5) Decision Tree Learning

Fig. 3.20.1
6. The LOESS (Locally Estimated Scatterplot Smoothing) model performs
a linear regression on points in the data set, weighted by a kernel
centered at x.

for which the original LOESS


kernel shape is a design parameter
The
model uses a tricubic kernel:
h,(x) = h(x-x) = exp(-k{r - x,9,

where k is a smoothing parameter


8. For brevity, we will drop the argument x for h,x), and define n = 2h,
We can then write the estimated means and covariances as

expectations and
9. We use the data covariances to express the conditional
their estimated variances

Kernel too wide includes region


Kernel just right
to0 narrow excludes some of linear regionn
Kernel
Fig. 3.20.2.

Que 3.21. Explain Radial Basis Function (RBF).

Answer
A Radial Basis Function (RBP) is a function that assigns real value
a to
1.
value
each input from its domain (it is a real-value function), and the
it is a mneasure of
produced by the RBF is alway8 an absolute value i.e.,
distance and cannot be negative.
Machine Learning Techniques 3-19L (CS/AT-Sem-5)

2. Euclidean distance (the straight-line distance) between two points in


Euclidean space is used.
3. Radial basis functions are used to approximate functions, such as neural
networks acts as function approximators.
The following sum represents a radial basis function network:

yx)= 2u, (|| x- x, ID,


5. The radial basis functions act as activation functions.
6. The approximant y(x) is differentiable with respect to the weights which
are learned using iterative update methods common among neural
networks.

Que3.22 Explain the architecture of


network.
a radial basis funetion

Answer
1. Radial Basis Funetion (RBF) networks have three layers: an input
layer, a hidden layer with a non-linear RBF activation function and a
linear output layer.
The input can be modeled as a vector of real numbers r e R".
3. The output of the network is then a scalar function of the input vector,
: R R , and is given by
dx) = a , pl| x - e, |)

Output y

Linear weights

Radial basis
functions

Weights

Input x

Fig. 3.22.1. Architecture ofa radial basis funetion network. An input


vector is input to radial
x
usedas all basis functions, each with different
parameters. The output of the network is a linear combination of the
outputs from radial basis functions.
Decision Tree Learning
3-20 L (CS/AT-Sem-5)
center
where n is the number of neurons in the hidden layer, c, is the
of n e u r o n i in the linear output
vector for neuron i and a, is the weight
neuron.
center vector are
Functions that depend only on the distance from a

vector.
radially symmetric about that
5. In the basic form all inputs are connected to each hidden neuron.

Gaussian
6. The radial basis function is taken to be

P x-c, |) =exp|-BIlx - , IFJ


7. The Gaussian basis functions are local to the center vector in the sense

that

0
lim pll c , )
=

has only a small effect for input


changing parameters of one
neuron
1.e.,
far away from the center of that
neuron.
values that are

8. Given certain mild conditions on the shape of the activation function,


RBP networks are universal approximators on a compact subset of R".
n e u r o n s can
9. This means that an RBF network with enough hidden
bounded set with
continuous function on a closed,
approximate any
arbitrary precision.
a r e determined in a manner that optimizes
10. The parameters a, Ci, P, and f
between and the data.
the fit

PART-5]
Case-based Learning

Questions-Answers

Answer Type Questions


Long Answer Type and Medium

Que 828. Write short note on case-based learning algorithm.

Answer
1. Case-Based Learning (CBL) algorithms contain input sequence
an asa

which can be used


of training cases and an output concept description,
to generate predictions of goal feature values for subsequently presented
cases.
Machine Learning Techniques
3-21 L(CS/AT-Sem-5)
The primary
component of the
almost all CBL deseriptionconcept is case-base, but
algorithms maintain additional related information for
the purpose of generating accurate
for feature weights). predictions (for example, settings
3. Current CBL
algorithms assume that cases are deseribed
value
representation, where features are
using a
either predictor feature
features. or
goa
CBL algorithms are distinguished by their
processing behaviour.
Disadvantages of case-based learning algorithm:
1. They are computationally expensive because
similarities all training cases.
to they save and compute
They are intolerant of noise and irrelevant features.
3. They are sensitive to the choice of the
algorithm's similarity function.
4 There is no
simple way they can process symbolic valued feature values.
Que 3.24. What the
are
functions of case-based learning algorithm?
Answer
Functions of case-based learning algorithm are
1.
Pre-processor: This prepares the input for processing (for example,
normalizing the range of numeric-valued features to ensure that
are treated with they
equal importance by the similarity function, formatting
the raw
input into a set of cases).
2 Similarity:

a This function assesses the similarities of a


given case with the
previously stored cases in the concept
description.
b. Assessment involve
may explicit encoding and/or dynamic
computation.
CBL similarity functions find
between these extremes.
a
compromise along the continuum

Prediction : This function inputs the similarity assessments and


generates a prediction for the value of the given case's goal feature (ie.,
a classification when it
is symbolic-valued).
4 Memory updating : This updates the stored case-base, such
modifying or abstracting previously stored cases, forgetting cases
as by
presumed to be noisy, or
updating a feature's relevance weight setting.
Que 3.26. Describe case-based learning cycle with different
schemes of CBL
3-22 L (Cs/IT-Sem-5) Decision Tree Learning

Answer
Case-based learning algorithm processing stages are :
1. Caseretrieval : After the problem situation has been assessed, the
best matching case is searched in the case-base and an approximate
solution is retrieved.
2 Case adaptation: The retrieved solution is adapted to fit better in the
new problem.

Problem

Retrieve

Retain
Revise

Confirmed Proposed
solution solution
i g 3.36.1.The CBL cycle.

3 Solution evaluation:
solution can be evaluated either before the solution is
a The adapted
to the problem or after the solution has been applied.
applied
In any case, if the accomplished result
is not satisfactory, the
b.
or more cases should be
retrieved solution must be adapted again
retrieved.
verified as correct, the new
4 Case-base updating: If the solution
base.
was

may be added to the


case
case

Different scheme of the CBL working cycle are:

1. Retrieve the most similar case.

Reuse the case to attempt to solve the


current problem.
2.
Revise the proposed solution if necessary.
3.
Machine Learning Techniques 3-23 L (CS/TT-Sem-5)

4 Retain the new solution as a


part of a new case.

Que 3.26.| What are the benefits of CBL as a lazy problem solving
method ?

Answer
The benefits of CBL as a lazy Problem solving method are:
Ease
of knowledge elicitation:
a.
Lazy methods can utilise easily available case
instead of rules that
or problem instances
are difficult to extract.
b. So, classical knowledge engineering is replaced by case acquisition
and structuring.

2 Absence of problem-solving bias


a. Cases can be used for multiple
they are stored in a raw form.
problem-solving purposes, because
b. This in contrast to eager methods, which can be used merely for
the purpose for which the
knowledge has already been compiled.
3 Ineremental learning:

a. ACBL system can be put into operation with a minimal set solved
cases furnishing the case base.

b. The case base will be filled with new cases


increasing the system's
problem-solving ability.
C. Besides augmentation of the case base, new indexes and clusters
categories can be created and the existing ones can be changed.
d This in contrast requires a special training period whenever
informatics extraction (knowledge generalisation) is performed.
e. Hence, dynamic on-line adaptation a non-rigid environment i
possible.
4 Suitability for complex and not-fully formalised solution spaces:
CBLsystems can applied to an incomplete model of problem domain,
implementation involves both to identity relevant case features
and to furnish, possibly a partial case base, with proper cases.

b.Lazy approaches are appropriate for complex solution spaces than


eager approaches, which replace the presented data with
abstractions obtained by generalisation.
Decision 'Tree Learning
3-24 L(CS/AT-Sem-5)

problem solving:
5. Suitability for sequential learning
reinforcement
encountered
like these
a. Sequential tasks, the form of sequence
problems, benefit from the storage of history in
states or procedures.
of
by lazy approaches.
b. Such a storage is facilitated
& Ease of explanation :

based upon the similar1ty


CBL system can be justified
a. The results of a
problem to the retrieved
case.
of the current
to
precedent cases, it is also easier
CBL easily traceable to
b. are

analyse failures of the system.


fact that CBL
This is particularly due to the
7. Ease
of maintenance:
to many changes in the problem
domain and the
systems can adapt
relevant environment, merely by acquiring

the limitations of CBL ?


Que 3.27. What are

Answer
Limitations of CBL are:

bases:
1 Handling large case retrieval
requirements and time-consuming
a High memory/ storage case bases.
accompany CBL systems utilising large

order of both is linear with


the number of cases,
Although the costs and
these problems usually lead
to increased construction

reduced system pertormance.


less significant as the hardware components
c. These problems are

become faster and cheaper.

domains:
2 Dynamie problem
problem
a. CBL systems may have difficulties in handling dynamic the
in
unable to follow a shift way
domains, where they may be
strongly biased towards what
problems are solved, since they are

has already worked.

outdated case base.


b. This may result in an

3. Handling noisy data:


a. Parts of the problem situation may be irrelevant to the problem

itself.
Machine Learning Techniques 3-25 L(CS/IT-Sem-5)

b. Unsuccessful assessment of such noise present in a


problem
situation currently imposed on a CBL system may result in the
same problem being unnecessarily stored numerous times in the
case base because of the difference due to the noise.

C. In turn this
implies inefficient storage and retrieval
of cases.
4 Fully automatic operation:
a. In a CBL system, the problem domain is not fully covered.

b. Hence, some problem situations can occur for which the system
has no solution.

c . I n such situations, CBL systems expect input from the user.

Que 3.28. What are the applications of CBL ?

Answer
Applications of CBL:
1. Interpretation : It is a process of evaluating situations/ problems in
some context (Por example, HYPO for interpretation of patent laws
KICS for interpretation of building regulations, LISSA for interpretation
of non-destructive test measurements).
2 Classification: It is a process of explaining a number of encountered
symptoms (For example, CASEY for classification of auditory
impairments, CASCADE for classification of software failures, PAKARR
for causal classification of building defects, ISFER for classification of
facial expressions into user defined interpretation categories.
3 Design:Itis a process of satisfying a number of posed constraints (For
example, JULIAfor meal planning. CLAVIER for design of optimal
layouts of composite airplane parts, EADOCS for aircraft panels design).
4 Planning: It is a process of arranging a sequence of actions in time
For example, BOLERO for building
diagnostic plans for medical patients,
TOTLEC for manufacturing planning).
5. Advising: It is a process of resolving diagnosed problems (For example,
DECIDER for advising students, HOMER).

Que 3.29.| What are major paradigms of machine learning ?

Answer
Major paradigms of machine learning are:
1. Rote Learning:
a. There is one-to-one mapping from inputs to stored representation.

b. Learning by memorization.
3-26 L (Cs/AT-Sem-5) Decision Tree Learning

retrieval.
C. There is Association-based storage and
2 Induction : Machine learning use specific examples to reach general
conclusions.

3 Clustering: Clustering isa task of grouping


a set
of objects in suchto a

a r e similar to each other than


way that objects in the same group
those in other group.
different
Analogy: Determine correspondence between two
representations.
Discovery: Unsupervised i.e., specific goal not given.

6 Genetic algorithms:
a Genetic algorithms are stochastic search algorithms which act on a

population of possible solutions


b. They are probabilistic search methods means that the states which
the properties of the
they explore are not determined solely by
problems.
7. Reinforcement:
a. In reinforcement only feedback (positive or negative reward) given
at end of a sequence of steps.
b. Requires assigning reward to steps by solving the credit assignment
problem which steps should receive credit or blame for a final result.

Que 3.30.Briefly explain the inductive learning problem.

Answer
Inductive learning problem are :

1. Supervised versus unsupervised learning:


a. We want to learn an unknown function flx) =y, where x is an input
example and y is the desi o put.
b. Supervised learning implies we are given a set of (x, y) pairs by a

teacher.
c. Unsupervised learning means we are only given the xs.

d In either case, the goal is to estimatef


2 Concept learning:
a. Given a set of examples of some concept/class/category, determine
if a given example is an instance of the concept or not.
b. Ifit is an instance, we call it a positive example.
c. Ifit is not, it is called a negative example.
Machine Learning Techniques 3-27L (CS/IT-Sem-5)
3 Supervised concept learning.by induetion:
a. Given a
training set of positive and negative examples of a concept,
construct a description that will accurately classify whether future
examples are positive or negative.
b. That is, learn some good estimate of functionf given a training set
((x1, y1), (x2, y2), .., (xn, yn)) where eachy; is either + (positive) or
- (negative).
4
UNIT
Artificial Neural
Network and
Deep Learning

CONTENTS
4-2L to 4-11L
Part-1 Neural Network,...
Artificial
Perceptron's, Multilayer
Perceptron, Gradient Descent
and the Delta Rule

Part-2 : Multilayer Network, ************************ 4-11L to 4-19L


Derivation of Backk
Propagation Algorithm,
Generalization

Part-3 : Unsupervised Learning,... 4 - 1 9 L to 4-22L


SOM Algorithm and its Variants

Part-4 Deep Learning. Introduction, . . . 22L to 4-271L


Concept of Convolutional Neural
Network, Types of Layers,
(Convolutional Layers, Activation
Function, Pooling, Fully Connected)

Part-5 4-27L to 4-31L


:
Concept
(1D and
of Convolution.. .
2D) Layers,
Training of Network, Case
Study of CNN for eg on Diabetic
Retinopathy, Building a Smart
Speaker, Self Driving Car etc.

4-1L(CS/IT-Sem-5)
Artificial Neural Network & Deep Learning
4-2L (CSIT-Sem-5)

PART 1
Artifical Neural Network, Perceptron's Multilayer Perceptron,
Gradient Descent and the Delta Rule.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 4.1. Describe Artificial Neural Network (ANN with different


layers.
Answer
Artificial Neural Network: Refer . 1.13, Page 1-14L., Unit-1.
A neural network contains the following three layers
a. Input layer: The activity of the input units represents the raw
information that can feed into the network.
b. Hidden layer:
i Hidden layer is used to determine the activity of each hidden
unit.
ii The activities of the input units and the weights depend on the
connections between the input and the hidden units.
ii. There may be one or more hidden layers.
Output layer: The behaviour of the output units depends on the
activity of the hidden units and the weights between the hidden
and output units.

Que 4.2.What are the advantages and disadvantage of Artificial


Neural Network ?

Answer
Advantages of Artificial Neural Networks (ANN):
1. Problems in ANN are represented by attribute-value pairs.
2 ANNs are used for problems having the target function, output may be
discrete-valued, real-valued, or a vector of several real or discrete-valued
attributes.

3. ANNs learning methods are quite robust to noise in the training data.
The training examples may contain errors, which do not affect the final
tput.
4-3 L(CS/IT-Sem-6)
Machine Learning Techniques
function
evaluation of the learned target
4. It is used where the fast
required.
5. ANNs can bear long training times depending on factors such as the

of training examples
number of weights in the network, the number
considered, and the settings of various learning algorithm parameters.

Networks (ANN) :
Disadvantages of Artificial Neural
1. Hardware dependence:
with parallel processing
a. Artificial neural networks require processors

structure.
power, by their

b. For this reason, the realization of the equipment is dependent.


2 Unexplained funetioning of the network:
a. This is the most important problem of ANN.
does not give a clue as to
b. When ANN gives a probing solution, it
why and how.

This reduces trust in the network.


structure:
3 Assurance of proper network
a. There is no specific rule for determining the structure of artificial
neural networks.
b. The appropriate network structure is achieved through experience
and trial and error.

problem to the network:


4 The difficulty of showing the
ANNs can work with numerical information.
b. Problems have to be translated into numerical values before being
introduced to ANN.
the
The display mechanism to be determined
will directly influence
performance of the network.

d This is dependent on the user's ability.


is unknown:
5. The duration of the network
certain value of the error on the
a The network is reduced to a

sample means that the training has been completed.


b. This value does not give us optimum results.

Neural
Que 43.What are the characteristics of Artificial
Network ?

Answer
Characteristics of Artificial Neural Network are:

1. It is neurally implemented mathematical model.


2. It contains large number of interconnected processing elements called
neurons to do all the operations.
44L(CSIT-Sem-5) Artificial Neural Network & Deep Learning

Information stored in the neurons is basically the weighted linkage of


neurons.

4. The input signals arrive at the processing elements through connections


and connecting weights
5. It has theability to learn, recall and generalize from the given data by
suitable assignment and adjustment of
6
weights.
The collective behaviour of the neurons describes its computational
power, and no single neuron carries specific information.

Que 4.4.Explain the application areas ofartificial neural network.


Answer
Application areas of artificial neural network are:
1. Speech recognition:
a. Speech occupies a prominent role in human-human interaction.
b. Therefore, it is natural for people to expect speech interfaces with
computers.
In the present era, for communication with machines, humans stil
need sophisticated languages which are difficult to learn and use.

To ease this communication barrier, a simple solution could be


in spoken language that is possible for the machine
communication
to understand.
a

e. Hence, ANN is playing a major role in speech recognition.

Character recognition:
a It is a problem which falls under the general area of Pattern
Recognition.
b. Many neural networks have been developed for automatic
recognition of handwritten characters, either letters or digits.
3 Signature verification application:
a Signatures are useful ways to authorize and authenticate a person

in legal transactions.
b. Signature verification technique is a non-vision based technique.
For this application, the first approach is to extract the featureor
rather the geometrical feature set representing the signature.

With these feature sets, we have to train the neural networks


d
using an efficient neural network algorithm.
This trained neural network will classify the signature as being

genuine or forged under the


verification stage.
4. Human face recognition
a. It is one of the biometric methods to identify the given face
Machine Learning Techniques 4-5 L (CS/TT-Sem-5)

b. It is a typical task because of the characterization of "non-face"

images.
CHowever,if a neural network is well trained, then it can be divided
elasses namely images having faces and images that do not
into two
have faces.

gue 4.5. Explain different types of neuron connection with


architecture.
Answer
Different types of neuron connection are
1. Single-layer feed forward network:
a. In this type of network, we have only two layers i.e., input layer
and output layer but input layer does not count because no
computation is performed in this layer.
b. Output layer is formed when different weights are applied on input
nodes and the cumulative effect per node is taken.

After this the neurons collectively give the output layer to compute
the output signals.

Input layer Output layer


W11

12
2

nm

forward network:
2 Multilayer feed which is internal to the network and
a. This layer has hidden layer
external layer.
has no direct contact with the
hidden layers enables the network to be
b. Existence of one or more

computationally stronger.
feedback connections in which outputs of the model
There are no

are fed back into itself.


4-6L (CS/IT-Sem-5) Artificial Neural Network & Deep Learning

Tnput layer Hidden layer Output layer

nm

3 Single node with its own feedback


a. When outputs can be directed back as inputs to the same layer or
preceding layer nodes, then it results in feedback networks.
b. Recurrent networks arefeedback networks with closed loop.
Fig. 4.5.1 shows a single recurrent network having single neuron
with feedback to itself.

Input Output

Feedback
Pig. 4.5.1

4 Single-layer recurrent network :


a This network is single layer network with feedback connection in
which processing element's output can be directed back to itself or
to other processing element or both.
b. Recurrent neural network is a class of artificial neural network
where connections between nodes form a directed graph along a
sequence.
c. This allows it to exhibit dynamic temporal behaviour for a time
sequence. Unlike feed forward neural networks, RNNs can use
their internal state (memory) to process sequences of inputs.
Machine Learning Techniques 4-7 L(CS/TT-Sem-5)

11

22

m
5 Multilayer recurrent network:

a. In this type of network, processing


element output can be directed
to the processing element in the same layer and in the preceding

multilayer recurrent network.


layer forming a

of a with
b. They perform the same task for every element sequence,
the output being depended on the previous computations. Inputs
are not needed at each time step.
neural network is its
c . T h e main feature of a multilayer recurrent
hidden state, which captures information about a sequence.

*1 1

W22

Wn
m

benefits of artificial neural network.


Que 46.Discuss the
Answer

1. Artificial neural networks a r e flexible and adaptive.


2. Artificial neural networks a r e used in sequence
and pattern recognition
etc.
systems, data processing, robotics, modeling,
to internal
3. ANN acquires knowledge from their surroundings by adapting
which
and external parameters and they solve complex problems
are

difficult to manage.
48L(CSIT-Sem-5) Artificial Neural Network &
Deep Learningk
4. It generalizes knowledge to
produce adequate
tuations. responses to unknown
5. Artificial neural networks are flexible and have the
generalize and adapts to situations based on its ability to learn,
6 This function allows the findings.
network to
efficiently acquire knowledge by
learning. This is a distinct advantage over
that is inadequate when it
comes to
a
traditionally linear network
An
modelling non-linear data.
artificial neuron network is
traditional network. Without thecapable
of greater fault tolerance
than a
loss of stored data, the network is able
to
regenerate a fault in any of its
components.
& An artificial neuron network is based on adaptive learning.
Que 4.7.Write short note on
gradient descent.
Answer
1. Gradient descent is an optimization
function by iteratively algorithm used to minimize some
moving in the direction of steepest descent as
defined by the negative of the
gradient.
2 Agradient is the slope of a function, the degree of
with the amount of
change in another
change ofa parameter
parameter.
3.
Mathematically, it can be described as the partial derivatives of a set of
parameters with respect to its inputs. The more the
the slope. gradient, the steeper
4 Gradient Descent is a convex function.
5. Gradient Descent can be described as an iterative method which is used
to find the values of the
parameters of a function that minimizes the
cost function as much as
possible.
The parameters are initially defined a particular value and from that,
Gradient Descent run in an iterative fashion to find the
of the parameters, using calculus, to find the optimal values
minimum possible value of
the given cost funetion.

Que 48.Explain different types of gradient descent.


Answer
Different types of gradient descent are:
L Batch gradient descent:
a This is a type of gradient descent which
processes all the training
examples for each iteration of gradient descent.
When the number of training examples is
large, then bateh gradient
descent is computationally very expensive. So, it is not
Instead,
preferred.
we
prefer to use stochastic gradient descent or
mini-batch gradient descent.
Machine Learning T'echniques 4-9L (CS/IT-Sem-5)
-

2 Stochastic gradient descent:


a This is a type of gradient descent which processes single training
example per iteration.
b. Hence, the parameters are being updated even after one iteration
in which only a single example has been processed.
Hence, this is faster than batch gradient descent. When the number
of is
training examples large, even then it processes only one
example which can be additional overhead for the system as the
number ofiterations will be large.
3 Mini-batch gradient descent:
a. This is a mixture of both stochastic and batch gradient descent.
b. The training set is divided into multiple groups called batches.
C. Each batch has a number of training samples in it.
d At a time, a single batch is passed through the network whiech
computes the loss of every sample in the batch and uses their
average to update the parameters of the neural network.

Que 4.9. What are the advantages and disadvantages of batch


gradient descent ?

Answer
Advantages of batch gradient descent:
1. Less oscillations and noisy steps taken towards the global minima of the
loss function due to updating the parameters by computing the average
of all the training samples rather than the value of a single sample.
2 It can benefit from the vectorization which increases the speed of
processing al training samples together.
It produces a more stable gradient descent convergence and stable error

gradient than stochastic gradient descent.


4. It is computationally efficient as all computer resources are not being
used to process a single sample rather are being used for all training

samples.
Disadvantages of batch gradient descent:
a local minima and unlike
1 Sometimes a stable e r r o r gradient can lead to
stochastic gradient descent no noisy steps are there to help to get out of
the local minima.
2. The entire training set can be too large to process in the memory due to
which additional memory might be needed.
3. Depending on computer resources it can take too long for processing all
the training samples as a batch.

Que 4.10.What are the advantages and disadvantages of


stochastic gradient descent ?7
4-10 L (CS/IT-Sem-5) Artificial Neural Network & Deep Learning

Answer
Advantages of stochastic gradient descent:
1 . l t is easier to fit into memory due to a single training sample being
processed by the network.
2. t is computationally fast as only one sample is processed at a time.

3. For larger datasets it can converge faster as it causes updates to the


parameters more frequently.

4Due to frequent updates the steps taken towards the minima of the loss
function have oscillations which can help getting out of local minimums
oflocal loss function (in
theminimum). the computed position turns out to be the
case

Disadvantages of stochasticgradient descent:


1. Due to frequent updates the steps taken towards the minima are very
noisy. This can often lead the gradient descent into other directions.
2. Also, due to noisy steps it may take longer to achieve convergence to the
minima ofthe loss function.
3. Prequent updates are computationally expensive due to using all
resources for processing one training sample at a time.
4 It loses the advantage of vectorized operations as it deals with only a
single example at a time.

Que 4.11.| Explain delta rule. Explain generalized delta learning


rule (error backpropagation learning rule).
Answer
Delta rule:
1. The delta rule is specialized version of backpropagation's learning rule
that uses single layer neural networks.
2. It calculates the error between calculated output and sample output
data, and uses this to create a modification to the weights, thus
implementing a form of gradient descent.
Generalized delta learning rule (Error backpropagation learning):
In generalized delta learning rule (error backpropagation learning). We
the
are
given training set:
,y),, *,y*)
wherer and y e R,k =1, . K.

Step 1:n> 0, Ema 0 are chosen.


Step 2: Weights w are initialized at small random values, k = 1, and the
running error E is set to 0.
Machine Learning Techniques 4-11 L(CS/IT-Sem-5)

Step 3: Input a* is presented, x: =r*y:=y*, and output O is computed


as

O
1+ exp(- WO)
where O, is the output vector of the hidden layer

1+ exp(- W*)

Step 4:Weights ofthe output unit are updated


W: = W+ n o

where8 = ( y - O 0 ( 1 - 0)

Step 5: Weights of the hidden units are updated


l 1, L
u+nöw,0{1 -0,x, =

..
the present error
Step 6: Cumulative cycle error is computed by adding
to E
E:= E+ 1/2(y - 0)
k+ 1 and we continue the training by going
Step 7: Ifk <
Kthen k=
back to step 2, otherwise we go to step 8.
For E terminate the
Step 8: The training cycle is completed. <
Emax
then E: 0, k := 1l and we initiate a new
If E> Ema
=

training session.
training cycle by going
back to step 3.

PART-22
Multilayer Network, Derivation of Back Propagation Algorithm,
Generalization.

Questions-Answers

Answer Type Questions


Long Answer Type and Medium

Write short note backpropagation algorithm.


Que 4.12. on

Answer
in the training of feedforward
1. Backpropagation is an algorithm used
neural networks for supervised learning.
the gradient of the loss function
2. Backpropagation efficiently computes
with respect to the weights of the network for a single input-output

example.
for training multi-layer
3. This makes it feasible to use gradient methods we use descent
los8, gradient
networks, updating weights to minimize
descent.
or variants such as stochastie gradient
4-12 L (CS/IT-Sem-5)
Artificial Neural Network & Deep Learning
4. The back propagation algorithm works by computing the gradient of the
loss function with
respect each weight by the chain rule,
to
backwards one layer at a time from the last iteratin8
calculations of intermediate terms in the layer to avoid redundant
chain rule; this is an example
of dynamie programming
5. The term
the
backpropagation refers only to the
algorithm computing for
gradient, but it is often used loosely to refer to the entire
learning
algorithm, also including how the gradient is used, such as by stochastic
gradient descent.
6.
Backpropagation generalizes the gradient computation in the delta rule,
which is the single-layer version of backpropagation, and is in turn
generalized by automatic differentiation, where backpropagation is a
special case of reverse accumulation (reverse mode).

Que 4.13.Explain perceptron with single flow graph.


Answer
1 The perceptron is the simplest form of a neural network used for
classification of patterns said to be linearly
separable.
2 It consists of a single neuron with adjustable synaptic weights and bias.
3. The perceptron build around a single neuron is limited for performing
pattern classification with only two classes.
4. By expanding the output layer of perceptron to include more than one
neuron, more than two classes can be classified.

5. Suppose, a perceptron have synaptic weights denoted by w,, w2, Wg


6. The input applied to the denoted by x, X2
perceptron are . .

Im
7. The externally applied bias is denoted by b.

Bias b

V Output
Inputs Hand

limiter

*mO
Pig A131. Signal low greph of the pereptron,
8. From the model, we find that the hard limiter input or induced local field
of the neuron as

V-ux+b
4-13 L (CS/IT-Sem-5)
Machine Learning Techniques
9. The goal of the perceptron is to correctly classify the set of externally
applied input r , X2, . i n t o one of two classes G, and G
+1 then assign the
10. The decision rule for classification is that if outputy is
G, else y is -1 then
point represented by input x, X2 *'m to class
assign to class G
11. In Fig. 4.13.2, if a point (r.x,) lies below the boundary lines is assigned
to class G, and above the line is assigned to class G,. Decision boundary
is calculated as

W+wt2 +b =0
Decision boundary
WX+W2X2+ b=0
Glass G2 lass G

Fig. 4.13.2.
defined as :
12. There are two decision regions separated by a hyperplane

of the perceptron can be adapted


The synaptic weights w,, W2
on a n iteration basis.
iteration by
error-correction rule known as perceptron
13. For the adaption, an

algorithm is used.
convergence
and G, must be
14. For a perceptron to function properly, the two classes G,
linearly separable.
the pattern or set of inputs to be classified
15. Linearly separable means,
straight line.
must be separated by a

in n-dimensional space a r e linearly separable


16. Generalizing, a set of points
-1) dimensions that separates the sets.
there is a hyperplane of (n
if
Gclass

G2class 2 Class

(b) A pair of non-linearly


(a) A pair of linearly
separable patterns separable patterns
Fig. 4.13.3.
4-14 L (Cs/IT-Sem-5) Artificial Neural Network & Deep Learning

Que 4.14.| State and prove perceptron convergence theorem.

Answer
Statement: The Perceptron convergence theorem states that for any data
set which is linearly separable the Perceptron
learning rule is guaranteed to
find a solution in a finite number of steps.
Proof:
1. To derive theerror-correction learning algorithm for the perceptron.
2. The perceptron convergence theorem used the
synaptic weights w,, w
o f the perceptron can be adapted on an iteration by iteration
basis.
The bias b(n) is treated as a
synaptic weight driven by fixed input equal
to +11.
xn) l+ 1,
=
x, (n), xln), n)
Where denotes the iteration step in applying the
n
algorithm.
4. Correspondingly, we define the weight vector as
wn) [bn),
=
w,n), w(n). ,n)
Accordingly, the linear combiner output is written in the compact form:

vn) = 2w, n) x,(n) = w"n) xin)

The algorithm for adapting the weight vector is stated as:


If the nth member of input set x(n), is correctly classified into linearly
separable classes, by the weight vector w{n) (that is output is correct)
then no adjustment of weights are done.
wn + ) = wln)

i fw' x(n)>0 and x(n) belongs to class G.


wn + 1) = wln)

if w' x{n)<0 and x{n) belongs to class


Gg
2 0therwise, the weight vector of the perceptron is updated in accordance
with the rule:
w n + 1) = w{n) - q(n) xln)

ifw'n)x n)> 0 and x(n) belongs to class G


wn + 1)= wln) - ntr) xtn)

ifwn) xn) s0 and xn) belongs to class G


where nfn) is the learning-rate parameter for controlling the adjustment
applied to the weight vector at iteration n.
Also small a leads to slow learning and large a leads to fast learning. For
a constant a, the learning algorithm is termed as fixed increment
algorithm.
4-15 L (CS/IT-Sem-5)
Machine Learning Techniques

its arehitecture
Que 4.15. Explain multilayer perceptron with
and characteristics.

Answer
Multilayer
1.
perceptron
The perceptrons which are arranged in layers are called multilayer

perceptron. This model has three layers : an input layer, output layer
and hidden layer.
the linear transfer function used
For the perceptrons in the input layer,
and for the perceptron in the hidden layer and output layer, the sigmoidal
or squashed-S function is used.
the network ina forward direction.
3. The input signal propagates through
bias b{n) is treated
4. On alayer by layer basis, in the multilayer perceptron
as a synaptic weight driven by fixed input equal to +1.
xn) = +1, X, 7), xzn), .. X,n)

where n denotes the iteration step in applying the algorithm.

Correspondingly, we define the weight vector as :

wn) = [b{n), w,in), w,n).. n)


is written in the compact form
5. Accordingly, the linear combiner output

x(n)
2w, (n)x, (n)= w'(n)
x
V(n) =

i-0

the weight vector is stated a s :


The algorithm for adapting
is correctly classified into linearly
1 If the nth number of input set x(n), is correct)
vector w{n) (that is output
separable classes, by the weight
are done.
then no adjustment of weights
wn + 1) = w{n)

x{n) belongs to class G.


if w'r(n) >0 and
wn + 1)= w{n)
if w'x(n) s0 and x(n) belongs to class G,
in accordance
vector of the perceptron is updated
Otherwise, the weight
with the rule.

Architecture of multilayer perceptron:


with two
. Fig. 4.15.1 shows architectural graph of multilayer perceptron
hidden layer and an output layer
in a forward direction,
Signal flow through the network progresses
from the left to right and on a layer-by-layer basis.
identified in this network
3. Two kinds of signals are
a. Functional signals : Functional signal is an input signal and
the output end of the network
and emerges at
propagates forward
as an output signal.
4-16 L (CS/AT-Sem-5) Artificial Neural Network & Deep Learning
p Learning
b. Error signals: Error signal originates at an output neuron and
propagates backward through the network.

Input Output
signal signal

Output layer
Input layer First hidden Second hidden
layer layer
Fig. 4.15.1.
Multilayer perceptrons have been applied successfully to solve some
difficult and diverse problems by training them in a supervised manner
with highly popular algorithm known as the error backpropagation
algorithm.
Characteristics of multilayer perceptron:
1. In this model, each neuron in the network includes a non-linear
activation function (non-linearity is smooth). Most commonly used
non-linear function is defined by:

1+ exp(-v,)
where u, is the induced local field (i.e., the sum of all weights and bia5)
and y is the output of neuronj.
2. The network contains hidden neurons that are not a part of input or

of enabled network to
of
the network. Hidden layer neurons
output
learn complex tasks.

3. The network exhibits a high degree of connectivity.


effect the backpropagation
Que 4.16.| How tuning parameters
neural network?

Answer
network:
the backpropagation neural
Effect of tuning parameters of
1. Momentum factor:
the values
The momentum factor has a significant role in deciding
a.
of learning rate that will produce rapid
learning.
weights biases.
the size of change in
or
b. It determines
4-17L (CS/IT-Sem-6)
Machine Learning Techniques

If momentum factor is zero, the smoothening is minimum and the

entire weight adjustment comes from the newly ralculated change.


d. l f momentum factor is one, new adjustment is ignored and previóus

one is repeated.
Between 0 and 1 is a region where the weight adjustment is
e.
smoothened by an amount proportional to the momentum tactor.

f. momentum factor effectively increases the speed of learning


The
without leading to oscillations and filters out high frequency
variations of the error surface in the weight space.

2 Learning coefficient:
a. A formula to select learning coefficient is

n
1.5
N N , . . + N,)

Where N, is the number of patterns of type 1 and m is the number


of different pattern types.
b. The small value of learning coefficient less than 0.2 produces slower
but stable training.
C. The largest value of learning coefticient i.e., greater than 0.5, the
weights are changed drastically but this may cause optimum
combination of weights to be overshot resulting in oscillations about

the optimum.
d The optimum value of learning rate is 0.6 whieh produce fast

learning without leading to oscillations.


3. Sigmoidal gain:
of
a If sigmoidal function is selected, the input-output relationship
the neuron can be set as

...(4. 16.1)

where . is a scaling factor known as sigmoidal gain.


b. As the scaling factor increases, the input-output characteristic of
the analog neuron approaches that of the two state neuron or the

activation function approaches the (Satisifiability) function.


c. It also affects the backpropagation. To get graded output, as the
sigmoidal gain factor is increased, learning rate and momentum
factor have to be decreased in order to prevent oscillations.

4. Threshold value:
a. 0in eq. (4.16.1) is called as threshold value or the bias or the noise
factor.
4-18 L (CS/AT-Sem-5) Artificial Neural Network & Deep Learning

b. A neuron fires or generates an output if the weighted sum of the


input exceeds the threshold value.

c. One method is to simply assigna small value to it and not to change


it during training.
The other method is to initially cho0se some random values and
change them during training.

Que 4.17.| Discuss seleetion of various parameters in


Backpropagation Neural Network (BPN).

Answer
Selection of various parameters in BPN:
1. Number of hidden nodes:
a The guiding criterion is to select the minimum nodes in the first
and third layer, so that the memory demand for storing the weights
can be kept minimum.
b. The number of separable regions in the input space M, is a function

ofthe number of hidden nodes H in BPN and H M-1.=

c. When the number of hidden nodes is equal to the number of training


patterns, the learning could be fastest.
d n such cases, BPN simply remembers training patterns losing all
generalization capabilities.
e. Hence, as far as generalization is concerned, the number of hidden
nodes should be small compared to the number of training
patterns
with help of Vapnik Chervonenkis dimension (VCdim) of probability
theors
We can estimate the selection of number of hidden nodes for a
of
given number trainingwhere
equal to* 2 + l 2 *
patterns as number of weightswhich
7, and I, denote input and output
is

nodes and l, denote hidden nodes.


Assume the training samples T to be greater than VCdim. Now if
we accept the ratio 10:

10
T 1
10T
,+1)
Which yields the value for
I
2 Momentum coefficient a:
a. To reduce the training time we use the momentum factor because
it enhances the training process.
b. The influences of momentum on weight change is
Machine Learning T'echniques 4-19L (CS/IT-Sem-5)

OE
AW = -

n alAW"
OW
The momentum also overcomes the effect of local minima.

d. The use of momentum term will carry a weight change process


through one or local minima and get it into global minima.

(Weight change
without momentum)
W
[AW]®/ a[AW]"
la Wn
(Momentum term)
Fig. 4.17.1. Influence of momentum term on weight change.
3 Sigmoidal gain
a. When the weights become large and force the neuron to operate in
a region where sigmoidal function is very flat, a better method of
coping with network paralysis is to adjust the sigmoidal gain.
b. By decreasing this scaling factor, we effectively spread out sigmoidal
function on wide range so that training proceeds faster.
4 Local minima:
a One of the most practical solutions involves the introduction of a
shock which changes all weights by specific or random amounts.
b. If this fails, then the most practical solution is to rerandomize the
weights and start the training all over.

|PART-33]
Unsperuised Learning, SOM Algorithm and its Variants

Questions-Answers
Long Answer 1ype and Medium Answer Type Questions

Que 4.18. Write short note on unsupervised learning.

Answer
1. Unsupervised learning is the training of machine using information
that is neither classified nor labeled and allowing the algorithm to act on
that information without guidance.
& Deep Learning
4-20 L (CS/AT-Sem-5) Artificial Neural Network

to
2. Here the task of machine is to group
unsorted information according
of data.
Similarities, patterns and differences
without any prior training
no training
no teacher is provided that m e a n s
3. Unlike supervised learning,
will be given to the machine.
in unlabeled
4. Therefore machine is restricted to find the hidden structure
data by our-self.

of
unsupervised learning into two categories
Que 4.19.| Classify
algorithm.

Answer
learning algorithm into two categories
:
Classification of unsupervised
discover the
1. Clustering: A clustering problem is where we want to
inherent groupings in the data, such as grouping customers by

purchasing behavior.

problem is where we want


2 Association: An association rule learning
to discover rules that describe large portions of our data, such as people

that buy X also tend to buy Y.

?
Que 4.20.What are the applications of unsupervised learning

Answer
Following a r e the application of unsupervised learning:
the dataset into groups base
1. Unsupervised learning automatically split
on their similarities.

detection discover unusual data points in our dataset. It is


2. Anomaly can

useful for finding fraudulent transactions.


in
3. Association mining identifies sets of items which often occur together
our dataset.

4. Latent variable models are widely used for data preprocessing. Like
the number of features in a dataset or decomposing the dataset
reducing
into multiple components.

?
Que 4.21.What is Self-Organizing Map (SOM)
Answer
1. Self-Organizing Map (SOM) provides a data visualization technique which
helps to understand high dimensional data by reducing the dimensions
of data to a map.
2. SOM also represents clustering concept by grouping similar data together.
4-21 L(CS/AT-Sem-5)
Machine Learning Techniques
Feature Map (SOFM)
o r Self-Organizing
3. A self-Organizing Map (SOM) trained using
Network (ANN) that is
is a type of Artificial Neural
unsupervised learning to produce a low-dimensional (typically two-
of the training
dimensional), discretized representation of the input space
therefore a method to do dimensionality
called a map, and is
samples,
reduction.
as
other artificial neural networks
4. Self-organizing maps differ from
they apply competitive learning as opposed to error-correction learning
and in the sense that
(such as backpropagation with gradient descent),
neighborhood function to preserve the topological properties
they use a

of the input space.

Write the steps used in SOM algorithm.


Que 4.22.
Answer
Following are the steps used in SOM algorithm:
Each node's weights are initialized.
1
2. A vector is chosen at random from the set of training data.
a r e most like
3. Every node is examined to calculate which one's weights
known as the Best
the input vector. The winning node is commonly
Matching Unit (BMU).
calculated. The amount of
4. Then the neighbourhood of the BMU is
decreases over time.
neighbors
5. The winning weight is rewarded with becoming more like the sample
vector. The neighbours also become more like the sample vector. The
closer a node is to the BMU, the more its weights get altered and the
farther away the neighbor is from the BMU, the less it learns.

6. Repeat step 2 for N iterations.

Que 4.23. What are the basic processes used in SOM? Also explain
stages of SOM algorithm.

Answer
Basics processes used in SOM algorithm are
1. Initilization: All the connection weights are initialized with small
random values.

2 Competition : For each input pattern, the neurons compute their


respective values ofa discriminant function which provides the basis for
competition. The particular neuron with the smallest value of the
discriminant function is declared the winner.
-22 L (CS/TT-Sem-5) Artificial Neural Network & Deep Learning

3. Cooperation: The winning neuron determines the spatial location of


a topological neighbourhood of exeited neurons, thereby providing the
basis for cooperation among neighbouring neurons.

Adaptation: The excited neurons decrease their individual values of


the discriminant function in relation to the input pattern through suitable
adjustment of the associated connection weights, such that the response
of the winning neuron to the subsequent application of a similar input
pattern is enhanced.

Stages of SOM algorithm are:


1. Initialization: Choose random values for the initial weight vectors u
2 Sampling: Draw a sample training input vector x from the input space.
3 Matehing: Find the winning neuron (x) that has weight vector closest

to the input vector, i.e., the minimum value of d(x) = >(x, - w,

4 Updating: Apply the weight update equation


Du ht) T, n . ) (x, - u)

where T e() is a Gaussian neighbourhood and h{t) is the learning


rate.
5 Continuation : Keep returning to step 2 until the feature map stops
changing.

PART-4|
Deep Learning, Introduction, Concept of Convolutional Neural
Network, Types of Layers, (Convolutional Layers, Activation
Function, Pooling, Fully Connected)
Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 4.24.| What do you understand by deep learning ?

Answer
1. Deep learning is the subfield of artificial intelligence that focuses on
creating large neural network models that are capable of making
accurate data-driven decisions.
Machine Learning Techniques 4-23 L (CS/AT-Sem-5)

2. Deep learning is used where the data is complex and has large datasets.
3. Facebook uses deep learning to analyze text in online conversations.
search and machine
Google and Microsoft all use deep learning for image
translation.
All modern smart phones have deep learning systems running on them.
For example, deep learning is the standard technology for speech
recognition, and also for face detection on digital cameras.

5. In the healthcare sector, deep learning is used to process medical images


(A-rays, CT, and MRI scans) and diagn0se health conditions.

6. Deeplearning is also at the core of self-driving cars, where it is used for


localization and mapping, motion planning and steering, and environment
perception, as well a acking driver state.

Que 425. Describe different architecture of deep learning.

Answer
Different architecture of deep learning are:
1. Deep Neural Network: It is a neural network with a certain level of
complexity (having multiple hidden layers in between input and output
are capable of modeling and processing non-linear
layers). They
relationships.
2 Deep Belief Network (DBN):It is a class of Deep Neural Network. It
is multi-layer belief networks. Steps for performing DBN are:
a. Learn a layer of features from visible units using Contrastive
Divergence algorithm.
b. Treat activations of previously trained features as visible units and
then learn features of features.
c. Finaly, the whole DBN is trained when the learning for the final
hidden layer is achieved.
3. Recurrent (perform same task for every element of a sequence)
Neural Network : Allows for parallel and sequential computation.
Similar to the human brain (large feedback network of connected
neurons).They are able to remember important things about the input
they received and hence enable them to be more precise.

Que 4.26. What are the advantages, disadvantages and limitation


of deep learning ?
4-24 L (CSIT-Sem-5) Artificial Neural Network & Deep Learning

Answer

Advantages of deep learning:


1. Best in-class pertformance on
problems.
2 Reduces need for feature
engineering.
3. Eliminates unnecessary
costs.
4 Identifies defects easily that are difficult to detect.
Disadvantages of deep learning:
1. Large amount of data
required.
Computationally expensive to train.
3 No strong theoretical foundation.
Limitations of deep learning:
1. Learning through observations only.
2 The issue of biases.

gue 427. What are the various applications of deep learning?

Answer
Following are the application of deep learning
L Automatic text generation: Corpus of text is learned and from this
model new text is generated, word-by-word or
character-by-character.
Then this model is capable of learning how to spell, punctuate, form
sentences, or it may even capture the style.
2 Healthcare: Helps in diagnosing various diseases and treating it.
3 Automatic machine translation: Certain words, sentences or
phrases in one language is transformed into another language (DeepP
Learning is achieving top results in the areas of text, images).
Image recognition : Recognizes and identifies peoples and objects in
images as well as to understand content and context. This area is already
being used in Gaming, Retail, Tourism, etc.
5 Predicting earthquakes: Teaches a computer to perform viscoelastic
computations which are used in predicting earthquakes.

Que 4.28. Define convolutional networks.


4-25L (CS/IT-Sem-5)
Machine Learning Techniques

Answer
Convolutional networks also known as Convolutional Neural Networks

network for processing data


(CNNs) are specialized kind of neural
a

that has a known, grid-like topology.


Convolutional neural network indicates that the network employs
a
2.
mathematical operation called convolution.

3. Convolution is a specialized kind of linear operation.


convolution
4. Convolutional networks a r e simply neural networks that use
in place of general matrix multiplication in at least one of their layers.

CNNs, (ConvNets), are quite similar to regular neural networks.


from
They still made up of n e u r o n s with weights that can be learned
are
data. Each neuron receives some inputs and performs a dot product.
They still have a loss function on the last fully connected layer.
function a regular neural network
They can still use a non-linearity
and passes through a series of
receives input data as a single vector
hidden layers.

output layer

input layer
hidden layer 1 hidden layer 2

Fig. 4.28.1.A regular three-layer neural network


9. Every hidden layer consists of neurons, wherein every neuron is fully
connected to all the other neurons in the previous layer.

10. Within a single layer, each neuron is completely independent and they
do not share any connections.

11. The fully connected layer, (the output layer), contains class scores in the
case of an image classification problem. There are three main layers in
a simple ConvNet.
4-26 L (CSAT-Sem-5)
Artificial Neural Network & Deep Learning
Que 4.29.Write short note on convolutional layer.
Answer
1. Convolutional layers are the
neural networks.
major building blocks used in convolutional
2 A convolution is the simple application of a filter to an input that results
in an activation.

3. Repeated appliçation of the same filter to an


activations called a feature input results in a
map of
map, indicating the locations and
a detected feature in strength of
input, such as an image.
an

The innovation of convolutional neural networks is


the
automatically learn a large number of filters in parallel abilityto toa
training dataset under the constraints of a specific specific
problem, such as image classification. predictive modeling
5. The result is highly specific features that can be detected anywhere on
input images.

Que 430. Describe briefly activation function,


pooling and fully
connected layer.

Answer

Activation function:
1. An activation function is a function that is
added into an artificial neural
network in order to help the network learn
complex patterns in the
data.
2. When comparing with a neuron-based model that
is in our brains, the
activation function is at the end
deciding what is to be fired to the next
neuron.

That is exactly what an activation function does in an ANN as well.


It takes in the output signal from the previous cell and converts it into
some form that can be taken as input to the next cell.

Pooling layer:
1. A pooling layer is a new layer added after the convolutional
layer.
Specifically, after a non-linearity (for example ReLU) has been applied
to the feature maps output by a convolutional layer, for
layers in a model may look as follows : example, the

a. Input image
b. Convolutional layer
Machine Learning Techniques 4-27L (CS/IT-Sem-5)

C. Non-linearity
d Pooling layer
2. The addition of a pooling layer after the convolutional layer is a common
pattern used for ordering layers within a convolutional neural network
that may be repeated one or more times in a given model.

3. The pooling layer operates upon each feature map separately to create
a new set of the same number of pooled feature maps.
Fully connected layer:
1 Fuly connected layers are an essential component of Convolutional
Neural Networks (CNNs), which have been proven very successful in
recognizing and classifying images for computer vision.
2. The CNN process begins with convolution and pooling eaking iov
the image into features, and analyzing them independently.
3. The result of this process feeds into a fully connected neural network
structure that drives the final classification decision.

PART-5
Concept of Convolution (1D and 2D) Layers, Training of Network,
Case Study of CNN for eg on Diabetic Retinopathy, Building a
Smart Speaker, Self Deriving Car etc.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 4.31Explain 1D and 2D convolutional neural network.

Answer
1D convolutional neural network:
1. Convolutional Neural Network (CNN) models were developed for image
classification, in which the model accepts a two-dimensional input
representing an image's pixels and color channels, in a process called
feature learning.
4-28 L (CS/IT-Sem-5) Artificial Neural Network & Deep Learning

of data.
This process can be applied to one-dimensional sequences
2 same

3. The model extracts features from sequences


data and maps the internal
features of the sequence
features from a fixed-length
A 1D CNN is very effective for deriving
4 where the
segment of the overall dataset, where it is not so important
feature is located in the segment
Networks work well for
5. 1D Convolutional Neural
a Analysis of a time series of sensor data.
of signal data fixed-length period, for example, an
Analysis over a

audio recording.
Natural Language Processing (NLP), although Recurrent Neural
Networks which leverage Long Short Term Memory (LSTM) cells
are more promising than CNN as they take into account the
proximity of words to create trainable patterns.

2D convolutional neural network


1. In a 2D convolutional network, each pixel within the image is represented
by its x andy position as well as the depth, representing image channels
(red, green, and blue).

2 It moves over the images both horizontally and vertically.

Que 4.32. How we trained a network ? Explain.

Answer
Once a network has been structured for a particular application, that
network is ready to be trained.

2 To start this processthe initial weights are chosen randomly. Then, the
training, or learning begins.
3. There are two approaches to training:
a In supervised training, both the inputs and the outputs are provided.
The network then processes the inputs and compares its resulting
outputs against the desired outputs.
b. Errors are then propagated back through the system, causing the
system to adjust the weights which control the network. This
process occurs over and over as the weights are continually
tweaked.
c. The set of data which enables the training is called the "training
set." During the training of a network the same set of data is
processed many times as the connecticn weizhts are ever refined.
4-29 L (CS/IT-Sem-5)
Machine Lerrning Techniques
In
d The other type of training is called unsupervised training.
but not
unsupervised training. the network is provided with inputs
with desired outputs8.

The system itself must then decide


what features it will use to
e.
to as self-organization
the input data. This is often referred
group
or adaption.

diabetie retinopathy on the basis of deep


Que 4.33.| Describe
learning
Answer
c a u s e s of blindness in the
DiabeticRetinopathy (DR) is one of the major
western world. Increasing life expectancy., indulgent
lifestyles and other
is projected
contributing factors mean the number of people with diabetes
to continue rising.
for DR has been shown to be a

Regular screening of diabetic patients


of their care.
cost-effective and important aspect
to both
The accuracy and timing of this care is of significant importance
of treatment.
the cost and effectiveness
treatment of DR is available; making
4 Ifdcted early enough, effective
this a vital process.
n u m e r o u s features and
5. Classification of DR involves the weighting of
time consuming for clinicians.
the location of such features. This is highly
trained,
6. Computers are able to obtain much quicker classifications once
clinicians in real-time classification.
gTving the ability to aid
active of
7. The efficacy of automated grading for DR has been an area

conclusions.
research in computer imaging with encouraging
the features of DR using
8. Significant work has been done on detecting
machines and k-NN classifiers.
automated methods such a support vector
two class
The majority of these classif+cation techniques arc on
9
classification for DR or no DR.

artificial neural network how we recognize


Que 4.34.| Using
speaker.

Answer

With the technology advancements smart home sector, voice control


in
that can make a real difference in
and automation are key components
people s Iives.
4-30 L (CS/AT-Sem-5) Artificial Neural Network & Deep Learning

2. The voice recognition technology market continues to involve rapidly as


almost all smart home devices are providing speaker recognition
capability today.
3. However, most of them provide cloud-based solutions or use very deep
Neural Networks for speaker recognition task, which are not suitable
models t0 run on smart home devices.

4. Here. we compare relatively small Convolutional Neural Networks


(CNN) and evaluate effectiveness of speaker recognition using these
models on edge devices. In addition, we also apply transfer learning
technique to deal with a problem of limited training data.
5. By developing solution suitable for running inference locally on edge
devices, we eliminate the well-known cloud computing issues, such as
data privacy and network lateney, ete
6. The preliminary results proved that the chosen model adapts the benefit
of computer vision task by using CNN and spectrograms to perform
speaker classification with precision and recall - 84 % in time less than
60 ms on mobile device with Atom Cherry Trail processor.

Que 4.35. Artificial intelligence plays important role in self-driving


car explain.

Answer
1. The rapid development ofthe Internet economy and Artificial Intelligence
(AI) has promoted the progress of self-driving cars.

of self-driving c a r s are
2 The market demand and economic value
more and more enterprises and
increasingly prominent. At present,
scientific research institutions have invested in this field. Google, Tesla,

Apple, Nissan, Audi, General Motors, BMW, Ford, Honda, Toyota,


in the research and
Mercede8, and Volkswagen have participated
of self-driving cars.
development
leaders in self.
3. Google is Internet company, which is one of the
an
in artificial intelligence.
driving cars, based on its solid foundation
cars were tested o n the road.
So
4. In June 2015, two Google self-driving
accumulated more than 3.2 million km of
far, Google vehicles have
to the actual use.
tests, becoming the closest
in the field of self
5. Another company that has made great progress
first company to devote self-driving
driving cars is Tesla. Tesla was the
technology to production.
Machine Learning Techniques 4-31 L (CS/IT-Sem-5)

6. Followed by the Tesla models series, its "auto-pilot" technology has made
major breakthroughs in recent years.
7. Although the Tesla's autopilot technology is only regarded as Level 2
stage by the National Highway Traffic Safety Administration (NHTSA),
Tesla shows us that the car has basically realized automatic
driving
under certain conditions.
5
UNIT
Reinforcement
Learning and
Genetic Algorithmn

CONTENTS
D-ZL to 5-6L
Part-1 Introduction
**********meosnnanaasse.

t0 .
Reinforce ment Learning

6-6L to 5-9L
Part-2 Learning Task, Example ********

of Reinforcement
Learning in Practice

Part-3 : Learning Models for . . 5 - 9 L to 5-13L

Reintorcement (Markov Decision


Process, Q Learning, Q Learning
Function, Q Learning Algorithm),
Application of Reinforcement

LearningS
d-1JL to 5-15L
Part-4 Introduction to Deep. *********** v*sanenee
Q Learning

5-15L to 5-30L
Part-5 : Genetic Algorithm,..
Introduction, Components,
GA Cycle of Reproduction,
Crossover, Mutation,
Genetic Programming,
Models of Evolution and
Learning, Application.

5-1 L (CS/IT-Sem-5)
5-2L (CS/TT-Sem-5) Reinforcement Iearning & Genetic Algorithm

|PART 1
Tntroduction to Reinforcement Learning.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 5.1.Describe reinforcement learning.


Answer
1. Reinforcement learning is the study of how animals and artificial systems
can learn to optimize their behaviour in the face of rewards and
punishments.
2 Reinforcement learning algorithms related to methods of dynamic
programming which is a general approach to optimal control.
3. Reinforcement learning phenomena have been observed in psychological
studies of animal behaviour, and in neurobiological investigations of
neuromodulation and addiction.
The task of reinforcemeent learning is to use observed rewards to learn
an optima poliey for the environment. An optimal poliecy is a policy that
maximizes the expected total reward.

5. Without some feedback about what is good and what is bad, the agent
will have no grounds for deciding which move to make.
6. The agents needs to know that something good has happened when it
wins and that something bad has happened when it loses.

7. This kind of feedback is called a reward or reinforcement.

8. Reinforcement learning is valuable in the field of robotics, where the


tasks to be performed are frequently complex enough to defy encoding
as programs and no training data is available.

9. In many complex domains, reinforcement learning is the only feasible


way to train a program to perform at high levels.
Machine Learning Techniques 5-3 L (CS/IT-Sem-5)

Primary
State (input) reinforcement signal
vector
Environment Critic

Heuristic
reinforcement
Signal
Actions

Learning
system

Fig. 5.1.1. Block diagram of reinforement leaming.

Que 5.2. Differentiate between reinforcement and supervised


learning.
Answer

No. Reinforcement Supervised


earning learning
Reinforcement learning is allIn supervised
about
learning. the
making decisions decision is made on the initial
sequentially. In simple words| input or the input given at the
we
can say that the outputstart.
depends on the state of the
current nput and the next

input depends on the output


ofthe previous input.
2. In reinforcement learning Supervised learning decisions are
decision is dependent. So, we independent of each other so
give labels to sequences of labels are given to each decision.
dependent decisions.
3. Example: Chess game. Example:Object recognition.

9ue53 What is reinforcement learning ? Explain passive


reinforcement learning and active reinforcement learning.

Answer
Reinforcement learning : Refer Q. 5.1, Page 5-2L, Unit-5.
5-4L(CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm

Passive reinforcement learning


1. In passive learning, the agent's policy T is fixed. In state s, it always
executes the action t(s).

2. Its goal is simply to learn how good the policy is -that is, to learn the
utility function U(s).
3. Fig. 5.3.1 shows a policy for the world and the corresponding utilities.
4. In Fig. 5.3.1(a) the policy happens to be optimal with rewards of
Rs) = - 0.04 in the non-terminal states and no discounting.

5. Passive learning agent does not know the transition model TMs, a, s'),
which specifies the probability of reaching state s' from state s after
doing action a; nor does it know the reward function R(s) which specifies
the reward for each state.
6. The agent executes a set of trials in the environment using its policy T.

7. In each trial, the agent starts in state (1, 1) and experiences a sequence
of state transitions until it reaches one of the terminal states, (4, 2) or
(4, 3).
8. Its percepts supPply both the current state and the reward received in
that state. Typical trials might look like this.

(1, 1a(1.2)a(1.3) 1,2 00(1,3)o2,3 3,3 084,3


(1, 1 ao(1,2) (1,3) (2, 3)ao(8, 3)aou3,2)gu3,3)aou4, 3,
(1, 1)o(2. 1)ao(3, 1 3 ,2) 4,2)

0.812 0.868| 0.918

0.762
0.660-)
0.705 0.655 0.611|0.388
3 4
(a) b)

Fig. 5.3.1. (a) A policy n for the 4x3 world;


(b) The utilities of the states in the 4x3world, given policy

9. Each state percept is subscripted with the reward received. The objectis
to use the information about rewards to learn the expected utility U"ls)
associated with each non-terminal state s.

10. The utility is defined to be the expected sum of (discounted) rewards


obtained if policy 1t is followed:
Machine Learning Techniques 5-5L (CS/IT-Sem-5)

U)= E| ZrR{«,)| n,5-


where y is a discount factor, for the 4 x5 world we set y = 1.
Active reinforcement learning:
1. An active agent must decide what actions to take.
2. First, the agent will need to learn a complete model with outcome
probabilities for all actions, rather thanjust model for the fixed policy.
3. We need to take into account the fact that the agent has a choice of
actions.
The utilities it needs to learn are those defined by the optimal policy,
the Bellman equations
theyobey
US) = RS) + Ymax 2T(s, a, s')U(s')

5. These equations can be solved to obtain the utility function U using the
value iteration or policy iteration
algorithms.
6. A utility
function is optimal for the learned model, the
U agent can
extract an optimal action by one-step look-ahead to maximize the expected
utility.
7. Alternatively, if it uses policy iteration, the optimal policy is already
available, so it should simply execute the action the optimal policy
recommen

Que 5.4. What are the different types of reinforcement learning ?

Explain.
Answer

Types of reinforcement learning:


1 Positive reinforcement learning:
a Positive reinforcement learning 1s defined as when an event, occurs
due to a particular behaviour, increases the strength and the

frequency of the behaviour.

b. In other words, it has a positive effect on the behaviour.

of positive reinforcement learning


are:
C. Advantages
Maximizes performance.
for long period of time.
Sustain change a

learning:
d Disadvantages of positive reinforcement
lead to overload of states which
i. Too much reinforcement can

can diminish the results.


2. Negative reinforcement learning:
reinforcement is defined as strengthening of behaviour
a. Negative
avoided.
negative condition is stopped
or
because a
5-6L (CS/IT-Sem-5) Reinforcement Iearning & Genetic Algorithm

b. Advantages of negative reinforcement learning:


i Increases behaviour.
It provide defiance to minimum standard of performance.
C. Disadvantages of negative reinforcement learning:
It only provides enough to meet up the minimum
behaviour
Que 5.5. What are the elements of reinforcement learning ?

Answer

Elements of reinforcement learning:


1. Policy (7):
a. It defines the behaviour of the agent which action to take in a given
state to maximize the received reward in the long term.
b. It stimulus-response rules or associations.
c. It could be a simple lookup table or function, or need more extensive
computation (for example, search).
d It can be probabilistic.
2 Reward function (r):
a It defines the goal in a reinforcement learning problem, maps a
state or action to a scalar number, the reward (or reinforcement).
b. The RLagent's objective is to maximize the total reward it receives
in the long run.

It defines good and bad events.


d It cannot be altered by the agent but may inform change of policy.
e. It can be probabilistic (expected reward).
3 Value function (V):
a It defines the total amount of reward an agent can expect to
accumulate over the future, starting from that state.
b. A state may yield a low reward but have a high value (or the
opposite). For example, immediate pain/pleasure vs. long term
happiness.
4. Transition model (M):
a. It defines the transitions in the environment action a taken in the
states, will lead to state s
b. It can be probabilistic.

PART-2
Learning Task, Example of Reinforcement Learning in Practice
Machine Learning Techniques
5-7L(CS/IT-Sem-5)

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 5.6. Describe briefly learning task used in machine learning.


Answer
1. A machine learning task is the
type of prediction or inference being
made, based on the problem or
question that is being asked, and the
available data.
2. For example, the classification task assigns data to categories, and the
clustering task groups data according to similarity.
Machine learning tasks rely on patterns in the data rather than being
explicitly programmed.
A supervised machine learning taskthat is used to predict which of two
classes (categories) an instance of data belongs to.
5. The input of a classification algorithm is a set of labeled
each label is
examples, where
an
integer of either 0 or 1.
6. The output of a binary classification algorithm is a classifier, which we
can use to predict the class of new unlabeled instances.
7. An unsupervised machine learning task that is used to group instances
of data into clusters that contain similar characteristics.
8. Clustering can also be used to identify relationships ina dataset that we
might not logically derive by browsing or simple observation.
9. The inputs and outputs of a clustering algorithm depend on the
methodology chosen.

Que 5.7.Explain different machine learning task.


Answer
Following are most common machine learning tasks
D a t a preprocessing: Before starting training the models, it is
important to prepare data appropriately. As part of data preprocessing
following is done :
a Data cleaning
b. Handling missing data
2 Exploratory data analysis : data is
Once preprocessed, the
next step
is to perform exploratory data analysis to understand data distribution
and relationship between /within the data.
5-8L(CSIT-Sem-5) Reinforcement Learning & Genetic Algorithm

3 Feature engineering I selection : Feature selection is one of the


critical tasks which would be used when building machine learning
models. Feature selection is important because selecting right features
would not only help build models of higher accuracy but also help achieve
models, reduce overtitung ete.
objectives related to building simpler
with estimation values
Regression: Regression tasks deal of numerical
(continuous variables). Some of the examples include estimation of

housing price, product price, stock price etc.


Classification: Classification task is related with predietinga category
of a data (discrete variables). Most common example is predicting
whether or not an email is spam or not, whether a person is suffering
from a particular disease or not, whether a transaction is traud or not,

etc.
Clustering: Clustering tasks are all about finding natural groupings of
data anda label associated with each of these groupings (clusters).
Some of the common example includes customer segmentation, product
features identification for produet roadmap.
7. Multivariate querying: Multivariate querying is about querying or
finding similar objects.
& Density estimation: Density estimation problems are related with
finding likelihood or frequency of objects.
9. Dimension reduction: Dimension reduction is the process of reducing
the number of random variables under consideration, and can be divided

into feature selection and feature extraction.


10. Model algorithm/ selection: Many atimes, there are multiple models
which are trained using different algorithms. One of the important task
1s to select most optimal models for deploying them in production.

11. Testing and matching: Testing and matching tasks relates to


comparing data sets.

Qque 5.8. xplain reinforeement learning with the help of an

example.

Answer
Reinforcement learning (RL) is learning concerned with how software
agents ought to take actions in an environment in order to maximize
the notion of cumulative reward.

2. The software agent is not told which actions to take, but instead must
discover which actions yield the most reward by trying them.
For example,
Consider the scenario of teaching new tricks to a cat:
1. As cat does not understand English or any other human language, we
cannot tell her directly what to do. Instead, we follow a different strategy.
Machine Learning Techniques 5-9L (CS/IT-Sem-5)

2. We emulate a situation, and the cat tries to respond in many different

ways. If the cat's response


is the desired way, we will give her fish.
3. Now whenever the cat is exposed to the same situation, the cat executes
similar action even more enthusiastically in expectation of getting
more reward (food).

4
That's like learning that cat gets from "what to do" from positive

experiences.
At the same time, the cat also learns what not do when faced with
5.
negative experiences.

Working of reinforcement learning:


1. In this case, the cat is an agent that is exposed to the environment (In
it is example of a state could
house). An be our cat sitting,
this case, your
and we use a specific word in for cat to walk.
2. Our agent reacts by performing an action transition from one "state" to
another "state."
3. For example, the cat goes from sitting to walking.
4. The reaction of an agent is an action, and the policy is a method of
selecting an action given a state in expectation of better outcomes
5. After the transition, they may get a reward or penalty in return.

PART3
Learning Models for Reinforcement (Markov Decision Process, Q
Learning, Q Learning Function, Q Learning Algorithm), Application
of Reinforcement Learning.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 5.9.Describeimportant term used in reinforcement learning


method.

Answer
Following are the terms used in reinforcement learning
Agent: It is an assumed entity which pertorms actions in an environment to
gain some reward.

i. Environment (e) : A scenario that an agent has to face.


. Reward (R) : An immediate return given to an agent when he or she
performs specific action or task.
5-10L (CSAT-Sem-5)
Reinforcement Learning & Genetic Algorithm

returned by the
iii. State (s) : State refers to the current situation
environment.

applies by the agent to decide the next


iv. Poliey (t): It is a strategy which
action based on the current state.

Value (V) : It is expected long-term i urn with discount, as compared

to the short-term reward.


vi. Value Funetion : It specifies the value of a state that is the total
amount of reward. It is an agent which should be expected beginning
from that state.
vii. Model of the environment: This mimics the behavior of the
It helps you to make inferences to be made and also
environment.
determine how the environment will behave.

vii. Model based methods: which


It is method for solving
a
model-based methods.
reinforcement
learning problems use

ix. Qvalue or action value Q):Q value is quite similar to value. The
only difference between the two is that it takes an additional parameter
as a current action.

Que 5.10. Explain approaches used to implement reinforcement


learning algorithm.
Answer
There are three approaches used implement a reinforcement learning algorithm:
L Value-Based:
a In a value-based reinforcement learning method, we should try to
maximize a value function V(s). In this method, the agent is expecting a
long-term return of the current states under policy n.
2 Policy-based:
In a policy-based RL method, we try to come up with such a poliey that
the action performed in every state helps you to gain maximum reward
in the future.
b. Two types of policy-based methods are:
i Deterministie: For any state, the same action is produced by the
policy
ii. Stochastie : Every action has a certain probability, which is
determined by the following equation stochastic policy:
na/s) = PlA = a/S = S
3. Model-Based:
a In this Reinforcement Learning method, we need to create a virtual
model for each environment.
b. The agent learns to perform in that specific environment.
Machine Learning Techniques 5-11 L(CS/AT-Sem-5)

Que 5.11 | Describe learning models of reinforcement learning.

Answer
1. Reinforcement learning is defined by a specific type of problem, and all
its solutions are classed as reinforcement
learning algorithms.
2. In the problem, an agent is supposed to decide the best action to select
based on his current state.
3. When this is
step repeated, the problem is known as a Markov Decision
Process.
4.
A Markov Decision Process (MDP) model contains
a. A State is a set of tokens that represent every state that the agent can
be in.
b. A Model(sometimes called Transition Model) gives an action's effect in
a state. In particular, TYS, a, S') defines a
transition T where being inn
state S and taking an action 'a' takes us to state S'
(S may be andS'
same).

c. An Action A is set of all


possible actions. A(s) defines the set of actions
that can be taken being in state S.
d A Reward isreal-valued reward function. R(s) indicates the reward for
a

simply being in the state S. R(S,a) indicates the reward for being in a
state S and taking an action 'a'. R(S,a,S) indicates the reward
for being
in a state S, taking action'a' and
an
ending up in a state S'.
e. A Policy is
solution a to the Markov Decision Process. A
policy is a
mapping from S to a. It indicates the action'a' to be taken while in state
S.

Que 5.12. What are the application of reinforcement learning and


why we use reinforcement learning ?

Answer
Following are the applications of reinforcement learning:
1L Robotics for industrial automation.
2. Business strategy planning.
3. Machine learning and data
4. It
processin8
helps us to creste training systems that provide custom instruction
and materials accrding to the requirement of students.
5. Aircraft control and robot motion control.
Following are the reasons for using reinforcement learning:
1. Ithelps us to find which situation needs an action.
2. Helps us to discover which action yields the highest reward over the
longer period.
Reinforcement Learning & Genetic Algorithm
5-12L (Cs/AT-Sem-5)

3. Reinforcement Learning also provides the learning agent with a reward

function.
4. It also allows us to figure out the best method for obtaining large rewards.

reinforcement learning ? What are the


Que 5.13. When not to use

challenges of reinforcement learning ?

Answer
We cannot apply reinforcement learning model is all the situation. Following
are the conditions when we should not use reinforcement learning model.
1. When we have enough data to solve the problem with a supervised
learning method.

When the action space is large reinforcement learning is computing

heavy and time-consuming

Challenges we will face while doing reinforcement learning are:

1. Feature/reward design which should be very involved.


2 Param ters may affect the speed of learning.
3. Realistic environments can have partial observability.
Too much reinforcement may lead to an overload of states which can
4
diminish the results.
5. Realistic environments can be non-stationary.

Que 5.14. Explain the term Q-learning.

Answer
1 Q-learning is a model-free reinforcement learning algorithm.
2. -learning is a values-based learning algorithm. Value based algorithms
Bellman
updates the value function based on a n equation (particularly
equation).
3. Whereas the other type, policy-based estimates the value function with
a greedy policy obtained from the last policy improvement.
-learning is an off-policy learner i.e., it learns the value of the optimal
policy independently of the agent's actions.
5. On the other hand, an on-policy learner learns the value of the policy
beingcarried by the agent, including the exploration steps and it will
out
find a policy that is optimal, taking into account the exploration inherent
in the policy.

Que 5.15. Describe -learning algorithm process.


Machine Learning Techniques
5-13 L (CS/IT-Sem-5)

Answer
Step 1: Initialize the Q-table: First the Q-table has to be built. There are
n columns, where n number of actions. There are m rows, where m
=
=

number of states.
In our
exampleGo left, Go right, Go up and Go down and m
n =

Start, Idle, =

Correct path, Wrong path and End. First, lets initialize the value at 0.
Step 2: Choose an action.
Step 3: Perform action: The combination
an
for an
of steps 2 and 3 is
undefined amount of time. These steps run until the time performed
stopped, or when the training loop stopped as defined in the code.
training is

a. First, an action (a) in the state (s) is chosen based on the


that, when the episode
Q-table. Note
initially starts, every Q-value should be 0.
b. Then, update the Q-values for being at the start and moving right using
the Bellman equation.
Step 4: Measure reward: Now we have taken an action and observed an
outcome and reward.
Step 5: Evaluate: We need to update the function Qs, a)
This process is repeated
again and again until the learning is stopped. In this
way the Q-table is been updated and the value functionQ is maximized. Here
the Q returns the
expected future reward of that action at that state.

PART-4
Introduction to Deep Q Learning8

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 5.16.| Describe deep Q-learning.

Answer
1. In deep Q-learning, we use a neural network to approximate the Q-
value function.
2. The state is given as the input and the Q-value of all possible actionsi
generated as the output.
3. The comparison between -learning and deepQ-learning is illustrated
below
5-14 L (CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm

L State H
Q-value
Action

Qlearning
-value action 1
LStateF Q-value action 2

-value action N
Deep Q learning
Fig. 5.16.1
4. On a higher level, Deep Q learning works as such:
Gather and store samples in a replay buffer with current poliey.
i Random sample batches of experiences from the replay buffer
ii Use the sampled experiences to update the Q network.
IV. Repeat 1-3.

Que 5.17. What are the steps involved in deep Q-learning network ?

Answer
Steps involved in reinforcement learning using deep Q-learning networks

1. All the past experience is stored by the user in memory.


2 The next action is determined by the maximum output of the Q-network.
3. The loss function here is mean squared error of the predicted Q-value
and the target Q-value-". This is basically a regression problem.
4. However, we do not know the target or actual value here as we are
dealing with a reintorcement learning problem. Going back to the
Q-value update equation derived from the Bellman equation, we have:

QUs, A)Qs,, A) + ulR,. +7 max Q(s,.)-Qi8, A,))

Que 5.18. Write pseudocode for deep-learning.

Answer
Start with 4,s, a) for all s, a.

Get initial state s

For k =
1,2,... till convergence
Sample action a, get next state s'
Machine Learning 'Techniques 5-15 L (CS/IT-Sem-5)

[f s isterminal:
target = R(s, a, s')
Sample new initial state s'
else target Rs, a, s') + y max@,(s', a')

. 6 , -aV,8, pit, olQ, (s, a) -

targets')"1|la-,
S8

PART-S
Genetic Algorithm, Introduction, Components, GA Cycle of
Reproduction, Crossover, Mutation, Genetic Programming.
Models of Evolution and Learning, Application.

Questions-Answers
Long Answer Type and Medium Answer Type Questions

Que 5.19. Write short note on Genetic algorithm.


Answer
L.enetic algorithms are computerizedsearch and optimization algorithm
based on mechanics of natural genetics and natural selection.

2 These algorithms mimic the principle of natural genetics and natural


selection to construct search and optimization procedure.
3. Genetic algorithms eonvert the design space into genetic space. Design
space is a set of feasible solutions.
4. Genetic algorithms work with a coding of variables.
5. The advantage of working with a coding of variables space is that coding
discretizes the search space even though the function may be continuous.
6. Search space is the space for all possible feasible solutions of particular
problem.
Following are the benefits of Genetic algorithm:
a. They are robust.
b. They provide optimization over large space state.

C.They do not break on slight change in input or presence of noise.


8. Following are the application of Genetic algorithm:
a. Recurrent neural network
5-16 L (CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm

b. Mutation testing
c. Code breaking

d Filtering and signal processin8


e. Learning fuzzy rule base

Que 5.20. Write procedure ofGenetie algorithm with advantages


and disadvantages.
Answer

Procedure of Genetic algorithm:


1. Generate a set of individuals as the initial population.
2. Use genetic operators such as selection or crOss over.

Apply mutation o r digital r e v e r s e if necessary.


Evaluate the fitness function of the new population.
5. Use the fitness function for determining the best individuals and replace
predefined members from the original population.
6. Iterate steps 2-5 and terminate when some predefined population
threshold is met.

Advantages of genetic algorithm:


Genetic algorithms can be executed in parallel. Hence, genetic algorithms
are faster.

2 solving optimization problems.


It is useful for
Disadvantages of Genetic algorithm:
it depends the
1 Identification of the fitness function is difficult as on

problem.
2. The selection of suitable genetic operators is difficult.

Que 5.21.|Explain different phases of genetic algorithm.

Answer
Different phases of genetic algorithm are:
1. Initial population :
a The process begins with a set of individuals which is called a
population.
b. Each individual is a solution to the problem we want to solve.
c. An individual is characterized by a set of purameters (variables)
known as genes
d Genes are joined into a string to form a chromosome (solution).
e. Ina genetic algorithm, the set of genes ofan individual is represented
using a string.
Machine Learning Techniques 6-17L (CS/IT-Sem-5)

f. Usually, binary values are used (string of ls and Os).

A Gene

DD hromosome

Aoo|i|
Population

2 FA (Factor Analysis) fitness function


a. The fitness function determines how fit an individual is (the ability

ofallindividual to compete with other individual).


b. It gives a fitness score to each individual.
C. The probability that an individual will be selected for reproduction
is based on its fitness score.

3. Selection:
a. The idea of selection phase is to select the fittest individuals and let
them paSs their genes to the next generation.
b. Two pairs of individuals (parents) are selected based on their fitness

Scores.

fitness have more chance to be selected for


c. Individuals with high
reproduction.

4. Crossover:
a. Crossover is the most significant phase in a genetic algorithm.
is chosen at
b. For each pair of parents to be mated, a crossover
point
random from within the genes.

to be 3 as shown:
For example, consider the crossover point

A00o0
A2

Crossover point

d Offspring are created by exchanging the genes of parents among


crossover point is reached.
themselves until the
Reinforcement Learning & Genetic Algorithm
5-18 L (CS/TT-Sem-5)

AIo o 0 o

e. The new offspring are added to the population.

As 00

5 Mutation:
of their genes can be subjected
a When new offspring formed, some

mutation with a low random probability.


to a

in the bit string can be flipped.


b. This implies that some of the bits
Before mutation

As11o00
After mutation

and
Mutation occurs to maintain diversity within the population
C.
prevent premature convergence.

& Termination:
(does
terminates if the population has converged
a The algorithm different from the
which are significantiy
not produce offspring
previous generation).
provided set of
genetic algorithm has
a
it is said that the
b. Then
solutions to our problem.

flowchart of GA and explain the working


Draw
Que 5.22.
a

principle.

Answer
Unit-1.
Genetic algorithm : Refer Q. 1.24, Page 1-23L,
Working principle :
consider unconstrainedd
1. To illustrate the working principle of GA, we

optimizat10n problem.

consider the following maximization problem:


2 Let us

maximize f

xSX, SXfor i=1,2...N,


Machine Learning Techniques 5-19 L(CS/IT-Sem-5)

3. 1f we want to minimize fX), for flX) > 0, then we can write the objective
function as

1
maximize
1+fX)
It fx) < 0 instead of minimizing f\X), maximize ( X ) . Hence, both
maximization and minimization problems can be handled by GA.

Que 5.23. Write short notes on


procedures of GA.
Answer
1. Start: Generate random population of n chromosomes.
Fitness : Evaluate the fitness flx) of each chromosome x in the
population.
3 New population: Create a new population by repeating following
steps until the new population is complete.
. Selection: Select two parent chromosomes from a population
according to their fitness.
b. Crossover: With a crossover probability crossover the parents
to form new offspring (children). If no crossover was performed,
offspring is the exact copy of parents.
Mutation: With a mutation probability mutate new offspring at
each locus (position in chromosome).
Accepting: Place new offspring in the new population.
Replace : Use new generated population for a further run of the
algorithm.
5. Test:lf the end condition is satisfied, stop, and return the best solution
in current population.
6. Go to step 2

Que 5.24.What are the benefits of using GA ? What are its


limitations ?

Answer
Benefits of using GA:
1. It is easy to understand.
2. It is modular and separate from application.
3. It supports multi-objective optimization.

4 It is good for noisy environment.


Limitations of geneticalgorithm are:
1. The problem of identifying fitness function.
Reinforcement Learning & Genetic Algorithm
5-20 L (Cs/IT-Sem-5)

2. Definition of representation for the problem.


Premature convergence occurs.
3.
size of the
of choosing the various parameters like the
4. The problem selection method and its
population, mutation rate, c r o s s o v e r rate, the
strength.
5. Cannot use gradients.
information.
6. Cannot easily incorporate problem specific
7. Not good at identifying local optima.
8. No effective terminator.
functions.
9. Not effective for smooth unimodal

10. Needs to be coupled with a local search technique.

Que 5.25. Write short notes of genetie representations.

Answer
solutionshindividuals in
1. Genetic representation is a way of representing
evolutionary computation methods.

behavior, physical
2. Genetic representation can encode appearance,
qualities of individuals.

3. All the individuals of a population are represented by using binary


encoding, permutational encoding, encoding by tree
standard
4 Genetic algorithms use linear binary representations. The most

is an array of bits.
method
otf representation
of individual
5. These genetic representations a r e convenient because parts
are easily aligned due to their fixed size which makes simple crossover

operation.

Que 5.26. Give the detail of genetic representation (Encoding).


OR
Explain different types of encoding in genetic algorithm.

Answer
Genetic representations:
Pncoding:
Encoding is a process of representing individual genes.
b. The process can be performed using bits, numbers, trees, arrays,
lists or any other objects.
C.The encoding depends mainly on solving the problem.
2 Binary encoding:
a. Binary encoding is the most commonly used method of geneticC
representation because GA uses this type of encoding.
Machine Learning Techniques 6-21 L(CS/AT-Sem-5)

b. In binary encoding, every chromosome is a string of bits, 0 or 1.

Chromosome A| 101100101100101011100101
Chromosome B 111111100000110000011111
chromosomes.
C. Binary encoding gives many possible
3 Octal or Hexadecimal encoding:
a. The encoding is done using octal or hexadecimal numbers

Octal Hexadecimal
Chromosome
B2CAES
Chromosome A 54545345
Chromosome B 77406037 FEOCIF

4 Permutation encoding (real number encoding) :

a. Permutation encoding can be used in ordering problems, such as

Travelling Salesman Problem (TSP).


b. In permutation encoding, every chromosome is a string of numbers,
which represents number in a sequence.

Chromosome A|153264798
Chromosome B8 5 6 723149

5. Value encoding:
a Direct value encoding can be used in problems, where some

are used.
complicated values, such a s real numbers,
is a string of some values.
b. In value encoding, every chromosome
real numbers o r
Values c a n be anything connected to problem,
chars to some complicated objects.
2.4545
Chromosome A|1.2324 5.3243 0.4556 2.3293
Chromosome BABDJEIFJDHDIERJFDLDFLFEGT
Chromosome C((back), (back), (right), (forward), (left)
6 Tree encoding:
or expressions, for
a. Tree encoding is used for evolving programs
genetic programming.
In tree encoding, every chromosome is a tree of some objects,
such as functions or commands in programming language.
LISP is often used to this, because
Prográ. .ming language
represented in this form and can be easily
programs in it are

the and mutation can be done


parsed as a tree, so cross-over

relatively easily.
5-22 L (CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm

Chromosome A Chromosome B

Do_until

Step Wall
(do_until step wall)
+xU5 y)
Explain different methods of selection in genetie
Que 527.
in order to select a population for next generation.
algorithm
Answer
to c r o s s over a r e :
for parents
The various methods of selecting chromosomes
a Roulette-wheel selection:
i Roulette-wheel selection is the proportionate reproductive method
is selected from the mating pool with a probability
where a string
fitness.
proportional to the
is selected with a probability
i Thus, ith string in the population that string.
to where is the f+tness value for
proportional ,
fixed in Genetic Algorithm,
ii Since the population size is usually kept for the
selected
of each string being
the s u m of the probabilities
mating pool must be one.
selected string is
iv. The probability of the ith

P
size.
where'n'is the population
The average fitness is
.5.27.1)
F F /n
b Boltzmann selection:
Boltzmann selection uses the concept of simulated annealing.
L minimization or
is a method of functional
i Simulated annealing
maximization.
of slow cooling of molten metal
i This method simulates the process
in a minimization problem.
the minimum function value
to achieve temperature
simulated by controlling a
iv. The cooling phenomenon is temperature 7'has its
thermal equilibrium at a
8o that a 8ystem in
to
distributed probabilistically according
energy
(6.27.2)
PE)-expl-
where '*' is Boltzmann constant.
5-23 L (CS/IT-Sem-5)
Machine Learning Techniques
This expression suggests that a system at a high temperature has
almost uniform probability of being at any energy state, but at a low
temperature it has a small probability of being at a high energy

state.
vi. Therefore, by controlling the temperature T and assuming search
process follows Boltzmann probability distribution, the convergence
of the algorithm is controlled.
Tournament selection :
i. GA uses a strategy to select the individuals from population and
insert them into a mating pool.
ii. A sclection strategy in GA is a process that favours the selection of
better individuals in the population for the mating pool.
ii. There are two important issues in the evolution process of genetic
search.
1. Population diversity: Population diversity means that the
the already discovered good individuals are
genes from
exploited.
2 Selective pressure : Selective pressure is the degree to
which the better individuals are favoured.
The higher the selective pressure the better individuals are
favoured.
d Rank selection
i R a n k selection first ranks the population and takes every
chromosome, receives fitness from the ranking.
ii The worst will have fitness 1, the next 2, .., and the best will have
fitness N (N is the number of chromosomes in the population).
ii. The method can lead to slow convergence because the best
chromosome does not differ so much from the other.
e Steady-state selection:
The main idea of the selection is that bigger part of chromosome
should survive to next generation.
i GA works in the following way :
I n every generation a few chromosomes are selected for
creating new off springs.
2. Then, some chromosomes are removed and new offspring is
placed in that place.
3. The rest of population survives a new generation.
Que 5.28.| Differentiate between Roulette-wheel based on fitness
and Roulette-wheel based on rank with suitable example.
5-24 L(CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm

Answer
Differenee:

Roulette-wheel Roulette-wheel
No.
based on rank
based on fitness
1. Population is selected with a Probability of a population being
probability that is directly selected is based on its fitness

proportional to their fitness rank.


values.
It first sort individuals in the
It computes selection
probabilities according to population according to their
fitness and then computes
their fitness values but do
not sort the individual in the selection probabilities according
to their ranks rather than
population.
fitness values.

the individuals with


3 It gives a chance to all the selectsrank
Ithighest in the population.
individuals in the population
to be selected.
i8
Diversity in the population Diversity in the population
not preserved.
|1s preserved.

Example:
where all chrom0somes in the population
Roulette-wheel
L Imagine a to its fitness
chromosome has its place accordingly
are placed, each
function

Chromosomes 4

Chromosomes3- - Chromosomes l

Chromosomes 2

selection.
Fig. 5.28.1. Roulette- whecl
and pointer
When the wheel is spun,
the wheel will finally stop
2. chromosomes with bigger
the one of
attached to it will points to
fitness value.
fitnessS
roulette-wheel selection based
on

3. The different between


5.28.1 and Fig. 5.28.3.
and rank is shown in Fig.
Machine Learning Techniques 6-25 L (CS/IT-Sem-5)

Chromosomes3 Chromosomes 4

Chromosomes 2

Chromosomes 1
Fig. 6.38.3. Situation before ranking (graph offitnenaos)
Chromosomes 4
Chromosomes 3 - Chromosomes 1

Chromosomes 2
Fig 6.28.. Situation afer ranking (graph of order nurabers)

Que 8.99. Draw geneties eycle for genetie algorithm.

Answer
Generational cyele of GA:
Population
(Chromosomes) Decoded
Offsprings string
New
generation

Genetic Evaluation
Parents
operator (Fitness)

Mate
Manipulation
Reproduction
Selection
(Mating pool)
Fe 520.1. The GAeyele
Components of generational cycle in GA:
L Population (Chromosomes):A population is collection of individuals.
A population consists of a number of individuals being tested, the
phenotype parameters defining the individuals and some information
about search space.
2 Evaluation (Fitness) :A fitness function is a particular type of objective
function that quantifies the optimality of a solution (ie., a chromosome)
5-26 L (CS/AT-Sem-5) Reinforcement Learning & Genetic Algorithm

in a genetie algorithm so that particular chromosome may be ranked


against all the other chromosomes.
Selection : During each successive generation, a proportion of the
existing population is selected to breed a new generation. Individual
solutions are selected through a fitness-based process.
Generic operator: A genetic operator is an operator used in genetic
algorithm to guide the algorithm towards a solution to agiven problem.

Que 5.30. Why mutation is done in genetic algorithm 7 Explain


types of mutation.

Answer
Mutation is done in genetic algorithm because :
It maintains genetic diversity from one generation of a population of
genetic algorithm chromosomes to the next.
2. GA can give better solution of the problem by using mutation.
Types of mutation :
1 Bit string mutation: The mutation of bit strings occurs through bit
flips at random positions.
Example: 1010010

1010110
The probability of a mutation of a bit is 1/1, where is the length of the
binary vector. Thus, a mutation rate of 1 per mutation and individual
selected for mutation is reached.
2 Flipbit: This mutation operator takes the chosen genome and inverts
the bits (i.e., if the genome bit is 1, it is changed to 0 and vice versa).
a Boundary: This mutation operator replaces the genome with either
lower or upper bound randomly. This can be used for integer and float
genes.
Non-uniform: The probability that amount of mutation will go to 0
with the next generation is increased by using non-uniform mutation
operator. Itkeeps the population from stagnating in the early stages of
the evolution.
Uniform: This operator replaces the value of the chosen gene witha
selected between the user-specified upper and
uniform random value
lower bounds for that gene.
& Gaussian : This operator adds a unit Gaussian distributed random
value to the chosen gene. Ifit falls outside of the user-specified lower or
upper bounds for that gene, the new gene value is clipped.
Machine Learning Techniques 5-27 L(CS/AT-Sem-5)

Que 5.31.| What is the main function of crossover operation in


genetic algorithm ?

Answer
1. Crossover is the basic operator of genetic algorithm. Performance of
genetic algorithm depends on crossover operator.
2. Type of crossover operator used for a problem depends on the type of
encoding used.
3. The basic principle of crossover process is to exchange genetic material
of two parents beyond the crossover points.
Function of crossover
operation/operator in genetic algorithm:
1. The main function
of crossover operator is to introduce diversity in the
population.
2. Specific crossover made for a specific problem
of the genetic algorithm.
can
improve pertormance
3. Crossover combines parental solutions to form offspring with a hope
to produce better solutions.
4. Crossover operators are critical in ensuring good mixing of building
blocks
5. Crossover is used to maintain balance between
exploitation and
exploration. The exploitation and exploration techniques are
responsible for the performance of genetic algorithms. Exploitation
means to use the already
existing information to find out the betteer
solution and exploration is to investigate new and unknown solution
in exploration space.

Que 5.32. Discuss the different applications


of genetic algorithms.
Answer
Application of GA:
1. Optimization: Genetic Algorithms are most commonly used in
optimization problems wherein we have to maximize or minimize a
given objective function value under a given set of constraints.
2 Economics: GAs are also used to characterize various economic models
like the cobweb model, game theory equilibrium resolution, asset pricing.
etc.
3. Neural networks : GAs are also used to train neural networks,
particularly recurrent neural networks.
4. Parallelization: GAs also have very good parallel capabilities, and
prove to be very effective means in solving certain problems, and also
provide a good area for research.
Reinforcement Learning & Genetic Algorithm
5-28 L (CSAT-Sem-5)

5. Image processing: GAs are used for various digital image processing
(DIP) Lasks as well like dense pixel matching
6 Machine learning: Genetics based machine learning (GBML) is a
nice area in machine learning.
7. Robot trajectory generation: GAs have been used to plan the path
which a robot arm takes by moving from one point to another.

Que 5.33. Explain optimization of travelling salesman problem


and give suitable example too.
using genetie algorithm a

Answer
1. The TSP consist a number of cities, where each pair of cities has a

corresponding distance.

LStart
Set GA parameters
Generate initial random
population

Evaluate fitness of each


chromosome in the

population

Yes Are optimization


termination New population
criteria met
Best
No
chromosome
Parents selection for next
generation
End
Crossover of
parents chromosome

Mutation of
chromo80me
L
Fig. 5.33.1. Genetic algorithm procedure for TSP
Machine Learning T'echniques 5-29 L (Cs/AT-Sem-5)
2. The aim is to visit all the cities such that the total
distance travelled will
be minimized.
3. A solution, and therefore achromosome which represents that solution
to the TSP, can be given order, that is, a path, of the cities.
as an

4. The procedure for solving TSP can be viewed as a process flow given in
F8. 6.33..

5. The GA process starts by supplying important information such as location


of the city, number of generations,
of crossover and
maximum population size, probability
probability of mutation.
6. An initial random population of chromosomes is generated and the
fitness of each chromosome is evaluated.
7. The population is then transformed into a new population (the next
generation) using three genetic operators: selection, crossover and
mutation.
8. The selection operator is used to choose two parents from the current
generation in order to create a new child by crossover and/or mutation.
The
generation contains higher proportion of the characteristics
new a

possessed by the good members of the previous generation and in this


way good characteristics are spread over the population and mixed with
other good characteristics.

10. After each generation, a new set of chromosomes where the size is
equal to the initial population size is evolved.
11. This transformation process from one generation to the next continues
until the population converges to the optimal solution, which usually
occurs when a certain percentage of the population (for example $0 )
the best individual is taken
has the s a m e optimal chrom0some in which
as the optimal solution.

Que 5.34. Write short notes on convergence of genetic algorithm

Answer
1. Agenetic algorithm is usually said to converge when there is no significant
in the values of fitness of the population from
one
improvement
generation to the next.

2. One criterion for convergence may be such that when a fixed percentage
same, it can be
population matrix becomes the
of
columns and rows in
fixed percentage may be
assumed that convergence is attained. The
80% or 85%.
3. In genetic algorithras as we proceed with more generations, there may
the population fitness and the best
not be much improvement in
individual may not change for subsequent populations.
As the generation progresses, the population gets filled with more fit
individuals
individuals with only slight deviation from the fitness of best
Reinforcement Learning & Genetic Algorithm
5-30 L (CS/AT-Sem-5)

to the fitness of
so far found, and the average fitness
comes very close
the best individuals.
of view.

The convergence criteria can be explained from schema point


6. A schema is a similarity template describing a subset of strings with
a subset of all
A schema represents
Similarnties at
certain positions.
the same bits at certain string positions.
possible strings that have
associate a fitness
Since schema represents a robust of strings, we c a n
7.
the schema.
value with schema, i.e., the average fitness of
a

8. One can visualize GA's search for the optimal strings as a simultaneous
increases the number of their instances in
competition among schema
the population.

You might also like