AI&ML NOTES Watermark

CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

UNIT 1 – PROBLEM SOLVING
SYLLABUS:
Introduction to AI - AI Applications - Problem solving agents – search
algorithms – uninformed search strategies – Heuristic search
strategies – Local search and optimization problems – adversarial
search – constraint satisfaction problems (CSP)
PART A
1. What is Artificial Intelligence? Or Define Artificial Intelligence.
 Artificial intelligence is a wide-ranging branch of computer science
concerned with building smart machines capable of performing tasks
that typically require human intelligence.
 Artificial Intelligence is the process of building intelligent machines from
vast volumes of data.
 Systems learn from past learning and experiences and perform human-
like tasks.
 It enhances the speed, precision, and effectiveness of human efforts.
2. How Artificial Intelligence is Defined? List the Types Of

Approaches for Artificial Intelligence.
 Thinking humanly: mimicking thought based on the human mind.
 Thinking rationally: mimicking thought based on logical reasoning.
 Acting humanly: acting in a manner that mimics human behavior.
 Acting rationally: acting in a manner that is meant to achieve a
particular goal.
3. What Are The Four Types Of Artificial Intelligence?

 Reactive machines: able to perceive and react to the world in front of it as
it performs limited tasks.
 Limited memory: able to store past data and predictions to inform
predictions of what may come next.
 Theory of mind: able to make decisions based on its perceptions of how
others feel and make decisions.
 Self-awareness: able to operate with human-level consciousness and

understand its own existence.
4. List the types of Artificial Intelligence-based on capabilities.

 Narrow AI
 General AI
 Super AI
5. List the types of Artificial Intelligence-based on functionalities -

 Purely Reactive or Reactive Machines
 Limited Memory
 Theory of Mind
 Self Awareness
6. Tabulate the comparison of Artificial Intelligence-based on

functionalities.
Reactive Machines Limited Memory Theory of Mind Self-Awareness
Simple classification Human-level intelligence
Complex classification Understands human
and pattern that can by-pass human
tasks reasoning and motives
recognition tasks intelligence too
Needs fewer examples
Great when all Uses historical data Sense of self-
to learn because it
parameters are known to make predictions consciousness
understands motives
Can’t deal with Next milestone for the

Current state of AI Does not exist yet
imperfect information evolution of AI
7. List the Four possible goals to pursue in artificial intelligence.

• Systems that think like humans.
• Systems that think rationally.
• Systems that act like humans
• Systems that act rationally
8. Point out the subfields of Artificial Intelligence.

 Machine Learning
 Deep Learning
 Neural Networks
 Cognitive Computing
 Computer Vision
 Natural Language Processing
9. What are the advantages of Artificial Intelligence?

 Reduced human error
 Risk avoidance
 Replacing repetitive jobs
 Digital assistance
10. List the limitations of Artificial Intelligence

 High cost of creation.
 No emotions.
 Box thinking.
 Can’t think for itself.
11. Differentiate Deep learning vs. machine learning

o Machine Learning
 Machine Learning is the learning in which a machine can learn on its
own from examples and previous experiences.
 It is machine learning that gives AI the ability to learn.
 This is done by using algorithms to discover patterns and generate
insights from the data they are exposed to.
o Deep Learning
 Deep learning, is a subcategory of machine learning, provides AI with
the ability to mimic a human brain’s neural network.
 It can make sense of patterns, noise, and sources of confusion in the
data.
12. Compare Data Science, Artificial Intelligence and Machine Learning.
Data Science Artificial Intelligence Machine Learning

Machine Learning is a part of
Data Science is used for data AI combines iterative AI where mathematical models
sourcing, cleaning, processing, processing and intelligent are used to empower a
and visualizing for analytical algorithms to imitate the machine to learn with or
purposes. human brain’s functions. without being programmed
regularly.
AI uses decision trees and
Data Science deals with both Machine Learning utilizes
logic theories to find the
structured and unstructured statistical models and neural
best possible solution to
data for analytics. networks to train a machine.
the given problem.
As a subset of AI, Machine
Some of the popular tools in Some of the popular
Learning also use the same
Data Science are Tableau, SAS2, libraries include Keras,
libraries, along with tools such
Apache, MATLAB, Spark, and Scikit-Learn, and
as Amazon Lex2, IBM Watson,
more. TensorFlow.
and Azure ML Studio.
AI includes predictive
Data Science includes data
modeling to predict events ML is a subset of Artificial
operations based on user
based on the previous and Intelligence.
requirements.
current data.
It is mainly used in fraud Applications of AI include Online recommendations, facial
detection, healthcare, BI chat bots, voice assistants, recognition, and NLP are a few
analysis, and more. and weather prediction. examples of ML.
13. List the various Applications of Artificial Intelligence or AI

Applications.
 AI Application in E-Commerce
 Personalized Shopping
 AI-powered Assistants
 Fraud Prevention
 AI Application in Education
 Administrative Tasks Automated to Aid Educators
 Creating Smart Content
 Voice Assistants
 Personalized Learning
 AI Application in Lifestyle
 Autonomous Vehicles
 Spam Filters
 Facial Recognition
 Recommendation System
 AI Application in Navigation
 AI Application in Robotics
 Carrying goods in hospitals, factories, and warehouses
 Cleaning offices and large equipment
 Inventory management
 AI Application in Human Resource
 AI Application in Healthcare
 AI Application in Agriculture
 AI Application in Gaming
 AI Application in Automobiles
 AI Application in Social Media
 Instagram
 Facebook
 Twitter
 AI Application in Marketing
 AI Application in Chatbots
14. Define an Agent and list its types.

Agent
 An agent is anything that can be viewed as perceiving its environment
through sensors and acting upon that environment through actuators.
Types of Agent
 Simple reflex agents respond directly to percepts.
 Model-based reflex agents maintain internal state to track aspects of
the world that are not evident in the current percept.
 Goal-based agents act to achieve their goals, and
 Utility-based agents try to maximize their own expected “happiness.”
15. Define an environment and task environment.

Environments
 The environment could be everything—the entire universe.
Task Environment
 A task environment specification includes the performance measure,
the external environment, the actuators, and the sensors.
16. List the steps in designing an agent

 In designing an agent, the first step must always be to specify the task
environment as fully as possible.
 The agent program implements the agent function.
 There exists a variety of basic agent program designs reflecting the
kind of information made explicit and used in the decision process.
 The designs vary in efficiency, compactness, and flexibility.
 The appropriate design of the agent program depends on the nature of
the environment.
 All agents can improve their performance through learning.
17. Define Problem Solving Agent. List the steps involved in Problem
Solving Agent.
 Problem Solving Agent
 Problem-solving agent is a goal-based agent that focuses on goals
using a group of algorithms and techniques to solve a well-defined
problem.
An agent may need to plan a sequence of actions that form a path
to a goal state.
 Such an agent is called a problem-solving agent, and the
computational process it undertakes is called search.
 Steps performed by Problem-solving agent
 Goal Formulation
 Problem Formulation
 Search
 Solution
 Execution
18. List the Components in problem formulation.

a) Initial State: It is the starting state or initial step of the agent towards
its goal.
b) Actions: It is the description of the possible actions available to the
agent.
c) Transition Model: It describes what each action does.
d) Goal Test: It determines if the given state is a goal state.
e) Path cost: It assigns a numeric cost to each path that follows the
goal.
19. Define search algorithms and list the types of search algorithm.
 A search algorithm takes a search problem as input and returns a
solution.
 Types of Search Algorithm
 Uninformed Search Algorithms
o Depth First Search
o Depth Limited Search
o Breadth First Search
o Iterative Deepening Search
o Uniform Cost Search
o Bidirectional Search
 Informed (Heuristic) Search Strategies
o Greedy Search
o A* Tree Search
20. Define Depth First Search. List the advantages and disadvantages
of DFS.
• Depth-first search (DFS) is an algorithm for traversing or searching
tree or graph data structures.
• The algorithm starts at the root node and explores as far as possible
along each branch before backtracking.
• It uses last in-first-out strategy and hence it is implemented using a

stack.
Advantage of Depth-first Search:
 DFS uses extremely little memory. 
 It takes less time to reach the goal node than the BFS method.
Disadvantage of Depth-first Search:
 There's a chance that many states will recur, and there's no certainty
that a solution will be found.
 The DFS algorithm performs deep searching and may occasionally
enter an infinite cycle.
21. Which solution would DFS find to move from node S to node G if run
on the graph below?
Solution.
Path: S - > A -> B -> C -> G
22. Define Depth Limited Search. List its advantages and disadvantages.
 A depth-limited search algorithm is similar to depth-first search with a
predetermined limit.
 Depth-limited search can solve the drawback of the infinite path in the
Depth-first search.
Advantages:
 Depth-limited search is Memory efficient.
Disadvantages:
 Depth-limited search also has a disadvantage of incompleteness. 
 It may not be optimal if the problem has more than one solution.
23. Define Breadth-first search (BFS). List its advantages and disadvantages.
• Breadth-first search (BFS) is an algorithm for traversing or searching
tree or graph data structures.
• It starts at the tree root and explores all of the neighbor nodes at the
present depth prior to moving on to the nodes at the next depth level.
Advantages of Breadth-first Search:
 If a solution is available, BFS will provide it.
 If there are multiple answers to a problem, BFS will present the simplest
solution with the fewest steps.
Disadvantages of Breadth-first Search:
 It necessitates a large amount of memory.
 If the solution is located far from the root node, BFS will take a long time.
24. Which solution would BFS find to move from node S to node G if run on
the graph below?
Solution.
Path: S -> D -> G

25. Define iterative deepening depth-first search/ iterative deepening

search
 This search is a combination of BFS and DFS, as BFS guarantees to
reach the goal node and DFS occupies less memory space.
 Therefore, iterative deepening search combines these two advantages of
BFS and DFS to reach the goal node.
 It gradually increases the depth-limit from 0,1,2 and so on and reach
the goal node.
26. Define Uniform Cost Search. List its advantages and disadvantages.
 A searching algorithm for traversing a weighted tree or graph is uniform-
cost search.
 When a separate cost is provided for each edge, this algorithm is used.
 The uniform-cost search's main purpose is to discover the shortest path
to the goal node with the lowest cumulative cost.
 Cost of a node is defined as:
cost(node) = cumulative cost of all nodes from root
cost(root) = 0
Advantages of Uniform-cost Search:
 The path with the lowest cost is chosen at each state.
 UCS is complete only if states are finite and there should be no loop
with zero weight.
 UCS is optimal only if there is no negative cost.
Disadvantages of Uniform-cost Search:
 It is just concerned with the expense of the path.
27. Which solution would UCS find to move from node S to node G if run on
the graph below?
Solution.
Path: S -> A -> B -> G

Cost: 5
28. Define Bidirectional Search.

 The strategy behind the bidirectional search is to run two searches
simultaneously--one forward search from the initial state and
other from the backside of the goal--hoping that both searches will
meet in the middle.
29. Define informed (heuristic) search strategies and mention its types.
 Here, the algorithms have information on the goal state, which helps
in more efficient searching.
 This information is obtained by something called a heuristic.
Search Heuristics:
 In an informed search, a heuristic is a function h(n) estimates how
close a state is to the goal state.
 h(n) = estimated cost of the cheapest path from the state at node n to
a goal state.
Types of Informed search algorithms.

 Greedy Search
 A* Tree Search
 A* Graph Search
30. Find the path from S to G using greedy search. The heuristic values h
of each node below the name of the node.
Solution.
Starting from S, we can traverse to A(h=9) or D(h=5). We choose D,
as it has the lower heuristic cost. Now from D, we can move to
B(h=4) or E(h=3). We choose E with a lower heuristic cost. Finally,
from E, we go to G(h=0). This entire traversal is shown in the search
tree below, in blue.
Path: S -> D -> E -> G
31. Define A* tree search.

 A* Tree Search, or simply known as A* Search, combines the
strengths of uniform-cost search and greedy search.
 In this search, the heuristic is the summation of the cost in UCS,
denoted by g(x), and the cost in the greedy search, denoted by h(x).
The summed cost is denoted by f(x).
32. Find the path to reach from S to G using A* search.
Solution.
 Starting from S, the algorithm computes g(x) + h(x) for all nodes in
the fringe at each step, choosing the node with the lowest sum.
 The entire work is shown in the table below.
33. Use graph searches to find paths from S to G in the following

graph.
Solution
Path: S -> D -> B -> E -> G

Cost: 7
34. Define Local Search Algorithm. List its types.

Local Search Algorithm - Definition
 Local search algorithms operate by searching from a start state to
neighboring states, without keeping track of the paths, or the set of states
that have been reached.
 “Local search algorithms” where the path cost does not matters, and
only focus on solution-state needed to reach the goal node.
 It has two key advantages:
o use very little memory;
o Often find reasonable solutions in large or infinite state spaces for
which systematic algorithms are unsuitable.
Different types of local searches:
 Hill-climbing Search
Types of Hill climbing search algorithm
 Simple hill climbing
 Steepest-ascent hill climbing
 Stochastic hill climbing
 Random-restart hill climbing
 Simulated Annealing
 Local Beam Search
35. What are the Limitations of Hill climbing algorithm.

 Local Maxima: It is that peak of the mountain which is highest than all
its neighboring states but lower than the global maxima. It is not the goal
peak because there is another peak higher than it.
 Plateau: It is a flat surface area where no uphill exists. It becomes
difficult for the climber to decide that in which direction he should move
to reach the goal point. 
 Ridges: It is a challenging problem where the person finds two or more
local maxima of the same height commonly. It becomes difficult for the
person to navigate the right point and stuck to that point itself.
36. Define Adversarial Search.

 Adversarial search is a game-playing technique where the agents are
surrounded by a competitive environment.
 A conflicting goal is given to the agents (multiagent).
 These agents compete with one another and try to defeat one another in
order to win the game.
 Such conflicting goals give rise to the adversarial search.

37. What are the Elements of Game Playing search?

 S0: It is the initial state from where a game begins. 
 PLAYER (s): It defines which player is having the current turn to make a
move in the state.
 ACTIONS (s): It defines the set of legal moves to be used in a state.
 RESULT (s, a): It is a transition model which defines the result of a move.
 TERMINAL-TEST (s): It defines that the game has ended and returns
true.
 UTILITY (s,p): It defines the final value with which the game has ended.
This function is also known as Objective function or Payoff function.
 The price which the winner will get i.e.
o (-1): If the PLAYER loses.
o (+1): If the PLAYER wins.
o (0): If there is a draw between the PLAYERS.
38 List the types of algorithms in adversarial search.

 Minimax Algorithm
 Alpha-beta Pruning
39. Define a Game Tree.

 A game tree is where the nodes represent the states of the game and
edges represent the moves made by the players in the game.
 Players will be two namely:
o MIN: Decrease the chances to win the game.
o MAX: Increases his chances of winning the game.
 MAX will choose that path which will increase its utility value and MIN
will choose the opposite path which could help it to minimize MAX’s
utility value.
40. Define Constraint satisfaction problems (CSP) – Definition.

 A constraint satisfaction problem (CSP) is a problem that requires its
solution within some limitations or conditions also known as constraints.
 It consists of the following:
 A finite set of variables which stores the solution (V = {V1, V2, V3, ..... ,
Vn})
 A set of discrete values known as domain from which the solution is
picked (D = {D1, D2, D3, ...... ,Dn})
 A finite set of constraints (C = {C1, C2, C3, ....... , Cn})
41. List the CSP Algorithms and Problems.

1. CryptArithmetic (Coding alphabets to numbers.)
2. n-Queen (In an n-queen problem, n queens should be placed in an
nXn matrix such that no queen shares the same row, column or
diagonal.)
3. Map Coloring (coloring different regions of map, ensuring no adjacent
regions have the same color)
4. Crossword (everyday puzzles appearing in newspapers)
5. Sudoku (a number grid)
6. Latin Square Problem
42. List the Types of Domains in CSP.

 Discrete Domain: It is an infinite domain which can have one state
for multiple variables. For example, a start state can be allocated
infinite times for each variable. 
 Finite Domain: It is a finite domain which can have continuous
states describing one domain for one specific variable. It is also called
a continuous domain.
43. List the Constraint Types in CSP.

 Unary Constraints: It is the simplest type of constraints that restricts
the value of a single variable.
 Binary Constraints: It is the constraint type which relates two
variables. A value x2 will contain a value which lies
between x1 and x3.
 Global Constraints: It is the constraint type which involves an
arbitrary number of variables. 
44. Define Map Coloring / Graph Coloring Problem.

 It is a map of Australia; it consists of states and territories as in
Figure 1.19.
 The task will be coloring each region with the colors either red, green
or blue.
 In such case, no neighboring regions will have the same color.
PART B
1. Define Artificial Intelligence and its types, goals, advantages and limitations
in detail.
ARTIFICIAL INTELLIGENCE:
1.1 Artificial Intelligence Definition
1.2 Types of Artificial Intelligence
1.2.1 Types of Artificial Intelligence-based on capabilities
1.2.2 Types of Artificial Intelligence-based on functionalities
1.3 Four possible goals to pursue in artificial intelligence:
1.3.1 Acting humanly: The Turing test approach
1.3.2 Thinking humanly: The cognitive modeling approach
1.3.3 Thinking rationally: The “laws of thought” approach
1.3.4 Acting rationally: The rational agent approach
1.4 Subfields of Artificial Intelligence
1.5 Advantages of Artificial Intelligence
1.6 Limitations of Artificial Intelligence
1.1 Artificial Intelligence Definition

 Artificial Intelligence is the process of building intelligent machines
from vast volumes of data.
 Systems learn from past learning and experiences and perform
human-like tasks.
 It enhances the speed, precision, and effectiveness of human efforts.
 AI uses complex algorithms and methods to build machines that can
make decisions on their own as shown in Figure 1.1.
 Machine Learning and Deep learning forms the core of Artificial
Intelligence.
Figure 1.1 – Artificial Intelligence
1.2 Types of Artificial Intelligence

 Artificial Intelligence can be divided based on capabilities and
functionalities as in figure 1.2.
 Types of Artificial Intelligence-based on capabilities -
o Narrow AI
o General AI
o Super AI
 Types of Artificial Intelligence-based on functionalities -

o Reactive Machines
o Limited Memory
o Theory of Mind
o Self Awareness
Figure 1.2 – Types of Artificial Intelligence
1.2.1 Types of Artificial Intelligence-based on capabilities -

o Narrow AI
o General AI
o Super AI
 Narrow AI:
o Referred as Weak AI, focuses on one automate specific task and
cannot perform beyond its limitations.
o Example: Alexa and other smart assistants, Self-driving cars,
Google search, Email spam filters, Netflix's recommendations,
 Artificial General Intelligence (AGI):
o Referred as “strong AI,” can understand and learn any intellectual
task that a human being can.
o Example: Fujitsu has built the K computer, which is one of the
fastest supercomputers in the world
 Super Intelligence:
o Super AI surpasses human intelligence and can perform any task
better than a human.
o Some of the critical characteristics of super AI include thinking,
solving puzzles, making judgments, and decisions on its own.
1.2.2 Types of Artificial Intelligence-based on functionalities -

o Purely Reactive or Reactive Machines
o Limited Memory
o Theory of Mind
o Self Awareness
 Purely Reactive or Reactive Machines

o These machines do not have any memory or data to work with,
specializing in just one field of work.
o For example, in a chess game, the machine observes the moves
and makes the best possible decision to win.
 Limited Memory
o These machines collect previous data and continue adding it to
their memory.
o They have enough memory or experience to make proper decisions,
but memory is minimal.
o For example, this machine can suggest a restaurant based on the
location data that has been gathered.
 Theory of Mind
o This kind of AI can understand thoughts and emotions, as well as
interact socially. However, a machine based on this type is yet to
be built.
 Self-Aware
o Self-aware machines are the future generation of the new
technologies. It will be intelligent, sentient, and conscious.
Table 1.1 - Comparison of types of Artificial Intelligence-based on functionalities:

Reactive Machines Limited Memory Theory of Mind Self-Awareness
Simple classification Human-level intelligence
Complex classification Understands human
and pattern that can by-pass human
tasks reasoning and motives
recognition tasks intelligence too
Needs fewer examples
Great when all Uses historical data Sense of self-
to learn because it
parameters are known to make predictions consciousness
understands motives
Can’t deal with Next milestone for the

Current state of AI Does not exist yet
imperfect information evolution of AI
1.3 Four possible goals to pursue in artificial intelligence:

 Systems that think like humans.
 Systems that think rationally.
 Systems that act like humans
 Systems that act rationally
- Refer Figure 1.3
Four Approaches from two dimensions
Figure 1.3 - Four Approaches from two dimensions
1.3.1 Acting humanly: The Turing test approach

o The Turing test, proposed by Alan Turing (1950), was designed as
a method for determining if a machine is intelligent or not.
o The computer would need the following capabilities:
 natural language processing to communicate successfully in a
human language;
 knowledge representation to store what it knows or hears;
 automated reasoning to answer questions and to draw new
conclusions;
 machine learning to adapt to new circumstances and to detect
and extrapolate patterns.
o However, other researchers have proposed a total Turing test.
o To pass the total Turing test, a robot will need
 computer vision and speech recognition to perceive the world;
 robotics to manipulate objects and move about.
1.3.2 Thinking humanly: The cognitive modeling approach

o Thinking humanly is to make a system or program to think like a
human.
o Can learnabout human thought in three ways:
 Introspection—trying to catch own thoughts as they go by;
 Psychological experiments—observing a person in action;
 Brain imaging—observing the brain in action.
Cognitive Science
o Cognitive science is the interdisciplinary study of mind and
intelligence, embracing philosophy, psychology, artificial
intelligence, neuroscience, linguistics, and anthropology.
1.3.3 Thinking rationally: The “laws of thought” approach

o The Greek philosopher Aristotle was one of the first to attempt to
codify “right thinking”—that is, irrefutable reasoning processes.
o His syllogisms provided patterns for argument structures that
always yielded correct conclusions when given correct premises.
o These laws of thought were supposed to govern the operation of
the mind; the study initiated the field called logic.
o By 1965, programs could solve any solvable problem described in
logical notation. The so-called logicist tradition.
o The theory of probability allows rigorous reasoning with uncertain
information.
o In principle, it allows the construction of a comprehensive model of
rational thought, leading from raw perceptual information to an
understanding of how the world works to predictions about the
future.
1.3.4 Acting rationally: The rational agent approach
o An agent is just something that acts to do.
o Computer agents are expected to do more: operate autonomously,
perceive their environment, persist over a prolonged time period,
adapt to change, and create and pursue goals.
o A rational agent is one that acts so as to achieve the best outcome
or, when there is uncertainty, the best expected outcome.
1.4 Subfields of Artificial Intelligence

o Machine Learning
 Machine Learning is the learning in which a machine can learn on its
own from examples and previous experiences.
Deep Learning
 Deep learning, is a subcategory of machine learning, provides AI with
the ability to mimic a human brain’s neural network.
 It can make sense of patterns, noise, and sources of confusion in the
data.
o Neural Networks
 Artificial Neural Networks (ANNs) are one of the most important tools
in Machine Learning to find patterns within the data, which are far too
complex for a human to figure out and teach the machine to
recognize.
o Cognitive Computing:
 The ultimate goal of cognitive computing is to imitate the human
thought process in a computer model.
 Using self-learning algorithms, pattern recognition by neural
networks, and natural language processing, a computer can mimic
the human way of thinking.
 Here, computerized models are deployed to simulate the human
cognition process.
o Computer Vision:
 Computer vision works by allowing computers to see, recognize, and
process images, the same way as human vision does, and then it
provides an appropriate output.
 The computer must understand what it sees, and then analyze it,
accordingly.
o Natural Language Processing:
 Natural language processing means developing methods that help us
communicate with machines using natural human languages like
English.
1.5 Advantages of Artificial Intelligence

o Reduced human error:
• With humans involved in the tasks where precision is required, there
will always be a chance of error. However, if programmed properly,
machines do not make mistakes and easily perform repetitive tasks
without making many errors, if not at all.
o Risk avoidance:
• Replacing humans with intelligent robots is one of the biggest
advantages of Artificial Intelligence.
o Replacing repetitive jobs:
• Our day-to-day work includes many repetitive tasks that we have to do
every day without any change. Now, machines have replaced these
tasks so that humans can spend this time doing creative things.
o Digital assistance:
• With digital assistants to interact with users 24/7, organizations can
save the need for human resources and deliver faster service to
customers.
1.6 Limitations of Artificial Intelligence
o High cost of creation.
o No emotions.
o Box thinking.
o Can’t think for Itself.
2. Compare and differentiate between Data Science vs Artificial Intelligence

Vs. Machine Learning.
DATA SCIENCE Vs. ARTIFICIAL INTELLIGENCE

2.1 Deep learning vs. machine learning
2.2 Data Science vs. Artificial Intelligence vs. Machine Learning
2.1 Deep learning vs. machine learning
Figure 1.4 - Deep learning vs. machine learning
o “Artificial intelligence is a set of algorithms and intelligence to try to

mimic human intelligence.
o Machine learning is one of them, and deep learning is one of those
machine learning techniques as shown in figure 1.4.”
o Machine Learning
 Machine Learning is the learning in which a machine can learn on
its own from examples and previous experiences.
o Deep Learning
 Deep learning, is a subcategory of machine learning, provides AI
with the ability to mimic a human brain’s neural network.
 It can make sense of patterns, noise, and sources of confusion in
the data.
2.2 Data Science vs. Artificial Intelligence vs. Machine Learning

 Data Science, Machine Learning, and Artificial Intelligence are
interconnected, but each one of them uniquely serves a different
purpose as mentioned in table 1.2
Table 1.2 - Comparison of Data Science, Artificial Intelligence and Machine Learning:
Data Science Artificial Intelligence Machine Learning

Machine Learning is a part of
Data Science is used for data AI combines
iterative AI where mathematical models
sourcing, cleaning, processing, processing and intelligent are used to empower a
and visualizing for analytical algorithms to imitate the machine to learn with or
purposes. human brain’s functions. without being programmed
regularly.
AI uses decision trees and
Data Science deals with both Machine Learning utilizes
logic theories to find the
structured and unstructured statistical models and neural
best possible solution to
data for analytics. networks to train a machine.
the given problem.
As a subset of AI, Machine
Some of the popular tools in Some of the popular
Learning also use the same
Data Science are Tableau, SAS2, libraries include Keras,
libraries, along with tools such
Apache, MATLAB, Spark, and Scikit-Learn, and
as Amazon Lex2, IBM Watson,
more. TensorFlow.
and Azure ML Studio.
AI includes predictive
Data Science includes data
modeling to predict events ML is a subset of Artificial
operations based on user
based on the previous and Intelligence.
requirements.
current data.
It is mainly used in fraud Applications of AI include Online recommendations, facial
detection, healthcare, BI chat bots, voice assistants, recognition, and NLP are a few
analysis, and more. and weather prediction. examples of ML.
3. Explain the history of artificial intelligence.

 The important events and milestones in the evolution of artificial intelligence
include the following:
1943:
 The first work that is now generally recognized as AI was done by
Warren McCulloch and Walter Pitts (1943).
 Inspired by the mathematical modeling work of Pitts’s advisor Nicolas
they drew on three sources:
 knowledge of the basic physiology and function of neurons in the
brain;
 a formal analysis of propositional logic;
 Turing’s theory of computation.
1949:
 Donald Hebb (1949) demonstrated a simple updating rule for
modifying the connection strengths between neurons.
 His rule, now called Hebbian learning, remains an influential model to
this day.
1950:
 Alan Turing publishes Computing Machinery and Intelligence. In the
paper, Turing—proposes to answer the question 'can machines
think?' and introduces the Turing Test to determine if a computer can
demonstrate the same intelligence as a human. The value of the
Turing test has been debated ever since.
1956:
 John McCarthy coins the term 'artificial intelligence' at the first-ever
AI conference at Dartmouth College. Later that year, Allen Newell,
J.C. Shaw, and Herbert Simon create the Logic Theorist, the first-ever
running AI software program.
1958:
 In 1958, John McCarthy had defined the high-level language Lisp,
which was to become the dominant AI programming language for the
next 30 years.
1959:
 At IBM, Nathaniel Rochester and his colleagues produced some of the
first AI programs.
 Herbert Gelernter (1959) constructed the Geometry Theorem Prover,
which was able to prove theorems that many students of mathematics
would find quite tricky.
1967:
 Frank Rosenblatt builds the Mark 1 Perceptron, the first computer
based on a neural network that 'learned' though trial and error.
1980s:
 Neural networks which use a back propagation algorithm to train
itself become widely used in AI applications.
1997:
 IBM's Deep Blue beats then world chess champion Garry Kasparov, in
a chess match (and rematch).
2011:
 IBM Watson beats champions Ken Jennings and Brad Rutter
at Jeopardy!
2015:
 Baidu'sMinwa supercomputer uses a special kind of deep neural
network called a convolution neural network to identify and categorize
images with a higher rate of accuracy than the average human.
2016:
 DeepMind'sAlphaGo program, powered by a deep neural network,
beats Lee Sodol, the world champion Go player, in a five-game match.
 The first “robot citizen,” a humanoid robot named Sophia, is created

by Hanson Robotics and is capable of facial recognition, verbal
communication and facial expression.
2018:
 Google releases natural language processing engine BERT, reducing
barriers in translation and understanding by ML applications.
2020:
 Baidu releases its Linear Fold AI algorithm to scientific and medical
teams working to develop a vaccine during the early stages of the
SARS-CoV-2 pandemic.
 OpenAI releases natural language processing model GPT-3, which is
able to produce text modeled after the way people speak and write.
2022:
 The National Institute of Standards and Technology releases the first
draft of its AI Risk Management Framework, voluntary U.S. guidance
“to better manage risks to individuals, organizations, and society
associated with artificial intelligence.”
 DeepMind unveils Gato, an AI system trained to perform hundreds of
tasks, including playing Atari, captioning images and using a robotic
arm to stack blocks.
4. List and explain the various Applications of Artificial Intelligence or AI

Applications.
APPLICATIONS OF AI:
4.1. AI Application in E-Commerce
4.2. AI Application in Education
4.3. AI Application in Lifestyle
4.4. AI Application in Navigation
4.5. AI Application in Robotics
4.6. AI Application in Human Resource
4.7. AI Application in Healthcare
4.8. AI Application in Agriculture
4.9. AI Application in Gaming
4.10. AI Application in Automobiles
4.11. AI Application in Social Media
4.12. AI Application in Marketing
4.13. AI Application in Chatbots
4.14. AI Application in Finance
4.1. AI Application in E-Commerce
 Personalized Shopping
o Artificial Intelligence technology is used to create
recommendation engines.
o These recommendations are made in accordance with their

browsing history, preference, and interests.
o It helps in improving the relationship with the customers and the
loyalty towards the brand.
 AI-powered Assistants
o Virtual shopping assistants and chatbots help improve the user
experience while shopping online.
o Natural Language Processing is used to make the conversation
sound as human and personal as possible.
 Fraud Prevention
o Credit card frauds and fake reviews are two of the most
significant issues that E-Commerce companies deal with.
o Many customers prefer to buy a product or service based on
customer reviews.
o AI can help identify and handle fake reviews.
4.2. AI Application in Education

 Administrative Tasks Automated to Aid Educators
o Artificial Intelligence can help educators with non-educational
tasks like task-related duties like facilitating and automating
personalized messages to students, back-office tasks like grading
paperwork, arranging and facilitating parent and guardian
interactions, routine issue feedback facilitating, managing
enrollment, courses, and HR-related topics.
 Creating Smart Content
o Digitization of content like video lectures, conferences, and text
book guides can be made using Artificial Intelligence.
o Artificial Intelligence helps create a rich learning experience by
generating and providing audio and video summaries and integral
lesson plans.
 Voice Assistants
o Without even the direct involvement of the lecturer or the teacher, a
student can access extra learning material or assistance through
Voice Assistants.
 Personalized Learning
o Using top AI technologies, hyper-personalization techniques can be
used to monitor students’ data thoroughly, and habits, lesson
plans, reminders, study guides, flash notes, frequency or revision,
etc., can be easily generated.
4.3. AI Application in Lifestyle

 Autonomous Vehicles
o Automobile manufacturing companies like Toyota, Audi, Volvo,
and Tesla use machine learning to train computers to think and
evolve like humans when it comes to driving in any environment
and object detection to avoid accidents.
 Spam Filters
o The email has AI that filters out spam emails sending them to
spam or trash folders.
o The popular email provider, Gmail, has managed to reach a
filtration capacity of approximately 99.9%.
 Facial Recognition
o Devices like our phones, laptops, and PCs use facial recognition
techniques by using face filters to detect and identify in order to
provide secure access.
o Facial recognition is a widely used Artificial Intelligence application
even in high security-related areas in several industries.
 Recommendation System
o Various platforms in daily lives like e-commerce, entertainment
websites, social media, video sharing platforms, like youtube, etc.,
all use the recommendation system to get user data and provide
customized recommendations to users to increase engagement.
4.4. AI Application in Navigation

 The technology uses a combination of Convolution Neural
Network and Graph Neural Network, which makes lives easier for
users by automatically detecting the number of lanes and road types
behind obstructions on the roads.
 AI is heavily used by Uber and many logistics companies to improve
operational efficiency, analyze road traffic, and optimize routes.
4.5. AI Application in Robotics

 Robotics is another field where artificial intelligence applications are
commonly used.
 Robots powered by AI use real-time updates to sense obstacles in its
path and pre-plan its journey instantly.
 It can be used for -
 Carrying goods in hospitals, factories, and warehouses
 Cleaning offices and large equipment
 Inventory management
4.6. AI Application in Human Resource

 Artificial Intelligence helps with blind hiring.
 AI drive systems can scan job candidates' profiles, and resumes to
provide recruiters an understanding of the talent pool they must
choose from.
4.7. AI Application in Healthcare

 AI applications are used in healthcare to build sophisticated
machines that can detect diseases and identify cancer cells.
 Artificial Intelligence can help analyze chronic conditions with lab
and other medical data to ensure early diagnosis.
 AI uses the combination of historical data and medical intelligence
for the discovery of new drugs.
4.8. AI Application in Agriculture

 Artificial Intelligence is used to identify defects and nutrient
deficiencies in the soil.
 This is done using computer vision, robotics, and machine learning
applications, AI can analyze where weeds are growing.
 AI bots can help to harvest crops at a higher volume and faster pace
than human laborers.
4.9. AI Application in Gaming

 AI can be used to create smart, human-like NPCs to interact with
the players.
 It can also be used to predict human behavior using which game
design and testing can be improved.
 The Alien Isolation games released in 2014 uses AI to stalk the
player throughout the game. The game uses two Artificial
Intelligence systems - ‘Director AI’ that frequently knows your
location and the ‘Alien AI,’ driven by sensors and behaviors that
continuously hunt the player.
4.10. AI Application in Automobiles

 Artificial Intelligence is used to build self-driving vehicles.
 AI can be used along with the vehicle’s camera, radar, cloud
services, GPS, and control signals to operate the vehicle.
 AI can improve the in-vehicle experience and provide additional
systems like emergency braking, blind-spot monitoring, and driver-
assist steering.
4.11. AI Application in Social Media

 Instagram
 Facebook
 Twitter
4.12. AI Application in Marketing

 Artificial intelligence (AI) applications are popular in the marketing
domain as well.
 AI can provide users with real-time personalization’s based on their
behavior and can be used to edit and optimize marketing campaigns
to fit a local market's needs.
4.13. AI Application in Chatbots

 AI chatbots can comprehend natural language and respond to
people online who use the "live chat" feature that many
organizations provide for customer service.
 As AI continues to improve, these chatbots can effectively resolve
customer issues, respond to simple inquiries, improve customer
service, and provide 24/7 support.
4.14. AI Application in Finance

 Whether it’s personal finance, corporate finance, or consumer
finance, the highly evolved technology that is offered through AI can
help to significantly improve a wide range of financial services.
 Artificial intelligence can also detect changes in transaction patterns
and other potential red flags that can signify fraud, which humans
can easily miss, and thus saving businesses and individuals from
significant loss.
 Aside from fraud detection and task automation, AI can also better
predict and assess loan risks.
5. Explain in detail about agents and its types in detail.
AGENTS
5.1 Agents and Environments
5.2 Basic Terminologies
5.3 Steps in designing an agent
5.4 Types of Agent
5.1 Agents and Environments

Agent
 An agent is anything that can be viewed as perceiving its
environment through sensors and acting upon that environment
through actuators as in Figure 1.5.
 Example:
o A human agent has eyes, ears, and other organs for sensors
and hands, legs, vocaltract, and so on for actuators.
o A robotic agent might have cameras and infrared range
findersfor sensors and various motors for actuators.
o A software agent receives file contents, network packets, and
human input (keyboard/mouse/touch screen/voice) as sensory
inputsand acts on the environment by writing files, sending
network packets, and displaying information or generating
sounds.
Environments
 The environment could be everything—the entire universe.
Figure 1.5: Agents and Environments
5.2 Basic Terminologies

Agent Function
 The agent function for an agent specifies the action taken by the
agent in response to any percept sequence.
Performance Measure
 The performance measure evaluates the behavior of the agent in
an environment.
Rational Agent
 A rational agent acts so as to maximize the expected value of the
performance measure, given the percept sequence it has seen so
far.
Task Environment
 A task environment specification includes the performance
measure, the external environment, the actuators, and the
sensors.
5.3 Steps in designing an agent

 In designing an agent, the first step must always be to specify the task
environment as fully as possible.
 The agent program implements the agent function.
 There exists a variety of basic agent program designs reflecting the
kind of information made explicit and used in the decision process.
 The designs vary in efficiency, compactness, and flexibility.
 The appropriate design of the agent program depends on the nature of
the environment.
 All agents can improve their performance through learning.
5.4 Types of Agent

 Simple reflex agents respond directly to percepts. Refer fig 1.6
 Model-based reflex agents maintain internal state to track aspects of
the world that are not evident in the current percept. Refer fig 1.7
 Goal-based agents act to achieve their goals Refer fig 1.8
 Utility-based agents try to maximize their own expected “happiness.”
Refer fig 1.9
Simple reflex agents
Figure 1.6: Simple reflex agents

Model-based reflex
Figure 1.7: Model-based reflex
Goal-based agents
Figure 1.8: Goal-based agents

Utility-based agents
Figure 1.9: Utility-based agents
Learning agents:
It allows the agent to operate in initially unknown environments and to become more
competent than its initial knowledge alone might allow. A learning agent can be divided into
four conceptual components, as shown in figure 1.10:
Figure 1.10: Learning agents
6. Explain in detail about problem solving agent with example in detail.
PROBLEM SOLVING AGENT
6.1 Problem Solving Agent Definition

6.2 Steps performed by Problem-solving agent
6.3 Types of problem approaches:
6.3.1 Toy Problem or Standardized Problem
6.3.2 Real-world Problem
6.1 Problem Solving Agent Definition

 Problem-solving agent is a goal-based agent that focus on goals using
a group of algorithms and techniques to solve a well-defined problem.
 An agent may need to plan a sequence of actions that form a path to a
goal state.
 Such an agent is called a problem-solving agent, and the
computational process it undertakes is called search.
6.2 Steps performed by Problem-solving agent
 Goal Formulation:
o It is the first and simplest step in problem-solving.
o It organizes the steps/sequence required to formulate one goal out
of multiple goals as well as actions to achieve that goal. Goal
formulation is based on the current situation and the agent's

performance measure
 Problem Formulation: 
o It is the most important step of problem-solving which decides
what actions should be taken to achieve the formulated goal.
o Components in problem formulation:
1. Initial State: It is the starting state or initial step of the agent
towards its goal.
2. Actions: It is the description of the possible actions available to
the agent.
3. Transition Model: It describes what each action does.
4. Goal Test: It determines if the given state is a goal state.
5. Path cost: It assigns a numeric cost to each path that follows
the goal.
The problem-solving agent selects a cost function, which reflects
its performance measure.
o Initial state, actions, and transition model together define
the state-space of the problem implicitly.
o State-space of a problem is a set of all states which can be reached
from the initial state followed by any sequence of actions.
o The state-space forms a directed map or graph where nodes are
the states, links between the nodes are actions, and the path is a
sequence of states connected by the sequence of actions.
 Search:
o It identifies all the best possible sequence of actions to reach the
goal state from the current state. It takes a problem as an input
and returns solution as its output.
 Solution:
o It finds the best algorithm out of various algorithms, which may be
proven as the best optimal solution.
 Execution:
o It executes the best optimal solution from the searching algorithms
to reach the goal state from the current state.
6.3 Types of problem approaches:

6.3.1 Toy Problem or Standardized Problem:
o It is a concise and exact description of the problem which is
used by the researchers to compare the performance of
algorithms.
 Further, change the depth-limit =[0-3], it will again expand the nodes
from level 0 till level 3 and the search terminate
with A->B->D->F->E->H sequence where H is the desired goal node.
Iterative deepening search Algorithm

 Explore the nodes in DFS order.
 Set a LIMIT variable with a limit value. 
 Loop each node up to the limit value and further increase the limit
value accordingly. 
 Terminate the search when the goal state is found.
The performance measure of Iterative deepening search

 Completeness: Iterative deepening search may or may not reach the
goal state.
 Optimality: It does not give an optimal solution always.
 Space Complexity: It has the same space complexity as BFS,
i.e., O(bd).
 Time Complexity: It has O(d) time complexity.
Disadvantages of Iterative deepening search

 The drawback of iterative deepening search is that it seems wasteful
because it generates states multiple times.
7.2.1.5 UNIFORM COST SEARCH:

 A searching algorithm for traversing a weighted tree or graph is uniform-
cost search.
 When a separate cost is provided for each edge, this algorithm is used.
 The uniform-cost search's main purpose is to discover the shortest path
to the goal node with the lowest cumulative cost.
 Cost of a node is defined as:
cost(node) = cumulative cost of all nodes from root
cost(root) = 0
 Completeness: The uniform-cost search is complete, so if a solution
exists, UCS will discover it.
o Time Complexity: Let C* be the cost of the best solution, and be
the cost of each step toward the goal node. The number of steps is
then equal to C*/ε+1. We've chosen +1 because we're starting from
state 0 and ending at C*/ε.
o As a result, the Uniform-cost search's worst-case time complexity
is O(b1 + [C*/ε])/.
o Space Complexity: The same argument applies to space complexity,

therefore Uniform-cost search's worst-case space complexity is O(b1 +
[C*/ε]).
o Optimal: Uniform-cost search is always the best option because it

only chooses the cheapest path.
Uniform Cost Search Algorithm
The performance measure of Uniform-cost search

 Completeness: It guarantees to reach the goal state.
 Optimality: It gives optimal path cost solution for the search.
 Space and time complexity: The worst space and time complexity of
the uniform-cost search is O(b1+LC*/??).
Advantages of Uniform-cost Search:

o The path with the lowest cost is chosen at each state.
o UCS is complete only if states are finite and there should be no loop
with zero weight.
o UCS is optimal only if there is no negative cost.
Disadvantages of Uniform-cost Search:
o It isjust concerned with the expense of the path.
Example:
Which solution would UCS find to move from node S to node G if run
on the graph below?
Solution.
Path: S -> A -> B -> G

Cost: 5
7.2.1.6 BIDIRECTIONAL SEARCH

 The strategy behind the bidirectional search is to run two searches
simultaneously--one forward search from the initial state and
other from the backside of the goal--hoping that both searches will
meet in the middle.
 As soon as the two searches intersect one another, the bidirectional
search terminates with the goal node.
 This search is implemented by replacing the goal test to check if the
two searches intersect. Because if they do so, it means a solution is
found.
The performance measure of Bidirectional search
 Complete: Bidirectional search is complete.
 Optimal: It gives an optimal solution. 
 Time and space complexity: Bidirectional search has O(bd/2)
Disadvantage of Bidirectional Search
 It requires a lot of memory space.
7.2.2 INFORMED (HEURISTIC) SEARCH STRATEGIES:

 Here, the algorithms have information on the goal state, which helps in
more efficient searching.
 This information is obtained by something called a heuristic.
Search Heuristics:
 In an informed search, a heuristic is a functionh(n) estimates how
close a state is to the goal state.
 h(n) = estimated cost of the cheapest path from the state at node n to
a goal state.
Types of Informed search algorithms.

7.2.2.1 Greedy Search
7.2.2.2 A* Tree Search
7.2.2.3 A* Graph Search
7.2.2.1 GREEDY SEARCH:

 In greedy search, we expand the node closest to the goal node. The
“closeness” is estimated by a heuristic h(x).
 Heuristic: A heuristic h is defined as-
h(x) = Estimate of distance of node x from the goal node.
Lower the value of h(x), closer is the node from the goal.
 Strategy: Expand the node closest to the goal state, i.e. expand the
node with a lower h value.
The performance measure of Best-first search Algorithm:

 Completeness: Best-first search is incomplete even in finite state
space.
 Optimality: It does not provide an optimal solution. 
 Time and Space complexity: It has O(bm) worst time and space
complexity, where m is the maximum depth of the search tree. If
the quality of the heuristic function is good, the complexities could be
reduced substantially.
 Advantage: Works well with informed search problems, with fewer

steps to reach a goal.
 Disadvantage: Can turn into unguided DFS in the worst case.
Example:
Question. Find the path from S to G using greedy search. The heuristic
values h of each node below the name of the node.
Solution.
Starting from S, we can traverse to A(h=9) or D(h=5). We choose D,
as it has the lower heuristic cost. Now from D, we can move to
B(h=4) or E(h=3). We choose E with a lower heuristic cost. Finally,
from E, we go to G(h=0). This entire traversal is shown in the search
tree below, in blue.
Path: S -> D -> E -> G
7.2.2.2 A* TREE SEARCH:

 A* Tree Search, or simply known as A* Search, combines the
strengths of uniform-cost search and greedy search.
 In this search, the heuristic is the summation of the cost in UCS,
denoted by g(x), and the cost in the greedy search, denoted by h(x).
The summed cost is denoted by f(x).
Heuristic:
The following points should be noted with heuristics in A* search.
f(x) = g(x) + h(x)
 h(x) is called the forward cost and is an estimate of the distance of
the current node from the goal node.
 g(x) is called the backward cost and is the cumulative cost of a
node from the root node.
 A* search is optimal only when for all nodes, the forward cost for a
node h(x) underestimates the actual cost h*(x) to reach the goal.
 This property of A* heuristic is called admissibility.
Admissibility: 0 <= h(x) <= h*(x)
Strategy: Choose the node with the lowest f(x) value.

The performance measure of A* search

 Completeness: The star(*) in A* search guarantees to reach the
goal node.
 Optimality: An underestimated cost will always give an optimal
solution.
 Space and time complexity: A* search has O(bd) space and
time complexities.
Example: Find the path to reach from S to G using A* search.
Solution.
 Starting from S, the algorithm computes g(x) + h(x) for all nodes in the
fringe at each step, choosing the node with the lowest sum.
 The entire work is shown in the table below.
A* algorithm
7.2.2.3 A* Graph Search:

 In A* tree search, if the same node has expanded twice in different
branches of the search tree, A* search might explore both of those
branches, thus wasting time
 A* Graph Search, or simply Graph Search, removes this limitation by
adding this rule: do not expand the same node more than once.
 Heuristic. Graph search is optimal only when the forward cost between
two successive nodes A and B, given by h(A) – h (B), is less than or equal
to the backward cost between those two nodes g(A -> B). This property
of the graph search heuristic is called consistency.
 Consistency: h(A) – h(B) <= g ( A->B )
Example:
Question. Use graph searches to find paths from S to G in the following
graph.
Solution
Path: S -> D -> B -> E -> G

Cost: 7
8. Explain in detail about Local Search Algorithms with an example.
LOCAL SEARCH ALGORITHMS

8.1. Local Search Algorithm - Definition
8.2. Working of a Local search algorithm
8.3. Different types of local searches:
8.3.1 Hill-climbing Search
8.3.1.1 Simple hill climbing
8.3.1.2 Steepest-ascent hill climbing
8.3.1.3 Stochastic hill climbing
8.3.1.4 Random-restart hill climbing
8.3.2 Simulated Annealing
8.3.3. Local Beam Search
8.1. Local Search Algorithm - Definition

 Local search algorithms operate by searching from a start state to
neighboring states,without keeping track of the paths, or the set of states
that have been reached.
 “Local search algorithms” where the path cost does not matters, and
only focus on solution-state needed to reach the goal node.
 It has two key advantages:

o use very little memory;
o Often find reasonable solutions in large or infinite state spaces for
which systematic algorithms are unsuitable.
 Local search algorithms can also solve optimization problems, in which
the aim is to find the best state according to an objective function.
 Optimization Problems - An optimization problem is one where all the
nodes can give a solution. But the target is to find the best state out of all
according to the objective function.
 Objective Function - An objective function is a function whose value is
either minimized or maximized in different contexts of the optimization
problems. In the case of search algorithms, an objective function can be
the path cost for reaching the goal node, etc.
8.2. Working of a Local search algorithm

 Consider the below state-space landscape having both:
o Location: It is defined by the state.
o Elevation: It is defined by the value of the objective function or
heuristic cost function.
Figure 1.18 – Local Search Algorithm
The local search algorithm explores the above landscape by finding the
following two points:
 Global Minimum: If the elevation corresponds to the cost, then the task
is to find the lowest valley, which is known as Global Minimum.
 Global Maxima: If the elevation corresponds to an objective function,

then it finds the highest peak which is called as Global Maxima. It is the
highest point in the valley. Refer fig 1.18
8.3 Different types of local searches:

8.3.3 Local Beam Search

 Hill Climbing Algorithm: Hill climbing search is a local search problem.
 The hill-climbing search algorithm keeps track of one current state and
on each iteration moves to the neighboring state with highest value.
 The purpose of the hill climbing search is to climb a hill and reach the
topmost peak/ point of that hill.
 It is based on the heuristic search technique where the person who is
climbing up on the hill estimates the direction which will lead him to the
highest peak.
State-space Landscape of Hill climbing algorithm

 Consider the below landscape representing the goal state/peak and
the current state of the climber.
Figure 1.19 – Local Search Algorithm

 The topographical regions shown in the Figure 1.19 can be defined as:
 Global Maximum: It is the highest point on the hill, which is the goal
state.
 Local Maximum: It is the peak higher than all other peaks but lower
than the global maximum.
 Flat local maximum: It is the flat area over the hill where it has no
uphill or downhill. It is a saturated point of the hill.
 Shoulder: It is also a flat area where the summit is possible.
 Current state: It is the current position of the person.
Types of Hill climbing search algorithm

8.3.1.1 Simple hill climbing
8.3.1.1 Simple hill climbing search
 Simple hill climbing is the simplest technique to climb a hill.
 The task is to reach the highest peak of the mountain.
 Here, the movement of the climber depends on his move/steps.
 If he finds his next step better than the previous one, he continues to
move else remain in the same state.
 This search focus only on his previous and next step.
Simple hill climbing Algorithm
1. Create a CURRENT node, NEIGHBOUR node, and a GOAL node.
2. If the CURRENT node=GOAL node, return GOAL and terminate the
search.
3. Else CURRENT node<= NEIGHBOUR node, move ahead.
4. Loop until the goal is not reached or a point is not found.

 Steepest-ascent hill climbing is different from simple hill climbing
search. Unlike simple hill climbing search,
 It considers all the successive nodes, compares them, and choose the
node which is closest to the solution.
 Steepest hill climbing search is similar to best-first search because it
focuses on each node instead of one.
Steepest-ascent hill climbing algorithm
1. Create a CURRENT node and a GOAL node.
2. If the CURRENT node=GOAL node, return GOAL and terminate the
search.
3. Loop until a better node is not found to reach the solution.
4. If there is any better successor node present, expand it.

5. When the GOAL is attained, return GOAL and terminate.

 Stochastic hill climbing does not focus on all the nodes.
 It selects one node at random and decides whether it should be
expanded or search for a better one.

 Random-restart algorithm is based on try and try strategy.
 It iteratively searches the node and selects the best one at each
step until the goal is not found.
 The success depends most commonly on the shape of the hill.
 If there are few plateaus, local maxima, and ridges, it becomes
easy to reach the destination.
Limitations of Hill climbing algorithm
 Local Maxima: It is that peak of the mountain which is highest than all
its neighboring states but lower than the global maxima. It is not the goal
peak because there is another peak higher than it. Refer Figure 1.20.
Figure 1.20 – Local Maxima
 Plateau: It is a flat surface area where no uphill exists. It becomes

difficult for the climber to decide that in which direction he should move
to reach the goal point.Refer Figure 1.21.
Figure 1.21 – Plateau

 Ridges: It is a challenging problem where the person finds two or more

local maxima of the same height commonly. It becomes difficult for the
person to navigate the right point and stuck to that point itself. Refer
Figure 1.22
Figure 1.22 – Ridges

 Simulated annealing is similar to the hill climbing algorithm.
 It works on the current situation.
 It picks a random move instead of picking the best move.
 If the move leads to the improvement of the current situation, it is
always accepted as a step towards the solution state, else it accepts
the move having a probability less than 1.
 This search technique was first used in 1980 to solve VLSI
layout problems.
 It is also applied for factory scheduling and other large optimization
tasks.
8.3.3. Local Beam Search

 The local beam search algorithm keeps track of states rather than just
one.
 It begins with randomly k generated states.
 It selects k randomly generated states, and expands them at each

step.
 If any state is a goal state, the search stops with success.
 Else it selects the best k successors from the complete list and repeats
the same process.
 In local beam search, the necessary information is shared between the
parallel search processes.
Disadvantages of Local Beam search
 This search can suffer from a lack of diversity among the k states.
 It is an expensive version of hill climbing search.
9. Explain in detail about Adversarial Search and its types in Artificial

Intelligence.
ADVERSARIAL SEARCH
9.1 Adversarial Search in Artificial Intelligence
9.2 Elements of Game Playing search
9.3 Example:
9.3.1 Game Tree For Tic Tac Toe
9.4 Types Of Algorithms In Adversarial Search
9.4.1 Minimax Strategy / Minimax Algorithm
9.4.2 Alpha – Beta Pruning
9.1 Adversarial Search in Artificial Intelligence

AI Adversarial search:
 Adversarial search is a game-playing technique where the agents are
surrounded by a competitive environment.
 A conflicting goal is given to the agents (multiagent).
 These agents compete with one another and try to defeat one another in
order to win the game.
 Such conflicting goals give rise to the adversarial search.
9.2 Elements of Game Playing search

 To play a game, we use a game tree to know all the possible choices and
to pick the best one out.
 There are following elements of a game-playing:
 S0: It is the initial state from where a game begins.
 PLAYER (s): It defines which player is having the current turn to
make a move in the state.
 ACTIONS (s): It defines the set of legal moves to be used in a state.
By R.Manickavasagan,AP/CSE
 RESULT (s, a): It is a transition model which defines the result of a

move.
 TERMINAL-TEST (s): It defines that the game has ended and returns
true.
 UTILITY (s,p): It defines the final value with which the game has
ended. This function is also known as Objective function or Payoff
function.
 The price which the winner will get i.e.
o (-1): If the PLAYER loses.
o (+1): If the PLAYER wins.
o (0): If there is a draw between the PLAYERS.
9.3 Example:
 Game tic-tac-toe, has two or three possible outcomes.
 Either to win, to lose, or to draw the match with values +1,-1 or 0.
 Game tree designed for tic-tac-toe refer Figure 1.23.
 Here, the node represents the game state and edges represent the moves
taken by the players.
Figure 1.23 – A Game Tree for Tic Tac Toe
9.3.1 GAME TREE FOR TIC TAC TOE

 INITIAL STATE (S 0): The top node in the game-tree represents the initial
state in the tree and shows all the possible choice to pick out one.
 PLAYER (s): There are two players, MAX and MIN. MAX begins the game
by picking one best move and place X in the empty square box.
 ACTIONS (s): Both the players can make moves in the empty boxes
chance by chance.
 RESULT (s, a): The moves made by MIN and MAX will decide the
outcome of the game.
 TERMINAL-TEST(s): When all the empty boxes will be filled, it will be the
terminating state of the game.
 UTILITY: At the end, we will get to know who wins: MAX or MIN, and
accordingly, the price will be given to them.
9.4 TYPES OF ALGORITHMS IN ADVERSARIAL SEARCH

9.4.1 Minimax Algorithm
9.4.2 Alpha-beta Pruning
9.4.1 MINIMAX STRATEGY / MINIMAX ALGORITHM

 In artificial intelligence, minimax is a decision-making strategy
under game theory, which is used to minimize the losing chances in
a game and to maximize the winning chances.
 This strategy is also known as ‘Minmax,’ ’MM,’ or ‘Saddle point.’
 Basically, it is a two-player game strategy where if one wins, the other
loose the game.
 Example: playing chess
 MINIMAX algorithm is a backtracking algorithm where it backtracks
to pick the best move out of several choices.
 MINIMAX strategy follows the DFS (Depth-first search) concept.
 Here, we have two players MIN and MAX, and the game is played
alternatively between them, i.e., when MAX made a move, then the
next turn is of MIN. 
Game Tree
 A game tree is where the nodes represent the states of the game and
edges represent the moves made by the players in the game.
 Players will be two namely:
o MIN: Decrease the chances to win the game.
o MAX: Increases his chances of winning the game.
 MAX will choose that path which will increase its utility value and MIN
will choose the opposite path which could help it to minimize MAX’s
utility value.
Working of Min-Max Algorithm:

o Consider an example of game-tree which is representing the two-
player game.
o In this example, there are two players one is called Maximizer and
other is called Minimizer.
o Maximizer will try to get the Maximum possible score, and Minimizer
will try to get the minimum possible score.
o This algorithm applies DFS, so have to go all the way through the
leaves to reach the terminal nodes.
o At the terminal node, the terminal values are given so we will compare
those value and backtrack the tree until the initial state occurs.
Steps involved in solving the two-player game tree:

Step-1:
Let's take A is the initial state of the tree. Suppose maximizer takes
first turn which has worst-case initial value =- infinity, and minimizer
will take next turn which has worst-case initial value = +infinity.
Step 2:
Now, first we find the utilities value for the Maximizer, its initial value
is -∞,
For node D max(-1,- -∞) => max(-1,4)= 4
o For Node E max(2, -∞) => max(2, 6)= 6
o For Node F max(-3, -∞) => max(-3,-5) = -3
o For node G max(0, -∞) = max(0, 7) = 7
Step 3:
In the next step, it's a turn for minimizer, so it will compare all nodes
value with +∞, and will find the 3rd layer node values.
o For node B= min(4,6) = 4
o For node C= min (-3, 7) = -3
Step 4:
Now it's a turn for Maximizer, and it will again choose the maximum of all
nodes value and find the maximum value for the root node.
In this game tree, there are only 4 layers, hence reach immediately to the
root node, but in real games, there will be more than 4 layers.
o For node A max(4, -3)= 4
That was the complete workflow of the minimax two player game.
Minimax algorithm
9.4.2 ALPHA – BETA PRUNING

 Alpha-beta pruning is an advance version of MINIMAX algorithm.
Drawbacks of MINIMAX
 The drawback of minimax strategy is that it explores each node in the
tree deeply to provide the best path among all the paths.
 This increases its time complexity.
 Therefore, alpha-beta pruning reduces this drawback of minimax
strategy by less exploring the nodes of the search tree.
 The method used in alpha-beta pruning is that it cutoff the

search by exploring less number of nodes. It prunes the unwanted
branches using the pruning technique (discussed in adversarial
search).
 Alpha-beta pruning works on two threshold values, i.e., ?
(alpha) and ? (beta).
 ?: (alpha) - It is the best highest value, a MAX player can have. It
is the lower bound, which represents negative infinity value.
 ?:(beta) - It is the best lowest value, a MIN player can have. It is
the upper bound which represents positive infinity.
 The main condition which required for alpha-beta pruning is: 
α>=β
Working of Alpha-beta Pruning
 Let's take an example of two-player search tree to understand the
working of Alpha-beta pruning
 here, ? (alpha) will represent the maximum value of the nodes,
and ?(beta) will represent the minimum value of the nodes.
Step 1:
At the first step the, Max player will start first move from node A
where α= -∞ and β= +∞, these value of alpha and beta passed down to
node B where again α= -∞ and β= +∞, and Node B passes the same
value to its child D.
Step 2:
At Node D, the value of α will be calculated as its turn for Max. The
value of α is compared with firstly 2 and then 3, and the max (2, 3) = 3
will be the value of α at node D and node value will also 3.
Step 3:
Now algorithm backtrack to node B, where the value of β will change
as this is a turn of Min, Now β= +∞, will compare with the available
subsequent nodes value, i.e. min (∞, 3) = 3, hence at node B now
α= -∞, and β= 3.
In the next step, algorithm traverse the next successor of Node B

which is node E, and the values of α= -∞, and β= 3 will also be passed.
Step 4:
At node E, Max will take its turn, and the value of alpha will change.
The current value of alpha will be compared with 5, so max (-∞, 5) = 5,
hence at node E α= 5 and β= 3, where α>=β, so the right successor of
E will be pruned, and algorithm will not traverse it, and the value at
node E will be 5.
Step 5:
At next step, algorithm again backtrack the tree, from node B to node
A. At node A, the value of alpha will be changed the maximum
available value is 3 as max (-∞, 3)= 3, and β= +∞, these two values
now passes to right successor of A which is Node C.
At node C, α=3 and β= +∞, and the same values will be passed on to
node F.
Step 6:
At node F, again the value of α will be compared with left child which
is 0, and max(3,0)= 3, and then compared with right child which is 1,
and max(3,1)= 3 still α remains 3, but the node value of F will become
1.
Step 7:
Node F returns the node value 1 to node C, at C α= 3 and β= +∞, here
the value of beta will be changed, it will compare with 1 so min (∞, 1) =
1. Now at C, α=3 and β= 1, and again it satisfies the condition α>=β,
so the next child of C which is G will be pruned, and the algorithm will
not compute the entire sub-tree G.
Step 8:
C now returns the value of 1 to A here the best value for A is max
(3, 1) = 3.
Following is the final game tree which is the showing the nodes which
are computed and nodes which has never computed.
Hence the optimal value for the maximizer is 3 for this example.
 The game will be started from the last level of the game tree,
and the value will be chosen accordingly.
The rule which will be followed is: “Explore nodes if necessary
otherwise prune the unnecessary nodes.”
10. Explain in detail about constraint satisfaction problems (CSP) with an

Suitable example.
Constraint Satisfaction Problems (CSP)

10.1 Constraint satisfaction problems (CSP) – Definition
10.2 CSP Algorithms and Problems
10.3 Map Coloring / Graph Coloring Problem
10.1 Constraint satisfaction problems (CSP) – Definition

 A constraint satisfaction problem (CSP) is a problem that requires its
solution within some limitations or conditions also known as
constraints.
 It consists of the following:
 A finite set of variables which stores the solution (V = {V1, V2,
V3, ......, Vn})
 A set of discrete values known as domain from which the solution
is picked (D = {D1, D2, D3, ...... ,Dn})
 A finite set of constraints (C = {C1, C2, C3,........ , Cn})
10.2 CSP Algorithms and Problems

1. CryptArithmetic (Coding alphabets to numbers.)
2. n-Queen (In an n-queen problem, n queens should be placed in an
nXn matrix such that no queen shares the same row, column or
diagonal.)
3. Map Coloring (coloring different regions of map, ensuring no adjacent
regions have the same color)
4. Crossword (everyday puzzles appearing in newspapers)
5. Sudoku (a number grid)
6. Latin Square Problem
Solving Constraint Satisfaction Problems

The requirements to solve a constraint satisfaction problem (CSP) is:
 A state-space
 The notion of the solution.
A state in state-space is defined by assigning values to some or all
variables such as {X1=v1, X 2=v2, and so on…}.
An assignment of values to a variable can be done in three ways:

 Consistent or Legal Assignment: An assignment which does not
violate any constraint or rule is called Consistent or legal assignment. 
 Complete Assignment: An assignment where every variable is

assigned with a value, and the solution to the CSP remains consistent.
Such assignment is known as Complete assignment. 
 Partial Assignment: An assignment which assigns values to some of
the variables only. Such type of assignments are called Partial
assignments.
Types of Domains in CSP

 Discrete Domain: It is an infinite domain which can have one state
for multiple variables. For example, a start state can be allocated
infinite times for each variable. 
 Finite Domain: It is a finite domain which can have continuous
states describing one domain for one specific variable. It is also called
a continuous domain.
Constraint Types in CSP

 Unary Constraints: It is the simplest type of constraints that restricts
the value of a single variable.
 Binary Constraints: It is the constraint type which relates two
variables. A value x2 will contain a value which lies
between x1 and x3.
 Global Constraints: It is the constraint type which involves an
arbitrary number of variables. 
10.3 Map Coloring / Graph Coloring Problem

Solution:
 It is a map of Australia; it consists of states and territories as in Figure
1.24.
 The task will be coloring each region with the colors red, green or blue.
 In such case, no neighboring regions will have the same color.
 The set of the domain in each variable is
Figure 1.24– States and Territories of Australia
 Formulate the variables to the particular regions as shown in figure 1.25:
Figure 1.25– Map Coloring of Australia
 The constraints require neighboring regions to have distinct colors.

 Since there are nine places where regions border, there are nine
constraints:
Coloring a map with three colors:

 If it starts coloring with the state SA, then any of the 3 colors can be
chosen.
 Means, here there are 3 possibilities.
 And when moving to the next node, it says WA and there are 2 colors that
can be chosen. That means it has 2 possibilities.
 The remaining states can be colored with the colors carefully.
 And there will be possibility chance of choosing only one particular color
that does not match the neighboring state’s color.
 So the number of possibilities is 3 x 2 = 6.
 As the state T is not connected, it can be mapped with any of the 3 colors.
There will be 3 possibilities to color this state.
 So, there can be 6 x 3 = 18 solutions that can be obtained by coloring
each of the six possibilities with 3 times with 3 colors to the state TA.
 So the number of solutions is 6 x 3 = 18.
Therefore, there will be 18 solutions for coloring a map with three

colors.
Figure 1.26 – Map Coloring Solution
There are many possible solutions to this problem, such as in figure 1.26
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

UNIT II PROBABILISTIC REASONING
SYLLABUS:
Acting under uncertainty – Bayesian inference – Naïve Bayes
models. Probabilistic reasoning – Bayesian networks – exact
inference in BN – approximate inference in BN – causal networks.
PART A
1. Define uncertainty and list the causes of uncertainty.
Uncertainty:
 The knowledge representation, A→B, means if A is true then B is true,
but a situation where not sure about whether A is true or not then cannot
express this statement, this situation is called uncertainty.
 So to represent uncertain knowledge, uncertain reasoning or probabilistic
reasoning is used.
Causes of uncertainty:
1. Causes of uncertainty in the real world
2. Information occurred from unreliable sources.
3. Experimental Errors
4. Equipment fault
5. Temperature variation
6. Climate change.
2. Define Probabilistic reasoning. Mention the need of probabilistic

reasoning in AI
Probabilistic reasoning:
 Probabilistic reasoning is a way of knowledge representation, the
concept of probability is applied to indicate the uncertainty in
knowledge.
Need of probabilistic reasoning in AI:
 When there are unpredictable outcomes.
 When specifications or possibilities of predicates becomes too large to
handle.
 When an unknown error occurs during an experiment.
.
.
. .
.
3. List the Ways to solve problems with uncertain knowledge.

 Bayes' rule
 Bayesian Statistics
4. Define Probability and the probability of occurrence.

 Probability can be defined as a chance that an uncertain event will
occur.
 The value of probability always remains between 0 and 1 that
represent ideal uncertainties.
o 0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
o P(A) = 0, indicates total uncertainty in an event A.
o P(A) =1, indicates total certainty in an event A.
 Formula to find the probability of an uncertain event
5. Define the terms event, sample space, random variables, prior

probability and posterior probability.
 Event: Each possible outcome of a variable is called an event.
 Sample space: The collection of all possible events is called sample
space.
 Random variables: Random variables are used to represent the
events and objects in the real world.
 Prior probability: The prior probability of an event is probability
computed before observing new information.
 Posterior Probability: The probability that is calculated after all
evidence or information has taken into account. It is a combination of
prior probability and new information.
6. Define Conditional probability.

 Conditional probability is a probability of occurring an event when
another event has already happened.
 Let's suppose, to calculate the event A when event B has already
occurred, "the probability of A under the conditions of B", it is:
Where P(A⋀B)= Joint probability of a and B

P(B)= Marginal probability of B.
7. In a class, there are 70% of the students who like English and 40%
of the students who likes English and mathematics, and then what
is the percent of students those who like English also like
mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics
8. Define Bayesian Inference.

 Bayesian inference is a probabilistic approach to machine learning that
provides estimates of the probability of specific events.
 Bayesian inference is a statistical method for understanding the
uncertainty inherent in prediction problems.
 Bayesian inference algorithm can be viewed as a Markov Chain Monte
Carlo algorithm that uses prior probability distributions to optimize the
likelihood function.
9. List Bayes Theorem or Bayes Rule

 Bayes' theorem can be derived using product rule and conditional
probability of event A with known event B:
 Product Rule:
1. P(A ⋀ B)= P(A|B) P(B) or
2. P(A ⋀ B)= P(B|A) P(A)
 Conditional Probability:
 Let A and B are events,
 P(A|B) is the conditional probability of A given B,
 P(B|A) is the conditional probability of B given A.
 Equating right hand side of both the equations will get:
The above equation (a) is called as Bayes' rule or Bayes' theorem.

This equation is basic of most modern AI systems for probabilistic
inference.
 P(A|B) is known as posterior, is the Probability of hypothesis A

when occurred an evidence B.
 P(B|A) is called the likelihood, in which hypothesis is true, then
calculate the probability of evidence.
 P(A) is called the prior probability, probability of hypothesis before
considering the evidence
 P(B) is called marginal probability, pure probability of an
evidence.
10. Suppose we want to perceive the effect of some unknown cause,

and want to compute that cause, then the Bayes' rule becomes:
what is the probability that a patient has diseases meningitis

with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have
a stiff neck, and it occurs 80% of the time. He is also aware of
some more facts, which are given as follows:
o The Known probability that a patient has meningitis disease is
1/30,000.
o The Known probability that a patient has a stiff neck is 2%.
Solution
Let a be the proposition that patient has stiff neck and b be the
proposition that patient has meningitis.
So, calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, assume that 1 patient out of 750 patients has meningitis

disease with a stiff neck.
11. Consider two events: A (it will rain tomorrow) and B (the sun will
shine tomorrow).
 Use Bayes’ theorem to compute the posterior probability of each event
occurring, given the resulting weather conditions for today:
P(A|sunny) = P(sunny|A) * P(A) / P(sunny)
P(B|sunny) = P(sunny|B) * P(B) / P(sunny)

where sunny is our evidence (the resulting weather condition for
today).
12. What are the Application of Bayes' theorem in Artificial

intelligence?
 It is used to calculate the next step of the robot when the already
executed step is given. 
 Bayes' theorem is helpful in weather forecasting.
13. Define Bayesian Network.

 "A Bayesian network is a probabilistic graphical model which
represents a set of variables and their conditional dependencies using
a directed acyclic graph." 
 It is also called a Bayes network, belief network, decision network,
or Bayesian model.
 Bayesian Network can be used for building models from data and
experts opinions, and it consists of two parts: 
o Directed Acyclic Graph
o Table of conditional probabilities
14. Define Joint probability distribution.

 If variables are x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability
distribution.
 P[x1, x2, x3, ,xn], can be written as the following way in terms of the
joint probability distribution.
= P[x1| x2, x3,....., xn]. p[x2, x3, , xn]
= P[x1| x2, x 3,....., xn]P[x2|x3,....., xn] P[x n-1|xn]P[xn].
 In general for each variable Xi,
P(Xi|Xi-1, ............. , X1) = P(Xi |Parents(Xi ))
15. Write an algorithm for Constructing Bayesian Network

16. Define Global semantics and local semantics.

Global Semantics
Local Semantics
17. List the ways to understand the semantics of Bayesian Network

There are two ways to understand the semantics of the Bayesian
network, which is given below:
1. To understand the network as the representation of the Joint
probability distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of
conditional independence statements.
It is helpful in designing inference procedure.
18. What are the Applications of Bayesian networks in AI?

1. Spam filtering
2. Bio monitoring
3. Information retrieval
4. Image processing
5. Gene regulatory network
6. Turbo code
7. Document classification
19. Define Bayesian Inference.

 Bayesian Network is to perform inference, which computes the
marginal probability P(V=v) for each node V and each possible
instantiation v.
 Inference can also be done on a Bayesian network when the values of
some nodes are known (as evidence) and wish to compute the
likelihood of values of other nodes.
 There are two types of inference on Bayesian networks: exact and
approximate.
 Exact inference algorithms compute the exact values of each marginal
or posterior probability, while approximate inference algorithms
sacrifice some accuracy of the probabilities to report results quickly.
20. Define Exact Inference.

 The goal of an exact inference algorithm is to report the exact values
for either the marginal (P(V = v)) or posterior probabilities (P(V = v|e))
for each instantiation v of each node V, possible given some evidence e
of other node values.
21. List the common exact inference algorithms –

 Pearl’s algorithm
 Lauritzen-Spiegelhalter algorithm.
22. Define Pearl’s Algorithm.

 Pearl’s algorithm is a linear-time algorithm that computes the
posterior probabilities of each node given evidence of singly-connected
networks.
 Pearl introduced the notation λ (X = x) for the diagnostic support of a
node X with value x, which is the probability of evidence below X given
that X = x.
23. Define Lauritzen-Spiegelhalter (LS) Algorithm.

 The Lauritzen-Spiegelhalter (LS) algorithm is an inference algorithm
for Bayesian networks that works on all models.
24. Define Causal Network or Causal Bayesian Network

 A causal network is an acyclic digraph arising from an evolution of
a substitution system, and representing its history.
 In an evolution of a multiway system, each substitution event is a
vertex in a causal network.
 Two events which are related by causal dependence, meaning one
occurs just before the other, have an edge between the corresponding
vertices in the causal network.
 More precisely, the edge is a directed edge leading from the past event
to the future event.
25. Define Structural Causal Models (SCMs).

 SCMs consist of two parts: a graph, which visualizes causal
connections, and equations, which express the details of the
connections. a graph is a mathematical construction that consists
of vertices (nodes) and edges (links).
 SCMs use a special kind of graph, called a Directed Acyclic Graph
(DAG), for which all edges are directed and no cycles exist.
 DAGs are a common starting place for causal inference.
26. List the purpose of do-operator in causal networks.

 The do-operator is a mathematical representation of a physical
intervention.
 If the model starts with Z → X → Y, simulate an intervention in X by
deleting all the incoming arrows to X, and manually setting X to some
value x_0.
27. List the rules of Do-Calculus.

PART B
1. Explain the concept of uncertainty and acting under uncertainty with

suitable example. Explain in detail about probabilistic reasoning.
UNCERTAINITY & PROBABILISTIC REASONING

1.1 Uncertainty:
1.1.1 Causes of uncertainty
1.2 Probabilistic reasoning:
1.2.1 Need of probabilistic reasoning in AI
1.2.2 Ways to solve problems with uncertain
knowledge
1.2.3 Probability
1.2.4 Conditional probability
1.2.4.1 Example
Agents almost never have access to the whole truth about their environment.
Agents must, therefore, act under uncertainty.
Handling uncertain knowledge

In this section, we look more closely at the nature of uncertain knowledge.
We will use a simple diagnosis example to illustrate the concepts involved.
Diagnosis whether for medicine, automobile repair, or whatever-is a task that
almost always involves uncertainty. Let us try to write rules for dental diagnosis
using first-order logic, so that we can see how the logical approach breaks down.
Consider the following rule:
The problem is that this rule is wrong. Not all patients with toothaches have
cavities; some of them have gum disease, an abscess, or one of several other
problems:
Unfortunately, in order to make the rule true, we have to add an almost unlimited
list of possible causes. We could try turning the rule into a causal rule:
But this rule is not right either; not all cavities cause pain The only way to fix the
rule is to make it logically exhaustive: to augment the left-hand side with all the
qualifications required for a cavity to cause a toothache. Even then, for the
purposes of diagnosis, one must also take into account the possibility that the
patient might have a toothache and a cavity that are unconnected. Trying to use
first-order logic to cope with a domain like medical diagnosis thus fails for three
main reasons:
0 Laziness: It is too much work to list the complete set of antecedents or

consequents needed to ensure an exceptionless rule and too hard to use such
rules.
0 Theoretical ignorance: Medical science has no complete theory for the domain.
0 Practical ignorance: Even if we know all the rules, we might be uncertain about
a particular patient because not all the necessary tests have been or can be run.
The connection between toothaches and cavities is just not a logical

consequence in either direction. This is typical of the medical domain, as well as
most other judgmental domains: law, business, design, automobile repair,
gardening, dating, and so on. The agent's knowledge can at best provide only a
degree of belief in the relevant sentences. Our main tool for dealing with degrees of
belief will be probability theory, which assigns to each sentence a numerical
degree of belief between 0 and 1.
Probability provides a way of summarizing the uncertainty that comes from

our laziness and ignorance. We might not know for sure what afflicts a particular
patient, but we believe that there is, say, an 80% chance-that is, a probability of
0.8-that the patient has a cavity if he or she has a toothache.
That is, we expect that out of all the situations that are indistinguishable
from the current situation as far as the agent's knowledge goes, the patient will
have a cavity in 80% of them. This belief could be derived from statistical data-80%
of the toothache patients seen so far have had cavities-or from some general rules,
or from a combination of evidence sources.
The 80% summarizes those cases in which all the factors needed for a cavity
to cause a toothache are present and other cases in which the patient has both
toothache and cavity but the two are unconnected. The missing 20% summarizes
all the other possible causes of toothache that we are too lazy or ignorant to confirm
or deny.
Design for a decision-theoretic agent

Below algorithm sketches the structure of an agent that uses decision theory
to select actions. The agent is identical, at an abstract level, to the logical agent.
The primary difference is that the decision-theoretic agent's knowledge of the
current state is uncertain; the agent's belief state is a representation of the
probabilities of all possible actual states of the world. As time passes, the agent
accumulates more evidence and its belief state changes. Given the belief state, the
agent can make probabilistic predictions of action outcomes and hence select the
action with highest expected utility.
1.1.1 Causes of uncertainty:

Causes of uncertainty in the real world
1. Information occurred from unreliable sources.
2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.
1.2 Probabilistic reasoning:

 Probabilistic reasoning is a way of knowledge representation, the
concept of probability is applied to indicate the uncertainty in
knowledge.
1.2.1 Need of probabilistic reasoning in AI:

o When there are unpredictable outcomes.
o When specifications or possibilities of predicates becomes too

large to handle.
oWhen an unknown error occurs during an experiment.
1.2.2 Ways to solve problems with uncertain knowledge:
o Bayes' rule
o Bayesian Statistics
1.2.3 Probability:
 Probability can be defined as a chance that an uncertain event
will occur.
 The value of probability always remains between 0 and 1 that
represent ideal uncertainties.
o 0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
o P(A) = 0, indicates total uncertainty in an event A.
o P(A) =1, indicates total certainty in an event A.
 Formula to find the probability of an uncertain event
P(¬A) = probability of a not happening event.

P(¬A) + P(A) = 1.
o Event: Each possible outcome of a variable is called an event.

o Sample space: The collection of all possible events is called
sample space.
o Random variables: Random variables are used to represent the
events and objects in the real world.
o Prior probability: The prior probability of an event is
probability computed before observing new information.
o Posterior Probability: The probability that is calculated after
all evidence or information has taken into account. It is a
combination of prior probability and new information.
1.2.4 Conditional probability:

 Conditional probability is a probability of occurring an event when
another event has already happened.
 Let's suppose, to calculate the event A when event B has already

occurred, "the probability of A under the conditions of B", it is:
Where P(A⋀B)= Joint probability of a and B

P(B)= Marginal probability of B.
 If the probability of A is given and to find the probability of B, then it

is:
1.2.4.1 Example:
In a class, there are 70% of the students who like English and 40% of
the students who likes English and mathematics, and then what is
the percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics
2. Explain in detail about Bayesian inference and Naive Bayes Model or Naive
Bayes Theorem or Bayes Rule.
Naive Bayes Model or Naive Bayes Theorem or Bayes Rule

2.1 Bayesian Inference
2.2 Bayes Theorem or Bayes Rule
2.3 Example - Applying Bayes' rule:
2.4 Application of Bayes' theorem in Artificial intelligence
2.1 Bayesian Inference

 Bayesian inference is a probabilistic approach to machine learning that
provides estimates of the probability of specific events.
 Bayesian inference is a statistical method for understanding the
uncertainty inherent in prediction problems.
 Bayesian inference algorithm can be viewed as a Markov Chain Monte

Carlo algorithm that uses prior probability distributions to optimize the
likelihood function.
 The basis of Bayesian inference is the notion of apriori and a posteriori
probabilities.
o The priori probability is the probability of an event before any
evidence is considered.
o The posteriori probability is the probability of an event after
taking into account all available evidence.
 For example, if we want to know the probability that it will rain
tomorrow, our priori probability would be based on our knowledge of
the weather patterns in our area.
2.2 Bayes Theorem or Bayes Rule

 Bayes' theorem can be derived using product rule and conditional
probability of event A with known event B:
 Product Rule:
3. P(A ⋀ B)= P(A|B) P(B) or
4. P(A ⋀ B)= P(B|A) P(A)





The above equation (a) is called as Bayes' rule or Bayes' theorem.

This equation is basic of most modern AI systems for probabilistic
inference.
 P(A|B) is known as posterior, is the Probability of hypothesis A
when occurred an evidence B.
 P(B|A) is called the likelihood, in which hypothesis is true, then
calculate the probability of evidence.
 P(A) is called the prior probability, probability of hypothesis before
considering the evidence
 P(B) is called marginal probability, pure probability of an
evidence.
 In general,
P (B) = P(A)*P(B|Ai),
 Hence the Bayes' rule can be written as:
Where A1, A2, A3 ,........, An is a set of mutually exclusive and

exhaustive events.
2.3 Example 1 - Applying Bayes' rule:

Suppose we want to perceive the effect of some unknown
cause, and want to compute that cause, then the Bayes' rule
becomes:
what is the probability that a patient has diseases meningitis

with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to
have a stiff neck, and it occurs 80% of the time. He is also
aware of some more facts, which are given as follows:
The Known probability that a patient has meningitis disease
is 1/30,000.
The Known probability that a patient has a stiff neck is 2%.
Solution
Let a be the proposition that patient has stiff neck and b be the
proposition that patient has meningitis.
So, calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, assume that 1 patient out of 750 patients has meningitis

disease with a stiff neck.
Example 2 - Applying Bayes' rule:

 Consider two events: A (it will rain tomorrow) and B (the sun will shine
tomorrow).
 Use Bayes’ theorem to compute the posterior probability of each event
occurring, given the resulting weather conditions for today:
P(A|sunny) = P(sunny|A) * P(A) / P(sunny)
P(B|sunny) = P(sunny|B) * P(B) / P(sunny)
where sunny is our evidence (the resulting weather condition for
today).
 From these equations,
o if event A is more likely to result in sunny weather than event B,
then the posterior probability of A occurring, given that the
resulting weather condition for today is sunny, will be higher than
the posterior probability of B occurring.
o Conversely, if event B is more likely to result in sunny weather than
event A, then the posterior probability of B occurring, given that the
resulting weather condition for today is sunny, will be higher than
the posterior probability of A occurring.
2.4 Application of Bayes' theorem in Artificial intelligence:

 It is used to calculate the next step of the robot when the already
executed step is given.
 Bayes' theorem is helpful in weather forecasting.
Naive Bayes Theorem

The dentistry example illustrates a commonly occurring pattern in which a single
cause directly influences a number of effects, all of which are conditionally
independent, given the cause. The full joint distribution can be written as
Such a probability distribution is called a naive Bayes model—“naive” because it is

often used (as a simplifying assumption) in cases where the “effect” variables are not
strictly independent given the cause variable. (The naive Bayes model is sometimes
called a Bayesian classifier, a somewhat careless usage that has prompted true
Bayesians to call it the idiot Bayes model.) In practice, naive Bayes systems often
work very well, even when the conditional independence assumption is not strictly
true
3. Explain in detail about Bayesian Network
3.1 Bayesian Network

 "A Bayesian network is a probabilistic graphical model which represents a
set of variables and their conditional dependencies using a directed
acyclic graph."
 It is also called a Bayes network, belief network, decision network,
or Bayesian model.
 Bayesian Network can be used for building models from data and experts
opinions,and it consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities
 The generalized form of Bayesian network that represents and solve
decisionproblems under uncertain knowledge is known as an Influence
diagram.
 It is used to represent conditional dependencies.
 It can also be used in various tasks including prediction, anomaly
detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty. 
 A Bayesian network graph is made up of nodes and Arcs (directed links).
Figure 2.1 – Example for Bayesian Network

 Each node corresponds to the random variables, and a variable can be
continuous or discrete. 
 Arc or directed arrows represent the causal relationship or conditional

probabilities between random variables. 
 These directed links or arrows connect the pair of nodes in the graph.
 These links represent that one node directly influence the other node, and
if there is no directed link that means that nodes are independent with
each other.
Example
In the figure 2.1, A, B, C, and D are random variables represented by
the nodes of the network graph.
 Considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
 Node C is independent of node A.
 The Bayesian network graph does not contain any cyclic graph. Hence,
it is known as a directed acyclic graph or DAG. 
 The Bayesian network has mainly two components:
1. Causal Component
2. Actual numbers
 Each node in the Bayesian network has condition probability

distribution P(X i |Parent(X i) ), which determines the effect of the
parent on that node.
 Bayesian network is based on Joint probability distribution and
conditional probability.
3.2 Joint probability distribution:

 If variables are x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability
distribution.
 P[x1, x2, x3, ,xn], can be written as the following way in terms of the
joint probability distribution.
= P[x1| x2, x3,....., xn]. p[x2, x3, , xn]
= P[x1| x2, x 3,....., xn]P[x2|x3,....., xn] P[x n-1|xn]P[xn].
 In general for each variable Xi,
P(Xi|Xi-1, ............. , X1) = P(Xi |Parents(Xi ))
3.3 Constructing Bayesian Network

Global Semantics
Local Semantics
Markov Blanket
 Each node is conditionally independent of all others given its
Markov blanket: parents + children + children’s parents
3.4 Example:
Harry installed a new burglar alarm at his home to detect burglary.
The alarm reliably responds at detecting a burglary but also responds
for minor earthquakes. Harry has two neighbors David and Sophia,
who have taken a responsibility to inform Harry at work when they
hear the alarm. David always calls Harry when he hears the alarm,
but sometimes he got confused with the phone ringing and calls at
that time too. On the other hand, Sophia likes to listen to high music,
so sometimes she misses to hear the alarm. Here we would like to
compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is
neither a burglary, nor an earthquake occurred, and David and
Sophia both called the Harry.
Solution:
 The Bayesian network for the above problem is given in figure 2.2.
The network structure is showing that burglary and earthquake is
the parent node of the alarm and directly affecting the probability
of alarm's going off, but David and Sophia's calls depend on alarm
probability.




Figure 2.2 - The Bayesian network for the example problem

All events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
Write the events of problem statement in the form of probability:

P[D, S, A, B, E],
Rewrite the probability statement using joint probability distribution:
Let's take the observed probability for the Burglary and earthquake
component:
 P(B=True) = 0.002, which is the probability of burglary.
 P(B=False)= 0.998, which is the probability of no burglary.
 P(E=True)= 0.001, which is the probability of a minor earthquake
 P(E=False)= 0.999, Which is the probability that an earthquake not
occurred.
Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and
earthquake:
B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999
Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the
probability of Alarm.
A P(D= True) P(D= False)
True 0.91 0.09
False 0.05 0.95
Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on
its Parent Node "Alarm."
A P(S= True) P(S= False)
True 0.75 0.25
False 0.02 0.98
From the formula of joint distribution, the problem statement in the form of
probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain
by using Joint distribution.
3.5 The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network,
which is given below:
1. To understand the network as the representation of the Joint
probability distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of
conditional independence statements.
It is helpful in designing inference procedure.
3.6 Applications of Bayesian networks in AI

Bayesian networks find applications in a variety of tasks such as:
1. Spam filtering:
a. A spam filter is a program that helps in detecting unsolicited and
spam mails. Bayesian spam filters check whether a mail is spam
or not.
2. Biomonitoring:
a. This involves the use of indicators to quantify the concentration of
chemicals in the human body.
3. Information retrieval:
a. Bayesian networks assist in information retrieval for research,
which is a constant process of extracting information from
databases.
4. Image processing:
a. A form of signal processing, image processing uses mathematical
operations to convert images into digital format.
5. Gene regulatory network:
a. A Bayesian network is an algorithm that can be applied to gene
regulatory networks in order to make predictions about the effects
of genetic variations on cellular phenotypes.
b. Gene regulatory networks are a set of mathematical equations that
describe the interactions between genes, proteins, and metabolites.
c. They are used to study how genetic variations affect the
development of a cell or organism.
6. Turbo code:
a. Turbo codes are a type of error correction code capable of achieving
very high data rates and long distances between error correcting
nodes in a communications system.
b. They have been used in satellites, space probes, deep-space
missions, military communications systems, and civilian wireless
communication systems, including WiFi and 4G LTE cellular
telephone systems.
7. Document classification:
a. The main issue is to assign a document multiple classes. The task
can be achieved manually and algorithmically. Since manual effort
takes too much time, algorithmic documentation is done to
complete it quickly and effectively.
4. Explain in detail about Bayesian Inference and its type Exact Inference
with suitable example.
Exact inference in Bayesian networks
The basic task for any probabilistic inference system is to compute the
posterior probability distribution for a set of query variables, given some observed
event-that is, some assignment of values to a set of evidence variables. We will use
the notation X denotes the query variable; E denotes the set of evidence variables
El, . . . , Em, and e is a particular observed event; Y will denote the nonevidence
variables Yl, . . . , (sometimes called the hidden variables). Thus, the complete set
of variables X = {X} U E U Y. A typical query asks for the posterior probability
distribution P(X|e)4
In the burglary network, we might observe the event in which JohnCalls =

true and MaryCalls = true. We could then ask for, say, the probability that a
burglary has occurred:
Inference by enumeration
Conditional probability can be computed by summing terms from the full joint
distribution. More specifically, a query P(X|e) can be answered using Equation,
which we repeat here for convenience:
Now, a Bayesian network gives a complete representation of the full joint

distribution. More specifically, Equation shows that the terms P(x, e, y) in the joint
distribution can be written as products of conditional probabilities from the
network. Therefore, a query can be answered using a Bayesian network by
computing sums of products of conditional probabibities from the network. In Figure
an algorithm, ENUMERATE-JOINT-ASK, was given for inference by enumeration
from the full joint distribution. The algorithm takes as input a full joint distribution
P and looks up values therein. It is a simple matter to modify the algorithm so that
it takes as input a Bayesian network bn and "looks up" joint entries by multiplying
the corresponding CPT entries from bn.
Consider the query P(Burglary1 JohnCalls = true, &Jury Calls = true). The
hidden variables for this query are Earthquake and Alarm. From Equation (13.6),
using initial letters for the variables in order to shorten the expressions, we have
The semantics of Bayesian networks then gives us an expression in terms of

CPT entries. For simplicity, we will do this just for Burglizry = true:
To compute this expression, we have to add four terms, each computed by

multiplying five numbers. In the worst case, where we have to sum out almost all
the variables, the complexity of the algorithm for a network with n Boolean
variables is O(n2n). An improvement can be obtained from the following simple
observations: the P(b) term is a constant and can be moved outside the
summaltions over a and e, and the 13(e) term can be moved outside the summation
over a. Hence, we have
This expression can be evaluated by looping through the variables in order,

multiplying CPT entries as we go. For each summation, we also need to loop over
the variable's possible values. The structure of this computation is shown in Figure.
Using the numbers from Figure, we obtain P(b| j , m) = a x 0.00059224. 'The
correspondingc omputation for ~b yields a x 0.0014919; hence
This expression can be evaluated by looping through the variables in order,

multiplying CPT entries as we go. For each summation, we also need to loop over
the variable's possible values.
The structure of this computation is shown in above Figure. Using the numbers
from Figure, we obtain P(b| j , m) = a x 0.00059224. 'The co~respondingc
omputation for ~b yields a x 0.0014919; hence
That is, the chance of a burglary, given calls from both neighbors, is about 28%.
The evaluation process for the expression in Equation is shown as an expression
tree in Figure.
The variable elimination algorithm

The enumeration algorithm can be improved substantially by eliminating
repeated calculations of the kind illustrated in Figure. The idea is simple: do the
calculation once and save the results for later use. This is a form of dynamic
programming. There are several versions of this approach; we present the variable
elimination algorithm, which is the simplest. Variable elimination works by
evaluating expressions such as Equation in right-to-left order (that is, bottom-up in
Figure). Intermediate results are stored, and summations over each variable are
done only for those portions of the expression that depend on the variable. Let us
illustrate this process for the burglary network. We evaluate the expression
The complexity of exact inference
We have argued that variable elimination is more efficient than enumeration

because it avoids repeated computations (as well as dropping irrelevant variables).
The time and space requirements of variable elimination are dominated by the size
of the largest factor constructed during the operation of the algorithm. This in turn
is determined by the order of elimination of variables and by the structure of the
network.
The burglary network of Figure belongs to the family of networks in which

there is at most one undirected path between any two nodes in the network. These
are called singly connected networks or polytrees, and they have a particularly
nice property: The time and space complexity of exact inference in polytrees is linear
in the size of the network. I-Iere, the size is defined as the number of CPT entries; if
the number of parents of each node is bounded by a constant, then the complexity
will also be linear in the number of nodes. These results hold for any ordering
consistent with the topological ordering of the network .
For multiply connected networks, such as that of Figure, variable

elimination can have exponential time and space complexity in the worst case, even
when the number of parents per node is bounded. This is not surprising when one
considers that, because it includes inference in propositional logic as a special case,
inference in Bayesian networks is 1W-hard. In fact, it can be shown that the
problem is as hard as that of computing the number of satisfying assignments for a
propositional logic formula. This means that it is #P-hard ("number-P hard")-that is,
strictly harder than NP-complete problems.
There is a close connection between the complexity of Bayesian network

inference and the complexity of constraint satisfaction problems (CSPs), the
difficulty of solving a discrete CSP is related to how "tree-lilce" its constraint graph
is Measures such as hypertree width, which bound the complexity of solving a
CSP, can also be applied directly to Bayesian networks. Moreover, the variable
elimination algorithm can be generalized to solve CSPs as well as Bayesian
networks.
5. Explain Causal Network or Causal Bayesian Network in Machine

5.1 Causal Network or Causal Bayesian Network
 A causal network is an acyclic digraph arising from an evolution of
a substitution system, and representing its history.
 In an evolution of a multiway system, each substitution event is a vertex
in a causal network.
 Two events which are related by causal dependence, meaning one occurs
just before the other, have an edge between the corresponding vertices in
the causal network.
 More precisely, the edge is a directed edge leading from the past event to
the future event.
 Refer Figure 2.3 for an example causal network.
 A CBN is a graph formed by nodes representing random variables,
connected by links denoting causal influence.
Figure 2.3 – Causal Network Example
 Some causal networks are independent of the choice of evolution, and

these are called causally invariant.
 Structural Causal Models (SCMs).

 SCMs consist of two parts: a graph, which visualizes causal
connections, and equations, which express the details of the
connections. a graph is a mathematical construction that consists
of vertices (nodes) and edges (links).
 SCMs use a special kind of graph, called a Directed Acyclic Graph
(DAG), for which all edges are directed and no cycles exist.
 DAGs are a common starting place for causal inference.
 Bayesian and causal networks are completely identical. However, the
difference lies in their interpretations.
Fire -> Smoke
 A network with 2 nodes (fire icon and smoke icon) and 1 edge (arrow
pointing from fire to smoke).
 This network can be both a Bayesian or causal network.
 The key distinction, however, is when interpreting this network.
 For a Bayesian network, we view the nodes as variables and the arrow
as a conditional probability, namely the probability of smoke given
information about fire.
 When interpreting this as a causal network, we still view nodes as
variables, however, the arrow indicates a causal connection.
 In this case, both interpretations are valid. However, if we were to flip the
edge direction, the causal network interpretation would be invalid, since
smoke does not cause fire.
Implementing Causal Inference
1.The do-operator
 The do-operator is a mathematical representation of a physical
intervention.
 If the model starts with Z → X → Y, simulate an intervention in X by
deleting all the incoming arrows to X, and manually setting X to some
value x_0. Refer Figure 2.4 denotes the example of do-operator.
Figure 2.4 – do-operator Example
P(Y|X) is the conditional probability that is, the probability of Y given

an observation of X. While, P(Y|do(X)) is the probability of Y given
an intervention in X.
2: Confounding
A simple example of confounding is shown in the figure 2.5 below.
Figure 2.5 – Confounding Example
 In this example, age is a confounder of education and wealth. In

other words, if trying to evaluate the impact of education on wealth
one would need to adjust for age.
 Adjusting for (or conditioning on) age just means that when
looking at age, education, and wealth data, one would compare
data points within age groups, not between age groups.
 Confounding is anything that leads to P(Y|X) being different than
P(Y|do(X)).
3: Estimating Causal Effects

 Treatment effect = (Outcome under E) minus (Outcome under C),
that is the difference between the outcome a child would receive if
assigned to treatment E and the outcome that same child would
receive if assigned to treatment C. These are called potential
outcomes.
6. Explain approximate inference in Bayesian network (BN)
Given the intractability of exact inference in large networks, we will now

consider approximate inference methods. This section describes randomized
sampling algorithms, also called Monte Carlo algorithms, that provide approximate
answers whose accuracy depends on the number of samples generated.
They work by generating random events based on the probabilities in the

Bayes net and counting up the different answers found in those random events.
With enough samples, we can get arbitrarily close to recovering the true probability
distribution—provided the Bayes net has no deterministic conditional distributions
Direct sampling methods

The primitive element in any sampling algorithm is the generation of samples
from a known probability distribution. For example, an unbiased coin can be
thought of as a random variable Coin with values (heads, tails) and a prior
distribution P(Coin) = (0.5,0.5). Sampling from this distribution is exactly like
flipping the coin: with probability 0.5 it will return heads, and with probability 0.5
it will return tails.
Given a source of random numbers r uniformly distributed in the range [0,1],
it is a simple matter to sample any distribution on a single variable, whether
discrete or continuous. This is done by constructing the cumulative distribution for
the variable and returning the first value whose cumulative probability exceeds r
We begin with a random sampling process for a Bayes net that has no evidence
associated with it. The idea is to sample each variable in turn, in topological order.
The probability distribution from which the value is sampled is conditioned on the
values already assigned to the variable’s parents. (Because we sample in topological
order, the parents are guaranteed to have values already.) This algorithm is shown
in Figure. Applying it to the network with the ordering Cloudy, Sprinkler, Rain,
WetGrass, we might produce a random event as follows:
Rejection sampling in Bayesian networks

Rejection sampling is a general method for producing samples from a hard-
to-sample distribution given an easy-to-sample distribution. In its simplest form, it
can be used to compute conditional probabilities that is, to determine P(X |e). The
REJECTION-SAMPLING algorithm is shown in Figure. First, it generates samples
from the prior distribution specified by the network. Then, it rejects all those that
do not match the evidence. Finally, the estimateˆP (X =x|e) is obtained by counting
how often X =x occurs in the remaining samples.
Let ˆP(X |e) be the estimated distribution that the algorithm returns; this
distribution is computed by normalizing NPS(X,e), the vector of sample counts for
each value of X where the sample agrees with the evidence e:
Inference by Markov chain simulation

In this section, we describe the Markov chain Monte Carlo (MCMC)
algorithm for inference in Bayesian networks. We will first describe what the
algorithm does, then we will explain why it works and why it has such a
complicated name.
The MCMC algorithm
MCMC generates each event by making a random change to the preceding

event. It is therefore helpful to think of the network as being in a particular current
state specifying a value for every variable. The next state is generated by randomly
sampling a value for one of the nonevidence variables Xi,conditioned on the current
values of the variables in the Markov blanket of Xi. MCMC therefore wanders
randomly around the state space-the space of possible complete assignments-
flipping one variable at a time, but keeping the evidence variables fixed.
Consider the query P(Rain1 Sprinkler = true, Wet Grass = true) applied to the
network in Figure. The evidence variables Sprinkler and WetGrass are fixed to their
observed values and the hidden variables Cloudy and Rain are initialized randomly-
let us say to true and false respectively. Thus, the initial state is [true, true, false,
true]. Now the following steps are executed repeatedly:
1. Cloudy is sampled, given the current values of its Markov blanket variables: in
this case, we sample from P(Cloudy1 Sprinkler = true, Rain =false). Suppose the
result is Cloudy =false. Then the new current state is [false, true, false, true].
2. Rain is sampled, given the current values of its Markov blanket variables: in this
case, we sample from P(Rain1 Cloudy =false, Sprinkler = true, WetGrass = true).
Suppose this yields Rain = true. The new current state is [false, true, true, true].
Each state visited during this process is a sample that contributes to the estimate
for the query variable Rain. If the process visits 20 states where Rain is true and 60
states where Rain is false, then the answer to the query is NORMALIZE
CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

UNIT III SUPERVISED LEARNING
SYLLABUS:
Introduction to machine learning – Linear Regression Models: Least
squares, single & multiple variables, Bayesian linear regression,
gradient descent, Linear Classification Models: Discriminant function –
Probabilistic discriminative model - Logistic regression, Probabilistic
generative model – Naive Bayes, Maximum margin classifier – Support
vector machine, Decision Tree, Random forests
PART A
1. Define Machine Learning.
 Arthur Samuel, an early American leader in the field of computer gaming
and artificial intelligence, coined the term “Machine Learning ” in 1959
while at IBM.
 He defined machine learning as “the field of study that gives computers
the ability to learn without being explicitly programmed “.
 Machine learning is programming computers to optimize a performance
criterion using example data or past experience. The model may be
predictive to make predictions in the future, or descriptive to gain
knowledge from data.
2 Mention the various classification of Machine Learning

 Machine learning implementations are classified into four major
categories, depending on the nature of the learning “signal” or “response”
available to a learning system which are as follows:
 Supervised learning
 Unsupervised learning
 Reinforcement learning
 Semi-supervised learning
3. Define Supervised learning

 Supervised learning is the machine learning task of learning a function
that maps an input to an output based on example input-output pairs.
 The given data is labeled.
 Both classification and regression problems are supervised learning

problems.
5. Define Unsupervised learning

 Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses.
 In unsupervised learning algorithms, classification or categorization is not
included in the observations.
 In unsupervised learning the agent learns patterns in the input without any
explicit feedback.
 The most common unsupervised learning task is clustering: detecting
potentially useful clusters of input examples.
6. What is Reinforcement learning?

 In reinforcement learning the agent learns from a series of reinforcements:
rewards and punishments.
 Reinforcement learning is the problem of getting an agent to act in the world
so as to maximize its rewards.
 A learner is not told what actions to take as in most forms of machine
learning but instead must discover which actions yield the most reward by
trying them.
7. What is Semi-supervised learning?

 Semi-Supervised learning is a type of Machine Learning algorithm that
represents the intermediate ground between Supervised and Unsupervised
learning algorithms.
 It uses the combination of labeled and unlabeled datasets during the training
period, where an incomplete training signal is given: a training set with some
of the target outputs missing.
8. How to Categorize algorithm based on required Output?

 Classification
 Regression
 Clustering
9. Define Classification.
 The Classification algorithm is a Supervised Learning technique that is
used to identify the category of new observations on the basis of training
data.
 In Classification, a program learns from the given dataset or observations
and then classifies new observation into a number of classes or groups.
 Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.
10. Define Regression.

 Regression is a supervised learning technique which helps in finding the
correlation between variables and enables us to predict the continuous
output variable based on the one or more predictor variables.
 It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
11. Define Clustering.

 Clustering or cluster analysis is a machine learning technique, which
groups the unlabeled dataset.
 It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points.
 The objects with the possible similarities remain in a group that has
less or no similarities with another group."
12. What is Linear Regression?

 In statistics, linear regression is a linear approach to modeling the
relationship between a dependent variable and one or more independent
variables.
 Let X be the independent variable and Y be the dependent variable.
 A linear relationship between these two variables as follows:
Where,
m: Slope
c: y-intercept
13. What is Least Squares Regression Line?

 Least squares are a commonly used method in regression analysis
for estimating the unknown parameters by creating a model which
will minimize the sum of squared errors between the observed data
and the predicted data.
14. Narrate Least Squares Regression Equation

 The equation that minimizes the total of all squared prediction
errors for known Y scores in the original correlation analysis.
where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
15. List and define the types of Linear Regression.

It is of two types: Simple and Multiple.
 Simple Linear Regression is where only one independent variable is
present and the model has to find the linear relationship of it with the
dependent variable
 Equation of Simple Linear Regression, where bo is the intercept, b1 is
coefficient or slope, x is the independent variable and y is the dependent
variable.
o In Multiple Linear Regression there are more than one independent

variables for the model to find the relationship.
Equation of Multiple Linear Regression, where bo is the intercept,
b1,b2,b3,b4…,bn are coefficients or slopes of the independent variables
x1,x2,x3,x4…,xn and y is the dependent variable.
16. Define Linear Regression Model.

 A Linear Regression model’s main aim is to find the best fit linear line and
the optimal values of intercept and coefficients such that the error is
minimized.
17. What is error or residual?

 Error is the difference between the actual value and Predicted value and the
goal is to reduce this difference.
 The vertical distance between the data point and the regression line is known
as error or residual.
 Each data point has one residual and the sum of all the differences is known
as the Sum of Residuals/Errors.
18. Define Bayesian Regression.

 Bayesian Regression is used when the data is insufficient in the dataset or
the data is poorly distributed.
 The output of a Bayesian Regression model is obtained from a probability
distribution.
 The aim of Bayesian Linear Regression is to find the ‘posterior‘
distribution for the model parameters.
 The expression for Posterior is :
where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter variable
is marginalized.
19. List the Advantages and Disadvantages of Bayesian Regression.

 Very effective when the size of the dataset is small.
 Particularly well-suited for on-line based learning (data is received in
real-time), as compared to batch based learning, where we have the
entire dataset on our hands before we start training the model. This is
because Bayesian Regression doesn’t need to store data.
 The Bayesian approach is a tried and tested approach and is very
robust, mathematically. So, one can use this without having any extra
prior knowledge about the dataset. 
Disadvantages of Bayesian Regression:
 The inference of the model can be time-consuming. 
 If there is a large amount of data available for our dataset, the Bayesian
approach is not worth it.
20. What are the two types of Classifications problem?

 Two-class problems :
o Binary representation or Binary Classifier:
o If the classification problem has only two possible outcomes, then
it is called as Binary Classifier.
o There is a single target variable t ∈ {0, 1}
 t = 1 represents class C1
 t = 0 represents class C2
o Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM,

CAT or DOG, etc.
o Multi-class Problems:
o If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
o Example: Classifications of types of crops, Classification of types
of music.
21. List the different Types of ML Classification Algorithms.

 Logistic Regression
 K-Nearest Neighbors
 Support Vector Machines
 Kernel SVM
 Naïve Bayes
 Decision Tree Classification 
 Random Forest Classification
22. What is Discriminant function?

 A function of a set of variables that is evaluated for samples of events or
objects and used as an aid in discriminating between or classifying them.
 A discriminant function (DF) maps independent (discriminating) variables
into a latent variable D.
 DF is usually postulated to be a linear function:
D = a0 + a1 x1 + a2 x2 ... aNxN
22 Define Probabilistic discriminative models.

 Discriminative models are a class of supervised machine learning
models which make predictions by estimating conditional
probability P(y|x).
 For the two-class classification problem, the posterior probability of class
C1 can be written as a logistic sigmoid acting on a linear function of x
 For the multi-class case, the posterior probability of class Ck is given by a

softmax transformation of a linear function of x
23. Define Logistics Regression

 Logistic regression is the Machine Learning algorithms, under the
classification algorithm of Supervised Learning technique .
 Logistic regression is used to describe data and the relationship between
one dependent variable and one or more independent variables.
 The independent variables can be nominal, ordinal, or of interval type. 
 Logistic regression predicts the output of a categorical dependent
variable.
 Therefore the outcome must be a categorical or discrete value.
24. DefineLogistic Function or Sigmoid Function.

 The logistic function is also known as the sigmoid function.
 The sigmoid function is a mathematical function used to map the
predicted values to probabilities. 
 The value of the logistic regression must be between 0 and 1, so it forms
a curve like the "S" form.
 The S-form curve is called the Sigmoid function or the logistic function.
25. List the types of Logistic Regression.

 Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or
more possible unordered types of the dependent variable, such as
"cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as "low",
"Medium", or "High".
26. What are the Steps in Logistic Regression?

 To implement the Logistic Regression using Python, the steps are given
below:
 Data Pre-processing step
 Fitting Logistic Regression to the Training set
 Predicting the test result
 Test accuracy of the result
 Visualizing the test set result.
27. List the advantages of Logistic Regression Algorithm.

 Logistic regression performs better when the data is linearly separable 
 It does not require too many computational resources

 There is no problem scaling the input features
 It is easy to implement and train a model using logistic regression
28. Define Probabilistic Generative model

 Given a model of one conditional probability, and estimated probability
distributions for the variables X and Y, denoted P(X) and P(Y), can
estimate the conditional probability using Bayes' rule:
 A generative model is a statistical model of the joint probability

distribution on given observable variable X and target variable Y.
Given a generative model for P(X|Y), can estimate:
29. Define Discriminative model.

 A discriminative model is a model of the conditional probability of the
target Y, given an observation x given a discriminative model for
P(Y|X), can estimate:
 Classifier based on a generative model is a generative classifier, while a

classifier based on a discriminative model is a discriminative classifier
30. List the types of Generative models.

 Types of generative models are:
 Naive Bayes classifier or Bayesian network
 Linear discriminant analysis
31. Mention the algorithms in Discriminative models.

 Logistic regression
 Support Vector Machines
 Decision Tree Learning
 Random Forest
32. Define Support Vector Machine (SVM)

 Support Vector Machine(SVM) is a supervised machine learning
algorithm used for both classification and regression.
 The objective of SVM algorithm is to find a hyperplane in an N-

dimensional space that distinctly classifies the data points.
 Hyperplanes are decision boundaries that help classify the data points.
33. Define Hinge loss function
 The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
34. Define SVM Kernel.

 The SVM kernel is a function that takes low dimensional input space
and transforms it into higher-dimensional space, ie it converts non
separable problem to separable problem. It is mostly useful in non-
linear separation problems.
35. List the types of SVMs

 There are two different types of SVMs, each used for different things:
o Simple SVM: Typically used for linear regression and classification
problems.
o Kernel SVM: Has more flexibility for non-linear data .
36. What are the advantages and disadvantages of SVM?

Advantages
 Effective on datasets with multiple features, like financial or medical
data.
 Effective in cases where number of features is greater than the number
of data points.
 Its memory efficient as it uses a subset of training points in the decision
function called support vectors
 Different kernel functions can be specified for the decision functions and
its possible to specify custom kernels
Disadvantages
 If the number of features is a lot bigger than the number of data points,
choosing kernel functions and regularization term is crucial.
 SVMs don't directly provide probability estimates. Those are calculated
using an expensive five-fold cross-validation.
 Works best on small sample sets because of its high training time.
37. List the applications of SVM.

SVMs can be used to solve various real-world problems:
 SVMs are helpful in text and hypertext categorization.
 Classification of images can also be performed using SVMs.
 Classification of satellite data like SAR data using supervised SVM.
 Hand-written characters can be recognized using SVM.
 The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly.
38. Define Decision Tree

 Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems.
 It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
 In a Decision tree, there are two nodes, the Decision Node and Leaf
Node.
39. List the types of Decision Trees

1. Categorical Variable Decision Tree: Decision Tree which has a categorical
target variable then it called a Categorical variable decision tree.
2. Continuous Variable Decision Tree: Decision Tree has a continuous target
variable then it is called Continuous Variable Decision Tree.
40. Mention the reason for using Decision Trees

 Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
 The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
41. List the algorithms used to construct Decision Trees:

 ID3 → (extension of D3) 
 C4.5 → (successor of ID3)
 CART → (Classification And Regression Tree)
 CHAID → (Chi-square automatic interaction detection Performs multi-level
splits when computing classification trees)
 MARS → (multivariate adaptive regression splines) 
PART B
1. Define Machine Learning. Give an introduction to Machine Learning.
INTRODUCTION TO MACHINE LEARNING

1.1 Machine Learning
1.1.1 Definition of Machine Learning
1.1.2 Definition of learning
1.1.3 Examples
1.1.3.1 Handwriting recognition learning problem
1.1.3.2 A robot driving learning problem
1.2 Classification of Machine Learning
1.2.1 Supervised learning
1.2.2 Unsupervised learning
1.2.3 Reinforcement learning
1.2.4 Semi-supervised learning
1.3 Categorizing based on required Output
1.3.1 Classification
1.3.2 Regression
1.3.3 Clustering
1.1 Machine Learning:

1.1.1 Definition of Machine Learning:
 Arthur Samuel, an early American leader in the field of
computer gaming and artificial intelligence, coined the term
“Machine Learning ” in 1959 while at IBM.
 He defined machine learning as “the field of study that gives
computers the ability to learn without being explicitly
programmed “.
 Machine learning is programming computers to optimize a
performance criterion using example data or past experience.
The model may be predictive to make predictions in the future,
or descriptive to gain knowledge from data.
1.1.2 Definition of learning:

 A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P , if
its performance at tasks T, as measured by P , improves with
experience E.
1.1.3 Examples
1.1.3.1 Handwriting recognition learning problem
 Task T : Recognizing and classifying handwritten words
within images
 Performance P : Percent of words correctly classified
 Training experience E : A dataset of handwritten words with
given classifications
1.1.3.2 A robot driving learning problem
 Task T : Driving on highways using vision sensors
 Performance P : Average distance traveled before an error
 Training experience E : A sequence of images and steering
commands recorded while observing a human driver
1.2 Classification of Machine Learning

 Machine learning implementations are classified into four major
categories, depending on the nature of the learning “signal” or
“response” available to a learning system which are as follows:
1.2.1 Supervised learning:
 Supervised learning is the machine learning task of learning a
function that maps an input to an output based on example
input-output pairs.
 The given data is labeled.
 Both classification and regression problems are supervised
learning problems.
 For example, the inputs could be camera images, each one
accompanied by an output saying “bus” or “pedestrian,” etc.
 An output like this is calleda label.
 The agent learns a function that, when given a new image,
predicts theappropriate label.
1.2.2 Unsupervised learning:

 Unsupervised learning is a type of machine learning algorithm
used to draw inferences from datasets consisting of input data
without labeled responses.
 In unsupervised learning algorithms, classification or
categorization is not included in the observations.
 In unsupervised learning the agent learns patterns in the input
without any explicit feedback.
 The most common unsupervised learning task is clustering:
detecting potentially useful clusters of input examples.
 For example, when shown millions of images taken from the

Internet, a computer vision system can identify a large cluster
of similar images which an English speaker would call “cats.”
1.2.3 Reinforcement learning:

 In reinforcement learning the agent learns from a series of
reinforcements: rewards and punishments.
 Reinforcement learning is the problem of getting an agent to act
in the world so as to maximize its rewards.
 A learner is not told what actions to take as in most forms of
machine learning but instead must discover which actions yield
the most reward by trying them.
 For example — Consider teaching a dog a new trick: we cannot
tell him what to do, what not to do, but we can reward/punish
it if it does the right/wrong thing.
1.2.4 Semi-supervised learning:

 Semi-Supervised learning is a type of Machine Learning
algorithm that represents the intermediate ground between
Supervised and Unsupervised learning algorithms.
 It uses the combination of labeled and unlabeled datasets
during the training period, where an incomplete training signal
is given: a training set with some of the target outputs missing.
1.3 Categorizing based on required Output
1.3.1 Classification:
 The Classification algorithm is a Supervised Learning technique
that is used to identify the category of new observations on the
basis of training data.
 In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number
of classes or groups.
 Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.
1.3.1 Regression:
 Regression is a supervised learning technique which helps in
finding the correlation between variables and enables us to
predict the continuous output variable based on the one or more
predictor variables.
 It is mainly used for prediction, forecasting, time series

modeling, and determining the causal-effect relationship
between variables.
1.3.2 Clustering:
 Clustering or cluster analysis is a machine learning technique,
which groups the unlabeled dataset.
 It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points.
 The objects with the possible similarities remain in a group that
has less or no similarities with another group."
2. Explain in detail about Linear Regression Models. Or Explain Linear

Regression Models: Least squares, single & multiple variables, Bayesian
linear regression, gradient descent.
LINEAR REGRESSION MODELS

2.1 Linear Regression
2.2 Least Squares Regression Line
2.2.1 Least Squares Regression Equation
2.2.2 Least Squares Regression in Python
2.3 Types of Linear Regression
2.4 Linear Regression Model
2.5 Bayesian Regression
2.5.1 Implementation Of Bayesian Regression Using Python
2.6 Gradient descent
2.6.1 Cost Function
2.6.2 Gradient Descent Algorithm.
2.1 Linear Regression

 In statistics, linear regression is a linear approach to modeling the
relationship between a dependent variable and one or more
independent variables.
 Let X be the independent variable and Y be the dependent variable.
 A linear relationship between these two variables as follows:
Where,
m: Slope
c: y-intercept
 Linear regression algorithm shows a linear relationship between a

dependent (y) and one or more independent (x) variables, hence called
as linear regression.
 Linear regression finds how the value of the dependent variable is
changing according to the value of the independent variable.
 The linear regression model provides a sloped straight line
representing the relationship between the variables.
 Consider the below Figure 3.1, which represents the relationship
between independent and dependent variables
Figure 3.1 – Relationship between independent and dependent

variables
2.2 Least Squares Regression Line

 Least squares are a commonly used method in regression analysis
for estimating the unknown parameters by creating a model which
will minimize the sum of squared errors between the observed data
and the predicted data.
2.2.1 Least Squares Regression Equation
 The equation that minimizes the total of all squared prediction
errors for known Y scores in the original correlation analysis.
where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
2.2.2 Least Squares Regression in Python

Scenario: A rocket motor is manufactured by combining an igniter
propellant and a sustainer propellant inside a strong metal housing. It
was noticed that the shear strength of the bond between two propellers
is strongly dependent on the age of the sustainer propellant.
Problem Statement: Implement a simple linear regression algorithm
using Python to build a machine learning model that studies the
relationship between the shear strength of the bond between two
propellers and the age of the sustainer propellant.
Step 1: Import the required Python libraries.
# Importing Libraries
importnumpyasnp
import pandas aspd
importmatplotlib.pyplotasplt
Step 2: Next step is to read and load the dataset.

# Loading dataset
data = pd.read_csv('PropallantAge.csv')
data.head()
data.info()
Step 3: Create a scatter plot just to check the relationship between

these two variables.
# Plotting the data
plt.scatter(data['Age of Propellant'],data['Shear
Strength'])
Step 4: Next step is to assign X and Y as independent and dependent

variables respectively.
# Computing X and Y
X = data['Age of Propellant'].values
Y = data['Shear Strength'].values
Step 5: Compute the mean of variables X and Y to determine the values

of slope (m) and y-intercept.
Also, let n be the total number of data points.
# Mean of variables X and Y
mean_x = np.mean(X)
mean_y = np.mean(Y)
# Total number of data values
n = len(X)
Step 6: Calculate the slope and the y-intercept using the formulas
# Calculating 'm' and 'c'
num = 0
denom = 0
for i in range(n):
num += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = num / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
The above step has given the values of m and c.

Substituting them ,
Shear Strength =
2627.822359001296 + (-37.15359094490524)
* Age of Propellant
Step 7: The above equation represents the linear regression model.

Let’s plot this graphically. Refer fig 3.2
# Plotting Values and Regression Line
maxx_x = np.max(X) + 10
minn_x = np.min(X) - 10
# line values for x and y

x = np.linspace(minn_x, maxx_x, 1000)
y=c+m*x
# Plotting Regression Line
plt.plot(x, y, color='#58b970', label='Regression Line')
# Plotting Scatter Points
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')
plt.xlabel('Age of Propellant (in years)')
plt.ylabel('Shear Strength')
plt.legend()
plt.show()
Output:
Figure 3.2 – Example for Regression Line

2.3 Types of Linear Regression
It is of two types: Simple and Multiple.
o Simple Linear Regression is where only one independent variable is
present and the model has to find the linear relationship of it with the
dependent variable
Equation of Simple Linear Regression, where bo is the intercept, b1 is
coefficient or slope, x is the independent variable and y is the dependent
variable.
o In Multiple Linear Regression there are more than one independent

variables for the model to find the relationship.
Equation of Multiple Linear Regression, where bo is the intercept,
b1,b2,b3,b4…,bn are coefficients or slopes of the independent variables
x1,x2,x3,x4…,xn and y is the dependent variable.
2.4 Linear Regression Model

 A Linear Regression model’s main aim is to find the best fit linear line and
the optimal values of intercept and coefficients such that the error is
minimized.
 Error is the difference between the actual value and Predicted value and
the goal is to reduce this difference.
Figure 3.3 – Example for Linear Regression Model
 In the above figure 3.3,

 x is our dependent variable which is plotted on the x-axis and y is the
dependent variable which is plotted on the y-axis.
 Black dots are the data points i.e the actual values.
 bo is the intercept which is 10 and b1 is the slope of the x variable.
 The blue line is the best fit line predicted by the model i.e the predicted
values lie on the blue line.
 The vertical distance between the data point and the regression line is
known as error or residual.
 Each data point has one residual and the sum of all the differences is known
as the Sum of Residuals/Errors.
Mathematical Approach:
Residual/Error = Actual values – Predicted Values
Sum of Residuals/Errors = Sum(Actual- Predicted Values)
Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values)) 2
2.5 Bayesian Regression

 Bayesian Regression is used when the data is insufficient in the dataset or
the data is poorly distributed.
 The output of a Bayesian Regression model is obtained from a probability
distribution.
 The aim of Bayesian Linear Regression is to find the ‘posterior‘

distribution for the model parameters.
 The expression for Posterior is :
where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter
variable is marginalized.
 The Bayesian Ridge Regression formula is as follows:

p(y|λ) = N(w|0, λ^-1Ip)
where
 'y' is the expected value,
 lambda is the distribution's shape parameter before the lambda
parameter
 the vector "w" is made up of the elements w0, w1,....
2.5.1 Implementation Of Bayesian Regression Using Python

 Boston Housing dataset, which includes details on the average price of
homes in various Boston neighborhoods.
 The r2 score will be used for evaluation.
 The crucial components of a Bayesian Ridge Regression model:
Program
fromsklearn.datasets import load_boston
fromsklearn.model_selection import train_test_split
fromsklearn.metrics import r2_score
fromsklearn.linear_model import BayesianRidge
# Loading the dataset

dataset = load_boston()
X, y = dataset.data, dataset.target
# Splitting the dataset into testing and training sets

X_train, X_test, y_train, y_test = train_test_split
(X, y, test_size = 0.15, random_state = 42)
# Creating to train the model

model = BayesianRidge()
model.fit(X_train, y_train)
# Model predicting the test data

prediction = model.predict(X_test)
# Evaluation of r2 score of the model against the test dataset

print(f"Test Set r2 score : {r2_score(y_test, prediction)}")
Output
Test Set r2 score : 0.7943355984883815
Advantages of Bayesian Regression:

 Very effective when the size of the dataset is small.
 Particularly well-suited for on-line based learning (data is received in
real-time), as compared to batch based learning, where we have the
entire dataset on our hands before we start training the model. This is
because Bayesian Regression doesn’t need to store data.
 The Bayesian approach is a tried and tested approach and is very
robust, mathematically. So, one can use this without having any extra
prior knowledge about the dataset. 
Disadvantages of Bayesian Regression:

 The inference of the model can be time-consuming. 
 If there is a large amount of data available for our dataset, the Bayesian
approach is not worth it.
2.6 Gradient descent

2.6.1 Cost Function
 The cost is the error in our predicted value.
 It is calculated using the Mean Squared Error function as shown in figure
3.4.
Figure 3.4 – Example for Cost function
 The goal is to minimize the cost as much as possible in order to find the
best fit line.
2.6.2 Gradient Descent Algorithm.

 Gradient descent is an optimization algorithm that finds the best-fit line
for a given training dataset in a smaller number of iterations.
 If m and c are plotted against MSE, it will acquire a bowl shape as shown
in figure 3.4a and figure 3.4b.
Figure 3.4a – Process of gradient descent algorithm

Figure 3.4b – Gradient Descent Shape
Learning Rate
 A learning rate is used for each pair of input and output values. It is a
scalar factor and coefficients are updated in direction towards minimizing
error.
 The process is repeated until a minimum sum squared error is achieved or
no further improvement is possible.
Step by Step Algorithm:

1. Initially, let m = 0, c = 0
Where L = learning rate — controlling how much the value of “m” changes
with each step.
The smaller the L, greater the accuracy. L = 0.001 for a good accuracy.
2. Calculating the partial derivative of loss function “m” to get the
derivative D.
 In figure 3.7, the hyperplane H divides the feature space into two
half-spaces:
o Decisionregion R1 for w1
o region R2 for w 2.
 The discriminant function g(x) gives an algebraic measure of the
distance from x to the hyperplane.
 The way to express x as
(3.4)
where xp is the normal projection of x onto H, and r is the desired
algebraic distance which is positive if x is on the positive side and
negative if x is on the negative side. Then, because g(xp)=0,
Since then
(3.5)
or
(3.6)
o The distance from the origin to H is given by .

o If w 0>0, the origin is on the positive side of H, and if w 0<0, it is on the
negative side.
o If w0=0, then g(x) has the homogeneous form , and the hyperplane
passes through the origin
4.3 The Multi-category Case

 To devise multi category classifiers employing linear discriminant
functions reduce the problem to c two-class problems.
 Defining c linear discriminant functions
(3.7)
and assigning x to wi if for all j¹ i; in case of ties, the

classification is left undefined.
 The resulting classifier is called a linear machine.
Figure 3.8: Decision boundaries defined by linear machines
 A linear machine divides the feature space into c decision regions as

shown in figure 3.8, with gj(x) being the largest discriminant if x is in
region Ri.
 If Ri and Rj are contiguous, the boundary between them is a portion of
the hyperplane Hij defined by
or
is normal to Hij and the signed distance from x to Hij is given by
4.4 Generalized Linear Discriminant Functions

 The linear discriminant function g(x) can be written as
(3.8)
where the coefficients wi are the components of the weight vector w.

Quadratic Discriminant Function
(3.9)
4.5 Probabilistic discriminative models

 Discriminative models are a class of supervised machine learning
models which make predictions by estimating conditional
probability P(y|x).
 For the two-class classification problem, the posterior probability of
classC1 can be written as a logistic sigmoid acting on a linear function of
x
 For the multi-class case, the posterior probability of class Ckis given by a
softmax transformation of a linear function of x
4.6 Logistics Regression

o Logistic regression is the Machine Learning algorithms, under the
classification algorithm of Supervised Learning technique.
o Logistic regression is used to describe data and the relationship between one
dependent variable and one or more independent variables.
o The independent variables can be nominal, ordinal, or of interval type.
o Logistic regression predicts the output of a categorical dependent variable.
o Therefore the outcome must be a categorical or discrete value.
o It can be either Yes or No, 0 or 1, true or False, etc. it gives the probabilistic
values which lie between 0 and 1.
o Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o The figure 3.9 predicts the logistic function
Figure 3.9 – Logistic Function or Sigmoid Function
4.6.1 Logistic Function (Sigmoid Function):

o The logistic function is also known as the sigmoid function.
o The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
o The value of the logistic regression must be between 0 and 1, so it forms
a curve like the "S" form.
o The S-form curve is called the Sigmoid function or the logistic function.
4.6.2 Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
4.6.3 Logistic Regression Equation:
 The Logistic regression equation can be obtained from the Linear
Regression equation.
 The mathematical steps are given below:
 The equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, let's divide the

above equation by (1-y):
o For the range between -[infinity] to +[infinity], take logarithm of the

equation:
The above equation is the final equation for Logistic Regression.

4.6.4 Type of Logistic Regression:

 Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or
more possible unordered types of the dependent variable, such as
"cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as "low",
"Medium", or "High".
4.6.5 Steps in Logistic Regression:

 To implement the Logistic Regression using Python, the steps are given
below:
 Data Pre-processing step
 Fitting Logistic Regression to the Training set
 Predicting the test result
 Test accuracy of the result
 Visualizing the test set result.
4.6.6 Advantages of Logistic Regression Algorithm

 Logistic regression performs better when the data is linearly separable
 It does not require too many computational resources
 There is no problem scaling the input features
 It is easy to implement and train a model using logistic regression
5. Elaborate in detail about Probabilistic Generative model and Naïve

Bayes.
PROBABILISTIC GENERATIVE MODEL AND NAÏVE BAYES

5.1 Probabilistic Generative model
5.2 Simple example
5.3 Generative models
5.4 Discriminative models
5.1 Probabilistic Generative model

 Given a model of one conditional probability, and estimated probability
distributions for the variables X and Y, denoted P(X) and P(Y), can
estimate the conditional probability using Bayes' rule:
 A generative model isa statistical model of the joint probability

distribution on given observable variable X and target variable Y.
Given a generative model for P(X|Y), can estimate:
 A discriminative model is a model of the conditional probability of the

target Y, given an observation xgiven a discriminative model forP(Y|X),
can estimate: 
 Classifier based on a generative model is a generative classifier, while a

classifier based on a discriminative model is a discriminative classifier
5.2 Simple example
5.3 Generative models

Types of generative models are:
 Naive Bayes classifier or Bayesian network
 Linear discriminant analysis
5.4 Discriminative models

 Logistic regression
 Support Vector Machines
 Decision Tree Learning
 Random Forest
6. Elaborate in detail about Support Vector Machine (SVM).
SUPPORT VECTOR MACHINE

6.1 Support Vector Machine (SVM)
6.2 Cost Function and Gradient Updates
6.2.1 Hinge loss function
6.3 SVM Kernel
6.4 Types of SVMs
6.5 Advantages of SVM
6.6 Disadvantages
6.7 Applications
6.1 Support Vector Machine (SVM)

 Support Vector Machine(SVM) is a supervised machine learning
algorithm used for both classification and regression.
 The objective of SVM algorithm is to find a hyperplane in an N-
dimensional space that distinctly classifies the data points.
 Hyperplanes are decision boundaries that help classify the data points.
 The dimension of the hyperplane depends upon the number of features.
 If the number of input features is 2, then the hyperplane is just a line.
 If the number of input features is 3, then the hyperplane becomes a two-
dimensional plane.
 It becomes difficult to imagine when the number of features exceeds 3.
 The objective is to find a plane that has the maximum margin, i.e the
maximum distance between data points of both classes.
 Support vectors are data points that are closer to the hyperplane and
influence the position and orientation of the hyperplane.
 Using these support vectors, can maximize the margin of the classifier.
 Deleting the support vectors will change the position of the hyperplane.
 Example Refer Figure 3.10
Figure 3.10 – Example for Support Vectors
 Let’s consider two independent variables x1, x2 and one dependent

variable which is either a blue circle or a red box as shown in figure
3.11.
Figure 3.11 - Linearly Separable Data points

6.2 Cost Function and Gradient Updates

 In the SVM algorithm, to maximize the margin between the data points
and the hyperplane, the loss function helps to maximize the margin is
called hinge loss.
6.2.1 Hinge loss function
 The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
 The objective of the regularization parameter is to balance the margin
maximization and loss.
 After adding the regularization parameter, the cost functions looks as
below.
6.3 SVM Kernel:

 The SVM kernel is a function that takes low dimensional input space
and transforms it into higher-dimensional space, ie it converts non
separable problem to separable problem. It is mostly useful in non-
linear separation problems.
6.4 Types of SVMs

 There are two different types of SVMs, each used for different things:
o Simple SVM: Typically used for linear regression and classification
problems.
o Kernel SVM: Has more flexibility for non-linear data .
6.5 Advantages of SVM:

 Effective on datasets with multiple features, like financial or medical
data.
 Effective in cases where number of features is greater than the number
of data points.
 Its memory efficient as it uses a subset of training points in the decision
function called support vectors
 Different kernel functions can be specified for the decision functions and
its possible to specify custom kernels
6.6 Disadvantages
 If the number of features is a lot bigger than the number of data points,
choosing kernel functions and regularization term is crucial.
 SVMs don't directly provide probability estimates. Those are calculated
using an expensive five-fold cross-validation.
 Works best on small sample sets because of its high training time.
6.7 Applications
SVMs can be used to solve various real-world problems:
 SVMs are helpful in text and hypertext categorization.
 Classification of images can also be performed using SVMs.
 Classification of satellite data like SAR data using supervised SVM.
 Hand-written characters can be recognized using SVM.
 The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly.
7. Elaborate in detail about Decision Tree in Supervised Learning.
DECISION TREE IN SUPERVISED LEARNING

7.1 Decision Tree
7.2 Types of Decision Trees
7.3 Reason for using Decision Trees
7.4 Decision Tree Terminologies
7.5 Working of Decision Tree algorithm
7.6 Algorithms used to construct Decision Trees
7.7 Attribute Selection Measures
7.7.1 Entropy
7.7.2.Information Gain
7.7.3.Gini Index
7.7.4 Gain Ratio
7.7.5 Reduction in variance
7.7.6 Chi-Square
7.8. Avoid/counter Over fitting in Decision Trees
7.8.1 Pruning Decision Trees
7.8.2 Random Forest
7.9 Advantages of the Decision Tree
7.10 Disadvantages of the Decision Tree
7.1 Decision Tree

 Decision Tree is a supervised learning technique that can be used for
both classification and Regression problems.
 It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
 In a Decision tree, there are two nodes, the Decision Node and Leaf
Node.
 As shown in figure 3.12, Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
 Example for Decision Tree Refer Figure 3.13
 The goal of using a Decision Tree is to create a training model that can
use to predict the class or value of the target variable by learning simple
decision rules inferred from prior data(training data).
 In order to build a tree, use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
Figure 3.12 – Decision Tree Structure

Figure 3.13 – Decision Tree Example
7.2 Types of Decision Trees

1. Categorical Variable Decision Tree: Decision Tree which has a
categorical target variable then it called a Categorical variable decision
tree.
2. Continuous Variable Decision Tree: Decision Tree has a continuous
target variable then it is called Continuous Variable Decision Tree.
7.3 Reason for using Decision Trees

 Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
 The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
7.4 Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from

the tree.
 Parent/Child node: The root node of the tree is called the parent node,
and other nodes are called the child nodes.
7.5 Working of Decision Tree algorithm

 In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree.
 This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
 For the next node, the algorithm again compares the attribute value with
the other sub-nodes and move further.
 It continues the process until it reaches the leaf node of the tree.
 The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the
best attributes.
Step-4: Generate the decision tree node, which contains the best
attribute.
Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage
is reached where cannot further classify the nodes and called
the final node as a leaf node.
Example:
 Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram: Refer fig 3.14
Decision Tree
Figure 3.14 – Decision Tree Algorithm Example
7.6 Algorithms used to construct Decision Trees:

 ID3 → (extension of D3)
 C4.5 → (successor of ID3)
 CART → (Classification And Regression Tree)
 CHAID → (Chi-square automatic interaction detection Performs
multi-level splits when computing classification trees)
 MARS → (multivariate adaptive regression splines) 
7.7 Attribute Selection Measures

 While implementing a Decision tree, Attribute selection measure
orASM is used to select the best attribute for the nodes of the tree.
1. Entropy,
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
6. Chi-Square
7.7.1 Entropy:
 Entropy is a metric to measure the impurity in a given
attribute.
 Entropy is a measure of the randomness in the information
being processed.
 The higher the entropy, the harder it is to draw any conclusions
from that information.
 Flipping a coin is an example of an action that provides
information that is random.
 Entropy can be calculated as:
Entropy(S) = -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
7.7.2. Information Gain:

 Information gain or IG is a statistical property that measures
how well a given attribute separates the training examples
according to their target classification. Example Refer Fig 3.15.
Figure 3.15 – Information Gain Example
 Constructing a decision tree is all about finding an attribute

that returns the highest information gain and the smallest
entropy.
 Information gain is a decrease in entropy.
 It computes the difference between entropy before split and
average entropy after split of the dataset based on given
attribute values.
 It can be calculated using the below formula:
Information Gain= Entropy(S)-

[(Weighted Avg) *Entropy(each feature)]
7.7.3. Gini Index:

 Gini index as a cost function used to evaluate splits in the dataset.
 It is calculated by subtracting the sum of the squared probabilities
of each class from one.
 It favors larger partitions and easy to implement whereas
information gain favors smaller partitions with distinct values.
 Gini index can be calculated using the below formula:
7.7.4 Gain Ratio

 Information gain is biased towards choosing attributes with a large
number of values as root nodes.
 Gain ratio overcomes the problem with information gain by taking
the intrinsic information of a split into account.
7.7.5 Reduction in variance

 Reduction in variance is an algorithm that uses the standard
formula of variance to choose the best split.
 The split with lower variance is selected as the criteria to split the
population:
7.7.6 Chi-Square
 The acronym CHAID stands for Chi-squared Automatic Interaction
Detector.
 It finds out the statistical significance between the differences
between sub-nodes and parent node.
 Higher the value of Chi-Square higher the statistical significance of

differences between sub-node and Parent node.
 It generates a tree called CHAID (Chi-square Automatic Interaction
Detector).
 Mathematically, Chi-squared is represented as:
7.8. Avoid/counter Overfitting in Decision Trees

o Two ways to remove overfitting:
o Pruning Decision Trees.
o Random Forest
7.8.1 Pruning Decision Trees

 Pruning is a process of deleting the unnecessary nodes from a tree
in order to get the optimal decision tree.
 A too-large tree increases the risk of over fitting, and a small tree
may not capture all the important features of the dataset.
 Therefore, a technique that decreases the size of the learning tree
without reducing accuracy is known as Pruning.
 There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
7.8.2 Random Forest

 Random Forest is an example of ensemble learning, in which we
combine multiple machine learning algorithms to obtain better
predictive performance.
 The name random means
 A random sampling of training data set when building trees.
 Random subsets of features considered when splitting nodes.
7.9 Advantages of the Decision Tree

 It is simple to understand as it follows the same process which a
human follow while making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other
algorithms.
7.10 Disadvantages of the Decision Tree

 The decision tree contains lots of layers, which makes it complex.
 It may have an over fitting issue, which can be resolved using
the Random Forest algorithm.
 For more class labels, the computational complexity of the decision
tree may increase.
8. Elaborate in detail about Random Forest in Supervised Learning.
RANDOM FOREST
8.1 Random Forest
8.2 Steps in the working process of Random Forest
8.3 Need for Random Forest
8.4 Example:
8.5 Important Features of Random Forest
8.6 Applications of Random Forest
8.7 Advantages of Random Forest
8.8 Disadvantages of Random Forest
8.9 Difference between Decision Tree & Random Forest
8.1 Random Forest

 Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset.
 Random Forest is a popular machine learning algorithm that belongs
to the supervised learning technique.
 It can be used for both Classification and Regression problems in ML.
 It is based on the concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex problem and to
improve the performance of the model.
 The greater number of trees in the forest leads to higher accuracy and
prevents the problem of over fitting.
8.2 Steps in the working process of Random Forest

 The Working process can be explained in the below steps and
diagram:
Step 1: In Random forest n number of random records are taken from
the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or
Averaging for Classification and regression respectively.
8.3 Need for Random Forest
 It takes less training time as compared to other algorithms.
 It predicts output with high accuracy, even for the large dataset it
runs efficiently.
 It can also maintain accuracy when a large proportion of data is
missing.
8.4 Example:
 Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is
divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results,
the Random Forest classifier predicts the final decision.
Figure 3.16 – Example for Random Forest

 In the above figure 3.16, the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an
apple.
8.5 Important Features of Random Forest

1. Diversity-
Not all attributes/variables/features are considered while
making an individual tree, each tree is different.
2. Immune to the curse of dimensionality-
Since each tree does not consider all the features, the feature
space is reduced.
3. Parallelization-
Each tree is created independently out of different data and
attributes. This means that we can make full use of the CPU to
build random forests.
4. Train-Test split-
In a random forest we don’t have to segregate the data for train
and test as there will always be 30% of the data which is not
seen by the decision tree.
5. Stability-
Stability arises because the result is based on majority voting/
averaging.
8.6 Applications of Random Forest

1. Banking: Banking sector mostly uses this algorithm for the
identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of
the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this
algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
8.7 Advantages of Random Forest

 Random Forest is capable of performing both Classification and
Regression tasks. 
 It is capable of handling large datasets with high dimensionality. 
 It enhances the accuracy of the model and prevents the over fitting
issue.
8.8 Disadvantages of Random Forest
 Although random forest can be used for both classification and
regression tasks, it is not more suitable for Regression tasks. 
8.9 Difference between Decision Tree & Random Forest
Decision trees Random Forest
Random forests are created from

Decision trees normally suffer from
subsets of data and the final output
the problem of overfitting if it’s
is based on average or majority
allowed to grow without any
ranking and hence the problem of
control.
overfitting is taken care of.
A single decision tree is faster in

It is comparatively slower.
computation.
When a data set with features is Random forest randomly selects

taken as input by a decision tree it observations, builds a decision tree
will formulate some set of rules to and the average result is taken. It
do prediction. doesn’t use any set of formulas.
CS3491 AIML Unit-4
UNIT IV
ENSEMBLE TECHNIQUES AND UNSUPERVISED LEARNING
Combining multiple learners: Model combination schemes, Voting,

Ensemble Learning - bagging, boosting, stacking, Unsupervised
learning: K-means, Instance Based Learning: KNN, Gaussian mixture
models and Expectation maximization
PART A
1. What is unsupervised learning?

Unsupervised learning, also known as unsupervised machine learning,
uses machine learning algorithms to analyze and cluster unlabeled
datasets.
These algorithms discover hidden patterns or data groupings without the
need for human intervention. Its ability to discover similarities and
differences in information make it the ideal solution for exploratory data
analysis, cross-selling strategies, customer segmentation, and image
recognition.
2. What is semi supervised learning?

Semi-supervised machine learning is a combination of supervised and
unsupervised learning.
It uses a small amount of labeled data and a large amount of unlabeled
data, which provides the benefits of both unsupervised and supervised
learning while avoiding the challenges of finding a large amount of
labeled data.
3. What is Ensemble method?

Ensemble methods are techniques that aim at improving the accuracy of
results in models by combining multiple models instead of using a single
model.
The combined models increase the accuracy of the results significantly.
This has boosted the popularity of ensemble methods in machine
learning
4. Explain Clustering.
Clustering is the act of organizing similar objects into groups within
a machine learning algorithm. Assigning related objects into clusters is
beneficial for AI models. Clustering has many uses in data science, like
CS3491 AIML Unit-4
image processing, knowledge discovery in data, unsupervised learning,

and various other applications.
5. What is Cluster?
Cluster is a group of objects that belongs to the same class. In other
words the similar objects are grouped in one cluster and dissimilar are
grouped in other cluster.
6. What is bagging?
Bagging is also known as Bootstrap aggregation, ensemble method works
by training multiple models independently and combining later to result
in strong model.
7. Define Boosting.
Boosting refers to a group of algorithms that utilize weighted averages to
make weak learning algorithms to stronger learning algorithms.
8. What is K-Nearest neighbor Methods?

K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time
of classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
9. Which are the performance factors that influence KNN algorithm?

1. The distance function or distance metric used to determine the nearest
neighbors
2. The Decision rule used to derive a classification from the K-Nearest
neighbors.
3. The number of neighbors used to classify the new example.
10. What is K Means Clustering?

K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process, as
if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
CS3491 AIML Unit-4
It is an iterative algorithm that divides the unlabeled dataset into k

different clusters in such a way that each dataset belongs only one group
that has similar properties.
11. List the properties of K-Means algorithm.

1. There are always K clusters
2. There is always at least one item in each cluster.
3. The clusters are non-hierarchical and they do not overlap
12. What is stacking?

Stacking, sometimes called stacked generalization, is an ensemble
machine learning method that combines heterogeneous base or
component models via a meta model.
13. How do GMMs differentiate from K-means clustering?

GMMs and K-means, both are clustering algorithms used for
unsupervised learning tasks. However, the basic difference between
them is that K-means is a distance-based clustering method while
GMMs is a distribution based clustering method.
14. What is ‘Over fitting’ in Machine learning?

In machine learning, when a statistical model describes random error or
noise instead of underlying relationship ‘over fitting’ occurs. When a
model is excessively complex, over fitting is normally observed, because
of having too many parameters with respect to the number of training
data types. The model exhibits poor performance which has been over fit.
15. What is ensemble learning?

To solve a particular computational program, multiple models such as
classifiers or experts are strategically generated and combined. This process
is known as ensemble learning.
16. What are the two paradigms of ensemble methods?

The two paradigms of ensemble methods are
 Sequential ensemble methods

 Parallel ensemble methods
17. What is voting?

A voting classifier is a machine learning estimator that trains various
base models or estimators and predicts on the basis of aggregating the
findings of each base estimator. The aggregating criteria can be
combined decision of voting for each estimator output.
CS3491 AIML Unit-4
18. What is Error-Correcting Output Codes?

The main classification task is defined in terms of a number of subtasks
that are implemented by the base learners.
The idea is that the original task of separating oneclass from all other
classes may be a difficult problem.
We want to define a set of simpler classification problems, each
specializing in one aspect of the task, and combining these simpler
classifiers, we get the final classifier.
19. What is Gaussian Mixture models?
This model is a soft probabilistic clustering model that allows us to

describe the membership of points to a set of clusters using a mixture of
Gaussian densities.
20. Differentiate between Bagging and Boosting.

Sl
no Bagging Boosting
The simplest way of combining

A way of combining predictions that
1. predictions that
belong to the different types.
belong to the same type.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
Models are weighted according to

3. Each model receives equal weight.
their performance.
New models are influenced

4. Each model is built independently. by the performance of previously
built models.
CS3491 AIML Unit-4
PART - B
1. Give Short notes on combining multiple learners.
Combining multiple learners is a models composed of multiple learners
that complement each other so that by combining them, we attain higher
accuracy.
Rationale
 In any application, we can use one of several learning algorithms, and
with certain algorithms, there are hyper parameters that affect the final
learner.
 For example, in a classification setting, we can use a parametric
classifier or a multilayer perceptron, and, for example, with a multilayer
perceptron, we should also decide on the number of hidden units.
 The No Free Lunch Theorem states that there is no single learning
algorithm that in any domain always induces the most accurate learner.
The usual approach is to try many and choose the one that performs the
best on a separate validation set.
 Each learning algorithm dictates a certain model that comes with a set of
assumptions. This inductive bias leads to error if the assumptions do not
hold for the data.
 The performance of a learner may be fine-tuned to get the highest
possible accuracy on a validation set, but this fine tuning is a complex
task and still there are instances on which even the best learner is not
accurate enough.
 Data fusion is the process fusing multiple records representing the same
real world object into a single , consistent, and clean Representation
 Fusion of data for improving prediction accuracy and reliability is an
important problem in machine learning
 Combining different models is done to improve the performance of deep
learning models
 Building a new model by combining requires less time, data and
computational resources
 The most common method to combine models is by averaging multiple
models, where taking a weighted average improves the accuracy.
Generating Diverse Learners
Different Algorithms
We can use different learning algorithms to train different base-learners.
Different algorithms make different assumptions about the data and lead to
different classifiers.
CS3491 AIML Unit-4
Different Hyper parameters

We can use the same learning algorithm but use it with different hyper
parameters.
Different Input Representations
Separate base-learners may be using different representations of the same
input object or event, making it possible to integrate different types of
Sensors/measurements/modalities.
Different representations make different characteristics explicit allowing
better identification.
Different Training Sets
Another possibility is to train different base-learners by different subsets of
the training set. This can be done randomly by drawing random training
sets from the given sample this is called bagging.
Diversity vs. Accuracy
This implies that the required accuracy and diversity of the learners also
depend on how their decisions are to be combined.
In a voting scheme, a learner is consulted for all inputs, it should be
accurate everywhere and diversity should be enforced everywhere.
Model Combination Schemes

There are also different ways the multiple base-learners are combined to
generate the final output:
Multiexpert combination
Multiexpert combination methods have base-learners that work in
parallel. These methods can in turn be divided into two:
A) The global approach, also called learner fusion, given an input, all base-
learners generate an output and all these outputs are used. Examples are
voting and stacking.
B) The local approach, or learner selection, for example, in mixture of
experts, there is a gating model, which looks at the input and chooses one
(or very few) of the learners as responsible for generating the output.
Multistage combination methods use a serial approach where the next
base-learner is trained with or tested on only the instances where the
previous base-learners are not accurate enough.
 Let us say that we have L base-learners. We denote by dj(x) the

prediction of base-learner Mj given the arbitrary dimensional input x
In the case of multiple representations,
 Each Mj uses a different input representation xj. The final prediction
is calculated from the predictions of the baselearners:
Y= f(d1,d2,….,dL|Φ) Eq no 1
CS3491 AIML Unit-4
 where f(·) is the combining function with Φ denoting its parameters.

When there are K outputs, for each learner there are dji(x), i= 1, … , K, j =
1, … , L, and, combining them, we also generate K values, yi, i = 1, … , K
and then for example in classification,we choose the class with the
maximum yi value:
k
Choose Ci if yi= max yk
K=1
Figure 4.1 Base-learners are dj and their outputs are combined using f(·).
Voting
The simplest way to combine multiple classifiers is by voting, which
corresponds to taking a linear combination of the learners (see figure 4.1)
Eq no 02
This is also known as ensembles and linear opinion pools. In the simplest case,
all learners are given equal weight and we have simple voting that corresponds
to taking an average.
CS3491 AIML Unit-4
Table 4.1 Classifier combination rules
Table 4.2 Example of combination rules on three learners and three classes
An example of the use of these rules is shown in table 4.2,

 which demonstrates the effects of different rules. Sum rule is the
most intuitive and is the most widely used in practice.
 Median rule is more robust to outliers; minimum and maximum
rules are pessimistic and optimistic, respectively.
In weighted sum, dji is the vote of learner j for class Ci and wj is the weight of
its vote.
Simple voting is a special case where all voters have equal weight, namely, wj =
1/L. In classification, this is called plurality voting where the class having the
maximum number of votes is the winner.
Voting schemes can be seen as approximations under a Bayesian framework

with weights approximating prior model probabilities, and model decisions
approximating model conditional likelihoods. This is Bayesian model
combination
CS3491 AIML Unit-4
Eq no 3
Let us assume that dj are iid with expected value E[dj] and variance Var(dj),
then when we take a simple average with wj = 1/L, the expected value and
variance of the output are
Eq no 4
 We see that the expected value does not change, so the bias does not
change.
 But variance and therefore mean square error, decreases as the number
of independent voters, L, increases.
Eq no 5
 which implies that if learners are positively correlated, variance (and
error) increase.
 We can thus view using different algorithms and input features as efforts
to decrease, if not completely eliminate, the positive correlation.
Error-Correcting Output Codes
 The main classification task is defined in terms of a number of subtasks
that are implemented by the base learners.
 The idea is that the original task of separating one class from all other
classes may be a difficult problem.
 We want to define a set of simpler classification problems, each
specializing in one aspect of the task, and combining these simpler
classifiers, we get the final classifier.
 Base-learners are binary classifiers having output −1/ + 1, and there is a
code matrix W of K × L whose K rows are the binary codes of classes in
terms of the L base-learners dj.
 Code Matrix W codes classes in terms of Learners One per class L=k
CS3491 AIML Unit-4
 The problem here is that if there is an error with one of the base-
learners, there may be a misclassification because the class code words
are so similar. So the approach in error correcting codes is to have L > K
and increase the Hamming distance between the code words.
 One possibility is pairwise separation of classes where there is a separate

base-learner to separate Ci from Cj, for i < j
In this case, L = K(K− 1)/2 and with K = 4, the code matrix is
where a 0 entry denotes “don’t care.”
 With reasonable L, find W such that the hamming distance between rows
and columns are maximized.
 ECOC can be written as a voting scheme where the entries of W, wij, are
considered as vote weights:
and then we choose the class with the highest yi.

 One problem with ECOC is that because the code matrix Wis set a priori,
there is no guarantee that the subtasks as defined by the columns of W
will be simple.
2. Explain Ensemble learning Technique in detail.

 One of the most powerful machine learning techniques is ensemble
learning. Ensemble learning is the use of multiple machine learning
models to improve the reliability and accuracy of predictions.
 Put simply, ensemble learning is the process of training multiple
machine learning models and combining their outputs together. The
different models are used as a base to create one optimal predictive
model.
CS3491 AIML Unit-4
 Simple ensemble learning techniques include things like averaging the

outputs of different models, while there are also more complex methods
and algorithms developed especially to combine the predictions of many
base learners/models together.
Why Use Ensemble Training Methods?
 Machine learning models can be different from each other for a variety of
reasons.
 Different machine learning models may operate on different samples of
the population data, different modeling techniques may be used, and a
different hypothesis might be used.
Simple Ensemble Training Methods
 Simple ensemble training methods typically just involve the application
of statistical summary techniques, such as determining the mode, mean,
or weighted average of a set of predictions.
Advanced Ensemble Training Methods
 There are three primary advanced ensemble training techniques, each of
which is designed to deal with a specific type of machine learning
problem.
 “Bagging” techniques are used to decrease the variance of a model’s
predictions, with variance referring to how much the outcome of
predictions differs when based on the same observation.
 “Boosting” techniques are used to combat the bias of models.
 Finally, “stacking” is used to improve predictions in general.
Ensemble learning methods can be divided into one of two different groups:
1. sequential methods
 Sequential ensemble methods get the name “sequential” because the
base learners/models are generated sequentially.
 In the case of sequential methods, the essential idea is that the
dependence between the base learners is exploited in order to get more
accurate predictions.
 Examples of sequential ensemble methods include AdaBoost, XGBoost,
and Gradient tree boosting.
2. parallel ensemble methods

 parallel ensemble methods generate the base learners in parallel.
 When carrying out parallel ensemble learning, the idea is to exploit the
fact that the base learners have independence, as the general error rate
can be reduced by averaging the predictions of the individual learners.
3. Explain in Detail about Bagging Technique in Ensemble Learning.

CS3491 AIML Unit-4
 Bagging: It is a homogeneous weak learners’ model that learns from

each other independently in parallel and combines them for
determining the model average.
 Bagging is a voting method whereby base-learners are made different by
training them over slightly different training sets.
 Ensemble learning helps improve machine learning results by
combining several models.
 This approach allows the production of better predictive performance
compared to a single model.
 Basic idea is to learn a set of classifiers (experts) and to allow them to
vote. Bagging and Boosting are two types of Ensemble Learning.
These two decrease the variance of a single estimate as they combine
several estimates from different models.
 So the result may be a model with higher stability.
 Bootstrap Aggregating, also known as bagging, is a machine learning
ensemble meta-algorithm designed to improve the stability and
accuracy of machine learning algorithms used in statistical
classification and regression.
 It decreases the variance and helps to avoid overfitting. It is usually
applied to decision tree methods. Bagging is a special case of the model
averaging approach.
Pseudocode
1. Given training data(x1,y1),….,(xm,ym)
2. For t=1,…..,T:
a. From bootstrap replicate dataset St by selecting m random examples from
the training set with replacement.
b. Let ht be the result of training base learning algorithm on si
3. Output combined classifier:
H(x)=majority(h1(x),….,hT(x)).
Implementation Steps of Bagging

Step 1: Multiple subsets are created from the original data set with equal
tuples, selecting observations with replacement.
Step 2: A base model is created on each of these subsets.
Step 3: Each model is learned in parallel with each training set and
independent of each other.
Step 4: The final predictions are determined by combining the predictions
from all the models
CS3491 AIML Unit-4
Figure 4.2 Bagging Technique.
Advantages:
1. Reduce overfitting of the model.
2. Handles higher dimensionality data very well.
3. Maintains accuracy for missing data
Disadvantage:
Since final prediction is based on mean prediction from the subset trees, it
won’t give precise values for the classification and regression model
4. Explain boosting Technique in Ensemble learning.

 In bagging, generating complementary base-learners is left to chance and
to the unstability of the learning method.
 In boosting, we actively try to generate complementary base learners
by training the next learner on the mistakes of the previous learners.
 The original boosting algorithm combines three weak learners to generate
a strong learner. A weak learner has error probability less than 1/2,
which makes it better than random guessing on a two-class problem,
and a strong learner has arbitrarily small error probability.
 Boosting is an ensemble modeling technique that attempts to build a
strong classifier from the number of weak classifiers.
 It is done by building a model by using weak models in series. Firstly, a
model is built from the training data.
 Then the second model is built which tries to correct the errors present
in the first model.
 This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum
number of models is added.
CS3491 AIML Unit-4
Figure 4.3 An illustration presenting the intuition behind the boosting

algorithm, consisting of the parallel learners and weighted dataset.
 AdaBoost can also combine an arbitrary number of base learners, not
three.
 Many variants of AdaBoost have been proposed; here, we discuss the
original algorithm AdaBoost.
 In AdaBoost, although different base-learners have slightly different
training sets, this difference is not left to chance as in bagging, but is a
function of the error of the previous baselearner.
 The actual performance of boosting on a particular problem is clearly
dependent on the data and the base-learner.
 There should be enough training data and the base-learner
should be weak but not too weak, and boosting is especially susceptible to
noise and outliers.
CS3491 AIML Unit-4
Similarities Between Bagging and Boosting

Bagging and Boosting, both being the commonly used methods, have a
universal similarity of being classified as ensemble methods. Here we will
explain the similarities between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the
majority of them i.e Majority Voting).
4. Both are good at reducing variance and provide higher stability.
Differences between Bagging and Boosting
Sl
no Bagging Boosting
The simplest way of combining

A way of combining predictions that
1. predictions that
belong to the different types.
belong to the same type.
Aim to decrease variance, not

2. Aim to decrease bias, not variance.
bias.
Models are weighted according to

3. Each model receives equal weight.
their performance.
4. Each model is built New models are influenced

CS3491 AIML Unit-4
Sl
no Bagging Boosting
independently. by the performance of previously

built models.
5. Explain in Detail about Stacking.

 Stacking is one of the most popular ensemble machine learning
techniques used to predict multiple nodes to build a new model and
improve model performance.
 Stacking enables us to train multiple models to solve similar problems,
and based on their combined output, it builds a new model with
improved performance.
 Stacking is a way of ensembling classification or regression models it
consists of two-layer estimators.
 The first layer consists of all the baseline models that are used to
predict the outputs on the test datasets.
 The second layer consists of Meta-Classifier or Regressor which takes
all the predictions of baseline models as an input and generate new
predictions.
Figure 4.3 Stacking Architecture
Steps to implement Stacking models:

o Split training data sets into n-folds using the Repeated Stratified
KFold as this is the most common approach to preparing training
datasets for meta-models.
CS3491 AIML Unit-4
o Now the base model is fitted with the first fold, which is n-1, and it will
make predictions for the nth folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of
size n,
o Now, the model is trained on all the n parts, which will make predictions
for the sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and y3_test by
using Model 2 and 3 for training, respectively, to get Level 2 predictions.
o Now train the Meta model on level 1 prediction, where these predictions
will be used as features for the model (refer Figure 4.3).
o Finally, Meta learners can now be used to make a prediction on test data
in the stacking model.
6. Explain in detail about Unsupervised Learning

 In supervised learning, the aim is to learn a mapping from the input to
an output whose correct values are provided by a supervisor. In
unsupervised learning, there is no such supervisor and we have only
input data.
 The aim is to find the regularities in the input. There is a structure to
the input space such that certain patterns occur more often than others,
and we want to see what generally happens and what does not. In
statistics, this is called density estimation
 One method for density estimation is clustering, where the aim is to find
clusters or groupings of input.
 Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data
without any supervision .
Clustering
 Given a set of objects, place them in a group such that the objects in a
group are similar to one another and different from the objects in other
groups
 Cluster analysis can be a powerful data-mining tool for any organization.
 Cluster is a group of objects that belongs to the same class
 Clustering is a process of partitioning a set of data in a meaningful
subclass.
CS3491 AIML Unit-4
Figure 4.4 Clustering

Clustering Methods :
 Density-Based Methods: These methods consider the clusters as the
dense region having some similarities and differences from the lower dense
region of the space. These methods have good accuracy and the ability to
merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of
Applications with Noise), OPTICS (Ordering Points to Identify Clustering
Structure), etc.
 Hierarchical Based Methods: The clusters formed in this method form a tree-
type structure based on the hierarchy. New clusters are formed using the
previously formed one. It is divided into two category
 Agglomerative (bottom-up approach)
 Divisive (top-down approach
Unsupervised Learning : K means
 K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one group
that has similar properties.
 It allows us to cluster the data into different groups and a convenient
way to discover the categories of groups in the unlabeled dataset on its
own without the need for any training as in Figure 4.4.
 It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:

CS3491 AIML Unit-4
o Determines the best value for K center points or centroids by an iterative

process.
o Assigns each data point to its closest k-center. Those data points which
are near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.
Figure 4.5 Explains the working of the K-means Clustering Algorithm
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

CS3491 AIML Unit-4
Step-7: The model is ready.
Instance based learning:KNN

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.
o K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time
of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN,
we can easily identify the category or class of a particular dataset. Consider the
below diagram:
CS3491 AIML Unit-4
Figure 4.6 Explains the working of the K-NN Algorithm
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points
in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
CS3491 AIML Unit-4
Figure 4.7 Suppose we have a new data point and we need to put it in the
required category.
o Firstly, we will choose the number of neighbors, so we will choose the

k=5.
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we
have already studied in geometry. It can be calculated as:
Figure 4.8 Euclidean distance

CS3491 AIML Unit-4
o By calculating the Euclidean distance we got the nearest neighbors, as

three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
Figure 4.9 nearest neighbors
o As we can see the 3 nearest neighbors are from category A in Figure 4.9
, hence this new data point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need
to try some values to find the best out of them. The most preferred value
for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
CS3491 AIML Unit-4
Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some
time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.
7. Explain in detail about Gaussian Mixture models and Expectation

Maximization.
Gaussian Mixture Model
 This model is a soft probabilistic clustering model that allows us to

describe the membership of points to a set of clusters using a mixture of
Gaussian densities.
 It is a soft classification (in contrast to a hard one) because it assigns

probabilities of belonging to a specific class instead of a definitive choice.
In essence, each observation will belong to every class but with different
probabilities.
 While Gaussian mixture models are more flexible, they can be more
difficult to train than K-means. K-means is typically faster to converge
and so may be preferred in cases where the runtime is an important
consideration.
 In general, K-means will be faster and more accurate when the data set
is large and the clusters are well-separated. Gaussian mixture models
will be more accurate when the data set is small or the clusters are not
well-separated.
 Gaussian mixture models take into account the variance of the data,
whereas K-means does not.
 Gaussian mixture models are more flexible in terms of the shape of the
clusters, whereas K-means is limited to spherical clusters.
 Gaussian mixture models can handle missing data, whereas K-means
cannot. This difference can make Gaussian mixture models more
effective in certain applications, such as data with a lot of noise or data
that is not well-defined.
 The mixture model where we write the density as a weighted sum of
component densities.
 Where P(Gi) are the mixture proportions and p(x|Gi) are the component
densities.
CS3491 AIML Unit-4
 For example, in Gaussian mixtures, we have p(x|Gi) ~ N(μi, Σi), and

defining πi ≡ P(Gi), we have the parameter vector as
that we need to learn from data.
Figure 4.10 The generative graphical representation of a Gaussian mixture

model.
The EM algorithm that is a maximum likelihood procedure:
If we have a prior distribution p(Φ), we can devise a Bayesian approach. For

example, the MAP estimator is
The mean and precision (inverse covariance) matrix, we can use a normal-
Wishart prior
Expectation-Maximization Algorithm
 In k-mean, clustering is the problem of finding codebook vectors that
minimize the total reconstruction error.
 Here the approach is probabilistic and we look for the component density
parameters that maximize the likelihood of the sample.
 Using the mixture model of equation , the log likelihood given the sample
X = {xt}t is
 Where Φ includes the priors P(Gi) and also the sufficient statistics of the
component densities p(xt|Gi).
A(x) = max(0,x)
Hyperparameters
₂
1 0
ℜ
ℜ
y wx w
o
∈ ∈
1 0
A(x) = max(0,x)
𝑥
A(x) = max(0,x)
γj βj
γj βj
Hyperparameters
C=0.3 and Alpha=0.2
₂

AI&ML NOTES Watermark

Uploaded by

Copyright:

Available Formats

AI&ML NOTES Watermark

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI&ML NOTES Watermark

Uploaded by

Copyright:

Available Formats

CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1

CS3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

2. How Artificial Intelligence is Defined? List the Types Of

3. What Are The Four Types Of Artificial Intelligence?

 Self-awareness: able to operate with human-level consciousness and

4. List the types of Artificial Intelligence-based on capabilities.

5. List the types of Artificial Intelligence-based on functionalities -

6. Tabulate the comparison of Artificial Intelligence-based on

Can’t deal with Next milestone for the

7. List the Four possible goals to pursue in artificial intelligence.

8. Point out the subfields of Artificial Intelligence.

9. What are the advantages of Artificial Intelligence?

10. List the limitations of Artificial Intelligence

11. Differentiate Deep learning vs. machine learning

Data Science Artificial Intelligence Machine Learning

13. List the various Applications of Artificial Intelligence or AI

14. Define an Agent and list its types.

15. Define an environment and task environment.

16. List the steps in designing an agent

18. List the Components in problem formulation.

• It uses last in-first-out strategy and hence it is implemented using a

Path: S - > A -> B -> C -> G

Path: S -> D -> G

25. Define iterative deepening depth-first search/ iterative deepening

Path: S -> A -> B -> G

28. Define Bidirectional Search.

Types of Informed search algorithms.

Path: S -> D -> E -> G

31. Define A* tree search.

32. Find the path to reach from S to G using A* search.

33. Use graph searches to find paths from S to G in the following

Path: S -> D -> B -> E -> G

34. Define Local Search Algorithm. List its types.

35. What are the Limitations of Hill climbing algorithm.

36. Define Adversarial Search.

 Such conflicting goals give rise to the adversarial search.

37. What are the Elements of Game Playing search?

38 List the types of algorithms in adversarial search.

39. Define a Game Tree.

40. Define Constraint satisfaction problems (CSP) – Definition.

41. List the CSP Algorithms and Problems.

42. List the Types of Domains in CSP.

43. List the Constraint Types in CSP.

44. Define Map Coloring / Graph Coloring Problem.

1.1 Artificial Intelligence Definition

Figure 1.1 – Artificial Intelligence

1.2 Types of Artificial Intelligence

 Types of Artificial Intelligence-based on functionalities -

Figure 1.2 – Types of Artificial Intelligence

1.2.1 Types of Artificial Intelligence-based on capabilities -

1.2.2 Types of Artificial Intelligence-based on functionalities -

 Purely Reactive or Reactive Machines

Table 1.1 - Comparison of types of Artificial Intelligence-based on functionalities:

Can’t deal with Next milestone for the

1.3 Four possible goals to pursue in artificial intelligence:

Four Approaches from two dimensions

Figure 1.3 - Four Approaches from two dimensions

1.3.1 Acting humanly: The Turing test approach

1.3.2 Thinking humanly: The cognitive modeling approach