AI&ML NOTES Watermark
AI&ML NOTES Watermark
AI&ML NOTES Watermark
SYLLABUS:
Introduction to AI - AI Applications - Problem solving agents – search
algorithms – uninformed search strategies – Heuristic search
strategies – Local search and optimization problems – adversarial
search – constraint satisfaction problems (CSP)
PART A
1. What is Artificial Intelligence? Or Define Artificial Intelligence.
Artificial intelligence is a wide-ranging branch of computer science
concerned with building smart machines capable of performing tasks
that typically require human intelligence.
Artificial Intelligence is the process of building intelligent machines from
vast volumes of data.
Systems learn from past learning and experiences and perform human-
like tasks.
It enhances the speed, precision, and effectiveness of human efforts.
AI includes predictive
Data Science includes data
modeling to predict events ML is a subset of Artificial
operations based on user
based on the previous and Intelligence.
requirements.
current data.
It is mainly used in fraud Applications of AI include Online recommendations, facial
detection, healthcare, BI chat bots, voice assistants, recognition, and NLP are a few
analysis, and more. and weather prediction. examples of ML.
Types of Agent
Simple reflex agents respond directly to percepts.
Model-based reflex agents maintain internal state to track aspects of
the world that are not evident in the current percept.
Goal-based agents act to achieve their goals, and
Utility-based agents try to maximize their own expected “happiness.”
17. Define Problem Solving Agent. List the steps involved in Problem
Solving Agent.
Problem Solving Agent
Problem-solving agent is a goal-based agent that focuses on goals
using a group of algorithms and techniques to solve a well-defined
problem.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
An agent may need to plan a sequence of actions that form a path
to a goal state.
Such an agent is called a problem-solving agent, and the
computational process it undertakes is called search.
Steps performed by Problem-solving agent
Goal Formulation
Problem Formulation
Search
Solution
Execution
19. Define search algorithms and list the types of search algorithm.
A search algorithm takes a search problem as input and returns a
solution.
Types of Search Algorithm
Uninformed Search Algorithms
o Depth First Search
o Depth Limited Search
o Breadth First Search
o Iterative Deepening Search
o Uniform Cost Search
o Bidirectional Search
Informed (Heuristic) Search Strategies
o Greedy Search
o A* Tree Search
20. Define Depth First Search. List the advantages and disadvantages
of DFS.
• Depth-first search (DFS) is an algorithm for traversing or searching
tree or graph data structures.
• The algorithm starts at the root node and explores as far as possible
along each branch before backtracking.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
21. Which solution would DFS find to move from node S to node G if run
on the graph below?
Solution.
22. Define Depth Limited Search. List its advantages and disadvantages.
A depth-limited search algorithm is similar to depth-first search with a
predetermined limit.
Depth-limited search can solve the drawback of the infinite path in the
Depth-first search.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Advantages:
Depth-limited search is Memory efficient.
Disadvantages:
Depth-limited search also has a disadvantage of incompleteness.
It may not be optimal if the problem has more than one solution.
23. Define Breadth-first search (BFS). List its advantages and disadvantages.
• Breadth-first search (BFS) is an algorithm for traversing or searching
tree or graph data structures.
• It starts at the tree root and explores all of the neighbor nodes at the
present depth prior to moving on to the nodes at the next depth level.
Advantages of Breadth-first Search:
If a solution is available, BFS will provide it.
If there are multiple answers to a problem, BFS will present the simplest
solution with the fewest steps.
Disadvantages of Breadth-first Search:
It necessitates a large amount of memory.
If the solution is located far from the root node, BFS will take a long time.
24. Which solution would BFS find to move from node S to node G if run on
the graph below?
Solution.
26. Define Uniform Cost Search. List its advantages and disadvantages.
A searching algorithm for traversing a weighted tree or graph is uniform-
cost search.
When a separate cost is provided for each edge, this algorithm is used.
The uniform-cost search's main purpose is to discover the shortest path
to the goal node with the lowest cumulative cost.
Cost of a node is defined as:
cost(node) = cumulative cost of all nodes from root
cost(root) = 0
Advantages of Uniform-cost Search:
The path with the lowest cost is chosen at each state.
UCS is complete only if states are finite and there should be no loop
with zero weight.
UCS is optimal only if there is no negative cost.
Disadvantages of Uniform-cost Search:
It is just concerned with the expense of the path.
27. Which solution would UCS find to move from node S to node G if run on
the graph below?
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Solution.
29. Define informed (heuristic) search strategies and mention its types.
Here, the algorithms have information on the goal state, which helps
in more efficient searching.
This information is obtained by something called a heuristic.
Search Heuristics:
In an informed search, a heuristic is a function h(n) estimates how
close a state is to the goal state.
h(n) = estimated cost of the cheapest path from the state at node n to
a goal state.
30. Find the path from S to G using greedy search. The heuristic values h
of each node below the name of the node.
Solution.
Starting from S, we can traverse to A(h=9) or D(h=5). We choose D,
as it has the lower heuristic cost. Now from D, we can move to
B(h=4) or E(h=3). We choose E with a lower heuristic cost. Finally,
from E, we go to G(h=0). This entire traversal is shown in the search
tree below, in blue.
Solution.
Starting from S, the algorithm computes g(x) + h(x) for all nodes in
the fringe at each step, choosing the node with the lowest sum.
The entire work is shown in the table below.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Solution
PART B
1. Define Artificial Intelligence and its types, goals, advantages and limitations
in detail.
ARTIFICIAL INTELLIGENCE:
1.1 Artificial Intelligence Definition
1.2 Types of Artificial Intelligence
1.2.1 Types of Artificial Intelligence-based on capabilities
1.2.2 Types of Artificial Intelligence-based on functionalities
1.3 Four possible goals to pursue in artificial intelligence:
1.3.1 Acting humanly: The Turing test approach
1.3.2 Thinking humanly: The cognitive modeling approach
1.3.3 Thinking rationally: The “laws of thought” approach
1.3.4 Acting rationally: The rational agent approach
1.4 Subfields of Artificial Intelligence
1.5 Advantages of Artificial Intelligence
1.6 Limitations of Artificial Intelligence
Cognitive Science
o Cognitive science is the interdisciplinary study of mind and
intelligence, embracing philosophy, psychology, artificial
intelligence, neuroscience, linguistics, and anthropology.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
o Cognitive Computing:
The ultimate goal of cognitive computing is to imitate the human
thought process in a computer model.
Using self-learning algorithms, pattern recognition by neural
networks, and natural language processing, a computer can mimic
the human way of thinking.
Here, computerized models are deployed to simulate the human
cognition process.
o Computer Vision:
Computer vision works by allowing computers to see, recognize, and
process images, the same way as human vision does, and then it
provides an appropriate output.
The computer must understand what it sees, and then analyze it,
accordingly.
o Natural Language Processing:
Natural language processing means developing methods that help us
communicate with machines using natural human languages like
English.
Table 1.2 - Comparison of Data Science, Artificial Intelligence and Machine Learning:
1950:
Alan Turing publishes Computing Machinery and Intelligence. In the
paper, Turing—proposes to answer the question 'can machines
think?' and introduces the Turing Test to determine if a computer can
demonstrate the same intelligence as a human. The value of the
Turing test has been debated ever since.
1956:
John McCarthy coins the term 'artificial intelligence' at the first-ever
AI conference at Dartmouth College. Later that year, Allen Newell,
J.C. Shaw, and Herbert Simon create the Logic Theorist, the first-ever
running AI software program.
1958:
In 1958, John McCarthy had defined the high-level language Lisp,
which was to become the dominant AI programming language for the
next 30 years.
1959:
At IBM, Nathaniel Rochester and his colleagues produced some of the
first AI programs.
Herbert Gelernter (1959) constructed the Geometry Theorem Prover,
which was able to prove theorems that many students of mathematics
would find quite tricky.
1967:
Frank Rosenblatt builds the Mark 1 Perceptron, the first computer
based on a neural network that 'learned' though trial and error.
1980s:
Neural networks which use a back propagation algorithm to train
itself become widely used in AI applications.
1997:
IBM's Deep Blue beats then world chess champion Garry Kasparov, in
a chess match (and rematch).
2011:
IBM Watson beats champions Ken Jennings and Brad Rutter
at Jeopardy!
2015:
Baidu'sMinwa supercomputer uses a special kind of deep neural
network called a convolution neural network to identify and categorize
images with a higher rate of accuracy than the average human.
2016:
DeepMind'sAlphaGo program, powered by a deep neural network,
beats Lee Sodol, the world champion Go player, in a five-game match.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
APPLICATIONS OF AI:
4.1. AI Application in E-Commerce
4.2. AI Application in Education
4.3. AI Application in Lifestyle
4.4. AI Application in Navigation
4.5. AI Application in Robotics
4.6. AI Application in Human Resource
4.7. AI Application in Healthcare
4.8. AI Application in Agriculture
4.9. AI Application in Gaming
4.10. AI Application in Automobiles
4.11. AI Application in Social Media
4.12. AI Application in Marketing
4.13. AI Application in Chatbots
4.14. AI Application in Finance
4.1. AI Application in E-Commerce
Personalized Shopping
o Artificial Intelligence technology is used to create
recommendation engines.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
AGENTS
5.1 Agents and Environments
5.2 Basic Terminologies
5.3 Steps in designing an agent
5.4 Types of Agent
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Environments
The environment could be everything—the entire universe.
Rational Agent
A rational agent acts so as to maximize the expected value of the
performance measure, given the percept sequence it has seen so
far.
Task Environment
A task environment specification includes the performance
measure, the external environment, the actuators, and the
sensors.
Model-based reflex
Goal-based agents
Learning agents:
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
It allows the agent to operate in initially unknown environments and to become more
competent than its initial knowledge alone might allow. A learning agent can be divided into
four conceptual components, as shown in figure 1.10:
Further, change the depth-limit =[0-3], it will again expand the nodes
from level 0 till level 3 and the search terminate
with A->B->D->F->E->H sequence where H is the desired goal node.
Example:
Which solution would UCS find to move from node S to node G if run
on the graph below?
Solution.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Search Heuristics:
In an informed search, a heuristic is a functionh(n) estimates how
close a state is to the goal state.
h(n) = estimated cost of the cheapest path from the state at node n to
a goal state.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Example:
Question. Find the path from S to G using greedy search. The heuristic
values h of each node below the name of the node.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Solution.
Starting from S, we can traverse to A(h=9) or D(h=5). We choose D,
as it has the lower heuristic cost. Now from D, we can move to
B(h=4) or E(h=3). We choose E with a lower heuristic cost. Finally,
from E, we go to G(h=0). This entire traversal is shown in the search
tree below, in blue.
Heuristic:
The following points should be noted with heuristics in A* search.
f(x) = g(x) + h(x)
h(x) is called the forward cost and is an estimate of the distance of
the current node from the goal node.
g(x) is called the backward cost and is the cumulative cost of a
node from the root node.
A* search is optimal only when for all nodes, the forward cost for a
node h(x) underestimates the actual cost h*(x) to reach the goal.
This property of A* heuristic is called admissibility.
Admissibility: 0 <= h(x) <= h*(x)
Solution.
Starting from S, the algorithm computes g(x) + h(x) for all nodes in the
fringe at each step, choosing the node with the lowest sum.
The entire work is shown in the table below.
A* algorithm
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Example:
Question. Use graph searches to find paths from S to G in the following
graph.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Solution
The local search algorithm explores the above landscape by finding the
following two points:
Global Minimum: If the elevation corresponds to the cost, then the task
is to find the lowest valley, which is known as Global Minimum.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
The topographical regions shown in the Figure 1.19 can be defined as:
Global Maximum: It is the highest point on the hill, which is the goal
state.
Local Maximum: It is the peak higher than all other peaks but lower
than the global maximum.
Flat local maximum: It is the flat area over the hill where it has no
uphill or downhill. It is a saturated point of the hill.
Shoulder: It is also a flat area where the summit is possible.
Current state: It is the current position of the person.
ADVERSARIAL SEARCH
9.1 Adversarial Search in Artificial Intelligence
9.2 Elements of Game Playing search
9.3 Example:
9.3.1 Game Tree For Tic Tac Toe
9.4 Types Of Algorithms In Adversarial Search
9.4.1 Minimax Strategy / Minimax Algorithm
9.4.2 Alpha – Beta Pruning
By R.Manickavasagan,AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
9.3 Example:
Game tic-tac-toe, has two or three possible outcomes.
Either to win, to lose, or to draw the match with values +1,-1 or 0.
Game tree designed for tic-tac-toe refer Figure 1.23.
Here, the node represents the game state and edges represent the moves
taken by the players.
By R.Manickavasagan,AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Game Tree
A game tree is where the nodes represent the states of the game and
edges represent the moves made by the players in the game.
Players will be two namely:
o MIN: Decrease the chances to win the game.
o MAX: Increases his chances of winning the game.
By R.Manickavasagan,AP/CSE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
MAX will choose that path which will increase its utility value and MIN
will choose the opposite path which could help it to minimize MAX’s
utility value.
Step 2:
Now, first we find the utilities value for the Maximizer, its initial value
is -∞,
For node D max(-1,- -∞) => max(-1,4)= 4
o For Node E max(2, -∞) => max(2, 6)= 6
o For Node F max(-3, -∞) => max(-3,-5) = -3
o For node G max(0, -∞) = max(0, 7) = 7
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Step 3:
In the next step, it's a turn for minimizer, so it will compare all nodes
value with +∞, and will find the 3rd layer node values.
o For node B= min(4,6) = 4
o For node C= min (-3, 7) = -3
Step 4:
Now it's a turn for Maximizer, and it will again choose the maximum of all
nodes value and find the maximum value for the root node.
In this game tree, there are only 4 layers, hence reach immediately to the
root node, but in real games, there will be more than 4 layers.
o For node A max(4, -3)= 4
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
That was the complete workflow of the minimax two player game.
Minimax algorithm
Step 1:
At the first step the, Max player will start first move from node A
where α= -∞ and β= +∞, these value of alpha and beta passed down to
node B where again α= -∞ and β= +∞, and Node B passes the same
value to its child D.
Step 2:
At Node D, the value of α will be calculated as its turn for Max. The
value of α is compared with firstly 2 and then 3, and the max (2, 3) = 3
will be the value of α at node D and node value will also 3.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Step 3:
Now algorithm backtrack to node B, where the value of β will change
as this is a turn of Min, Now β= +∞, will compare with the available
subsequent nodes value, i.e. min (∞, 3) = 3, hence at node B now
α= -∞, and β= 3.
Step 4:
At node E, Max will take its turn, and the value of alpha will change.
The current value of alpha will be compared with 5, so max (-∞, 5) = 5,
hence at node E α= 5 and β= 3, where α>=β, so the right successor of
E will be pruned, and algorithm will not traverse it, and the value at
node E will be 5.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Step 5:
At next step, algorithm again backtrack the tree, from node B to node
A. At node A, the value of alpha will be changed the maximum
available value is 3 as max (-∞, 3)= 3, and β= +∞, these two values
now passes to right successor of A which is Node C.
At node C, α=3 and β= +∞, and the same values will be passed on to
node F.
Step 6:
At node F, again the value of α will be compared with left child which
is 0, and max(3,0)= 3, and then compared with right child which is 1,
and max(3,1)= 3 still α remains 3, but the node value of F will become
1.
Step 7:
Node F returns the node value 1 to node C, at C α= 3 and β= +∞, here
the value of beta will be changed, it will compare with 1 so min (∞, 1) =
1. Now at C, α=3 and β= 1, and again it satisfies the condition α>=β,
so the next child of C which is G will be pruned, and the algorithm will
not compute the entire sub-tree G.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
Step 8:
C now returns the value of 1 to A here the best value for A is max
(3, 1) = 3.
Following is the final game tree which is the showing the nodes which
are computed and nodes which has never computed.
Hence the optimal value for the maximizer is 3 for this example.
The game will be started from the last level of the game tree,
and the value will be chosen accordingly.
The rule which will be followed is: “Explore nodes if necessary
otherwise prune the unnecessary nodes.”
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 1
There are many possible solutions to this problem, such as in figure 1.26
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
SYLLABUS:
Acting under uncertainty – Bayesian inference – Naïve Bayes
models. Probabilistic reasoning – Bayesian networks – exact
inference in BN – approximate inference in BN – causal networks.
PART A
1. Define uncertainty and list the causes of uncertainty.
Uncertainty:
The knowledge representation, A→B, means if A is true then B is true,
but a situation where not sure about whether A is true or not then cannot
express this statement, this situation is called uncertainty.
So to represent uncertain knowledge, uncertain reasoning or probabilistic
reasoning is used.
Causes of uncertainty:
1. Causes of uncertainty in the real world
2. Information occurred from unreliable sources.
3. Experimental Errors
4. Equipment fault
5. Temperature variation
6. Climate change.
.
.
. .
.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
7. In a class, there are 70% of the students who like English and 40%
of the students who likes English and mathematics, and then what
is the percent of students those who like English also like
mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics
11. Consider two events: A (it will rain tomorrow) and B (the sun will
shine tomorrow).
Use Bayes’ theorem to compute the posterior probability of each event
occurring, given the resulting weather conditions for today:
P(A|sunny) = P(sunny|A) * P(A) / P(sunny)
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
Local Semantics
PART B
Agents almost never have access to the whole truth about their environment.
Agents must, therefore, act under uncertainty.
The problem is that this rule is wrong. Not all patients with toothaches have
cavities; some of them have gum disease, an abscess, or one of several other
problems:
Unfortunately, in order to make the rule true, we have to add an almost unlimited
list of possible causes. We could try turning the rule into a causal rule:
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
But this rule is not right either; not all cavities cause pain The only way to fix the
rule is to make it logically exhaustive: to augment the left-hand side with all the
qualifications required for a cavity to cause a toothache. Even then, for the
purposes of diagnosis, one must also take into account the possibility that the
patient might have a toothache and a cavity that are unconnected. Trying to use
first-order logic to cope with a domain like medical diagnosis thus fails for three
main reasons:
That is, we expect that out of all the situations that are indistinguishable
from the current situation as far as the agent's knowledge goes, the patient will
have a cavity in 80% of them. This belief could be derived from statistical data-80%
of the toothache patients seen so far have had cavities-or from some general rules,
or from a combination of evidence sources.
The 80% summarizes those cases in which all the factors needed for a cavity
to cause a toothache are present and other cases in which the patient has both
toothache and cavity but the two are unconnected. The missing 20% summarizes
all the other possible causes of toothache that we are too lazy or ignorant to confirm
or deny.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
1.2.3 Probability:
Probability can be defined as a chance that an uncertain event
will occur.
The value of probability always remains between 0 and 1 that
represent ideal uncertainties.
o 0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
o P(A) = 0, indicates total uncertainty in an event A.
o P(A) =1, indicates total certainty in an event A.
Formula to find the probability of an uncertain event
1.2.4.1 Example:
In a class, there are 70% of the students who like English and 40% of
the students who likes English and mathematics, and then what is
the percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics
2. Explain in detail about Bayesian inference and Naive Bayes Model or Naive
Bayes Theorem or Bayes Rule.
Global Semantics
Local Semantics
Markov Blanket
Each node is conditionally independent of all others given its
Markov blanket: parents + children + children’s parents
3.4 Example:
Harry installed a new burglar alarm at his home to detect burglary.
The alarm reliably responds at detecting a burglary but also responds
for minor earthquakes. Harry has two neighbors David and Sophia,
who have taken a responsibility to inform Harry at work when they
hear the alarm. David always calls Harry when he hears the alarm,
but sometimes he got confused with the phone ringing and calls at
that time too. On the other hand, Sophia likes to listen to high music,
so sometimes she misses to hear the alarm. Here we would like to
compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is
neither a burglary, nor an earthquake occurred, and David and
Sophia both called the Harry.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
Solution:
The Bayesian network for the above problem is given in figure 2.2.
The network structure is showing that burglary and earthquake is
the parent node of the alarm and directly affecting the probability
of alarm's going off, but David and Sophia's calls depend on alarm
probability.
Let's take the observed probability for the Burglary and earthquake
component:
P(B=True) = 0.002, which is the probability of burglary.
P(B=False)= 0.998, which is the probability of no burglary.
P(E=True)= 0.001, which is the probability of a minor earthquake
P(E=False)= 0.999, Which is the probability that an earthquake not
occurred.
From the formula of joint distribution, the problem statement in the form of
probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain
by using Joint distribution.
2. Biomonitoring:
a. This involves the use of indicators to quantify the concentration of
chemicals in the human body.
3. Information retrieval:
a. Bayesian networks assist in information retrieval for research,
which is a constant process of extracting information from
databases.
4. Image processing:
a. A form of signal processing, image processing uses mathematical
operations to convert images into digital format.
5. Gene regulatory network:
a. A Bayesian network is an algorithm that can be applied to gene
regulatory networks in order to make predictions about the effects
of genetic variations on cellular phenotypes.
b. Gene regulatory networks are a set of mathematical equations that
describe the interactions between genes, proteins, and metabolites.
c. They are used to study how genetic variations affect the
development of a cell or organism.
6. Turbo code:
a. Turbo codes are a type of error correction code capable of achieving
very high data rates and long distances between error correcting
nodes in a communications system.
b. They have been used in satellites, space probes, deep-space
missions, military communications systems, and civilian wireless
communication systems, including WiFi and 4G LTE cellular
telephone systems.
7. Document classification:
a. The main issue is to assign a document multiple classes. The task
can be achieved manually and algorithmically. Since manual effort
takes too much time, algorithmic documentation is done to
complete it quickly and effectively.
4. Explain in detail about Bayesian Inference and its type Exact Inference
with suitable example.
The basic task for any probabilistic inference system is to compute the
posterior probability distribution for a set of query variables, given some observed
event-that is, some assignment of values to a set of evidence variables. We will use
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
the notation X denotes the query variable; E denotes the set of evidence variables
El, . . . , Em, and e is a particular observed event; Y will denote the nonevidence
variables Yl, . . . , (some- times called the hidden variables). Thus, the complete set
of variables X = {X} U E U Y. A typical query asks for the posterior probability
distribution P(X|e)4
Inference by enumeration
Conditional probability can be computed by summing terms from the full joint
distribution. More specifically, a query P(X|e) can be answered using Equation,
which we repeat here for convenience:
Consider the query P(Burglary1 JohnCalls = true, &Jury Calls = true). The
hidden variables for this query are Earthquake and Alarm. From Equation (13.6),
using initial letters for the variables in order to shorten the expressions, we have
The structure of this computation is shown in above Figure. Using the numbers
from Figure, we obtain P(b| j , m) = a x 0.00059224. 'The co~respondingc
omputation for ~b yields a x 0.0014919; hence
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
That is, the chance of a burglary, given calls from both neighbors, is about 28%.
The evaluation process for the expression in Equation is shown as an expression
tree in Figure.
the number of parents of each node is bounded by a constant, then the complexity
will also be linear in the number of nodes. These results hold for any ordering
consistent with the topological ordering of the network .
A network with 2 nodes (fire icon and smoke icon) and 1 edge (arrow
pointing from fire to smoke).
This network can be both a Bayesian or causal network.
The key distinction, however, is when interpreting this network.
For a Bayesian network, we view the nodes as variables and the arrow
as a conditional probability, namely the probability of smoke given
information about fire.
When interpreting this as a causal network, we still view nodes as
variables, however, the arrow indicates a causal connection.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
In this case, both interpretations are valid. However, if we were to flip the
edge direction, the causal network interpretation would be invalid, since
smoke does not cause fire.
1.The do-operator
The do-operator is a mathematical representation of a physical
intervention.
If the model starts with Z → X → Y, simulate an intervention in X by
deleting all the incoming arrows to X, and manually setting X to some
value x_0. Refer Figure 2.4 denotes the example of do-operator.
2: Confounding
A simple example of confounding is shown in the figure 2.5 below.
We begin with a random sampling process for a Bayes net that has no evidence
associated with it. The idea is to sample each variable in turn, in topological order.
The probability distribution from which the value is sampled is conditioned on the
values already assigned to the variable’s parents. (Because we sample in topological
order, the parents are guaranteed to have values already.) This algorithm is shown
in Figure. Applying it to the network with the ordering Cloudy, Sprinkler, Rain,
WetGrass, we might produce a random event as follows:
Let ˆP(X |e) be the estimated distribution that the algorithm returns; this
distribution is computed by normalizing NPS(X,e), the vector of sample counts for
each value of X where the sample agrees with the evidence e:
Consider the query P(Rain1 Sprinkler = true, Wet Grass = true) applied to the
network in Figure. The evidence variables Sprinkler and WetGrass are fixed to their
observed values and the hidden variables Cloudy and Rain are initialized randomly-
let us say to true and false respectively. Thus, the initial state is [true, true, false,
true]. Now the following steps are executed repeatedly:
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 2
1. Cloudy is sampled, given the current values of its Markov blanket variables: in
this case, we sample from P(Cloudy1 Sprinkler = true, Rain =false). Suppose the
result is Cloudy =false. Then the new current state is [false, true, false, true].
2. Rain is sampled, given the current values of its Markov blanket variables: in this
case, we sample from P(Rain1 Cloudy =false, Sprinkler = true, WetGrass = true).
Suppose this yields Rain = true. The new current state is [false, true, true, true].
Each state visited during this process is a sample that contributes to the estimate
for the query variable Rain. If the process visits 20 states where Rain is true and 60
states where Rain is false, then the answer to the query is NORMALIZE
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
SYLLABUS:
Introduction to machine learning – Linear Regression Models: Least
squares, single & multiple variables, Bayesian linear regression,
gradient descent, Linear Classification Models: Discriminant function –
Probabilistic discriminative model - Logistic regression, Probabilistic
generative model – Naive Bayes, Maximum margin classifier – Support
vector machine, Decision Tree, Random forests
PART A
1. Define Machine Learning.
Arthur Samuel, an early American leader in the field of computer gaming
and artificial intelligence, coined the term “Machine Learning ” in 1959
while at IBM.
He defined machine learning as “the field of study that gives computers
the ability to learn without being explicitly programmed “.
Machine learning is programming computers to optimize a performance
criterion using example data or past experience. The model may be
predictive to make predictions in the future, or descriptive to gain
knowledge from data.
9. Define Classification.
The Classification algorithm is a Supervised Learning technique that is
used to identify the category of new observations on the basis of training
data.
In Classification, a program learns from the given dataset or observations
and then classifies new observation into a number of classes or groups.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.
Where,
m: Slope
c: y-intercept
where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter variable
is marginalized.
o Multi-class Problems:
o If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
o Example: Classifications of types of crops, Classification of types
of music.
The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
PART B
1.1.3 Examples
1.1.3.1 Handwriting recognition learning problem
Task T : Recognizing and classifying handwritten words
within images
Performance P : Percent of words correctly classified
Training experience E : A dataset of handwritten words with
given classifications
1.1.3.2 A robot driving learning problem
Task T : Driving on highways using vision sensors
Performance P : Average distance traveled before an error
Training experience E : A sequence of images and steering
commands recorded while observing a human driver
1.3.1 Classification:
The Classification algorithm is a Supervised Learning technique
that is used to identify the category of new observations on the
basis of training data.
In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number
of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.
1.3.1 Regression:
Regression is a supervised learning technique which helps in
finding the correlation between variables and enables us to
predict the continuous output variable based on the one or more
predictor variables.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
1.3.2 Clustering:
Clustering or cluster analysis is a machine learning technique,
which groups the unlabeled dataset.
It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points.
The objects with the possible similarities remain in a group that
has less or no similarities with another group."
Where,
m: Slope
c: y-intercept
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
where
Y´ represents the predicted value;
X represents the known value;
b and a represent numbers calculated from the original correlation
analysis
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
Step 6: Calculate the slope and the y-intercept using the formulas
# Calculating 'm' and 'c'
num = 0
denom = 0
for i in range(n):
num += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = num / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
Output:
Mathematical Approach:
Residual/Error = Actual values – Predicted Values
Sum of Residuals/Errors = Sum(Actual- Predicted Values)
Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values)) 2
where
o Posterior: It is the probability of an event to occur; say, H, given that
another event; say, E has already occurred. i.e., P(H | E).
o Prior: It is the probability of an event H has occurred prior to another
event. i.e., P(H)
o Likelihood: It is a likelihood function in which some parameter
variable is marginalized.
Program
fromsklearn.datasets import load_boston
fromsklearn.model_selection import train_test_split
fromsklearn.metrics import r2_score
fromsklearn.linear_model import BayesianRidge
Output
Test Set r2 score : 0.7943355984883815
The goal is to minimize the cost as much as possible in order to find the
best fit line.
Learning Rate
A learning rate is used for each pair of input and output values. It is a
scalar factor and coefficients are updated in direction towards minimizing
error.
The process is repeated until a minimum sum squared error is achieved or
no further improvement is possible.
In figure 3.7, the hyperplane H divides the feature space into two
half-spaces:
o Decisionregion R1 for w1
o region R2 for w 2.
The discriminant function g(x) gives an algebraic measure of the
distance from x to the hyperplane.
The way to express x as
(3.4)
where xp is the normal projection of x onto H, and r is the desired
algebraic distance which is positive if x is on the positive side and
negative if x is on the negative side. Then, because g(xp)=0,
Since then
(3.5)
or
(3.6)
o If w0=0, then g(x) has the homogeneous form , and the hyperplane
passes through the origin
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
(3.7)
or
(3.8)
(3.9)
For the multi-class case, the posterior probability of class Ckis given by a
softmax transformation of a linear function of x
The cost is 0 if the predicted value and the actual value are of the same
sign. If they are not, then calculate the loss value.
The objective of the regularization parameter is to balance the margin
maximization and loss.
After adding the regularization parameter, the cost functions looks as
below.
Different kernel functions can be specified for the decision functions and
its possible to specify custom kernels
6.6 Disadvantages
If the number of features is a lot bigger than the number of data points,
choosing kernel functions and regularization term is crucial.
SVMs don't directly provide probability estimates. Those are calculated
using an expensive five-fold cross-validation.
Works best on small sample sets because of its high training time.
6.7 Applications
SVMs can be used to solve various real-world problems:
SVMs are helpful in text and hypertext categorization.
Classification of images can also be performed using SVMs.
Classification of satellite data like SAR data using supervised SVM.
Hand-written characters can be recognized using SVM.
The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly.
Example:
Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram: Refer fig 3.14
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
Decision Tree
7.7.1 Entropy:
Entropy is a metric to measure the impurity in a given
attribute.
Entropy is a measure of the randomness in the information
being processed.
The higher the entropy, the harder it is to draw any conclusions
from that information.
Flipping a coin is an example of an action that provides
information that is random.
Entropy can be calculated as:
7.7.6 Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction
Detector.
It finds out the statistical significance between the differences
between sub-nodes and parent node.
CS 3491 – ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING UNIT 3
RANDOM FOREST
8.1 Random Forest
8.2 Steps in the working process of Random Forest
8.3 Need for Random Forest
8.4 Example:
8.5 Important Features of Random Forest
8.6 Applications of Random Forest
8.7 Advantages of Random Forest
8.8 Disadvantages of Random Forest
8.9 Difference between Decision Tree & Random Forest
In the above figure 3.16, the majority decision tree gives output as an
apple when compared to a banana, so the final output is taken as an
apple.
UNIT IV
PART A
4. Explain Clustering.
Clustering is the act of organizing similar objects into groups within
a machine learning algorithm. Assigning related objects into clusters is
beneficial for AI models. Clustering has many uses in data science, like
CS3491 AIML Unit-4
5. What is Cluster?
Cluster is a group of objects that belongs to the same class. In other
words the similar objects are grouped in one cluster and dissimilar are
grouped in other cluster.
6. What is bagging?
Bagging is also known as Bootstrap aggregation, ensemble method works
by training multiple models independently and combining later to result
in strong model.
7. Define Boosting.
Boosting refers to a group of algorithms that utilize weighted averages to
make weak learning algorithms to stronger learning algorithms.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
PART - B
1. Give Short notes on combining multiple learners.
Combining multiple learners is a models composed of multiple learners
that complement each other so that by combining them, we attain higher
accuracy.
Rationale
In any application, we can use one of several learning algorithms, and
with certain algorithms, there are hyper parameters that affect the final
learner.
For example, in a classification setting, we can use a parametric
classifier or a multilayer perceptron, and, for example, with a multilayer
perceptron, we should also decide on the number of hidden units.
The No Free Lunch Theorem states that there is no single learning
algorithm that in any domain always induces the most accurate learner.
The usual approach is to try many and choose the one that performs the
best on a separate validation set.
Each learning algorithm dictates a certain model that comes with a set of
assumptions. This inductive bias leads to error if the assumptions do not
hold for the data.
The performance of a learner may be fine-tuned to get the highest
possible accuracy on a validation set, but this fine tuning is a complex
task and still there are instances on which even the best learner is not
accurate enough.
Data fusion is the process fusing multiple records representing the same
real world object into a single , consistent, and clean Representation
Fusion of data for improving prediction accuracy and reliability is an
important problem in machine learning
Combining different models is done to improve the performance of deep
learning models
Building a new model by combining requires less time, data and
computational resources
The most common method to combine models is by averaging multiple
models, where taking a weighted average improves the accuracy.
Different Algorithms
We can use different learning algorithms to train different base-learners.
Different algorithms make different assumptions about the data and lead to
different classifiers.
CS3491 AIML Unit-4
Figure 4.1 Base-learners are dj and their outputs are combined using f(·).
Voting
The simplest way to combine multiple classifiers is by voting, which
corresponds to taking a linear combination of the learners (see figure 4.1)
Eq no 02
This is also known as ensembles and linear opinion pools. In the simplest case,
all learners are given equal weight and we have simple voting that corresponds
to taking an average.
CS3491 AIML Unit-4
Table 4.2 Example of combination rules on three learners and three classes
Eq no 3
Let us assume that dj are iid with expected value E[dj] and variance Var(dj),
then when we take a simple average with wj = 1/L, the expected value and
variance of the output are
Eq no 4
We see that the expected value does not change, so the bias does not
change.
But variance and therefore mean square error, decreases as the number
of independent voters, L, increases.
Eq no 5
which implies that if learners are positively correlated, variance (and
error) increase.
We can thus view using different algorithms and input features as efforts
to decrease, if not completely eliminate, the positive correlation.
Error-Correcting Output Codes
The main classification task is defined in terms of a number of subtasks
that are implemented by the base learners.
The idea is that the original task of separating one class from all other
classes may be a difficult problem.
We want to define a set of simpler classification problems, each
specializing in one aspect of the task, and combining these simpler
classifiers, we get the final classifier.
Base-learners are binary classifiers having output −1/ + 1, and there is a
code matrix W of K × L whose K rows are the binary codes of classes in
terms of the L base-learners dj.
Code Matrix W codes classes in terms of Learners One per class L=k
CS3491 AIML Unit-4
The problem here is that if there is an error with one of the base-
learners, there may be a misclassification because the class code words
are so similar. So the approach in error correcting codes is to have L > K
and increase the Hamming distance between the code words.
With reasonable L, find W such that the hamming distance between rows
and columns are maximized.
ECOC can be written as a voting scheme where the entries of W, wij, are
considered as vote weights:
Advantages:
1. Reduce overfitting of the model.
2. Handles higher dimensionality data very well.
3. Maintains accuracy for missing data
Disadvantage:
Since final prediction is based on mean prediction from the subset trees, it
won’t give precise values for the classification and regression model
Sl
no Bagging Boosting
Sl
no Bagging Boosting
o Now the base model is fitted with the first fold, which is n-1, and it will
make predictions for the nth folds.
o The prediction made in the above step is added to the x1_train list.
o Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of
size n,
o Now, the model is trained on all the n parts, which will make predictions
for the sample data.
o Add this prediction to the y1_test list.
o In the same way, we can find x2_train, y2_test, x3_train, and y3_test by
using Model 2 and 3 for training, respectively, to get Level 2 predictions.
o Now train the Meta model on level 1 prediction, where these predictions
will be used as features for the model (refer Figure 4.3).
o Finally, Meta learners can now be used to make a prediction on test data
in the stacking model.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN,
we can easily identify the category or class of a particular dataset. Consider the
below diagram:
CS3491 AIML Unit-4
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
CS3491 AIML Unit-4
Figure 4.7 Suppose we have a new data point and we need to put it in the
required category.
o As we can see the 3 nearest neighbors are from category A in Figure 4.9
, hence this new data point must belong to category A.
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need
to try some values to find the best out of them. The most preferred value
for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
Where P(Gi) are the mixture proportions and p(x|Gi) are the component
densities.
CS3491 AIML Unit-4
The mean and precision (inverse covariance) matrix, we can use a normal-
Wishart prior
Expectation-Maximization Algorithm
In k-mean, clustering is the problem of finding codebook vectors that
minimize the total reconstruction error.
Here the approach is probabilistic and we look for the component density
parameters that maximize the likelihood of the sample.
Using the mixture model of equation , the log likelihood given the sample
X = {xt}t is
Where Φ includes the priors P(Gi) and also the sufficient statistics of the
component densities p(xt|Gi).
A(x) = max(0,x)
Hyperparameters
₂
1 0
ℜ
ℜ
y wx w
o
∈ ∈
1 0
A(x) = max(0,x)
𝑥
A(x) = max(0,x)
γj βj
γj βj
Hyperparameters
C=0.3 and Alpha=0.2
₂