M4 - Clustering

MACHINE LEARNING TECHNIQUES
MCA301
Module 4 - Clustering Topics
1. Clustering Introduction
2. Similarity measures
3. Clustering criteria
4. Distance functions
5. k-Means Clustering
6. Hierarchical clustering
7. DBSCAN
8. Combining Multiple Learners
• Voting
• Bagging
• Boosting
Clustering in machine learning
Clustering or cluster analysis is a machine

learning technique, which groups the unlabelled
dataset.
It can be defined as "A way of grouping the data

points into different clusters, consisting of
similar data points.
How does Clustering Work in Machine Learning?
In clustering, we group an unlabeled data set, which is

known as unsupervised learning.
➢When we first group unlabeled data, we need to find a

similar group.
➢When we create a group, we need to understand the

features of datasets, i.e., similar things.
➢If we create a group by one or two features, it is easy

to measure similarity.
Clustering is an unsupervised learning technique
which aims at grouping a set of objects into clusters
so that objects in the same clusters should be similar
as possible, whereas objects in one cluster should be
as dissimilar as possible from objects in other
clusters.
Cluster analysis aims to group a collection of

patterns into clusters based on similarity. A typical
clustering technique uses a similarity function for
comparing various data items.
Top 4 Methods of Clustering in Machine Learning
Clustering using distance functions, called distance
based clustering, is a very popular technique to cluster the
objects and has given good results.
The clusters are formed in such a way that any two data
objects within a cluster have a minimum distance
value and any two data objects across different clusters
have a maximum distance value
Similarity is an amount that reflects the strength of
relationship between two data items, it represents how
similar 2 data patterns are.
Clustering is done based on a similarity measure to group

similar data objects together.
This similarity measure is most commonly and in most

applications based on distance functions such as Euclidean
distance, Manhattan distance, Minkowski distance, Cosine
similarity, etc. to group objects in clusters.
The clusters are formed in such a way that any two data
objects within a cluster have a minimum distance value and
any two data objects across different clusters have a
maximum distance value.
Clustering using distance functions, called distance based

clustering, is a very popular technique to cluster the objects
and has given good results.
What is the criteria of good
clustering?
▪ A good clustering method will produce high
quality clusters in which: – the intra-class (that
is, intra-cluster) similarity is high. – the inter-
class similarity is low.
▪ The quality of a clustering result also depends on
both the similarity measure used by the method
and its implementation.
Categories of Clustering
Clustering algorithms can be categorized broadly

into the following categories:
1. Partitional Clustering
2. Density based Clustering
3. Hierarchical clustering
1. Partitional Clustering
Partitional clustering is considered to be the most popular
category of clustering algorithm. Partition clustering
algorithm divides the data points into “k“ partitions, where
each partition represents a cluster.
The partition is done based on a certain objective function.

The clusters are formed such that the data objects within a
cluster are “similar”, and the data objects in different
clusters are “dissimilar”.
Partitional clustering methods are useful in applications

where the number of clusters required are static. K-means,
PAM (Partition around mediods) and CLARA are a few of
the partitioning clustering algorithms.
2. Density Based Clustering
Density-based clustering algorithms create arbitrary-shaped clusters.
In this kind of clustering approach, a cluster is considered as a region

in which the density of data objects exceeds a particular threshold
value.
DBSCAN algorithm is a famous example of Density based clustering

approach.
3. Hierarchical Clustering
Hierarchical clustering algorithms work to divide or merge a particular

dataset into a sequence of nested partitions.
The hierarchy of these nested partitions can be of two types, viz.,

agglomerative, i.e., bottom-up or divisive, i.e., top-down.
In the agglomerative method, clustering begins with a single data

object in a single cluster and continues to cluster the closest pairs of
clusters until all the data objects are grouped together in just one
cluster. Divisive hierarchical clustering, on the other hand, starts with
all data objects in a single cluster and keeps splitting larger clusters
into smaller ones until all the data objects are split into unit clusters.
BIRCH, (Balance Iterative Reducing and Clustering using

Hierarchies), CURE (Cluster Using REpresentatives) are examples of
Hierarchical clustering approach.
Two main consideration of similarity:
Similarity = 1 if X = Y (Where X, Y are two objects)

Similarity = 0 if X ≠ Y
That’s all about similarity let’s drive to five most popular

similarity distance measures.
Euclidean distance is the most common use of distance measure. In most cases when people
say about distance, they will refer to Euclidean distance. Euclidean distance is also known as
simply distance.
When data is dense or continuous, this is the best proximity measure.
The Euclidean distance between two points is the length of the path connecting them.
The Pythagorean theorem gives this distance between two points.
What is Euclidean Distance ?
Euclidean Distance gives the distance from each

cell in the raster to the closest source.
Example of usage:
What is the distance to the closest town?
Euclidean Direction gives the direction from each

cell to the closest source
Manhattan distance is a metric in which the distance between two points
is calculated as the sum of the absolute differences of their Cartesian
coordinates. In a simple way of saying it is the total sum of the
difference between the x-coordinates and y-coordinates.
Suppose we have two points A and B. If we want to find the Manhattan

distance between them, just we have, to sum up, the absolute x-axis and
y-axis variation. This means we have to find how these two points A
and B are varying in X-axis and Y-axis.
In a more mathematical way of saying Manhattan distance between

two points measured along axes at right angles.
In a plane with p1 at (x1, y1) and p2 at (x2, y2).

Manhattan distance = |x1 – x2| + |y1 – y2|
This Manhattan distance metric is also known as Manhattan length,
rectilinear distance, L1 distance or L1 norm, city block distance,
Minkowski’s L1 distance, taxi-cab metric, or city block distance.
Manhattan distance captures the distance
between two points by aggregating the
pair wise absolute difference between each
variable while Euclidean distance captures
the same by aggregating the squared
difference in each variable.
What is Manhattan distance used for?
Manhattan Distance:
We use Manhattan distance, also known as city

block distance, or taxicab geometry if we need to
calculate the distance between two data points
in a grid-like path.
Euclidean distance is the shortest
path between source and destination
which is a straight line but
Manhattan distance is sum of all the
real distances between source(s)
and destination(d) and each distance
are always the straight lines.
The Minkowski distance is a generalized metric form
of Euclidean distance and Manhattan distance.
Differences between Jaccard Similarity and
Cosine Similarity:
Jaccard similarity takes only unique set of

words for each sentence / document
while cosine similarity takes total length of
the vectors.
(these vectors could be made from bag of
words term frequency or tf-idf)

M4 - Clustering

Uploaded by

Copyright:

Available Formats

M4 - Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M4 - Clustering

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING TECHNIQUES

Clustering or cluster analysis is a machine

It can be defined as "A way of grouping the data

In clustering, we group an unlabeled data set, which is

➢When we first group unlabeled data, we need to find a

➢When we create a group, we need to understand the

➢If we create a group by one or two features, it is easy

Cluster analysis aims to group a collection of

Clustering is done based on a similarity measure to group

This similarity measure is most commonly and in most

Clustering using distance functions, called distance based

Clustering algorithms can be categorized broadly

The partition is done based on a certain objective function.

Partitional clustering methods are useful in applications

Density-based clustering algorithms create arbitrary-shaped clusters.

In this kind of clustering approach, a cluster is considered as a region

DBSCAN algorithm is a famous example of Density based clustering

Hierarchical clustering algorithms work to divide or merge a particular

The hierarchy of these nested partitions can be of two types, viz.,

In the agglomerative method, clustering begins with a single data

BIRCH, (Balance Iterative Reducing and Clustering using

Similarity = 1 if X = Y (Where X, Y are two objects)

That’s all about similarity let’s drive to five most popular

Euclidean Distance gives the distance from each

What is the distance to the closest town?

Euclidean Direction gives the direction from each

Suppose we have two points A and B. If we want to find the Manhattan

In a more mathematical way of saying Manhattan distance between

In a plane with p1 at (x1, y1) and p2 at (x2, y2).

We use Manhattan distance, also known as city

Jaccard similarity takes only unique set of

You might also like