Nothing Special   »   [go: up one dir, main page]

M4 - Clustering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

MACHINE LEARNING TECHNIQUES

MCA301
Module 4 - Clustering Topics
1. Clustering Introduction
2. Similarity measures
3. Clustering criteria
4. Distance functions
5. k-Means Clustering
6. Hierarchical clustering
7. DBSCAN
8. Combining Multiple Learners
• Voting
• Bagging
• Boosting
Clustering in machine learning

Clustering or cluster analysis is a machine


learning technique, which groups the unlabelled
dataset.

It can be defined as "A way of grouping the data


points into different clusters, consisting of
similar data points.
How does Clustering Work in Machine Learning?

In clustering, we group an unlabeled data set, which is


known as unsupervised learning.

➢When we first group unlabeled data, we need to find a


similar group.

➢When we create a group, we need to understand the


features of datasets, i.e., similar things.

➢If we create a group by one or two features, it is easy


to measure similarity.
Clustering is an unsupervised learning technique
which aims at grouping a set of objects into clusters
so that objects in the same clusters should be similar
as possible, whereas objects in one cluster should be
as dissimilar as possible from objects in other
clusters.

Cluster analysis aims to group a collection of


patterns into clusters based on similarity. A typical
clustering technique uses a similarity function for
comparing various data items.
Top 4 Methods of Clustering in Machine Learning
Clustering using distance functions, called distance
based clustering, is a very popular technique to cluster the
objects and has given good results.

The clusters are formed in such a way that any two data
objects within a cluster have a minimum distance
value and any two data objects across different clusters
have a maximum distance value
Similarity is an amount that reflects the strength of
relationship between two data items, it represents how
similar 2 data patterns are.

Clustering is done based on a similarity measure to group


similar data objects together.

This similarity measure is most commonly and in most


applications based on distance functions such as Euclidean
distance, Manhattan distance, Minkowski distance, Cosine
similarity, etc. to group objects in clusters.
The clusters are formed in such a way that any two data
objects within a cluster have a minimum distance value and
any two data objects across different clusters have a
maximum distance value.

Clustering using distance functions, called distance based


clustering, is a very popular technique to cluster the objects
and has given good results.
What is the criteria of good
clustering?
▪ A good clustering method will produce high
quality clusters in which: – the intra-class (that
is, intra-cluster) similarity is high. – the inter-
class similarity is low.
▪ The quality of a clustering result also depends on
both the similarity measure used by the method
and its implementation.
Categories of Clustering

Clustering algorithms can be categorized broadly


into the following categories:

1. Partitional Clustering
2. Density based Clustering
3. Hierarchical clustering
1. Partitional Clustering
Partitional clustering is considered to be the most popular
category of clustering algorithm. Partition clustering
algorithm divides the data points into “k“ partitions, where
each partition represents a cluster.

The partition is done based on a certain objective function.


The clusters are formed such that the data objects within a
cluster are “similar”, and the data objects in different
clusters are “dissimilar”.

Partitional clustering methods are useful in applications


where the number of clusters required are static. K-means,
PAM (Partition around mediods) and CLARA are a few of
the partitioning clustering algorithms.
2. Density Based Clustering

Density-based clustering algorithms create arbitrary-shaped clusters.

In this kind of clustering approach, a cluster is considered as a region


in which the density of data objects exceeds a particular threshold
value.

DBSCAN algorithm is a famous example of Density based clustering


approach.
3. Hierarchical Clustering

Hierarchical clustering algorithms work to divide or merge a particular


dataset into a sequence of nested partitions.

The hierarchy of these nested partitions can be of two types, viz.,


agglomerative, i.e., bottom-up or divisive, i.e., top-down.

In the agglomerative method, clustering begins with a single data


object in a single cluster and continues to cluster the closest pairs of
clusters until all the data objects are grouped together in just one
cluster. Divisive hierarchical clustering, on the other hand, starts with
all data objects in a single cluster and keeps splitting larger clusters
into smaller ones until all the data objects are split into unit clusters.

BIRCH, (Balance Iterative Reducing and Clustering using


Hierarchies), CURE (Cluster Using REpresentatives) are examples of
Hierarchical clustering approach.
Two main consideration of similarity:

Similarity = 1 if X = Y (Where X, Y are two objects)


Similarity = 0 if X ≠ Y

That’s all about similarity let’s drive to five most popular


similarity distance measures.
Euclidean distance is the most common use of distance measure. In most cases when people
say about distance, they will refer to Euclidean distance. Euclidean distance is also known as
simply distance.
When data is dense or continuous, this is the best proximity measure.
The Euclidean distance between two points is the length of the path connecting them.
The Pythagorean theorem gives this distance between two points.
What is Euclidean Distance ?

Euclidean Distance gives the distance from each


cell in the raster to the closest source.

Example of usage:

What is the distance to the closest town?

Euclidean Direction gives the direction from each


cell to the closest source
Manhattan distance is a metric in which the distance between two points
is calculated as the sum of the absolute differences of their Cartesian
coordinates. In a simple way of saying it is the total sum of the
difference between the x-coordinates and y-coordinates.

Suppose we have two points A and B. If we want to find the Manhattan


distance between them, just we have, to sum up, the absolute x-axis and
y-axis variation. This means we have to find how these two points A
and B are varying in X-axis and Y-axis.

In a more mathematical way of saying Manhattan distance between


two points measured along axes at right angles.

In a plane with p1 at (x1, y1) and p2 at (x2, y2).


Manhattan distance = |x1 – x2| + |y1 – y2|
This Manhattan distance metric is also known as Manhattan length,
rectilinear distance, L1 distance or L1 norm, city block distance,
Minkowski’s L1 distance, taxi-cab metric, or city block distance.
Manhattan distance captures the distance
between two points by aggregating the
pair wise absolute difference between each
variable while Euclidean distance captures
the same by aggregating the squared
difference in each variable.
What is Manhattan distance used for?

Manhattan Distance:

We use Manhattan distance, also known as city


block distance, or taxicab geometry if we need to
calculate the distance between two data points
in a grid-like path.
Euclidean distance is the shortest
path between source and destination
which is a straight line but
Manhattan distance is sum of all the
real distances between source(s)
and destination(d) and each distance
are always the straight lines.
The Minkowski distance is a generalized metric form
of Euclidean distance and Manhattan distance.
Differences between Jaccard Similarity and
Cosine Similarity:

Jaccard similarity takes only unique set of


words for each sentence / document
while cosine similarity takes total length of
the vectors.
(these vectors could be made from bag of
words term frequency or tf-idf)

You might also like