Nothing Special   »   [go: up one dir, main page]

Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 11

Survey of clustering algorithms

Cluster analysis or clustering is the art of separating the data points into dissimilar group with a
group having similar objects (cluster). This is an unsupervised learning technique used in many
fields, like machine learning, data mining, bioinformatics, etc. This technique is implemented in
a variety of ways few of majorly known techniques are partitioning methods, hierarchical
methods, density-based methods, grid-based methods, model-based methods, graph based
methods and ensembles of clustering algorithms.

This article tries to give a brief description based on the data type and method type under each
category, which is not the only comprehensive article describing the all clustering algorithm.
Route map for the paper
Partitioning algorithms are described in section 2. Hierarchical algorithms are covered in section 3. The
section 4 explains the density-based algorithms. The section 5 describes the grid-based methods. The
model-based algorithms are covered in section 6, while, recent advances in clustering techniques, such as
ensembles of clustering algorithms, are described in section 7. The final section concludes this work.

2. Partitioning Methods
Partitioning algorithms tries to explore the subset in the dataset at a time. The methods in
this approach are K-Means, Farthest FirstTraversal k-center(FFT) algorithm, K-Mediods,
CLARA, CLARANS, Fuzzy K-Means, K-Modes, Fuzzy K-Modes, squeezer, K-prototypes,
COOLCAT, etc. few methods are explained below.
Numerical data
K-Means algorithm operates on numerical and binary data. The algorithm builds up spherical
shaped clusters with means as center. The algorithm’s drawbacks are the user has to enter the
number of cluster to be formed and also the stopping criteria. With different order of same data
in dataset, algorithm may produce different output clusters. The algorithm cannot handle missing
data and outliers (dissimilar data items from reset of the data). The worst case complexity of this
algorithm is O(tkN), where t is the number of iterations, k is the number of clusters and n is the
number of objects.
Farthest First Traversal k-center (FFT) algorithm this works with variance of the data points. User only
requires the input ‘k’ value to be given. The complexity of this method is O(NK). This method also
suffers from drawbacks of K-means. Like K-Means, this method is also unsuitable for noisy datasets and
cannot handle outliers effectively. The algorithm builds up spherical shaped clusters with means as
center.

K-Medoids is also known as Partitioning Around Medoids (PAM): first asks the user to enter required
number of clusters then finds arbitrary Median (centers) for each cluster. This method is capable of
handling outliers but is order sensitivity. The complexity of the algorithm is O(tNK). The algorithm
builds up spherical shaped clusters with means as center.

CLARA (Clustering Large Applications): this method capitalizes on scalability of K-Mediods. This
works on samples from the dataset and not on entire dataset. Methodology still remains the same. The
complexity of the algorithm is O(Ks2bk(N-K )), here s is a sample size. The algorithm builds up
arbitrary shaped clusters with means as center.

CLARANS (Clustering Large Applications Based Upon Randomized Search): this algorithm facilitates
the dynamic ordering of samples during each iteration. This effectively handles outliers with a complexity
of O(N2). The algorithm builds up arbitrary shaped clusters with means as center.

Fuzzy k-means: is a soft computing algorithm allowing each object to have a membership in more than
one cluster using mean as the evaluation criteria. The complexity of the algorithm is O(tNK). The
algorithm builds up arbitrary shaped clusters with means as center.

Discrete
K-Modes: this is a method similar to k-means replacing mean value with the mode; hamming distance
may also be used with this method. It has the merits and demerits of K-Means. The complexity of this
method is O(TKN).

Fuzzy k-modes
Fuzzy k-modes is an extension to fuzzy k-means. The complexity of this method is O(TKN). This method
suffers from the drawbacks of k-modes.
Squeezer: this algorithm scans the algorithm exactly once and is an iterative version of k-modes. The
algorithm is a top down approach. Squeezer is efficient with a complexity of O(kN).

COOLCAT: the drawback with k-modes is sensitivity to start clusters (initial cluster mode) this is over
come by considering entropy as a measure of dissimilarity. The algorithm works at minimizing clusters at
each stage. COOLCAT works with an efficiency of O(N2).

Mixed Discrete and Numerical


K-Prototypes: The integration of k-Means and k-modes is k-prototypes, handles mixed data. The
complexity of this approach is O(TKN).

HIERARCHICAL CLUSTERING
Hierarchical clustering algorithms arrange the data into groups in a tree fashion. Each node in the tree
represents a cluster. This method comprises to major divisions called bottom-up or agglomerative and
top-down or divisive approaches. The edges are links (distances between two clusters) which in this
method can be using linkage criterion, Single linkage or Average linkage and complete linkage method.
Single linkage is minimum distance between two clusters. This method results in chaining problem.
Average linkage is mean of all pairs of points between the two clusters. This method is computationally
expensive than the other two methods. This method works by calculating maximum distance between two
clusters, appropriate for high-dimensional space and inappropriate for noisy datasets.

Hierarchical methods are popular in bioinformatics since clusters can be navigated at various levels of
granularity [15, 87, 118, 119]. Eisen et al. used average linkage for clustering genes by expression pattern
similarity, as assessed by Euclidean distance. For N genes, an N*N matrix is computed containing gene
pair similarities, and then scanned to find the most similar genes.

Hierarchical methods are useful for representing protein sequence family relationships [67, 68,119].
Regulatory and network clustering methods often involve finding cliques [120, 121]. Visualization of
hierarchical clusters of cliques is called ‘power graphs’ [96]. Hierarchical methods are often slow, not
adaptable to decisions of merger and/or split once applied. Intra cluster relations may be lost due to
merging clusters at various levels. Next, we discuss hierarchical clustering of numerical and then discrete
data.
Numerical
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): is an apt method for large
databases. This algorithm works on the principles of clustering features tree (CF-Tree). The algorithm
scales to O(N). It produces spherical shaped clusters.

CURE (Clustering using representatives) is an improvisation over BRICH to project arbitrary shaped
clusters. This is algorithm handles outliers effectively than BRICH. The complexity of the BRICH and
CURE remain the same. This is also a single scan algorithm using a user-specified shrinking factor. This
method relies on the input number of cluster to be formed and shrinking factor value, which form the
basis of start and later build up the values iteratively.

Spectral clustering originated in graph partitioning [122, 123, 88, 148]. Similarity matrix (adjacency
matrix) is built and distance is calculated between two matrices, a measure of similarity.
Biclustering- clustering horizontally and vertically is known as biclustering.
Discrete
Bottom up methodology (agglomerative algorithm)
ROCK
ROCK assumes a similarity measure between objects and defines a ‘link’ between two objects whose
exceeding a threshold to find appropriate cluster of the data point in hand. Initially, each object is
assigned to a separate cluster. Then, clusters are merged repeatedly according to their closeness: the sum
of the number of ‘links’ between all pairs of objects between two clusters. ROCK has cubic complexity in
N, and is unsuitable for large datasets [4, 103].
Chameleon
Chameleon applicable for mixed/ all types of data capitalizes on drawbacks of CURE and ROCK.
Chameleon considers the internal interconnectivity and closeness of the objects both between and within
two clusters to be merged, if a similarity metric is specified [93]. Its complexity is O(N2).

LIMBO improves on the scalability of other hierarchical clustering algorithms. LIMBO builds on the
Information Bottleneck (IB) framework for quantifying the relevant information preserved when
clustering [94]. Its complexity is O(NlogN).

GRID-BASED CLUSTERING
Grid-based clustering algorithms first quantize the clustering space into a finite number of cells and then
perform the required operations on the quantized space. Some of the grid-based clustering algorithms are:
STatistical INformation Grid-based method – STING [36], WaveCluster [33], and CLustering In QUEst –
CLIQUE [1].
Numerical
STING is an integration of grid based and hierarchical clustering with O(N) as its complexity to partition
a data set into rectangular clusters arranged in hierarchical manner. Each level has count, mean, standard
deviation, etc. this approach facilitates top down traversal.

CLIQUE [1] is another grid-based clustering algorithm that starts by finding all the dense areas in the
one-dimensional spaces corresponding to each attribute, and then generates the set of two-dimensional
cells that might possibly be dense by looking at dense one-dimensional cells. Generally, CLIQUE
generates the possible set of k-dimensional cells that might possibly be dense by looking at dense (k - 1)
dimensional cells. CLIQUE produces identical results irrespective of the order in which the input records
are presented. In addition, it generates cluster descriptions in the form of DNF expressions [1] for ease of
comprehension. Moreover, empirical evaluation shows that CLIQUE scales linearly with the number of
instances, and has good scalability as the number of attributes is increased. The complexity of this method
is O(N).

Unlike other clustering methods, WaveCluster [33] does not require users to give the number of clusters
applicable to low dimensional space. It uses a wavelet transformation to transform the original feature
space resulting in a transformed space where the natural clusters in the data become distinguishable.
The complexity of this method is O(N).

DENSITY-BASED CLUSTERING
Density-based approaches use a local density criterion; clusters are subspaces in which the objects are
dense and are separated by subspaces of low density. Density-based methods are useful in bioinformatics
for finding the densest subspaces within networks, typically involving cliques [19, 65].
Advantages of many density-based algorithms include time efficiency and ability to find clusters of
arbitrary shapes. Some density-based algorithms take user-specified input parameters, but not the number
of clusters k that changes the clustering. Some cannot identify clusters of varying densities. Some density-
based approaches are also grid-based, since a histogram is constructed by partitioning the dataset into a
number of non-overlapping regions.
Numerical
DBSCAN
DBSCAN regards clusters as dense regions of objects in space that are separated by regions of low
density. For each object of a cluster, the neighborhood of a given radius has to contain at least a minimum
number of objects (MinPts), where radius and MinPts are input parameters. Every object not contained in
any cluster is considered noise [98]. Its complexity is O(NlogN), if a spatial index is used; otherwise, it is
O(N2). Its main advantage is that it can discover clusters of arbitrary shapes. DBSCAN is resistant to
noise and provides a means of filtering for noise if desired. Its main drawback is the user-specified
parameter values. DBSCAN is not suitable for high-dimensional data; this leads to OPTICS.

OPTICS
Both DBSCAN and OPTICS require parameters to be specified by the user that will affect the result.
However, OPTICS considers that different clusters could require different values. DBSCAN and OPTICS
have difficulty identifying clusters within clusters [4, 130]. OPTICS finds an ordering of the data that is
consistent with DBSCAN [99]. The complexity of this method is O(NlogN). For sequence clustering,
OPTICS was extended into SEQOPTICS to support users choosing parameters [10].

DENCLUE
The main advantage of DENCLUE is ability to find arbitrary shaped clusters. DENCLUE differs from
other density-based approaches in that it pins density to a point in the attribute space instead of an object
[100]. Its complexity is O(N2). The drawback is that it has a large number of input parameters. The
influence of each object within its neighborhood is modeled using an influence function. The density
function is the sum of the influence functions of all objects. Clusters are determined by identifying local
maxima of the overall density function.
WaveCluster
WaveCluster uses a wavelet to transform the original data and find dense regions in the transformed space
[105]. A wavelet transform is useful because it can suppress weaker information, and thus being effective
for noise removal. This results’ in two benefits: with less information one can speed up the process, and it
can detect clusters at varying levels of accuracy. Its complexity is O(N). However, it is only applicable to
low-dimensional datasets. Its input parameters are the number of grid cells for each dimension, the
wavelet transform, and the number of wavelet applications.
CLIQUE
CLIQUE improves on the scalability of other methods as its complexity is O(N). CLIQUE is useful for
clustering high-dimensional data and is insensitive to object ordering. User-specified parameters are the
grid size and a global density threshold for clusters. The CLIQUE partitions space into non-overlapping
rectangular units and identifies the dense units; a unit is dense if the fraction of total objects contained in
it exceeds the user-specified value [107]. CLIQUE considers only hyper-rectangular clusters and
projections parallel to the axes [4, 130].

Discrete
HIERDENC: Hierarchical Density-based Clustering
A challenge involved in applying density-based clustering to discrete datasets is that the ‘cube’ of
attribute values has no ordering defined. The HIERDENC algorithm for hierarchical density-based
clustering of discrete data offers a probabilistic basis for designing faster density-based discrete clustering
algorithms. The characteristics of HIERDENC include insensitivity to the order of object input, and
ability to handle outliers [97]. The complexity of this method is O(N).

MULIC: Multiple Layer Incremental Clustering


MULIC is a faster simplification of HIERDENC that focuses on the multi-layered structure of special
datasets. MULIC produces layered clusters, which has several differences from traditional hierarchical
clustering. First, MULIC requires no user-specified parameters. MULIC clusters have a clear separation,
not requiring a cut-off to get the clusters as in hierarchical clustering. MULIC does not merge clusters
during clustering, not losing interesting local cluster structure; instead, any cluster mergers that may be
desirable are done after objects’ clustering has finished. MULIC results in layered (or nested) clusters of
biomolecules, with each cluster corresponding to a collapsed edge. The complexity of this method is
O(N).

Projected (subspace) clustering


Projected clustering is motivated by high dimensional datasets, where clusters exist only in specific
attribute subsets [133]. Clusters are subspaces of high-dimensional datasets, determined by the subset of
attributes most relevant to each cluster. The values at the relevant attributes are distributed around some
specific values in the cluster, while objects of other clusters are less likely to have such values. The
drawback is that clustering depends on user parameters for determining the relevant attributes of each
cluster; such parameters are the number of clusters or the average number of dimensions for each cluster.
Projected clustering may distinguish the center of a cluster based on higher density or the relevant
attributes [107].
CACTUS
CACTUS uses a minimum size for the relevant attribute sets, and assumes that a cluster is identified by a
unique set of attribute values that seldom occur in other clusters [101]. This assumption may be unnatural
for clustering many real world datasets. CACTUS may return too many clusters [103]. CACTUS has
difficulty finding clusters within clusters [4, 130]. This is a scalable algorithm to dataset size.
STIRR
STIRR looks for relationships between all attribute values in a cluster [102]. Two sets of attribute values,
one with positive and another with negative weights, define two clusters. STIRR is sensitive to object
ordering and lacks a definite convergence. The notion of weights is non-intuitive and several parameters
are user-specified. The final detected clusters are often incomplete [103]. This is also a scalable
algorithm.
CLICK
CLICK creates a graph representation; vertices are discrete values and an edge is a co-occurrence of
values in an object. A cluster is a k-partite maximal clique such that most pairs of vertices are connected
by an edge. CLICK may return too many clusters or too many outliers [103]. This is also a scalable
algorithm.

CLOPE
CLOPE uses a heuristic of increasing the height to width ratio of the cluster histogram [104]. CLOPE is
fast and scalable to high-dimensional datasets. The accuracy of CLOPE’s results may suffer. The
complexity of the algorithm is O(kdN).

MODEL-BASED CLUSTERING
Model-based clustering assumes that objects match a model, which is often a statistical distribution. Then,
the process aims to cluster objects such that they match the distribution. The model may be user specified
as a parameter and the model may change during the process. In bioinformatics, model-based clustering
methods integrate background knowledge into gene expression, protein structures, and sequences [134,
135]. Building models is an oversimplification; user assumptions may be false and then results will be
inaccurate. Another disadvantage of model-based clustering (especially neural networks) is slow
processing time on large datasets.
Numerical
Self-Organizing Maps
Self-Organizing Maps involve neural networks, which resemble processing that occurs in the brain. NNs
and SOM clustering involve several layers of units that pass information between one another in a
weighed manner. Several units compete for an object; the unit that is closest to the object becomes the
winning unit. SOMs assume that the winning units will eventually learn the correct cluster structure. NNs
can model nonlinear relationships between attributes, and can handle dependent attributes. However, NNs
have a number of drawbacks. NNs can handle binary data, but discrete attributes are hard to handle.
Classifying a result into multiple clusters is done by setting arbitrary value thresholds for discriminating
clusters. NNs do not present an easily understandable model, being more of a ‘black box’ that delivers
results without explaining how the results were derived. The complexity of the method is O(N2).
Discrete
COBWEB
COBWEB is a conceptual clustering method. COBWEB creates a hierarchical clustering in the form of a
classification tree. COBWEB integrates observations incrementally into an existing classification tree by
classifying the observation along a path of best matching nodes. A benefit of COBWEB is that it can
adjust the number of clusters in a partition, without the user specifying this input parameter. A drawback
is that it may assume correlated attributes are independent. A classification tree differs from a decision
tree. In a COBWEB classification tree each node refers to a concept, and contains the probability of the
concept and the probabilities of the attribute-value pairs, which apply to the objects classified under that
node. This is unlike decision trees, which label branches rather than nodes and use logical rather than
probabilistic descriptions. Sibling nodes at a classification tree level form a partition [108]. The
complexity of the method is O(Nd2).

Mixed Discrete and Numerical


BILCOM (Empirical Bayesian)
Model-based methods for gene expression clustering, such as Bi-level clustering of Mixed Discrete and
Numerical Biomedical Data (BILCOM) [109], often adopts an empirical Bayesian approach. Model-
based clustering can find arbitrary shaped gene expression clusters [16, 18, 47, 136–139, 149].
For protein sequence clustering, Brown et al. used mixture densities to estimate amino-acid preferences
within known subfamily clusters, achieving scalability and accuracy [134]. The complexity of the method
is O(N2).

AutoClass: AutoClass is a clustering algorithm for mixed data types, which uses a Bayesian method for
determining the optimal classes based on prior distributions [110]. AutoClass investigates different
numbers of clusters, which are not user specified. The output is a mixture of several likely answers.
Drawbacks include that users have to specify the model spaces to be searched in and wrong models may
produce wrong results. AutoClass can be slow. AutoClass finds the most likely classifications of objects
in clusters, given a prior distribution for each attribute, symbolizing prior beliefs of the user. It changes
the classifications of objects in clusters and changes the means and variances of the attributes’
distributions in each cluster, until they stabilize. The complexity of the method is O(kd2Nt).

SVM Clustering
SVMs provide a method for supervised learning. SVMs were also adapted for clustering.
SVM-based clustering does not use prior knowledge of object classifications [111]. Initially, every object
in the dataset is randomly labelled and a binary SVM classifier trained. Then the lowest confidence
classifications, those objects with confidence factor values beyond some threshold, repeatedly have labels
switched to the other class label. The SVM is re-trained after each re-labeling on the lowest confidence
objects. The repetition of the above process improves the classification accuracy and limits local minima
traps. The complexity of the method is O(N1.8).

GRAPH-BASED CLUSTERING
Graph-based clustering methods have been applied for complex prediction and to sequence networks.
These methods are sensitive to user-specified parameter values, and often slow; exceptions are
SEQOPTICS and MULIC, presented previously.

Molecular Complex Detection (MCODE)


MCODE is designed to detect sub networks with many edges [19]. MCODE is sensitive to network
alterations, edge removal and addition, and threshold parameters. MCODE complex prediction had high
correctness, but few complexes were retrieved. The complexity of the method is O(Nd3).

Super Paramagnetic Clustering (SPC)


SPC is similar to entropy-based COOLCAT; when temperature or entropy increases, the system becomes
less stable and the clusters become smaller [70]. SPC is robust to edge removal, but sensitive to edge
addition. SPC is weak at complex predictions [65]. The complexity of the method is O(N2).

Restricted Neighborhood Search Clustering (RNSC)


RNSC is similar to ROCK and Chameleon, considering the number of edges within and between clusters
[112]. RNSC is relatively robust to parameter values and edge addition, but sensitive to edge removal. It
gives many mini-clusters of small sizes [65]. The complexity of the method is O(N2).

Markov Clustering (MCL)


MCL is similar to projected clustering, which often improves results on high dimensional datasets. It
simulates a flow, finding clusters as high-flow regions separated by no-flow boundaries [113]. MCL is
robust to network alterations, both edge removal and addition. An overall comparison showed MCL’s
superiority for finding complexes [65]. The complexity of the method is O(N3).
Other sequence clustering Tribe MCL clusters sequences into families using BLAST similarity
searches [113]. SPC was also applied to sequences, improving sequence clustering over Tribe MCL [70].
CD-HIT removes redundant sequences, satisfying the point proportion admissibility requirement [12].
The BAG algorithm uses graph theoretic properties to guide cluster splitting and reduce errors [142].

You might also like