Nothing Special   »   [go: up one dir, main page]

Fraiman 2013

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Adv Data Anal Classif (2013) 7:125–145

DOI 10.1007/s11634-013-0129-3

REGULAR ARTICLE

Interpretable clustering using unsupervised binary trees

Ricardo Fraiman · Badih Ghattas · Marcela Svarc

Received: 31 January 2013 / Revised: 20 February 2013 / Accepted: 7 March 2013 /


Published online: 29 March 2013
© Springer-Verlag Berlin Heidelberg 2013

Abstract We herein introduce a new method of interpretable clustering that uses


unsupervised binary trees. It is a three-stage procedure, the first stage of which entails
a series of recursive binary splits to reduce the heterogeneity of the data within the
new subsamples. During the second stage (pruning), consideration is given to whether
adjacent nodes can be aggregated. Finally, during the third stage (joining), similar
clusters are joined together, even if they do not share the same parent originally.
Consistency results are obtained, and the procedure is used on simulated and real data
sets.

Keywords Unsupervised classification · CART · Pattern recognition

Mathematics Subject Classification (2000) 62H30 · 68T10

1 Introduction

Clustering is a means of unsupervised classification, is a common technique for the


statistical analysis of data, and has applications in many fields, including medicine,

R. Fraiman
Universidad de San Andrés and Universidad de la República, Vito Dumas 284, 1644 Victoria,
Buenos Aires, Argentina
e-mail: rfraiman@udesa.edu.ar

B. Ghattas
Département de Mathématiques, Université de la Méditerrannée
Faculté des Sciences de Luminy, 163 Avenue de Luminy, 13288 Marseille Cedex 09, France
e-mail: ghattas@univmed.fr

M. Svarc (B)
Universidad de San Andrés and Conicet, Vito Dumas 284, 1644 Victoria, Buenos Aires, Argentina
e-mail: msvarc@udesa.edu.ar

123
126 R. Fraiman et al.

marketing and economics, among other areas. The term “cluster analysis” (first used
by Tryon 1939) includes the use of any of a number of different algorithms and
methods for grouping similar data into a number of different categories. The grouping
is achieved in such a way that “the degree of association between data is at a maximum
if the data belong to the same group and at a minimum otherwise”.
Cluster analysis or clustering involves the assignment of a set of observations from
R p into subsets (called clusters), such that observations in the same cluster are similar
in “some sense”. The definitions are quite vague, in that there is no clear objective
function of the population that can be used to measure the performance of a clus-
tering procedure. Implicit in each clustering algorithm is an objective function that
varies from one method to another. It is important to note that although most clus-
tering procedures require the number of clusters to be known beforehand, generally
in practice it is not. In contrast, in supervised classification, not only the number of
groups is known but also the distribution of each population. We additionally have
both a learning sample and a universal objective function, i.e. to minimise the number
of misclassifications, or in terms of the population, to minimise the Bayes error.
Despite these differences, there are a number of similarities between supervised
and unsupervised classification. Specifically, there are many algorithms that share the
same spirit in both cases.
Algorithms that use supervised and unsupervised classification (Hastie et al. 2003)
can either be partitional or hierarchical. Partitional algorithms determine all the groups
at once. The most widely used and well-studied partitioning procedure for cluster
analysis is that of the k-means algorithm. Hierarchical algorithms successively identify
groups that split from or join groups that were established previously. These algorithms
can either be agglomerative (“bottom–up”) or divisive (“top–down”). Agglomerative
algorithms begin with each element as a separate group and merge them into suc-
cessively larger groups. In contrast, divisive algorithms begin with the whole set and
proceed to split it into successively smaller groups. Hierarchical algorithms create a
hierarchy of partitions that may be represented in a tree structure. The best known
hierarchical algorithm for supervised classification is CART (Breiman et al. 1984).
CART has a further property of interest. The partition tree is built using a few
binary conditions obtained from the original coordinates of the data. In most cases,
the interpretation of the results may be summarised in a tree that has a very simple
structure. The usefulness of such a scheme of classification is valuable not only for
the rapid classification of new observations, but it can also often yield a much simpler
“model” for explaining why the observations are classified in a particular group, a
property that is remarkably important in many applications. Moreover, it is important
to stress that the algorithm assumes no kind of parametric model for the underlying
distribution.
Recently, several new methods have been proposed for clustering analysis, see for
instance García-Escudero et al. (2010) for a review with focus on robust clustering
procedure. Other recent proposals have been made by Peña and Prieto (2001, 2006),
Fraley and Raftery (2002), Oh and Raftery (2007), Iyigun and Ben-Israel (2008),
Walther (2010), among others. But just a few different methods using decision trees
to obtain clusters have previously been proposed. Liu et al. (2000) use decision trees
to partition the data space into clusters and empty (sparse) regions at different levels

123
Interpretable clustering using unsupervised binary trees 127

of detail. Their method uses the idea of adding an artificial sample of size N that is
uniformly distributed over the space. With these N points added to the original data
set, the problem then becomes one of obtaining a partition of the space into dense and
sparse regions. Liu et al. (2000) treat this problem as a classification problem that uses
a new “purity” function that is adapted to the problem and is based on the relative
densities of the regions concerned.
Chavent et al. (1999) obtained a binary clustering tree that applies to a particular
variable and its binary transformation. They presented two alternative procedures. In
the first, the splitting variables are recursively selected using correspondence analysis,
and the resulting factorial scores lead to the binary transformation. In the second,
the candidate variables and their variable transformations are simultaneously selected
by a criterion of optimisation in which the resulting partitions are evaluated. Basak
and Krishnapuram (2005) proposed four different measures for selecting the most
appropriate characteristics for splitting the data at every node, and two algorithms
for partitioning the data at every decision node. For the specific case of categorical
data, Andreopoulos et al. (2007) introduced HIERDENC, which is an algorithm that
searches the dense subspaces on the “cube” distribution of the values presented in the
data at hand.
Our aim herein is to propose a simple clustering procedure that has the same appeal-
ing properties as CART. We introduce the hierarchical top–down method of CUBT
(Clustering using unsupervised binary trees), in which the clustering tree is based on
binary rules on the original variables, and this will help us to understand the structure
of the clustering.
There are three stages in our procedure. In the first, we grow a maximal tree by
applying a recursive partitioning algorithm. In the second, we prune the tree using a
criterion of minimal dissimilarity. In the final stage, we aggregate the leaves of the
tree that do not necessarily share the same direct ascendant.
We do not claim that the new method we introduce is always more efficient
nor better than others. For each particular problem, depending on the cluster struc-
ture some methods behave better than others, and quite often the difference in
efficiency is just a small (but important) improvement. This is also the case in
supervised classification. CART is far from being the best universally most effi-
cient algorithm. However it has a quite good behavior for a large class of classi-
fication problems. We will show along the paper, that the main advantages of our
clustering method (mostly shared with CART for classification problems) are the
following:

(a) Flexibility The method is able to perform good clustering for a large family of
clusters structures. As long as the true clusters can be separated by some partition
built as the intersection of a arbitrarily large finite number of half-spaces, whose
boundaries are orthogonal to the original coordinates system the method will
work properly.
(b) Interpretability The final partition is explained in terms of binary splits on the
original variables. This property is fundamental for many applications.
(c) Efficiency We get a good performance in terms of correct clustering allocation
for a large family of clusters structures.

123
128 R. Fraiman et al.

(d) Population version We provide a population version of the final partition,


regarded as a random vector X ∈ R p with unknown distribution P. We then
show that the algorithm (the empirical version) converges a.s. to the population
final partition. This kind of property is essential in statistics in order to under-
stand well when and why a method will be adequate, by looking at the population
version. This is briefly discussed on Sect. 2.3.

The remainder of the paper is organised as follows. In Sect. 2, we introduce some


notations and we describe the empirical and population versions of our method. The
consistency of our method is described in Sect. 3. In Sect. 4, we compare our method
with different known algorithms, the k-means algorithm, MCLUST, DCLUST and
DBSCAN over several simulated data sets having various complexity configurations.
A set of real data is analysed in Sect. 5, and our concluding remarks are made in
Sect. 6. All proofs are given in the Appendix.

2 Clustering à la CART

Before stating our definitions it is convenient to introduce some notations. We adopt the
convention of writing (column) vectors in R p := R p×1 in boldface. In the remainder of
this paper X ∈ R p×1 will denote a a random p-dimensional real vector with coordinates
X ( j), j = 1, . . . , p, defined on a rich enough probability space (Ω, A , P) such that
E(X2 ) < ∞.
The data consist in n random independent and identically distributed realisations
of X, Ξ = {X1 , . . . , Xn }.
We will introduce together a population version (our probabilistic setup) of the algo-
rithm based on the random vector X and its empirical version based on the sample Ξ .
For the population version the space is R p , while for the empirical version the space
is Ξ .
Even though our procedure uses the same general approach as CART in many
respects, two main differences should be stressed. First, because we use unsupervised
classification, only information about the observations without labels is available. Thus
the splitting criterion cannot make use of the labels, as it does in CART. The second
essential difference is that the backward pruning algorithm used in CART is replaced
by two algorithms; the first, merges sibling nodes and the second aggregates nodes
that do not share the same direct ascendant in the tree. We call theses two algorithms
pruning and joining.

2.1 Forward step: maximal tree construction

Because we are using a top–down procedure, we begin by assigning the whole space
to the root node. Let t ⊂ R p be a node and tˆ = Ξ ∩ t, the set of observations obtained
from the sample concerned. When clear from context, we will still use t instead of tˆ
to simplify the notations for the empirical version of the algorithm.
Let Xt be the restriction of X to the node t, i.e. Xt = X|{X ∈ t}, and αt the
probability of being in t, αt = P(X ∈ t). R(t) is then a heterogeneity measure of t

123
Interpretable clustering using unsupervised binary trees 129

and is defined by,

R(t) = αt trace(Cov(Xt )), (1)

where, cov(Xt ) is the covariance matrix of Xt . Thus, R(t) is an approximate measure


of the “mass concentration” of the random vector X at set t, weighted by the mass of
set t. In the empirical version of the algorithm αt and Cov(Xt ) are replaced by their
empirical versions (estimates), and R(t) is calledthe deviance. More precisely, if we
denote n t to be the cardinal of the set tˆ, n t = i=1
n
1{Xi ∈t } , (where 1 A stands for
the indicator function of set A), and hence the estimated probability is  αt = nnt , the
 
estimate of E Xt − μt 2 is
  2
 
{Xi ∈t } Xi − Xt
,
nt

where Xt is the empirical mean of the observations in t and the estimate of the deviance
is,
  2
 
{Xi ∈t } Xi − Xt
R̂(t) = . (2)
n
At each stage, every population terminal node t is considered to be split into two
subnodes, namely the left and right child, tl , tr , if it fulfills a certain condition. At the
beginning there is only one node, i.e. the root, which contains the whole space. The
splitting rule has the form x( j) ≤ a, where x( j) is the jth coordinate of the vector x and
a is a threshold level. Thus, tl = {x ∈ R p : x( j) ≤ a} and tr = {x ∈ R p : x( j) > a}.
The best split for t is defined as the couple ( j, a) ∈ {1, . . . , p}×R, (the first element
indicates the variable where the partition is defined and the second is the threshold
level) that maximises,

Δ(t, j, a) = R(t) − R(tl ) − R(tr ). (3)

It is easy to verify that Δ(t, j, a) ≥ 0 for all t, j, a, and this property is also verified
by all the splitting criteria proposed in CART.
Remark 1 (Uniqueness) As for CART, the maximum in (3) is not guaranteed to be
unique. If it is not, we use the following arbitrary rules. If the maximum is attained for
splits over different variables we choose the split over the variable j with the smallest
index. If the maximum is attained for several splits over the same variable j but with
different values for the threshold a, we will choose the split with the smallest value of a.
We fix two parameters τ, mindev ∈ (0, 1). Beginning from the root node, each new
terminal node is split recursively until one of the following stopping rules is satisfied:
(i) αt < τ ; (ii) The reduction in deviance is less than mindev × Δ(S , j0 , a0 ), where
S is the whole space, and ( j0 , a0 ) correspond to the best split found for the root
node. For the empirical version of the algorithm we replace αt by α̂t and Δ(t, j, a)

123
130 R. Fraiman et al.

by Δ̂(t, j, a) = R̂(t) − R̂(tl ) − R̂(tr ), and denote by minsi ze = [τ n] (the minimum


size of a node). minsi ze and mindev are tuning parameters that must be supplied by
the user.
When the algorithm stops, a label is assigned to each leaf (terminal node), and we
then call the actual tree the maximal tree. At this point, we have obtained a partition
of the space and in consequence a partition of the data set, in which each leaf is
associated with a cluster. Ideally, this tree has at least the same number of clusters
as the population, although in practice it may have too many clusters, and then an
agglomerative stage must be applied as in CART. It is important to note that if the
number of clusters k is known, the number of leaves should be greater or equal to k.
Small values of mindev gives rise to bigger trees having more leaves. Moreover, if the
tree has the same number of leaves as the number of clusters, it is not then necessary
to run the subsequent stages of the algorithm.

2.2 Backward step: pruning and joining

In the next step, we successively use two algorithms to give the final conformation of
the grouping. The first prunes the tree and the second merges non-adjacent leaves (we
herein refer to this as “joining”). We now introduce our pruning criterion that we refer
to as “minimum dissimilarity pruning”.

2.2.1 Minimum dissimilarity pruning

Let tl and tr be a pair of terminal nodes that share the same direct ascendant t. We
define (in population terms) the random variables Wl(r ) = D(Xtl , sop(Xtr )), (where
sop(Z ) stands for the support of the random variable Z ) as the Euclidian distance
between the random elements of Xtl and the support of Xtr and respectively Wr (l) =
D(sop(Xtl ), Xtr ).
Observe that by definition sop(Xl ) ⊂ tl and  sop(Xr ) ⊂ tr and therefore they are
disjoint with probability one, i.e. P(sop(Xl ) sop(Xr )) = 0. If there is more than
one cluster in the subset t we may expect that small quantiles of the random variables
Wr (l) and Wl(r ) be relatively large, while if not, they will be close to zero. For each of
them we define


Δl(r ) = qν (Wl(r ) )dν,
0
(4)

Δr (l) = qν (Wr (l) )dν,
0

where qν (Wlr ) stands for the quantile function, (P(Wl(r ) ≤ qν ) = ν) and δ is a


proportion, δ ∈ (0, 1).

123
Interpretable clustering using unsupervised binary trees 131

Finally we consider as a measure of dissimilarity between the sets tl and tr

Δlr = max{Δl(r ) , Δr (l) }.

For a fixed ε > 0, a node t is pruned whenever Δlr < ε, i.e. we replace tl and tr by
tl ∪ tr in the partition.
Observe that since Δl(r ) and Δr (l) are averages of small quantiles of D(Xtl , sop
(Xtr )) and of D(Xtr , sop(Xtl )) respectively, Δlr can be thought as “a more resistant
version” of the distance between the supports of the random vectors Xtl and Xtr .
For the empirical version of the algorithm we use the natural plug-in estimate of the
dissimilarity measure Δ. Let nl (resp. nr ) be the size of tˆl (resp. tˆr ). For every Xi ∈ tˆl
and X j ∈ tˆr , consider the sequences d̃i = min x∈tˆl d(Xi , x), d̃ j = min x∈tˆr d(x, X j )
and their ordered versions, denoted as di and d j . For δ ∈ [0, 1], let

δnl δnr
1 1
d̄lδ = di , d̄rδ = dj.
δnl δnr
i=1 j=1

We compute the empirical dissimilarity measure between tl and tr as,

d δ (l, r ) = d δ (tl , tr ) = max(d̄lδ , d̄rδ ),

and at each stage of the algorithm the leaves, tl and tr , are merged into the ascendant
node t if d δ (l, r ) ≤ ε where ε > 0.
For the minimum dissimilarity pruning two parameters must be supplied, δ and
ε, hereafter refered to as mindist. In population terms, it suffices that ε be smaller
than the distance among the supports of two disjoint clusters, but if this information
is unknown the default could be zero and this step would be skipped. This is not a
problem since if two nodes sharing the same ascendant are close they will be collapsed
eventually on the third step, and the final cluster allocation would remain unchanged.
The parameter δ is just a way to deal with some possible presence of outliers. When δ
goes to zero the dissimilarity measure becomes the usual distance between sop(Xtl )
and sop(Xtr ).

2.2.2 Joining

The aim of the joining step is to aggregate nodes that do not share the same direct
ascendent. The criterion used in the joining step is the same as the one used for pruning.
The need of this step is illustrated in Fig. 1.
Here, all pairs of terminal nodes ti and t j are compared by computing, d δ (i, j). As
in standard hierarchical clustering, pairs of terminal nodes are successively aggregated
starting from the pair i, j with minimum d δ (i, j) value, thereby producing one less
cluster at each step. We consider two different stopping rules for the joining procedure,
which correspond to the cases where the number of clusters k is either known or
unknown. We denote by m the number of leaves of the pruned tree. If k is known, the
following step is repeated until m ≤ k:

123
132 R. Fraiman et al.

Fig. 1 The lower group is split


on the first stage and it cannot be
merged in the pruning stage, but
will be when using the joining
step

– For each pair of values (i, j), 1 ≤ i < j ≤ m, let (ĩ, j̃) = arg mini, j {d δ (i, j)}.
Replace tĩ and t j̃ by its union tĩ ∪ t j̃ , put m = m − 1 and proceed.
If k is unknown,
– if d δ < η replace tĩ and t j̃ by its union tĩ ∪ t j̃ , where η > 0 is a given constant,
ĩ j̃
and continue until this condition does not apply.
In the first case, the stopping criterion is the number of clusters, while in the second
case a threshold value of η for d δ (i, j) must be specified.

2.3 CUBT and k-means

In the following section we discuss informally the circumstances in which both our
procedure and the well-known k-means algorithm should produce a reasonably good
set of results. We shall consider those cases where there are “nice groups” that are
strictly separated. More precisely, let A1 , . . . , Ak be disjoint connected compact sets
on R p such that Ai = Ai0 for i = 1, . . . , k (where S 0 is the interior of the set S and S
is the closure of the set S), and {Pi : i = 1, . . . , k} their probability measures on R p
with supports {Ai : i = 1, . . . , k}.
A typical case may be obtained by defining a random vector X∗ with a density f
and then considering the random vector X = X∗ |{ f > ζ } for a positive level set ζ , as
in a number of hierarchical clustering procedures.
An admissible family for CUBT is the family of sets A1 , . . . , Ak such that there
exist another family of disjoint sets B1 , . . . , Bk that are built up as the intersection
of a finite number of half-spaces delimited by hyperplanes that are orthogonal to the
coordinate axis that satisfying Ai ⊂ Bi .
In contrast, the k-means algorithm is defined through the vector of centers
(c1 . . . , ck ) that minimise

E min X − cj  .
j=1,...,k

123
Interpretable clustering using unsupervised binary trees 133

(a) (b)
Fig. 2 a CUBT cluster allocation. b k-means cluster allocation

Associated with each center cj is the convex polyhedron S j of all points in R p that are
closer to cj than to any other center, called the Voronoi cell of cj . The sets in the partition
S1 , . . . , Sk are the population clusters for the k-means algorithm. The population
clusters for the k-means algorithm are therefore defined by exactly k hyperplanes in
an arbitrary position.
Then, an admissible family for the k-means algorithm will be a family of sets
A1 , . . . , Ak that can be separated by exactly k hyperplanes.
Although the hyperplanes can be in an arbitrary position, no more than k of them
can be used.
It is clear that in this sense CUBT is much more flexible than the k-means algorithm,
because the family of admissible sets is more general. For example, k-means will
necessarily fail to identify nested groups, while CUBT will not.
Another important difference between k-means and CUBT is that our proposal
is less sensitive to small changes in the parameters that define the partition. Effec-
tively, small changes in these will produce small changes in the partition. How-
ever, small changes in the centers (c1 . . . , ck ) that define the k-means partition
can produce significant changes in the associated partition as given by the Voronoi
cells.
Figure 2 show an example where CUBT has a good performance and clearly
k-means does not work.

3 Consistency of CUBT

In the following section we present some theoretical results on the consistency of


our algorithm. We first prove an important property, that of the monotonicity of the
deviance with increasing tree size.
A simple equivalent characterisation of the function R(t) is given in the following
lemma.

Lemma 3.1 Let tl and tr be disjoint compact sets on R p and μs = E(Xts ), s = l, r .


If t = tl ∪ tr we have,

αtl αtr
R(t) = R(tl ) + R(tr ) + μl − μr 2 . (5)
αt

123
134 R. Fraiman et al.

The proof is given in the Appendix.


Remark 2 Monotonicity of the function R(.) and geometric interpretation.
Observe that Lemma (3.1) entails that for all disjoint compact sets tl , tr and t = tl ∪ tr ,
the function R(.) is monotonic in the sense that, R(t) ≥ R(tl )+ R(tr ). Moreover, R(t)
will be close to R(tl ) + R(tr ) when the last term on the right hand side of equation
(5) is small. This will occur either if one of the sets tl and tr has a very small fraction
of the mass of t and/or if the centers of the two subsets tl , tr are very close together.
In either cases we do not want to split the set t.
The following results show the consistency of the empirical algorithm in relation
to its population version. We begin with the splitting algorithm and then follow this
by pruning and joining.
Theorem 3.1 Assume that the random vector X has distribution P and a density f
that fulfils the condition that x2 f (x) is bounded. Let X1 , . . . , Xn be iid random
vectors with the same distribution as X and denote by Pn the empirical distribution
of the sample X1 , . . . , Xn .
Let {t1n , . . . , tm n n } be the empirical binary partition obtained from the forward
empirical algorithm, and {t1 , . . . , tm } the population version. We then find that there
exists n 0 such that for n > n 0 , m n = m and each pair (i jn , a jn ) ∈ {1, . . . , p} × R
in the determination of the empirical partition converges a.s. to the corresponding
one (i j , a
j ) ∈ {1, . . . , p} × R for the population values. In particular, it implies that,
m
limn→∞ i=1 P (tin Δti ) = 0, where Δ stands for the symmetric difference.
The proof is given in the Appendix.
Theorem 3.2 Let {t1n∗ , . . . , t ∗ } be the final empirical binary partition obtained after
kn n
the forward and backward empirical algorithm has been applied, and {t1∗ , . . . , tk∗ } be
the population version. Under the assumptions of Theorem k 3.1 we∗ ultimately
 have that
kn = k (kn = k for all n if k is known), and limn→∞ i=1 P tin Δti∗ = 0.
The proof is given in the Appendix.

4 Some experiments

We present in this section some comparison results based on four models used to
generate the synthetic datasets. For each model we compare several methods which
are briefly described. We give the details concerning how the parameters are tuned
for our method, followed by the results obtained in each configuration. We consider
separately the cases where the number of groups is known and where it is unknown.
All the experiments were run using the new R package “Cubt”, publically available.

4.1 Simulated data sets

We herein consider four different models (M1–M4) having different number of groups
and dimensions. For each model clusters have equal number of observations.

123
Interpretable clustering using unsupervised binary trees 135

(a) (b) (c)


Fig. 3 a Scatter plot corresponding to M1 for σ = 0.19. b Two-dimensional projection scatter plot
corresponding to M2 for σ = 0.19. c Scatter plots corresponding to M3

M1. Four groups in dimension 2 with 100 observations per group. The data are
generated using : N (μi , Σ), i = 1, . . . 4, with centers (−1, 0),
(1, 0), (0, −1), (0, 1) and covariance matrix Σ = σ 2 I d with σ = 0.11, 0.13,
0.15, 0.17, 0.19. Figure 3a gives an example of the data generated using model
M1 with σ = 0.19.
M2. Ten groups in dimension 5 with 30 observations per group. The data are generated
using N (μi , Σ), i = 1, . . . , 10. The means μi of the first five groups, are the
vectors of the canonical basis e1 , . . . , e5 respectively, while the centers of the
five remaining groups are μ5+i = −ei , i = 1, . . . 5. In all cases, the covariance
matrix is Σ = σ 2 I d, σ = 0.11, 0.13, 0.15, 0.17. Figure 3b gives an example
of data generated using model M4 with σ = 0.19 projected on the two first
dimensions.
M3. Two groups in dimension 2 with 150 observations per group. The data are uni-
formly generated on two concentric rings. The radius of the inner ring is between
50 and 80 and the radius of the outer ring is between 200 and 230. Figure 3c
gives an example of data generated using model M3.
M4. Three groups in dimension 50 with 25 observations per group. The data are gen-
erated using N (μi , Σ), i = 1, 2, 3. The mean of the first group is (0.1, . . . , 0.1),
the second group is centered at the origin and the last group has mean
(−0.1, . . . , −0.1). In all cases, the covariance matrix is Σ = σ 2 I d, σ = 0.03
or 0.05.

4.2 Methods used for the comparisons

The following methods are used in the comparisons accounting for their ability to
estimate or not the number of clusters:
– The k-means algorithm. It is well known that the performance of the k-means
algorithm depends strongly on the position of the centroids initially used to start the
algorithm. A number of different methods have been proposed to take account of
this effect (see Steinley and Brusco 2007). We herein follow the recommendations

123
136 R. Fraiman et al.

made in this last reference, and consider ten random initialisations,n  keeping the
k
one with the minimum within-cluster sum of squares given by, i=1 j=1 Xi −
c j  1{Xi ∈G j } , where G j is the j-th group and c j is the corresponding center. We
2

refer to this version as k-means(10). The number of clusters must be given by the
user.
– The probabilistic distance clustering method (DCLUST) introduced by Iyigun
and Ben-Israel (2008). It is based on the assumption that the probability of a point
belonging to a cluster is inversely proportional to the distance from the center of
the cluster. We considered the Euclidean distance. This procedure assumes that
the number of clusters is known.
– The model-based clustering procedure proposed by Fraley and Raftery (2002,
2009). The MCLUST algorithm assumes that the sample has been generated from
a mixture of G normal distributions, with ellipsoidal covariance matrices with
variable shape, volume and orientation, and estimates the parameters of each pop-
ulation considering an EM algorithm and the most appropriate model is selected
by means of the Bayesian Information Criteria. This procedure is versatile and it
can be applied whether the number of groups is stated or not, hence we will report
the results for both cases.
– The Density Based Spatial Clustering of Applications with Noise (DBSCAN, Ester
et al. 1996) algorithm which is suitable for discovering arbitrarily shaped clusters,
since clusters are defined as dense regions separated by low-density regions. This
algorithm estimates the number of clusters, as most of the density based algo-
rithms do, hence we compare the results with those where the number of groups
is unknown. The algorithm depends on two parameters: the number of objects in
a neighborhood of an object, our input was 5, and the neighborhood radius, the
authors, recommend to avoid putting an input if this parameter is unknown.

4.3 Tuning CUBT

We performed M = 100 replicates for each model. When k is given, we compare the
results with those obtained using k-means (k-M), k-means(10) [k-M(10)], MCLUST
and DCLUST. Otherwise, if k is unknown we compared the results with DBSCAN and
MCLUST. In order to apply CUBT we must fix the values for the parameters involved
at each stage in the algorithm: for the maximal tree we used minsi ze = 5, 10 or 15 and
mindev = 0.7 or 0.9; for the pruning stage mindist = 0.3 or 0.5 and δ = 0.2, 0.4
or 0.6. For the cases where the number of clusters is stated, it is important to note that
even though in almost every case the number of terminal nodes of the maximal tree
is bigger than the number of groups, there is no warranty that this will happen. Then,
if we are in that case we reduce mindev in decreasing order, 0.6, 0.5, . . . until the
maximal tree has at least k terminal nodes.
For the cases where the number of clusters is not stated we consider the same
parameters as in the previous case. To determine the number of clusters we must
choose a threshold η for the joining stage. We consider the distances di,δ j defined in
Sect. 2.2.2 that correspond to the tree that is the output of the pruning step, and fix η as
a low quantile of di,δ j . Heuristically, low quantiles of di,δ j correspond to terminal nodes

123
Interpretable clustering using unsupervised binary trees 137

whose observations belong to the same clusters. The quantiles of di,δ j that determine
η chosen for M1 to M4 were 0.2, 0.08, 0.25 and 0.15 respectively.

4.4 Criterion used for comparing the methods

In cluster analysis, there is no commonly accepted criterion to evaluate the performence


of a cluster procedure, nor a lower bound that indicates the difficulty of the problem
as it happens in supervised classification where the Bayes error plays that role. The
most commonly used criteria is the sum of the dispersions within each cluster, which
is particularly well designed for the k-means method. Model 3 suggests that it is not
always a good criterion.
As for synthetic datasets the generating model and thus the original partition is
known, it is more convenient to use a dissimilarity measure between the clusters
obtained and the original partition. Several such dissimilarity measures exist and are
widely used in computer science, like the Rand index (Rand 1971) and the adjusted
Rand index (Hubert and Arabie 1985). However we chose to use the “Matching Error”
which is equivalent to the misclassification error in supervised classification. This
index defined as a metric for comparing partitions [see for instance Melia (2012)]
measures “the number of observations assigned to a group which is not the one that
has generated the observation” and is thus easy to interpret.
We denote the original clusters r = 1, . . . , R. Let y1 , . . . yn be the group label of
each observation, and ŷ1 , . . . ŷn the class label assigned by the clustering algorithm.
Let Σ be the set of permutations over {1, . . . , R}. The Matching Error, ME in what
follows, may then be expressed as:

n
1
M E = min 1{yi =σ ( ŷi )} . (6)
σ ∈Σ n
i=1

Where is the set of all possible permutations of the predicted labels.
If the number of clusters is large, the assignment problem may be computed in
polynomial time using Bipartite Matching and the Hungarian Method, Papadimitriou
and Steiglitz (1982).

4.5 Results

First, we analyze the results of the simulation when the number of clusters is known.
Table 1 shows the results obtained in our simulations. Except for Model 3, we var-
ied the values of σ to reflect the degree of overlapping between classes. In order to
study the robustness of the method with respect to the parameters selection we con-
sidered 36 different parameters configurations for our procedure. Since the results
were practically the same, we present for each model the results for the best and
the worst clustering allocation. We report the matching error obtained for each clus-
tering procedure, namely CUBT for the best performance [CUBT (B)] and for the
worst performance [(CUBT (W)), k-means, k-means(10), MCLUST (if k is given)

123
138 R. Fraiman et al.

Table 1 Simulation results for models M1 to M4

σ CUBT (W) CUBT (B) k-M k-M(10) MCLUST DCLUST

Model 1
0.11 0 0 0.12 0 0 0.04
0.13 7e − 03 0 0.12 0 0 0.06
0.15 1e − 03 1e − 04 0.11 0 0 0.06
0.17 1e − 03 2e − 04 0.07 5e − 05 3e − 05 0.07
0.19 2e − 03 3e − 04 0.06 3e − 04 2e − 04 0.08
Model 2
0.11 0 0 0.11 0.01 0.04 0.09
0.13 0 0 0.10 0.01 0.05 0.11
0.15 4e − 04 2e − 04 0.10 0.01 0.06 0.10
0.17 0.004 0.002 0.08 0.01 0.07 0.11
0.19 0.05 0.04 0.07 0.01 0.08 0.11
Model 3
0 3e − 04 0.47 0.47 0.25 0.48
Model 4
0.03 0 0 0.11 0 0.65 0.22
0.05 0.16 0.05 0.12 0 0.65 0.33

and DCLUST. As expected for the first two models k-means(10) and MCLUST have
a good performance, but both of them fail for M3. For the last model MCLUST per-
forms poorly in both cases, k-means has an acceptable performance in both cases
and k-means(10) always achieves a perfect cluster allocation. The results obtained
by CUBT using different values for the parameters are practically the same, and in
almost all cases the results for CUBT lie between those of k-means and k-means(10),
except for Model 3 where it has a better performance and for M4 when σ = 0.05
(where the matching error of CUBT(W) is larger than the matching rate of k-means).
If we compare the results with those of MCLUST we may say that for M1 they have a
similar performance and that for M2, M3 and M4 the performance of CUBT is better.
The performance of DCLUST is always similar to the performance of k-means, except
in the last model where the behavior of k-means is better.
Now we proceed to analyze the results of the simulation when k is unknown. In
Table 2 we report the number of times that the procedure chooses the correct number of
groups. The ME rate for CUBT and MCLUST, when the number of clusters is chosen
properly, is the same as in the previous analysis. CUBT has a very good performance
for M1, M3 and the first case of M4, and for the other situations the best performance
is very good but it strongly depends on the choices of the other parameters. However,
it is important to note that in those cases identifying correctly the clusters is a very
difficult problem and the other methods were not able to do it. For M2, when σ = 0.11
DBSCAN in nine opportunities identifies ten groups (in those cases all the observations
are classified correctly) and in the rest of the replicates it identifies fewer groups, if the
group overlapping is bigger it is not able to separate the groups, for σ = 0.13 it always

123
Interpretable clustering using unsupervised binary trees 139

Table 2 Number of times that


σ CUBT (W) CUBT (B) DBSCAN MCLUST
the different procedures choose
the correct number of groups
Model 1
for M1 to M4
0.11 81 85 92 99
0.13 75 84 81 100
0.15 76 87 74 100
0.17 79 92 43 100
0.19 85 88 38 100
Model 2
0.11 24 90 9 0
0.13 34 90 0 0
0.15 42 90 0 0
0.17 58 83 0 0
0.19 51 78 0 0
Model 3
94 98 76 0
Model 4
0.03 100 100 100 100
0.05 2 98 26 100

finds less than five groups and in the rest of the cases it always finds only one group.
For M1 DBSCAN has a very good performance identifying the number of groups and
in those cases the ME rate is zero except for the case of σ = 0.19 that it is 0.027. For
M3 and M4 whenever it identifies correctly the number of groups it also allocates the
observations in the right group. MCLUST has an outstanding performance for M1,
but it fails in all the other models. For M2 it always identifies one group and for M3
97 times finds nine groups and the rest of the times eight clusters. Finally for M4
even though it always finds three groups it fails in the allocation of the observations,
which is consistent with the results found when the number of clusters was previously
stated.

5 A real data example. European jobs

In the following example, the data set describes the percentage of workers employed
in different sectors of economic activity for a number of European countries in 1979
(this data set can be obtained on the website http://lib.stat.cmu.edu/DASL/Datafiles/
EuropeanJobs.html). The categories are: agriculture (A), mining (M), manufacturing
(MA), power supplies industries (P), construction (C), service industries (SI), finance
(F), social and personal services (S) and transportation and communication (T). It is
important to note that these data were collected during the cold war. The aim is to
allocate the observations to different clusters, but the number of clusters is unknown.
We must therefore study the data structure for a range of different numbers of clusters.

123
140 R. Fraiman et al.

Table 3 CUBT clustering


Group 1 Group 2 Group 3 Group 4
structure for the European Jobs
data using four groups
Turkey Greece Ireland Belgium
Poland Portugal Denmark
Romania Spain France
Yugoslavia Bulgaria W. Germany
Czechoslovakia E. Germany
Hungary Italy
USSR Luxembourg
Netherlands
United Kingdom
Austria
Finland
Norway
Sweden
Switzerland

Fig. 4 Tree structure using four


groups top and five groups
bottom, the left-hand branch
shows the smaller values of the
variable that is making the
partition

We first consider a four-group structure. In this case, a single variable (the percentage
employed in agriculture) determines the tree structure. The four groups are shown
in Table 3, and the corresponding tree is plotted on the top panel of Fig. 4. In the
tree, the highest value of A corresponds to Turkey, which is an outlier and conforms
to a single cluster of observations, which may be explained in terms of its social
and territorial proximity to Africa. The countries that make up Groups 2 and 3 are
those that were either under communist rule or those that were experiencing vary-
ing degrees of political upheaval; Spain, for example was making adjustments after
the end of Franco’s regime. The countries of Group 2 were poorer than those of
Group 3. Finally, Group 4 had the lowest percentage of employment in agriculture,
and the countries in this group were the most developed and were not under commu-
nist rule, with the exception of East Germany. Using k-means we get the following
clusters: Turkey and Ireland are isolated in one group each, Greece, Portugal, Poland,
Romania and Yugoslavia form another group, and the remaining countries form a
fourthcluster.

123
Interpretable clustering using unsupervised binary trees 141

Table 4 CUBT clustering with five groups

Group 1 Group 2 Group 3 Group 4 Group 5

Turkey Greece Ireland Belgium W. Germany


Poland Portugal Denmark E. Germany
Romania Spain France Switzerland
Yugoslavia Bulgaria Italy
Czechoslovakia Luxembourg
Hungary Netherlands
USSR United Kingdom
Austria
Finland
Norway
Sweden

If we use five clusters instead of four, the main difference is that Group 4 of the
original partition is divided into two subgroups (4 and 5), and the variables that explain
these partitions are the percentages employed in mining and agriculture. The other
groups of the original partition remain stable (Table 4).
If a five-cluster structure is considered via the k-means algorithm, Turkey and
Ireland are then each isolated in single groups, and Greece, Portugal, Poland, Romania
and Yugoslavia form another group, as in the four-cluster structure. Switzerland and
East and West Germany make up a new cluster.

6 Concluding remarks

We have herein presented a new clustering method called CUBT which shares some
ideas with the well known classification and regression trees, defining clusters in
terms of binary rules over the original variables. Like CART, our method may be
very attractive and useful in a number of practical applications. Because the tree
structure makes use of the original variables, it helps to determine those variables
that are important in the conformation of the clusters. Moreover, the tree allows the
classification of new observations. In our approach, a binary tree is obtained in three
stages. In the first stage, the sample is split into two sub-samples, thereby reducing
the heterogeneity of the data within the new sub-samples according to the objective
function R(·). The procedure is then applied recursively to each sub-sample. In the
second and third stages, the maximal tree obtained at the first stage is pruned using a
dissimilarity criterion, first applied to the adjacent nodes and then to all the terminal
nodes. The algorithm is simple and requires a reasonable computation time. There are
no restrictions on the dimension of the data used. Our method is consistent under a
set of quite general assumptions, and produces quite good results with the simulated
examples that we considered here, as well as for an example that used real data.

123
142 R. Fraiman et al.

The algorithm depends on several parameters, and an optimal way to choose them
is beyond the scope of this paper. We herein propose some advices in order choose
them in practice. The splitting stage depends on two parameters, mindev and minsize.
The later, mindev ∈ (0, 1) represents the percentage of the deviance of the whole data
set (R(S )) the algorithm requires to split a group [if the reduction of its deviance is
more than mindev×Δ(S , j0 , a0 )] the group is split in two subgroups). Our experience
indicates that values between 0.7 and 0.9 give a sufficiently large partition. The former
indicates the minimum cluster size that the user admits, if he has some information
beforehand it could be provided and taken into account by the algorithm otherwise
the default value should be 1. The pruning step includes also two parameters δ and ε.
In population terms, it suffices that ε be smaller than the distance among the supports
of two disjoint clusters. If the user does not provide a value for this parameter, the
default is set to zero, which corresponds to skip the pruning step and go directly to
the joining step. In that case one would probably obtain a larger tree, but the final
clustering allocation would not change and the results given in Theorem 3.2 still hold.
The parameter δ is just a way to deal with some possible presence of outliers and as a
default δ = 0.2 can be used. Finally, if the number of clusters is given the final stage
requires only the parameter η whose choice is explained in Sect. 4.3. An R package
named Cubt, is available and can be supplied by the authors upon request. It is easy
to use, and it provides some default parameters, as well as the option to choose the
parameters by the user, In Sect. 4 we provide heuristical and helpful ideas for choosing
these parameters.
A more robust version of our method could be developed by substituting the
objective function cov(X T ) in equation (1) for a more robust covariance functional
r obcov(X T ) (see e.g. Maronna et al. 2006 Chapter 6 for a review), and then proceed-
ing in the same manner as described herein. However, a detailed investigation of this
alternative approach lies beyond the scope of the present study.
Even though we focussed our paper on the treatment of interval data, the procedure
can be extended to other settings as long as a covariance structure and a dissimilarity
measure can be defined. For example, one could use Gini’s definition of variance for
categorical data, (Gini 1912) and the Mean Character difference as a measure of dis-
similarity. Other measures of dissimilarity for categorical data can be found in Gan et
al. (2007). However, a deeper study of these settings are beyond the scope of this paper.

7 Appendix

7.1 Proof of Lemma 3.1

We first observe that because tl and tr are disjoint, E(X tl ∪tr ) = γ μl + (1 − γ )μr ,
( j)
where γ = P(X ∈ tl |X ∈ tl ∪ tr ). Given that j = 1, . . . , p, we use M2i =
ti x( j) d F(x), i = l, r , where F stands for the distribution function of the vector X.
2
( j) ( j)
It then follows that E(X tl ∪tr ( j)2 ) = γ M2l + (1 − γ )M2r , and therefore that

var (X tl ∪tr ( j)) = γ var (X tr ( j)) + (1 − γ )var (X tr ( j))


+ γ (1 − γ )(μl ( j) − μr ( j))2 . (7)

123
Interpretable clustering using unsupervised binary trees 143

By summing the terms in j we get the desired result.

7.2 Proof of Theorem 3.1

Let T be the family of polygons in R p with faces orthogonal to the axes, and fix
i ∈ {1, . . . , p} and t ∈ T. For a ∈ R denote by tl = {x ∈ t : x(i) ≤ a} and tr = t \ tl .
We define r (t, i, a) = R(t) − R(tl ) − R(tr ) and rn (t, i, a) = Rn (t) − Rn (tl ) − Rn (tr ),
the corresponding empirical version.
We start showing the uniform convergence

sup sup |rn (t, i, a) − r (t, i, a)| → 0 a.s. (8)


a∈R t∈T

By Lemma 3.1,

αt r (t, i, a) = αtl αtr μl (a) − μr (a)2 , (9)

where α A = P(X ∈ A) and μ j (a) = E(X t j ), j = l, r . Then, the pairs (i jn , a jn )


and (i j , a j ) are the arguments that maximise the right-hand side of equation (9) with
respect to the measures Pn and P respectively. We observe that the right-hand side of
equation (9) equals
    
αtr x2 d P(x) + αtl x2 d P(x) − 2 xd P(x), xd P(x) . (10)
tl tr tl tr

In order to prove Eq. (8) it is sufficient to show that:


1. supa∈R supt∈T |Pn (t j ) − P(t j )| → 0 a.s. j = l, r
supa∈R supt∈T | t j x2 d Pn (x) − t j x2 d P(x)| → 0
2.
a.s. j = l, r
supa∈R supt∈T | t j x(i)d Pn (x) − t j x(i)d P(x)| → 0
3.
a.s. j = l, r, i = 1, . . . , p.
Since T is a Vapnik–Chervonenkis class, we have that (1) holds. Now observe that the
conditions for uniform convergence over families of sets still hold if we are dealing
with signed finite measures. Therefore if we consider the finite measure x2 d P(x)
and the finite signed measure given by x(i)d P(x) we also have that (2) and (3) both
hold.
Since

lim αtl αtr μl (a) − μr (a)2 = lim αtl αtr μl (a) − μr (a)2 = 0, (11)
a→∞ a→−∞

we have that

inf {argmaxa∈Rrn (t, i, a)} → inf {argmaxa∈Rr (t, i, a)}

123
144 R. Fraiman et al.

a.s.
In the first step of the algorithm, t = R p and we obtain i n1 = i 1 for n large enough
and an1 → a1 a.s. In the next step, we observe that the empirical procedure begins
to work with tnl and tnr , while the population algorithm will do so with tl and tr .
However, we have that

sup |rn (tn j , i, a) − r (t j , i, a)|


a∈R
≤ sup sup |rn (tn j , i, a) − r (tn j , i, a)| + sup |r (tn j , i, a) − r (t j , i, a)|, (12)
a∈R t∈T a∈R

for j = l, r .
We already know that the first term on the right hand side of equation(12) converges
to zero almost surely. In order to show that the second term also converges to zero, it
is sufficient to show that
1. supa∈R |P(tn j ) − P(t j )| → 0 a.s. j = l, r
2. supa∈R | t j x2 d P(x) − tn j x2 d P(x)| → 0 a.s. j = l, r
3. supa∈R | t j x(i)d P(x) − tn j x(i)d P(x)| → 0 a.s. j = l, r, i = 1, . . . , p,
which follows from the assumption that x2 f (x) is bounded. This concludes the
proof since minsi ze/n → τ .

7.3 Proof of Theorem 3.2

We need to show that we have consistency in both steps of the backward algorithm.
∗ , . . . , t ∗ } be the output of the forward
(i) Convergence of the pruning step. Let {t1n mn
algorithm. The pruning step partition of the algorithm converges to the corresponding
population version from
– the conclusions of Theorem 3.1.
– the fact that the random variables Wlr d̃l , d̃r are positive.
– the uniform convergence of the empirical quantile function to its population ver-
sion.
– the Lebesgue dominated convergence theorem.
The proof of convergence of the joining step is mainly the same as that for (i).

References

Andreopoulos B, An A, Wang X (2007) Hierarchical density-based clustering of categorical data and a


simplification. Adv Knowl Discov Data Min, pp 11–22
Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by construction an unsupervised
decision tree. IEEE Trans Knowl Data Eng 17:121–132
Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Morgan K (ed)
Proceedings of the 15th international conference on machine learning, pp 55–63
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC, Boca Raton
Chavent M, Guinot C, Lechevallier Y, Tenenhaus M (1999) Méthodes Divisives de Classification et
Segmentation Non Supervisée: Recherche d’une Typologie de la Peau Humanine Saine. Rev.
Statistique Appliquée, XLVII 87–99

123
Interpretable clustering using unsupervised binary trees 145

Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large
spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery
and data mining. Portland, OR, pp 226–231
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am
Stat Assoc 97:611–631
Fraley C, Raftery AE (2009) MCLUST Version 3 for R: Normal Mixture Modeling and Model-based
Clustering, Technical Report No. 504, Department of Statistics, University of Washington
Gan G, Ma C, Wu J (2007) Clustering data: theory, algorithms and applications. Springer, Berlin
García-Escudero L, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods.
Adv Data Anal Classif 4(2):89–109
Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations.
Studi Economico-Giuridici della R. Universita de Cagliari
Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, Berlin
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Iyigun C, Ben-Israel A (2008) Probabilistic distance clustering adjusted for cluster size. Prob Eng Inf Sci
22:603–621
Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. CIKM ’00. In: Proceedings of
the ninth international conference on information and knowledge management. ACM, New York, NY,
USA, pp 20–29
Maronna RA, Martin DR, Yohai VJ (2006) Robust statistics: theory and methods. Wiley Series in Probability
and Statistics, Wiley, New York
Melia M (2012) Local equivalences of distances between clusterings—a geometric perpective. Mach Learn
86(3):369–389
Oh MS, Raftery A (2007) Model-based clustering with dissimilarities: a Bayesian approach. J Comput Gr
Stat 16(3):559–585
Papadimitriou C, Steiglitz K (1982) Combinatorial optimization: algorithms and complexity. Prentice Hall,
Englewood Cliffs
Peña D, Prieto FJ (2001) Cluster identification using projections. J Am Stat Assoc 96(456):1433–1445
Peña D, Prieto FJ (2006) A projection method for robust estimation and clustering in large data set.
In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search.
Springer, Berlin
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Steinley D, Brusco M (2007) Initializing K-means batch clustering: a critical evaluation of several tech-
niques. J Classif 24:99–121
Tryon RC (1939) Cluster analysis. McGraw-Hill, Columbus
Walther G (2010) Optimal and fast detection of spatial clusters with scan statistics. Ann Stat 24(2):
1010–1033

123

You might also like