1. Introduction
The
k-means problem has received much attention in the past several decades. The
k-means problems consists of partitioning a set
P of points in
d-dimensional space
into
k subsets
such that
is minimized, where
is the center of
, and
is the distance between two points of
p and
q. The
k-means problem is one of the classical NP-hard problems, and has been paid much attention in the literature [
1,
2,
3].
For many applications, each cluster of the point set may satisfy some additional constraints, such as chromatic clustering [
4],
r-capacity clustering [
5],
r-gather clustering [
6], fault tolerant clustering [
7], uncertain data clustering [
8], semi-supervised clustering [
9], and
l-diversity clustering [
10]. The constrained clustering problems was studied by Ding and Xu, who presented the first unified framework in [
11]. Given a point set
, and a positive integer
k, a list of constraints
, the constrained
k-means problem is to partition
P into
k clusters
, such that all constraints in
are satisfied and
is minimized, where
denotes the centroid of
.
In recent years, particular research has been focused on the constrained
k-means problem. Ding and Xu [
11] showed the first polynomial time approximation scheme with running time
for the constrained
k-means problem, and obtained a collection of size
of candidate approximate centers. The existing fastest approximation schemes for the constrained k-means problem takes
time [
12,
13], which was first shown by Bhattacharya, Jaiswai, and Kumar [
12]. Their algorithm gives a collection of size
of candidate approximate centers. In this paper, we propose the uncertain constrained
k-means problem, which supposes that all points are random variables with probabilistic distributions. We present a stochastic approximate algorithm for the uncertain constrained
k-means problem. The uncertain constrained
k-means problem can be regarded as a generalization of the constrained
k-means problem. We prove the random sampling properties of the uncertain constrained
k-means problem, which are fundamental for our proposed algorithm. By applying random sampling and mathematical induction, we propose a stochastic approximate algorithm with lower complexity for the uncertain constrained
k-means problem.
This paper is organized as follows. Some basic notations are given in
Section 2.
Section 3 provides an overview of the new algorithm for the uncertain constrained
k-means problem. In
Section 4, we discuss the detailed algorithm for the uncertain constrained
k-means problem. In
Section 5, we investigate the correctness, success probability, and running time analysis of the algorithm.
Section 6 concludes this paper and gives possible directions for future research.
2. Preliminaries
Definition 1 (Uncertain constrained k-means problem). Given a random variable set , the probability density function for every random variable , a list of constraints , and a positive integer k, the uncertain constrained k-means problem is to partition into k clusters , such that all constraints in are satisfied and is minimized, where denotes the centroid of .
Definition 2 ([
13])
. Let be a set of random variables in , be probability density function for every random variable , and and P be a set of points in , .Define .
Define .
Define .
Definition 3 ([
13])
. Let be a set of random variables in , be the probability density function for every random variable , and be a partition of .Define .
.
Define .
Define
.
Define .
Lemma 1. For any point and a random variable set , .
Proof. Let
be the probability density function for every random variable
.
The (3) equality follows from the fact that
. □
Lemma 2. Let be a set of random variables in and be the probability density function for every random variable . Assume that is a set of random variables obtained by sampling random variables from uniformly and independently. For ∀,
we have:
where . Proof. First, observe that
where
. Then apply the Markov inequality to obtain the following.
□
Lemma 3. Let be a set of random variables in , be the probability density function for every random variable , and be an arbitrary subset of with random variables for some . Then , where .
Proof. Let
. By Lemma 1, we have the following two equalities.
Let
. By the definition of the mean point, we have:
Thus, the three points
are collinear, while
and
. Meanwhile, by the definition of
, we have
. Combining Equality (
9) and Equality (
10), we have:
Thus, we have
, which means that
. □
Lemma 4 ([
12])
. For any , then . Theorem 1 ([
14])
. Let be s, an independent random variable, where takes 1 with a probability of at least p for . Let . Then, for any , . 3. Overview of Our Method
In this section, we first introduce the main idea of our methodology to solve the uncertain constrained k-means problem.
Considering the optimal partition of , since , if we could sample a set of size from uniformly and independently, then at least random variables in are from with a certain probability. All subsets of of size could be enumerated to discover the approximate center of .
We assume that is the set including approximate centers of the . Let , where . The set is divided into two parts: and , where and . For each random variable X, let be the nearest point (particular random variable) in to X. Let , and .
If most of the random variables of are in , our idea is to use the center of to approximate the center of . The center of is found based on . If most of the random variables of are in , our ideal is to replace the center of with the center of . For seeking out the approximate center of , we should find out a subset by uniformly sampling from . However, the set is unknown. We need to find the set . We apply a branching strategy to find a set such that , and . Then, a random variables set is obtained by sampling random variables from independently and uniformly. And the set can be replaced by a subset of from . Based on and , the approximation center of could be obtained. Therefore, the algorithm presented in this paper outputs a collection of size of candidate sets containing approximation centers, and has the running time .
4. Our Algorithm cMeans
Given an instance of the uncertain constrained k-means problem, denotes an optimal partition of . There exist six parameters (, , g, k, C, U) in our , where is the approximate factor, is the input random variable set, g is the number of centers, k is the number of the clusters, C is the set of approximate cluster centers, and U is a collection of candidate sets including the approximate center. Let , where M is the size of subsets of the sampling set and N is the size of the sampling set. Without loss of generality, assume that values of M and N are integers.
We use the branching strategy to seek out the approximate centers of clusters in
. There exist two branches in our algorithm
, which can be seen in
Figure 1. On one branch, a size
N set
is obtained by sampling from
uniformly and independently;
is constructed by
and
M copies of each point in
C. Moreover, we consider each subset
of size
M of
, and the centroid
c of
is solved to represent the approximate center of
, and our algorithm
(
) is used to obtain the remaining
cluster centers.
On the other branch, for each random variable
, we calculate the distance between
X and
C first.
H denotes the set of all distances of random variables in
to
C, where
H is a multi-set. We should obtain the median value
m for all values in
H, which is the
-th element if all of the values in
H are sorted. In the second branch,
is divided into two parts,
and
, based on
m such that for
,
,
, where
,
. Subroutine
(
) is used to obtain the remaining
g cluster centers. Therefore, we present the specific algorithm for seeking out a collection of candidate sets in the Algorithm 1.
Algorithm 1:() |
|
5. Analysis of Our Algorithm
We investigate the success probability, correctness, and time complexity analysis of the algorithm in this section.
Lemma 5. There exists a candidate set, with a probability of at least , including the approximate center in U satisfying .
The following Lemmas from Lemma 6 to 16 are used to prove Lemma 5. We prove Lemma 5 via induction on j. For , we can obtain easily, and prove the success probability first.
Lemma 6. In the process of finding in our algorithm , by sampling a set of random variables from independently and uniformly, denoted by , the probability that at least random variables in are from is at least .
Proof. In our algorithm
, we assume that
, where
. Let
be the corresponding random variables of elements in
. If
, then
. Otherwise
. It is known easily that
. Let
. We obtain that
. Then,
□
From Lemma 6, an with size of can be obtained, and the probability that all points in are from is at least . Let denote the centroid of , and . For , by Lemma 2, we conclude that holds with a probability of at least . Then, the probability that a subset of size of can be found such that holds is at least . Therefore, we conclude that Lemma 5 holds for .
Moreover, we assume that for , Lemma 5 holds with a probability of at least . Considering the case , we prove Lemma 5 by the following two cases: (1); (2).
5.1. Analysis for Case 1:
Since , most of the random variables of are in . Our idea is to replace the center of with the center of . Thus, we need to find the approximate center of and the bound distance . We divide the distance into the following three parts: , , and . We first study the distance between and .
Lemma 7. .
Proof. Since and , the proportion of in is at least . By Lemma 3, . □
Lemma 8. .
Proof. Since
, and
, we can obtain the following:
□
Lemma 9. .
Proof. Since
, by 1, we have
. Then,
□
Lemma 10. In the process of finding in our algorithm , for the set in step 5, a subset of size of can be obtained such that all random variables in are from . Let be the centroid of . Then, the inequality holds with a probability of at least .
Proof. For each point
,
copies of
p are added to
in step 9 in our algorithm
. Thus, a subset
of size
of
can be obtained such that all random variables in
are from
. Let
. Since
, by Lemma 2,
holds with a probability of at least
. Assume that
. Then,
□
Lemma 11. If satisfies , then .
Proof. Assume that
satisfies
. Then,
□
5.2. Analysis for Case 2:
Let , and denote the centroid of . Our idea is to replace the center of with the center of . But it is difficult to seek out the center of . Thus, we try to find an approximate center of .
Lemma 12. .
Lemma 13. .
Lemma 14. .
Lemma 15. In the process of finding in our algorithm , we assume that satisfies and . For the set in step 5, a subset of size of can be obtained such that all random variables in are from with a probability of . Let denotes the centroid of . Then, the inequality holds with a probability of at least .
Proof. In our algorithm
, we assume that
, where
. Let
be the corresponding random variables of elements in
. If
, obtain
, or else
. It is known easily that
by Lemma 12. Let
. We obtain that
, and
Then, the probability that at least
random variables in
are from
is at least
. Since
copies of each point in
, a subset
of size
of
can be obtained, and the probability that all random variables in
are from
is at least
. Let
denote the centroid of
and
. For
and
, by Lemma 2,
holds with a probability of at least
. Assume that
. Then,
□
Lemma 16. If satisfies , then .
Proof. Assume that
satisfies
. Then,
□
Lemma 17. Given an instance of the uncertain constrained k-means problem, where the size of is n, for , we assume that by using our algorithm (ϵ, , k, C, U) (C and U are initialized as empty sets), a collection U of candidate sets including approximate centers is obtained. If there exists a set in U satisfying that , then is a -approximation for the uncertain constrained k-means problem.
Proof. Assume that
is a set in
U satisfying that
. Then,
□
5.3. Time Complexity Analysis
We analyze the time complexity for our algorithm in this section.
Lemma 18. The time complexity of our algorithm is .
Proof. Let
, which
,
. By the Stirling formula,
In our algorithm
, steps 5–9 have a run time of
, step 11 have a run time of
, and steps 13–16 have a run time of
. Let
denote the time complexity of algorithm
, where
g is the number of cluster centers, and
n is the size of
.
If , . When , . Because , . Therefore, , where .
For
and
, the recurrence of
could be obtained as follows:
Because
, two constants
and
with
and
could be obtained to arrive at the following recurrence.
Now we claim that
. If
, then
. If
, then
, and the claim holds. Suppose that if
, the claim holds for
, and if
, the claim holds for
. We need to prove that:
The above formula can be simplified as
, which holds for
. For
,
. □
Thus, we can obtain the following Theorem 2.
Theorem 2. Given an instance of the uncertain constrained k-means problem, where the size of is n, for , by using our algorithm , a collection U of candidate sets including approximate centers can be obtained with a probability of at least such that U includes at least one candidate set including approximate centers that is a -approximation for the uncertain constrained k-means problem, and the time complexity of our algorithm is .
6. Conclusions
In this paper, we defined the uncertain constrained
k-means problem first, and then presented a stochastic approximate algorithm for the problem in detail. We proposed a general mathematical model of the uncertain constrained
k-means problem, and studied the random sampling properties, which are very important to deal with the uncertain constrained
k-means problem. By applying a random sampling technique, we obtained a
-approximate algorithm for the problem. Then, we investigated the success probability, correctness and time complexity analysis of our algorithm
, whose running time is
. However, there also exists a big gap between the current algorithms for the uncertain constrained k-means problem and the practical algorithms for the problem, which has been mentioned in [
13] similarly.
We will try to explore a much more practical algorithm for the uncertain constrained k-means problem in future. It is known that the 2-means problem is the smallest version of the k-means problem, and remains NP-hard. The approximation schemes for the 2-means problem can be generalized to solve the k-means problem. Due to the particularity of the uncertain constrained 2-means problem, we will study approximation schemes for the uncertain constrained 2-means problem and reduce the algorithm complexity of approximation schemes for the uncertain constrained k-means problem through approximation schemes of the uncertain constrained 2-means problem. Additionally, we will apply the proposed algorithm to some practical problems in the future.