Abstract
We study the problem of preclustering a set B of imprecise points in \({\mathbb {R}}^d\): we wish to cluster the regions specifying the potential locations of the points such that, no matter where the points are located within their regions, the resulting clustering approximates the optimal clustering for those locations. We consider k-center, k-median, and k-means clustering, and obtain the following results. Let \(B := \{b_1,\ldots ,b_n\}\) be a collection of disjoint balls in \({\mathbb {R}}^d\), where each ball \(b_i\) specifies the possible locations of an input point \(p_i\). A partition \(\mathcal {C}\) of B into subsets is called an \((f(k),\alpha )\)-preclustering (with respect to the specific k-clustering variant under consideration) if (i) \(\mathcal {C}\) consists of f(k) preclusters, and (ii) for any realization P of the points \(p_i\) inside their respective balls, the cost of the clustering on P induced by \(\mathcal {C}\) is at most \(\alpha \) times the cost of an optimal k-clustering on P. We call f(k) the size of the preclustering and we call \(\alpha \) its approximation ratio. We prove that, even in \({\mathbb {R}}^1\), one may need at least \(3k-3\) preclusters to obtain a bounded approximation ratio—this holds for the k-center, the k-median, and the k-means problem—and we present a (3k, 1) preclustering for the k-center problem in \({\mathbb {R}}^1\). We also present various preclusterings for balls in \({\mathbb {R}}^d\) with \(d\geqslant 2\), including a \((3k,\alpha )\)-preclustering with \(\alpha \approx 13.9\) for the k-center and the k-median problem, and \(\alpha \approx 193.9\) for the k-means problem.
Similar content being viewed by others
Notes
The subscript \(\infty \) in \({\textsc {Opt}}_{\infty }\) refers to the fact that if \(d_i\) denotes the distance of point \(p_i\in P\) to its nearest center, then we are minimizing the norm of the vector \(\langle d_1,\ldots ,d_n\rangle \) in the \(\ell _{\infty }\)-metric. For k-median and k-means we are minimizing the norm in the \(\ell _1\)-metric and in the squared \(\ell _2\)-metric, respectively.
References
Agarwal, P.K., Procopiuc, C.M.: Exact and approximation algorithms for clustering. Algorithmica 33(2), 201–226 (2002). https://doi.org/10.1007/s00453-001-0110-y
Arora, S., Raghavan, P., Rao, S.: Approximation schemes for euclidean k-medians and related problems. In: Proceedings of the ACM Symposium on the Theory of Computing, pp. 106–113 (1998). https://doi.org/10.1145/276698.276718
Bilu, Y., Linial, N.: Are stable instances easy? Comb. Probab. Comput. 21(5), 643–660 (2012). https://doi.org/10.1017/S0963548312000193
Buchin, K., Löffler, M., Morin, P., Mulzer, W.: Preprocessing imprecise points for delaunay triangulation: simplified and extended. Algorithmica 61(3), 674–693 (2011). https://doi.org/10.1007/s00453-010-9430-0
Cohen-Addad, V., Schwiegelshohn, C.: On the local structure of stable clustering instances. In: C. Umans (ed.) 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15–17, 2017, pp. 49–60. IEEE Computer Society (2017). https://doi.org/10.1109/FOCS.2017.14
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings of ACM Symposium on Computational Geometry, pp. 11–18 (2007). https://doi.org/10.1145/1247069.1247072
Gullo, F., Tagarelli, A.: Uncertain centroid based partitional clustering of uncertain data. Proc. VLDB Endow. 5(7), 610–621 (2012). https://doi.org/10.14778/2180912.2180914
Hochbaum, D.S., Shmoys, D.B.: A best possible heuristic for the k-center problem. Math. Oper. Res. 10(2), 180–184 (1985). https://doi.org/10.1287/moor.10.2.180
Jaiswal, R., Kumar, A., Sen, S.: A simple D 2-sampling based PTAS for k-means and other clustering problems. In: Computing and Combinatorics COCOON 2012. Lecture Notes in Computer Science, vol. 7434, pp. 13–24. Springer (2012). https://doi.org/10.1007/978-3-642-32241-9_2
Ju, W., Luo, J., Zhu, B., Daescu, O.: Largest area convex hull of imprecise data based on axis-aligned squares. J. Comb. Optim. 26(4), 832–859 (2013). https://doi.org/10.1007/s10878-012-9488-5
Kao, B., Lee, S.D., Cheung, D.W., Ho, W., Chan, K.F.: Clustering uncertain data using voronoi diagrams. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), December 15–19, 2008, Pisa, Italy, pp. 333–342. IEEE Computer Society (2008). https://doi.org/10.1109/ICDM.2008.31
Kolliopoulos, S.G., Rao, S.: A nearly linear-time approximation scheme for the euclidean kappa-median problem. In: Algorithms—ESA Proceedings. Lecture Notes in Computer Science, vol. 1643, pp. 378–389. Springer (1999). https://doi.org/10.1007/3-540-48481-7_33
Liu, C., Montanari, S.: Minimizing the diameter of a spanning tree for imprecise points. Algorithmica 80(2), 801–826 (2018). https://doi.org/10.1007/s00453-017-0292-6
Löffler, M.: Data imprecision in computational geometry. Ph.D. thesis, Utrecht University, Netherlands (2009)
Löffler, M., van Kreveld, M.J.: Largest and smallest convex hulls for imprecise points. Algorithmica 56(2), 235–269 (2010). https://doi.org/10.1007/s00453-008-9174-2
Mahajan, M., Nimbhorkar, P., Varadarajan, K.R.: The planar k-means problem is nphard. Theor. Comput. Sci. 442, 13–21 (2012). https://doi.org/10.1016/j.tcs.2010.05.034
Megiddo, N., Supowit, K.J.: On the complexity of some common geometric location problems. SIAM J. Comput. 13(1), 182–196 (1984). https://doi.org/10.1137/0213014
Nagai, T., Tokura, N.: Tight error bounds of geometric problems on convex objects with imprecise coordinates. In: JCDCG. Lecture Notes in Computer Science, vol. 2098, pp. 252–263. Springer (2000). https://doi.org/10.1007/3-540-47738-1_24
Saulpic, D., Cohen-Addad, V., Feldmann, A.E.: Near-linear time approximations schemes for clustering in doubling metrics. In: D. Zuckerman (ed.) 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, Baltimore, Maryland, USA, November 9–12, 2019, pp. 540–559. IEEE Computer Society (2019). https://doi.org/10.1109/FOCS.2019.00041
Sheikhi, F., Mohades, A., de Berg, M., Mehrabi, A.D.: Separability of imprecise points. Comput. Geom. 61, 24–37 (2017). https://doi.org/10.1016/j.comgeo.2016.10.001
Shmoys, D.B., Tardos, É.: An approximation algorithm for the generalized assignment problem. Math. Program. 62, 461–474 (1993). https://doi.org/10.1007/BF01585178
Spielman, D.A., Teng, S.: Smoothed analysis: an attempt to explain the behavior of algorithms in practice. Commun. ACM 52(10), 76–84 (2009). https://doi.org/10.1145/1562764.1562785
Talata, I.: Exponential lower bound for the translative kissing numbers of d-dimensional convex bodies. Discrete Comput. Geom. 19(3), 447–455 (1998). https://doi.org/10.1007/PL00009362
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Mark de Berg supported by the Netherlands’ Organisation for Scientific Research (NWO) under project networks (project no. 024.002.003).
Appendices
A Generalizations of Lemmas 3 and 4
We present the generalizations of Lemmas 3 and 4 which are used in order to prove Theorem 7. Note that p can take any value between 1 and n/k.
Lemma 13
For any B-instance P the preclustering \(\mathcal {C}:= \{ \{b_1\}, \ldots , \{b_{(p-1)k} \},B_1,\ldots ,B_{k} \}\) computed by the generalized algorithm satisfies:
-
(i)
\(\mathcal {C}{{\text {-}}\textsc {Cost}}_{\infty }(P) \leqslant {\textsc {Opt}}_{\infty }(P,k) + 2 \cdot {{\,\mathrm{radius}\,}}(b_{(p-1)k+1})\)
-
(ii)
\(\mathcal {C}{{\text {-}}\textsc {Cost}}_1(P) \leqslant {\textsc {Opt}}_1(P,k) + 2 \sum _{j=(p-1)k+1}^n {{\,\mathrm{radius}\,}}(b_j)\)
-
(iii)
\(\sqrt{\mathcal {C}{{\text {-}}\textsc {Cost}}_2(P)} \leqslant \sqrt{{\textsc {Opt}}_2(P,k)} + 2\sqrt{\sum _{j=(p-1)k+1}^n {{\,\mathrm{radius}\,}}(b_j)^2}\).
Proof
We first prove part (i) of the lemma. Let P be any B-instance, let \(p_j\in P\) denote the point inside \(b_j\), and let \(c_j\) be the center of \(b_j\). Recall that \(P_i\subset P\) is the subset of points in the instance corresponding to the precluster \(B_i\). Define \(P_{\mathrm {small}}:= \{ p_{(p-1)k+1},\ldots ,p_n\}\) to be the set of points from P in the small balls, and define \(C_{\mathrm {small}}:= \{ c_{(p-1)k+1},\ldots ,c_n\}\). Note that \(P_{\mathrm {small}}= P_1\cup \cdots \cup P_k\) and that
for all \(p_j\in P_{\mathrm {small}}\). We define the following sets of centroids:
-
Let \(Q := \{q_1,\ldots .q_k\}\) be the set of centroids in an optimal k-center solution for the entire point set P. We have
$$\begin{aligned} \max _{p_j\in P_{\mathrm {small}}} \; \min _{q_i\in Q} |p_j q_i| \leqslant \max _{p_j\in P} \min _{q_i\in Q} |p_j q_i| = {\textsc {Opt}}_{\infty }(P,k). \end{aligned}$$(8) -
Let \(Q' := \{q'_1,\ldots ,q'_k\}\) be the set of centroids in the optimal k-center clustering on \(C_{\mathrm {small}}\) used in Step 2 of the algorithm. Thus
$$\begin{aligned} \max _{c_i\in C_{\mathrm {small}}} \min _{q'_j \in Q'} |c_i q'_j| = {\textsc {Opt}}_{\infty }(C_{\mathrm {small}},k) \leqslant \max _{c_i\in C_{\mathrm {small}}} \min _{q_j \in Q} |c_i q_j|. \end{aligned}$$(9) -
Let \(Q'' := \{q''_1,\ldots ,q''_k\}\), where \(q''_i\) is the optimal centroid for \(P_i\). Note that for all \(P_i\) we have
$$\begin{aligned} \max _{p_j\in P_i}|p_j q''_i| \leqslant \max _{p_j\in P_i}|p_j q'_i|. \end{aligned}$$(10)
Since the total cost of the singleton preclusters is trivially zero, we have
To prove part (ii) of the lemma, which deals with the k-median problem, note that Inequality (2) still holds while Inequalities (3)–(5) hold if we replace the \(\max \)-operator by a summation. Part (ii) can thus be derived using a similar derivation as for part (i).
To prove part (iii), which deals with the k-means problem, we need to work with squared distances. Note that Inequality (2) still holds, while Inequalities (3)–(5) hold if we replace the max-operator with a summation and all distance values with their squared values. For squared distances the triangle inequality does not hold. Instead we use the triangle inequality for \(L_2\) norm, which is called Minkowsky Inequality. A similar computation as above can now be used to prove part (iii); we have
On the other hand
By adding up the above inequalities we conclude part (iii) of the lemma. \(\square \)
Lemma 14
Let \(r^p_d\) be the smallest possible radius of any ball that intersects p disjoint unit balls in \({\mathbb {R}}^d\). Then
-
(i)
\({\textsc {Opt}}_{\infty }(P,k) \geqslant r^p_d \cdot {{\,\mathrm{radius}\,}}(b_{(p-1)k+1})\)
-
(ii)
\({\textsc {Opt}}_1(P,k) \geqslant r^p_d \cdot \sum _{j=(p-1)k+1}^{n} {{\,\mathrm{radius}\,}}(b_j)\)
-
(iii)
\({\textsc {Opt}}_2(P,k) \geqslant (r^p_d)^2 \cdot \sum _{j=(p-1)k+1}^{n} {{\,\mathrm{radius}\,}}(b_j)^2\)
Proof
For part (i) notice that by the Pigeonhole Principle an optimal clustering must have a cluster containing at least p points from \(\{p_1,\ldots ,p_{(p-1)k+1}\}\). The cost of this cluster is lower bounded by the radius of the smallest ball intersecting p balls of radius at least \(b_{(p-1)k+1}\), which is in turn lower bounded by \(r^p_d \cdot {{\,\mathrm{radius}\,}}(b_{(p-1)k+1})\).
For part (ii) let \(P_1, P_2, \ldots , P_k\) be the clusters in an optimal k-median clustering on P, and let \(q_i\) be the centroid of \(P_i\) in this clustering. Let \(B_i\) be the set of balls corresponding the points in \(P_i\). We claim that
where \(s_i\) is the sum of the radii of the \(p-1\) largest balls in \(B_i\). To show this, let \(b(q_i,r)\) be the ball of radius r centered at \(q_i\), let \(P_i(r) := \{ p_j\in P_i: b_j \cap b(q_i,r)\ne \emptyset \}\) be the set of points in \(P_i\) whose associated ball intersects \(b(q_i,r)\), and let \(B_i(r)\) be the corresponding set of balls. Since for sufficiently large r we have \(P_i = P_i(r)\), it suffices to show that for all \(r>0\) we have
where \(s_i(r)\) is the sum of the radii of the \(p-1\) largest balls in \(B_i(r)\). To prove this, consider this inequality as r increases from \(r=0\) to \(r=\infty \). As long as \(|P_i(0)|\leqslant 2\) the right-hand side is zero and so the inequality is obviously true. As we increase r further, \(b(q_i,r)\) starts intersecting more and more balls from \(B_i\). Consider what happens to the inequality when \(b(q_i,r)\) starts intersecting another ball \(b_\ell \in B_i\). Then \(p_\ell \) is added to \(P_i(r)\), so the left-hand side of the inequality increases by \(|p_\ell q_i|\), which is at least r. The right-hand side increases by at most \(r^p_d\) times the radius of the p-th largest ball in \(B_i(r)\). By definition of \(r^p_d\), if p balls intersect a ball of radius r then the smallest has radius at most \(r/r^p_d\). Hence, the right-hand side increases by at most r and the inequality remains true.
Recall that \(b_1,\ldots ,b_{(p-1)k}\) are the \((p-1)k\) largest balls in B. Hence, summing Inequality (11) over all clusters \(P_1,\ldots ,P_k\) gives
For part (iii) the same proof as (ii) works if we replace all distances with squared distances. \(\square \)
B Proof of Theorem 9 for the k-Means Problem
For the k-means problem, observe that it suffices to prove the lower bound for \(k=1\); for larger k we can simply copy the construction k times and put the copies sufficiently far from each other. Now for \(k=1\), let \(n=\frac{2^{2t+1}-2}{3}\) and for every \(1\leqslant i \leqslant t\) let \(B_i\) be the set of \(2^{2t-2i}\) tiny intervals all very close to the point \(2^{i-1}\) in \({\mathbb {R}}^1\). Also let \(B_{-i}\) be the set of \(2^{2t-2i}\) tiny intervals all very close to the point \(-2^{i-1}\). Consider B as the union of \(B_{-t},\cdots ,B_{-2},B_{-1},B_1,B_2,\cdots ,B_t\). It is not difficult to see that \({\textsc {Opt}}_2(B,1)=2t\cdot 2^{2t-2}\) (considering the origin as the center). On the other hand assume that we precluster B into f(1) preclusters, using \(q_1,q_2,\cdots ,q_{f(1)}\) as centers. For \(1 \leqslant i \leqslant t\), let \(I_i\) be an open interval of length \(2^{i-2}\) whose midpoint is \(2^{i-1}\) and let \(I_{-i}\) be an open interval of length \(2^{i-2}\) whose midpoint is \(-2^{i-1}\). Note that \(I_{-t},\cdots ,I_{-2},I_{-1},I_1,I_2,\cdots I_t\) are disjoint. Thus, at least \(2t-f(1)\) number of the intervals \(I_{-t},\cdots ,I_{-2},I_{-1},I_1,I_2,\cdots I_t\) are empty of a center. But when \(I_i\) (similarly \(I_{-i}\)) is empty of a center, this means the total cost for \(B_i\) (similarly \(B_{-i}\)) is at least \(2^{2t-2i}\cdot (2^{i-3})^2)=2^{2t-6}\). Therefore, the total cost for any preclustering of B into f(1) preclusters is at least \((2t-f(1))\cdot 2^{2t-6}\). For a \((f(1),\varepsilon )\)-preclustering we should have
which concludes \(f(1)=\varOmega (t)\).
One can easily generalize this example to any \({\mathbb {R}}^d\) by considering the balls with diameters as the intervals in B introduced above.
Rights and permissions
About this article
Cite this article
Abam, M.A., de Berg, M., Farahzad, S. et al. Preclustering Algorithms for Imprecise Points. Algorithmica 84, 1467–1489 (2022). https://doi.org/10.1007/s00453-022-00929-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-022-00929-9