Abstract
Sparse clustering divides samples into distinct groups while simultaneously reducing dimensionality in ultra-high dimensional unsupervised learning. To achieve sparse clustering that can screening out noise variables, an adaptive sparse clustering framework based on sufficient variable screening, abbreviated as adaptive sufficient sparse clustering (ASSC), is developed by controlling false discovery. Without any specific model, ASSC employs a composite hypothesis testing procedure that leverages conditional marginal correlations of variables across distinct groups, aiming to pinpoint sufficient variables via sparse clustering and promoting the performance of sparse clustering in ultra-high dimensionality. To control false discoveries of the hypothesis testing procedure at a predetermined level, ASSC provides the adaptive threshold for identifying conditional marginal dependency, which ensures to accuracy of clustering with high probability. Under mild conditions, the sufficient screening properties and sufficient clustering properties of ASSC are established based on variable screening procedure by controlling false discovery. Numerical studies on synthetic data and real datasets corroborate the performance and flexibility of ASSC, and underscore its potent utility in unsupervised learning.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Sequence data sets that support the findings of this study are download from the website as follows: 1.gene expression cancer RNA-Seq: https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq 2.lung CT image dataset: https://www.kaggle.com/datasets/kmader/finding-lungs-in-ct-data
References
Adeen, N., Abdulazeez, M., Zeebaree, D.: Systematic review of unsupervised genomic clustering algorithms techniques for high dimensional datasets. Technol. Rep. Kansai Univ. 62(3), 355–374 (2020)
Amiri, S., Clarke, B.S., Clarke, J.L.: Clustering categorical data via ensembling dissimilarity matrices. J. Comput. Graph. Stat. 27(1), 195–208 (2018)
Benati, S., García, S., Puerto, J.: Mixed integer linear programming and heuristic methods for feature selection in clustering. J. Operat. Res. Soc. 69(9), 1379–1395 (2018)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)
Bishop, C.: Pattern recognition and machine learning. 16, 140–155 (2006)
Costa, V., Aprile, M., Esposito, R., Ciccodicola, A.: RNA-SEQ and human complex diseases: recent accomplishments and future perspectives. Eur. J. Hum. Genet. 21(2), 134–142 (2013)
Cui, H., Li, R., Zhong, W.: Model-free feature screening for ultrahigh dimensional discriminant analysis. J. Am. Stat. Assoc. 110(510), 630–641 (2015)
Chipman, H.A., Tibshirani, R.: Hybrid hierarchical clustering with applications to microarray data. Biostatistics 7, 286–301 (2005)
Chang, J., Tang, C.Y., Wu, Y.: Marginal empirical likelihood and sure independence feature screening. Ann. Stat. 41(4), 2123–2148 (2013)
Chang, X., Wang, Y., Li, R., Xu, Z.: Sparse k-means with \(\ell _\infty \)/\(\ell _0\) penalty for high-dimensional data clustering. Stat. Sin. 28(3), 1265–1284 (2018)
Donoho, D.L.: High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension. Discrete Comput. Geomet. 35(4), 617–652 (2006)
Fan, J., Feng, Y., Song, R.: Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc. 106(494), 544–557 (2011)
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. B Stat. Methodol. 70(5), 849–883 (2008)
Fan, J., Song, R.: Sure independence screening in generalized linear models with np-dimensionality. Ann. Stat. 38(6), 3567–3604 (2010)
Guo, J., Levina, E., Michailidis, G., Zhu, J.: Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66(3), 793–804 (2010)
Guo, X., Ren, H., Zou, C., Li, R.: Threshold selection in feature screening for error rate control. J. Am. Stat. Assoc. 1–13 (2022)
Han, X.: Nonparametric screening under conditional strictly convex loss for ultrahigh dimensional sparse data. Ann. Stat. 47(4), 1995–2022 (2019)
Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: Data mining, inference, and prediction 2nd edition. (2020)
Hallquist, M.N., Wiley, J.F.: Mplusautomation: An r package for facilitating large-scale latent variable analyses in mplus. Struct. Equ. Model. 25(4), 621–638 (2018)
He, X., Wang, L., Hong, H.G.: Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Stat. 41(1), 342–369 (2013)
Hao, N., Zhang, H.H.: A note on high-dimensional linear regression with interactions. Am. Stat. 71(4), 291–297 (2017)
Lin, X., Guan, J., Chen, B., Zeng, Y.: Unsupervised feature selection via orthogonal basis clustering and local structure preserving. IEEE Trans. Neural Netw. Learn. Syst. 1–12 (2021)
Liu, W., Ke, Y., Liu, J., Li, R.: Model-free feature screening and FDR control with knockoff features. J. Am. Stat. Assoc. 117(537), 428–443 (2022)
Lee, H., Li, J.: Variable selection for clustering by separability based on ridgelines. J. Comput. Graph. Stat. 21(2), 315–336 (2012)
Liu, W., Li, R.: In Fuleky, P. (ed.) Variable Selection and Feature Screening, pp. 293–326. Springer, Cham (2020)
Lu, J., Lin, L.: Model-free conditional screening via conditional distance correlation. Stat. Pap. 61(1), 225–244 (2020)
Li, G., Peng, H., Zhang, J., Zhu, L.: Robust rank correlation based screening. Ann. Stat. 40(3), 1846–1877 (2012)
Lim, D.K., Rashid, N.U., Ibrahim, J.G.: Model-based feature selection and clustering of RNA-SEQ data for unsupervised subtype discovery. Annals Appl. Stat. 15(1), 481–508 (2021)
Lin, L., Sun, J., Zhu, L.: Nonparametric feature screening. Comput. Stat. Data Anal. 67, 162–174 (2013)
Li, Z., Tang, J.: Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans. Image Process. 24(12), 5343–5355 (2015)
Li, R., Zhong, W., Zhu, L.: Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107(499), 1129–1139 (2012)
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967). Oakland
Mohamed, I.B., Mirakhmedov, S.M.: Approximation by normal distribution for a sample sum in sampling without replacement from a finite population. Sankhya A 78, 188–220 (2016)
Maji, P., Pal, S.K.: Fuzzy-rough feature selection using -information measures, pp. 117–159 (2012)
Pollard, D.: Convergence of Stochastic Processes. Springer, Berlin (2012)
Qiu, H., Chen, J., Yuan, Z.: Quantile correlation-based sufficient variable screening by controlling false discovery rate. Adv. Theory Simul. 7(5), 2301099 (2024)
Rosenberg, J.M., Beymer, P.N., Anderson, D.J., Lissa, Cj., Schmidt, J.A.: tidylpa: an r package to easily carry out latent profile analysis (LPA) using open-source or commercial software. J. Open Sour. Softw. 3(30), 978 (2018)
Seo, B., Lin, L., Li, J.: Block-wise variable selection for clustering via latent states of mixture models. J. Comput. Graph. Stat. 31(1), 138–150 (2022)
Shao, S., Tunc, C., Al-Shawi, A., Hariri, S.: Automated twitter author clustering with unsupervised learning for social media forensics. In: 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), pp. 1–8 (2019)
Tan, M., Tsang, I.W., Wang, L.: Minimax sparse logistic regression for very high-dimensional feature selection. IEEE Trans. Neural Netw. Learn. Syst. 24(10), 1609–1622 (2013)
Tang, W., Xie, J., Lin, Y., Tang, N.: Quantile correlation-based variable selection. J. Bus. Econ. Stat. 1–13 (2021)
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Wallace, M.L., Buysse, D.J., Germain, A., Hall, M.H., Iyengar, S.: Variable selection for skewed model-based clustering: Application to the identification of novel sleep phenotypes. J. Am. Stat. Assoc. 113(521), 95–110 (2018)
Wang, Y., Chang, X., Li, R., Xu, Z.: Sparse k-means with the \(\ell _q(0 \le q < 1)\) constraint for high-dimensional data clustering. In: 2013 IEEE 13th International Conference on Data Mining, pp. 797–806 (2013)
Wan, Q., Dingerdissen, H., Fan, Y., Gulzar, N., Pan, Y., Wu, T.-J., Yan, C., Zhang, H., Mazumder, R.: Bioxpress: an integrated RNA-SEQ-derived gene expression database for pan-cancer analysis. Database 2015, 019 (2015)
Wang, H., Pang, G., Shen, C., Ma, C.: Unsupervised representation learning by predicting random distances. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2950–2956 (2021)
Witten, D., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc. 105, 713–726 (2010)
Wang, B., Zhang, Y., Sun, W.W., Fang, Y.: Sparse convex clustering. J. Comput. Graph. Stat. 27(2), 393–403 (2018)
Xie, J., Lin, Y., Yan, X., Tang, N.: Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. J. Am. Stat. Assoc. 115(530), 747–760 (2020)
Yang, M.-S., Benjamin, J.B.M.: Sparse possibilistic c-means clustering with lasso. Pattern Recogn. 138, 109348 (2023)
Yuan, Q., Chen, X., Ke, C., Yin, X.: Independence index sufficient variable screening for categorical responses. Comput. Stat. Data Anal. 174, 107530 (2022)
Yuan, Z., Chen, J., Qiu, H., Huang, Y.: Quantile-adaptive sufficient variable screening by controlling false discovery. Entropy 25(3), 524 (2023)
Yin, X., Hilafu, H.: Sequential sufficient dimension reduction for large \(p\), small \(n\) problems. J. R. Stat. Soc. B (Stat. Methodol.) 77(4), 879–892 (2015)
Yao, C., Liu, Y.-F., Jiang, B., Han, J., Han, J.: Lle score: a new filter-based unsupervised feature selection method based on nonlinear manifold embedding and its application to image recognition. IEEE Trans. Image Process. 26(11), 5257–5269 (2017)
Yu, H., Wang, Y., Zeng, D.: A general framework of nonparametric feature selection in high-dimensional data. Biometrics 79(2), 951–963 (2023)
Zhu, L.-P., Li, L., Li, R., Zhu, L.-X.: Model-free feature screening for ultrahigh-dimensional data. J. Am. Stat. Assoc. 106(496), 1464–1475 (2011)
Zhao, Y., Shrivastava, A.K., Tsui, K.L.: Regularized gaussian mixture model for high-dimensional clustering. IEEE Trans. Cyber. 49(10), 3677–3688 (2018)
Zhou, T., Zhu, L., Xu, C., Li, R.: Model-free forward screening via cumulative divergence. J. Am. Stat. Assoc. 115(531), 1393–1405 (2020)
Acknowledgements
Many thanks to the reviewers for their positive feedback, valuable comments, and constructive suggestions that helped improve the quality of this article. Many thanks to the editors’ great help and coordination for the publication of this paper.
Funding
This research was supported by the National Natural Science Foundation of China under Grant No.81671633 to J.C.
Author information
Authors and Affiliations
Contributions
Conceptualization, J.C.; Data curation, Z.Y.; Formal analysis, Z.Y.; Funding acquisition, J.C.; Investigation, Z.Y.; Methodology, Z.Y.; Project administration, H.Q. and H.W.; Supervision, J.C.; visualization, H.Q. and H.W.; Validation, Y.H.; Writing-original draft, Z.Y.; Writing-review and editing, Z.Y. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Main proof
1.1 Proof of Lemma 1
Proof
According to the definition of \(\mathcal {A}_k\) in Assumption (i), for any \(\varvec{x}_{\mathcal {A}_k^c} \in \mathbb {R}_{\mathcal {A}_k^c}\), we have
Due to that \(\varvec{X} = ( \varvec{X}_{\mathcal {A}_k}, \varvec{X}_{\mathcal {A}_k^{c}} )\), for any \(\varvec{x}_{\mathcal {A}_k^c} \in \mathbb {R}_{\mathcal {A}_k^c}\), multiply \(\textrm{Pr} (\varvec{X}_{\mathcal {A}_k^{c}} \le \varvec{x}_{\mathcal {A}_k^c} \mid \varvec{X}_{\mathcal {A}_k})\) on both sides of the Eq. (A1), we obtain that
Due to the multiplication theorem of probability, it is clear that \( \textrm{I} ( \varvec{X} \in \mathcal {C}_k ) {\not \perp \!\!\! \perp } \mathcal {A}_k^{c} \mid \mathcal {A}_k. \) In terms of invertibility, we proved that
Thus, \(\mathcal {A}_k\) is the sufficient screening variables index set. When \(K = 2\), it is easy to find that \(\textrm{Pr} (\varvec{X} \in \mathcal {C}_2 | \varvec{X}) = 1 -\textrm{Pr} (\varvec{X} \in \mathcal {C}_1 | \varvec{X})\). Thus, \(\mathcal {A}_1 = \mathcal {A}_2\).
Note that \(\mathcal {A}_k^c = \{ 1 \le j \le p: \textrm{I} ( \varvec{X} \in \mathcal {C}_k ) {\perp \!\!\! \perp } X_j | \varvec{X}_{\mathcal {A}_k}\}\). It indicates that for any \(\varvec{x}_{\mathcal {A}_k^c}\) belong to the support set \(\mathbb {R}_{\mathcal {A}_k^c}\), we have
Thus, we obtain that \(F_j(x\mid \varvec{X} \in \mathcal {C}_k) - F_j(x) = 0\) holds when \(j \notin \mathcal {A}_k\). \(\square \)
1.2 Proof of Proposition 1
Proof
Suppose Assumptions (i)-(ii) hold. If \(x_j\) is not sufficient active for kth cluster \(\mathcal {C}_k\), we have \(F_{j, k}\left( x\right) = \textrm{Pr} \left( x_j \le x | \varvec{X} \in \mathcal {C}_k \right) = \textrm{Pr} (x_j \le x) = F_{j}\left( x\right) \). By the definition of Eq. (3), we obtain
Hence, \(\upsilon _{j k} = 12 \cdot (n + 1) \cdot p_k / (1 - p_k) \tau _{j k}^2 = 0\). When \(x_j\) is not sufficient variable with the partition \(\mathbb {C}\), it implies that \(\upsilon _{j k} = 0\) hold for all \(k = 1, \ldots , K\). Finally, we have that \(\sum _{k=1}^{K} \upsilon _{j k} = 0\). \(\square \)
1.3 Proof of Lemma 2
Proof
Note \(\Omega = \{{0}, {1}, \ldots , {n-1}\}\), and \(\Omega _k = \{ \xi _1, \xi _2, \ldots , \xi _e \}\) \((e=1, \ldots , n)\) is an e size sample set random sampled in equal probability without replacement from the population, where \(\xi _{i} (i = 1, 2, \ldots , e)\) represents the ith random sample, which satisfies a discrete uniform distribution from 0 to \(n-1\), that is, \(\xi _{i} \sim {\text {DiscreteU}}({0}, n-1)\). \(\xi _{i}\) has the following properties:
For all \(i_1 \ne i_2\), where \(i_1, i_2 \in \{1, 2, \ldots , n\}\), we get
For continuous variables subject to arbitrary distribution \(\varvec{X}_j=\left\{ X_{1 j}, X_{2 j}, \ldots , X_{n j} \right\} \), let \(\xi _{ij} = \sum _{k \ne i}^{n} \textrm{I} ( X_{i j} \le X_{k j} )\). It is easy to find that the random variable \(\xi _{ij} (i = 1, 2, \ldots , n)\) is a special case in Mohamed and Mirakhmedov (2016), where \(e = n\). According to the definition of \(\hat{\tau }_{j k}\) and \(\xi _{i j}\), it is obviously established that \(\textrm{Pr}(\xi _i = r)= 1/n (i = 1, 2, \ldots , n; t = 0, 1, \ldots , n-1)\) and \(\xi _{i_1} \ne \xi _{i_2}(\forall i_1 \ne i_2)\). By the definition of \(\xi _{i j}\), for all \(j=1,2,\ldots , p, k=1,2,\ldots ,K\),
If \(\textrm{H}_{0, j, k}\) is true, the expectation and variance of \(\hat{\tau }_{j k}\) can be obtained as follows:
and
Let \(\Omega _{n} = (\zeta _{1 n}, \ldots , \zeta _{n n} )\) be a random permutation of \(\{0,1\ldots ,n-1\}\), and \(r=(r_1, \ldots , r_N)\) is a random vector independent of \(\zeta _{1 n}, \ldots , \zeta _{n n}\) satisfying \(\textrm{Pr}\{ r_1 = k_1, \ldots , r_{n} = k_{n} \} = 1 / (n!)\), where \(k=(k_1,\ldots ,k_{n})\) is also a random permutation of \(1,\ldots , n\). Note that \(S_{n \hat{p}_k, n} = \zeta _{r_1 n} + \cdots + \zeta _{r_{n \hat{p}_k} n}\) represents a sum of n random vector samples chosen at random without replacement from the population \(\Omega _{n}\). It can be expressed equivalently as \(S_{n \hat{p}_k, n} = \textrm{I} \{ \varvec{X}_1 \in \hat{\mathcal {C}}_k \} \zeta _{1 N} + \cdots + \textrm{I} \{ \varvec{X}_n \in \hat{\mathcal {C}}_k \} \zeta _{n n}\), where \(\textrm{I}(\cdot )\) represents indicative function. It is shown that \(\sum _{i = 1}^{n} \textrm{I}\left( \varvec{X}_i \in \hat{\mathcal {C}}_k\right) \xi _{i j}\) and \(S_{n \hat{p}_k, n}\) have the same distribution. Denote
Let \(F_{n}(u)=\textrm{Pr}\left\{ S_{n \hat{p}_k, n}< u \sigma ^{*} \sqrt{n \hat{p}_k}+ n \hat{p}_k \tau \right\} \). Due to the same distribution of \(S_{n \hat{p}_k, n}\) and \(\sum _{i = 1}^{n} \textrm{I} (\varvec{X}_i \in \hat{\mathcal {C}}_k) \xi _{i j}\), for any j, we have
Considering
where continuous variable \(x_j\) satisfies \(\textrm{Pr}(x_{ij} = x_{kj}) = 0\). Define \(\xi _{ij}^{\prime } = \sum _{k \ne i}^{n} \textrm{I} (X_{i j} \ge X_{k j})\); then, \(\xi _{ij}^{\prime }\) has the same distribution with \(S_{n \hat{p}_k, n}\); as a result, \(\xi _{ij}^{\prime }\) has the same distribution with \(\xi _{ij}\). In other words, \(\hat{\tau }_{j k}\) and \(-\hat{\tau }_{j k}\) have the same distribution, which is \(F_{\hat{\tau }_{j k}}(u) = 1 - F_{\hat{\tau }_{j k}}(-u)\). If \(\lim \limits _{n \rightarrow \infty } \hat{p}_{k} (1 - \hat{p}_{k})>0\), we have the following relationship by using Theorem 3.4, Corollary 3.4, Corollary 3.5, and Corollary 3.6 of Mohamed and Mirakhmedov (2016), and we can obtain that
-
For all \(u = o(\sqrt{n})\ge 0\), we obtain that
where \(L_{n}(v)\) is an odd power series that, for all sufficiently large N, is majorized by a power series with coefficients not depending on N, and is convergent in some discussions, and \(L_{n}(v)\) converges uniformly in n for sufficiently small values of v, where \(L_{n}(0) = 0\).
-
For all \(u = o(n^{1/4}) \ge 0\), we obtain that
-
For all \(u = o(n^{1 / 6}) \ge 0\), we obtain that
-
For all u satisfies that \(0 \le u \le C \min \left( \frac{\sqrt{{n^{*}} \hat{q}} }{ \max |\hat{Y}_{m N}|},\frac{({n^{*}} \hat{q})^{1 / 6}}{ B_{3}^{1 / 3}}\right) \), we have
Based on the above cases, using Taylor’s expansion of the Eqs. (A2)–(A5), the following conclusions can be obtained through the order correlation of u and n: if \(\textrm{H}_{0, j, k} (j = 1, 2, \ldots , p; k = 1, 2, \ldots , K)\) is true, then
where \(\Phi (-u) = 1/2p = O(n^{\alpha })\) and \(\alpha \le 1/2\). In other words, \(\hat{\tau }_{j k}\) has the asymptotic normal distribution \(\mathcal {N} (0, {(1 - \hat{p}_k)}/{(12 (n+1) \hat{p}_k)} )\) for all \(j \in \{ 1, 2, \ldots , p \}\) and all \(k \in \{ 1, 2, \ldots , K \}\). \(\square \)
1.4 Proof of Corollary 1
Proof
By definition of \(\hat{\upsilon }_{j k}\), we can easily find that \(\hat{\upsilon }_{j k}\) has the asymptotic distribution as \(\chi ^2_1\), where \(\chi ^2_1\) is the chi-square distribution with degree of freedom 1. Since \(\sum _{k=1}^{K}\hat{p}_k \hat{\tau }_{j k} = 0\), then \(\hat{\upsilon }_{j}=\sum _{k=1}^{K}\hat{\upsilon }_{j k}\) has the asymptotic distribution as \(\chi ^2_{K-1}\) with degree of freedom \(K-1\). \(\square \)
1.5 Proof of Theorem 1
Proof
Let \(p_{k}=\textrm{Pr}\left( \varvec{X} \in \mathcal {C}_k\right) \), \(F_{j}(k, x)=\textrm{I}\left( X_{j} \le x, \varvec{X} \in \mathcal {C}_k\right) \) and \(\eta _{k}=\textrm{I}\left( \varvec{X} \in \mathcal {C}_k\right) \). For some \(j=1, \ldots , p\), let \(\{X_{i j}: i=1, \ldots , n\}\) be a random sample of \(X_j\). Write \(\eta _{k i}=\textrm{I}\left( \varvec{X}_i \in \mathcal {C}_k\right) \), \(f_{i j}(k, x)=\textrm{I}\left( X_{i j} \le x, \varvec{X}_i \in \mathcal {C}_k\right) \), \(\zeta _{j}(k)=\mathbb {E}_{x}\left\{ F_{j, k}(x)\right\} \), \(\tilde{\zeta }_{j}(k)=\frac{1}{n^{2}} \sum _{l=1}^{n} \sum _{i=1}^{n} \frac{\textrm{I}\left( X_{i j} \le X_{l j}, \varvec{X}_i \in \mathcal {C}_k\right) }{\hat{p}_{k}}\) and \(\hat{\zeta }_{j}(k)=\frac{1}{n(n+1)} \sum _{l=1}^{n} \sum _{i=1}^{n} \frac{\textrm{I}\left( X_{i j} \le X_{l j}, \varvec{X}_i \in \mathcal {C}_k\right) }{\hat{p}_{k}}\). Due to that \(\mathbb {E}\left( \eta _{k i}\right) =p_{k}, \eta _{k i}\) follows \(\textrm{Bernoulli}(\left( p_{k}\right) \) and \(\sum _{i=1}^{n} \eta _{k i}\) follows \(\textrm{Binomial}\left( n, p_{k}\right) \). By Bernstein’s inequality, for any k and j, we have
Since \(\left| \textrm{I}\left( X_{i j} \le x\right) \right| \le 1\), it can be proved straightforwardly by Hoeffding’s inequality and the empirical process theory Pollard (2012) that for given \(k=1, \ldots , K\) and \(j=1, \ldots , p\), we get
By Bernstein’s inequality, the empirical process theory Pollard (2012), and
for given \(k=1, \ldots , K\) and \(j=1, \ldots , p\), we obtain
Note that
where \(\hat{F}_{j}(x)=n^{-1} \sum _{i=1}^{n} \textrm{I}\left( X_{i j} \le x\right) \) and \(F_{j}(x)=\textrm{Pr}\left( X_j \le x\right) \).
In terms of \(\textrm{I}_{1}\), from the inequality
and Eq. (A8), we have
In terms of \(\textrm{I}_{2}\), due to that
write
Thereby, from Eqs. (A6) and (A7), for any \(\epsilon >0\), we have that
where \(c_{4}=\frac{\left( c_{1} / 4 K\right) ^{2}}{2\left( p_{k}K+\epsilon c_{1} / 12 \right) } >0\) and the inequality above holds based on the following inequality
Under Condition C3, \(c_{4} n \epsilon ^{2} / K<2 n \epsilon ^{2}\). Therefore, we have
Let \(\hat{\zeta }_{j}(k)-\zeta _{j}(k) \equiv \frac{n}{n+1}\left\{ \tilde{\zeta }_{j}(k)-\zeta _{j}(k)\right\} +\Delta \), where \(\Delta =-\zeta _{j}(k) /(n+1)\). It follows from Condition C3 and Eqs. (A5) and (A9) that \(K=O\left( n^{\xi }\right) \) for \(\xi + 2\kappa < 1\). By letting \(\epsilon = c_{3}^{*} n^{-\kappa } = c_{3}^{*} n^{-1/2\beta }\) for \(0 \le \kappa < 1 / 4\) and \(c_{3}^{*}>0\), we have
where \(c_{4}\) is a constant. The last inequality above holds due to \(\Delta = O(1 / n)\). Let \(c_3 = 12 (n+1) \hat{p}_k / (1-\hat{p}_k)c_3^{*} \), by the definition of \(\hat{\upsilon }_{j k}\), we have
where we use the inequalities that \(|\hat{\tau }_{j, k}+\tau _{j, k}| \le |\hat{\tau }_{j, k}| + |\tau _{j, k}| \le 1/2 + 1/2 = 1\). \(\square \)
1.6 Proof of Theorem 2
Proof
Let \(c_{3} n^{-\kappa } = \rho _0 / 4\). It follows from Condition C2 and Eq. (A9) that
where \(c_{4}^{*}>0\) is a constant. \(K \log (n)= o\left( n \rho _{0}^{2}\right) \) and \(K \log (p)=o\left( n \rho _{0}^{2}\right) \) ensures that, there exists some \(n_{0}>0\) for \(n>n_{0}\), \(p \le \exp \left( c_{4}^{*} n \rho _{0}^{2} / 2 K\right) \) and \(c_{4}^{*} n \rho _{0}^{2} / 2 K \ge 3 \log \{4(n+2)\}\). Consequently, we have that
By Borel-Contelli Lemma,
almost surely. \(\square \)
1.7 Proof of Corollary 2
Proof
Suppose the Assumptions (i)–(ii) and the Conditions (C1)–C3 hold. Firstly, according to the Theorem 1, for any constant \(c_4>0\), we have
where \(c_4>0\) is a constant. Write \(c_5 = (\sum _{k=1}^{K}s_k) \cdot c_5^{*}\). Then, we obtain
Similarly, let \(c_6 = s \cdot K \cdot c_6^{*}\), we have
\(\square \)
1.8 Proof of Lemma 3
Proof
Suppose Assumptions (i)–(ii) and Conditions (C1)–C2 hold. For the convenience of discussion, in this proof, the sparse clustering utility \(\Upsilon \) is considered as a function of any given partition \(\mathbb {C}= \{\mathcal {C}_1, \ldots , \mathcal {C}_K\}\), i.e, note the sparse clustering utilities to hypothesis testing (2) and hypothesis testing (8) as \({\Upsilon }(\mathbb {C})\) and \({\Upsilon }^{*}(\mathbb {C})\). Correspondingly, write the sufficient variable screening utilities of the jth variable as \({\upsilon }_{j k}(\mathbb {C})\) and \({\upsilon }_{j}(\mathbb {C})\).
Let the true partition \(\mathbb {C}_0 = \left\{ \mathcal {C}_{0,1}, \ldots , \mathcal {C}_{0,K} \right\} \) separates the samples into K groups with the true mixture distributions \(\mathcal {F}_0 = \{F_{0,1}, \ldots , F_{0,K} \}\), where the mixing proportion satisfies that \(p_{0k} = \textrm{Pr}(\mathcal {C}_{0, k} | \varvec{x})\). For any given partition \(\mathbb {C} = \left\{ \mathcal {C}_{1}, \ldots , \mathcal {C}_{K} \right\} \), write the true mixture distributions as \(\mathcal {F} = \{F_{1}, \ldots , F_{K} \}\) with the mixing proportion \(p_{k} = \textrm{Pr}(\varvec{X} \in \mathcal {C}_{k} )\). Denote that \(\varvec{p} = (p_1, \ldots , p_K)^{\textsf{T}}\), \(\varvec{p}_0 = (p_{01}, \ldots , p_{0K})^{\textsf{T}}\), \(p_k=\textrm{Pr}\{x\in \mathcal {C}_k \}\), \(\varvec{P}=[p_{kl}]_{K \times K}\), the elements \(p_{k l} = \textrm{Pr}(\varvec{X} \in \mathcal {C}_{k} | \varvec{X} \in \mathcal {C}_{0\,l})\) in \(\varvec{P}\) represents the probability of the lth component distribution of the partition \(\mathbb {C}_0\) separate into the kth component distribution of the partition \(\mathbb {C}\) and \(\sum _{k=1}^{K} p_{k l} = 1\), \(l = 1, \ldots , K\). Then for any \(\mathbb {C}\), we can obtain that \(\varvec{p} = \varvec{P} \cdot \varvec{p}_0\) and \(F(\varvec{x}) = \varvec{p} \cdot (F_{1}(\varvec{x}), \cdots , F_{K}(\varvec{x})) = \varvec{p}_0 \cdot (F_{01}(\varvec{x}), \cdots , F_{0K}(\varvec{x}))\). For any \(j \in \{1, \ldots , p\}\) and \(k \in \{1, \ldots , K\}\), the marginal utility follows the relationship as
where \(p_{0 \bar{k}} = 1 - p_{0 k}\), \(p_{k \bar{k}} = 1 - p_{k k}\) and \(\tau _{j \bar{k}}(\mathbb {C}_0) = - p_k \cdot \tau _{j k}(\mathbb {C}_0) / (1-p_k)\). By the definition of \(\upsilon _{j k}\), we can obtain that
where \(\Delta _{p_{kk}} = p_{0 k} \cdot p_{k k} + p_{0 \bar{k}} \cdot p_{k \bar{k}}\). Since that \((2 \cdot p_{k k} - 1)^2\) achieves maximum only and if only at \(p_{kk} = 0\) and \(p_{kk}= 1\), and \(\Delta _{p_{kk}}\) achieves maximum only and if only at \(p_{kk} = 0\) or \(p_{kk}= 1\), we have
Considering the non-label process of the clustering and the limitation that \(\sum _{k=1}^{K} p_{k l} = 1\) for any \(l=1, \ldots , K\), without losing the meaning, let \(p_{kk}=1\) and \(p_{k \bar{k}} = 0\). Then for any j, we conclude that
where the equivalence holds only when the matrix P is an identity matrix. Thus, \(P = \varvec{I}_{K \times K}\) is a unique global optimal solution of \({\Upsilon _{d}}_j(\mathbb {C})\) when \(S_j = 1\). Similarly, for any j, we obtain
where the equivalence holds only when the matrix P is an identity matrix, which is the same as \(S_j(\mathbb {C})\). It indicates that \({\Upsilon }(\mathbb {C})\) and \({\Upsilon }^{*}(\mathbb {C})\) achieve the maximum values \({\Upsilon }(\mathbb {C}_0)\) and \({\Upsilon }^{*}(\mathbb {C}_0)\), respectively, if and only if \(\mathbb {C} = \mathbb {C}_0\), where
\(\square \)
1.9 Proof of Theorem 3
Proof
Suppose Assumption (i), Assumption (ii) and Conditions (C1)–C3 hold. For any constant \(c_9 = 2 \cdot c_9^{*}\) and \(c_{10} = 2 \cdot c_{10}^{*}\), the follows inequalities are established by Corollary 2 and Lemma 3.
and
where \(c_{11}>0\) and \(c_{12}>0\) are some constants. As \(n \rightarrow \infty \), it can be written as
Denote \(\tilde{\mathbb {C}}_0 = \mathop {\arg \max }\limits _{\mathbb {C}} \sum _{j=1}^{p} \sum _{k=1}^{K} {S}_{j k} \cdot b_{j k}\) and \(\tilde{\mathbb {C}}_0^{*} = \mathop {\arg \max }\limits _{\mathbb {C}} \sum _{j=1}^{p} {S}_{j} \cdot b_{j}\). We can get that \(\liminf \limits _{n \rightarrow \infty } \tilde{\mathbb {C}}_0 = \liminf \limits _{n \rightarrow \infty } \tilde{\mathbb {C}}_0^{*}\). \(\square \)
1.10 Proof of Lemma 4
Proof
By the Eq. (15), the component of the OVR BCSS-RV cost object can be rewritten as
where we use the equation \(\bar{{\xi }}_{k j} = {n_k}^{-1} \sum _{i \in \mathcal {C}_k} \xi _{i j}\), \(\bar{{\xi }}_{k j}^{\prime } = (n - n_k)^{-1} \sum _{i \notin \mathcal {C}_k} \xi _{i j}\), and note \(\bar{{\xi }}_{j} = {n}^{-1} \sum _{i = 1}^{n} \xi _{i j} = (n + 1) / 2\). Hence, one can obtain
By the Eq. (15), the component of the OVR BCSS-RV cost object satisfies that \(b_{j k} \propto (n-1)(n+1)^4 + \hat{\upsilon }_{j k}\) for any \(j = 1, \ldots , p\) and any \(k = 1, \ldots , K\). Hence, we also get that \(b_j = \sum _{k=1}^{K} b_{j k} \propto K (n-1)(n+1)^4 + \sum _{k=1}^{K} \hat{\upsilon }_{j k} = K (n-1)(n+1)^4 + \hat{\upsilon }_{j}\). Due to \(b_{j_1} > b_{j_2}\) for any \(j_1 \in \hat{\mathcal {A}}\) and \(j_2 \notin \hat{\mathcal {A}}\), we obtain that \(\sum _{j=1}^{p} w_{j} b_{j}\) reaches optimal value at \(w_j = S_j\) under \(\Vert \varvec{w}\Vert _{\infty } \le 1\) and \(\Vert \varvec{w}\Vert _{0} \le s\).
Similarly, the component of the OVR BCSS-RV cost object can be rewritten as
where we use that \(b_j = \sum _{k=1}^{K} b_{j k}\) and \(\hat{\upsilon }_{j} = \sum _{k=1}^{K} \hat{\upsilon }_{j k} = \sum _{k=1}^{K} \frac{12 (n + 1) {\hat{p}}_k}{1 - {\hat{p}}_k} \hat{\tau }_{j k}^2\). Thus, we have
Therefore, we get that \(b_{j k} \propto (n-1)(n+1)^4 + \hat{\upsilon }_{j k}\) for any \(j = 1, \ldots , p\) and any \(k = 1, \ldots , K\). Due to \(b_{j_1} > b_{j_2}\) for any \(j_1 \in \hat{\mathcal {A}_{k}}\) and \(j_2 \notin \hat{\mathcal {A}_{k}}\) with a same k, one can obtain that \(\sum _{j=1}^{p} w_{j k} b_{j k}\) reaches the optimal value at \(w_{j k} = S_{j k}\) under \(\Vert \varvec{w}_k\Vert _{\infty } \le 1\) and \(\Vert \varvec{w}_k\Vert _{0} \le s\) for all \(k = 1, \ldots , K\). \(\square \)
1.11 Proof of Theorem 4
Proof
Due to the equivalence of \(\hat{\upsilon }_{j k} \ge \rho \) and \(|\hat{\tau }_{j, k}| \ge \rho \sqrt{\frac{1 - \hat{p}_{k}}{12 (n+1) \hat{p}_{k}}}\), The proof of Theorem 4 will be obtained by \(|\hat{\tau }_{j, k}|\). For \(k = 1, \ldots , K\) and \(j = 1, \ldots , p\), note that
Then, we have that if rejecting \(\textrm{H}_{0,j,k}\), then we get
Hence, we obtain
When \(\beta \in (2, + \infty )\), we have \(\rho = o(n^{1/4})\) as a constant. Hence, we obtain
By the definition of \(F_{|\hat{\tau }_{k}|}^{c}(x)\), it implies that
By the ultra-high dimensional settings, we have \(p = o\left( \exp \{n^{c} \} \right) , c > 0\) and \(s_k = o(n)\). If \(c > 1/2\), according to Eqs. (A10) and (A11), we have
To conclude, \(F_{|\hat{\tau }_{k}|}^{c}(k) \rightarrow 1 - e^{-1}\) a.s. \(n \rightarrow \infty \); in other words,
The number of variables screened into the adaptive false discovery set is subjected to the p times Bernoulli testing; then, the expectation and variance of the number of the false discovery are
\(\square \)
1.12 Proof of Theorem 5
Proof
According to the Condition C2, the definition of \(\hat{\rho }_0\) in Sect. 4.1 and \(\max \limits _{j \in \mathcal {A}_{k}} |\hat{\upsilon }_{j, k} - \upsilon _{j, k} | \le c_{13}^{*} n^{-\kappa }\) in Theorem 1, we have
Therefore, we obtain that
holds for some constant \(c_{13}>0\). \(\square \)
1.13 Proof of Corollary 3
Proof
Suppose the Assumptions (i)–(ii) and the Conditions (C1)–C3 hold. Firstly, by the Theorem 1, for any positive constant \(c_{15}^{*}\) satisfies that \(c_{15}^{*} n^{-\kappa } = \hat{\rho }_{0}\), we get
holds for some constant \(c_{17} > 0\). Let \(c_{15} = (K+\sum _{k=1}^{K}s_k) \cdot c_{15}^{*}\), by Theorem 1 and Theorem 4, we have
where
Similarly, by the Theorem 1, for any positive constant \(c_{16}^{*}\) satisfies that \(c_{16}^{*} n^{-\kappa } = \hat{\rho }_{0}\), we get
holds for some constant \(c_{18} > 0\). Let \(c_{16} = (s + 1) \cdot K \cdot c_{16}^{*}\). By Theorem 1 and Theorem 4, we obtain
where
\(\square \)
1.14 Proof of Theorem 6
Proof
Suppose Assumptions (i)–(ii) and Conditions (C1)–C3 hold. By Lemma 2 and Corollary 3, for any positive constants \(c_{19,1}\) and \(c_{19,2}\), such that
and
hold for some positive constant \(c_{21}\). Let \(c_{19} = c_{19,1} + c_{19,2}\). Thus, we obtain that
Similarly, By Lemma 2 and Corollary 3, for any positive constant \(c_{20} = c_{20,1} + c_{20,2}\), there exists \(c_{20,1}\) and \(c_{20,2}\) are positive constants, such that
and
hold for some constant \(c_{22}>0\). Then, we get that
When \(n \rightarrow \infty \), it can be written as
Denote \(\hat{\mathbb {C}}_{0, \hat{\rho }_0} = \mathop {\arg \max }\limits _{\mathbb {C}} \sum _{j=1}^{p} \sum _{k=1}^{K} \hat{S}_{\hat{\rho }_0, j k} \cdot \hat{\upsilon }_{j k}\) and \(\hat{\mathbb {C}}_{0, \hat{\rho }_0}^{*} = \mathop {\arg \max }\limits _{\mathbb {C}} \sum _{j=1}^{p} \hat{S}_{\hat{\rho }_0, j} \cdot \hat{\upsilon }_{j}\). We have that
\(\square \)
1.15 Proof of Theorem 7
Proof
Similar to proof of Theorem 5, we have
holds for some constant \(c_{7}>0\).
To prove \(\widehat{\textrm{FDR}}_{\hat{\rho }_k} \rightarrow \alpha \) in probability, under the assumption that \(q_k/p \rightarrow 1\) as \(p\rightarrow \infty \) and for any \(\rho > 0\), by Corollary 1 and the Hoeffding’s inequality, it suffices to show that
in probability as \(n \rightarrow \infty \), where \(\epsilon > 0\) is a constant\(S_{\chi ^2_1}(\rho ) = 1 - F_{\chi ^2_1}(\rho )\), the survival function of the distribution of \(\chi ^2_1\). Thus,
Notice that \(\sum _{j \in \mathcal {A}_{k}} I({\hat{\upsilon }}_{j k} \ge \rho )\) is monotone in \(\rho \) and asymptotically converges to \(s_k\), and \(S_{\chi ^2_1}(\rho )\) is continuous and monotone. Then, there exists a unique constant \(0< \tilde{\rho }_k \le C n^{-\beta }\) such that
in probability as \(n \rightarrow \infty \). Therefore, according to the Eqs. (A12) and (A13), we obtain that
\(\square \)
1.16 Proof of Corollary 4
Proof
Suppose the Assumptions (i)–(ii) and the Conditions (C1)–C3 hold. By Theorem 1, for any positive constant \(c_{24}^{*} n^{-\kappa } = \max \{ \hat{\rho }_{\alpha , 1}, \ldots , \hat{\rho }_{\alpha , K} \}\),
holds for some constant \(c_{26} > 0\). Due to that
Then, by Theorem 1, Theorem 4, and Theorem 7, let \(c_{24} = c_{24}^{*} \sum _{k=1}^{K}s_k/(1-\alpha )\), we have
Similarly, by the Theorem 1, for any constant \(c_{25}^{*} > 0\) satisfies that \(c_{25}^{*} n^{-\kappa } = \hat{\rho }_{\alpha }\),
holds for constant \(c_{27} > 0\). Let \(c_{25} = c_{25}^{*} Ks/(1-\alpha )\). By Theorem 1 and Theorem 4, we obtain
\(\square \)
1.17 Proof of Theorem 8
Proof
Suppose Assumptions (i)–(ii) and Conditions (C1)–C3 hold. By Lemma 2 and Corollary 4, for constant \(c_{28}= c_{28,1} + c_{28,2}\), where \(c_{28,1}\) and \(c_{28,2}\) are positive constants, such that
hold for positive constant \(c_{21}\). Thus, we obtain
Similarly, By Lemma 2 and Corollary 3, for any positive constant \(c_{29} = c_{29,1} + c_{29,2}\), there exists \(c_{29,1}\) and \(c_{29,2}\) are positive constants, such that
hold for some positive constant \(c_{31}\). Then, we get
When \(n \rightarrow \infty \), it can be written as
Denote \(\hat{\mathbb {C}}_{0, \alpha } = \mathop {\arg \max }\limits _{\mathbb {C}} \sum _{j=1}^{p} \sum _{k=1}^{K} \hat{S}_{\alpha , j k} \cdot \hat{\upsilon }_{j k}\) and \(\hat{\mathbb {C}}_{0, \alpha }^{*} = \mathop {\arg \max }\limits _{\mathbb {C}} \sum _{j=1}^{p} \hat{S}_{\alpha , j} \cdot \hat{\upsilon }_{j}\). We have that
\(\square \)
Supplements
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yuan, Z., Chen, J., Qiu, H. et al. Adaptive sufficient sparse clustering by controlling false discovery. Stat Comput 34, 193 (2024). https://doi.org/10.1007/s11222-024-10507-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10507-4
Keywords
- Unsupervised ultra-high dimensional data
- Sufficient variable screening
- Adaptive sufficient sparse clustering
- False discovery control
- Sufficient clustering property