Abstract
Motivated by open problems in applied and computational algebraic topology, we establish multivariate normal approximation theorems for three random vectors which arise organically in the study of random clique complexes. These are:
-
(1)
the vector of critical simplex counts attained by a lexicographical Morse matching,
-
(2)
the vector of simplex counts in the link of a fixed simplex, and
-
(3)
the vector of total simplex counts.
The first of these random vectors forms a cornerstone of modern homology algorithms, while the second one provides a natural generalisation for the notion of vertex degree, and the third one may be viewed from the perspective of U-statistics. To obtain distributional approximations for these random vectors, we extend the notion of dissociated sums to a multivariate setting and prove a new central limit theorem for such sums using Stein’s method.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Methods from applied and computational algebraic topology have recently found substantial applications in the analysis of nonlinear and unstructured datasets (Ghrist 2008; Carlsson 2005). The modus operandi of topological data analysis is to first build a nested family of simplicial complexes around the elements of a dataset, and to then compute the associated persistent homology barcodes (Edelsbrunner and Harer 2010). Of central interest, when testing hypotheses under this paradigm, is the question of what homology groups to expect when the input data are randomly generated. Significant efforts have therefore been devoted to answering this question for various models of noise, giving rise to the field of stochastic topology (Kahle 2011; Bobrowski and Kahle 2018; Kahle 2009; Adler et al. 2014; Costa and Farber 2016). Our work here is a contribution to this area at the interface between probability theory and algebraic topology.
Distributional approximations provide a way of understanding random variables in cases where closed-form distributions cannot be easily obtained. This paper establishes the first multivariate normal approximations to three important counting problems in stochastic topology; as these approximations are based on Stein’s method, explicit bounds on the approximation errors are provided. Our starting point is the ubiquitous graph model \({{\textbf {G}}(n,p)}\); a graph G chosen from this model has as its vertex set \([n] = {\left\{ {1,2,\ldots ,n}\right\} }\), and each of its possible \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) edges is included independently with probability \(p \in [0,1]\). A natural higher-order generalisation of \({{\textbf {G}}(n,p)}\) is furnished by the random clique complex model \({{\textbf {X}}(n,p)}\), whose constituent complexes \({\mathscr {L}}\) are constructed as follows. One first selects an underlying graph \({G \sim {\textbf {G}}(n,p)}\), and then deterministically fills out all k-cliques in G with \((k-1)\)-dimensional simplices for \(k \ge 3\). Higher connectivity is measured by the Betti numbers \(\beta _k({\mathscr {L}})\), which are ranks of rational homology groups \(\text {H}_k({\mathscr {L}};{\mathbb {Q}})\)—in particular, \(\beta _0({\mathscr {L}})\) equals the number of connected components of the underlying random graph G. In Kahle (2014), Kahle proved the following far-reaching generalisation of the Erdős-Rényi connectivity result: for each \(k \ge 1\) and \(\epsilon > 0\),
-
(1)
if
$$\begin{aligned} p \ge \left[ \left( \frac{k}{2}+1+\epsilon \right) \cdot \frac{\log (n)}{n}\right] ^{{1}/{(k+1)}}, \end{aligned}$$then \(\beta _k({\mathscr {L}}) = 0\) with high probability; and moreover,
-
(2)
if
$$\begin{aligned} \left[ \frac{k+1+\epsilon }{n}\right] ^{1/k} \le p \le \left[ \left( \frac{k}{2}+1-\epsilon \right) \cdot \frac{\log (n)}{n}\right] ^{1/(k+1)}, \end{aligned}$$then \(\beta _k({\mathscr {L}}) \ne 0\) with high probability.
Unlike Kahle’s result, we study \({{\textbf {X}}(n,p)}\) in the regime where p is a constant. In that case, we may have \({\beta _k({\textbf {X}}(n,p)) \ne 0}\) for multiple values of k that depend on n. Hence, studying the multivariate distribution of the Betti numbers is of interest in this regime. Since many results about Betti numbers in the univariate case are based on counting simplices, we hope that understanding the multivariate simplex counts will facilitate multivariate understanding of the Betti numbers. With this result in mind, we motivate and describe three random vectors pertaining to \({{\mathscr {L}}\sim {\textbf {X}}(n,p)}\); the normal approximation of these three random vectors will be our focus in this paper. All three are denoted \(T = (T_1,\ldots ,T_d)\) for an integer \(d > 0\). For the purposes of this introduction, we add a superscript (1), (2) or (3) to indicate the particular vector.
1.1 Random vector 1: critical simplex counts
The computation of Betti numbers \(\beta _k({\mathscr {L}})\) begins with the chain complex
Here \(\textrm{Ch}_k\) is a vector space whose dimension equals the number of k-simplices in \({\mathscr {L}}\), while \(d_k:\textrm{Ch}_k \rightarrow \textrm{Ch}_{k-1}\) is an incidence matrix encoding which \((k-1)\)-simplices lie in the boundary of a given k-simplex. These matrices satisfy the property that every successive composite \(d_{k+1} \circ d_k\) equals zero, and \(\beta _k({\mathscr {L}})\) is the dimension of the quotient vector space \(\ker d_k/\text {img }d_{k+1}\). In order to calculate \(\beta _k({\mathscr {L}})\), one is required to put the matrices \({\left\{ {d_k:\textrm{Ch}_k \rightarrow \textrm{Ch}_{k-1}}\right\} }\) in reduced echelon form, which is a straightforward task in principle. Unfortunately, Gaussian elimination on an \(m \times m\) matrix incurs an \(O(m^3)\) cost, which becomes prohibitive when facing simplicial complexes built around large data sets (Otter 2017). The standard remedy is to construct a much smaller chain complex which has the same homology groups, and by far the most fruitful mechanism for achieving such homology-preserving reductions is discrete Morse theory (Forman 2002; Mischaikow and Nanda 2013; Henselman-Petrusek and Ghrist 2016; Lampret 2019).
The key structure here is that of an acyclic partial matching, which pairs together certain adjacent simplices of \({\mathscr {L}}\); and the homology groups of \({\mathscr {L}}\) may be recovered from a chain complex whose vector spaces are spanned by unpaired, or critical, simplices. One naturally seeks an optimal acyclic partial matching on \({\mathscr {L}}\) which admits the fewest possible critical simplices. Unfortunately, the optimal matching problem is computationally intractable to solve (Joswig and Pfetsch 2006) even approximately (Bauer and Rathod 2019) for large \({\mathscr {L}}\). Our first random vector \(T^{(1)}\) is obtained by letting \(T^{(1)}_k\) equal the number of critical k-simplices for a specific type of acyclic partial matching on \({\mathscr {L}}\), called the lexicographical matching. Knowledge of this random vector serves to simultaneously quantify the benefit of using discrete Morse theoretic reductions on random simplicial complexes and to provide a robust null model by which to measure their efficacy on general (i.e., not necessarily random) simplicial complexes. This is the first time the asymptotic distribution of this random vector is studied.
1.2 Random vector 2: link simplex counts
The link of a simplex t in \({\mathscr {L}}\), denoted \(\textbf{lk}(t)\), consists of all simplices s for which the union \(s \cup t\) is also a simplex in \({\mathscr {L}}\) and the intersection \(s \cap t\) is empty. The link of t forms a simplicial complex in its own right; and if we restrict attention to the underlying random graph G, then the link of a vertex is precisely the collection of its neighbours. Therefore, the Betti numbers \(\beta _k(\textbf{lk}(t))\) generalise the degree distribution for vertices of random graphs in two different ways—one can study neighbourhoods of higher-dimensional simplices by increasing the dimension of t, and one can examine higher-order connectivity properties by increasing the homological dimension k. The second random vector \(T^{(2)}\) of interest to us here is obtained by letting \(T^{(2)}_k\) equal the number of k-simplices that would lie in the link of a fixed simplex t in \({\mathscr {L}}\), if t indeed was a simplex in the random complex. As far as we are aware, ours is the first work that studies this random vector. A different conditional distribution, which follows directly from results on subgraph counts in \({{\textbf {G}}(n,p)}\), has been studied before, see Remark 5.1.
There are compelling reasons to better understand the combinatorics and topology of such links from a probabilistic viewpoint. For instance, the fact that the link of a k-simplex in a triangulated n-manifold is always a triangulated sphere of dimension \((n-k-1)\) has been exploited to produce canonical stratifications of simplicial complexes into homology manifolds (Asai and Shah 2022; Nanda 2020). Knowledge of simplex counts (and hence, Betti numbers) of links would therefore form an essential first step in any systematic study involving canonical stratifications of random clique complexes.
1.3 Random vector 3: total simplex counts
The strategy employed in Kahle’s proof of the second assertion above involves first checking that the expected number of k-simplices in \({{\mathscr {L}}\sim {\textbf {X}}(n,p)}\) is much larger than the expected number of simplices of dimensions \(k \pm 1\) whenever p lies in the range indicated by (2). Therefore, one may combine the Morse inequalities with the linearity of expectation in order to guarantee that the expected \(\beta _k({\mathscr {L}})\) is nonzero—see Kahle (2014, Section 4) for details. To facilitate more refined analysis and estimates of this sort, the third random vector \(T^{(3)}\) we study in this paper is obtained by letting \(T^{(3)}_k\) equal the total number of k-dimensional simplices in \({\mathscr {L}}\).
Since \(T^{(3)}_k\) is precisely the number of \((k+1)\)-cliques in \({G \sim {\textbf {G}}(n,p)}\), this random vector falls within the purview of generalised U-statistics. We extend results from Janson and Nowicki (1991) to show not only distributional convergence asymptotically but a stronger result, detailing explicit non-asymptotic bounds on the approximation. Several interesting problems can be seen as special cases—these include classical U-statistics (Lee 1990; Korolyuk and Borovskich 2013), monochromatic subgraph counts of inhomogeneous random graphs with independent random vertex colours, and the number of overlapping patterns in a sequence of independent Bernoulli trials. To the best of our knowledge, this is the first multivariate normal approximation result with explicit bounds where the sizes of the subgraphs are permitted to increase with n.
1.4 Main results
The central contributions of this work are multivariate normal approximations for all three random vectors T described above. The approximation error is quantified for finite n in terms of both smooth test functions as well as convex set indicator test functions. As long as the bound with respect to either test function class vanishes asymptotically, it implies asymptotic convergence in distribution. However, the convexset indicator test function result is stronger and can be more useful in statistical applications: for example, when estimating confidence regions, which are usually taken to be convex sets.
We state a simplified version of our normal approximation results here and note that the full statements and proofs have been recorded as Theorems 4.5, 5.3, and Corollary 6.1. Note that the quantities below \(B_{5.3}\) and \(B_{6.1}\) are explicit, allowing to vary the parameters p and d.
To state the results, for a positive integer d we define a class of test functions \(h: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\), as follows. We say \(h \in {\mathscr {H}}_d\) iff h is three times partially differentiable with third partial derivatives being Lipschitz and bounded. Moreover we denote by \({\mathscr {K}}\) the class of convex sets in \({\mathbb {R}}^d\).
Theorem 1.1
Let \(W^{(i)}\) be an appropriately scaled and centered version of random vector \(T^{(i)}\) for \(i = 1,2,3\) as described above. Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and \(\Sigma _i\) be the covariance matrix of \(W^{(i)}\) for each i. Let \(h \in {\mathscr {H}}_d.\)
-
(1)
There is a constant \(B_{1.1.1} >0\) independent of n and a natural number \(N_{1.1.1}\) such that for any \(n \ge N_{1.1.1}\) we have
$$\begin{aligned}\left| {\mathbb {E}}h(W^{(1)}) - {\mathbb {E}}h(\Sigma _1^{\frac{1}{2}}Z) \right| \le B_{1.1.1} \sup _{i, j, k \in [d]} \left\Vert \frac{\partial ^3 h}{\partial x_i \partial x_j \partial x_k}\right\Vert _{\infty } n^{-1}.\end{aligned}$$Also, there is a constant \(B_{1.1.2} >0\) independent of n and a natural number \(N_{1.1.2}\) such that for any \(n \ge N_{1.1.2}\) we have
$$\begin{aligned}\sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W^{(1)} \in A) -{\mathbb {P}}(\Sigma _1^{\frac{1}{2}}Z \in A)| \le B_{1.1.2}n^{-\frac{1}{4}}.\end{aligned}$$ -
(2)
There is a quantity \(B_{5.3}\) independent of n and defined explicitly such that
$$\begin{aligned} \left| {\mathbb {E}}h(W^{(2)}) - {\mathbb {E}}h(\Sigma _2^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{5.3} (n - |t|)^{-\frac{1}{2}}; \end{aligned}$$and
$$\begin{aligned}\sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W^{(2)} \in A) -{\mathbb {P}}(\Sigma _2^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{5.3}^{\frac{1}{4}}(n - |t|)^{-\frac{1}{8}}.\end{aligned}$$ -
(3)
There is a quantity \(B_{6.1}\) independent of n and defined explicitly such that
$$\begin{aligned} \left| {\mathbb {E}}h(W^{(3)}) - {\mathbb {E}}h(\Sigma _3^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{6.1} n^{-1}; \end{aligned}$$and
$$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W^{(3)} \in A)-{\mathbb {P}}(\Sigma _3^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{6.1}^{\frac{1}{4}}n^{-\frac{1}{4}}. \end{aligned}$$
En route to proving Theorem 1.1, we also establish the following properties, which are of direct interest in computational topology. Here we assume that \(p \in (0,1)\) and \(k \in \{1, 2, \ldots \}\) are constants.
-
(1)
The expected number of critical k-simplices is one order of n smaller that the expected total number of k-simplices; see Lemma 4.2.
-
(2)
The variance of the number of critical k-simplices is at least of the order \(n^{2k}\), as shown in Lemma 4.3. An upper bound of the same order can be proved similarly. The variance of the total number of k-simplices is also of the same order.
-
(3)
Knowing the expected value and the variance one can prove concentration results using different concentration inequalities, for example, Chebyshev’s inequality. This would show that not only the expected value of the number of critical simplices is smaller compared to all simplices but also that large deviations from the mean are unlikely, hence implying that the substantial improvement of one order of n is not only expected but also likely.
-
(4)
For counting critical simplices to high accuracy in probability, it is not necessary to check every simplex. Certain simplices have a very small chance of being critical, and can be safely ignored. The probability of this omission causing an error is vanishingly small asymptotically; see Proposition 4.4.
The main ingredient in establishing such results is often an abstract approximation theorem that can be applied to the random variables of interest. While there is no shortage of multivariate normal approximation theorems (Fang 2016; Raiíc 2004; Meckes 2009; Chen 2011), the existing ones are not sufficiently fine-grained for proving multivariate normal approximations to the random vectors studied here. We therefore return to the pioneering work of Barbour et al. (1989), who proved a univariate central limit theorem (CLT) for a decomposable sum of random variables using Stein’s method, treating the case of dissociated sums as a special case. Our approximation result (Theorem 3.2) forms an extension of their ideas to the multivariate setting.
1.5 Related work
There are different versions of distributional approximation results for subgraph counts in \({{\textbf {G}}(n,p)}\) that can be interpreted as simplex counts in \({{\textbf {X}}(n,p)}\). For example, a multivariate central limit theorem for centered subgraph counts in the more general setting of a random graph associated to a graphon can be found in Kaur and Röollin (2021). That proof is based on Stein’s method via a Stein coupling. Translating this result for uncentered subgraph counts would yield an approximation by a function of a multivariate normal. In Reinert and Röllin (2010), an exchangeable pair coupling led to Reinert and Rollin (2010, Proposition 2) which can be specialised to joint counts of edges and triangles; our approximation significantly generalises this result beyond the case where \(k \in {\left\{ {1,2}\right\} }\). Several univariate normal approximation theorems for subgraph counts are available; recent developments in this area include Privault and Serafin (2020), which uses Malliavin calculus together with Stein’s method, and Eichelsbacher and Rednoß (2023), which uses the Stein-Tikhomirov method. Stein’s method is used in Kahle and Meckes (2013) to show a CLT for the Betti numbers of \({{\textbf {X}}(n,p)}\) in a sparse regime; in Owada et al. (2021) limit theorems for Betti numbers and Euler characteristic are proven in a dynamical random simplicial complex model, also using Stein’s method.
Theorem 3.2 is not the first generalisation of the results in Barbour et al. (1989) to a multivariate setting, see for example (Fang 2016; Raiíc 2004). The key advantage of our approach is that it allows for bounds which are non-uniform in each component of the vector W. This is useful when, for example, the number of summands in each component are of different order or when the sizes of dependency neighbourhoods in each component are of different order. The applications considered here are precisely of this type, where the non-uniformity of the bounds is crucial. Moreover, we do not require the covariance matrix \(\Sigma \) to be invertible, and can therefore accommodate degenerate multivariate normal distributions.
1.6 Organisation
In Sect. 2 we recall concepts from the theory of simplicial complexes, which we later use. In Sect. 3 we state the theorems that serve as main tools in proving the CLTs. In order to maintain focus on the main results, we defer the proofs of this section until the end of the paper. In Sect. 4 we prove an approximation theorem for critical simplex counts of lexicographical matchings. Two technical computations required in this section have been consigned to the Appendix. In Sect. 5 we prove an approximation theorem for count variables of simplices that are in the link of a fixed simplex. In Sect. 6 we study simplex counts in the random clique complex and prove a CLT for this random variable. This CLT is a corollary of a multivariate normal approximation of generalised U-statistics, which might be of independent interest. Finally, in Sect. 7 we prove our main tools: the abstract approximation theorem (Theorem 3.2) as well as the approximation theorem for U-statistics (Theorem 3.9). We first use smooth test functions and then extend the results to convex set indicators using a smoothing technique from Gan et al. (2017).
1.7 Notation
Throughout this paper we use the following notation. Given positive integers n, m we write [m, n] for the set \({\left\{ {m,m+1,\ldots ,n}\right\} }\) and [n] for the set [1, n]. Given a set X we write \(\left| X \right| \) for its cardinality, \({\mathscr {P}}(X)\) for its powerset, and given a positive integer k we write \(C_k = \left\{ \,t \in {\mathscr {P}}([n]) \;\vert \; |t| = k\,\right\} \) for the collection of subsets of [n] which are of size k. For a function \(f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) we write \(\partial _{ij}f = \frac{\partial ^2f}{\partial x_i \partial x_j}\) and \(\partial _{ijk}f = \frac{\partial ^3f}{\partial x_i \partial x_j \partial x_k}\). Also, we write \(\left| f \right| _k = \sup _{i_1, i_2, \ldots , i_k \in [d]} \left\Vert \partial _{i_1 i_2 \ldots i_k}f\right\Vert _{\infty }\) for any integer \(k \ge 1\), as long as the quantities exist. Here \(|| \cdot ||_\infty \) denotes the supremum norm while \(|| \cdot ||_2\) denotes the Euclidean norm. The notation \(\nabla \) denotes the gradient operator in \({\mathbb {R}}^d\). The notation \(\textrm{Id}_{d \times d}\) denotes the \(d \times d\) identity matrix. The vertex set of all graphs and simplicial complexes is assumed to be [n]. We also use Bachmann-Landau asymptotic notation: we say \(f(n) = O(g(n))\) iff \(\limsup _{n \rightarrow \infty } \frac{|f(n)|}{g(n)} < \infty \) and \(f(n) = \Omega (g(n))\) iff \(\liminf _{n \rightarrow \infty } \frac{f(n)}{g(n)} > 0\). The notation that \(f(n)= \omega (g(n))\) indicates that \(\lim _{n \rightarrow \infty } \frac{f(n)}{g(n)} = \infty \).
2 Simplicial complex preliminaries
2.1 First definitions
Firstly, we recall the notion of a simplicial complex (Spanier 1966, Ch 3.1); these provide higher-dimensional generalisations of a graph and constitute data structures of interest across algebraic topology in general as well as applied and computational topology in particular.
A simplicial complex \({\mathscr {L}}\) on a vertex set V is a set of nonempty subsets of V (i.e. \(\varnothing \notin {\mathscr {L}}\subseteq {\mathscr {P}}(V)\)) such that the following properties are satisfied:
-
(1)
for each \(v \in V\) the singleton \({\left\{ {v}\right\} }\) lies in \({\mathscr {L}}\), and
-
(2)
if \(t \in {\mathscr {L}}\) and \(s \subset t\) then \(s \in {\mathscr {L}}\).
The dimension of a simplicial complex \({\mathscr {L}}\) is \(\max _{s \in {\mathscr {L}}}|s| - 1\). Elements of a simplicial complex are called simplices. If s is a simplex, then its dimension is \(|s| - 1\). A simplex of dimension k can be called a k-simplex. Note that the notion of one-dimensional simplicial complex is equivalent to the notion of a graph, with the vertex set V and edges as subsets.
Given a graph \(G = (V, E)\) the clique complex \({\mathscr {X}}\) of G is a simplicial complex on V such that
Recall that \({{\textbf {G}}(n,p)}\) is a random graph on n vertices where each pair of vertices is connected with probability p, independently of any other pair. The \({{\textbf {X}}(n,p)}\) random simplicial complex is the clique complex of the \({{\textbf {G}}(n,p)}\) random graph, which is a random model studied in stochastic topology (Kahle 2009, 2014). Note that \(t \in {{\mathscr {X}}} \) if and only if the vertices of t span a clique in G. Thus, elements in \({{\textbf {X}}(n,p)}\) are cliques in \({{\textbf {G}}(n,p)}\).
2.2 Links
The link of a simplex t in a simplicial complex \({\mathscr {L}}\) is the subcomplex
Example 2.1
If we look at a graph as a one dimensional simplicial complex, then the vertices are sets of the form \({\left\{ {i}\right\} }\) and edges are sets of the form \({\left\{ {i,j}\right\} }\). For a vertex \(t = {\left\{ {v}\right\} }\), the edges of the form \(s = {\left\{ {v,u}\right\} }\) will not be in the link of t because \(t \cap s = \varnothing \) is not satisfied. If we pick \(s = {\left\{ {i,j}\right\} }\) and \(v \notin s\), then \(s \cup t \in {\mathscr {L}}\) is not satisfied. So there will be no edges in the link. However, if \(s = {\left\{ {u}\right\} }\) and u is a neighbour of v, then \(s \cup t \in {\mathscr {L}}\) and \(s \cap t = \varnothing \). Hence the link of a vertex will be precisely the other vertices that the vertex is connected to; the notion of the link generalises the idea of a neighbourhood in a graph.
Example 2.2
Now consider the simplicial complex depicted in Fig. 1: it has 8 vertices, 12 edges and 3 two-dimensional simplices that are shaded in grey. On the left hand side of the figure we see highlighted in blue the link of the vertex 1, which is highlighted in red. So \(\textbf{lk}({\left\{ {1}\right\} }) = {\left\{ {{\left\{ {2}\right\} }, {\left\{ {3}\right\} }, {\left\{ {5}\right\} }, {\left\{ {6}\right\} }, {\left\{ {8}\right\} }, {\left\{ {2,3}\right\} }, {\left\{ {2,8}\right\} }, {\left\{ {5,6}\right\} }}\right\} }\). On the right hand side of the figure we see highlighted in blue the link of the edge \({\left\{ {1,2}\right\} }\), which is highlighted in red. That is, \(\textbf{lk}({\left\{ {1,2}\right\} }) = {\left\{ {{\left\{ {3}\right\} }, {\left\{ {8}\right\} }}\right\} }\).
2.3 Discrete Morse theory
A partial matching on a simplicial complex \({\mathscr {L}}\) is a collection
such that every simplex appears in at most one pair of \(\Sigma \). A \(\mathbf {\Sigma }\)-path (of length \(k \ge 1)\) is a sequence of distinct simplices of \({\mathscr {L}}\) of the following form:
such that \((s_i, t_i) \in \Sigma \) and \(|t_i| - |s_{i+1}| = 1\) for all \(i \in [k]\). A \(\Sigma \)-path is called a gradient path if \(k=1\) or \(s_1\) is not a subset of \(t_k\). A partial matching \(\Sigma \) on \({\mathscr {L}}\) is called acyclic iff every \(\Sigma \)-path is a gradient path. Given a partial matching \(\Sigma \) on \({\mathscr {L}}\), we say that a simplex \(t \in {\mathscr {L}}\) is critical iff t does not appear in any pair of \(\Sigma \).
For a one-dimensional simplicial complex, viewed as a graph, a partial matching \(\Sigma \) is comprised of elements \(( v; \{u,v\})\) with v a vertex and \(\{u, v\}\) an edge. A \(\Sigma -\)path is then a sequence of distinct vertices and edges
where each consecutive pair of the form \((v_i,{\left\{ {v_i,v_{i+1}}\right\} })\) is constrained to lie in \(\Sigma \).
We refer the interested reader to Forman (2002) for an introduction to discrete Morse theory and to Mischaikow and Nanda (2013) for seeing how it is used to reduce computations in the persistent homology algorithm. In this work we aim to understand how much improvement one would likely get on a random input when using a specific type of acyclic partial matching, defined below.
Definition 2.3
Let \({\mathscr {L}}\) be a simplicial complex and assume that the vertices are ordered by \([n] = \{1,\ldots ,n\}\). For each simplex \(s \in {\mathscr {L}}\) define
Now consider the pairings
where \(i = \min I_{{\mathscr {L}}}(s)\) is the smallest element in the set \( I_{{\mathscr {L}}}(s)\), defined whenever \(I_{{\mathscr {L}}}(s) \ne \varnothing \). We call this the lexicographical matching.
Due to the \(\min I_{{\mathscr {L}}}(s)\) construction in the lexicographical matching, the indices are decreasing along any path and hence it will be a gradient path, showing that the lexicographical matching is indeed an acyclic partial matching on \({\mathscr {L}}\).
Example 2.4
Consider the simplicial complex \({\mathscr {L}}\) depicted in Fig. 2. The complex has 5 vertices, 6 edges and one two-dimensional simplex that is shaded in grey. The red arrows show the lexicographical matching on this simplicial complex: there is an arrow from a simplex s to t iff the pair (s, t) is part of the matching. More explicitly, the lexicographical matching on \({\mathscr {L}}\) is
Note that \({\left\{ {3,4}\right\} }\) cannot be matched because the set \(I_{{\mathscr {L}}}({\left\{ {3,4}\right\} })\) is empty. Also, in any lexicographical matching \({\left\{ {1}\right\} }\) is always critical as there are no vertices with a smaller label and hence the set \(I_{{\mathscr {L}}}({\left\{ {1}\right\} })\) is empty. So under this matching there are two critical simplices: \({\left\{ {1}\right\} }\) and \({\left\{ {3,4}\right\} }\), highlighted in blue in the figure. Hence, if we were computing the homology of this complex, considering only two simplices would be sufficient instead of all 12 which are in \({\mathscr {L}}\)—a significant improvement.
3 Probabilistic tools
In this section we introduce the approximation theorems that are used to study the random variables of interest. In order not to obscure our main results, the proofs are deferred to Sect. 7.
3.1 CLT for dissociated sums
Let n and d be positive integers. For each \(i \in [d] =:\{1, 2, \ldots , d\}\), we fix an index set \({\mathbb {I}}_i \subset [n] \times {\left\{ {i}\right\} }\) and consider the union of disjoint sets \({\mathbb {I}}:= \bigcup _{i \in [d]} {\mathbb {I}}_i\). Associate to each such \(s = (k,i) \in {\mathbb {I}}\) a real centered random variable \(X_s\) and form for each \(i \in [d]\) the sum
Consider the resulting random vector \(W = (W_1,\ldots ,W_d) \in {\mathbb {R}}^d\). The following notion is a natural multivariate generalisation of the dissociated sum from McGinley and Sibson (1975); see also Barbour et al. (1989).
Definition 3.1
We call W a vector of dissociated sums if for each \(s \in {\mathbb {I}}\) and \(j \in [d]\) there exists a dependency neighbourhood \({\mathbb {D}}_j(s) \subset {\mathbb {I}}_j\) satisfying three criteria:
-
(1)
the difference \(\left( W_j - \sum _{u \in {\mathbb {D}}_j(s)} X_u\right) \) is independent of \(X_s\);
-
(2)
for each \(t \in {\mathbb {I}}\), the quantity \(\left( W_j - \sum _{u \in {\mathbb {D}}_j(s)} X_u - \sum _{v \in {\mathbb {D}}_j(t) \setminus {\mathbb {D}}_j(s)} X_v\right) \) is independent of the pair \((X_s,X_t)\); and finally,
-
(3)
\(X_s\) and \(X_t\) are independent if \(t \not \in \bigcup _j {\mathbb {D}}_j(s)\).
Let W be a vector of dissociated sums as defined above. For each \(s\in {\mathbb {I}}\), by construction, the sets \({\mathbb {D}}_j(s), j \in [d]\) are disjoint (although for \(s \ne t\), the sets \({\mathbb {D}}_j(s)\) and \({\mathbb {D}}_j(t)\) may not be disjoint). We write \({\mathbb {D}}(s) = \bigcup _{j \in [d]}{\mathbb {D}}_j(s)\) for the disjoint union of these dependency neighbourhoods. With this preamble in place, we state the abstract approximation theorem that is the main ingredient in the proofs of our normal approximation results.
Theorem 3.2
Let \(h \in {\mathscr {H}}_d.\) Consider a standard d-dimensional Gaussian vector \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\). Assume that for all \(s \in {\mathbb {I}}\), we have \({\mathbb {E}}\left\{ X_s \right\} = 0\) and \({\mathbb {E}}\left| X_s^3 \right| < \infty .\) Then, for any vector of dissociated sums \(W \in {\mathbb {R}}^d\) with a positive semi-definite covariance matrix \(\Sigma \),
where \(B_{3.2} = B_{3.2.1} + B_{3.2.2}\) is the sum given by
The theorem above together with a smoothing technique will be used to prove the following approximation theorem in terms of convex set indicators.
Theorem 3.3
Consider a standard d-dimensional Gaussian vector \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\). For any centered vector of dissociated sums \(W \in {\mathbb {R}}^d\) with a positive semi-definite covariance matrix \(\Sigma \) and finite third absolute moments we have
where the quantity \(B_{3.2}\) as in Theorem 3.2.
The next result provides a simplification of Theorems 3.2 and 3.3 under the assumption that one uses bounds that are uniform in \(s, t, u \in {\mathbb {I}}\). Its proof follows immediately from writing the sum over \( \sum _{s \in {\mathbb {I}}} \sum _{t,u \in {\mathbb {D}}(s)} \) as the sum over \(\sum _{i \in [d]} \sum _{j \in [d]} \sum _{k \in [d]} \sum _{s \in {\mathbb {I}}_i} \sum _{t \in {\mathbb {D}}_j(s)} \sum _{u \in {\mathbb {D}}_k(s)}\).
Corollary 3.4
We have the following two bounds:
-
(1)
Under the assumptions of Theorem 3.2,
$$\begin{aligned} \left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le B_{3.4} \left| h \right| _3. \end{aligned}$$ -
(2)
Assuming the hypotheses of Theorem 3.3,
$$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A)-{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{3.4}^{\frac{1}{4}}. \end{aligned}$$
Here \(B_{3.4}\) is a sum over \((i,j,k) \in [d]^3\) of the form
and \(\alpha _{ij}\) is the largest value attained by \(\left| {\mathbb {D}}_j(s) \right| \) over \(s \in {\mathbb {I}}_i\), and
as (s, t, u) range over \({\mathbb {I}}_i \times {\mathbb {I}}_j \times {\mathbb {I}}_k\).
In most of our applications, the variables \(X_s\) are centered and rescaled Bernoulli random variables. Hence, the following lemma is useful.
Lemma 3.5
Let \(\xi _1, \xi _2, \xi _3\) be Bernoulli random variables with expected values \(\mu _1, \mu _2, \mu _3\) respectively. Let \(c_1, c_2, c_3 > 0\) be any constants. Consider variables \(X_i :=c_i (\xi _i - \mu _i)\) for \(i = 1,2,3\). Then we have
Proof
Note that \(X_{3}\) can take two values: \(-c_3 \mu _3\) or \(c_3(1-\mu _3)\). As \(0 \le \mu _3 \le 1\), we have
Applying the Cauchy-Schwarz inequality and direct calculation of the second moments gives
which finishes the proof. \(\square \)
3.2 CLT for U-statistics
Here we consider generalised U-statistics, which were first introduced in Janson and Nowicki (1991). The result we derive could be of independent interest but, most importantly, the approximation theorem for simplex counts follows as a consequence. Let \(\{\xi _i\}_{1 \le i \le n}\) be a sequence of of independent random variables taking values in a measurable subset \({\mathscr {X}} \subseteq {\mathscr {U}}\) and let \(\{Y_{i,j}\}_{1 \le i < j \le n}\) be an array of of independent random variables taking values in a measurable subset \({\mathscr {Y}} \subseteq {\mathscr {U}}\) which is independent of \(\{\xi _i\}_{1 \le i \le n}\). We use the convention that \(Y_{i,j} = Y_{j,i}\) for any \(i < j\). For example, one can think of \(X_i\) as a random label of a vertex i in a random graph where \(Y_{i,j}\) is the indicator for the edge connecting i and j. Given a subset \(s \subseteq [n]\) of size m, write \(s = {\left\{ {s_1, s_2, \ldots , s_m}\right\} }\) such that \(s_1< s_2< \ldots < s_m\) and set \({\mathscr {X}}_s = (\xi _{s_1}, \xi _{s_2}, \ldots , \xi _{s_m})\) and \({\mathscr {Y}}_s = (Y_{s_1, s_2}, Y_{s_1, s_3}, \ldots Y_{s_{m-1},s_m})\). Recall that \(C_k\) denotes the set of subsets of [n] which are of size k.
Definition 3.6
Given \(1 \le k \le n\) and a measurable function \(f: {\mathscr {X}}^k \times {\mathscr {Y}}^{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } \rightarrow {\mathbb {R}}\) define the associated generalised U-statistic by
Let \({\left\{ {k_i}\right\} }_{i \in [d]}\) be a collection of positive integers, each being at most n, and for each \(i \in [d]\) let \(f_i: {\mathscr {X}}^{k_i} \times {\mathscr {Y}}^{\left( {\begin{array}{c}k_i\\ 2\end{array}}\right) } \rightarrow {\mathbb {R}}\) be a measurable function. We are interested in the joint distribution of the variables \(S_{n,k_1}(f_1), S_{n,k_2}(f_2), \ldots S_{n,k_d}(f_d)\), which are assumed to have finite mean and variance.
Fix \(i \in [d]\). For \(s \in {\mathbb {I}}_i :=C_{k_i} \times {\left\{ {i}\right\} }\) define \(X_{s} = \sigma _i^{-1}\left( f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) - \mu _{s}\right) \), where \(\mu _{s} = {\mathbb {E}}\left\{ f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) \right\} \) and \(\sigma _i^{2} = {{\,\textrm{Var}\,}}(S_{n,k_i}(f_i))\). Now let \(W_i = \sum _{s \in {\mathbb {I}}_i} X_{s}\) be a random variable and write \(W = (W_1, W_2, \ldots W_d) \in {\mathbb {R}}^d\). By construction, \(W_i\) has mean 0 and variance 1.
Assumption 3.7
We assume that
-
(1)
For any \(i \in [d]\) there is some \(\alpha _i > 0\) such that for all \(s, t \in {\mathbb {I}}_i\), the variables \(f_i({\mathscr {X}}_s, {\mathscr {Y}}_s), f_i({\mathscr {X}}_t, {\mathscr {Y}}_t)\) are either independent or \({{\,\textrm{Cov}\,}}(f_i({\mathscr {X}}_s, {\mathscr {Y}}_s), f_i({\mathscr {X}}_t, {\mathscr {Y}}_t)) > \alpha _i\).
-
(2)
There is \(\beta \ge 0\) such that for any \(i,j,l \in [d]\) and any \(s \in {\mathbb {I}}_i, t \in {\mathbb {I}}_j, u \in {\mathbb {I}}_l\) we have
$$\begin{aligned}{\mathbb {E}}\left| \left\{ f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) - \mu _{s} \right\} \left\{ f_j({\mathscr {X}}_t, {\mathscr {Y}}_t) - \mu _{t} \right\} \left\{ f_l({\mathscr {X}}_u, {\mathscr {Y}}_u) - \mu _{u} \right\} \right| \le \beta \end{aligned}$$as well as
$$\begin{aligned}{\mathbb {E}}\left| \left\{ f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) - \mu _{s} \right\} \left\{ f_j({\mathscr {X}}_t, {\mathscr {Y}}_t) - \mu _{t} \right\} \right| {\mathbb {E}}\left| f_l({\mathscr {X}}_u, {\mathscr {Y}}_u) - \mu _{u} \right| \le \beta .\end{aligned}$$
The first assumption is not necessary but very convenient and we use it to derive a lower bound for the variance \(\sigma _i^2\). A normal approximation theorem can be proven in our framework when the assumption does not hold and a sufficiently large lower bound for the variance is acquired in a different way. Similarly, we use the second assumption to get a convenient bound on mixed moments. In order to maintain the generality and simplicity of the proofs, we work under Assumption 3.7.
We also consider the important special case that the functions in Definition 3.6 only depend on the second component, so that the sequence \({\left\{ {\xi _i}\right\} }_{i \in [n]}\) can be ignored. Hence, we add an additional assumption.
Assumption 3.8
We assume that the functions \(f_i\) only depend on the variables \({\left\{ {Y_{i,j}}\right\} }\) for \(1 \le i < j \le n\). That is, we can write \(f_i: {\mathscr {Y}}^{\left( {\begin{array}{c}k_i\\ 2\end{array}}\right) } \rightarrow {\mathbb {R}}\).
Such functions appear naturally, for example, when counting subgraphs in an inhomogeneous Bernoulli random graph. An example of such generalised U-statistic is simplex counts in \({{\textbf {X}}(n,p)}\) and is worked out in Sect. 6. We recall, for the purposes of the following result, that n is the number of variables in the sequence \(\{\xi _i\}_{1 \le i \le n}\) and \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) is the number of variables in the sequence \(\{Y_{i,j}\}_{1 \le i < j \le n}\) from Definition 3.6.
Theorem 3.9
Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and let \(h \in {\mathscr {H}}_d\). Assume W with covariance matrix \(\Sigma \) satisfies Assumption 3.7. Then
and
Here,
and
If only Assumption 3.7 is satisfied, then \(\gamma = \frac{1}{2}\) and \(\delta = 1\). If additionally Assumption 3.8 is also satisfied, then \(\gamma = 1\) and \(\delta = 4\).
4 Critical simplex counts for lexicographical Morse matchings
Now we attend to our motivating problem, critical simplex counts. Consider the random simplicial complex \({{\textbf {X}}(n,p)}\). In this section we study the joint distribution of critical simplices in different dimensions with respect to the lexicographical matching on \({{\textbf {X}}(n,p)}\). We start with the following lemma, which is an immediate consequence of Definition 2.3, allowing us to write down the variables of interest in terms of the edge indicators.
Lemma 4.1
Let \({\mathscr {L}}\) be a simplicial complex endowed with the lexicographical acyclic partial matching, and consider a simplex \(t \in {\mathscr {L}}\) with minimal vertex \(i \in [n]\). Then, t is matched with
-
(1)
one of its co-faces if and only if there exists some \(j < i\) for which \(t \cup {\left\{ {j}\right\} } \in {\mathscr {L}}\); and,
-
(2)
one of its faces if and only for all \(j < i\) we have \((t \setminus {\left\{ {i}\right\} }) \cup {\left\{ {j}\right\} } \notin {\mathscr {L}}\).
For any pair of integers \(1 \le i < j \le n\) let \({Y_{i,j} :=\mathbb {1}\left( {\left\{ {i,j}\right\} } \in {\textbf {X}}(n,p)\right) }\) be the edge indicator. Fix \(s \in C_k\). Define the variables \(X_s^{+} = \mathbb {1}\) (s matches with its coface given it is a simplex) and \(X_s^{-} = \mathbb {1}\) (s matches with its face given it is a simplex). The events that the two variables indicate are disjoint. By Lemma 4.1 we can see that \(X_s^{+} = 1 - \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s} Y_{a,b} \right) \) and \(X_s^{-} = \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s_{-}} Y_{a,b}\right) \), where \(s_{-} :=s {\setminus } {\left\{ {\min (s)}\right\} }\). Hence,
Thus, the random variable of interest, counting the number of \((k-1)\)-simplices that are critical under the lexicographical matching, is
Note that this random variable does not fit into the framework of generalised U-statistics because the summands in \(T_k\) depend not only on the variables that are indexed by the subset s. Therefore, Theorem 3.9 cannot be applied here.
4.1 Mean and variance
Lemma 4.2
For any \(1 \le k \le n - 1\) we have:
Proof
Moreover,
\(\square \)
In this example, bounding the variance is not immediate. The proof of the following Lemma 4.3 are long (and not particularly insightful) calculations, which are deferred to the Appendix. In Lemma 4.3 the constant could have been made explicit at the expense of a lengthy calculation while an explicit expression for the variance is given in Lemma A.1.
Lemma 4.3
For a fixed integer \(1 \le k \le n-1\) and \(p \in (0,1)\) there is a constant \(C_{p,k} > 0\) independent of n and a natural number \(N_{p,k}\) such that for any \(n \ge N_{p,k}\):
Just knowing the expectation and the variance can already give us some information about the variable. For example, we obtain the following proposition. This proposition shows that considering only a subset of the simplices already gives a good approximation for the critical simplex counts.
Proposition 4.4
Fix \(k \in [n]\). Let \(K \le n - k\) and set the random variable:
If \(K =K(n)= \omega (\ln ^{1 + \epsilon }(n))\) for any \(\epsilon > 0\), then the variable \(T_{k+1} - T_{k+1}^K\) vanishes with high probability, provided that p and k stay constant.
Proof
A similar calculation to that for Lemma 4.2 shows that:
Using Markov’s inequality, we get:
which asymptotically vanishes as long as \(K = \omega (\ln ^{1+\epsilon }(n))\). \(\square \)
4.2 Approximation theorem
For \(i \in [d]\), recall a random variable counting i-simplices in \({{\textbf {X}}(n,p)}\) that are critical under the lexicographical matching, as given in (4.1). We write for the i-th index set \({\mathbb {I}}_i :=C_{i+1} \times {\left\{ {i}\right\} }\). For \(s = (\phi ,i) \in {\mathbb {I}}_i\) we write
and \(\sigma _i = \sqrt{{{\,\textrm{Var}\,}}(T_{i+1})}\). Let
Let \(W_i = \sum _{s \in {\mathbb {I}}_i} X_{s}\) and \(W = (W_1, W_2, \ldots , W_d) \in {\mathbb {R}}^d\). For bounds that asymptotically go to zero for this example, we use Theorems 3.2 and 3.3 directly: the uniform bounds from Corollary 3.4 are not fine enough here. We note that here, due to the requirement of criticality, two summands \(X_s\) and \(X_u\) become dependent as soon as the corresponding subsets share a vertex. This is in contrast to simplex counts, which required an overlap of at least two vertices.
Theorem 4.5
Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and \(\Sigma \) be the covariance matrix of W.
-
(1)
Let \(h \in {\mathscr {H}}_d\). Then there is a constant \(B_{4.5.1} >0\) independent of n and a natural number \(N_{4.5.1}\) such that for any \(n \ge N_{4.5.1}\) we have
$$\begin{aligned}\left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le B_{4.5.1} \left| h \right| _3 n^{-1}.\end{aligned}$$ -
(2)
There is a constant \(B_{4.5.2} >0\) independent of n and a natural number \(N_{4.5.2}\) such that for any \(n \ge N_{4.5.2}\) we have
$$\begin{aligned}\sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A) -{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le B_{4.5.2}n^{-\frac{1}{4}}.\end{aligned}$$
Proof
It is clear that W satisfies the conditions of Theorems 3.2 and 3.3 for any \(s = (\phi ,i) \in {\mathbb {I}}_i\) setting
We apply Theorems 3.2 and 3.3. For the bounds on the quantity \(B_{3.2}\) from Theorems 3.2 and 3.3 we use Lemma 3.5 and Lemma 4.3. We write C for an unspecified positive constant that does not depend on n. Also, we assume here that n is large enough for the bound in Lemma 4.3 to apply. Let \(\mu (i,a) = p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }\left( (1-p^{i+1})^{a - 1}-(1-p^{i})^{a - 1} \right) .\) Then we have:
\(\square \)
Remark 4.6
The relevance of understanding the number of critical simplices in the context of applied and computational topology is as follows. We assume that \(p \in (0,1)\) and \(k \in \{1, 2, \ldots \} \) are constants.
-
(1)
As seen in Lemma 4.2, the expected number of critical k-simplices under the lexicographical matching is one power of n smaller than the total number of k-simplices in \({{\textbf {X}}(n,p)}\).
-
(2)
From Proposition 4.4, it is with high probability that in \({{\textbf {X}}(n,p)}\) all k-simplices \({s \in {\textbf {X}}(n,p)}\) with \(\min (s) = \omega (\ln ^{1+\epsilon }(n))\) for any fixed \(\epsilon > 0\) are not critical.
5 Simplex counts in links
Consider the random simplicial complex \({{\textbf {X}}(n,p)}\). For \(1 \le i < j \le n\) define the edge indicator \({Y_{i,j} :=\mathbb {1}\left( {\left\{ {i,j}\right\} } \in {\textbf {X}}(n,p)\right) }\). In this section we study the count of \((k-1)\)-simplices that would be in the link of a fixed subset \(t \subseteq [n]\) if the subset spanned a simplex in \({{\textbf {X}}(n,p)}\). Given that t is a simplex, the variable counts the number of \((k-1)\)-simplices in \(\textbf{lk}(t)\). Thus, the random variable of interest is
Note that the product \( \prod _{i \in s, j \in t} Y_{i,j} \) ensures that \(t \cup s\) is a simplex if t spans a simplex.
Remark 5.1
The random variable \(T^t_k\) does not fit into the framework of generalised U-statistics, because the summands depend not only on the variables that are indexed by the subset s and we do not sum over all subsets s but rather only the ones that do not intersect t. Hence, Theorem 3.9 does not apply here.
Moreover, note that given the number of vertices of the link of a simplex t, the conditional distribution of the link of t is again \({{\textbf {X}}(n,p)[n'][p]}\), where \(n'\) is a random variable equal to the number of vertices in the link. If we are interested in such a conditional distribution, the results proved later in Sect. 6 apply. However, in this section we study the number of simplices in the link of t given that t is a simplex rather than given the number of vertices of the link of t. Such a random variable behaves differently from the simplex counts in \({{\textbf {X}}(n,p)}\). For example, the summands of \(T_k^t\) have a different dependence structure compared to the summands of \(T_k\) (see Eq. 6.1 below). As a result, the approximation bounds are of different order.
It is natural to ask whether the results obtained in this section follow from those of Sect. 6 below. This might well be the case, but the answer is not straightforward. One could derive an approximation for the number of simplices in \(\textbf{lk}(t)\) given the number of vertices in the link; the variable \(T_k^t\) could then be approximated by a mixture, induced by the distribution of the number of vertices in the link (which is binomial). However, applying this approach naïvely yields bounds that do not converge to zero. While it is certainly possible that a different approach would succeed, we prefer to prove the approximation directly.
5.1 Mean and variance
It is easy to see that for any positive integer k and \(t \subseteq [n]\),
since there are \(\left( {\begin{array}{c}n - |t|\\ k+1\end{array}}\right) \) choices for \(s \in C_{k+1}\) such that \(s \cap t = \varnothing \). Next we derive a lower bound on the variance.
Lemma 5.2
For any fixed \(1 \le k \le n-1\) and \(t \subseteq [n]\) we have:
Proof
First let us calculate \({{\,\textrm{Cov}\,}}(T^t_{k+1}, T^t_{l+1})\). For fixed subsets \(s \in C_{k+1}\) and \(u \in C_{l+1}\) if \(|s \cap u| = 0\), then the corresponding variables \(\prod _{i \ne j \in s} Y_{i,j} \prod _{i \in s, j \in t} Y_{i,j}\) and \(\prod _{i \ne j \in u} Y_{i,j} \prod _{i \in u, j \in t} Y_{i,j}\) are independent and so have zero covariance.
For \(1 \le m \le l + 1\), the number of pairs of subsets \(s \in C_{k+1}\) and \(u \in C_{l+1}\) such that \(s \cap t = \varnothing = u \cap t\) and \(|s \cap u| = m\) is \(\left( {\begin{array}{c}n-|t|\\ k+1\end{array}}\right) \left( {\begin{array}{c}k+1\\ m\end{array}}\right) \left( {\begin{array}{c}n-|t| - k - 1\\ l + 1-m\end{array}}\right) \). Since each summand is non-negative, we lower bound by the \(m=1\) summand and get (with \(\left( {\begin{array}{c}1\\ 2\end{array}}\right) := 0\))
Taking \(l = k\) completes the proof. \(\square \)
5.2 Approximation theorem
For a multivariate normal approximation of counts given in Equation (5.1), we write \(\sigma _i = \sqrt{{{\,\textrm{Var}\,}}(T^t_{i+1})}\) and \(C_{i+1}^t = \left\{ \,\phi \in C_{i+1} \;\vert \; \phi \cap t = \varnothing \,\right\} \), as well as \({\mathbb {I}}_i :=C_{i+1}^t \times {\left\{ {i}\right\} }\). For \(s = (\phi ,i) \in {\mathbb {I}}_i\) define
It is clear that \({\mathbb {E}}\left\{ X_{s} \right\} = 0\). Let \(W^t_i = \sum _{s \in {\mathbb {I}}_i} X_{s}\) and \(W^t = (W^t_1, W^t_2, \ldots , W^t_d) \in {\mathbb {R}}^d\). Then we have the following approximation theorem.
Theorem 5.3
Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and \(\Sigma \) be the covariance matrix of \(W^t\).
-
(1)
Let \(h \in {\mathscr {H}}_d\). Then
$$\begin{aligned} \left| {\mathbb {E}}h(W^t) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{5.3} (n - |t|)^{-\frac{1}{2}}. \end{aligned}$$ -
(2)
Moreover,
$$\begin{aligned}\sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W^t \in A) -{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{5.3}^{\frac{1}{4}}(n - |t|)^{-\frac{1}{8}}.\end{aligned}$$
Here
Proof
It is clear that \(W^t\) satisfies the conditions of Corollary 3.4 with the dependency neighbourhood \({\mathbb {D}}_j(s) = \left\{ \,(\psi ,j) \in {\mathbb {I}}_j \;\vert \; |\phi \cap \psi | \ge 1\,\right\} \) for any \(s = (\phi ,i) \in {\mathbb {I}}_i\). So we aim to bound the quantity \(B_{3.4}\) from the corollary.
Given \(\phi \in C^t_{i+1}\) and \(m \le \min (i+1, j+1)\) there are \(\left( {\begin{array}{c}i+1\\ m\end{array}}\right) \left( {\begin{array}{c}n-|t| - i-1\\ j+1-m\end{array}}\right) \) subsets \(\psi \in C^t_{j+1}\) such that \(|\phi \cap \psi | = m\). Therefore, for any \(i,j \in [d]\) and \(s \in {\mathbb {I}}_i\) we have
giving a bound for \(\alpha _{ij}\). For a bound on \(\beta _{ijk}\), applying Lemma 3.5, for any \(i,j,k \in [d]\) and \(s \in {\mathbb {I}}_i, u \in {\mathbb {I}}_j, v \in {\mathbb {I}}_k\) we get
Now we apply Corollary 5.2 and get
Taking both sides of the inequality to the power of \(-\frac{1}{2}\) we get for any \(i \in [d]\)
Using Eqs. (5.2)–(5.5) to bound \(B_{3.4}\) from Corollary 3.4 we get:
\(\square \)
Remark 5.4
Recall that \({\mathbb {E}}\left\{ T^t_{k+1} \right\} = \left( {\begin{array}{c}n - |t|\\ k+1\end{array}}\right) p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) + |t|(k+1)}\). By Stirling’s approximation, if \(p \in (0,1)\) is a constant, then \(\max (k, |t|) = \Omega (\ln ^{1+\epsilon }(n))\) for any positive \(\epsilon \) forces the expectation to go to 0 asymptotically. Hence, by Markov’s inequality, with high probability there are no k-simplices in the link of t as long as \(\max (k, |t|)\) is of order \(\ln ^{1+\epsilon }(n)\) or larger for any \(\epsilon > 0\) for a constant p.
Recall that in Theorem 5.3 we count all simplices up to dimension d in the link of t. Note that if \(\max (d^2, d|t|) = O(\ln ^{1 - \epsilon }(n))\) for any \(\epsilon > 0\), then the bounds in Theorem 5.3 tend to 0 as n tends to infinity as long as \(p \in (0,1)\) stays constant. In particular, if d is a constant, Theorem 5.3 gives an approximation for all sizes of t for which the approximation is needed.
6 Simplex counts in \({{\textbf {X}}(n,p)}\)
In this section we apply Theorem 3.9 to approximate simplex counts. Consider \({G \sim {\textbf {G}}(n,p)}\). For \(1 \le x < y \le n\) let \(Y_{x,y} :=\mathbb {1}\left( x \sim y\right) \) be the edge indicator. In this section we are interested in the \((i+1)\)-clique count in \({{\textbf {G}}(n,p)}\) or, equivalently, the i-simplex count in \({{\textbf {X}}(n,p)}\), given by
Let \({\mathscr {Y}}^{i+1} = \{0,1\}^{i+1}\) and let \(f_i: {\mathscr {Y}}^{i+1} \rightarrow {\mathbb {R}}\) be the function
Then the associated generalised U-statistic \(S_{n, i + 1}(f_i)\) equals the \((i+1)\)-clique count \(T_{i+1}\), as given by Eq. (6.1). To apply Theorem 3.9 we need to center and rescale our variables. It is easy to see that \({\mathbb {E}}\{ f_i({\mathscr {Y}}_\phi )\} = p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }\) if \(\phi \in C_{i+1}\). We let \(I_i :=C_{i+1} \times {\left\{ {i}\right\} }\) and for \(s = (\phi ,i) \in {\mathbb {I}}_i\) we define \(X_{s} :={\sigma ^{-1}} \left( f_i({\mathscr {Y}}_\phi ) - p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) } \right) \) and \(W_i = \sum _{s \in {\mathbb {I}}_i} X_{s}\). Now the vector of interest is \(W = (W_1, W_2, \ldots , W_d) \in {\mathbb {R}}^d\). This brings us to the next approximation theorem.
Corollary 6.1
Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and \(\Sigma \) be the covariance matrix of W.
-
(1)
Let \(h \in {\mathscr {H}}_d\). Then
$$\begin{aligned} \left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{6.1} n^{-1}. \end{aligned}$$ -
(2)
Moreover,
$$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A)-{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{6.1}^{\frac{1}{4}}n^{-\frac{1}{4}}. \end{aligned}$$
Here
Proof
Firstly, observe that for any \(\phi , \psi \in C_{i+1}\) for which \(| \phi \cap \psi | \le 1\) the covariance vanishes, while if \(| \phi \cap \psi | \ge 2\) the covariance is non-zero, and we have
For \(s = (\phi ,i) \in {\mathbb {I}}_i\) write \({\hat{X}}_{s} = f_i({\mathscr {Y}}_\phi ) - p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }\). Then by Lemma 3.5 we get:
Since \( \left\{ p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) + \left( {\begin{array}{c}j+1\\ 2\end{array}}\right) }(1-p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) })(1-p^{\left( {\begin{array}{c}j+1\\ 2\end{array}}\right) }) \right\} ^{\frac{1}{2}} \le p(1-p^{\left( {\begin{array}{c}d+1\\ 2\end{array}}\right) })\), we see that Assumption 3.7 holds. Assumption 3.8 also holds and therefore we can apply Theorem 3.9 with \(k_i = i+1\), \(K_i = (2(i+1)^2 - 2(i+1))^{-\frac{1}{2}(i+1) + 1}\), \(\alpha _i = p^{2\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }(p^{-1} - 1)\), and \(\beta = p(1-p^{\left( {\begin{array}{c}d+1\\ 2\end{array}}\right) })\). Using the bounds \(K_i \le 1\) as well as \(2 \le k_i^{\min (k_i,k_j) + 1} \le d^{d+1}\), and \(\sqrt{\alpha _i} \ge p^{\left( {\begin{array}{c}d+1\\ 2\end{array}}\right) }\sqrt{p^{-1}-1}\) finishes the proof. \(\square \)
Remark 6.2
It is easy to show that with high probability there are no large cliques in \({{\textbf {G}}(n,p)}\) for \(p<1\) constant. To see this, the expectation of the number of k-cliques is \(\left( {\begin{array}{c}n\\ k\end{array}}\right) p^{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\). By Stirling’s approximation, \(k = \Omega (\ln ^{1+\epsilon }(n))\) for any positive \(\epsilon \) forces the expectation to go to 0 asymptotically. Hence, by Markov’s inequality, for any \(\epsilon > 0\), with high probability there are no cliques of order \(\ln ^{1+\epsilon }(n)\) or larger.
Recall that in Corollary 6.1 the size of the maximal clique we count is \(d + 1\). Note that if \(d = O(\ln ^{\frac{1}{2} - \epsilon }(n))\) for any \(\epsilon > 0\), then the bounds in Corollary 6.1 tend to 0 as n tends to infinity as long as \(p \in (0,1)\) stays constant. This value might seem quite small but in the light of there not being any cliques of order \(\ln ^{1+\epsilon }(n)\) with high probability, this is meaningfully large.
Remark 6.3
Note that in Corollary 6.1 we use a multivariate normal distribution with covariance \(\Sigma \), which is the covariance of W when n is finite and it differs from the limiting covariance, as mentioned in Reinert and Röllin (2010). To approximate W with the limiting distribution, in the spirit of Reinert and Rollin (2010, Proposition 3) one could proceed in two steps: use the existing theorems to approximate W with \(\Sigma Z\) and then approximate \(\Sigma Z\) with \(\Sigma _L Z\) where \(\Sigma _L\) is the limiting covariance, which is non-invertible, as observed in Janson and Nowicki (1991).
Remark 6.4
Corollary 6.1 generalises the result (Reinert and Röllin 2010, Proposition 2) beyond the case when \(d=2\) and we get a bound of the same order of n. Kaur and Roollin (2021, Theorem 3.1) considers centered subgraph counts in a random graph associated to a graphon. If we take the graphon to be constant, the associated random graph is just \({{\textbf {G}}(n,p)}\). Compared to Kaur and Roollin (2021, Theorem 3.1) we place weaker smoothness conditions on our test functions. However, we make use of the special structure of cliques whereas Kaur and Roollin (2021, Theorem 3.1) applies to any centered subgraph counts. Translating Kaur and Roollin (2021, Theorem 3.1) into a result for uncentered subgraph counts, as we provide here in the special case of clique counts, is not trivial for general d.
However, it should be possible to extend our results, using the same abstract approximation theorem, beyond the random clique complex to Linial-Meshulam random complexes (Linial and Meshulam 2006) or even more general multiparameter random complexes (Costa and Farber 2016). We shall consider this conjecture in future work.
7 A proof of the multivariate CLT for dissociated Sums and U-statistics
7.1 A proof of the multivariate CLT
Throughout this subsection, \(W \in {\mathbb {R}}^d\) is a vector of dissociated sums in the sense of Definition 3.1, with covariance matrix whose entries are \(\Sigma _{ij} = {{\,\textrm{Cov}\,}}(W_i, W_j)\) for \((i,j) \in [d]^2\). For each \(s \in {\mathbb {I}}\) we denote by \({\mathbb {D}}(s) \subset {\mathbb {I}}\) the disjoint union \(\bigcup _{j=1}^d {\mathbb {D}}_j(s)\). For each triple \((s,t,j) \in {\mathbb {I}}^2 \times [d]\) we write the set-difference \({\mathbb {D}}_j(t) {\setminus } {\mathbb {D}}_j(s)\) as \({\mathbb {D}}_j(t;s)\), with \({\mathbb {D}}(t;s) \subset {\mathbb {I}}\) denoting the disjoint union of such differences over \(j \in [d]\).
7.1.1 Smooth test functions
To prove Theorem 3.2 we use Stein’s method for multivariate normal distributions; for details see for example Chapter 12 in Chen (2011). Our proof of Theorem 3.2 is based on the Stein characterization of the multivariate normal distribution: \(Z \in {\mathbb {R}}^d\) is a multivariate normal \(\textrm{MVN}(0, \Sigma )\) if and only if the identity
holds for all twice continuously differentiable \(f: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) for which the expectation exists. In particular, we will use the following result based on Meckes (2009, Lemma 1 and Lemma 2). As Lemma 1 and Lemma 2 in Meckes (2009) are stated there only for infinitely differentiable test functions, we give the proof here for completeness.
Lemma 7.1
(Lemma 1 and Lemma 2 in Meckes (2009)) Fix \(n \ge 2\). Let \(h: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) be n times continuously differentiable with n-th partial derivatives being Lipschitz and \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\). Then, if \(\Sigma \in {\mathbb {R}}^{d \times d}\) is symmetric positive semidefinite, there exists a solution \(f: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) to the equation
such that f is n times continuously differentiable and we have for every \(k=1, \ldots , n\):
Proof
Let h be as in the assertion. It is shown in Lemma 2.1 in Chatterjee and Meckes (2008), which is based on a reformulation of Eq. (2.20) in Barbour (1990), that a solution of (7.2) for h is given by \(f(x) = f_h(x) = \int _0^1 \frac{1}{2t} {\mathbb {E}}\{ h( Z_{x,t} ) \} \, dt\), with \(Z_{x, t}=\sqrt{t} x+\sqrt{1-t} \Sigma ^{1 / 2} Z\). As h has n-th partial derivatives being Lipschitz and hence for differentiating f we can bring the derivative inside the integral, it is straightforward to see that the solution f is n times continuously differentiable.
The bound on \(\left| f \right| _k\) is a consequence of
for any \(i_1, i_2, \ldots , i_k\); see, for example, Equation (10) in Meckes (2009). Taking the sup-norm on both sides and bounding the right hand side of the equation gives
\(\square \)
Proof of Theorem 3.2
To prove Theorem 3.2, we replace w by W in Eq. (7.2) and take the expected value on both sides. As a result, we aim to bound the expression
where f is a solution to the Stein equation (7.2) for the test function h. Since the variables \({\left\{ {X_s \mid s \in {\mathbb {I}}}\right\} }\) are centered and as \(X_t\) is independent of \(X_s\) if \(t \not \in {\mathbb {D}}(s)\), for each \((i,j) \in [d]^2\) we have
We now use the decomposition of \(\Sigma _{ij}\) from (7.4) in the expression (7.3). For each pair \((s,j) \in {\mathbb {I}}\times [d]\) and \(t \in {\mathbb {D}}(s)\) we set \( {\mathbb {D}}_j(t;s) = {\mathbb {D}}_j(t) {\setminus } {\mathbb {D}}_j(s) \) and
By Definition 3.1, \(W_j^s \) is independent of \(X_s\), while \(W_j^{s,t}\) is independent of the pair \((X_s,X_t)\).
Next we decompose the r.h.s. of (7.3);
with
Here we recall that if \(s=(k,i)\) then \(|s|=i \in [d]\). \(\square \)
As with the vector of dissociated sums \(W \in {\mathbb {R}}^d\) itself, we can assemble these differences into random vectors. Thus, \(W^s \in {\mathbb {R}}^d\) is \((W^s_1,\ldots ,W^s_d)\), and similarly \(W^{s,t} = (W^{s,t}_1,\ldots W^{s,t}_d)\). In the next three claims, we provide bounds on \(R_i\) for \(i \in [3]\).
Claim 7.2
The absolute value of the expression \(R_1\) from (7.6) is bounded above by
Proof
Note that
For each \(s \in {\mathbb {I}}_{i}\), it follows from (7.5) that \(W = U^s + W^s\). Using the Lagrange form of the remainder term in Taylor’s theorem, we obtain
for some random \(\theta _s \in (0,1)\). Using this Taylor expansion in the expression for \(R_1\), we get the following four-term summand \(S_{i,s}\) for each \(i \in [d]\) and \(s \in {\mathbb {I}}_i\):
The second and fourth terms cancel each other. Recalling that \(X_s\) is centered by definition and independent of \(W^s\) by Definition 3.1, the first term also vanishes and
Recalling that \(\left\Vert \partial _{ijk}f\right\Vert _{\infty } \le \left| f \right| _3\) and that \(U^s_j = \sum _{t \in {\mathbb {D}}_j(s)}X_t\), we have:
as desired. \(\square \)
Claim 7.3
The absolute value of the expression \(R_2\) from (7.7) is bounded above by
Proof
Recalling that \(U^s_j = \sum _{t \in {\mathbb {D}}_j(s)} X_{t}\) and \({\mathbb {D}}(s) = \bigcup _{j=1}^d {\mathbb {D}}_j(s)\),
Fix \(s \in {\mathbb {I}}\) and \(t \in {\mathbb {D}}_j(s)\). Recall that by (7.5), \(W^s = W^{s,t} + V^{s,t}\). Using the Lagrange form of the remainder term in Taylor’s theorem, we obtain:
for some random \(\theta _{s,t} \in (0,1)\). Using this Taylor expansion in the expression for \(R_2\), we get the following three-term summand \(S_{s,t}\) for each pair \((s, t) \in {\mathbb {I}}\times {\mathbb {D}}_j(s)\):
Recalling that \(W^{s,t}\) is independent of the pair \((X_s, X_t)\) the first and the last terms cancel each other and only the sum over k is left:
Recalling that \(\left\Vert \partial _{ijk}f\right\Vert _{\infty } \le \left| f \right| _3\) and that \(V_k^{s,t} = \sum _{v \in {\mathbb {D}}_k(t;s)} X_v\) we have:
as required. \(\square \)
Claim 7.4
Proof
Fix \((s, t) \in {\mathbb {I}}\times {\mathbb {D}}_j(s)\). Recall that by (7.5), \(W^{s,t} = W - U^s - V^{s,t}\). Using the Lagrange form of the remainder term in Taylor’s theorem, we obtain
for some random \(\rho _{s,t} \in (0,1)\). Recalling that \(U_k^s = \sum _{t \in {\mathbb {D}}_k(s)} X_t\) and \(V_k^{s,t} =\sum _{u \in {\mathbb {D}}_j(t;s)} X_u\),
Recalling that \(\left\Vert \partial _{ijk}f\right\Vert _{\infty } \le \left| f \right| _3\) we bound:
as required. \(\square \)
Take any \(h \in {\mathscr {H}}_d\). Let \(f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) be the associated solution from Lemma 7.1. Combining Claims 7.1–7.4 and using Lemma 7.1 we have:
7.1.2 Non-smooth test functions
Convex set indicator test functions provide a stronger distance between probability distributions. Also, from a non-asymptotic perspective, the distance might be more useful in statistical applications: for example, when estimating confidence regions, which are often convex sets. Here we follow (Kaur and Röollin 2021, Section 5.3) very closely to derive a bound on the convex set distance between a vector of dissociated sums \(W \in {\mathbb {R}}^d\) with covariance matrix \(\Sigma \) and a target multivariate normal distribution \(\Sigma ^{\frac{1}{2}}Z\), where \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\). The smoothing technique used here is introduced in Gan et al. (2017). However, a better (polylogarithmic) dependence on d could potentially be achieved using a recent result (Gaunt and Li 2023, Proposition 2.6), at the expense of larger constants. The recursive approach from Schulte and Yukich (2019), Kasprzak and Peccati (2022) usually yields better dependence on n; however, this requires the target normal distribution to have an invertible covariance matrix. Since this property does not always hold in our applications of interest, we do not use the recursive approach here.
Proof of Theorem 3.3
Fix \(A \in {\mathscr {K}}\), \(\epsilon > 0\) and define
where \(d(y, A)=\inf _{x \in A}\left\Vert x-y \right\Vert _2\) and \(B(y; \epsilon ) = \left\{ \,z \in {\mathbb {R}}^d \;\vert \; \left\Vert y-z\right\Vert _2 \le \epsilon \,\right\} \).
Let \( {\mathscr {H}}_{\epsilon , A} :={\left\{ {h_{\epsilon , A}: {\mathbb {R}}^{d} \rightarrow [0,1]; A \in {\mathscr {K}}}\right\} }\) be a class of functions such that \(h_{\epsilon , A}(x)=1\) for \(x \in A\) and 0 for \(x \notin A^{\epsilon }\). Then, by Bentkus (2003, Lemma 2.1) as well as inequalities (1.2) and (1.4) from Bentkus (2003), for any \(\epsilon > 0\) we have
Let \(f: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) be a bounded Lebesgue measurable function, and for \(\delta >0\) let
Set \(\delta =\frac{\epsilon }{16\sqrt{d}}\) and \(h_{\epsilon , A}=S_{\delta }^{4} I_{A^{\epsilon /4}}\), where \(I_{A^{\epsilon /4}}\) is the indicator function of the subset \(A^{\epsilon /4} \subseteq {\mathbb {R}}^d\). By Gan et al. (2017, Lemma 3.9) we have that \(h_{\epsilon , A}\) is bounded and is in \({\mathscr {H}}_d\).
Moreover, the following bounds hold:
Note that \(h_{\epsilon , A}=S_{\delta }^{4} I_{A^{\epsilon /4}} \in {\mathscr {H}}_{\epsilon , A}\) and hence (Bentkus 2003, Lemma 2.1) applies. Using this with Theorem 3.2 we get
Since this bound works for every \(\epsilon > 0\), we minimise it by using \(\epsilon = \left( \frac{3B_{3.2}}{4d^{\frac{1}{4}}}\right) ^{\frac{1}{4}}\). \(\square \)
7.2 A proof of the CLT for U-statistics
Proof of Theorem 3.9
Note that if \(s = (\phi ,i) \in {\mathbb {I}}_i\) and \(u = (\psi , j) \in {\mathbb {I}}_j\) are chosen such that \(\phi \cap \psi = \varnothing \), then the corresponding variables \(X_{s}\) and \(X_{u}\) are independent since \(f_i({\mathscr {X}}_s, {\mathscr {Y}}_s)\) and \(f_j({\mathscr {X}}_u, {\mathscr {Y}}_u)\) do not share any random variables from the sets \(\{\xi _i\}_{1 \le i \le n}\) and \(\{Y_{i,j}\}_{1 \le i < j \le n}\). Hence, if for any \(s = (\phi ,i) \in {\mathbb {I}}_i\) we set \({\mathbb {D}}_j(s) = \left\{ \,(\psi , j) \in {\mathbb {I}}_j \;\vert \; |\phi \cap \psi | \ge 1\,\right\} \), then W satisfies the assumptions of Corollary 3.4. It remains to bound the quantity \(B_{3.4}\).
First, to find \(\alpha _{ij}\) as in Corollary 3.4, given \(\phi \in C_{k_i}\) and if \(k_i, k_j \ge m\) then there are \(\left( {\begin{array}{c}k_i\\ m\end{array}}\right) \left( {\begin{array}{c}n-k_i\\ k_j-m\end{array}}\right) \) subsets \(\psi \in C_{k_j}\) such that \(|\phi \cap \psi | = m\). Therefore, we have for any \(i,j \in [d]\) and \(s \in {\mathbb {I}}_i\)
Note that
as well as
Using Assumption 3.7, for any \(i,j,l \in [d]\) and \(s \in {\mathbb {I}}_i, t \in {\mathbb {I}}_j, u \in {\mathbb {I}}_l\)
To take care of the variance terms, we lower bound the variance using Assumption 3.7;
Here the second-to-last inequality follows by taking only the term for \(m=1\). Now we take both sides of the inequality to the power of \(-\frac{1}{2}\) to get that for any \(i \in [d]\)
Using Eqs. (7.9)–(7.11) to bound the quantity \(B_{3.4}\) from Corollary 3.4 we get
Now further assume that Assumption 3.8 hold. The key difference in this case is that the dependency neighbourhoods become smaller: now the subsets need to overlap in at least 2 elements for the corresponding summands to share at least one variable \(Y_{i,j}\) and hence become dependent. This makes both the variance and the size of dependency neighbourhoods smaller. In the context of Theorem 3.2, the trade-off works out in our favour to give smaller bounds, as follows. For any \(s = (\phi ,i) \in {\mathbb {I}}_i\) we set \({\mathbb {D}}_j(s) = \left\{ \,(\psi ,j) \in {\mathbb {I}}_j \;\vert \; |\phi \cap \psi | \ge 2\,\right\} \), so that W, under the additional Assumption 3.8, satisfies the assumptions of Corollary 3.4.
Now Eq. (7.9) becomes
Equation (7.11) becomes
Using the adjusted bounds in Corollary 3.4 gives the result. \(\square \)
References
Adler, R.J., Bobrowski, O., Weinberger, S.: Crackle: the homology of noise. In: Discrete and Computational Geometry, pp. 680–704 (2014)
Asai, R., Shah, J.: Algorithmic canonical stratifications of simplicial complexes. J. Pure Appl. Algebra 226(9), 107051 (2022)
Barbour, A.D.: Stein’s method for diffusion approximations. Probab. Theory Relat. Fields 84(3), 297–322 (1990)
Barbour, A.D., Karoński, M., Ruciński, A.: A central limit theorem for decomposable random variables with applications to random graphs. J. Combin. Theory Ser. B 47(2), 125–145 (1989)
Bauer, U., Rathod, A.: Hardness of approximation for Morse matching. In: SODA ’19: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2663–2674 (2019)
Bentkus, V.: On the dependence of the Berry–Esseen bound on dimension. J. Stat. Plann. Inference 113(2), 385–402 (2003)
Bobrowski, O., Kahle, M.: Topology of random geometric complexes: a survey. J. Appl. Comput. Topol. 1(3), 331–364 (2018)
Carlsson, G., et al.: Persistence barcodes for shapes. Int. J. Shape Model. 11(02), 149–187 (2005)
Chatterjee, S., Meckes, E.: Multivariate normal approximation using exchangeable pairs. Alea 4, 257–283 (2008)
Chen, L.H.Y., Goldstein, L., Shao, Q.M.: Normal Approximation by Stein’s Method. Springer, Berlin (2011)
Costa, A., Farber, M.: Large random simplicial complexes I. J. Topol. Anal. 8(03), 399–429 (2016)
Edelsbrunner, H., Harer, J.: Computational Topology: An Introduction. American Mathematical Society, Providence (2010)
Eichelsbacher, P., Rednoß, B.: Kolmogorov bounds for decomposable random variables and subgraph counting by the Stein–Tikhomirov method. Bernoulli 29(3), 1821–1848 (2023)
Fang, X.: A multivariate CLT for bounded decomposable random vectors with the best known rate. J. Theor. Probab. 29(4), 1510–1523 (2016)
Forman, R.: A user’s guide to discrete Morse theor. Séminaire Lotharingien de Combinatoire 48, B48c (2002)
Gan, H.L., Röllin, A., Ross, N.: Dirichlet approximation of equilibrium distributions in Cannings models with mutation. Adv. Appl. Probab. 49(3), 927–959 (2017)
Gaunt, R.E., Li, S.: Bounding Kolmogorov distances through Wasserstein and related integral probability metrics. J. Math. Anal. Appl. 522, 126985 (2023)
Ghrist, R.: Barcodes: the persistent topology of data. Bull. Amer. Math. Soc. (N.S.) 45(1), 61–75 (2008)
Henselman-Petrusek, G., Ghrist, R.: Matroid filtrations and computational persistent homology (2016). arXiv:1606.00199
Janson, S., Nowicki, K.: The asymptotic distributions of generalized U-statistics with applications to random graphs. Probab. Theory Relat. Fields 90(3), 341–375 (1991)
Joswig, M., Pfetsch, M.E.: Computing optimal Morse matchings. SIAM J. Discrete Math. 20(1), 11–25 (2006)
Kahle, M.: Topology of random clique complexes. Discrete Math. 309(6), 1658–1671 (2009)
Kahle, M.: Random geometric complexes. Discrete Comput. Geom. 45(3), 553–573 (2011)
Kahle, M.: Sharp vanishing thresholds for cohomology of random flag complexes. Ann. Math. 25, 1085–1107 (2014)
Kahle, M., Meckes, E.: Limit theorems for Betti numbers of random simplicial complexes. Homol. Homotopy Appl. 15(1), 343–374 (2013)
Kasprzak, M.J., Peccati, G.: Vector-valued statistics of binomial processes: Berry–Esseen bounds in the convex distance (2022). arXiv preprint arXiv:2203.13137
Kaur, G., Röollin, A.: Higher-order fluctuations in dense random graph models. Electron. J. Probab. 26, 1–36 (2021)
Korolyuk, V.S., Borovskich, Y.V.: Theory of U-Statistics, vol. 273. Springer, Berlin (2013)
Lampret, L.: Chain complex reduction via fast digraph traversal (2019). arXiv:1903.00783
Lee, A.J.: U-Statistics: Theory and Practice. Taylor & Francis (1990)
Linial, N., Meshulam, R.: Homological connectivity of random 2-complexes. Combinatorica 26(4), 475–487 (2006)
McGinley, W.G., Sibson, R.: Dissociated random variables. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol. 77, pp. 185–188. Cambridge University Press (1975)
Meckes, E.: On Stein’s method for multivariate normal approximation. In: High Dimensional Probability V: The Luminy Volume, pp. 153–178. Institute of Mathematical Statistics (2009)
Mischaikow, K., Nanda, V.: Morse theory for filtrations and efficient computation of persistent homology. Discrete Comput. Geom. 50(2), 330–353 (2013)
Nanda, V.: Local Cohomology and Stratification. Found. Comput. Math. 20, 195–222 (2020)
Otter, N., et al.: A roadmap for the computation of persistent homology. EPJ Data Sci. 6, 1–38 (2017)
Owada, T., Samorodnitsky, G., Thoppe, G.: Limit theorems for topological invariants of the dynamic multi-parameter simplicial complex. Stochast. Process. Appl. 138, 56–95 (2021)
Privault, N., Serafin, G.: Normal approximation for sums of weighted U-statistics—application to Kolmogorov bounds in random subgraph counting. Bernoulli 26(1), 587–615 (2020)
Raiíc, M.: A multivariate CLT for decomposable random vectors with finite second moments. J. Theor. Probab. 17(3), 573–603 (2004)
Reinert, G., Röllin, A.: Random subgraph counts and U-statistics: multivariate normal approximation via exchangeable pairs and embedding. J. Appl. Probab. 47(2), 378–393 (2010)
Schulte, M., Yukich, J.E.: Multivariate second order Poincaré inequalities for Poisson functionals. Electron. J. Probab. 24, 1–42 (2019)
Spanier, E.: Algebraic Topology. McGraw-Hill (1966)
Acknowledgements
TT acknowledges funding from EPSRC studentship 2275810. VN is supported by the EPSRC grant EP/R018472/1 and the US AFOSR grant FA9550-22-1-0462. GR is funded in part by the EPSRC grants EP/T018445/1 and EP/R018472/1. We would like to thank Xiao Fang, Matthew Kahle, Heather Harrington, Adrian Röllin, and Christina Goldschmidt for helpful discussions Moreover, we are grateful to two anonymous referees whose comments helped improve this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A. Proof of Lemma 4.3
Appendix A. Proof of Lemma 4.3
We recall that \(T_k\), counting the number of \((k-1)\)-simplices that are critical under the lexicographical matching, is given by
Lemma A.1
For any integer \(1 \le k \le n - 1\) we have:
where
Here we have used the following notation:
Also, \( \sum _{i < j}^{n-k}\) stands for \( \sum _{i=1}^{n-k-1} \sum _{j= i+1}^{n-k}\).
Proof of Lemma A.1
For \(s \in C_{k+1}\) recall that \(s_{-} = s \setminus {\left\{ {\min (s)}\right\} }\). We write:
Then \(Z_s\) and \(Y_s\) are independent and \(T_{k+1} = \sum _{s \in C_{k+1}} Z_s Y_s\). Consider the variance:
For the first term in (A.1), writing \( {\mathbb {P}}(Z_sY_s=1)= \mu (i) :=p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } ((1-p^{k+1})^{i-1} - (1-p^k)^{i-1})\) we see:
Now consider the covariance terms in (A.1), the expansion of the variance. Note that for any \(s, t \in C_{k+1}\) if \(s \cap t = \varnothing \), then the variables \(Z_s Y_s\) and \(Z_t Y_t\) can be written as functions of two disjoint sets of independent edge indicators and hence have zero covariance.
Fix \(s, t \in C_{k+1}\) and assume \(|s \cap t| = m\) where \(1 \le m \le k\). Note that because \(m \ne k + 1\), we have \(s \ne t\). There are \(2\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) - \left( {\begin{array}{c}m\\ 2\end{array}}\right) \) distinct edges in s and t combined and hence \({\mathbb {P}}(Z_s Z_t = 1) = p^{2\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) - \left( {\begin{array}{c}m\\ 2\end{array}}\right) }\). Also, \(Y_sY_t = Y_s^+Y_t^+ + Y_s^-Y_t^- - Y_s^+Y_t^- - Y_s^-Y_t^+\). For the rest of the proof when calculating probabilities we assume w.l.o.g. that \(\min (s) \le \min (t)\). Then we have for \(Y_s^+ Y_t^+\):
Fix \(i \in [\min (s)-1]\). Then with \(\lnot \) denoting the complement
Moreover, \(\prod _{i=1}^{\min (s) - 1}(1-\prod _{j \in s} Y_{i,j})(1-\prod _{j \in t} Y_{i,j})\) and \(\prod _{\begin{array}{c} i=\min (s)\\ i \notin s \end{array}}^{\min (t) - 1}(1 - \prod _{j \in t}Y_{i,j}) \) are independent of \(Z_s Z_t\).
Recall the notation \([a,b] = {\left\{ {a, a+1, \ldots , b}\right\} }\) for two positive integers \(a \le b\). Setting \(q_{s, t} :=|s \cap [\min (s), \min (t) - 1]|\),
This strategy of splitting the product \(Y_s^+Y_t^+\) into three products of independent variables, only one of which is dependent on \(Z_sZ_t\) works exactly in the same way for the variables \(Y_s^-Y_t^+\), \(Y_s^+Y_t^-\), \(Y_s^-Y_t^-\). We write \(i = \min (s)\), \(j = \min (t)\), and q instead of \(q_{s,t}\). Also, we set
Using the described strategy we get:
Now we are ready to calculate the covariance:
Next we consider the two covariance sums in (A.1) separately. First let us assume that \(\min (s) \ne \min (t)\). Given \(i,j \in [n-k]\), \(m \in [k]\), and \(q \in [\min (k+1, |j-i|)]\) define the set
as well as
and
Next we argue that
To see this, assume \(i < j\). Note that to pick a pair \((s,t) \in \Gamma ^+_{k+1}(i,j,m,q)\) with \( \min (s) = i\) and \( \min (t) = j\) we need to pick the \(2k-m\) vertices in \(s \cup t\). Firstly, we pick the vertices that are not included in \(s \cap [\min (s), \min (t) - 1] = s \cap [i, j - 1]\). Since \(\min (s) \in s \cap [\min (s), \min (t) - 1]\), this amounts to choosing \(2k - m - (q-1)\) vertices out of \(n-j\). Then we decide which of the vertices that we have just picked will lie in t. This means we further need to choose k out of \(2k+1-m-q\) vertices. Then we choose \(m-1\) out of k vertices of t to lie in \(s \cap t\) (under the assumption that we already have \(\min (t) \in s\)). Finally, we choose the set \(s \cap [\min (s), \min (t) - 1]\), which amounts to picking \(q-1\) vertices out of \(j - i +1\) possible choices. If any of the binomial coefficients are negative, we set them to 0. The case \(j < i\) is analogous.
An analogous argument shows that
Now using the covariance expression we have just derived, we get
Similarly, we calculate the remaining term in the expansion of the variance (A.1). We notice that if \(i=j\), then \(q=0\) and we have \(\Gamma _{k+1}(i,i,m,0) = \Gamma ^+_{k+1}(i,i,m,0)\). Hence, \(|\Gamma _{k+1}(i,i,m,0)| = \left( {\begin{array}{c}n-i\\ 2k+1-m\end{array}}\right) \left( {\begin{array}{c}2k+1-m\\ k\end{array}}\right) \left( {\begin{array}{c}k\\ m-1\end{array}}\right) \), and
\(\square \)
Proof of Lemma 4.3
Fix \(1 \le k \le n-1\) and \(p \in (0,1)\), and consider the variance. From Lemma A.1 we have \({{\,\textrm{Var}\,}}\{ T_{k+1} \} = 2p^{2\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }V_1 + 2p^{2\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }V_2 + p^{2\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }V_3 + p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }V_4\). First we lower bound \(V_1\) and \(V_2\) by just the negative part of the sum:
Now using that \(\left( {\begin{array}{c}k\\ m\end{array}}\right) + \left( {\begin{array}{c}k\\ m-1\end{array}}\right) = \left( {\begin{array}{c}k+1\\ m\end{array}}\right) \) and \((1-p^{k+1}) \ge (1-p^{k-m})\) for \(m \ge 0\) it is easy to see that \(V_1 + V_2 \ge -4R_1 - 4R_2\), where
For \(V_3\) we lower bound by terms with \(m = 1\) and the negative parts of the other terms:
here we call the positive part of the lower bound \(R_4\) and the negative part \(R_3\). For \(V_4\) we use the trivial lower bound \(V_4 \ge 0\). Hence, we have:
Let us now upper bound \(R_1\):
Noting that \((1-p^{k+1})^{j-i}(1-p^{k+1}-p^k + p^{2k+1-m})^{i-1} \le (1-p^{k+1})^{j-1}\), we can bound \(R_2\) in an identical way:
Noting that \((1 -p^k -p^{k+1} + p^{2k+2-m})^{i-1} \le (1-p^{k+1})^{i-1}\) and \((1-p^{k+1})^{2i-2} \le (1-p^{k+1})^{i-1}\) we proceed to bound \(R_3\):
To lower bound \(R_4\) we just take the \(i = 2\) term:
Since \(R_1\), \(R_2\), \(R_3\) are all at most of the order \(n^{2k-1}\) and \(R_2\) is at least of the order \(n^{2k}\), we have that for any fixed \(k \ge 1\) and \(p \in (0,1)\) there exists a constant \(C_{p,k} > 0\) independent of n and a natural number \(N_{p,k}\) such that for any \(n \ge N_{p,k}\):
\(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Temčinas, T., Nanda, V. & Reinert, G. Multivariate central limit theorems for random clique complexes. J Appl. and Comput. Topology 8, 1837–1880 (2024). https://doi.org/10.1007/s41468-023-00146-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41468-023-00146-5
Keywords
- Stein’s method
- Multivariate normal approximation
- Discrete Morse theory
- Random graphs
- Random Simplicial complexes