1 Introduction

1.1 Isometry-invariant distances between metric spaces

The Gromov–Hausdorff (GH) distance, proposed by Gromov in Gromov (1981) (see also Tuzhilin (2016)), measures how far two compact metric spaces are from being isometric to each other. Since its conception four decades ago, the GH distance was mainly studied from a theoretical standpoint, as its computation poses an intractable combinatorial problem (Chazal et al. 2009; Mémoli 2007; Schmiedl 2017). Computing the GH distance can be thought of as an extension of the NP-hard Quadratic Bottleneck Assignment Problem (QBAP) that allows non-injective assignments in its search space. In particular, it was shown that approximating the distance up to a factor of \(\sqrt{N}\) in the general case (or up to a factor of 3 for ultrametric spaces or metric trees) cannot be done in polynomial time unless \(\text {P}=\text {NP}\) (Schmiedl 2017; Pankaj K et al. 2018). However, a polynomial-time algorithm for \(\sqrt{N}\)-approximation was given for unweighted metric trees in Pankaj K et al. (2018), and a superlinear \(\frac{5}{4}\)-approximation algorithm for subsets of \(\mathbb {R}^1\) was derived in Majhi et al. (2016). In Mémoli and Sapiro (2004), the GH distance was first considered for shape comparison, and several of its computationally motivated relaxations were presented since then.

Different variations of the Gromov–Wasserstein distance, a relaxation of the GH distance for metric measure spaces motivated by the optimal transport problem (Villani 2003), were proposed in Sturm (2006) and in Mémoli (2007), and further studied in e.g. Mémoli (2009), Mémoli (2011), Peyré et al. (2016). The Gromov–Wasserstein distances are based on a generalization of bijective assignments, which are less expressive than the assignments allowed in the GH distance. Similarly to the GH distance, computing the Gromov–Wasserstein distance also requires solving a non-convex optimization problem. Recently, semidefinite relaxations of both the GH and Gromov–Wasserstein distances were studied in Villar et al. (2016). While allowing polynomial-time approximation, these relaxations admit distance 0 between non-isometric objects, losing the desired property of being a metric. Another result of potential interest from Villar et al. (2016) is a feasible algorithm for an upper bound of the GH (and therefore the mGH) distance. In Lipman and Daubechies (2011), the authors define the conformal Wasserstein distance, inspired by the Gromov–Wasserstein distance. It is a metric on the isometry classes of Riemannian 2-manifolds that can be accurately approximated in polynomial time under some reasonable conditions.

In Mémoli (2012), Mémoli introduces the modified Gromov–Hausdorff (mGH) distance, another relaxation of the GH distance that preserves the property of being a metric on the isometry classes of compact metric spaces. Same as the GH distance, the mGH distance is based on assignments between the metric spaces that are allowed to be non-injective. It turns out that the two distances are not equivalent in the general case (Oles and Vixie 2022), although they are topologically equivalent within precompact (in the GH-distance sense) families of compact metric spaces (Mémoli 2012).

Directly computing the mGH distance still poses an intractable combinatorial optimization of \(O(N^N)\) time complexity (as compared to \(O(N^{2N})\) for the standard GH distance). Our focus on the mGH distance in this paper is partially motivated by the so-called “structural theorem” (Theorem 5.1 in Mémoli (2012)), which allows for the decomposition of the computation into solving a sequence of polynomial-time problems.

1.2 Shape-based graph matching

Because graphs are ubiquitous in applications, the task of graph matching, i.e. measuring how much a pair of graphs are different from each other, is extensively studied. Common approaches to exact graph matching are those based on graph shape, such as subgraph isomorphism and maximum common subgraph (Conte et al. 2004; Foggia et al. 2014). The shape-based approaches appear in many fields including neuroscience (Van Wijk et al. 2010), telecommunications (Shoubridge et al. 2002), and chemoinformatics (Raymond and Willett 2002; Vandewiele et al. 2012). While efficient heuristics for these approaches exist for special cases (e.g. planar graphs), applying them in the general case requires solving an NP-complete problem (Bunke 1997; Conte et al. 2004).

Recently, the Gromov–Hausdorff framework for graph matching was explored both theoretically (Aflalo et al. 2015) and in applications, e.g. in the fields of neuroscience (Lee et al. 2011, 2006; Hendrikson 2016), social sciences, and finance (Hendrikson 2016). Undirected graphs admit metric space representation using the geodesic (shortest path length) distances on their vertex sets. However, the high computational cost of computing the isometry-invariant distances impedes a more widespread application of this approach.

1.3 Our contribution

Our main contribution is a theoretical framework for producing polynomial-time lower bounds of the GH and mGH distances. Furthermore, we present an algorithm for estimating the mGH distance, built upon this framework. We implement the algorithm for metric representations of unweighted graphs, leveraging their properties to reduce the polynomial order in the algorithm’s time complexity. We demonstrate the performance of the algorithm on several datasets of real-world and synthesized graphs of up to several thousand vertices.

The rest of the paper is structured as follows. Section 2 briefly reviews (Mémoli 2013) to formally define the GH and mGH distances, show their relation to each other, and state some of their properties. In Sects. 3 and 4 we discuss the ideas for establishing lower and upper bounds, respectively, of the mGH distance between finite metric spaces. In Sect. 5, we describe the algorithm for estimating the mGH distance, show that it has polynomial time complexity, then discuss and present its implementation for the case of unweighted graphs. Computational examples from real-world and synthetic datasets are given in Sects. 6 and 7 summarizes our work. The Appendix contains pseudocode for the procedures and algorithms, omitted from the main paper for brevity.

2 Background

When talking about metric space given by set X and distance function \(d_X: X \times X \rightarrow \mathbb {R}\), we use notation \((X, d_X)\) and its shorter version X interchangeably. We expect the distinction between a set X and a metric space X to be clear from the context.

2.1 Definition of the Gromov–Hausdorff distance

Given \((X, d_X), (Y, d_Y) \in \mathcal {M}\), where \(\mathcal {M}\) denotes the set of all compact metric spaces, the GH distance measures how far the two metric spaces are from being isometric. It considers any “sufficiently rich” third metric space \((Z, d_Z)\) that contains isometric copies of X and Y, measuring the Hausdorff distance (in Z) between these copies, and minimizes over the choice of the isometric copies and Z. Formally, the GH distance is defined as

$$\begin{aligned} d_{\mathcal{G}\mathcal{H}}(X, Y) \buildrel \textrm{def}\over =\inf _{Z, \phi _X, \phi _Y}d_{\mathcal {H}}^Z\big (\phi _X(X), \phi _Y(Y)\big ), \end{aligned}$$

where \(\phi _X : X \rightarrow Z\) and \(\phi _Y : Y \rightarrow Z\) are isometric embeddings of X and Y into Z, and \(d_{\mathcal {H}}^Z\) is the Hausdorff distance in Z:

$$\begin{aligned} d_{\mathcal {H}}^Z(S, T) \buildrel \textrm{def}\over =\max \Big \{\sup _{s \in S} \inf _{t \in T} d_Z(s, t),\sup _{t \in T} \inf _{s \in S} d_Z(s, t) \Big \}\quad \forall S, T \subseteq Z \end{aligned}$$

(see Fig. 1). Gromov has shown in Gromov (2007) that \(d_{\mathcal{G}\mathcal{H}}\) is a metric on the isometry classes of \(\mathcal {M}\), constituting what is called a Gromov–Hausdorff space.

Fig. 1
figure 1

Illustration of the idea underlying the Gromov–Hausdorff distance

Although the above definition gives the conceptual understanding of the GH distance, it is not very helpful from the computational standpoint. The next subsection introduces a more practical characterization of the GH distance.

2.2 Characterization of the GH distance

For two sets X and Y, we say that relation \(R \subseteq X \times Y\) is a correspondence if for every \(x \in X\) there exists some \(y \in Y\) s.t. \((x, y) \in R\) and for every \(y \in Y\) there exists some \(x \in X\) s.t. \((x, y) \in R\). We denote the set of all correspondences between X and Y by \(\mathcal {R}(X, Y)\).

If R is a relation between metric spaces \((X, d_X)\) and \((Y, d_Y)\), its distortion is defined as the number

$$\begin{aligned} {{\,\textrm{dis}\,}}R \buildrel \textrm{def}\over =\sup _{(x, y), (x', y') \in R} \big |d_X(x, x') - d_Y(y, y')\big |. \end{aligned}$$

Notice that any mapping \(\varphi : X \rightarrow Y\) induces the relation \(R_\varphi \buildrel \textrm{def}\over =\big \{\big (x, \varphi (x)\big ): x \in X \big \}\), and we denote

$$\begin{aligned} {{\,\textrm{dis}\,}}\varphi \buildrel \textrm{def}\over ={{\,\textrm{dis}\,}}R_\varphi = \sup _{x, x' \in X} \big |d_X(x, x') - d_Y\big (\varphi (x), \varphi (x')\big )\big |. \end{aligned}$$

Similarly, any \(\psi : Y \rightarrow X\) induces the relation \(R_\psi \buildrel \textrm{def}\over =\big \{\big (\psi (y), y\big ): y \in Y \big \}\). If both \(\varphi : X \rightarrow Y\) and \(\psi : Y \rightarrow X\) are given, we can define the relation \(R_{\varphi , \psi } \buildrel \textrm{def}\over =R_\varphi \cup R_\psi \), and realize that it is actually a correspondence, \(R_{\varphi , \psi } \in \mathcal {R}(X, Y)\).

A useful result in Kalton and Ostrovskii (1999) identifies computing GH distance with solving an optimization problem, either over the correspondences between X and Y or over the functions \(\varphi : X \rightarrow Y\) and \(\psi : Y \rightarrow X\):

$$\begin{aligned} d_{\mathcal{G}\mathcal{H}}(X, Y) = \frac{1}{2}\inf _{R \in \mathcal {R}(X, Y)} {{\,\textrm{dis}\,}}R = \frac{1}{2} \inf _{\varphi , \psi }{{\,\textrm{dis}\,}}R_{\varphi , \psi }. \end{aligned}$$

By definition, distortion of any relation \(R \subseteq X \times Y\) is bounded by \({{\,\textrm{dis}\,}}R \le d_{\max }\), where \(d_{\max } \buildrel \textrm{def}\over =\max \{{{\,\textrm{diam}\,}}X, {{\,\textrm{diam}\,}}Y\}\). Combined with the characterization of the GH distance, it yields \(d_{\mathcal{G}\mathcal{H}}(X, Y) \le \frac{1}{2} d_{\max }\).

Let \(*\) denote the (compact) metric space that is comprised of exactly one point. For any correspondence \(R \in \mathcal {R}(X, *)\), \({{\,\textrm{dis}\,}}R = \sup _{x, x' \in X} \big |d_X(x, x') - 0\big | = {{\,\textrm{diam}\,}}X\). The above characterization implies that \(d_{\mathcal{G}\mathcal{H}}(X, *) = \frac{1}{2}{{\,\textrm{diam}\,}}X\), and, by an analogous argument, \(d_{\mathcal{G}\mathcal{H}}(Y, *) = \frac{1}{2}{{\,\textrm{diam}\,}}Y\). From the triangle inequality for the GH distance, \(d_{\mathcal{G}\mathcal{H}}(X, Y) \ge \big |d_{\mathcal{G}\mathcal{H}}(X, *) - d_{\mathcal{G}\mathcal{H}}(Y, *)\big | = \frac{1}{2}|{{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y|\).

2.3 Modifying the GH distance

Recall that for some \(\varphi : X \rightarrow Y\) and \(\psi : Y \rightarrow X\), correspondence \(R_{\varphi , \psi }\) is defined as \(R_{\varphi , \psi } = R_\varphi \cup R_\psi \). For any two elements in \(R_{\varphi , \psi }\), either both belong to \(R_\varphi \), or both belong to \(R_\psi \), or one of them belongs to \(R_\varphi \) while the other belongs to \(R_\psi \). It follows that

$$\begin{aligned} {{\,\textrm{dis}\,}}R_{\varphi , \psi } = \max \big \{{{\,\textrm{dis}\,}}R_\varphi , {{\,\textrm{dis}\,}}R_\psi , C_{\varphi , \psi }\big \}, \end{aligned}$$

where \(C_{\varphi , \psi } \buildrel \textrm{def}\over =\displaystyle \sup _{x \in X, y \in Y} \big |d_X\big (x, \psi (y)\big ) - d_Y\big (\varphi (x), y\big )\big |\).

Notice that the number \(C_{\varphi , \psi }\) acts as a coupling term between the choices of \(\varphi \) and \(\psi \) in the optimization problem

$$\begin{aligned} d_{\mathcal{G}\mathcal{H}}(X, Y) = \frac{1}{2}\inf _{\varphi , \psi }\max \big \{{{\,\textrm{dis}\,}}R_\varphi , {{\,\textrm{dis}\,}}R_\psi , C_{\varphi , \psi }\big \}, \end{aligned}$$

making its search space to be of the size \(|X|^{|Y|}|Y|^{|X|}\). Discarding the coupling term \(C_{\varphi , \psi }\) yields the notion of the modified Gromov–Hausdorff distance

$$\begin{aligned} {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \buildrel \textrm{def}\over =\frac{1}{2}\inf _{\varphi , \psi } \max \big \{{{\,\textrm{dis}\,}}R_\varphi , {{\,\textrm{dis}\,}}R_\psi \big \} \le d_{\mathcal{G}\mathcal{H}}(X, Y). \end{aligned}$$

Computing \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\) requires solving two decoupled optimization problems whose search spaces are of the size \(|X|^{|Y|}\) and \(|Y|^{|X|}\), respectively. An equivalent definition emphasizing this fact is given by

$$\begin{aligned} {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \buildrel \textrm{def}\over =\frac{1}{2}\max \big \{\inf _{\varphi } {{\,\textrm{dis}\,}}R_\varphi , \inf _{\psi } {{\,\textrm{dis}\,}}R_\psi \big \}. \end{aligned}$$

Similarly to \(d_{\mathcal{G}\mathcal{H}}\), \({\widehat{d}}_{\mathcal{G}\mathcal{H}}\) is a metric on the isometry classes of \(\mathcal {M}\). Moreover, \({\widehat{d}}_{\mathcal{G}\mathcal{H}}\) is topologically equivalent to \(d_{\mathcal{G}\mathcal{H}}\) within GH-precompact families of metric spaces (Mémoli 2012).

2.4 Curvature sets and the structural theorem

Let \(X \in \mathcal {M}\), and consider \((x_1, \ldots , x_n) \in X^{\times n}\), an n-tuple of points in X for some \(n \in \mathbb {N}\). The \(n \times n\) matrix containing their pairwise distances is called the distance sample induced by \((x_1, \ldots , x_n)\), and denoted by \(D^{(x_1, \ldots , x_n)} \buildrel \textrm{def}\over =\big (d_X(x_i, x_j)\big )_{i,j=1}^n\). A distance sample generalizes the notion of distance matrix of \(\{x_1, \ldots , x_n\}\) when \(x_1, \ldots , x_n\) are not necessarily distinct. Unlike a distance matrix, a distance sample may contain zeros off the main diagonal.

The n-th curvature set of X is then defined as a set of all \(n \times n\) distance samples of X, denoted

$$\begin{aligned} \mathcal {K}_n(X) \buildrel \textrm{def}\over =\left\{ D^{(x_1, \ldots , x_n)}: (x_1, \ldots , x_n) \in X^{\times n}\right\} . \end{aligned}$$

For example, \(\mathcal {K}_2(X)\) contains the same information as the entries of \(D^X\), the distance matrix of X; when X is a smooth planar curve, then its curvature at every point can be calculated from the information contained in \(\mathcal {K}_3(X)\) (Calabi et al. 1998), thus providing motivation for the name curvature sets.

Curvature sets contain information about the shape of a compact metric space in permutation-invariant way. In particular, any \(X \in \mathcal {M}\) and \(Y \in \mathcal {M}\) are isometric if and only if \(\mathcal {K}_n(X) = \mathcal {K}_n(Y)\) for every \(n \in \mathbb {N}\) (3.27 in Gromov (2007)). To discriminate the shapes of X and Y, it is therefore reasonable to measure the difference between \(\mathcal {K}_n(X)\) and \(\mathcal {K}_n(Y)\) for various \(n \in \mathbb {N}\). Since both n-th curvature sets are subsets of the same space \(\mathbb {R}^{n \times n}\), Hausdorff distance is a natural metric between them. We equip the set of \(n \times n\) matrices with distance \(d_{l^\infty }(A, B) \buildrel \textrm{def}\over =\max _{i,j} \left| A_{i,j}-B_{i,j}\right| \), and define

$$\begin{aligned} d_{\mathcal {K}_n}(X, Y) \buildrel \textrm{def}\over =\frac{1}{2}d_{\mathcal {H}}^{\mathbb {R}^{n \times n}}\big (\mathcal {K}_n(X), \mathcal {K}_n(Y)\big ), \end{aligned}$$

where \(d_{\mathcal {H}}^{\mathbb {R}^{n \times n}}\) is the \(d_{l^\infty }\)-induced Hausdorff distance on \(\mathbb {R}^{n \times n}\).

Remark 1

The choice of distance \(d_{l^\infty }: \mathbb {R}^{n \times n} \times \mathbb {R}^{n \times n} \rightarrow \mathbb {R}\) complies with the notion of distortion of a mapping. If \(\varphi \) is a mapping from X to Y for \(X = \{x_1, \ldots , x_{|X|}\}\), Y—metric spaces, then

$$\begin{aligned} {{\,\textrm{dis}\,}}\varphi = d_{l^\infty }\left( D^X, D^{\left( \varphi (x_1), \ldots , \varphi (x_{|X|})\right) }\right) . \end{aligned}$$

The fact that \(\varphi \) can be non-injective provides intuition for the possibility of identical points in a tuple from the definition of a distance sample.

An important result, extensively relied upon in this paper, is the so-called “structural theorem” for the mGH (Mémoli 2012, 2013):

$$\begin{aligned} {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) = \sup _{n \in \mathbb {N}}d_{\mathcal {K}_n}(X, Y). \end{aligned}$$

Notice that the bounds of the GH distance from the inequalities \(\frac{1}{2} |{{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y| \le d_{\mathcal{G}\mathcal{H}}(X, Y) \le \frac{1}{2}d_{\max }\) also hold for the mGH distance:

$$\begin{aligned} {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)&\ge d_{\mathcal {K}_2}(X, Y) \\ &= \frac{1}{2}d_{\mathcal {H}}^\mathbb {R}\Big (\big \{d_X(x, x'): x, x' \in X\big \},\big \{d_Y(y, y'): y, y' \in Y\big \}\Big ) \\ &\ge \frac{1}{2}\max \bigg \{\inf _{x, x' \in X}\big |{{\,\textrm{diam}\,}}Y - d_X(x, x')\big |, \inf _{y, y' \in Y}\big |{{\,\textrm{diam}\,}}X - d_Y(y, y')\big |\bigg \} \\ &= \frac{1}{2}\left| {{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y\right| , \end{aligned}$$

while \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \le d_{\mathcal{G}\mathcal{H}}(X, Y) \le d_{\max }\) trivially follows from the definition of the mGH distance.

3 Lower bound for \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\)

This section provides theoretical results and algorithms for an informative lower bound for the mGH distance between a pair of metric spaces X and Y. This and the following sections assume that the metric spaces are non-empty and finite, i.e. \(1 \le |X|, |Y| < \infty \). When talking about algorithmic time complexities, we denote the input size with \(N \buildrel \textrm{def}\over =\max \{|X|, |Y|\}\).

Remark 2

We notice that efficient estimation of either the GH or the mGH distance between finite metric spaces allows for its estimation between infinite compact metric spaces with desired accuracy. This is because for any \(X \in \mathcal {M}\) and \(\epsilon > 0\), there exist finite \(X_\epsilon \subset X\) that is an \(\epsilon \)-net of X, which gives

$$\begin{aligned} {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, X_\epsilon ) \le d_{\mathcal{G}\mathcal{H}}(X, X_\epsilon ) \le d_{\mathcal {H}}(X, X_\epsilon ) \le \epsilon . \end{aligned}$$

Similarly, for any \(Y \in \mathcal {M}\) there exists its finite \(\epsilon \)-net \(Y_\epsilon \), and therefore

$$\begin{aligned} |d_{\mathcal{G}\mathcal{H}}(X, Y) - d_{\mathcal{G}\mathcal{H}}(X_\epsilon , Y_\epsilon )| \le d_{\mathcal{G}\mathcal{H}}(X, X_\epsilon ) + d_{\mathcal{G}\mathcal{H}}(Y, Y_\epsilon ) \le 2\epsilon . \end{aligned}$$

Feasible algorithms for lower bounds are important in e.g. classification tasks, where knowledge that a distance exceeds some threshold can make computing the actual distance unnecessary. In particular, if the mGH distance between metric representations of two graphs is \(> 0\), it immediately follows that the graphs are not isomorphic.

3.1 d-bounded matrices

Let A be a square matrix. We say that A is d-bounded for some \(d \in \mathbb {R}\) if every off-diagonal entry of A is \(\ge d\). Similarly, A is positive-bounded if its off-diagonal entries are positive. Naturally, any d-bounded matrix for \(d > 0\) is also positive-bounded.

Notice that a distance sample \(D^{(x_1, \ldots , x_n)}\) of X is d-bounded if and only if \(d_X(x_i, x_j) \ge d \quad \forall i \ne j\), and positive-bounded if and only if \(x_1, \ldots , x_n\) are distinct. By non-negativity of a metric, any distance sample is 0-bounded.

Claim 1

Let A and B be square matrices of the same size. If A is d-bounded for some \(d > 0\), and B is 0-bounded but not positive-bounded, then \(d_{l^\infty }(A, B) \ge d\).

Recall that a matrix B is a permutation similarity of a (same-sized) matrix A if \(B = PAP^{-1}\) for some permutation matrix P. Equivalently, B is obtained from A by permuting both its rows and its columns according to some permutation \(\pi \): \(B_{i,j} = A_{\pi (i),\pi (j)}\). Given \(n \in \mathbb {N}\), we will denote the set of permutation similarities of \(n \times n\) principal submatrices of A by \(\textrm{PSPS}_n(A)\).

Claim 2

A distance sample \(K \in \mathcal {K}_n(X)\) is positive-bounded if and only if it is a permutation similarity of a principal submatrix of \(D^X\), i.e. if and only if \(K \in \textrm{PSPS}_n(D^X)\). In particular, there are no positive-bounded distance samples in \(\mathcal {K}_n(X)\) if \(n > |X|\).

Theorem 1

Let \(K \in \mathcal {K}_n(X)\) be d-bounded for some \(d > 0\). If \(n > |Y|\), then \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\).

Proof

Notice that \(L \in \mathcal {K}_n(Y)\) implies that L is 0-bounded (by non-negativity of a metric) and not positive-bounded (from Claim 2). Then

$$\begin{aligned} {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)&\ge \frac{1}{2}d_{\mathcal {H}}^{\mathbb {R}^{n \times n}}\big (\mathcal {K}_n(X), \mathcal {K}_n(Y)\big ) \\ &\ge \frac{1}{2}\min _{L \in \mathcal {K}_n(Y)} d_{l^\infty }(K, L) \\ &\ge \frac{d}{2}.\hspace{4cm}\text {from Claim }1 \end{aligned}$$

\(\square \)

Remark 3

For the standard GH distance, its lower bound implied by Theorem 1 is a direct consequence of its relationship with the n-packing number established in Cho (1997).

3.2 Obtaining d-bounded distance samples of large size

In order to apply Theorem 1 for some \(d > 0\), one needs to verify the existence of a d-bounded distance sample from X that exceeds Y in size. Ideally, one wants to know M(Xd), the largest size of a d-bounded distance sample of X:

$$\begin{aligned} M(X, d) \buildrel \textrm{def}\over =\max \big \{n \in \mathbb {N}: \exists d\text {-bounded }K \in \mathcal {K}_n(X) \big \}. \end{aligned}$$

Equivalently, M(Xd) is the so-called d-packing number of X: the largest number of points one can sample from X such that they all are at least d away from each other. Finding M(Xd) is equivalent to finding the size of a maximum independent set of the graph \(G = \big (X, \{(x_i, x_j): d_X(x_i, x_j) < d\}\big )\). Unfortunately, this problem is known to be NP-hard (Kégl 2002), and we therefore require approximation techniques to search for a sufficiently large d-bounded distance sample of X.

We implement greedy algorithm FindLargeK (see Appendix A for the pseudocode) that, given the distance matrix of X and some \(d > 0\), finds in \(O(N^3)\) time a d-bounded distance sample \(K \in \mathcal {K}_{{\widetilde{M}}(X, d)}(X)\), where \({\widetilde{M}}(X, d)\) is an approximation of M(Xd). Informally, the algorithm iteratively removes rows (and same-index columns) from \(D^X\) until all off-diagonal entries of the resulting K are \(\ge d\). At each step, the algorithm chooses to remove a row that is least d-bounded, i.e. one with the highest (non-zero) count of off-diagonal entries \(< d\).

Removing a row from K also “collaterally” removes an entry \(< d\) from some of the remaining rows (due to removal of the corresponding column), potentially turning them into d-bounded. The counts of off-diagonal entries \(<d\) in these rows define the number of newly appeared d-bounded rows in the remaining matrix at each step, thus controlling the duration of FindLargeK. In particular, the maximum duration (and therefore the worst approximation \({\widetilde{M}}(X, d)\)) is attained when “collaterally” removed entries \(< d\) reside in the least d-bounded rows at each step, which minimizes the number of d-bounded rows obtained from the removal. Let w be the maximum count of off-diagonal entries \(<d\) in the rows of \(D^X\). Assuming the worst case scenario, each iteration of \(\textsc {FindLarge}K(D^X, d)\) reduces the number of rows with w off-diagonal entries \(<d\) by \(w+1\) (one row is removed and the counts for w others become \(w-1\)), or to zero if fewer than \(w+1\) of them remains. For the sake of simplicity, we will assume that \(w+1\) divides |X|. It follows that after \(\frac{|X|}{w+1}\) iterations the maximum count of off-diagonal entries \(<d\) in the \(\frac{w}{w+1}|X|\) rows of K is at most \(w-1\). More generally, if for some \(k \le w\) the current K is comprised by \(\big (1-\frac{k}{w+1}\big )|X|\) rows with at most \(w-k\) off-diagonal entries \(<d\) each, then after another

$$\begin{aligned} \frac{\big (1-\frac{k}{w+1}\big )|X|}{w-k+1} = \frac{|X|}{w+1} \end{aligned}$$

iterations the maximum count will be at most \(w-k-1\). By induction, K will contain no off-diagonal entries \(<d\) after at most \(w\frac{|X|}{w+1}\) iterations, which implies that its resulting size \({\widetilde{M}}(X, d)\) must be at least \(\frac{1}{w+1}|X|\). Because \(M(X, d) \le |X|\), the approximation ratio of \({\widetilde{M}}(X, d)\) is \(\frac{1}{w+1}\).

Notice that the obtained K does not need to be unique, because at any step there can be multiple least d-bounded rows. Choosing which one of them to remove allows selecting for some desired characteristics in the retained rows of K. In particular, Sect. 3.4 provides motivation to select for bigger entries in the resulting d-bounded distance sample K. To accommodate this, we choose to remove at each step a least d-bounded row with the smallest sum of off-diagonal entries \(\ge d\), so-called smallest least d-bounded row. Since uniqueness of a smallest least d-bounded row is not guaranteed either, we implemented procedure FindLeastBoundedRow (see Appendix B for the pseudocode) to find the index of the first such row in a matrix.

3.3 Using permutation similarities of principal submatrices of \(D^Y\)

Theorem 1 establishes that \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\) from the existence of some d-bounded distance sample of sufficiently large size. However, such distance samples may not exist in certain cases (e.g. when \(|X| = |Y|\)), thus deeming Theorem 1 inapplicable. The main result of this subsection, Theorem 2, complements Theorem 1 by allowing to verify \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\) in those cases.

Lemma 1

Let \(K \in \mathcal {K}_n(X)\) be d-bounded for some \(d > 0\), and let \(n \le |Y|\). If for some \(i \in \langle n \rangle \buildrel \textrm{def}\over =\{1, \ldots , n\}\)

$$\begin{aligned} \Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d \quad \forall L \in \textrm{PSPS}_n(D^Y), \end{aligned}$$

then \(d_{l^\infty }\big (K, \mathcal {K}_n(Y)\big ) \ge d\).

Proof

Let \(L \in \mathcal {K}_n(Y)\). If L is not positive-bounded, \(d_{l^\infty }(K, L) \ge d\) follows from Claim 1 and the fact that any distance sample is 0-bounded. If L is positive-bounded, then \(L \in \textrm{PSPS}_n(D^Y)\) from Claim 2, and the premise entails

$$\begin{aligned} d_{l^\infty }(K, L) \ge \big \Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\big \Vert _\infty \ge d. \end{aligned}$$

Because \(d_{l^\infty }(K, L) \ge d\) for an arbitrary choice of \(L \in \mathcal {K}_n(Y)\), we have \(d_{l^\infty }\big (K, \mathcal {K}_n(Y)\big ) \ge d.\) \(\square \)

Remark 4

A naive approach to proving \(d_{l^\infty }(K, \mathcal {K}_n(Y)) \ge d\) is to show that \(d_{l^\infty }(K, L) \ge d\) for each \(L \in \mathcal {K}_n(Y)\), which comprises an instance of the NP-hard QBAP. Instead, the premise of Lemma 1 (for a particular i) can be checked by solving at most |Y| optimization problems of O(|Y|) time complexity each, as will be shown in the next subsection.

Theorem 2

Let \(K \in \mathcal {K}_n(X)\) be d-bounded for some \(d > 0\), and let \(n \le |Y|\). If for some \(i \in \langle n \rangle \)

$$\begin{aligned} \Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d \quad \forall L \in \textrm{PSPS}_n(D^Y), \end{aligned}$$

then \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\).

Proof

$$\begin{aligned} {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)&\ge \frac{1}{2}d_{\mathcal {H}}^{\mathbb {R}^{n \times n}}\big (\mathcal {K}_n(X), \mathcal {K}_n(Y)\big ) \\ &\ge \frac{1}{2}d_{l^\infty }\big (K, \mathcal {K}_n(Y)\big ) \\ &\ge \frac{d}{2}. \hspace{4cm}\text {from Lemma }1 \end{aligned}$$

\(\square \)

3.4 Verifying \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\)

Let \(d > 0\). To see if we can verify \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\) from \(D^X\) and \(D^Y\), we start by calling FindLargeK to obtain a d-bounded distance sample \(K \in \mathcal {K}_n(X)\) for some \(n \ge \frac{1}{w+1}M(X, d)\). If \(n > |Y|\), then \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\) follows immediately from Theorem 1. Otherwise, we want to obtain this lower bound from Theorem 2, which requires showing that some \(i \in \langle n \rangle \) satisfies

$$\begin{aligned} \Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d \quad \forall L \in \textrm{PSPS}_n(D^Y). \end{aligned}$$

Let i be fixed. If \(L \in \textrm{PSPS}_n(D^Y)\), then all entries in \(\textrm{row}_{i}(L)\) come from one row of \(D^Y\), with \(L_{i,i} = 0\) being the diagonal element in that row. The choice of i thus induces a (disjoint) partition of \(\textrm{PSPS}_n(D^Y)\):

$$\begin{aligned} \textrm{PSPS}_n(D^Y) = \bigcup _{j=1}^{|Y|} \textrm{PSPS}_n^{i \leftarrow j}(D^Y), \end{aligned}$$

where \(\textrm{PSPS}_n^{i \leftarrow j}(D^Y)\) is the set of all permutation similarities of principal submatrices of \(D^Y\) whose i-th row is comprised of the entries in \(\textrm{row}_{j}(D^Y)\). Therefore, the condition

$$\begin{aligned} \Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d \quad \forall L \in \textrm{PSPS}_n(D^Y) \end{aligned}$$

can be verified by showing that

$$\begin{aligned} \Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d \quad \forall L \in \textrm{PSPS}_n^{i\leftarrow j}(D^Y), \forall j \in \langle |Y| \rangle . \end{aligned}$$

Let j, in addition to i, be fixed. Notice that any \(L \in \textrm{PSPS}_n^{i\leftarrow j}(D^Y)\) corresponds to an injective mapping \(f_L:\langle n \rangle \rightarrow \langle |Y| \rangle \) that defines the entries from \(\textrm{row}_{j}(D^Y)\) that populate \(\textrm{row}_{i}(L)\): \(L_{i, k} = D^Y_{j, f_L(k)}\). In particular, \(f_L(i) = j\), because \(L_{i,i} = D^Y_{j,j}\) for any \(L \in \textrm{PSPS}_n^{i\leftarrow j}(D^Y)\). Therefore, the condition

$$\begin{aligned} \Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d \quad \forall L \in \textrm{PSPS}_n^{i\leftarrow j}(D^Y) \end{aligned}$$

is equivalent to the non-existence of an injective \(f_L:\langle n \rangle \rightarrow \langle |Y| \rangle \) such that \(|K_{i, k} - D^Y_{j, f_L(k)}| < d \quad \forall k \in \langle n \rangle \) and \(f_L(i) = j\). The decision about the existence of a feasible assignment \(f_L\) between the entries of \(\textrm{row}_{i}(K)\) and \(\textrm{row}_{j}(D^Y)\) is an instance of linear assignment feasibility problem. If such \(f_L\) exists, it can be constructed by iteratively pairing the smallest unassigned \(K_{i, k}\) to the smallest available \(D^Y_{j, h}\) s.t. \(|K_{i, k} - D^Y_{j, h}| < d\), that is, by setting \(f_L(k) = h\) (see Claim 3). It follows that each entry in \(\textrm{row}_{j}(D^Y)\) needs to be checked at most once, and hence solving the feasibility problem takes O(|Y|) time if the entries in both \(\textrm{row}_{i}(K)\) and \(\textrm{row}_{j}(D^Y)\) are sorted. Procedure SolveFeasibleAssignment (see Appendix C for the pseudocode) implements the solution for a pair of vectors, given their entries are arranged in ascending order. We notice that ignoring the actual order of the (off-diagonal) entries in \(\textrm{row}_{i}(K)\) and \(\textrm{row}_{j}(D^Y)\) reflects the fact that curvature sets are closed under permutations of the underlying tuples of points.

Claim 3

Let \(\textbf{v} \in \mathbb {R}^p, \textbf{u} \in \mathbb {R}^q\) be s.t. \(v_1 \le \ldots \le v_p, u_1 \le \ldots \le u_q\). If \(\textsc {SolveFeasibleAssignment}(\textbf{v}, \textbf{u}, d)=\textrm{FALSE}\), then no injective \(f: \langle p \rangle \rightarrow \langle q \rangle \) s.t. \(|v_t - u_{f(t)}| < d \quad \forall t \in \langle p \rangle \) exists.

Proof

Let t be the largest index for which \(u_t\) could be feasibly assigned to some available entry of \(\textbf{v}\) by \(\textsc {SolveFeasibleAssignment}(\textbf{v}, \textbf{u}, d)\). If such t does not exist, then \(|v_1 - u_l| \ge d \quad \forall l \in \langle q \rangle \) and trivially no feasible assignment between \(\textbf{v}\) and \(\textbf{u}\) can exist.

Denote the partially constructed assignment by \(f: \langle t \rangle \rightarrow \langle q \rangle \), and notice that it is monotone by construction. In particular, every \(u_l\) for \(l > f(t)\) must remain available for assignments. Moreover, because \(v_{t+1}\) could not be feasibly assigned to any entry of \(\textbf{u}\) after \(u_{f(t)}\), which means that \(v_t \le u_l - d \quad \forall l > f(t-1)\).

Let \(r \le t\) be the smallest index such that \(f(r), f(r+1), \ldots , f(t)\) are a contiguous block, i.e. that \(f(r) = f(t) - (t - r + 1)\). Because \(u_{f(r)-1}\) remains available (or does not exist due to \(f(r)=1\)), \(v_r\) could not be feasibly assigned to any entry of \(\textbf{u}\) before \(u_{f(r)}\), which means that \(v_r \ge u_l + d \quad \forall l < f(r)\).

Because \(v_r \le \ldots \le v_t \le v_{t+1}\), none of these \(t-r+2\) entries can be feasibly assigned to the entries of \(\textbf{u}\) before \(u_{f(r)}\) or after \(u_{f(t)}\), meaning that they only have \(t-r+1\) feasible matches in \(\textbf{u}\). By the pigeonhole principle, no feasible assignment between the two vectors can exist. \(\square \)

Remark 5

Intuitively, either sufficiently small or sufficiently large entries in \(\textrm{row}_{i}(K)\) for some \(i \in \langle n \rangle \) can render a feasible assignment \(f_L\) impossible for every \(j \in \langle |Y| \rangle \), yielding \(\Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d \quad \forall L \in \textrm{PSPS}_n(D^Y)\) and therefore \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\) from Theorem 2. This provides the motivation behind considering the magnitude of the \(\ge d\) entries when choosing a row to remove at each step of FindLargeK. Recall that such row is chosen by the auxiliary procedure FindLeastBoundedRow, that selects for bigger entries in the resulting K. The approach allows for a tighter lower bound in the case when the entries in \(D^X\) are, generally speaking, bigger than those in \(D^Y\). The converse case is covered by similarly obtaining and handling a d-bounded distance sample of Y.

Calling SolveFeasibleAssignment\((\textrm{row}_{i}(K), \textrm{row}_{j}(D^Y), d)\) for every \(j \in \langle |Y| \rangle \) is sufficient to check whether a particular i satisfies \(\Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d \quad \forall L \in \textrm{PSPS}_n(D^Y)\). Procedure CheckTheoremB (see Appendix D for the pseudocode) makes such check for each \(i \in \langle n \rangle \) to decide whether \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\) follows from Theorem 2 for the d-bounded K. The procedure sorts the entries in the rows of K and \(D^Y\) prior to the checks, which takes \(O(N\log N)\) time for each of the O(N) rows. This allows solving each of the \(O(N^2)\) feasibility problems in O(N) time, making the time complexity of CheckTheoremB \(O(N^2\log N + N^3) = O(N^3)\).

Notice that both Theorem 1 and 2 regard a largest-size d-bounded distance sample of only one metric space, X. However, its counterpart for Y is equally likely to provide information for discriminating the two metric spaces. Making use of the symmetry of \({\widehat{d}}_{\mathcal{G}\mathcal{H}}\), we summarize theoretical findings of this section under \(O(N^3)\)-time procedure VerifyLowerBound (see Appendix E for the pseudocode), that attempts to prove \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\).

3.5 Obtaining the lower bound

Procedure VerifyLowerBound is a decision algorithm that gives a “yes” or “no” answer to the question if a particular value can be proven to bound \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\) from below. In order to obtain an informative lower bound, one wants to find the largest value for which the answer is “yes”. Since \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \le \frac{1}{2}d_{\max }\), the answer must be “no” for any value above \(\frac{1}{2}d_{\max }\), and therefore it suffices to limit the scope to \((0, \frac{1}{2}d_{\max }]\). To avoid redundancy when checking the values from this interval, we consider the following result.

Claim 4

Let \(\Delta \) denote the set of absolute differences between the distances in X and Y,

$$\begin{aligned} \Delta \buildrel \textrm{def}\over =\{\left| d_X(x, x') - d_Y(y, y')\right| : x, x' \in X, y, y' \in Y\}, \end{aligned}$$

and let \(\{\delta _i\}_{i=1}^{|\Delta |}\) represent the sorting order of \(\Delta \), \(0 = \delta _1< \ldots < \delta _{|\Delta |} = d_{\max }\). If \(\delta _i< d_1 < d_2 \le \delta _{i+1}\) for some \(d_1, d_2\) and \(i \in \langle |\Delta |-1 \rangle \), then \(\textsc {VerifyLowerBound}(D^X, D^Y, d_1) = \textsc {VerifyLowerBound}(D^X, D^Y, d_2)\).

Proof

VerifyLowerBound considers the value of its argument d only through comparisons of the form “\(\delta < d\)” for some \(\delta \), that occur in FindLargeK and SolveFeasibleAssignment. Notice that the values of \(\delta \) compared with d in FindLargeK are the entries of \(D^X\) or \(D^Y\), and therefore belong to \(\Delta \) as

$$\begin{aligned} \{d_X(x, x'): x, x' \in X\}, \{d_Y(y, y'): y, y' \in Y\} \subseteq \Delta . \end{aligned}$$

The values of \(\delta \) compared with d in SolveFeasibleAssignment belong to \(\Delta \) by construction.

For any \(\delta \in \Delta \), \(\delta < d_1\) if and only if \(\delta < d_2\). This is because \(\delta < d_2\) implies \(\delta < d_1\) from \(\delta \notin [d_1, d_2)\), while \(\delta < d_1\) implies \(\delta < d_2\) trivially. It follows that both FindLargeK and SolveFeasibleAssignment yield identical outputs on \(d_1\) and \(d_2\) (and otherwise identical inputs), and hence so does VerifyLowerBound. \(\square \)

Claim 4 implies that the largest \(\delta \in \Delta \) s.t. \(\textsc {VerifyLowerBound}(D^X, D^Y, \delta ) = \textrm{TRUE}\) is the largest \(d \in \mathbb {R}\) s.t. \(\textsc {VerifyLowerBound}(D^X, D^Y, d) = \textrm{TRUE}\). We use this fact to implement the procedure FindLowerBound (see Appendix F for the pseudocode), that obtains a lower bound of \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\) by calling \(\textsc {VerifyLowerBound}(D^X, D^Y, \delta )\) for each \(\delta \in \Delta \) from largest to smallest, and stops once the output is TRUE. Since \(|\Delta | = O(N^4)\) in the general case, the time complexity of FindLowerBound is \(O(N^7)\).

Remark 6

Using binary search on \(\Delta \) instead of traversing its values in descending order reduces the number of calls to VerifyLowerBound from \(O(N^4)\) to \(O(\log N)\), bringing the time complexity of FindLowerBound to \(O(N^3 \log N)\). We however notice that, given some \(d_1 < d_2\), \(\textsc {VerifyLowerBound}(D^X, D^Y, d_2) = \textrm{TRUE}\) does not guarantee \(\textsc {VerifyLowerBound}(D^X, D^Y, d_1) = \textrm{TRUE}\) due to the heuristic nature of FindLargeK (for example, it is possible to obtain \({\widetilde{M}}(X, d_1) \le |Y| < {\widetilde{M}}(X, d_2)\) even though, trivially, \(M(X, d_1) \ge M(X, d_2) \ge {\widetilde{M}}(X, d_2)\)). It follows that relying on the binary search in FindLowerBound can result in failing to find the largest \(\delta \in \Delta \) s.t. \(\textsc {VerifyLowerBound}(D^X, D^Y, \delta ) = \textrm{TRUE}\), and thus in reducing time complexity at the cost of potentially lower accuracy.

Remark 7

An increase in accuracy at the expense of time complexity can be achieved by modifying \(\textsc {FindLowerBound}(D^X, D^Y)\) and its subroutines so that Theorem 2 for the d-bounded distance samples tries to prove not only \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d}{2}\) but also \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{d^*}{2}\) for some \(d^* < d\), \(d^* \in \Delta \). For every pair \(i \in \langle n \rangle , j \in \langle |Y| \rangle \), the modified \(\textsc {CheckTheoremB}(K, D^Y, d)\) finds the largest \(d^* \le d\) s.t. \(\textsc {SolveFeasibleAssignment}(\textrm{row}_{i}(K), \textrm{row}_{j}(D^Y), d^*)=\textrm{TRUE}\) (instead of checking the existence of feasible assignment for d only) and returns the largest \(d^*\) s.t. some i satisfies \(\Vert \textrm{row}_{i}(K) - \textrm{row}_{i}(L)\Vert _\infty \ge d^* \quad \forall L \in \textrm{PSPS}_n(D^Y)\). Accordingly, VerifyLowerBound propagates this value of \(d^*\) to FindLowerBound, which stores their maximum from the already processed part of \(\Delta \) (and proceeds only if this maximum is smaller than the current \(\delta _i\)). Because the existence of feasible assignment for \(d^*\) is monotonic in \(d^*\), using binary search on \(\Delta \) is sufficient to find the largest \(d^* \le d\) s.t. \(\textsc {SolveFeasibleAssignment}(\textrm{row}_{i}(K), \textrm{row}_{j}(D^Y), d^*)=\textrm{TRUE}\). It follows that the increase in time complexity of the modified CheckTheoremB (and therefore FindLowerBound) is only by a factor of \(O(\log N)\).

The modification would be able to provide better bounds by additionally trying smaller distance samples with larger entries in Theorem 2. In particular, it is guaranteed to prove the baseline lower bound \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \ge \frac{1}{2}|{{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y|\). Assuming without loss of generality \({{\,\textrm{diam}\,}}X \ge {{\,\textrm{diam}\,}}Y\), the call to \(\textsc {FindLarge}\)K\((D^X, {{\,\textrm{diam}\,}}X)\) from \(\textsc {VerifyLowerBound}(D^X, D^Y, {{\,\textrm{diam}\,}}X)\) will yield K with non-zero entries that each are at least \(({{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y)\) away from all \(D^Y\) entries. Because the \(({{\,\textrm{diam}\,}}X)\)-bounded K is also \(({{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y)\)-bounded, the modified \(\textsc {CheckTheoremB}(K, D^Y, {{\,\textrm{diam}\,}}X)\) will output \(d^* \ge {{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y\), ensuring that the resulting lower bound is at least \(\frac{1}{2}|{{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y|\).

This is in contrast with the original FindLowerBound, in which the only easy guarantee for proving the baseline lower bound via Theorem 2 is the presence of \({{\,\textrm{diam}\,}}X\) values in K produced by \(\textsc {FindLarge}K(D^X, {{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y)\). However, because \(\textsc {FindLarge}K\) prioritizes the number of entries \(\ge {{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y\) in a row of \(D^X\) over their magnitude, and because a point with a distance \({{\,\textrm{diam}\,}}X\) can have arbitrarily many distances \(< {{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y\), establishing practical conditions for the output of FindLowerBound to be \(\ge \frac{1}{2}({{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y)\) is non-trivial.

4 Upper bound for \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\)

4.1 Construction method

To obtain an upper bound of \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\), we recall the definition

$$\begin{aligned} {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \buildrel \textrm{def}\over =\frac{1}{2}\max \{\inf _{\varphi } {{\,\textrm{dis}\,}}\varphi , \inf _{\psi } {{\,\textrm{dis}\,}}\psi \}, \end{aligned}$$

where \(\varphi \) and \(\psi \) are the mappings \(\varphi :X \rightarrow Y\) and \(\psi :Y \rightarrow X\) and the infimums are taken over the corresponding function spaces. It follows that \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y) \le \frac{1}{2}\max \{{{\,\textrm{dis}\,}}\varphi , {{\,\textrm{dis}\,}}\psi \}\) for any particular choice of \(\varphi \) and \(\psi \). Considering the exponential size of function spaces, we rely on a small randomized sample of mappings to tighten the upper bound. To sample \(\varphi :X \rightarrow Y\), we use the construction method, a heuristic for solving quadratic assignment problems (Gilmore 1962; Burkard et al. 1998). The construction method iteratively maps each \(x \in X\) to some \(y \in Y\), chosen by a greedy algorithm to minimize \({{\,\textrm{dis}\,}}\varphi \) as described below.

Let \(X = \{x_1, \ldots , x_{|X|}\}\) and \(Y = \{y_1, \ldots , y_{|Y|}\}\). We randomly choose a permutation \(\pi \) of \(\langle |X| \rangle \) to represent the order in which the points in X are mapped. At step i we map \(x_{\pi (i)}\) by choosing \(y_{j_i}\) and setting \(\varphi (x_{\pi (i)}) \buildrel \textrm{def}\over =y_{j_i}\). We represent these choices by inductive construction \(R_\varphi ^{(i)} = R_\varphi ^{(i-1)} \cup \{(x_{\pi (i)}, y_{j_i})\}\) for \(i = 1, \ldots , |X|\), where \(R_\varphi ^{(0)} \buildrel \textrm{def}\over =\emptyset \). The particular choice of \(y_{j_i}\) at step i is made to minimize distortion of resultant \(R_\varphi ^{(i)}\):

$$\begin{aligned} y_{j_i} \in \mathop {\mathrm {arg\,min}}\limits _{y \in Y}\,{{\,\textrm{dis}\,}}\Big (R_\varphi ^{(i-1)} \cup \big \{(x_{\pi (i)}, y)\big \}\Big ). \end{aligned}$$

After carrying out all |X| steps, \(\varphi :X \rightarrow Y\) is given by the constructed relation \(R_\varphi \buildrel \textrm{def}\over =R_\varphi ^{(|X|)}\), and by definition, \({{\,\textrm{dis}\,}}\varphi = {{\,\textrm{dis}\,}}R_\varphi \).

Notice the possible ambiguity in the choice of \(y_{j_i}\) when \(y \in Y\) minimizing \({{\,\textrm{dis}\,}}\Big (R_\varphi ^{(i-1)} \cup \big \{(x_{\pi (i)}, y)\big \}\Big )\) is not unique. In particular, any \(y \in Y\) can be chosen as \(y_{j_1}\) at step 1, since \({{\,\textrm{dis}\,}}\big \{(x_{\pi (i)}, y_{j_1})\big \} = 0\) is invariant to the said choice. In the case of such ambiguity, our implementation simply decides to map \(x_{\pi (i)}\) to \(y_{j_i}\) of the smallest index \(j_i\). However, in applications one might want to modify this logic to leverage the knowledge of the relationship between the points from two metric spaces.

We formalize the above heuristic under a randomized, \(O(N^3)\)-time procedure SampleSmallDistortion (see Appendix G for the pseudocode) that samples a mapping between the two metric spaces with the intent of minimizing its distortion, and outputs this distortion. We then describe an algorithm FindUpperBound (see Appendix H for the pseudocode), that repeatedly calls SampleSmallDistortion to find \(\varphi ^*:X \rightarrow Y\) and \(\psi ^*:Y \rightarrow X\), the mappings of the smallest distortion among those sampled from \(X \rightarrow Y\) and \(Y \rightarrow X\), respectively, and finds an upper bound for \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\) as \(\frac{1}{2}\max \{{{\,\textrm{dis}\,}}\varphi ^*, {{\,\textrm{dis}\,}}\psi ^*\}\). The time complexity of FindUpperBound is therefore \(O(sN^3)\), where s is the total number of sampled mappings.

4.2 Approximation ratio bounds

Unlike the lower bound estimation, our heuristic for the upper bound operates within the solution space of \({\widehat{d}}_{\mathcal{G}\mathcal{H}}\) formulation as a combinatorial minimization problem. This allows leveraging a result for bottleneck assignment problems bounding the ratio between the two extrema of the objective function (see e.g. Theorem 2 in Burkard and Fincke (1985)) to understand better the convergence of SampleSmallDistortion. Indeed, solving \(\inf _{\phi :X\rightarrow Y} {{\,\textrm{dis}\,}}\phi \) can be cast as a variant of the QBAP with the cost function \(c(i, j, k, h) \buildrel \textrm{def}\over =|D^X_{i,j} - D^Y_{k,h}|\) that allows non-injective assignments and hence has the search space of size \(|Y|^{|X|}\). Assuming \(|X| > 1\), the maximum of its objective function \({{\,\textrm{dis}\,}}\phi \) on the feasible region is \(d_{\max }\). We consider a scaled but otherwise identical minimization problem with the cost function \({\widetilde{c}}(i, j, k, h) \buildrel \textrm{def}\over =\frac{|D^X_{i,j} - D^Y_{k,h}|}{d_{\max }} \in [0, 1]\), whose objective function has the minimum of \(\frac{\inf _{\phi :X\rightarrow Y} {{\,\textrm{dis}\,}}\phi }{d_{\max }}\). The result by Burkard and Fincke (1985) treats the values of \({\widetilde{c}}\) as identically distributed random variables, and states that if |X| and |Y| satisfy

$$\begin{aligned} \mathbb {P}\left[ \frac{|D^X_{i,j} - D^Y_{k,h}|}{d_{\max }} < 1-\frac{\epsilon }{2}\right] \le 1-q \end{aligned}$$
(⋆)

for some \(\epsilon , q \in [0, 1]\), then the approximation ratio of the solution by the maximum is probabilistically bounded by

$$\begin{aligned} \mathbb {P}\left[ \frac{d_{\max }}{\inf _{\phi :X\rightarrow Y} {{\,\textrm{dis}\,}}\phi } < 1 + \epsilon \right] \ge 1 - \exp (-q|X|^2 + |X| \log |Y|). \end{aligned}$$

Let b be the smallest distortion discovered by SampleSmallDistortion after a number of tries, and consider some \(r \ge 1\). If for \(\epsilon \buildrel \textrm{def}\over =\frac{rd_{\max }}{b}-1\) we can find q that satisfies both () and \(q > \frac{\log |Y|}{|X|}\), the above result allows us to bound the probability that the approximation ratio of b is worse than r:

$$\begin{aligned} \mathbb {P}\left[ \frac{b}{\inf _{\phi :X\rightarrow Y} {{\,\textrm{dis}\,}}\phi }> r\right]&= \mathbb {P}\left[ \frac{d_{\max }}{\inf _{\phi :X\rightarrow Y} {{\,\textrm{dis}\,}}\phi } > \frac{rd_{\max }}{b}\right] \\&= 1 - \mathbb {P}\left[ \frac{d_{\max }}{\inf _{\phi :X\rightarrow Y} {{\,\textrm{dis}\,}}\phi } < 1 + \epsilon \right] \\&\le \exp (-q|X|^2 + |X| \log |Y|). \end{aligned}$$

An analogous probabilistic bound applies to the minimization of distortion of \(\psi :Y \rightarrow X\). These bounds can be used to adaptively control the sample size s in FindUpperBound for achieving some target accuracy with the desired level of confidence.

5 Algorithm for estimating \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\)

The algorithm for estimating the mGH distance between compact metric spaces X and Y (“the algorithm”) consists of the calls \(\textsc {FindLowerBound}(D^X, D^Y)\) and \(\textsc {FindUpperBound}(D^X, D^Y)\). Notice that \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\) is found exactly whenever the outputs of the two procedures match. Time complexity of the algorithm is \(O(N^7)\) whenever s, the number of mappings sampled from \(X \rightarrow Y\) and \(Y \rightarrow X\) in FindUpperBound, is limited by \(O(N^4)\).

To obtain a more practical result, we now consider a special case of metric spaces induced by unweighted undirected graphs. We show that estimating the mGH distance between such metric spaces in many applications has time complexity \(O(N^3 \log N)\), and present our implementation of the algorithm in Python.

5.1 \({\widehat{d}}_{\mathcal{G}\mathcal{H}}\) between unweighted undirected graphs

Let \(G = (V_G, E_G)\) be a finite undirected graph. For every pair of vertices \(v, v' \in V_G\), we define \(d_G(v, v')\) as the shortest path length between v to \(v'\). If weights of the edges in \(E_G\) are positive, the resulting function \(d_G : V_G \times V_G \rightarrow [0, \infty ]\) is a metric on \(V_G\). We say that the metric space \((V_G, d_G)\) is induced by graph G, and notice that its size \(|V_G|\) is the order of G. By convention, the shortest path length between vertices from different connected components of a graph is defined as \(\infty \), and therefore \((V_G, d_G)\) is compact if and only if graph G is connected.

For brevity, we will use notation \(G \buildrel \textrm{def}\over =(V_G, d_G)\), assuming that the distinction between graph G and metric space G induced by this graph is clear from the context. In particular, we refer to the mGH distance between compact metric spaces induced by undirected connected graphs G and H as \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(G, H)\), or even call it “the mGH distance between graphs G and H”.

Let GH be connected unweighted undirected graphs. Notice that \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(G, H) = 0\) if and only if graphs G and H are isomorphic. We use the following result to reduce the computational cost of estimating \({\widehat{d}}_{\mathcal{G}\mathcal{H}}(G, H)\) as compared to that in the general case.

Claim 5

If G is a finite connected unweighted undirected graph, all entries in distance matrix \(D^G\) of the corresponding compact metric space are from \(\{0, 1, \ldots , {{\,\textrm{diam}\,}}G\}\).

Claim 5 implies that there are at most \(d_{\max } + 1\) distinct entries in any distance sample of either G or H, where \(d_{\max } \buildrel \textrm{def}\over =\max \{{{\,\textrm{diam}\,}}G, {{\,\textrm{diam}\,}}H\}\). Recall that SolveFeasibleAssignment traverses the two sorted vectors strictly left to right, in each vector keeping track only of its part after the last assigned entry. It follows that both vectors can be represented simply as counts of entries \(0, 1, \ldots , d_{\max }\) they contain, with traversing a vector corresponding to decrementing the leftmost non-zero count. In particular, assigning an entry in one vector to an entry in another is then equivalent to decrementing the corresponding counts in each of the vectors, which reflects the injectivity of the assignment being constructed. Notice that if \(v_i\) is assigned to \(u_j\) at some step and the remaining entries of the two vectors contain \(v_i\) and \(u_j\), respectively, SolveFeasibleAssignment will make an identical assignment at the next step. It follows that representing the vectors by counting their identical entries allows the procedure to make bulk assignments, which reduces the time complexity of SolveFeasibleAssignment from O(N) to \(O(d_{\max })\) and makes the complexity of CheckTheoremB \(O(N^2 d_{\max })\), where \(N \buildrel \textrm{def}\over =\max \{|V_G|, |V_H|\}\).

Remark 8

From the perspective of optimization theory, representing vectors as counts of their entries reformulates the linear assignment feasibility problem of SolveFeasibleAssignment as a transportation feasibility problem.

Another implication of Claim 5 narrows the set of absolute differences between the distances in G and H to \(\Delta = \{0, 1, \ldots , d_{\max }\}\), reducing the time complexity of traversing its elements from \(O(N^4)\) to \(O(d_{\max })\). This bounds the time complexity of FindLowerBound by \(O(N^3 d_{\max })\). The complexity of the entire algorithm is therefore \(O(N^3 d_{\max })\) when the number of sampled mappings is \(s = O(d_{\max })\).

Network diameter often scales logarithmically with network size, e.g. in Erdős–Rényi random graph model (Albert and Barabási 2002) and Watts–Strogatz small-world model (Watts and Strogatz 1998), or even sublogarithmically, e.g. in the configuration model with i.i.d. degrees (van der Hofstad et al. 2007) and Barabási–Albert preferential attachment model (Cohen and Havlin 2003). This suggests the time complexity of the algorithm in applications to be \(O(N^3 \log N)\), deeming it practical for shape comparison for graphs of up to a moderate order.

5.2 Implementation

We have implemented the algorithm for estimating mGH between unweighted graphs in Python 3.7 as part of |scikit-tda| package (Saul and Tralie 2019) (https://github.com/scikit-tda). Our implementation takes adjacency matrices (optionally in sparse format) of unweighted undirected graphs as inputs. If an adjacency matrix corresponds to a disconnected graph, the algorithm approximates it with its largest connected component. The number of mappings to sample from \(X \rightarrow Y\) and \(Y \rightarrow X\) in FindUpperBound is parametrized as a function of |X| and |Y|.

6 Computational examples

This section demonstrates the performance of the algorithm on some real-world and synthetic networks. The real-world networks were sourced from the Enron email corpus (Klimt and Yang 2004), the cybersecurity dataset collected at Los Alamos National Laboratory (Kent 2016), and the functional magnetic resonance imaging dataset ABIDE I (Di Martino et al. 2014).

6.1 Methodology and tools

To estimate mGH distances between the graphs, we use our implementation of the algorithm from Sect. 5.2. We set the number of mappings from \(X \rightarrow Y\) and \(Y \rightarrow X\) to sample by the procedure FindUpperBound to and , respectively.

Given a pair of connected graphs, we approximate the mGH distance between them by \({\tilde{d}} \buildrel \textrm{def}\over =\frac{b_L + b_U}{2}\), where \(b_L\) and \(b_U\) are the lower and upper bounds produced by the algorithm. The relative error of the algorithm is estimated by

$$\begin{aligned} \eta \buildrel \textrm{def}\over ={\left\{ \begin{array}{ll} \frac{b_U - b_L}{2{\tilde{d}}} = \frac{b_U - b_L}{b_L + b_U}, & \hbox { if}\ b_U > 0\\ 0, & \hbox { if}\ b_U = 0 \end{array}\right. }, \end{aligned}$$

noting that the case of \(b_U = 0\) implies that the algorithm has found the mGH distance of 0 exactly. In addition, we compute the utility coefficient defined by

$$\begin{aligned} \upsilon \buildrel \textrm{def}\over ={\left\{ \begin{array}{ll} \frac{b_L - b'_L}{2{\tilde{d}}} = \frac{b_L - b'_L}{b_L + b_U}, & \hbox { if}\ b_U > 0\\ 0, & \hbox { if}\ b_U = 0 \end{array}\right. }, \end{aligned}$$

where \(b'_L \le b_L\) is the baseline lower bound: \(b'_L \buildrel \textrm{def}\over =\frac{1}{2}|{{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y| \le {\widehat{d}}_{\mathcal{G}\mathcal{H}}(X, Y)\) for \(X, Y \in \mathcal {M}\). The utility coefficient thus quantifies the tightening of the lower bound achieved by using Theorems 1 and 2.

Due to the possible suboptimality of the mappings selected by using the construction method (see Sect. 4), the upper bound may not be computed accurately enough. From the definition of relative error \(\eta \) and utility coefficient \(\upsilon \), a sufficiently loose upper bound can make \(\eta \) arbitrarily close to 1 and \(\upsilon \)—arbitrarily small.

We measured \(\eta \) and \(\upsilon \) separately for each dataset. For the real-world data, we also used the approximated distances \({\tilde{d}}\) to identify graphs of outlying shapes and matched these graphs to events or features of importance in application domains, following the approach taken in e.g. (Bunke and Kraetzl 2004; Bunke et al. 2006; Koutra et al. 2013; Pincombe 2005). Unlike (Bunke and Kraetzl 2004; Koutra et al. 2013), and Pincombe (2005) that focus on local time outliers (under the assumption of similarity between graphs from consecutive time steps), we considered the outliers with respect to the entire time range (where applicable), similarly to Bunke et al. (2006).

To identify the outliers, we applied the Local Outlier Probability (LoOP) method (Kriegel et al. 2009) to the graphs using their approximated pairwise mGH distances. LoOP uses a local space approach to outlier detection and is robust with respect to the choice of parameters (Kriegel et al. 2009). The score assigned by LoOP to a data object is interpreted as the probability of the object to be an outlier. We binarize the score into outlier/non-outlier classification using the commonly accepted confidence levels of 95%, 99%, and 99.9% as thresholds, chosen empirically for each individual dataset. We used the implementation of LoOP in Python (Constantinou 2018) (version 0.2.1), modified by us to allow non-Euclidean distances between the objects. We ran LoOP with locality and significance parameters set to \(k = 20\) and \(\lambda = 1\), respectively.

The synthetic graphs were generated according to Erdős–Rényi, Watts–Strogatz, and Barabási–Albert network models. We used implementations of the models provided in Hagberg et al. (2008) (version 2.1).

All computations were performed on a single 2.70GHz core of Intel i7-7500U CPU.

6.2 Enron email corpus

Enron email corpus (available at https://www.cs.cmu.edu/~./enron/) represents a collection of email conversations between the employees, mostly senior management, of the Enron corporation from October 1998 to March 2002. We used the latest version of the dataset from May 7, 2015, which contains roughly 500K emails from 150 employees.

Associating employees with graph vertices, we view the dataset as a dynamic network whose 174 instances reflect weekly corporate email exchange over the course of 3.5 years. An (unweighted) edge connecting a pair of vertices in a network instance means a mutual exchange of at least one email between the two employees on a particular week. The time resolution of 1 week was suggested in Sulo et al. (2010) for providing an appropriate balance between noise reduction and information loss in Enron dataset.

We expected major events related to the Enron scandal in the end of 2001 to cause abnormal patterns of weekly email exchange between the senior management, distorting the shape of the corresponding network instances. As a consequence, metric spaces generated by such network instances would be anomalously far from the rest with respect to the mGH distance.

Fig. 2
figure 2

Outlier probabilities assigned to the weekly email exchange networks. Red indicates outlier probabilities \(> 0.99\), corresponding to the weeks of Sep 17, Oct 29, Nov 5, and Nov 26 in the year 2001

In preparation for the analysis, we discarded all empty network instances corresponding to the weeks of no email exchange between the employees (of which all 28 weeks happened before May 1999). Each of the remaining 146 graphs was then replaced with its largest connected component. The distribution of the order of the resulting graphs had a mean of 68.2, a standard deviation of 99.8, and a maximum of 706.

We estimated the mGH distances in all 10,585 distinct pairs of the non-empty connected network instances. Average graph order and computing time per one pair were distributed as \(68.2 \pm 70.3\) and \(0.93\textrm{s} \pm 3.91\textrm{s}\), respectively (where \(\mu \pm \sigma \) refers to distribution with a mean of \(\mu \) and a standard deviation of \(\sigma \); no assumptions of normality are made, and we use standard deviation solely as a measure of spread). The algorithm found exact mGH distances in 74.4% of the graph pairs, with relative error \(\eta \) and utility coefficient \(\upsilon \) distributed as \(0.057 \pm 0.118\) and \(0.043 \pm 0.085\), respectively. The ratio between the means of \(\upsilon \) and \(\eta \) (i.e. \(\frac{\upsilon }{\eta } + 1 = \frac{\upsilon + \eta }{\eta } = \frac{b_U - b_{L'}}{b_L - b_{L'}}\)) implies that using Theorems 1 and 2 on average reduced the relative error by a factor of 1.75.

We ran LoOP on the network instances using their (approximated) pairwise mGH distances \({\tilde{d}}\). The resulting outlier probability assigned to each network instance (Fig. 2) thus measures the abnormality of its shape.

To see if the abnormal shape of email exchange corresponds to events of high importance from the Enron timeline, we applied the threshold of 0.99 to the outlier probabilities. Three out of four network instances that scored above the threshold correspond to the weeks of known important events in 2001, namely the weeks of Oct 29, Nov 5, and Nov 26 (each date is a Monday). As the closing stock price of Enron hit an all-time low on Friday, Oct 26, Enron’s chairman and CEO Kenneth Lay was making multiple calls for help to Treasure Secretary Paul O’Neill and Commerce Secretary Donald Evans on Oct 28–29. Enron fired both its treasurer and in-house attorney on Nov 5, admitted to overstating its profits for the last five years by $600M on Nov 8, and agreed to be acquired by Dynegy Inc. for $9B on Nov 9. On Nov 28, Dynegy Inc. aborted the plan to buy Enron, and on Dec 2, Enron went bankrupt.

We conclude that the abnormal shape of email exchange networks tends to correspond to disturbances in their environment, and that the algorithm estimates the mGH distance accurately enough to capture it.

6.3 LANL cybersecurity dataset

Los Alamos National Laboratory (LANL) cybersecurity dataset (available at https://csr.lanl.gov/data/cyber1/) represents 58 consecutive days of event data collected from LANL’s corporate computer network (Kent 2016). For our purposes, we considered its part containing records of authentication events, generated by roughly 11K users on 18K computers, and collected from individual Windows-based desktops and servers. During the 58-day data collection period, a “red team” penetration testing operation had taken place. As a consequence, a small subset of authentications were labeled as red team compromise events, presenting well-defined bad behavior that differed from normal user and computer activity. The labeling is not guaranteed to be exhaustive, and authentication events corresponding to red team actions, but not labeled as such, are likely to be present in the data. Heard and Rubin-Delanchy (2016)

Each authentication event occurs between a pair of source and destination computers. Viewing the computers as graph vertices, we associated each user with a dynamic network, whose instances reflect their daily authentication activity within the 58-day period. An (unweighted) edge connecting a pair of vertices in a network instance means that at least one authentication event by the user has occurred between the two computers on a particular day. The user-based approach to graph representation of the data aims to capture the patterns of user account misuse that are expected to occur during a cyberattack.

Our objective was to develop an unsupervised approach that can identify the red team activity associated with a user’s account. We expected that frequent compromise events within the course of one day should distort the shape of the corresponding network instance. As a consequence, metric spaces generated by such network instances would be anomalously far from the rest.

For the analysis, we selected 20 users with the highest total of associated red team events and randomly chose another 20 users from those unaffected by the red team activity (see Fig. 3). Each of their 40 dynamic networks initially comprised of 58 instances. We discarded all empty network instances corresponding to the days of inactivity of a user account, and replaced each of the remaining 1,997 graphs with its largest connected component. The distribution of the order in the resulting graphs had a mean of 32.7 and a standard deviation of 75.8. The largest graphs were associated with the red team-affected user U1653@DOM1—their order was distributed with a mean of 178.2, a standard deviation of 391.5, and a maximum of 2,343.

Fig. 3
figure 3

Frequency of red team events in daily authentication activity of the selected users. Grey indicates days of no authentication activity by user. Dashed line separates the two groups of 20 users

Separately for each of the selected users, we estimated the mGH distances in all distinct pairs of the non-empty connected network instances associated with her account. Table 1 shows average graph order in a pair and the performance metrics of the algorithm, aggregated per user subsets. We notice that using Theorems 1 and 2 has reduced the relative error by a factor of 3.5 on average. In addition, the algorithm did not seem to perform worse on the larger graphs associated with user U1653@DOM1.

Table 1 Performance of the algorithm on user-based daily authentication graphs
Fig. 4
figure 4

Outlier probability assigned to user-based daily authentication graphs. Red indicates outlier probabilities \(> 0.999\). Grey indicates empty graphs (excluded from analysis). The dashed line separates the two groups of 20 users

Separately for each user, we ran LoOP on the associated network instances using their (approximated) pairwise mGH distances \({\tilde{d}}\). The resulting outlier probability assigned to each network instance (Fig. 4) thus measures the abnormality of its shape for the particular user.

To see if the days of high compromise activity can be identified from the abnormal shape of the corresponding network instances, we approached the identification as a binary classification task. User’s daily activity is considered a compromise if it includes at least 30 red team events, and is predicted as such if the outlier probability assigned to the corresponding network instance is \(>0.999\). The resulting confusion matrix of our shape-based binary classifier is shown in Table 2, establishing its accuracy, precision, and recall as 99.5%, 18.2%, and 100%, respectively.

Table 2 Confusion matrix of the shape-based binary classifier

Even though recall is prioritized over precision when identifying intrusions, low precision can be impractical when using the classifier alone. However, high recall suggests that combining the shape-based classifier with another method can improve performance of the latter. For example, classifying daily activity as a compromise if and only if both methods agree on it is likely to increase the other method’s precision without sacrificing its recall.

We conclude that sufficiently frequent compromise behavior tends to distort the shape of networks representing user authentication activity, and that the algorithm estimates the mGH distance accurately enough to pick up the distortion. However, regular operations can cause similar distortion, and the shape of user authentication alone may be insufficient to accurately identify compromise behavior.

6.4 ABIDE I dataset

The Autism Brain Imaging Data Exchange I (ABIDE I, https://fcon_1000.projects.nitrc.org/indi/abide/abide_I.html) (Di Martino et al. 2014) is a resting state functional magnetic resonance imaging dataset, collected to improve understanding of the neural bases of autism. Besides the diagnostic group (autism or healthy control), the information collected from the study subjects also includes their sex, age, handedness category, IQ, and current medication status. We considered the preprocessed version of ABIDE I used in Kazeminejad and Sotero (2019), containing brain networks of 816 subjects, including 374 individuals with autism spectrum disorder and 442 healthy controls.

Brain network of an individual is comprised of 116 nodes, each corresponding to a region of interest (ROI) in the automatic anatomical labeling atlas of the brain (Tzourio-Mazoyer et al. 2002). Connectivity of the network represents correlations between the brain activity in the ROI pairs, with the brain activity extracted as a time series for each region. Namely, an (unweighted) edge connects two nodes if Spearman’s rank correlation coefficient between the corresponding pair of time series is in the highest 20% of all such coefficients for distinct ROI pairs. This approach of thresholding connectivity to obtain an unweighted graph representation is commonly used for functional brain networks (Achard and Bullmore 2007; Supekar et al. 2008; Redcay et al. 2013; Rudie et al. 2013; Rubinov and Sporns 2010).

Fig. 5
figure 5

Outlier probability assigned to brain networks of study subjects. Blue indicates outlier probabilities \(> 0.95\), and red—outlier probabilities \(> 0.999\). The latter correspond to the subjects MaxMun_c_0051332, MaxMun_b_0051323, and MaxMun_c_0051335. The remaining outlier probabilities are 0

We replaced the brain network of each subject with its largest component, which did not introduce a significant change to the data: the distribution of the order in the resulting 816 graphs had a mean of about 116.0 and a standard deviation of 0.1.

We estimated the mGH distances in all 332,520 distinct pairs of the connected brain networks. Average graph order and computing time per one pair were distributed as \(116.0 \pm 0.1\) and \(0.40\textrm{s} \pm 20.29\textrm{s}\), respectively. The algorithm found exact mGH distances in 78.6% of the graph pairs, with relative error \(\eta \) and utility coefficient \(\upsilon \) distributed as \(0.072 \pm 0.139\) and \(0.361 \pm 0.214\), respectively. We notice that using Theorems 1 and 2 has reduced the relative error by a factor of 5 on average.

We ran LoOP on the brain networks using their (approximated) pairwise mGH distances \({\tilde{d}}\). The resulting outlier probability assigned to each brain network (Fig. 5) thus measures the abnormality of its shape.

To see if abnormal shape of a brain network corresponds to certain features of the individual, we applied the thresholds of 0.999 and 0.95 to the outlier probabilities. We were unable to identify the corresponding brain network shapes with outlying values of the available measurements, neither for the 3 subjects who scored above the threshold of 0.999 nor for the 54 subjects who scored above 0.95.

To see if the brain networks of subjects within the same diagnostic group tend to have similar shape, we performed cluster analysis based on the mGH distances \({\tilde{d}}\). We used |scipy| implementation of hierarchical agglomerative clustering algorithm to split the 816 networks into two clusters (by the number of diagnostic groups in the dataset). The smaller cluster was comprised of the same 3 subjects who scored above the outlier probability threshold of 0.999. Discarding them as outliers and rerunning the analysis resulted in two clusters of 51 and 762 subjects, respectively. The clusters did not show any correspondence with the diagnostic groups, thus providing no evidence that the within-group mGH distances are smaller than the inter-group ones. However, we notice a significant overlap between the 54 subjects with an outlier probability above 0.95 and the cluster of 51 individuals, with 47 people shared between the two groups. This implies that the abnormal brain networks tend to be closer to one another than to the “regular” brain networks. This observation suggests that abnormality of the brain network shape is influenced by currently unknown features which are not included in the dataset.

We conclude that the algorithm estimates the mGH distance between Spearman’s correlation-based functional brain networks with high accuracy. However, detected shape abnormalities do not seem to correspond to a conclusive pattern related to autism spectrum disorder identification.

6.5 Synthetic networks

To test performance of the algorithm on synthetic networks, we generated 100 graphs per each of Erdős–Rényi (random), Watts–Strogatz (small-world), and Barabási–Albert (scale-free) network models. The order n of each graph was selected uniformly at random between 10 and 200, and other parameters of the model were based on n. In the Erdős–Rényi G(np) model, the probability p for an edge between a vertex pair to appear was selected uniformly at random between \(\frac{0.5\log n}{n}\) and \(\frac{1.5\log n}{n}\). In the Watts–Strogatz G(n, 2kp) model, k, half of the average vertex degree, was selected uniformly at random between 1 and \(\lfloor 0.5 \log ^2 n \rceil \), and the probability p for an edge to get rewired was selected uniformly at random between \(\frac{0.5\log n}{n}\) and \(\frac{1.5\log n}{n}\). In the Barabási–Albert G(nm) model, the number of edges m to attach from a new node to the existing nodes was selected uniformly at random between 1 and \(\lfloor \log ^2 n \rceil \).

After generating the graphs, we replaced each of them with its largest connected component. For each set, we estimated the mGH distances in all distinct pairs of the 100 connected graphs therein. Table 3 shows the average graph order in a pair and the performance metrics of the algorithm, aggregated per individual data sets.

Table 3 Performance of the algorithm on the synthesized networks

We notice that the algorithm performs significantly worse on the Erdős–Rényi graphs. One possible explanation is that there are fewer identically connected vertices in random graphs than in those resembling real-world networks, which contributes to the combinatorial complexity of the search for distortion-minimizing mappings to obtain the upper bound. Recall from Sect. 6.1 that inaccurate computations of the upper bound alone can have a detrimental effect on both \(\eta \) and \(\upsilon \).

Another interesting observation is that Theorems 1 and 2 have smaller utility when applied to the Watts–Strogatz graphs. Recall that a Watts–Strogatz small-world graph is generated from a lattice ring with each node connected to its 2k neighbors (k for each side), by randomly rewiring a fraction (roughly, p) of its edges. For a small p, the rewired edges serve as shortcuts between the otherwise remote vertices and have a highly nonlinear effect on the diameter (Watts and Strogatz 1998). This allows for high variability in the diameters of generated graphs, thus contributing to the tightness of the baseline lower bounds \(b'_L \buildrel \textrm{def}\over =\frac{1}{2}|{{\,\textrm{diam}\,}}X - {{\,\textrm{diam}\,}}Y|\).

We conclude that the algorithm performs better on graphs with scale-free and small-world properties, observed in many real-world networks.

7 Conclusion

The main contribution of this work is a feasible method for finding a lower bound for the mGH (and therefore GH) distance between finite metric spaces. The approach, based on the introduced notion of d-bounded distance samples, yields a polynomial-time algorithm for estimating the mGH distance. The algorithm is implemented as part of Python |scikit-tda| library for the case of compact metric spaces induced by unweighted graphs. It is also shown that in the case of unweighted graphs of order N whose diameter scales at most logarithmically with N the algorithm has time complexity \(O(N^3 \log N)\).

To test the algorithm’s performance, we applied it to both real and synthesized networks. Among the synthesized networks we tested, the best performance was observed for graphs with scale-free and small-world properties. We have also found that the algorithm performed well on the real-world email exchange, computer, and brain networks. The mGH distance was used to successfully detect outlying shapes corresponding to events of significance. This suggests that the proposed algorithm may be useful for graph shape matching in various application domains.