1 Introduction

Evolutionary algorithms (EAs) are randomised search heuristics imitating principles of natural evolution. They maintain a population, a multi-set of candidate solutions, and aim to iteratively improve the quality (or fitness) of solutions in the population by creating new solutions via mutation and/or crossover of current population members, and then selecting promising solutions to form the next generation’s population [1]. EAs have proven to perform excellently in many domains, including—but not limited to—multi-objective optimisation, logistics, and transportation [2].

The classic focus of EAs is on pure (global) optimisation, i.e., to find a single high-performing solution to the problem at hand. In multi-modal optimisation [3] instead, the goal is in finding all or (at least) most of multiple local and/or global optima. Such algorithms usually use different means of explicit diversity preservation to allow a more exploratory optimisation process in order to prevent the population from converging into local optima which often act as traps. Another related EA-branch first introduced by Ulrich and Thiele [4] in the continuous domain is evolutionary diversity optimisation (EDO). In EDO the aim is to evolve a population of diverse solutions which are all high-performers (i.e., they all satisfy a minimum quality threshold which may be set a-priori or can be adapted dynamically). Usually, diversity is measured by a domain- or even problem-specific diversity function. Recently, EDO gained increasing interest in combinatorial optimisation. Very recently, a new branch of evolutionary computation termed quality diversity (QD) emerged [5]. In the so-called MAP-Elites approach [6], the search space is partitioned into disjoint regions of interest termed cells. The goal is to find high-quality solutions for each cell, making diversity explicit rather than implicit like it is done, e.g., in EDO. QD algorithms have major applications in the field of robotics [7] and only recently have been successfully applied in, e.g., TSP instance space coverage [8] or combinatorial optimisation for the knapsack problem [9] and the travelling thief problem [10].

Stemming from engineering applications, the theory of evolutionary algorithms was mainly neglected in the early years. With some delay it finally developed into a mature field with many runtime results on pseudo-Boolean optimisation, many combinatorial optimisation problems [11], multi-modal optimisation [12], diversity preservation [13] and—rather recently—advances in multi-objective optimisation [14] and EDO [15]. However, at the moment of writing—to the best of our knowledge—there is just one work on the runtime analysis of quality diversity algorithms. Nikfarjam et al. [10] analyse a QD algorithm for the knapsack problem. They show that the studied QD algorithm operating on a suitable two-dimensional feature space mimics the dynamic programming approach for the knapsack problem; they show upper bounds on the expected optimisation time. Apart from this, there is no theoretical work explaining the potential benefits of QD. Many questions need to be solved: when does QD perform well, and why? How does its performance compare with that of established algorithms? How to design effective QD algorithms for interesting problem classes? Many of these questions can be resolved by runtime analysis.

We perform runtime analyses of a simple MAP-Elites quality diversity algorithm, termed QD, in the context of pseudo-Boolean optimisation and combinatorial optimisation. QD operates on the k-number-of-ones (k-NoO) feature space where \(k \in \{1, \ldots , n+1\}\) steers the granularity of the feature space, i.e., the number of cells. The ith cell of the map for \(i = 1, \ldots , (n+1)/k=:L\) describes the niche of solutions with the number of ones in the interval \([(i-1)k,ik-1]\).

We consider three performance measures: the time to find an optimal search point (optimisation time), the time to cover all cells (cover time) and the time to find an optimal search point within each cell (optimal-cover time).

In Sect. 3 we show that the cover time of QD operating on the k-NoO feature space on every pseudo-Boolean fitness function is \(O(n^2 \log n)\) for \(k=1\) and \(O(n/(\sqrt{k}4^k p_m^k))\) for larger k, where \(p_m = \Theta (1/n)\) is the mutation probability. For granularity \(k=1\) this immediately gives upper bounds of \(O(n^2 \log n)\) for all functions of unitation (where the fitness only depends on the number of ones) and all monotone functions [16, 17], demonstrating that QD can be powerful if the feature space is well aligned with relevant problem characteristics. We show that the aforementioned bounds are tight for OneMax and that the optimal-cover time on OneMax is of the same order as well.

Moreover, we give a general fitness-level upper bound for QD and consider its performance on classes of transformed functions on which the feature space is not necessarily well aligned with the problem. On all transformed OneMax functions the optimal-cover time is still of order \(O(n^2 \log n)\). Applying a worst-case transformation to Trap functions turns an expected optimisation time of \(O(n^2 \log n)\) into an exponential time. However, there is still a small benefit for QD as the worst possible running time is drastically smaller than that of a (1+1) EA.

In Sect. 4 we further study QD on the k-NoO feature space in the domain of monotone submodular function maximisation which is known to be NP-hard in general. QD finds a \((1-1/e)\)-approximation on any monotone submodular function with a single cardinality constraint r in expected time \(O(n^2(\log (n) + r))\), matching the upper bound shown by Friedrich and Neumann [18] for the multi-objective optimiser GSEMO. Both GSEMO and our QD operate similarly, but QD is simpler and more straightforward from an algorithmic perspective. This application indicates an interesting relationship between the definition of a feature space in the context of quality diversity optimisation and the fitness-function construction—encoding features of the encoded solutions as additional objectives—in multi-objective optimisation.

Last but not least, in Sect. 5 we focus on the minimum spanning forest (MSF) problem on an edge-weighted undirected graph with n nodes, m edges, and \(C(G) \in \{1, \ldots , n\}\) connected components (CCs); the minimum spanning tree problem (MST) is a special case for connected graphs with \(C(G)=1\). Solutions are encoded as bit-strings over \(\{0,1\}^m\), i.e., the feature space spans a feature of the phenotype of the encoded solutions. We first consider a two-dimensional feature space with cells for every feasible number of CCs and every possible number of selected edges, referred to as CCE feature space. We show an upper bound of \(O(m^2(n-C(G))\max \{n-C(G), \log n\})\) for finding a minimum spanning forest on G. Then we observe that a simpler feature space is sufficient and leads to better performance. In the CC feature space we have cells for each feasible number of CCs. QD operating on this CC feature space finds a minimum spanning forest of any edge-weighted source graph in at most \(O((n-C(G)+1)m\max \{n-C(G)+1,\log n\})\) steps, provided all edge weights are bounded by \(2^{O(n)}\). As a consequence, an MST is found after \(O(n^2m)\) steps in expectation on any connected graph. Again, the algorithm works similarly to the bi-objective formulation studied by Neumann and Wegener [19] who used the number of connected components and the total weight of the encoded solution as two conflicting objectives to be minimised simultaneously, leading to the same time bound of \(O(n^2 m)\). QD is a simpler algorithm since it does not require the concept of Pareto-dominanceFootnote 1 to deal with partial ordering of the bi-objective fitness vectors, and the runtime analysis is more straightforward as a consequence.

A preliminary version of this work with parts of the results has been published at a conference [20]. In this improved and extended manuscript we added the general fitness-level upper bound, results on transformations of fitness functions (Sect. 3.4) and the two-dimensional feature space CCE for computing minimum spanning forests.

2 The QD Algorithm

Algorithm 1
figure a

QD algorithm

The outline of the studied quality diversity algorithm QD analysed is given in Algorithm 1. The algorithm follows the Map-Elites scheme and operates on functions \(f :\{0,1\}^n \rightarrow {\mathbb {R}}\). First, QD initialises a map M which over the course of optimisation stores one solution for each expression of the feature space. The specific details will be discussed later; for now assume that M is a function that maps \(\{0,1\}^n\) to a feature space. In the beginning the map is empty. Next, an initial solution \(x \in \{0,1\}^n\) is sampled uniformly at random. This solution is stored alongside its objective function value in the map in cell M(x). After initialisation the evolutionary loop starts until a stopping condition is met. In each iteration a random parent x is sampled uniformly at random from the set of already covered map cells. A single offspring y is generated from x by bit-wise mutation with mutation probability \(p_m\). If the cell at position M(y) is not yet covered, y is stored in any case. If M(y) is already populated, say with z, the fitness function decides which one to keep. If f(y) is no worse than f(z), y replaces z in the respective map cell.

We define the following goals and corresponding hitting times.

  • Optimisation: let \(T_{f, \text {OPT}}\) be the random number of function evaluations until QD finds a global optimum of f.

  • Coverage: let \(T_{f, \text {C}}\) be the random number of function evaluations until QD finds each one solution for each map cell.

  • Optimal coverage: Let \(T_{f, \text {COPT}}\) be the time until an optimal solution is found for each map cell.

Note that the optimal-coverage time is the largest of all measures: \(T_{f, \text {C}} \le T_{f, \text {COPT}}\) and \(T_{f, \text {OPT}} \le T_{f, \text {COPT}}\).

We say that an event occurs with overwhelming probability if its probability is at least \(1-2^{\Omega (n^\varepsilon )}\) for some positive constant \(\varepsilon > 0\). By H(xy) we denote the Hamming distance (number of disagreeing bits) in two search points \(x, y \in \{0, 1\}^n\).

We will consider three different feature spaces, a generic feature space for pseudo-Boolean optimisation based on the number of ones, and two tailored feature spaces for computing minimum spanning forests. The latter two will be introduced later on, in Sect. 5.

Definition 1

The number-of-ones (NoO) feature space is defined as follows. Given a granularity parameter \(k \in \{1, \ldots , n+1\}\) with k dividing \(n+1\), the map M is of size \(L {:}{=}\frac{n+1}{k}\) and the i-th cell, \(i \in \{1, \ldots , L\}\), stores the best solution found so far amongst all solutions with a number of ones in \([(i-1)k, ik-1]\). We refer to this as the k-NoO feature space.

Note that \(k=1\) matches the somehow natural feature space storing a best search point for every number of ones, that is, every level of the Boolean hypercube. It should be noted that QD can be implemented efficiently for the k-NoO feature space: the map can be represented via an array of length L. Given a solution \(x \in \{0,1\}^n\), we can calculate the index M(i) directly in constant time by calculating its array index \(i = \lfloor |x|_1/k\rfloor \).

The number-of-ones feature space is a natural space for the Boolean hypercube and it is well-aligned with many well-studied fitness functions. Functions of unitation, e.g., OneMax, Jump and Trap only depend on the number of ones. These functions are defined in this way for simplicity and notational convenience. For some combinatorial optimisation problems, e. g. in selection problems like Knapsack or Vertex Cover, as well as for optimising submodular functions, the number of ones is a natural property.

3 Pseudo-Boolean Optimisation

3.1 Upper Bounds for Cover Times

We first give an upper bound on the expected time to cover all cells. This upper bound applies to every pseudo-Boolean fitness function and relies on worst-case probabilities of jumping from one cell to an undiscovered neighbouring cell.

Theorem 1

Consider QD operating on the k-NoO feature space with granularity \(1 \le k \le n+1\), k dividing \(n+1\), with an arbitrary initial search point and mutation rate \(p_m = c/n\) for a constant \(c > 0\). For every fitness function, the expected cover time is \(O(n^2\log n)\) for \(k=1\) and \(O\left( Lp_m^{-k}/\left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \right) = O(n/(\sqrt{k}4^kp_m^k))\) for \(k \ge 2\).

Note that, while Theorem 1 bounds the cover time irrespective of the fitness, the optimal-cover time depends on the fitness function. In particular, the cell \(\lceil L/2 \rceil \) contains all \(\Omega (2^n/\sqrt{n})\) search points with \(\lfloor n/2 \rfloor \) ones and the fitness function in this cell could be deceptive or a needle-in-a-haystack function, implying exponential optimal cover times. Hence the optimal-cover time cannot be bounded as in Theorem 1.

To prove Theorem 1, we show the following technical lemma.

Lemma 2

For all \(k \in {\mathbb {N}}\), \(k \ge 2\) and \(i \ge 3\),

$$\begin{aligned} \left( {\begin{array}{c}ik-1\\ k\end{array}}\right) \ge \frac{i^2}{3} \cdot \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) . \end{aligned}$$

Proof

$$\begin{aligned} \frac{\left( {\begin{array}{c}ik-1\\ k\end{array}}\right) }{\left( {\begin{array}{c}2k-1\\ k\end{array}}\right) } =\;&\frac{(ik-1)!(k-1)!}{(ik-1-k)!(2k-1)!}\\ =\;&\frac{ik-1}{2k-1} \cdot \frac{ik-2}{2k-2} \cdot \frac{ik-k+1}{k+1} \cdot \frac{ik-k}{k}\\ \ge \;&\frac{ik-k+1}{k+1} \cdot \frac{ik-k}{k} = \frac{ik-k+1}{k+1} \cdot (i-1). \end{aligned}$$

Since the fraction is non-decreasing with k, its minimum is attained for \(k=2\) and we obtain the lower bound

$$\begin{aligned} \frac{2i-2+1}{2+1} \cdot (i-1) = \frac{(2i-1)(i-1)}{3} = \frac{2i^2 - 3i + 1}{3}. \end{aligned}$$

Using \(i \ge 3\), which implies \(i^2 \ge 3i\), we bound the numerator from below by \(2i^2 - 3i + 1 \ge i^2 + 1 \ge i^2\). This completes the proof. \(\square \)

Proof of Theorem 1

We assume \(k \le (n+1)/2\) as otherwise the only k dividing \(n+1\) is \(k=n+1\), where we only have one cell and the cover time is trivial. After initialisation one cell i is covered and \(QD({k}) \) will maintain a solution \(x^{(i)}\) whose number ones is in \([k(i-1), ki-1]\). Note that the precise search point can be replaced by another search point having the same property and a fitness value that is no worse. If \(i > 1\) then cell \(i-1\) can be covered by choosing the current search point \(x^{(i)}\) in cell i as a parent and creating an offspring whose number of ones is in \([k(i-2), k(i-1)]\). In the worst case,Footnote 2\(|x^{(i)}|_1 = ki-1\) and then selecting \(x^{(i)}\) as a parent and flipping exactly k one-bits in \(x^{(i)}\) will create an offspring in cell \(i-1\). Since the probability of selecting a parent from one out of at most L covered cells is always at least 1/L, the probability of this event is at least

$$\begin{aligned} s_i&\ge \frac{1}{L} \cdot \left( {\begin{array}{c}ki-1\\ k\end{array}}\right) \cdot p_m^k (1-p_m)^{n-k}. \end{aligned}$$

The expected time for covering all cells with an index of at most i is bounded by \(\sum _{j=2}^{i-1} \frac{1}{s_i} \le \sum _{j=2}^{L} \frac{1}{s_i}\). The same arguments apply symmetrically for reaching cells with a larger number of ones, assuming that in the worst case a search point in cell j has \(k(j-1)\) ones and swapping the roles of zeros and ones. Thus, the expected time to cover all cells is bounded by

$$\begin{aligned} 2\sum _{i=2}^{L} \frac{1}{s_i}&\le \frac{2L}{p_m^{k}(1-p_m)^{n-k}} \cdot \sum _{i=2}^{L} \frac{1}{\left( {\begin{array}{c}ki-1\\ k\end{array}}\right) }. \end{aligned}$$

Now for the summation in the last line we make a case distinction. For \(k=1\) the summation simplifies to

$$\begin{aligned} \sum _{i=2}^L \frac{1}{\left( {\begin{array}{c}i-1\\ 1\end{array}}\right) } = \sum _{i=2}^L \frac{1}{i-1} = \sum _{i=1}^{L-1} \frac{1}{i} = H(n), \end{aligned}$$

where H(n) denotes the n-th harmonic number, as \(k=1\) implies \(L-1= \frac{n+1}{k} -1 = n\). Since \(H(n) = \ln (n) + \Theta (1)\), along with \(p_m = \Theta (1/n)\) and \((1-p_m)^{n-k} = (1-c/n)^{n-k} = \Theta (1)\), this proves the claimed upper bound for \(k=1\).

For \(k \ge 2\), applying Lemma 2 to all summands with \(i \ge 3\),

$$\begin{aligned} \sum _{i=2}^{L} \frac{1}{\left( {\begin{array}{c}ki-1\\ k\end{array}}\right) } \le \;&\frac{1}{\left( {\begin{array}{c}2k-1\\ k\end{array}}\right) } \cdot \left( 1 + 3 \sum _{i=3}^L \frac{1}{i^2}\right) = \frac{O(1)}{\left( {\begin{array}{c}2k-1\\ k\end{array}}\right) } \end{aligned}$$

since \(\sum _{i=3}^L \frac{1}{i^2} \le \sum _{i=1}^\infty \frac{1}{i^2} = \frac{\pi ^2}{6} = O(1)\). Hence, along with \((1-p_m)^{n-k} = (1-c/n)^{n-k} = \Theta (1)\), the expected coverage time is \(O\left( Lp_m^{-k}/\left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \right) \). Using well known results on the largest binomial coefficient (see, e.g., [21, Equation (1.4.18), Corollary 1.4.12]), \(\left( {\begin{array}{c}2k-1\\ k\end{array}}\right) = 2^{2k-1}/\Theta (\sqrt{k}) = \Theta (4^k/\sqrt{k})\), we obtain the simplified bound \(O(Lp_m^{-k} \cdot \sqrt{k}/4^k) = O(n/(\sqrt{k}4^kp_m^k))\). \(\square \)

We remark that the term \(p_m^{-k}\) (along with a factor \((1-p_m)^{-n+k} = \Theta (1)\) that was absorbed in the asymptotic notation) reflects the expected time to perform a mutation flipping precisely k bits and the binomial coefficient \(\left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \) reflects the number of possible choices for these k flipping bits. The term \(p_m^{-k}\) could possibly be reduced by using heavy-tailed mutations [22, 23] that increase the probability of flipping many bits. This is outside the scope of this work, though.

3.2 Functions of Unitation and Monotone Functions

Our general Theorem 1 guarantees good performance for QD if the feature space aligns favourably with the problem in hand. We give two such examples. Functions of unitation only depend on the number of ones in the bit string. Well-known examples include OneMax \((x) {:}{=}\sum _{i=1}^n x_i\), Jump [24], Cliff [25], Hurdle [26] and Trap (see Sect. 3.4), which are functions of wildly varying difficulty for evolutionary algorithms. For functions of unitation and \(k=1\) with the 1-NoO feature space, covering all cells is equivalent to covering all cells optimally, and implies that a global optimum has been found. Hence Theorem 1 yields the following simple implication.

Corollary 3

The expected cover time, the expected optimal-cover time, and the expected time until QD operating on the 1-NoO feature space, with mutation rate \(p_m = c/n\), \(c > 0\) constant, finds a global optimum on any function of unitation is \(O(n^2 \log n)\).

Note that for larger k, the optimal-cover time on functions of unitation can be larger than the cover time. If in cell 1 the all-zeros string is optimal and search points with \(k-1\) ones have the second-best fitness of that cell, a mutation flipping \(k-1\) specific bits is required to optimally cover this cell. Along with a factor of \(\Theta (L)\) for selecting a parent from cell 1, this yields an expected time of \(\Omega (L p_m^{-k+1})\) from a typical population covering cell 1 and \(\Omega (L)\) other cells. If k is large enough such that \(4^k/\sqrt{k} = \omega (1/p_m)\), this is larger than the upper bound from Theorem 1.

It should be pointed out that Corollary 3 only holds since the feature space aligns with the definition of unitation functions. It is not robust in a sense that changing the encoding of unitation functions by swapping the meaning of zeros and ones for selected bits will immediately invalidate the proof. We will consider transformations of fitness functions in more detail in Sect. 3.4.

A second example is the class of monotone functions [16, 17, 27,28,29]. A pseudo-Boolean function is called monotone if flipping only 0-bits to 1 and not flipping any 1-bits strictly increases the fitness. This implies that \(1^n\) is a unique global optimum. The class of monotone functions includes all linear functions with positive weights. All monotone functions can be solved by randomised local search in expected time \(O(n \log n)\). But, surprisingly, there are monotone functions such that the (1+1) EA with mutation rate c/n, \(c > 0\) a sufficiently large constant, require exponential time with high probability [16, 17]. The reason is that mutation probabilities larger than 2.2/n frequently lead to mutations flipping both zeros and ones (thus avoiding the requirements of monotonicity) and accepting search points with a smaller number of ones. Since QD on the 1-NoO feature space stores every increase in the number of ones, it runs in expected time \(O(n^2 \log n)\) for every mutation rate c/n.

Corollary 4

The expected time until QD operating on the 1-NoO feature space, with mutation rate \(p_m = c/n\), \(c > 0\) constant, finds a global optimum on any monotone function is \(O(n^2 \log n)\).

Even though every mutation that flips only a 0-bit strictly increases the fitness, this property is not needed for Corollary 4 as empty cells are being filled regardless of fitness. But it suggests that the result on monotone functions may be more robust to fitness transformations than our result on functions of unitation. Indeed, we will show in Sect. 3.4 that our upper bound on functions of unitation is not robust in a sense that exponential optimisation times occur with overwhelming probability for worst-case transformed functions of unitation.

3.3 Tight Bounds for OneMax

We also show that the upper bound from Theorem 1 is asymptotically tight for OneMax, for all values of k. For \(k \ge 2\) the reason is that, on OneMax, when cell 2 is reached, the best solution in cell 2 is one with \(2k-1\) ones. The best chance of reaching cell 1 from there is to pick M(2) as a parent and to flip k out of \(2k-1\) 1-bits. This shows that even for the simplest possible fitness function (with a unique optimum), the expected cover time grows exponentially with the granularity k (except for the trivial setting of \(k=n+1\), that is, having one cell only, for which QD equals the (1+1) EA and the expected optimal-cover time is \(O(n \log n)\)).

Theorem 5

Consider QD operating on the k-NoO feature space, with \(1 \le k \le (n+1)/2\) and mutation rate \(p_m = c/n\), \(c > 0\) constant, on OneMax. The expected cover time and the expected optimal-cover time are both in \(\Theta (n^2\log n)\) for \(k=1\) and \(\Theta \left( Lp_m^{-k}/\left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \right) = \Theta (n/(\sqrt{k}4^kp_m^k))\) for \(k \ge 2\).

We first show a technical lemma on decaying jump probabilities.

Lemma 6

Let \(p_{i, j}\) denote the probability of a standard bit mutation with mutation probability \(p_m\) creating a mutant with j ones from a parent with i ones. For all \(0< p_m < 1\), all \(j < i\) and all \(k \in \{0, \ldots , j\}\),

$$\begin{aligned} p_{i, j-k} \le \left( \frac{ip_m}{1-p_m}\right) ^k \cdot p_{i, j}. \end{aligned}$$

Proof

For \(j < i\) and \(\ell \ge 0\) let \(p_{i, j, \ell }\) denote the probability of a standard bit mutation with mutation probability \(p_m\) creating a mutant with j ones from a parent with i ones by flipping \(i-j+\ell \) ones and flipping \(\ell \) zeros, that is,

$$\begin{aligned} p_{i, j, \ell } = \left( {\begin{array}{c}i\\ i-j+\ell \end{array}}\right) \left( {\begin{array}{c}n-i\\ \ell \end{array}}\right) p_m^{i-j+2\ell } (1-p_m)^{n-i+j-2\ell }. \end{aligned}$$

Note that \(p_{i, j, \ell } = 0\) if \(\ell > \min \{j, n-i\}\). We claim that for all \(i \in \{2, \ldots , n\}\), all \(d \in \{1, \ldots , i\}\) and all \(\ell \in {\mathbb {N}}_0\)

$$\begin{aligned} p_{i, i - d, \ell } \ge \frac{1-p_m}{ip_m} \cdot p_{i, i - d - 1, \ell }. \end{aligned}$$
(1)

This is trivial for \(p_{i, i - d - 1, \ell } = 0\), hence we assume \(p_{i, i - d - 1, \ell } > 0\) (which implies \(i - d - \ell \ge 1\)) and have

$$\begin{aligned} \frac{p_{i, i-d, \ell }}{p_{i, i-d-1, \ell }} =\;&\frac{\left( {\begin{array}{c}i\\ d+\ell \end{array}}\right) \left( {\begin{array}{c}n-i\\ \ell \end{array}}\right) p_m^{d+2\ell }(1-p_m)^{n-d-2\ell }}{\left( {\begin{array}{c}i\\ d+\ell +1\end{array}}\right) \left( {\begin{array}{c}n-i\\ \ell \end{array}}\right) p_m^{d+2\ell +1}(1-p_m)^{n-d-2\ell -1}} = \frac{d+\ell +1}{i-d-\ell } \cdot \frac{1-p_m}{p_m}. \end{aligned}$$

Since \(d + \ell \ge 1\) and \(i \ge 2\), we have \(\frac{d+\ell +1}{i-d-\ell } \ge \frac{2}{i-1} \ge \frac{1}{i}\) and this implies (1). Consequently,

$$\begin{aligned} p_{i, i-d} = \sum _{\ell =0}^{\min \{i-d,n-i\}} p_{i, i - d, \ell } \ge \sum _{\ell =0}^{\min \{i-d,n-i\}} \frac{1-p_m}{ip_m} \cdot p_{i, i - d - 1, \ell } = \frac{1-p_m}{ip_m} \cdot p_{i, i-d-1}. \end{aligned}$$

The statement now follows from applying (1) repeatedly and bounding the resulting factors as

$$\begin{aligned} \frac{1-p_m}{ip_m} \cdot \frac{1-p_m}{(i-1)p_m} \cdot \cdots \cdot \frac{1-p_m}{(i-k+1)p_m} \ge \left( \frac{1-p_m}{ip_m}\right) ^k. \end{aligned}$$

\(\square \)

The following lemma shows that, with probability \(\Omega (1)\), the map will be partially filled by the time the cells with lower indices will be reached. This implies that the probability of selecting a parent from the cell with smallest index is decreased by a factor of \(\Theta (1/L)\), slowing down the exploration of smaller cells.

Lemma 7

Consider the scenario of Theorem 5. There exists a constant \(\varepsilon > 0\) such that, with probability \(\Omega (1)\), QD reaches a state where at least \(\varepsilon L\) cells are covered and cells 1 to \(\varepsilon L\) are not yet covered.

Proof

If \(k=(n+1)/2\) then since k divides \(n+1\) we know that \(n+1\) is even and there are just two symmetric cells 1 and 2. By symmetry, the probability of initialising in cell 2 is 1/2 and then the claim follows. The same arguments apply to values \(k < (n+1)/2\) and \(k = \Omega (n)\) as then \(L = O(1)\) and covering one cell with a solution of at least n/2 ones is sufficient when choosing \(\varepsilon := 1/L\).

In the following, we assume \(k = o(n)\) and define \(X_t\) as the smallest number of ones in any search point seen so far. Note that, trivially, \(X_{t+1} \le X_t\). We claim that with probability \(1-o(1)\) the following statements hold for a function \(\alpha {:}{=}\alpha (n) {:}{=}\min \{n/3, n/(4c)\}\). (1) the algorithm is initialised with \(X_0 \ge \alpha \), (2) whenever \(X_t\) is decreased, it is decreased by at most \(\alpha /4\), (3) while \(X_t \in [1, \alpha ]\), steps decreasing \(X_t\) have an exponential decay (as will be made precise in the following). Claim (1) follows from Chernoff bounds since \(\alpha \le n/3\). Claim (2) follows since decreasing \(X_t\) by at least \(\alpha /4\) requires flipping \(\Omega (n)\) bits. This has probability at most \(\left( {\begin{array}{c}n\\ \alpha /4\end{array}}\right) p_m^{\alpha /4} \le 2^n \cdot (c/n)^{\alpha /4} = n^{-\Omega (n)}\) whereas accepted steps decreasing \(X_t\) have probability at least \(\Omega (n^{-k}/L) = n^{-o(n)}\). Along with a union bound over O(n) values of \(X_t\), the conditional probability of decreasing \(X_t\) by at least \(\alpha /4\) throughout the run is \(O(n) \cdot n^{-\Omega (n)}/n^{-o(n)} = n^{-\Omega (n)}\). It remains to prove (3).

Assuming \(X_t \le \alpha \), we consider the decrease of \(X_t\) in steps in which \(X_{t+1} < X_t\) and consider the difference \(D_t:= X_{t+1} - X_t\). Adopting the notation from Lemma 6, we have

$$\begin{aligned} \textrm{Pr}\left( D_t \ge \ell \mid X_{t+1} < X_t, X_t = i\right) =\;&\frac{\sum _{j=0}^{i-\ell } p_{i, j}}{\sum _{j=0}^{i-1} p_{i, j}} \le \frac{\sum _{j=0}^{i-\ell } p_{i, j}}{\sum _{j=\ell -1}^{i-1} p_{i, j}}. \end{aligned}$$

Recalling \(\alpha \le n/(4c)\), by Lemma 6 we have for all \(i \le \alpha \) and all \(\ell \in \{0, \ldots , i-1\}\).

$$\begin{aligned} p_{i, j} \ge \left( \frac{1-p_m}{ip_m}\right) ^{\ell -1} \cdot p_{i, j-\ell +1} \ge \left( \frac{n}{2ci}\right) ^{\ell -1} \cdot p_{i, j-\ell +1} \ge 2^{\ell -1} p_{i, j-\ell +1} \end{aligned}$$

for n large enough such that \(1-p_m \ge 1/2\). Hence the above is at most

$$\begin{aligned} \;&\frac{\sum _{j=0}^{i-\ell } p_{i, j}}{2^{\ell -1} \sum _{j=\ell -1}^{i-1} p_{i, j-\ell +1}} = \frac{\sum _{j=0}^{i-\ell } p_{i, j}}{2^{\ell -1} \sum _{j=0}^{i-\ell } p_{i, j}} = 2^{-(\ell -1)}. \end{aligned}$$

Thus, \((D_t \mid X_{t+1} < X_t)\) is stochastically dominated by a geometric random variable \(Z_t\) with parameter 1/2. In other words, the progress towards \(0^n\) in steps in which \(X_t\) decreases is thus stochastically dominated by a sequence \(Z_0, Z_1, \ldots \) of geometric random variables with parameter 1/2. Now choose \(\varepsilon {:}{=}\alpha /(12Lk)\) and note \(\varepsilon = \Theta (1)\).

Consider a phase of \(\alpha /12\) steps. The expected total decrease in \(\alpha /12\) steps is at most \(\alpha /6\). By Chernoff bounds for sums of geometric random variables (Theorem 10.32 in [21]), the probability of the total decrease being at least \(\alpha /4\) is at most \(e^{-\Omega (\alpha )} = e^{-\Omega (n)}\). Hence, with the converse probability \(1 - e^{-\Omega (n)}\) at least \(\alpha /12\) steps decreasing \(X_t\) are necessary to reduce \(X_t\) by a total of \(\alpha /4\). By claims (1) and (2), with high probability the first \(X_t\)-value of at most \(\alpha \) is at least \(3\alpha /4\), hence the final \(X_t\)-value is at least \(\alpha /2\).

Since \(\alpha /2 = 6\varepsilon L k\), the event \(X_t \ge \alpha /2\) implies that the first \(\varepsilon L\) cells have not been reached yet. Since at most k decreasing steps may concern the same cell, during \(\alpha /12\) steps decreasing \(X_t\), at least \(\alpha /(12k) = \varepsilon L\) cells are visited. Observing that the union bound of all failure probabilities is o(1) completes the proof. \(\square \)

Now we put everything together to prove Theorem 5.

Proof of Theorem 5

The upper bounds on the expected cover time follow from Theorem 1. For \(k=1\) the cover time equals the optimal-cover time. For \(k \ge 2\) we use a simple and crude argument: as long as not all cells are covered optimally, there is a cell i whose search point can be improved. An improvement happens with probability \(1/L \cdot 1/(en)\), if the right parent is selected and an improving Hamming neighbour is found. The expected waiting time is at most enL and at most n such steps are sufficient to cover all cells optimally. This only adds a term \(O(n^2L) = O(n^3)\) to the upper bound, which is no larger than the claimed runtime bound.

It remains to show the claimed lower bounds for the expected cover time. By Lemma 7, with probability \(\Omega (1)\), QD(\(k\)) reaches a state where at least \(\varepsilon L\) cells are covered and the first \(\varepsilon L\) cells are not yet covered. We work under the condition that this happens and note that this only incurs a constant factor in the lower bound.

We first show the statement for \(k \ge 2\) and \(k \le \varepsilon (n+1)/2\). We may assume that \(\varepsilon (n+1)/2 \le n/(16c)\) as otherwise we may simply choose a smaller value for \(\varepsilon \). This ensures that the exponential decay of jump lengths shown in the proof of Lemma 7 applies for all search points with at most \(4k \le 4\varepsilon (n+1)/2\) ones. Let S be the set of all search points with at most \(2k-1\) ones. As the first \(\varepsilon L\) cells are not covered yet, the minimum number of ones seen so far is at least \(\varepsilon k L = \varepsilon (n+1) \ge 2k\). Applying Lemma 6 as in the proof of Lemma 7 we get that, for all \(2k \le i \le 4\varepsilon (n+1)/2\), the probability that a search point with exactly \(2k-1\) ones is reached, when S is reached for the first time, is

$$\begin{aligned}&\textrm{Pr}\left( X_{t+1} = 2k-1 \mid X_t = i \wedge X_{t+1} \le 2k-1\right) \\&\quad = \frac{p_{i, 2k-1}}{\sum _{j=0}^{2k-1} p_{i, j}} \ge \frac{p_{i, 2k-1}}{\sum _{j=0}^{2k-1} 2^{-(2k-1-j)} p_{i, 2k-1}} = \frac{1}{\sum _{j=0}^{2k-1} 2^{-j}} \ge \frac{1}{2}. \end{aligned}$$

The same probability bound also holds for \(i > 4\varepsilon (n+1)/2\) as the conditional probability of jumping to S from a search point with more than \(4\varepsilon (n+1)/2\) ones is asymptotically smaller than the probability of jumping to S from cell 3. Since we are optimising OneMax, the algorithm will maintain a search point with exactly \(2k-1\) ones in cell 2 forever. To discover cell 1, the algorithm either needs to pick the search point from cell 2 as parent and jump to a search point of at most \(k-1\) ones, or pick a parent from some cell i, \(i \ge 3\), that has at least \((i-1)k\) ones and jump to the same target. The probability of any such event is at most

$$\begin{aligned} \frac{1}{\varepsilon L} \left( \sum _{j=0}^{k-1} p_{2k-1, j} + \sum _{i=3}^{L}\sum _{j=0}^{k-1} p_{(i-1)k, j}\right) . \end{aligned}$$

For all \(m \ge k-1\), since at least \(m-(k-1)\) bits have to flip, \(\sum _{j=0}^{k-1} p_{m, j} \le \left( {\begin{array}{c}m\\ m-(k-1)\end{array}}\right) p_m^{m-k+1} = \left( {\begin{array}{c}m\\ k-1\end{array}}\right) p_m^{m-k+1}\), thus this is at most

$$\begin{aligned} \le \;\frac{1}{\varepsilon L} \left( \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) p_m^k + \sum _{i=3}^{L} \left( {\begin{array}{c}(i-1)k\\ k-1\end{array}}\right) p_m^{(i-2)k+1}\right) . \end{aligned}$$

Using that for \(k \ge 2\)

$$\begin{aligned} \left( {\begin{array}{c}(i-1)k\\ k-1\end{array}}\right) \le \frac{((i-1)k)^{k-1}}{(k-1)!} \le (i-1)^{k-1} \frac{k^{k-1}}{(k-1)!} \le (i-1)^{k-1} \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) \end{aligned}$$

we get, for \(i \ge 4\),

$$\begin{aligned} \left( {\begin{array}{c}(i-1)k\\ k-1\end{array}}\right) p_m^{(i-2)k+1} \le \;&(i-1)^{k-1} \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) p_m^{(i-2)k+1}\\ =\;&\left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) p_m^k \cdot \frac{p_m}{i-1} \left( (i-1)p_m^{(i-3)}\right) ^k\\ \le \;&\left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) p_m^k \cdot \frac{p_m}{i-1} \end{aligned}$$

as \((i-1)p_m^{i-3} \le 1\) for all \(i \ge 4\) and large enough n.

The summand for \(i=3\) is

$$\begin{aligned} \left( {\begin{array}{c}2k\\ k-1\end{array}}\right) p_m^{k+1} = \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) \cdot \frac{2k}{k+1} p_m^{k+1} \le \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) 2p_m^{k+1}. \end{aligned}$$

Together, the sought probability is bounded by

$$\begin{aligned}&\frac{1}{\varepsilon L} \left( \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) p_m^k + 2p_m \sum _{i=3}^{L} \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) p_m^{k}\right) \\&\quad = \frac{1}{\varepsilon L} \left( \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) p_m^k \left( 1 + 2p_m L\right) \right) = O\left( \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) \frac{p_m^{k}}{L}\right) . \end{aligned}$$

Taking the reciprocal yields a lower bound on the expected time for finding cell 1.

For \(\varepsilon (n+1)/2 < k \le (n+1)/2\) all cells have linear width. Since the probability of initialising with at least \(n/2 + \sqrt{n}\) ones is \(\Omega (1)\) by properties of the binomial distribution, it is easy to show that the algorithm will reach a search point with \(2k-1\) ones before reaching cell 1, with high probability. Then we apply the above arguments.

For \(k=1\) we argue differently as the time is no longer dominated by the expected waiting time to jump from cell 2 to cell 1. We assume that the event stated in Lemma 7 occurs and consider \(X_t\) as the minimum number of ones in any search point of the population at time t, \(P_t\). Let \(Q_t\) be the number of ones in the parent chosen in generation t. Define the drift

$$\begin{aligned} \Delta _i {:}{=}\text {E}\left( X_{t}-X_{t+1} \mid X_t = s, Q_{t+1} = X_t + i\right) \end{aligned}$$

under the condition that a parent is chosen with a number of ones by i larger than \(X_t\). Note that \(\Delta _0\) is bounded by the expected number of flipping 1-bits, \(\Delta _0 \le s \cdot p_m = \frac{cs}{n}\). We focus on \(X_t \le n^{1/5}\) and show that here the drift is at most

$$\begin{aligned} \hspace{-0.2cm} \text {E}\left( X_t - X_{t+1} \mid X_t = s\right) = \frac{1}{|P_t|} \sum _{x \in P_t} \Delta _{|x|_1-s} \le \frac{1}{\varepsilon L} \sum _{i=0}^{n-s} \Delta _i = O(s/n^2) \end{aligned}$$
(2)

where the first equation follows from the law of total probability and the last equality remains to be proven. We will show that for all \(i \ge 1\), \(\Delta _i\) is asymptotically much smaller than \(\Delta _0\). We trivially have \(\Delta _i \le \textrm{Pr}\left( X_{t+1} < X_t \mid X_t, Q_t = X_t + i\right) \cdot (X_t+i)\) since the maximum possible decrease is \(X_t+i\). In order to decrease \(X_t\) by mutating a parent of \(X_t+i\) ones, it is necessary to flip at least \(i+1\) ones. For \(i \le X_t + n^{1/5}\) This event has probability

$$\begin{aligned} \left( {\begin{array}{c}X_t+i\\ i+1\end{array}}\right) p_m^{i+1} \le \left( {\begin{array}{c}2n^{1/5}\\ i+1\end{array}}\right) p_m^{i+1} \le (2n^{1/5}p_m)^{i+1} \le (2cn^{-4/5})^{i+1} \end{aligned}$$

and thus \(\Delta _i \le (2cn^{-4/5})^{i+1} \cdot 2n^{1/5} \le O(n^{-7/5})\) and \(\sum _{i=0}^{n^{1/5}-1} \Delta _i = O(n^{-6/5})\). For all \(i \ge X_t + n^{1/5}\) the above probability is at most

$$\begin{aligned} \left( {\begin{array}{c}X_t+i\\ i+1\end{array}}\right) p_m^{i+1} \le \frac{(X_t + i)^{i+1}}{i!} p_m^{i+1} \le \frac{n^{i+1}}{i!} p_m^{i+1} = \frac{c^{i+1}}{i!}. \end{aligned}$$

Since \(i! \ge (i/e)^i\), this term is \(n^{-\Omega (n^{1/5})}\) and \(\sum _{i=n^{1/5}}^{n-X_t} \Delta _i = n^{-\Omega (n^{1/5})}\). Along with \(L = n+1\), (2) follows.

Finally, we apply the multiplicative drift theorem for lower bounds, Theorem 2.2 in [30], to the process \(X_t\), using the following parameters: \(s_{\min } {:}{=}\log ^2 n\) and \(\beta {:}{=}1/\ln n\). The condition on the drift, \(\text {E}\left( X_t - X_{t+1} \mid X_t = s\right) \le \delta s\), is satisfied for \(\delta {:}{=}c'/n^2\), \(c'\) being a sufficiently small positive constant.

We still need to show that, for all \(s \ge s_{\min } = \log ^2 n\),

$$\begin{aligned} \textrm{Pr}\left( X_t - X_{t+1} \ge \beta s \mid X_t = s\right) \le \frac{\beta \delta }{\ln n} = \frac{c'}{n^2(\ln n)^2}. \end{aligned}$$

Since \(\beta s \ge \beta s_{\min } = \Omega (\log n)\), the above bound holds since the probability of flipping a logarithmic number of bits with \(p_m = c/n\) is superpolynomially small.

By the multiplicative drift theorem, the expected time for reaching a distance of at most \(s_{\min }\), and thus the time to reach cell 1, starting from a distance of at least \(\varepsilon L = \varepsilon n\), is at least

$$\begin{aligned} \frac{\ln (\varepsilon n) - \ln (s_{\min })}{\delta } \cdot \frac{1-\beta }{1+\beta } =\;&\frac{n^2(\ln (\varepsilon n) - 2\ln (\log n))}{c'} \cdot \frac{1-1/(\ln n)}{1+1/(\ln n)} = \Omega (n^2 \log n). \end{aligned}$$

\(\square \)

As an aside, we point out that Theorem 5 gives a lower bound on the expected time for GSEMO to cover the whole Pareto frontFootnote 3 of the bi-objective test function OneMinMax\((x) {:}{=}(\textsc {OneMax} (x), n - \textsc {OneMax} (x))\). To our knowledge, this is a novel (albeit not surprising) result; previously only for SEMO a lower bound of \(\Omega (n^2 \log n)\) was known [31]. SEMO only flips one bit in each mutation, and previous work avoided the complexity of analysing standard mutations.

Theorem 8

The expected time for QD, operating on the 1-NoO feature space, on OneMax to cover all \(n+1\) cells equals the expected time of GSEMO covering the Pareto front of OneMinMax.

Proof

Both algorithms keep one search point in every cell and create new search points by choosing a covered cell uniformly at random, applying standard bit mutation and keeping the result if it covers an uncovered cell. \(\square \)

3.4 General Results and Transformations

We now take a closer look at the question how the performance of QD is affected if the function is not well-aligned with the feature space.

Following [32], we define XOR-transformations as follows.

Definition 2

Given a bit string \(a \in \{0, 1\}^n\) and a fitness function \(f :\{0, 1\}^n \rightarrow {\mathbb {R}}\), we define \(f_a :\{0, 1\}^n \rightarrow {\mathbb {R}}\) by \(f_a(x) {:}{=}f(x \oplus a)\) where \(\oplus \) denotes the bit-wise exclusive OR operation.

This transformation implies that in all positions i where \(a_i = 1\), the operation \(x \oplus a\) inverts the bit value before evaluating the fitness. So the special case \(a=0^n\) implies \(f_a = f\). If f contains a single global optimum \(x^*\), it is moved to \(x^* \oplus a\) when considering \(f_a\) instead.

We now discuss generalisations of our previous results to classes of transformed fitness functions. For easy functions like OneMax, QD is still efficient because of elitism. We give a more general fitness-level upper bound based on the well known fitness-level method or method of f-based partitions [33].

The idea is to partition the search space into sets that are strictly ordered with respect to the fitness of the contained individuals.

Definition 3

For two sets \(A, B \subseteq \{0, 1\}^n\) and fitness function f let \(A <_f B\) if \(f(a) < f(b)\) for all \(a \in A\) and all \(b \in B\). Consider a partition of the search space into non-empty sets \(A_1, \ldots , A_m\) such that \( A_1<_f A_2<_f \cdots <_f A_m \) and \(A_m\) only contains global optima. We say that \(QD \) is in \(A_i\) or on level i if the best individual created so far is in \(A_i\).

Owing to elitism, the algorithm can only increase its current fitness level. Assuming that for each non-optimal fitness level we have a lower bound on the probability of reaching some higher fitness level set, this gives an upper bound on the expected time to escape from the current level. Summing up all these expected waiting times yields an upper bound on the expected time for reaching the final level.

Theorem 9

(Fitness-level upper bound for QD) Consider QD operating on any feature space of L cells. Given a fitness-level partition \(A_1, \ldots , A_m\), let \(s_i\) be a lower bound on the probability of creating a new offspring in \(A_{i+1} \cup \cdots \cup A_m\), provided the best search point in \(QD \)’s map is in \(A_i\). Then the expected optimisation time of \(QD \) on f is at most

$$\begin{aligned} L \sum _{i=1}^{m-1} \frac{1}{s_i}. \end{aligned}$$
(3)

Proof

For every set \(A_i\), \(i < m\), the following holds. If the best search point in the map is currently in \(A_i\), the probability of selecting this cell is 1/L and the probability of creating a search point in a better fitness level is at least \(s_i\). If this happens, the set \(A_i\) is left for good. Thus, the expected time spent in \(A_i\) is bounded by the expectation of a geometric random variable with parameter \(s_i/L\), which is \(L/s_i\). Summing up these times for all non-optimal \(A_i\) implies the claim. \(\square \)

For instance, on OneMax we may choose \(A_i = \{x \mid \textsc {OneMax} (x) = i\}\) for \(i \in \{0, \ldots , n\}\) and \(s_i \ge (n-i)p_m(1-p_m)^{n-1}\) for all \(i \le n-1\) as a sufficient condition for increasing the fitness is to flip one of \(n-i\) zero-bits and not to flip any of the other bits. This argument remains valid for all transformations of OneMax when replacing “zero-bits” with “bits that disagree with the global optimum”. For \(p_m = \Theta (1/n)\), we can use \((1-p_m)^{n-1} = \Theta (1)\) and

$$\begin{aligned} \sum _{i=0}^{n-1} \frac{1}{s_i} = \Theta \left( \frac{1}{p_m}\right) \sum _{i=0}^{n-1} \frac{1}{n-i} = \Theta \left( \frac{1}{p_m}\right) \sum _{i=1}^n \frac{1}{i} = O(n \log n), \end{aligned}$$

using the well-known fact that the n-th harmonic number is \(O(\log n)\). This yields the following.

Corollary 10

Consider QD with mutation rate \(p_m = c/n\), \(c > 0\) constant, operating on any feature space with L cells. For every bit string \(a \in \{0, 1\}^n\), the expected optimisation time of QD on the transformed function \(\textsc {OneMax} _a\) is \(O(Ln \log n)\). For the 1-NoO feature space this simplifies to \(O(n^2 \log n)\).

This result concerns the optimisation time. The expected cover time on transformed OneMax functions using the 1-NoO feature space is still bounded by \(O(n^2 \log n)\) as well since Theorem 1 applies to all pseudo-Boolean fitness functions. However, while for standard OneMax a cover implies that all fitness values have been found and all cells are covered optimally (since, trivially, all search points in one cell have the same fitness), this does not necessarily hold for transformed OneMax functions. The following theorem shows that a time bound of \(O(n^2 \log n)\) still applies for the optimal-cover time of every transformed OneMax function.

Theorem 11

Consider QD with mutation rate \(p_m = c/n\), \(c > 0\) constant, operating on the 1-NoO feature space. For every \(a \in \{0, 1\}^n\) the expected optimal-cover time of QD on the transformed function \(\textsc {OneMax} _a\) is \(O(n^2 \log n)\).

Proof

In expected time \(O(n^2 \log n)\) all cells are covered by Theorem 1. This in particular implies that, numbering cells from 0 to n, the cells 0 and n, respectively, are covered optimally, since these cells only contain single search points (\(0^n\) and \(1^n\), respectively).

Now we estimate the expected remaining time for covering all other cells optimally. Let \(i^*\) denote the number of ones in the global optimum, \({\overline{a}}\). For \(0 \le j \le i^*\) let \(T_j\) denote the random time until all cells \(0, 1, \ldots , j\) are covered optimally. For \(0 \le j \le i^*-1\) we consider the algorithm at time \(T_j\) and estimate the expected time for covering cell \(j+1\) optimally. Note that the minimum Hamming distance between the search point \(x_j\) in cell j and the optimum is \(i^* - j\). As \(x_j\) is optimal for cell j, there must be \(i^*-j\) bits that are 1 in the global optimum and 0 in \(x_j\). A 1-bit flip inverting one of these bits creates an optimal solution in cell \(j+1\). The probability of this event is at least \(1/L \cdot (i^*-j)\cdot C/n\) for some constant \(C > 0\) that depends on \(p_m\). Consequently, the expected time for covering all cells \(0, \ldots , i^*\) optimally, starting with a full cover, is at most

$$\begin{aligned} \sum _{j=0}^{i^*-1} L \cdot \frac{Cn}{i^*-j} = CnL \sum _{j=1}^{i^*} \frac{1}{j} = O(n^2 \log n). \end{aligned}$$

A symmetric argument applies for the expected time to cover all cells \({n, n-1, \ldots , i^*+1}\) optimally, swapping the roles of zeros and ones. This shows an upper bound of \(O(n^2 \log n)\). \(\square \)

In summary, on easy functions (for which the fitness-level method yields good bounds), transformations are not that harmful for the expected optimisation time.

Now we give a more extreme example where transformations affect performance significantly.

Definition 4

The deceptive function Trap is defined as

$$\begin{aligned} \textsc {Trap}(x) {:}{=}{\left\{ \begin{array}{ll} n+1 &{} \text {if}\quad x = 0^n\\ \textsc {OneMax} (x) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

It is well known [24] that the (1+1) EA with mutation rate \(p_m = 1/n\) requires time \(\Theta (n^n)\) in expectation on Trap and that this is the largest expected optimisation time possible for algorithms using standard bit mutation with mutation probability 1/n in each offspring creation. The reason is that in each offspring creation, the probability of turning the parent x into a global optimum \(x^*\) is

$$\begin{aligned} \left( p_m\right) ^{H(x, x^*)} \left( 1 - p_m\right) ^{n-H(x, x^*)} \le p_m^{n} \end{aligned}$$

where the inequality holds for all mutation probabilities \(p_m \le 1/2\) (as then \(p_m \le 1-p_m\)). Consequently, for every fitness function and every mutation probability \(0 < p_m \le 1/2\), the expected time for creating a global optimum is at most \((1/p_m)^n\).

Note that Trap is a function of unitation, hence Corollary 3 yields an upper bound of \(O(n^2 \log n)\) for QD operating on the 1-NoO feature space with respect to the expected optimisation time and even the expected optimal-cover time. This result breaks down when considering transformations of the Trap function.

Theorem 12

Consider QD with mutation rate \(p_m = c/n\), \(c > 0\) constant, operating on the 1-NoO feature space. There is a transformed Trap function for which the optimisation time of QD is in \(\Omega ((1/p_m)^{\lfloor n/2 \rfloor }n)\) with overwhelming probability and in expectation.

In the analysis we will consider the impact of a mutation operation and distinguish between flipping bits that contribute to a decrease of the Hamming distance and those contributing to an increase. This process can be modelled as an urn model with red and white balls, where the red balls correspond to bit flips approaching the optimum. The following lemma shows that in a general urn model with N balls, r of which are red, if the number of red balls is small, the probability of drawing more red balls than white balls is bounded by r/N. This is trivially the precise formula when drawing exactly one ball. The following lemma shows that this bound applies regardless of the number of balls drawn.

Lemma 13

Consider an urn model in which k balls are chosen uniformly at random, without replacement, from an urn containing \(N \in {\mathbb {N}}\) red or white balls, r of which are red. If \(r \le N/4\) then, for every value \(k \in \{0, \ldots , N\}\) the probability of drawing more red balls than white balls is at most r/N.

Proof

The probability of drawing more red than white balls equals the sum of probabilities of drawing i red balls, for \(i \in \{\lfloor k/2\rfloor +1, \ldots , k\}\). Let \(p_i\) denote the probability of drawing i red balls, then \(p_i = 0\) if \(i > r\) and otherwise, if \(r \ge i\),

$$\begin{aligned} p_i = \frac{\left( {\begin{array}{c}r\\ i\end{array}}\right) \left( {\begin{array}{c}N-r\\ k-i\end{array}}\right) }{\left( {\begin{array}{c}N\\ k\end{array}}\right) }=\;&\frac{r!(N-r)!k!(N-k)!}{i!(r-i)!(k-i)!(N-r-k+i)!N!}\\ =\;&\frac{r!(N-r)!(N-k)!}{(r-i)!(N-r-k+i)!N!}\left( {\begin{array}{c}k\\ i\end{array}}\right) \\ =\;&\frac{r!}{(r-i)!} \cdot \prod _{j=0}^{k-i-1}(N-r-j) \cdot \prod _{j=0}^{k-1}\frac{1}{N-j} \cdot \left( {\begin{array}{c}k\\ i\end{array}}\right) \\ =\;&\frac{r!}{(r-i)!} \cdot \prod _{j=0}^{k-i-1}(N-r-j) \cdot \prod _{j=0}^{i-1}\frac{1}{N-j} \cdot \prod _{j=0}^{k-i-1}\frac{1}{N-i-j} \cdot \left( {\begin{array}{c}k\\ i\end{array}}\right) . \end{aligned}$$

Using \(r \ge i\), the product of the first and the last \(\prod \) terms is at most 1 and the above simplifies to

$$\begin{aligned} \le \;&\frac{r!}{(r-i)!} \cdot \prod _{j=0}^{i-1}\frac{1}{N-j}\left( {\begin{array}{c}k\\ i\end{array}}\right) \\ =\;&\frac{r(r-1) \cdots (r-i+1)}{N(N-1) \cdots (N-i+1)} \left( {\begin{array}{c}k\\ i\end{array}}\right) \\ \le \;&\left( \frac{r}{N}\right) ^i \left( {\begin{array}{c}k\\ i\end{array}}\right) . \end{aligned}$$

As an aside, note that the latter, when multiplied with a factor of \((1-r/N)^{k-i}\), yields the probability of choosing i red balls with replacement, that is, a binomial distribution with parameters k and r/N. Now the sought probability is at most

$$\begin{aligned} \sum _{i=\lfloor k/2 \rfloor +1}^{k} p_i \le \sum _{i=\lfloor k/2 \rfloor +1}^{k} \left( \frac{r}{N}\right) ^i \left( {\begin{array}{c}k\\ i\end{array}}\right) =\;&\frac{r}{N}\sum _{i=\lfloor k/2 \rfloor +1}^{k} \left( \frac{r}{N}\right) ^{i-1} \left( {\begin{array}{c}k\\ i\end{array}}\right) \\ \le \;&\frac{r}{N} \sum _{i=\lfloor k/2 \rfloor +1}^{k} \left( \frac{1}{4}\right) ^{i-1} \cdot \left( {\begin{array}{c}k\\ i\end{array}}\right) \\ \le \;&\frac{r}{N} \cdot \left( \frac{1}{4}\right) ^{\lfloor k/2 \rfloor } \sum _{i=\lfloor k/2 \rfloor +1}^{k} \left( {\begin{array}{c}k\\ i\end{array}}\right) . \end{aligned}$$

Noting \(\sum _{i=0}^k \left( {\begin{array}{c}k\\ i\end{array}}\right) = 2^k\), the sum is precisely \(2^k/2\) if k is odd by symmetry of binomial coefficients and \(2^k/2 - \left( {\begin{array}{c}k\\ k/2\end{array}}\right) /2 \le 2^k/2\) if k is even, thus we get

$$\begin{aligned} \le \;&\frac{r}{N} \cdot \left( \frac{1}{4}\right) ^{\lfloor k/2 \rfloor } \cdot \frac{2^k}{2} \le \frac{r}{N} \end{aligned}$$

where the last inequality is obvious for even k and follows from \((1/4)^{(k-1)/2} = (1/2)^{k-1}\) for odd k. \(\square \)

Now we are able to give a proof of Theorem 12

Proof of Theorem 12

Throughout this proof we assume that n is large enough such that all inequalities holding for large enough n hold. We choose the transformation according to \(a {:}{=}1^{\lfloor n/2 \rfloor } 0^{\lceil n/2 \rceil }\), that is, the global optimum is a and the fitness of all other search points equals the Hamming distance to a. Note that in every non-empty cell the Hamming distance to the optimum can only increase, except for the cell containing search points with \(\lfloor n/2 \rfloor \) ones in the unlikely case that a is reached. However, empty cells will accept any search point, including those in which the Hamming distance to a has decreased, compared to the current parent. Hence we need to show that QD does not evolve any search points that are too close to a.

To this end, denoting by \(M_t\) the map at time t we define the distance measure

$$\begin{aligned} D_t {:}{=}\min \{2n^{2/3}, H(x, a) \mid x \in M_t\}. \end{aligned}$$

Note that \(D_t = 0\) if and only if a is found. Then we show that the following statements all hold with overwhelming probability:

  1. 1.

    Initially, \(D_0 \ge 2n^{2/3}\).

  2. 2.

    Whenever \(D_t\) is decreased, it decreases by at most \(n^{1/4}\).

  3. 3.

    There are at most \(n^{5/12}\) steps decreasing \(D_t\) before the map is filled.

Together, if these very likely events all occur, the total decrease of \(D_t\) is bounded by \(n^{1/4} \cdot n^{5/12} = n^{2/3}\) and \(D_t\) is always at least \(n^{2/3}\) while the map is being filled. We assume in the following that this happens. Once the map is filled, as long as a is not found, QD behaves as on the transformed function \(\textsc {OneMax} _a\). We consider the random time \(T^*\) to find a or to cover each cell with the optimal \(\textsc {OneMax} _a\)-value. We know \(\text {E}\left( T^*\right) = O(n^2\log n)\) by Theorem 11, using that the unlikely event of finding a is an additional stopping event. During this time, every parent has Hamming distance at least \(D_t\) to a by definition of \(D_t\). Using \(D_t \ge n^{2/3}\), the probability of creating a is at most \(p_m^{n^{2/3}}\) and the probability of this happening during a random phase of length \(T^*\) is at most

$$\begin{aligned} \sum _{t=1}^{\infty } \textrm{Pr}\left( T^* = t\right) \cdot t \cdot p_m^{n^{2/3}} = \text {E}\left( T^*\right) \cdot p_m^{n^{2/3}} = O(n^2 \log n) \cdot p_m^{n^{2/3}}. \end{aligned}$$

After time \(T^*\), if a has not been found, every cell will have a search point with maximum Hamming distance to a. The maximum Hamming distance of any search point in the cell containing i ones to a is attained for the search point \(0^{n-i}1^i\). In case \(i \le \lceil n/2 \rceil \) the maximum Hamming distance is \(i+\lfloor n/2 \rfloor \) as all ones are placed in disjoint positions. In case \(i > \lceil n/2 \rceil \) the maximum Hamming distance is \(\lceil n/2 \rceil + n - i\) as all zeros are placed in disjoint positions. Together, the maximum Hamming distance is \(\lfloor n/2 \rfloor \) for \(i=0\), \(\lceil n/2 \rceil \) for \(i=n\) and at least \(\lfloor n/2 \rfloor + 1\) for all \(1 \le i \le n-1\). Hence, creating the optimum by mutation has probability at most \(2/L \cdot p_m^{\lfloor n/2 \rfloor } + (1-2/L) \cdot p_m^{\lfloor n/2 \rfloor +1} = O(p_m^{\lfloor n/2 \rfloor }/n)\). This implies the claimed time bound.

It remains to show the three claims from above. The first claim easily follows from applying Chernoff bounds to the number of bits in the initial search point that agree with a. The expectation of this quantity is n/2 and the probability of it being less than \(2n^{2/3}\) is \(1-2^{-\Omega (n)}\). For the second claim, we use that decreasing \(D_t\) by d requires the algorithm to flip at least d bits in one standard bit mutation. The probability of flipping at least \(n^{1/4}\) bits is at most

$$\begin{aligned} \sum _{j=\lceil n^{1/4}\rceil }^n \left( {\begin{array}{c}n\\ j\end{array}}\right) p_m^j \le \sum _{j=\lceil n^{1/4}\rceil }^n \frac{n^j}{j!} \cdot \frac{c^j}{n^j} = \sum _{j=\lceil n^{1/4}\rceil }^n \frac{c^j}{j!} \le \sum _{j=\lceil n^{1/4}\rceil }^n \left( \frac{c}{ej}\right) ^j = 2^{-\Omega (n^{1/4})} \end{aligned}$$

using \(j! \ge (j/e)^j\) and properties of the geometric series.

The probability of this happening in the first \(T^*\) steps is still exponentially small, using the same arguments as above.

Finally, we show the third claim, assuming that the typical events from the first two statements occur. Note that \(D_t\) can only be decreased if the parent has Hamming distance at most \(D_t + n^{1/4} \le 2n^{2/3} + n^{1/4} \le 3n^{2/3}\) to a. Consequently, it suffices to consider cells with a number of ones in \(S {:}{=}\{|a|_1 - 3n^{2/3} + 1, \ldots , |a|_1 + 3n^{2/3} - 1\}\) as all other cells are guaranteed to have a larger Hamming distance to a. As mentioned before, it also suffices to consider steps where empty cells are filled as these are the only steps capable of decreasing \(D_t\). We consider such a step and apply the method of deferred decisions: we assume that we know the number of one-bits flipped in this step and the number of flipped zero-bits, but that we do not yet know their positions. This is a valid argument since only the number of flipped one-bits and zero-bits determine the cell the offspring belongs to. Now, assume that the new cell has been reached, creating an offspring y from flipping o one-bits and z zero-bits in its parent, x, with \(H(x, a) \le 3n^{2/3}\). The latter implies that at most \(3n^{2/3}\) one-bits in x disagree with a and at least \(|a|_1 - 3n^{2/3} - 3n^{2/3} \ge n/2 - 6n^{2/3} \ge n/3\) one-bits in x agree with a. A necessary condition for decreasing the Hamming distance to a is that amongst all flipping one-bits we have a surplus of bits that disagree with a or that amongst all flipping zero-bits we have a surplus of bits that disagree with a. Imagining the choice of bits as an urn model with at most \(3n^{2/3}\) red balls and n/3 bits in total, the probability of having a surplus of red balls representing flipping one-bits is at most \(3n^{2/3}/(n/3) = 9n^{-1/3}\) and the same holds for flipping zero-bits. A union bound of these two events still yields a probability of at most \(18n^{-1/3}\).

Since these arguments are agnostic of the specific number of bits flipped, and these events are independent for different cells, the number of steps decreasing \(D_t\) is stochastically dominated by a sequence of \(|S| \le 6n^{2/3}\) independent Bernoulli random variables with parameter \(18n^{-1/3}\). The expected number of steps decreasing \(D_t\), \(\text {E}\left( X\right) \), is at most \(108n^{1/3}\). We apply Chernoff bounds to the latter, choosing a parameter \(\delta \) such that \((1+\delta )\text {E}\left( X\right) = n^{5/12}\), which implies \(\delta = \Theta (n^{1/12})\). Then Chernoff bounds yield a probability bound of \(e^{-\Omega (\text {E}\left( X\right) \delta ^2)} = e^{-\Omega (n^{1/3} \cdot n^{2/12})} = e^{-\Omega (n^{1/2})}\) of having more than \(n^{5/12}\) such steps.

This completes the proof of the third claim and the theorem. \(\square \)

Next, we show that the above lower bound is asymptotically tight.

Theorem 14

Consider QD with mutation rate \(p_m = c/n\), \(c > 0\) constant, operating on the 1-NoO feature space. For every pseudo-Boolean function \(f :\{0, 1\}^n \rightarrow {\mathbb {R}}\) the expected optimisation time of QD on f is at most

$$\begin{aligned} O(n^2 \log (n) + (1/p_m)^{H(\textrm{OPT}_f, \{0^n, 1^n\})}n) = O((1/p_m)^{\lfloor n/2 \rfloor }n), \end{aligned}$$

where \(\textrm{OPT}_f\) is the set of global optima of f and, for sets \(A, B \subseteq \{0, 1\}^n\), \(H(A, B) {:}{=}\min _{x \in A, y \in B}H(x, y)\) denotes the minimum Hamming distance between any search point \(x \in A\) and any search point \(y \in B\).

Proof

Within expected time \(O(n^2 \log n)\), QD has covered all cells, by Theorem 1. In particular, QD has found both strings \(0^n\) and \(1^n\). Let \(x^*\) be a global optimum of f with \(H(\{x^*\}, \{0^n, 1^n\}) = H(\textrm{OPT}_f, \{0^n, 1^n\}) {=}{:}h\). Then \(x^*\) can be created from flipping h bits in either \(0^n\) or \(1^n\). The probability of selecting a parent with Hamming distance at most h to \(x^*\) and mutating it into \(x^*\) is at least \(1/L \cdot p_m^h(1-p_m)^{n-h} = \Omega (p_m^{h}/n)\).

Taking the reciprocal yields an upper bound on the expected waiting time for finding a global optimum after covering all cells. The second bound follows since for any search point \(x^* \in \textrm{OPT}_f\) we have \(H(x^*, 0^n) + H(x^*, 1^n) = n\) as the left-hand side adds up the number of ones in \(x^*\) and the number of zeros in \(x^*\). This implies \(H(\textrm{OPT}_f, \{0^n, 1^n\}) \le \min \{H(x^*, 0^n), H(x^*, 1^n)\} \le \lfloor n/2 \rfloor \). \(\square \)

We conclude that QD with the 1-NoO feature space is efficient when there are global optima close to \(0^n\) or \(1^n\) in terms of Hamming distance. But it cannot avoid exponential optimisation times if all global optima have a sufficiently large Hamming distance from both \(0^n\) and \(1^n\). However, the worst case of \(\Theta ((1/p_m)^{n})\) can be lowered significantly, to \(\Theta ((1/p_m)^{\lfloor n/2 \rfloor }n)\). The latter is nearly the square root of the former and the upper bound is smaller by a factor of order \((1/p_m)^{\lceil n/2\rceil }/n\). Note, however, that in the worst case both bounds are exponential in n, limiting the usefulness of these bounds. This is why we now turn to combinatorial problems where we will show that QD can guarantee finding good solutions in expected polynomial time.

4 Monotone Submodular Function Optimisation

We now consider the NP-hard problem of submodular function maximisation. Submodular functions generalise many well-known NP-hard combinatorial optimisation problems, e.g., finding a maximum cut in a graph, and are thus of utmost relevance. Let \(\Omega \) be a finite ground set and let \(f: 2^{\Omega } \rightarrow {\mathbb {R}}\) be a set function. The function f is submodular if for all \(A, B \subseteq \Omega \)

$$\begin{aligned} f(A \cup B) + f(A \cap B) \le f(A) + f(B) \end{aligned}$$

and f is monotone if \(f(A) \le f(B)\) for all \(A \subseteq B \subseteq \Omega \). We assume that f is normalised, i.e., \(f(A) \ge 0 \,\forall A \subseteq \Omega \) and \({f(\emptyset ) = 0}\). Given a constraint r the goal is to find a set \(\text {OPT}\subseteq \Omega \) with \(f(\text {OPT}) = \max _{A \subseteq \Omega , |A|\le r} f(A)\). Friedrich and Neumann [18, 34] were the first to study the performance of evolutionary algorithms on monotone submodular functions. In particular they showed that GSEMO, using a bi-objective fitness function, obtains a \((1-1/e)\)-approximation on any monotone submodular function with a single uniform matroid constraint in expected time \(O(n^2 (\log (n) + r))\).

Algorithm 2
figure b

Global SEMO algorithm

The authors consider the fitness function \(g(x):= (z(x), |x|_0)\) where \(z(x) = f(x)\) if the solution x is feasible (i.e., \(|x|_1 \le r\)) and \(z(x) = -1\) otherwise. They use the concept of Pareto-dominance and define \(g(x) \ge g(y)\) if and only if \(((z(x) \ge z(y)) \wedge (|x|_0 \ge |y|_0))\) holds; in this case x (Pareto-)dominates y. They study the expected runtime of the GSEMO (see Algorithm 2) until a solution \(x^{*} = \text {arg max}_{x \in P} z(x)\) with \(f(x^{*})/\text {OPT}\ge \alpha \) is hit for the first time.

In the following we show that QD with \(p_m {:}{=}1/n\) also achieves an approximation ratio of \((1-1/e)\) in polynomial expected time \(O(n^2(\log (n) + r))\) using the function f itself as the fitness function.

Theorem 15

The expected time until QD operating on the 1-NoO feature space has found a \((1-1/e)\)-approximate solution for a monotone submodular function with a uniform cardinality constraint r is \(O(n^2 (\log (n) + r))\).

The proof follows the proof of Theorem 2 in [18] and the proof of the approximation quality of the deterministic greedy algorithm by Nemhauser et al. [35]. This is because the working principles of QD and GSEMO match. While the fitness function for GSEMO cleverly encodes the cardinality constraint as a second objective and uses Pareto-dominance to decide which solutions to keep, the 1-NoO feature space does the job for QD.

Proof of Theorem 15

By Theorem 1 the first cell of the map is populated with the solution \(0^n\) in \(O(n^2 \log n)\) steps in expectation. Since \(0^n\) is the only solution with n zeroes it will not be replaced.

More precisely, we show that once \(0^n\) is stored in the map, QD evolves solutions \(x_j\) for \(0 \le j \le r\) with approximation ratio

$$\begin{aligned} f(x_j) \ge \left( 1 - \left( 1 - \frac{1}{r}\right) ^j\right) \cdot f(\text {OPT}) \end{aligned}$$
(4)

in expected time \(O(n^2r)\). The proof is by induction over j. As a base case we note that once the all zeroes string is in the map, Eq. (4) holds for \(x_0 = 0^n\) since

$$\begin{aligned} 0 = f(x_0) \ge \left( 1 - \left( 1 - \frac{1}{r}\right) ^0\right) \cdot f(\text {OPT}) = 0. \end{aligned}$$

Now assume that Eq. (4) holds for every \(0 \le i \le j < r\); we show that it also holds for \(x_{j+1}\). To this end the algorithm needs to (i) select the \(x_j \in M\) with \(|x_j|_1 = j\) and (ii) add the element with the largest possible increase in f in a single mutation. Let \(\delta _{j+1}\) be the increase in fitness we obtain by this event. Then we get

$$\begin{aligned} f(\text {OPT}) \le f(x_j \cup \text {OPT}) \le f(x_j) + r \cdot \delta _{j+1} \end{aligned}$$

where the first inequality follows from monotonicity of f and the second by submodularity. Rearranging terms we obtain

$$\begin{aligned} \delta _{j+1} \ge \frac{1}{r} \cdot \left( f(\text {OPT}) - f(x_j)\right) . \end{aligned}$$

This leads to

$$\begin{aligned} f(x_{j+1})&\ge f(x_j) + \frac{1}{r} \cdot \left( f(\text {OPT}) - f(x_j)\right) \\&= \frac{f(\text {OPT})}{r} + f(x_j) \cdot \left( 1 - \frac{1}{r}\right) \\&\ge f(\text {OPT}) \cdot \left( \frac{1}{r} + \left( 1 - \left( 1 - \frac{1}{r}\right) ^j\right) \cdot \left( 1 - \frac{1}{r}\right) \right) \\&= f(\text {OPT}) \cdot \left( 1 - \left( 1 - \frac{1}{r}\right) ^{j+1}\right) . \end{aligned}$$

The latter for \(j = r\) simplifies to

$$\begin{aligned} f(\text {OPT}) \cdot \left( 1 - \left( 1 - \frac{1}{r}\right) ^{j+1}\right) \ge (1 - 1/e) \cdot f(\text {OPT}) \end{aligned}$$

proving the claimed approximation ratio. Note that the required selection plus mutation event happens at least with probability \(\frac{1}{n+1} \cdot \frac{1}{en}\) where the first factor is the probability to select the respective parent and the second factor is a lower bound on the probability to greedily add the element with largest increase in fitness. Summing over all r iterations we obtain an expected runtime of \(\sum _{j=0}^{r-1} (n+1)ne = O(n^2r)\). Adding the initial time to hit \(0^n\) yields \(O(n^2\log n) + O(n^2r) = O(n^2(\log (n) + r))\) and completes the proof. \(\square \)

We see that QD achieves the same approximation ratio as GSEMO and it does so in the same expected runtime. This comes as little surprise, as the algorithms work similarly (population with at most \((n+1)\) non-dominated individuals versus a map with at most \((n+1)\) individuals) as pointed out before. However, QD is a much simpler algorithm. The fitness function is the set function itself and no bi-objective formulation is required (that role is played by the feature space). As a consequence, the algorithm does not need to perform dominance checks in every iteration. We are confident that many other results from [18] also hold for QD.

5 Minimum Spanning Forests

We now turn the focus to problem-tailored feature spaces and a well-known combinatorial optimisation problem. Consider an undirected graph \(G=(V,E)\) with \(n=|V|\) nodes, \(m = |E|\) edges, and a weight function \(w: E \rightarrow {\mathbb {N}}^{+}\) that maps each edge to a positive integer weight. Let \(C(G) \in \{1, \ldots , n\}\) be the number of connected components (CCs) of G. For a connected graph G (i.e., \(C(G) = 1\)) every acyclic sub-graph of G that contains all vertices is called a spanning tree (ST); let \({\mathcal {T}}\) be the set of all spanning trees of G. A spanning tree \(T^{*}\) with \(T^{*} = \text {arg min}_{T \in {\mathcal {T}}} \sum _{e \in T} w(e)\) is a minimum spanning tree (MST) of G. For an unconnected undirected graph G with \(C(G) > 1\) a spanning forest (SF) is a collection of C(G) STs (each one for each connected component). A minimum spanning forest (MSF) corresponds to a SF composed of MSTs for every CC.

We encode solution candidates as bit-strings x from the search space \(S=\{0,1\}^m\) where \(x_i = 1\) means that the ith edge is in the solution and \(x_i=0\) otherwise given an arbitrary, but fixed order of the edges \(e_1, \ldots , e_m\). The fitness of solution candidate x is—overloading the function w—defined as

$$\begin{aligned} w(x) := \sum _{i=1}^{m} x_i \cdot w(e_i). \end{aligned}$$

Let cc(x) be the number of connected components of \(x \in S\).

The MST problem was studied for Randomised Local Search [19], ant colony optimisation [36] and simulated annealing [37, 38].

Neumann and Wegener [19] used a bi-objective formulation of the MST problem where the fitness \(f': S \rightarrow {\mathbb {R}}^2\) of an individual \(x \in S\) was defined by \(w'(x) = (cc(x),w(x))\) which is to be minimised in both objectives simultaneously. Since \(cc(x) \in \{1, \ldots , n\}\) the Pareto front has size n and thus the population size of SEMO and GSEMO is bounded from above by n. They showed that both SEMO and GSEMO, given an arbitrary initial search point, solve the bi-objective variant of the MST in expected time \(O(n^2m)\) if all edge weights are at most \(2^n\).

In the following we first consider QD operating on a two-dimensional feature space. The first dimension is the defined by the number of CCs of a solution. Note that, when implementing the algorithm, one might want to map these values to a set \(\{1, \ldots , n-C(G)+1\}\) by applying some affine transformation, but for our proofs this is immaterial. The second dimension of M is defined by the number of edges used in the solutions. In consequence, the size of the map is \(L = (n-C(G)+1)(m+1)\). We term this the CCE feature space. We remark that some cells may remain empty forever since not all cells admit feasible solutions. In particular, \(C(G) \ge n-m\) since introducing an edge can only reduce the number of connected components by at most 1. Thus, all cells with \(C(G) < n-m\) are always empty. We shall ignore empty cells as they will not play a role for parent selection and we adapt our notation of cover times and optimal-cover times in an obvious way, to exclude cells that admit no feasible solutions. The only (mild) effect is that L is an overestimation of the number of non-empty cells.

Theorem 16

Let \(G=(V,E)\) be an undirected graph with n nodes, m edges, edge weights \(w: E \rightarrow {\mathbb {N}}^{+}\), and \(C(G) \in \{1, \ldots , n\}\) connected components. QD operating on the CCE feature space finds an MSF of G in expected time \(O(m^2(n-C(G))(n-C(G) + \log m))\).

Proof

The proof consists of two phases. We first need to add the empty edge set \(0^m\) (with n connected components) to the map. Once this event happened, we proceed with the construction of an MSF in the second phase. As the case \(m = 0\) is trivial, we assume \(m > 0\).

Phase 1: Let \(x \in M\) be the solution with the minimum number of edges and—among such solutions—with the maximum number of CCs. Dropping an edge from x brings us closer to the empty edge set and the solution is accepted in M since it decreases the minimum number of edges. Note that the number of CCs of the mutant may reduce by exactly one or stay the same. However, the number of CCs is clearly non-increasing. If x has \(1 \le i \le m\) edges such an event happens with probability at least

$$\begin{aligned} \left( \frac{1}{L}\right) \left( \frac{i}{m}\right) \left( 1 - \frac{1}{m}\right) ^{m-1} \ge \frac{i}{eLm} \ge \frac{i}{4em^2(n-C(G))}. \end{aligned}$$

Here we used that \(L = (n-C(G)+1)(m+1) \le 2(n-C(G)) \cdot 2m = 4(n-C(G))m\). Using artificial fitness levels, \(0^m\) is hit for the first time after at most

$$\begin{aligned} \sum _{i=1}^{m} \frac{4em^2(n-C(G))}{i} = O((n-C(G))m^2\log m) \end{aligned}$$

expected steps.

Phase 2: Recall Kruskal’s MST algorithm [39, 40]. It starts with the empty edge set and adds the cheapest edge that does not introduce a cycle to the solutions until an MSF is found. This step is repeated \(n-C(G)\) times. Once \(0^m \in M\), QD on the CCE feature space can mimic this behaviour. Let \(x_t\) at time t denote the acyclic solution with the smallest number of CCs and minimum weight that is currently stored in the map. At the beginning of the process \(0^m\) fulfills these requirements. In order to locate the cheapest acyclic sub-graph with \(cc(x_t)-1\) CCs QD needs to add the cheapest edge that reduces the number of CCs and does not yield a cycle (exactly as Kruskal’s algorithm proceeds in a deterministic manner). This event happens with probability at least \(1/(eLm) \ge \frac{1}{4em^2(n-C(G))}\). After \(n-C(G)\) such steps an MSF is found. Taking the reciprocal to obtain an upper bound on the expected waiting time for such a step, this phase ends after at most \((n-C(G))^24em^2 = O((n-C(G))^2m^2)\) expected iterations.

Adding up the expected times of both phases yields an upper bound of

$$\begin{aligned}&O((n-C(G))m^2\log m) + O((n-C(G))^2m^2) \\&\quad = O(m^2(n-C(G))(n-C(G) + \log m)). \end{aligned}$$

\(\square \)

It follows that on a connected graph an MST is found by QD on the CCE feature space after \(O(m^2n^2)\) function evaluations.

Corollary 17

Let \(G=(V,E)\) be an undirected, connected graph with n nodes, m edges and edge weights \(w: E \rightarrow {\mathbb {N}}^{+}\). QD operating on the CCE feature-space finds an MST of G in expected time \(O(m^2n^2)\).

Since \(L = \Theta (nm)\) the likelihood for successful mutations is quite low, though, yielding a runtime bound of \(O(n^6)\) on dense graphs.

In the following we drop the second dimension of the map and derive an upper bound for QD with the feature space being defined solely as via the number of CCs. Thus, the map size in this setting is \(L = n - C(G) + 1\). For each number of CCs we store the best-so-far solution in the map. We refer to this as CC feature space.

Theorem 18

Let \(G=(V,E)\) be an undirected graph with n nodes, m edges, edge weights \(w: E \rightarrow {\mathbb {N}}^{+}\), and \(C(G) \in \{1, \ldots , n\}\) connected components. QD operating on the CC feature space locates \(0^m\) (the empty edge set) in an expected time of \(O((n-C(G)+1)m\log (nw_{\max }))\) where \(w_{\max }\) is the maximum edge weight.

Proof

We follow well-established arguments for the analysis of the \((\mu + 1)\)-EA and GSEMO on MST. Let \(x_t\) denote a solution with minimum weight amongst all solutions stored by QD at time t. Note that \(w(x_t)\) is non-increasing over time. If there are k edges selected in \(x_t\), there are k distinct mutations flipping only a single selected edge, and not flipping any unselected edges. As all these mutations, when applied at the same time, decrease the weight from \(w(x_t)\) to 0, the expected decrease in weight, when applying one such mutation chosen uniformly at random, is \(w(x_t)/k\). The event we condition on has probability \(k \cdot 1/m \cdot (1-1/m)^{m-1} \ge k/(em)\). In addition, the probability of selecting \(x_t\) as parent is at least \(1/(n-C(G)+1)\). Together,

$$\begin{aligned} \text {E}\left( w(x_{t+1}) \mid x_t\right) \le w(x_t) \cdot \left( 1 - \frac{1}{em(n-C(G)+1)}\right) . \end{aligned}$$

The weight of the initial solution is \(w_0 \le n w_{\max }\). Applying the multiplicative drift theorem [41] yields an upper bound of \(O((n-C(G)+1)m \log (n w_{\max }))\). \(\square \)

The following result follows for connected graphs where \(C(G) = 1\).

Corollary 19

On every connected graph \(G=(V,E)\) with n nodes, m edges and edge weights \(w: E \rightarrow {\mathbb {N}}^{+}\), QD operating on the CC feature space locates \(0^m\) (the empty edge set) in expected time of \(O(nm\log (nw_{\max }))\) where \(w_{\max } = \max _{e \in E} w(e)\) is the maximum edge weight.

The time bound matches the one for the expected time until the population of GSEMO contains \(0^m\) (see [19, Theorem 1]).

We remark that the preliminary version of this work published at GECCO 2023 [20] claims an upper bound of \(O(nm \log n)\) in its Theorem 6.1. This result matches the above statement when all weights are polynomial in n. Regrettably, the proof in [20] is flawed as mutations adding edges while also removing edges may increase the total number of edges. Hence, the argument used in [20] that QD may remove all edges by mutations flipping one bit is not sufficient. The authors thank Frank Neumann for pointing this out. To our knowledge the upper bound of \(O(nm \log (n w_{\max }))\) is the best known upper bound for the expected time of the (1+1) EA and GSEMO for finding an empty selection of edges in the MST problem [19], along with a cruder bound of \(O(nm^2 \log n)\) due to Reichel and Skutella [42] that removes the dependence on weights. We conjecture that a time bound of \(O(nm \log n)\) holds for all weights, but we currently do not have a proof and therefore have to leave this as an open problem. The following theorem bounds the expected time an MSF on G is found. In addition, the proof shows that optimal solutions are found for all cells, i. e., QD finds minimum-weight edge selections that induce i connected components, for all values of \(i = C(G), \ldots , n\). We therefore obtain a bound on the expected optimal-cover time. An MSF corresponds to an optimal solution of cell \(i = C(G)\).

Theorem 20

Let \(G=(V,E)\) be an undirected graph with n nodes, m edges, edge weights \(w: E \rightarrow {\mathbb {N}}^{+}\), and \(C(G) \in \{1, \ldots , n\}\) connected components. The expected optimal-cover time of QD operating on the CC feature space and starting with \(0^m \in M\) is \(O((n-C(G)+1)nm)\). In particular, QD finds an MSF of G in expected time \(O((n-C(G)+1)nm)\).

Proof

For \(i = C(G), \ldots , n\) let \(x_i^{*}\) be a minimum-weight edge selection inducing i connected components, i.e.,

$$\begin{aligned} cc(x_i^{*}) = i \quad \text {and} \quad w(x_i^{*}) = \min _{\begin{array}{c} x \in S,\\ cc(x)=i \end{array}} w(x). \end{aligned}$$

It holds that \(x_n^{*}=0^m\) since \(cc(0^m)=n\) and \(0^m\) is the empty edge set; all edge weights are positive and thus adding edges can only worsen the fitness. Clearly, \(x_{C(G)}^{*}\) is a minimum spanning forest of the input graph. Given \(x_i^{*}, i \in \{C(G)+1, \ldots , n\}\) we can reach \(x_{i-1}^{*}\) by injecting an edge of minimal weight into \(x_i^{*}\) which does reduce the number of CCs by one or, put differently, does not close a cycle. We call such a step a Kruskal-step as it is exactly the way Kruskal’s well-known MST algorithm builds an MST (see, e.g., [40]). Now, starting with \(x_n^{*}\) we can create the sequence \(x_{n-1}^{*}, x_{n-2}^{*}, \ldots , x_{C(G)+1}^{*}, x_{C(G)}^{*}\) by \(n-C(G)\) consecutive Kruskal-steps. Having \(x_i^{*} \in M\) the probability for a Kruskal-step is at least \(1/((n-C(G)+1)em)\) as we need to select the right cell and flip exactly one bit. In consequence, optimal solutions for all cells, including an MSF, are found after at most \((n-C(G))(n-C(G)+1)em = O((n-C(G)+1)^2m)\) steps in expectation. \(\square \)

Combining the results of Theorem 18 and Theorem 20 yields the following result.

Corollary 21

On every graph \(G=(V,E)\) with n nodes, m edges, edge weights \(w: E \rightarrow {\mathbb {N}}^{+}\) with \(w(e) \le 2^{O(n)}\) for all \(e \in E\), and \(C(G) \in \{1, \ldots , n\}\) connected components, QD operating on the CC feature space and starting with an arbitrary initial solution finds optimal solutions for all cells, including an MSF of G, in expected time \(O((n-C(G)+1)m\max \{n-C(G)+1,\log n\})\).

Corollary 22

On every connected graph \(G=(V,E)\) with n nodes, m edges and edge weights \(w: E \rightarrow {\mathbb {N}}^{+}\) with \(w(e) \le 2^{O(n)}\) for all \(e \in E\), QD operating on the CC feature space and starting with an arbitrary initial solution finds optimal solutions for all cells, including an MSF of G, in expected time \(O(n^2m)\).

For edge weights bounded from above by \(2^{O(n)}\) this matches the result by Neumann and Wegener for (G)SEMO [19].

6 Conclusion

We performed runtime analyses of a simple quality diversity algorithm termed QD in the context of pseudo-Boolean optimisation and combinatorial optimisation. For the number-of-ones feature space with granularity parameter k we showed an upper bound on the expected time until all L cells are covered, for different values of k. We gave upper bounds on the expected optimisation time on OneMax, functions of unitation and monotone functions, for which the feature space is aligned favourably with the structure of the problem.

Applying a transformation to such problems may diminish this favourable alignment. On easy functions like OneMax, a general fitness-level bound shows that performance unaffected by transformations, at least asymptotically speaking. While Trap can be optimised in expected time \(O(n^2 \log n)\), after applying a worst-case transformation the expected time increases to \(\Theta ((1/p_m)^{\lfloor n/2 \rfloor }n)\), which is better than the expected time of \((1/p_m)^{n}\) for a (1+1) EA by an exponential factor. We also saw that including two complementary search points like \(0^n\) and \(1^n\) can exponentially decrease the expected optimisation time on the hardest possible problems.

We showed that QD finds a \((1-1/e)\)-approximation for the task of monotone submodular function maximisation, given a single uniform cardinality constraint.

Finally, we considered the problem of finding a minimum spanning forest for edge-weighted graphs G with \(C(G) \in \{1, \ldots , n\}\) connected components. Here, a two-dimensional feature space spanned by the number of connected components and the number of edges used in the solutions yields a bound of \(O(m^2n^2)\) imitating Kruskal’s well-known MST algorithm. However, we showed that QD using only the CC-based dimension yields a significantly better upper bound of \(O((n-C(G)+1)m\max \{n-C(G)+1,\log n\})\) for integer weights up to \(2^{O(n)}\). On connected graphs this runtime bound simplifies to \(O(n^2m)\). Interestingly, we find that the working principles of QD are similar to those of GSEMO while its implementation is easier and there is no need to use the concept of Pareto-dominance.

We see this paper as a starting point for a deep dive into the theoretical study of QD-algorithms. One important question for future work is to investigate the alignment between the feature space and the problem structure in more depth, to see how the efficiency of QD is affected if there is a mismatch between the two. Straightforward extensions of QD might consider more combinatorial optimisation problems, e.g., graph colouring problems, the study of different (multi-)dimensional feature spaces, and supplementary experiments. A first step in this direction was undertaken recently [43], where in the context of monotone submodular function optimisation the 1-NoO feature space was generalised by allowing users to add Boolean formulas to guide the optimisation process, leading to several overlaid problems being optimised in parallel [43]. We also see different directions for methodological/algorithmic extensions, e.g., keeping multiple solutions per cell or combining classic population-based EAs with quality diversity approaches. Another promising research direction is identifying general relationships between QD and GSEMO.