1 Introduction

Search-based software testing (SBST) is an effective method for testing complex systems with large input spaces (Zeller 2017). SBST employs metaheuristics, such as evolutionary algorithms (Ben Abdessalem et al. 2018; Borg et al. 2021; Humeniuk et al. 2022; Klück et al. 2019; Moghadam et al. 2021, 2023; Riccio and Tonella 2020; Sorokin and Kerscher 2024), to reveal failure-revealing test cases. It formulates testing as an optimization problem, capturing system safety requirements through multiple objectives that are often optimized using Pareto-based algorithms.

Pareto-based SBST techniques are generally assessed by their effectiveness in identifying failures, focusing on the quantity and diversity of the failures detected. However, we still do not know if Pareto-based SBST is capable of achieving adequate coverage across the space of failure-revealing test inputs. Ideally, we are interested in a testing method that can achieve high coverage over the space of failure-revealing test inputs. That is, it can identify failure-revealing test inputs distributed across the entire failure-inducing area of a search domain, rather than clustered in one specific sub-area.

The identification of such diverse failing test inputs can support the detection of underlying failure causes and conditions leading to failures (Ben Abdessalem et al. 2018; Jodat et al. 2024). Existing research on software testing views the identification of diverse failures as an important testing objective (Aghababaeyan et al. 2023; Feldt et al. 2016). For example, Aghababaeyan et al. (2023) argue that we are more likely to identify diverse failures of a system if we identify diverse test inputs.

Several Pareto-based SBST algorithms use heuristics to make the search more explorative with the original goal of escaping local optima. This approach is consistent with the broader focus within the evolutionary search community on balancing the trade-off between exploration and exploitation in search algorithms. By investigating various heuristics and hyperparameter tunings, researchers aim to adjust the behavior of algorithms across the spectrum of exploration and exploitation, which is crucial for enabling algorithms to avoid local optima (Črepinšek et al. 2013). Exploring the trade-off between exploration and exploitation has an important practical use case in software testing as well, since increasing exploration within often exploitative Pareto-based algorithms has been shown to lead to the discovery of more diverse failures. For example, Clune et al. (2005, 2008) propose making genetic operators in evolutionary algorithms more explorative by increasing mutation rates when the performance of the search shows marginal improvement. Multiple approaches combine Pareto-based optimization with novelty search (Mouret 2011; Riccio and Tonella 2020; Zohdinasab et al. 2023, 2021) and maximize the distance of candidate solutions to previously found test inputs to escape local optima in the fitness landscape. However, it has never been studied how well Pareto-based algorithms cover the failure-inducing regions within a search domain.

In this paper, we aim to study Pareto-based SBST techniques in terms of their capability to cover the failure-inducing test inputs within a search domain by addressing the following question:

figure d

We first present a theoretical argument explaining why Pareto-based SBST algorithms cannot achieve high coverage of failure-revealing tests within the search space. Our argument builds on a definition of an SBST problem characterized by the following two assumptions:

(A1):

 The problem involves optimizing multiple quantitative fitness functions simultaneously using the principle of Pareto optimality.

(A2):

 The test oracle function of the SBST problem specifies subsets of the objective space as the failure regions such that at least one such subset has the same dimensionality as the objective space, i.e., the dimensionality is equal to the number of fitness functions.

For example, consider a SBST problem with two real-valued fitness functions \(f_1\) and \(f_2\), and a test oracle function O. Suppose O indicates a test input v as a failure if the following constraint holds: \( (f_1(v) > 5) \vee (f_2(v) < 10)\). The test oracle O satisfies assumption A2 since it specifies a two-dimensional space as a failure region. Briefly, our argument indicates that any Pareto-based optimization algorithm for \(f_1\) and \(f_2\) ultimately identifies a Pareto front intersecting the two-dimensional failure region in only one dimension, which likely cannot ensure good coverage of the failure region defined by O. Other than the assumptions A1 and A2, our argument is not dependent on the number and definition of fitness functions, or the heuristics used to approximate the optimal solution. We note that assumption A2 is not constraining, as test oracles for many SBST formulations applied to cyber-physical systems and Deep Learning (DL) systems conform to this assumption (Ben Abdessalem et al. 2016, 2018; Borg et al. 2021; Ebadi et al. 2021; Jahangirova et al. 2021), making our argument valid for these existing SBST formulations.

In addition to the theoretical argument, we investigate our research question empirically. We consider three Pareto-based algorithms that have been previously used for software testing: 1) NSGA-II (Deb et al. 2002), which is a genetic algorithm and has been widely applied for testing complex systems (Ben Abdessalem et al. 2016; Klück et al. 2019); 2) an extension of NSGA-II that implements the concepts of novelty search combined with a diversification operator (Riccio and Tonella 2020); and 3) the swarm-based Pareto-optimization algorithm OMOPSO (Sierra and Coello Coello 2005), which is an extension of the well-known Particle Swarm Optimization algorithm for multi-objective problems. We apply these three algorithms to two case studies representing two SBST problems: an industrial ADS case study Automated Valet Parking (AVP) system (Bosch 2023) that is integrated in the Prescan simulator (Siemens 2023), and a hand-written digit-recognition Deep-Learning (DL) system (Lecun et al. 1998) that uses the MNIST dataset (LeCun and Cortes 2005) as input. We compare these three algorithms with a naïve random testing baseline, which is considered a standard baseline in the search-based software engineering community. In particular, we investigate whether the three algorithms can outperform naïve random testing in terms of covering the failure-inducing test inputs within a search domain. These algorithms or their close variants have already been compared with a random testing baseline regarding the quantity and diversity of failures, where failure diversity is measured, e.g., by calculating the average Euclidean distance between inputs (Ebadi et al. 2021) or by applying the Jaccard similarity metric (Humeniuk et al. 2023). However, there is no prior comparison with respect to the coverage of failure-inducing test inputs.

Although there are metrics for measuring diversity, none has been used to evaluate search algorithms by measuring how well they cover failure-inducing regions in the test input space. This is because measuring such coverage requires an approximation of the failure-inducing regions, which is a more difficult problem than finding individual failures. To measure the coverage of the failure-inducing areas of the input space, we propose to use a metric that we refer to as Coverage Inverted Distance (CID). We use this metric only to understand how different Pareto-based algorithms compare in terms of coverage. Otherwise, as we detail below, computing this metric may not be feasible in general. To compute CID, we create a reference set that approximates failure-inducing regions within a search domain. Given such a reference set and given a solution set computed by a testing method, CID measures how well the solution set covers the reference set by computing the average distance from each point in the reference set to the closest point in the solution set. The smaller the CID value, the better the solution set in covering the space of all failure-revealing test inputs. To build the reference set, we generate many inputs evenly across the search space (Nabhan et al. 2019) such that these input points approximate the search space as a uniform grid. Among all the generated inputs, those that fail represent the reference set approximating the failure-inducing areas of a search domain. While computing a reference set is feasible for our two case studies, it is likely infeasible for larger and more complex systems, as it requires running the system under test across numerous points in the search space. Hence, the CID metric is not a metric that we propose to be used, in general, for evaluating test generation algorithms for arbitrarily case studies.

The accuracy of the CID metric depends on how well the reference set approximates the actual failure-inducing regions. To mitigate potential threats related to the accuracy of CID in our study, we mathematically explain the relationship between CID’s accuracy and the resolution of reference sets and specify conditions under which CID’s error approaches zero. In addition, we present the useful properties of CID, highlighting its robustness to the reference set, regardless of the method used to generate points in the input space for computing the reference set. Moreover, CID is independent of the size of the test set generated by a testing approach, enabling the comparability of CID values across test sets of different sizes.

Results

For each of our two case studies, we define three alternative test oracle functions, each specifying a different failure region. These functions are defined to represent increasingly stricter failure criteria, allowing us to obtain experimental results that reflect varying difficulty levels in covering these regions in our case studies. Our results show that the three different Pareto-based testing techniques fare no better than naïve random testing in covering the input space regions that include failure-revealing test inputs:

  • The results show that naïve random testing covers failure-inducing regions more effectively than NSGA-II as defined by all the alternative test oracles in the MNIST case study. For the AVP case study, naïve random testing surpasses NSGA-II for one test oracle, and for the other two test oracles specifying smaller and more restricted failure regions, both methods perform equally. These results indicate that NSGA-II cannot outperform random search in terms of coverage of failure regions even when we vary the size of failure-inducing regions.

  • Incorporating a diversity-focused fitness function and a re-population operator in NSGA-II improves, on average, the coverage difference between NSGA-II and random search by 52.1%. However, NSGA-II, even after being extended by these diversification mechanisms, still cannot outperform random search in covering the failure-inducing regions of our two case studies and for the three different test oracle functions of these case studies.

  • The swarm-based search method, OMOPSO, improves the coverage of failing test inputs compared to NSGA-II but, similar to NSGA-II, OMOPSO still does not perform better than random search.

The outline of the paper is as follows: In Section 2, we present the preliminaries. In Section 3, we present related work, followed by our argument in Section 4 regarding why Pareto-based algorithms cannot achieve high coverage of failure-revealing tests. In Section 5, we introduce a coverage metric to assess the coverage of failures in the input space. In Section 6, we compare, using our novel metric, Pareto-based driven testing with random testing in terms of the coverage of failure-inducing test inputs. In Section 7, we discuss the most important threats concerning the validity of our results. In Section 8, we reflect on the lessons learned and conclude in Section 9. In the appendix we provide proofs related to our introduced metric.

2 SBST Problem Definition and Assumptions

In this section, we introduce the terminologies and formal notations used for our argumentation in Section 5. We start with the definition of a SBST problem.

Definition 1

We define a search-based software testing (SBST) problem as a tuple \(P = (S, D, F, O, SO)\), where

  • S is the system under test.

  • \(D \subseteq \mathbb {R}^n\) is the search domain, where n is the dimension of the search space. An element \(x = (x_1, \ldots , x_n) \in D\) is called a test input.

  • F is the vector-valued fitness function defined as \(F: D \mapsto \mathbb {R}^m\), \(F(x) = (f_1(x),\ldots , f_m(x))\), where \(f_i\) is a scalar fitness function (or fitness function for short) which assigns to each test input a quantitative value, and \(\mathbb {R}^m\) is the objective space. A fitness function evaluates how fit a test input is. This captures the assumption A1 described in Section 1.

  • O is the test oracle function, \(O : \mathbb {R}^m \mapsto \lbrace 0,1 \rbrace \), which evaluates whether a test passes or fails. A test that fails is called failure-revealing. We assume that 1 indicates fail and 0 indicates pass.

  • The set of all failure-revealing test inputs is called the domain of interest (DOI). Note that DOI is a subset of the search space (i.e., \(DOI \subseteq D\)). The image of the domain of interest F(DOI) is called codomain of interest (COI). Note that \(COI \subseteq \mathbb {R}^m\), and for every \(y \in COI\), we have O(y) = 1.

  • The test oracle O is defined such that COI has the same dimensionality as \(\mathbb {R}^m\). This captures the assumption A2 described in Section 1.

  • SO is the search objective. A search objective refers to a specific goal or criterion that guides the search process in finding test inputs that satisfy certain properties of the SUT (e.g., failure of a system).

As defined above, our definition of SBST problems involves multiple fitness functions. Solving such SBST problems requires a Pareto-based optimization algorithm.

Definition 2

(Pareto-based Optimization) A Pareto-based optimization problem is defined as

$$\begin{aligned} \min _{x \in X} F(x) = (f_1(x), \ldots , f_k(x)) \end{aligned}$$
(1)

where \(f_i\) is an objective function and \(X \subseteq \mathbb {R}^n\) is called the feasible solution set, where n is the dimensionality of a search domain. A solution x with \(F(x) = (v_1,\ldots , v_m)\) is said to dominate another solution \(x'\) with \(F(x') =(v_1',\ldots , v_m')\), if and only if x is in at least one fitness value better than \(x'\) and not worse in the remaining fitness values, i.e., \(\exists v_i. (v_i < v_i') \wedge \forall v_j.(v_j \le v_j')\). A solution x is called Pareto optimal if no solution exists that dominates x. The set of all Pareto optimal solutions is called Pareto set (PS), while the image of the Pareto set F(PS) constitutes the Pareto front (PF).

As discussed in Section 1, our argument in Section 5 builds on two assumptions, A1 and A2. Assumption A1 states that the problem involves optimizing multiple quantitative fitness functions simultaneously using the principle of Pareto optimality, and assumption A2 states that the test oracle O is defined such that COI has the same dimensionality as \(\mathbb {R}^m\). Both assumptions are included in Definition 1. Both of our case studies, AVP and MNIST, satisfy assumptions A1 and A2. Below, we illustrate how our AVP case study can be captured using our definition of an SBST problem, i.e., Definition 1. Our second case study, which concerns testing a digit classification system, is described in Section 6.1.

Example

Our first case study system is an AVP system (Bosch 2023). An AVP system is a feature added to a car (ego vehicle), which parks the ego vehicle without human intervention in a predefined parking spot by following a precalculated trajectory. Our objective is to evaluate AVP system to verify if it meets the following safety requirement:

R =“When the ego vehicle’s velocity exceeds threshold \(th_1\), AVP must ensure a minimum distance of \(th_2\) from both static and dynamic objects, including other vehicles and pedestrians.”

The values for parameters \(th_1\) and \(th_2\) are determined by the system’s specific context and set by a domain expert. We execute AVP in a simulation environment, exposing the system to various situations known as scenarios (Ulbrich et al. 2015). In particular, we test AVP using a scenario where a pedestrian crosses the ego vehicle’s trajectory from the right of the ego vehicle’s lane. We parameterize this scenario so that we can generate different instances of this scenario to stress the AVP system. We define the SBST elements from Definition 1 for AVP as follows:

  • The search domain D of AVP is defined by three search variables: \(x_1\), the velocity of the ego vehicle; \(x_2\), the velocity of the pedestrian; and \(x_3\), the time when the pedestrian starts moving. Each search variable \(x_i\) has a designated range \(D_i\). Thus, the search space is defined as follows: \(D = D_1 \times D_2 \times D_3\), where \(D_1 = [0.3 m/s, 3 m/s]\), \(D_2 = [0.5 m/s,2 m/s]\) and \(D_3 = [0s,5s]\).

  • We define two fitness functions: \(f_1\) is the adapted distance between the ego vehicle and the pedestrian, and \(f_2\) is the velocity of the ego vehicle at the time of the minimal distance between the ego vehicle and the pedestrian. An adapted distance function combines the longitudinal and latitudinal distances between the ego vehicle and a pedestrian into a single number, assigning more optimal values to test cases where the pedestrian is located in front of the car. To compute a fitness value for a test input, AVP must be run in a simulation environment. The simulator provides outputs, including the position and velocity traces of both the ego vehicle and the pedestrian, from which the fitness values are calculated.

  • The test oracle function is defined as \(O(x) = f_1(x)< th_1 \wedge f_2(x) < th_2\) where x is a test input, and the parameters \(th_1\) and \(th_2\) are determined based on threshold values in the AVP safety requirement R described earlier. The test oracle function evaluates whether, or not, a test input violates the requirement R.

  • DOI is the set of all test inputs that lead to the violation of the safety requirement. That is, DOI is the set of all failure-revealing test inputs. COI, on the other hand, is the set of the fitness values of the test inputs in DOI. We note that, in general, we cannot compute DOI for a given search-based testing problem. As we discuss in Section 5.1, we propose a way to approximate DOI for our case studies. We refer to the approximation of DOI as a reference set.

  • The search objective SO is to identify as good as possible the DOI, i.e., to identify diverse test inputs leading to the violation of the safety requirement. When we use a Pareto-based optimization algorithm in Definition 2, for testing, we often define the search objective as finding optimal fitness values. However, as motivated in Section 1, our objective in this paper and for our case studies is to effectively cover DOI, i.e., the set of failure-revealing test inputs.

3 Related Work

In this section, we outline two streams of related work. In the first part, we discuss SBST approaches for generating diverse test data, highlighting both Pareto-based and non-Pareto-based methods. In the second part, we discuss metrics used to assess the diversity and coverage of SBST testing techniques.

Table 1 Classification of the related SBST approaches based on their underlying algorithm, the metric they use to measure the diversity of the generated tests, and whether or not they rely on Pareto-based optimization

3.1 Diversified SBST Approaches

Table 1 provides an overview of SBST techniques aimed at generating diverse tests and failures. Below, we briefly discuss these approaches and contrast them with our work.

DeepJanus (Riccio and Tonella 2020) is an NSGA-II-based search approach designed to identify the boundaries of the failure-revealing regions in DL-based systems. In particular, it extends a Pareto-based genetic algorithm with concepts from novelty search (Lehman and Stanley 2011b; Mouret 2011). While Pareto-driven testing promises identifying interesting tests by optimizing a single or multiple fitness functions, novelty search requires no definition of a fitness function. In novelty search the identification of promising solutions is guided solely by the novelty a solution has. Novelty is quantified by using a distance measure applied between the candidate and solutions which have been identified in previous iterations of the algorithm (called archive). In addition, DeepJanus implements a repopulation operator, which replaces at every popoulation a portion of dominated individuals by newly generated candidates. The repopulation operator should diversify the search and support to escape local optima. In our study, in order to diversify NSGA-II, we have incorporated novelty search into the genetic algorithm of NSGA-II by using the diversity-aware fitness function of DeepJanus, which measures the Euclidean distance between a test input and the archive of previously found tests. In addition, we have used the repopulation operator from DeepJanus to increase the diversity of the solution set in our diversified NSGA-II.

Our results show that although incorporating a diversity-aware fitness function and the repopulation operator improves the average coverage difference between NSGA-II and random search, NSGA-II – even when augmented with these diversification measures – still cannot outperform random search in covering failure-revealing test inputs.

The approaches by Birchler et al. (2023); Lu et al. (2021) also apply a diversified fitness function to identify more diverse failures; however, these studies are concerned with test case prioritization and not test case generation, which is our objective.

DeepHyperion (Zohdinasab et al. 2021) is a search-based testing approach adapting illumination search (Mouret and Clune 2015) for testing deep-learning systems. Illumination search is an optimization approach that utilizes human-interpretable feature maps to identify failure-revealing test inputs. The feature space represents important and human-interpretable characteristics of test inputs. While test inputs are not directly generated in the feature space, they can be mapped to the feature space and positioned in the feature map. Based on already evaluated test inputs in the feature map, new test inputs are selected for subsequent search iterations. Since our paper focuses on studying the coverage of Pareto-based optimization algorithms and DeepHyperion is not a Pareto-based algorithm, we do not include its underlying algorithm in our study. Further, the proposed metrics by Zohdinasab et al. (2021) do not assess the coverage of failure-inducing regions. They propose two metrics: one for assessing the diversity of failures within the feature map and the other evaluates the diversity of the covered cells of the feature map. Their metrics are different from ours for two main reasons: i) First, they are defined over the feature space rather than the input space. Therefore, metrics over the feature space can not directly indicate how much of the input space is explored/covered. ii) Second, the second metric by Zhodinasab et al. does not approximate the set of actual failures as we do through our reference set. Hence, their metric cannot assess how well the identified failures represent the actual failures.

DeepAtash (Zohdinasab et al. 2023) is a Pareto-based testing approach for deep learning systems. Like DeepHyperion, DeepAtash utilizes feature maps, enabling targeted searches in specific cells within the feature map to identify diverse solutions. We do not include DeepAtash in our study because it is not designed to increase the diversity of identified failures; instead, it aims to target the identification of specific failures.

Moghadam et al. (2021) have presented multiple bio-inspired techniques for testing a DNN-based lanekeeping system. Their approaches are based on genetic algorithms GA and NSGA-II, evolution strategy and particle swarm optimization (PSO) and have been compared with the Frenetic, GABExploit and GABExplore, Swat techniques. They particularly assess the diversity of identified failures by computing the average of the maximum Levenshtein distance between road segments that trigger failures. Their results indicate that their presented approaches have comparable performance in terms of producing diverse test inputs. The study does not evaluate the coverage of the failures in the DNN’s input space, but only the diversity of the identified failures. Instead, in our work, we focus on the coverage of identified failures. Among the algorithms studied by Moghadam et al., we evaluate the performance of OMOPSO, which is an extension of the particle swarm algorithm to multi-objective optimization problems and therefore a Pareto-based algorithm.

The closest work to our work is the study from Nabhan et al. (2019). Similar to our work, they evaluate the coverage of the failure-inducing regions of the input space by approximating the failure-inducing test input space - the DOI - using grid sampling. To assess how well a solution set of an algorithm covers the failures of a systems, they calculate the portion of failure-revealing grid points that have a witness in the solution set. A grid point is said to have a witness in the solution set if the distance to the closest test input within the set does not exceed a fixed threshold. However, Nabhan et al.’s paper does not study Pareto-based testing techniques in terms of the coverage of the test input space, which is the primary focus of our work.

3.2 Diversity and Coverage Metrics for SBST

Several metrics are proposed to assess Pareto-based testing algorithms with respect to the objective space, which is the space of fitness/objective values (Li et al. 2022; Li and Yao 2019). In contrast, since in our paper we focus on the input space, we have included in Table 2 an overview of the approaches proposed to assess the capabilities of SBST techniques in terms of diversity and coverage when the focus is either on the input space or the feature space, i.e., a representation of the input data.

Below, we discuss these techniques and contrast the metric they use to measure diversity or coverage with our Coverage Inverted Distance (CID) metric.

Table 2 Classification of existing studies assessing test case diversity and coverage of SBST algorithms based on the following criteria: their underlying search algorithms, the metrics they use to measure diversity or coverage, whether the metric measures diversity, and whether the metric measures coverage

Feldt et al. (2016) have proposed a metric to evaluate the diversity of a test suite. The metric is based on the Normalized Compression Distance (NCD), which is a distance metric based on information theory and can be applied to any type of test data. However, this metric is not meant for assessing the test input space coverage (DOI), but rather the diversity of test data. While diversity correlates with coverage, in our approach, we focus on assessing the coverage of failure-inducing regions with the goal of understanding the capabilities of Pareto-based algorithms in covering these regions.

Humeniuk et al. (2023, 2024) have evaluated the diversity of generated tests on a lane-keeping system case study using Pareto-optimization-based search (NSGA-II) compared to random search (RS). Using their metric based on Jaccard similarity, they have shown that, contrary to several prior studies, RS indeed generates more diverse tests than NSGA-II. While in our work, we study coverage instead of diversity, the findings of Humeniuk et al. are in line with our argument and our empirical results, demonstrating the weakness of Pareto-based optimization in covering failure-inducing regions of the search space.

Aghababaeyan et al. (2023) have argued that diverse test inputs most likely point to diverse faults and evaluated three different metrics in assessing the diversity of image data used for testing deep learning models. They identify the geometric diversity relation as the most expressive metric. Unlike our metric, the metrics proposed by Aghababaeyan et al. do not evaluate the coverage of failures (Aghababaeyan et al. 2023).

Biagiola et al. (2019) devised an SBST algorithm to maximize diversity in test inputs for generating test cases for testing web applications. However, the approach targets code coverage (branch/statement coverage) rather than the coverage of failures, which is why the proposed coverage evaluation technique is not applicable to our study.

Hungar (2020) have noted the importance of covering failure-revealing test input spaces for ADS. They propose a binary coverage criterion for the search space. The criterion provides an assessment as either covered or not covered, and therefore, cannot be used as a quantitative measure of DOI coverage.

Ramakrishna et al. (2022) define a diversity metric that is based on clustering of ADS testing results. This metric is defined by the number of clusters and includes a \(risk\ score\) for each test input or scenario. The approach requires the number of clusters to be provided as input to the clustering technique. However, the process for determining the number of clusters is unclear, as is the potential impact of this predetermined number of clusters on the diversity assessment results. Marculescu et al. (2016) focus on the exploration of the objective space and compare exploration-based evolutionary search algorithms such as novelty search and Illumination Search (IS) with Differential Evolution (DE), an optimization algorithm. Their results show that IS and DE explore larger and different regions of the objective space (COI in our context) compared to solely optimization-based algorithms. Nevertheless, this study does not evaluate the coverage of the DOI in particular when using Pareto-based search approaches, which is the focus of our work.

Feldt and Poulding (2017) have studied the feature space (Mouret and Clune 2015) coverage when using randomized, optimization-based, and hybrid testing approaches such as Nested Monte Carlo Tree Search (Browne et al. 2012). Their findings show that randomized testing exhibits similar coverage results compared to pure optimization-driven testing. However, their notion of coverage differs from ours. Their metric is similar to that of Zohdinasab et al. (2021), which is focused on the feature space rather than the input space and does not approximate the set of actual failures as we do through our reference set.

Ebadi et al. (2021) have investigated the diversity of generated test data when testing an ADS with a genetic algorithm compared to random search. By evaluating the average Euclidean distance of identified failing tests, their findings show that genetic search outperforms randomized search in producing diverse test inputs. However, they do not assess the coverage abilities of the testing techniques compared to our study. In particular, their results deviate from the study by Humeniuk et al. (2023), which has shown that RS outperforms NSGA-II in terms of the diversity of test inputs.

Neelofar and Aleti (2024b, 2024a) have proposed a set of adequacy metrics to assess the diversity and coverage of test suites for testing DL-enabled system. The metrics project test data from a multidimensional feature space into a two-dimensional space called instance space and evaluate coverage and diversity using information-theoretic metrics. However, similar to the coverage metrics proposed by Feldt and Poulding (2017) and Zohdinasab et al. (2021), in these papers, the coverage is not assessed by contrasting the identified failures with the actual failures since these approaches do not approximate the set of actual failures as we do through our reference set. Without knowing the space of actual failures, we cannot provide any information about the coverage of the failure-inducing space. These existing coverage metrics only compare the space of identified failures with the entire feature or instant spaces. This information, while useful, does not indicate what portion of the actual failures are found by a testing algorithm. Our study, for the first time, attempts to address this question for Pareto-based optimization techniques.

4 Theoretical Argumentation

We claim that Pareto-optimal solutions computed by Pareto-based approaches do not necessarily lead to an adequate coverage of the DOI defined in Definition 1, i.e., the set of failure-revealing test inputs. Our argumentation builds on the definitions of the SBST problem (Definition 1) and Pareto-based optimization (Definition 2).

Fig. 1
figure 1

Illustrating the limitation of Pareto-based optimization algorithms in covering failures: Figure a) illustrates how a Pareto front (PF) and a codomain of interest (COI) are located in the objective space. Figure b) shows how a Pareto set (PS) and a domain of interest (DOI) are located in the search space. The dimensionalities of PF and PS are, respectively, lower than the dimensionalities of COI and DOI. Hence, PF cannot effectively cover COI, and PS cannot effectively cover DOI

We use Figs. 1a and 1b to illustrate our argument. These figures, respectively, show the objective and search spaces for a Pareto-based optimization problem with two fitness functions, i.e., \(m=2\), and two input variables, i.e., \(n=2\). The grey area in Fig. 1a represents the test oracle function O which identifies the subset of the objective space encompassing failures. The portion of the grey area dominated by the true Pareto-Front (PF) and delineated with dashed lines in Fig. 1a represents the set of reachable failures which is called the codomain of interest (COI) in Definition 1. Due to assumption A2 in Definition 1, we assume that COI has the same dimensionality as the objective space. That is, the number of dimensions of both COI and the objective space is m which is two in our example. The PS line and the DOI area in Fig. 1b, respectively, show the mappings of PF and COI in the search space, i.e., input space. Assuming that there is no redundancy in search variables and fitness functions, which is achievable through global sensitivity analysis (Matinnejad et al. 2014), the DOI has the same number of dimensions as the search space. That is, the number of dimensions of both DOI and the input space is n which is two in our example. However, PF, computed within \(\mathbb {R}^m\), has at most \(m-1\) dimensions. Consequently, PF with at most \(m-1\) dimensions cannot effectively cover a COI with m dimensions. Similarly, PS, which maps PF to DOI, has at most \(n-1\) dimensions. Therefore, PS with at most \(n-1\) dimensions cannot effectively cover a DOI with n dimensions.

In summary, using a Pareto-based optimization algorithm results in computing PS in the search space and PF in the objective space. However, PS cannot achieve good coverage of DOI, just as PF falls short in covering COI, as evidenced by our argument above and the illustrations in Figs. 1a and 1b. Therefore, a Pareto-based optimization algorithm is not suitable for achieving adequate DOI coverage.

5 Metric for DOI Coverage Assessment

To empirically assess the coverage of the actual failing test inputs of an SUT, we introduce the CID metric in this section. CID allows us to quantitatively evaluate the coverage of the DOI by failure-revealing test inputs. In Section 5.1, we introduce CID. In Section 5.2, we explore methods for approximating the set of failures, which serves as the reference set for CID, and discuss the accuracy and robustness of reference sets.

5.1 Metric Definition

In this section, we present Coverage Inverted Distance (CID), a metric specifically designed to assess how well the failures identified by a testing technique cover the actual set of failure-revealing test inputs for a system. That is, CID evaluates how well a given set of failures identified by a testing technique covers the DOI. Let Z be a finite set approximating the DOI for a given SBST problem, and let A be a finite set of solutions generated by a testing method for that SBST problem. In the remainder, we call Z a reference set and A a test set. Our metric CID, defined below, assesses how well the test set A approximates, i.e. covers, the reference set Z:

$$\begin{aligned} CID(A,Z) = \frac{1}{\left|Z \right|} \left( {\sum _{z \in Z} {d_z}^q} \right) ^{\nicefrac {1}{q}} \end{aligned}$$
(2)

, where \(d_z = \left( \sum _{j=1}^{n} {|z_j - x_j|}^p \right) ^{\nicefrac {1}{p}}\) represents the distance imposed by the p-norm from a reference point \(z = (z_1, \ldots , z_n) \in Z\), to the nearest point \(x = (x_1, \ldots , x_n)\) from the test set A in the search space, while q represents the norm for computing the mean of the distance (i.e., \(q=1\) for the generalized mean, or \(q=2\) for the power mean). If \(q=1\) is chosen, the metric results in the average distance from a reference point to the closest point in A. If at the same time \(p=2\) is chosen, then the defined distance between the points is the Euclidean distance. If no specific characteristics of the system under test are known, the Euclidean distance is the most practical option (Fuangkhon 2022). The lower the CID value, the better the test set represents the reference set. The CID value reaches zero if the test set achieves a complete representation of the reference set.

In our evaluation in Section 6, we use the Euclidean distance to compute the distance between the points in the reference set (Z) and the test points in the solution set (A), i.e., \(p=2\). Further, we use the general mean to compute CID values, i.e, \(q=1\).

While A is computed by a given testing method, the reference set Z should be computed independently from the testing method, and should represent the DOI as accurately as possible. In our study, we approximate the reference set Z using a grid sampling approach. Specifically, we discretize the search domain by segmenting each search variable into a predetermined number of equal intervals. We test the system under test, e.g., ADS, for each point in the discretized search domain to determine whether it is passing or failing. The reference set Z is then defined as the set of failing test cases within the search space.

We discuss CID and its adequacy using illustrative examples. Figure 2 illustrates three examples, in which the DOI is represented as two separate regions, highlighted in pink, and the test set A and reference set Z are illustrated as solid red, and empty pink circles respectively. In Fig. 2(a), A includes a handful of points in each of the regions and provides low coverage of both. In Fig. 2(b), A contains several points exclusively within one region, providing a good coverage in one region, while not witnessing the other region. In Fig. 2(c), A includes multiple points well-distributed in both regions, leading to a good coverage of the entire DOI. The reference set is the same for all three examples, and is generated by performing grid sampling with a step size of 0.1 for each axis interval. For the first and second examples, CID values are equal to 0.163 and 0.237 respectively, which indicates poor coverage of DOI by the corresponding test sets. For the third example, CID is equal to 0.084, reflecting a good DOI coverage.

5.2 Reference Set Generation and Properties of CID

To apply CID, we need to generate a reference set to approximate the actual set of failures exhibited by the SUT. We can use various sampling techniques to create test inputs that are uniformly scattered across the search space in order to generate this reference set. Note that when we sample points to generate the reference set, we generate values within the search space. Sampling in this context means generating a set of points in the search space according to a sampling strategy. One such strategy is grid sampling, discussed in Section 5.1 where the selected test inputs are the nodes of a regular grid within the search space. Other alternative sampling strategies are furthest point sampling (Eldar et al. 1997) and Poisson disc sampling (Bridson 2007). FPS is a sampling strategy that iteratively generates samples in such a way that the minimal distance from already sampled points to a new sample is maximized. Poisson disc sampling allows generating points in a given space while maintaining a user-defined minimal distance between sampled points.

We call a reference set generated by a uniform sampling strategy, such as the three strategies above, a uniform reference set. Consider a DOI containing several disconnected regions. We say a uniform reference set is optimal for a DOI if all the separate regions of the DOI are covered by the reference set. We can show, that for an optimal reference set, the error of CID reduces linearly as the maximal distance between adjacent points in the reference set decreases. As we increase the number of points sampled with a uniform sampling approach, the CID’s error tends to zero. The sketched proofs of the above statements are available in the appendix. Further, the CID metric has the following two important properties:

  1. 1.

    CID is not impacted by the size of the test set. This is because in the definition of CID in (2), \(d_z\) is computed as the distance from each reference point to the closest test point, and averaged over the reference set. This property enables us to compare test sets with different sizes;

  2. 2.

    CID values are robust when we use different reference sets as long as the reference sets are optimal and have equal sizes.

Note that generating a reference set for an SBST problem requires executing the system under test for each sampled point, which can be computationally expensive, particularly for complex systems where each execution may take some minutes. In this paper, we use CID to show the limitation of Pareto-based search-based testing for our DL-enabled case study systems with manageable search domain size and execution time.The applications of CID are further discussed in Section 8.

Fig. 2
figure 2

Illustrative examples showing the computation of CID: DOI is represented as two separate regions, highlighted in pink in the three examples. The test set, A, is presented as solid red points in all three examples. The reference set, Z, is represented by the non-filled pink circles. The example in (a): A barely covers DOI; the example in (b): A covers one region of DOI well; and the example in (c): A covers both regions of DOI well

6 Empirical Study

While Section 4 presents an argument addressing our research question introduced in Section 1, this section aims to answer our research question through empirical evidence. We assess the effectiveness of Pareto-based testing algorithms by comparing their ability to achieve high coverage of failure-revealing tests with that of baseline random testing. To ensure the validity and diversity of our empirical results, our experiments include alternative Pareto-based testing algorithms, case study systems, and varying definitions of test oracles for these systems.

In particular, we selected three optimization testing algorithms or their variants, which are Pareto-based and have been previously applied to case studies similar to ours, as identified in our literature survey in Table 1. Specifically, we selected the genetic algorithm NSGA-II, DeepJanus, a hybrid genetic algorithm that uses concepts from Novelty Search (Mouret 2011) and evolutionary search, and OMOPSO (Sierra and Coello Coello 2005), a swarm optimization algorithm. Other algorithms have been not considered in our study as they are non Pareto-based and hence outside of our scope. We present our empirical results using the following two sub-RQs:

RQ\(_1\):

: How does Pareto-driven search-based testing compare to baseline random testing in terms of covering failure-revealing tests? To answer this research question, we use our proposed CID metric to compare the test results obtained by the well-known Pareto-based algorithms NSGA-II, OMOPSO and a naïve testing algorithm random search (RS) in terms of DOI coverage. NSGA-II has been extensively used in test automation for ADS and DL-enabled systems (Ben Abdessalem et al. 2016, 2018; Klück et al. 2019; Riccio and Tonella 2020). Similarly, OMOPSO is a Pareto-based swarm optimization technique that extends the well-known Particle Swarm Optimization (PSO) algorithm to solve multi-objective optimization problems. PSO is a swarm intelligence algorithm that leverages concepts from collective social behavior to identify solutions with optimal fitness values. PSO has been widely applied in various domains, ranging from image segmentation in image processing to solving wireless communication optimization problems (Shami et al. 2022), and it has also been used for testing ADS (Moghadam et al. 2021). We apply NSGA-II, OMOPSO and random testing to two case studies: an industrial Automated Valet Parking system introduced in Section 2, and an open-source, DL-based, handwritten digits classification system that is widely-used for evaluating testing approaches in the literature (Riccio and Tonella 2020; Zohdinasab et al. 2023, 2021). We refer to the first case study as AVP, and to the second case study as MNIST since it uses digits from the MNIST dataset as test inputs. Both case studies have predefined test oracle functions for identifying failures. We extend the existing oracle functions by developing alternative definitions of test oracle functions, with some being stricter than others. Stricter functions result in smaller DOI regions. We evaluate the DOI coverage performance of NSGA-II, OMOPSO and RS for the alternative test oracle functions to determine if and how the size of failure regions affects the coverage of failure tests by NSGA-II and RS.

RQ\(_2\):

:How does state-of-the-art, diversity-focused, Pareto-driven search-based testing compare with random testing in terms of covering failure-revealing tests? Several Pareto-based SBST algorithms aim to achieve diversity and coverage of different failures by incorporating a diversity fitness function. In our study, we consider a diversified search algorithm similar to the proposed technique (Riccio and Tonella 2020), which combines a genetic algorithm with concepts from novelty search. Specifically, we extend NSGA-II using a repopulation operator as well as a diversity-focused fitness function derived from DeepJanus, which aims to maximize the Euclidean distance of a test input to previously found solutions. Similar to RQ1, we use the CID metric to compare the diversity-focused NSGA-II against RS with respect to DOI coverage.

6.1 Experimental Setup

In this section, we describe our experimental setup including the details of our case study systems, the implementation of NSGA-II, diversified NSGA-II, denoted by NSGA-II-D, OMOPSO, and RS, and the metrics used in our empirical analysis. An overview of the configuration of the search algorithms is given in Table 3. We note that we selected the parameters in this table after conducting preliminary experiments aimed at identifying optimal parameters for our three studied algorithms, NSGA-II, NSGA-II-D, OMOPSO.

Table 3 Hyperparameter configuration for the search algorithms NSGA-II, NSGA-II-D and OMOPSO for the case studies AVP and MNIST

SUT MNIST

Our first case study focuses on a classification task involving the utilization of DL models. Specifically, we evaluate a DL model designed to classify handwritten digits and trained using the widely-used MNIST dataset (LeCun and Cortes 2005). The SUT must accurately classify a digit within a 28x28 greyscale image, where the digit is depicted with white strokes on a black background. Our objective is to create new instances of MNIST digits that lead to a misclassification by the classifier (Fig. 4).

SUT AVP

The SUT of the second case study is the AVP system described in Section 2. AVP consists of a Lidar-based object detector, a static path planner for the generation of a driving path to a free parking lot, and a path follower. Further, it contains basic (longitudinal/latitudinal) vehicle dynamics.

NSGA-II and RS for MNIST

For the first case study, similar to existing work (Riccio and Tonella 2020), we represent the input images as Bézier curves in the SVG format where the curve is characterized by several start points, end points, and control points. The mutation of digits throughout the search is performed by choosing a start point, an end point or a control point of the Bézier curve and applying a small tweak. The value of the tweak is chosen randomly from a predefined range. To select the range, we performed some preliminary experiments to identify a range size such that tweaking a digit less likely leads to the generation of an invalid digit. An invalid digit is one that cannot be recognized as a digit by human (Riccio and Tonella 2023).

After mutating the Bezier curve, it is converted back to an image of size 28 \(\times \) 28 and passed to the SUT for classification. For further technical details of how the transformations are enabled, we refer the reader to the paper by  Riccio and Tonella (2020). To generate the initial population, we select one seed image from MNIST dataset and apply the same mutation discussed above. In addition, we ensure that the distance of the mutated digits and the original seed image is less than a threshold proposed in existing work (Riccio and Tonella 2020; Zohdinasab et al. 2021). This increases the likelihood of preserving the label of the initial seed for images in the initial population. As system under test, we consider the deep convolutional neural network provided by KerasFootnote 1 and used in existing studies (Riccio and Tonella, 2020; Zohdinasab et al., 2021).

We use two fitness functions to increase chances of generating images that are challenging for the digit classifier: \(f_1\), adopted from previous studies (Zohdinasab et al. 2021; Riccio and Tonella 2020), represents the difference between the confidence in the expected label and the highest confidence among other labels. We want to minimize \(f_1\) since a lower \(f_1\) value indicates a stronger tendency to missclassify a digit. Using \(f_2\), we minimize the average brightness of generated digits so that they are more difficult for the digit classifier. For example, Fig. 4 shows seed images, representing the digit 5, from the MNIST dataset on the left. On the right side, the figure displays corresponding mutations of these images with reduced average brightness, hence posing a challenge for digit classifiers. The goal of fitness \(f_2\) is to increase the chances of generating images like those on the right side of Fig. 4. We calculate \(f_2\) by dividing the sum of the brightness values of pixels, which have brightness values above zero, by the number of these pixels. For instance, the \(f_2\) values for the digits in the first row in Fig. 4 are 0.82 and 0.77, respectively, and in the second row 0.79 and 0.66.

OMOPSO for MNIST and AVP

For the configuration of the OMOPSO algorithm, we considered guidelines from existing works that have applied OMOPSO to different case studies (Shami et al. 2022; Shi and Eberhart 1999; Coello et al. 2004; Sierra and Coello Coello 2005). In particular, for MNIST and AVP, we set the crowding distance archive size as the swarm size and used the same population size value as for the search with NSGA-II and NSGA-II-D. Another component of the OMOPSO algorithm is the inertia weight parameter, which controls how explorative or exploitative the particles are. A high inertia weight value results in a greater change in the position/test inputs of particles, potentially skipping interesting/failure-revealing regions, while a low value might hinder exploration and the detection of new failing tests. We compared the performance of the OMOPSO search using a static inertia weight with a time-dependent (dynamic) inertia weight (Shi and Eberhart 1999), where the inertia weight is reduced with increasing iteration numbers. A dynamic inertia weight yielded better coverage results in preliminary experiments, which is why we chose to proceed with this configuration. In addition, we investigated the impact on the search results when using different boundary handling techniques, i.e., absorbing and reflecting particles at the boundaries. In the first technique, particles are absorbed at the boundary, where their positions are set to the boundary value with velocity 0 whenever they reach a position outside the search space. In contrast, a reflecting boundary handling strategy reflects the particle back into the search space by inverting its velocity component whenever it reaches a position outside the search space. For our study, we selected the reflecting technique, as with the former approach, we observed that particles were stuck at the boundaries, exhibiting lower coverage scores.

Test Oracles for MNIST

We define three different test oracle functions for the MNIST case study, listed from least to most strict. These correspond to a decreasing size of the DOI region for this case study. Let x be a test input. We define three test oracle functions for the MNIST case study as follows:

$$\begin{aligned} O_{Large} (x): f_1(x)< -0.5 \\ O_{Medium} (x): f_1(x)< -0.7 \\ O_{Snall} (x): f_1(x) < -0.95 \end{aligned}$$

Specifically, according to \(O_{Large}\), a test input fails when the difference between the expected label and the highest prediction certainty generated by the classifier, i.e., the value of the fitness function \(f_1\), is just less than \(-0.5\). Functions \(O_{Medium}\) and \(O_{Small}\), on the other hand, require this difference to be less than \(-0.7\) and \(-0.95\), respectively. Hence, \(O_{Medium}\) is stricter than \(O_{Large}\), and \(O_{Small}\) is the strictest.

We set the search budget for both NSGA-II and RS to 1,000 evaluations, as preliminary experiments have shown that for NSGA-II, CID values stabilize already before. We use the population size of 20 and set the number of generations to 50.

NSGA-II and RS for AVP

In the second case study, we test the AVP system on a driving scenario where an occluded pedestrian is crossing the planned trajectory of the ego vehicle (Fig. 3). An occluded pedestrian refers to a pedestrian who is hidden from the ego vehicles’s view because the pedestrian is either totally or partially behind another static object. The search space and fitness functions for AVP are defined in Section 2 and has been provided by our industrial partner. To adjust the velocity of the ego vehicle we just use a scalar value ranging from 0.1 to 1 which is internally used in the SUT to calculate the vehicle’s maximal velocity for each point in time.

To generate the initial population for NSGA-II, we use Latin Hypercube Sampling (McKay et al. 1979), a quasi-random sampling strategy, to initiate the search with a diverse population. We set the mutation rate to 1/3, and the crossover rate to 0.6, following existing guidelines (Arcuri and Fraser 2011). We set the search budget for both NSGA-II and RS to 2,000 evaluations. For NSGA-II, we use the population size of 40 and set the number of generations to 50, as preliminary runs have shown, more generations do not significantly improve the coverage score evaluated with CID.

Test oracles for AVP

Similar to the MNIST case study, we define alternative test oracle functions for the AVP case study. Specifically, we use the following test oracles with increasing level of restriction:

$$\begin{aligned} O_{Large} (x): f_1(x)< -0.7 \wedge f_2(x)< -1.\\ O_{Medium} (x): f_1(x)< -0.8 \wedge f_2(x)< -2.\\ O_{Small} (x): f_1(x)< -0.9 \wedge f_2(x) < -2. \end{aligned}$$

The oracle function \(O_{Medium}\) considers a test case as failing when the ego vehicle is closer to the pedestrian and has a higher speed compared to the oracle function \(O_{Large}\). The test oracle \(O_{Small}\) constrains the conditions for failure even more, by requiring, compared to the two other functions, the highest maximal speed of the ego vehicle at the time of the minimal distance to the pedestrian.

Diversified NSGA-II (NSGA-II-D)

To answer RQ2, first we extend NSGA-II with an additional fitness function that evaluates the Euclidean distance of a test input to individuals found in previous generations (short, called Archive). Second, we apply a repopulation operator, which replaces in each generation the most dominated individuals by randomly sampled candidates. By maximizing the value of the additional fitness function and by employing the repopulation operator, we want to make the search identify more diverse test cases and therefore achieving a better coverage of failure-revealing tests. A similar approach has been used in DeepJanus by Riccio and Tonella (2020). We do not directly apply the DeepJanus algorithm in our experiments since its main goal is to identify the borders of failure-revealing regions. However, we note that the diversity fitness function and the repopulation operator we use in our study are similar to those used by DeepJanus. To configure the hyperparameters and processing technique for individuals to be included in the archive, we performed preliminary experiments. The archive uses a threshold to determine when a test input should be included. Before a solution can be added to the archive, the distance of the candidate to the closest individual in the archive is computed. A solution is only included in the archive if the distance value is higher than the threshold, promoting the storage of diverse inputs in the archive. As suggested by Riccio and Tonella (2020), we selected the archive threshold use case specifically and compared the coverage performance when using the average versus the minimal Euclidean distance between failures, averaged over the results from all oracle definitions. Based on the results, we selected the minimal Euclidean distance between failing test inputs for both case studies, which was 0.29 for the AVP case study and 1.77 for MNIST. Regarding the processing of individuals to be included into the archive, we compared a sequential processing, where each individually is independently added into the archive, and population-based processing. In the sequential approach, adding an individual from the population to the archive can affect the distance computation for another individual in the population. In contrast, in a population-based processing first the distance is calculated for all individuals from a population to archived individuals. Based on whether the distance exceeds the threshold all allowed individuals are added together to the archive. In our preliminary experiments, sequential processing could outperform population based processing, why we selected to process sequentially individuals in our study.

Fig. 3
figure 3

The test scenario for the AVP case study: a pedestrian is crossing the trajectory of a vehicle equipped with the automated valet parking system

Experiments

We repeat the execution of NSGA-II, OMOPSO and RS 10 times to account for their randomness. The AVP system is executed using the Prescan simulator (Siemens 2023) and configured with the default sampling rate of 100Hz. For the MNIST case study we execute NSGA-II, OMOPSO and RS with 20 different digits from the MNIST dataset depicting a 5 to ensure the generalization of our results. Before using a digit from the MNIST dataset for the study, we verify that it is correctly classified (Fig. 4).

Metrics

To compute CID, we need to develop a reference set, which approximates the actual set of failure-revealing test inputs, i.e., the DOI, for each case study. The reference set is computed independently from the solution set of a search approach (test set).

For the AVP case study, we use two methods to generate the reference set: grid sampling (GS) and furthest point sampling (FPS), as explained in Section 5.2. For GS, we consider three different resolutions, 10, 20 and 25 samples for each axis, generating in total 1,000, 8,000 and 15,625 samples. Executing 15,625 samples took about 43 hours. We did not consider resolutions higher then 25 samples per axis, since the CID value stabilizes when 25 samples are reached. For the FPS, we consider 1,000 and 8,000 samples, and did not consider 15,625 samples given the constrained time budget, because computing new samples in FPS is time intensive as the implementation of FPS is based on the generation of Voronoi cells. However, the relative comparison of different algorithms for both resolutions 1000 and 8000 remains the same. Note, that we cannot verify that our reference sets are optimal as defined in Section 5 since the sizes of some regions of the DOI may be less than the maximum distance between the adjacent points in our reference sets. However, we can show the robustness and consistency of our results by considering different sampling methods, i.e., GS and FPS, and different sampling sizes, i.e., 1,000, 8,000, and 15,625.

For the MNIST case study, we use GS with a resolution of 10 samples per search dimension. Preliminary experiments demonstrate that using higher resolutions for GS or varying sampling methods does not significantly affect CID results for MNIST.

Fig. 4
figure 4

Example test inputs for the MNIST case study. Left column: original digits, correctly classified as 5. Right column: corresponding label-preserving digits generated by NSGA-II and labeled as 8 by the classifier under test. Classification certainty differences between the expected label and the highest prediction by the classifier for the mutated digits from the top right to the bottom right are: -0.51, -0.79, -0.69

Fig. 5
figure 5

MNIST Case study. The average and standard deviations of CID values obtained from 10 runs of RS, NSGA-II, NSGA-II-D and OMOPSO for the three test oracle functions \(O_{Large}\), \(O_{Medium}\), and \(O_{Small}\) of MNIST. The CID values are plotted after every 100 evaluations. The reference set is computed using 1,000 uniformly generated test inputs using grid sampling

Fig. 6
figure 6

AVP Case study. The average and standard deviations of CID values obtained from 10 runs of RS, NSGA-II, NSGA-II-D and OMOPSO for the three test oracle functions \(O_{Large}\), \(O_{Medium}\), and \(O_{Small}\) of AVP. The CID values are plotted after every 250 evaluations. The reference set is computed using 15,625 uniformly generated test inputs using grid sampling

6.2 Experimental Results

Figures 5 and 6 show the CID results obtained by applying RS, NSGA-II, and NSGA-II-D and OMOPSO to the MNIST and AVP case studies. For each system, the CID results are computed with respect to their respective test oracle definitions that correspond to different DOI sizes. The diagrams in Figs. 5 and 6 illustrate the average and standard deviations of the CID values derived from 10 runs of each algorithm after every 100 and 250 evaluations, respectively.

The CID values in the diagrams in Figs. 5 and 6 are calculated using reference sets generated by grid sampling for each case study and each test oracle definition. We use the same reference set for each combination of algorithm, case study, and test oracle definition. However, the number of failures in the reference set depends on the oracle definition.

We note that due to our interest in coverage, we choose the set A, i.e., the solution set, used to compute CID as the set of all failures found by each testing algorithm over all its iterations. Specifically, for each of the NSGA-II, NSGA-II-D, and OMOPSO algorithms, the set A is the set of all failures found over all generations and not just the failures in the last generation. Similarly, for random search, set A is the set of all failures found by the random search.

The AVP’s reference sets contain 15,625 samples, while the MNIST’s reference sets comprise 1,000 samples. In the remainder of this section, we discuss the results for RQ1 and RQ2 using the results in Figs. 5 and 6.

6.2.1 RQ1 Results

Comparing CID Results

The MNIST Case Study results in Fig. 5 show that RS consistently yields lower, i.e., better, average and lower standard deviations for CID values than NSGA-II and OMOPSO over time across all test oracles. Moreover, although CID values for NSGA-II, OMOPSO and RS decrease over time, the rate of reduction is more significant for RS compared to NSGA-II and OMOPSO.

Further, the results in Fig. 5 show that while the CID values for RS, NSGA-II and OMOPSO are similar for the \(O_{Medium}\) and \(O_{Large}\) test oracles, they are higher for the \(O_{Small}\) test oracle. This indicates that covering the DOI regions for \(O_{Small}\), the strictest test oracle definition, poses a greater challenge for both algorithms compared to \(O_{Medium}\) and \(O_{Large}\). Finally, for all three cases, the CID values of random research, NSGA-II and OMOPSO stabilize in less than 1000 evaluations.

Similarly, the results presented in Fig. 6 demonstrate that for AVP, the CID values achieved by RS consistently outperform those obtained by NSGA-II across all test oracles and after at least 300 evaluations. Further, the average of CID values and the variations of CID values obtained by RS decrease at a higher rate compared to those obtained by NSGA-II and OMOPSO. However, OMOPSO can outperform NSGA-II for the less constrained oracle definition \(O_{Large}\) in CID values. For the \(O_{Medium}\) and \(O_{Small}\) oracles, there is no significant difference in CID values between OMOPSO and NSGA-II.

In each of the three diagrams, the CID values for RS, NSGA-II and OMOPSO stabilize within 2000 evaluations, suggesting that executing the algorithms longer is unlikely to change the comparison of the CID values.

In addition, two observations can be drawn from Fig. 6: (1) For both NSGA-II and random research, stricter oracle definitions, corresponding to smaller DOIs and fewer failures in a large search space, result in higher metric values, indicating that covering failures is more challenging when there are fewer failures. (2) The difference in CID values between NSGA-II and random research decreases with a stricter oracle function. This is expected; as DOIs become smaller and failures harder to detect, random research may become less effective in covering and exploring failures, behaving more similarly to NSGA-II and OMOPSO. Nevertheless, even for our strictest notion of a test oracle, NSGA-II and OMOPSO cannot outperform random research in covering DOI for both case studies.

Table 4 (Reference Set Variation) The average and standard deviations of CID values obtained from 10 runs of RS, NSGA-II, NSGA-II-D, OMOPSO for the AVP system with the \(O_{Large}\), \(O_{Medium}\) and \(O_{Small}\) test oracles for different reference sets and sizes

Sampling Methods and Sampling Resolutions

As mentioned in Section 6.1, the CID results shown in Figs. 5 and 6 are derived from specific reference sets generated using grid sampling and with a specific sampling resolution. One might wonder how altering the sampling method or resolution would affect the comparison of CID values between RS, NSGA-II and OMOPSO. Table 4 shows the average and standard deviation of CID values for the AVP system with the \(O_{Large}\) test oracle, calculated using the three sampling methods discussed in Section 5.2, grid sampling (GS) and furthest point sampling (FPS) reference sets with 1,000 and 8,000 samples. As the table shows, even if we use reference sets obtained from different sampling methods and with different sample sizes, RS still fares considerably better than NSGA-II, NSGA-II-D and OMOPSO. This indicates the robustness of our results when we use different sampling techniques and different number of samples to generate the reference sets to compute CID values. Note that the results in Table 4 for 1,000 and 8,000 samples for both GS and FPS for \(O_{Large}\) are close, indicating that the reference set of size 1,000 already provides a good approximation of the DOI.

For AVP the underlying number of failures in the reference sets for Grid Sampling with the highest resolution of in total 15,625 samples across all oracle definitions is presented in Section 8 in Table 7.

Statistical Tests

We use the non-parametric pairwise Wilcoxon rank sum test and the Vargha-Delaney’s \(\hat{A}_{12}\) effect size to compare the CID results in Figs. 5 and 6 between random search and the Pareto-based approaches. The level of significance has been set to 0.05. The results are presented in the first two rows of Table 5. We adopt the following standard classification for effect size values: an effect size e is small, when \(0.56 \le e < 0.64\) or \(0.36 < e \le 0.44\), is medium when \(0.64 \le e < 0.71\) or \(0.29 < e \le 0.36\) and large when \( e >=0.71\) or \(e <=0.29\). Otherwise the effect size is negligible.

The statistical test results show that for MNIST and AVP with \(O_{Large}\) and \(O_{Medium}\) test oracles, RS yields CID values that are significantly better than those generated by NSGA-II and OMOPSO with large effect sizes. However, for the \(O_{Small}\) test oracle of AVP, the difference between RS and NSGA-II is not statistically significant. This is consistent with the results in Fig. 6 for the AVP with the \(O_{Small}\) and the \(O_{Medium}\) test oracles.

Table 5 Statistical-test results for comparison between RS and Pareto-based approaches (Wilcoxon and Vargha-Delaney)
Table 6 Statistical-test results for Pareto-based approaches (Wilcoxon and Vargha-Delaney)

For the less constrained test oracles, \(O_{Large}\) and \(O_{Medium}\), RS significantly outperforms NSGA-II and OMOPSO in covering the failure-revealing tests, while for the more constrained \(O_{Small}\) test oracle the difference between the algorithms NSGA-II, OMOPSO and RS diminishes and neither covers the DOI better than the other.

For completeness, we have provided in Table 6 also the statistical-test results for the comparison of CID results between NSGA-II and OMOPSO in the second row. We can see that for the AVP case study only for oracle \(O_{Large}\) the difference between NSGA-II and OMOPSO is statistically significant with a large effect size what is consistent with the result in Fig. 6. For MNIST however, there is not statistical significance what is in line with the results in Fig. 5.

figure e

6.2.2 RQ2 Results

The results in Figs. 5 and 6 show that the CID values obtained by NSGA-II-D are always better than those obtained by NSGA-II over time for both case studies and all test oracle definitions. However, compared to RS, NSGA-II-D achieves consistently worse CID values for all test oracle definitions of the MNIST case study and the \(O_{Large}\) test oracle of the AVP case study. For the \(O_{Medium}\) and \(O_{Small}\) test oracles of the AVP case study, RS outperforms NSGA-II-D after 500 and 1500 evaluations respectively.

Statistical Tests

The statistical-test results in Table 5 show that RS yields CID values that are significantly better than those generated by NSGA-II-D with large effect sizes for all test oracle definitions of MNIST and the \(O_{Large}\) test oracle of AVP. Similar to the comparison between RS and NSGA-II, the difference between RS and NSGA-II-D for the \(O_{Small}\) test oracle of AVP is not statistically significant. For \(O_{Small}\) , RS does not outperform NSGA-II-D, nor does NSGA-II-D outperform RS. This indicates that even when NSGA-II is enhanced with a diversity-focused fitness function, it still does not outperform RS in terms of failure coverage.

The statistical test results comparing NSGA-II-D with the Pareto algorithms NSGA-II and OMOPSO indicate a statistically significant difference between OMOPSO and NSGA-II-D for the MNIST case study across all oracles. For the AVP case study, this difference is significant only for the \(O_{Large}\) oracle. The difference between NSGA-II-D and NSGA-II is statistically significant for the AVP case study only for the \(O_{Large}\) oracle, and for the MNIST case study for all oracles. These findings are consistent with the results presented in Figs. 6 and 5.

figure f

7 Threats to Validity

The most important threats concerning the validity of our experiments are related to the internal, external and construct validity.

To mitigate internal validity risks, which refer to confounding factors, we used identical search spaces and search configurations for all the algorithms – RS, NSGA-II, NSGA-II-D and OMOPSO – in each of our case studies: AVP and MNIST. For MNIST, the objective is to create test inputs that challenge classifiers while remaining valid and preserving the digit’s original label. “Valid" implies that the digits are still recognizable by humans, and “label preserving" ensures that the image’s label remains unchanged after mutation; for instance, a mutated “5" should not transform into a “6". To address this, we have taken the following actions: (1) We ensure the validity and label-preservation of individuals in the initial population for our search algorithms by enforcing a distance threshold, as suggested by the previous studies (Riccio and Tonella 2020), on the distances between the individuals and initial seed. (2) We select images from the MNIST dataset as seeds only if they can be correctly classified by the classifier under test. (3) We use a small displacement interval in the mutation operator to reduce the chances of altering the label or shape of digit images through perturbations. (4) Finally, we identify out-of-bound displacements in images during mutation and adjust the mutated images by reversing the displacement, bringing it back within the image boundary.

To mitigate the bias associated with choosing suboptimal hyperparameters for experiments using OMOPSO and NSGA-II-D, we conducted preliminary experiments to identify optimal parameter configuration, as outlined in Section 6.1.

Construct validity threats relate to the inappropriate use of metrics. To compare the coverage of the failure-revealing domain of test inputs for the studied search approaches, we have applied the CID metric introduced in Section 5.1. To the best of our knowledge, there have been no other metrics in the literature that serve our purpose. We have outlined in Section 3 related metrics and have positioned our metric.

External validity is related to the generalizability of our results. We used two case studies from different domains: the AVP case study is an industrial ADAS, and the MNIST case study is a widely-used open-source system for classifying handwritten digits. The specification of AVP and the definition of its search space are provided by our industrial partner. Furthermore, we have used a widely-used, stable and high-fidelity simulator Prescan to simulate the ADAS. We apply search-based testing to the MNIST case study in a manner consistent with prior studies using the same case study (Zohdinasab et al. 2021; Riccio and Tonella 2020). Regarding the choice of the Pareto-based algorithms, in our experiments, we considered two widely-used and different types of Pareto-based techniques, NSGA-II and OMOPSO. The diversity fitness function and the repopulation operator, both employed to improve NSGA-II, are adopted from recent research aimed at introducing diversity into NSGA-II (Riccio and Tonella 2020).

Further, we provide an argument in Section 4 based on the definition of Pareto-based optimization and which is independent of the system under test, regarding why Pareto-based search techniques are not suitable for the coverage of failure-revealing test inputs. The argument holds under the assumptions defined in Section 2.

Replicability To foster replicability, the implementation of the CID metric, as well as results of the evaluation and conducted experimental runs with RS, OMOPSO, NSGA-II and diversified search with NSGA-II are available online (rep 2024). The implementation of the AVP case study could not have been disclosed as it was developed by our industrial partner. The search approach and the analysis of the results have been implemented using the modular and open-source search-based testing framework OpenSBT (Sorokin et al. 2024).

8 Discussion

In this section, we offer further observations and discuss the applications and limitations of our proposed metric, CID. In addition, we compare NSGA-II, NSGA-II-D, OMOPSO and RS in terms of their effectiveness and efficiency in finding failure instances. This comparison aims to better position our study within the context of prior research and complements the study presented in Section 6, which is primarily focused on comparing these algorithms with respect to the coverage of failure-inducing regions, as opposed to their ability to identify individual failures.

Observation

Why do the Pareto-based approaches NSGA-II, NSGA-II-D and OMOPSO yield worse CID values with considerably larger variations compared to RS? To understand why NSGA-II, NSGA-II-D and OMOPSO yield worse CID values with considerably larger variations compared to RS, we plot the failure-revealing solutions obtained by NSGA-II, NSGA-II-D, OMOPSO and RS, along with the reference sets computed for RQ1 to calculate CID values. Recall that reference sets approximate the complete set of failure-revealing test inputs within the search space. Specifically, Fig. 7a shows the reference set for AVP with the test oracle \(O_{Large}\) computed based on grid sampling with 15,625 samples as part of our RQ1 experiments, and Figs. 7b to 7e, respectively, show the failing test inputs computed by one run of NSGA-II, RS, OMOPSO and NSGA-II-D for AVP with the \(O_{Large}\) test oracle. The Figures contain all failures that have been identified over all evaluations of the run of their respective algorithm.

Fig. 7
figure 7

Visualizing failure-revealing solutions in the search space for AVP with the \(O_{Large}\) test oracle function. Plot (a): failure-revealing tests identified using grid sampling with 15,625 samples. This set serves as a reference set. Plot (b) to (d) show failure-revealing tests found using one run of NSGA-II, random search, OMOPSO and NSGA-II-D respectively. Although random search finds far fewer failure-revealing solutions than NSGA-II and NSGA-II-D, it covers the DOI in Plot (a) more effectively and yields a lower CID

By comparing the failure-revealing test inputs in Figs. 7b and 7c with those from the reference set shown in Fig. 7a, which locates 558 failures, one can see that although RS finds far fewer failure-revealing solutions than NSGA-II, 89 versus 429, RS covers the DOI in Plot (a) more effectively.

The failure-revealing tests identified by RS are more spread out and diverse, while those found by NSGA-II are grouped close together. Similarly, when comparing the failures found by NSGA-II and NSGA-II-D in Fig. 7e, one can see that while the failures found by NSGA-II-D are more scattered compared to NSGA-II, they are still clustered together and are not as spread as the solutions found by random research.

OMOPSO identifies only 57 failing tests, but the failures as shown in Fig. 7d are more spread out compared to NSGA-II and NSGA-II-D. However, we can see that larger regions of the test input space are left uncovered compared to random research. Similarly, we plot in Fig. 8a the reference set for the AVP case study with the \(O_{Small}\) test oracle computed using the same sampling method and resolution as the one in Fig. 7a. As shown in this figure, only 62 failing tests are identified in the reference set for \(O_{Small}\), which is significantly less then the number of failures in the reference set for the \(O_{Large}\) oracle. Further, the failure-revealing solutions obtained by one run of NSGA-II and RS are shown in Figs. 8b and 8c, respectively.

RS identifies 10 failures, NSGA-II identifies 111 failures, NSGA-II-D identifies 69 failing tests, while OMOPSO finds 5 failures. The failing test inputs found in one run of OMOPSO and NSGA-II-D are shown in Figs. 8d and 8e. The figures illustrate that NSGA-II and NSGA-II-D solutions are clustered closely, in contrast to the solutions obtained by random research, which are more dispersed and offer better coverage of the DOI approximation in Plot (a).

Fig. 8
figure 8

Visualizing failure-revealing solutions for AVP in the search space for the most constrained oracle function \(O_{Small}\). Plot (a) shows the reference set obtained to approximate failures of the system under test. Plot (b) to (d) show failure-revealing tests found using one run of NSGA-II, random search, OMOPSO and NSGA-II-D respectively. Although NSGA-II and NSGA-II-D identify more failure-revealing test inputs than RS, RS achieves a better coverage of the DOI than NSGA-II

Table 7 Number of failures identified in the reference sets for the MNIST and AVP case studies across different test oracle definitions

We have also provided in Table 7 the number of failing tests in the reference set for \(O_{Medium}\) for AVP, as well as the reference set sizes for MNIST. It can be observed that the number of failures identified for the most constrained oracle definition is relatively small compared to the total number of samples (below 1% for AVP and below 20% for MNIST), indicating that the testing problems are not trivial.

Clarification Regarding the Applicability of CID

The CID metric is not intended as a universal test coverage adequacy criterion applicable to various systems in practice. Instead, our metric acts as a quality indicator, assisting in the empirical assessment of testing approaches when an approximated reference set for the system under test is computable. Our proposed metric, CID, is explicitly designed to evaluate testing approaches during the development of an algorithm and is not meant for evaluating solutions in the application phase of the algorithm.

Further, the validity of CID values is tied to the accuracy of the reference set approximating the DOI. Demonstrating the robustness of reference sets may become challenging for large, multi-dimensional search spaces (e.g., more than three dimensions), where one system execution may take minutes to hours. To mitigate this, we can use surrogate models (Ben Abdessalem et al. 2016; Matinnejad et al. 2014; Nejati et al. 2023; Matinnejad et al. 2013) as approximations of the SUT, avoiding costly executions. Alternatively, we can evaluate a candidate search algorithm using CID, calculated for mathematical test problems in SBST benchmarks (Surjanovic and Bingham 2023). Although these mathematical problems may not represent real-world ADS, they allow us to assess and potentially refute an ineffective candidate search algorithm. By augmenting these mathematical problems with a failure concept, i.e., a test oracle function, we can test the capabilities of a candidate search algorithm with respect to DOI coverage.

Lesson Learned

Pareto-based approaches perform better than naïve random testing in efficiently identifying single failures, yet they do not surpass random search in covering the areas of the search space that reveal failures. The results from RQ1 and RQ2 show that NSGA-II and OMOPSO cannot outperform randomized testing in terms of the DOI coverage. However, based on the results we obtained, we can confirm that NSGA-II and OMOPSO are on average better than random research if the purpose is to find individual failures fast.

In particular, Table 8 shows the results of NSGA-II and random research compared with respect to the first iteration where they can find a failing test input for the AVP case study. As shown in the table, on average, NSGA-II is able to find the first failure in less than 5 iterations for all test oracles, while random research needs on average five or more iterations to find the first failure. The first iteration in which a search algorithm identifies a failure has been commonly used in literature to evaluate the efficiency of testing algorithms (Klück et al. 2019). Our results confirm the superiority of NSGA-II over random research when we consider the efficiency of finding failures. We note that this result does not contradict our results regarding the weakness of NSGA-II in covering failures, as covering the space of failure-inducing regions is a different objective from identifying several failure instances.

In addition, the results in Table 8 show that the AVP case study is a sufficiently difficult case study for meta-heuristic search algorithms, as NSGA-II already finds more failures than RS and does so faster. Demonstrating that NSGA-II outperforms RS in its optimization goal, which is finding failures in this case, is considered a sanity check comparison, indicating that the underlying case study system presents a sufficiently difficult problem warranting the use of a meta-heuristic algorithm.

Table 8 AVP Case study

Root Cause Analysis

The motivation for our study is that the identification of diverse failing test inputs can support the localization of the root cause behind the identified failures. However, it is out of the scope of our study to analyze the underlying root causes of these failures. As demonstrated by the works from Jodat et al. (2024) and Ben Abdessalem et al. (2018) decision trees or decision rules can be used to derive failure explanations from a set of failure-revealing test inputs. For example, the work by Abdessalam et al. shows that failure-inducing test inputs can indicate that the reason for failure is due to environmental factors, such as high road curvature or hardware limitations, meaning an additional camera or one with a greater field of view might be required to capture vulnerable road users in sharp curves. Regardless, identifying failures is a prerequisite for fault localization and repair. Therefore, before proceeding with the fault localization and repair steps, we need testing techniques that can effectively identify as many diverse test inputs as possible, located in different areas of the search space that lead to failures.

Feasibility and Accuracy of Reference Sets

Regarding the computation of reference sets, we note that the higher the sampling rate, the higher the accuracy of the reference sets, but the more expensive it is to create those sets. We use three considerations to achieve a balance between the accuracy and feasibility of the reference sets’ computations: (i) converging CID values, (ii) total time budget for reference set computation, and (iii) the sensitivity of the SUT’s behavior to the sampling resolutions.

  1. i)

    We consider different sampling rates to see whether CID values stabilize and converge. Once we observe that the CID values stabilize at a specific sampling rate, we do not need to consider higher sampling rates since they are unlikely to significantly change the CID values. For instance, as shown in Table 4, the CID values stabilize for all oracle definitions at a sampling rate of 25.

  2. ii)

    We ensure that we have sufficient time to generate the reference sets. For instance, a sampling rate of 25 leads to the generation of 15,625 samples in the reference set of AVP with a three-dimensional test input space, which takes 86.8 hours. The number of samples increases exponentially with the number of input space dimensions. Hence, we consider capping the sampling rate based on our available time budget.

  3. iii)

    We consider the sensitivity of the system’s behavior to the selected sampling resolution. For instance, for the pedestrian velocity, which is one of AVP’s input variables, a change of less than 0.075 m/s is unlikely to impact the system’s behavior. Our sampling rate of 25 already samples this variable at a finer-grained resolution than 0.075 m/s. Therefore, we consider this sampling rate sufficient for computing stable CID values, making a higher resolution unnecessary.

9 Conclusion and Future Work

In this paper, we studied the ability of Pareto-based Search-Based Software Testing (SBST) to cover failure-revealing test inputs. Based on a theoretical argumentation and an empirical evaluation we have shown that Pareto-driven testing cannot achieve a high coverage of failures.

We introduced a new metric, CID, to quantitatively evaluate the coverage of failure-inducing areas in a search domain and discussed its properties. In our empirical analysis using two case studies – an industrial automated valet parking system and a handwritten digit classification system – we demonstrate that the Pareto-based testing techniques NSGA-II and OMOPSO are less effective than random search in covering failure-revealing tests in 11 out of 12 alternative failure definitions across both studies, and in one case, all methods exhibited similar performance.

In addition, we demonstrated that augmenting NSGA-II with a diversified fitness function and a repopulation operator, adapted from a state-of-the-art testing approach does improve its performance in covering failure-revealing tests in four out of six alternative failure oracle definitions. However, in all comparisons, it does not perform better than random testing in terms of failure coverage.

Our investigation highlights the limitations of Pareto-driven testing in covering failure-revealing tests, emphasizing the need for practitioners using SBST with Pareto-based testing approaches to be aware of these constraints. Our observation could further confirm that Pareto-based testing is good in identifying failures fast, while for the coverage of failures, other innovative search strategies are necessary. Our future work is to combine randomized testing with machine learning models to improve the failure space coverage. The application of the CID metric also confirms the importance of surrogate models and benchmarking systems to evaluate SBST algorithms, thereby avoiding expensive system executions.