We start by describing the general experimental setup for numerical evaluations. A simulation experiment is performed with a total budget of
B observations to collect at
k distinct design points, with
\(n_i\) observations to obtain at design point
\({\bf x}_i\), for
\(i = 1, 2, \ldots , k\). The
k design points are chosen according to either a random design or a fixed design, to be specified for each example. Three budget allocation schemes, i.e., the equal allocation, unequal allocation 1, and unequal allocation 2, are considered. Specifically, the equal budget allocation scheme sets
\(n_i = \lceil B/k \rceil\), where
\(\lceil a\rceil\) denotes the least integer not less than
a. Assuming that the noise variance function is known, unequal allocation 1 sets
\(n_i = \lceil \tfrac{{\sf V}({\bf x}_i)}{\sum _{l=1}^{k}{\sf V}({\bf x}_l)}B\rceil\), while unequal allocation 2 sets
\(n_i = \lceil \tfrac{\sqrt {{\sf V}({\bf x}_i)}}{\sum _{l=1}^{k}\sqrt {{\sf V}({\bf x}_l)}}B\rceil\). We note that the goal of unequal allocation 1 is to equalize the variance of sample means across all design points. This approach has been explored in the literature on ranking and selection as well as multi-armed bandit problems [
9,
16,
19], aiming to alleviate the impact of heteroscedasticity. Unequal allocation 2 assigns the simulation budget in accordance with the scale function
\(\rho ({\bf x}): = \sqrt {{\sf V}({\bf x})}\) for
\({\bf x}\in \mathcal {X}\); it is proposed based on the optimal design density
\(h^*(\cdot)\) established for controlled nonparametric regression experiments. Specifically,
\(h^*(x) = \rho (x)/ \int _{\mathcal {X}} \rho (u) du\) achieves the asymptotic sharp minimax lower bound for the mean integrated squared error when
\(x \in \mathcal {X}= [0,1]\); see Reference [
17] and references therein for more details. We consider various experimental settings comprising various combinations of the total budget
B, the number of design points
k, the design-point sampling scheme to choose
k design points, and the budget allocation scheme adopted. The target confidence level is set to 0.95. SK is implemented by adopting two types of kernel functions, respectively, the Gaussian kernel and the Matérn kernel in Equation (
15) with smoothness parameter
\(\nu = 5/2+d/2\), where
d denotes the dimension of the input space
\(\mathcal {X}\).
6.1 M/M/1 Queue
Consider simulating an M/M/1 queue with service rate 1 and arrival rate
x with
\(x \in \mathcal {X}= [0.3, 0.9]\). In this example, the simulation output recorded on each replication at a given design point is the average number of customers in the system from time 0 to
T time units. It is well known from queueing theory that the steady-state mean number of customers in the queue is
\(f(x)=x/(1-x)\), which is the mean function we intend to estimate. The noise variance function is
\({\sf V}(x)\approx 2x(1+x)/(T(1-x)^4)\) for
T large [
50]. When simulating the M/M/1 queue at each design point, we initialize each replication in steady state and set the run length of each replication
T to 1,000. Besides the general experimental setup given at the beginning of Section
6, the specific setup for this example is as follows: The value of the grid constant
\(\tau\) is set to ensure that
\(\sqrt {\beta _k(\tau)}\sigma _{k,B}(x)\) dominates
\(\gamma _{k,B}(\tau)\) in Equation (
7). We consider using a total budget
\(B \in \lbrace 2,560,25,600\rbrace\) and varying the number of distinct design points
\(k\in \lbrace 16,32,128,512\rbrace\) when
\(B=2,560\) and
\(k\in \lbrace 32,128,512,2,048\rbrace\) when
\(B=25,600\). Given each choice of
k, we choose
k design points according to a uniform design or a grid design, where the former comprises a sample of
k independent and uniformly distributed points and the latter
k equispaced grid points in
\(\mathcal {X}\). The three budget allocation schemes mentioned at the beginning of Section
6 are applied given each combination of
B,
k, and the design-point sampling scheme. A grid of
\(N = 1,000\) equispaced points is selected from
\(\mathcal {X}\) for prediction.
Summary of results. Tables
1 and
2 show the
\(\widehat{\mbox{SCP}}\)’s achieved by the empirical uniform bound (
\(\mbox{CI}_{{\mbox{eu}}}\)), the nominal uniform bound (
\(\mbox{CI}_{{\mbox{u}}}\)), and the Bonferroni correction-based simultaneous confidence intervals (
\(\mbox{CI}_{{\mbox{b}}}\)) under different experimental settings when SK is implemented with the Gaussian and Matérn kernels. The following common observations are made across all experimental settings: First, given a fixed budget
B, the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\) are typically higher than the target level 0.95, while the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{b}}}\) are not close to 0.95, except when the number of design points
k is relatively small under the grid design. Second, the simultaneous coverage performance of
\(\mbox{CI}_{{\mbox{eu}}}\) remains satisfactory as the number of design points
k increases until it reaches a point where only a few replications are allocated to some design points, regardless of the kernel function and the design-point sampling scheme adopted; see, e.g., the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) obtained under the equal allocation and unequal allocation 1 when
\(k=512\) and
\(B = 2,560\). However, the simultaneous coverage performance of
\(\mbox{CI}_{{\mbox{b}}}\) typically deteriorates as
k increases given a fixed budget
B, regardless of the kernel function adopted. This is evidenced by examining the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{b}}}\) obtained across all experimental settings shown in Tables
1 and
2. Third, the
\(\widehat{\mbox{SCP}}\)’s of all bounds increase with the budget
B given that the number of design points
k is fixed. Last, applying unequal budget allocation schemes typically help improve the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{b}}}\) when the number of design points
k is not too large relative to the budget
B given. Such an impact on the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\) is not as obvious, however.
We next examine the maximum half-widths
\(H_{\max }\) of
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\) obtained under different experimental settings. Figure
1 summarizes the magnitudes of
\(H_{\max }\) (in a logarithmic scale) of
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\) obtained under the grid design with a total budget of
\(B=2,560\) when SK is implemented with the Gaussian and Matérn kernels. The corresponding results obtained with
\(B=25,600\) and those obtained with
\(B \in \lbrace 2,560, 25,600\rbrace\) under the uniform design are shown in Figures
6–
8 in Appendix
D.1. We have the following observations from Figures
1 and
6: First, the
\(H_{\max }\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\) are larger than those of
\(\mbox{CI}_{{\mbox{b}}}\) under each experimental setting considered, regardless of the kernel function adopted. Second, the magnitudes of
\(H_{\max }\) of
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\) decrease with the number of design points
k when a large total budget
\(B=25,600\) is applied under all budget allocation schemes. However, this trend may not hold for a small total budget. Third, for
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\), when a given budget
B is allocated to a fixed number of design points, the magnitudes of
\(H_{\max }\) produced under the equal allocation scheme are the highest. It is worth noting that, regarding all bounds, the two unequal budget allocation schemes lead to shorter half-widths than the equal budget allocation scheme without compromising the simultaneous coverage performance, as observed in Table
1. The magnitudes of
\(H_{\max }\) produced under the two unequal allocation schemes are similar, with unequal allocation 2 leading to higher
\(H_{\max }\) values than unequal allocation 1. Last, regarding the impact of the design-point sampling scheme, the uniform design tends to lead to more outliers of the
\(H_{\max }\) values compared to the grid design; see Figures
7 and
8 in Appendix
D.1 for details. Moreover, under the uniform design, the unequal allocation schemes yield shorter half-widths of all bounds compared to the equal allocation scheme, but the difference between the two unequal allocation schemes is slight.
We summarize the impacts of noise variance estimation, the design-point sampling scheme, and the choice of kernel function observed in the M/M/1 queueing example. First, comparisons based on the
\(\widehat{\mbox{SCP}}\)’s and
\(H_{\max }\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\) indicate that the performance of the empirical uniform bound is very similar to that of the nominal one, except when the number of replications is too few at some design points, since using the variance estimates can incur additional variability in the SK model fitting. This corroborates the theoretical results provided in Section
5.2. Second, with other experimental settings held constant, the performance of all bounds under the grid design that is quasi-uniform typically outperforms that under the uniform design. This aligns with the theoretical results provided in Section
4. Last, with other experimental settings held constant, the performance of all bounds built with the Matérn kernel is comparable to that built with the Gaussian kernel for the M/M/1 queueing example. However, the magnitudes of
\(H_{\max }\) of all bounds built with the Matérn kernel tend to be higher than those with the Gaussian kernel; this is because a faster convergence rate of the predictive variance is achieved with the Gaussian kernel than with the Matérn kernel, corroborating Corollaries
1 and
2, along with the comments following the results. Moreover, when the number of replications at some design points is too small, using the Matérn kernel can result in fitting issues that render the SK model unstable and adversely affect simultaneous coverage performance; see the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{b}}}\) obtained under unequal allocation 1 when
\(B=2,560\) and
\(k=512\) shown in Table
1 and the corresponding
\(H_{\max }\)’s shown in Figure
1(d) in this subsection and Figure
7(d) in Appendix
D.1.
6.2 A Dimension-flexible Example
Consider the following dimension-flexible example adapted from the one studied by Reference [
23], where
\(\mathcal {X}= [-1,1]^d\) and
d denotes the input-space dimension. The simulation output at design point
\({\bf x}=(x_1,x_2,\ldots ,x_d)^\top\) on the
jth replication is generated as
\(\mathcal {Y}_{j}({\bf x}) = f({\bf x}) + \varepsilon _j({\bf x}),\) where
\(\varepsilon _j({\bf x})\)’s are i.i.d. normally distributed with mean zero and variance
\({\sf V}({\bf x})\). The noise variance function is
\({\sf V}({\bf x}) = (2+\cos (\pi +\sum _{i=1}^d x_i/d))^2\) for any
\({\bf x}\in \mathcal {X}\). The mean function of interest is given by
\(f(x)=\sin (9x^2)\) if
\(d=1\) and
\(f({\bf x})=\sin (9x_1^2) + \sin ((\sum _{i=2}^d 3 x_i/(d-1))^2)\) if
\(d\gt 1\). We describe the specific setup for this example next and refer the reader to the beginning of Section
6 for the general experimental setup. We consider two problem cases corresponding to
\(d=1\) and
\(d=5\). For
\(d=1\), we vary the number of design points
\(k \in \lbrace 16, 64, 256, 512\rbrace\) when
\(B=2,560\) and
\(k\in \lbrace 16,64,256,512,2,048\rbrace\) when
\(B=10,240\). For
\(d=5\), we vary
\(k \in \lbrace 16, 64, 256, 512\rbrace\) when
\(B=2,560\) and
\(k \in \lbrace 64,256,512,1,024,2,048\rbrace\) when
\(B=10,240\). Given each choice of
k, a set of
k design points is chosen according to a uniform design or a maximin
Latin hypercube design (LHD). The three budget allocation schemes mentioned at the beginning of Section
6 are applied given each combination of
B,
k, and the design-point sampling scheme. On each macro-replication, we independently generate a set of
\(N = 2,500\) points from
\(\mathcal {X}\) according to a maximin LHD as the prediction points for performance evaluation.
Summary of results. We first look into the 1D case. Tables
3 and
4 present the
\(\widehat{\mbox{SCP}}\)’s of the empirical uniform bound (
\(\mbox{CI}_{{\mbox{eu}}}\)), the nominal uniform bound (
\(\mbox{CI}_{{\mbox{u}}}\)), and the Bonferroni correction-based simultaneous confidence intervals (
\(\mbox{CI}_{{\mbox{b}}}\)) under different experimental settings when SK is implemented with the Gaussian and Matérn kernels. Observations akin to those noted in the M/M/1 queueing example are summarized as follows: First, given a fixed budget
B, the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\) are typically higher than the target level 0.95. Moreover, the performance of
\(\mbox{CI}_{{\mbox{eu}}}\) is very similar to that of the nominal one in this 1D case, except when the number of replications is too few at some design points. Second, the
\(\widehat{\mbox{SCP}}\)’s of all bounds increase with the budget
B given that the number of design points
k is fixed. Third, applying unequal budget allocation schemes typically helps improve the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{b}}}\) when the number of design points
k is not too large relative to the budget
B given. Last, the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{b}}}\) deteriorate as the number of design points
k becomes large, regardless of the budget allocation schemes adopted. We highlight that this 1D case differs from the M/M/1 example in the following aspects: First, the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{b}}}\) are not significantly lower than those of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\), so long as the number of design points
k is not too large relative to the budget
B given. Second, all bounds built with the Matérn kernel achieve comparable or higher
\(\widehat{\mbox{SCP}}\)’s compared to those built with the Gaussian kernel, especially when the number of design points
k is small.
We further delve into the maximum half-widths
\(H_{\max }\) of
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\) obtained under different experimental settings. To save space, we concentrate on analyzing the results obtained under LHD with the equal allocation scheme. Similar observations apply to the results obtained under LHD with the two unequal budget allocation schemes and under the uniform design with all three budget allocations; see Figures
9 and
10 in Appendix
D.2.1 for details. Figures
2 and
3 summarize the magnitudes of
\(H_{\max }\) (in a logarithmic scale) of
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\) obtained under LHD with a total budget of
\(B=2,560\) and 10,240. First, the
\(H_{\max }\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\) are larger than those of
\(\mbox{CI}_{{\mbox{b}}}\) under each experimental setting considered, regardless of the kernel function adopted. Second, the magnitudes of
\(H_{\max }\) of
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\) decrease with the number of design points
k when a large total budget
\(B=10,240\) is applied. However, this trend is less discernible with a small budget. Third, the magnitudes of
\(H_{\max }\) of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\) are similar, except when
k becomes large relative to the given budget
B so using variance estimates based on a few replications can incur additional variability in SK model fitting, which impacts
\(\mbox{CI}_{{\mbox{eu}}}\); this observation holds regardless of the kernel function adopted. Fourth, applying unequal budget allocation schemes only slightly reduces the half-widths of all three bounds, as observed in Figures
9(a)–(d) and
10(a)–(d) in Appendix
D.2.1. Last, the magnitudes of
\(H_{\max }\) of all bounds built with the Matérn kernel tend to be higher than those with the Gaussian kernel, which explains the corresponding higher
\(\widehat{\mbox{SCP}}\)’s observed in Table
3. We also highlight some differences in the 1D case from the M/M/1 example. Given a fixed total budget
B, especially a small one, using a small number of design points
k can lead to unstable SK model fitting when the Gaussian kernel is adopted, resulting in highly variable
\(H_{\max }\) values for all bounds; see, e.g., the
\(H_{\max }\)’s obtained for
\(k=16\) in Figure
2(a). This is because the true mean function
\(f(\cdot)\) in this 1D case is nonlinear, and its trend changes more frequently than that of the M/M/1 example across the input space, due to its combination of the trigonometric function and higher-order polynomial terms. It is known that the Gaussian kernel is not flexible enough for modeling a function with such features, and using a small number of design points exacerbates this deficiency in capturing the true mean function
\(f(\cdot)\) in the 1D case.
We next investigate the 5D case. Compared to the 1D case, notice that the true mean function
\(f(\cdot)\) in this 5D case is more complex and changes more rapidly across its input space. Tables
5 and
6 present the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\) obtained under various experimental settings. We observe that, similar to the 1D case, the performance of
\(\mbox{CI}_{{\mbox{b}}}\) built with the Gaussian kernel is notably weak, regardless of the budget size
B. In contrast,
\(\mbox{CI}_{{\mbox{b}}}\) built with the Matérn kernel delivers relatively robust performance under the same experimental settings. Furthermore, in this 5D case, an increase in the number of design points
k is associated with higher
\(\widehat{\mbox{SCP}}\)’s for
\(\mbox{CI}_{{\mbox{b}}}\) especially when the Matérn kernel is adopted, underscoring the need of using a dense design with more design points for capturing the function in high-dimensional problems and enhancing the reliability of the confidence bound built. Regarding
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\), the Matérn kernel also facilitates consistently strong simultaneous coverage performance across different experimental settings, highlighting its suitability for modeling highly nonlinear and rapidly changing response surfaces in high-dimensional problems. In contrast, due to the limitations of the Gaussian kernel in modeling the mean surface
\(f(\cdot)\) in this 5D case, the
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{eu}}}\) and
\(\mbox{CI}_{{\mbox{u}}}\) suffer especially when the number of design points
k is small.
Figures
4 and
5 summarize the magnitudes of
\(H_{\max }\) (in a logarithmic scale) of
\(\mbox{CI}_{{\mbox{eu}}}\),
\(\mbox{CI}_{{\mbox{u}}}\), and
\(\mbox{CI}_{{\mbox{b}}}\), employing LHD with a total budget
\(B=2,560\) and 10,240. For simplicity, the discussion below focuses on the results obtained under LHD with the equal allocation scheme. Similar observations apply to the results obtained under LHD with the two unequal budget allocation schemes and under the uniform design with all three budget allocations, which are detailed in Figures
11 and
12 provided in Appendix
D.2.2. First and foremost, the variation exhibited by the
\(H_{\max }\) values of all bounds built with the two kernels demonstrates the notable difference between the two kernels in this 5D case as the number of design points
k increases, especially when the given budget
B is small. Specifically, the Gaussian kernel leads to increased instability in modeling as the number of design points
k grows, contrasting with the Matérn kernel, which maintains consistent modeling stability across the range of
k values considered; compare Figures
4(a) and (b). Furthermore, given a large budget
B, the
\(H_{\max }\)’s of all bounds built using the Gaussian kernel show a decreasing trend as the number of design points
k increases. However, such a trend seems absent when the Matérn kernel is adopted. This observation explains the corresponding low
\(\widehat{\mbox{SCP}}\)’s of
\(\mbox{CI}_{{\mbox{b}}}\) observed in Tables
5 and
6. An explanation is that the predictive variance (hence the confidence bound width) associated with the Gaussian kernel diminishes faster compared with the Matérn kernel, as indicated by Corollaries
1 and
2 and the comments immediately following them.
Finally, we summarize key insights obtained from the numerical evaluations. When applying SK for metamodeling purposes, selecting an appropriate kernel suitable for modeling the function of interest is crucial. The Gaussian kernel performs well in queueing applications like the M/M/1 example considered here [
53]. In contrast, only the Matérn kernel with smoothness parameter
\(\nu = 5/2+d/2\) was considered by Reference [
23] from where the dimension-flexible example originated, which also showed its competence in our evaluations. Therefore, choosing an appropriate kernel function is example-dependent and should take into account the features of the function of interest. The performance of Bonferroni correction-based simultaneous confidence intervals is sensitive to this choice, and they can be unreliable even when built by adopting an appropriate experimental design. In contrast, the uniform error bound is more robust, especially in high-dimensional problems. Second, fixed designs can outperform random designs in facilitating stability in SK model fitting and simultaneous coverage performance. Last, unequal allocation schemes can enhance the performance of confidence bounds, but their impact is limited.