Sequential Design with Mutual Information for Computer Experiments (MICE): Emulation of a Tsunami Model

Computer simulators can be computationally intensive to run over a large number of input values, as required for optimization and various uncertainty quantification tasks. The standard paradigm for the design and analysis of computer experiments is to employ Gaussian random fields to model computer simulators. Gaussian process models are trained on input-output data obtained from simulation runs at various input values. Following this approach, we propose a sequential design algorithm, MICE (Mutual Information for Computer Experiments), that adaptively selects the input values at which to run the computer simulator, in order to maximize the expected information gain (mutual information) over the input space. The superior computational efficiency of the MICE algorithm compared to other algorithms is demonstrated by test functions, and a tsunami simulator with overall gains of up to 20% in that case.


Introduction
Computer experiments have been widely replacing physical experiments as they are typically less resource-and time-consuming. A computer experiment is designed to learn about one or multiple outputs of interest of a computer model (also known as a simulator), using a number of runs of the simulator at different input parameter values, over a parameter space called the design space. However, advanced simulators themselves can be quite demanding in terms of computing resources and run time. Hence, in order to facilitate uncertainty quantification (UQ) for simulators, we want to run computer experiments for well chosen values of the various parameters. A natural approach is to evaluate the output of the simulator on a space-filling design, and then build a surrogate model from the collected input-output data. A surrogate model, or emulator, is a fast approximation of the simulator and is used to predict the model output for untried design points. The design should be chosen judiciously in order to obtain a good approximation of the simulator by the surrogate. The most common space-filling designs include Latin hypercube designs (LHD), maximin(Mm)and minimax(mM)-distance designs, multi-layer designs, and orthogonal arrays [2,23]. We denote by MmLHD a Mm-distance design within the class of LHDs, and mMLHD a mM-distance design within the class of LHDs.
Sacks et al. [25] fit a Gaussian Process (GP) models to data from computer experiments given a "small" budget. GP models are commonly used surrogate models for emulating expensive-toevaluate simulators in many fields of simulation [26], and for global optimization [18]. In recent years, Gaussian process approaches to UQ for simulators have become increasingly popular [4,27]. correlation length parameter for the i-th input dimension. Since this correlation function only depends on x and x ′ through the distance between them, it is stationary. When using the squaredexponential, one implicitly assumes that the process Y (x) has mean square derivatives of all orders, which is a quite strong assumption, especially when modeling complex physical processes. An alternative is to instead consider a more general set of correlation functions, the Matèrn family [16]: where ξ = (ℓ 1 , . . . , ℓ p , ν) T , Γ ν is the gamma function for ν, and J ν a modified Bessel function of order ν > 0. The parameter ν regulates the smoothness of the process; for instance, Gaussian random fields with Matèrn covariance are ⌊ν −1⌋ times mean-square differentiable [31]. The Matèrn correlation function has become increasingly popular because of its ability to model data of different degrees of smoothness. The squared-exponential correlation is a special case of a Matèrn correlation when ν goes to ∞. The Matèrn correlation with ν = 5 2 can be written in an explicit form, see [24]. For prediction the mean of the posterior distribution of Y (x), conditional on the known data (y j , x j ) m j=1 , can be used as a predictor for y(x) at any untried point. The known design D is (X D , y D ), where X D = (x j ) m j=1 , y D = (y 1 , y 2 , . . . , y m ) T , and y i = y(x i ). Given that the underlying function y(x) is a realization of Y (x)|β, θ, the joint distribution of [Y (x 1 ), Y (x 2 ), . . . , Y (x m )] T |β, θ is known to follow a multivariate normal distribution N m (Hβ, Σ θ ) where H = (h(x j )) m j=1 , and Σ θ is the m × m variance-covariance matrix whose (j, k)-th entry is given by Σ(x j , x k ; θ) = σ 2 K(x j , x k ; ξ) for x j , x k ∈ X D . The correlation matrix is K ξ = σ −2 Σ θ , and must be positive semidefinite. Next, y D = [y 1 , y 2 , . . . , y m ] T is assumed to be a realization of m-variate normal distribution [Y (x 1 ), Y (x 2 ), . . . , Y (x m )] T |β, σ 2 , ξ. An exact marginalization of β and σ 2 is possible using the non-informative prior p(β, σ 2 ) ∝ σ −2 . Then, as shown in [16], the predictive posterior distribution of Y (x)|y D , ξ follows a shifted Student's t-distribution on m − q degrees of freedom with meanŷ (x; ξ) = h T (x)β + k T ξ (x)K −1 ξ (y D − Hβ), and variance-covariancê where the m × 1 vector k(x; ξ) has entry j given by K(x, x j ; ξ) for x j ∈ X D , and γ(x; ξ) = Hβ). For practical computation, the inputs X D are scaled to lie in [0, 1] p , and the outputs y D are scaled to have zero mean and unit variance. The predictive (posterior) mean at x, given by Equation (2), is the best linear unbiased predictor (BLUP) [25], which interpolates the data D. The regularity of the correlation function K(x, x ′ ) for x = x ′ determines the regularity of the predictorŷ(x) [31], which means that the regularity of y(x) should be reflected in the choice of K(x, x ′ ). The predictive (posterior) variance at x, given by Equation (3) for x ′ = x, is an estimate of the MSPE (= E [y(x) −ŷ(x; ξ)] 2 ). Let us denote the predictive varianceV (x, x; ξ) byŝ 2 (x; ξ). The predictive varianceŝ 2 (x; ξ) can be used to produce confidence bands for the predictive mean y(x; ξ).

Estimation of uncertain parameters in the correlation function
So far, we have assumed that the correlation function K(·, ·; ξ) is known, which rarely is the case, particularly in the context of computer experiments. We briefly discusses approaches that deal with the situation where the form of K(·, ·; ξ) is assumed to be known, but where ξ is uncertain. Priors can be used to express one's uncertainty about the correlation parameters, followed by sampling of the posterior of those parameters [10], by, e.g., Markov chain Monte Carlo (MCMC). This is a fully Bayesian approach that can be very computationally expensive as the convergence to stationarity can be challenging. A somewhat less computationally demanding alternative is to find estimates of the parameters that best fit the GP to the known data, (X D , y D ), by using maximum likelihood estimation (MLE). MLE provides point estimates for ξ, denoted byξ, and those estimates are then plugged into the GP as if they were the true values. The vectorξ is the solution tô ξ = arg max ξ∈Ξ ln (L(ξ|y D )) = arg min where ln (L(ξ|y D )) is the marginal log-likelihood, and Ξ is the search domain of interest. Given that the underlying distribution is Gaussian, the following relationship holds: The estimated predictor,ŷ(x;ξ), is known as the estimated BLUP (EBLUP) [32]. A drawback with this approach is that the variance estimatorŝ 2 (x;ξ) tends to underestimate the MSPE, especially if the correlation is weak. This is because the estimator has been derived under the assumption that the correlation is fully known. In the following, we drop ξ's in our notation for brevity's sake, whenever there is no ambiguity.

On the numerical stability
In this section, we discuss a few practical implementations we have adopted: i) Nugget parameter for regularization: To fit the GP model with MLE we require the inverse and the determinant of the m × m correlation matrix K ξ for a large number of ξs. The Cholesky decomposition (K = LL T ) is typically used for matrix inversion, which costs O(m 3 ), and is hence computationally intensive for large m. L is here a lower triangular matrix. The Cholesky decomposition is known to suffer from numerical instabilities whenever the matrix K is near singular (due to ill-conditioning). The closer a matrix is to being singular, the larger its condition number is. Whenever the correlation matrix is ill-conditioned a non-zero nugget parameter can be used for regularization: K ξ is replaced in (2) and (3) by K ξ,τ 2 = K ξ + τ 2 I, where I is the m × m identity matrix. Typically τ 2 is chosen to be very small. K ξ,τ 2 becomes better conditioned as the size of the nugget increases. In practice, to prevent the correlation matrix from becoming ill-conditioned, the nugget should be chosen so that the condition number of K ξ,τ 2 does not exceed ∆ = 1/(10ε) [22], where ε is the machine epsilon for double precision floating point numbers. However, by introducing a non-zero nugget, we are sacrificing the interpolatory property of the predictive meanŷ(x), see Theorem 1. The nugget may also affect the integrated likelihood profile substantially [1]. Nevertheless, the use of a nugget to improve stability can be very beneficial [15]. Below is our result on the effect of the nugget on the predictive variance at any point in the design D.
Theorem 1 For a GP model with a constant mean, and with design D, the predictive variance given by (3) can be written aŝ for any x i ∈ X D , where τ 2 is a non-negative nugget parameter, and e i is the i-th unit vector.
According to Theorem 1, whenever τ 2 > 0 is added to the correlation matrix diagonal, the predictive variance, at a known design point, is σ 2 τ 2 with some variability proportional to σ 2 τ 4 . The magnitude of the last term in the round brackets tends to be much smaller than the other two; hence, the predictive variance at a known design point is typically smaller than σ 2 τ 2 . Moreover, as τ 2 increases, the second term approaches τ 2 , and the last term approaches τ 2 /M D , where M D is the number of points in the design D. This follows from the fact that with the increase of τ 2 , the inverse matrix reduces to (K + τ 2 I) −1 ≈ τ −2 I. Hence, for a very large τ 2 , following some simple calculations, we observe thatŝ 2 ii) QR decomposition: To improve numerical stability, we follow [21]: we replace K in Equation (2) and (3) with the Cholesky decomposition LL T . Then, the following transformations are made:H = L −1 H,k = L −1 k, andỹ D = L −1 y D . Also, as a precaution if D is large, the QR decomposition can be applied to decomposeH into QR T , where Q ∈ R m×p has orthonormal columns, and R ∈ R p×p is an upper triangular matrix. β can in this way be directly solved from the least square problem R T β = Q Tỹ D . Furthermore, the direct computation of the predictive variance-covariance, using Equation (3), could, due to unstable computations, produce negative values. Thus we improve the stability of Equation (3) as follows: iii) Arithmetic underflow: ln(det(K)) in (4) may suffer from floating-point underflow problems, that is, its value becomes so small that it cannot be represented in double floatingpoint precision. To mitigate the risk of this happening, we perform the following calculation [21]: ln(det(K)) = m ln(det(K) 1/m ) = m ln(( j L jj ) 2/m ), which uses that det(K) = det(L) det(L T ) = ( j L jj ) 2 (recall L is a triangular matrix).

Sequential design of computer experiments
A well-known (non-sequential) design strategy for prediction is maximum entropy sampling [29], where the goal is to identify the most informative subset of size M , determined using the entropy [9], from a finite candidate set X cand ⊂ X : whereȲ Here − log b (Ȳ ) is a measure of the information content ofȲ . If b = 2, ⌈H(Ȳ )⌉ is the number of bits needed to representȲ . When the underlying distribution is a multivariate normal with correlation matrix K, the entropy (in bits) is Ko et al. [19] showed that this optimization problem is NP-hard. In other words, this optimization problem quickly becomes computationally intractable when M grows large. Instead, a sequential approach can be adopted. Sequential design algorithms start with an empty or sparse design, and then add points into the existing design in a sequential manner until the design is complete. When a certain level of accuracy is desired within a limited time budget, a sequential algorithm offers the opportunity to stop the algorithm whenever a target accuracy level has been reached, which can result in significant computational savings. The two most widely used GP-based sequential design strategies are active learning-MacKay (ALM) and active learning-Cohn (ALC), see, e.g., [28].

ALM algorithm
ALM (Active learning-MacKay) is a greedy algorithm for maximum entropy sampling that selects x ∈ X cand that yields the largest improvement in entropy: at each step. The last identity follows from Equation (9). We use the subscript D to denote a quantity that depends on design D. At the end of each greedy step the new point, x * , is included in the existing design D, (X D , y D ). This procedure is repeated until the stop criterion is satisfied. A drawback is that ALM places too many points on the boundary of the design space, especially in the early stages of the selection process. This selection behavior is rather wasteful as boundary points generally are less informative than points in the interior [23]. Here, by boundary we mean the convex hull of the points in X cand . The ratio of boundary points grows rapidly with the dimension size, p. Given a regular grid with n p points, n p − (n − 2) p points are on the boundary. This means that the ratio of the number of points on the boundary to the total is (1 − (1 − 2 n ) p ). As an example, if p = 4 and n = 10 the ratio is approximately 0.59, and if p = 6 and n = 10, the ratio is approximately 0.74.

ALC algorithm
ALC (Active learning-Cohn) is a sequential algorithm that selects x ∈ X cand that yields the largest expected reduction in predictive variance over X : at each step. ALC solves (11) iteratively, and at the end of each iteration the point x * is added to the existing design D. In practice, the ALC criterion is often simply defined as the average reduction in predictive variance over a set of trial points spread out in X : where M r is the number of trial points. The Cholesky decomposition of K D∪{x} is performed for each x ∈ X cand , which leads to a time complexity of O(M cand M r m 3 ), which is computationally infeasible even for moderately large m. Fortunately, by first inverting K D using the Cholesky decomposition, which is in O(m 3 ), it is possible to compute K −1 D∪{x} for any x, in O(m 2 ), by exploiting the fact that K −1 D∪{x} can be expressed in terms of K −1 D (and k D (x)): . Furthermore, as shown in [14], by using Equation (13), it is possible to recast (12) into: which reduces the time complexity for a single ALC step from . ALC designs generally perform better than ALM designs in terms of MSPE for a given design size [14,28]. Moreover, ALC does not exhibit the tendency to select points on the boundary of X , contrary to ALM. On the other hand, ALM is occasionally preferred over ALC due to its relatively low computational complexity [4]; more on this in Section 5.

Mutual information for the design and analysis of computer experiments
Mutual information, which, like entropy, is a standard measure in classical information theory, has been shown to be useful for sensor network design [6,20], experimental design [17], and optimization [8].
In this work, we develop a sequential design algorithm for the design and analysis of computer experiments, called MICE (Mutual Information for Computer Experiments). MICE is based on the greedy mutual information (MI) algorithm proposed by Krause et al. [20], for optimal sensor placement. We begin this section with a brief background on mutual information. Let us consider two random vectorsȲ andȲ ′ with marginal pdfs pȲ (y) and pȲ ′ (y ′ ), and joint pdf pȲ ,Ȳ ′ (y, y ′ ). The mutual information between them, denoted by I(Ȳ ;Ȳ ′ ), is equivalent to the Kullback-Leibler divergence D KL (· ·) between pȲ ,Ȳ ′ and pȲ pȲ ′ [9]: with log(0)0 = 0. The mutual information betweenȲ andȲ ′ can be linked to entropy [9]: Caselton and Zidek [6] proposed a mutual information-based criterion for the design of spatial sampling networks, where the goal is to select from a candidate set X cand ⊆ X F the most informative subset of size M determined by the mutual information betweenȲ [X M ] andȲ [X M c ]: where This optimization problem is NP-hard [20].

A greedy mutual information (MI) criterion for prediction
Krause et al. [20] proposed a greedy algorithm for optimization problem (17), that is, maximize ) at each step, and showed that this criterion can be recast into a more simple form, i.e., for any x ∈ X cand , where D c = X F \X D , and X F is a discretization of the region of interest X ⊆ R 2 . The design D is the current design. Krause et al. [20] proved that this is a constantfactor approximation algorithm for optimization problem (17) under some mild conditions, and showed its efficiency for a sensor placement problem. In this case, X cand is the set of admissible sensor sites. The approximation is within (1 − 1/e) of the optimum, given that discretization X F is fine enough in X (i.e., a discretization level larger than 2M points, where M is the desired design size), and that the covariance function is fully known. The proof relies on the fact that the set Greedy algorithms are known to be quite efficient for submodular set functions. Furthermore, as we assume thatȲ [X D ] andȲ [X D c ] follow a multivariate normal distribution, we can rewrite the objective function (18) of the optimization problem into a computationally-efficient form, as follows: Thus, the goal is to maximize the ratio of the predictive variances using the design D, to that using the design D c \{x}. The greedy MI algorithm is given below: MI algorithm (Krause et al. [20]): Step 3. Let X D = X D ∪ {x * }, and X cand = X cand \{x * } Step 4. If design D is of size M , then stop; otherwise go to Step 1 Output: Observe that, in Step 2, a GP model is assigned to the design D, and for each candidate, x, a GP model is assigned to D c \{x}. Note that, in our case, we need to discretize the design space X ⊂ R p , where p may be large. Next, we compare MI to ALM and ALC for a simple example.

MI example
Let us consider a realization of a stationary Gaussian random field with zero mean on a 21 × 21 regular grid over [0, 1] 2 . The covariance is defined by σ 2 = 1 and the squared-exponential correlation function with ξ = (0.8, 0.5) T . The discretization X F is a regular subgrid of size 11 × 11 with equidistant points. We also make all points in X F available for selection, i.e., X cand = X F . The remaining 320 design points constitute the hold-off set used to calculate the prediction quality measured by the RMSPE (root mean squared prediction error). Figure 1 displays the prediction performance for ALM, ALC and MI, as averages over ten runs with different initial two-point designs, randomly picked from X cand . MI offers a prediction quality comparable to ALC. ALM is the worst. Indeed, at least 20 boundary points of the total 40 were selected no later than the 25-th iteration, for all the runs with ALM. The MI criterion is indeed a promising design criteria. However, we have identified some practical issues with the MI algorithm, which are presented below.

Practical issues with the MI algorithm
The results for the above example suggest that MI may be able to compete with ALM and ALC.
Krause et al. [20] arrived at the same conclusion for an optimal sensor placement application, and showed that MI has a lower run time than ALC (for their application with two design parameters). Unfortunately, the MI algorithm, in its existing form, struggles for the design and analysis of computer experiments. So far, the MI algorithm has been used for spatial data (p = 2) to select the most informative subset X D of size M from a set X cand of size M cand , where typically M cand ≫ k. Here X cand ⊆ X F , and X D c = X F \X D . The design of computer experiments is a more challenging design problem, e.g., the candidate points are typically selected by the algorithm to explore the design space X , and one often considers more than two parameters.
It is clear from observing the MI criterion (19) that D c plays a major role. For each x ∈ X cand , a GP is built on D c \{x}, which is an operation in O(M 3 D c ), where M D c is the number of points in D c . The run time of the MI algorithm thus grows exponentially fast with the size of D c . Hence, D c should preferably not be too large, but note that this also affects the allowed size of the candidate set, since X cand ⊆ X D c . In addition, the distribution of D c in X has a significant effect on the greedy MI criterion. D c should represent the complement of D in X , and should thus be chosen in such a way that F = D ∪ D c becomes a well-spaced discretization of X . For points close to each other in D c ,ŝ 2 D c \{x} (x) may become very small, and hence produce a high MI score regardless of their location in X . This issue can be observed in Figure 2. A high MI score is marked in red, a low score is in blue, an intermediate score is in yellow, and the black dots are the points of the design D. The detrimental effect an additional point to D c , in close proximity to points already in D c , may have on the MI scores, shows that the distribution of points in D c has a strong influence on the MI criterion. Hence, there is an apparent need to examine, and, if possible, improve the robustness of the MI criterion. For small p, we can often ensure that D c (D c ⊂ F ) is well-spaced by using a regular grid with equidistant points as the discretization F . For large p, however, regardless of sampling technique employed to sample F , it would be an immense task to ensure that the resulting distribution of points is evenly distributed without clustering. Even with sampling techniques that are space-filling, such as MmLHD, we are unable to guarantee consistency in the results with MI. See Figure 3 for the comparison between the MI scores for two D c designs with different distribution. The MI scores for the two cases are highly conflicting. The MI algorithm also assumes that the covariance is known, but this is rarely the case. MLE can be used to adaptively find point estimates for the uncertain correlation parameters. The choice of correlation parameters is crucial: a poor choice may lead to a poor GP model, see Figure 4. For MI-MLE, the guess ξ = (1, 1) T is used until the design D is of size 10, after that, MLE is used

MICE algorithm
In this section our new algorithm, called MICE, which overcomes the shortcomings of MI, is introduced. The MICE algorithm is presented below, followed by details on the individual steps of the algorithm.

MICE algorithm:
Require: y(x), X , GP model (h(·), K(·, ·; ξ)), τ 2 and τ 2 s , initial data (X D , y D ), M D c , M cand , desired design size M Step 1. MLE to obtain estimatesξ used in K(·, ·;ξ) Step 2. Fit GP model to data (X D , y D ) Step 3. Generate X D c with respect to X D , and then choose X cand (⊆ X D c ) Step 4. Solve x * = arg max Step 5. Evaluate y * = y(x * ) and let X D = X D ∪ {x * } and y D = y D ∪ {y * } Step 6. If design X D is of size M , then stop; otherwise go to step 1 Output: X D of size M with output data y D In step 1 MLE is used for estimating the uncertain correlation parameters. In MICE a fully Bayesian approach could also be used.
Step 2, 5, and 6, are typical steps performed by any GPbased sequential design algorithm. In step 3, we argue in favor of generating X D c as a LHD. The LHD can be selected from a set of LHDs. We want to select the LHD that has the large minimum distance to the current design X D , since then X D c becomes an approximation of the complement of X D in X . Clearly, as the dimension of the problem, p, grows, so does the required size of D c . One way to overcome this issue is to generate a new D c for each MI step to keep D c at a moderate size, to improve the design exploration. In step 4 our MICE criterion is used. In contrast to the MI criterion, we added a "large" nugget parameter τ 2 s is added to the diagonal elements of the correlation matrix in the GP assigned to D c \{x}. If the size of the nugget is large enough, the predictive variance is flattened as the GP model becomes much more relaxed at the points in D c , see Theorem 1. A moderately large value for τ 2 s is needed to achieve the desired effect; the default choice is τ 2 s = 1. In this way, one overcomes the issue of uneven distribution of points in D c , which we identified as a major limitation for the MI criterion. The new criterion is defined bŷ s 2 D (x;ξ, τ 2 )/ŝ 2 D c \{x} (x;ξ, τ 2 s ) (note the two nuggets τ 2 and τ 2 s ). Figure 5 and 6 show score values of the MICE with τ 2 s = 1 for a direct comparison with the correponding figures for MI ( shown in Figure 2 and 3, respectively).   Figure 6: Given a design D, of size 5, the score value of the MICE criterion using τ 2 s = 1 is shown for two MmLHD candidate sets of size 100.
Observe that MICE with τ 2 s = 1 performs as well as MI for the regular grid (the simple case), whereas for the case when D c has an uneven distribution of points (and clustering), MICE shows robustness to the design D c . The choice of τ 2 s is investigated further in Section 6. Note that the GP for the design D, in MICE, does not rely on τ 2 s . Although, if necessary, another nugget parameter can be introduced. For that GP we have introduced τ 2 = 10 −12 for the purpose of improving numerical stability. The estimatesξ for the GP over D c are the same as those for D.
The optimality results provided by Krause et al. [20] for the MI algorithm are for known ξ. Therefore, we generalize this result in Theorem 2, which covers the case of different nugget parameters for the GPs over D and D c , as well as the case of estimated correlation. This theorem enables us to derive an approximative bound of optimality for MICE as in [20].
Theorem 2 Let Y (x) be a GP model on a compact set X ⊂ R p with a continuous correlation function K(x, x ′ ) : X × X → R + 0 . Moreover, assume that the correlation matrices for D and D c have included the nuggets τ 2 and τ 2 s , respectively. We assume that the MICE algorithm use estimatesξ i , for greedy step i, that satisfy, for some constant α > 0, |K(x, x ′ ; ξ)−K(x, x ′ ;ξ i )| ≤ α. Then, for any ε > 0, and any finite number M , there exists a discretization F δ of mesh width δ > 0 such that MICE is guaranteed to select a design D M of M design points, where M ≤ 2|F δ |, for which where e is the base of the natural logarithm, and OP T is the value of the mutual information for the optimal design of size M .
Under the optimal conditions this upper bound guarantees a performance within 63% of the optimum. The term M ε is unavoidable, but small if the discretization F is fine enough. The term 2(ασ −1 τ −1 ) 2 M 4 (1 + M 3/2 ) 2 − M 7/2 |τ 2 s is related to the parameter uncertainty, and the term M 3 √ M |τ 2 s − τ 2 |/τ 2 s is the penalty for using τ 2 s instead of τ 2 for the GP over D c . Our extention of the approximative bound of optimality, to MICE, shows the effect of the nugget τ 2 s on the performance, and tells us that our default choice τ 2 s = 1 is not causing the algorithm to diverge too much from mutual information. Observe that the near-optimal performance can be shattered by a poor choice of values for the correlation parameters (with α large). To better understand the behavior of the MICE criterion for large D c , we also provide the following theorem: Theorem 3 Let Y (x) be a second-order stationary Gaussian process with constant mean on a compact subset X of R p with a Lipschitz-continuous correlation function. The expected MSPE, s 2 τ 2 (x), is given by (3) with a correlation matrix using a non-negative nugget parameter τ 2 . Then, for any ε > 0, there exists a regular grid X δ ⊂ X with grid spacing δ = 2ε/( √ pK L ) so that for any untried point x * ∈ X the predictive variance is bounded as where I is the identity matrix, and e n the n-th unit vector for member x n of X δ . Here K L is the Lipschitz constant forŝ 2 τ 2 (x) over X .
Theorem 3 shows that if design X D c is dense enough in X , and τ 2 s is small enough, MICE is equivalent to ALM, sinceŝ 2 τ 2 s (x * ) is within (ε + bτ 4 ) to the constant σ 2 τ 2 s for any x * , where ε > 0 can be made arbitrary small. The size of b decreases rapidly with τ 2 s . MICE behaves as ALM, since according to Theorem 3, if D c is dense enough so that ε > 0 can be made arbitrary small, and τ 2 s large enough so that (K + τ 2 I) −1 = τ −2 I (approximately), then 0 <ŝ 2 D c < ε. However, with our default value τ 2 s = 1, MICE is not expected to behave as ALM. MI, on the other hand, behaves as ALM whenever X D c is dense in X , and τ 2 is very small. Note that both the squared-exponential correlation function and Matérn with ν = 5/2 are continously differentiable [16], hence Lipschitz continuous. Figure 7 shows, for a two dimensional example, the designs obtained for the different sequential design algorithms. Observe that MICE with τ 2 s = 1 displays a centered and well-spaced design. ALM selects only boundary points, since the two points of the interior constitute the initial design. MI is the criterion that appears most reluctant to select boundary points. In MICE, we need to computeŝ 2 D c \{x} (x) for each x ∈ X cand , which typically requires the Cholesky decomposition of the (M D c − 1) × (M D c − 1) correlation matrix K D c \{x} . This adds up to a computational burden if D c is large, and we thus propose the following implementation that only requires a single Cholesky decomposition. First, invert the M D c × M D c correlation matrix K D c . Then exploit the partitioned inverse formula for matrices in block-form: the inverse of can be written as: for any x ∈ X D c as follows: given K −1 D c , we can obtain B, b 12 , b 21 and b, directly from (21), and then we find that K −1

Computational complexity
To assess the performance of a sequential design strategy one should not only look at the prediction accuracy as the number of design points increases, but also take computational complexity into account. The overall computational cost is the sum of the computational cost of of fitting the GP, generating the candidate set X cand , evaluating the design criterion over the candidate set, and finally evaluating the simulator for those design points included in the design D, that is, The estimation of the correlation parameters with MLE, is in O(pm 2 + M mle m ω ), where ω > 0 is related to the efficiency of the algorithm used for matrix inversion: for naïve Gaussian elimination ω = 3, and for Strassen's algorithm ω = log 2 (7). The term pm 2 is the number of operations needed to determine the distances between distinct pairs of points in X D (which is m(m − 1)/2). The second term M mle m ω is the time complexity for MLE, which is equal to the cost as inverting the correlation matrix K ξ , M mle times. M mle is the number of trial points explored to obtain the MLE estimatesξ. To train the GP, which entails the determination of the weights of the linear predictor, only matrix multiplications are required, of order O m 2 . The time complexity for evaluating the predictive mean at an untried point is O (pm), and for the predictive variance it is O pm 2 .
The computational complexity of the different algorithms are given in Table 1. For ALC, we used the formulation given by Equation (14), which is the one with lowest computational cost. The time complexity for a single ALM step is O pm 2 + M mle m ω + M cand pm , where M cand is the number of points in the candidate set. The total cost for ALM is O pM 3 where M D is the number of points in the final design. Here M r is the number of trial points used in ALC for averaging over the design space. To simplify the expressions given in Table 1  is the most expensive term for small to moderately large M D . Computational savings could be made if estimates for the correlation parameters only are updated for certain carefully selected iteration steps (e.g. every 10 or so, or adaptively), not at each iteration. It would give a much bigger relative advantage to MICE against ALC, with a lower T mle , see Figure 8. . MLE for estimation of the correlation parameters is applied whenever the current design is greater than 10, otherwise fixed. The candidate sets generated are LHDs, selected based on the maximin dinstance to the current design from 500 LHD candidates.
The cost to generate candidate sets depends greatly on the desired size, and the sampling technique used. For instance, minimax designs are more computationally intensive than maximin designs [2].

A numerical comparison
In this section, we present a numerical comparison between MICE, ALM and ALC. MmLHD and mMLHD, which are widely used, have been included to represent affordable non-sequential strategies. MmLHDs tend to cover the parameter space better than Mm-distance designs, which are not restricted to the class of LHDs, but at the expense of lower Mm-distance scores. Hence, MmLHD can be seen as a compromise between Mm-and mM-distance designs [2]. The computational budget is limited to 150 design points; we found this to be reasonable budget. The results presented will provide a good foundation to understand the different sequential design strategies. The metric of success is primarily the RMSPE against design size. The normalized RMSPE is calculated using a hold-off design set which includes 1000 pre-evaluated design points. The hold-off set is a LHD. The test functions have been carefully selected in order to achieve diversity in both input dimension size p and difficulty level. All results are presented as averages of ten runs. For the sequential strategies, what differs between the runs is the initial data of size 2, which is generated by mMLHD with 1000 reference points. Note that all strategies are compared using the same initial training data for the runs. For MmLHD and mMLHD, we simply generated ten different designs for each specific design size (50,75,100,120). The variability in the results is not always presented to increase visibility; it is however similar across the different strategies, see, e.g., the 95%-credible bands provided in Figure  13. We have also included the run time of the algorithms to express algorithm efficiency.
A stationary GP with a Matèrn covariance with ν = 5/2 is used. The size of the candidate set has a significant effect on the results, and therefore we decided to declare the number of candidate points in the method name, for instance we denoted by MICE-150 the MICE algorithm with M cand = 150. Because the algorithm cost of ALC is significantly higher than ALM and MICE, the size of the candidate set is set as M cand = 150. We let MICE and ALC have the same candidate set size, in order to ensure a fair comparison in terms of RMSPE. For ALM we use M cand = 1000 due to its low algorithm cost for point selection (T select ). The candidate sets generated are LHDs, selected based on the maximin criterion with respect to the current design (from a set of 500 LHD candidates).
The ALC algorithm is given by (14), with M r = M cand , as in [14,28]. MmLHD and mMLHD select a LHD from a pool of 1000 LHDs. The mM-distance is measured using 1000 reference points over X on a LHD. The optimiser employed for maximum likelihood estimation is a real-coded genetic algorithm [11] with settings that require 1024 calls to the log-likelihood. The values for the uncertain correlation parameters in the correlation function are fixed until the current design is of a specific size (20 if p > 4, else 10) to avoid unstable estimates. The nugget parameter for the GP model assigned to current design D is τ 2 = 10 −12 . For MICE, the smoothing nugget is typically τ 2 s = 1 but we also included results for a range of different choices of τ 2 s . For the comparison we also included τ 2 s = 10 −12 , which would be the value used in the greedy MI algorithm, since then τ 2 s = τ 2 .

Alan Genz's Oscillatory function
The "Oscillatory" function belongs to a family of test functions [13] proposed by Alan Genz for the study of quadrature methods. The function is y(x) = cos (c · x + 2πw) , x ∈ [0, 1] p . The vector c = (c 1 , c 2 , . . . , c p ) T determines the level of difficulty along the different coordinate directions of X ⊂ R p , and w is the displacement. To study the impact of dimension size p on the difficulty to predict untried points, c is constrained as As can be observed in Figure 9, the sequential strategies outperform the ones based on LHDs, as expected. MmLHD and mMLHD, even if well spaced, do not take into account that y(x) is anisotropic. The worst performing sequential design strategy presented is MICE-150 with τ 2 s = 10 −12 , which in fact is equal to the MICE algorithm a criterion equal to the MI criterion. The poor performance is evidently due to the MI criterion as explained in Section 4.3. In the 4-D case, ALM-1000, ALC-150, and MICE-150 (with the recommended nugget level τ 2 s = 1) produce similar result in terms of prediction error, but as shown in Figure 8, the run time for ALC is significantly higher than for ALM and MICE, which makes it the least attractive of them all if y(x) is relatively inexpensive to evaluate. Even if one assumes that ALC-50 would produce a similar performance as ALC-150, it is still quite expensive and therefore not competitive. Observe that τ 2 s = 1 performs the best.

Piston simulation function
Here we consider a 7-dimensional example given in [3], where the output describes the circular motion of a piston within a cylinder; it obeys the following equations: Here y(x) is the cycle time (s) which varies with seven input variables. The design space is given by . For the simulation, the input values have been normalized using the min-max ranges. The nonlinearity makes this deterministic computer experiment problem very challenging to emulate. In addition, we included MICE-300, which showed a slight improvement over MICE-150, see Figure 11. MICE with 300 candidate points is not that much more expensive than with 150; in fact, it is significantly cheaper computationally than ALC with 150. Again, the proposed algorithm MICE performs the best. For high-dimensional problems, ALM tends to be the worst, probably due to the high percentage of points on the boundary.

Application to a tsunami computer model
There is a pressing need for UQ in tsunami modeling in order to provide accurate risk maps or issue informative warnings. Sarri, Guillas and Dias [27] were the first to demonstrate that statistical emulators can be used for this purpose. Recently, Sraj et al. [30] studied the propagation of uncertainty in Manning's friction parameterization to the prediction of sea surface elevations, for the Tohoku 2011 tsunami event. They used a polynomial chaos (PC) expansion as the surrogate model of a low resolution tsunami simulator. However, for UQ, Bilionis and Zabaras [4] showed that GPs can outperform PC expansions, especially for small to moderate-sized designs that are only available for high resolution (e.g. 100-200m) tsunami simulations.
We consider here the problem of predicting the maximum free-surface elevation of a tsunami wave at the shoreline, for a wide range of untried conditions, following a subaerial landslide at an adjoining beach across a large body of shallow water. A tsunami wave simulator is used, whose cost is included in our study. We demonstrate the efficiency of the different sequential design methods for the design of a realistic computer experiments. This problem, is inspired by a benchmark problem, given at the Catalina 2004 workshop on long-wave runup models used in the validation of tsunami models. A landslide of seafloor sediments, initially at the beach, has a Gaussian shaped mass distribution, and generates tsunami waves that propagates towards the opposite shoreline across from the beach (see Figure 12). The sea-floor bathymetry is changing over time, and is used as input to the tsunami model simulator. The beach floor motion is described by the change in bathymetry of the sloping beach over time, h(x, t) = H(x) − h 0 (x, t), where H(x) = x tan β is the static uniformly sloping beach, and h 0 (x, t) = δ exp −(x −t) 2 is the perturbation with respect to H(x, t). Herex = 2 xµ 2 δ tan φ 1 ,t = g δ µt, δ is the maximum vertical slide thickness, µ is the ratio of the thickness and the slide length, and tan φ 1 is the beach slope. The free surface elevation is defined as z(x, t) = −h(x, t). It is assumed the initial water surface is undisturbed, that is, z(x, 0) = 0 for all x. The slope tan φ 2 of the beach at the opposite shoreline is chosen so that the distance between the shorelines is 2800 m. This is a shallow water problem, which means that tan φ 1 ≪ 1, and that the translating mass movement is thin (µ = δ/L ≪ 1).
We use the state-of-the-art numerical code VOLNA [12] to simulate all stages of this landslidegenerated tsunami event, based on nonlinear shallow water equations. We run VOLNA on a single GPU on the cluster Emerald. The bathymetry defined above is given only along one spatial coordinate, but in the code implementation of VOLNA a second spatial dimension (in this case, along the shoreline) is added to cover 10 meters of shoreline. The mesh is defined on [−5, 5]×[0, 3000] For MICE and ALC, M cand = 150, and for MICE τ 2 s = 1 is used. As before, ALM uses M cand = 1000 as it is relatively cheap computationally. The mean function is assumed to be a constant. The results are averages of ten runs. As before, the GPs have a constant mean, and use the Matèrn covariance ν = 5 2 . The normalized RMSPE and the maximum error are calculated using a hold-off set of size 500. As observed in Figure 13, MICE performs better than ALC, especially as the design increases in size. Observe that, for design size 120, the maximum error in sea surface elevation is less than 1 meter for waves up to almost 10 meters. Note that in the bottom right figure the total run time is given in logarithmic scale with base 10, and the computational savings that can be made by using MICE are thus around 10-20% (and much higher if MLE were not used at each iteration). A single run of VOLNA takes on average 850 seconds. The time consumed by the simulator is represented by a grey dashed line in Figure 13 (bottom left figure). For a fully realistic tsunami scenario we expect a substantially longer run time for the tsunami simulator, which will make the MICE algorithm even more computationally advantageous.

Conclusion
We have shown for a few test functions, and a realistic tsunami simulator, that MICE is able to outperform traditional sequential and non-sequential algorithms, such as ALC, ALM, and LHDs. It naturally extends MI from a spatial context to computer experiments, where several parameters are considered and the covariance structure of the GP is not fully known. MICE is particularly attractive in terms of the time complexity of the whole process (MLE to fit the GP surrogate, draws of candidates, selection of inputs amongst these candidates, and run of the simulator at those inputs). In addition, MICE may outperform even more other designs with less frequent MLE estimates, and this will be investigated in the future. Our theoretical results enhance our understanding of the effect of the nugget parameter on the estimation of the variance, a key ingredient in MICE, and also demonstrate near-optimality of our approach. We envision a large impact of our design method in several fields of research, such as realistic tsunami modeling. Finally, an area for further research could be to use the MICE algorithm in a non-stationary setting, e.g., in the treed GP form [14].