Multilevel Ensemble Transform Particle Filtering

This paper extends the Multilevel Monte Carlo variance reduction technique to nonlinear filtering. In particular, Multilevel Monte Carlo is applied to a certain variant of the particle filter, the Ensemble Transform Particle Filter. A key aspect is the use of optimal transport methods to re-establish correlation between coarse and fine ensembles after resampling; this controls the variance of the estimator. Numerical examples present a proof of concept of the effectiveness of the proposed method, demonstrating significant computational cost reductions (relative to the single-level ETPF counterpart) in the propagation of ensembles.


Introduction
Data assimilation is the process of incorporating observed data into model forecasts. In data assimilation, one is interested in computing statistics E η [X] of solutions X to random dynamical systems with respect to a posterior measure (η) given partial observations of the system. In particle filtering [6,3], this is done by using an empirical ensemble representing the posterior distribution η at any one time. The propagation in time of the members (particles) of this ensemble can be computationally expensive, especially in high dimensional systems.
Recently, the Multilevel Monte Carlo (MLMC) method has been developed for achieving significant cost reductions in Monte Carlo simulations [8]. It has been applied to areas such as Markov Chain Monte Carlo [11] and quasi-Monte Carlo [9] to return computational cost reductions from exisiting techniques. It has also been applied to uncertainty quantification within PDEs [5]. The idea is to consider a hierarchy of discretized models, balancing numerical error in cheap/coarse models against Monte Carlo variance in expensive/fine models. It is desirable to adapt MLMC to sequential Monte Carlo methods such as particle filters, and some first steps have been taken in this direction. Firstly, [10] have developed a multilevel Ensemble Kalman Filter (EnKF), using MLMC estimators to calculate the mean and covariance of the posterior, in the case where the underlying distributions are Gaussian and the model is linear. However for non-Gaussian distributions and nonlinear models, the EnKF is biased. Secondly, [2] proposed a multilevel sequential Monte Carlo method for Bayesian inference problems to give significant computational cost reductions from standard techniques. Our goal in this paper is to take a step further by applying MLMC to nonlinear filtering problems.
In general, MLMC works by computing statistics from pairs of coarser and finer correlated ensembles. For Monte Carlo simulation of SDEs, this correlation is achieved by using the same initial conditions and Brownian paths for each coarse/fine pair of ensemble members. The key challenge in applying MLMC to particle filtering is in maintaining this correlation after resampling. [7] suggested that correlating coarse and fine ensembles could be achieved by minimising the Wasserstein distance between the two ensembles. This can formulated as a optimal transportation problem [17].
In this paper, we adapt the MLMC framework to the Ensemble Transform Particle Filter (ETPF) [16]. ETPF is an efficient and effective nonlinear filter that uses optimal transportation transformations [19] instead of random resampling. In our Multilevel ETPF (MLETPF) the coupling between coarse and fine ensembles is also maintained using optimal transportation. The sole aim of introducing MLMC to the ETPF is to reduce the computational cost of the propagation of particles. This is only a benefit if the computational cost dominates the optimal transportation transformation cost; whilst direct solvers for optimal transportation problems with one-dimensional state space scale as O(N), solvers for problems with more than one dimension scale as O(N 3 ) with the ensemble size. To address this, a technique commonly used in the aforementioned EnKF known as localisation can be used to reduce this optimal transportation cost significantly [4]. Our proposed MLETPF can return significant reductions in the overall computational cost of ETPF where the particle propagation cost dominates. It will also return significant reductions in cases where optimal transportation computational cost dominates, if the localised ETPF is used. This paper proceeds as follows. Section 2 provides a background of the MLMC method, Section 3 describes the basic particle filtering framework together with the ETPF scheme. Then, the proposed Multilevel ETPF (MLETPF) method is presented in Section 4, along with numerical examples to demonstrate the effectiveness of the method. Finally, Section 5 provides a summary and outlook.

The Multilevel Monte Carlo Method
The Multilevel Monte Carlo estimator can be viewed as a variance reduction technique for a standard Monte Carlo estimator. Suppose one wishes to compute an approximation of E η L [X L ], where X L is a numerical approximation of a random variable X (with discretization accuracy parameter 1 h L ∝ M −L ); X L has probability distribution η L . The Multilevel Monte Carlo (MLMC) method introduced in [8] considered the case where X is the solution to a stochastic differential equation (SDE) at time T > 0; the discretized solutions X L are obtained from a given numerical method with stepsize h L . This paper will instead consider X to be a solution, at time T > 0, to a general random dynamical system, with stochastic forcing and/or random initial conditions (drawn from a distribution π 0 ). In the simplest case let X i where δ is the Dirac delta function. The standard, Monte Carlo estimator for E η L [X L ] is then Using a telescoping sum of expectations, one can define the MLMC approximation to E η L [X L ] as a sum of independent Monte Carlo estimators,X L = ∑ L l=0X l , whereX leading toX We have and hence the MLMC estimator is an unbiased approximation of E η L [X L ]. We will call the estimators,X l , l > 0, 'multilevel difference estimators'. The important thing to note here is that the fine (level l) and coarse (level l − 1) samples in each difference estimator must be positively correlated for each i in each of the L + 1 multilevel difference estimators. This can be achieved by using the same random system input (e.g. initial conditions/stochastic forcing) for each i on both levels. On the other hand, the samples in different difference estimators must be uncorrelated. The discretization bias (away from E η [X]) of the overall estimator is O(T h α L ) [8], where α is the global discretization bias (i.e. E η L [X L ] − E η [X] ) of the numerical method used to simulate X l , l ≥ 0. One notes from [8] that, where c is a positive constant, and that can be used as an estimate for the overall discretization bias, X L − E η [X] . As each estimator in (5) is independent of one another, the overall variance is given by the sum of the variances of each individual estimator. Given that there is a positive correlation between X l and X l−1 , one can expect then that the sample variance of X l − X l−1 , denoted by decays at a rate proportional to l, so that V l = O(h β l ), β > 0. The covariance in the last term in (7) is taken over the joint probability distribution of η l and η l−1 . One can then trade off variance in fine/expensive estimators against discretization error in coarse/cheap estimators with lower variance by setting a decreasing sequence of Monte Carlo estimator sample sizes, N 0 > N 1 ... > N L . The overall computational cost of the MLMC estimator is where, h −γ l defines the computational cost of propagating one single sample (of X l ) through a discretized system. On the other hand, the cost of the standard Monte Carlo estimator in (2) is By setting the continuous variable N l to their optimal values N l ∝ V l h γ l given in [8,5], as well as optimally choosing the finest level L, one can achieve a computational cost reduction relative to the standard Monte Carlo counterpart (2), with the same bound Mean Square error. Giles [8] proved the following result. Theorem 1. If the Mean Square Error ofX L is bounded by O(ε 2 ), one can optimally choose L and N l to allow the computational cost of the MLMC estimator to be bounded by, where c 1 , c 2 , c 3 , γ, β are positive constants, T is constant and α ≥ 1 2 min(γ, β ). ). The MLMC approach in principle is very simple to implement and can be very effective as long as one can satisfy the two constraints, α ≥ 1 2 min(γ, β ) and β > 0. More detail on this method can be found in a generalised explanation, and related theorems, in [5].

Particle Filtering
This section will outline the standard particle filtering methodology. In this context, one is interested in computing statistics of a random process X t , conditioned on observations of a single realisation of X, denoted X , and referred to as the reference solution. The observations are random variables of the form at times t ∈ (t 1 , . . . ,t N y ), where H : R d → R y is an observation operator, and φ is a random variable representing measurement error. For simplicity, we choose φ ∼ N(0, R), R is a y × y covariance matrix, and we will take H to be the identity (so y=d). We define X L,t k to be a numerical discretization of X t k with discretization accuracy parameter h L . Our aim here is to sequentially approximate E η L,t k [X L,t k ], where η L,t k is now the posterior of X L,t given the observations Y t 1 ,...,t k . Let p(y|x) be the likelihood function of y given x and q(x) be the prior of x. Then, for any k ∈ [1, N y ], using Importance Sampling [6], one can draw N i.i.d samples from the empirical approximation of the prior of X L,t k , and denote the normalised importance weights of sample i ∈ [1, N] to bẽ where Thus, the filter weights are defined iteratively, starting fromw L (X i L,t 0 ) = 1 N . As the observations are given by a Gaussian distribution, the likelihood is Finally, an estimator for the expectation of X L,t k with respect to the posterior η L,t k is This estimator, despite not being unbiased due to the normalised importance weights, is consistent with E η L,t k [X L,t k ]: as N → ∞, the estimator converges in probability to E η L,t k [X L,t k ].
Typically, importance weights become degenerate as k increases [3]. In this case, it is necessary to duplicate higher weighted particles whilst removing lower weighted particles; this is known as resampling. Resampling resembles an unbiased transformation from the weighted ensemble, X i l,t k ,w l (X i l,t k ) i=1,...,N to an evenly weighted ensemble of resampled particles X i l,t k i=1,...,N . The scheme outlined above is known as the Sequential Importance Resampling (SIR) method. For more information on this turn to [6]. The Ensemble Transform Particle Filter, the subject of this paper, uses Optimal Transportation [19] to implement this transformation, which we describe next.

Ensemble Transform Particle Filters
ETPFs are a variant of linear ensemble transform filters (LETF) [16]. They present an alternative to the resampling step that takes place in the standard SIR methodology, replacing it with a linear transformation. The goal is to obtain a transformed set of evenly weighted particles, X i L,t k i=1,...,N , from the weighted set of particles X i L,t k i=1,...,N , with importance weights w L (X i L,t k ) i=1,...,N , defining an empirical approximation to the posterior distribution η L,t k . This can be done with the following linear transformation, for i = 1, . . . , N and j = 1, . . . , N with non-zero entries for P i, j . Here, ∑ N i=1 P i, j = 1. For the ETPF, one creates a coupling between the posterior η L,t k , andX L,t k , denoted by the matrix T i, j , size N × N, with non-negative entries. The coupling defines the linear transformation matrix in (18) as P i, j = NT i, j . This coupling can be found by solving a linear transport problem by minimising the expected Euclidean distance between η L,t k andX L,t k , subject to the constraints This is in fact equivalent to maximising the covariance between the two ensembles, since In a univariate case, we define an optimal coupling matrix, T i, j , as one which minimises the cost function, Theoretical analysis of the above transformation is given in [16]. Once the transformed particles in (18) are found, the posterior mean is now estimated by,X It is important to note that this linear transformation, which is deterministic, will give the same estimator as in (17), and thus does not add considerable extra variance to the estimator from random resampling. This is a consistent estimator for the previous posterior mean estimator In the univariate case, the matrix T i, j can easily be found by an O(N) algorithm in [17]. This will become an important observation when discussing localisation in the next section. The above constraints lead to a maximum of 2N − 1 non-zero elements in T i, j , leading to a very sparse matrix calculation, and thus the entire ensemble transformation process can be achieved in a O(N) computational cost. This allows one to be able to carry the ensemble transform out on every assimilation step without the computational expense of this being of a higher order of magnitude than the propagation of the particles in between assimilation steps, but more analysis will cover this observation in the next section.
In the multivariate case, the same linear transport problem prevails, however one is required to minimise the cost function, whose minimum defines the Wasserstein distance between η L,t k andX L,t k . This cannot be solved in a computational cost of O(N) by using the algorithm in [17]. Instead, one has to use more expensive direct solver algorithms which typically have a O(N 3 ) or O(N 3 log N) computational cost, such as the FastEMD algorithm [13]. However, this means that with many systems, this optimal transportation computational cost will dominate over model costs of the system and thus the scheme is not efficient. The model costs of the particle filter are defined to be the computational cost needed to propagate the N particles through the system in between assimilation steps and is given in (10) (determined by a constant γ). Thankfully, a technique called localisation can aid this problem, and can also provide a pivotal change to the scheme when applying it to high dimensional systems.

Localisation
Localisation, a scheme frequently used in the EnKF for high dimensional systems, can also be applied to the ETPF [4].
In the simplest form, localisation applied to the ETPF means that one can reduce the computational cost of designing a multivariate coupling to d times the cost of designing a univariate coupling. Localisation allows one to construct an individual transformation in (18) for each of the d components of a multivariate X L,t k . A simple definition for the localisation matrix C [4] that describes the spatial correlation structure of the ensemble X L,t k i=1,...,N could be Here m, n = 1, . . . , d are the indicies of the spatial components of X L,t k , s m,n = min |m − n − N|, |m − n|, |m − n + N| and r loc,c is a constant. The above form for C m,n takes possible spatial-periodicity into account. One can now decompose the linear transport problem in (24) into d separate linear transport problems, to find a coupling matrix The objective of these linear transport problems is minimising the cost function where subject to the constraints Here, X i L,t k (m) is the m'th component of X i L,t k . Then, one can define the approximation of the marginal posterior mean for each m = 1, . . . , N asX where the transformed components are given bỹ and P i, j (m) = T i, j (m)N. When r loc,c = 0, exhibiting the most computationally efficient scenario, one has the interesting case where d univariate linear transport problems need to be solved, thus transforming all components individually. One can simply use the univariate algorithm in [17], mentioned in the last section, with a computational cost of O(N), for each linear transport problem, to get an overall O(dN) computational cost. In practice, when r loc,c = 0, one can also reorder each of the transformed sets of components into the rank structure of the original ensemble. This preserves the copula structure [18] of the original ensemble.
If the model costs of a system are less than that of the multivariate optimal transportation, using r loc,c = 0 is the only case in which the model costs can return to being the dominative cost in the ETPF estimator. Despite this, it is very reasonable to imagine that this model cost will dominate that of the optimal transportation in some systems, especially for high dimensional Partial Differential Equations (i.e. where γ is high). Localisation does have an effect on the performance of the ETPF by adding inconsistency into the posterior mean approximation stemming from the fact that one is generating deterministic couplings T i, j that are minimising different (simplified) cost functions to the full, multivariate one in (24). Despite this, numerical experiments conducted in [4] find the localised ETPF to be effective even in the chaotic, highly nonlinear Lorenz-63 Equations. Localisation is also needed in the likelihood evaluation of multivariate particles in the ETPF. Although this is not critical to the aim of this paper, only briefly covered here, it is essential for the ETPF to be able to successfully filter high dimensional systems due to the curse of dimensionality. Standard Sequential Monte Carlo (SMC) methods fail to track high dimensional systems due to exponentially degenerate importance weights. However, while there have only been some suggestions for a solution to this problem in SMCs, such as in [15], one can alter the above localisation scheme in the ETPF to solve this problem swiftly [4]. It is also needed to reduce the computational cost of likelihood evaluations when the dimension of the state space is greater than the sample size (d > N). For each component (m = 1, . . . , d) in the particles X i L,t k i=1,...,N , generate an separate importance weight given by where for i = 1, . . . N (C m is diagonal) and the value of r loc,R can be independent to r loc,c . These weights are then used in the constraints in the linear transport problems for each individual component transformation in (28). The two 'radii' of localisation, r loc,c and r loc,R will henceforth be refered to as the particular settings of localisation used.

Multilevel Ensemble Transform Particle Filter (MLETPF)
The proposed multilevel ETPF framework is demonstrated in this section. It creates an estimator consistent with the standard ETPF estimator in (17), for the same discretization accuracy level, L. The term 'single level' estimator will henceforth be a reference to the corresponding standard ETPF estimator, conditioned on the same observations, with the same discretization level L and variance as the proposed MLETPF estimator. The general premise of the MLETPF is to run L + 1 independent ETPF estimators, with N l samples, forward in time (and space), in the coupled multilevel framework. We assume here that h 0 ≤ ∆t, where ∆t = t k+1 − t k for all k ∈ [1, N y − 1], so that all of the L + 1 estimators are conditioned on the same observations. This does mean that one has to set a bound on the frequency of the data assimilation, given the time-step of the minimum level, h 0 . When updating the weights of each particle in each estimator, the same method as the ETPF holds for each of the L + 1 estimators. Thus, we define the MLETPF estimator of E η L,t k [X L,t k ], as the following, where the importance weightsw l (X l,t k ) target η l,t k , the posterior of each d-dimensional discretization X l,t k , given the observations, Y t 1 ,...,t k , k ∈ [1, N y ], One notes that as the standard ETPF estimator for each level of descritization, l ≥ 0, is consistent with E η l,t k [X l,t k ], the above estimator is consistent with the E η L,t k [X L,t k ], given the linearity of expectation shown in (3). Here, each of the particles from the fine and coarse ensembles in each of the multilevel difference estimators are positively correlated in between assimilation steps as in the standard MLMC method. This correlation is required for the variance of each difference estimator to decay with l → ∞ as discussed in the opening section. However, now in the ETPF context, when one comes to transform the fine and coarse ensembles in each multilevel difference estimator, the two ensembles cannot be transformed independently of one another, and need to have a positive correlation imparted between them ready for the next phase of particle propagation, especially if the transformations are happening frequently. If the random input to the system is simply a random initial condition, in a system with no stochastic forcing, these particles from the fine and coarse ensembles will certainly diverge instantly if they are not positively correlated after the ensemble transformations. In this paper, this positive correlation is achieved using a multilevel coupling step after the standard ensemble transform stage. This requires one to first carry out the ensemble transform (18) on the coarse and fine ensembles, ..,N l respectively, to get evenly weighted particles, X i l−1,t k i=1,...,N l and X i l,t k i=1,...,N l . If localisation is needed, one can implement this with the required parameters on both ensemble transforms as in the last section. It is very important that the same localisation settings are used on all estimators, so that the overall MLETPF estimator is consistent with the single level ETPF estimator with the same localisation settings. The key point of this being that the fine and coarse ensembles from each discretization level will have the same systematic localisation bias as one another. This means, such as with the discretization bias, that the localisation biases can cancel each other out in the telescoping sum of estimators (33), leaving only the systematic localisation bias of the finest level equal to that of the localised single level estimator. At this point, one notes that (33) becomes Now one needs to positively couple the fine and coarse ensembles of transformed particles from each estimator above. We propose to build another coupling betweenX l,t k andX l−1,t k , denoted by T F/C i, j , that minimises the cost function with constraints This is an assignment problem and in the multivariate case it can be solved by the Hungarian algorithm [12] with a computational cost equal to the multivariate linear transport problem algorithms discussed previously and so is the same order of magnitude as the corresponding ensemble transform stage in the standard ETPF method. In the univariate case, one can simply use the cheap algorithm in [17], exactly like the ensemble transform stage. One notes that the above assignment problem returns a coupling with one element in each row and column ( 1 N l ), resulting in particles simply being reordered and not transformed. This therefore returns exactly the same transformed particles in each ensemble. The reordering can be seen as finding the transformation matrix P Using a calculation similar to (23), we now show that the estimator in (37) is consistent with the term from Equation (33). Let T F i, j and T C i, j (i, j = 1, . . . , N) be the coupling matrices used for the ensemble transform on the finer and coarse ensembles respectively, then The estimator in (34) can therefore be written as with covariance where due to the fact that each estimator in (40) is independent. The above estimator is consistent with the single level (localised, with the same settings as used in the MLETPF) ETPF estimator due to (39) and (33). Therefore, in the absence of localisation, it is also consistent with E η L,t k [X L,t k ].
The multilevel coupling T F/C i, j minimises the expected distance between the two transformed ensembles, which maximises the covariance between them via (20) and then finally minimises V l . The multilevel coupling procedure above minimises V l in each multilevel difference estimator; we also ensure that pairs of particles in the coarse and fine ensembles are positively coupled in between assimilation steps (by using the same random input). The aim of this is to make the covariance V l decrease at an asymptotic rate O(h β l ) (β > 0) required for the variance reduction of the multilevel framework to work. This is because the fine and coarse ensembles are coupled both in between assimilation steps and during them via the coupling. This will be demonstrated in numerical experiments later in the paper. Designing this coupling between the both transformed ensemblesX l−1,t k andX l,t k is the key to the proposed MLETPF method, and enforces correlation amongst both transformed ensembles whilst remaining consistent with the single level ETPF estimator. Most importantly it also suits the ETPF method since the coupling can be generated simply and cheaply when using the r loc,c = 0 localisation that can be used freely in the ETPF. This will be shown later.

Algorithm
In this section, we present an algorithm to implement the MLETPF in practice. In this paper, a pre-defined recurrence relation for the decay of N l as l → ∞ will be set (N l+1 = f (N l )) and these sample sizes will be kept fixed throughout the filtering process. The finest discretization level L > 0 will be kept arbitrary for now. The algorithm is now presented.

Calculate
3. Propagate all samples forward according to system dynamics until time t k+1 . If l > 0, the fine and coarse pairs of samples in each estimator must be coupled by using the same random input.

Computational Cost of the MLETPF and ETPF
As previously noted, in a multidimensional case, it is computationally expensive to generate the coupling matrices needed to couple the fine and coarse transformed particles. Localisation is used to reduce the computational cost of the ensemble transforms down to O(dN) when r loc,c = 0 in the standard and multilevel ETPF methods, and this too can also reduce the computational cost of generating the multilevel coupling T F/C i, j . When localisation is used along with r loc,c = 0, one can break the multivariate coupling T and the same constraints as in (36). Each of these couplings can again be found using the cheap, O(N) univariate algorithm in [17]. One can then reorder / transform each component of fine and coarse ensembles separately using the same methodology as in the last section. This performs a similar role as resampling N l particles from F −1 f ,m (u) and F −1 c,m (u) where F −1 f ,m / F −1 c,m are the marginal (empirical) inverse cumulative distribution functions ofX l,t k (m) and X l−1,t k (m) respectively. Here the same uniform variate u ∈ [0, 1] is used for each pair of the N l samples. If localisation is carried out along with r loc,c > 0, the computational cost of the optimal transport in the standard ETPF and thus the multilevel coupling, minimizing the full multivariate cost function in (35), in the MLETPF will rise to around O(N 3 ). Therefore, in this scenario, the model costs are likely to be dwarfed by these optimal transportation costs, which are fixed by definition. In this case there is no justification for implementing the multilevel framework as it only aims to reduce model cost and not the optimal transportation computational expense. However to consider the case where model cost does dominate that of the multivariate optimal transportation, the full multivariate coupling will be demonstrated in the numerical exeriments at the conclusion of this paper. This paper now considers the overall computational cost of the ETPF and MLETPF estimators when localised, with r loc,c = 0. In this case, as explained above, one can reduce the computational expense of not only the ensemble transform stage, but the multilevel coupling stage as well, in the MLETPF scheme from a potential O(N 3 l ) to O(dN l ) for each multilevel difference estimator. This is enough to guarantee that the model computational cost bounds for the standard MLMC method in Theorem 1 are of the same order of magnitude to that of the entire MLETPF, including the ensemble transform and coupling stages, as one can simply 'hide' the optimal transportation costs behind the particle propagation costs. This is stated in the following proposition. Proposition 1. If ∆t ≥ h 0 is constant , and one can bound the computational cost of all N y ensemble transform / multilevel coupling stages of the MLETPF by, where c 1 is a positive constant, then the total computational cost of the MLETPF is bounded by, where C l,d is the cost of propagation of one particle on level l (dependent on d) and c 2 is a positive constant. Therefore C MLEPT F is of equivalent order of magnitude to the model cost of the filter and also the computational cost bounds for the standard MLMC method given in Theorem 1 for varying β , γ > 0 and α ≥ 1 2 min(γ, β ). Proof. Let the total computational cost of the MLETPF be given by the sum of the model cost and the ensemble transform cost (including the localised likelihood evaluation in (31), which scales at O(N l ) due to the sparse diagonal d × d matrixC, with r loc,R assumed constant (r loc,R = O(1)) and << d) then bounding C ET as in the claim and using the model cost in (9), Here we assume that C l,d will grow at least linearly with d, c 4 = c 1 ∆t and that c 3 is a positive constant. The same theory can be applied to the single level ETPF estimator, in which the ensemble transform computational cost can be constrained to O(dN), where N is the sample size for the estimator. One can then bound the overall computational cost of the single level estimator by the model cost (O(NC d ), where C d is the cost of propagation of one single particle, dependent on γ and grows at least linearly with d). Thus the computational cost reductions from the single level ETPF, of the MLMC framework in Theorem 1, can be extended to the MLETPF estimator.

Numerical Examples
Numerical examples of the MLETPF method, applied to classical data assimilation problems, are given in this section. Three problems will be studied: the multivariate, chaotic Stochastic Lorenz-63 Equations, the univariate, but nonlinear double-well OU Process and the high dimensional Stochastic Lorenz-96 Equations. The algorithm above will be used to generate experimental MLETPF estimators (that are compared against the single level ETPF estimators) for the latter two of the three problems above, along with varying levels of pre-defined of accuracy. This pre-defined level of accuracy, O(ε), will determine L (the finest level / overall numerical discretization bias), the fixed sample sizes for each estimator in the MLETPF method (N l ) and the fixed overall sample size in the corresponding single level ETPF estimator required to achieve the order of magnitude of this error in both estimators. This follows from the standard Monte Carlo approximation error decomposition given by the Central Limit Theorem, as in [8]. One can then compare the computational cost, given as the number of operations, for both the single level and multilevel estimators, which should be in line with Theorem 1 given the same pre-defined error. The error is not bounded exactly due to variations in the variance at each assimilation step, but a proof of concept from a practioner's viewpoint can be established. The error in the estimators will be estimated by the time-averaged root mean square error (RMSE), given by where η t k is the posterior distribution of X t k given the observations Y t 1 ,...,t k . An approximation of E η t k [X t k ] will be used in the RMSE calculations above. This will be given by the standard ETPF estimator, of which the numerical discretization bias and sample size produce an estimator with an error orders less than any ε used in the following experiments. Where localisation is used in the single level and multilevel estimators for the problems above, the estimators are inconsistent with E η L,t k [X L,t k ], but crucially consistent with each other as the same localisation settings are used. The ETPF approximation of E η t k [X t k ] will use the same localisation settings to correctly compare the ETPF and MLETPF estimators like for like. The recurrence relation for N l , l ≥ 1, given N 0 (dependent on ε), used in both numerical experiments will be set to N l+1 = N l M −3/2 , however the optimality of this depends on the relative value of β with respect to γ [8]. This is only optimal for β > γ, (β + γ) = 3; this is the case in the numerical examples below. One notes that for a RMSE of O(ε), one requires that N l = O(ε −2 ), and that N = O(ε −2 ) for the single level ETPF. Also, the discretization bias of the multilevel and single level ETPF estimators, from E η t k [X t k ], should be O(ε). Thus, for a numerical scheme that has a global discretization bias of O(t N y h α L ), one requires for the sum of the multilevel and single level ETPF estimator components bias to be of O(ε). One could also use the maximum among all components' biases to be a suitable measure here. The sample covariances of the independent ETPF estimators will also be measured by a sum among all components: Tr(V l ), the trace of the covariance matrix. The computational cost for the these numerical examples is defined as the theoretical number of operations needed to compute each approximation, including the ensemble transform stage and the multilevel coupling for the MLETPF. This is computed simply by inserting a step-counter into the numerical implementation (functions etc.) in the Python code used for these experiments. Finally, M = 2 is used for each numerical example.

Stochastic Lorenz-63 Equations
This simple 3-component chaotic nonlinear system in X = (x, y, z), with ρ = 28, σ = 10, β = 8/3, φ = 0.4 will be used to demonstrate the effect of the multivariate multilevel coupling, T F/C i, j , without localisation and thus with cost function in (35). Here, the Brownian motion W will be the same for each component to keep the strong nonlinearity in the equations. Computational cost against accuracy comparisons with the standard ETPF method will not be investigated here given the low model cost of this test problem and thus the dominating effects of the multilevel coupling and / or the ensemble transform stage in both the ETPF and the MLETPF will make the model cost reductions of the multilevel framework unnoticeable. The MLETPF estimator with N 0 = 500, and L = 5, using the full multivariate cost function in (35) to find the multilevel couplings T F/C i, j in the aforementioned algorithm, is used to compute an approximation to E η L,t k [X L,t k ], with X as above, for k ∈ [1,128], with h 0 = M −7 = ∆t (for stability), and thus t N y = 1. Here, X L is the solution to the above Lorenz-63 equations using the forward Euler numerical scheme. The observations are given by a measurement error with R = 6I, where I is the 3 × 3 identity matrix and weights are thus based on observations of all components x, y and z. Figure 1 shows the mean estimates of Tr(V l ) (l ∈ [1,5]) over all assimilation steps k ∈ [1, N y ]. The asymptotic decay of the above estimates show importantly that the multilevel coupling, with the multivariate cost function, successfully produces the variance decay, Tr(V l ) = O(h β l ), in this case β ≈ 1 > 0.

Double-well OU Process
This is a univariate test problem that demonstrates the cost effective, consistent, MLETPF estimator of E η L,t k [X L,t k ], where X L,t k is a numerical discretized solution to the double-well, nonlinear OU process, k ∈ [1, N y ] and W t k is a standard Brownian Motion. Here, V (X t k ) = 1 4 X 4 t k − 1 2 X 2 t k . This example uses h l = 2 −4−l but this is arbitrarily chosen. The stochastic forcing is set to ξ = 0.5. The observations and assimilation times were given by R = 0.6, t N y = 50 where ∆t = h 0 and so N y = 800. The numerical discretizations of X t k are computed by the Euler-Maruyama method. A very accurate simulation (N 0 = 10000, L = 7) of the MLETPF estimator is run to demonstrate the mean asymptotic decay of V l and X l,t k (l ∈ [1,7]) over all assimilation steps. These are shown in Figure 2. The values of α = 1, β = 2, are as expected given the Euler-Maruyama global discretization bias of O(h l ) and the additive noise in the OU Process contributing to the variance. Figure 4 shows the computational cost against the accuracy (RMSE) for the MLETPF and the single level ETPF estimators over varying values of ε. Here one sets N 0 = ε −2 for the MLETPF and N = ε −2 for the single level ETPF estimator. One can clearly see the expected orders of growth for the computational cost of the standard ETPF (O(ε −3 ), as γ = 1 and α = 1) and the MLETPF (O(ε −2 ), given that γ < β ) that were shown in Theorem 1 for the pre-defined RMSE of O(ε).

The Stochastic Lorenz-96 Equations
The final numerical test for the MLETPF method in this paper is the high dimensional (d = 40 in this case) Stochastic Lorenz-96 Equations, given by, with j ∈ [1, N x ], N x = 40, N x ∆x = 10 and thus ∆X = 0.25; dW j /dt are 40 i.i.d. Brownian motions. Here, F is a constant forcing (F = 8) and σ 2 = 0.4. Periodic boundary conditions are used, so that X(−1) = X(N x ). The observations were given by a measurement error of R = 6I, where I is the 40 × 40 identity matrix and assimilation times were set to t N y = 100 where ∆t = h 0 meaning N y = 1600 as h l = 2 −4−l (again simply arbitrary). The Euler-Maruyama method is used to find X l,t k once again here. In this numerical example, the ETPF and MLETPF estimators use r loc,R = 1. The localisation setting of r loc,c = 0 is used for both the multilevel and single level ETPF estimators here, due to the model cost, C l,d , being simply equal to h −1 l d, and thus much lower than that of high optimal transportation costs in multiple dimensions. Once again, a very accurate simulation (N 0 = 1000, L = 10) of the MLETPF was generated to demonstrate the mean asymptotic decays of ∑ 40 i=1 X l,t k (i) and Tr(V l ) (l ∈ [1, 10]) over all assimilation steps and these are shown in Figure 3. These follow the same, expected, values of α = 1, β = 2, as with the last example.
Next, the stability of the MLETPF is considered. Since N l is fixed and does not change to bound error over time, we can look at the errors from the reference solution, X t k , compared to the observational errors, to study the stability of the MLETPF estimator over time and check that the errors do not increase. One expects that, for a successful particle filter, the errors from the estimator should be less than the observational errors and remain stable. Figure 6 shows  this expected behaviour, where the cumulative time-averaged RMSE values from X t k , of both the observations and the MLETPF estimator (using arbitrary values of N 0 = 1000 and L = 10) are shown. For k ∈ [1, N y ], these are defined to be, for the observations and, for the MLETPF estimator. To demonstrate the stability of the variance of the particle filter, cumulative time-averaged RMSE values for the second moments ofX L,t k and Y t k are shown in Figure 7. Finally, to compare the MLETPF with it's standard counterpart for this set of equations, Figure 5 shows the computational cost against the accuracy (RMSE) for the standard ETPF and the MLETPF estimators over varying values of ε. Once again, one sets N 0 = ε −2 for the MLETPF and N = ε −2 for the single level ETPF estimator. This follows the successful cost reductions achieved in the last example, defined in Theorem 1.

Summary and outlook
This paper has demonstrated a proof of concept for the application of MLMC to nonlinear filtering. The Ensemble Transform Particle Filter (ETPF), coupled with localisation, allows one to simply and cheaply carry out a multilevel coupling between each fine and coarse ensemble in each independent Monte Carlo estimator in the MLMC framework. The coupling is designed to minimise the Wasserstein distance between the distributions of these transformed ensembles (in the standard ETPF methodology), originally suggested in [7]. It has been shown through numerical experiments that one can restore positive correlation between fine and coarse ensembles which might have been lost if they had been transformed independently of one another. This in turn satisfies the neccessary constraints on the sample variance of each independent multilevel estimator, allowing the proposed MLETPF method to reduce the computational cost of the propagation of particles in the localised ETPF method.
In general, localisation with r loc,c = 0 makes the computational cost of this coupling and the ensemble transform in the MLETPF cheap enough that the multilevel framework can return overall computational cost reductions from the standard ETPF methods; the aim of the paper. It must be noted that although this paper has only touched on the case where the very crude r loc,c = 0 localisation is considered, due to small model cost test problems, this method could also  be applied to high-dimensional systems where the model cost dominates that of the multivariate optimal transportation. One can do this without any crude constraints on r loc,c using the full multivariate coupling methodology presented in this paper and demonstrated numerically. However whether the variance decay of V l from such a multivariate coupling would hold, producing as strong results, in the limit of d >> N is unknown and thus the issue of how one would adjust this coupling to be used alongside other values of r loc,c to reduce the dimensionality of these multivariate couplings remains to be explored. Iterative and approximate schemes for solving discrete optimal transportation problems have been an area of rapid research in the last few years [14] and this offers the chance to improve the multilevel coupling in the proposed method by reducing computational cost. This could be done by trading-off between the optimality and computational cost of the coupling for each l, e.g. more expensive / optimal couplings for greater l with lower sample sizes N l . The form of the coupling used in this paper is simple to implement and has the potential to be used in plenty of applications, in and outside of data assimilation, whenever one wishes to establish consistent correlation between two distributions for variance reduction. Considering an extension for the multidimensional example presented in this paper, one could also apply a spatial multilevel framework, setting the spatial resolution (∆X l ) to be dependent on the level of discretization, as done in [5,1] to gain even more significant cost reductions.  [4] Y. Cheng and S. Reich. A McKean optimal transportation perspective on Feynman-Kac formulae with application to data assimilation. arXiv preprint arXiv:1311.6300, 2013.