Transport map accelerated adaptive importance sampling, and application to inverse problems arising from multiscale stochastic reaction networks

In many applications, Bayesian inverse problems can give rise to probability distributions which contain complexities due to the Hessian varying greatly across parameter space. This complexity often manifests itself as lower dimensional manifolds on which the likelihood function is invariant, or varies very little. This can be due to trying to infer unobservable parameters, or due to sloppiness in the model which is being used to describe the data. In such a situation, standard sampling methods for characterising the posterior distribution, which do not incorporate information about this structure, will be highly inefficient. In this paper, we seek to develop an approach to tackle this problem when using adaptive importance sampling methods, by using optimal transport maps to simplify posterior distributions which are concentrated on lower dimensional manifolds. This approach is applicable to a whole range of problems for which Monte Carlo Markov chain (MCMC) methods mix slowly. We demonstrate the approach by considering inverse problems arising from partially observed stochastic reaction networks. In particular, we consider systems which exhibit multiscale behaviour, but for which only the slow variables in the system are observable. We demonstrate that certain multiscale approximations lead to more consistent approximations of the posterior than others. The use of optimal transport maps stabilises the ensemble transform adaptive importance sampling (ETAIS) method, and allows for efficient sampling with smaller ensemble sizes.


Introduction
The importance of Markov chain Monte Carlo (MCMC) methods is becoming increasingly apparent, in a world replete with datasets which need to be combined with complex dynamical systems in order for us to be able make progress in a range of scientific fields. Different MCMC methods have been designed with different challenges in mind, for example high dimensional targets [21], or large data sets [7]. New types of stochastic processes have been considered, such as piecewise deterministic Markov processes [8]. Other methods are able to make very large moves informed by the data [26,55], and others are able to exploit the geometry of the target [35]. Population Monte Carlo (pMC) methods use efficient importance sampling proposals informed by an ensemble of particles which have already learned about the structure of the target [12,13,17,23,24,42,43]. These methods are particularly useful in the context of low dimensional target distributions which have complex 1 structure, for which traditional Metropolis-Hastings methods tend to struggle, for example when the target is multimodal or if the density is concentrated on a lower-dimensional manifold. Moreover, they are an excellent candidate for parallelisation, since the task of computing the likelihood for each particle in the ensemble can be distributed across a set of processors. This is of utmost importance in an age where increases in computational power are most likely to come from large-scale parallelisation as opposed to the large speed-ups in individual processors that we saw throughout the 20th century. In [18], we explored ideas from adaptive MCMC within adaptive importance samplers, in order to construct a method which could automatically adapt to optimise efficiency. Alongside this, we explored the use of state-of-the-art resamplers [49], and the multinomial transform (MT), a greedy approximation of the ensemble transform method, in order to improve the quality of the importance sampler proposal distribution. Other approaches to challenging posterior structure have been proposed, including the work of Marzouk and Parno [46], who demonstrated how invertible transport maps can be constructed which map the target distribution close to an easily-explored reference density, such as a d-dimensional Gaussian. Since the map is invertible, Gaussian proposal densities on the reference space can be mapped to highly informed moves on the parameter space. A related idea is the use of the anisotropic diffusion maps [52], in which a map which describes the manifold is constructed using principle component analysis. These ideas have also been applied to stochastic reaction networks [25,53]. In [15], these maps were used to propagate data-mining into unexplored regions which were predicted to be close to the lower-dimensional manifold. The idea is similar to that of the Lamperti or Girsanov transformations for SDEs [30,31], the aim being to map complex processes onto a standard one which is well understood, through a change of measure. In this paper, we particularly concern ourselves with the subject of inverse problems in stochastic reaction networks. This type of model is used in situations where ODE or diffusion [33] approximations of chemical reaction kinetics are not viable, usually because of the low concentration of one or more chemical species [32]. Biochemical networks typically fall into this category, because of the small volume of the reactors involved (a single cell). One challenge with networks arising from these applications is that they often have some reactions which are occurring many times on a timescale in which other reactions in the system are unlikely to occur. This multiscale behaviour throws up two main challenges, both in observability in an experimental setting, but also in the context of modelling, where multiscale systems can make the simulation of the continuous time Markov chain computationally intractable. A range of numerical methods have been suggested to tackle this second problem, including tau-leaping [10,48], diffusion approximations [20,22,29], quasi-steady state approximations [9,27], and other averaging arguments [19]. Some of these approaches are more computationally intensive than others, but this can be mitigated with the use of analytical results about the invariant densities of certain types of fast subsystems [2,3,38]. We would not expect the fast processes in such a system to be experimentally observable. In such a situation where only the slow variables in a system are identifiable, in an ideal world, we would be be able to integrate over the space of all possible trajectories of the fast processes. Since this is an insurmountable computational task, we might instead wish to approximate the likelihoods arising from such a set of observations by the use of an appropriate multiscale methodology. The partial observability of these systems lead to only partial-observability of the unknown parameters in the system, for example the reaction rates of the fast reactions. The resulting multiscale approximations, and the resulting approximations of the likelihoods arising from the inverse problem for such a network, often exhibit insensitivity to changes in certain directions of parameter space. This causes problems when aiming to sample from such a target, since the density is concentrated on a manifold whose tangential vectors point in directions of insensitivity. This sloppiness [5,37] of the likelihood is a challenging problem that appears in many applications in science and engineering [16]. In this paper, we aim to incorporate transport maps into an ensemble importance sampling setting, in order to develop stable algorithms which are capable of sampling from probability distributions with complex structures. These maps are constructed using an importance sample of the posterior distribution, and are used to simplify the posterior distribution, mapping them as close as possible to a reference Gaussian measure. Through this map, straightforward Gaussian proposal densities are transformed into complex densities, which match as well as possible the target density. Less well-informed proposal kernels means that complex structures in the density can only be well represented with a large increase in the ensemble size, increasing computational cost. If computational resources do not allow for such an increase, this can lead to stability issues in the algorithm, due to the proposal distribution in the importance sampler not representing well the target density, leading to high variance in the weights, causing slow convergence. Better informed proposal densities leads to more stable and faster converging importance sampling schemes, and we will demonstrate this through examples arising from Bayesian inverse problems on multiscale stochastic reaction networks. In Section 2, we will briefly reintroduce the ensemble transport adaptive importance sampling (ETAIS) method. In Section 3 we show how an appropriate transport map can be constructed from importance samples which maps the posterior close to a reference Gaussian measure. In Section 4 we show how such a map can be incorporated into a sophisticated parallel MCMC infrastructure in order to accelerate mixing, and improve stability. In Section 5 we seek to show the advantages of this approach through the analysis of a challenging test problem. In Section 6 we consider how likelihoods can be approximated using multiscale methodologies in order to carry out inference for multiscale and/or partially observed stochastic reaction networks. In Section 7 we present some numerical examples, which serve to demonstrate the increased efficiency of the described sampling methodologies, as well as investigating the posterior approximations discussed in the previous section. We conclude with a discussion in Section 8.

Ensemble Transport Adaptive Importance Sampling
Ensemble transport adaptive importance sampling (ETAIS) [18] is an adaptive importance sampling framework which uses an ensemble of particles and state-of-the-art resampling methods to construct a mixture proposal distribution which closely matches the target distribution. In importance sampling, we aim to sample from a target π, by sampling from a chosen proposal distribution χ. Each sample θ (k) ∼ χ is then weighted by w k = π(θ (k) ) χ(θ (k) ) , to take account of the bias of sampling from a different distribution to π. Monte Carlo estimates using a sample of size N of a function f with respect to π can then be made through the formula w k is the total weight. This method works well when π and χ are close, but can be excruciatingly slow when they are not. The idea behind adaptive importance sampling methods is to construct a good proposal distribution, either from the entire history of the algorithm up to the current point, or to use the current state of a whole ensemble of M particles in the system. In ETAIS, the proposal distribution χ is chosen to be the equally weighted mixture of set of proposal kernels. If M ] is the current state of the ensemble, and we wish to use a mixture of the proposal density q(·; ·, β) then Usually the kernel q(·; x, β) is centred at, or has mean equal to x, and whose variance scales with β. Good values for the algorithmic parameter(s) β can be found by optimising for the effective sample size of the importance sample that is produced (see [18,51] for more details). If the ensemble is large enough, and the chain has entered probabilistic stationarity, then the current state of the ensemble is a good rough discrete approximation of the target density, and in turn χ (k) is close enough to π to produce an efficient importance sample It can be advantageous to use stratified sampling of the mixture in order to ensure that the sample made is as representative as possible of the target density, i.e.
for each i = 1, 2, . . . , M . We now have a weighted sample, and it would be inadvisable to use an equally weighted mixture of proposal distributions from each of these points. Therefore, before starting the next iteration, the importance sample {(θ (k) , w (k) )} is resampled to produce an equally weighted sample, ready for the next iteration of the algorithm. In ETAIS, a state-of-the-art resampler is used, which uses optimal transport methods to find the equally weighted discrete sample which best represents the statistics of the importance sample [49]. For larger ensemble sizes M this can become expensive, in which case a greedy approximation of this algorithm, the Multinomial Transformation (MT) can be implemented [18]. The output of the resampler is then denoted θ (k+1) and the algorithm is ready for the next iteration. The ETAIS is summarised in Algorithm 1, where the importance samples {(θ (k) , w (k) )} are stored as the output.
One problem with the ETAIS and other similar methods can become apparent if the target density has highly curved manifolds on which the density is concentrated. In this case, unless the proposal densities q are informed by this local structure, the mixture distribution proposal may not well approximate π without a very large ensemble size M , which can become prohibitively expensive. Some methods have been proposed [24] which use samples local to each particle to inform local covariance structure.
In this paper, we investigate the use of transport maps to learn local covariances across the whole of the domain, in order to stabilise adaptive importance sampling methods, and make these methods more applicable to a wider range of more challenging inference problems.

Construction of transport maps in importance sampling
In this Section, we describe the construction of transport maps which allow for the simplification of complex posterior distributions in order to allow for improved sampling, in particular for methods based on importance sampling. In [28] the transport map was introduced to provide a transformation from the prior distribution to the posterior distribution, the idea being that one could draw a moderately sized sample from the prior distribution and use this sample to approximate a map onto the target space. Once this map was known to the desired accuracy, a larger sample from the prior could be used to investigate the posterior distribution. This methodology was adapted in [46] to form a new proposal method for MH algorithms. In this case, rather than transforming a sample from the prior into a sample from the target distribution, the map transforms a sample from the posterior onto a reference space. The reference density is chosen to allow efficient proposals using a simple proposal distribution such as a Gaussian centred at the previous state. Proposed states can then be mapped back into a sample from the posterior by applying the inverse of the transport map.
Proposing new states in this way allows us to make large, informed moves, even when sampling from probability distributions with complex structure. If a very good such map can be found, it is also possible that the push forward of the posterior density through the map is close enough to the reference Gaussian that we can efficiently propose moves using a proposal distribution which is independent of the current state, for example sampling from the reference Gaussian itself, or a slightly more diffuse distribution.
In this Section we outline the methodology in [46] for approximately coupling the target measure with density π θ , with the reference distribution with density π r , and show how the map can be constructed using a weighted sample and hence how we can incorporate the map into importance sampling schemes.
Definition 3.1 (Transport Map T ) A transport map T is a smooth function T : X → R d such that the pullback of the reference measure with density φ(·), is equal to the target density π(θ) for all θ ∈ X . The pullback is defined in terms of the determinant of the Jacobian of T , Definition 3.2 (Target and Reference Space) The transport map pushes a sample from a target space X , that is a subset of R d equipped with a target density π θ , onto a reference space, R, again a subset of R d equipped with the reference density π r .
Armed with such a map, independent samples can be made of the target measure, using the pullback of the reference density φ through T −1 . Clearly the pullback only exists when T is monotonic, i.e. has a positive definite Jacobian, and has continuous first derivatives. Not all maps satisfy these conditions, so we define a smaller space of maps, T ↑ ⊂ T which contains all invertible maps. This space does not necessarily contain an exact coupling between target and reference space, and so we are motivated to formulate an optimisation problem to find the mapT ∈ T ↑ which most closely maps the target density to the reference density.
As in previous work in [46], we can ensure invertibility if we restrict the map to be lower triangular, i.e.T ∈ T ⊂ T ↑ . This lower triangular map has the form, . . .

The optimisation problem
Our aim is now to find the lower triangular mapT ∈ T such that the difference between target density and the pullback of the reference density is minimised. As in [46], we choose the cost function to be the Kullback-Leibler (KL) divergence between the posterior density and the pullback density, This divergence results in some nice properties which we will explore in the following derivation. The KL divergence is not a true metric since it is not symmetric, however it is commonly used to measure the distance between probability distributions due to its relatively simple form, and because it provides a bound for the square of the Hellinger distance by Pinsker's inequality [47], D KL (p q) ≥ D 2 H (p, q), which is a true distance metric on probability distributions. Given the form of the pullback in Equation (1), now taken through an approximate mapT , the divergence becomes D KL (π π) = E π log π(θ) − log π r (T (θ)) − log |JT (θ)| .
We note the posterior density is independent ofT , and so it is not necessary for us to compute it when optimising this cost function. This expression is a complicated integral with respect to the target distribution, for which the normalisation constant is unknown. However this is exactly the scenario for which we would turn to MCMC methods for a solution. To find the best coupling,T ∈ T , we solve the optimisation problem, which has a unique solution since the cost function is convex. We also include a regularisation term, which is required for reasons which will become clear later. The optimisation problem now takes the form The parameter β > 0 does not need to be tuned, as experimentation has shown that the choice β = 1 is sufficient for most problems. This expectation can be approximated by using an MCMC approximation. The form of the penalisation term promotes maps which are closer to the identity, and so prevents overfitting when the quality or size of the current sample from the posterior is not sufficient.

The structure of the map
Before we continue with the derivation of the optimisation problem, we consider the structure of the map in more detail. The lower triangular structure of the map not only guarantees monotonicity, it also allows for efficient calculation of the pullback density, as well as the inverse of the map,T −1 . The Jacobian ofT is a lower triangular matrix, since ∂ θnTk (θ) = 0 for all n > k. This lower triangular structure means that the determinant of the Jacobian is a product of the diagonal elements which, when we take logs, becomes where we note that this term is separable in terms of the dimension i.
InvertingT at a point r is simplified by the lower triangular structure of the map. The map componentT 1 (θ) is a univariate polynomial in θ 1 , so we can find the inverse of this function by solving the equation T 1 (θ 1 ) = r 1 . This inversion tells us the value of θ 1 , which means the next component is again a univariate polynomial, T 2 (θ 2 ; θ 1 ) = r 2 . We can then perform d root finding problems instead of a full d dimensional non-linear solve. We require that the first derivatives of the map are continuous, which is easy to enforce by the choice of basis functions. Here we assume that the map will be built from a family of orthogonal polynomials, P(θ), not necessarily orthogonal with respect to the target distribution. Each component of the map is defined as a multivariate polynomial expansion, The parameter γ i ∈ R M i is a vector of coefficients. Each component of γ i corresponds to a basis function ψ j , indexed by the multi-index j ∈ J i ⊂ N i 0 . A multi-index defines a product of univariate polynomials in θ k , and where ϕ j k (θ k ) ∈ P(θ k ). SinceT is lower triangular, a multi-index j ∈ J i only contains entries for univariate polynomials in θ k for k ≤ i. The cardinalities of the multi-index sets, M i = card(J i ), give the number of unknowns in our optimisation problem, and so we would like to keep this number as small as possible. One option is to use polynomials of total order p, which is optimal in terms of the amount of information captured by the map about the target. The cardinality of J TO i is M i = i+p p which increases rapidly in d and p, where i = 1, . . . , d. Smaller optimisation problems can be produced by constructing subsets of J TO i . These index sets are discussed in [46]. Increased information with a slower increase in the number of map parameters can be achieved with the composition of maps discussed in [45]. Here we stick with polynomials of total order p since we work with low dimensional problems with the ETAIS algorithm.

Implementation of the optimisation problem
We now discuss how we can evaluate the cost function in Equation (2). In [46], this expectation is approximated using an MCMC estimator, such that Here we diverge from previous work, as we aim to build a map from samples from an importance sampling scheme. Such samples no longer carry equal weight, and as such the Monte Carlo estimator becomes where w k are the weights associated with each sample θ (k) , andw is the sum of all these weights. Optimisation of this cost function results in a map from π to some reference density π r . By choosing the reference density to be a Gaussian density, we can simplify this expression greatly. Substitution of the Gaussian density into Equation (6) leads to Note that since we assume that the map is monotonic, the derivatives of each component are positive and so this functional is always finite. In practice it is infeasible to enforce this condition across the whole parameter space. We instead enforce this condition by ensuring that the derivatives are positive at each sample point. This means that when we sample away from these support points while in reference space, it is possible to enter a region of space where the map is not monotonic. We now return to the structure of the map components given in Equation (4). Since the basis functions are fixed, the optimisation problem in (2) is really over the map components γ = (γ 1 , . . . , γ d ) where γ i ∈ R M i . Note that C(T ) is the sum of d expectations, and these expectations each only concern one dimension. Therefore we can rewrite (2) as d separable optimisation problems.
arg min The sum in Equation (4) is an inner product between the vector of map coefficients, and the evaluations of the basis function at a particular θ (k) . If we organise our basis evaluations into two matrices, for all j ∈ J TO i , and k = 1, . . . , K, then we have that In this expression, the vector w = [w 1 , w 2 , . . . , w K ] is the vector of the weights, W is the diagonal matrix W = diag(w) and log(G i γ i ) is to be evaluated element-wise. As more importance samples are made, new rows can be appended to the F i and G i matrices, and F i W F i can be efficiently updated via the addition of rank-1 matrices. The regularisation term in Equation (9) can be approximated using Parseval's identity, where ι is the vector of coefficients for the identity map. This is of course only true when the polynomial family P(θ) is chosen to be orthonormal with respect to π θ ; however this approximation prevents the map from collapsing onto a Dirac when the expectation is badly approximated by a small number of samples. These simplifications result in the efficiently implementable, regularised optimisation problem for computing the map coefficients, arg min This optimisation problem can be efficiently solved using Newton iterations. It is suggested in [46] that this method usually converges in around 10-15 iterations, and we have seen no evidence that this is not a reasonable estimate. When calculating the map several times during a Monte Carlo run, using previous guesses of the optimal map to seed the Newton algorithm results in much faster convergence, usually taking only a couple of iterations to satisfy the stopping criteria. The Hessian takes the form where [G i γ i ] −2 is to be taken element-wise, and I is the

Transport map usage in ETAIS and other importance sampling algorithms
Given importance samples from the target distribution, we have demonstrated how to construct an approximate transport map from the target measure to a reference measure. We now consider how to implement an ensemble importance sampling algorithm which uses these maps to propose new states. In [46] it was shown how approximate transport maps can be used to accelerate Metropolis-Hastings methods, with the map being periodically updated with the samples produced from the target measure. Convergence of this adaptation is shown in [46]. In this Section, we will show how similarly, these maps can be used to construct highly efficient importance sampling schemes. In particular, we will show how we can use the transport map derived in Equation (10) to design a proposal scheme for the ETAIS algorithm. In this case we have a choice in how to proceed; we propose new samples on reference space and resample on target space, or we both propose and resample on reference space, mapping onto target space to output the samples. The first option allows us to reuse much of the framework from the standard ETAIS algorithm and in the numerics later we see that this performs better than both the Transport MH algorithm, and the standard ETAIS algorithm. The second option requires some restructuring, but can result in improved performance from the resampler. The first option is given in Algorithm 2. We denote the ensembles of states in target space where these pairs are the states which together form our sample from the target distribution. As in the standard version of the ETAIS algorithm we use the deterministic mixture weights. The second option, Algorithm 3, is similar to the first except on Line 8 where rather than resampling in target space we resample in reference space. Assuming a good transport map, the push forward of the target measure onto reference space has no strong correlations, and Algorithm 2: ETAIS algorithm with adaptive transport map. Option 1.
the Gaussian marginals are easy to approximate with fewer ensemble members. This means that the resampling step will be more efficient in moderately higher dimensions, which we discuss in Section 8.1.

Convergence of the transport proposal based MCMC algorithms
In this Section we study the convergence of the transport based proposal distributions which we have described in Section 4. We take as a test problem the ubiquitous Rosenbrock banana-shaped density. This target density is given by A contour plot of the target density is given in Figure 1. This problem is challenging to sample from since it has a highly peaked and curved ridge, and is often used as a test problem in optimisation and MCMC communities.

Implementation details
Before looking at the performance of the algorithms, we demonstrate some properties of the transport maps we will be using. We draw 1 million samples from the density (12), and use Algorithm 3: ETAIS algorithm with adaptive transport map. Option 2.
Calculate:    this sample in the framework of Section 3 to build a transport map. We use this map to push forward the original sample onto the reference space, where we will be able to see how well the map has performed at converting the original sample to a standard Gaussian. We then pull the sample back on to target space using the inverse map to check that our map is invertible and well behaved.
For this example, we use an index set of total order 3 with monomial basis functions. It is important that total order is an odd number, since otherwise the map will not be surjective. This results in a map of the form where Clearly even with only basis functions of total order 3, we have a large number of unknowns in our optimisation problem,γ ∈ R 14 . If we were to increase the dimension of θ further we would need to reduce the number of terms we include in the expansion by, for example, removing all the "cross" terms. This reduces the quality of our map but since we only require an approximate map we can afford to reduce the accuracy.  The effect of the approximate transport mapT on a sample from the Rosenbrock target density as described in Equation (12). Figure 2 shows the output of the approximate transport map. Even though we have truncated the infinite expansion in the monomial basis down to 4 and 10 terms in respective dimensions, the push forward of the sample is still a unimodal distribution centred at the origin with standard deviation 1. As one moves out into the tails of the reference density more non-Gaussian features are clearly visible. However, overall, the push forward of the target density does not look a challenging one to sample from, with even relatively simple MCMC methods such as Metropolis-Hasting random walk (MH-RW). The pullback from reference space, in Figure 2, is an exact match of the original sample since we have not perturbed the sample in reference space. This inversion is well defined in the sampling region, although not necessarily outside [46].

Numerical results for convergence of transport map based algorithms on the Rosenbrock density
We first find the optimal scaling parameters for the individual algorithms. This is done, as in [18], by optimising for the effective sample size in the ETAIS algorithm, and by tuning the relative L 2 error in the MH algorithm. There is currently no guidance on the best way of tuning the MH algorithm with transport map proposals although one might expect results similar to the standard MH results, especially if adaptation of the map is stopped after a given point. As in the ETAIS algorithm, optimising for the effective sample size might be the best option. Table 1: Optimal scaling parameters for the transport map based algorithms applied to R 1 , optimising for the L 2 error and the effective sample size (ESS) for ETAIS algorithms, and for average acceptance rate for MH algorithms.
The optimal scaling parameters are given in Table 1. Here we see that the effective sample size is much lower than we see in the one-dimensional examples with the ETAIS algorithms. However, in the Rosenbrock density (12) we are dealing with a much more complicated curved structure, as well as a very slowly decaying tail in θ 2 . From our experiments, we have observed that the standard ETAIS-RW required an ensemble size of M = 500 to overcome the problems in this density, however the transport map transforms the tails to be more like those of a Gaussian which can be approximated well by a smaller ensemble size of M = 150. The convergence of the three algorithms is displayed in Figure 3. Figure (a) shows that the two variations of the transport based ETAIS algorithms converge with similar rates. The second version, which performs the resampling stage in reference space rather than target space, has a slightly higher ESS, and is more stable than option (1). This version also has a property that we can exploit in Section 8.1. Figure 3 (b) compares the performance of RWMH, transport map-accelerated RWMH, ETAIS-RW and transport map-accelerated ETAIS-RW. The difference in performance between RWMH and its accelerated version is relatively negligible, in part because this version of the Rosenbrock is not overtly pathological. The story is different for the ETAIS algorithms. The slowly decaying tails of the Rosenbrock cause a problem for ETAIS since these are not well-represented in the proposal distributions, leading to high weight variance and poor convergence. A much larger ensemble would be required to ensure a stable ETAIS implementation for this problem, which leads to high overhead costs for the resampling step and evaluation of the importance weights. The transport map-accelerated ETAIS, on the other hand, is stabilised by the map, which transforms the proposal distributions in order to adequately approximate the slowly decaying tails of the target. This stabilisation allows  us to benefit from the improved convergence of the ETAIS methodology, with a smaller ensemble size.

Multiscale Methods for Stochastic Chemical Reaction Networks
In this Section, we discuss some recent advances in multiscale methods for stochastic reaction networks. Inverse problems arising in this area often lead to complex posterior distributions, which traditional MCMC methods can struggle to sample from. We will then go on to solve some inverse problems related to this in Section 7, using the transport map versions of the ETAIS algorithm, as described in Section 4. We consider chemical reaction networks of N s chemical species {S j } Ns j=1 , with population numbers given by X(t) = [X 1 (t), X 2 (t), . . . , X Ns (t)] ∈ N Ns 0 reacting in a small reactor, through N r different reaction channels. When population numbers of one or more of the chemical species in the reactor is small, as is the case with chemical reactions occurring within living cells, the sporadic and discrete way in which reactions occur can not be well modelled by deterministic continuous models, such as ODEs. In such a situation, we may wish to model the dynamics through a discrete stochastic model, such as a continuous time Markov chain. For each reaction R j , for j = 1, 2, . . . N r , there is a propensity or hazard function α j (X(t)) which indicates how likely that reaction is to fire, defined by α j (X(t)) = lim dt→0 P(Reaction R j in the time interval s ∈ [t, t + dt]).
If a system satisfies what is called mass action kinetics, then the form of the function α j is determined, up to a rate constant, by the reactants involved in that reaction: where ν j,m is the mth component of the stoichiometric vector ν j , X m is the mth component of the state vector X, and the k j are rate constants. Following each reaction R j there is an instantaneous change in the current state, as the reactants of the reaction are consumed, and the products produced. This is modelled by a change in the state vector X(t) = X(t) + ν j where ν j is that stoichiometric vector for reaction R j . The model can be represented as the following expression involving N r different unit rate Poisson processes [4] Y j , given by: The master equation for this model is only solvable in certain situations, for example monomolecular networks [38], or for steady-state solutions for weakly reversible deficiency zero networks [2,3]. Trajectories for this system can be sampled exactly, for instance using the Gillespie SSA [32], or its variants [1,11,34]. However, if the system is stiff, i.e. there are some reactions which are firing many times on a timescale for which others are unlikely to fire at all, then trajectories can become prohibitively expensive to simulate, since these methods simulate every single reaction event with the same cost. In such a system, one might employ multiscale methods in order to approximate trajectories at a lower cost than the exact algorithms. One common approach is to induce the quasi-equilibrium approximation QEA. This approximation makes the assumption that fast reactions enter quasi-equilibrium on a timescale which is negligible with respect to the timescale on which the slow dynamics in the system are evolving. This amounts to taking the asymptotic limit that the propensities of the "slow" reaction channels are equal to zero. This allows us to sample approximate trajectories of the slow reactions without the need to compute many fast reaction events. This approach can work well when there is a large timescale gap between the fast and slow reactions in a system, but where the timescale gap is less pronounced, it can lead to significant errors [20]. Another approach is the constrained multiscale algorithm (CMA) [19,22], based in part on the equation-free approach to multiscale computations [29,41]. This approach also assumes quasi-equilibrium in the system, but takes into account the effect that the slow reactions can have on the invariant distribution of the fast variables in the system. For a more detailed description of this method, please refer to the literature [19]. Multiscale methods are not only of use when forward evaluations of the dynamics are costly, but can also be of use where we attempt to solve an inverse problem where the fast variables are unobservable. One approach in this situation would be to integrate over all possible trajectories of the fast variables, but this would almost always be prohibitively expensive. Another approach would be to use an appropriate multiscale approximation, so that the effective dynamics of the slow variables can be approximate without the need to consider the rapid fluctuations of the fast variables in the system.

Likelihoods arising from stochastic reaction networks
Suppose that we are able to exactly observe the number of molecules of each chemical species in a system which satisfies mass action kinetics, and which can be well approximated by (14). Suppose that we wish to be able to infer the value of the rate constants of each reaction from these observations. Even with no observational noise, since the dynamics of the system are stochastic, this still leads to a Bayesian inverse problem where we can only infer a joint probability distributions on the reaction parameters. Suppose that we are in state X(t) = X 0 . There are two independent univariate random variables which decide when and what the next reaction in the system is. There is the jth waiting time τ j to the next reaction, which is given by an exponential random variable where α 0 (X(t)) = Nr i α i (X(t)) is the total propensity in the system. The second is a multinomial random variable r j which dictates which reaction has occurred during the jth reaction, which takes the value i ∈ {1, 2, . . . , N r } with probability ) .
As such, in order to compute the likelihood of a particular trajectory given a possible realisation of the reaction parameters, it is sufficient to have access to the total time spent in each state, and the frequency of each reaction which led to leaving each state. From this formulation, we see that the random variables (τ j , r j ) only depend on the states X(t j−1 ) and so are Markovian. This conditional independence means that we can group events together by what state the system was in when the event happened. We define two new random variables which depend on a state Y ∈ S, first the total time spent in state Y, This random variable, T (Y), is a sum of exponentially distributed random variable, each with the rate α 0 (Y; k), and hence follows the Gamma distribution, where Similarly, we can define the reactions which occurred when the system was in state Y as Here all the random variables r j follow the same multinomial distribution, and so The random variables defined in Equations (15) and (16) are sufficient statistics for the posterior distribution π(k|D). With these definitions we define two new structures where K = |S|, the number of states in S, and each state Y i ∈ S has been enumerated. We use these structures to define shorter notation, To construct the posterior distribution for the reaction rates, k, in the chemical system, we formulate the likelihood using the sufficient statistics derived in the previous section. Due to the non-negativity of these reaction rates, we assign a Gamma prior distribution to each rate. Given the distributions in Equations (15) and (16) for our data, the likelihood of observing the data R and T is where again Y i ∈ S and the propensities α i and probabilities p i depend on the reaction rates k = [k 1 , k 2 , . . . k Nr ] through the concept of mass action kinetics given in (13). Our choice of Gamma prior distributions with hyper-parameters (a i , b i ) results in the posterior distribution of the form

Approximation of likelihoods in multiscale chemical networks
Suppose now that we only observe the slower variables in a multiscale chemical system of this type. Suppose that n r < N r is the number of slow reactions in the system, and the slow reactions are given by {R s 1 , R s 2 , . . . R sn r } ⊂ {R 1 , R 2 , . . . , R Nr }. The propensities of the slow reactions α s i may depend on the value of the fast variables which is unknown. However, the invariant distribution of the fast variables conditioned on the slow variables in the system can be approximated using a multiscale method, such as the QEA or the CMA, as described briefly in Section 6. Once the approximations have been made, we arrive at approximate effective propensitiesᾱ s i , which have been averaged over the computed invariant distributions of the fast variables. Then the approximate effective dynamics on the slow variable S(t) are given by s j (S(s))ds. (18) In turn, the approximate posterior distribution on the reaction parameters k is then given by whereᾱ 0 (Y i ; k) = nr j=1ᾱ j (Y i ; k) is the total of the effective propensities, n r is the number of slow reactions,p(Y i ; k) is the multiscale approximation of the expected proportion of each slow reaction at state Y i and with reaction rates k and the data is limited only to changes in the slow variable due to the occurrence of slow reactions.

Numerical Examples
In this section we look at two examples of chemical systems to demonstrate the effectiveness of transport ETAIS algorithms for sampling from the challenging posterior distribution on the reaction parameters that arise.

A Multiscale Chemical System
First we consider a simple multiscale example involving two chemical species S 1 and S 2 , Each arrow represents a reaction from a reactant to a product, with some rate constant k i , and where mass action kinetics is assumed. The parameters k i are non-negative, and k = [k 1 , . . . , k 4 ] ∈ R 4 + = X . We denote the population count of species S i by X i ∈ N 0 . We consider this system with the following parameters: where V is the volume of the reactor. In this parameter regime the reactions R 2 : S 1 → S 2 and R 3 : S 2 → S 1 occur more frequently than the other reactions. Notice that both chemical species are involved in fast reactions. However, the quantity S = X 1 + X 2 , is conserved by both of the fast reactions, and as such, this is the slowly changing quantity in this system. We observe this system over the time period t ∈ [0, 500] with initial condition S 1 (0) = S 2 (0) = 0, but we assume that we are only able to observe the slow variable S(t) = S 1 (t) + S 2 (t).
The effective dynamics of S can be approximated by a system of only two reactions: where here the propensities are shown as opposed to the rates. The effective propensityᾱ 4 (S) can be approximated through application of a multiscale approximation, as detailed in Section 6, and in more detail in [19]. These effective propensities can often be computed without the need for expensive simulations, in particular when the fast subsystem is deficiency zero, which is often the case [2,3]. For this very simple example, the dependance of these effective propensities on the original rate parameters can be explicitly understood. Under the QEA, the effective rate of degradation of the slow variable S is given byᾱ Similarly, the analysis of the constrained system as discussed in [19] yields the effective propensityᾱ Our observations are uninformative about the reaction rates k 2 , k 3 , and k 4 , as for each multiscale approximation there is a manifold M ⊂ X along which the effective propensityα 4 is invariant, leading to a highly ill-posed inverse problem. This type of problem is notoriously difficult to sample from using standard MH algorithms, as the algorithms quickly gravitate towards the manifold M but then exploration around M is slow. We aim to recover three different posterior distributions. In the first, which is of the form (17), we assume that we can fully and exactly observe the whole state vector for the whole of the observation time window t ∈ [0, T ]. In the second and third, the same data are used, but projected down to one dimensional observations of the slow variable S = X 1 + X 2 . In these last two examples, we approximate the posterior using (19) with QEA and CMA approximations of the effective dynamics respectively. In all three cases, we find the posterior distribution for k ∈ X = R 4 + . These four parameters are assigned Gamma prior distributions, The hyper-parameters corresponding to each of these prior distributions are given in Table 2. These priors are the same for each of the three posterior distributions. Figure 4 shows how this posterior looks when we use the CMA to model the effective degradation propensityᾱ 4 . Although this posterior does not appear to be pathologically difficult to sample using MCMC from these plots, it most certainly is. The posterior density is concentrated close to a curved manifold. Since the manifold is curved, the joint marginals do not look unreasonable, where in fact the density has very differing derivatives in different directions when evaluated close to this manifold.  Table 2: Hyper-parameters in the prior distributions for the multiscale problem described in Section 7.1.  (19) of the posterior arising from observations of the slow variable S = X 1 + X 2 for system (20) with true rates given by (21).
We consider several different algorithms for sampling from these distributions. First we implement both the ETAIS and MH algorithms with a Gaussian proposal distribution. In the case of the ETAIS algorithm, this is a Gaussian mixture proposal distribution, and in the MH algorithm this is equivalent to a vanilla random walk. The proposal distribution uses a covariance matrix which has been constructed using the sample covariance of a sample produced using a standard MH-RW algorithm. This is necessary since these algorithms otherwise fail to converge in a reasonable time for this problem. We also compare the ETAIS and MH algorithms when using a transport map proposal distribution, as was discussed in detail in Chapter 4. Figure 5 (top) shows how we define a bijective map,T , between parameter space X and a reference space R. When using the transport map proposal distribution, we prefix the proposal method with a T, e.g. MH-RW and MH-TRW, as well as ETAIS-RW and ETAIS-TRW. In practice we cannot ensure that the approximate mapT is uniquely invertible over Figure 5: Couplings between parameter space and reference when using the transport map with (bottom) and without (top) a log preconditioner.
the whole of R and soT is not truly bijective. This leads to problems for our strictly positive state space, X ⊂ R d + , since proposals in R do not necessarily map back onto X . Therefore we also consider how these algorithms perform when the transport map is preconditioned by the logarithmic function applied to the parameters. We choose to use this log preconditioner since it converts our strictly positive parameter space, X , into an intermediate space Y ⊂ R d . This allows us to define T between two subsets of R d , which means that all proposals made on the reference space are mapped to valid proposals on the true parameter space. It also reduces the work required of the map, which since we use only a finite dimensional approximation, can improve performance and stability. As before, the proposal distributions are labelled with a T for transport map and when using the intermediate space we prepend 'log' to the proposal method, e.g. MH-logRW and MH-logTRW, with, ETAIS-logRW and ETAIS-logTRW. Figure 5 (bottom) displays the composition of maps when we include this log preconditioner leading to an intermediate space Y. The inclusion of this additional map means that we must again alter our importance weight definition to reflect the pullback from R through Y. The weight is now where θ is a proposed new state, θ (i−1) is the ensemble of states from the previous iteration, and JT •log (θ) is the Jacobian of the composition of the two maps. This Jacobian is straightforward to calculate, where the first determinant is computed as in (3), and the second is given by For this problem, we continue to use monomials in each dimension in our transport map construction. We use polynomials of total order p = 3 as the basis functions, i.e.
and j k = 0 ∀k > i}. We will use the MT [51] resampler with an ensemble size of M = 500. This increase in ensemble size in comparison with the previous example in Section 5 is to allow for the increase in parameter dimension. To measure the convergence of the sampling methods in this section, we will compare the convergence of the mean of each parameter. We approximate E(k) using 2.4 billion samples from the MH-RW algorithm. We do this for each of the eight algorithms we have so far discussed. Convergence is shown only for the CMA approximation of the posterior (19) with effective rate for the degradation of S given by (23), but we expect very similar results for the other multiscale approximation of posterior distributions discussed.
The optimal scaling parameters for the proposal distributions are given in Table 3. These are precomputed by running the algorithms for different parameter values. MH-RW based algorithms are tuned to an acceptance rate close to the optimal 23.4%, and ETAIS algorithms are tuned to maximise the effective sample size (ESS). The optimal parameter values for the MH algorithms are not particularly informative about their performance, since they relate to algorithms operating on different transformations. The ESS is highest for the logTRW variant of the ETAIS algorithm, as we would hope. We note here that as outlined in [18], it is possible to construct adaptive versions of ETAIS which automatically tune these algorithmic parameters, but we compare against the static optimal parameter values for all algorithms to ensure fairness.  Convergence of the eight algorithms for this example are shown in Figure 6. We first note the poor performance of the MH based algorithms, despite being preconditioned with a pre-computed sample covariance. Only the MH-TRW is at all competitive with the ETAIS algorithms. During the simulation interval, the MH-TRW algorithm has not settled down to the expected O(1/ √ N ) rate which means that the estimate is still biased by the long burn-in time. As we have seen in previous examples, the burn-in time for the ETAIS algorithm is negligible. The ETAIS variants with RW and TRW proposals perform similarly on both sample spaces. When sampling without a preconditioner, the transport map is not quite as efficient, largely due to the difficulties discussed in the previous section i.e. many proposals are made which do result in negative reaction rates. Sampling with transport maps with a log preconditioner leads to more comparable Monte Carlo errors, with the logTRW being apparently slightly less stable. This proposal method becomes more stable as we increase either the ensemble size, or the number of iterations between updates of the transport map, T . Overall we see the smallest Monte Carlo errors for a given number of likelihood evaluations coming from the ETAIS-logTRW algorithm. We now look at the proposal distributions of the transport map accelerated algorithms. In Figure 7, we see the reference spaces found by; (a) mapping the posterior through the map T , (b) by mapping the posterior through T • log. For the most part, each of these marginal distributions can be recognised as a Gaussian. However, with the exception of P(r 2 , r 3 ), we would not consider them to be close to a standardised N (0, I) distribution. Before thinking that the transport map has not helped us to find a 'nicer' space on which to propose new values we should consider that the dimensions are now (1) largely uncorrelated, and (2) the variances in each dimension are much more similar than they are in Figure 4. In Figure 7 (b) we see that var(r 1 ) and var(r 4 ) are much smaller than var(r 2 ) and var(r 3 ). To combat this we have a number of choices. We might wish to learn the covariance structure of the push forward of the target onto the reference space, and incorporate this information into the covariances of the mixture components in an adaptive scheme. Another option is to increase the total order of our index set. For these numerics we have chosen p = 3, but we can obtain reference spaces which are closer to N d (0, I) by choosing a larger p.

Comparison of the Constrained and QEA approaches
The convergence analysis in the previous section was performed for the constrained approximation of the posterior distribution. We now look at the differences in the CMA and QEA approximations of the posterior distribution arising in this problem. Recall that the approaches differed only in the form of the effective degradation propensityα 4 , This difference in the denominator causes a shift in the manifold on which the posterior is concentrated, as can be seen in Figure 8, in which we plot the difference in the posteriors densities π CMA (k|R, T) − π QEA (k|R, T).
Since the two posteriors have been approximated using an MCMC sample, there is a visible amount of Monte Carlo static, particularly in the tails of the distributions. We can see that the differences in the marginals for k 1 , k 2 , and k 3 are relatively small, but the marginal for k 4 varies by a significant amount. In the QEA approximation, the effective rate of degradation of S is given byk QEA 4 = k 2 k 4 /(k 2 + k 3 ), and as such, our observations should be informative about this quantity. Using the CMA, this effective rate is approximated byk CMA 4 = k 2 k 4 /(k 2 + k 3 + k 4 ). To validate our inferences on k 2 , k 3 and k 4 we would like to discover which model is most accurate and informative about these quantities, and which model gives us results which are most similar to that which can be obtained from the full model, with all reactions (including fast reactions) observed. A conventional way to compare two models under a Bayesian framework is to calculate the Bayes factors [14]. The Bayes factor, B 1,2 , between two models, M 1 and M 2 , given data D, can be interpreted as a ratio of the normalisation constants of the posterior distributions given each model, Under the ETAIS framework, it is straightforward to calculate these factors using the Monte Carlo estimator for the normalisation constants. From [50] these normalisation constants take the form where w ) . The full model M 0 is dependent on all four parameters θ 0 = (k 1 , k 2 , k 3 , k 4 ) . The marginal densities for the data, evaluated at the observed data, given the models are displayed in Table 4.  These Bayes factors provide evidence for our argument that the constrained approximation of the posterior is superior to the QEA approximation. we see that π 0 , the density arising from the full model with all reactions observed, is peaked a distance away from π 2 , QEA approximation. The CMA approximation π 1 , in contrast, is peaked close to peak of π 0 , but there is more uncertainty here since we do not observe the fast reactions. In contrast, in Figure 9 (b), the marginal densities fork CMA 4 for π 0 and π 1 are barely distinguishable, even with the vastly reduced data available in the posterior π 1 with CMA approximation. The QEA approximation π 2 is again peaked far away from both of the other distributions. This demonstrates that the data regarding the slow variable S is only informative about k 1 andk CMA 4 , as predicted by the constrained multiscale approximation.

Gene Regulatory Network
In this example, we look at a model of a gene regulatory network (GRN) mechanism [6,36,39] used by a cell to regulate the amount of a protein, P , present. Proteins can bind in pairs to form a dimer D. The dimer can bind to the gene G and convert it into a "switched off" state G , in which the production of mRNA molecules M is inhibited. The mRNA encodes for production of the protein P , and both P and M can degrade. The eight reactions are given in Equation (25).
We create trajectories of this system, using the Gillespie SSA, with the following parameters: over the time period t ∈ [0, 500]. T-cells can be manufactured which bind to proteins of a certain type, and cause them to phosphoresce, allowing for approximate observations of the concentration levels of that protein. However, the T-cells will not necessarily differentiate between monomers and dimers of the protein.
We approximately replicate such a scenario, in a zero-observational noise setting, by assuming that we can only observe T = P + 2D + 2G , the total number of protein molecules in the system, alongside M , the concentration of mRNA. The number of dimers, and the status of the genetic switch, are assumed unobservable. In order to sample from the posterior distribution on the reaction parameters in the system given these observations, we can either integrate over all possible trajectories of the fast variables (which is computationally intractable), or as in Section 7.1 we can use a multiscale method to approximate the posterior. The effective dynamics of T and M are then given by: Here, the propensities are displayed, as opposed to the rates, with approximations of the propensitiesα 5 andα 7 to be computed. We omit the derivation of these effective propensities, using the CMA methodology, for brevity, but interested readers can see more details in [51].

Target Distribution
As alluded to in the previous section, we aim to sample from the CMA approximation of the posterior distribution as defined in Equation (17). This example is more involved than the previous one, since we have an eight dimensional parameter space corresponding to the eight reaction rates in the system (25). Half of those reactions are unobservable, and their effect on the posterior is only felt through their effect on the effective propensitiesα 5 and α 7 , and observations of the frequencies of these reactions.
The priors for this problem were chosen to be fairly uninformative, and a list of the hyperparameters corresponding to each Gamma prior can be found in Table 5.
The one-and two-dimensional marginal distributions for the full posterior distribution are displayed in Figure 10. The posterior does not appear from these joint marginals to contain  many interesting correlation structures, although several dimensions have leptokurtic tails which are difficult for the standard ETAIS algorithm to sample from. However as with the previous example, the density is concentrated close to a lower-dimensional manifold. The marginal densities also vary over very different scales, which could lead to problems with the algorithms which do not benefit from the transport map.

Implementation
We apply the eight RW and logRW proposal distributions, with and without transport map acceleration, which were discussed in Section 7.1 to this posterior. As with this example, the log conditioner is required so that our transport map is a function T : R d → R d , rather than a map from R d + to R d . The transport map setup has not changed from the previous example. However, due to the scaling of the weight we have had to be more careful about which samples we use to build our map. We require that the objective function C(T ), from Equation (7) is convex, which requires that the second derivative, given in Equation (11), is positive definite. This positive definite property depends on all weights in our sample being strictly positive, which is not always possible on a computer, where very small positive numbers can be rounded down to absolute zero. For this reason we filter out any samples from the optimisation problem where the weight is a numeric zero. This does not affect the validity of the map since a weight zero would not contribute to the expectation. In all eight algorithms we adapt the proposal variances during the burn-in phase of the algorithms to avoid the need for a huge ensemble size to approximate the posterior with a mixture of isotropic Gaussian kernels. We use the MT resampler, this time with an ensemble size of M = 2500. The increase in ensemble size compensates for the increase in dimension from four to eight. As in the previous chemical reaction example, we measure convergence of the algorithms only through the convergence of the mean. We use the sample Mahalanobis distance, with 'true' covariance and 'true' mean built from a sample of size 2.4 billion using the MH-RW algorithm.

Convergence results
The scaling parameters used are selected by optimising the acceptance rate for the MH algorithms, and optimising the effective sample size for the ETAIS algorithms. The optimal parameters are given in Table 6. Here, for Metropolis-Hastings variants, β * % is the variance of the proposal, optimised for an acceptance rate of 50%. For the ETAIS variants, β * ESS is the variance of each of the proposal kernels around each particle, optimised numerically for the maximal effective sample size (ESS). ESS/M denotes the ratio of ESS to the number of particles, with values closer to one denoting a more efficient algorithm.

MH ETAIS
RW logRW TRW log-TRW RW logRW TRW log-TRW 5e-4 4.7e-2 4.8e-2 1.6e-1 From the effective sample sizes shown in Table 6 we can see the improvement to the efficiency of the algorithm both by transforming the parameter space into something closer to Gaussian, and additionally by utilising a log preconditioner. Due to the numerical cost of calculating the full relative L 2 errors for this posterior, we quantify the convergence speeds of these algorithms using the sample Mahalanobis distance between the sample mean and the 'true' mean. This convergence analysis is shown in Figure 11. We believe that an ensemble size of 2500 is over-cautious, despite the larger number of dimensions in this example, and that the required ensemble size to sample from these posteriors is reduced by the use of the transport map. The second obvious feature of these convergence plots is that the ETAIS algorithms outperform the MH algorithms by a large margin -roughly a reduction of two orders of magnitude in the relative error over the same number of samples. We note that in this example the ETAIS-TRW algorithm does not perform as well as the ETAIS-RW algorithm. This is due to the lack of log preconditioner. The posterior has support only on the positive quadrant, and without the preconditioner, the transport map has to work hard to enforce this. With a relatively low-dimensional approximation being used for the transport map, this makes it more likely that the map produced may hinder sampling rather than improving it. This highlights the importance of using preconditioners with transport maps in order to ensure absolute continuity of the posterior distribution with respect to the pullback of the proposal distribution on reference space. In contrast, when we look at the convergence of these two algorithms with the log preconditioner, the transport map version exhibits more consistent and stable convergence, with some of the 32 repeats for the ETAIS-logRW algorithm producing rare large weights which hinder convergence. The large increase in the effective sample size observed between the ETAIS-logRW and ETAIS-logTRW algorithms is converted into an estimate which is twice as accurate after 1 million samples. This demonstrates our claim that the transport map can produce stable ETAIS implementations with smaller ensemble sizes than without. A similar pattern is seen in the MH algorithms, where the MH-TRW performs the worst, and the MH-logTRW algorithm performs the best, although only marginally.

Discussion
In this paper, we have investigated the use of transport maps to accelerate and stabilise ensemble importance sampling schemes. We considered the ETAIS methodology here, but the approach used is applicable to all such methods. The idea is powerful since it allows adaptive importance sampling schemes to be efficient and stable for smaller ensemble sizes. This in turn suggests that the improvements in performance that adaptive importance sampling schemes can deliver could be made applicable to a bigger class of problems. In particular we looked at the application of these methods to problems arising in inverse problems for multiscale stochastic chemical networks, where the posterior distributions which arise can be highly concentrated along curved manifolds in parameter space. Through these results we also demonstrated the improvement in accuracy of inference for multiscale problems when more sophisticated multiscale approximations are employed. Below we highlight two areas for future work.

Sampling in higher dimensions
One major issue with importance sampling-based algorithms is the curse of dimensionality. This curse is two-fold, since as the target distribution's dimension is increased, a larger ensemble size is often required in order to capture the addition complexity of the density that the higher dimension enables. Furthermore, the cost of the resampling step within the algorithm grows with the ensemble size. For example, the ETPF algorithm is O(n 2 ). However, the simplification that the transport map makes has some potential to go some way to tackling this problem.
Here, we will briefly discuss how transport maps could aid with making moderately higher dimensional problems accessible to this family of methods. It is important to emphasise here that these methods are unlikely to be suitable in anything other than low dimensional problems, but the idea that we present here, which is the subject of future investigation, could make them applicable to more problems which have more than a handful of parameters. Armed with a high quality transport map, we can transform the target distribution close to a Gaussian reference measure, which is uncorrelated in all dimensions. This lack of correlations allows us to resample in each dimension separately. Resampling in a single dimension allows for optimisations in resampling code, and also means that the resampler is not affected by the curse of dimensionality. In one dimension, the ETPF algorithm can be implemented very efficiently. As described in [49], the coupling matrix has all non-zero entries in a staircase pattern when the state space is ordered. We can exploit this knowledge to produce Algorithm 4, which is much faster than using the simplex algorithm to minimise the associated cost function, and faster than the MT algorithm [18], which is itself a greedy approximation to the output of the ETPF resampler. Moreover, the task of running this resampler can then be parallelised over d processors. This could be an area for further study.  Increase y j by M × s × x i . 10 Decrease t by s. 11 Increase c by s. 12 if t > 0 then 13 Increase j by 1.

Invertibility of the transport map
In this paper, we have considered an approach similar to that presented in [46], where we aim to find an invertible map which approximates the true transport map between the reference measure and the target measure. However, this true map is invariably itself not invertible. However, by considering embedding the map in higher dimensions [44,54,56], these issues of non-invertibility may be avoided, and higher quality maps may be found which better aid exploration and sampling of the whole state space.

Adaptivity of ensemble sizes
At initialisation, our transport map approximation is equal to the identity, and the algorithm is essentially the standard ETAIS algorithm. In order to build a good transport map approximation, we require a good sample from the posterior distribution. Since our aim is to apply this methodology to challenging problems, this is itself a challenge. However, in the limit of the number of ensemble members going to infinity, this approach is consistent and stable. However, large ensemble sizes come with large computational costs. We propose that the transport map acceleration could be used in conjunction with an adaptive ensemble size mechanism. First we can initialise the algorithm with a large ensemble size, and the identity transport map. As we explore the density, the transport map quickly improves, making it possible to sample efficiently and stably with a smaller ensemble size. The problem then amounts to finding a reliable method to approximate the lowest stable ensemble size.