Ensemble Transport Adaptive Importance Sampling

Markov chain Monte Carlo methods are a powerful and commonly used family of numerical methods for sampling from complex probability distributions. As applications of these methods increase in size and complexity, the need for efficient methods increases. In this paper, we present a particle ensemble algorithm. At each iteration, an importance sampling proposal distribution is formed using an ensemble of particles. A stratified sample is taken from this distribution and weighted under the posterior, a state-of-the-art ensemble transport resampling method is then used to create an evenly weighted sample ready for the next iteration. We demonstrate that this ensemble transport adaptive importance sampling (ETAIS) method outperforms MCMC methods with equivalent proposal distributions for low dimensional problems, and in fact shows better than linear improvements in convergence rates with respect to the number of ensemble members. We also introduce a new resampling strategy, multinomial transformation (MT), which while not as accurate as the ensemble transport resampler, is substantially less costly for large ensemble sizes, and can then be used in conjunction with ETAIS for complex problems. We also focus on how algorithmic parameters regarding the mixture proposal can be quickly tuned to optimise performance. In particular, we demonstrate this methodology's superior sampling for multimodal problems, such as those arising from inference for mixture models, and for problems with expensive likelihoods requiring the solution of a differential equation, for which speed-ups of orders of magnitude are demonstrated. Likelihood evaluations of the ensemble could be computed in a distributed manner, suggesting that this methodology is a good candidate for parallel Bayesian computations.

1. Introduction. Markov chain Monte Carlo (MCMC) methods are a powerful family of tools that allow us to sample from complex probability distributions. MCMC methods were first developed in the 70s [12], and with the development of faster, more powerful computers, have become ever more important in a whole range of fields in statistics, science and engineering. In particular, when considering Bayesian inverse problems, each MCMC step may involve the numerical solution of one or more PDE. As many samples are usually required before Monte Carlo error is reduced to acceptable levels, the application of MCMC methods to these types of problem remain frustratingly out of our grasp. Many advances have been made in the field of MCMC to design ever more complex methods that propose moves more intelligently, leading to rapidly converging approximations. Function space versions of standard methods such as the random walk Metropolis-Hastings (RWMH) algorithm or the Metropolis adjusted Langevin algorithm (MALA), whose convergence rates are independent of dimension have been developed [6]. The hybrid (or Hamiltonian) Monte Carlo (HMC) method uses Hamiltonian dynamics in order to propose and accept moves to states which are a long way away from the current position [26], and function space analogues of this have also been proposed [1]. Riemann manifold Monte Carlo methods exploit the Riemann geometry of the parameter space, and are able to take advantage of the local structure of the target density to produce more efficient MCMC proposals [10]. This methodology has been successfully applied to MALA-type proposals and methods which exploit even higher order gradient information [2]. These methods allow us to explore the posterior distribution more fully with fewer iterations. Simultaneously, great strides are continually being made in the development of computing hardware. Moore's law, which predicted that the number of transistors that can fit onto a single microchip will double every two years, has been largely followed since the early 70s [17]. In recent times, it has become necessary to use parallel architectures in order for this trend to continue. The efficient exploitation of these architectures is the key to solving many of the computational challenges that we currently face. As such, the development of efficient parallel MCMC algorithms is an important area for research. Since MCMC methods can be naively parallelised by simply running many independent chains in parallel, the focus needs to be on the development of methods which gain some added benefit through parallelism. One class of parallel MCMC method uses multiple proposals, with only one of these proposals being accepted. Examples of this approach include multiple try MCMC [16] and ensemble MCMC [18]. In [3], a general construction for the parallelisation of MCMC methods was presented, which demonstrated speed ups of up to two orders of magnitude when compared with serial methods. In this paper, we present a framework for parallelisation of importance sampling, which can be built around many of the current Metropolis-based methodologies in order to create an efficient target proposal from the current state of all of the chains in the ensemble. The idea is to consider the current state on each of a set of parallel chains as an ensemble, and to resample using a transformation based on optimal transport. Samplers based on optimal transport have also been considered in [7]. In Section 2 we introduce some mathematical preliminaries upon which we will later rely. In Section 3 we present the general framework of the PAIS algorithm. In Section 4 we consider adaptive versions of PAIS which automatically tune algorithmic parameters concerned with the proposal distributions. In Section 5 we introduce the approximate multinomial resampling (AMR) algorithm which is a less accurate but faster alternative to resamplers which solve the optimal transport problem exactly. In Section 6 we present some numerical examples, before a brief conclusion and discussion in Section 7.

Preliminaries.
In this Section we will introduce preliminary topics and algorithms that will be referred to throughout the paper.
2.1. Bayesian inverse problems. In this paper, we focus on the use of MCMC methods for characterising posterior probability distributions arising from Bayesian inverse problems. We wish to learn about a particular unknown quantity u, of which we are able to make direct or indirect noisy observations. For now we say that x is a member of a Hilbert space X.
The parameter x is observed through the observation operator G : X → R d . Since observations are never perfect, we assume that these measurements D are subject to Gaussian noise, so that For example, if x are the rates of reactions in a chemical system, G might return the quantities of each chemical species at a particular time, or some summary of this information. These modelling assumptions allow us to construct the likelihood of observing the data D given the parameter x = x * . Rearranging (2.1) and using the distribution of ε, we get: where y 1 − y 2 Σ is the Mahalanobis distance between y 1 and y 2 ∈ R d .
As discussed in [5,27], in order for this inverse problem to be well-posed in the Bayesian sense, we require the posterior distribution, µ Y , to be absolutely continuous with respect to the prior, µ 0 . A minimal regularity prior can be chosen informed by regularity results of the observational operator G. Given such a prior, then the Radon-Nikodym derivative of the posterior measure, µ Y , with respect to the prior measure, µ 0 , is proportional to the likelihood:

Particle filters and resamplers.
In several applications, data must be assimilated in an "online" fashion, with up to date observations of the studied system being made available on a regular basis. In these contexts, such as in weather forecasting or oceanography, data is incorporated using a filtering methodology. One popular filtering method is the particle filter, the first of which was dubbed the Bootstrap filter [11]. In this method, a set of weighted particles is used to represent the posterior distribution. The positions of the particles are updated using the model dynamics. Then, when more observations are made available, the relative weights of the particles are updated to take account of this data, using Bayes' formula. Other filtering methods, such as the Kalman filter [14] and ensemble Kalman filter [8], have also been developed which are often used within the data assimilation community. One advantage of the particle filter is that there are convergence results for this method as the number of particles is increased. One downside is that the required ensemble size increases quickly with dimension, making it difficult to use in highdimensional problems. Another downside is that the effective sample size decreases at each iteration, resulting in degeneration of the approximation of the posterior. One way to tackle this is to employ a resampling scheme. The aim of a successful resample is to take the unevenly weighted ensemble and return a new ensemble of particles with even weights which is highly correlated to the original samples. The ensemble transform particle filter (ETPF) proposed by Reich [20] makes use of optimal transportation as described in [28,29]. The transform takes a sample of weighted particles {y i } M i=1 from µ Y and converts it into a sample of evenly weighted particles {x i } M i=1 from µ X , by means of defining a coupling T * between Y and X. Given that a trivial coupling T t always exists in the space of transference plans, Π(µ X , µ Y ), we can find a coupling T * which maximises the correlation between X and Y [4]. This coupling is the solution to a linear programming problem in M 2 variables with 2M − 1 constraints. Maximising the correlation ensures that the new sample is as much like the original sample as possible with the additional property that the sample is evenly weighted. A Monte Carlo algorithm can be implemented to resample from a weighted ensemble. We create a weighted sample, then solve the optimal transport problem which produces the coupling described above, we can draw a new sample from the evenly weighted distribution. Reich suggests using the mean of the evenly weighted distribution to produce a consistent estimator. Analysis of this method shows that as the ensemble size increases, the statistics of the evenly weighted sample approach those of the posterior distribution.
2.3. Deficiencies of Metropolis-type MCMC schemes. All MCMC methods are naively parallelisable. One can take a method and simply implement it simultaneously over a set of processors in an ensemble. All of the states of all of the ensemble member can be recorded, and in the time that it takes one MCMC chain to draw N samples, M ensemble members can draw N M samples. However, we argue that this is not an optimal scenario. First of all, unless we have a lot of information about the posterior, we will initialise the algorithm's initial state in the tails of the distribution. The samples that are initially made as the algorithm finds its way to the mode(s) of the distribution cannot be considered to be samples from the target distribution, and must be thrown away. This process is known as the burn-in. In a naively parallelised scenario, each ensemble member must perform this process independently, and therefore mass parallelisation makes no inroads to cutting this cost. Moreover, many MCMC algorithms suffer from poor mixing, especially in multimodal systems. The amount of samples that it takes for an MCMC trajectory to switch between modes can be large, and given that a large number of switches are required before we have a good idea of the relative probability densities of these different regions, it can be prohibitively expensive. Another aspect of Metropolis-type samplers is that information computed about a proposed state is simply lost if we choose to reject that proposal in the Metropolis step. An advantage of importance samplers is that no evaluations of G are ever wasted since all samples are saved along with their relative weighting. Moreover, a naively parallelised MCMC scheme is exactly that -naive. Intuition suggests that we can gain some speed up by sharing information across the ensemble members, and this is what we wish to demonstrate in this paper. These deficiencies of the naive method of parallelising MCMC methods motivated the development of the Parallel Adaptive Importance Sampler (PAIS). In the next section we will introduce the method in its most general form.
3. The Parallel Adaptive Importance Sampler (PAIS). Importance sampling can be a very efficient method for sampling from a probability distribution. A proposal density is chosen, from which we can draw samples. Each sample is assigned a weight given by the ratio of the target density and the proposal density at that point. They are efficient when the proposal density is concentrated in similar areas to the target density, and incredibly inefficient when this is not the case. The aim of the PAIS is to use an ensemble of states, each coming from a MCMC chain, to construct a proposal distribution which will be as close as possible to the target density. If this ensemble is large enough, the distribution of states will be representative of the target density. The proposal distribution could be constructed in many different ways, but we choose to use a mixture distribution, made up of a sum of MCMC proposal distributions for each of the members of the ensemble. Once the proposal is constructed, we can sample a new set of states from the proposal distribution, and each is assigned a weight given by the ratio of the target density and the proposal mixture distribution density. Assuming that our proposal distribution is a good one, then the variance of the weights will be small, and we will have many useful samples. Finally, we need to create a set of evenly weighted samples which best represent this set of weighted samples. This is achieved by implementing a resampling algorithm. Initially we will use the ETPF algorithm [20], although we will suggest an alternative strategy in Section 5. The output of the resampling algorithm gives us a set of evenly weighted samples that we believe represents the target distribution well, and from this point we can iterate the process once again. The algorithm is summarised in Table 3.1. We wish to sample states x ∈ X from a posterior probability distribution µ D , where D represents our data which we wish to use to infer x. Since we have M ensemble members, we represent the current state of all of the Markov chains as a vector X = [x 1 , x 2 , . . . , x M ] T . We are also given a transition kernel ν(·, ·), which might come from an MCMC method, for example the random walk Metropolis-Hastings proposal density ν(·, x) ∼ N (x, β 2 ), where β 2 ∈ R defines the variance of the proposal. Since the resampling does not give us a statistically identical sample to that which is inputted, we cannot assume that the samples X (i) are samples from the posterior. Therefore, as with serial importance samplers, the weighted samples ( are the samples from the posterior that we will analyse. The key is to choose a suitable transition kernel ν such that if X (i) is a good representative sample of the posterior, then the mixture density χ(·; X (i) ) is a good approximation of the posterior distribution. If this is the case, the newly proposed states Y (i) will also be a good sample of the posterior with low variance in the weights W (i) . In Section 6, we will demonstrate how the algorithm performs, primarily using RWMH proposals. We do not claim that this choice is optimal, but is simply chosen as an example to show that sharing information across ensemble members can improve on the original MCMC algorithm and lead to convergence in fewer evaluations of G. This is important since if the inverse problem being tackled involves computing the likelihood from a very large data set this could lead to a large saving of computational cost. We have observed that using more complex (and hence more expensive) kernels ν, does not significantly improve the speed of convergence of the algorithm for simple examples [25]. Care needs to be taken when choosing the proposal distribution to ensure that the proposals are absolutely continuous with respect to the posterior distribution. In Section 6.3 we will consider an inverse problem with Gamma priors. Since the posterior distribution in this context is heavy-tailed, if we use a lighter tailed distribution for ν, such as a Gaussian kernel, then importance weights in the tails will be unbounded, which will hamper convergence of the algorithm.
4. Automated tuning of Algorithm Parameters. Efficient selection of scaling parameters in MCMC algorithms is critical to achieving optimal mixing rates and hence achieving fast convergence to the target density. One aspect worthy of consideration with the PAIS, is finding an appropriate proposal kernel ν such that the mixture distribution χ is a close approximation to the posterior density π. If the proposal distribution is too over-dispersed, then the algorithm will often propose states in the tails of the distribution, resulting in larger variance of the weights, and therefore slower convergence to the posterior distribution. Similarly, if the proposal distribution is under-dispersed, the proposals will be highly correlated with the previous states, and the algorithm will take a long time to fully explore the parameter space, and worse, will lead regularly to states proposed in the tails of the proposal distribution with very large weights. It is therefore necessary to find a proposal distribution which is slightly over-dispersed to ensure the entire posterior is explored [9], but is as close to the posterior as possible. Most commonly used MCMC proposals have parametric dependence which allows the user to control their variance. For example, in the RWMH proposal y = x + βη, the parameter β controls how correlated the proposal state y is to the current state x. Therefore the proposal distributions can be tuned such that they are are slightly over-dispersed, as described above. This tuning can take place during the burn-in phase of the algorithm. Algorithms which use this method to find optimal proposal distributions are known as adaptive MCMC algorithms, and have been shown to be convergent provided that they satisfy certain conditions [22,23]. Algorithms which use mixture proposals, e.g. PAIS, must tune the variance of the individual kernels within the proposal mixture. This adaptivity during the burn-in has some added benefits over and above finding an optimal parameter regime for the algorithm. If the initial value of the proposal variances is chosen to be very large, then proposed moves will be made far and wide, expediting the early mode-finding stages of the algorithm. Adaptively reducing the proposal variances to an optimal value then allows us to explore each region efficiently. The fact that we have an ensemble of chains allows us to assess quickly and effectively what the optimal variance of the proposal distributions should be. The alternative to using adaptive procedures to tune the scaling parameters is to perform exploratory simulations to find the optimal regimes by trial and error. This can be very costly and the optimal parameters cannot realistically be found to more than a couple of significant figures. It is therefore important that MCMC algorithms provide a feasible adaptive strategy. In many MCMC algorithms such as the Random Walk Metropolis-Hastings (RWMH) algorithm, the optimal scaling parameter can be found by searching for the parameter value which gives an optimal acceptance rate, e.g. for near Gaussian targets the optimal rates are 23.4% for RWMH and 57.4% for MALA [21]. Unlike Metropolis-Hastings algorithms, the PAIS algorithm does not accept or reject proposed values, so we need another method of measuring the optimality of β. Section 4.1 gives some possible methods for tuning β. 4.1.1. Determining optimal scaling parameter using error analysis. MCMC algorithms can be assessed by comparing their approximation of the posterior to the analytic distribution, in cases where the posterior distribution can be computed using alternative methods. To assess this, a distance metric on distributions must be chosen. Examples are the relative error between the sample moments and the posterior's moments, or the relative L 2 error between the true density, π(x|D), and the constructed histogram. The relative error in the m-th moment is given by: is a sample of size N . The relative L 2 error between a continuous function to a piecewise constant function, e, can be given by considering the difference in mass between the normalised histogram of the samples and the posterior distribution over a set of disjoint sets or "bins": is the number of bins, v is the volume of each bin, and B i is the value of the ith bin. This metric converges to the standard definition of the relative L 2 error as v → 0. These statistics cannot be used in general to find optimal values of β since they require knowledge of the analytic solution, and the algorithm must be run for a long time to build up a sufficiently large sample. However they can be used to assess the ability of other indicators to find the optimal proposal variances in a controlled setting. The following statistics can be used specifically for importance sampling algorithms.

4.1.2.
The variance of the weights. Importance samplers assign a weight to each sample they produce based on a ratio of the posterior to the proposal at that point. Importance samplers are most efficient when the target is proportional to the proposal distribution. In this case the weights are all equal, and so the variance of the weights, var(w(y)), is zero. Hence, we would like to choose the value of β which minimises the variance of the weights, β * var = arg min β var(w(y)).
In our experience, the mean of the estimator var(w(y)) is a smooth enough function of β that it can be used to tune the proposal variance during the burn-in phase of MCMC algorithms. However the variance of the estimator of the variance can be large, especially far away from the optimal value, so it can take a large number of iterations to calculate descent directions.

4.1.
3. The effective sample size. The effective sample size, n eff , can also be used to assess the efficiency of importance samplers. Ideally, in each iteration, we would like all M of our samples to provide us with new information about the posterior distribution. In practise, we cannot achieve a perfect effective sample size of M . The effective sample size can be defined in the following way: The second two expressions are true when M → ∞. From the last expression we see that when the variance of the weights is zero, n eff = M ; this is our ideal scenario. Maximising the effective sample size is equivalent to minimising the variance of the weights. The statistic n eff is easier to deal with than the variance of the weights, as it varies between 1 and M as opposed to the variance which can vary over many orders of magnitude. Therefore it is preferable as a means of tuning the scaling parameter.
In all of the numerics which follow, we use the effective sample size calculated at each iteration and averaged across the entire run to tune the scaling parameters. We do this because when we tune this parameter on the fly, we use the single-iteration average to estimate optimality. The optimal scaling parameter found using the global optimum of the variance of the weights is also included for comparison. Note that for the majority of iterations the value of n eff is an overestimate of this statistic over a larger number of samples. Rare events which result in a large importance weight bring this statistic down, and care must be made not to overfit the proposal variance to n eff on an iteration by iteration basis. This can be accounted for by increasing the variance slightly once the adaptive algorithm has arrived on a value using single iteration values of n eff . The effective sample size also has another useful property; if we imagine the algorithm in the burn-in phase, for example, we have M ensemble members in the tail of a Gaussian curve searching for the area of high density. If the ensemble members are evenly spaced, then the particle closest to the mean will have an exponentially higher weight assigned to it. The effective sample size ratio in this scenario will be close to 1. As the algorithm burns in, the ensemble populates the regions where the majority of the probability density lies, and the proposal distributions better represent the posterior distribution. This leads to smaller variance in the weights, and a bigger effective sample size. By this argument we can see that rising n eff signals the end of the burn-in period.  .1 shows that the effective sample size flattens out at the same time as the relative L 2 difference between the posterior distribution and the proposal distribution stabilises close to its minimum. Detecting this stationarity allows us to automatically determine the end of the burn-in phase of the algorithm. If we are to use the effective sample size ratio as an indicator of how to tune the proposal variance, we need to look at how it behaves in different situations. Here we look at how the statistic behaves as we vary the ensemble size, M . Figure 4.2 shows results for the Gaussian posterior discussed in Section 6.1 when using the PAIS algorithm with RW proposals, each with variance β 2 ∈ R >0 . Figure 4.2 (a) shows that as the ensemble size increases, the scaling parameter which gives the optimal effective sample size ratio decreases. This is to be expected since if we are trying to approximate the posterior with a mixture distribution of a small number  of Gaussians, the optimal variance will naturally be larger so that the whole of the significant regions are covered by the proposal distribution. As the number increases, the optimal variances decrease so that finer details in the posterior can be better represented in the proposal. Figure 4.2 (b) shows that as the ensemble size increases, the efficiency of the sampler also increases.

Adaptive PAIS.
A popular approach for adaptive MCMC algorithms is to view the scaling parameter as a random variable which we can sample during the course of the MCMC iterations. However, it can be slow to converge to the optimal value, and we may need an uninformative prior for the scaling parameter. Alternatively, the parameter may be randomly sampled at various points during the evolution of the chain. This results in some iterations which make larger global moves in the state space between modes in the target distribution, and some which make local moves. Algorithms of this type do not converge to an optimal value of the scaling parameter. We choose to use a divide and conquer scheme which optimises the effective sample size (or any other diagnostic). Some more sophisticated examples are described in [23] and [13]. From here on in, we will use the random walk proposal, where x i−1 is our current state, y is our proposed state and Σ is a covariance operator, which could come from the prior distribution. In the numerics section we will refer to β as the scaling parameter.
Using an adaptive strategy, we calculate a sequence {β (k) } k=1 which converges roughly to the optimal scaling parameter β * , resulting in the optimal transition density χ for our MCMC algorithm. This optimal value will differ depending on the criterion we are optimising. We must choose some sequence of iterations, {n k } k=1 , at which to update β, and due to the constraints on adaptive MCMC algorithms [22,23], these n k must grow exponentially further apart. This same adaptive approach can also be applied to the trivially parallelised MCMC algorithms to adaptively calculate their optimal scaling parameter β * .

Approximate Multinomial Resampling.
Although the ETPF is optimal in terms of preserving statistics of the sample, it can also become quite costly as the number of ensemble members is increased. It is arguable that in the context of PAIS, we do not require this degree of accuracy, and that a faster more approximate method for resampling could be employed. One approach would be to use the bootstrap resampler, which simply takes the M ensemble members' weights and constructs a multinomial distribution, from which M samples are drawn. This is essentially the cheapest resampling algorithm that one could construct. However it too has some drawbacks. The algorithm is random, and as such it is possible for all of the ensemble members in a particular region not to be sampled. This could be particularly problematic when attempting to sample from a multimodal distribution, where it might take a long time to find one of the modes again. The bootstrap filter is also not guaranteed to preserve the mean of the weighted sample, unlike the ETPF. Ideally, we would like to use a resampling algorithm which is not prohibitively costly for moderately or large sized ensembles, which preserves the mean of the samples, and which makes it much harder for the new samples to forget a significant region in the density. This motivates the following algorithm, which we refer to as approximate multinomial resampling (AMR). Instead of sampling M times from an M -dimensional multinomial distribution as is the case with the bootstrap algorithm, we sample once each from M different multinomials. Suppose that we have M samples y n with weights w n . The multinomial sampled from in the bootstrap filter has a vector of probabilities given by: with associated states y n . We wish to find M vectors {p 1 , p 2 , . . . , p M } ⊂ R M ≥0 such that 1 M p i =w. The AMR is then given by a sample from each of the multinomials defined by the vectors p i = [p i,1 , p i,2 , . . . , p i,M ] with associated states y i . Alternatively, as with the ETPF, a deterministic sample can be chosen by picking each sample to be equal to the mean value of each of these multinomial distributions, i.e. each new samplex i is given by: The resulting sample has several properties which are advantageous in the context of being used with the PAIS algorithm. Firstly, we have effectively chopped up the multinomial distribution used in the bootstrap filter into M pieces, and we can guarantee that exactly one sample will be taken from each section. This leads to a much smaller chance of losing entire modes in the density, if each of the sub-multinomials is picked in an appropriate fashion. Secondly, if we do not make a random sample for each multinomial with probability vector p i but instead take the mean of the multinomial to be the sample, this algorithm preserves the mean of the sample exactly. Lastly, as we will see shortly, this algorithm is significantly less computationally intensive than the ETPF. There are of course infinitely many different ways that one could use to split the original multinomial up into M parts, some of which will be far from optimal. The method that we have chosen is loosely based on the idea of optimal transport. We search out states with the largest weights, and choose a cluster around these points based on the closest states geographically. This method is not optimal since once most of the clusters have been selected the remaining states may be spread across the parameter space.   N (1, 2). Importance weights were assigned, based on a target distribution of N (2, 3). The statistics of the resampled outputs were compared with the original weighted samples. Figure  5.1 (a)-(c) show how the relative errors in the first three moments of the samples changes with ensemble size M for the three different samplers. As expected, the AMR lies somewhere between the high accuracy of the ETPF and the less accurate bootstrap resampling. Note that only the error for the bootstrap multinomial sampler is presented for the first moment since both the ETPF and the AMR preserve the mean of the original weighted samples up to machine precision. Figure 5.1 (d) shows how the computational cost, measured in seconds, scales with the ensemble size for the three different methods. These results demonstrate that the AMR behaves how we wish, and importantly ensures that exactly one sample of the output will lie in each region with weights up to 1 M of the total. We will use the AMR in the numerics in Section 6.3.3 where we have chosen to use a larger ensemble size. We do not claim that the AMR is the optimal choice within PAIS, but it does have favourable features, and demonstrates how different choices of resampler can affect the speed and accuracy of the PAIS algorithm.
6.1. Sampling from a one dimensional Gaussian distribution. In this example we compare the naively parallelised RWMH algorithm with its PAIS variant, the PAIS-RW algorithm. The PAIS algorithm is implemented using the ETPF to perform the resampling step. We assess the performance of the PAIS algorithm using the relative L 2 error defined in (4.2), as well as the relative error in the first moment (4.1).
Ensemble size M  Since we are comparing against naively parallelised MH algorithms, we also need to decide which statistics T (β) provide the best criterions for obtaining the optimal scaling parameters. In the examples which follow, we have optimised the naively parallelised RWMH algorithm using the optimal acceptance rateα = 0.5. This value differs from the theoretical asymptotic value of 0.234 which applies in higher dimensions, but this higher acceptance rate is commonly used for one dimensional Gaussian posteriors [24]. To find the optimal scaling parameter we minimise the statistic where N acc (β) is the number of accepted moves and N total is the total number of samples produced. For the PAIS algorithm, we maximise the effective sample size as discussed in Section 4.1.3.
where σ 2 and τ 2 are the variances of the observational noise and prior distributions respectively. In the numerics which follow, we choose τ 2 = 0.01 and σ 2 = 0.01, and we observe x ref = 4 noisily such that These values result in a posterior density in which the vast majority of the density is out in the tails of the prior distribution. The Kullback-Leibler (KL) divergence, which gives us a measure of how different the prior and posterior are, is D KL (µ D ||µ 0 ) = 4.67 for this problem. A KL divergence of zero indicates that two distributions are identical almost everywhere.

Numerical implementation.
In each of the following simulations, we perform three tasks. First we calculate the optimal value of β by optimising the statistics described in Section 4.1. We then run the algorithms with optimal parameters to calculate and compare the convergence rates. Finally, we implement the adaptive algorithms described in Section 4.2 and compare the convergence rates of these algorithms with the nonadaptive algorithms.
(1) Finding the optimal parameters: To find the optimal parameters we choose 32 values of β evenly spaced on a log scale in the interval [10 −5 , 2]. We run the PAIS-RW and RWMH algorithms for one million iterations, each with an ensemble size of M = 50. We took 32 repeats of both algorithms and then used the geometric means of the sample statistics to find the optimal parameters.
(2) Measuring convergence of nonadaptive algorithms: We run the algorithms in Section 6.1 for one million iterations, again with M = 50. The simulations are repeated 32 times using the optimal parameters found in (1). The performance of the algorithms is judged by the convergence of the relative L 2 error statistic in (4.2).
(3) Measuring convergence of adaptive algorithms: We run the adaptive algorithms under the same conditions as the nonadaptive algorithms, and again use the relative L 2 error to compare efficiency. The adaptive algorithms are initialised with β (1) = 1.    Figure 6.1 (a) shows the two values of β which are optimal according to the acceptance rate and relative L 2 error criteria for the RWMH algorithm. The smaller estimate comes from the relative L 2 error, and the larger from

Scaling parameter
3.9e-2 Table 6.1 Optimal values of β summarised from Figure 6.1. Statistics calculated as described in Section 4.1. The values β * L2 and β * % are the optimal scaling parameters found by optimising the relative L 2 errors and acceptance rate respectively. Similarly β * eff and β * var(w(y)) optimise the effective sample size and variance of the weights statistics.
the acceptance rate. The results in Figure 6.1 are summarised in Table 6.1. Since in general we cannot calculate the relative L 2 error, we must optimise the algorithm using the acceptance rate. From the relative L 2 error curve we can see that the minimum is very wide and despite the optimal values being very different there is not a large difference in the convergence rate. Figure 6.1 (b) shows the effective sample size ratio compared to the error analysis and the variance of the weights. The relative L 2 error graph is noisy, but it is clear that the maximum in the effective sample size and the minimum in the variance of the weights are both close to the minimum in the relative L 2 error. Due to this we say that the estimate of the effective sample size found by averaging the statistic over each iteration is a good indicator for the optimal scaling parameter. In general this indicator overestimates the value of n eff found by using the entire sample.    Figure 6.2 shows that the PAIS-RW algorithm converges to the posterior distribution significantly faster than the RWMH algorithm, in both L 2 error and relative error in the moments. A description of the speed up attained by this algorithm is given in Section 6.2.4. Both adaptive algorithms are run with initial values of β = 1. Figure 6.2 shows that after an initial burn-in period the APAIS-RW algorithm catches up to the PAIS-RW algorithm, and by the end of the simulation window is matching its performance. The ARWMH algorithm does not perform quite as well, this is possibly due to the fact that the acceptance rate cost function is not particularly smooth at the optimal value making it difficult to minimise. 6.1.5. Scaling of the PAIS algorithm with ensemble size. Throughout this example, we use an ensemble size M = 50, but it is interesting to see how the PAIS algorithm scales when we increase the ensemble size, and if there is some limit below which the algorithm fails. We implement the problem in Section 6.1, using the RWMH and PAIS-RW algorithms with ensemble sizes in the inteveral M ∈ [1, 160].  Figure 6.3 was produced using the method of finding optimal β described in Section 6.1.2 (1), then running 32 repeats at each ensemble size. The convergence rates are then found by regressing through the data. The graph is still very noisy but demonstrates that increasing the ensemble size continues to reduce the number of iterations required in comparison with naively parallelised MH. The decreasing trend indicates superlinear improvement of PAIS with respect to ensemble size, in terms of the number of iterations required, which is a demonstration of our belief that parallelism of MCMC should give us added value over and above that provided by naive parallelism. This decrease is linked to the increasing effective sample size shown in

Sampling from Bimodal Distributions.
In this section we investigate the behaviour of the PAIS algorithm when applied to bimodal problems. MH methods can struggle with multimodal problems, particularly where switches between the modes are rare, resulting in incorrectly proportioned modes in the histograms. This example demonstrates that the PAIS algorithm redistributes chains to new modes as they are found. This means that we expect the number of chains in a mode to be approximately proportional to the probability density in that mode. As a result, reconstructed posteriors with disproportional modes, as is familiar with the MH algorithms, are not produced.
6.2.1. Target Distribution. We look at an 'easy' problem, B 1 , which has a KL divergence of 0.880, and a 'harder' problem, B 2 , which has a KL divergence of 3.647. Problem B 1 has two modes which are not too far apart. In B 2 we increase the distance between the two modes which has the effect of increasing the expected number of iterations that it takes for a MCMC chain to jump between modes. These posteriors are shown in Figure 6.4.
The following setup is the same for both problems. We consider a non-linear observation operator G(x) = x 2 , and assign the prior x ∼ µ 0 = N (0, τ 2 = 0.25). We assume that a noisy reading, D, is taken according to D = G(x ref ) + ε, where ε ∼ µ ε = N (0, σ 2 = 0.1). This results in the non-Gaussian posterior To create the 'easy' problem we say that the true value of G(x ref ) = 0.75, and the 'hard' problem is generated using G(x ref ) = 2. In the numerics which follow we draw noise from µ ε to generate our data point.

6.2.2.
Calculating values of Optimal β * . Calculating the optimal values of the scaling parameters for this problem is similar to the previous example; we check only the acceptance rate to find the optimal values for RWMH and we use the effective sample size to find the optimal values for PAIS-RW. Table 6.2 gives the optimal values of β for both problems. The subscript on β refers to the criterion which has been optimised. It is relatively simple to find optimal scaling parameters for problem B 1 . These values are given in Table 6.2 (left). However problem B 2 is much harder as transitions between the modes are extremely unlikely for the standard RWMH algorithm. This means that we need to consider the convergence on two levels; we should consider the algorithm's ability to find both the modes, and also whether it can sample them in the correct proportions.
To get correctly proportioned modes with the RWMH algorithm it is important that the chains can transition between the modes frequently, which means that β must be large. However, this leads to a lower acceptance rate, and so we sacrifice convergence locally. For this reason, the RWMH algorithm is very slow to converge for problems of this type. We can achieve these two regimes in RWMH by tuning β using the acceptance rate for local convergence, and by L 2 error for global convergence. Similarly in PAIS-RW we can use the effective sample size for local convergence, and the L 2 error for global convergence. From Table 6.2 (right) we see that there is a large difference between the optimal value of β for each regime in RWMH, and so will result in inefficient sampling. The PAIS-RW algorithm manages to sample the local detail and the large scale behaviour with similar values of β * ; a clear advantage to using this algorithm for this problem.  6.2.3. Convergence of RWMH vs PAIS-RW. As in the Gaussian example we see a significant speed up with the PAIS-RW algorithm for problem B 1 . Figure 6.5 shows the adaptive and nonadaptive convergence rates. We can see that the adaptive algorithms compare closely with the respective nonadaptive algorithms and the improvement PAIS offers remains significant. For B 2 the algorithms are run with the global optimal value of β * , and with the local optimal value of β * . Figure 6.6 (a) shows that the algorithms using the globally optimal β * convergence at the desired rate, whereas the algorithms optimised using the acceptance rate and effective sample size initially converge faster but at some point forget the location of one of the modes, causing the convergence to flatten out. Using the PAIS algorithm we can minimise the impact of forgotten modes by constantly allowing the algorithm to search them out. Since we have parallel chains, we can run PAIS-RW with the majority of chains using the locally optimal value of β, and one or two chains with a larger scaling parameter. These chains with larger scaling parameters act both as 'scouts' for new modes, and act to aid in the overdispersal of the proposal distribution. Figure 6.6 (b) shows the results of using 49 chains with the local optimal scaling parameter, and one chain with ten times the local optimal scaling parameter. We see that modes are not forgotten and the algorithm converges with the improvement we see from PAIS-RW in the other problems. Other methods of mode searches are described in [15].  The adaptive algorithm can just as easily be applied to the PAIS algorithm with the 'scout' chains described in the previous paragraph. Since we need two equally sized sub-ensembles, we will use two groups of 24 ensembles with the same proposal distributions, and each group will also have a 'scout' chain with a scaling parameter ten times that of the rest of the group. Comparing the convergence of the adaptive algorithms against the nonadaptive algorithms in Figure 6.7 shows that the algorithms behave as expected. The adaptive RWMH algorithm tuned using the acceptance rate converges at the same rate as the L 2 optimised algorithm until the scaling parameter gets small, and therefore switches between the modes are rare, and the relative heights of the modes are decided by the arbitrary proportion of chains which are in each mode at this point. The adaptive PAIS-RW with scout chains tuned to the effective sample size converges at about the same rate as the locally optimised nonadaptive algorithm also with scouts.

Calculating the Speed Up in Convergence.
The graphs in the previous section clearly show that the PAIS-RW algorithm converges faster than the RWMH algorithm when both are parallelised with the same ensemble size. We can calculate the number of iterations required to achieve a particular tolerance level in our solution for each algorithm and compare these to calculate a percentage saving. In Figure 6.8 we demonstrate our calculation of the savings. The constants c 1 and c 2 are found by regressing through the data with a fixed exponent of −1/2 excluding the initial data points where the graph has not finished burning in. A summary of the percentage of iterations required using the PAIS algorithm compared with the respective Metropolis-Hastings algorithms is given in Table 6  6.2.5. A Useful Property of the PAIS Algorithm for Multimodal Distributions. The biggest issue for the Metropolis-Hastings algorithms when sampling from a posterior such as the one in B 2 is that it is unlikely that the correct ratio of chains will be maintained in each of the modes, and since there is no interaction between the chains, there is no way to remedy this problem. The PAIS algorithm tackles this problem with its resampling step. The algorithm uses its dynamic kernel to build up an approximation of the posterior at each iteration, and then compares this to the posterior distribution via the weights function. Any large discrepancy in the approximation will result in a large or small weight being assigned to the relevant chain, meaning the chain will either pull other chains towards it or be sucked towards a chain with a larger weight. In this way, the algorithm allows chains to 'teleport' to regions of the posterior which are in need of more exploration. Figure 6.9 shows Problem B 2 with initially 1 chain in the positive mode, and 49 chains in the negative mode. It takes only a handful of iterations for the algorithm to balance out the chains into 25 chains in each mode. The chains switch modes without having to climb the energy gradient in the middle.
6.3. Sampling from non-Gaussian bivariate distributions. In this section we apply the PAIS algorithm to a more complicated posterior distribution. The field of biochemical kinetics gives rise to multiscale stochastic problems which remain a challenge both theoretically and computationally. Biochemical reactions occur in single cells between a number of chemical populations and the rates of these reactions can vary on vastly different timescales. It is these reaction rates which we are interested in finding descriptions for. It is often possible to isolate which reactions are occurring more frequently (the fast reactions) and which are occurring less frequently (the slow reactions). The quasisteady-state assumption (QSSA) is the assumption that the fast reactions converge in distribution on a timescale which is negligible with respect to the rate of occurrence of the slow reactions. This assumption allows us to approximate the dynamics of the slowly changing quantities in the system by assuming that the fast quantities are in equillibrium with respect to the fast reactions in isolation. This kind of model reduction can be used to approximate the likelihood in an inverse problem where we wish to recover the reaction parameters in the system. Let us consider a simple example by introducing the following simple chemical system involving two chemical species S 1 and S 2 : Each arrow represents a reaction from a reactant to a product, with some rate constant k i , and where the rates of the reactions are assumed to follow mass action kinetics. We denote the concentration of species S i by X i . We assume that we are in a parameter regime such that the reactions S 1 → S 2 and S 2 → S 1 occur much much more frequently than the other reactions. Notice that both chemical species are involved in fast reactions. However, the quantity S = X 1 + X 2 is conserved by both of the fast reactions, and as such, this is the slowly changing quantity in this system. The effective dynamics of S can be represented as follows.
Here, the new reaction ratek 4 is approximated through application of the QSSA to bek The value of E [k 4 X 2 |S = s] is approximated by finding the steady state of the ODE representing the fast subsystem of reactions: If we assume that we know the rate constants k 1 and k 4 , and observe the system in (6.3), we indirectly observe the rates k 2 and k 3 through the effective ratek 4 of the degradation of S. Our observations are uninformative about these reaction rates, as there are surfaces in parameter space along which the effective ratek 4 is invariant, leading to a highly ill-posed inverse problem. Making the assumption that the errors in our observations of the value of S are Gamma distributed with some variance σ 2 , this results in a long and thin posterior distribution on k 2 and k 3 . This type of problem is notoriously difficult to sample from using standard MH algorithms, as the algorithms quickly find a point on this manifold on whichk 4 is invariant, but exploration along its length is slow.
6.3.1. Target Distribution. We now formalise the posterior of interest. We look for a distribution over the parameter k = (k 2 , k 3 ) T , given that we know k 1 = 100 and k 4 = 1. We generate our data by making ten observations of the system at t i = 2, 4, . . . , 20, here simulated by solving the full system, (6.2), governed by the differential equations with initial conditions X 1 (0) = X 2 (0) = 0, and parameter values k = (50, 100) T . We then add noise taken from a Gamma distribution with variance σ 2 = 225 and centred at each S(t i ) = X 1 (t i ) + X 2 (t i ). In our modelling we use the QSSA to simplify this system of differential equations into the one dimensional system based on (6.3), The observation operator, G : (k, t) → S(t), maps from parameter space onto population space at a time t. This means that for the ith observation we assume, These are parameters required to give us a distribution with mean G(k and variance σ 2 . We assign Gamma priors to k with mean α 0 /β 0 = 75 and variance α 0 /β 2 0 = 100 in both coordinates, resulting in the posterior  6.3.2. Implementation. For this problem there is no analytic form for the normalisation constant, and numerical methods implemented in MATLAB and Mathematica have proven to be unreliable. To demonstrate the convergence of the algorithm, we first run a MH algorithm for much longer than we normally would (8 × 10 10 samples), and consider the histogram produced to be a well-converged approximation of the true posterior distribution. We then perform our usual simulations with the PAIS and MH algorithms to compare the rate at which the histograms converge to the posterior over the same mesh. In this problem we modify our PAIS algorithm to use a mixture of Gamma distributions in the proposal distribution instead of a Gaussian mixture. The PAIS-Gamma and MH-Gamma algorithms use a Gamma proposal distribution with mean centred at the previous state, y ∼ Gamma(·; α * , β * ) where α * β * = x and α * (β * ) 2 = β 2 .
This ensures that we only propose positive parameters. Having a heavier right tail in the proposal distribution also means that the posterior is absolutely continuous with respect to the proposal. When this is not the case, the weight function can tend to zero or infinity when k i → ∞, which results in poor, very spiky approximations of the tails of the posterior distribution leading to slow convergence. We now significantly increase the ensemble size we are using from M = 50 to M = 2500. This allows us to build a better approximation of the posterior distribution for our proposals. Following the discussion in Section 5, it is clear that to keep the runtime of the resampler negligible compared with the calculation of the posterior we must switch from the ETPF resampler to the AMR algorithm.

Convergence of PAIS-Gamma and PAIS-LA vs MH with Gamma proposals.
In the numerics which follow we have used the AMR algorithm to resample with an ensemble size of M = 2500. The method otherwise remains the same as in previous sections. We perform test runs to find the optimal scaling parameters for both Gamma proposals and MALA-type proposals. We then calculate the convergence rates of the algorithms by producing 50 million samples from the posterior with each algorithm, and repeat the simulation 32 times.  . Convergence of the PAIS-Gamma and PAIS-LA algorithms for the chemical system problem described in Section 6.3. Implementation described in Sections 6.3.2 and 6.3.3. Resampling is performed using the AMR scheme.
Algorithm MH-Gamma PAIS-Gamma PAIS-LA β * 2.7e-0 9.4e-1 1.0e-0 Table 6.4 Optimal values of the scaling parameter. The MH algorithm is optimised using the acceptance rate, and the PAIS algorithms are optimised using the effective sample size.
In Figure 6.11 (a) we see that the two PAIS algorithms have similar optimal values of the scaling parameter, also shown in Table 6.4. However, the MALA-type proposal achieves a higher effective sample size with the same ensemble size. Convergence after the first 50 million samples is shown in Figure 6.11 (b) and demonstrates that the PAIS algorithm converges faster than the MH algorithm. The slightly unstable convergence in the PAIS-Gamma algorithm is the effect of spikes appearing in the tails of the posterior distribution. This is also what has caused the slightly lower effective sample size.
7. Discussion and Conclusions. We have explored the application of parallelised MCMC algorithms in low dimensional inverse problems. We have demonstrated numerically that these algorithms converge faster than the analogous naively parallelised Metropolis-Hastings algorithms. Further experimentation with the Metropolis Adjusted Langevin Algorithm (MALA), preconditioned Crank-Nicolson (pCN), preconditioned Crank-Nicolson Langevin (pCNL) and Hamiltonian Monte Carlo (HMC) proposals has yielded similar results [25]. Importantly, we have compared the efficiency of our parallel scheme with a naive parallelisation of serial methods. Thus our increase in efficiency is over and above an M -fold increase, where M is the number of ensemble members used. Our approach demonstrates a better-than-linear speed-up with the number of ensemble members used. The PAIS has a number of favourable features, for example the algorithm's ability to redistribute, through the resampling regime, the ensemble members to regions which require more exploration. This allows the method to be used to sample from complex multimodal distribution. Another strength of the PAIS is that it can also be used with any MCMC proposal. There are a growing number of increasing sophisticated MCMC algorithms (HMC, Riemann manifold MCMC etc) which could be incorporated into this framework, leading to even more efficient algorithms, and this is another opportunity for future work. One limitation of the PAIS approach as described above is that a direct solver of the ETPF problem (such as FastEMD [19]) has computational cost O(M 3 log M ), where M is the number of particles in the ensemble. As such, we introduced a more approximate resampler the approximate multinomial resampler, which allows us to push the approach to the limit with much larger ensemble sizes. The PAIS framework is very flexible in terms of being able to use any combination of proposal distributions and resampling algorithms that one wishes.