Aﬃne invariant interacting Langevin dynamics for Bayesian inference

,


Introduction
In this paper, we propose an efficient sampling method for Bayesian inference which is based on first-order (overdamped) Langevin dynamics [32] and which satisfies the property of affine invariance [13].Here affine invariance of a computational method refers to the fact that a method is invariant under an affine change of coordinates.A classical example is provided by Newton's method, while standard gradient descent is not affine invariant.The importance of affine invariance as a general guiding principle for the design of Monte Carlo sampling methods was first highlighted in the pioneering contribution [13].
Langevin dynamics based sampling methods, on the other hand, have a long history in statistical physics [36] and computational statistics [35].An important step towards affine invariant Langevin sampling methods was taken through the introduction of Riemannian manifold Langevin Monte Carlo methods in [12] with the metric tensor given by the Fisher information matrix.However, the Fisher information matrix is typically not available in closed form and/or is difficult to approximate numerically.Instead, an alternative approach was put forward in the unpublished Master thesis [14], where an ensemble of Langevin samplers is combined to provide an empirical covariance matrix resulting in a preconditioned affine invariant MALA algorithm (see Remark 6 for more details).This methodology was put into the wider context of dynamics-based sampling methods in [21] with a focus on second-order Langevin dynamics.
An interesting link between ensembles of Langevin samplers and the ensemble Kalman filter [19,33], both relying on ensemble based empirical covariance matrices, has been established more recently in [11] leading to a nonlinear Fokker-Planck equation for the associated mean-field equations and an associated Kalman-Wasserstein gradient flow structure in the space of probability measures.Furthermore, if applied to a Bayesian inverse problem with additive Gaussian measurement errors and nonlinear forward map, a gradient-free Langevin dynamics formulation has been proposed [11] based on ideas previously exploited in the ensemble Kalman filter literature [10].
The present paper builds upon the unpublished note [29], which identifies a statistically consistent finite ensemble size implementation of the mean-field equations put forward in [11].More precisely, it could be established that the proposed interacting Langevin dynamics possesses the desired posterior target measure as an invariant measure provided an appropriate correction term is added, which is due to the multiplicative noise in the preconditioned Langevin system.The correction term vanishes in the mean-field limit.
In addition, we establish in this paper the affine invariance of our finite ensemble size evolution equations (with acronym ALDI) through a particular choice of the multiplicative noise term and study the ergodicity properties of ALDI.In particular, it can be shown that ALDI is ergodic if the ensemble size, N , and the dimension, D, of the underlying random variable satisfy N > D + 1 and the empirical covariance matrix is non-degenerate at initial time.We emphasise that ALDI is straightforward to implement, does not require inversion or other matrix factorisations of the empirical covariance matrices (which is important for high-dimensional problems) and is applicable to a wide range of sampling problems.
Finally, a gradient-free formulation of ALDI in the spirit of [11] is proposed for Bayesian inverse problems with additive Gaussian measurement errors.While the invariance of the posterior distribution is lost except for Gaussian likelihood functions, affine invariance is maintained.Numerical experiments are conducted for a PDE constrained Bayesian inference problem.The numerical results indicate in particular that it is entirely sufficient to implement ALDI with N = D + 2 particles; the minimum size required for ergodicity to hold.
The remainder of this paper is structured as follows.The subsequent Section 2 establishes the mathematical setting of the sampling problems considered in this paper and provides a unifying mathematical framework for ensemble-based Langevin dynamics equations.Given this framework, we formulate the key algorithmic requirements on the ensemble formulation proposed of this paper.We introduce the concept of affine invariance and prove affine invariance for the nonlinear Fokker-Planck equations put forward in [11].The novel ALDI method is put forward in Section 3 and its gradient-free variant in Section 4. The affine invariance, non-degeneracy and ergodicity of ALDI are proven in Section 5, where we also put our approach into the perspective of diffusion processes on Riemannian manifolds [12,23] and Wasserstein gradient flows [1,38].The importance of the correction term is demonstrated for a PDE constrained inverse problem [11].We also compare the performance of the gradient-based and gradient-free formulations of ALDI and find that they both lead to comparable numerical results with the gradient-free formulation however much cheaper to implement.We conclude the paper with a summary section.

Mathematical problem formulation
We consider the computational problem of producing samples from a random variable U with values in R D and given probability density function (PDF) where Φ : R D → R is an appropriate potential and a normalisation constant.
Example 1 (Bayesian inverse problems).The computational Bayesian inverse problem (BIP) of sampling a random variable U conditioned on an observation y ∈ R K , where the observations Y and the random variable U obey the forward model serves as the main motivation of this paper.Here, G : R D → R K denotes some nonlinear forward map and the mean zero R K -valued Gaussian random variable Ξ represents measurement errors with positive definite error covariance matrix R ∈ R K×K .We assume that Ξ and U ∼ π 0 are independent.Then, by Bayes' theorem, the distribution of the conditional random variable U |y is determined by with the least-squares misfit function and the normalisation constant If the prior PDF π 0 is Gaussian with mean m 0 ∈ R D and covariance matrix P 0 ∈ R D×D , then the posterior is absolutely continuous with respect to the Lebesgue measure on where Φ(u; y) We write Φ(u) for simplicity and ignore the dependence on the data y from now on.
The sampling methods considered in this paper are based on stochastic processes of N interacting particles , with the property that the marginal distributions in each of the particles U (i) t approximate π * as t → ∞.We treat the particle positions U (i) t as random vectors with values in R D , which we collect into the L-dimensional random vector with L = N • D, and consider gradient-based evolution equations of the form t ≥ 0, for some positive semi-definite matrix-valued A(u) ∈ R L×L and Γ(u) ∈ R L×M , W t standard M -dimensional Brownian motion, and V : R L → R a given potential.Here u denotes an element in R L of the form with all u (i) , i = 1, . . ., N , elements of R D .Furthermore, the Itô interpretation [32] of the multiplicative noise term in (10) is to be used.
The main contribution of this paper consists in developing a particular instance of (10) with the following three properties: is invariant under the dynamics.Furthermore, π * is ergodic in the sense that the joint law of the process converges towards π * as t → ∞, in an appropriate sense and under suitable conditions on the initialisation.
(ii) The equations (10) are invariant under affine transformations with M ∈ R D×D an invertible matrix and b ∈ R D a shift vector.A precise formulation of affine invariance is provided in Definition 2 below.See also [13,14,21].
(iii) The equations (10) are straightforward and computationally efficient to implement, that is, do not require the inversion or factorisation of D-dimensional matrices and/or higher-order derivatives of the potential V.
Definition 2 (affine invariance).Following [13,14,21], a formulation ( 10) is called invariant under affine transformations of the form (13), that is, if the resulting equations in the collective variable are given by dV with the potential V defined by where 1 N ∈ R N denotes a column vector of ones, A ⊗ B the Kronecker product of two matrices, and I N ×N the N -dimensional identity matrix.
Example 3 (Langevin dynamics).The classical example of ( 10) is provided by the scaled first-order (overdamped) Langevin dynamics where W (i) t , i = 1, . . ., N , denotes independent D-dimensional Brownian motion, C ∈ R D×D is a constant symmetric positive-definite matrix and C 1/2 denotes its positive-definite square root.In this case, the particles do not interact and and the potential V is given by We note that (18) satisfies items (i) and (iii) from above for any N ≥ 1 but not (ii), in general.As pointed out in [21], the failure of ( 18) to be affine invariant potentially leads to inefficient sampling when Φ is poorly scaled with respect to C.More specifically, in the case of Bayesian inverse problems with Gaussian posterior, this scenario occurs when C is vastly different from the target covariance.
Let π t denote the PDF of the ith particle U (i) t at time t ≥ 0 with evolution equation (18).Then these PDFs satisfy the Fokker-Planck equation with π t = π i t and the Kullback-Leibler divergence defined by An important generalisation of the linear Fokker-Planck equation (20) has been proposed in [11].It relies on making the matrix C dependent on the PDF π t itself; thus leading to a nonlinear generalisation of (20).More specifically, This choice of C is motivated by the gradient flow structure of the ensemble Kalman-Bucy filter, as first revealed in [4], and gives rise to the Kalman-Wasserstein gradient flow structure of the associated nonlinear Fokker-Planck equation (20) as introduced in [11].Contrary to the classical Fokker-Planck equation (20) with constant C, the nonlinear Fokker-Planck equation with C = C(π t ) is invariant under affine transformations (13).
Proof.We define the pushforward PDFs Then Here we have used that as well as and an analog statement for the divergence operator.Furthermore, the variational derivatives of the Kullback-Leibler divergences satisfy Building upon the affine invariance property of the nonlinear Fokker-Planck equation ( 20) with C = C(π t ) as in (22), we demonstrate in the following section how to obtain stochastic evolution equations of the form (10) which satisfy all three properties (i)-(iii) from above.

Affine invariant interacting Langevin dynamics
As noted in the previous section, the nonlinear Fokker-Planck evolution ( 20)-( 22) satisfies invariance of the target measure π * (property (i)) as well as affine invariance (property (ii)).In this section, we address (iii), that is, we present an interacting particle system of the form (10) which has (20) as its mean field limit while still maintaining properties (i) and (ii) for any finite number of particles.
In order to define our interacting particle system, let us first define the empirical covariance and mean that is, the particle-based estimators of the quantities defined in (22), and furthermore fix the (non-quadratic) square root For a moment, let us assume that u ∈ R L is such that C N (u) is invertible (we will comment on this assumption following Definition 5, see also Proposition 11) and choose the preconditioning matrix the potential and the diffusion matrix in (10), that is, M = N 2 .Using the identity which follows from Jacobi's formula for the derivative of determinants (see the Appendix for more details), we derive the following explicit form of the proposed interacting particle Langevin dynamics.

Definition 5 (ALDI method). The affine invariant Langevin dynamics (ALDI) is given by dU
for i = 1, . . ., N , where We emphasise that the square root C 1/2 N (u), as defined in (28), does not require computationally expensive matrix factorisations, and hence the formulation (33) satisfies the requirement (iii).Note that although defining V as in (30) necessitates N > D in order for the empirical covariance matrix C N (u) to be non-singular, the terms in (33) are well-defined also for N ≤ D. While a non-singular C N (u) is required generically for the ALDI method to sample from the invariant measure π * (see the discussion in Section 5.1), a smaller number of particles, N , is sometimes desirable in order to reduce the computational cost for high-dimensional BIPs.
If indeed N ≤ D, than C N (u) is singular and the dynamics of the interacting particle system (33) is restricted to the linear subspace spanned by the N initial particle positions Stochastic differential equations in the N 2 scalar coefficients m ij t can easily be derived from (33) using the ansatz (34).In other words, provided that the initial samples U (i) 0 are appropriately chosen, an implementation of (33) with N ≤ D can lead to a computationally efficient reduction of the BIP onto a lower dimensional linear subspace.The affine invariance of (33) holds regardless of the ensemble size and is discussed in Section 5.2 in more detail.
Remark 6 (previous work).The idea of an affine invariant Monte Carlo method based on Langevin dynamics using an ensemble of particles and its empirical covariance first appeared in the unpublished Master thesis [14].More specifically, the author proposes an affine invariant modification of the popular MALA algorithm [35,12], where each particle U (i) k , i = 1, . . ., N , is sequentially updated at time-step k using the proposal where h > 0 is the step-size, M k is an empirical covariance matrix based on a set of particles not including

and Ξ
(i) k is a D-dimensional Gaussian random variable with mean zero and covariance matrix I D×D .Independently of [14], a general time-continuous framework for affine invariant interacting particle formulations has been developed in [21] and affine invariant implementations of second-order Langevin dynamics using empirical covariance matrices are studied in detail.
More recently, ensemble preconditioned first-order Langevin dynamics has been revisited in [11] with an emphasis on its mean-field limit and its connection to the ensemble Kalman filter [10,19,33].In fact, (33) appeared first in [11] without the correction term and with C 1/2 N (u) being replaced by the symmetric matrix square root of the covariance matrix C N (u).The correction term (36) in (33) is, however, needed for π * to be an invariant distribution under the resulting interacting particle system (10) and first appeared in the unpublished note [29].The invariance of π * under (33) is proven in Section 5.1.
The correction term (36) vanishes as N → ∞ for D fixed which justifies the nonlinear Fokker-Planck equation (20) with C = C(π t ) in this mean-field limit.See [11] for more details.
We note that a general discussion on necessary correction terms for Langevin dynamics with multiplicative noise can, for example, be found in [35,12] from the perspective of Riemannian Brownian motion.We also note that general conditions on diffusion processes that guarantee invariance of a given target distribution have been investigated in [8, Section 2.2] and [24,21].
The specific form of the multiplicative noise term in (33) avoids matrix factorisations and ensures the affine invariance of ALDI as demonstrated in Section 5.2.

Approximate gradient-free sampling
A central idea put forward in [11] (see also [31]) in the context of BIPs described in Example 1 is to combine the preconditioned Langevin dynamics with gradient-free formulations of the ensemble Kalman filter.Recalling the forward map G from (3), the empirical crosscorrelation matrix D N (u) ∈ R D×K is defined via We now make the approximation C N ∇ u G ≈ D N , motivated by the fact that this is exact for affine forward maps, G(u) = Gu + c.For background on this approach, we refer to [10,Appendix A.1].In terms of the ALDI formulation (33) and the potential Φ(u), given by (8), we obtain: Definition 7 (gradient-free ALDI).Given a potential Φ(u) of the form (8), the gradientfree ALDI formulation is given by dU for i = 1, . . ., N , where W (i) t denote independent N -dimensional standard Brownian motions.
While the invariance of π * is lost under the gradient-free formulation (38) (except, of course, for affine forward operators), affine invariance of the equations of motions is maintained, see Section 5.2.Gradient-free formulations have been found to work well for unimodal posterior distributions in [11], but fail for multi-modal distributions as demonstrated in [34].A localised covariance formulation of ALDI has been proposed in [34] to overcome this limitation.Localised covariance matrices were already considered in [21]; but not in the context of gradient-free formulations.

Properties of ALDI
The aim of this section is to analyse some of the properties of the dynamics (33), in particular verifying the conditions (i) and (ii) outlined in Section 2. The key observation (crucially depending on the correction term D+1  2 log |C N (u)| to the potential V in (30)) is that the corresponding Fokker-Planck equation has the same mathematical structure as its counterpart (20) for the mean-field regime: Proposition 8 (linear Fokker-Planck equation).Let U t satisfy the stochastic evolution equations (33) and assume that the time-marginal PDF π t of U t is smooth.Then π t satisfies the linear Fokker-Planck equation Proof.The proof can be found in the Appendix.See also the technical report [29].
Note that the PDF π t in (39) is defined on the extended space R L , L = N • D, whereas π t in ( 20) is defined on R D .In contrast to (39), the mean field equation ( 20) is nonlinear since C, as defined in (22), depends on the solution π t itself.

Non-degeneracy and ergodicity
As a first result, we have that property (i) is satisfied for the extended target measure (12) on the joint state space R L .This follows directly from Proposition 8: Corollary 9 (invariance of the posterior measure).The extended target measure ( 12) is invariant for (33), that is, if U 0 ∼ π * , then U t ∼ π * for all t ≥ 0.
Note that π * is not the unique invariant measure for the dynamics (33).For instance, if U = (u (1) , . . ., u (N ) ) with u (1) = u (2) = . . .= u (N ) , then C N (U ) = 0 and u i − µ N (U ) = 0, and hence δ U (the Dirac measure centred at U ) is invariant.To ensure favourable ergodic properties, we need to prove that π * is also the unique invariant measure that is reachable by the dynamics from an appropriate set of initial conditions.First, we shall make the following assumption on the potential Φ: Assumption 10 (regularity and growth conditions on the potential Φ).Assume that Φ ∈ C 2 (R D ) ∩ L 1 (π * ).Furthermore, assume that there exists a compact set K ⊂ R D and constants c, C > 0 such that for all u ∈ R D \ K.
The bound (40c) is to be understood in the sense of quadratic forms.Assumption 10 is satisfied for target measures with Gaussian tails.Indeed, Φ = Φ 0 + Φ 1 is admissible, where Φ 0 (u) = 1 2 u • Su is quadratic (with S strictly positive definite), and ) is a smooth perturbation with compact support.We would like to emphasise that Assumption 10 can be relaxed with minimal effort, but we refrain from doing so for ease of exposition.
Due to the fact that C N (u) is not uniformly bounded from below on R L , the associated Fokker-Planck operator is not uniformly elliptic and standard ergodicity results are not applicable.However, we have the following non-degeneracy result.
Proposition 11 (non-degeneracy of the empirical covariance matrix).Let Assumption 10 be satisfied and assume that C N (U 0 ) is strictly positive definite.Then (33) admits a unique global strong solution, and C N (U t ) stays strictly positive definite for all t ≥ 0, almost surely.
Proof.The proof rests on the identity (32) so that (33) can be written in the form dU (i) with the potential V given by (30), making use of the repulsive effect of the term Details can be found in the Appendix.
With Proposition 11 in place, the proof of the following ergodicity result is relatively straightforward: Proposition 12 (ergodicity).Assume the conditions from Proposition 11, and furthermore that N > D +1.Then the dynamics is ergodic, that is, π t → π * as t → ∞ in total variation distance.
Proof.The proof can be found in the Appendix.
Remark 13.In the case when N ≤ D, ergodicity will not hold, since the dynamics is constrained to a subspace according to the discussion following Definition 5.In the case when N = D+1 one can show that the set E defined in (45) has two connected components.The dynamics will then be ergodic with respect to π * restricted to one of these, depending on the initial condition.This is acceptable from an algorithmic viewpoint, but we do not treat this case separately for simplicity.

Affine invariance
We show that (33) and its gradient-free variant (38) are affine-invariant, in the terminology introduced in [13,14] and summarised in Definition 2.
Proof.We follow the proof of Lemma 4. Since C N (u) = M C N (v)M T we also have Furthermore, , and an analogous statement holds for the divergence operator.Finally, equality (26) also holds for the Kullback-Leibler divergences over extended state space.Along the same lines, the affine invariance can also be checked directly at the level of the stochastic differential equations (33).In particular, it holds that C with G(v) = G(M u + b) and D N (v) the empirical covariance matrix between v and G(v).This implies the affine invariance of the gradient-free formulation (38).
Remark 15 (path-wise versus distributional affine invariance).Definition 2 is based on path-wise affine invariance at the level of the SDE (10).Path-wise invariance implies affine invariance of the associated time-marginal distributions π t , that is, affine invariance of the implied Fokker-Planck equation.The converse is not true, in general.

Geometric properties and gradient flow structure
In this section, we place the dynamics (33) in a geometric context, viewing (a suitable subset of) R L as a Riemannian manifold when equipped with an appropriate metric tensor.This approach has been pioneered in [12]; we also recommend the review paper [23].Leveraging this perspective, we show that the evolution induced by (33) on the set of smooth PDFs can be interpreted as a gradient flow in the sense of [16].In the limit as N → ∞ we formally recover the Kalman-Wasserstein geometry introduced in [11].We restrict our attention to the case N > D + 1 in this section, when the dynamics (33) is ergodic on the set according to the results in Section 5.1.Extending the framework to the case when N ≤ D + 1 is subject of ongoing work.We now turn E into an L-dimensional Riemannian manifold by introducing the metric tensor where A has been defined in (29).Furthermore, here and in the rest of this section, Einstein's summation convention is used.
Remark 16 (definition of metric tensor).To avoid notational clutter, we have suppressed the coordinates of the individual particles in the expression (46).Denoting the γ-th coordinate of the i-th particle by u (i,γ) and using a similar convention for A, we can write more explicitly (not using the Einstein convention for clarity) where the second identity rests on the block-diagonal structure of A.
In what follows, we will denote by dvol g the Riemannian volume, by ∇ g the Riemannian gradient, by (W g t ) t≥0 Riemannian Brownian motion and by d g the geodesic distance on (E, g).For more details, we refer to [15,20] and, in the context of computational statistics, to [23].Using these objects induced by g, both the SDE (33) and the corresponding Fokker-Planck equation (39) admit a compact formulation: Proposition 17 (Riemmanian interpretation of ALDI).Let π g denote the density of π with respect to the Riemannian volume, that is, π g dvol g = π du and π g * dvol g = π * du.Then the dynamics (33) can be written in the form and the Fokker-Planck equation (39) can be written in the form Remark 18.Note that the Kullback-Leibler divergence and its functional derivative depend on the measures but not on the respective densities, in contrast to the Onsager operator [25,28,30] Proof.Using the results from [23], in particular the equations ( 46)-( 47), the proof of the first statement reduces to verifying that where g IJ stands for the components of the inverse of g, and we have used the notation I = (i, γ) from Remark 16.This follows directly from the definition of g and the identity (71) giving rise to the drift correction (36).Indeed, together with the coordinate expressions for vector-valued functions f and scalar-valued V , the result follows by direct substitution.
To exhibit the gradient flow structure, we recall that the natural quadratic Wasserstein distance on (E, g) is given by where Π(µ, ν) denotes the set of probability measures on R L × R L with marginals µ and ν.It is well-known that the evolution (49) can be interpreted as gradient flow dynamics of the Kullback-Leibler divergence on the set of probability measures equipped with the distance (52), see for instance [38,Chapter 15] or [22].By the Benamou-Brenier formula [3], we have the representation where the constraining continuity equation in (53b) is to be interpreted in a weak form.In standard coordinates (using the definition (46) as well as the formulas (51)) we see that revealing a close similarity with the Kalman-Wasserstein distance (here denoted by W Kalman ) introduced in [11].Indeed, denoting by ⊗ N i=1 µ i and ⊗ N i=1 ν i the product measures on R L associated to µ, ν ∈ P(R D ), we formally expect that assuming that C N ≈ C(ρ) for sufficiently large N , where C(ρ) was defined in (22).A rigorous passage from W g to the Kalman-Wasserstein distance might be a rewarding direction for future research; we note that a similar analysis (relating the gradient flow structures associated to a finite particle system and its mean-field limit) has been carried out recently in [5].

Numerical experiment: A PDE constrained inverse problem
We consider the inverse problem of determining the periodic permeability field a(x) > 0 in the elliptic partial differential equation (PDE) from N y = 10 observed grid values j = 0, . . ., N y − 1, of the pressure field p for a given forcing f .Both p and f are assumed to be periodic over the domain Ω and the measurement errors η j in (57) are i.i.d.Gaussian with mean zero and variance σ R = 0.0001.A related 2-dimensional Darcy flow problem has been studied in [11].In this paper, we restrict the simulations to the 1-dimensional formulation (56) for computational simplicity.This infinite-dimensional problem is made finite-dimensional by introducing a computational grid with D = 50 grid points.Hence (56) gets replaced by the finite-difference approximation i = 1, . . ., D.Here h = 2π/D denotes the mesh size and p j ≈ p(x j ), etc.We also make use of the periodicity and set p D = p 0 as well as f D = f 0 .Since the permeability field should be non-negative, we set for i = 1, . . ., D. The forward problem is now given by the solution {p i } D−1 i=0 to (59) for given {f i } D−1 i=0 and {u i } D i=1 and its restriction to the observation grid {x j } Ny−1 j=0 .We denote this map by G(u), suppressing the dependence on the forcing given by The measurement error covariance matrix is given by R = σ R I Ny×Ny .This completes the description of our forward model (3).
The prior distribution on u is assumed to be Gaussian with mean zero and covariance matrix where ∆ h denotes the standard second-order finite-difference operator over Ω with meshsize h and periodic boundary conditions, that is, the operator defined by the left-hand side of (59) with a i±1/2 = 1.The observations (57) are generated numerically by solving (59) for the reference permeability field and setting j = 1, . . ., N y .From (63) one can read off the reference vector u † with entries i = 1, . . ., D.
We implemented the gradient-based ALDI formulation (33) as well as the gradientfree ALDI formulation (38) using the Euler-Maruyama method with step-size ∆t = 0.01 over a time interval t ∈ [0, 5].We also implemented both methods with and without the correction term (36).The ensemble sizes were taken as N = 25, 52, 100, 200.Except for the smallest ensemble size, all other choices resulted in non-singular empirical covariance matrices C N (U t ).
We compare the simulation results based on the estimation bias and the ensemble spread computed along numerical solutions for τ = 4 and T = 2.Each experiment was repeated ten times to reduce the impact of random effects.The results can be found in Tables 1  and 2, respectively.It can be seen that the correction term has a profound impact on both the bias as well as the ensemble spread for the smallest ensemble size N = 25.This effect is largely diminished for the largest ensemble size of N = 200.We also find that the gradient-free formulation yields results which are essentially indistinguishable from those including the exact gradient while being computationally much more efficient.Finally, the In order to provide a better insight into the impact of the correction term (36) on the final ensemble distributions we display results for M = 25 and M = 200 in Figures 1 and  2, respectively.
We conclude from this simple experiment that the correction term (36) is required for implementations of ALDI whenever the ensemble size is of the order of the dimension of the parameter space or less.The experiments also confirm that gradient-free implementations can offer a computationally attractive alternative to gradient-based implementations of ALDI.

Conclusions
We have proposed a finite ensemble size implementation of the Kalman-Wasserstein gradient flow formalism put forward in [11], which requires the inclusion of a correction term (36) due to the multiplicative nature of the noise in the Langevin equations (33) [29].In addition to sampling from the desired target distribution, it has also been demonstrated that the equations of motion are affine invariant.While ALDI can be used with N ≤ D ensemble members, effectively leading to a linear subspace sampling method, it has also been proven that N > D + 1 and a non-singular initial empirical covariance matrix C N (U 0 ) ensure that |C N (U t )| = 0 for all times and that the equations of motion (33) are ergodic with invariant measure π * .Further computational savings can be achieved through the gradient-free implementation (38) for BIPs as introduced in Example 1.The effectiveness of gradient-free affine invariant sampling methods has been demonstrated for a Darcy flow inversion problem.This example has also demonstrated the significance of the correction term for reducing estimation errors both for N < D as well as for N = O(D) implementations of the ALDI method (33).
A numerical issue which has not been studied in this paper is the choice of an efficient time-stepping method for ALDI.In particular, adaptive and semi-implicit time-stepping methods might be necessary whenever the initial distribution π 0 is not close to the target measure π * .This issue has been studied for the related continuous-time ensemble Kalman-Bucy filter in [2].Wer also reemphasise that multi-model target distributions might require localised empirical covariance matrices in (33) as first suggested in [21] and further explored in [34].
While this paper has focused on a computational implementation of the Kalman-Wasserstein gradient flows proposed in [11], we wish to point out that this gradient flow structure has also become the focus of theoretical studies.We mention in particular [7], which provides a rigorous mean field limit with rates in Wasserstein-2 for the linear case, and [6], which studies the decay for the mean field limit in Wasserstein-2 in the linear case using explicitly the dynamics of the covariance matrix.
Since L vanishes on constants, (75) is equivalent to LV ≤ γV + C for some constant C.Here and in the following, C denotes a generic constant that can change from line to line.Furthermore, by the growth condition on Φ there exists a constant C such that −2V C ≤ V Φ + C. Therefore, it is sufficient to show the bound LV ≤ C(1 + V Φ ).In the remainder of the proof, we achieve the latter bound term-wise for the contributions in (76).
First note that Indeed, again following the notation introduced in Remark 16, we have that To bound the second term in (76), we first notice the estimate again easily obtained from Assumption 10.The other contribution is Since the result is a constant, we clearly have the required estimate of the form LV ≤ C(1 + V Φ ).In conjunction with (80) and (81) the claim follows.
Proposition 11 now essentially follows from adapting [27,Theorem 2.1].Textbook accounts of similar arguments can be found in [17,Chapter 5] and [9,Chapter 2].For the convenience of the reader we provide a self-contained proof: Proof of Proposition 11.The potential V is bounded from below by the growth condition on Φ (see Assumption 10).We can therefore choose a constant c V such that V + := V + c V is nonnegative.Since C N (U 0 ) is assumed to be nondegenerate, there exists k 0 ∈ N such that V + (U 0 ) < k 0 .For k ≥ k 0 , let us define the sets and the stopping times The stopping times τ k are increasing in k, and so the limit for any t ≥ 0 and k ≥ k 0 .On the other hand, where the last estimate uses the fact that V + ≥ 0. Combining (86) and (87), we see that for every t ≥ 0 and k ≥ k 0 .It follows immediately that We now show how to construct a continuous path between arbitrary x, y ∈ E. By density of E, it is enough to find a path between x ∈ E and y ∈ E lying in connected neighbourhoods of x and y respectively.Since E is open, there exist open neighbourhoods U x , U y ⊂ E. It is then sufficient to find points in these neighbourhoods that can be connected by a continuous path.Denoting x = (x (1) , . . ., x (N ) ) and y = (y (1) , . . ., y (N ) ), we can choose a continuous path γ (1) : [0, 1] → R D with γ (1) (0) = x (1) and γ (1) (1) = y (1) , and set γ (1) (t) = (γ 1 (t), x (2) , . . ., x (N ) ).By (92), it is clear that γ (1) (t) ∈ E for all t ∈ [0, 1].By density of E we can perturb γ (1) (1) in order to ensure that γ (1) (1) ∈ E. We can now proceed iteratively to move the remaining particles using paths γ (2) , . . .γ (N ) and concatenate them, yielding the required total path.Note that the perturbation of the endpoints γ ( i) can be chosen arbitrarily small in order to ensure that the final point γ (N ) (1) is in U y .
Proof of Proposition 12. Since the diffusion matrix Γ(u)Γ(u) T is strictly positive definite on E, E is path-connected by Lemma 20, and the process (U t ) t≥0 admits an invariant measure with strictly positive Lebesgue-density by Corollary 9, the process is positively recurrent and irreducible by the result in [18].We also refer to [37, Section 2.2.2.1].The convergence in total variation distance then follows from [26,Theorem 6.1].

Figure 1 :Figure 2 :
Figure 1: Displayed are the initial (top row) and final (bottom row) ensembles of the permeability fields a(x) = exp(u(x)) for N = 25.The left column is from ALDI without correction term (36) while the right column includes the correction term.

Table 2 :
As in Table1, but reporting the results for the ensemble spread (67).results for ALDI indicate that it is entirely sufficient to implement it with N = D + 2 particles; the minimum size required for ergodicity to hold.