How to avoid the curse of dimensionality: scalability of particle filters with and without importance weights

Particle filters are a popular and flexible class of numerical algorithms to solve a large class of nonlinear filtering problems. However, standard particle filters with importance weights have been shown to require a sample size that increases exponentially with the dimension D of the state space in order to achieve a certain performance, which precludes their use in very high-dimensional filtering problems. Here, we focus on the dynamic aspect of this curse of dimensionality (COD) in continuous time filtering, which is caused by the degeneracy of importance weights over time. We show that the degeneracy occurs on a time-scale that decreases with increasing D. In order to soften the effects of weight degeneracy, most particle filters use particle resampling and improved proposal functions for the particle motion. We explain why neither of the two can prevent the COD in general. In order to address this fundamental problem, we investigate an existing filtering algorithm based on optimal feedback control that sidesteps the use of importance weights. We use numerical experiments to show that this Feedback Particle Filter (FPF) by Yang et al. (2013) does not exhibit a COD.

1. Introduction. Filtering problems are rarely exactly solvable with a finite amount of computational resources, requiring numerical techniques in order to approximately represent or sample from the filtering distribution. Particle filters, which have been first introduced by [20], have seen widespread use as general-purpose algorithms to solve nonlinear filtering problems (see [18,21,8] for a general survey). They use sequential importance sampling in order to calculate the filtering distribution. At each time-step, samples or particles are drawn from a proposal density (such as the prior transition probability) and then re-weighted according to the observations. Particle filters have very few restrictions with regards to the type of generative model that underlies the filtering problem; they can be applied to highly nonlinear and non-Gaussian models, which is an advantage compared to the well-known Kalman filters.
However, they suffer from a very severe problem that affects all importance sampling-based algorithms; their efficiency diminishes rapidly with increasing dimension of the state space. This 'curse of dimensionality' (COD) occurs because in highdimensional spaces the importance weights are more likely to be degenerate, i.e. only a few weights are significant and all others are very close to zero. There have been numerous studies on the COD in particle filtering [12,3,5,24], which will be described below in more detail.
In this paper, we first revisit the COD in particle filters with a focus on the dynamic aspect of the problem. We study how the time-scales of weight degeneracy and of the performance benefit due to resampling scale with dimension. We find that the time-scale of weight degeneracy is inversely proportional to the dimensionality and proportional to the logarithm of ensemble size. From this, we obtain an exponential scaling of the ensemble size that is required to obtain a fixed time-scale of weight degeneracy. We argue that because the resampling-induced benefits occur at a timescale that scales weakly with dimension, this exponential scaling is necessary in order to avoid the collapse of the particle filter.
Next, we show that the COD can be avoided by particle filters that do not employ importance weights. Such a filter must sample the posterior distribution directly, and the particle motions have to fully take into account the observations. An unweighted particle filter for the classical filtering problem, the Feedback Particle Filter (FPF) by [31], uses optimal feedback control in order to guide the particle motions according to the observations. As a result, the particles of the FPF are samples from the posterior distribution. Although the authors claim that the FPF method does not suffer from the COD, to our knowledge this has never been explicitly demonstrated. We fill this gap in the literature by demonstrating that the FPF does not suffer from the COD.
1.1. Existing literature on the COD in particle filtering. The COD of particle filters has been studied using simple back-of-the-envelope calculations [12,24] and by proving statements on the convergence of the maximum weight [3,5]. These works reduce the filtering problem to a single Bayesian update step. In doing so, the COD of particle filters is reduced to the general COD of importance sampling and the effects of proposal distributions or resampling on the COD are not explored in detail.
The effect of proposal distributions on particle filtering in high-dimensional problems has received some attention. In discrete time, particle filters with 'smart' proposal distributions (i.e. with improved laws that govern the motion of particles) have been shown to perform well in certain high-dimensional problems [26,27,1], leading some authors to conjecture that the COD could be avoided using carefully crafted proposal distributions. However, it has been argued [25] that the system considered in [1] (and similar systems) was effectively a low-dimensional system disguised as a high-dimensional one. A precise definition of a sequence of filtering problems of increasing dimension therefore requires a notion of 'effective dimension', which also plays a role in the arguments brought forward by [12] and [24]. Special properties of the model sequence, e.g. a low-dimensional dynamical system that is embedded in spaces of increasing dimensions, could potentially avoid the COD as a function of D, but not as a function of effective dimension. Improved proposals are usually designed to reduce weight degeneracy by minimizing the rate of change of the variance of the importance weights. But even the optimal proposal function is not able to completely prevent the weight degeneracy. Moreover, in continuous time it is not known whether an optimal proposal even exists.
Meanwhile, the existing literature offers only a cursory treatment of the effect of resampling on the dimensionality-dependent scaling of particle filters. While it is understood that the most widespread resampling algorithm, multinomial resampling, is suboptimal compared to branching (see [9], [6], [16], [7], and also [2], p.250), we cannot be certain about the existence of a resampling scheme that resolves the COD. The arguments against such a possibility, which are brought forward e.g. in [25] and [6], are mostly heuristic and can be summarized as follows: while resampling resets the particle weights to undo the effects of weight degeneracy, it does not improve the quality of the sample instantaneously. The potential improvement is temporary and comes from particle motions that originate from regions of high likelihood. In the next section we will look more closely at this mechanism and thereby support the explanation provided by [25].
The conjecture that the COD may be avoided by particle filters without impor-tance weights has been brought up in previous works. For example, the (stochastic) particle flow filter [13,15] holds the promise of avoiding the COD [14]. This filter implements an implicit Bayesian update by defining a flow field at each measurement: whenever a measurement is registered, a little loop in synthetic time is inserted, during which the particles evolve according to this flow field, and are thus propagated from the prediction to the posterior density. A similar approach, the continuous-discrete particle filter [29], also relies on inserting an artificial loop at the times of measurement arrival, but the particles are propagated according to an innovation-error based gain feedback structure. Since particle flow filters require a clear separation between prediction and update in their respective filtering algorithms, it is not clear how the results would transfer to continuous-time measurement models, where prediction and update have to be processed simultaneously.

Preliminaries.
We restrict our overall discussion to the classical nonlinear filtering problem in continuous time, where the state X t ∈ R D and the observation Y t ∈ R D are D-dimensional diffusion processes that are solutions to the Itô stochastic differential equations (SDE) where W t and V t are independent D-dimensional Brownian motions. The stochastic filtering problem is to find conditional expectations E ϕ(X t )|F Y t , where ϕ : R D → R is a measurable function and F Y t is the filtration generated by the observation process. Next, we will describe the two approximate filtering algorithms that will be studied and compared in this paper.
2.1. The Bootstrap Particle Filter (BPF). The bootstrap or vanilla particle filter uses importance sampling (IS) to approximate the conditional expectation as where B (i) t are independent Brownian motions and · denotes the standard scalar product on R D . In this formulation, the unnormalized importance weight M (i) t arises as a Radon-Nikodym derivative dP (i) /dP of the measure P (i) under which the observations are generated from the state Z (i) t to a reference measureP under which Y t is a Brownian motion.
It is well established [18] that the BPF suffers from weight degeneracy, i.e. the weights evolve to become less equal, with only a few significant weights and all other weights being negligibly small. As a result, the variance of the IS estimate in Eq. (3) grows and performance drops. It is useful to introduce an 'effective sample size' N eff that measures the number of samples that would be required to match the performance of IS with a MC sampler that samples directly from the filtering distribution. A useful approximation of effective sample size is given by the inverse of the sum of squared weights, (see [23] and references therein). As a measure that only depends on the importance weights and not on the locations of the samples, it is not a good predictor of performance when samples are clustered. In particular, the typical outcome of multinomial resampling from an impoverished sample is that there are only a few distinct particle positions and multiple particles at each of those positions. The new sample is still impoverished and has a large variance despite the fact that the above definition of N eff,t yields a value of N (after resampling, all weights are reset to a value of 1/N ). With this caveat in mind, we assume thatÑ eff,t in Eq. (6) gives an upper bound on the true value of N eff,t . Because of the weight degeneracy outlined above, the particle system has to be frequently resampled. Resampling is performed according to a schedule that is usually based either on regular resampling intervals or on the effective sample size. We use the convention that particle trajectories are right-continuous with left-side limits Z Prob (n 1 , n 2 , ..., The new position at t = t r is then set to the resampled position, i.e. Z tr and all weights are reset to 1/N ←m (i) tr . A discussion of various other forms of resampling can be found e.g. in [9], [6], [16], and [7].
2.2. A particle filter without importance weights: the Feedback Particle Filter (FPF). In [31], a particle filter was proposed that does not require importance weights, i.e. posterior expectations are approximated by unweighted averages, (note the distinction to the weighted average ϕ t in Eq. (3)). By definition, the time evolution of particles fully incorporates the information given by the history of observations. The particle system of the FPF evolves according to the Itô SDEs where K is a (D × D matrix-valued) function, and Ω is a D-dimensional vector-valued function with components given by The function K is the solution to an Euler-Lagrange boundary value problem (EL-BVP). The latter results from an optimal control problem: the function K is chosen such as to make the distribution of particle positions as close as possible to the posterior filtering distribution, and it ensures that the particle distribution converges to the true filtering distribution when K is chosen according to the filtering distribution.
The function K, also called gain function, is analogous to the Kalman gain of the Kalman-Bucy filter for the linear filtering problem. As such, K (and therefore Ω) introduces interactions between the particles, in contrast to the BPF, where particles evolve independently from each other between resampling times. Another interesting aspect of the FPF is the structure of the innovation term (Eq. (9), term in square brackets) that multiplies the gain function: the new observation dY t is compared to the arithmetic mean of the individual particle's estimate h(Z (i) t ) and the population estimateh t . The function Ω comes from the conversion from Stratonovich form to Itô form.
In practical implementations of the FPF, the main difficulty lies in the solution of the EL-BVP. In multiple dimensions (D > 1), naïve approaches to the EL-BVP turn out to be computationally expensive. This problem has recently been addressed in [30] using a Galerkin approximation of the gain function. We use the constant gain approximation from that paper, which corresponds to a Galerkin approximation with coordinate functions as basis functions. With this choice, the gain function is which is constant in z (but not in time) and corresponds to the sample covariance matrix between the particle positions and the observation function h. It can be shown that for a linear filtering problem, the constant gain approximation equals the Kalman gain in the limit of infinite number of particles [30]. In addition, under this approximation the FPF gives rise to almost the same particle dynamics as in the ensemble Kalman-Bucy filter (EnKFBF) [4], only differing slightly in the estimation of the covariance matrix from the observations.

Results.
3.1. Illustration of the curse of dimensionality for a linear problem. We consider the linear D-dimensional problem where the numerical factors are chosen in order to ensure a unit prior variance and time-constant and a (one-dimensional) mean squared error of the optimal filter E[(X i,t − X i,t ) 2 ] = 1/2 across all values of D. For the linear system in Eqs. (12,13), the bootstrap particle filter (BPF) is given by where i = 1, ..., N and Z We study the effect of the number of dimensions D on the time-scale on which the decay ofÑ eff,t occurs. In Fig. 1 we show numerical results for the trial-averagedÑ eff,t , illustrating the decay ofÑ eff,t as a function of time for N = 10 4 and D = 10, 20, ..., 50. The decay ofÑ eff,t critically limits the performance of the filter. We measure the performance by the mean squared error (MSE) of the particle estimate of the hidden state where the average is estimated numerically by averaging independent trials, as in the case of expected stopping time. In our numerical experiment, the initial value is MSE 0 = 1 due to the initialization of the particles according to the prior distribution. The MSE reaches an asymptotic value of two as the effective sample size goes to one. As we show in Fig. 1 bottom, for low values of D the MSE has a transient dip that is followed by a gradual increase towards the asymptotic value. The dip becomes shallower and the increase becomes faster as the dimension D increases. For very high D, the dip disappears and the MSE increases immediately.
In order to quantify the time-scale of weight decay, we define the following expected stopping time, the mean first-passage time ofÑ eff,t through n, We first give a rough analytical estimate of the dependence of T (D, N, n) on D. The time evolution of the normalized weight can be written as (19) dm for the linear model. From this we obtain which consists of scalar products of D-dimensional vectors. Each of the vector components is initially of order 1, independently of D. Even in the best of cases, i.e. if the particle positions are samples from the true posterior distribution, each of the vector components would be of the order of the true posterior standard deviation, which is equal to 1/ √ 2 independently of D. Thus the initial magnitude of change ofÑ eff,t , which is the inverse of the sum of the squared weights, is also proportional to D. From this, a rough estimate of the scaling is T (D, N, n) ∝ D −1 . In Fig. 1 we numerically estimated T by using trial averages, and we show the results as a function of D for different values of N . This confirms that the scaling is close to D −1 .
Next, we study the dependence of T (D, N, n) on N , for which we rely on a numerical investigation. The results are shown in Fig. 2 for n = 10 and D = 10, 20, ..., 50. We can see that as D increases, N has to increase exponentially in order to achieve a fixed T , or in other words T (D, N, n) ∝ log N .

The effect of resampling.
If the criterion to resample the particles isÑ eff,t = n crit , the rate at which resampling occurs is tied to the time-scale of weight degeneracy. The immediate implication is that with a fixed ensemble size N , resampling occurs more frequently in higher dimension, with resampling rate roughly proportional to D. After resampling,Ñ eff,t is reset to N , and since the resampled particles are located at positions where the likelihood of observations is high, the initial decay ofÑ eff,t is a bit slower than for an initialization from the prior. However, since resampling does not add any new information about the true state, it cannot lead to an immediate performance increase.
The benefit of resampling is that particles with vanishing weights are discarded, and all computational efforts are expended for particles that are in an interesting region of state space. It is therefore expected that resampling shows a delayed effect due to the diffusion of particles away from resampled positions. For example, in the extreme case where n crit ≈ 1, all particles will typically be resampled at the same location. The particles have to diffuse away from their initial position such that their empirical variance is of the same order of magnitude as the (true) posterior variance. The time that is needed to reach such a state is related to the inverse of the squared diffusion coefficient and therefore independent of dimension.
In order to quantify the delay between the resampling time and the time where the MSE has maximally decreased due to resampling, we introduce the measure τ MSE , which is defined as the inverse absolute value of the slope of a linear fit to the initial portion of the MSE time-course (after resampling). In the right panel of Fig. 3 we show τ MSE as a function of dimension D. The time-scale τ MSE decreases with D, but it tends to a constant value for very high D. This saturation is in sharp contrast to the time-scale of weight degeneracy, which continues to decay with D −1 . The consequence is that in high dimensions, the weight decays during an interval which is much shorter than the delay required for the resampling to have a beneficial effect. Resampling is therefore ineffective in high dimensions, and it does not remove the need for exponentially large ensembles.

Optimal proposals in continuous time.
In the literature on discrete time particle filters, optimal proposals for the particle motion have been shown to greatly reduce the required ensemble size. However, proposals for discrete-time filters are not directly applicable to the continuous-time case. For example, it is straightforward to show (c.f. Appendix A) that the optimal particle filter by [17] collapses to a bootstrap particle filter as the time discretization step goes to zero. Since the re-weighting of samples in the continuous-time particle filter depends on the mutual absolute continuity of the hidden state process and the particle process, the class of admissible particle motions is restricted. Two diffusion processes are mutually absolutely continuous if and only if they differ by a pure drift term. The SDE for the particle motion can therefore differ from the hidden state SDE by at most a drift term F t : where F -adapted D-dimensional process. The corresponding (unnormalized) weight will evolve as From this starting point, it could be interesting to formulate a stochastic control problem in order to choose the processes F (i) t such as to minimize the effects of weight degeneracy. There is some work in this direction [28] for continuous-discrete filters in the small observation noise regime, but we are not aware of any work on generalizing this to the continuously observed case (and for large observation noise).
It is also interesting to note that both the optimal proposal and the FPF have the aim of redirecting the particles based on the observations. However, the particle motion of an importance sampling-based particle filter is more heavily constrained in order for the importance weights to exist. Even the general form of Eq. (21) is less general than the motion of the particles in the FPF, which include an explicit dY t term. It remains an open problem to reconcile the two frameworks of particle filtering.
3.3. Dimensionality-dependent scaling of the BPF vs. FPF. We study the minimum ensemble size that is required in order to reach a certain performance (measured as mean squared error) where MSE is the expected time-averaged mean-squared error defined as for the BPF for the FPF.
In simulations, this integral is estimated using a Riemann sum over discrete time-steps, t 1 is chosen to be 5000 time units, for which we can drop the expectation because of the ergodicity of the process. Particles of the BPF are resampled wheneverÑ eff /N ≤ 0.1. For our particular linear toy model, the theoretical range of sensible values of ǫ is 0.5 < ǫ < 2, and we can set N 0.5 (D) = ∞ and N 2 (D) = 1 because the performance of the optimal filter is set to produce an MSE of 0.5 and both filters go back to the prior for N = 1, which yields an MSE of 2. However, the practical range of ǫ is severely restricted by the runtime of the simulations, especially for the BPF.
The results for ǫ = 1 and ǫ = 0.85 are displayed in Fig. 4 left. Already in 10 dimensions, the FPF starts out with a significant advantage, requiring only four particles vs. 13 for the BPF. For larger dimensions, the ensemble size increases rapidly, reaching a value of 421 by D=100 and 6434 by D=200. Simulations for D > 200 were too time-consuming to run. Meanwhile, the FPF requires only 15 particles for D = 100 and an estimated 25 particles for D = 200. We ran simulations of the FPF up to D = 1000, where it requires merely 111 particles.
A least-squares fit of the numerical data reveals an exponential scaling for the BPF and merely linear scaling for the FPF (see Fig. 4 caption). The FPF thus requires roughly one additional particle with every increase of the dimension by ten. In contrast, for the same increase in dimension, the BPF requires a factor of 1.3 more particles.
In terms of computation time for a fixed performance of MSE = 1, we find that the BPF requires a run time per time-step that scales exponentially with dimension (see Fig. 4 right). In contrast, the FPF shows a cubic scaling, which is expected because the gain function scales quadratically with the ensemble size and the latter scales linearly with dimension. Interestingly, the FPF requires more computation time for low dimensions, despite using fewer particles to achieve the same performance as the BPF. However, we did not heavily optimize our code for performance, and we expect that the runtime could be reduced to compete with the BPF in low dimensions using more careful programming.
4. Discussion. In this paper, we revisited the problem of curse of dimensionality (COD) in the standard particle filter. We considered the case of the classical filtering problem with a continuous time index. Even though the COD has been studied before, all the existing literature considers only one Bayesian update step and implies a discrete-time treatment. Here, we closed this gap by studying the full dynamic nature of the problem in continuous time.
The discrete-and continuous-time particle filters have some important differences. The class of possible proposal distributions is larger in discrete time, where it is only restricted in terms of practicality by virtue of tractability of the transition kernel. In discrete time, it has been shown that even the optimal proposal distribution does not avoid the COD [25]. In continuous time, the law of the particle motion has to be absolutely continuous with respect to the law of the hidden state. It is an open problem to show that there are nontrivial proposals that minimize the weight degeneracy.
There has been a general consensus that the problems of particle filters in highdimensional problems result from importance sampling. It has therefore been conjectured that a particle filter without importance weights could work efficiently in high dimensions. Such a filter, the Feedback Particle Filter (FPF), has recently been proposed by [31]. There have been other related approaches to filtering with unweighted particles, e.g. in [11] and [10], Chapter 23, the Ensemble Kalman (Bucy) Filter [19,4] and the Neural Particle Filter in [22]. Both the EnKBF and the Neural Particle Filter are mathematically similar to the FPF with constant gain approximation, and the latter differs only in terms of the structure of the innovation term. Despite the promise of unweighted particle filters, their efficiency in high dimensions has not been thoroughly demonstrated so far. This work complements the analyses in [30,22] by providing numerical evidence that the FPF requires only a polynomial ensemble size and computation time as a function of dimensionality and thereby avoids the COD. Future analytical work will hopefully shed light on the mechanisms that underly this finding.
Based on this important result, we want to draw the attention of researchers to the FPF and similar unweighted approaches. Particle filters without importance weights are promising algorithms for solving very high-dimensional problems. This opens up new perspectives in applied fields such as geophysics and meterology, especially numerical weather prediction, where data assimilation typically requires the solution of very high-dimensional filtering problems.
Appendix A. Continuum limit of the optimal proposal from [17]. In order to apply the optimal importance function that is given in [17], Example 3, we have to introduce a time-discretized version of our diffusion process: where ǫ k and η k are multivariate Gaussian random variable with mean zero and unit covariance matrix. For constant g(x) = G and linear h(x) = W x, this corresponds to Eqs. (9,10) in [17] with n x = n y = D, f (x) = x + f cont (x)dt, Σ v = dtGG ⊤ , C = W dt, and Σ w = dt1 D×D , where the increment Y k − Y k−1 was identified with y k of [17] in order to preserve the non-regressive form of y. Taking Eqs. (11,12) in [17] and expressing the mean m k and covariance matrix Σ of the conditional distribution x k |(x k−1 , y k ) in terms of the quantities of the continuous-time model, we obtain For small dt, the covariance matrix can be approximated as Σ ≈ dtGG ⊤ , and therefore we obtain (29) m k = x k−1 + f cont (x k−1 )dt + dtGG ⊤ W ⊤ y k .
The last term, when expressed in terms of the continuous-time process,