Model error estimation using the expectation maximization algorithm and a particle flow filter

. Model error covariances play a central role in the performance of data assimilation methods applied to nonlinear state-space models. However, these covariances are largely unknown in most of the applications. A misspecification of the model error covariance has a strong impact on the computation of the posterior probability density function, leading to unreliable estimations and even to a total failure of the assimilation procedure. In this work, we propose the combination of the expectation maximization (EM) algorithm with an efficient particle filter to estimate the model error covariance using a batch of observations. Based on the EM algorithm principles, the proposed method encompasses two stages: the expectation stage, in which a particle filter is used with the present updated value of the model error covariance as given to find the probability density function that maximizes the likelihood, followed by a maximization stage, in which the expectation under the probability density function found in the expectation step is maximized as a function of the elements of the model error covariance. This novel algorithm here presented combines the EM algorithm with a fixed point algorithm and does not require a particle smoother to approximate the posterior densities. We demonstrate that the new method accurately and efficiently solves the linear model problem. Furthermore, for the chaotic nonlinear Lorenz-96 model the method is stable even for observation error covariance 10 times larger than the estimated model error covariance matrix and also is successful in moderately large dimensional situations where the dimension of the estimated matrix is 40 \times  40.

1. Introduction. Several research areas in which the object of research is a complex system, such as the atmosphere, the ocean, and biological systems, require estimating the state of the system through partial observational information which is distributed in time. Surrogate models of the system, which represent approximately the time evolution of the variables, are included as a source of information in the state estimation. This has a twofold aim; it regularizes the state estimation at a time and propagates information of the system and its uncertainty to the subsequent times. This estimation problem was called data assimilation in geophysical sciences [8,18], and this terminology has subsequently been popularized in other Likelihood-based methods for model error covariance estimation may be broadly classified into two groups: maximum likelihood estimation and Bayesian inference. Maximum likelihood estimation methods have been applied assuming that covariance parameters are deterministic [5,17,35,36]. In this case, non--a priori information on the parameters is required. When the gradient of the likelihood can be obtained analytically, Newton--Raphson optimization methods may be applied to maximize the likelihood function. When this is not feasible, one of the most popular methods for maximum likelihood estimation is expectation maximization (EM) [10]. One of the reasons for its widespread use is that it is readily applicable. In particular, contrary to Newton--Raphson optimization methods, EM does not depend on tunable parameters. The second group is a Bayesian approach in which the model error covariance parameters are interpreted as stochastic and the prior distribution of the parameters needs to be given. In this case, some hypotheses on the correlations between state-space variables and parameters are required [28]. Some authors [22,33] assume that the uncertainty in the parameters does not affect the state density, while the authors of [29] use the marginalization of the hidden state for the parameter estimation in a hierarchical Bayesian framework.
In this article, we propose a new method based on a maximum likelihood approach to estimate the model error covariance matrix in state-space models. The article is organized as follows. In section 2 we formulate the problem, giving a brief overview of PFs and some details of a recently proposed particle flow filter to be used in the experiments before describing the proposed method to estimate the model error covariance matrix in subsection 2. 3. Further details about its derivation are given in Appendix A and Appendix B. The proposed method is tested on a simple autoregressive linear model and on the Lorenz-96 model with 8 and 40 dimensions. The experiments' design, results, and comparison with existing methods are shown in section 3, while in section 4 we present some conclusions.

Methodology.
2.1. Problem formulation. Consider a state-space model consisting of a nonlinear system and a nonlinear observation model, described by x k = \scrM (x k - 1 ) + \bfitbet k , (2.1) y k = \scrH (x k ) + \bfiteps k , (2.2) where x k \in \BbbR Nx , called the state vector, is a hidden or latent process, and \{ y k \} K k=1 is a time series of observations measured at times k = 1, . . . , K with y k \in \BbbR M . The maps \scrM (.) and \scrH (.) denote the nonlinear deterministic dynamical model and the (possibly nonlinear) observation operator, respectively. The state process is assumed to be a Markov process, whereas the conditional density of the observations y k depends only on the current state x k for k = 1, . . . , K. The stochastic term \bfitbet k accounts for the missing physics in the model and its numerical approximations, whereas \bfiteps k is the observation noise. In this work, we assume that observational and model uncertainty are additive and \bfitbet k \sim \scrN (0, Q k ) and \bfiteps k \sim \scrN (0, R k ), where Q k and R k belong to the subspace of positive definite matrices in \BbbR Nx\times Nx and \BbbR M \times M , respectively, representing the model error and observation error covariance matrices at time k. We denote by \bfitthe k \in \Theta the vector of parameters (Q k , R k ).
Since \scrM and \scrH are surrogate models which approximate the system evolution and the processes that relate the observations with the hidden state, the model error and observa-tional error covariances are expected to be largely unconstrained physically. Observational errors may be partially constrained from the knowledge of measurement errors, whereas representation errors, arising from the fact that models and observations often represent reality differently, are hard to determine in practice. Therefore, the parameters \bfitthe k of the state-space model are unknown and need to be inferred from the data as well as the hidden state x k . In principle, unknown parameters from the dynamical and observational models may also be included in \bfitthe k [24]; in this work we consider these parameters to be provided. In this parametric inference problem we need to assume an a priori underlying error distribution; the assumption \bfitbet k \sim N (0, Q k ) was then taken. The technique here proposed could be extended to \bfitbet k belonging to an exponential family, as done in [7].
We assume that the model error covariance Q k and the observation error covariance R k vary slowly within K-cycles (the temporal scale of covariance variations is longer than Kcycles), that is, Q k = Q, R k = R \forall k = 1, . . . , K, and propose a method based on a timebatch of observations to estimate the model error covariance Q for particle filters (PFs). The information provided by the observations along the K times, y 1:K , where the subindices 1 : K denote the set \{ y 1 , . . . , y K \} , is considered essential to regularize/constrain the N x \times N x unknowns from Q. The coupling between observations at different times is produced through the dynamical model \scrM .
The method presented here is based on maximum likelihood estimation: given a set of independent observations \{ y k , k = 1, . . . , K\} from a probability density function (pdf) represented by p(y 1:K ; \bfitthe), a nonlinear dynamical model \scrM , and an observation operator \scrH , we seek to maximize the likelihood of the observations as a function of the statistical parameters Q given an observation error covariance R in the presence of a hidden state x 0:K . The statistical parameters to be estimated are assumed to vary slowly. This is an essential and unavoidable assumption in the state-space framework because this framework usually considers a single realization of the observation at each time. Hence, a time range with a long set of observations, say K \geq 100, is required to constrain the statistical parameters in which the model error covariance is assumed to be slowly varying. Online maximum likelihood techniques (e.g., [1,7]) and hierarchical Bayesian inference [29] also need to assume slow variations in the statistical parameters.

Particle filters.
State-space models are generally used in sequential data assimilation to estimate or reconstruct the hidden state x k given the observations y 1:K . This can be done by computing the filter densities \{ p(x k | y 1:k )\} k=1:K or smoother densities \{ p(x k | y 1:s )\} k=1:K with s \geq k. Having prior knowledge of the initial state x 0 , that is, given a prior pdf p(x 0 ), the posterior pdf of a filter is the probability of the model state at time k, given all the available information up to time k. In a Markovian system with observations y k that are conditionally independent given the state, the filter densities can be computed recursively using Bayes's rule to obtain the posterior pdf where the prior pdf p(x k | y 1:k - 1 ; \bfitthe) is the forecast or prediction pdf, p(y k | x k ; \bfitthe) is the observation likelihood defined by the observation model \scrH and the distribution of the observation error \bfiteps k , and p(y k | y 1:k - 1 ; \bfitthe) is a normalizing factor. Note that we consider here the marginalized posterior pdf, in which the only state variable that is estimated is the current one, considering all the past and the current observations. When the dynamical and observational models are linear and their errors Gaussian, the filter densities are Gaussian and the state can be computed using Kalman recursive algorithms. However, for nonlinear models and/or nonlinear observational functions it is not possible to get a known distribution or a closed form for these filter pdfs, and they should be somehow approximated. Classical PFs [2,12,38] are based on sequential importance sampling and resampling algorithms and provide different methods to approximate these pdfs.
The basic idea behind a PF is to represent the posterior pdf p(x k | y 1:k ) by a set of N p particles \{ x (j) k \} j=1:Np with corresponding weights \{ w That is, at time k, p(x k | y 1:k ; \bfitthe) is approximated by where . = denotes approximation by an ensemble of particles and \delta (\cdot ) is the Dirac \delta function. Initially, a set of N p particles\{ x (j) 0 \} j=1:Np with corresponding weights \{ w Np \} j=1:Np is drawn from the prior pdf p(x 0 ). These particles are sequentially evolved in time using a forecasting, weighting, and resampling scheme to obtain \{ x (j) k \} j=1:Np and \{ w (j) k \} j=1:Np at each time step k. Different PFs were proposed depending on the resampling, forecasting, or weighting approaches taken [2,25,43].
2.2.1. Variational mapping particle filter. The variational mapping PF (VMPF) is a particle flow filter which is based on optimization and Monte Carlo sampling. At each cycle, a particle flow filter [4,9] drives the particle samples from the prior density towards a desired target density. In the VMPF, the particles are moved deterministically via a sequence of maps, based on the optimal transport principle. The maps seek to minimize the Kullback--Leibler divergence (KLD) between the target density, i.e., the posterior density p(x k | y 1:k ), and an intermediate density q(x k ), that is, the density represented through the sample, the set of particles, at a given cycle k. At the ith optimization iteration, the KLD is given by where q i (x k ) is represented via N p sample points, i.e., particles, x which is a small perturbation to the identity map (\epsilon is assumed to be small). This means The maps are assumed to be in a reproducing kernel Hilbert space (RKHS).
The optimal map that gives the steepest descent direction in the RKHS is shown to be given by where K is a kernel (assumed here to be Gaussian) [25]. Then the optimization is a sequence of (sufficiently smooth) mappings in which each particle (j) is moved along the steepest descent direction If observational and model errors are assumed to be Gaussian, the gradient of the logposterior density is where \bfitbet . They could be interpreted as weights of the forecast states which consider the distance between the particles to the point under consideration. A more detailed description of the VMPF is found in [25]. Recently, a generalization to embed also the observational operator in the RKHS was proposed in [26].
One of the main advantages of the VMPF is that it not only efficiently samples highdimensional state spaces with a limited number of particles but also does not suffer from sample impoverishment. Since (2.6) moves the particles toward regions of high observation likelihood, by construction the resulting sample has a large effective sample size close to the number of particles. Thus, this particle flow filter does not require resampling even for long recursions, as shown in [25].

Parameter estimation.
Let us assume that p(\cdot ; \bfitthe) is a parametric distribution, with \bfitthe \in \Theta the parameter space. Given a set of observations y 1:K = \{ y k , k = 1, . . . , K\} taken along a time interval of length K, a maximum likelihood estimation method aims at finding the value of \bfitthe that maximizes the (incomplete) likelihood of the observations, The likelihood function (2.9) can be interpreted as how probable the set of observations y 1:K would be for different choices of \bfitthe. An analytic form for the log-likelihood function is not achievable in practice, and the numerical evaluation of (2.10) may involve high-dimensional integrations, which is intractable. In some situations the optimization task can be accomplished by using numerical optimization routines like Newton--Raphson techniques to solve the nonlinear equations obtained by differentiating the log-likelihood function (2.10) [5,16,24]. However, even in these particular situations, other methods are preferable due to the difficulty of implementing optimization methods and tuning their parameters. Gradient optimization methods may not be stable numerically for certain sets of parameters.
The expectation maximization (EM) algorithm [10] is a widely used numerical method that aims at maximizing the log-likelihood of the observations as a function of the statistical parameters \bfitthe in the presence of a hidden state x 0:K in successive iterations without the need to evaluate the complete log-likelihood function.
Starting from an initial parameter \bfitthe 0 , the two steps of the EM algorithm at iteration s can be summarized as follows: \bullet Expectation Step (E-Step): Calculate the required densities to compute the intermediate function \scrG (\bfitthe s - 1 , \bfitthe) as in (2.12). \bullet Maximization Step (M-Step): find \bfitthe s = max \bfitthe\in \Theta \scrG (\bfitthe s - 1 , \bfitthe). The assumptions of a hidden Markov model and mutually independent observations in (2.1) and (2.2) allow us to express the joint pdf p(x 0:K , y 1:K ; \bfitthe) as To estimate the parameters of the state-space model using the EM algorithm, the expectation of this last pdf under the conditional (smoother) pdf p(x 0:K | y 1:K ; \bfitthe \prime ) must be computed in the E-Step of the algorithm. In the case of a linear Gaussian model this can be accomplished by means of a Kalman smoother [31]. This was further extended to the ensemble Kalman filter in [14,24]. When the dynamical or observational model is nonlinear and therefore the joint density non-Gaussian, the expectation in (2.11) may be intractable, and a different approach must be taken. A generally used approach is to approximate the expression in (2.11) by generating samples of the smoother pdf p(x 0:K | y 1:K ; \bfitthe \prime ) using a particle smoother [3,17,21,23]. However, the use of particle smoothers in data assimilation represents a computational challenge, since they not only tend to degenerate rapidly but also have a poor performance in moderate to high-dimensional spaces, particularly if the time sequence is long (large K).
The requirement of a particle smoother in the E-Step of the EM algorithm is due to the fact that the likelihood of the observations p(y 1:K ; \bfitthe) is usually obtained by marginalizing the joint pdf p(x 0:K , y 1:K ; \bfitthe) over the whole state x 0:K (cf. (2.9)).
Instead of using this last expression for the likelihood of the observations, and following the notation of [6], the likelihood of the observations (model evidence) can be decomposed as p(y 1:K ; \bfitthe) = \prod K k=1 p(y k | y 1:k - 1 ; \bfitthe), with the convention y 1:0 = \{ \emptyse \} . Marginalizing this last expression, we obtain p(y 1: and therefore we can rewrite the logarithm of the incomplete likelihood (2.9) as (2.15) l(\bfitthe) = log p(y 1: Using this last expression instead of the commonly used expression (2.10), the intermediate function \scrG (\bfitthe \prime , \bfitthe) of the EM algorithm can be written as as shown in Appendix A. Note that this last expression for \scrG (\bfitthe \prime , \bfitthe) is written in terms of filter and forecast pdfs, while smoother pdfs are no longer required. As the denominator of (2.16) does not depend on \bfitthe, maximizing \scrG (\bfitthe \prime , \bfitthe) with respect to \bfitthe is equivalent to maximizing So far \bfitthe denotes the set comprised by (Q, R), the model error and observation error covariance matrices, respectively. As there is a certain degree of knowledge about the instrument's noise and how observations are measured or obtained, R is usually determined empirically in practice by estimating these noises and the errors between the state and observation space. However, this is far from being an accurate representation of R, and during the last few years great effort has been dedicated to the study and estimation of the observation error covariance matrix, and many works have been published on these topics (see [20,32,36,37,39] and references therein). Some of these works focus on studying the most plausible structure for R given the nature of how observations were measured, taking into account correlations between observation errors, while some others just propose a fixed structure for R and a methodology to estimate it. What these different approaches have in common is that they assume that the model error covariance matrix is already given, or known. The model error covariance matrix Q is, perhaps, the most difficult one to estimate or determine, since it accounts for the model inaccuracies and deficiencies in representing the missing underlying physics, the errors in parameterizations, the unresolved and smaller scales, and the numerical schemes used. Some works have been devoted to the joint estimation of (Q, R), and an up to date and detailed review of these techniques is presented in [34].
The main purpose of this work is to provide a method to estimate Q, the covariance matrix of the model error \bfitbet. Assuming that R is known, replacing \bfitthe = Q and having in mind that by hypothesis the density p(y k | x k ; Q) is assumed to be independent of Q, starting from an initial guess Q 0 , the two steps of the EM algorithm at iteration s can be summarized as follows: \bullet Expectation Step (E-Step): Calculate the required densities to compute the intermediate function where \Theta is the space of positive definite matrices of order N x . Using a PF, the posterior pdf p(x k | y 1:k ; Q s - 1 ) can be approximated by a set of particles and their corresponding weights as in (2.4) with \bfitthe = Q s - 1 .
As we assume a Gaussian model noise \bfitbet \sim \scrN (0, Q) (see subsection 2.1), the transition density is given by \bigr\} . Therefore, the prediction density p(x k | y 1:k - 1 ; Q) can be approximated by k - 1 are the ith weight and particle, respectively, obtained by a PF at time step k -1.
Combining these approximated pdfs and computing the required expectations, the intermediate function \scrG (Q s - 1 , Q) given in (2.19) can now be written in terms of N p particles as Differentiating (2.21) with respect to Q, we can determine the root of \partial\scrG (\bfQ s - 1 ,\bfQ ) \partial\bfQ = 0 to obtain the maximum of the intermediate function \scrG (Q s - 1 , Q) at iteration s. By doing this, we obtain (Appendix B) We have to take into account that particles and weights indexed by j, that is, x (j) k and w j k,\bfQ s - 1 in (2.22), are computed in the expectation step using Q s - 1 as the model error covariance matrix. On the other hand, particles and weights indexed by i, that is, x (i) k - 1 , w i k - 1,\bfQ , should be computed with a filter that assumes Q as the model error covariance matrix. However, this Q is the matrix to be found in the maximization step.
Summarizing, the two steps of this EM algorithm using a PF at iteration s are as follows: \bullet E-Step: Given Q s - 1 , use a PF with a model error covariance Q s - 1 to calculate the weights and particles which are needed for the intermediate function \scrG (Q s - 1 , Q). \bullet M-Step: Solve (2.22). The M-Step at iteration s requires solving an implicit equation for the covariance matrix Q, where the inverse of Q is involved. In this work we propose to solve this equation using a fixed point algorithm in Q. This fixed point algorithm requires extra iterations at each iteration of the EM algorithm; however the conducted experiments showed that fewer than six iterations of the fixed point algorithm are enough to satisfy the required stopping criteria. The Banach fixed point theorem, or contractive mapping theorem, guarantees the existence (and uniqueness) of a fixed point of certain mappings (functions) defined on a complete metric space, as long as these mappings are contractive. It is not possible to show analytically that the function involved in our fixed point algorithm is contractive due to the nonlinearity in the dynamical model, and since we are not under the hypothesis of the Banach fixed point theorem we cannot guarantee that our algorithm converges to a fixed point. What we can guarantee, based on empirical evidence, is that the algorithm stops after a few iterations, satisfying a stopping criterion based on the Frobenius norm defined below.
The pseudocode of the proposed algorithm is presented in Algorithm 2.1. Within this pseudocode, PF (Q) indicates that the particles and corresponding weights are obtained by using a PF with Q as model error covariance matrix. The algorithm evaluates the fixed point function, (2.22), to obtain the new updated value of the parameters.
The stopping criterion for the fixed point algorithm (FP) is defined as either stop \mathrm{ (Q \mathrm{ \mathrm{ , , where | | \cdot | | F is the Frobenius norm, is smaller than a previously set threshold, or when the maximum number of fixed point iterations is reached. The algorithm in its current form does not assume any a priori knowledge of the structure of Q, and therefore the full matrix is estimated. That is, at each iteration of the EM algorithm the full covariance matrix Q is computed (M-Step) by means of a fixed point algorithm. The updated value of Q, at the mth iteration of the fixed point algorithm, is given by Q m = f \bfQ EM (Q m - 1 ), where Q m - 1 is the matrix computed in the previous iteration of the fixed point algorithm and f \bfQ EM (Q m - 1 ) is calculated using the formula for f \bfQ EM (Q m - 1 ) given in (2.22). Hence, the resulting Q m is a linear combination of the positive semidefinite matrices B j,i k multiplied by nonnegative real numbers (0 \leq w (j) k,\ast < 1, 0 < \psi j,i k , and S j.i > 0), and therefore the updated value Q m is also a positive semidefinite matrix.
In practice, these numerical calculations appear to be robust to roundoff errors so that it is not required to impose any further restriction to guarantee the positive semidefiniteness of Q. For higher-dimensional problems, symmetry may be assumed a priori, and the upper triangular matrix of the model error covariance may be obtained via sparse matrix multiplication algorithms.
Algorithm 2.1. EM algorithm to estimate Q using a PF and a fixed point algorithm.
Numerical experiments' design. In order to evaluate the capabilities and performance of the proposed methodology, numerical experiments were designed using two different dynamical models \scrM , a univariate linear Gaussian model and the Lorenz-96 model [19]. For each of these models we conducted twin experiments with different settings. We first generated a set of noisy observations using the dynamical model with known parameters. Then, using these synthetic observations and the same stochastic dynamical model, we estimated the model error covariance Q with the proposed algorithm and compared the results with those obtained with some classical methodologies. These experiments are useful to assess the convergence and performance of the proposed methodology.
3.1. Linear model. A one-dimensional linear Gaussian state-space model is defined as where \{ x k \} k=0:K , \{ y k \} k=1:K \in \BbbR , \beta k \sim \scrN (0, Q), \epsilon k \sim \scrN (0, R), \nu is the autoregressive coefficient, and Q, R are the error variances. The implementation of the EM algorithm to estimate the parameters of this linear Gaussian model using the Kalman filter and smoother was first discussed by Shumway and Stoffer in [31], whereas in [30] the same authors provide a more detailed derivation of this implementation. Discussions of the convergence of the EM algorithm for this model can be found in [11,40,42]. A set of one-dimensional experiments was conducted by generating K = 100 noisy observations y 1:K using the linear model (3.1) with known parameter values \nu = 0.8, Q true = I 1 , R = I 1 . Then, using the same model, these synthetic observations, and an initial guess Q 0 sampled from a uniform distribution \scrU [0.5, 1.5], the model error variance Q was estimated by four different algorithms: \bullet EM+VMPF: The algorithm here proposed, which is a version of the EM algorithm for a PF without the need of a particle smoother. The PF used is the variational mapping particle filter (VMPF) described in subsection 2.2.1. \bullet EM+SIR: The algorithm here proposed, coupled with a classical sampling importance resampling (SIR) filter [2]. \bullet EM+KF+KS: The classical EM algorithm based on the Gaussian assumption that requires a Kalman filter and smoother as presented in [30]. \bullet EM+EnKF+EnKS: A version of the EM algorithm in conjunction with the ensemble Kalman filter and an ensemble Kalman smoother as presented in [24] with N p = 50 ensemble members. The procedure was repeated independently 50 times for each algorithm in order to have an empirical distribution of the estimators. That means the``true"" state and a set of observations \{ y 1:K \} were generated independently for each experiment realization, and an independent initial guess Q 0 was sampled from a uniform distribution \scrU [0.5, 1.5]. Then the model error covariance Q was estimated. This procedure was repeated 50 times for each algorithm. With these results we can have an approximation of the empirical distribution of the estimators.
As the algorithm here proposed is suitable to be used with any PF, we tested the EM+PF algorithm with two different PFs, namely the classical SIR filter with N p = 1000 particles (for a detailed explanation see [2]) and the recently proposed PF based on optimal transportation and referred to as the variational mapping particle filter (VMPF) (subsection 2.2.1 and [25]) with N p = 20 particles. Figure 1. \\bfQ as function of the iteration number using the EM+VMPF (blue), EM+SIR (orange), EM+KF+KS (green), and EM+EnKF+EnKS (red) algorithms, for the linear model (subsection 3.1) with true parameter value \bfQ true = \bfI 1, \bfR = \bfI 1, \nu = 0.8, and K = 100. The 95\% confidence intervals were generated by running 50 independent repetitions of these estimation experiments and 25 EM iterations. For the EM+VMPF only N p = 20 particles were used, whereas for the EM+SIR and the EM+EnKF+EnKS Np = 1000 particles and Np = 50 ensemble members, respectively, were used. Figure 1 shows the mean and 95\% confidence intervals obtained by the 50 independent repetitions of the experiments for each algorithm. The EM+VMPF, EM+KF+KS, and EM+EnKF+EnKS algorithms show a similar behavior, stabilizing after 5 to 6 iterations of their respective EM versions; the EM+EnKF+EnKS one is the less biased (in terms of the mean value of the confidence interval). The 95\% confidence intervals for these three methods include the true value Q true = I 1 . With only N p = 20 particles the EM+VMPF algorithm results are very similar to the EM+KF+KS ones. On the other hand, for this simple linear model and using N p = 1000 particles, the results obtained by using the SIR filter are biased, and the 95\% confidence interval never reaches the true value of Q true = I 1 . In the experiments that follow, we only use EM+VMPF for the PF implementation of the proposed algorithm. Experiments coupling the EM algorithm with the SIR filter for high-dimensional state and observational spaces, including the Lorenz-96 system, are not feasible due to the large number of particles required [25].

Lorenz-96 model.
In this section we show results of twin experiments using the chaotic and nonlinear Lorenz-96 system as the dynamical model \scrM in (2.1). The Lorenz-96 is one of the most used toy models in data assimilation within the geoscience community due to its ability to mimic certain properties of the atmospheric predictability at a low computational cost.
It is defined by the ordinary differential equations where X n is the state variable of the model at variable (position) n, n = 1, . . . , N x , and the domain is assumed to be periodic, that is, X - 1 \equiv X N - 1 , X 0 \equiv X N , X N +1 \equiv X 1 . F is the forcing constant, and, as usual, here it is set as F = 8 to represent chaotic dynamics. We used a fourth-order Runge--Kutta scheme with a model time step of \delta t = 0.005 to integrate the Lorenz-96 equations (3.2). In the first set of experiments with the Lorenz-96 system, the number of variables is set to N x = 8, meaning that an 8\times 8 model error covariance Q has to be estimated. The observation error covariance R was chosen as a diagonal matrix R = \sigma 2 R I. We observed every grid point; in other words, the observation model \scrH is assumed to be the identity transformation. The observations are taken every \Delta t = 0.05, which represents 10 model time steps in all the experiments performed.
For these twin experiments, we simulated K = 500 noisy observations y 1:K using the Lorenz-96 model with known model error and observation error covariances Q true , R. Using the same model and these synthetic observations, the full model error covariance matrix Q was estimated by three different algorithms: \bullet EM+VMPF: The algorithm here proposed, which is a version of the EM algorithm for a PF without the need of a particle smoother. The PF used is the VMPF, as in the previous section. \bullet EM+VMPF+EnKS: A``hybrid"" method, consisting of the``classical"" batch EM, where the VMPF is used in the forward pass of the E-Step and a Gaussian smoother is used in the backwards pass. \bullet EM+EnKF+EnKS, as presented in [24].
In all these algorithms the number of particles/ensemble members was N p = 20, except when otherwise stated. The method here proposed is designed to be used with a PF, avoiding the need of a particle smoother due to the computational disadvantages they present. A fair competitor to analyze the performance of our methodology would be the classical EM using a PF and a particle smoother. As already mentioned, particle smoothers tend to degenerate rapidly and have a poor performance in moderate-dimensional spaces, and so we propose to use the EnKS smoother combined with the VMPF instead for comparison purposes.
The performance of the VMPF in the Lorenz-96 dynamical system has been thoroughly evaluated in [25]. In particular, that work showed that the effective sample size was close to the number of particles, and no resampling was required for the 40-variable Lorenz-96 system. The EM+EnKF+EnKS uses the perturbed observations EnKF, while the EnKS is based on the linear RTS smoother. Furthermore, since this work attempts to estimate the full model error covariance matrix, we did not use either localization or inflation in the experiments. However, we note that model error may be interpreted as additive inflation so that in principle model error estimation should account for underdispersion of the ensemble.
To assess the proposed methodology to estimate the model error covariance matrix, experiments with two different structures, usually assumed in practice, were proposed for the true model error covariance matrix Q true : (a) an isotropic noncorrelated covariance matrix, where Q true = \sigma 2 I, with I the identity matrix of order N x , and (b) an isotropic tridiagonal covariance matrix with diagonal values \sigma 2 d and both sub-/superdiagonal values \sigma 2 sd . In the first case we assume that model errors for different model variables are uncorrelated and have the same variance \sigma 2 d , whereas for the second case we assume an a priori spatial covariance structure with correlations between the first neighbors. We remark that this is the structure of the``true"" model error covariance used to generate the synthetic observations via (2.1) and (2.2). On the other hand, we do not assume any constraint on the structure of the covariance matrix in the estimation algorithm, with the exception of the positive semidefinite property (and hence symmetry) of the estimated matrix, which is satisfied by construction.
Following a similar procedure as for the linear model case (subsection 3.1), 50 independent realizations of this experiment were performed in order to show the estimator empirical distribution. The nonzero values of the initial guesses Q 0 (a N x \times N x matrix with the same structure of Q true ) and \sigma 2 d and \sigma sd were sampled from \scrU [0.05, 0.5] and \scrU [0.01, 0.15], respectively. In Figure 2 we show the empirical distribution of the estimator of Q (top panel) for true parameter value Q true = \sigma 2 I with \sigma 2 = 0.2, R = 0.5I, and K = 500 observations and the Frobenius norm \| Q true -\Q (s) \| F (bottom panel), for the algorithm proposed in this work, EM+VMPF (blue), the EM+VMPF+EnKS algorithm (yellow), and the EM+EnKF+EnKS algorithm (red) proposed by [24], which requires an ensemble Kalman filter and an ensemble Kalman smoother. Since the full covariance matrix is estimated, the Frobenius norm indicates both deviations from nonzero and zero values in the estimation. Thus, it is a good diagnostic of the overall structural errors in the estimation.
For ease of visualization, and just for plotting purposes, in this case of an isotropic uncorrelated covariance assumption Q = \sigma 2 I, at the sth iteration of the EM algorithm we kept [ \widehat \sigma 2 d ] s as the average of the diagonal values of \widehat Q (s) . As we repeated each experiment independently 50 times, we had a series of 50 values of [ \widehat \sigma 2 d ] s at the s iteration of the EM algorithm to con-struct a violin object that describes the empirical distribution of the corresponding estimator at each iteration. With only N p = 20 particles, the EM+VMPF and the EM+EnKF+EnKS methods provide good estimates of Q, stabilizing in about 10 iterations with the median value (white circle) around the true diagonal value (top panel of Figure 2). The EM+VMPF (blue) algorithm produces estimates less biased than the EM+EnKF+EnKS (red) algorithm despite the fact that it does not require a particle smoother (and therefore uses less observational information in the state estimates). Moreover, the violin objects show that the empirical distribution for the EM+VMPF estimates is symmetric and highly concentrated around the true value, whereas the EM+EnKF+EnKS estimator empirical distributions have a greater dispersion, are not symmetric, and have tails towards higher values of \sigma 2 d , meaning that in the 50 repetitions performed this method overestimates the value of \sigma 2 d . On the other hand, the EM+VMPF+EnKS estimator empirical distribution has a large dispersion and does not reach the true value \sigma 2 d . The bottom panel of Figure 2 shows the empirical distribution of the Frobenius norm \| Q true -\Q (s) \| F for the three methods. The EM+EnKF+EnKS algorithm (red), despite having a greater dispersion, has a slightly better performance than the other two.
To examine the sensitivity of the proposed algorithm to different values of the observation error covariance R a second set of experiments was performed. We assumed R = \sigma 2 R I with \sigma 2 R \in \{ 0.1, 0.5, 1.0, 2.0\} , and Q = 0.2I. For each setting we performed 30 independent realizations of the same experiment. The mean and 95\% confidence intervals for [ \widehat \sigma 2 d ] s for each value of \sigma 2 R are shown in Figure 3. The larger the R values are, the harder it is to estimate model error, likely because of sampling noise. The results for \sigma 2 R = 1.0 and 2.0 are noisier and demonstrate small biases, even after 25 iterations of the EM scheme. As clearly stated in [34] and references therein, the quality of reconstructed state vectors and estimation procedures when using variational or ensemble-based methods largely depends on the relative amplitudes between observation and model errors. In [44] the authors also mention that a small increment in the magnitude of the observation error R significantly affects the accuracy and behavior of the method they propose to estimate the model error Q iteratively in the observation space using the implicit equally weighted PF (IEWPF) [43]. Their method provides reasonable estimates of the diagonal values of Q true as long as the diagonal values of the observation error matrix R are relatively small (they assume a diagonal R), with \sigma 2 R = 0.2 being the largest one. The larger \sigma 2 R , the less accurate the estimation procedure. As shown in Figure 3, the estimation with EM+VMPF gives results with a good performance for observational variances ranging from 0.1 to 2.0. The estimation error increases with the increase of the observational variance; however, the estimation error is lower than 20\% in all the cases shown.
Experiments designed for the estimation of a tridiagonal isotropic model error covariance are shown in Figure 4. The diagonal value of Q true was set to \sigma 2 d = 0.2 and sub-/superdiagonal values were \sigma 2 sd = 0.05, as defined in [44], R = 0.5I, K = 500, and N p = 20. For this isotropic tridiagonal model error covariance assumption, at the sth iteration of the EM algorithm we computed \widehat \sigma 2 d as the average of the diagonal values of \Q s , and \widehat \sigma 2 sd as the average of the sub-/superdiagonal values of \Q s . The violin plots at the sth iteration of the EM algorithm were generated with \widehat \sigma 2 d , \widehat \sigma 2 sd obtained for each realization of the experiment.  As shown in Figure 4, the EM+VMPF(blue) and EM+EnKF+EnKS (red) methods converge rapidly to the true value \sigma 2 d (top panel), whereas the hybrid EM+VMPF+EnKS method does not capture the variances. The empirical distribution of the EM+VMPF estimates shows a median value (white circle) closer to the true value of \sigma 2 d but has a greater dispersion. However, as in the diagonal Q case, the EM+EnKF+EnKS proposal tends to overestimate \sigma 2 d . Both the EM+VMPF and EM+EnKF+EnKS methods show a similar performance when estimating the sub-/superdiagonal elements of Q, with EM+VMPF being more biased in terms of the median value. However, after 17 iterations the violin object obtained by the EM+VMPF proposal is completely contained within the violin object obtained by the EM+EnKF+EnKS algorithm. The EM+VMPF+EnKS method never reaches the covariances between neighboring variables, underestimating them. A different behavior is observed for the empirical distribution of the Frobenius norm. In this case, EM+EnKF+EnKS outperforms the EM+VMPF estimation procedure, which in turn outperforms the EM+VMPF+EnKS performance. Thus, EM+VMPF has a better performance when estimating variance and covariances between neighbor variables, but it does not perform so well when estimating the null long correlations due to its nonparametric statistics.
Regarding computational costs and times, EM+VMPF and EM+VMPF+EnKS needed 3.1 minutes and 1.2 minutes, respectively, to complete 20 iterations of their EM versions. The extra cost in the former one is because it needs to do the iterations of the fixed point algorithm at each EM iteration. On average 4 fixed point iterations are required even when we set the maximum number of iterations to 6; i.e., the convergence criterion is met in 4 iterations on average (see Algorithm 2.1). Therefore, the overall extra cost of the smoother-free EM in this experiment is about 3 times the classical batch EM using the same number of particles.
We also evaluated the performance of the proposed algorithm using less frequent observations. This means that longer model integrations are required, which in turn enhances nonlinearities, resulting in non-Gaussian forecast distributions. We used the 8-variable Lorenz-96 model assuming observations are available every \Delta t = 0.25. In all the previous experiments with the Lorenz-96 model, we assumed observations were every \Delta t = 0.05. The synthetic observations were generated with a true tridiagonal model error covariance matrix Q, with diagonal values \sigma 2 d = 0.5 and sub/superdiagonal values \sigma sd = 0.1, R = 0.5I 8 , K = 500. Following a procedure similar to that for previous experiments and settings, independent realizations of the current experiment were performed, and Q true was estimated using the EM+VMPF, EM+EnKF+EnKS, and EM+VMPF+EnKS algorithms with N p = 20 particles. The mean values of \\sigma 2 d and \\sigma sd at the sth iteration of the algorithm are shown in Figure 5. The EM+VMPF (blue solid line) algorithm recovers the structure of Q better than the EM+EnKF+EnKS (red dashed lines) and the EM+VMPF+EnKS (yellow dot-dashed lines) algorithms. The EM+VMPF algorithm outperforms the other two algorithms when estimating the variances \sigma 2 d , being EM+EnKF+EnKS and EM+VMPF+EnKS highly biased. The three methods show similar performances when estimating the sub/superdiagonal values of Q, \sigma sd .
The low performance of EM+EnKF+EnKS in this estimation problem appears to be associated to the Gaussianity assumption. The non-Gaussianity resulting from long assimilation cycles has a negative impact on the Gaussian-based algorithms, with the EM+EnKF+EnKS performance being the most affected. On the other hand, the smoother-free algorithm based  on a PF, EM+VMPF, can capture the non-Gaussian densities and gives a relatively good estimation. A similar result was found in [7] for an online EM algorithm. They show that importance-sampling online EM with a PF has a much better performance than a Gaussian smoother EM for long assimilation cycles.
A higher-dimensional experiment was also performed for the chaotic Lorenz-96 model with 40 dimensions and F = 8, where a 40 \times 40 model error covariance matrix has to be estimated. We prescribed an isotropic tridiagonal model error covariance matrix Q true , with diagonal values \sigma 2 d = 0.2 and both sub-/superdiagonal values equal to \sigma 2 sd = 0.05 to generate the synthetic observations. Following a procedure similar to that used in the previous experiments, we performed independent repetitions of this experiment and estimated Q using the EM+VMPF, the EM+VMPF+EnKS, and the EM+EnKF+EnKS algorithms. We used N p = 20 particles for the EM+VMPF smoother-free algorithm, whereas the EM+EnKF+EnKS and the EM+VMPF+EnKS algorithms required N p = 50 ensemble members/particles in order to avoid smoother divergence. We note that early experiments we conducted with the 40-variables Lorenz-96 model using N p =20 particles with both EM+EnKF+EnKS and EM+VMPF+EnKS showed divergence.
The observation error covariance was set to R = 0.5I, and the batch of observations is K = 500. At the sth iteration of the EM algorithm we computed \widehat \sigma 2 d as the average of the diagonal values of \Q s , \widehat \sigma 2 sd as the average of the sub-/superdiagonal values of \Q s and the Frobenius norm \| Q true -\Q (s) \| F . The results of these estimates at the sth iteration of the algorithm are shown in the top, middle, and bottom panels, respectively, of Figure 6. The performances of the EM+VMPF (blue) and the EM+EnKF+EnKS (red) algorithms are similar in terms of the estimation of the diagonal values of Q and Frobenius norm. When estimating the diagonal elements of Q both algorithms tend to stabilize after 10 iterations overestimating them, with the EM+VMPF algorithm being the less biased. EM+VMPF shows an excellent performance when estimating the sub/superdiagonal elements \sigma sd , while EM+EnKF+ EnKS slightly underestimates it.
The full covariance matrix is estimated at each iteration, and the Frobenius norm indicates both deviations from nonzero and zero values in the estimation. As shown in the bottom panel of Figure 6, the overall structural errors in the estimation of Q by the EM+VMPF and the EM+EnKF+EnKS algorithms are similar. On the other hand, EM+VMPF+EnKS (yellow) has a poor performance, underestimating \sigma 2 d and \sigma sd . We remark here that EM+VMPF uses N p =20 particles, while EM+EnKF+EnKS and EM+VMPF+EnKS require N p =50 particles.
Regarding computational times, the experiments took an average of 53.2 minutes and 93.9 minutes to complete 20 iterations for EM+VMPF and EM+VMPF+EnKS, respectively. All the experiments were run in Python on a laptop with an i5 processor. In this largerdimensional experiment the number of iterations in the fixed point algorithm is again about 4; however, due to the larger number of particles required for the smoother (50 particles), EM+VMPF presents a lower computational cost. The required number of particles is an important factor in practice for medium and large systems.
Despite the increase in the dimensionality, the experiments' results show that the smootherfree version of the EM algorithm for a PF provides unbiased results when estimating the diagonal and sub/superdiagonal values of Q, showing a better performance than the hybrid EM version for a PF combined with a Gaussian Rauch--Tung--Striebel (RTS) smoother (EM+VMPF+EnKS).

Conclusions.
In this work a novel method to estimate the model error uncertainty in dynamical systems is introduced and evaluated. It assumes that both the model and observation errors are additive and Gaussian with zero mean and covariance matrices Q and R, respectively. The methodology here presented is based on the maximization of a likelihood criterion using the principles of the EM algorithm and a particle filter. We aim at maximizing the complete likelihood of the observations by marginalizing this likelihood function. The resulting likelihood is expressed sequentially. By taking this approach, in the E-Step of the EM algorithm we only have to compute filtering densities, avoiding the need to compute smoothing densities, which are known to be one of the main drawbacks when using particle filters in data assimilation. The trade-off of avoiding the need of a particle smoother in the E-Step of the EM algorithm is the need to solve an implicit equation for Q in the M-Step of the EM algorithm. This problem was tackled by means of a fixed point algorithm, and despite the fact that an analytical proof of its convergence is not straightforward to obtain due to the nonlinearity of the model dynamics, empirical results show that it converges to a solution of this implicit equation.
The EM algorithm coupled with the VMPF presents, in general, an overall excellent performance. It gives very promising results in the experiments performed with a simple linear first order autoregressive system and a chaotic Lorenz-96 system with 8 and 40 variables. In the first case, results were compared with those obtained by different methods already proposed in the literature, showing a good performance in terms of bias and root mean squared error (RMSE). The new method is suitable for non-Gaussian posterior densities from nonlinear dynamical and observational models, unlike the Kalman filter/smoother and its ensemble variants.
The proposed smoother-free variant of the EM algorithm with the particle filter is compared with a classical EM algorithm that uses the particle filter and a Gaussian smoother. The experiments also present a comparison with the EM algorithm coupled with the ensemble Kalman filter. The Gaussian assumption in the Kalman filter leads to an analytical solution of the maximization step. In this sense, the combination of the EM with the Kalman filter could be considered as a gold standard for the proposed algorithm to achieve. The results of the experiments showed that the smoother-free algorithm coupled with the particle filter performance may be as good as the performance of the EM with the Kalman filter. In a more equally based comparison, the smoother-free variant of the EM algorithm gives a performance that is superior to that of the classical EM algorithm coupled with the particle filter and a Gaussian smoother.
In the case of the Lorenz-96 system, its performance was tested for different scenarios showing good convergence properties, even for less frequent observations. It is stable even for R = 10Q, although a small bias appears in the estimate. The new method outperforms the traditional EM algorithm with the EnKS [14] for a diagonal Q and for the diagonal of a tridiagonal Q. However, off-diagonal elements' estimates were always noisier than those using an EnKS. It also works in moderately large parameter estimation problems of dimension 40 \times 40 and of state space 40. In the case of less frequently observed data, longer model integrations are required, which result in non-Gaussian forecast distributions. The smootherfree EM algorithm coupled with the VMPF showed better performance in this scenario than methods based on the ensemble Kalman filter and smoothers.
In general we found that the extra computational cost of the smoother-free EM for the same number of particles is about 3--4 times the classical batch EM due to the fixed point algorithm iterations. However, in practice a smaller number of particles may be required if the smoother step is avoided, substantially diminishing the extra cost. At the current form, the algorithm here proposed would be expensive (in computational time and also in terms of memory requirements) for state dimensions larger than N x > 100 and prohibitive for N x > 1000, since the algorithm uses the explicit form of Q and its inverse is needed at the M-Step.
The computational cost of the algorithm here proposed is directly related to the number of iterations needed for convergence. All the experiments performed achieved convergence to a narrow neighborhood of the true value of Q in as few as 10--15 iterations of the EM algorithm. Each EM iteration requires the computation of K filtering densities computed by using a particle filter with N p particles, whereas the M-Step requires solving a fixed point algorithm. In our experiments, we set the maximum number of iterations for the fixed point algorithm to 6. In practice, an average of 4 fixed point iterations were enough to satisfy the stopping criterion. In turn, each of these fixed point iterations also require computing K filtering densities computed by using a particle filter with N p particles.
The numerical results obtained in the conducted experiments and these last considerations about computational costs suggest that the method works well for moderately large models, and in the examined situations even better and faster than classical schemes, but as previously mentioned, it would be too expensive in its current implementation for large-dimensional models.
Model error covariances are essential in particle flow filters. The conducted experiments show that these particle flow filters, in particular VMPF, can work with an adaptive model error without a priori information on this covariance, whereas in previous studies a fixed known model error covariance was used [25]. Model error covariances impact on the prior density and also on the kernel covariance in the VMPF. The overall excellent performance of the estimates may also be attributed to the strong sensitivity of VMPF performance to model error covariance. In this sense, there is a positive feedback between the model error covariance estimates of the EM algorithm and the state estimates of the filter. p(x k | y 1:k ; \theta \prime ) log \biggl( p(y k | x k ; \theta )p(x k | y 1:k - 1 ; \theta ) p(x k | y 1:k ; \theta \prime ) \biggr) dx k (A.5) p(x k | y 1:k ; \theta \prime ) log (p(y k | x k ; \theta )p(x k | y 1:k - 1 ; \theta )) dx k (A.6) p(x k | y 1:k ; \theta \prime ) log \bigl( p(x k | y 1:k ; \theta \prime ) \bigr) dx k . (A.7) Thus, Q that satisfies \partial \partial\bfQ \scrG (Q s - 1 , Q) = 0 is given by