Trajectory stratification of stochastic dynamics

We present a general mathematical framework for trajectory stratification for simulating rare events. Trajectory stratification involves decomposing trajectories of the underlying process into fragments limited to restricted regions of state space (strata), computing averages over the distributions of the trajectory fragments within the strata with minimal communication between them, and combining those averages with appropriate weights to yield averages with respect to the original underlying process. Our framework reveals the full generality and flexibility of trajectory stratification, and it illuminates a common mathematical structure shared by existing algorithms for sampling rare events. We demonstrate the power of the framework by defining strata in terms of both points in time and path-dependent variables for efficiently estimating averages that were not previously tractable.

We present a general mathematical framework for trajectory stratification for simulating rare events. Trajectory stratification involves decomposing trajectories of the underlying process into fragments limited to restricted regions of state space (strata), computing averages over the distributions of the trajectory fragments within the strata with minimal communication between them, and combining those averages with appropriate weights to yield averages with respect to the original underlying process. Our framework reveals the full generality and flexibility of trajectory stratification, and it illuminates a common mathematical structure shared by existing algorithms for sampling rare events. We demonstrate the power of the framework by defining strata in terms of both points in time and path-dependent variables for efficiently estimating averages that were not previously tractable.

I. INTRODUCTION
Computer simulation is a powerful tool for the study of physical processes. However, for many systems, the most interesting events are those that occur most infrequently and are therefore very difficult to observe by direct numerical integration of the equations governing the dynamics. For example, in chemistry, the conformational changes responsible for the function of many molecules and, in climate science, extreme events like severe droughts and violent hurricanes, occur on timescales orders of magnitude longer than the timestep for numerical integration. This basic observation has motivated the development of numerous techniques aimed at enhancing the sampling of rare events of interest without sacrificing statistical fidelity (see [1] for an account within the context of molecular simulation).
In this article, we depart from standard enhanced sampling approaches and develop a general mathematical and computational framework for the estimation of statistical averages involving rare trajectories of stochastic processes. Our approach can be viewed as a form of stratified sampling, long a cornerstone of experimental design in statistics (e.g., [2]). In stratified sampling, a population is divided into subgroups (strata), averages within those strata are computed separately, and then averages over the entire state space are assembled as weighted sums of the strata averages. Stratification also has a long history in computer simulations of condensed-phase systems as umbrella sampling (US) [1,[3][4][5][6]. The key idea behind any stratified sampling strategy is that, when the strata are chosen appropriately, their statistics can be obtained accurately with relatively low effort and combined to estimate the average of interest with (much) less overall effort than directly sampling the stochastic process to the same statistical precision. Here we show that the trajectories of an arbitrary discrete-time Markov process (including many dynamics with memory, so long as they can be written as a suitable mapping) can also be stratified: they can be decomposed into fragments restricted to regions of trajectory space (strata), averages over the distributions of trajectory fragments within the strata can be computed with limited communication between them, and those averages can be combined in a weighted fashion to yield a very broad range of statistics that characterize the dynamics.
These basic features are at the core of the existing nonequilibrium umbrella sampling (NEUS) method [7][8][9], which forms the starting point for our development. NEUS was originally introduced to estimate stationary averages with respect to a given, possibly irreversible, stochastic process [7]. Starting in [8,9] it was observed that the general NEUS approach was applicable to certain dynamical averages as well. At its most basic level, NEUS relies on duplication of states in rarely visited regions of space and subsequent forward evolution of the duplicated states. In this way it is similar to a long list of so-called "trajectory splitting" techniques [10][11][12][13][14][15][16][17][18] that are also able to compute averages of dynamic quantities. Like NEUS, splitting techniques also often involve a decomposition of state space into regions. Unlike NEUS however, in most splitting techniques bias is removed through the use of a separate weight factor for each individual sample (rather than for an entire region), and the computational effort expended in each region is not controlled exactly. What makes the NEUS method unique among splitting techniques is that it is also a trajectory stratification strategy.
Our goal in this article is to provide a clear and general mathematical framework for trajectory stratification that builds upon the NEUS method. In the process we clearly delineate the range of statistics that can be estimated by NEUS, including more general quantities than previously computed. Our analysis of the underlying mathematical structure of US [19,20] has already facilitated the derivation of a central limit theorem for US and a detailed arXiv:1610.09426v1 [cond-mat.stat-mech] 28 Oct 2016 understanding of its error properties. Here, our framework reveals unanticipated connections between the equilibrium and nonequilibrium US methods and places the nonequilibrium algorithm within the well studied family of stochastic approximation methods [21]. The analysis leads to a practical scheme that departs dramatically from currently available alternatives. We demonstrate the use of trajectory stratification to compute a hitting time distribution as well as to compute the expectation of a path-dependent functional that gives the relative normalization constants for two arbitrary, user-specified unnormallized probability densities.

II. A UNIFIED FRAMEWORK
In this section we present a framework that reveals the unified structure underlying umbrella sampling in both the equilibrium and nonequilibrium case. In Section II A, we review the equilibrium approach [19,20] to introduce terminology and the central eigenproblem in a context where the analogies to traditional umbrella sampling descriptions [1,[3][4][5][6] are readily apparent. In Section II B, we present the nonequilibrium version of the algorithm and show how this interpretation results in a flexible scheme for computing dynamic averages. As for its equilibrium counterpart, an eigenproblem lies at the core of the nonequilibrium method. This eigenproblem however, involves a matrix that depends on the desired eigenvector, introducing the need for a self-consistent iteration. In Section III, we give a precise description of the fixed-point problem solved by this iteration and show that the algorithm is an example of a stochastic approximation strategy [21]. In Section IV we specialize our development to the context of steady-state averages that motivated the original development of NEUS [7].
A. Averages with respect to a specified density Our presentation in this section follows [19]. We view umbrella sampling as a method to compute averages of the form where π is a known probability distribution and d is the dimension of the underlying system (e.g., the total number of position coordinates for all atoms in a molecular system). For example, π might be the canonical distribution, π(dx) ∝ e −βV (x) dx where V is a potential energy function, β is an inverse temperature, and f might be 1 on some set A and 0 elsewhere. In this case, −β −1 log f (x)π(dx) can be regarded as the free energy of the set A. Note that in our notation π is a probability measure on R d and dx is an infinitesimal volume element in R d . If the distribution π has a density function p(x) then π(A) = x∈A p(x)dx and, in particular, π(dx) = p(x)dx. This more general notation is useful when we move to our description of the nonequilibrium umbrella sampling scheme. As an aid to the reader, we choose to introduce it in the simpler setting of this section.
Consistent with traditional implementations of US [1,4], we divide the computation of the average in (1) into a series of averages over local subsets of space. To describe that process mathematically we define a set of n probability distributions where the ψ j are non-negative user defined functions satisfying n j=1 ψ j (x) = 1 for all x (this last requirement is relaxed in [19]). For example, one might choose where the A j are a collection of sets covering the space to be sampled, and, for any set A j , the function 1 Aj (x) is 1 if x ∈ A j and 0 otherwise. Like this one, typical choices of ψ j concentrate (or restrict) probability under π j in smaller regions of space (relative to π itself) with the goal of eliminating or reducing barriers to efficient sampling associated with π.
To see that averages with respect to π can be reconstructed using averages with respect to the π j , note that we can decompose the expectation in (1) as a weighted sum of expectations over the local distributions defined in (2). That is where we have defined the quantities and Here z j is the statistical weight associated with each distribution π j and f j are the averages of the observable f against π j . From (3) we see that if we can sample from the π j and compute the z j then we can compute averages with respect to π. Since π j is known explicitly in this case, it can be sampled by standard means (e.g., Langevin dynamics or Metropolis Monte Carlo [1]).
Our key observation underpinning the equilibrium umbrella sampling method is that the z j themselves are functions of averages with respect to the local distributions π j : where The matrix F is stochastic (i.e., has non-negative entries with rows that sum to 1) and (6), which is written in matrix-vector form as is an eigenproblem that can be solved easily for the vector z.
We now have a stratification scheme for computing the target average in (1) by sampling from the distributions π j . Operationally, the main steps are as follows.
1. Assemble F defined in (7) (or the alternative in Appendix A below) and f j defined in (5) by sampling from π j defined in (2).

Compute the desired expectation via (3).
The efficiency of this equilibrium US scheme has been analyzed in detail elsewhere [19,20]. Roughly, the benefit of US is due to the facts that averages with respect to the π j are often sufficient to solve for all desired quantities, and one can choose ψ j so that averages with respect to the π j converge much more quickly than averages with respect to π itself. It is this basic philosophy that we extend in Section II B to the computation of dynamic averages.

B. Averages with respect to a given Markov process
The mathematical description of the nonequilibrium umbrella sampling scheme that follows reveals how the stratification strategy developed for the equilibrium case in Section II A can be extended to compute nearly arbitrary dynamic statistics. Our interest in this section is computing averages over trajectories of some specified Markov process, X (t) . This process can be timeinhomogenous, i.e., given the value of X (t) , the distribution of X (t+1) can depend on the value of t. We compute averages of trajectories evolved up to a first exit time of the process, (t, X (t) ), from a user specified set of times and positions, D-i.e., trajectories terminate when they first leave the set D. We consider averages over trajectories of X (t) run until time for a set D ∈ N × R d . The averages are of the form We note that the average in (10) is not completely general, in order to streamline the developments below. Without any modification, we can compute averages similar to (10) but with the argument (t, X (t) ) in the definitions of τ and f replaced by (t, X (t−1) , X (t) ). On the other hand, expectations with (t, X (t) ) replaced by (t, X (t−m) , . . . , X (t−1) , X (t) ) for m ≥ 2 cannot be obtained immediately. These and many more general expectations can, however, be accommodated by applying the algorithm to an enlarged process (e.g., (t, X (t−m) , . . . , X (t−1) , X (t) )) at the cost of storing copies (as described below) of the enlarged process. For many expectations, this cost is quite manageable. Finally, we require that E [τ ] < ∞. The limit τ → ∞ will be considered in Section IV. Below we show that expectations of time-dependent functions can be decomposed as a weighted sum of expectations computed over restricted subsets of the full space and, in turn, how the statistical weights can be computed as expectations over these subsets, mirroring the basic structure of the equilibrium scheme described in Section II A. However, as we discuss in Section III, the algorithm for computing these local expectations departs significantly from the equilibrium case because their form is not known a priori in the nonequilibrium setting.

The index process
The US scheme in Section II A used the basis functions ψ j to stratify the sampling of the distribution π by decomposing averages with respect to π into averages with respect to the more easily sampled π j . To arrive at an analogous partitioning of state space for the nonequilibrium case, we first need to introduce an index process J (t) that takes values in {1, 2, . . . , n} and (roughly) labels the point (t, X (t) ) in time and space, N × R d . Our objective is to generate fragments of trajectories of X (t) consistent with specific values of J (t) thereby breaking the coupled process (X (t) , J (t) ) into separate regions corresponding to a given value of J (t) .
The idea of discretizing a process X (t) according to the value of some user-specified index process is not new in computational statistical mechanics. For example, in our notation, given a partition of state space A 1 , A 2 , . . . , A n , the Milestoning procedure [22] and some Markov State Modeling procedures [23] correspond to an index process that marks the pairs of sets (A i , A j ) for i = j between which X (t) last transitioned. In the Milestoning method, the pairs of sets are considered unordered, so that a transition from A j to A i immediately following a transition from A i to A j does not correspond to a change in J (t) and J (t) can assume n = (m choose 2) distinct values. The original presentation of NEUS on the other hand corresponds to a process J (t) which marks the index of the set A j containing X (t) . For accurate results, the Milestoning procedure requires that the index process J (t) itself be Markovian. Even under the best circumstances, that assumption is only expected to hold approximately. It is not required by the NEUS algorithm. Our presentation below reveals the full flexibility in the choice of J (t) within NEUS. That flexibility is essential in the generalized setting of this article.
In the developments below we require that J (t) is chosen so that the joint process (X (t) , J (t) ) is Markovian. This assumption allows that trajectories can be continued beyond a single transition event (before τ ) without additional information about the history of X (t) or J (t) . We do not assume that J (t) alone is Markovian and in general it will not be. Our assumption implies no practical restriction on the underlying Markov process X (t) . When X (t) is non-Markovian, additional variables can often be appended to X (t) to yield a new Markov process to which the developments below can be applied. A version of this idea is applied in Section V C where we append a variable representing a nonequilibrium work to an underlying Markov process.

The eigenproblem
Given a specific choice of index process J (t) , the nonequilibrium umbrella sampling algorithm stratifies trajectories of X (t) according to their corresponding values of J (t) . That is, for each possible value of the index process, NEUS generates segments of trajectories of X (t) between the times that J (t) transitions to and from J = i. To make this idea more precise, we need to carefully describe the distribution sampled by these trajectory fragments: where For each j, π j is the distribution of time and position pairs (t, X (t) ) conditioned on J (t) = j and t < τ . We call the π j restricted distributions. We have reused the notations π j and z j from our account of the equilibrium umbrella sampling scheme to emphasize the analogous roles played by those objects in both sections. Note that here we are treating time as an additional random variable. Also note that in these definitions as well as in the formulas below, P and E represent probabilities and expectations with respect to the original, unbiased X (t) and J (t) . We assume that z j > 0 for all j since we can remove the index j from consideration if z j = 0. The z j are all finite because n j=1 z j = E [τ ], which we assume is finite.
Observe that Thus we have a decomposition of (10) analogous to the decomposition of (1) in (3). Also as in the equilibrium case, the z j can be computed from averages with respect to the π j . To see this, observe that for any t we can write Summing this expression over t we obtain These expressions are all bounded by E [τ ] and are therefore finite. Expression (15) can be rewritten as an affine eigenequation: where z is defined in (12), and Equation (16) is the analog of (8) in Section II A. Here, the matrix element G ij stores the expected number of transitions between the values of J (t) , normalized by the expected number of time steps with J = i. Note that the matrix G is substochastic; that is, it has non-negative entries and rows that sum to a number less than or equal to one.
To complete the analogy with the umbrella sampling schemes described in Section II A, we need to show that the elements of the matrix G are expressible as expectations over the π j . Indeed, where P t,x,i is used to denote probabilities with respect to X initialized at time and position (t, x) and conditioned on J (t) = i and t < τ. Note that in the first line we have appealed to the Markovian assumption on (X (t) , J (t) ). Had we instead assumed that J (t) alone was Markovian, we could have ignored the x dependence in (20). Just as for the umbrella sampling algorithm described in Section II A, we arrive at a procedure for computing (10) via stratification: (17) and f j defined in (13) by sampling from the π j defined in (11).

Compute
z j f j as in (13).
Relative to the scheme in Section II A, sampling the restricted distributions π j requires a more complicated procedure. This is the subject of Section III. Rapid convergence of the scheme rests on the choice of J (t) . Roughly, one should choose the index process so that the variation in estimates of the required averages with respect to the π j (e.g., estimates of the G ij ) are small. In practice, this requires that transitions between values of J (t) are frequent, which is the analog of selecting the biases in equilibrium US to limit the range of the free energy over each subset of state space (see [19,20]). We defer further specification of J (t) to Section V, where we describe this and other important implementation details in the context of particular applications.

III. A GENERAL NEUS FIXED-POINT ITERATION
In this section we present a detailed algorithm for computing (10) by the stratification approach outlined in Section II B. To accomplish this one must be able to generate samples from the restricted distributions π j (t, dx).
In NEUS, the restricted distributions are sampled by introducing a set of Markov processes Y (r) j whose trajectories consist of sequential segments of the process (t, X (t) ) such that J = j. We use the time variable r to distinguish between the time associated with the underlying process (t, X (t) ) and the time r associated with the process Y (r) j . The trajectories of Y (r) j are generated by integrating the underlying process (X (t) , J (t) ) forward in time until a segment leaves the state J = j or reaches the stopping time τ at which point the dynamics are stopped and restarted at a random time s and position y drawn from a distributionπ j (s, dy). For Y (r) j to preserve the restricted distribution π j (t, dx),π j (s, dy) must be the distributions of times s and positions y at which the process ( process is illustrated in Figure 1 for a particular choice of index process. In general, the flux distributionsπ j (s, dy) for which Y (r) j preserves the restricted distribution π j (t, dx) are not known a priori and must be computed approximately. In Section III A, we define the flux distributions carefully and express them as a weighted sum of conditional flux distributions γ ij (s, dy) which are a set of time and position distributions conditioned on the process (X (t) , J (t) ) switching from a specific J = i to J = j. In order to construct an estimate ofπ j (s, dy) from estimates of the γ ij (s, dy) it is necessary to compute the flux z i G ij of the process (X (t) , J (t) ) from each restriction with J = i, i = j. The elements of the desired matrix G (and the corresponding weights z computed from (16)) are computed from expectations over the restricted distribution π j (t, dx) and sampled by integrating the process Y (r) j . This relationship between the weights z j and the flux distributionsπ j (s, dy) motivates a self-consistent iteration in which we solve for the flux distributionsπ j (s, dy) and the matrix G, yielding successively better approximations of the flux distributions and the matrix G simultaneously. In Section III B, we derive a fixed-point equation that expresses the relationship between the flux distributions and the transition matrix G that motivates this self-consistent iteration. In Section III C, we describe the complete NEUS algorithm in detail and interpret it as a stochastic approximation algorithm [21] for solving the fixed-point equation derived in Section III B.

A. The flux distributions
Before deriving the fixed-point problem and the corresponding stochastic approximation algorithm, we define the flux distributionsπ j (s, dy) precisely. First, define the time of the th change in the value of J (t) for a given re- Illustration of the stratification of a process (X (t) , J (t) ) (black lines, left panel) via the scheme outlined in Section II B. The restricted distributions corresponding to each value of the index process J (t) are outlined as discrete regions of the (t, X (t) ) space (left panel, black dashed lines). In this depiction, the value of J (t) corresponds to the current cell containing (t, X (t) ) within a rectangular grid of time and position. Each of the restricted distributions πj(t, dx) are sampled by integrating a locally restricted dynamics Y alization of the coupled process (X (t) , J (t) ) as for > 0, and S (0) = 0. Note that for i = j, is the net probability flux between values of the process J (t) before time τ . Now letπ j (s, dy) be the distribution of time and position pairs (S ( ) , X (S ( ) ) ) conditioned on J (S ( ) ) = j. Specifically, definē Instead of working directly with the flux distributions, we find it convenient to express both the fixed-point problem and the algorithm in terms of the probability distribution of time and position pairs (t, X (t) ) conditioned on observing a transition from J = i to J = j at time t, i.e., in terms of which is defined only for s > 0. To simplify notation, we let γ denote the set of all distributions γ ij . The following simple but crucial identity relates γ to the flux distributionsπ j : The normalization is the total flux into J = j, The s > 0 term is the contribution from transitions into state J = j from the neighboring state J = i, and the s = 0 term accounts for the initial t = 0 contribution of the underlying process when J = j. We emphasize that both the fixed-point problem and the iteration that we define below could be expressed in terms of the flux distributionsπ j instead of γ. We choose to express them in terms of γ because the resulting formalism more naturally captures the implementation of the method used to generate our numerical results in Section V.

B. The fixed-point problem
We now derive the fixed-point problem. Our goal is to find an expression of the form that characterizes the desired matrix G and collection of probability measures γ as the fixed-point of a pair of maps G(G,γ) and Γ(G,γ) that take as arguments ap-proximationsG of G andγ of γ and return, respectively, a new substochastic matrix and a new collection of probability measures.
To this end, we define functions mappingG andγ to approximations of the flux distributionsπ j and the restricted distributions π j . We denote these functions by the corresponding capital lettersΠ j and Π j . Based on (25) and (16), we definē Π j (s, dy;G,γ) wherez solves the equationz T =z TG + a T . To motivate the definition above, we observe that for the exact values G and γ,π j (s, dy) =Π j (s, dy; G, γ) by (25). Moreover, givenG and samples fromγ, one can generate samples fromΠ j (s, dy;G,γ); see Section III C. This is crucial in developing a practical algorithm to solve the fixed-point problem.
Definition (28) is motivated by the same two considerations as (27): First, Π j (s, dy; G, γ) = π j (s, dy); this follows directly from the definition of Π j , using that Π j (s, dy; G, γ) =π j (s, dy). Second, the distribution Π j (t, dx;G,γ) is invariant for a process similar to the process Y (r) j mentioned above, except with times and positions drawn fromΠ j (s, dy;G,γ) instead ofπ j (s, dy) on leaving the state J = j; see Section III C. Thus, giveñ G and samples fromγ, one can generate samples from Π j (t, dx;G,γ).
At this point we are ready to define the functions G and Γ appearing in (26) above. For a substochastic matrix G and a collection of probability distributionsγ = {γ ij }, define the substochastic matrix and the collection of probability distributions Γ ij (s, dy;G,γ) Because Π j (G, γ) = π j , expressions (20) and (24) imply that G(G, γ) = G and Γ ij (G, γ) = γ ij , establishing our fixed-point relation (26).
Having fully specified the fixed-point problem, we can now consider iterative methods for its solution. One approach would be to fix some ε ∈ (0, 1] and compute the deterministic fixed-point iteratioñ given initial guessesG(0) andγ(0) for G and γ, respectively. One would choose ε = 1 in this deterministic iteration; we consider arbitrary ε ∈ (0, 1] to motivate the stochastic approximation algorithm developed in Section III C.) In practice, computing (31) requires computing averages with respect to Π j (G(m),γ(m)) exactly and we can only compute a process sampling Π j (G(m),γ(m)). Therefore, in Section III C, we explain how to solve the fixed-point problem by a stochastic approximation algorithm similar to (31) but based on random approximations of G and Γ.

C. A stochastic approximation
In this section, we present the full NEUS algorithm and we interpret it as a stochastic approximation algorithm analogous to the deterministic fixed-point iteration (31). In NEUS, as in the fixed-point iteration, we generate a sequence of approximationsG(m) andγ(m), converging to G and γ, respectively. During the mth iteratation of the NEUS algorithm, we update the current approxima-tionsG(m) andγ(m) based on statistics gathered from trajectories of a Markov process Y j , X (r) j whose law depends onG(m) andγ(m). Recall that τ is the first time at which the process (t, X (t) ) is not contained in the set D ∈ N×R d . We define a single step in the evolution of Y (G(m),γ(m)) fromΠ j (s, dy;G(m),γ(m)).
The process Y (r) j (G,γ) preserves the distribution Π j (t, dx;G,γ) in (28); see Appendix B for proof. For the following, we assume that Y (r) j (G,γ) is ergodic with respect to Π j (t, dx;G,γ). The implementation details for drawing Y (r+1) j (G(m),γ(m)) fromΠ j (s, dy;G(m),γ(m)) will be discussed in more detail in Section V.
We now state the NEUS algorithm. To simplify the expressions below, we sometimes omit the iteration number m. The algorithm proceeds as follows: 1. Choose initial approximationsG(0) andγ(0) of G and γ, respectively. Fix the number of steps, K, of the processes Y j (G(m),γ(m)) for K time steps, starting from initial condition Label the output Y

Let
be the number of i to j transitions of the index process observed while generating Y

ComputeĜ
andˆ where L ∧ M ij (m) = min{L, M ij }. In Equation (33), δ x represents the Dirac delta function centered at position x.

Replace the deterministic iteration (31) by the approximatioñ
and where 6. Update the expectations Once the desired level of convergence has been reached, compute We now interpret NEUS as a stochastic approximation algorithm analogous to the deterministic fixed-point iteration (31). First, we observe thatĜ(m) approximates G(G(m),γ(m)) in the following sense. Suppose we were to compute a sequencê G(n),Ĝ(n + 1), . . . , G(n + k) as in NEUS, except holding the values ofG(n) and γ(n) fixed and computing G(n + i) from a trajectory of Y (t) i (G(n),γ(n)). We would then have with probability one, since the left hand side is a trajectory average of Y (t) i (G(m),γ(m)), and assuming that Y (t) i (G(m),γ(m)) is ergodic for Π i (G(m),γ(m)), its limit is G(G(m),γ(m)) by (29). The distribution γ ij (m) approximates Γ ij (G(m),γ(m)) in a similar sense. Therefore, the NEUS iteration (34) is a version of the deterministic fixed-point iteration (31) but with a shrinking sequence ε m instead of a fixed ε and with random approximations instead of the exact values of G and Γ. The conditions (36) on the sequence ε m are common to most stochastic approximation algorithms [21]; they are necessary to ensure convergence of the iteration when G and Γ can only be approximated up to random errors.
We remark that in practice the empirical measures γ(m) are stored as lists of time and position pairs. The update in (34) allows the number of pairs stored in these lists to grow with each iteration. This can lead to impractical memory requirements for the method. We therefore limit the size of each listγ ij (m) to a fixed maximum value by implementing a selection step in which the points that have been stored for the most iterations are removed to make room for the points in the updates ofγ ij (m) when this maximum is exceeded. Also, in our numerical experiments in Section V, we use ε m = 1/(m + 1) in which case,G This and other details of our implementation are explained in Section V.

IV. ERGODIC AVERAGES
In this section we consider the calculation of ergodic averages with respect to a general (not necessarily timehomogenous) Markov process. The system might not be stationary, and an "ergodic average" should be understood here as an infinite time average. We will assume that these time averages converge, which, for our purposes, is operationally the same as assuming that the system is stationary and ergodic. We also describe the simplifications that occur when the target Markov process is time-homogenous as in the original NEUS algorithm. We illustrate how quantities such as reaction rates can be computed within that context. For a general Markov process we require that exists as a probability distribution on R d × {1, 2, . . . , n} and let This general ergodicity requirement allows processes X (t) with periodicities or time dependent forcing.
Our goal is to compute ergodic averages of the form To that end, in the notation of Section II B we fix a deterministic time horizon τ > 0. If we divide both sides of (16) by τ and take the limit τ → ∞, we obtain the equation where now and Note that the matrix G is now stochastic and that n j=1 z j = 1. We can rewrite the ergodic average of f as and we represent the large τ limit of the position marginal distribution of π j defined in (11) as These formulas indicate that the only modification of the algorithm in Section III that is required to compute a long-time average is to set τ = ∞ in the definition of the processes Y i (G,γ), to set a = 0 in (27), and letz solvez T =z TG with n j=1z j = 1. In other words, the algorithm seamlessly transitions from solving the initial value problem to solving the infinite time problem as τ becomes large. When the joint process (X (t) , J (t) ) is time-homogenous and stationary and our goal is to compute the average of a position dependent observable f (x) with respect to the stationary distribution π of X (t) , the above relations can be further simplified. In this case, where z j defined in (40) becomes The matrix G in (41) can now be written and the vector f j defined in (42) becomes These simplifications lead to a version of the original NEUS method [7] that employs a direct method for solving for the weights similar to the scheme in [9]. In [9] and [8] the basic NEUS approach was extended to the estimation of transition rates between sets for a stationary Markov process. Implicit in this extension was the observation that any algorithm that can efficiently compute averages with respect to the stationary distribution of a time-homogenous Markov process can be applied to computing dynamic averages more generally by an enlargement of the state space, i.e., by applying the scheme to computing stationary averages for a higher dimensional time-homogenous Markov process. This idea is also central to Exact Milestoning [24], which extends the original Milestoning procedure [22] to compute steady-state averages with respect to a timehomogenous Markov process and is very similar in structure to steady-state versions of NEUS. For an alternative implementation of NEUS inspired by Exact Milestoning but presented in the more general context of Sections II B and III, see Appendix C.

V. NUMERICAL EXAMPLES
Here we illustrate the flexibility of the generalized algorithm with respect to both the means of restricting the trajectories (the choice of the J (t) process) and the averages that can be calculated. Specifically, in Section V A we discuss our choice of the J (t) process. In Section V B we show how finite-time hitting probabilities can be calculated by discretizing the state space according to both time and space. In Section V C we show how free energies can be obtained by discretizing the state space according to time and the irreversible work.

A. One choice of the J (t) process
Rapid convergence of the scheme outlined in Section III rests on the choice of J (t) . Perhaps the most intuitive choice is where the subsets A 1 , A 2 , . . . , A n partition N × R d . Indeed, earlier steady-state NEUS implementations [7-9, 25, 26] employed an analogous rule using a partition of the space variable (the time variable was not stored or partitioned). However, even with an optimal choice of the subsets A 1 , A 2 , . . . , A n , (44) has an important disadvantage: in many situations, X (t) frequently recrosses the boundary between neighboring subsets A i and A j , which slows convergence. Fortunately, there are many alternative choices of J (t) that approximate the choice in (44) while mitigating this issue. We give one simple and intuitive alternative which we use in the numerical examples that follow. Let ψ j be a set of non-negative functions on N × R d for which n j=1 ψ j = 1. The ψ j are generalizations of the functions 1 Aj in that they serve to restrict trajectories to regions of state space. In practice, given a partition of space A 1 , A 2 , . . . , A n , the ψ j can be chosen to be smoothed approximations of the functions 1 Aj . Given a trajectory of X (t) , the rule defining J (t) is as follows. Initially, choose J (0) ∈ {1, 2, . . . , n} with probabilities proportional to {ψ 1 (0, X (0) ), ψ 2 (0, X (0) ), . . . , ψ n (0, X (0) )}. At later times J (t) evolves according to the rule 1. If ψ J (t−1) (t, X (t) ) > 0 then J (t) = J (t−1) .
While transitions out of J (t) = i occur when X (t) leaves the support of ψ i , transitions back into J (t) = i can only occur outside of the support of ψ j . Thus, this transition rule allows one to separate in space the values of X (t) at which J (t) transitions away from i from those where J (t) transitions into i, mitigating the recrossing issues mentioned above.
In our examples, we discretize time and only one additional "collective variable" (a dihedral angle in Section V B and the nonequilibrium work in Section V C). Here we denote the collective variable by φ, and we discretize it within some interval of values [a, b] (though it may take values outside this interval). In both examples [a, b] is evenly discretized into a set of points k=0 for some integer m φ . Letting φ j be any of the points in that discretization, we set where ∆ φ is some fixed value controlling the width of the support of ψ j , and the indicator 1 [a,b] restricts the terminal functions. Recall that the ψ j are required to sum to 1. We choose t j start and t j end to equally divide the interval [0, τ ), where, in our examples, τ is a fixed time horizon. The function ψ j is largest when t ∈ [t j start , t j end ) and φ(x) = φ j . The supports of the various ψ j correspond to products of overlapping intervals in the φ variable, but non-overlapping intervals in time. The fact that ψ j depends on time is essential in our examples.

B. Finite-time hitting probability
In this section we compute the probability, P BA (τ max ), of hitting a set B before a separate set A and before a fixed time τ max > 0 given that the system is at a point In the case where X (0) and B are separated by a large free energy barrier while X (0) and A are not, computing P BA (τ max ) can be challenging since trajectories that contribute to P BA (τ max ) are rare in direct simulations. To compute P BA (τ max ) via the scheme in Section III C, we let the stopping time τ be the minimum of τ max and the first time, t, at which X (t−1) is in either A or B, i.e., τ − 1 = min{τ A , τ B , τ max − 1} where τ A and τ B are the first times that X (t) enters the sets A and B respectively. Strictly speaking, to write τ in the form in (9), we need to replace (t, X (t) ) in that equation by (t, X (t−1) , X (t) ). The set D corresponding to our choice of τ is then D = {(t, x, y) : t < τ max , x / ∈ (A ∪ B)}. As we have already mentioned, this can be done without further modification of the scheme. Then f (t, X (t) ) in (10) is The system that we simulate is the alanine dipeptide (CH 3 -CONH-C α H(C β H 3 )-CONH-CH 3 ) in vacuum modeled by the CHARMM 22 force field [27]. We use the default Langevin integrator [28] implemented in LAMMPS [29], with a temperature of 310 K, a timestep of 1 fs and a damping coefficient of 30 ps −1 . The SHAKE algorithm is used to constrain all bonds to hydrogens [30]. We consider the system to be in set A if −150 • < φ < −100 • and in set B if 30 • < φ < 100 • (Figure 2). We discretize time into intervals of t end − t start = 10 3 time steps with a terminal time of τ max = 10 4 time steps. We use the rule outlined in Section V A for the evolution of J We generate the initial point X (0) by running an unbiased simulation at 310 K and choosing a single point X (0) between the sets A and B. The vector a defined in (18) is Note that the initial condition at J (0) can be drawn from an ensemble of configurations with minimal changes to the algorithm, but we restrict our attention to the initial condition consisting of a single point. To evaluate the performance of the algorithm in Section III C, we choose two points from our direct simulation, one at φ = −58.0 • and one at φ = −91.0 • . The former is chosen to allow the NEUS results to be compared with results from unbiased direct simulations, while the latter provides a more challenging test because P BA becomes small when X (0) is close to A. We set K = 1000 and L = 1 and perform a total of 10 4 iterations (about 5.9 µs of dynamics) of the scheme in Section III C for each starting point. Each step of the process Y (r) j (G(m),γ(m)) corresponds to 10 time steps of the physical model. Theγ ij are represented as lists of time and position pairs with associated weights. We cap the maximum size of those lists at 250 entries. Ifγ ij reaches this maximum size, each new entry overwrites the oldest previous entry. Because the lists are empty at the start of the calculation, we restrict sampling in Step 2 of the algorithm in Section III C to regions with at least one stored entry point ("progressive initialization" in [26]). When required, a sample (S, Y ) is drawn fromΠ j (s, dy;G(m),γ(m)) by the following. With probability a j /z j (1 − G jj ), set S = 0 and select Y from P[X (0) ∈ dy|J (0) = j], or with the remaining probability select an index I proportional to the flux z i G ij and then select (S, Y ) from the list of weighted samples comprisingγ Ij (m). For each j we compute f j = P j BA = M jB /(mK) where M jB is the total number of transition events of X (r) j into B observed after m iterations (mK is the total computational effort in state j after m iterations). The estimate of P BA (τ max ) after m iterations is then computed as P BA (τ max ) = n j=1 P j BAz j (m). To assess the efficiency of the trajectory stratification, we also estimate P BA (τ max ) by integrating an ensemble of n = 10 6 unbiased dynamics trajectories for τ max time steps from the initial point X (0) . In this case, P BA (τ max ) ≈ N B /N , where N B is the number of trajectories that hit set B before set A. To assess the accuracy of the NEUS result, we perform 10 independent NEUS The free energy is computed from the method presented in Section II A as implemented in [19] .
calculations. In each NEUS simulation, we estimate the value of P BA as the average over the final 1000 iterations of each simulation and compute the mean of this estimate over 10 independent NEUS simulations. We obtain P BA (τ max ) ≈ 4.33 × 10 −4 from NEUS and P BA (τ max ) ≈ 4.12 × 10 −4 from direct simulation for the starting point at φ = −58.0 • (Figure 3). In this case, the NEUS result is within the 95% confidence interval [3.72 × 10 −4 , 4.52 × 10 −4 ] (estimated as ±1.96 p(1 − p)/n, where p is the estimate of P BA from the direct simulation) for the direct simulation estimate given the number of samples. We obtain P BA (τ max ) ≈ 2.94 × 10 −8 from NEUS for the starting point at φ = −91.0 • , consistent with the fact that none of the unbiased trajectories reached B before A in this case. From the same data (for either NEUS or direct simulation), one can easily assemble estimates of P BA (t) for any t ≤ τ max by counting only those transitions into B that occur before t time steps. Up to a normalization, P BA (t) is the cumulative distribution function for the time that it takes X (t) to enter B conditioned on not entering A. Estimates of this cumulative distribution function compiled from the NEUS and direct simulation data are plotted in Figure 4. The NEUS results show excellent agreement with the results from the direct simulation.
Spatiotemporal plots of the weights computed from the converged NEUS calculations and the direct simulations are shown in Figure 5. For both starting points, the stratification scheme is able to efficiently sample events with weights spanning 12 orders of magnitude. When X (0) is close to the boundary of set A, accurate estimation of the very small probability P BA (τ max ) depends sensitively on the ability to realize a set of very rare trajectories, ruling out the use of direct simulation.

C. Free Energy Differences via the Jarzynski Equation
In this section, we show how a specific choice of the J (t) process enables us to stratify a path-dependent variable, specifically, the accumulated work appearing in the Jarzynski equation [6,31]. For a statistical model defined by a density proportional to exp[−V (x)] (e.g., V (x) is a potential function or a log-likelihood), the normalization constant is Q = e −V (x) dx. In fields ranging from statistics to chemistry, a ratio of normalization constants is often used to compare models [32,33]. Subject to certain conditions [31,34], the Jarzynski equation relates the ratio of normalization constants to an average over paths of a time-dependent process, X (t) : where and we refer to ∆F = − log(Q t /Q 0 ) as the free energy difference. For example, for a small time discretization parameter, dt, a suitable choice of dynamics is where ξ t is a standard Gaussian random variable and Formula (48) suggests a numerical procedure for estimating free energy differences in which one simulates many trajectories of X (t) , evaluates the work W (t) for each, and then uses this sample to compute the expectation on the right hand side of (48) approximately. This approach has been particularly useful in the context of single-molecule laboratory experiments [35,36]. A wellknown weakness of this strategy in the fast-switching (small t) regime is large statistical errors result from the fact that low-work trajectories contribute significantly to the expectation but are infrequently sampled [35,[37][38][39][40].
The quantity that we seek to compute is the free energy difference between a particle in a double-well potential that is additionally harmonically restrained with spring constant k = 20 near x = −1 and a particle in the same potential restrained near x = 1. The model is adapted from the one presented in [32]. Setting τ = 1001, for t < τ we define where dt = 0.001. We show V (0, x), V (τ − 1, x), and V (x; k = 0) in Figure 6. The process X (t) evolves according to (49).
The reader may be concerned that the expectation in (48) is not immediately of the general form in (10) suitable for an application of NEUS. We will apply NEUS as described in Section II B to the augmented process Z (t) = (X (t) , W (t) ). To compute the expectation of the left hand side of (48) via NEUS, we compute the expectation in (10) with The index process J (t) will mark transitions between regions of the time t and accumulated work W (t) variables. We discretize the work space in overlapping subsets using the pyramid form in (V A). We use 86 subsets with centers evenly spaced on the interval [−25.0, 30.0] with a width of ∆ φ = 0.6. We discretize time into 5 discrete nonoverlaping subsets every 200 time steps for a total of 430 subsets. We again cap the maximum size of the list representation of {γ ij } at 250 entries using the same scheme as in Section V B.
For both NEUS and direct simulations, we prepare an ensemble of 1000 starting states X (0) by performing an unbiased simulation with fixed potential V (0, x) for 10 6 steps, saving every 1000 steps. The direct fast-switching simulations start from each of these points and comprise 1000 steps of integration forward in time; each trajectory contributes equally to the left hand side of (48). For the NEUS simulations, the vector a is constructed as in (47), and trajectories are initialized at J (0) by drawing uniformly from this ensemble. We set K = 1000 and L = 1, and we perform 250 iterations. Each step in K corresponds to a single step of (49). As in Section V B, we sample only in the restricted distributions where there is at least one point stored inγ from which to restart the dynamics.
The estimated ∆F produced from data generated in the last 50 iterations of NEUS is 5.87 (the units are chosen to absorb temperature factors above), which is in excellent agreement with the reference value of 5.94, in contrast to the estimate from direct simulation ( Figure  7). The left panel of Figure 8 shows the weights along the time and work axes. In the right panel of Figure 8 we plot histogram approximations of the density P W (w) of W (τ −1) along with the weighted density proportional to P W (w) exp(−w). The separation of the peaks of this distribution highlight how NEUS is able to effectively sample the low work tails that contribute significantly to the expectation in the Jarzynski relation in (48) but are rarely accessed by the switching procedure in the unbiased simulations. FIG. 6. V (0, x) (blue) and V (τ −1, x) (green) for the switching process used to compute Jarzynski's equality. For reference, the potential with k = 0 (black) is also shown.

VI. CONCLUSIONS
We describe a trajectory stratification framework for the estimation of expectations with respect to arbitrary Markov processes. The basis for this framework is the nonequilibrium umbrella sampling method (NEUS) originally introduced to compute steady state averages. Our development highlights the structural similarities between the nonequilibrium and equilibrium US algorithms and places the NEUS method within the general context of stochastic approximation. These connections have practical implications for further optimizing the procedure and point the way to a more in depth convergence analysis that will be the subject of future work. Our development reveals that the basic trajectory stratification approach can be useful well beyond the estimation of stationary averages for time-homogenous Markov processes. This flexibility is demonstrated in two examples, both involving an expectation over trajectories of finite duration. In the first example, we show that the probability of first hitting a set within a finite time can be efficiently computed via stratification even when the dynamics start close to a competing absorbing state. In our second example, we use NEUS to stratify a process according to a path-dependent variable, the accumulated work in a nonequilibrium process appearing in the Jarzynski equation. The result is a novel and effective scheme for estimating free energy differences by enhancing sampling of the tails of the accumulated work distribution.
Our general framework also suggests new and exciting applications of trajectory stratification. For example, with little modification, these methods can be applied to sequential data assimilation applications where the goal is to approximate averages with respect to the conditional distribution of a hidden signal X (t) given sequentially arriving observations (i.e., with respect to the posterior distribution). In high-dimensional settings (e.g., weather forecasting) the only practical alternatives are limited to providing information about only the mode of the posterior distribution (i.e., variational methods) or involve uncontrolled and often unjustified approximations (i.e., Kalman-type schemes). The approach that we present here opens the door to efficient data assimilation, machine learning, and, more generally, new forms of analysis of complex dynamics.
The expression in parentheses on the right side of last equality is equal to one since we assume that E[τ ] < ∞, which implies τ < ∞ with probability one. Thus, for s > 0, ∞ t=0 x Q(t, x; s, dy)Π j (t, dx) ∝ Π j (s, dy).
A very similar argument shows that the same expression holds when s = 0. Thus, Π j is invariant for Y (t) j .

Appendix C: An alternative implementation
The Exact Milestoning method recently proposed for computing certain dynamic averages with respect to a stationary time-homogenous Markov process [24] has a structure very similar to the steady-state version of NEUS that we have described in Section IV. The differences in implementation between the two methods suggest an alternative implementation of NEUS in the more general context of Sections II B and III.
To describe this alternative implementation, it is convenient to work with a slightly different set of variables that reflects the interface-to-interface emphasis in Mile-stoning. Here we show how these new variables are equivalent to the variables that we used to describe NEUS. In particular, we define the substochastic matrix H ij = ∞ =0 P S +1 < τ, J (S +1 ) = j, J (S ( ) ) = i ∞ =0 P J (S ( ) ) = i, S < τ (C1) = ∞ t=0 x∈R d P t,x,i σ(t) < τ, J (σ(t)) = j π i (t, dx) which can be expressed in terms of G by We also define and we define the matrix of subprobability distributions (they do not integrate to 1) ξ ij (s, dy) = ∞ t=0 x∈R d P t,x,i σ(t) = s, s < τ, X (s) ∈ dy, J (s) = j ×π i (t, dx) which are related to the conditional flux distributions γ ij as defined in (24) by Note that j γ ij is the distribution of transition points leaving J = i. We have already seen in (25) that v jπj (s, dy) = i v i ξ ij (s, dy), if s > 0, a j P X (0) ∈ dy | J (0) = j , if s = 0.
We will construct an algorithm that finds the unnormalized distribution θ j = v jπj .
Given the current iterateθ(m) with the vector of normalizationsṽ(m), the stochastic approximation algorithm corresponding to this fixed-point problem fixes a non-negative integer N and, for each i, initiates N independent trajectories of (t, X (t) ) fromθ i (m)/ṽ i (m), all of which are run until τ or the next transition of the index process. Suppose that N i of these N trajectories end upon a transition of the index process (rather than upon reaching time τ ). Let {(J i (m), T i (m), X i (m))}, be the corresponding values of the index process, time, and position upon those transition events. Then we can define the empirical subprobability measureŝ Note that E θ j (m) = Θ j (θ(m)).
An approximation ofπ j is obtained at any time by normalizingθ j (m).
Both the implementation described in this section and the one described in Section III store an empirical estimate ofπ (in Section III the estimate of γ directly corresponds through (27) to an estimate ofπ). The core difference between the two implementations is that in this section one generates a fixed number of trajectory fragments for each value of the index process, while in the implementation in Section III the number of trajectory fragments generated is random but the total computa-tional expense for each value of the index process is fixed. This alternative iteration has the advantage that the update in (C3) is completely uncorrelated between iterations (conditioned on the current iterate). On the other hand, a potential disadvantage is that the random and varied computational effort for each value of the index process could make synchronization difficult in a parallel implementation.