ADM-CLE approach for detecting slow variables in continuous time Markov chains and dynamic data

A method for detecting intrinsic slow variables in high-dimensional stochastic chemical reaction networks is developed and analyzed. It combines anisotropic diffusion maps (ADM) with approximations based on the chemical Langevin equation (CLE). The resulting approach, called ADM-CLE, has the potential of being more efficient than the ADM method for a large class of chemical reaction systems, because it replaces the computationally most expensive step of ADM (running local short bursts of simulations) by using an approximation based on the CLE. The ADM-CLE approach can be used to estimate the stationary distribution of the detected slow variable, without any a-priori knowledge of it. If the conditional distribution of the fast variables can be obtained analytically, then the resulting ADM-CLE approach does not make any use of Monte Carlo simulations to estimate the distributions of both slow and fast variables.

1. Introduction. The time evolution of a complex chemical reaction network often occurs at different time scales, and the observer is interested in tracking the evolution of the slowly evolving quantities (i.e., of the so called slow variables) as opposed to recording each and every single reaction that takes place in the system. Whenever a separation of scales exists, one has to simulate a large number of reactions in the system in order to capture the evolution of the slowly evolving variables. With this observation in mind, it becomes crucial to be able to detect and parametrize the underlying slow manifold corresponding to the slow variables intrinsic to the system. In the present paper, we introduce an unsupervised method of discovering the underlying hidden slow variables in chemical reaction networks, and of their stationary distributions, using the anisotropic diffusion map (ADM) framework [39].
The ADM is a special class of diffusion maps which have gained tremendous popularity in machine learning and statistical analysis, as a robust nonlinear dimensionality reduction technique, in recent years [36,2,10,6]. Diffusion maps have been successfully used as a manifold learning tool, where it is assumed that the high dimensional data lies on a lower dimensional manifold, and one tries to capture the underlying geometric structure of the data, a setup where the traditional linear dimensionality reduction techniques (such as the Principal Component Analysis) have been shown to fail. In the diffusion maps setup, one constructs or is given a sparse weighted connected graph (usually in the form of a weighted k-Nearest-Neighbor graph, with each node connected only to its k nearest or most similar neighbors), and uses it to build the associated combinatorial Laplaciañ L = D − W , where W denotes the matrix of weights and D denotes a diagonal matrix with D ii equal to the sum of all weights of the node i. Next, one considers the generalized eigenvalue problemLx = λDx, whose solutions are related to the solutions of the eigenvalue problem Lx = λx, where L = D −1 W is a row-stochastic matrix often dubbed as the random walk normalized Laplacian. Whenever the pair (λ, x) is an eigenvalue-eigenvector solution to Lx = λx, then so is (1 − λ, x) forLx = λDx. The (non-symmetric) matrix L can also be interpreted as a transition probability matrix of a Markov chain with state space given by the nodes of the graph, and entries L ij denoting the one-step transition probability from node i to j.
In the diffusion map framework, one exploits a property of the top nontrivial eigenvector of the graph Laplacian of being piecewise constant on subsets of nodes in the domain that correspond to the same state associated to the underlying slow variable. We make this statement precise in Section 4, and further use the resulting classification in Section 5 to propose an unsupervised method for computing the stationary distribution of the hidden slow variable, without using any prior information on its structure. Since the top eigenvectors of the above Laplacian define the coarsest modes of variation in the data, and have a natural interpretation in terms of diffusion and random walks, they have been used in a very wide range of applications, including but not limited to partitioning [41,40], clustering and community detection [34,45,33], image segmentation [38], ranking [47,17], and data visualization and learning from data [7,36].
The main application area studied in this paper are stochastic models of chemical reaction networks. They are written in terms of stochastic simulation algorithms (SSAs) [21,22] which have been used to model a number of biological systems, including the phage λ lysis-lysogeny decision circuit [1], circadian rhythms [44], and the cell cycle [27]. The Gillespie SSA [21] is an exact stochastic method that simulates every chemical reaction, sampling from the solution of the corresponding chemical master equation (CME). To characterize the behavior of a chemical system, one needs to simulate a large number of reactions and realizations, which leads to very computationally intensive algorithms. For suitable classes of chemically reacting systems, one can sometimes use exact algorithms which are equivalent to the Gillespie SSA, but are less computationally intensive, such as the Gibson-Bruck SSA [19] and the Optimized Direct Method [5]. However, these methods also stochastically simulate the occurrence of every chemical reaction, which can be a computationally challenging task for systems with a very large number of species. One way to tackle this problem is to use parallel stochastic simulations [28]. In this work, we discuss an alternative approach which does not make use of parallel stochastic simulations, but at the same time, the proposed approach can also benefit from large processing power and parallel computing, as many steps of our proposed algorithms are highly parallelizable.
An alternative approach to treating the molecular populations as discrete random variables, is to describe them in terms of their continuously changing concentration, which can be done via the Chemical Langevin equation (CLE), a stochastic differential equation that links the discrete stochastic simulation algorithm with the deterministic reaction rate equations [20]. Although such an approach can be less computationally expensive, it comes with the disadvantage that, for certain chemical systems, it can lead to negative populations [46]. In addition, note that none of the above approaches takes explicit advantage of the separation of scales if one exists, a which we will make use of in this paper as detailed in Sections 4 and 5.
It is often the case that a modeller is not interested in every single reaction which takes place in the system, but only in the slowly evolving quantities. Certain systems possess multiple time scales, meaning that one has to simulate a large number of reactions to reveal the slow dynamics. Several algorithms for chemical networks with fast and slow variables have already been developed in the literature. The authors of [25] proposed to simulate the fast reactions using Langevin dynamics, and the slow reactions using the Gillespie algorithm. This approach requires both the time scale separation and a sufficiently large system volume; however the latter constraint can be avoided using probability densities of the fast species conditioned on the slow species, and estimating the effective propensity functions of the slow species [3,4,13,37,43]. An alternative approach to simulating the evolution of the slow variables while avoiding doing so for the fast variables, is to estimate the probability distribution of the slow variables [18]. The key point in this approach is to use short bursts of appropriately initialized stochastic simulations to estimate the drift and diffusion coefficients for an approximating Fokker-Planck equation written in terms of the slow variables [15]. The success of this approach has already been demonstrated in a range of applications including materials science [24], cell motility [14], and social behavior of insects [42].
Ref. [8] introduces the conditional stochastic simulation algorithm (CSSA) that allows one to sample efficiently from the distribution of the fast variables conditioned on the slow ones [8], and to estimate the coefficients of the effective stochastic differential equation (SDE) on the fly via a proposed constrained multiscale algorithm (CMA) algorithm. The CMA can be further modified by estimating the drift and diffusion coefficients in the form given by the CLE for the slow subsystem, which requires the estimation of effective propensity functions of slow reactions [9]. The main question we plan to address in our present work builds on and combines two already existing ideas investigated in [8] and [39], and brings several computational and algorithmic improvements. The above-mentioned CSSA algorithm explicitly makes use of the knowledge of the slow variables (often unavailable in many real applications), a drawback we plan to address as explained later in Section 4, where, driven by the top eigenvector of an appropriately constructed Laplacian, we discover the underlying slow variable. In doing so, we make use of the ADM framework [39] which modifies the traditional diffusion map approach to take into account the time-dependence of the data, i.e., the time stamp of each of the data points under consideration. By integrating local similarities at different scales, the ADM gives a global description of the data set.
The rest of this paper is organized as follows. In Section 2 we provide a mathematical framework for multiscale modeling of stochastic chemical reactions networks and detail the two chemical systems via which we use to illustrate our approach. In Section 3 we introduce the ADM-CLE framework and we highlight its differences from the approaches which were previously introduced in the literature. In Section 4 we propose a robust mapping from the observable space to the "dynamically meaningful" inaccessible space, that allows us to recover the hidden slow variables. In Section 5 we introduce a Markov-based approach for approximating the steady distribution of the slow variable, and compare our results with another recently proposed approach. We conclude with a summary and discussion of future work in Section 6.
2. Problem formulation. A multi-scale modeling framework for stochastic chemical reaction networks can be formulated as follows. We consider a well-mixed system of chemical species, denoted by X 1 , X 2 , . . . , X that interact through m reaction channels R 1 , R 2 , . . . , R m in a reactor of volume V . We denote the state of the system by X(t) = [X 1 (t), X 2 , . . . , X (t)], where X i (t), i = 1, 2, . . . , represents the number of molecules of type X i in the system at time t. With a slight abuse of notation, we interchangeably use X i to denote the type i of the molecule. In certain scenarios, one may assume that the reactions can be classified as either fast or slow, depending on the time scale of occurrence [3]. As expected, the fast reactions occur many times on a timescale for which the slow reactions occur with very small probability. As defined in [3], the fast species denoted by F are those species whose population gets changed by a fast reaction. Slow species (denoted by S) are not changed by fast reactions. Considering that slow species are not only species from the set {X 1 , X 2 , . . . , X }, but also their functions which are not changed by fast reactions, the components of the fast and slow species can be used as a basis for the state space of the system, whose dimension equals the number of linearly independent species.
For each reaction channel R j , j = 1, 2, . . . , m, there exists a corresponding propensity function α j ≡ α j (x), such that α j dt denotes the probability that, given X(t) = x, reaction R j occurs within the infinitesimal time interval [t, t + dt). We denote by ν the stochiometrix matrix of size m × , with entry ν ji denoting the change in the number of molecules of type X i caused by one occurrence of reaction channel R j . The continuous time discrete in space Markov chain can be further approximated by the CLE for a multivariate continuous Markov process [20]. Using time step ∆t, the Euler-Maruyama discretization of the CLE is given by ∆t, for all i = 1, 2, . . . , , (2.1) where X i , with another slight abuse of notation, denotes a real-valued approximation of the number of molecules of the i-th chemical species, i = 1, 2, . . . , . Here, N j (t), j = 1, 2, . . . , m, denote the set of m independent normally distributed random variables with zero mean and unit variance.
2.1. Illustrative example CS-I. As the first illustrative example, we consider the following simple 2dimensional chemical system, with the two chemical species denoted by X 1 and X 2 (i.e., = 2) which are subject to four reaction channels R j , j = 1, 2, 3, 4 (i.e., m = 4), given by Throughout the rest of this paper, we shall refer to the chemical system (2.2) as CS-I (i.e. "chemical system I"). We label by R j the reaction corresponding to the reaction rate subscript k j , j = 1, 2, 3, 4, and note that each reaction R j has associated a propensity function α j (t) given by [21] where V denotes the volume of the reactor. We consider the system with the following dimensionless parameters We plot in Figure 2.1(a) the time evolution of the two different species in system (2.2), together with the slow variable S = (X 1 + X 2 )/2, starting from initial conditions X 1 (0) = X 2 (0) = 100. As the figure shows, the system variables X 1 and X 2 are changing very frequently (thus we label them as fast variables), while the newly defined variable S changes very infrequently and can be considered to be a slow variable. Following [26], for the chemical system in (2.2) comprised only of monomolecular reactions, it is possible to compute analytically the stationary distribution of the slow variable S, since the joint probability distribution of the two variables X 1 and X 2 is a multivariate Poisson distribution P(X 1 = n 1 , X 2 = n 2 ) =λ n1 1 with parameters given byλ Trajectory of X 1 , X 2 , and S/3 = (X 1 + 2 X 2 )/3 2.2. Illustrative example CS-II. The second example is taken from Ref. [8]. We shall refer to it as CS-II from now on. We consider the following system involving two molecular species X 1 and X 2 , whose reactions R 1 , R 2 , . . . , R 6 have the propensity functions given by where V denotes the system volume. Figure 2.1(b) shows a simulated trajectory of this chemical system using the Gillespie algorithm for the following dimensionless parameters [8] k 1 = 32, k 2 = 0.04V, k 3 V = 1475, k 4 = 19.75, k 5 = 10V, k 6 = 4000, (2.8) where we use V = 8. Note that in this second example, reactions R 5 and R 6 are occurring on a much faster timescale than the other four reactions R 1 , R 2 , R 3 and R 4 . A natural choice for the slow variable is S = X 1 +2X 2 , which is invariant with respect to all fast reactions [8], as we illustrate in Figure 2.1(b).

Main problem.
Our end goal in the present paper is to propose an algorithm that efficiently and accurately estimates the stationary probability density of the hidden slow variable S, without any prior knowledge of it. The approach we propose builds on the anisotropic diffusion map framework (ADM) to implicitly discover the mapping from the observable state space to the dynamically meaningful coordinates of the fast and slow variables, as previously introduced in [39], and on the CLE approximation (2.1).
3. ADM-CLE approach. Let us consider example CS-II, and assume that s = s(x 1 , x 2 ) = x 1 + 2x 2 and f = f (x 1 , x 2 ) = x 1 are the slowly and rapidly changing variables, respectively. They together define a mapping g : (x 1 , x 2 ) → (s, f ) from the observable state variables x 1 and x 2 in the accessible space O to the "dynamically meaningful" (but in more complicated examples inaccessible) slow variable s and the fast accessible variable f , both in space H. In other words, g maps (x 1 , x 2 ) → (x 1 + 2x 2 , x 1 ), and conversely its . The approach introduced in [39] exploits the local point clouds generated by many local bursts of simulations at each point (x 1 , x 2 ) in the observable space O. Such observable local point clouds are the image under h of similar local point clouds in the inaccessible space H (at corresponding coordinates (s, f ) such that h(s, f ) = (x 1 , x 2 )), which, due to the separation of scales between the fast and slow variables f and s, have the appearance of thin elongated ellipses. It is precisely this separation of scales that we leverage into building a sparse anisotropic graph Laplacian L in the observable space, and use it as an approximation of the isotropic graph Laplacian in the inaccessible space H. As we shall see, the top nontrivial eigenvector of L will robustly indicate all pairs of original states (x 1 , x 2 ) that correspond to the same slow variable S = s (where s = x 1 + 2x 2 for CS-II). In other words, we discover on the fly the structure of the slow variable S, and further integrate this information into a Markov-based method for estimating its stationary distribution P(S = s), while also computing along the way an analytical expression for the conditional distribution of the fast variable given the slow variable P(F = f |S = s).
Singer et al. [39] run many local bursts of simulations for a short time step δt starting at (x 1 , x 2 ). Such trajectories end up at random locations forming a cloud of points in the observable plane O, with a bivariate normal distribution with 2 × 2 covariance matrix Σ. The shape of the resulting point cloud is an ellipse, whose axes reflect the dynamics of the data points. In other words, when there is a separation of scales, the ellipses are thin and elongated, with the ratio between the axis of the ellipse given by the ratio of the two eigenvalues of Σ. The first eigenvector corresponding to λ 1 points in the direction of the fast dynamics on the line x 1 + 2x 2 = s, while the second one points in the direction of the slow dynamics. In particular, τ is a small parameter, i.e. 0 < τ 1. In general, we wish to piece together locally defined components into a globally consistent framework, a nontrivial task when the underlying unobservable slow variables (or the propensity functions of the system) are complicated nonlinear functions of the observable variables in O.
The construction of the ADM framework in [39] relates the anisotropic graph Laplacian in the observable space O with the isotropic graph Laplacian in the inaccessible space H. In that setup, each of the N data points x (i) , i = 1, 2, . . . , N, lives in an -dimensional data space. For both CS-I and CS-II, the data is twodimensional, thus = 2. For the former system, we consider each lattice point in the domain [50, 150] × [50, 150], hence there are N = 101 2 = 10, 201 states, while for the latter one we consider the domain [1, 110] × [1, 110], i.e. N = 110 2 = 12, 100. Throughout the paper, we will often refer to the N data points . . , N, as O-states of the chemical system. The ADM [39] then generates ensembles of short simulation bursts at each of the N points in the data set, computes the averaged position after statistically averaging over the many simulated trajectories, and obtain an estimate of the local 2 × 2 covariance matrix Σ (i) . For each data point x (i) , the inverse of Σ (i) is computed and symmetric Σ-dependent squared distance between pairs of data points in the two-dimensional observable space R 2 (given by (3.3) below) is defined. The ADM framework then uses this dynamic distance measure to approximate the Laplacian on the underlying hidden slow manifold. We provide further details on the anisotropic diffusion maps framework in Section 3.2. We now highlight the first difference between the approach taken in this paper and in [39].
3.1. Replacing short simulation bursts by the CLE approximation. The local bursts of simulations initiated at each data point in order to estimate the local covariances may be computationally expensive to estimate. In this paper, we bypass these short bursts of simulations by using an approximation given by the CLE (2.1), which allows for a theoretical derivation of the local 2 × 2 covariance matrices. Using (2.1), we obtain Computing the eigen-decomposition of a local covariance matrix is analogous to performing the Principal Component Analysis on the local cloud of points, generated by the short simulations bursts. The advantage of (3.2) over the computational approach used in [39] is that Σ (i) can be computed at each data point without running (computationally intensive) short bursts of simulations. The error of the CLE approximation depends on the values of coordinates of the data point x (i) , i.e. on the system volume V [23,12]. In the case of CS-I or CS-II, the most probable states contain about one hundred molecules of each chemical species and the CLE approximation (3.2) is well justified.
3.2. Anisotropic diffusion kernels. The next task is the integration of all local principal components into a global framework, with the purpose of identifying the hidden slow variable. We estimate the distance (and hence the similarity measure) between the slow variables in the underlying inaccessible manifold using the anisotropic graph Laplacian [39]. We derive a symmetrized second order approximation of the (unknown) distances in the inaccessible space H, based on the Jacobian of the unknown mapping from the inaccessible to the observable space. The Σ-dependent distance between two O-states is given by and represents a second order approximation of the Euclidean distance in the inaccessible (s, τ f )-space where the last approximation is due to the fact that τ is a small parameter, see (3.1). Note that it is also possible to extend (3.4) to higher dimensions, as long as there exists a separation of scales between the set of slow variables and the set of fast variables [39]. Using approximation (3.3)-(3.4) of the distance between states of the slow variable, we next construct (an approximation of) the Laplacian on the underlying hidden slow manifold, using the Gaussian kernel as a similarity measure between the slow variable states. We build an N × N similarity matrix W with entries where the single smoothing parameter ε (the kernel scale) has a two-fold interpretation. On one hand, ε denotes the squared radius of the neighborhood used to infer local geometric information, in particular W ij is O(1) when s (i) and s (j) are in a ball of radius √ ε, thus close on the underlying slow manifold, but it is exponentially small for states that are more than √ ε apart. On the other hand, ε represents the discrete time step at which the underlying random walker jumps from one point to another. We refer the reader to [31] for a detailed survey of random walks on graphs, and their applications. We normalize W using the diagonal matrix D to define the row-stochastic matrix L by Since L is a row-stochastic matrix, it has eigenvalue λ 0 = 1 with trivial eigenvector Φ 0 = (1, 1, . . . , 1) T . The remaining eigenvalues can be ordered as We denote by Φ i the corresponding eigenvectors, i.e. LΦ i = λ i Φ i . The top d nontrivial eigenvectors of the random-walk anisotropic Laplacian L describe the geometry of the underlying d-dimensional manifold [16], i.e. the i-th data point x (i) is represented by the following diffusion map where Φ j (i) denotes the i-th component of the eigenvector Φ j . However, note that some of the considered eigenvectors can be higher harmonics of the same principal direction along the manifold, thus in practice one computes the correlation between the computed eigenvectors before selecting the above d eigenvectors chosen to parametrize the underlying manifold. For the two chemical systems considered in this paper, we show in the remainder of this section how the top (i.e., d = 1) non-trivial eigenvector of L can be used to successfully recover the underlying slow variable.
Using the stochasticity of L, we can interpret it as a random walk matrix on the weighted graph G = (V, E), where the set of nodes corresponds to the original observable states (x 1 , x 2 ) (i) , i = 1, 2, . . . , N (and implicitly to states s (i) of the slow variable), and there is an edge between nodes i and j if and only if W ij > 0. The associated combinatorial Laplacian is given byL Before considering the top eigenvector of L for determining the underlying slow variable and estimating its stationary distribution, we propose to use a sparse graph Laplacian which differs from the ADM method in [39], where the Laplacian matrix is associated to a complete weighted graph. However, using a complete graph leads to computing the Σ-dependent squared distance in equation (3.3) for any pair of nodes, thus an O(N 2 ) number of computations is used. In light of the approximation (3.4), a pair of points which are far away in the observable space (i.e., for which d 2 is large) denotes a pair of corresponding states of the slow variable which are also far away in the inaccessible space. Thus we do not have to do such computations, because points far away in the unobservable space will have an exponentially small similarity W ij close to 0. The fact that the shape of the local point cloud is an ellipse provides some insight in this direction. Thus we will build a sparse graph of pairwise measurements and no longer compute the Σ-dependent distance between all points of the data  set, but only between a very small subset of the points. The spectrum of the covariance matrix Σ i , in particular the ratio τ of its two eigenvalues given by (3.1), guides us in building locally at each point, a sparse ellipsoid-like neighborhood graph.
For each observable state (x 1 , x 2 ) (i) , we build a local adjacency graph, denoted by G i , in the shape of an ellipse pointing in the direction of the fast dynamics, and whose small axis points in the direction of the slow dynamics. Figure 3.3 shows an example of such a local 1-hop neighborhood graph G i , where the central node (x 1 , x 2 ) (i) is connected to all points contained within the boundaries of an appropriately scaled ellipse centered at (x 1 , x 2 ) (i) . Finally, we define the sparse graph G of size N × N associated to the entire network as the union of all locally defined ellipsoid-like neighborhood graphs G = N i=1 G i . Note that the union graph G is still a simple graph, with no self edges and no multiple edges connecting the same pair of nodes. We compute the distance d Σ by (3.3) (and thus the similarity W ij ) between a pair of nodes (x 1 , x 2 ) (i) and (x 1 , x 2 ) (j) if and only if the corresponding edge (i, j) exists in G.
We plot in Figures 3.1(c) and 3.2(c) the histogram of the weighted degrees of the nodes in the weighted graph W defined in (3.5), while Figures 3.1(d) and 3.2(d) show a scatterplot of the states of the system, where each state i is colored by its weighted degree, i.e., the sum of all its outgoing weighted edges W ij , j = 1, 2, . . . , n. Throughout the computational examples in this paper, the smoothing parameter ε which appears in (3.5) was set to ε = 0.1. In contrast to the approach in [39] which computes all O(N 2 ) pairwise similarities, the sparsity of G (and thus of the associated graph Laplacian L) in our approach only requires the computation of a much smaller number of distances, as low as linear, depending on the discretization of the domain, and makes it computationally feasible to solve problems with thousands or even tens of thousands of nodes.

4.
A robust mapping from the observable space O to the "dynamically meaningful" inaccessible space H. As a first step towards partitioning the nodes of the original graph G and detecting the associated slow variable, we sort the entries of the top eigenvector Φ 1 , which we then denote byΦ 1 withΦ 1 (1) ≥Φ 1 (2) ≥ . . . ≥Φ 1 (N ). This sorting process defines permutation σ of the original index set i = 1, 2, . . . , N so that Φ 1 (σ(i)) = Φ 1 (i). We consider the increments between two consecutive (sorted) values The local neighborhood graph G i at a given node (x 1 , x 2 ) (i) ; the shape is an ellipsoid whose axis ratio is given by the ratio of the eigenvalues τ of the local covariance matrix d 2 by (3.1). The corresponding eigenvectors are used to calculate the orientation of the ellipse. Next, we sort the vector of such increments, denote its entries byδ 1 ≥δ 2 ≥ . . .δ N −1 , and show in Figure  4.1(a) (resp. Figure 4.2(a)) the top 300 (resp. top 420) largest such incrementsδ i for illustrative example CS-I (resp. CS-II). Note that this already give us an idea about the number of distinct slow states in the system, a set which we denote by S. Ideally, the difference Φ 1 (i) − Φ 1 (j) in the entries of the top eigenvector corresponding to two observable states (x 1 , x 2 ) (i) and (x 1 , x 2 ) (j) that belong to the same slow variable s (i.e., x = s for illustrative example CS-II) should be zero or close to zero, in which case we expect that only approximately |S| of the N − 1 increments δ i are significantly larger than zero, while the remaining majority are zero or close to zero.
In Figures 4.1(b) and 4.2(b) we highlight the correlation between the entries of the top non-trivial eigenvector Φ 1 and the corresponding slow variable S. In Figures 4.1(c) and 4.2(c), we zoom on a subset of states to make the point that the eigenvector Φ 1 is almost constant on the O-states that correspond to the same value of the slow variable. The plots in Figures 3.1(b) and 3.2(b) show a coloring of the networks generated by the two chemical systems CS-I and CS-II, based on the first nontrivial eigenvector of the associated sparse Laplacian L. Note that the eigenvector looks almost piecewise constant along the lines that point to the evolution of the fast variable, for a given value of the slow variable (S = (X 1 + X 2 )/2 for CS-I and S = X 1 + 2X 2 for CS-II), yet nowhere along the way we have input this information into the method. In the next step we use this top eigenvector to identify all nodes of the graph (original states of the chemical system) that correspond to the same value of the underlying slow variable. In other words, all nodes whose corresponding eigenvector entries are between an appropriately chosen interval (that we shall refer to a bin) will be labeled as belonging to the same slow variable S. In other words, we seek a partition of the observable states in O, i.e., of the nodes of G, such that all original states (x 1 , x 2 ) (i) with the same value of the corresponding slow variable s((x 1 , x 2 ) (i) ) end up in the same bin. Our goal is to find a partition P = {P 1 , P 2 , . . . , P k } of O such that where k denotes the number of distinct values q j , j = 1, 2, . . . , k, of the slow variable S. As an example, in the case of CS-I given by (2.2), the partition P j = {(1, 99), (2, 98), . . . , (99, 1)} corresponds to all nodes in the graph for which the value of the associated slow variable is constant q j = 50. The key observation we exploit here is that the top eigenvector of the Laplacian matrix is almost piecewise constant on the bins that partition O, since the nodes of G that correspond to the same value of the slow variable have a very high pairwise similarity, with W ij very close to 1. One may also interpret the above problem as a clustering problem, where the similarity between pairs of points is given by (3.5), and is such that nodes that belong to the same bin have a much higher similarity compared to nodes that belong to two different bins, an effect due to the strong separation of scales. In the case of illustrative example CS-I, the clusters correspond to lines in the two-dimensional plane such that (x 1 + x 2 )/2 = c, for a constant c. We point out the interested reader to the work of [32], where the top eigenvectors of the random walk Laplacian are used for clustering. While in practice one uses the top several eigenvectors as the reduced eigenspace where clustering is subsequently performed, in our case the top eigenvector alone suffices to capture the many different clusters (i.e., bins), a fact we attribute to the strong separation of scales exhibited by the illustrative chemical systems CS-I and CS-II. If several eigenvectors were considered then one could use a clustering algorithm, such as k-means or spectral clustering [38,45], to obtain the partitioning (4.2). However, a simpler method has been successfully used for the 1-dimensional eigenspace in both examples we considered. It is described as follows. Recall the sorted vector of incrementsδ 1 ≥δ 2 ≥ . . . ≥δ N −1 defined in equation (4.1), and consider the set of the k − 1 largest such increments {δ 1 ,δ 2 , . . . ,δ k−1 } whereδ 1 ≥δ 2 ≥ . . . ≥δ k−1 . Next, from the sorted eigenvectorΦ 1 we extract the position of the entries whose associated increment (with respect to its right-next neighbor index) belongs to {δ 1 ,δ 2 , . . . ,δ k−1 }. In others words, we compute To illustrate the correctness of our proposed technique, we compute the Jaccard index between each proposed partition setP j , j = 1, 2, . . . , k and each ground truth partition set P i , i = 1, 2, . . . , |S|:    global sign, since −Ψ 1 is also an eigenvector of L. Finally, we compute the maximum weight matching (using, for example, the Hungarian method [29]) in the bipartite graph with node set P ∪P and edges across the two sets given by matrix J in (4.4). In Figure 4.3(d) we plot the Jaccard index of the matched partitions. For the first chemical system CS-I, note that the algorithm perfectly recovers the ground truth partition. In Figure 4.4 we present the outcome of the binning algorithm for the illustrative example CS-II, which is no longer satisfactorily by itself and requires further improvement. Though the bin cardinalities in the initial solution visually resemble the ground truth, there are numerous mistakes being made. To illustrate this, for a given partition P, we compute the following measure of continuity of the recovered bin cardinalities (4.5) In other words, Θ P captures the squared difference in the cardinalities of two consecutive bins. For the chemical system CS-II, the ground truth yields a score Θ = 108, while for the eigenvector-recovered solution Θ = 7206, thus indicating already that numerous misclassifications are being made, without even computing the Jaccard similarity matrix (4.4) between the two partitions. To this end, we introduce in the next subsection a heuristic denoising technique followed by a truncation of the domain, which altogether lead to a better partitioning of O into groups of states that correspond to the same slow variable.

A bin denoising scheme.
While the eigenvector-based partition procedure detailed above yields accurate results for the CS-I in (2.2), this procedure alone is not sufficient for obtaining a satisfactory partition for the more complex CS-II considered in (2.7), as illustrated by the high corresponding Θ-score (Θ = 7206) shown in Figure 4.4(b). In Figure 4.4(c) we zoom into some of the recovered bins, showing that the eigenvectorbased reconstruction splits some of the inner bins, which explains the high associated Θ-score. In other words, states/bins which in the ground truth solution correspond to the same values of the slow state variable, are divided into two adjacent bins, and mistaken for two distinct states of the slow variable.
To solve this issue, we propose a bin-denoising heuristic that robustly assigns data points to their respective bins. In hindsight, the continuity of the eigenfunctions of the Laplacian should be reflected in the continuity of the histogram of state counts in bins corresponding to adjacent intervals. We detail in Algorithm 1 an  iterative heuristic procedure which, at each step, merges two adjacent bins such that the resulting Θ-score is minimized across all possible pairs of adjacent bins that can be merged. We show in Figure 4.5(a) the resulting bin cardinalities after the bin-merging heuristic and after truncating at the boundary of the slow variable. Note that the new denoised partition yields Θ = 103, and the number of bins (states of the slow variable) decreases from |S| = 328 to |S| = 314. Furthermore, in Figure 4.5(b) we compute the Jaccard similarity matrix between the ground truth and the newly obtained partition, showing in Figure 4.5(d) that we almost perfectly recover the structure of the ground truth bins.

5.
A Markov approach for computing the steady distribution of the slow variable. In this section, we focus on the final step of the ADM-CLE approach, of estimating the stationary distribution of the slow variable, without any prior knowledge of what the slow variable actually is. One of the ingredients needed along the way is an estimation of the conditional distribution P(F |S = s) of the fast variable F given a value s of the slow variable S, which we compute via two approaches. As the first approach, we consider the Conditional Stochastic Simulation Algorithm (CSSA) [8] which is given in Algorithm 2. It samples from the distribution of the fast variable conditioned on the slow variable. The second approach is entirely analytic and free of any stochastic simulations, and amounts to analytically solving the CME for each set in the partition P = {P 1 , P 2 , . . . , P k }. We then compare our results to the Constrained Multiscale Algorithm (CMA) introduced in [8], which approximates the effective dynamics of the slow variable as a SDE, after estimating the effective drift and diffusion using the CSSA (Algorithm 2).

5.1.
A stochastic simulation algorithm for estimating the conditional probability (CSSA). Our next task is to estimate the conditional distribution P(F |S = s) of the fast variable F given a value s of the slow variable S. One possible approach for doing this relies on the CSSA algorithm to globally integrate the effective dynamics of the slow variable. One iteration of the CSSA is given in Algorithm 2. Ideally, one repeats steps 1-6 of Algorithm 2 and samples values of F until the distribution P(F |S = s) converges. In practice, we run Algorithm 2 until L c changes of the slow variable S occur. This computation is done for each value in the range of the slow variable S = {s 1 , s 2 , . . . , s |S| }.

5.
2. An analytical derivation of the conditional distribution. An alternative approach which we follow in this paper relies on an analytical computation of the conditional distribution P(F |S = s), thus eliminating Table 5.1 Illustrative example CS-II. The set of all ground states of the system (x 1 , x 2 ) corresponding to the slow variable S = X 1 +2X 2 = 7. We denote by (x 1 , x 2 ) the states reachable from (x 1 , x 2 ) in one transition step, and by S the associated corresponding slow variable such that S = X 1 + 2X 2 . R j denotes the reaction channel that takes the chemical system from state (x 1 , x 2 ) to (x 1 , x 2 ), with corresponding propensity α j . We highlight in bold letters the subset of all states via which the system can transition in one step from the slow variable S = 7 to S = 8.
the need for any expensive stochastic simulations. We illustrate in Figure 5.1 the transition diagrams for the two chemical systems we consider in this paper. For chemical system CS-II, the system can transition from a given state (x 1 , x 2 ) to four adjacent distinct O-states: to (x 1 − 2, x 2 + 1) via channel R 5 , to (x 1 + 2, x 2 − 1) via channel R 6 , to (x 1 − 1, x 2 ) via channels R 2 and R 4 , and finally to state (x 1 + 1, x 2 ) via channels R 1 and R 3 . However, in terms of the underlying slow variable S, the system can transition to only two adjacent states S = s − 1 (via channels R 2 and R 4 ) and S = s + 1 (via channels R 1 and R 3 ), or remain at the current state S = s, via channels R 5 and R 6 . Considering the subsystem of fast reactions of CS-II and conditioning on the line s = x 1 + 2x 2 , the stationary CME takes the form 0 = k 5 (x 1 + 2)(x 1 + 1) P(X 1 = x 1 + 2, X 2 = x 2 − 1) + k 6 (x 2 + 1) P(X 1 = x 1 − 2, X 2 = x 2 + 1) Thus, the conditional distribution for CS-II is, for 0 ≤ x 1 ≤ s, given by is an even number. Here, C is the normalization constants and P(F = x 1 |S = s) = 0 if (s − x 1 ) is an odd number.
A similar argument can be used for CS-I. The stationary CME of the fast subsystem of CS-I is written as where s = (x 1 + x 2 )/2. Thus, the conditional distribution for CS-I is, for 0 ≤ x 1 ≤ 2s, given by where C is again the normalization constant.

Aggregated transition rates and a Markov
Chain on the state of slow variables. In the final step of the ADM-CLE approach, we set up a Markov chain on the state space of slow variables with the end goal of estimating the stationary distribution of the slow variable. As illustrated in Figure 5.1(b), the system CS-II can can transition from a given state S = s to two adjacent states S = s − 1 (via reaction channels R 2 and R 4 ) and S = s + 1 (via channels R 1 and R 3 ), or it can remain at the current state S = s, via channels R 5 and R 6 . Consider now the set P s = {x (i) = (x 1 , x 2 ) (i) |x 1 + 2x 2 = s}, illustrated as the middle bin in Figure 5.2. To compute the transition rate between two adjacent bins P s and P s+1 , one has to aggregate over possible ways of getting from an observable state in bin P s to an observable state in bin P s+1 . We compute Θ (s) 1 to be the aggregated transition rate from state P s to state P s+1 , over all possible states (x (i) , x (j) ), such that x (i) ∈ P S and x (j) ∈ P s+1 , by where Q(x (i) , x (j) , R k ) denotes the indicator functions of whether one can transition from the O-state x (i) to O-state x (j) via reaction R k . We define similarly the aggregated transition rate Θ s 2 , that the chemical system transitions from the slow variable state P s to P s−1 by We illustrate in Figure  can either rely on the CSSA algorithm to sample from the conditional distribution of fast variables given values for the slow variables, as shown in Section 5.1, or use the analytic formulation which is possible to derived for both CS-I and CS-II, see Section 5.2.
Finally, we compute the solution to the stationary CME associated to the system in Figure 5.2 which can be written as where π(s) ≈ P(S = s) is the probability that S = s at time t. Assuming that π(s) = 0 for all s ∈ S and using no-flux boundary conditions, we arrive at a linear system. The eigenvector of the resulting matrix, with associated eigenvalue λ = 0, yields an approximate solution of the stationary CME, which we plot in blue in Figure 5.4. Our result is visually indistinguishable from the exact solution (plotted as the red solid line).

A comparison with the Constrained Multiscale Algorithm (CMA).
We compare the approach we introduced in the previous Section 5.3 with the CMA method proposed in [8]. We compare the results of the two methods with the ground truth, and record the error defined as Error π, P(S = s) = s∈S π(s) − P(S = s) , (5.4) where P(S = s) denotes the ground truth probability distribution of the slow, and π denotes the estimated solution, either by the CMA and or the ADM-CLE. As Table 5.2 shows, the ADM-CLE approach yields lower errors compared to the CMA algorithm, even when we run the latter with the parameter L c as large as 20,000. Note that for the chemical system CS-I, the ground truth probability distribution of the slow variable P(S = s) can be easily computed using the multivariate Poisson distribution, as discussed in Section 2.1. For the second chemical system CS-II, we consider as ground truth the solution obtained by solving the associated CME of the full model in a large (truncated) domain.
L c = 100 L c = 500 L c = 2, 000 L c = 5, 000 L c = 10, 000 L c = 20, 000 ADM-CLE CS-I 0.  Table 5.2 The distance (as measured by the error in (5.4)) between the estimated and the ground truth probability distributions of the slow variable, for the CMA algorithm which runs the CSSA algorithm for each value of the slow variable S = s, until Lc changes of the slow variable occur. The rightmost column shows the recovery errors for our proposed Markov-based approach. In Table 5.2 we show numerical results that highlight the accuracy improvement of the ADM-CLE approach compared to the CMA approach of [8]. For the latter method, we run the CSSA algorithm [8] for each value of the slow variable, until L c changes of the slow variable occur. As expected, the accuracy of the CMA algorithm improves as L c increases, at the cost of additional computational running time of the method. In comparison, our stochastic simulation free approach yields significantly more accurate results, with errors that are at least one order of magnitude lower than the CMA method with L = 20, 000. We plot in 6. Summary and discussion. In this paper we have introduced an ADM-CLE approach for detecting intrinsic slow variables in high-dimensional dynamic data, generated by stochastic dynamical systems. In the original ADM framework, the local bursts of simulations initiated at each data point to estimate the local covariances are computationally expensive, a shortcoming we avoid by using an approximation of the CLE. A second innovation that further improved the computational performance relates to the underlying similarity graph, a starting point for the diffusion map approach. By exploiting the spectrum of each local covariance matrix, we built a sparse ellipsoid-like neighborhood graph at each point in the data set, with the end result of being able to build a sparse similarity graph that requires the computation of a much smaller number of distances, which makes the ADM-CLE approach scalable to networks with thousands or even tens of thousands of nodes. For the two illustrative examples considered in this paper, the size of the resulting graphs is N = 10, 201 for CS-I, and N = 12, 100 for CS-II, respectively. Had these graphs been complete graphs, the number of resulting weighted edges would be over 50 million, while in our computations, the number of edges is approximately 2.9 million for CS-I, and 3.9 million for CS-II.
We have proposed a spectral-based method for inferring the slow variable present within the chemical sys- tem without any prior knowledge of its structure, and a Markov-based approach for estimating its stationary distribution. We augment the proposed algorithmic approach with numerical simulations that confirm that the ADM-CLE approach can compare favorably for some systems to the CMA for estimating the stationary distribution of such slow variables. The ADM-CLE approach can also be applied to systems with a low number of states of slow variables. The CMA, as introduced in [8], is more suitable for systems where the slow variable(s) can take many different values, because the CMA uses an underlying SDE approximation for the behaviour of the slow variables. One option to overcome this problem is to estimate effective propensity functions of the slow subsystem [9]. An open question is to extend the ADM-CLE to systems where the range of the X i variables is very large. In the ADM-CLE approach applied to CS-I or CS-II, we associate a state (i.e., node in the initial graph) to each possible combination of pairs of states (x 1 , x 2 ), an approach no longer feasible whenever the range of the variables is large. To bypass this problem, one could change the discretization of the state space and modify accordingly the Markov chain based approach ( Figure 5.2) used in the ADM-CLE.
The ADM-CLE couples a method for finding slow variables (ADM) with an approach to compute the stationary distribution of a multiscale chemical reaction network. Chemical systems depend on a number of parameters (e.g. kinetic rate constants) and an open question is to extend the ADM approach to situations where one (or more) parameters are varied, i.e. to perform bifurcation analysis of multiscale stochastic chemical systems [30].
Finally, we point out recent work of Dsilva et al. [11], who also rely on the ADM framework to discover nonlinear intrinsic variables in high-dimensional data in the context of multiscale simulations, with the task of merging different simulations ensembles or partial observations of such ensembles in a globally consistent manner. Their work is motivated by the fact that often one is not merely interested in extracting the hidden (slow) variables from the underlying low-dimensional manifold, given partial observations x (i) , i = 1, 2, . . . , N as in the ADM setting, but also in extending high-dimensional functions on a set of points lying in a low-dimensional manifold. Their proposed approach relies on the so called Laplacian Pyramids [35], a multiscale algorithm for extending a high-dimensional function defined on a set of points in the space of intrinsic variables to a second set of points not in the data set, by using Laplacian kernels of decreasing bandwidths (the ε parameter in (3.5)).