Automated adjoints of coupled PDE-ODE systems

Mathematical models that couple partial differential equations (PDEs) and spatially distributed ordinary differential equations (ODEs) arise in biology, medicine, chemistry and many other fields. In this paper we discuss an extension to the FEniCS finite element software for expressing and efficiently solving such coupled systems. Given an ODE described using an augmentation of the Unified Form Language (UFL) and a discretisation described by an arbitrary Butcher tableau, efficient code is automatically generated for the parallel solution of the ODE. The high-level description of the solution algorithm also facilitates the automatic derivation of the adjoint and tangent linearization of coupled PDE-ODE solvers. We demonstrate the capabilities of the approach on examples from cardiac electrophysiology and mitochondrial swelling.


Introduction.
In this paper we discuss solvers for systems involving spatially dependent ordinary differential equations (ODEs) of the following form: given an initial condition y 0 (x), find y = y(x, t) such that y t (x, t) = f (y, x, t), y(x, t 0 ) = y 0 (x), (1) for all x in a point set X ⊂ Ω ⊆ R d , where d is the spatial dimension.The subscript t refers to differentiation in time.The right-hand side f cannot depend on spatial derivatives of y; the ODE is decoupled at different points.Problems of this form often arise when discretizing time-dependent mathematical models that couple PDEs with spatially distributed systems of ODEs via operator splitting.Examples of application areas include cardiac electrophysiology in general [20,27] and cardiac ion channel modeling in particular [11], mitochondrial swelling [8], groundwater flow and contamination [30], pulmonary gas transport [6,31], and plasma-enhanced chemical vapor deposition [12].We also discuss the automated derivation of the adjoint and tangent linearization of such models: these can be used to identify the sensitivity of the solution to model parameters, solve inverse problems for unknown parameters, and characterize the stability of trajectories.
For concreteness, we present two biological examples of coupled PDE-ODE systems where problems of the form (1) arise.We return to numerical results for these examples in Section 6.
Example 1 (The bidomain equations).As our first motivating example, we will consider the bidomain equations for the propagation of an electrical signal in a nondeforming domain Ω [27]: find the transmembrane potential v = v(x, t), the extracellular potential u e (x, t) and additional state variables s = s(x, t) such that for t ∈ (0, T ]: in Ω, (2b) In (2), I ion is a given nonlinear function describing ionic currents and F defines a system of nonlinear functions, while M i and M e are the intracellular and extracellular conductivity tensors, respectively, and I s is a given stimulus current.The function F cannot depend on spatial derivatives of v and s; it defines a pointwise system of ODEs.The specific form of I ion and F are typically prescribed by a given cardiac cell model and may vary greatly in complexity: from involving a single state variable s such as the FitzHugh-Nagumo model [11] to models with e.g.41 state variables such as the model of O'Hara et al. [21].The system (2) is closed with appropriate initial and boundary conditions.After the application of operator splitting, the ODE step is decoupled from the PDE step and is a system of the form of (1).
Example 2 (Mitochondrial swelling).As a second example, we consider a model proposed in [8] to describe the swelling of mitochondria.Mitochondrial swelling plays a key role in the process of programmed cell death ( apoptosis) and thus for the life cycle of cells.Mathematically, we consider the following model [8, p. 26]: find the calcium concentration u = u(x, t) and the densities of mitochondria states N i = N i (x, t) for i = 1, 2, 3 such that in Ω, (3a) where d 1 ≥ is a diffusion coefficient, d 2 ≥ 0 is a feedback parameter, q ≥ 2 ∈ N determines the nonlinearity of the diffusion-type operator, and g and f are prescribed functions.The functions N 1 , N 2 and N 3 describe the densities of unswollen, swelling and completely swollen mitochondria respectively.Equation (3a) is a spatially-coupled PDE, while equations (3b)-(3d) define pointwise ODEs.The system is closed with initial conditions and Dirichlet boundary conditions for u.Again, after operator splitting, a subproblem of the form (1) results.
Over the last decade, there has been a growing interest in computational frameworks for the rapid development of numerical solvers for PDEs such as the FEniCS Project [16], the Firedrake Project [23], and Feel++ [22].The rapid development of solvers is achieved by offering high-level abstractions for the expression of such problems.Each of these projects provides a domain-specific language for specifying finite element variational formulations of PDEs and associated software for their efficient solution.The methods presented in this paper are implemented in the the FEniCS Project and dolfin-adjoint [10]; dolfin-adjoint automatically derives discrete adjoint and tangent linear models from a FEniCS forward model.These allow for the efficient computation of functional gradients and Hessian-vector products, and are essential ingredients in stability analyses, parameter identification and inverse problems.
However, these systems do not currently efficiently extend to the kind of coupled PDE-ODE systems arising in computational biology.While monolithic finite element discretizations of coupled PDE-ODE systems such as (2) or (3) may easily be specified and solved using the current software features in FEniCS, this approach will involve an unreasonably large computational expense, as the monolithic system is highly nonlinear and the solver cannot exploit the fact that the ODE is spatially decoupled.
Operator splitting is the method of choice for such coupled PDE-ODE systems [27], and is implemented as standard in many hand-written codes, see e.g.[19,29] in the context of cardiac electrophysiology.Operator splitting decouples a PDE-ODE system into a spatially-coupled system of PDEs and a spatially-decoupled collection of ODE systems.This collection of ODE systems then typically takes the form (1), which can be solved using well-known temporal discretization methods.Heretofore it has not been possible to specify spatially-decoupled ODE systems in high-level PDE frameworks such as the FEniCS or Firedrake projects.
This work addresses the gap in available abstractions, algorithms and software for arbitrary PDE-ODE systems.We introduce high-level domain-specific language constructs for specifying collections of ODE systems and for specifying multistage ODE schemes via Butcher tableaux.This has several major benefits.By automatically generating the solver from a high-level description of the problem, practitioners can flexibly explore a range of models; this is especially important in biological problems where the model itself is uncertain.Another advantage is that a high-level description facilitates the automated derivation of the associated tangent linear and adjoint models.This represents a significant saving in the time taken to investigate questions of biological interest.
The main new contributions of this paper are (i) the extension of the FEniCS finite element system to enable efficient the large-scale forward solution of coupled, time-dependent PDE-ODE systems via operator splitting, and (ii) the extension of dolfin-adjoint to automatically derive and solve the associated tangent linear and adjoint models, enabling efficient automated computation of functional gradients for use in e.g.optimization or adjoint-based sensitivity analysis.
This paper is organized as follows.In Section 2, we describe the general operator splitting setting, the resulting separate PDE and ODE systems, and natural discretizations of these.We continue in Section 3 by briefly describing the well-established ODE schemes that we consider in this work: the multistage and Rush-Larsen families.We discuss the adjoint and tangent linear discretizations of a general operator splitting scheme and derive the adjoint and tangent linear models for the multistage and Rush-Larsen schemes in Section 4. Key features of our new implementation are described in Section 5. We present numerical results for two different application examples in Section 6, and use these to evaluate the performance of our implementation.In Section 7, we provide some concluding remarks and discuss current limitations and possible future extensions.

Operator splitting.
A classical approach to the solution of the bidomain system (2) [27] and similar systems is to apply operator splitting.This approximately solves the full system of equations by alternating between the solution of a system of PDEs and a system of ODEs defined over each time interval.The advantage of this approach is that it decouples the solution of the typically highly nonlinear ODEs from the spatially coupled PDEs at each time step.We will consider this general setting, but use (2) as a concrete example.
Applying a variable order operator split to (2), the resulting scheme reads: given initial conditions v 0 , s 0 , an order parameter θ ∈ [0, 1], and time points {t 0 , . . ., t N } with an associated timestep κ n = t n+1 −t n , then for each time step n = 0, 1, . . ., N −1: 1. Compute v * and s * by solving the ODE system over Ω × [t n , t n + θκ n ] with initial conditions v n , s n .2. Compute v † and u n+1 e by solving the PDE system with initial conditions v † and s * .The split scheme relies on the repeated solution of a nonlinear system of ODEs and (in this case) a linear system of PDEs.The main advantage of the approach is that it allows for the separate discretization and solution of the ODEs and PDEs, with the respective solutions as feedback into the other system.For θ = 1/2, the resulting Strang splitting scheme is second-order accurate; for other values of θ the resulting scheme is first-order accurate.
2.2.Discretization of the separate PDE and ODE systems.Suppose now that the relevant PDE system (e.g. ( 5)) is discretized in space by a finite element method defined over a mesh T h of the domain Ω, and in time by some suitable temporal discretization.Efficient solution algorithms for such discretizations are wellestablished (such as [16]) and will not be detailed further here.
At each iteration the solution of the PDE system relies on the solution of the ODE system (e.g. ( 4)), or at least the ODE solution evaluated at some finite set of points X = {x i } |X| i=1 in space.For instance, the set of points X may be taken as the nodal locations of the finite element degrees of freedom, or the quadrature points of the mesh.As a consequence of this and the spatial locality of ODEs, a natural approach to discretizing the system of ODEs is to step the ODE system forward in time at this set of points X.Typically, |X| is very large and the efficient repeated solution of these systems of ODEs is key.This is the setting that we focus on next.
3. Solution schemes for the ODE systems.The initial value problem (1) decouples in space, and henceforth we consider its solution for a fixed x.With a minor abuse of notation, let y = y(x) and f (y, t) = f (y, x, t) so that (1) reads as the classical ODE problem: for t ∈ [T 0 , T 1 ].There exists a wide variety of solution schemes for (6) including but not limited to multistage, multistep, and IMEX schemes, see e.g.[7].In this work, we focus on two classes of schemes that are widely used in computational biology: multistage schemes and so-called Rush-Larsen schemes.The Rush-Larsen schemes are commonly used in cardiac electrophysiology in general and for discretizations of the bidomain equations (2) in particular.These two classes of schemes are detailed below.Keeping the general iterative operator splitting setting in mind, we present the schemes on a single time-step [T 0 , T 1 ] with κ = T 1 − T 0 for brevity of notation.
3.1.Multistage schemes.An s-stage multistage scheme for ( 6) is defined by a set of coefficients a ij , b i and c i for i, j = 1, . . ., s, commonly listed in a so-called Butcher tableau; for more details, see e.g.[7].Given y 0 at T 0 , the scheme finds the stage variables k i for i = 1, . . ., s satisfying (7) and subsequently sets the solution y 1 at T 1 via ( 8) Note that (7) defines a system of (non-linear) equations to solve for the stage variables k i (i = 1, . . ., s) if a ij = 0 for any j ≥ i.Our implementation demands that a ij = 0 for j > i, i.e. does not allow the computation of earlier stages to depend on the values of later stages.
3.2.The Rush-Larsen scheme and its generalization.Recall that y = {y i } m i=1 and f (y) = {f i (y)} m i=1 .The original Rush-Larsen scheme employs an exponential integration scheme for all linear terms for each component f i (for i = 1, 2, . . ., M ), and a forward Euler step for all non-linear terms [25].Let J i = ∂fi ∂yi (y 0 , T 0 ) be the i'th diagonal component of the Jacobian at time T 0 for i = 1, . . ., M , and assume that J i is non-zero.For brevity, denote f i (y 0 , T 0 ) = f 0 i .The Rush-Larsen scheme then computes y 1 i at T 1 by ( 9) For stiff systems of ODEs, the forward Euler step in (9) can become unstable for large κ.This motivates a generalized version of the Rush-Larsen scheme [26].In this generalization, the exponential integration step is used for all components of y reducing the scheme to (10) Both ( 9) and ( 10) are first order accurate in time [25,26].
Both the original and the generalized Rush-Larsen schemes can be developed into second order schemes by repeated use as follows.The solution y is computed in the first step and is used in f and its linearization J to compute y 1 in the second step.Let f 2 ), and let 2 ) be the diagonal component of the Jacobian at time T 1 2 for i = 1, . . ., M .More precisely, the second order version of ( 9) and ( 10) is then given by y for the original Rush-Larsen scheme and for the generalized Rush-Larsen scheme.Here, (12b) is simplified compared to the scheme in [26]: instead of evaluating f and J at ȳ(i) (cf.eq. ( 7) in [26]) we use y

Adjoints and tangent linearizations of coupled PDE-ODE systems.
Our aim is to automatically derive the adjoint and tangent linear equations of operator splitting schemes in a manner that allows for efficient solution of the resulting systems.We proceed as follows: we first state the adjoint and tangent linear system for general problems, then consider the special case of mixed PDE-ODE systems and finally discuss the specific adjoint and tangent linear versions of multistage and Rush-Larson schemes.
We begin by considering a general system of discretized equations in the form: The adjoint equation of ( 13) associated with a real-valued functional of interest J : where the superscript * denotes the adjoint operator.The associated tangent linear equation with respect to some auxiliary control parameter m is given by: find the tangent linear solution ẏ such that We now turn our attention to the case where F is a coupled PDE-ODE system.For a guiding example, we consider again the bidomain equations ( 4)-( 5) and, without loss of generality restrict ourselves to a single iteration (N = 1).For brevity, we denote the ODE operator (4) as O and the PDE operator (5) as P .We can then rewrite the scheme in the form (13): (Solve bidomain ODEs for final state) (16e) The first pair of equations in ( 16) set the initial conditions, the following equations represents a collection of nonlinear ODEs, the third equation a system of PDEs, and the last equation again a collection of ODEs.
From ( 14), the adjoint system for ( 16) derives as: For brevity, the left hand side matrix combines 2 × 2 blocks, and the functional of interest is assumed to only depend on the final state variables v 1 , s 1 .The adjoint system ( 17) is a linear coupled PDE-ODE system with upper-triangular block structure, which can efficiently by solved by backwards substitution.The last block row represents a linear adjoint ODE system, preceded by a adjoint PDE system, and preceded by another adjoint ODE system, and finalised by a variable assignment in the first block row which represents the adjoint version of setting the initial conditions.
The derivation of adjoint and tangent linear equations for finite element discretizations of PDEs, such as the third block row in (17), is well-established, see e.g.[10].However, the automated derivation of adjoint and tangent linear systems for the multistage and Rush-Larsen discretizations of the generic ODE system (6), such as the second and fourth block row in (17), is less so, and these derivations are presented below.
4.1.Adjoints and tangent linearizations of multistage schemes.Consider a multistage discretization of (6) as described in Section 3.1 with a given Butcher tableau a ij , b j , c i , i, j = 1, . . ., s. Taking s = 3 for illustrative purposes, we can write ( 7)- (8) in form (13) as: with Assuming that the functional of interest J is independent of the internal stage values k i , we can derive the adjoint problem as: , where ȳ0 is the terminal condition for the adjoint solution at T 1 .
Generalizing to s stages, we see that we first solve for the adjoint stage values and then compute the adjoint solution at time T 0 via Following a similar calculation, the tangent linearization of the multistage scheme with s stages is: given y 0 at T 0 compute ki for i = 1, . . ., s and then ẏ1 at T 1 via Adjoints and tangent linearizations of Rush-Larsen schemes.We now turn to consider adjoints and tangent linearizations of the (generalized) Rush-Larsen schemes.We here derive the equations for the first order generalized Rush-Larsen scheme given by ( 10): for each component i = 1, . . ., M of the state variable.We can write (10) in form (13) as where y 0 is the given initial condition at T 0 and L(y (y 0 ,T0) − 1).
The adjoint system of the first order generalised Rush-Larsen scheme is then given by (20) ȳ0 ȳ1 − where ȳ0 is a given terminal condition for the adjoint at T 1 .Similarly, the tangent linear system (15) with respect to an auxiliary parameter m is given by where ẏ0 is a given initial condition at T 0 .The adjoint and tangent linear equations for the other Rush-Larsen schemes are derived following the same steps, and we therefore omit the details here.

FEniCS and Dolfin-adjoint abstractions and algorithms. The FEniCS
Project defines a collection of software components targeting the automated solution of differential equations via finite element methods [16].The components include the Unified Form Language (UFL) [3], the FEniCS Form Compiler (FFC) [17] and the finite element library DOLFIN [18].The separate dolfin-adjoint project and software automatically derives the discrete adjoint and tangent linear models from a forward model written in the Python interface to DOLFIN [10].This section presents our extensions of the FEniCS form language and the FEniCS form compiler, and other new FEniCS and dolfin-adjoint software features targeting coupled PDE-ODE systems in general and collections of systems of ODE in particular.
5.1.Variational formulation of collections of ODE systems.Consider a spatial domain Ω ⊂ R d tessellated by a mesh T h , and a collection of general initial problems as defined by (1) over a set of points X defined relative to T h .For x i ∈ X, denote by δ xi the Dirac delta function centered at for all continuous, compactly supported functions f .We will also write Drawing inspiration from our context of finite element variational formulations, we can then write (1) as: find y(•, t) such that (23) y t (t), ψ i X = f (y, t), ψ i X for all (basis) functions (or distributions) ψ i such that ψ i (x i ) = 1 and ψ j (x i ) = 0 for j = i.The remainder of this section describes the extensions of the FEniCS and Dolfinadjoint systems to allow for in particular abstract representation and efficient forward and reverse solution of systems of the form (23).

Extending UFL with vertex integrals.
UFL is an expressive domainspecific language for abstractly representing (finite element) variational formulations of differential equations.In particular, the language defines syntax for integration over various domains.Consider a mesh T h of geometric dimension d with cells {T }, interior facets {F i } and boundary facets {F} b .Interior facets are defined as the d − 1 dimensional intersections between two cells and thus for any interior facet F i we can write F i = T + ∩ T − .UFL defines the sum of integrals over cells, sum of integrals over boundary facets and sum of integrals over interior facets by the dx, ds, and dS measures, respectively.For example, the following linear variational form defined in terms of a piecewise polynomial u defined relative over T h : is naturally expressed in UFL as: To allow for point evaluation over the vertices of a mesh (cf.( 22)), we have introduced a new vertex integral (or vertex measure in UFL terms) type with default instantiation dP: In agreement with (22), the vertex integral is defined by: (24) f where V (T h ) is the set of all vertices of the mesh T h .Vertex measures restricted to a subset of vertices can be defined as for all other UFL measure types.This basic extension of UFL crucially allows for the abstract specification of collections of ODEs such as (23) in a manner that is consistent and compatible with abstract specification of finite element variational formulations of PDEs.It also allows for the native specification of e.g. point sources in finite element formulations of PDEs in FEniCS.

Vertex integrals in the FEniCS Form Compiler. The FEniCS Form
Compiler FFC generates specialized C++ code [2] from the symbolic UFL representation of variational forms and finite element spaces [17].To accommodate the new vertex integral type, we have extended the UFC interface with a class vertex integral defining the interface for the tabulation of the element matrix corresponding to the evaluation of an expression at a given vertex, listed below: The FFC code generation pipeline has been correspondingly extended to allow for the generation of optimized code from the UFL representation of variational forms involving vertex integrals.This allows for subsequent automated assembly of variational forms involving single vertex integrals (as illustrated above) and also in combination with other (cell, interior facet, exterior facet) integrals.

DOLFIN features for solving collections of ODE systems.
5.4.1.Assembly of vertex integrals.We have extended DOLFIN with support for automated assembly of variational forms that include vertex integrals.The support is currently limited to forms defined over test and trial spaces with vertexbased degrees of freedom only.The assembly algorithm follows the standard finite element assembly pattern by iterating over the vertices of the mesh, computing the map from local to global degrees of freedom, evaluating the local element tensor based on generated code, and adding the contributions to the global tensor.The vertex integral assembler runs natively in parallel via MPI.

Specification and generation of multistage schemes.
Since version 2016.1,DOLFIN supports the specification of multistage schemes of the form ( 7)-( 8) for the solution of collections of ODE systems of the form (1), via their Butcher tableaus.The DOLFIN class ButcherMultiStageScheme takes as input the right hand side expression f , the solution y, the Butcher tableau specified via a, b and c, and secondary variables such as a time variable, the (integer) order of the scheme.From this specification, DOLFIN automatically generates a variational formulation of each of the separate stages in the multistage scheme, with each stage in accordance with ( 7)- (8).DOLFIN can also automatically generate the variational formulations corresponding to the adjoint scheme (18) and to the tangent linear scheme (19) on demand.A set of common multistage schemes are predefined including Crank-Nicolson, explicit Euler, implicit Euler, 4 th -order explicit Runge-Kutta, ESDIRK3, ESDIRK4 [7,15].
Similar features have also been implemented for easy specification of Rush-Larsen schemes cf. ( 9)-( 12) and the automated generation of the corresponding adjoint and tangent linear schemes cf.e.g. ( 20) and ( 21) through the class RushLarsenScheme.Both ButcherMultiStageScheme and RushLarsenScheme subclass the MultiStageScheme class.

PointIntegralSolvers.
To allow for the efficient solution of the multistage schemes for collections of systems of ODEs, we have introduced a targeted PointIntegralSolver class in DOLFIN 2016.1.This solver class takes a MultiStageScheme as input and its main functionality is to compute the solutions over a single time step.
The solver iterates over all vertices of the mesh and solves for the relevant (stage and/or final) variables at each vertex.The solution algorithm for each stage depends on whether the stage is explicit or implicit.For implicit stages, a custom Newton solver is invoked that allows for Jacobian reuse across stages, across vertices, and/or across time steps on demand.As the resulting linear systems are typically small and dense, a direct (LU) algorithm is used for the inner solves.For explicit stages, a simple vector update is performed.
The point integral solver runs natively in parallel via MPI.For a cell-partitioned mesh distributed between N processes, each cell is owned by one process and that process performs the solve for each vertex on the cell (once).As little communication is required between processes, the total solve is expected to scale linearly in N .
5.5.Extensions to dolfin-adjoint.Dolfin-adjoint has been extended to support the new features including automatically deriving and computing adjoint and tangent linear solutions for PointIntervalSolver.The symbolic derivation of the variational formulation for the adjoint and tangent linear equations for the MultiStageScheme are based on (18), (19) for the Butcher tableau defined schemes and ( 20), (21) and its analogies for the Rush-Larsen schemes.
The overall result is that operator splitting algorithms as described in Section 2.1 can be fully specified and efficiently solved within FEniCS.Moreover, the corresponding adjoint and tangent linear models and first and second order functional derivatives may be efficiently computed using dolfin-adjoint with only a few lines of additional code.
6. Applications.In this section, we demonstrate the applicability and performance of our implementation by considering the two examples presented in the introduction, originating from the computational modelling of mitochondrial swelling and cardiac electrophysiology respectively.The complete supplementary code is openly available, see [9].
The initial and final solutions are presented in Figure 1 while the L 2 -gradient of the objective functional J given by (28) with respect to the initial calcium concentration u 0 is presented in Figure 2.
To verify that the computed gradient is correct, we performed a Taylor test at a given spatial and temporal resolution with a given perturbation seed.The results are listed in Table 1 and demonstrate the expected orders of convergence, indicating a correctly computed gradient.
In a multi-stage scheme, a number of intermediate stage solutions are computed during the solution process.To avoid excessive memory usage, dolfin-adjoint does not store these stage solutions.Thus, in order to compute the adjoint solution, the stage solutions are recomputed for each time-step.As a consequence, the minimal ratio of adjoint runtime to forward runtime for multistage ODE solves using this strategy is approximately 2. For a linear PDE solve, the optimal ratio of adjoint run time to forward run time, assuming all linear solves take the same amount of time, is 1.For a nonlinear PDE solve via a Newton iteration, the optimal ratio of adjoint run time to forward run time, assuming all linear solves take the same amount of time, is 1/N where N is the number of average Newton iterations in the forward solve.These optimal ratios for linear and nonlinear PDE solves have observed for typical dolfin-adjoint usage [10].For this example, two Newton iterations were required on  average to solve the PDEs to within a tolerance of 10 −10 both for N x = 40 and 80, and assembly dominated the PDE solve runtime.Thus for this example with two multistage ODE solves and a nonlinear PDE solve in each timestep, the optimal ratio of adjoint runtime to forward runtime is expected to be in the range 0.5 − 2, and closer to the upper bound than the lower bound depending on the distribution of computational cost between the PDEs and ODEs.Experimentally observed timings for this example are listed in Tables 2-3 for N x = 40, 80, 160.From Table 2, we observe that the adjoint-to-forward runtime ratio for the total solve is in the range 1.38 − 1.95 and decreasing with increasing problem size.The corresponding gradient-to-forward runtime is in the range 1.44 − 2.28, again decreasing with increasing problem size as expected.From Table 3, we observe that adjoint-to-forward runtime ratio for the ODE solves is in the range 2.35 − 2.41 for this set of problem sizes, also decreasing with increasing problem size.We also note that the ODE solve runtime, both for the forward and adjoint solves, appears to scale linearly with the problem size M = N 2 x as anticipated and as is optimal.The ODE solves account for 37 − 40% of the total forward runtime for this example.
We also conducted numerical experiments varying the multistage scheme used in  28) for T = 35 κ n = 0.5 with the ESDIRK4 multistage scheme.
the ODE solves, including the 1-stage Backward Euler (BDF1), the 2-stage Crank-Nicolson (CN2), the 3-stage ESDIRK3 scheme in addition to the previously reported 4-stage ESDIRK4 scheme.The results (using N x = 40) are reported in Table 4.We observe that the adjoint-to-forward runtime ratio ranges from 2.04 (BDF1) to 2.41 (ESDIRK3, ESDIRK4) and the general trend is that this ratio increases with the number of stages as expected, but stabilizes around 2.3 − 2.4.

Application: cardiac electrophysiology (2D).
In this example, we consider the bidomain equations ( 2) over a two-dimensional rectangular domain Ω = [0, 50] × [0, 50] (mm) with coordinates (x 0 , x 1 ).We will consider a set of different cardiac cell models of increasing complexity: a reparametrized FitzHugh-Nagumo (FHN) model with 1 ODE state variable [11], the Beeler-Reuter (BR) model with 7 ODE state variables [5], the ten Tusscher and Panfilov (TTP) epicardial cell model with 18 ODE state variables [28], and the Grandi et al (GPB) cell model with 38 ODE state variables [13].The description of each of these cell models is available via the CellML repository, and our implementation of the ionic current I ion and F arising in (2) was automatically generated from the CellML models for the BR, TTP and GPB models.We refer to the supplementary code [9] for the precise description of the models including our choice of FitzHugh-Nagumo parameters.
Our parameter setup is otherwise as follows.We let χ = 140 (mm −1 ), C m = 0.01 (µF/mm 2 ), and let M i = diag(g if , g if ) and M e = diag(g ef , g es ) where g ef = 0.625/(χC m ), g es = 0.236/(χC m ) and g if = 0.174/(χC m ).We let I s = 0 and set the for v.The other state variables are initialized to the default initial conditions given by the respective CellML models.We use a second-order Strang splitting scheme (i.e.θ = 0.5 for ( 4)-( 5)), a Crank-Nicolson discretization in time and continuous piecewise linear finite elements in space for the PDE system, and different Rush-Larsen type discretizations (RL1, GRL1, RL2, GRL2) for the time discretization of the ODE systems1 .The coupled linear systems that arise at each PDE step (5) were solved with a block-preconditioned GMRES scheme with a relative tolerance of 10 −10 .The complete details of the solver are detailed in the supplementary code [9].
We take κ n = 0.1 (ms), and consider a uniform mesh of the computational domain with N x ×N x ×2 triangles.The coarsest mesh considered used N x = 160.For this case, simulations converged using any of the cell models and schemes, and gave qualitatively correct results, except the Grandi cell model discretized by the RL1 scheme for which the numerical solution scheme failed to converge.
We are interested in computing the gradient with respect to the extracellular fiber conductivity g ef of the following objective functional (32) for a equidistributed set of time points t i .The model initial condition and solution for the transmembrane potential at T = 100, together with the L 2 -gradient of the objective functional with respect to the fiber conductivity are illustrated in Figure 3.
The following numerical experiments were performed on the Abel supercomputer at the University of Oslo.All runtime experiments were repeated three times, of which the minimum timing values are reported here.6.2.1.Verifying the correctness of the discrete gradient.To verify that the computed functional gradient is correct, we performed a Taylor test in a random  5 and demonstrate the expected orders of convergence without a computed gradient (order 1) and when using the computed discrete gradient (order 2) in the Taylor expansion.We also performed similar experiments for the other cell models (BR, TTP, and GPB) and other Rush-Larsen solution schemes (RL1, GRL2, RL2).We used the same settings as in Table 5, except for the Grandi cell model with any other Rush-Larsen scheme than GRL1, where the time step had to be reduced to 0.01 in order for the forward solver to converge.We obtained the expected convergence order for all combination of cell models and schemes tested.These results indicate that the automatically derived and computed adjoints and gradient are correct for all the Rush-Larsen schemes implemented.
for decreasing step sizes s and a random direction δ with functional given by (32).
Computations performed with T = 10.0, κ n = 0.1 and N x = 160, the FHN cell model and the GRL1 scheme.We observe that the remainders converge at first order without gradient information and at second order with gradient information, as expected.
6.2.2.Adjoint runtime performance.As the bidomain PDEs ( 5) are linear, a rough theoretical estimate of the optimal PDE adjoint to PDE forward runtime ratio is 1, assuming that the solution times in the adjoint and forward PDE solves are equal and dominated by the finite element assembly and linear solvers.Similarly, the ODEs are solved with explicit Rush-Larsen schemes with a theoretically optimal adjoint-to-forward ratio of 1 as well.
In Table 6, we list the total runtime, and the runtime components corresponding to only the ODE solves, only the PDE solves and an intermediate variable update (merge) step, for both the forward and the adjoint solution computation for a test case with T = 10.0, κ n = 0.1, N x = 160, the GRL1 scheme and the four different cardiac cell models (FHN, BR, TTP, GPB).
From Table 6, for the forward solves, we observe that the total runtime increases with cell model complexity.The forward PDE step runtime is essentially independent of the cell model, while the forward ODE step runtime increases with the cell model complexity, as expected: the ODE runtime corresponds to approximately 10% of the PDE runtime for the simplest FitzHugh-Nagumo model, and approximately 114% of the PDE runtime for the most complex Grandi model.The runtime of the forward merge step is insignificant in comparison to the ODE and PDE steps, but increases with cell model complexity.
We observe that also the adjoint PDE runtimes are comparable for all cell models.The resulting adjoint-to-forward PDE step ratios are in the range 0.79 − 0.82.The decrease in compute time for the adjoint PDE solves compared to the forward PDE solves are attributable to the Krylov solvers: fewer Krylov solver iterations were necessary for convergence for the adjoint solves compared to the forward solves.For the ODE solves, we observe that the adjoint-to-forward ratio for the ODE solves is in the range 1.88 − 2.58 for the different cell models.Additionally, we observe that the runtime of the adjoint merge step is significantly larger than the corresponding time for the forward merge step and with an increasing adjoint-to-forward ratio with increasing cell model complexity 3.76 − 5.40.We suggest that the reason for the increased and increasing adjoint runtime is that the adjoint merge step involves additional assembly of variational forms of complexity comparable to that of the cell model.
Overall, we observe that the adjoint-to-forward total runtime ratios are in the range 1.15 − 1.66 for the different cell models considered, with increasing adjoint-to-  6: Adjoint-to-forward runtime performance for a cardiac electrophysiology 2D test case.Breakdown of forward and adjoint runtimes and their ratio for different cell models.Each row shows total runtime (s), runtime for the ODE, PDE and merge steps (s), and percentage of the total runtime not accounted for by these three steps (Other).The computations were performed with T = 10, κ n = 0.1, N = 160, GRL1 as the ODE scheme and (32) as the adjoint functional.
forward ratio with increasing cell model complexity.Since the PDE runtime dominates the total runtime for the simpler cell models, the total adjoint-to-forward ratio is closer to 1 for these models.On the other hand, for the more complicated cell models, the ODE runtimes dominate, and so we observe adjoint-to-forward ratios closer to 2 for these.
In the last column of Table 6, we observe that the combined ODE and PDE solve time contributes to between 97% (for FitzHugh-Nagumo) and 86% (for Grandi) of the total forward runtime for the cell models considered.The remaining forward runtime cost is primarily caused by the initialization routines such as loading the computational mesh, and creating the required function spaces.For the corresponding adjoint runtimes, the combined ODE and PDE solve times contribute to between 86% (for FitzHugh-Nagumo) and 85% (for Grandi) of the total run time, and the percentage contribution decreases with the complexity of the cell model.The remaining runtime cost can primarily be attributed to recording the forward states during the adjoint solve (which is included in the reported total run time but not in the ODE or PDE runtimes here).We also timed the computation of the discrete gradient and noted that the cost of additionally computing the gradient was negligible (less than 0.1% of the adjoint solve runtime).6.2.3.Forward and adjoint parallel scalability.In this section, we discuss the weak and strong parallel scalability of the forward and adjoint ODE solvers and the scalability of the point integral solvers in particular applied to this 2D electrophysiology test case.We focus on the scalability of the point integral solver particular as this is the primary new feature described in this paper.Previous studies have discussed the parallel scalability of FEniCS in general [1,10,24].
We ran a series of weak scaling experiments on Abel, hosted by the Norwegian national computing infrastructure (NOTUR), for the previously considered series of cell models (FHN, BR, TTP, GPB) on an increasing number of CPUs (1,4,16,64) and with an increasing mesh resolution.We timed the total runtime of the point integral solves, both for the forward solves and the adjoint solves.For each cell model and resolution, we ran three experiments and extracted the minimal run times.The results are listed in Table 7.For both the forward and adjoint run times, we computed the parallel efficiency (PE) as the ratio of the 1-CPU runtime to the n-CPU run time for n = 4, 16, 64, also listed in Table 7.The optimal PE would be 1.The last column of the Table lists the computed the adjoint-to-forward ratio for these solves.
For the forward runtimes, we observe that the parallel efficiencies range from 0.66 to 0.95 for all cell models and resolutions, with efficiencies higher than 0.8 for all but the least complex cell model.In general, we observe that the parallel efficiency increases with increasing cell model complexity and that it decreases moderately with the number of CPUs.Some of the loss of parallel efficiency (from 1) is attributable to only a near-perfect distribution of mesh vertices across CPUs; we observed approximately 10% variation between CPUs in the number of local vertices.
For the adjoint runtimes, we observe parallel efficiencies ranging from 0.75 (FHN on 64 cores) to 0.95, again with efficiencies higher than 0.8 for all but the least complex cell model.In general, we note the same trends as for the forward run times: increasing parallel efficiency with increasing cell model complexity and moderately decreasing parallel efficiency with increasing number of CPUs.We emphasize that we thus achieve the same parallel efficiency for the adjoint point integral solves as for the forward point integral solves.
Comparing the forward and adjoint runtimes, we see that the adjoint-to-forward ratio for the point integral solvers range from 1.19 to 1.56.Comparing these numbers with the adjoint-to-forward ratios of the overall ODE solves as discussed in the previous, we observe that the adjoint-to-forward point integral solver performance is better.The adjoint-to-forward ratio is moderately increasing with cell model complexity, but is stable with respect to the number of CPUs.This latter point again illustrates that the adjoint point integral solvers demonstrate the same parallel efficiency as the forward point integral solves.
We also ran a series of strong scaling experiments on Abel for the previously considered series of cell models (FHN, BR, TTP, GPB) on an increasing number of CPUs (1,4,16,64) and with a fixed mesh resolution.We timed the total runtime of the point integral solver steps for both the forward and adjoint solves.Again, for each cell model and resolution, we ran three experiments and extracted the minimal run times.The results are listed in Table 8.For both the forward and adjoint run times, we computed the (strong scaling) parallel efficiency (PE) as the ratio of the 1-CPU runtime to n-times the n-CPU runtime for n = 4, 16, 64, also listed in Table 8.The optimal (strong scaling) PE would be 1.The last column of the Table lists the computed the adjoint-to-forward ratio for these solves.
We observe that the parallel efficiencies are comparable to the results from the weak scaling test case with efficiencies in the range from 0.60 to 0.94.We note that the adjoint strong parallel efficiency is comparable to the forward strong parallel efficiency.The adjoint-to-forward ratios for this strong scaling test are also in the same range as for the weak scaling test.

Application: biventricular cardiac electrophysiology (3D).
In this example, we aim to compute the sensitivity of the squared L 2 -norm of the transmembrane potential with respect to its initial condition, in order to illustrate the features  7: Weak scaling of the point integral solver for a cardiac electrophysiology 2D test case.Forward and adjoint point integral solver runtimes for increasing N x and simultaneously increasing number of cores and for different cell models.Each row shows number of CPUs, the mesh resolution, the total point integral solver runtime (s) for the forward solves, the parallel efficiency (PE) for the point integral forward solves, the total point integral solver runtime (s) for the adjoint solves, the parallel efficiency for the point integral adjoint solves, and the adjoint-to-forward runtime ratio.The computations were performed with T = 1.0, κ n = 0.1 and GRL1 as the ODE scheme and (32) as the adjoint functional.
discussed in this paper for a more complex test case.This example also aims to illustrate how different representations may seamlessly be used for the control variables; this is useful for instance when computing sensitivities with respect to spatially constant, regionally defined or highly resolved control variables.
To solve the coupled equations ( 33) and ( 4), we use a second-order Strang splitting scheme (θ = 0. Forward and adjoint point integral solver runtimes for a fixed (fine) mesh resolution N x = 1280 with an increasing number of CPUs, for different cell models (FHN, BR, TTP, GPB).Each row show the number of CPUs, the total point integral solver runtime for the forward solves, the corresponding parallel efficiency (PE), the total point integral solver runtime for the adjoint solves, the corresponding parallel efficiency (PE) and the adjoint-to-forward runtime ratio.The computations were performed with T = 1.0, κ n = 0.1 and GRL1 as the ODE scheme and (32) as the adjoint functional.
linear finite elements in space for the resulting PDEs, and the first-order generalized Rush-Larsen (GRL1) scheme for the resulting ODEs.We take κ n = 0.05 (ms).The moderately coarse mesh has mesh cell diameters in the range [0.61,We observe that as we increase α, the gradient with respect to the top initial condition c 0 goes from large and positive to small and negative, while the gradient with respect to the bottom initial condition c 1 varies from moderately negative to close-to-zero and rapidly to large and positive.The rapid changes in the gradients illustrate the nonlinear nature of the problem at hand.
The sensitivity of the objective functional J defined by (35) with respect to the initial condition v 0 ∈ V is illustrated in Figure 5 (right).We observe that the gradient in the upper part of the domain is large and negative, that the gradient in a boundary zone is large and positive and the gradient in the lower part of the domain is close to zero.This indicates that infinitesimal increases/decreases in the upper part of the domain will lead to a large decrease/increase in the functional value, infinitesimal increases/decreases in the boundary zone domain will lead to a large increase/decrease in the functional value, and that infinitesimal increases/decreases in the lower part of the domain will have relatively small effect on the functional value.
7. Concluding remarks.We have presented a high-level framework that allows for the specification of coupled PDE-ODE systems and their efficient forward and adjoint solution with operator splitting schemes.We have illustrated the features of the framework with a series of examples originating from cell modelling and computational cardiac electrophysiology.Our numerical results indicate adjoint runtime performance near optimal (measured in terms of adjoint-to-forward runtimes), and parallel efficiency indices of 84 − 94% for the more realistic cardiac cell models.
The main advantages of the framework are the speed of developing solvers for new systems, and the ability to automatically derive their adjoints from the highlevel specification.This allows for flexible exploration of models when the relevant governing equations are uncertain, and for the identification of unknown parameters via the solution of inverse problems as demonstrated e.g. in [14].Both of these capabilities are of significant importance in numerous areas of scientific computing, and in particular in computational medicine, physiology and biology.Future work will focus on further improving the parallel scalability of the solvers and its application to problems in personalized medicine.

Fig. 1 :
Fig. 1: Initial conditions and solutions t = T = 35 for the mitochondrial swelling example with N x = 40.The scale is common for A and E and ranges from 0 (blue) to 108.3 (yellow).The scale is common for B-D and F-G and ranges from 0 (blue) to 1 (yellow).

Fig. 2 :
Fig.2: Sensitivity of J, the total amount of completely swollen mitochondria at T , L 2 -gradient of J (left) with respect to the initial condition u 0 (right).

Fig. 3 :
Fig. 3: (a) Transmembrane potential v at t = 0 ms and (b) at t = 100 ms using the FHN model, computed with κ n = 0.1 and N x = 160.(c) L 2 -gradient of the functional with respect to the extracellular fiber conductivity g ef .(d) Timeplot of transmembrane potential v at x = y = 25 mm.
This class defines the interface for the tabulation of // / an expression evaluated at exactly one point .

Table 2 :
Run times to compute the forward solution t F and to compute the adjoint t A and the functional gradient t G of the functional defined by (28) for the mitochondria example with increasing spatial resolution for T = 35 κ n = 0.5 with the ESDIRK4 multistage scheme.

Table 3 :
Run times for solving the ODE systems when computing the forward solution t F,O and when computing the adjoint t A,O of the functional defined by (

Table 4 :
Run times computing the forward solution t F , for solving the ODE systems when computing the forward solution t F,O and when computing the adjoint t A,O of the functional defined by (28) for T = 35 κ n = 0.5, N x = 40 for some common multistage schemes.

Table 5 :
Taylor convergence results for the cardiac electrophysiology 2D test case.Taylor remainders R 0

Table 8 :
5), a Crank-Nicolson discretization in time and continuous piecewise Model No. cores Forward PIS PE Adjoint PIS PE Ratio Strong scaling of the point integral solver for a cardiac electrophysiology 2D test case.