TSFC: a structure-preserving form compiler

A form compiler takes a high-level description of the weak form of partial differential equations and produces low-level code that carries out the finite element assembly. In this paper we present the Two-Stage Form Compiler (TSFC), a new form compiler with the main motivation to maintain the structure of the input expression as long as possible. This facilitates the application of optimizations at the highest possible level of abstraction. TSFC features a novel, structure-preserving method for separating the contributions of a form to the subblocks of the local tensor in discontinuous Galerkin problems. This enables us to preserve the tensor structure of expressions longer through the compilation process than other form compilers. This is also achieved in part by a two-stage approach that cleanly separates the lowering of finite element constructs to tensor algebra in the first stage, from the scheduling of those tensor operations in the second stage. TSFC also efficiently traverses complicated expressions, and experimental evaluation demonstrates good compile-time performance even for highly complex forms.

1. Introduction. The development of state-of-the-art finite element simulation software has become an especially demanding endeavour. Advanced finite element techniques are particularly complicated, the efficient utilization of modern computers requires onerous effort, and the application domain itself is often a source of remarkable complexity. Automatic code generation techniques have been employed by various software projects to decouple and manage these complexities. Examples include Analysa [5], GetDP [10], Sundance [29], FreeFem++ [14], the FEniCS Project [27,2], and Firedrake [35]. The finite element method is particularly suited to code generation approaches since the entire mathematical specification of a finite element problem can be expressed at a high level of abstraction. The specification of the weak form of the equations, along with the discrete function space from which each variable is to be drawn, is sufficient to characterise the problem completely. The Unified Form Language (UFL) [3] is a domain specific language for the finite element method, and provides the input language for FEniCS and Firedrake. UFL captures the weak form and discretised function spaces, thereby providing the information that the code generation system requires.
The solution of a finite element problem requires three distinct operations. First, each integral in the problem (that is, each multilinear form) must be evaluated on each cell or facet of the mesh. Second, the facet and cell components are assembled into global matrices and/or vectors as appropriate. Finally, these global objects are used by a linear solver to compute the answer to the system. Time varying and nonlinear PDEs are solved by composing these operations under the control of a time integration scheme and nonlinear solver respectively.
A form compiler takes a multilinear form expressed in UFL and produces lowlevel code that assembles the form on a single cell or facet of the mesh. The current production form compiler in FEniCS is the FEniCS Form Compiler (FFC) [19,28] while the SyFi Form Compiler (SFC) [4] was an earlier alternative. The UFL Analyser and Compiler System (UFLACS) has been developed as a replacement for the previous FFC implementations, and has recently become default. In this paper we present the Two-Stage Form Compiler (TSFC), a new form compiler in use in the Firedrake project. The main motivation in TSFC is to maintain the structure of the input expression as long as possible to facilitate the application of optimizations at the highest possible level of abstraction.
The contributions of TSFC, and therefore of this paper, are a number of algorithmic advances over the state of the art form compilers currently available. By efficiently traversing the input UFL expressions, the compile time of TSFC scales much better than legacy FFC as form complexity increases and in the cases where the transformation from the reference cell to the mesh is non-affine. TSFC preserves the tensor structure of expressions for longer through the compilation process than either FFC or UFLACS. In particular, this is achieved by a new approach to isolating the components of a weak form which must be inserted into different rows and columns of the output tensor. Finally, TSFC cleanly separates the lowering of finite element constructs to tensor algebra on the one hand, from the scheduling of those tensor operations on the other. This facilitates the optimisation of this schedule either in situ or by a downstream tool.
The rest of this work is arranged as follows. In the remainder of this section we review some concepts necessary for understanding form compilers, and then give an overview of other current form compilers. In section 2 we emphasize the importance of preserving structure, and propose a novel, structure-preserving method for an issue that arises in discontinuous Galerkin problems. We compare this with the approach adopted by UFLACS, and consider their compatibility with our existing optimization technology. Then we discuss the first compilation stage in section 3, briefly describing the preprocessing in UFL, then explaining the translation of the preprocessed UFL form to the intermediate tensor algebra language. Section 4 then gives a detailed description of this intermediate representation along with its translation to low-level C code. Section 5 discusses optimizations employed at various stages of the form compilation pipeline. Experimental evaluation is presented in section 6, and section 7 finally concludes the paper. Where κ and f are prescribed spatial functions that parametrize the forms a and L, and x represents the coordinate field that characterizes the domain Ω. Following the UFL terminology, we refer to v and u as arguments, and to x, κ, and f as coefficients. A form must be linear in its arguments, but not necessarily in its coefficients.
Let V andV be suitable finite-dimensional spaces of functions on Ω and let their bases be A is the bilinear form a assembled as a matrix, and b is the linear form L assembled as a vector.
UFL is a domain-specific language for the description of the weak form and discretized function spaces of finite element problems. UFL is embedded in Python. Listing 1 defines a and L in UFL, lines 10 and 11 correspond to (1.2) and (1.3) respectively. Line 1 defines an abstract mesh of triangular cells in a 2-dimensional space with coordinates attached to each vertex (equivalent to a first-order Lagrange space). Setting dim=3 would create a 2-dimensional surface in a 3-dimensional space. Setting the degree of the coordinate element higher than one would allow the triangular cells to be curved. Users of FEniCS and Firedrake typically create a concrete mesh object which extends the UFL class with mesh data that define a tesselation of a concrete physical domain. Line 2 sets up the function space V =V with P 2 elements on the mesh; we seek a solution in this space. Listing 1 specifies that κ and f are also from V .

Listing 1 Stationary heat equation in UFL
1 mesh = Mesh ( VectorElement ( " Lagrange " , triangle , 1 , dim =2)) 2 V = FunctionSpace ( mesh , FiniteElement ( " Lagrange " , triangle , 2)) 3 4 u = TrialFunction ( V ) The UFL specification serves as the input of a form compiler. Form compilation can be understood as the partial evaluation of a multilinear form given a basis for each of its arguments. For example, the bilinear form a is compiled to the function (x, κ) → A, and the linear form L to the function (x, f ) → b. Form compilers typically assume that the domain consists of a single cell; this makes the argument bases independent of actual physical domain up to a space transformation which is encapsulated in the coordinate coefficient x. This does not limit the applicability of form compilation since, given a tessellation T of the domain Ω, one can evaluate forms locally on each cell and assemble the contributions into the global tensors efficiently a new expression, implementing, for example, a simplification or an optimization, or translating the expression to a different language. A compiler may use several intermediate languages between the source and target language to decompose the translation process into several simpler passes, and especially to facilitate optimizations at the right level of abstraction.
It is common practice in form compilers to change the integral from physical to reference space. This enables, for example, the precomputation of basis functions at quadrature points. We use this transformation as an example for a compiler pass. Let us consider the Laplace operator Let x :K → K be the mapping of coordinates from the reference cell to the physical cell, and let the Jacobian of this mapping be J = ∇x. For a P 2 element u(x(X)) = u(X), and likewise for v. The gradient transforms as ∇ x = J −T ∇, and the integral needs scaling as dx = |J| dX. Thus we have Figure 1 presents the input UFL form as an expression tree. By locally applying the above substitutions to the arguments, gradients and the integral, we obtain the expression graph in Figure 2. As one can see, J −T is referenced twice, and J is referenced twice directly, or three times indirectly. A later pass may replace J with the evaluation of ∇x. It is therefore better to think of expressions as directed acyclic graphs (DAGs) rather than trees. The presence of two references (arrows) to J in place of replicating the J node make this a directed acyclic graph (DAG) rather than a tree.
we follow their description. Consider, for example, the bilinear form where [[v]] denotes the jump in the function value of v across the facet S. The jump is expressed as where v + and v − are the function values as seen from the two cells K + and K − incident to S, respectively (see Figure 3). While UFL provides utility functions such as jump and avg for convenience, its only additional primitive is restriction: quantities to be evaluated on interior facets must be restricted to either side of the facet. The two sides of a facet are arbitrarily marked as + and −. Let {φ + i } n i=1 be the local finite element basis on K + , and similarly let {φ − i } n i=1 be the local finite element basis on K − . Since interior facet integrals involve both cells, Figure 3: Two cells K + and K − sharing a common facet S.
i=1 be a local basis on K = K + ∪ K − by the following construction: This construction makes the local tensor A ij = a(φ i , φ j ) a block tensor : FFC, UFLACS and TSFC construct the interior facet tensor A by computing its subtensors. The subtensor A (+,+) can be seen as the evaluation of a corresponding form a (+,+) , and the same applies for the other subtensors. Section 2 presents a novel algorithm implemented in TSFC for the separation of the multilinear form into several forms corresponding to the subtensors of the local tensor.
1.4. Other contemporary form compilers. FFC contains several compilers for weak forms-they are called "representations". The quadrature representation [34] generates code that performs runtime quadrature, akin to what one would manually write. The tensor representation [19,20] computes the local tensor as the contraction of a precomputable reference tensor with an element-dependent geometry tensor. This allows the elimination of the quadrature loop, which especially benefits higher-order schemes. However, due to the nature of this representation, the size of the reference tensor grows exponentially with the number of coefficients in the form, rendering this representation inefficient or even infeasible for a wide variety of problems. On the other hand, the quadrature representation suffers from an implementation limitation that expressions are seen as a tree rather than a DAG. This can dramatically increase the perceived size of the input expression (see Figure 4), which is detrimental to form compilation speed and generated code size. In case of non-affine geometries as well as for "complicated forms" -such as various formulations of hyperelastic modelsthe effect can be prohibitive.
UFLACS, which was integrated as another FFC representation, and has recently become the default implementation for most cases, has been developed to overcome this limitation. However, to keep the internal data structures simple, UFLACS is eager to lower high-level constructs early in the form compilation pipeline. The high-level structure expressed in the input form, if preserved longer, could facilitate optimisations that are otherwise difficult to obtain. Section 2 provides further insight.
2. Preserving structure. Optimizations are best expressed at a specific level of abstraction. If the representation is too high level, the transformation may not be expressible at all, and if the representation is too low level, difficult analyses may be needed to recognize the optimization opportunity. It is therefore desirable to postpone representation lowering operations until the appropriate optimizations have been applied. There are two principal ways in which UFLACS lowers structure early, but TSFC does not: lowering tensor expressions to scalar expressions and argument factorization.
Argument factorization rewrites the integrand as a sum of products such that each product contains a term without arguments and for each argument a component of (a derivative of) that argument. This is always possible since the form is multilinear in its arguments. Lowering to scalars is a practical requirement: distributivity rules in vector and tensor algebra are numerous, and completeness of the implementation is necessary for correctness in all cases. The sum-of-product form makes it easy to separate the form into parts that correspond to the subtensors of the local tensor in interior facet integrals (see subsection 1.3). We also note that UFLACS applies argument factorization irrespective of the integral type. In some cases it turns out to be an effective optimization as argument-independent expressions are exposed and precomputed outside the inner loops. However, this approach is in general suboptimal. The transformation may require the expansion of products involving one or more arguments which affects an expression in two major ways: a) by increasing the operation count (which is usually only partially counteracted by the subsequent loop hoisting); and b) by destroying the common subexpressions that naturally arise when a problem is specified in a tensor algebra language.
To demonstrate what UFLACS does, let us consider the following interior facet integral as an example: where [[u]] n is the product of the jump of u with the facet normal n, and {∇v} is the average of ∇v. The equivalent integrand with restrictions is n is of unit length, and always pointing out of the cell to which it is restricted. UFLACS first lowers the expression to scalars, in this case by expanding the dot product. Assuming a 2-dimensional space, this step yields Argument factorization is achieved through the application of distributivity rules: At this point one can easily separate the form based on argument restrictions: In contrast, TSFC separates the form into parts that correspond to the subtensors without applying argument factorization. This is achieved by simply exploiting the linearity of the form in its arguments. Since each instance of an argument is either positively or negatively restricted in the form, the following equality holds: where f is the form interpreted as a function of two variables, the positively restricted and the negatively restricted argument. Equation (2.4) means that splitting the form based on its contributions to the subtensors of the interior facet tensor is as simple as substituting an appropriately shaped zero for any argument that has the "wrong" restriction. This approach also naturally generalizes to any number of arguments.
Applying this approach to (2.2) directly yields: Equations (2.3) and (2.5) are equivalent, but TSFC's approach preserves the tensor structure of the original expression, and does not require rearrangement. In general, the preserved structure can be exploited, for example, by COFFEE [31,30], an optimizer for finite element loop nests. COFFEE explores a transformation space with the goal of minimizing the operation count of a kernel, and the sequence of transformations that UFLACS carries out is difficult to undo. This renders inaccessible to COFFEE a part of the transformation space, potentially eliminating the most efficient configurations.
Preserving the tensor structure of the UFL form presents further difficulties. Previous form compilers lower the form expression to scalars, and therefore only require code generation for the evaluation of finite element objects and geometric terms, while the expression itself is simply transcribed since the target language (typically C or C++) also supports scalar expressions. With tensor form expressions, code generation is also required for the translation of operations on tensor valued operands and for tensor contractions. Furthermore, the tensor nature of UFL expressions has two representations (see subsection 3.1). To tackle this complexity, we split compilation into two stages, hence the name: Two-Stage Form Compiler (TSFC). A tensor algebra language (GEM) is used as an intermediate representation, and the first stage lowers finite element objects and geometric terms to this language (section 3), while the second stage generates code to evaluate this tensor algebra expression (section 4). This architecture is sketched in Figure 5. We also expect our approach to be beneficial in the future for generating efficient code for higher-order elements by exploiting structure inside the elements and utilizing FInAT [18], a smarter library of finite elements.
3. First stage (UFL → GEM). GEM is a tensor algebra language very similar to UFL in its tensor algebra features, although it contains nothing specific to the finite element method. This approach works very well, because the evaluation of finite element objects and geometric terms is expressible using only tensor algebra.
3.1. Tensor algebra and index notation in UFL. The tensor nature of UFL expressions is represented as shape and free indices: shape An ordered list of dimensions and their respective extent, e.g. (2,2). A dimension is only identified by its position in the shape. free indices An unordered set of dimensions where each dimension is identified by a symbolic index object. One might think of free indices as an "unrolled shape". These traits are an integral part of any UFL expression. For example, let A be a 2 × 2 matrix, then A has shape (2, 2) and no free indices. A 1,1 has scalar shape and no free indices; A i,j has scalar shape and free indices i and j; and A i,1 has scalar shape UFL C / COFFEE  and free index i. UFL has four basic node types related to the tensor nature of its expressions: • Indexed(A, (α 1 , α 2 , . . . , α r )) where A is an expression of rank r, and α is multi-index consisting of fixed indices and free indices. The result has scalar shape, but all the free indices of α become free indices of the result, i.e. Indexed can convert shape to free indices.
Tensor contraction of the expression e over the free index i.
where α := (α 1 , α 2 , . . . , α k ) is a multi-index consisting of free indices, and e is an expression with free indices α 1 , α 2 , . . . , α k (at least). These free indices are turned into shape, in the order as they appear in α. • ListTensor(e 1 , e 2 , . . . , e n ) where e 1 , e 2 , . . . , e n are expressions of shape (s 1 , s 2 , . . . , s r ). The result is a concatenation of (the otherwise unrelated) expressions e 1 , e 2 , . . . , e n , with shape (n, s 1 , s 2 , . . . , s r ). The index notation provides a low-level representation for many vector and matrix operations. For example, dot product, element-wise multiplication and outer product are all expressible using only multiplication of scalars and some of the above tensor operations: • 3.2. UFL preprocessing and modified terminals. TSFC relies on UFL to preprocess the form. The most important step in this procedure is to change the integral from physical to reference space. This involves scaling the integral and the application of pullbacks on the functions (arguments and coefficients) of the form. For example, the Laplace operator where u and v are the trial and the test function respectively, becomes if v and u require no reference value transformation, as discussed in subsection 1.2.
J is the Jacobian of the coordinate transformation. Note the change in the integral measure (dx versus dX). After this transformation, UFL does various simplifications on the form: the Jacobian is replaced with the gradient of the spatial coordinates, determinants are expanded, divergence and curl are expressed with tensor algebra on gradients, and various products are expanded using the index notation. Subsection 3.1 demonstrated that lowering to the index notation introduces ComponentTensor nodes; however, to make this discussion easier to follow, we simplify most of them away, even though in reality they are only eliminated at a later stage (subsection 4.2). Finally, gradients and restrictions are propagated to the terminals. This reduces the number of language constructs that the form compiler needs to handle while preserving a rich language for the user. As a result of these operations, (3.2) becomes: where ]e[ α is a notation for ComponentTensor(e, α). Equation (3.3) denotes a single expression, although some subexpressions are assigned names for simpler presentation.
A terminal is a node in an expression graph with no children. UFL terminals are: a) constant literals, b) coefficients, c) arguments, d ) geometric quantities, and e) quadrature weight. Various terminal modifiers may wrap a terminal. TSFC only needs to handle the following terminal modifiers: • PositiveRestricted, NegativeRestricted Restriction to one particular side of the facet in an interior facet integral.

• ReferenceGrad
Gradient in the reference space, possibly applied multiple times. A modified terminal is a terminal with a set of modifiers attached. In this set-up, the translation of modified terminals is independent of their context (surrounding outer expression), and the translation of the overall expression does not need to know more about the modified terminals than their shape and free indices. Since GEM intentionally mimics the tensor algebra features of UFL, the latter problem is trivial. We discuss the former in the next subsection.
3.3. Substitution of modified terminals. Currently, the substitution rules for modified terminals mimic the evaluation rules of FFC: • Arguments. FIAT [17] provides the tabulation matrix that contains the value of each basis function at each quadrature point for the finite element of the argument. This becomes a compile-time constant matrix in GEM. This matrix is then indexed with the quadrature index and the corresponding argument index. • Coefficients. We need a GEM expression that evaluates the coefficient at each quadrature point. The matrix-vector product of the element tabulation matrix with the local basis function coefficients gives exactly that. That vector for the cell is given as a kernel argument, which is represented as a run-time value in GEM. • Geometric quantities. Because Firedrake represents the mesh coordinates as a finite element field, the Jacobian and further derivatives of the coordinate field are simply evaluated as coefficients. Other geometric terms typically appear in facet integrals (e.g. facet normal), and, as a consequence of UFL preprocessing, are expressed in reference space. They depend on the cell type, which is known at compile time, and often on the facet, which is provided as a kernel argument at run time. Thus the evaluation of a geometric term usually involves indexing into a compile-time constant tensor with the facet index. Applying these evaluation rules, the modified terminals of (3.3) are replaced with the following expressions: where E (b) is the tabulation matrix of the first derivate of the finite element of v in the b-th direction, C (a,b) is the tabulation matrix of the first derivate in the bth direction of the a-th component of the coordinate element, and c is the vector of the local basis function coefficients of the coordinates. This general treatment of the Jacobian also works for non-simplex cells and higher-order geometries. We will discuss optimizations for affine simplices in section 5. Since indexing a tensor with fixed indices immediately selects the appropriate entry, these substitutions yield the following integrand expression: Aggregating the integrand values for each quadrature point with quadrature weight w q gives the GEM representation of (3.1) as . This representation only contains tensor algebra operators applied to compile-time constant tensors and a run-time argument to the local kernel. All the finite element and geometric terms have been lowered, the first stage of form compilation is complete.

Intermediate representation (GEM).
Since GEM mimics the tensor algebra features of UFL and lacks other features, such as finite element objects and geometric quantities, one may rightly ask why the tensor algebra subset of UFL could not be adopted as the intermediate representation. The reason for the creation of GEM is to relax some of the constraints on free indices which are intrinsic in UFL.
Free indices are just an "unrolled shape" in UFL, that may be "rolled back" any time. For any operation, free indices of the operands are subject to the same constraints that apply to shape. For example, tensors must have equal shape for addition, consequently, they also must have the same free indices. GEM interprets free indices differently. A free index in GEM simply means that the value of the expression depends on that index. For example, let u be a 3-dimensional vector (no free indices). u i and u j are scalars with free indices i and j, respectively. So far, it is identical to UFL. However, in GEM u i +u j is just an expression that depends on both i and j, that is a scalar with free indices i and j, while in UFL the same expression is illegal because the free indices do not match: the addition of a 3-dimensional vector with itself cannot produce a 3 × 3 matrix.
While the enforcement of strict shape checking is valuable for preventing user errors in a weak form, this strictness becomes an obstacle when the modified terminals are replaced with their respective evaluation rules. For example, we replaced ∇ṽ with q,j in subsection 3.3. In UFL this would mean replacing a vector with a third-order tensor, which yields an illegal expression in the general case. In GEM it just replaces a vector with another vector which depends on q (quadrature index) and j (argument index). There is eventually a summation over q, and the dependence on j gives the components of the element tensor. This difference in free indices poses no difficulty for the UFL → GEM translation since GEM is strictly more permissive.

Second stage (GEM → C).
This section discusses the second stage of form compilation, which is the translation of GEM to C, a programming language with explicit loops and only scalar expressions.
4.1. Overview of GEM. GEM was intentionally designed to be very similar to UFL in its tensor algebra features. In common with UFL, the tensor nature of expressions can be expressed as shape or free indices (see subsection 3.1). Most supported operations are naturally scalar, such as addition and multiplication, but there are a few tensor operations as well, such as tensor contraction (called IndexSum).
Here we give a summary of all GEM nodes: • Terminals:

4.2.
Elimination of ComponentTensor nodes. The duplicity of shape and free indices would noticeably complicate the process of code generation. To avoid that, we eliminate non-scalar shape in non-terminal nodes by removing all ComponentTensor nodes using a rule that is analogous to beta reduction: where e is an expression with scalar shape, α is a multi-index containing distinct free indices of e, and β is a multi-index of the same rank as α. e| α→β denotes a substitution of indices in e, where α i is replaced with β i for all i. The rule replaces indexing into a ComponentTensor by a substitution of the indices. A ComponentTensor node must be immediately inside an Indexed node or at the root of the expression because Indexed is the only GEM operator that accepts non-scalar shape. Since the UFL → GEM stage produces a GEM expression with scalar shape and the argument indices as free indices, the application of (4.1) removes all ComponentTensors from the intermediate representation. The downside of this approach is the possibility of code duplication when different substitutions are applied on the same subexpression.
To demonstrate this, we continue with our Laplace operator example. To simplify our presentation in subsection 3.2, most ComponentTensor nodes were eliminated except for K defined in (3.3). Eliminating K in (3.7) gives As one can see, this step duplicates the element-wise division of the adjugate matrix by the determinant.

Code generation.
Since ComponentTensor nodes are eliminated, we can now restrict our attention to the index notation. A simple code generation approach for the evaluation of such tensor expressions is to introduce a temporary variable and generate a pair of perfect loop nests (Figure 6a)    (IndexSum). The first loop nest initializes the temporary, and the second accumulates the result. Consequently, the first loop nest contains all the free indices of the IndexSum expression, while the second loop nest has one more index, the index being summed over. For example, applying this approach to (4.2), we introduce temporaries J (1,1) , J (1,2) , J (2,1) , J (2,2) , t 1 , t 2 , t 3 and A. The evaluation code is presented in Algorithm 1, which also adds loop nests for "defining" ListTensor expressions, such as adjungate matrix B, l 1   A[j, k] += w q * abs(J (1,1) [q] * J (2,2) [q]−J (1,2) [q] * J (2,1) [q]) * t 3 [q, j, k] While this simple approach produces correct code, introducing temporaries merely for tensor contractions may cause too much inlining. In particular, shared subexpressions other than tensor contractions are still expanded, and there is also room for loop hoisting ( Figure 6). Therefore, we also introduce temporaries to evaluate subexpressions that a) have a reference count greater than one, or b) have a different set of free indices than their unique parent. In our example, the determinant of the Jacobian is an instance of a), and w q * abs(. . .) in the final loop nest is an instance of b). So we introduce temporaries d and t 4 respectively. The result of this is Algorithm 2.
Finally, we apply loop fusion to reduce the size of the temporaries (see Figure 7). Here we use a simple and fast algorithm. First, we construct an ordering of all free indices that appear in the GEM expression, and apply this ordering to each (perfect) loop nest. Based on our knowledge about the structure of finite element assembly kernels, and in common with FFC, we choose the quadrature index to be the first (outermost), followed by the argument indices in order, and then by all other indices. A possible ordering that we use for our example is (q, j, k, r, i 3 , i 4 , i 5 ). This means a number of loop interchanges (Figure 8) in Algorithm 2. Considering dependencies, we also take a topological ordering of the loop nests that heuristically tries to maximize loop fusion (subsection 4.4). For our example, this results in Algorithm 3. Finally, we apply maximal loop fusion without further reordering. The fused algorithm is Algorithm 4. Note that many of the temporaries are now scalar, or at least have a reduced tensor order.

15:
for all k do 16: for all i 5 do 18:

21:
t 4 = w q * abs(d) 22: for all j do 23: is prepended to the function body. The element tensor A and the local coordinate coefficients vector c are defined as kernel arguments. Lines 10-14 may be omitted if the caller clears A.

Topological orderings of loop nests.
The GEM DAG describes the dependencies of each node in that DAG. The nodes can be evaluated in any order which respects those dependencies. That is, if there is an edge from node a to node b then node b must be evaluated before node a. Any ordering which respects the constraints imposed by the DAG edges is called a topological ordering, and the process of generating such an ordering is called topological sorting.
Topological sorting is a well-understood problem [9, §22.4]. Every DAG admits a Listing 2 Local assembly kernel for ∇u · ∇v dx in C. } 60 } topological ordering, but this is not in general unique. For correct code generation as described in subsection 4.3, any topological ordering suffices. However, the choice of ordering determines which loops can be fused, and thus how much memory is saved. Reducing the working set size 1 can help to fit into faster memory, thus also reducing execution time.
We use a variant of Kahn's algorithm [16] [21, §2.2.3, Algorithm T] with heuristics to maximize loop fusion. Algorithm 5 presents Kahn's algorithm for topological sorting. Line 6 is a point of ambiguity, since S can have more than one element. Any choice will yield a correct, but different, topological ordering, so our heuristics are about choosing "wisely". return L Loop fusion occurs between two consecutive loop nests for all common outer loops. We maximize this at every choice in a greedy manner. The number of loops in the previously chosen loop nest, however, poses an upper limit on the number of loops that can be fused as a result of the current choice. For example, if the previous choice was a loop nest with indices (q, ), this rule so far expresses no preference between (q, ) and (q, r) for the next choice, because only the q loop can be fused in both cases. Therefore we further refine our heuristic to prefer those loop nests with exactly the same loops as the previously chosen loop nest. Delaying "entry" into further inner loops, is a heuristic designed in the hope that more loop nests with those indices will satisfy the dependency requirements once we have no other option but to do "enter" those further inner loops, thus enabling more loop fusion later.
Algorithm 6 shows our version of the Choose-Element function. We maintain the ordered loop indices of the previously chosen loop nest in cursor , which is initialized to () at the beginning of a topological sort. We can assume that S is non-empty. First, we try to choose an element with the same loop indices as the previous one (lines [3][4][5]. If that is not possible, we try to choose an element such that cursor is a prefix of its loop indices (lines 6-10), and we update cursor accordingly. If that is not possible either, we drop the last (innermost) index from cursor and repeat. Before this step cursor must have had at least one index, otherwise S 2 = S = ∅, so the function would have returned in line 10. Figure 9 shows the dependency graph of the loop nests in Algorithm 2, along with the solution our algorithm yields. The ordering precisely corresponds to what we have if S 1 is non-empty then 5: return any element from S 1 6: if S 2 is non-empty then remove last index from cursor seen in Algorithm 3.

Related work.
Since code generation for tensor algebra expressions is not inherently specific to the finite element method, the related work is broad in scope. Various tensor algebra libraries and domain-specific languages [7,11,36,38] have been developed for quantum chemistry computations. However, while the working set size of a finite element local assembly kernel rarely exceeds a megabyte, the tensors in quantum chemistry computations are typically huge, even reaching petabytes in size. Consequently, the optimized implementation of individual tensor operations on distributed-memory parallel computers is usually the major concern, as is the case with SIAL [36], libtensor [11], and CTF [38]. The Tensor Contraction Engine (TCE) [7] goes further, and also considers optimizations on the whole tensor algebra expression, which could be applicable to our case as well. Some of these ideas were also applied to do efficient tensor contractions on GPUs [32].
Algebraic transformations on tensor algebra expressions can yield equivalent formulations that require many fewer operations to compute. Lam et al. [26] first studied the problem of single-term optimization, also known as strength reduction: the operation-minimal evaluation of tensor contractions over the product of several tensors with various indices. They prove that the problem is NP-complete, but they also provide a pruning search strategy that is efficient in practice, since a single term typically consists of only a limited number of tensors. Kschischang et al. [22] show that such a single term can be characterized by a factor graph. If the factor graph is cycle-free then it straightforwardly yields the unique optimal factorization. Otherwise different factorizations can be produced by various transformations that break the cycles. Considering the sum of single terms, Hartono et al. [13] achieved further improvements over single-term optimization by considering various factorizations of the whole expression. Since the space of different factorizations is huge, they relied on various search techniques. Exploiting common subexpression elimination during the search resulted in further gains [12]. We expect to benefit from these results in the future, for example, when applying sum factorization to forms defined over tensor product elements. At the moment TSFC primarily relies on COFFEE to provide operation minimization (see subsection 5.5).
Since tensors in quantum chemistry computations are huge, optimizing the memory requirement is of great importance so that computations can fit in the main memory. While this is not an issue for finite element local assembly, fitting the working set into the cache can have an important performance impact. The evaluation of tensor expressions in a scalar programming language such as C requires the introduction of array temporaries which may be indexed. The order in which these temporaries are produced and consumed affects the working set size of the kernel. Lam et al. [23,25] give an algorithm for the evaluation order of nodes of an expression tree that minimizes the maximal total size of live temporaries. Their work can be understood as a generalization of the Sethi-Ullman algorithm [37] where nodes are not assumed to have the same size, instead each node can have an arbitrary size. Since tensor valued temporaries are both produced and consumed by loops, loop fusion can eliminate dimensions of these temporaries, thus reducing the required memory, as we have demonstrated in subsection 4.3. Lam et al. [24] formalize the legality of loop fusions over expression trees and their effect on the memory requirement. They give an algorithm for finding the memory-optimal set of loop fusions based on dynamic programming and pruning, with an exponential complexity in the number of index variables. Allam et al. [1] propose an alternative algorithm that reduces this problem to an integer linear programming formulation. Cociorva et al. [8] studied space-time trade-off optimization in this context, by allowing some recomputations for further loop fusion opportunities. In TSFC we have implemented a fast algorithm that may produce suboptimal results (subsections 4.3 and 4.4), rather than striving for optimality at the cost of high algorithmic complexity [24,1],

5.
Optimizations. This section discusses various optimizations in TSFC which are not essential for the understanding of the overall form compilation process, and whose proper effects are often only understood in composition with other optimizations at various points of the form compilation pipeline.
5.1. Constant folding for zero tensors. Constant folding means replacing computations whose all operands are known at compile time with the result of the computation. This is a standard technique in modern optimizing compilers. TSFC implements a limited version of it at the GEM level: only computations involving zero tensors are folded. This limited version can still yield significant benefits since the application of 0 · X = 0 can eliminate an arbitrarily complex expression X. Zero tensors can arise, for example, as tabulation matrices for higher derivatives than the degree of the finite element.

Cellwise constant quantities.
We saw in subsection 3.3 that the evaluation of arguments and coefficients involves a tabulation matrix M q,r that contains the value of each basis function at each quadrature point. When an argument or a coefficient, or more commonly a derivative of it, is cellwise constant, then evaluation need not happen for every quadrature point. So instead of the matrix M q,r we create a vector M r which does not have the quadrature index as a free index. This change in combination with loop hoisting (see subsection 4.3) will hoist the evaluation of all cellwise constant subexpressions out of the quadrature loop.

Fast Jacobian evaluation for affine cells.
The Jacobian of the coordinate transformation appears in virtually every form after preprocessing, even the simplest ones. In those cases an inefficient Jacobian evaluation may likely become the dominant cost of assembly. Consequently, FEniCS has, and Firedrake used to have, efficient hand-written code snippets for Jacobian evaluation on affine simplices. As described in subsection 3.3, we now implement Jacobian evaluation as coefficient evaluation. This allows us to handle arbitrary order geometries with a single code path, but it also implies that each entry of the Jacobian is evaluated as a matrix-vector product. For affine cells, the Jacobian is cellwise constant, so this reduces to a dot product of two vectors (subsection 5.2). The code snippets evaluate each Jacobian entry with a single subtraction.
Since both approaches are correct and thus yield identical results, it follows immediately that the tabulation vector must have precisely two nonzero entries which are 1 and −1. Therefore, we introduced another optimization rule which applies to all coefficient evaluation: if the quantity is cellwise constant and the tabulation vector has at most two nonzero entries, then we expand the dot product that evaluates the coefficient. This, of course, composes with constant folding (subsection 5.1), so the multiplications with zeros vanish, and we obtain a formulation that is performance equivalent to the code snippets.

5.4.
Unrolling short IndexSum loops. In most cases, unrolling short loops, which are typically geometric loops, improves performance. When this optimization is applied, the generated code for the Laplace operator becomes as in Algorithm 7. It also negates the code duplication around Jacobian inverses that was caused by the beta expansion step (subsection 4.2). This optimization is enabled by default in TSFC. However, as demonstrated, it is completely optional, and can be delayed to run after those optimizations that benefit from the tensor structure of the source expression. for all k do 19:

21:
t 5 = w q * abs(d) 22: for all j do 23: (2) q,j 24: 5.5. COFFEE integration. Although we did not emphasize this, TSFC does not generate C code directly, but an abstract syntax tree (AST) for C, so that COFFEE can further optimize the kernel. Since this is just a different representation of C code, the discussions in section 4 are still completely valid. COFFEE performs both lowlevel optimizations, such as alignment and padding to facilitate vectorization by the host C compiler, and high-level optimizations to reduce the operation count. The latter require restructuring the loops and expressions. In essence, COFFEE seeks to determine the optimal trade-off amongst common subexpressions elimination, loopinvariant code motion and factorization. For this, it analyzes a transformation space which also includes the FFC tensor representation and many of the factorizations performed by UFLACS. For further details about COFFEE and for comparative performance evaluations, see Luporini, Ham, and Kelly [30].
6. Experimental evaluation. We experimentally evaluate and compare form compilation times with the following form compilers: • FFC quadrature representation (supports affine cells only) • Modified version of FFC quadrature representation, developed for Firedrake to support non-affine cells.

• UFLACS
• TSFC We disabled optional optimizations, and also included in the form compilation times the time to compile the generated C or C++ code, using GCC 5.4.0 as C and C++ compiler at -O2 optimization level. FFC tensor representation is not included since its limitations are well known [34] (subsection 1.4), and it fails to compile our more difficult test problems due to missing features.
Our test problems consist of the left-hand sides of the following four elliptic PDEs of increasing complexity: Helmholtz equation.
Linear elasticity. In elastic models the solution is typically a displacement vector field. Let u and v be vector-valued trial and test functions, and let ε denote the symmetric part of the Jacobian of a vector field. Then the bilinear form is defined as follows: Hyperelasticity. A simple hyperelastic model. First, we define the strain energy function over the displacement vector field u: where λ and µ are the Lamé parameters, and I is the identity matrix. Now, we define the Piola-Kirchhoff stress tensors: Second Piola-Kirchhoff stress tensor (6.8)

P = FS
First Piola-Kirchhoff stress tensor (6.9) One could, of course, derive that S = λ tr(E)I + 2µE, but we leave the symbolic differentiation for UFL. Finally, the residual form of this nonlinear problem is where b is the external forcing. To assemble a left-hand side, one must linearize the residual around an approximate solution u: This bilinear form has trial function δu, test function v, and u is a coefficient of the form.
Unlike the simple hyperelastic model, this model uses a volume-preserving right Cauchy-Green strain tensor: J = det(F) (6.12)C = J − 2 3 F T F (6.13) Then defines the following rotation invariant functions: We define the first Piola-Kirchhoff stress tensor, and the residual form without external forcing: P = ∂Ψ ∂F Find first Piola-Kirchhoff tensor (6.19) r = P : ∇v (6.20) Finally, (6.11) gives the bilinear form again. Figure 10 presents violin plots of the form compilation times, each made from 18 samples. The measurements were run on an Intel R Xeon R E5-2620 (Sandy Bridge) 2.00 GHz CPU. Due to the nature of development with Firedrake or FEniCS, the form compiler is frequently invoked in an interactive manner. Consequently, form compilation shall preferably take no more time than the fraction of a second; waiting for more than a couple of seconds is undesirable in a natural workflow.
To make the measurements as fair as possible, we configured FFC representations to skip element compilation, and for the C++ compilation we only kept the relevant tabulate tensor method from the generated UFC class. However, since we did not want to change the generated code, we had to include the relevant header files. increasing complexity with the following form compilers: FFC quadrature representation, its modified version which supports non-affine cells (fd bendy), UFLACS, TSFC, and TSFC as an FFC representation (TSFC † ). 3D Holzapfel-Ogden data for fd bendy are missing because compilation ran out of memory after more than 6 hours on a machine with 64GB of memory.
<algorithm>, the compilation of which takes about 0.9 seconds on the test machine. Form compilers targeting Firedrake (fd bendy and TSFC) generate C code, and the inclusion of relevant header files induces much less overhead. This effect dominates the simplest test case, the Helmholtz equation. For further fairness, we also included TSFC as an FFC representation (shown as TSFC † ), including the same set of header files as for the other FFC representations.
All form compilers compile the Helmholtz equation quickly and perform tolerably on the elastic model. The modified version of FFC quadrature with non-affine support (fd bendy) demonstrates excessive compilation times on the simple hyperelastic model already. While non-affine support adds a huge toll to compile-time performance, even the base version of FFC quadrature becomes unmanageable as form complexity increases, as the data for Holzapfel-Ogden model demonstrate. UFLACS and TSFC, however, have tolerable compile-time performance in all cases, TSFC being somewhat faster, but typically within the same order of magnitude.
The overall compilation time of the most difficult case (Holzapfel-Ogden 3D) with TSFC can further be reduced from 3.7 to 2.2 seconds by disabling the optimization described in subsection 5.4. This configuration, however, did not yield significant benefits for the other test cases as long as C compilation times are also included.
For reproducibility, we cite archives of the exact software versions that were used to produce these results. The modified version of FFC with support for non-affine geometries requires custom branches of FFC [42] and UFL [49] which require older versions of COFFEE [39], FIAT [44], and Instant [47]. For all other form compilers and representations we used recent versions: COFFEE [40], dijitso [41], FFC [43], FIAT [45], FInAT [46], TSFC [48], and UFL [50]. The experimentation framework used to generate the violin plots is available in [15].

Conclusion.
The experimental data demonstrate that the efficient traversal of expressions keeps the compilation time with TSFC acceptably low even for highly complicated forms. This was a crucial requirement for properly supporting nonsimplicial cells as well as higher-order geometries in Firedrake.
A comprehensive analysis of run-time performance (of the generated code) is beyond the scope of this paper. For such an analysis, we refer readers to Luporini, Ham, and Kelly [30] which compares the representations of FFC, UFLACS, and an earlier and a recent version of COFFEE. TSFC is compatible with COFFEE's optimizations.