Efficient mesh management in Firedrake using PETSc-DMPlex

The use of composable abstractions allows the application of new and established algorithms to a wide range of problems while automatically inheriting the benefits of well-known performance optimisations. This work highlights the composition of the PETSc DMPlex domain topology abstraction with the Firedrake automated finite element system to create a PDE solving environment that combines expressiveness, flexibility and high performance. We describe how Firedrake utilises DMPlex to provide the indirection maps required for finite element assembly, while supporting various mesh input formats and runtime domain decomposition. In particular, we describe how DMPlex and its accompanying data structures allow the generic creation of user-defined discretisations, while utilising data layout optimisations that improve cache coherency and ensure overlapped communication during assembly computation.

1. Introduction.The separation of model description from implementation facilitates multi-layered software stacks consisting of highly specialised components that allow performance optimisation to happen at multiple levels, ranging from global data layout transformations to local kernel optimisations.A key challenge in designing such multi-layered systems is the choice of abstractions to employ, where a high degree of specialisation needs to be complemented with the generality required to facilitate the utilisation of third-party libraries and thus promote code reuse.The use of high-level domain-specific languages (DSL) and composable abstractions allows existing algorithms and optimisations to be inserted into this hierarchical framework, and applied to a much wider range of problems.
In this paper we describe the integration of the DMPlex mesh topology abstraction provided by the PETSc library [2] with Firedrake, a generalised system for the automation of the solution of partial differential equations using the Finite Element method (FEM) [20].We outline how DMPlex is utilised in Firedrake to provide the required mapping between topological entities and degrees of freedom (DoFs), while supporting various mesh input formats, run-time domain decomposition and mesh renumbering techniques.In particular, we describe how DMPlex and its accompanying data structures allow the generic creation of user-defined discretisations, while utilising data layout optimisations that optimise cache coherency and ensure computation-communication overlap during finite element assembly.

Background.
2.1.Firedrake.Firedrake is a novel tool for the automated solution of Finite Element problems defined in the Unified Form Language (UFL) [1], a domain-specific language (DSL) for the specification of partial differential equations in weak form pioneered by the FEniCS project [18].Firedrake imposes a clear separation of concerns between the definition of the problem, the local discretisation defining the computational kernel used to compute the solution and the parallel execution of this kernel over a given data set [20].These multiple layers of abstraction allow various types of optimisation to be applied during the solution process, ranging from high-level caching of mathematical forms to compiler-level optimisations that leverage threading and vectorisation intrinsics within the assembly kernels.
A key component to achieving performance in Firedrake is PyOP2, a high-level framework that optimises the parallel execution of numerical kernels over unstructured mesh data [21].PyOP2 represents mesh entities as sets and connectivity between them as mappings, where input data to the compiled kernel is either accessed directly or indirectly via a mapping.In parallel PyOP2 is able to overlap halo data communication with kernel computation during the execution loop due to a specialised data ordering within sets [19].
2.2.DMPlex.PETSc's ability to manage unstructured meshes is centred around DMPlex, a data management object that encapsulates the topology of unstructured grids and provides a wide range of common mesh management functionalities to application programmers [16].As such DMPlex provides a domain topology abstraction that decouples user applications from the implementation details of common meshrelated utility tasks, such as file I/O, domain decomposition methods and parallel load balancing [15], which increases extensibility and improves interoperability between scientific applications through librarization [5].
DMPlex uses an abstract representation of the unstructured meshes in memory, where the connectivity of topological entities is stored as a directed acyclic graph (DAG) [14,17].The DAG is constructed of clearly defined layers (strata) that enable access to mesh entities by their topological dimension or co-dimension, enabling application codes to be written without explicit reference to the topological dimension of the mesh.As illustrated in Fig. 1b, all points in the topology DAG share a single consecutive entity numbering, emphasizing that each point is treated equally no matter its shape or dimension, and allowing DMPlex to store the graph connectivity in a single array where dimensional layers are defined as consecutively numbered sub-ranges.The directional connectivity of the DAG is defined by the covering relationship cone(p), which denotes all points directly connected to p in the next codimension, as illustrated in Fig. 1c.The transitive closure of the cone operation is denoted as closure(p), and depicted in Fig. 1d.The dual operation, support(p), and it's transitive closure star(p) are shown in Fig. 1e and Fig. 1f respectively.
In addition to the abstract topology data, PETSc provides two utility objects to describe the parallel data layout: a Section object maps the graph-based topology information to discretised solution data through an offset mapping very similar to the Compressed Sparse Row (CSR) storage scheme, and the Star Forest [3] (SF) object holds a one-sided description of shared data in parallel.These data layout mappings allow DMPlex to manage distributed solution data by automating the preallocation of distributed vector and matrix data structures and performing halo data exchanges.Moreover, by storing grid topology alongside discretised solution data, DMPlex is able provide the mappings required for sophisticated preconditioning algorithms, such as geometric multigrid methods [6] and multi-block, or "Fieldsplit", preconditioning for multi-physics problems [4].

Mesh Reordering Techniques.
The run-time performance of geometrybased processing algorithms can be significantly affected by the data layout of unstructured meshes and sparse matrices due to caching effects.A number of mesh ordering techniques exist that aim to increase the cache coherency of local data, either through cache-aware or cache-oblivious reordering [22,11,12].Cache-oblivious techniques aim to reduce the bandwidth of the resulting sparse matrix and thus lower the number of cache misses incurred when traversing local data regardless of the underlying caching architecture.
The Reverse Cuthill-McKee (RCM) algorithm [7,9] represents a classic example of a cache-oblivious mesh reordering.RCM is based on a variant of a simplex breadth-first search of the mesh connectivity graph and yields a fixed-size n tuple that represents the new ordering permutation.Alternative methods, such as space filling curve numberings, may be used to create similar permutations from a given mesh topology graph in order to further increase cache coherency.
3. Computational Meshes in Firedrake.The Firedrake system comprises a stack of specialised components that implement a set of multi-layered abstractions to provide automated finite element computation from a high-level specification [20].The role of the top-level Firedrake layer is to marshall data between the various subcomponents and to provide the computation layers, PyOP2 and PETSc, with the maps and data objects required to assemble and solve linear and non-linear systems.The computational mesh is encapsulated in a Mesh object that can either be read from file or generated in memory for common geometry classes, such as squares, cubes or spheres.
A characteristic feature of the Firedrake execution stack is that multiple discretisations of the same computational domain, represented by the FunctionSpace class, may be derived dynamically at any point during execution, which requires the topological connectivity of the mesh to be stored in a separate object.Separating mesh topology from the discretisation of the problem not only enables Firedrake to exploit caching and data re-use with minimal replication at multiple levels in the tool chain but also allows data layout optimisations to be inherited for all derived discretisations without re-computation of the mesh reordering scheme.
As shown in Fig. 2 the Firedrake classes Mesh and FunctionSpace, which encapsulate mesh topology and problem discretisation respectively, map naturally onto the abstractions provided by PETSc's data management API.The Mesh class encapsulates the topological connectivity of the grid by storing a DMPlex object alongside a Firedrake-specific application ordering, while discretisation data given by the FunctionSpace class defines the layout of local data stored in the Function object.3.1.Mesh topology.Firedrake uses the DMPlex data management abstraction as an internal representation of mesh topology, allowing it to delegate file I/O and run-time mesh generation to PETSc.In doing so Firedrake only depends on the public API provided by PETSc and automatically inherits the mesh management and manipulation capabilities provided by DMPlex.As a result Firedrake naturally supports the same set of mesh file formats as DMPlex, which at the time of writing includes ExodusII, Gmsh, CGNS, MED and Fluent Case files, and thus increases interoperability with other applications and provides extensibility through a wellsupported public library.

Mesh
In addition to various mesh format readers DMPlex also provides parallel domain decomposition routines that interface with external libraries, such as Chaco and Metis/ParMetis, to facilitate parallel partitioning of the topology graph.Utilising PETSc's internal communication routines DMPlex is thus capable of automatically distributing the mesh across any number of processes, which allows Firedrake to fully automate the parallelisation and optimisation of the user-defined Finite Element problem.
Another advantage of using the DMPlex DAG as an intermediate representation of mesh topology is that the abstracted graph format allows Firedrake to dictate the ordering of the mesh topology and thus control local data layout of derived discretisations.This is made possible by attaching a point permutation to the DMPlex object, which defines a single level of indirection that is applied to all graph traversal operations within DMPlex.As a results, all discretisation objects derived from the stored topology inherit this permutation, giving Firedrake an effective way to control the global ordering of derived solution data.[1]) implemented by Firedrake allows the use of various discretisation schemes to represent solution data, where the number of DoFs associated with each mesh entity is determined by the local discretisation within a reference element.The FIAT package [13] of the FEniCS software stack provides this reference element from which Firedrake needs to derive the indirection maps between mesh cells and DoFs required by PyOP2 to perform matrix and vector assembly.

Discretisation. The FEniCS language (UFL
The mapping from mesh topology to solution data is facilitated by PETSc through PetscSection, a class of descriptor objects that store a CSR-style mapping between points in the topology DAG and entries in array or vector objects.Assuming a constant element type throughout the mesh, DMPlex can generate a section object, given the number of DoFs associated with each mesh entity type as provided by the FIAT reference element.The set of DoFs associated with a cell can then be derived by taking the closure of the cell point (see Fig. 1d) and collecting the DoFs associated with them by the provided section.
The use of DMPlex closures to determine entity-to-DoF mappings is sufficient on its own should the local numbering of mesh entities within a cell closure match that required by the application.In Firedrake the local numbering on simplices must match the simplex numbering used in FEniCS [18], where the local facet number is determined by the local number of the opposite vertex.The algorithm shown in Alg. 1 is thus applied to each cell closure in turn to enforce the desired local numbering for simplices.

Halo communication. The exchange of halo data between processors in
Firedrake is performed by PETSc's Star Forest [3] (SF) communication abstraction that encapsulates one-sided description of shared data.SF objects implement a range for f acet in f acets do 8: closure f acet ← DMPlexGetClosure(plex, f acet) for f in closure f acet do Filter vertices from facet closure 10: for v in vertices do Find non-adjacent vertices Sort(f acets, keys) Sort facets by non-adjacent vertices of sparse communication patterns that are able to perform common data communication patterns, such broadcasts and reduction operations, over sparse data arrays according to the stored mapping.The halo data exchange pattern is derived by DM-Plex from an internal SF encapsulating the overlap in the topology graph and a given discretisation provided in the form of a section object.The derived SF encapsulates a local-to-local remote data mapping that avoids the need to convert halo data into a global numbering.
4. Application orderings.When deriving function spaces from a DMPlex topology definition, the global data layout is inherited from the original graph ordering generated by PETSc.PyOP2, however, imposes a data layout restriction that allows it to optimise performance by overlapping computation with communication, which is not honoured in the global entity numbering generated by DMPlex.Firedrake therefore generates an application ordering in the form of a permutation of the DAG points that is passed to a distributed DMPlex object to generate indirection maps that adhere to its required ordering.

PyOP2 data ordering.
To ensure that halo exchange communication can be overlapped with assembly kernel computation PyOP2 sets require a strict entity ordering, where non-owned data is stored contiguously at the end of the data array.Moreover, as shown in Fig. 3, all owned data items adjacent to non-owned items require that the halo data exchange be finished before computation is performed.Thus, owned data is further partitioned into core (independent of halo) and noncore (halo-dependent) data, allowing processing over core data items to proceed while communication is still in-flight.
Firedrake honours the PyOP2 entity ordering by assigning all points in the DM-Plex topology DAG to one of the PyOP2 entity classes using a DMLabel data structure, which encapsulates integer value assignments to points.When deriving the indirection maps for each discretisation, mesh entities can then be filtered into the appropriate sets regardless of entity type.The algorithm used to mark PyOP2 entity classes is shown in Alg. 2, where the initial overlap definition, provided by DMPlex in form of an SF, is used to first mark the halo region, followed by the derivation of adjacent non-core points.LabelSetValue(halo, p) 3: for p in LabelGetStratum(halo) do Loop over halo cells

4:
if p in HeightStratum(plex, 0) then 5: adjacency ← DMPlexGetAdjacency(plex, p) for c in adjacency do Find cells adjacent to halo 7: if LabelHasPoint(halo, c) and c in HeightStratum(plex, 0) then 8: LabelSetValue(noncore, p) Mark adjacent cell as non-core 9: for p in mesh do Mark remaining points as core 10: if not LabelHasPoint(halo, p) and not LabelHasPoint(noncore, p) then 11: LabelSetValue(core, p) 4.2.Compact RCM ordering.The generic encapsulation of mesh topology allows DMPlex to compute the point permutation according to the well-known RCM mesh reordering algorithm (see section 2.3).Since Firedrake already controls the effective ordering of mesh entities to adhere to PyOP2 ordering restrictions, the RCM permutation provided by DMPlex can be applied to the Firedrake-specific point permutation.However, any additional indirection applied to the reordering permutation computed by Firedrake needs to be contained within the marked PyOP2 class regions.Thus, although the base RCM permutation generated by DMPlex includes all graph points, Firedrake implements a cell-wise compact reordering, where the cell ordering is filtered from the RCM permutation within each marked PyOP2 region.As shown in Alg. 3, the full permutation is then derived by adding cell closures along the segmented cell order, ensuring the relative compactness of DoFs associated with the same cell.
It is worth noting that this cell-wise compact reordering approach allows any additional level of indirection to be applied without violating the PyOP2 ordering Algorithm 3 Algorithm for generating a compact RCM permutation that honours PyOP2's entity class separation and encapsulates a cell-wise RCM reordering within the PyOP2 regions.permutation{idx} ← p closure constraint and is therefore not limited to RCM.Examples of sparse matrix structures generated using the RCM-based reordering are shown in Fig. 4.
5. Performance benchmarks.The benefits of Firedrake's compact RCM mesh reordering have been evaluated using two sets of performance benchmarks: a run-time comparison of assembly loops over cells and interior facets with light-weight kernels, as well as solving a full advection-diffusion problem.The benchmark experiments were carried out on the UK national supercomputer ARCHER, a Cray XE30 with 4920 nodes connected via an Aries interconnect and a parallel Lustre filesystem 1 .Each node consists of two 2.7 GHz, 12-core Intel E5-2697 v2 (Ivy Bridge) processors with 64GB of memory.
An indication of the indirection cost and subsequent data traversal performance in low-level loops was gained by comparing the individually measured execution time of two PyOP2 assembly loops.The benchmark loops were generated by invoking assemble(L) 100 times for the UFL expressions L = u*dx and L = u('+')*dS for cell and interior facet integrals respectively, where u is a suitable Function object.The performance of a full-scale finite element problem was then analysed, which consisted of assembling and solving the advection-diffusion equation ∂c ∂t + • ( uc) = • (κ c) using a Conjugate Gradient method with a Jacobi preconditioner for advection and the HYPRE BoomerAMG algebraic multigrid preconditioner [8] for the diffusion component.The mesh used in both experiments represents a two-dimensional Lshaped domain, consisting of 3,105,620 cells and 1,552,808 vertices, and was generated with Gmsh [10].
The performance of the assembly loops over cells and interior facets using P 1 and P 3 function spaces on up to 96 cores is shown in Fig. 5.The performance of the cell integral loop shows significant improvements in both cases, whereas the facet integral loop shows a small performance decrease.This highlights that the compact RCM reordering optimises cell integral computation due to the generated cell-wise compact traversal pattern.It is also worth noting that the improvement due to RCM diminishes as we approach the strong scaling limit, although an increase in computational intensity between P 1 and P 3 assembly kernels negates this effect.
A performance profile of the full advection-diffusion model is given in Fig. 6.Ma-trix and RHS assembly times indicate clear performance improvements under compact RCM with significant speedups for P 1 on small numbers of cores (see Fig. 6a).As shown in Fig. 6b, P 3 assembly kernels with a higher computational intensity also show significant performance improvements, where matrix assembly in particular benefits from the reordering in a sustained way up to 96 cores.Similarly, advection and diffusion solver times shown in Fig. 6c and Fig. 6d indicate a clear speedup on small numbers of cores, while significant improvements are also evident on up to 96 cores for solves with larger numbers of DoFs in P 3 .
6. Discussion.In this paper we give a full account of the utilisation of PETSc's DMPlex topology abstraction in Firedrake to derive the topological mapping required to solve a wide range of finite element problems.We highlight how the right composition of abstractions can be used to apply well known data layout optimisations, such as RCM renumbering, to an entire class of problems and demonstrate the resulting gains in assembly and solver performance.Our work emphasises the importance of high-level DSLs and further underlines their potential for achieving performance portability through run-time optimisation.
An important corollary of the close integration of DMPlex into the Firedrake framework is the improved interoperability and extensibility of the mesh management component.Future efforts to improve file I/O and add new meshing capabilities, such as mesh adaptivity, can now be integrated through PETSc DMPlex interfaces.This ensures that computational models built using the Firedrake framework can easily be extended without breaking existing abstractions and thus enables domain scientists to leverage automated performance optimisations as well as a wide range of simulation features.

Fig. 1 :
Fig. 1: Example entity numbering for a single tetrahedron and the corresponding internal DAG.Entities are numbered accross stratified layers (dimensions) with a consecutive numbering in each stratum.

Fig. 2 :
Fig.2: Mapping of data abstractions between Firedrake and PETSc: Firedrake's Mesh object encapsulates domain topology stored in a DMPlex object alongside an application numbering permutation.The choice of FunctionSpace defines the local data discretisation via a PetscSection that is used to generate the indirection maps required by PyOP2 for assembly computation.Halo communication is performed by a PetscSF object, which encapsulates the mapping between local and remote data items in the local Vec.

Algorithm 1 2 : 3 : 4 : 5 : 6 :
Local numbering algorithm for simplex elements 1: for cell in mesh do Loop over all cells in the mesh closure cell ← DMPlexGetClosure(plex, cell) for p in closure cell do Filter facets and vertices from cell closure if p in DepthStratum(plex, 0) then vertices ← p if p in HeightStratum(plex, 1) then f acets ← p Sort(vertices) Sort vertices by global number 7:

Fig. 3 :Algorithm 2
Fig.3: PyOP2 entity classes on a distributed 4 × 4 unit square mesh.The dark region marks core entities, medium grey marks non-core entities and light grey marks the halo region.

1 :
for p in pointSF do Define halo region from SF 2:

7 .
Acknowledgements.This work was supported by the embedded CSE programme of the ARCHER UK National Supercomputing Service (http://www.archer.ac.uk) and the Intel Parallel Computing Center program through grants to both the University of Chicago and Imperial College London.ML and GJG also acknowledge support from EPSRC grants EP/M019721/1 and EP/L000407/1, LM acknowledges NERC grant NE/K006789/1 and MGK acknowledges partial support from DOE Contract DE-AC02-06CH11357 and NSF Grant OCI-1147680.

Fig. 4 :
Fig. 4: Effects of the combined RCM and OP2 mesh ordering on matrix structure for a P 1 and a P 3 function space on a 5 × 5 unit square.