An adaptive fast Gauss transform in two dimensions

A variety of problems in computational physics and engineering require the convolution of the heat kernel (a Gaussian) with either discrete sources, densities supported on boundaries, or continuous volume distributions. We present a unified fast Gauss transform for this purpose in two dimensions, making use of an adaptive quad-tree discretization on a unit square which is assumed to contain all sources. Our implementation permits either free-space or periodic boundary conditions to be imposed, and is efficient for any choice of variance in the Gaussian.


Introduction.
A variety of problems in applied physics and engineering involve the solution of the heat equation for t > 0, in a interior or exterior domain Ω, subject to suitable conditions on its boundary Γ = ∂Ω. For simplicity, we will assume that these take the form of either Neumann conditions: Dirichlet conditions: u(x, t) = h(x, t) for x ∈ Γ, (1.3) or periodic boundary conditions with Ω the unit square.
In the absence of physical boundaries, the equations (1.1) are well-posed in free space (under mild conditions on the behavior of u, f and F at infinity) without auxiliary conditions. Moreover, assuming F (x, t) and f (x) are compactly supported in the region Ω, the solution to (1.1) can be expressed at the next time step, t = ∆t in closed form as For the Neumann problem (1.1), (1.2), the classical representation [16,23]  is a double layer (heat) potential. Here, ∂ ∂ny denotes the derivative in the outward normal direction at the boundary point y. The only unknowns in the representations (1.7), (1.9) are the scalar densities σ and µ supported on Γ. These are obtained by solving integral equations to enforce the desired boundary conditions [16,23]. Once σ or µ is known, (1.7), (1.9) can be used to evaluate the solution at time t = ∆t. This yields a one-step marching method for the heat equation that is unconditionally stable (see, for example, [1,5,7,8,12,17,21,29,30]).
For our present purposes, we assume that σ and µ are given. We assume also that a suitable M -stage quadrature has been applied to V [F ](x, ∆t), S[σ](x, ∆t) and D[σ](x, ∆t) with respect to the time variable, yielding: where w V,j , w S,j , w D,j are known quadrature weights. Thus, the computational burden of time-marching (that is, evaluating the various heat potentials) is dominated by the volume integrals for various values of δ and given functionsf ,σ,μ. Evaluating these integrals accurately and efficiently is the focus of the present paper. Definition 1.1. The integrals (1.12) and (1.13) will be referred to as volume and boundary Gauss transforms, respectively. Definition 1.2. By the discrete Gauss transform (DGT), we mean the evaluation of the Gaussian "potential" at M points {x i } due to N sources located at {y j } of strength {q j }: q j · e − |x i −y j | 2 δ for i = 1, · · · , M. (1.14) A variety of algorithms have been developed for the rapid evaluation of sums of the form (1.14), such as the fast Gauss transform (FGT) [14] (see also [15,24,25,28]). While the naive DGT requires O(M N ) work, the FGT permits the evaluation of the values {F (x i )} using only O(M + N ) work, independent of δ. High-dimensional versions of the FGT are of interest in statistical and machine learning applications (see, for example, [10]), but we are concerned here with physical modeling, where the ambient dimension is generally less than or equal to three.
Here, we seek to develop a robust version of the FGT that is fully adaptive, insensitive to δ, and able to compute transforms with discrete sources, volume sources and densities supported on boundaries ( Fig. 1.1). Some notable prior work on continuous (volume) fast transforms includes [27], which describes a triangulation-based adaptive refinement method and [30] which makes use of a high-order, adaptive quad-tree based discretization. In both cases, a "single-level" FGT was superimposed in order to achieve linear scaling. Our approach also relies on an adaptive quad-tree with high-order Chebyshev grids on leaf nodes, but we carry out a modified version of the FGT on the quad-tree itself. This requires a somewhat more complicated implementation, following that of the hierarchical fast multipole method (FMM) [13] or, more precisely, its level-restricted variants developed to compute elliptic volume potentials [2,11,20,22]. Our hierarchical FGT permits the inclusion of boundary Gauss transforms (and discrete sources) at the same time. We should note that adaptive FGT variants using an FMM data structure have been constructed previously, such as in [19], but for the discrete setting only -where small values of δ pose no additional quadrature challenges. We also introduce new error estimates that are relevant for the hierarchical processing. The paper is organized as follows. In sections 2 and 3, we review our adaptive discretization strategy and the analytic machinery on which the fast Gauss transform is based. In section 4, we describe the new FGT itself, focusing primarily on the volume integral case (1.12), with a brief discussion of the modifications needed for (1.14) and (1.13). We also discuss the incorporation of periodic boundary conditions. Section 5 illustrates the performance of the algorithm with several numerical examples and section 6 contains some concluding remarks.
2. Data structure. In the classical FGT [14], aimed at the computation of (1.14), where the sources are discrete, a uniform grid is superimposed on the computational domain, with a box size of dimension (r √ δ) d , where r ≈ 1 ( Fig. 2.1, left). Because of the exponential decay of the Gaussian, it is easy to see that only a finite range of nearby boxes need to be considered to achieve any desired precision. That is, the effect due to sources in B at targets that are at least m boxes away is of the order O(e −m 2 r 2 ). Since the field due to sources in any box B is efficiently represented by a suitable Hermite expansion (see section 3), it is straightforward to develop an algorithm of complexity O(N + M ), where N is the number of discrete sources and M is the number of targets. The FGT is easily modified to allow for adaptivity. One simply needs to sort the source and target points on the uniform grid while ignoring empty boxes and keeping track of the relevant neighbors for each box. The total storage is then of the order O(N + M ) as well. This can be accomplished, for example, with an adaptive quad-tree that is refined uniformly to a level where the box size is approximately (r √ δ) d , pruning empty boxes on the way. In the original FGT, the interaction region (shaded blue-gray) consists of boxes which are close enough to B that the Gaussian field induced by the sources in B is significant. (Outside the shaded region, the field is exponentially small and can be ignored for any fixed precision.) In the quad-tree, multiple types of interactions must be accounted for, described in detail in section 4. Such a strategy fails for volume integrals of the form (1.12), since there are no empty boxes. Instead, we will assume that the right-hand side (the function f in (1.12)) is specified on a level-restricted quad-tree. These data structures have been shown to be extremely effective for elliptic volume integrals [2,6,11,18,20,22]. For the sake of simplicity, we assume that the source distribution f in (1.12) is supported in the unit box D, centered at the origin. Following the discussion of [11], we assume that superimposed on D is a hierarchy of refinements (a quad-tree). Grid level 0 is defined to be D itself, with grid level l + 1 obtained recursively by subdividing each box at level l into four equal parts. If B is a fixed box at level l, the four boxes at level l + 1 obtained by its subdivision will be referred to as its children. In a level-restricted, adaptive tree, we do not assume the same number of levels is used in all subdomains of D. We do, however, require that two leaf nodes which share a boundary point must be no more than one refinement level apart ( Fig. 2.1, right).
On each leaf node B, we assume that we are given f on a k × k tensor product grid. We may then construct a kth-order polynomial approximation to f on B of the form is the number of basis functions needed for kth order accuracy, and the basis functions b j are assumed to be scaled to the relevant box size and centered on the box center. The coefficient vector is defined to be c B = (c B (1), . . . , c B (N k )).
For fourth or sixth order accuracy, one can use as basis functions If we let f B ∈ R k 2 denote the given function values (in standard ordering), then the coefficient vector c B can be computed as the solution of a least squares problem (interpolating the desired data f B at the corresponding points). The solution operator for this least squares task is denoted by P ∈ R N k ×k 2 , so that P can be precomputed and stored (say, using QR factorization). For eighth (or higher) order accuracy, polynomial interpolation is stabilized by assuming that f is given on a k × k tensor product Chebyshev grid and using as basis functions where T l (x) denotes the (suitably scaled) Chebyshev polynomial of degree l. The coefficients of the tensor product Chebyshev expansion can be computed efficiently using the fast cosine transform [4].
In order to develop a fast algorithm for the various kinds of source distributions shown in Fig. 1.1, we will make use of efficient far field and local representations of the induced field.
3. Analytical apparatus. Following the discussion in [14], we define the Hermite functions h n (x) by where D = d/dx. They satisfy the relation where y 0 ∈ R and δ > 0. This formula can be interpreted as an Hermite expansion centered at y 0 for the Gaussian field e −(x−y) 2 /δ at the target x due to the source at y. Interchanging x and y one can also write: This expresses the Gaussian as a Taylor series in the target location x about a center x 0 . It will be convenient to use multi-index notation. In two dimensions, a multi-index is a pair of nonnegative integers α = (α 1 , α 2 ) with which, for any x = (x 1 , x 2 ) ∈ R 2 , we define: If p is an integer, we say α ≥ p if α 1 , α 2 ≥ p. Multi-dimensional Hermite functions are defined by and the analogs of (3.1) and (3.2) are (3.4)

Hermite expansions and translation operators.
We turn now to the analytical apparatus needed in the FGT algorithm. The first lemma describes how to transform the field due to a volume source distribution and a collection of discrete Gaussians into an Hermite expansion about the center of box B in which they are supported.
Lemma 3.1. Let B be a box with center s B and side length r √ δ and let the Gaussian field φ(x) be defined by where the y j lie in B. Then, The error in truncating the Hermite expansion with p 2 terms is given by (3.10) and K < 1.09.
Proof. The error estimate relies on Cramer's inequality, which takes the form in two dimensions, where K < 1.09, and the fact that for y in B. The desired result follows from integration over the domain B and summation over the discrete sources.
Note that the Hermite expansion converges extremely rapidly for r < 1. For larger r, they still converge but require larger values of p. (See [3,19,26,31] for further discussion of error estimates.) In the original FGT, setting r ≈ 1 is a sensible choice, since a modest value of p is sufficient and the number of boxes within the interaction region (where the Gaussian field is not vanishingly small) is modest as well. The interaction region for a box B is the shaded area on the left in Fig. 2.1. A second thing to note is that the estimate is uniform with respect to the target. In the original FGT, this is necessary since the Hermite expansion is evaluated at all relevant locations. In the hierarchical FGT, however, the Gaussian field due to an Hermite expansion is evaluated only for boxes that are "well-separated" (the boxes labeled by i on the right-hand side of Fig. 2.1). Moreover, we will compute such interactions at every level of the quad-tree, so that the boxes could be of arbitrary size. Fortunately, once boxes are separated by a distance R √ δ, their interactions can be ignored with an error of the order O(e −R 2 ), limiting the size of the expansions (see Fig. 3.1).
In the full algorithm, a more refined estimate that makes use of the separation criterion will be useful. We have the following lemma.
Lemma 3.2. Let B be a box with center s B and side length r √ δ, and let C be a box with center t C and side length r √ δ with x ∈ C. Assuming the distance between B and C is at least r √ δ, the Gaussian field defined by (3.5) and its Hermite expansion (3.6) satisfy the error bound: where Q B is given by (3.9), and S r (p), T r (p) are given by (3.10).
Proof. The proof follows the same outline as that of Lemma 3.1. In this case, after applying Cramer's inequality (3.11), we make the additional observation that | x−s B √ δ | ≥ 3 2 r, which contributes to the exponential decay in r.
The next lemma describes the conversion of an Hermite expansion about s B into a Taylor expansion about t C .
denote the Hermite expansion of a Gaussian field induced by a source distribution in a box B with center s B and side length r √ δ. Then φ(x) has the following Taylor expansion about the center t C of box C with side length r √ δ: The coefficients are given by Assuming that the distance between boxes B and C is at least r √ δ, the error E T (p) in truncating the Taylor series after p 2 terms satisfies where Q B is given by (3.9), and S r (p), T r (p) are given by (3.10).
Proof. This result follows, again, from the standard error estimate in [3,26,31], with one modification; the exponential term in Cramer's inequality can be bounded by e − 9 8 r 2 , instead of 1.
In practice, we need a variant of Lemma 3.3, in which the Hermite expansion is truncated before being converted to a Taylor expansion.
denote a truncated Hermite expansion corresponding to the Gaussian field induced by a source distribution in a box B with center s B and side length r √ δ. The induced Taylor series in a box C with center t C and side length r √ δ is given by with coefficients Assuming that the distance between boxes B and C is at least r √ δ, the error E HT (p) in truncating the Taylor series after p 2 terms satisfies the bound where Q B is given by (3.9), and S r (p), T r (p) are given by (3.10).
Proof. The result is a straightforward application of the triangle inequality and Lemma 3.3. Note that the total error in using both an Hermite and a local expansion consists of two contributions: the first comes from truncating the Hermite expansion, given by (3.8), while the second comes from truncating the local expansion, according to (3.21).
For the hierarchical FGT, we will also need to propagate Hermite and Taylor expansions between levels of the quad-tree. The following two lemmas provide the needed analytical tools. Lemma 3.5 describes a formula for shifting the center of an Hermite expansion, and Lemma 3.6 describes one for shifting the center of a Taylor expansion. The derivation is straightforward [19].
Lemma 3.5. Let a Gaussian field be given by the Hermite expansion about a center s B and let s C denoted a shifted expansion center. Then, where the coefficients are given by: Lemma 3.6. Let t B ∈ R 2 and let {C α } denote the expansion coefficients for a truncated Taylor series with p 2 terms. Letting t C ∈ R 2 be a shifted expansion center, we have (3.26)

Local interactions.
In the previous section, we summarized the analytical machinery needed for the fast evaluation of Gaussian fields with well separated sources and targets. Before providing a formal description of the full algorithm in the next section, it remains to consider the computation of local interactions between neighboring boxes at the level of leaf nodes. For point sources, this is done by direct evaluation. We concentrate in this section on domain integrals, and defer a discussion of densities supported on boundaries to section 4.2.
B n c n f n Fig. 3.2: For a leaf node B, there are three types of possible local interactions: the interaction with a colleague (a neighbor at the same refinement level, including the self-interaction), and the interaction with fine and coarse neighbors -one level finer or one level coarser, respectively. The grid shown corresponds to an 8th order accurate tensor product Chebyshev discretization.
Thus, suppose B is a leaf node -that is a box at level l of the tree hierarchy on which a k × k tensor product grid of function values has been specified. Let r l denote the side length of B, so that its area is r l × r l . Consider now a target point t, which lies in either B, a neighboring box of B at the same refinement level, or a coarse or fine neighbor for B, which can be at most one refinement level apart. Because of the translation invariance of the kernel, a simple counting argument shows that there are at most k × k × 9 possible targets at the same level and at most k × k × 12 possible targets in neighbors at either a coarser or finer level. Recalling that the source distribution f on B is given by (2.1), the Gaussian field induced at t by f B can be approximated by where (t 1 , t 2 ), (y 1 , y 2 ) denote the coordinates of the target t and source with respect to the center of box B. Thus, for colleagues (boxes at the same refinement level), there are at most 3k possible relative target locations (t i − y i ) and at most k basis functions p ni (y i ), which are either monomials or scaled Chebyshev polynomials. These 3k 2 numbers can be computed in milliseconds on a single core. For coarse or fine colleagues, it is straightfoward to check that there are at most 4k possible relative target locations (t i − y i ), so that these tables involving 4k 2 numbers can be generated in milliseconds as well. Finally, we note that such tables must be generated for each refinement level that contains a leaf node.
4. FGT algorithm. We now describe an adaptive version of the Fast Gauss Transform, closely following the discussion in [11]. Since the Gaussian kernel e − x−y 2 /δ is rapidly decaying, we will ignore interactions beyond a distance where they can be considered negligible, according to a user-defined precision . That is, we define a cut-off parameter r c so that e −|x−y| 2 /δ ≤ , when x − y ≥ r c √ δ. Clearly, if a source box has side length greater than or equal to r c √ δ, its contribution to well-separated boxes is negligible. We will also make use of the following definitions: Definition 4.1. (Cutoff level): Given a quad-tree with levels l = 0, 1, · · · , L, the cutoff level is defined to be the coarsest level of the tree at which the box size is smaller than or equal to r c √ δ. We denote this by l cut . If the box size is greater than r c √ δ even at the finest level (level L), we let l cut = L + 1.  Fig.2.1), and is denoted by I(B). Boxes at coarser levels will be referred to as the coarse interaction list, denoted by I c (B) (marked i c in Fig.2.1).
Definition 4.4. (Expansions): We denote by B l,k the kth box at refinement level l and by Φ l,k the Hermite expansion describing the far field due to the source distribution supported inside B l,k . We denote by Ψ l,k the local expansion describing the field due to the source distribution outside the neighbors of B l,k and byΨ l,k the local expansion describing the field due to the source distribution outside the neighbors of the parent of B l,k . When the context is clear, we will sometimes use the notation Φ(B), Ψ(B),Ψ(B) to describe the expansions associated with a box B.
Remark 1. Let B = B l,k be a box in the quad-tree hierarchy with children C 1 , C 2 , C 3 , C 4 . Then, according to Lemma 3.5, there is a linear operator T HH for which (4.1) The operator T HH is responsible for merging the expansions of four children into a single expansion for the parent. Likewise, according to Lemma 3.6 there is a linear operator T LL for which T LL is responsible for shifting the incoming data (the local expansion) from a parent box to its children. Finally, according to Lemma 3.4, for any source box B l ,k in the interaction list I(B) of box B l,k , there is a linear operator T HL for which the induced field in B l,k is given by Ψ = T HL Φ l ,k . Clearly, Since our adaptive algorithm is operating on a level-restricted adaptive tree, the leaf nodes need to handle far field interactions between boxes at different levels. More precisely, viewing each such leaf node B as a "target box", we need to incorporate the influence of the s-list and the coarse interaction list (see Fig. 2.1). For every box in the s-list, its Hermite expansion is rapidly convergent in B and its influence can be computed by direct evaluation of the series. We also need to compute the dual interaction -namely the influence of a leaf node B on a box B in the s-list. Rather than evaluate the Hermite expansion of B at all targets in B , or shifting to a local expansion in B , we can directly expand the influence of the polynomial source distribution in B, given by the coefficients c B , as a local expansion in B . Thus, incorporating all far field interactions into (4.3), we have: The operator T direct , which maps the coefficients of a polynomial approximation of the density in B (a coarse interaction list box) onto the p 2 coefficients of the local expansion in B can be precomputed and stored for each level in the quad-tree hierarchy. Inspection of Fig. 2.1, the translation invariance of the kernel, and a simple counting argument show that this requires O(kpL) work and storage, where k is the order of polynomial approximation, p is the order of the local expansion, and L is the number of levels.
More precisely, let b n (y 1 , y 2 ) = p n1 (y 1 ) p n2 (y 2 ) be a basis function for the polynomial approximation in box B and let α = (α 1 , α 2 ) denote the multi-index of a term in the induced local expansion in B. Then (s 1 , s 2 ) denotes the center of B, and D l−1 denotes the side length of box B at level l − 1.

Pseudocode for the FGT.
We assume we are given a square domain B 0,0 , on which is superimposed an adaptive hierarchical quad-tree with l max refinement levels. We let l cut denote the cutoff level. If l cut ≤ l max , for each level l that satisfies the condition l cut ≤ l ≤ l max , determine the number of terms needed in the Hermite expansions N h (l) and the number of terms needed in the local expansion N t (l) according to the box size, the parameter δ that defines the variance of the Gaussian, the user-defined precision , and the estimates (3.13), (3.17).
We denote the leaf nodes by B i , i = 1, . . . , M , where M is the total number of leaf nodes across all levels. We assume that the source distribution on each B i is given by a collection of point sources, as well as a smooth function f , sampled on a k × k grid. The number of grid points is denoted by N = M k 2 and the number of discrete sources is denoted by N s . We assume the output is desired at the N grid points as well as the N s source locations.
Step I: Upward pass for l = l max , . . . , l cut for every box j on level l if j is childless then • Form Hermite expansion Φ l,j using (3.7) else • Form Hermite expansion Φ l,j by merging the expansions of its children using T HH (see Lemma 3.5) endif end end Step II: Downward pass for every box j on level l cut • Set Ψ lcut,j = 0 end for l = l cut + 1, . . . , l max for every box j on level l: • ComputeΨ l,j from its parent's Ψ expansion using the operator T LL for every box m in j's interaction list: • Increment Ψ l,j by adding in the contributions from all boxes in j's interaction list, using (4.4). if j is childless then for every box m in j's s-list: • Evaluate the Hermite expansion Φ(m) at each target in box j. end for every box m in j's s-list: • Increment the local expansion Ψ(m) from the smooth and point source distribution in j, using the precomputed operators (4.5) for the smooth source distribution and (3.4) for the point sources end • Evaluate the local expansion Ψ l,j at each target in box j (whether the target is a grid point or a point source location) endif end end Step III: Local interactions for l = 0, . . . , l max for every leaf node B j on level l: • At each tensor product grid point in B j , compute influence of the smooth source in colleagues, fine neighbors and coarse neighbors using precomputed tables of coefficients (3.28) • For each point source location in B j , use Chebyshev interpolation to obtain the Gaussian field due to smooth sources in colleagues, fine and coarse neighbors • For all targets in B j , use direct computation to evaluate the Gaussian field due to point sources in colleagues, fine and coarse neighbors end end The cost of the adaptive FGT is easily estimated. Creating the tree and sorting sources into leaf nodes requires at most O((N + N s )l max ) work. Forming expansions on all leaf nodes requires O((N + N s )p 2 work, for an expansion of order p. The remainder of the upward pass requires O(N b p 3 ) work to carry out the recursive merging of Hermite expansions, where N b is the number of boxes in the quad-tree. The downward pass requires approximately O(27N b p 3 ) work to carry out the Hermite-local and local-local translations. Finally, the local work is of the order O(N s q) for the point sources (assuming the tree has been refined until there are O(q) sources per leaf node). For the continuous source distribution, only approximately 13N k(k+1) 2 + N s k 2 operations are required. The first term accounts for the cost of computing the Gaussian potential on the tensor product grids from the near neighbors, using precomputed tables, while the latter term is the interpolation cost at the point source locations. The factor 13 is a consequence of the observation that the maximum number of neighbors a box can have is thirteen (twelve fine neighbors and itself).
Remark 2. The preceding analysis assumes that the translation operators T HH , T LL , and T HL have been computed according to the formulae (4.1), (4.2) and (4.3), taking advantage of the tensor product nature of the two-dimensional Hermite and local expansions to achieve O(p 3 ) complexity, instead of the naive estimate O(p 4 ). In the d-dimensional setting, the operation count is O(dp d+1 ) instead of O(p 2d ) [14,15].
We have further accelerated the code by making use of diagonal translation operators, following the method described in [15] and [25]. Instead of Hermite expansions, it is straightforward to show that This formula is derived from the Fourier relation In order to make practical use of (4.6), we need to discretize the integral, for which the trapezoidal rule is extremely efficient because of the smoothness and exponential decay of the integrand. The reason (4.6) is useful is that it provides a basis in which translation is diagonal. Assuming p t denotes the number of trapezoidal quadrature points required, it is shown in [15] and [25] that the dominant cost of translating Hermite to local expansions, namely the O(27N b p 3 ) term above, can be reduced to O(3N b p 2 t + N b pp 2 t ) work. The principal difference between the methods in [15] and [25] and our hierarchical scheme is that p t must be different on each level. Informally speaking, for a level where the linear box size is r l , p t must be sufficiently large that the integrand e ik·(x−s B )/ √ δ is Nyquist-sampled for (x − s B ) ≤ 4r l , where r l < r c √ δ and r c is the cutoff parameter defined above. (It is easy to verify that p t = O(p) [15,25].)

Boundary FGT.
We turn now to the evaluation of boundary Gauss transforms of the form (1.13), for targets both on and off the boundary Γ. We assume that Γ itself is described as the union of M b boundary segments: with each boundary segment defined by a kth order Legendre series. That is, x 2 j (n)P n (s), (4.9) with −1 ≤ s ≤ 1. We assume that the densities σ and µ in (1.13) are also given by corresponding piecewise Legendre series: For the sake of simplicity, we assume that Γ has been discretized in a manner that is commensurate with the underlying data structure used above: an adaptive, level-restricted tree. That is, we assume the length of Γ j , denoted by |Γ j | satisfies |Γ j | ≈ r l , where r l is the box size of the leaf node in the tree that contains the center point c j of Γ j .  .13) is either resolved by its discretization using standard Gauss-Legendre quadrature with k nodes on Γ j , or negligible outside the disk centered at c j with radius r corr , which we will denote by D j . (Right) In the latter case (when δ is small and the Gaussian is sharply peaked), a simple interpolatory rule can be used to compute the correct contribution using O(k 2 ) work per target point, either on or off the boundary. The shaded circles in the figure around the three target points t i are meant to indicate the regions where the Gaussians centered at t i are less than a user-prescribed tolerance . t 1 is sufficiently far from Γ j that it can be ignored. t 2 and t 3 are off and on the boundary, respectively. The relevant portion of Γ j for t 2 is marked in terms of the parameter s by s l and s r .
Suppose now that we apply composite Gauss-Legendre quadrature to the integrals in (1.13). For the "single layer" type integral, we have where y ij = (y 1 ij , y 2 ij ) is the location of the ith scaled Gauss-Legendre node on Γ j , σ ij is the density value at that point, and w ij = w i [dy 1 j /ds(s i )] 2 + [dy 2 j /ds(s i )] 2 . Here, s i and w i denote the standard Gauss-Legendre nodes and weights on [−1, 1].
Note that the quadrature weight w ij involves both the standard weight w i and the change of variables corresponding to an arc-length parametrization on each segment. The necessary derivatives can be computed from (4.9). Note also that the sum in (4.10) consists simply of point sources and is easily incorporated into the FGT above. The only remaining issue has to do with the accuracy of the formula (4.10), since the smoothness of the integrand depends strongly on the parameter δ. Here, however, the rapid decay of the Gaussian makes the problem tractable for any δ. To see why, consider a boundary segment Γ j , centered at c j in a leaf box B of commensurate size (Fig. 4.1, left). Suppose first that δ is sufficiently large that |Γ j | ≤ C( ) √ δ. C( ) is straightforward to tabulate in terms of the user-specified tolerance . Then, it is straightforward to show that the error in k-point Gauss-Legendre quadrature is of the order for some constant C . This follows from the error estimate for Gauss-Legendre quadrature [9], Stirling's formula, and Cramer's inequality. In short, the k-point quadrature rule is spectrally accurate. Suppose now that |Γ j | > C( ) √ δ, and let r corr = |Γ j |. Then, for any target outside the circle of radius r corr centered at c j , the integrand is bounded by e −C( ) 2 /4 |Γ j | σ ∞ . (Setting C = 12, the integrand is approximately |Γ j | σ ∞ (2 · 10 −16 ).) Thus, for each boundary segment with |Γ j | > C( ) √ δ, it remains only to correct the result obtained from the FGT within this circle D j (Fig. 4.1, right).
This correction can be computed rapidly and accurately as follows. Let us denote by D(t, r) the circle of radius r centered at t for any target t in the circle, where r = δ ln(1/ ) so that e − t−y 2 /δ < . If D(t, r) doesn't intersect Γ j , the field is negligible and no correction is needed. Otherwise, we compute the intersection of Γ j and D(t, r) and let the endpoints of the intersection be denoted by s l and s r (in terms of the underlying parametrization of Γ j ). We then interpolate the source distribution σ(s) to k c scaled Gauss-Legendre nodes on Γ s j for s ∈ [s l , s r ], and replace the original k-point quadrature on Γ j by a k c -point Gauss-Legendre rule on [s l , s r ]. Setting k c to 20 yields approximately fourteen digits of accuracy assuming the density σ(s) is locally smooth.

Periodic boundary conditions.
It is straightforward to extend the FGT to handle periodic conditions on the unit square D = [−0.5, 0.5] 2 . Conceptually speaking, this can be accomplished by tiling the entire plane R 2 with copies of the source distribution f . For this, we let Λ = {j = (j 1 , j 2 )|j 1 , j 2 ∈ Z}. The tile T j is a unit square centered at the lattice point j ∈ Λ. The extended periodic source distribution will be denoted byf . From this, the solution to the periodic problem can be written as: As in the FMM for the Poisson equation [13,11], we can accomodate periodic boundary conditions with very little change to the data structure or processing. To see this, note that, if we carry out the upward pass of the FGT until the root node (level l = 0), we obtain an Hermite expansion describing the field due to all sources in D. Because of the translation invariance of the kernel, the coefficients of this expansion are the same for every tile T j covering the plane, expanded about the corresponding lattice point j. We denote the expansion about j by Let Having deleted the colleagues of the original unit box D, centered at the origin, the remaining tiles indexed by Λ are all well separated from D. From Lemma 3.4 and the linearity of the problem, it is clear that the contribution to field in D from all well-separated tiles indexed by Λ can be representation by a local expansionF centered at the origin, with coefficients Extending Definitions 4.2-4.4, we let Ψ 0,1 denote the local expansion for the root node D at level 0, and define the nine tiles centered at Λ − Λ to be the root node's colleagues.
When δ is so small that l cut ≥ 0, the far fieldF f ar (x) in the root node D is negligible (for given accuracy ), so that we can initialize its coefficients C β = 0. Otherwise, we carry out the computation in (4.12) to initialize Ψ 1,0 . This requires the computation of tte lattice sums in (4.13). These are obtained rapidly from the Poisson summation formula: (4.14) It is straightforward to verify that, when δ is large enough thatF f ar (x) is non-negligible, only a few terms are required on the right-hand side of (4.14) and only milliseconds are needed for all L α+β .
Only two other changes are needed in the FGT: the interaction list and near neighbor computations must be adjusted to account for periodic images. Having defined the colleagues of the root node above, this is handled automatically by the data structure. For large scale problems with many levels of refinement, this involves a modest increase in work for boxes near the boudary of D and a negligible increase in the total work.

Numerical results.
In this section, we illustrate the performance of both the volume and boundary FGT, implemented in Fortran, with experiments carried out on a single core of a 3.4GHz Intel Xeon processor. Our first example demonstrates the linear scaling of CPU time with the number of grid points. We compute the volume FGT with the source distribution imposing periodic boundary conditions. In order for the numerical experiment to be non-trivial, we increase the complexity of the problem as we increase the number of degrees of freedom. More precisely, we consider four cases, with k = 1, 2, 4, 8 and δ = 1 k 2 , requiring a finer and finer spatial mesh to resolve the data. For each choice of k, we create a (level-restricted) quadtree, refined to a level where f k is accurately represented with our piecewise eighth order polynomial to 10 digits of accuracy. For the function described in (5.2), the refinement happens to be uniform, with N = 256, 1024, 4096, 16384 leaf nodes for the four cases, respectively. (We will see examples with inhomgeneous source distributions and adaptive data structures below.) Timings are given in Table  5.1 and plotted in Fig. 5.1. k 1 2 4 8 10 −3 3.0 · 10 5 3.3 · 10 5 6.1 · 10 5 7.3 · 10 5 10 −6 1.7 · 10 5 1.8 · 10 5 3.1 · 10 5 4.3 · 10 5 10 −9 0.9 · 10 5 0.9 · 10 5 1.1 · 10 5 1.3 · 10 5 Table 5.1: Throughput on a single core for the volume FGT with periodic boundary condition in units of points/second.
Note that, for a fixed tolerance , the performance of the volume FGT is relatively insensitive to the parameter δ. For large δ, the far field is nontrivial but very smooth. For sufficiently small δ, the interaction is entirely local and the FGT is particularly fast. The worst case performance is for δ ≈ 10 −4 , where both the far field and near field need significant effort.
Our fourth example illustrates the performance of the boundary FGT. We compute the integral where Γ is chosen to be the ellipse: We create an adaptive quadtree on the unit box so that each leaf box of the tree contains no more than O(1) boundary points and then enforce the level-restricted condition, yielding the data structure shown in Fig. 5.6. The leaf nodes with 8 tensor product Chebyshev grids on each define our volumetric targets. The boundary FGT is then evaluated at all volumetric grid points and all boundary points as well. Timings are given in Table 5.5 and plotted in Fig. 5.7.
Note, again, that the performance of the boundary FGT varies only modestly over a wide range of the parameter δ. For sufficiently small δ, the interaction is entirely local and no expansions are formed. For sufficiently large δ, a smooth quadrature rule is accurate enough to discretize the boundary integral, avoiding the need for local correction. The code is slowest for intermediate values of δ, where both local and far field contributions are significant (while still satisfying linear scaling with the number of source and target points).

5.
1. An initial value problem for the heat equation. As a final example, we consider the homogeneous heat equation: u t (x, t) = ∆u(x, t), u(x, 0) = f (x) (5.8) δ 10 0 10 −1 10 −2 10 −3 10 −4 10 −5 10 −6 10 −3 4.2 · 10 5 3.9 · 10 5 4.1 · 10 5 3.6 · 10 5 5.0 · 10 5 2.8 · 10 5 5.4 · 10 5 10 −6 2.0 · 10 5 1.8 · 10 5 1.6 · 10 5 1.6 · 10 5 0.9 · 10 5 1.8 · 10 5 3.9 · 10 5 10 −9 1.0 · 10 5 0.9 · 10 5 0.8 · 10 5 0.4 · 10 5 0.2 · 10 5 1.1 · 10 5 2.7 · 10 5 Table 5.5: Throughput for the boundary FGT with volumetric targets, measured in points/second, for various precisions and values of δ. for x ∈ D = [−0.5, 0.5] 2 , with periodic boundary conditions. The initial data is chosen to be a piecewise constant function: where the unit box D is refined uniformly on a tree that is five levels deep, resulting in a 32-by-32 grid of leaf nodes D i . On each leaf node, we let C i take on a random value in the range [0, 1]. f (x) is plotted in Fig. 5.8. The exact solution of this problem is given by: u(x, t) = 1 4πt R 2 e − |x−y| 2 4tf (y)dy, (5.10) wheref is the periodic extension of f . This is precisely what is computed by the periodic version of the volume FGT and the solution u(x, t) is plotted for various choices of t in Fig. 5.8, with nine digits of precision in the FGT. Remark 3. There is a subtle issue regarding the use of the FGT to compute (5.10), namely that the error estimates for the FGT derived above are based on the Gaussian rather than the heat kernel, which includes the additional 1/(4πt) scaling in two dimensions. To compute an accurate convolution requires that the local tables be built using the full heat kernel (whose support to a fixed precision is slightly greater than the support of the Gaussian alone). The far field and local expansions also require a few more terms. Without entering into a detailed analysis, we illustrate the difference when t = 10 −4 for leaf node boxes in the present example. For the FGT, Hermite expansions of order p = 22 are needed to achieve 9 digits of precision. For the full heat kernel, it turns out that p = 28 is required to achieve the same accuracy. can be used for the evaluation of volume or boundary integrals with a Gaussian kernel as well as the field induced by discrete point sources. This is a standard and well-defined computational task in its own right, and serves as a key component in integral equation-based solvers for the heat equation in complex geometry [32]. The extension of the present method to three dimensions is straightforward and will be reported at a later date.