Communication-Efficient Property Preservation in Tracer Transport

Atmospheric tracer transport is a computationally demanding component of the atmospheric dynamical core of weather and climate simulations. Simulations typically have tens to hundreds of tracers. A...

We call a quantity that must be maintained essentially exactly, in an otherwise approximate solution to the differential equations, a property. Four correlated aspects of the problem make tracer transport challenging. First, simulations typically have tens to hundreds of tracers. Second, the subgrid chemistry and physics models in simulations require the tracer fields to obey several properties. Third, to address the first challenge, computational efficiency strongly motivates different discretizations and time steps for the transport and dynamical components. Fourth, using different discretizations increases the difficulty of maintaining properties.
Some properties are inherent in a tracer transport discretization; others must be recovered after the discretization provides a preliminary, or target, field. We call the difference between the final property preserving field and the preliminary target field the correction. This paper describes correction algorithms that preserve properties not inherent in the discretization while maintaining those that are. The algorithms are computation-and communication-efficient, have useful bounds on the correction magnitude as a function of readily accessible problem data, always return correct solutions when primary problem constraints are feasible, return favorable solutions to a safety constraint set when they are not, and depend very little on the details of each component's discretization and thus are widely applicable. We call the class of methods of which these algorithms are members constrained density reconstructor s (CDRs). A CDR can be used to solve a global problem or a local one, e.g., within a mesh cell.
This paper proceeds as follows. The remainder of section 1 describes the properties on which this paper focuses, some general aspects of CDRs, existing work, and applications. Section 2 formalizes and characterizes the property preservation problem, including primary and safety problems. The next three sections discuss CDRs in three classes: limited-reduction (section 3) and optimization-based (section 4) CDRs, and the new Quasi-Local Tree-based CDR (QLT, section 5). In addition, subsection 3.1 provides an algorithm to make a CDR solve the safety problem if the primary problem is infeasible. These three sections characterize these algorithms in terms of correctness, safety, and correction 1-norm. Each algorithm in sections 3 and 4 can be used as a global or local method; as local methods, some of these algorithms are used as node subproblem solvers in QLT, and thus section 5 builds on the analysis of the first two sections. Section 6 discusses the algorithms in terms of communication efficiency on a distributed computer and shows that QLT has the lowest communication volume. Section 7 presents results of numerical studies comparing various methods; these show that QLT and QLT with weights from [3] provide the highest quality corrections. Section 8 concludes.
1.1. Property preservation and CDRs. Let \bfitrho , \bfitQ , and \bfitq be vectors of coefficients in the discretizations of \rho , Q, and q, respectively. Each component of the dynamical core has a value for air density \bfitrho . The transport component computes \bfitrho if mixing ratio \bfitq = \bfite , the vector of ones. The dynamical and transport components are tracer consistent or free stream preserving [20] if these computed air densities agree.
Shape preservation encompasses a variety of constraints and methods, including limiters [1,18] and monotone reconstructions [9]. We focus on methods that apply bound constraints to the coefficients of nodal discretizations. Positivity requires \bfitrho , \bfitQ \geq 0. Range preservation assures that all values of a field lie between globally applied lower and upper bounds: l\bfite \leq \bfitq \leq u\bfite . The scalars l, u may be set just once and the algorithm assures mass conservation. Sections 3 and 5 provide new methods that can use this weight vector but also assure shape preservation. Diamantakis and Flemming [12] review the methods of Bermejo and Conde [3], Zerroukat [34], Priestley [27], and McGregor [26]. They find the methods of Zerroukat and McGregor do not provide strict shape preservation in practice.
Bochev et al. describe optimization-based remap (OBR) methods [4,5,6] and applications in which the optimization problem provides mass conservation and local bound preservation. An efficient, but iterative, method [10] is used to solve the optimization problem. Each iteration requires a reduction. Section 4 analyzes this approach in our framework.
In a cell-integrated semi-Lagrangian (SL) method [20], the departure cell geometry may be adjusted to preserve properties. In general, adjustment is a global problem; but in a flux-form transport method, the adjustments can be decoupled [22,24]. Compared with a CDR, the approach does not use a global collective but requires flux form.
Kaas et al. [17] describe a hybrid Eulerian--Lagrangian (HEL) scheme. Equations (22)-- (25) of that reference correspond to our algorithm ClipAndAssuredSum (section 3). We place the method in a broader setting suitable for analysis and comparison. In addition, HEL develops local bounds on mixing ratio in a manner that assures feasibility of the resulting constraint set; in other applications, an assuredly feasible constraint set may not be efficiently computable. ReconstructSafely (subsection 3.1) enables the CDR to efficiently maintain dynamic range even if the primary constraint set is empty.

Applications.
Because the algorithms in this paper depend very little on the details of discretizations, they may be used in any scheme that either currently does not preserve properties or preserves properties with a CDR. For example, QLT may be well suited for direct application to the interpolation SL methods Spectral Bicubic interpolation scheme (SBC) [14] and FARSIGHT [32]. SBC runs on a latitudelongitude grid, and FARSIGHT runs on the cubed sphere. FARSIGHT already has a dynamic range preservation CDR, but QLT can strengthen FARSIGHT's property preservation to local bound preservation. HEL [17] is much more complicated than SBC and FARSIGHT, but it still has essentially the same property preservation problem. It might benefit from using QLT instead of its current use of, essentially, global ClipAndAssuredSum, for increased efficiency. Any OBR algorithm whose optimization problem has the structure of one of the three problems considered in section 4 can use the algorithms described in this paper for increased and deterministic communication efficiency and a solution framework that relaxes constraints as necessary to resolve infeasibility. For example, [5, eq. 4.3] has the structure of problem \scrP w2 (section 4).
Some tracer transport discretizations already provide mass conservation. But gaining shape preservation and tracer consistency without losing this mass conservation is as hard as gaining both these and mass conservation. Thus, the CDRs described in this paper are useful regardless of whether the discretization conserves mass. Nonetheless, a mass-conservative discretization has advantages. Generally, it will require smaller corrections than nonconservative methods. In addition, QLT (section 5) redistributes mass approximately locally if the transport discretization is mass conservative; we discuss this detail further in sections 5 and 7.
In a method having multiple discretization points per cell, such as finite-element methods, a CDR may be applied hierarchically. A global CDR redistributes mass PROPERTY PRESERVATION IN TRACER TRANSPORT   C165 among cells, and then a cell-local CDR redistributes mass within a cell. The purpose of the global CDR is to provide each cell with a mass sufficient to make the subsequent cell-local problem feasible; section 2 discusses problem feasibility. Different CDR algorithms can be used for the global and local problems.
Two criteria concerning the transport method determine whether a global CDR is the appropriate method to preserve properties. The first is the transport method's time step. If the transport method is Eulerian, and thus likely in flux form, and time integration is explicit, then a CDR, while mathematically applicable, is not an efficient means of preserving properties. Instead, it is possible that the time integration method and time step can be chosen so that mass redistribution among cells is not necessary to preserve properties; see, e.g., [16,Theorem 1] and [36,Theorem 2.2]. With these choices, property preservation can be obtained entirely locally within each cell and thus without extra communication, except possibly in the details of obtaining local bounds and regaining continuity across cells. If a global CDR from this paper were applied in this case, its correction would be 0. Alternatively, fluxes can be adjusted by a scheme that decouples the mass redistribution problem; see, e.g., [31,33].
The second criterion is whether the SL method uses flux or remap form. A global CDR is particularly well suited to SL transport in remap form. For large time steps, remap form can use substantially less communication than flux form. For in remap form, the computational domain of dependence of a parcel is its advected image, whereas in flux form, this domain is the swept region (e.g., [24]), the union of the regions swept by the parcel's edges during advection. Flux form provides information that can obviate a global CDR, as exploited in [24]. But remap form SL transport with an efficient CDR can have a lower overall communication volume, and one that--roughly, subject to the details of the transport problem---does not grow with time step, as it does in flux form. In summary, the primary motivation for CDR algorithms is that SL tracer transport can be substantially faster than Eulerian transport because of a long time step without CFL restriction, SL transport in remap form can be more efficient than SL transport in flux form, high-order SL transport generally requires redistributing mass among cells to preserve properties, and remap-form transport lacks flux information with which to redistribute mass.
In [7], we describe a family of cell-integrated, remap-form, spectral-element [29], SL methods for two-dimensional transport on the sphere. Because the methods are cell-integrated, they are locally mass conserving. In that work, we use QLT for shape preservation and tracer consistency.

Preliminaries.
We let all arithmetic and logical operators apply elementwise. For example, \bfitx \bfity is elementwise multiplication; \bfitx /\bfity is elementwise division; in \bfitb \equiv \bfitx \leq \bfity , the boolean b i = x i \leq y i . The inner product of two vectors is denoted \bfitx \cdot \bfity . The vector \bfite is the vector of all ones; the context determines its size. Similarly, 0 is the vector of all zeros.
Let \= \bfitrho > 0 be the mass field coefficients; subsequently, we omit coefficients and sometimes field. In a nodal finite-element method, each component might correspond to a nodal basis function; then \= \rho i = w i \rho i , where w i is the integral of the node's basis function and \rho i is the density. Or a component of the vector might correspond to an aggregation of components from another vector. For example, let \bfitrho N be a vector over mesh nodes and \= \bfitrho C be a vector over mesh cells. Then \= \rho C i = \sum j\in Ci w i j \rho N j is the mass in cell i as a function of nodal basis functions and densities, where C i is the index set of nodes in cell i and w i j is the integral of basis function w j over cell i. The algorithms in this paper are independent of these distinctions, as they begin with \= \bfitrho .
Let \bfitQ be a constituent tracer's density. The tracer transport component is provided \bfitrhoand \bfitrho , the total density fields at the previous and current time steps, respectively, and \bfitQ -, the tracer density at the previous time step. The . We refer to the equivalent sets \= \scrQ , \scrT as the primary constraint set. It may be that \scrT (b, \bfitl , \bfitu ) = \emptyse . In practice, this can happen because the procedure to find \bfitl , \bfitu is heuristic and local; nothing in the procedure may assure, in particular, the global property \bfite \cdot \bfitl \leq b \leq \bfite \cdot \bfitu . A useful superset \scrT s ( \= \bfitQ \ast , \= \bfitrho , b, \bfitl , \bfitu ) is essentially always nonempty. Let \scrT s is the set of all dynamic range preserving and mass conserving corrections. Recall that (\cdot )denotes a quantity at the previous time step.
Consider a simulation for which the following conditions hold: 1. \bfite \cdot \= \bfitrho -= \bfite \cdot \= \bfitrho = \= \rho g and \bfite \cdot \= \bfitQ We can expect these conditions to hold in practice. A dynamical component typically conserves mass, and mass conservation holds for the transport component inductively. Condition 2 mimics that the exact solution to the advection equation maintains its extrema.
Because \scrT s prevents the introduction of new global extrema, provides tracer consistency, conserves mass, and is always nonempty, we call it the safety constraint set. This paper focuses on algorithms that efficiently find a correction in the primary constraint set, if it is nonempty, or else the safety constraint set.

14: end function
We write our algorithms for mathematical clarity; they should not be used as guides for efficient implementations. In section 6, we show that ClipAndAssured-Sum can be implemented with a single BAR, an MPI Allreduce in particular. We call CDRs that use a deterministic maximum of such reductions limited-reduction algorithms; ClipAndAssuredSum is a one-reduction algorithm. It was previously described in [17, eqs. 22--25].
In each proposition concerning algorithms, the preconditions of the algorithms, annotated Pre: in an algorithm listing, are assumed to hold. The postcondition in an algorithm listing indicates the propositions directly relevant to the algorithm. x caas ClipAndAssuredSum is correct, as follows.
The proof is in Appendix A.1. Function ClipAndAssuredSum chooses a particular weight vector \bfitw to assure correctness. Others may be useful for quality of correction; see, e.g., [3]. As a first step to permitting other weight vectors, ClipAndGenericSum in Algorithm 3.2 implements an algorithm useful for analysis, but not useful in practice: the function may fail to return \bfitl \leq \bfitx \leq \bfitu even if \scrT (b, \bfitl , \bfitu ) \not = \emptyse . return \= \bfitx + m\bfitz 6: end function A useful measure of the magnitude of a correction is its 1-norm, \| \bfitx \| 1 . This value is the total mass by which the target field is altered. We use ClipAndGenericSum to characterize \| \bfitx \| 1 as returned by any CDR that can be implemented by forming a weight vector \bfitw and then calling ClipAndGenericSum.
ClipAndGenericSum first clips \bfitx \ast = 0 to get \= \bfitx . The clip at index i is necessary, and it is independent of every other clip. Then, because \bfitz \geq 0 and \bfite \cdot \bfitz = 1, it adds exactly the mass discrepancy m, and no more, to \= \bfitx to get \bfitx . Let \bfitd \equiv \bfitx -\= \bfitx ; each element of \bfitd is either zero or shares the same sign because this fact is true of \bfitw . Additionally, for any i for which x i is altered in the second step, the modification d i has the same sign as the clip modification in the first. Hence the total change, \| \bfitx \| 1 , is the sum of \| \= \bfitx \| 1 and \| \bfitd \| 1 . Since each of the two steps modifies the vector by exactly as much as needed, \| \bfitx \| 1 is minimal. This argument is an intuitive proof that \bfitx is 1-norm-minimal; Appendix A.2 contains a formal proof of Proposition 3.2. Let \scrT 1 (b, \bfitl , \bfitu ) \equiv arg min x\in \scrT (b,\bfitl ,\bfitu ) \| \bfitx \| 1 , the set of all 1-norm-minimal points in \scrT .  Next, let B(\bfitl , \bfitu ) is the mass by which the target field is out of bounds. A first use of it is to provide a simple lower bound on the magnitude of \bfitx \in \scrT (b, \bfitl , \bfitu ): is a technical term that is used particularly in the proof of Proposition 5.3.
The first equality of (3.2) simply repeats Proposition 3.2. The proof of the remaining relations is in Appendix A.1.
Next, we consider continuity and semilinearity. Inspection of the algorithms shows the following. In the context of this proposition, the inputs may be those of an infeasible problem. Then the preconditions are not satisfied, and an algorithm may return output violating the postconditions. But that output is still a continuous function of the inputs.
A transport operator A is said to be semilinear if A(\alpha \bfitx + \beta ) = \alpha A(\bfitx ) + \beta for scalars \alpha , \beta and vector \bfitx [21,25,30]. A CDR is semilinear if the transformations \bfitq min \rightar \beta \bfite + \alpha \bfitq min , \bfitq max \rightar \beta \bfite + \alpha \bfitq max , \= \bfitQ \ast \rightar \beta \= \bfitrho + \alpha \= \bfitQ \ast , \= Q g \rightar \beta \bfite \cdot \= \bfitrho + \alpha \= Q g transform the output as \= \bfitQ \rightar \beta \= \bfitrho + \alpha \= \bfitQ . Here, a \rightar b means that a is transformed to b. Under this transformation, our correction variable set transforms as b \rightar \alpha b, \bfitl \rightar \alpha \bfitl , \bfitu \rightar \alpha \bfitu , \bfitl s \rightar \alpha \bfitl s , \bfitu s \rightar \alpha \bfitu s . Hence a CDR is semilinear if its output transforms as \bfitx \rightar \alpha \bfitx . Note that the term semilinear has, by convention, a special meaning in the context of transport operators; in other settings, the term has a different meaning. Semilinearity in the sense of [21,25,30] does not imply continuity. We can construct a counterexample in three dimensions. Let R be a rotation matrix such that \bfite = \surd 3R T \bfite n , where \bfite n has a one in its final element and is otherwise zero. Let f R (x, y, z) \equiv [0, 0, g(x, y) + z], where g is homogeneous of degree 1 and discontinuous. Then f (\bfitv ) \equiv R T f R (R\bfitv ) is, first, discontinuous and, second, semilinear. The first is by construction of g. The second can be verified by simplifying f (\alpha \bfitv + \beta \bfite ), using that R\bfite = \surd 3\bfite n on the second term of the argument, homogeneity of g applied to the first two components of \alpha R\bfitv , that f R is nonzero only in its third component, and finally that \surd 3\beta R T \bfite n = \beta \bfite , yielding \alpha f (\bfitv ) + \beta \bfite . Again by direct inspection of the algorithms, we obtain the following. Bermejo and Conde's algorithm [3], hereafter BC, is an instance of ClipAndGeneric-Sum. Hence Proposition 3.2 implies that its output is a minimal 1-norm correction in addition to being a minimal weighted-2-norm correction by construction. Let their weight vector be \bfitw BC . ClipAndGenericSum cannot assure that \bfitx \in \scrT (b, \bfitl , \bfitu ) \equiv \scrT even if \scrT \not = \emptyse . Nor can BC. But, first, \bfitw BC strongly encourages satisfaction of the constraints. Second, ClipAndAssuredGenericSum can be used with \bfitw BC to return exactly their solution \bfitx BC when \bfitx BC \in \scrT , and otherwise a solution resulting from the convex combination, as close to \bfitw BC as possible, of \bfitw BC and that from MakeAssuredWeights. Thus, ClipAndAssuredGenericSum safeguards another, nested, algorithm. The method of Zerroukat [34] may also benefit from safeguarding. To make ClipAndAssuredGenericSum a general safeguard, the user's w i should first be modified to be 0 if \= x i is on a bound and the sign of m will make \= x i stay on that bound.
3.1. Safety problem. So far we have discussed algorithms that return a valid correction only if the primary constraint set is not empty. We refer to such a CDR as a primary CDR. ReconstructSafely in Algorithm 3.4 calls a user-provided primary CDR, SelectX. If the primary set \scrT (b, \bfitl , \bfitu ) is not empty, Reconstruct-Safely returns the output of SelectX. If it is empty, then it calls SelectX with new inputs, now to find a correction in the safety constraint set. We say that a CDR having this behavior is safe. Although Proposition 2.2 assures that the safety set is not empty when ReconstructSafely is used in a typical simulator, Reconstruct-Safely nonetheless has a branch to handle an empty safety set. In this branch, the correction is only mass conserving. In section 5, QLT calls ReconstructSafely as a subproblem solver. A QLT subproblem can require this final branch even when the global problem has a nonempty safety set.
ReconstructSafely sets \bfitw = \= \bfitrho if the user does not provide \bfitw . This choice encourages uniform relative mass change, whereas \bfitw = \bfite encourages uniform absolute mass change. More generally, a weight vector in the literature that is constructed with respect to \bfitq should be multiplied by \= \bfitrho for use in our mass-formulated CDRs. A weight vector that depends on problem data must be checked for semilinearity; \bfitw = \= \bfitrho indeed does not break semilinearity.
Propositions 3.8 and 3.9 establish that ReconstructSafely is a safe CDR, a continuous operator, and a semilinear operator. if \bfitw = None then \bfitw \leftarr \= \bfitrho if b \leq \bfite \cdot \bfitu s then 8: else 10: if b \geq \bfite \cdot \bfitl s then 15:  Finally, we bound the correction size. If the primary constraint set is not empty, \| \bfitx \| 1 is provided by Proposition 3.3. But if it is empty, then we bound the amount by which the correction is out of the primary bounds.
The proofs proceed by inspection of the logical branches in ReconstructSafely and then application of earlier propositions relevant to the primary CDR SelectX.
4. Optimization problems. A member of another class of CDRs explicitly solves an optimization problem to find the smallest valid correction. Consider the following optimization problems and their solution sets, with \bfitome > 0:  \scrP w2 is the problem corresponding to set \scrT w2 , and similarly for the others. Each problem and set is convex because \scrT (b, \bfitl , \bfitu ) is convex and each objective is convex. Uniqueness is not of interest, but we note that for \bfitome > 0, \bfitome \cdot \bfitx 2 is strictly convex and so \scrT w2 (\bfitome, b, \bfitl , \bfitu ) has exactly zero points (if \scrT (b, \bfitl , \bfitu ) = \emptyse ) or one point. The objectives find minimal 1-norm, weighted-1-norm, and weighted-2-norm corrections, respectively. In fact, the correction is minimal in the 1-norm in all three problems.
A solver for one of these optimization problems is a primary CDR; calling a solver through ReconstructSafely makes a safe CDR. The continuity and semilinearity of these optimization problems are evident by direct inspection. Weights in these optimization problems influence the solution as the reciprocal of those elsewhere. In particular, comparison of the terms \bfitome\bfitx + \lambda \bfite in \bfits in \scrP w2 's Karush--Kuhn--Tucker (KKT) conditions (Appendix A.2) with ClipAndGeneric-Sum's update \bfitx \leftarr \= \bfitx + m\bfitz , where \bfitz \equiv \bfitw /(\bfite \cdot \bfitw ), shows that \omega i \propto 1/w i . An efficient algorithm to solve \scrP w2 is described in [10] and is used in [4,5,6] for global mass conservation and shape preservation. It is iterative, with the number of iterations dependent on data, and each iteration requires a reduction. Thus, as a global CDR, it is less efficient than the limited-reduction CDRs, and its performance depends on the data. However, QLT (section 5) can use it efficiently to solve node subproblems.
5. Tree algorithms. A third and new class of CDR algorithms generalizes the first two. The computations of an algorithm in this class are structured by a tree. The tree is over the mesh entities to which the indices of the vector \= \bfitQ correspond, e.g., mesh cells or mesh nodes, and may have arbitrary structure. At each node of the tree, ReconstructSafely is used to solve a CDR subproblem. In ReconstructSafely, SelectX may be any of ClipAndAssuredSum, ClipAnd-AssuredGenericSum; solvers for the optimization problems \scrP w2 , \scrP 1 , \scrP w1 ; or any other CDR that returns a valid correction if one is possible and whose correction magnitude is bounded by (3.3). In practice and in our analysis of communication efficiency (section 6), it is sensible for the node subproblem to be solved within a single process and so for it to require no communication. Thus, we distinguish between the global CDR and problem, and a local CDR and subproblem. An algorithm that is inefficient as a global CDR may still be efficient as a local CDR, e.g., 2-norm minimization.
We introduce one algorithm in this class, QLT (Quasi-Local Tree-based CDR). Propositions 5.1 and 5.2 establish that QLT is a safe, continuous, semilinear CDR, inheriting the characteristics of the subproblem solver wrapper ReconstructSafely. Proposition 5.3 shows that QLT inherits the bound (3.3) from SelectX. Proposition 5.4 shows that QLT does not inherit the equality (3.2). However, we explain that while QLT may not return a 1-norm-minimal correction, the larger magnitude has an interesting and useful source: QLT redistributes mass approximately locally, where locality is determined by the tree. We call this type of locality quasi-locality since it is not as precise as, e.g., a flux-based redistribution of mass.
Tree algorithms are possible fundamentally because of the necessary and sufficient conditions in Proposition 2.1. The condition \bfite \cdot \bfitl \leq b \leq \bfite \cdot \bfitu summarizes the feasibility of the primary problem with just three scalars, independently of the size of the problem. Another way to think of this is that just three scalars are necessary to determine what is needed to make the problem feasible. A tree algorithm exploits this compactness of data. A node summarizes the portion of the global problem rooted at the node with essentially three scalars and communicates with its parent what positive or negative mass it needs to solve, or it can take while still maintaining feasibility of, this portion of the problem. if \bfitw = None then \bfitw \leftarr \= \bfitrho 3: \= \bfitQ \ast n , \= \bfitrho n , \bfitl n , \bfitu n , \bfitw n \leftarr () \triang Initialize 5 empty lists 22: for k \in n.kids do \triang Fill the lists 23: append entries of k.data to \= \bfitQ \ast n , \= \bfitrho n , \bfitl n , \bfitu n , \bfitw n , respectively 24: end for 25: if \bfite \cdot \bfitw n = 0 then \bfitw n \leftarr \= following fields: \bullet kids, n's list of child nodes, empty if n is a leaf; \bullet id, the index i with which a leaf node n is associated; \bullet data, a list of data associated with n. QLT first reduces a problem within a subtree by summing \= \bfitQ \ast , \= \bfitrho , \bfitl , \bfitu , and \bfitw ; LeavesToRoot is a reduction with the addition operator. At the root, b is available to create the first node subproblem. These are then solved in turn from root to leaves, with each solution vector providing b for the child nodes' subproblems.    A 1D density field consisting of one up-and one down-pointing triangle, identical up to translation and sign, is clipped symmetrically. For generality, the mesh is nonuniform and the tree is not symmetric relative to the density field. Because the clipped mass changes sum to 0, m = 0 in ClipAndAssuredSum. Thus, the Clip-AndAssuredSum correction simply clips the original function. In contrast, QLT's  tree imposes locality on the redistribution of mass. The mass from the left triangle's clipped peak cannot be used to fill in the hole formed from clipping the right triangle, as these two features are too far apart, according to the tree. Thus, the correction from QLT is larger in norm than the one from ClipAndAssuredSum and thus is not 1-norm-minimal. But the QLT correction's 1-norm is bounded usefully by (3.3).
That QLT redistributes mass locally is a positive attribute of the algorithm that is most prominent when QLT is used to provide shape preservation and tracer consistency to a transport discretization that is already mass conserving. In such a case, there is no global mass discrepancy to resolve at the root node. Still, as we shall see, the mass redistribution is not as local as it could be; thus, we use the term quasi-local to refer to the locality property of QLT. Quasi-local mass redistribution occurs as follows.
Let l, the level of a node, be the number of edges between the root and n. Let L be the largest value such that the node subproblem of each node in level l < L is satisfied after LeavesToRoot finishes, i.e., before RootToLeaves starts. Then none of these nodes will alter its child nodes' masses in the RootToLeaves pass; case (i) of Proposition 3.10 applies with b, B, d all 0, and so the subproblem's correction has 0 1-norm. Let H be the height of the tree, i.e., the maximum level l. Then mass moves within subtrees of height at most H -L and not between these subtrees. The relation of the tree to the mesh provides the maximum distance over which this localized mass redistribution occurs. The tree is binary and balanced; it is translated so that any of 64 trees is possible. By translation we mean that at each leaf node n, n.id \leftarr (n.id + s) mod 64, where s is an integer. QLT is run with \= Q g = m, \bfitl = 0, \bfitu = \bfite , \bfitw = \= \bfitrho = \bfite , and SelectX is 2-norm minimization, and the output density profile is recorded. (Here, values of density and mass per cell are the same because we use unit-length cells.) This is done for each possible translation s and for m = 1 to 32. The profiles for each m are averaged over s; these are the profiles in the figure. In all three panels, the x axis is the cell index. In the top panel, the y axis is the total mass m. The top panel shows contours (gray lines) of the density as a function of cell index and m. The gray background shows where values are exactly 0. The bottom two panels show average profiles (thick lines); these are slices of the contour plot for m = 16 and 24. In addition, each shows the mass redistribution for the value of s that illustrates the worst case (dashed line), the ideal mass redistribution (gray thin solid line), the ensemble minimum and maximum (light gray solid), and the ensemble middle 50\% (dark gray solid).
Ideally, mass is redistributed no farther than r(m) \equiv \lceil (m -1)/2\rceil cells away from cell 32; the black lines with stair-step pattern in the top panel show r(m). In this experiment, QLT redistributes mass at most four times that far. Let k be the power of 2 such that 2 k - 1 < m \leq 2 k . Consider the predecessor node of the leaf node for cell 32 that is k levels above the leaf node; let this be node p. Outside of the tree rooted at p, all node problems are already satisfied after LeavesToRoot finishes. Node p's problem is feasible, which implies that all its descendent nodes' problems are feasible; see part 3 of the proof of Proposition 5.1 in Appendix A.3. Thus, mass is redistributed only within the tree rooted at p. This subtree spans 2 k adjacent cells. Because of the translation s, cell 32 can occur in any position of these adjacent cells; therefore, mass can move up to 2 k -1 cells from cell 32 in either direction. In the worst case, 2 k - 1 = m -1; then 2 k -1 \leq 4r -1. For m a power of 2, each cell in the cells spanned by the tree rooted at p is given mass 1. Combined with averaging over tree translations, the profile is a triangle, as illustrated by the middle panel of Figure 5.2(b). The general case is illustrated in the bottom panel.
6. Implementation. CDRs can be used locally or globally. For example, a user may call QLT as a global CDR. QLT calls a primary CDR at each node of its tree to solve a local subproblem. As another example, a field \= \bfitQ may aggregate multiple data from another field \bfitQ . After the global CDR is run, a local CDR is applied to each aggregate set. A local CDR is intended to run within a single process. Since the local CDR does not require communication, and the CDR acts on local data that likely fits in the fastest level of memory, it can have greater computational complexity than a global CDR without negative impact; e.g., it can be iterative. This section focuses on communication efficiency of global CDRs. We find that QLT has the lowest communication volume.
6.1. Communication. Global CDRs fundamentally rely on one or more batch all-to-all reductions (BARs). Often this BAR can be implemented using the Message Passing Interface (MPI) function MPI Allreduce. The cost of CDR computations is generally negligible compared with the data movement cost. Thus, a CDR's performance can be characterized by two quantities: the number of BARs and the communication volume. We characterize the second quantity by the number of scalars per tracer transmitted along a directed edge of the reduction tree. For MPI Allreduce, the total is thus twice the number of scalars per tracer, since the same number of scalars is communicated from leaves to root of the reduction tree as from root to leaves. In contrast, QLT uses a different number of scalars in the two directions.
Several implementations are possible for each algorithm. Our purpose in this section is to characterize the minimal number of BARs a CDR requires, even if another implementation has smaller communication volume. For example, an algorithm might be implemented with one BAR. But another approach might use one or two, depending on the outcome of the first, and on average use slightly more than one BAR but with lower total communication volume.
A complicating factor is that MPI Allreduce and related global collectives may be implemented using special hardware support. In such a case, a method that can be implemented using MPI global collective functions will be substantially faster than one implemented according to a custom tree and asynchronous point-to-point functions, for the same communication volume. In this case, QLT performance will be substantially worse than limited-reduction CDR performance, except when the number of tracers is sufficiently large to make bandwidth the dominant factor. For reference, the E3SM version-1 climate model [13] has 72 vertical levels and 40 tracers. A vertically Lagrangian coordinate decouples tracer transport and property preservation to 72 horizontal problems, although subsets of vertical levels can be combined in the property preservation step as a design choice. Thus, there are as many as 2880 property preservation problems to solve simultaneously, with reduction message size then a CDR-dependent constant times this number.
In previous sections, we treated the quantity b \equiv \= Q g -\bfite \cdot \= \bfitQ \ast as provided by the caller. In practice, the caller has not computed \bfite \cdot \= \bfitQ \ast ; thus, the CDR must do so. In the case of a simulation in which advection steps are alternated with source term computations, \= Q g changes in the course of the simulation and so also must be computed by reducing \= \bfitQ -, the field at the previous time step after the source term has been applied. Thus, in practice, one of \= \bfitQ \ast or \= \bfitQ --\= \bfitQ \ast must be reduced in the CDR's first (or only) BAR. In some cases, the vector \= \bfitrho must be reduced and, similarly, the weight vector \bfitw . \= \bfitrho is, and \bfitw may be, the same for all tracers, and \bfitw may even be set equal to \= \bfitrho . In each of these cases, the BAR should of course include these vectors just once. We continue to describe computation and communication in terms of just one tracer; the implementation for more than one tracer should batch computation and communication in each step.
ClipAndAssuredSum can be implemented with one BAR. The BAR sums, separately, four vectors: \= \bfitQ --\= \bfitQ \ast (to compute b), \= \bfitx (to compute m and \bfitv ), and \bfitl and \bfitu (to compute \bfitv ). A sequence of two BARs would require summing just one of \bfitl and \bfitu . This is an example of a trade-off in number of BARs and total communication volume. In addition, it illustrates a simple technique to minimize BAR counts: if a variable's value is a result of a BAR and can take one of only a small set of values, then account for all of these possible values in a single BAR. In the case of ClipAndAssuredSum, the sign of m, the mass discrepancy, can take one of two values; its value determines which of \bfite \cdot \bfitl and \bfite \cdot \bfitu is used. Now we consider ReconstructSafely with SelectX = ClipAndAssured-Sum for cases in which \scrT (b, \bfitl , \bfitu ) is not assuredly nonempty. One implementation approach is to use the same first BAR as before. Then, if \bfite \cdot \bfitl \leq b \leq \bfite \cdot \bfitu does not hold, solve the safety problem, which requires at least one additional BAR. This is quite likely the more efficient approach in practice. Nonetheless, we describe how to implement this method with one BAR total. The trade-off is that the BAR has 8 scalars per tracer, up from 4, although one scalar is shared by all tracers. The addi-tional reductions are \bfite \cdot \= \bfitQ and \bfite \cdot \= \bfitQ \ast instead of \bfite \cdot ( \= \bfitQ --\= \bfitQ \ast ), \bfite \cdot \= \bfitrho , min \bfitq min (\= \bfitrho , \= \bfitQ \ast , \bfitl ), max \bfitq max (\= \bfitrho , \= \bfitQ \ast , \bfitu ). After the first BAR, if \bfite \cdot \bfitl \leq b \leq \bfite \cdot \bfitu , then \bfitx is immediately returned. If at least one inequality is violated, then, depending on sign, \bfitu s or \bfitl s is formed. Assume \bfitu s is needed. Then b \leq \bfite \cdot \bfitu s is evaluated; the quantity on the right-hand side can be computed using data from the first BAR. If b > \bfite \cdot \bfitu s (even the safety set is empty), ReconstructSafely returns a result requiring no additional BAR. If b \leq \bfite \cdot \bfitu s , then ClipAndAssuredSum is called with new data. This time, it is assured that ClipAndAssuredSum can return a valid point. The special inputs in this case mean ClipAndAssuredSum has all the reduced values it needs already available, and so an implementation specialized for this case will perform no additional reduction. In summary, a total of one BAR is required, with 8 scalars per tracer per edge, or a little above 7 when \= \bfitrho is amortized over many tracers. ClipAndAssuredGenericSum is more complicated to implement than Clip-AndAssuredSum. We omit a complete discussion of details, as they follow the same ideas. However, there is one new feature of the problem. ClipAndAssuredGeneric-Sum must perform two global collectives except in the case that even the safety set is empty. This is because one cannot know whether \bfitw will push \= \bfitx out of bounds until after m is computed. Thus, at the very least, a second global collective in the form of a broadcast must be used to determine either that each x i is in bounds or that at least one is out, according to which a coordinated action can be taken. As long as a second global collective is required, it makes sense to use a BAR. Unless the caller's weight vector computation itself requires a global collective that cannot be batched with the first BAR, two BARs in total are needed. If ReconstructSafely is used with SelectX = ClipAndAssuredGenericSum, additional scalars are needed, but there are still at most only two BARs.
QLT is already defined in terms of the communication equivalent of one reduction followed by one broadcast. In LeavesToRoot, \= \bfitQ \ast , \= \bfitrho , \bfitl , \bfitu , \bfitw , and \= \bfitQ are separately summed. In RootToLeaves, one scalar is communicated per tree edge, b n . Thus, QLT communicates 6 or 7 scalars per tracer per two edges (up and down the tree), depending on \bfitw , or on average 3 or 3.5 per edge; or, if \= \bfitrho and \bfitw are shared among many tracers, a little above 2.5.
In summary, ClipAndAssuredSum, ClipAndAssuredGenericSum, and Re-constructSafely using one of these as SelectX can be implemented using one or two calls to MPI Allreduce. QLT only ever requires one BAR equivalent and uses less than half the communication volume of even ClipAndAssuredSum. But the reduction tree must be implemented as part of QLT because QLT stores values and performs computations at each node in the tree.

Level scheduling in QLT.
In this subsection, let a degree of freedom (DOF) correspond to one entry of \bfitx . A practical implementation of QLT should not, and in a distributed environment cannot, be recursive, as QLT is in Algorithm 5.1. A straightforward distributed implementation of QLT forms a tree over the DOFs such that all DOFs in a process are covered by a subtree of the QLT tree. However, often one wants invariance of the solution to MPI process decomposition. Thus, the tree of DOFs cannot depend on the process decomposition.
The key tool to solve this problem is level scheduling; see, e.g., [28, eq. (18)]. A level schedule constructs a sequence of levels. A level is a set of nodes that do not depend on each other but depend on nodes in levels earlier in the sequence. For this reason, this tool is also useful to expose intraprocess parallelism; computations for nodes in a level may proceed in parallel. Similarly, nodes in a level can fill a single  log 10 l 2 rel. err.
log 10 l inf rel. send buffer for each communication partner, and then this monolithic buffer may be sent after the level's computations are complete. This method minimizes the number of messages.
Thus, a QLT implementation should have an initialization phase in which the tree is level-scheduled. Then these levels are used to allocate and organize send and receive buffers, with offsets into these for each participating node. Finally, each invocation of QLT uses these established data structures to implement LeavesTo-Root and RootToLeaves. Each SelectX (through ReconstructSafely) call is independent of the others and associated with a single node; hence these can be computed in parallel.
7. Numerical studies. In this section, we examine the quality of corrections provided by the CDRs. We do not provide performance results for implementations; performance is well characterized by number and sizes of communication rounds. QLT is a rich enough algorithm that its implementation could be a topic of another study. The performance study in [7] includes the use of our first implementation of QLT.
7.1. One-dimensional problems. In each 1D experiment, a function is periodically translated by a uniform flow field. Figure 7.1 shows the 1D initial conditions we use. Relative to a domain of length 1, the Locality function has a Rectangle function of width 0.2; the Rectangle, Triangle, and Bell functions each have shape of width 0.55; and the Gaussian's standard deviation is 0.1485. The Bell is one half of a sinusoid's period. The conventional cubic interpolation semi-Lagrangian (ISL) method is used: a node (equivalently, a cell center) is translated to its departure point, and the four surrounding Eulerian nodes on the departure mesh provide a cubic interpolant. In some cases, a linear interpolant based on the two surrounding points is used.
After the ISL step, a CDR is applied to recover properties. When referring  Fig. 7.2. Results for the randomized 1D periodic simulations for the Sinusoid initial condition. The top row is for the uniform mesh, and the bottom row is for the perturbed mesh. The left column plots log 10 of the average pointwise relative error. The middle column plots log 10 of the average pointwise relative mass redistribution. See text for definitions. The right column plots the cumulative density of the l 2 relative error. In the bottom row, the initial condition is plotted for reference. In the left and middle columns, for QLT(BC), the middle 90\% (light gray) and middle 50\% (dark gray) of the ensemble are shown as filled regions.
to weights, we exclude a cell-length factor. For example, the unit weight vector is multiplied by the cell-length vector. The CDRs are as follows: ClipAndAssured-Sum, labeled CAAS; the algorithm of [3], labeled BC; 2-norm minimization with unit weights, labeled LS(1); QLT, with unit weights, labeled QLT(1); and QLT with the full unsigned weight vector of [3], labeled QLT(BC). QLT(BC) uses an unsigned version of the BC weight vector, \bfitw = | \bfitq H -\bfitq L | 3 , where \bfitq H is the result of applying the cubic ISL step and \bfitq L is the result of applying the linear ISL step. The second part of the BC algorithm is to discard the signed weights associated with a correction of the wrong direction with respect to recovering mass conservation; in QLT(BC), this second part is unnecessary. The original BC algorithm is not assured to provide a feasible correction even if one is possible, but QLT(BC) is. In our experiments, we chose problems such that BC always succeeds. We could have instead wrapped BC in ClipAndAssuredGenericSum, but this would complicate interpretation of the BC data. QLT(BC) makes use of the ability of ReconstructSafely, QLT, and the 2-norm minimization implementation to handle 0 weight values. We also include cubic ISL with no property preservation, labeled Cubic.
Two mesh types are used. Mesh types are labeled Uniform, for a uniform mesh, and Perturbed. The perturbed mesh starts with a uniform set of cells; then each cell   Results for the randomized 1D periodic simulations for the Locality initial condition.
log 10 l inf rel.  Let y 0 be the initial condition and y f the final profile. Data collected include the following: l 2 , l 1 , and l \infty relative errors are computed as \| y fy 0 \| p /\| y 0 \| p for p the norm. Pointwise relative error is computed as | y fy 0 | /\| y 0 \| \infty . Finally, pointwise relative mass redistribution is computed as follows: At each time step k, the absolute value of the pointwise difference between the CDR-corrected field y k CDR and the uncorrected field y k is recorded: \Delta y k \equiv | y k CDRy k | . Then two linear interpolations are performed, although these could be combined into one linear interpolant. First, \Delta y k is linearly interpolated to a 512-cell uniform mesh. Second, the result is advected in one step to time 0 using the linear ISL. Let the result be \widetil \Delta y k . \widetil \Delta y k is accumulated over all time steps k = 1, 2, . . . , K. The pointwise relative mass redistribution is then (K \| y 0 \| \infty ) - 1 \sum K k=1 \widetil \Delta y k . We find that the resulting plots are visually nearly identical when 1024 cells are used instead; to store the ensemble data, we thus chose 512 cells. In each of Figures 7.2 and 7.3, the top panels are for the uniform mesh, and the bottom is for the perturbed. The left column shows log 10 of the average pointwise relative error, and the middle shows log 10 of the average pointwise mass redistribution. In addition, for QLT(BC), the middle 90\% (light gray) and middle 50\% (dark gray) of the ensemble are shown as filled regions. The initial condition is plotted as a gray solid curve in the same axes for reference. The right column shows cumulative density of l 2 relative error. Tables 7.1 and 7.2 report the 10, 50, and 90 percentiles for each of the error norms; for text compactness,log 10 of each value is taken.
In Figure 7.2, middle column, mass is redistributed principally at the extrema of the sinusoid, as these extrema are modified by the CDR. For the uniform mesh, both QLT simulations redistribute mass within just these peaks, since the ISL is mass conserving. BC encourages mass redistribution to localize to these peaks, but there is still a spread of mass. For the perturbed mesh, QLT(1) redistributes the global mass discrepancy essentially uniformly, roughly matching CAAS and LS(1). QLT(BC) has some mass localization resulting from the BC weights. The right column and Table 7.1 show that both QLT CDRs give the most accurate solutions of the property preserving ones, but all property preserving solutions are substantially less accurate than Cubic because of limiting the smooth extrema.
In Figure 7.3, middle column, mass is redistributed principally at the extrema of the background sinusoid, as in Figure 7.2, and also at the rectangle function's edges. Again, on a uniform mesh, for which the ISL is mass conserving, both QLT simulations redistribute mass within just these peaks. On the perturbed mesh, QLT(1) must again redistribute the global mass discrepancy roughly uniformly, whereas QLT(BC) benefits from the BC weights. Since the initial condition is no longer smooth, the CDR solution can be as accurate as or even more accurate than the Cubic one. Figure 7.4 plots convergence data for the 1D periodic translation problem on a uniform mesh. The coarsest mesh has 17 cells, and 7 time steps are used. Each refinement level doubles the number of cells and time steps. One cycle is run, as convergence data may be obtained at any cycle boundary. QLT provides more accurate solutions than the other CDRs. In the case of the nonsmooth initial conditions, the cubic ISL with QLT is also about as accurate as the cubic ISL with no CDR. For sufficiently smooth initial conditions, the cubic ISL has third-order accuracy, whereas a CDR limits the solution to second-order accuracy.
7.2. Sphere. In this section, we construct a standalone passive tracer test framework that exercises the three properties of mass conservation, shape preservation, and tracer consistency, and we use it to examine CDR solution quality. In the test framework, the mesh is a cubed-sphere Gauss--Lobatto--Legendre (GLL) mesh as in HOMME [11]. A tracer mixing ratio is advected using the classical SL method specialized to this setting: the interpolant is the degree-p GLL interpolant [11]. This interpolant can in general cause the method to be unstable, but for p = 2, we have observed that it is empirically linearly stable across a broad range of problems; outside the scope of this paper, we have determined it to be stable analytically for the problem of one-dimensional periodic translation on a uniform mesh. To exercise tracer transport coupled to a different discretization for the total mass continuity equation, we solve for total density using the cell-integrated Jacobian-combined transport method described in [7]. Then the tracer densities must be made consistent with respect to this total density field. HOMME implements a continuous Galerkin spectral element method. But in each time stage, calculations are performed in each element separately, essentially treating the basis as discontinuous. Then the discrete stiffness summation (DSS) op- erator, which maintains the three properties, restores continuity [29]. This procedure minimizes communication among processes. We follow this approach in our tests, as it is efficient and natural for the cubed-sphere GLL mesh. In particular, CDR is performed at two levels: globally among elements, and then locally within each element on the (p + 1) 2 nodes. In the global phase, each index of the vector to which CDR is applied corresponds to the cell tracer mass. The global phase redistributes mass as necessary to make each cell-local problem feasible.
In an SL time step, the cell nodes are advected backward in time according to the flow field; GLL interpolation transfers the mixing ratio field at the previous time step to these advected points; bounds on the mixing ratio are set as the extrema of the values at the nodes supporting interpolation; cell tracer masses and bounds are computed and used to set the global CDR problem; the global CDR, and then the cell-local CDR, problems are solved; finally, the DSS operator is applied. In our simulations, only the global CDR is varied; the cell-local CDR is always 2-norm minimization. Because the cell-local CDR will resolve many of the discrepancies, we can expect differences in the solutions among CDR methods to be less pronounced than in the 1D study.
We run three passive tracer simulations: rigid rotation and the divergent-flow and nondivergent-flow cases described in [19,23]. In each, the Gaussian hills, cosine bells, and slotted cylinder ICs described in [23] are used. Essentially exact flow is computed using an adaptive Runge--Kutta method with a tight tolerance. The coarsest mesh is a cubed sphere with n e \times n e elements per face, n e = 18. For this mesh, we use 72 time steps. Each refinement level doubles n e and the number of time steps. As our purpose here is not to evaluate a particular tracer transport method, but rather to evaluate just the CDR, we omit the full set of diagnostics described in [23]. Numerical checks show that each tracer in each case conserves mass globally to at least 12 digits at the end of 12 days and is bounded by the initial global extrema to at least 14 and usually 15 digits. We evaluate the global 2-norm minimization, ClipAndAssuredSum, and QLT CDRs, and also the case of no CDR. Figure 7.5(a) plots l 2 relative error in the tracer mixing ratio field versus mesh refinement level. Figure 7.5(b) plots the relative total amount of mass redistributed among cells in the simulation. This value is the integral, over both the sphere and the 12-day simulation period, of the absolute value of the global-phase CDR correction, divided by the tracer's global mass. Hence it captures the effects of simultaneous discretization refinement in space and time. Each plot shows results for a particular flow field and for all initial conditions. Curves cluster by initial conditions. The left plot in Figure 7.5(a) labels these clusters; each other plot has the same pattern of clusters. For the essentially infinitely differentiable Gaussian hills initial conditions, all CDRs produce nearly identical errors and redistribute nearly identical amounts of mass. The rigid rotation field with this initial condition shows that the method's order of accuracy is 2. The total amount of mass redistributed over the 12 days diminishes at second order as well. For the discontinuous slotted cylinders initial conditions, the error produced by each CDR is nearly identical, but QLT redistributes more mass, consistent with its quasi-local redistribution pattern. For the once continuously differentiable cosine bells IC, all CDR solutions are more accurate than the field with no CDR (``None""). The QLT solution is slightly more accurate than the other CDR solutions and redistributes more mass.
8. Summary and conclusions. We discussed three classes of constrained density reconstructors to solve the tracer transport property preservation problem. They are particularly well suited to remap-form semi-Lagrangian tracer transport methods. They enable such transport methods to be coupled to the atmospheric dynamical equations consistently essentially regardless of these equations' discretizations, and the resulting tracer transport component to conserve mass and preserve shape. The primary CDRs provide a mass-bounded correction to the primary constraint set. These can be used within an outer algorithm that assures that a correction within a backup safety constraint set is computed if the primary set is empty. The combination of semi-Lagrangian tracer transport and an efficient CDR provides an efficient tracer transport component in the atmospheric dynamical core.
QLT is the most general of the algorithms discussed. It can use any primary CDR, wrapped by ReconstructSafely, as a node subproblem solver, as long as this solver satisfies correctness and solution magnitude conditions. QLT requires a tree over the mesh. The tree is general; each node can have an arbitrary number of child nodes. Hence QLT can produce subproblems as small as two-dimensional or, in the limit of a tree having exactly one node, the global problem. QLT has the full safety guarantees ReconstructSafely provides and returns a correction whose 1norm is bounded by (3.3). If the subproblem solver does not perform communication, which is the intended use case, then QLT uses lower communication volume than any other CDR discussed. If QLT is used to correct an already mass conserving tracer transport discretization, its correction redistributes mass quasi-locally. x i + my i = \= x i + m\alpha (z iv i ) + mv i . First, suppose z i \leq v i ; then x i \leq \= x i + mv i \leq u i , the final inequality by Proposition 3.1. Second, suppose z i > v i . As \alpha \leq (u i -\= x imv i )/[m(z iv i )] by construction in MakeBestConvexCombination, The case m < 0 follows by symmetry.
Proof of Proposition 4.1. We carry out the full proof for \bfitx \in \scrT w2 and then discuss differences in the case of \bfitx \in \scrT w1 . The method of proof is to show that for \bfitx \in \scrT w2 , we can choose \bfitp , \bfitn , \bfitphi , \bfitalp , \bfitbet, \lambda 1 , \bfitmu \bfone , \bfiteta \bfone such that the KKT conditions for \scrP 1 are satisfied. If these conditions are satisfied, then \bfitx \in \scrT 1 . The proof proceeds by case analysis, with a two-level case-analysis tree. The root considers each of \lambda > 0, \lambda < 0, and \lambda = 0. Once the sign of \lambda is fixed, analysis for each i can proceed separately; the KKT conditions decouple in i conditioned on \lambda . Thus, the second level of the case