Alternating Minimization Algorithm with Automatic Relevance Determination for Transmission Tomography under Poisson Noise

We propose a globally convergent alternating minimization (AM) algorithm for image reconstruction in transmission tomography, which extends automatic relevance determination (ARD) to Poisson noise models with Beer's law. The algorithm promotes solutions that are sparse in the pixel/voxel-differences domain by introducing additional latent variables, one for each pixel/voxel, and then learning these variables from the data using a hierarchical Bayesian model. Importantly, the proposed AM algorithm is free of any tuning parameters with image quality comparable to standard penalized likelihood methods. Our algorithm exploits optimization transfer principles which reduce the problem into parallel 1D optimization tasks (one for each pixel/voxel), making the algorithm feasible for large-scale problems. This approach considerably reduces the computational bottleneck of ARD associated with the posterior variances. Positivity constraints inherent in transmission tomography problems are also enforced. We demonstrate the performance of the proposed algorithm for x-ray computed tomography using synthetic and real-world datasets. The algorithm is shown to have much better performance than prior ARD algorithms based on approximate Gaussian noise models, even for high photon flux.


Introduction.
Tomographic image reconstruction is the process of estimating an object from measurements of its line integrals along different angles [1]. Tomographic reconstruction is an ill-posed problem [2] in the sense that multiple solutions exist that are consistent with the data. It is therefore often desirable to incorporate prior knowledge, which includes the statistical characteristics of the measured data, and properties that are expected of the image.
The images in transmission tomography represent attenuation per unit length of the impinging beam due to energy absorption inside the imaged object. Well known examples are x-ray computed tomography (CT) [3], and electron tomography [4]. The attenuation must be nonnegative, since a negative value corresponds to an increase in the energy of the exiting beam, which is non-physical. These images can often be well approximated by a sparse representation in some transform domain. In some applications, sparsity is present directly in the native image domain [5], but more commonly it is present in the pixel-difference domain [6] or in some transform domain, such as the wavelet, curvelet or shearlet transforms [7,8].
There are generally two types of approaches for tomographic image reconstruction. The first type consists of one-shot algorithms, such as filtered back-projection [1] and its extensions, which rely on analytic formulas. These algorithms do not incorporate the type of prior knowledge mentioned above and also produce prominent artifacts when some of the data are missing. The second type consists of iterative algorithms, based on minimizing some cost function. The latter enable incorporation of prior knowledge about the image, by adding penalties that promote sparsity in some representation and adding positivity constraints. In addition, iterative algorithms are far more robust to missing data.
Many transmission tomography problems involve data consisting of counts, e.g., the number of detected photons in x-ray CT, or the number of detected electrons in electron tomography. Therefore, at least some component of the measurement noise is modeled as independent Poisson random variables [6,4]. In the x-ray medical imaging community, the Poisson likelihood model is standard [6] and provides the basis for a well known class of statistical iterative algorithms [6,9,10,11]. In addition, the mean of each measurement is modeled according to Beer's law [6], which is an inherent non-linearity in transmission data.
We develop a new sub-class of statistical iterative reconstruction algorithms, inspired by automatic relevance determination (ARD) [12,13,14]. It allows the incorporation of prior knowledge, including positivity of the image, Poisson statistics of the measured signals, Beer's law and also any underlying sparse representations for the imaged object. What sets this new sub-class apart from prior art is the fact that it automatically learns the balance between data-fidelity and the penalty due to prior knowledge, thus avoiding the use of any tuning parameters and the difficulty in determining them. In addition, it also computes posterior variances, which can be used for adaptive sensing/experimental design, or for determining the system's sensitivity to noise. To the best of our knowledge, the proposed algorithm is the first ARD-based algorithm for Poisson likelihoods that is both scalable to very large problems and that is also guaranteed to converge.
This work is mainly motivated by x-ray CT, used in medical and security applications [1], but the methods presented here can be applied more generally to any type of transmission tomography that involves count data.
1.1. Model. We consider a Poisson noise model for transmission tomography [6] given by where y ∈ Z n are the measurements, Poisson[λ] denotes the Poisson univariate probability function with rate λ, and the exponential function encompasses Beer's law (a physical property of the signals [6]). The column vector x ∈ R p×1 + denotes the lexicographical ordering of the 2D/3D image, with each entry x j representing the attenuation of the x-ray beam per unit length at the jth pixel/voxel. The vector φ T i ∈ R 1×p + is the ith row of the system matrix Φ ∈ R n×p + . The measurement y i corresponds to the ith source-detector pair and the operation φ T i x represents a line integral over the image domain from source to detector (see Fig. 1). η i is the mean number of photons that would have been measured using the ith source-detector pair if the imaged object was removed. We make the standard assumption that η i was measured for all i in an independent calibration experiment and is known to us.
To construct Φ we use the approach in [15], where an entry φ ij at the ith row and jth column of Φ is equal to the length at which the line between the ith source-detector pair intersects the jth pixel/voxel (see Fig. 1). Each ray transverses at most O(p 1/D ) pixels/voxels, where p is the total number of pixels/voxels and D = 2, 3 for 2D/3D images, respectively.  In this example, a node colored in light green and marked by a star has 4 neighboring pixels shown in dark green. Each neighbor has a weight of 1/4 when computing their average. The image is augmented to introduce a boundary of pixels with zero values, represented by white nodes. The latter are not included in the forward model, and are merely introduced to provide the interpretation for the 1/4 weight when averaging less than 4 neighbors, e.g., the neighbors of the node colored in light red.
Accordingly, each row φ T i in the system matrix has only O(p 1/D ) non-zero entries. We absorb a reference linear attenuation coefficient into Φ so that x is dimensionless.
1.2. Background. The method proposed in this paper is an extension of automatic relevance determination (ARD). For convenience of readers unfamiliar with ARD, we first provide background to explain the potential advantages of ARD motivating this work, and discuss the key differences between ARD and other, more common, sparse estimation methods. Our aim here is to emphasize previous challenges in adapting ARD to the Poisson model in (1.1), challenges which we aim to resolve with the proposed method.
1.2.1. Type I Methods. Statistical iterative methods for CT [6,9,10,11] cast the problem of image reconstruction as a penalized maximum likelihood estimation problem, given by x * = arg max We refer to (1.2) as the "Type I" estimation problem; log p(y|x) is the log-likelihood function which represents the data-fit. For the Poisson model in (1.1), maximizing the log-likelihood is equivalent to minimizing the I-divergence 1 between the actual measurements y and the predicted measurements η ⊙ exp(−Φx) [11], where ⊙ denotes an elementwise product. The expression log p(x|β) is a penalty that promotes solutions with desired properties, e.g., the ℓ 1 penalty [5] and the total variation (TV) penalty [17,18]; β is a tuning parameter that determines the trade-off between data-fit and the prior, or the bias/variance trade-off. The 1 I-divergence between two vectors p, q ∈ R M + is defined as I(p||q) = M j=1 pj log( p j q j ) − pj + qj [16] "best" value of β is usually unknown and depends greatly on the imaged object at hand. To determine β, it is often required to solve (1.2) many times for different values of β, examine all reconstructed images and then select the value that best fits the needs of the application, based on some chosen figure of merit. For many common choices of p(x|β) considered below, decreasing β will result in a higher variance of the reconstructed image for different noise realizations, and in noisier images. Increasing β will result in increased bias and artifacts (e.g., stair-case artifacts). For a given β, the solution in (1.2) can be interpreted as the maximum a posteriori (MAP) solution, i.e., the maximizing point of the posterior distribution, with the posterior distribution given by The ℓ 1 and TV penalties are non-differentiable and do not allow the use of standard gradientbased methods for solving (1.2). An alternative to TV is the class of neighborhood roughness penalties (Gibbs priors) [10] which have the following form where w jk are positive weights inversely proportional to the distance between the jth and kth pixels/voxels, Θ is a set of ordered pairs {j, k} defining a neighborhood system and π(x) is chosen to be a positive, even, and twice continuously differentiable function. To preserve sharp edges in the image, π(x) is often chosen such that π(x) ∝ |x| for large |x| and can be viewed as a smooth version of the ℓ 1 penalty. A common choice is the Huber-class due to Lange [10] π(x) = δ 2 [|x/δ| − log(1 + |x/δ|)], (1.5) which approaches the quadratic x 2 for δ → ∞ and the linear function δ|x| as δ → 0. This type of penalties allows the use of standard gradient-based methods but come with the price of having an additional tuning parameter δ, so the search for the tuning parameters becomes even more cumbersome. Decreasing δ will result in noisier images but with sharper edges, and increasing δ will result in images which are less noisy but with over-smoothed edges.

Automatic
Relevance Determination (ARD)/Type II Methods. A different approach to sparse estimation is "automatic relevance determination" (ARD) [12]. Originally, ARD has been proposed for Gaussian likelihoods with the prior [13] p(x|γ) = N (x|0, Γ), Γ = Diag(γ), (1.6) where Diag(γ) denotes a diagonal matrix with main diagonal γ. In this approach sparsity is promoted by treating x as latent variables and seeking the newly introduced hyperparameters γ ∈ R p + that maximize the marginal likelihood p(y|γ) where p(y|γ) is also called the "model evidence" or "Type-II" likelihood [14]. We refer to (1.7) as the "Type-II estimation problem." Once γ * is found, the estimate for x is given by the mean of the posterior distribution p(x|y, γ * ). Interestingly, the solutions to (1.7) have many γ j ≈ 0 and the posterior distribution for the corresponding parameters x j becomes highly concentrated around zero 2 . The mechanism that promotes sparsity of γ * has been revealed by Wipf et al. [19,20], with the analysis specifically tailored for Gaussian likelihoods.
The solution to the Type-II problem in (1.7) was originally addressed in [13] using an expectation-maximization (EM) algorithm [21]. In the E-step, the Gaussian posterior probability distribution p(x|y, γ) is computed for given hyperparameters γ. In the M-step, γ are updated based on the mean and variances of the posterior distribution found in the E-step. As opposed to Type I methods, which focus on the mode (maximum) of the posterior distribution, Type-II (ARD) methods also compute the covariance matrix (during the E step), or at least the marginal variances, which provide additional information. ARD-based algorithms are therefore computationally more demanding than Type-I methods, mainly due to the expensive task of estimating the variances. Previous extensions of ARD [20,22] require the same type of operations and rely on approximations to the variances [22] to maintain scalablility.
Despite computational complications, the ARD approach can be quite rewarding and excel in hard or ill-posed problems, as evidenced in several recent applications [23]. A very significant advantage of ARD methods is their ability to automatically learn the balance between the datafit and the degree of sparsity. This avoids the tuning of nuisance parameters, which is often non-intuitive, object-dependent, and requires many trials. The posterior variances are key in determining this trade-off. The variances can also be used for adaptive sensing/experimental design [22], which is outside the scope of this paper and will be the subject of future work.

Difficulties in Previous ARD-based Methods for Poisson Models.
The ARD-EM algorithm described in Sec. 1.2.2 [13] cannot be directly applied to the Poisson likelihood in (1.1) since the E-step is intractable due to the non-conjugacy 3 of the prior in (1.6) and the likelihood. Although we are not aware of any previously published work on ARD for Poisson models with Beer's law, a simple remedy to the non-conjugacy is the Laplace method, discussed in [13]. In that approach, the posterior is approximated during the E-step using a multivariate Gaussian distribution, with mean equal to the MAP solution and a covariance matrix equal to the inverse of the Hessian matrix evaluated at the MAP solution. However, the resulting "EM-like" algorithm is not based upon a consistent objective function, and is not guaranteed to converge. In addition, it requires inversions of the Hessian matrix with computational complexity O(p 3 ), where p is the total number of pixels/voxels. This is not practically feasible for x-ray CT, where p could be the order of 10 8 and higher. Instead, one can use more sophisticated methods [22,24,25] to indirectly estimate the diagonal of the inverse, i.e., the marginal variances, since only these are required for updating the hyperparameters γ [13]. Note however that the work in [22,24,25] does not consider Laplace approximations, so these methods have not been tested for this case. Irrespective of how the variances are computed, the "EM-like" algorithm with the Laplace approximation is unsatisfactory from a principled perspective due to the lack of an objective function and due to the lack of 2 by employing Bayes' rule in (1.3), it is readily shown that if the prior is concentrated around xi = 0 then the corresponding posterior distribution is also concentrated around xi = 0 3 a prior is said to be conjugate to the likelihood if both the prior and posterior probability distributions are in the same family. The posterior is then given in closed-form.
convergence guarantees. The objective is also needed to assess convergence in practice. In addition, we are faced with conceptual difficulties, since the analysis that revealed the sparsity promoting mechanism of ARD by Wipf et al. [14,19,20] is restricted to Gaussian likelihoods.
1.3. Contributions. The contributions of this paper are summarized as follows. 1. We present a new alternating minimization algorithm for transmission tomography, which extends automatic relevance determination (ARD) to Poisson models with Beer's law. The proposed algorithm is free of any tuning parameters and also has a consistent objective function that can be evaluated in a computationally efficient way, in order to assess convergence in practice. 2. We reveal the sparsity-promoting mechanism in the proposed ARD-based algorithm for Poisson likelihoods. This preserves insights similar to those gained from previous work [14,19,20] for Gaussian likelihoods. 3. We derive a scalable AM algorithm for transmission tomography, that is well adapted to parallel computing. We reduce the computational complexity considerably by simplifying the algorithm into parallel line searches (for each pixel/voxel) with the scalar mean and variance updates also done in parallel. 4. We present a convergence analysis for the scalable AM algorithm and establish guarantees for global convergence 4 . To the best of our knowledge, this is the first ARD-based algorithm for Poisson models that has convergence guarantees. 5. We study the performance of the algorithm on synthetic and real-world x-ray datasets.
1.4. Outline of the Paper. The rest of this paper is organized as follows. In Sec. 2 we present a general overview of the proposed AM algorithm, which extends ARD to Poisson likelihoods with Beer's law, and explains how sparsity is promoted. In Sec. 3 we present a scalable AM algorithm based on separable surrogates for the objective function, which is well adapted to parallel computing and is feasible for large-scale problems. In Sec. 4 we present a detailed analysis of the convergence properties of the algorithm presented in Sec. 3. In Sec. 5 we make connections between the proposed algorithm and related previous methods. We also present an alternative view of the proposed ARD method that further explains how sparsity is promoted. In Sec. 6 we extend the proposed ARD framework to allow the modeling of sparsity in overcomplete representations, while retaining the convergence properties studied in Sec. 4. In Sec. 7 we study the performance of the proposed scalable AM algorithm on synthetic and real-world datasets with a parallel implementation used in the experiments. Conclusions are presented in Sec. 8.

Variational ARD.
2.1. Prior. In many applications of transmission tomography sparsity is not present directly in the native basis but in some transform domain. Therefore, we consider a generalized prior (similar to Seeger and Nickisch [22]) to address these cases where γ ∈ R p + are newly introduced hyperparameters, and we assume a transformation s = Ψx, (2.2) where Ψ is known and s is approximately sparse. By choosing Ψ = I, the prior in (2.1) reduces to (1.6). Any scalable algorithm will rely on efficient matrix-vector multiplication Ψx. This is achieved if Ψ is sparse or has some special structure, such as discrete Fourier or wavelet transforms. In this work we focus on neighborhood penalties, which are commonly used for x-ray CT, the application explored in Sec. 7. We consider Ψ given by where Θ is the set of ordered pairs {j, k} defining a neighborhood system and N max is the maximal number of neighbors for a pixel/voxel. An example is shown in Fig. 2. The neighborhood is usually chosen to be small enough such that Ψ will be sparse with at most N max + 1 non-zero elements in each row. For the choice in (2.3), each s j in (2.2) is equal to the difference between the jth pixel/voxel and the average of its neighboring pixels/voxels, i.e., s j = x j − k:{j,k}∈Θ [x k /N max ]. For pixels/voxels at the boundaries of the image domain, we assume additional neighbors outside the image domain with values set to zero (Dirichlet boundary conditions), such that the number of neighbors for each pixel/voxel is the same (see Fig. 2). These boundary conditions agree with physical constraints typical of many real experimental setups, where the attenuation (x) is zero outside the region of interest. Note that sparsity of the above pixel/voxel differences implies piecewise smooth images.

Variational
View of the EM Algorithm for ARD. We start with a known variational view [27] of the EM algorithm which is used to solve (1.7) for Gaussian noise models. Here we introduce some of the notation used throughout this work, and discuss the type of difficulties encountered when trying to extend ARD to non-Gaussian noise models.
We rewrite the evidence as q(x) is the proposed variational distribution and p(x|y, γ) is the exact posterior, F is the free variational energy (FVE) or negative evidence lower bound (ELBO), D KL [q||p] is the Kullback-Leibler (KL) divergence, and E q is the expected value with respect to q(x). The EM algorithm can be viewed [27] as minimizing the free variational energy (FVE) function F in (2.5) (or maximizing the ELBO) by alternating between updating q(x) in the E-step and updating γ in the M-step, as summarized below In the E-step, the optimal solution with zero KL divergence is q (t+1) = p(x|y, γ (t) ), i.e., the exact posterior based on the previous γ (t) values, which can be computed directly from (1.3). In the M-step, minimizing F with respect to γ can only increase the KL term, since it is nonnegative and overall the evidence is also increased.
For non-Gaussian likelihoods and the prior in (2.1), the E step in (2.7) is intractable, i.e., the expression for p(x|y, γ (t) ) is unknown and its direct calculation according to (1.3) involves a high-dimensional integral that is prohibitive to compute. Instead, we are forced to restrict q(x) to a tractable family of distributions to approximate the posterior. However, this results in an algorithm that is no longer guaranteed to increase the evidence [28]. This occurs because the KL term in (2.4) is never zero and can either decrease or increase after updating γ. Furthermore, a local maximum of the evidence with respect to γ is generally not a local minimum of F (or a local maximum of the ELBO), except for degenerate cases [28].
Therefore, when considering the above approach to extend ARD, we immediately encounter a conceptual difficulty, since most previous theoretical analysis of ARD [14,19,20] is based on the maximization of the evidence and its specific closed-form expression for the Gaussian likelihood. It is not immediately clear how to extend the ARD framework to the Poisson noise model in (1.1) in a manner which will preserve the insights gained from [14,19,20] and will explain why sparsity is being promoted by the ARD algorithm.
Perhaps surprisingly, we shall show that the variational algorithm discussed above still preserves the principles of ARD, despite the fact that it is not based on maximizing the evidence, but rather on maximizing the ELBO (lower bound for the evidence). We call this framework "Variational ARD" (VARD).

Alternating-Minimization Algorithm for Variational ARD.
As mentioned in Sec. 2.2, the true posterior distribution p(x|y, γ) corresponding to the Poisson model in (1.1) and the prior in (2.1) is intractable, so we cannot solve the E-step in (2.7) exactly, and an approximation to the posterior should be used instead. Perhaps the most common approach is to use a free-form factorized posterior distribution (mean field variational Bayes) [29], however this is also intractable here due to the lack of conditional conjugacy. Instead, we restrict the posterior to a parametric multivariate Gaussian form. In addition, in order to reduce the computational complexity, we restrict ourselves to factorized distributions, i.e., where N (x i ; m i , v i ) denotes a normal univariate distribution with mean m i and variance v i . We propose to modify the EM algorithm in (2.7)-(2.8) by adding the constraint q ∈ D during the E-step in (2.7). The resulting algorithm is defined by the following two steps (2.11) which are repeated until convergence. Note that we have rewritten the FVE of (2.5) as The first term in (2.10) does not depend on γ and was therefore omitted in (2.11). In the backward (B) step in (2.10), we update the approximation to the posterior q, which is reduced to only searching for the variational parameters (m, v). In the forward (F) step in (2.11), we update the generative (forward) model by finding γ. Note that due to the constraint q ∈ D in (2.9), this is no longer an EM algorithm, and therefore we distinguish between the "E-step" and the "B-step," and similarly between the "M-step" and the "F-step." Note that the posterior mean m is the main object of interest. It provides an estimate for the attenuation coefficients, which must be nonnegative. Accordingly, a non-negativity constraint on m was added in (2.10). The function in (2.10) is only defined for positive v values and using the extended-value function we assume F = ∞ for v j ≤ 0 for any j.
The AM algorithm formed by repeating the updates in (2.10)-(2.11) will reduce the FVE at each iteration (ELBO is increased). Importantly, as shown below, the AM algorithm leads to near-sparse solutions, i.e., many of the marginal posterior distributions will be highly concentrated around zero in the transform space defined by (2.2). The roles of v and γ in promoting sparse solutions will be made clear in Sec.

2.4.
For the Poisson model in (1.1), the prior in (2.1) and the approximate posterior proposed in (2.9), the objective function in (2.10) can be written as (dropping iteration number) where φ ij = (φ i ) j are the entries of the forward operator, as defined in (1.1), and ψ ij = (Ψ) ij are the entries of the Ψ matrix defined in (2.1). In deriving the last term in (2.15), we assumed that Ψ is square and invertible 5 . We provide a generalization in Sec. 6 which removes this restriction, and applies to any Ψ.
Recalling the discussion in Sec. 1.1 about the sparse structure of φ i , we note that the objective function in (2.12)-(2.15) has a computational complexity of O(np 1/D ) 6 which is feasible to compute in order to assess convergence in practice. In contrast, the original ARD [13] or other extensions [20,22] require O(p 3 ) operations to evaluate the objective which is prohibitive for large scale problems where p is very large, so the objective can only be approximated, e.g., via the stochastic estimator of Papandreou and Yuille [25].
The solution to the F-step in (2.11) is obtained by solving ∇ γ D KL [q||p] = 0 which yields where ⊙ denotes elementwise product. It is straightforward to verify that (2.16) is indeed a minimizing point of (2.11) (for finite γ). For fixed γ, the FVE objective function is jointly convex with respect to (m, v), which implies that any local minimum of the objective in (2.10) is also a global minimum. Note that the FVE function is not jointly convex with respect to (m, v, γ), so local minima are possible. In a similar way, the original ARD-EM algorithm [13] is also only guaranteed to find a local maximum of the evidence [32].
2.4. Discussion. The B-step in (2.10) is well understood from a Bayesian variational inference perspective, and it is equivalent to minimizing the KL divergence between the proposed posterior q(x) and the true posterior p(x|y, γ), as described in Fig. 3

(a).
However, the F-step in (2.11) is somewhat puzzling, as it is unclear why it promotes sparse solutions, especially since using a Gaussian prior for Type-I MAP estimation does not lead to sparse solutions. To explain how sparsity is promoted, consider the s-space defined in (2.2), where the image has a sparse representation. Note that the KL divergence in (2.11) is invariant under the transformation in (2.2). The prior and posterior distributions become where Ψ is assumed to be invertible. We next make use of the following representation for a factorized Student's-t distribution [19] t(s|ν) To obtain (2.18), one exploits Fenchel duality [33] is the Fenchel dual of log t( √ s j |ν), and then uses the monotonically increasing transformation λ j = −1/2γ j to obtain log t( √ s j |ν) = max γ j ≥0 (−s j /2γ j + f * (−1/2γ j )). Exponentiating both sides of the dual representation for log t(s i |ν) one obtains (2.18) [19]. For ν → 0, t(s i |ν) ∝ 1/|s i | with sharp spikes at s i = 0. Note in (2.18) that ν → 0 implies ϕ → 1, and the multivariate t-distribution t(s|ν) becomes an envelope for the ARD prior p(s|γ) in (2.17). This is illustrated in Fig. 3(b) using a 2D example with s ∈ R 2 , which depicts the proposed posterior q(s) and the prior p(s|γ), with the latter "trapped" inside the level sets of t(s|ν) due to (2.18). One can see in Fig. 3(b) how selecting the widths γ of the Gaussian prior distribution to minimize the KL divergence between q(s) and p(s) promotes the shrinking of γ k for some k (minimizing KL divergence between the two distributions is illustrated as maximizing the overlap between the level-set ellipsoids corresponding to the majority of the probability mass of these distributions). In the example of Fig. 3, the posterior has a much lower mean and variance along s 2 than along s 1 , and the F-step will further shrink γ 2 . The corresponding parameter s 2 will be shrunk in the sense that the marginal prior p(s 2 |γ 2 ) = N (s 2 ; 0, γ 2 ) will be more concentrated around zero, and so will the true posterior p(s 2 |y) (see (1.3)). In the following B-step, the approximate posterior q will also concentrate further around zero due to the zero-forcing property of the KL divergence [34] (recall that in the B-step we minimize KL-divergence between q and the true posterior). Note that repeating the B and F steps is required to obtain sparse solutions. The description of the mechanism responsible for promoting sparsity was inspired by the work of Wipf et al. [19], which uses similar arguments; however, that work relies entirely on the evidence function p(y|γ) in (2.4) since a Gaussian noise model was considered. In contrast, our proposed approach is based on the free variational energy (FVE) function and applies more generally, even to cases where the expression for the evidence is unknown and an algorithm that increases the evidence is unavailable. This is important for non-Gaussian noise models, for which the evidence function is intractable, so we are forced to work with the FVE function instead. Another important difference between our work and that of Wipf et al. [19] is that we show how sparsity is promoted during the iterations of the algorithm, whereas in [19] it has been shown that a solution to (1.7) is sparse. Note also that the work in [19] considers the original ARD prior in (1.6), whereas we consider the more general case in (2.1).
Step (a) Backward step: level sets for the likelihood p(y|x) (red curve), prior p(x) (blue curve) and the proposed posterior q(x) (black curve); true posterior p(x|y) is concentrated in the orange region; proposed posterior q(x) is chosen to minimize the KL divergence with p(x|y). (b) Forward step: a level set for the student's t distribution (dashed black curves) forming an envelope for the level set of the zero mean Gaussian prior p(s) (blue curves); p(s|γ (1) ) and p(s|γ (2) ) correspond to two different choices for γ where p(s|γ (2) ) assigns equal variance along s1 and s2 and p(s|γ (1) ) assigns small variance along s2 and large variance along s1. Choosing p(s|γ (1) ) maximizes the overlap between the prior p(s) and the proposed posterior q(s) (leading to lower KL divergence). s2 is shrunk in the sense that its exact and proposed posterior distributions are further concentrated around smaller values due to the concentration of the prior about zero. Note that Fig. (a) is in the x domain whereas Fig. (b) is in the s domain.

Parallel AM Algorithm.
Here we present a modified AM algorithm, which uses an alternative B-step. The rationale behind the modified algorithm lies in decoupling the variational parameters m j , v j of each pixel/voxel. This reduces the algorithm into simultaneous simple tasks that can be computed in parallel, leading to high scalability. First, note that (2.13) and (2.14) are not additively separable with respect to the individual parameters m j , v j . We shall therefore introduce a surrogate function, making use of the following lemma.
Lemma 3.1 (The convex decomposition lemma [11]). Let f (x) be a convex function with x ∈ R p and let x ∈ R p be a reference point. Let r j ≥ 0 such that r j = 1. Then where e j denotes a column vector with all zeros except 1 at the jth entry.
Proof. The proof readily follows from Jensen's inequality [33] f where we added and subtracted a reference point x and then applied Jensen's inequality to Before we proceed, we make several definitions to simplify the following derivations.
where y i , φ i and η i are defined in (1.1). To distinguish between quantities associated with the mean and variance, we denote any quantities associated with the variance using a tilde. In i is the projection of the posterior mean m (t) along the ith line, which we shall call "mean-type" projection;p (t) i is a projection of the posterior variance v (t) along the ith line using an elementwise squared forward operator, which we term a "variance-type" projection.
i is the predicted Poisson rate for the ith measurement, based on the expectation with respect to posterior distribution q computed at iteration t. In (3.5), b y j is the backprojection of measurements y to the jth pixel; b j are the back-projections of the predicted Poisson rate in (3.4) to the jth pixel, using the backward (adjoint) operator and the squared backward operator, respectively, which we call "mean-type" and "variance-type" backprojections.
By noting that F 1 in (2.13) is jointly convex with respect to (m, v), and applying Lemma 3.
where we made use of the definitions in (3.3)-(3.5). Note that (3.6) is separable with respect to the components m j and v j . We are free to choose r ij ,r ij as long as the conditions of Lemma 3.1 are satisfied. We follow O'Sullivan and Benac [11] to apply a minor (but useful) extension of the lemma, where we introduce a dummy variable denoted m 0 with a corresponding weight r 0 ≥ 0. We set m 0 = 0 so that the resulting algorithm does not involve m 0 in any way. This modification leads to the (relaxed) requirement that instead of equality. The above relaxation allows us to make the following choice which satisfies the conditions of the extended lemma. Substituting (3.8) into (3.6), and using the definitions in Eq. (3.5), we obtain The particular choice in (3.8) is very useful since the sum over i in (3.6) needs to be done only once instead of repeating it for each update of m j during the search for the minimum of S 1 .
We derive a separable bound for (2.14), F 2 ≤ S (t) 2 (m, v), by noting that x 2 is a convex function and using lemma 3.1 again to obtain j (also defined in (3.13)). Similarly to (3.7), we require j s kj ≤ 1 using the dummy variable extension. One choice for s kj is Substituting (3.11) into (3.10), we obtain (3.12) where we introduced the following definitions .13) is the estimated kth element in the sparse representation for the object based on the posterior mean m (t) , e.g., if Ψ is chosen according to (2.3) (assuming sparsity in the pixel/voxel-difference space) then d where Θ defines the neighborhood system and N max is the maximal number of neighbors. In deriving (3.14) we made use of j s kj = 1.
Combining (3.9),(3.12), and (2.15) (the latter is already separable), we obtain a surrogate function S (t) that is a global majorizer of F, i.e., F(m, γ) ≤ S (t) (m, v), given by % "variance-type" projections % 6: for j=1 to p do % executed in Parallel % 11: forward-projection p (t) and a "variance type" forward-projectionp (t) (lines 4-5 in Algorithm 1) which can be computed in parallel. It also requires a "mean-type" backprojection b (t) and a "variance-type" backprojectionb (t) (lines 8-9 in Algorithm 1), which can also be computed in parallel. Each iteration also requires two line searches, one for the mean components m j and one for the variance components v j (lines 11-12 in Algorithm 1). The line searches for all pixels/voxels can be done in parallel. In addition, the scalar updates for mean and variance can also be done in parallel. Note that despite the latter parallelization, b (t) and b (t) provide information about both the mean and variance at the previous iteration which is shared between the mean and variance updates in (3.19)-(3.20) (lines 11-12 in Algorithm 1). Also note that these subiterations for (m, v) are themselves iterations of another alternating minimization algorithm (lines 4-12 of Algorithm 1). Here we chose to use only one subiteration, but any number can be used instead.
A considerable advantage of this algorithm is that it allows fast parallel computing on multi-core computers, or on dedicated hardware such as field-programmable gate arrays (FPGA). Another advantage is that the solution to the constrained problem in (3.19) can be obtained by first solving the unconstrained problem using simple gradient-based methods; if m j < 0 then the solution to the constrained problem must be m j = 0, otherwise the solution is the same. This simple thresholding for m j leads to the optimal solution only in one-dimensional minimization problems, and the same procedure for the original B-step in (2.10) would lead to sub-optimal solutions or even lack of convergence.
The computational complexity of a forward/backprojection is O(np 1/D ), where n is the number of measurements, p is the total number of pixels/voxels and D = 2, 3 for 2D/3D images, respectively (recall the discussion about the sparsity of Φ in Sec. 1.1). The computational complexity of Ψm is O(p(N max + 1)), where N max is the maximal number of neighboring pixels/voxels defined by the prior (recall the discussion on the sparsity of Ψ in Sec. 2.1). Typically, N max ≈ 3 D − 1, n ≥ p, and p 1/D ≫ N max , so the forward/backprojection cost dominants the cost of Ψm. The details regarding the implementations of the line searches (lines 11-12 in Algorithm 1) are given in Appendix B.

Optimality Conditions and Convergence
Analysis. First we provide the necessary optimality conditions for the minimization problems presented in Secs. 2-3 as well as a few definitions in order to simplify the following analysis. We then present a detailed study of the convergence properties of the parallel AM algorithm derived in Sec. 3 (Algorithm 1). We shall refer to this algorithm as "PAR-VARD" (parallel VARD). 1. The necessary Karush-Kuhn-Tucker (KKT) optimality conditions [33] for (m, v) to be a solution to (2.10) are given by replaced by m j and v j , j is defined in (3.15). 2. The necessary KKT condition for γ to be a solution to (2.11) is given by 3. The necessary KKT conditions for (m, v, γ) to be a minimizer of the FVE objective in (2.12)-(2.15) subject to m 0 are given by k replaced by d k and γ k , respectively (see also definitions after (4.2)).

Proof. Follows directly from the KKT conditions
j and Z 1 are defined in (3.5),(3.13), (3.14) and (3.8), respectively. 2. The necessary KKT condition for v j to be a solution to (3.20   j be a solution to (3.19) but without the non-negativity constraint. The corresponding solution to (3.19) with the non-negativity constraint can be expressed as j in (4.11) has the same effect as thresholding in (4.9). Remark 2: note that since m (t+1) ≥m (t) we have Z (t) j ≥ Z 1 . We also introduce a similar definition for the variance.  Remark: sinceb (t) j ≥ 0 and ξ (t) j ≥ 0, it follows from (4.8) that v (t+1) j > 0. Definition 4.6.The I-divergence between two vectors p, q ∈ R M + is defined as We define log( 0 0 ) = 0. Remark: the I-divergence is nonnegative. Definition 4.7.The Itakura-Saito (IS) divergence between two vectors p, q ∈ R M + is defined as We define log( 0 0 ) = 0, and 0 0 = 1. Remark: D IS is nonnegative.

Convergence Analysis. For simplicity we shall denote F (t) F(m (t)
, v (t) , γ (t) ). The following theorem is due to Degirmenci and O'Sullivan. Theorem 4.9. The objective function of (2.12)-(2.15) decreases monotonically during the iterations of PAR-VARD, and the decrease between subsequent iterations is lower bounded by where β (t) ,β (t) , and Z (t) are vectors with components β j is defined in (3.14) (g (t) j ≥ 0). Dividing two vectors should be interpreted as done elementwise.
Proof. The proof is very technical and therefore moved to Appendix A. Several properties follow from Theorem 4.9 as stated in Theorem 4.10.
As t → ∞ all term on the right-hand side of (4.18) must go to zero since they are nonnegative and have a finite positive upper bound due to the fact that F (1) is finite and due to the monotonically decreasing and positive sequence F (t) . (c)-(d): Connectedness of the limit set of the iterates follows from convergence of the Idivergences to zero which implies by Pinsker's inequality [38][Lemma 12. 6 1 → 0. This together with (4.11) and (4.13) implies that m j → 0 ∀j. We are now able to state the guarantees for global 7 convergence of PAR-VARD.
be the sequence of iterates generated by Algorithm 1. Let the solution set Γ be the set of points (m, v, γ) which satisfy the KKT conditions in (4.4)-(4.6). Assume there exists z (1) such that F (1) = F(z (1) ) is finite. Then (a) The iterates are contained in a compact set.
The point-to-set mapping defined by Algorithm 1 is closed. (e) All limit points of the iterates are in the solution set Γ and Proof. (a) By assumption, there is an initial guess z (1) such that F (1) = F(z (1) ) is finite. From the F-monotonicity of the sequence {z (t) } ∞ t=1 , as asserted by theorem 4.9, and the positivity constraints incorporated into the algorithm, the iterates {z (k) } ∞ k=1 are contained in a sub-level set of F given by {z : z 0, F(z) ≤ F (1) }. To show that this set is compact we only need to show that lim z 2 →∞ F(z) = ∞. Using (2.12)-(2.15), it can be verified that this condition is satisfied for any combination of , then from (4.17) we must have F (t) = F (t+1) and all I divergences in (4.17) must equal zero (as they are nonnegative). This implies that β j , ∀j. From (4.11) and (4.13) it follows that m j so z (t) is a fixed point. It then follows from corollary 4.8 that z (t) satisfies the KKT conditions, i.e., z (t) ∈ Γ (d) We use a result by Gunawardana and Byrne [28] which is restated here for convenience.  , v (t) , γ (t) ) → (m (t+1) , v (t+1) , γ (t+1) ) is also closed. (e) Follows from (a)-(d) and Zangwill's generalized convergence theorem [35].

A Different View of VARD.
We present an alternative view of the proposed VARD framework, which will provide further insight into the sparsity-promoting mechanism and will later enable extensions of the AM algorithm. The minimum of the objective function in (2.13)-(2.15) with respect to γ is given by where the explicit expression for E q(x|m,v) − log p(y|x) is given by (2.13), and h[q(x; m, v)] in (5.3) denotes the differential entropy of the distribution q. µ and σ in (5.3) are the posterior mean and standard deviation (std) in the transform domain s = Ψm, respectively.
Note in (5.2) that the term k log(µ 2 k + σ 2 k ) promotes sparsity of both the posterior mean and standard deviation in the transform domain. In fact, without the entropy term h[q] = j log v j /2 in (5.2), any (µ, σ) with at least one pair of components (µ k , σ k ) = 0, would be a global minimum of (5.2), with 2 p possible combinations. The entropy term serves as a barrier preventing solutions with v j = 0, thus avoiding these unwanted minima. Note that the concave term k log(µ 2 k +σ 2 k ) in (5.2) is not separable with respect to the components of m and v, and the convex decomposition approach used in Sec. 3 is not applicable here.

Connections to
Reweighted ℓ 2 Algorithms. The ARD approach is related to the reweighted ℓ 2 algorithm [36] and the comparison between the two for Gaussian likelihoods has been presented in [37]. It is natural to consider the reweighted ℓ 2 algorithm for the Poisson likelihood as well, which consists of the following two steps where ǫ > 0 is a fixed parameter that needs to be tuned. Note that we cannot choose ǫ = 0 since some entries in Ψx are zero and this will result in infinite weights for the penalty in (5.4). For an easy comparison of (5.4)-(5.5) to the VARD algorithm in (2.10)-(2.11), we provide here again the update equations for VARD with the posterior variance fixed It can be seen that in (5.4) the data-fidelity term is the negated log-likelihood, whereas for VARD in (5.6) it is the expectation of this term with respect to the approximate posterior q.
A key difference between the two algorithms is that in (5.5) a tuning parameter ǫ is required, whereas in (5.7) the variances serve as "tuning parameters" with one parameter for each pixel/voxel, and are automatically learned from the data during the B-step.
Next, we make another type of comparison between the two approaches. We rewrite the reweighted ℓ 2 algorithm in (5.4)-(5.5) as x (t+1) = min x 0 Q(x, γ (t) ) and γ (t+1) = min γ Q(x (t+1) , γ), where Q is defined as Minimizing (5.8) with respect to γ by solving ∇ γ Q = 0 results in (5.5) with the superscript (t+1) removed. Substituting this solution into (5.8), we obtain the equivalent Type-I problem The result in (5.9) can be compared to the alternative formulation for VARD given in (5.2)- (5.3). Note that small values of ǫ in (5.9) introduce many local minima, since any solution with at least one component (Ψx) k = 0 will result in a large negative penalty. In this case, any standard gradient-based algorithm will be easily trapped at a sub-optimal solution. On the other hand, large values of ǫ avoid local minima but do not promote sparse solutions. The correct choice for the parameter ǫ is therefore critical.
To facilitate the comparison between reweighted ℓ 2 and PAR-VARD, we propose to modify the former using the separable surrogate approach as in Sec. 3, and derive the corresponding surrogate function. The surrogate function for the log-likelihood is given by S mle j (x j ) in (5.10) was derived in [11] and can be obtained as a special case of the PAR-VARD surrogate in (3.17) when m is replaced by x, v → 0, and γ → ∞, with consequently f j → 0. Adding the surrogate function for the quadratic penalty in (5.4), we obtain the update equation (1.4)-(1.5). The surrogate function for the Huber penalty involving the jth pixel is given by where N j is the number of neighbors of the jth pixel. The updates for the MAP estimate are then given by where S mle j and S H j are given by (5.10) and (5.15), respectively. See Algorithm 3.

Connections to Previous Variational Inference
Methods. The approximation of the posterior using a Gaussian distribution has been studied by Challis and Barber [39]. Although they did not consider the Poisson likelihood in (1.1) specifically, their approach is the same as our B-step in (2.10) if the posterior is chosen to be factorized and if γ are fixed and known. We emphasize that this is very different from the AM algorithm proposed here, which alternates between the B-step and the F-step where γ are also updated. In fact, a single B-step does not result in sparse solutions. In order to obtain sparse solutions, the B and F steps need to be repeated. This is similar in principle to the original EM-ARD algorithm [13], where a single E-step would not result in sparse solutions. Also, the parallel algorithm we propose in Sec. 3 is completely new. An alternative approach to [39] was proposed by Khan et al. [40] but it requires the inversion of a p×p matrix (p is the number of pixels/voxels) so it is not feasible for large scale problems. We note that the method by Seeger and Nickisch [22] cannot be applied here for approximating the posterior since it is restricted to super-Gaussian likelihoods and the Poisson likelihood we consider (see (1.1)) is not super-Gaussian.
6. Generalization of VARD to Nonsquare Ψ. Building on Sec. 5.1, we extend VARD to cases where the underlying sparse representation for the object is overcomplete, i.e., when Ψ in (2.1) has more rows than columns (previously we assumed Ψ to be square). Introducing a non-square Ψ in the AM algorithm of (2.10)-(2.11) results in a considerable complication of the γ update (F-step). The objective function includes the term − log |Ψ T Γ −1 Ψ| which can no longer be reduced to k log γ k as in (2.15) (recall that Γ = Diag(γ)). Taking the gradient of this expression with respect to γ yields γ −2 ⊙ Diag[Ψ(Ψ T Γ −1 Ψ) −1 Ψ T ]. Both the objective function and the gradient now have a computational complexity of O(p 3 ) (p is the number of pixels/voxels) which is prohibitive for large-scale problems.
To avoid the above complication, we use a different approach, where we start from the equivalent formulation in (5.2)-(5.3), and observe that Ψ can also be rectangular. Going back to the AM algorithm formulation, we obtain the same objective function as in (2.12)-(2.15), but with a rectangular Ψ. However, the objective function can no longer be interpreted as the FVE in (2.5) for the prior in (2.1). This is due to the fact that we replaced the term − log |Ψ T Γ −1 Ψ| with k log γ k . Since the difference between these two terms is independent of m and v, minimizing the objective in (2.12)-(2.15) with respect to (m, v) is still equivalent to minimizing the KL divergence between the true posterior and the proposed posterior q. The F-step can no longer be interpreted as minimizing the KL divergence between q and the prior, as in (2.11). Nevertheless, the promotion of sparsity is still asserted by the equivalent formulation in (5.2)-(5.3), where the k log(µ 2 k + σ 2 k ) term promotes sparsity in the transform domain, and it is also supported by numerical experiments, provided in Sec. 7. It is important to note that although the interpretation of the objective function (as a function of γ) is different, it is given by the same expressions in (2.12)-(2.15), so the convergence analysis in Sec. 4 still applies.
To illustrate the type of priors that the above extension allows, we give an example for 2D-images where Ψ is given by where [ ; ] denotes concatenation, and Θ h , Θ v define pairs of neighboring pixels located along the same horizontal or vertical lines in the image, with corresponding matrices Ψ h , Ψ v , respectively. We assign the same weight to horizontal and vertical pixel-differences, so γ is replaced by [γ; γ]. It is also possible to incorporate differences between additional neighbors by concatenating additional Ψ matrices for these additional neighbors in (6.1). It is also easily extendable to 3D images. To understand the difference between the above construction and the original one in (2.3), we write the sparsifying penalty from the objective function in (5.2) for both cases (considering only 2 neighbors per pixel for this example) where P 1 corresponds to the original construction in (2.3) and P 2 corresponds to the construction in (6.1); j 1 and j 2 denote the indexes of the horizontal and vertical neighbors of the jth pixel, respectively, i.e., {j, j 1 } ∈ Θ h and {j, j 2 } ∈ Θ v . For small variances, P 2 promotes small differences along the horizontal and vertical axes separately (similarly to (1.4)). In contrast, P 1 promotes small differences between a pixel and the average of its neighbors.
7. Numerical Results. In the following section we study the performance of the proposed PAR-VARD algorithm for x-ray CT. For medical applications, there is a limit on the intensity of the x-ray source to avoid harmful radiation levels. We consider both simulated and real data for x-ray CT with source intensity levels (η values in (1.1)) which are typical of clinical settings. The geometry considered in the experiments is shown in Fig. 4.
We compare the following methods: (1) The proposed PAR-VARD algorithm described in algorithm 1; (2) Maximum likelihood estimator (MLE) described in algorithm 3; (3) Maximum a Posteriori (MAP) estimator with the Huber neighborhood penalty described in algorithm 3; (4) The reweighted ℓ 2 algorithm described in Algorithm 2. For simplicity, we shall abuse  notation slightly and refer to PAR-VARD as VARD. We compare two different priors corresponding to two choices of Ψ in (2.3) and (6.1), which we refer to as "complete" (C) and "over-complete" (O) representations, respectively. All algorithms were implemented in MAT-LAB and executed on a desktop PC with 4 Intel Xeon E7 CPU's (8 cores each) running a Linux platform. In our implementation, the line searches for each pixel are done in parallel using 32 cores. All algorithms were run till convergence with the objectives being flat.
7.1. Simulated Data. The simulated data were generated for a Shepp-Logan phantom with Poisson noise according to the model in (1.1) and source intensities η = 10 4 , 10 5 , which are typical of clinical systems. We use an image resolution of 256×256, which for the geometry specified in Fig. 4 required 1372 view-angles (full scan around the object) and 512 detectors for each view (702, 464 measurements in total). The number of measurements were calculated according to classical sampling theory as detailed in [2].
The root-mean-squared errors (RMSE) obtained with the different methods is shown in Tables 1-2, with source intensity set to η = 10 5 and η = 10 4 , respectively. For MAP we had to choose values for two different tuning parameters in (1.5), and perform many trials. For reweighted ℓ 2 we had to choose values for one tuning parameter in (5.5) and run multiple trials as well. One can see that VARD with the over-complete representation (VARD-O) yields RMSE that is comparable to the best result with the alternative methods but without any tuning, and using only a single reconstruction. By comparing Tables 1(a),(c) and Tables 2(a),(c), one observes that both reweighted ℓ 2 and MAP are significantly more sensitive to the choice of tuning parameters when the noise level is higher. Both methods result in much higher RMSE than VARD when the tuning parameters are not properly chosen.
It is important to note that in practice the object is unknown (RMSE cannot be computed) so the tuning parameters are chosen by examining the reconstructed image for each and every trial, which is very time consuming. The purpose of the comparison in Tables 1-2 is merely to show the sensitivity of the MAP and reweighted ℓ 2 estimators to the particular choice of the tuning parameters. The lowest RMSE is obtained using reweighted ℓ 2 but the differences in the image compared to VARD are visually very minor, as shown below.
Some of the reconstructed images using these methods are shown Fig. 5. For MAP and reweighted ℓ 2 we present the results for two different choices of tuning parameters, one of which is the best result we could find. One can see in Figs. 5(d) and 5(f) how significant     artifacts occur when the tuning parameters are not chosen correctly. In contrast, the VARD reconstruction in Fig. 5(g) does not have any significant artifacts and the image quality is comparable to the best results obtained with the alternative methods (Figs. 5(c),5(e)). The reconstructed hyperparameters γ are shown in Fig. 6(a). One observes that γ has large values at discontinuities and very low values in regions where the attenuation levels are constant. Recall that VARD with the prior in (6.1) promotes sparsity in the pixel-difference domain. Figure 6(a) clearly indicates that VARD has learned the correct support of the object in the pixel-differences domain. The posterior std is shown in Fig. 6(b) and has a similar structure to γ. Note that one should be careful when interpreting the posterior std since the support of the posterior is (−∞, ∞) (see (2.9)) and the std can extend to negative attenuation values.   In Sec. 7.2 we show a useful interpretation of the posterior variances.
All algorithms were run till 2000 iterations to insure convergence. The images were initialized by setting all pixel values to zero. For VARD, all posterior variances v were initialized to 1. For VARD and reweighted ℓ 2 , the hyperparameters γ were initialized to 100 to make the data-fit term dominant in the first iteration. Figure 5(h) presents the objective function for VARD-O ((2.12)-(2.15)) vs. number of iterations, as well as the objectives for the other methods. The objective is seen to decrease monotonically, as predicted by theory, and can be used to assess convergence in practice. All methods share similar convergence rates.  7.2. Real Data. Next we present reconstructions for a dataset acquired at an experimental laboratory at Duke University with the geometry specified in Fig. 4. The number of views was 360 (full scan around the object) with view-angle resolution of 1 • . For each view 1781 detector pixels were used. The total number of measurements was 641, 160. Additional details about the experimental setup are provided in Appendix C. We imaged a 2D cross section of an acrylic glass cylinder full of cylindrical holes with different diameters. In this case, the image for the ground truth is unknown. However, a good estimate of the truth is based on the specifications from the manufacturer of this phantom. The attenuation of the acrylic glass relative to water is 1.5, and for the inside of the rods it is 0 (air). Figure 7 presents reconstructions using different methods. Image resolution is 256 × 256. For MAP and reweighted ℓ 2 we show two examples of reconstructions, one for the best choice of the tuning parameter that we could find, and one with significant artifacts for a bad choice of tuning parameters. Again, one can see how VARD produces images which are comparable in quality to the best trial with MAP or reweighted ℓ 2 but without any need for repeated trials. Figure 8 shows the reconstructed variances, where one can see again how VARD learns the correct support of the object in the pixel-differences domain.
Finally, in Figs. 9-10 we present an example which illustrates the robustness of VARD to missing data and also demonstrates a useful interpretation of the posterior variances. In this example we scanned a "DUKE" letters object suspended in mid-air using the same experimental system. The letters are made from VeroBlue material with a linear attenuation   coefficient of 1.458 relative to water (more details in caption of Fig. 9). The expected image is sparse so we used the prior in (2.1) with Ψ = I to promote sparsity in the pixel basis.
We subsampled the angular grid for the source-locations (views) so only 1/8 or 1/16 of the views are used; the VARD reconstructions for these two cases are shown in Fig. 9(a) and Fig. 9(b), respectively. In addition, we include in Fig. 9(c) and 9(d) the reconstructions using filtered-back-projection (FBP) [1], which is a linear reconstruction method based on Fourier inversion. Note that the FBP reconstruction exhibits much more prominent artifacts than VARD in the form of streaks, which are due to view undersampling. We present the FBP reconstructions again in Figs 10(a)-10(b) where the displayed attenuation values were limited to a smaller range in order to enhance the artifacts. There is a striking resemblance between the streaks in the posterior std images obtained by VARD in Figs 10(c)-10(d) and the streaks in the FBP reconstructions in Figs 10(a)-10(b). This demonstrates that the variances can indicate regions in the image which are insufficiently sampled. A more detailed study of this relationship will be deferred to a subsequent publication.
Lastly, we note that although we have focused on the 2D fan-beam geometry in the numerical experiments, the proposed method can be applied to any other geometry. 7.3. Implementation Details. In our implementation, the line searches for each pixel (lines 11-12 in Algorithm 1, line 9 in Algorithm 2 and Algorithm 3) are done in parallel and using Newton's method. Since it is not necessary to solve the 1D surrogate problems (line searches) exactly, we found that it is more efficient to perform just one Newton step for each iteration. However, for MAP this leads to divergence whenever the solution is in the linear regime of the Huber penalty. Therefore, we use a 1D trust-region Newton method (see details in Appendix B), which takes more time to execute than a Newton step. For VARD we could use Newton steps for the mean updates, but had to resort to the trust-region method for the variance updates (see details in Appendix B) to insure decrease of the objective. For reweighted ℓ 2 the Newton steps worked successfully. In our implementation, the run time was dominated by the trust-region updates, so both VARD and a single trial of MAP took the same time to converge, which was 40 minutes for the synthetic dataset (35 min for the acrylic  (d) same as in (c) but using 1/16 of the views. For FBP, we used cubic interpolation for the backprojection, and a cropped Ram-Lak (ramp) filter. Image resolution is 128 × 128. The full dataset (before the view subsampling) was acquired using a full 360 • scan around the object with view-angle resolution of 1 • and 1573 detector pixels per view. 510 iterations were used for VARD. glass phantom). A single trial of reweighted ℓ 2 took 20 minutes to converge.
For the problem sizes considered here, we could store the system matrix and its transpose in memory using MATLAB's compressed sparse representation. The time for the forward/backprojections was negligible relative to the trust-region Newton updates. For larger problems, it might not be possible to store the matrices in memory and an "on-the-fly" implementation of the forward/back-projection operations would be required and could take a larger portion of the run time than the line searches. This could lead to comparable run times for MAP and reweighted ℓ 2 (as they both require the same number of forward/back-projections per iteration). Also, in cases where the computational resources are limited, VARD could be slower than a single MAP or reweighted ℓ 2 trial up to a factor of 2, since VARD requires twice the number of forward/backprojections (compare Algorithm 1 with Algorithms 2-3). However, considering the large number of trials that needs to be done when using methods with tuning parameters, VARD still has a lower total run-time than the alternative methods.  8. Conclusions. We presented a new framework for automatic relevance determination (ARD), which we call variational ARD (VARD). It provides an extension of previous ARD methods to Poisson models used in transmission tomography. VARD computes approximations to the posterior mean and variance, with the former being the estimate for the object. We revealed the sparsity-promoting mechanism in VARD, and established key differences compared to previous ARD work that are applicable only to Gaussian likelihoods. An important advantage of the VARD framework is that it avoids the cumbersome process of choosing tuning parameters, while still retaining good image quality. We have shown that the posterior variances play a key role in learning the correct balance between the data-fit and the prior. Straightforward simplifications as reweighted ℓ 2 do not retain this property.
We presented a parallel AM algorithm for VARD (PAR-VARD) which is applicable to large scale problems. We studied a parallel implemention of PAR-VARD using both synthetic and real data. The numerical experiments demonstrate that the images produced by PAR-VARD are comparable in quality to the best MAP and reweighted ℓ 2 estimates. However, PAR-VARD requires much less time when taking into account the many trials required by MAP and reweighted ℓ 2 and the time spend on inspecting the images from different trials. As apposed to previous ARD methods, the exact evaluation of the objective for PAR-VARD is practically feasible for large scale problems and can be used to assess convergence in practice. We provided a detailed study of the convergence properties of PAR-VARD and established guarantees for global convergence.
In addition to the estimate of the object, VARD also computes posterior variances. We have shown that when the object is undersampled, the variances can indicate regions in the image where sampling is insufficient. The variances can therefore be used to define relative levels of confidence in the reconstruction at particular regions in the image. The variances can also be used for adaptive sensing/experimental design [22]. Our results suggest that the variances can be used for image segmentation or boundary detection. These potential uses are not studied here, and will be the subject of future work.
We have explored the use of ordered subsets (OS) [6,11] with PAR-VARD, where the dataset is split into a number of subsets, each processed in a predefined order. This led to considerable speed-ups, similar to the ones reported in the literature for other algorithms using OS. To focus on the algorithm presented in this paper we chose not to present OS results here.
9. Acknowledgments. We would like to thank Prof. Martin P. Tornai for providing us access to the Duke Multi-Modality Imaging Laboratory (MMIL) where the experimental data was acquired. We would like to thank Mr. Andrew Holmgren for conducting the experiments. We thank Dr. Ikenna Odinaka for suggesting the 1D trust-region Newton method used in our code. We thank Mr. David Carlson for proofreading the manuscript and providing helpful comments. This work was supported by the Department of Homeland Security, Science and Technology Directorate, Explosives Division, through contract HSHQDC-11-C-0083.
(A.1) First we focus on the first difference in (A.1) and write it explicitly using (3.4)-(3.5) Appendix B. Implementation Details. Next, we address the implementation of lines 11-12 in algorithm 1 which involve a line search to minimize the surrogate function. Since it is not necessary to find the exact solution for the surrogate, we use a single Newton step for the mean with thresholding, i.e., where ∂ x denotes the derivative with respect to x, S denotes the surrogate in (3.17), and all parameters are defined in (3.5), (3.8), (3.13), (3.14). While Newton steps as in (B.1) are not guaranteed to decrease the objective function, we have found that in all our numerical experiments the objective function decreased at each iteration. If the objective function is not monotonically decreasing, then the step size can be adjusted or a trust-region approach can be used, as described below for the variance updates. For the variance updates, we use a Newton trust region approach [41] where Newton steps for v j are considered (similarly to (B.1)) but the steps are limited to within a trusted region. At each iteration, the trust region is adjusted from both left and right by comparing predicted reduction based on a local quadratic approximation and the actual reduction. If the prediction is poor, the trust region is shrunk. The left boundary is limited to the positive axis. This approach also insures the decrease of the objective function at each step. For MAP with Huber neighborhood penalties we found that a trust region approach is also required to guarantee convergence in cases where the linear part of the penalty dominates. For reweighted ℓ 2 , taking Newton steps was sufficient.