A Preconditioner for a Primal-Dual Newton Conjugate Gradients Method for Compressed Sensing Problems

In this paper we are concerned with the solution of Compressed Sensing (CS) problems where the signals to be recovered are sparse in coherent and redundant dictionaries. We extend a primal-dual Newton Conjugate Gradients (pdNCG) method for CS problems. We provide an inexpensive and provably effective preconditioning technique for linear systems using pdNCG. Numerical results are presented on CS problems which demonstrate the performance of pdNCG with the proposed preconditioner compared to state-of-the-art existing solvers.

where A ∈ R m×n is an under-determined linear operator with m < n andb ∈ R m are the observed measurements. Although this system has infinitely many solutions, reconstruction ofx is possible due to its assumed properties. In particular,x is assumed to have a sparse image through a redundant and coherent dictionary W ∈ E n×l , where E = R or C and n ≤ l. More precisely, W * x , is sparse, i.e. it has only few non-zero components, where the star superscript denotes the conjugate transpose. If W * x is sparse, under certain conditions on matrices A and W (discussed in Subsection 1.2) the optimal solution of the linear 1 -analysis problem minimize W * x 1 , subject to: Ax =b isx, where · 1 is the 1 -norm.
Frequently measurementsb might be contaminated with noise, i.e. one measures b =b + e instead, where e is a vector of noise, usually modelled as Gaussian with zero-mean and bounded Euclidean norm. In addition, in realistic applications, W * x might not be exactly sparse, but its mass might be concentrated only on few of its components, while the rest are rapidly decaying. In this case, (again under certain conditions on matrices A and W ) the optimal solution of the following 1 -analysis problem is proved to be a good approximation tox. In (1.1), c is an a-priori chosen positive scalar and · 2 is the Euclidean norm.
1.1. Brief description of CS applications. An example of W being redundant and coherent with orthonormal rows is the curvelet frame where an image is assumed to have an approximately sparse representation [6]. Moreover, for radar and sonar systems it is frequent that Gabor frames are used in order to reconstruct pulse trains from CS measurements [19]. For more applications a small survey is given in [7]. Isotropic Total-Variation (iTV) is another application of CS, which exploits the fact that digital images frequently have slowly varying pixels, except along edges. This property implies that digital images with respect to the discrete nabla operator, i.e. local differences of pixels, are approximately sparse. For iTV applications, matrix W ∈ C n×n is square, complex and rank-deficient with rank(W ) = n − 1. An alternative to iTV is 1 -analysis, where matrix W is a Haar wavelet transform. However, a more pleasant to the eye reconstruction is obtained by solving the iTV problem compared to the 1 -analysis problem, see [21].

Conditions and properties of CS matrices.
There has been an extensive amount of literature studying conditions and properties of matrices A and W which guarantee recoverability of a good approximation ofx by solving problem (1.1). For a thorough analysis we refer the reader to [7,21]. The previously cited papers use a version of the well-known Restricted Isometry Property (RIP) [7], which is repeated below.
Definition 1.1. The restricted isometry constant of a matrix A ∈ R m×n adapted to W ∈ E n×l is defined as the smallest δ q such that for all at most q-sparse z ∈ E l , where E = R or C.
For the rest of the paper we will refer to Definition 1.1 as W-RIP. It is proved in Theorem 1.4 in [7] that if W ∈ E n×l has orthonormal rows with n ≤ l and if A, W satisfy the W-RIP with δ 2q ≤ 8.0e-2, then the solution x c obtained by solving problem (1.1) satisfies (1.2) x c −x 2 = C 0 e 2 + C 1 where (W * x ) q is the best q-sparse approximation of W * x , C 0 and C 1 are small constants and only depend on δ 2q . It is clear that W * x must have l − q rapidly decaying components, in order for x c −x 2 to be small and the reconstruction to be successful. iTV is a special case of 1 -analysis where matrix W does not have orthonormal rows, hence, result (1.2) does not hold. For iTV there are no conditions on δ 2q such that a good reconstruction is assured. However, there exist results which directly impose restrictions on the number of measurements m, see Theorems 2, 5 and 6 in [21]. Briefly, in these theorems it is mentioned that if m ≥ q log(n) linear measurements are acquired for which matrices A and W satisfy the W-RIP for some δ q < 1, then, similar reconstruction guarantees as in (1.2) are obtained for iTV. Based on the previously mentioned results regarding reconstruction guarantees it is natural to assume that for iTV a similar condition applies, i.e. δ 2q < 1/2. Hence, we make the following assumption. assumption 1.2. The number of nonzero components of W * x c , denoted by q, and the dimensions l, m, n are such that matrices A and W satisfy W-RIP for some δ 2q < 1/2. This assumption will be used in the spectral analysis of our preconditioner in Section 5.
Another property of matrix A is the near orthogonality of its rows. Indeed many applications in CS use matrices A that satisfy with a small constant δ ≥ 0. Finally, through the paper we will make use of the following assumption This us commonly used assumption in the literature, see for example [24].
1.3. Contribution. In [9], Chan, Golub and Mulet, proposed a primal-dual Newton Conjugate Gradients method for image denoising and deblurring problems. In this paper we modify their method and adapt it for CS problems with coherent and redundant dictionaries. There are two major contributions.
First, we propose an inexpensive preconditioner for fast solution of systems using pdNCG when applied to CS problems with coherent and redundant dictionaries. We analyze the limiting behaviour of our preconditioner and prove that the eigenvalues of the preconditioned matrices are clustered around one. This is an essential property that guarantees that only a few iterations of CG will be needed to approximately solve the linear systems. Moreover, we provide computational evidence that the preconditioner works well not only close to the solution (as predicted by its spectral analysis) but also in earlier iterations of pdNCG.
Second, we demonstrate that despite being a second-order method, pdNCG can be more efficient than specialized first-order methods for CS problems of our interest, even on large-scale instances. This performance is observed in several numerical experiments presented in this paper. We believe that the reason for this is that pdNCG, as a second-order method, captures the curvature of the problems, which results in sufficient decrease in the number of iterations compared to first-order methods. This advantage comes with the computational cost of having to solve a linear system at every iteration. However, inexact solution of the linear systems using CG combined with the proposed efficient preconditioner crucially reduces the computational costs per iteration.
1.4. Format of the paper and notation. The paper is organized as follows. In Section 2, problem (1.1) is replaced by a smooth approximation; the 1 -norm is approximated by the pseudo-Huber function. Derivation of pseudo-Huber function is discussed and its derivatives are calculated. In Section 3, a primal-dual reformulation of the approximation to problem (1.1) and its optimality conditions are obtained. In Section 4, pdNCG is presented. For convergence analysis of pdNCG method the reader is referred to [9,13] In Section 5, a preconditioning technique is described for controlling the spectrum of matrices in systems which arise. In Section 6, a continuation framework for pdNCG is described. In Section 7, numerical experiments are discussed that present the efficiency of pdNCG. Finally, in Section 8, conclusions are made.
Throughout the paper, · 1 is the 1 -norm, · 2 is the Euclidean norm, · ∞ the infinity norm and | · | is the absolute value. The functions Re(·) and Im(·) take a complex input and return its real and imaginary part, respectively. For simplification 3 of notation, occasionally we will use Re(·) and Im(·) without the parenthesis. Furthermore, diag(·) denotes the function which takes as input a vector and outputs a diagonal square matrix with the vector in the main diagonal. Finally, the super index c denotes the complementarity set, i.e. B c is the complement set of B.
2. Regularization by pseudo-Huber. In pdNCG [9] the non-differentiability of the 1 -norm is treated by applying smoothing. In particular, the 1 -norm is replaced with the pseudo-Huber function [16] (2.1) where W i is the i th row of matrix W ∈ E n×l and µ controls the quality of approximation, i.e. for µ → 0, ψ µ (x) tends to the 1 -norm. The original problem (1.1) is approximated by 2.1. Derivation of pseudo-Huber function. The pseudo-Huber function (2.1) can be derived in a few simple steps. First, we re-write function W * x 1 in its dual form where g are dual variables. The pseudo-Huber function is obtained by regularizing the previous dual form where g i is the i th component of vector g. An approach which consists of smoothing a non-smooth function through its dual form is known as Moreau's proximal smoothing technique [20]. Another way to smooth function W * x 1 is to regularize its dual form with a strongly convex quadratic function µ/2 g 2 2 . Such an approach provides a smooth approximation of W * x 1 which is known as Huber function and it has been used in [4]. Generalizations of the Moreau proximal smoothing technique can be found in [22] and [3].

2.2.
Derivatives of pseudo-Huber function. The gradient of pseudo-Huber function ψ µ (W * x) in (2.1) is given by and y = [y 1 , y 2 , · · · , y l ] := W * x. The gradient of function f µ c (x) in (2.2) is The Hessian matrix of ψ µ (x) is where the bar symbol denotes the complex conjugate,Ŷ := diag Ŷ 1 ,Ŷ 2 , ...,Ŷ l ,Ỹ := diag Ỹ 1 ,Ỹ 2 , ...,Ỹ l and Moreover, the Hessian matrix of f µ c (x) is 3. Primal-dual formulation and optimality conditions. In [8] the authors solved iTV problems for square and full-rank matrices A which were inexpensively diagonalizable, i.e. image deblurring or denoising. More precisely, in the previous cited paper the authors tackled iTV problems using a Newton-CG method to solve problem (2.2). They observed that close to the points of non-smoothness of the 1 -norm, the smooth pseudo-Huber function (2.1) exhibited an ill-conditioning behaviour. This results in two major drawbacks of the application of Newton-CG. First, the linear algebra is challenging. Second, the region of convergence of Newton-CG is substantially shrunk. To deal with these problems they proposed to incorporate Newton-CG inside a continuation procedure on the parameters c and µ. Although they showed that continuation did improve the global convergence properties of Newton-CG it was later discussed in [9] (for the same iTV problems) that continuation was difficult to control (especially for small µ) and Newton-CG was not always convergent in reasonable CPU time.
In [9], Chan, Golub and Mulet provided numerical evidence that the behaviour of a Newton-CG method can be made significantly more robust even for small values of µ. This is achieved by simply solving a primal-dual reformulation of (2.2), which is given below The reason that Newton-CG method is more robust when applied on problem (3.1) than on problem (2.2) is hidden in the linearization of the optimality conditions of the two problems.
3.1. Optimality conditions. The optimality conditions of problem (2.2) are The first-order optimality conditions of the primal-dual problem (3.1) are Notice for conditions (3.3) that the constraint g ∞ ≤ 1 in (3.1) is redundant since any x and g that satisfy (3.3) also satisfy this constraint. Hence, the constraint has been dropped. Conditions (3.3) are obtained from (3.2) by simply settingḡ = DW * x.
Hence, their only difference is the inversion of matrix D. However, this small difference affects crucially the performance of Newton-CG. The reason behind this is that the linearization of the second equation in (3.3), i.e. D −1ḡ = W * x, is of much better quality than the linearization of ∇ψ µ (W * x) for µ ≈ 0 and W * x ≈ 0. To see why this is true, observe that for small µ and W * x ≈ 0, the gradient ∇ψ µ (W * x) becomes close to singular and its linearization is expected to be inaccurate. On the other hand, D −1ḡ = W * x as a function of W * x is not singular for µ ≈ 0 and W * x ≈ 0, hence, its linearization is expected to be more accurate. We refer the reader to Section 3 of [9] for empirical justification.
4. Primal-dual Newton conjugate gradients method. In this section we present details of pdNCG method [9].
4.1. The method. First, we convert optimality conditions (3.3) to the real case. This is done by splitting matrix W = ReW + √ −1ImW and the dual variables g = g re + √ −1g im into their real and imaginary parts. We do this in order to obtain optimality conditions which are differentiable in the classical sense of real analysis. This allows a straightforward application of pdNCG method [9]. The real optimality conditions of the primal-dual problem (3.1) are given below At every iteration of pdNCG the primal-dual directions are calculated by approximate solving the following linearization of the equality constraints in (4.1) and B i , i = 1, 2, 3, 4 are diagonal matrices with components The former condition will be maintained through all iterations of pdNCG.
It is straightforward to show the claim in Remark 4.1 for the case of W being a real matrix. For the case of complex W we refer the reader to a similar claim which is made in [9], page 1970. Although matrix B is positive definite under the conditions stated in Remark 4.1, it is not symmetric, except in the case that W is real where all imaginary parts are dropped. Therefore in the case of complex matrix W , preconditioned CG (PCG) cannot be employed to approximately solve (4.2). To avoid the problem of non-symmetric matrix B the authors in [9] suggested to ignore the non-symmetric part in matrix B and employ CG to solve (4.2). This idea is based on the following remark. remark 4.2. The symmetric part of B tends to the symmetric second-order derivative of f µ c (x) as pdNCG converges (see Section 5 in [9]). Hence, system (4.2) is replaced with and sym(B) := 1/2(B+B ) is the symmetric part ofB. Moreover, PCG is terminated when The projection operator for complex arguments is applied component-wise and it is defined as v := P · ∞≤1 (u) = min(1/|u|, 1) u, where denotes the component-wise multiplication. In the last step, line-search is employed for the primal ∆x direction in order to guarantee that the objective value f µ c (x) is monotonically decreasing, see Section 5 of [9]. The pseudocode of pdNCG is presented in Figure 4.1.

Preconditioning.
Practical computational efficiency of pdNCG applied to system (4.4) depends on spectral properties of matrixB in (4.5). Those can be improved by a suitable preconditioning. In this section we introduce a new preconditioner forB and discuss the limiting behaviour of the spectrum of preconditioned B.
First, we give an intuitive analysis on the construction of the proposed preconditioner. In Remark A.2 it is mentioned that the distance ω of the two solutions x c := arg min f c (x) and x c,µ := arg min f µ c (x) can be arbitrarily small for sufficiently small values of µ. Moreover, according to Assumption 1.2, W * x c is q sparse. Therefore, Remark A.2 implies that W * x c,µ is approximately q sparse with nearly zero components of O(ω). A consequence of the previous statement is that the components of W * x c,µ split into the following disjoint sets The behaviour of W * x c,µ has a crucial influence on matrix ∇ 2 ψ µ (W * x c,µ ) in (2.6).
Notice that the components of the diagonal matrix D, defined in (2.5) as part of ∇ 2 ψ µ (W * x c,µ ), split into two disjoint sets. In particular, q components are non-zeros much less than O(1/ω), while the majority, l − q, of its components are of O(1/ω), Loop: For k = 1, 2, ..., until termination criteria are met.

5:
Find the least integer j ≥ 0 such that and set α := τ j 1 .

6:
Set Hence, for points close to x c,µ and small µ, According to Remark 4.2, the symmetric matrix sym(B) in (2.8) tends to matrix ∇ 2 ψ µ (x) as x → x c,µ . Therefore, matrix sym(B) is the dominant matrix inB. For this reason, in the proposed preconditioning technique, matrix A A in (2.8) is replaced by a scaled identity ρI n , ρ > 0, while the dominant matrix sym(B) is maintained. Based on these observations we propose the following preconditioner (5.2)Ñ := c sym(B) + ρI n .
In order to capture the approximate separability of the diagonal components of matrix D for points close to x c,µ , when µ is sufficiently small, we will work with approximate guess of B and B c . For this reason, we introduce the positive constant ν, such that Here σ might be different from the sparsity of W * x c . Furthermore, according to the above definition we have the sets with |B ν | = σ and |B c ν | = l−σ. This notation is being used in the following theorem, in which we analyze the behaviour of the spectral properties of preconditioned ∇ 2 f µ c (x), with preconditioner N := c∇ 2 ψ µ (W * x) + ρI n . However, according to Remark 4.2 matricesB andÑ tend to ∇ 2 f µ c (x) and N , respectively, as x → x c,µ . Therefore, the following theorem is useful for the analysis of the limiting behaviour of the spectrum of preconditionedB. Let Additionally, let A and W satisfy W-RIP with some constant δ σ < 1/2 and let A satisfy Proof. We analyze the spectrum of matrix Let u be an eigenvector of N − 1 2 ∇ 2 f µ c (x)N − 1 2 with u 2 = 1 and λ the corresponding eigenvalue, then First, we find an upper bound for |u N and A A − ρI n have the same eigenvalues. Therefore, where λ + max (·) is the largest eigenvalue of the input matrix in absolute value. Thus, where P is the projection matrix to the column space of W Bν and Q = I n − P . Using triangular inequality we get Let us denote byv the solution of this maximization problem and set Pv 2 2 = α and Since Pv belongs to the column space of W Bν and |B ν | = σ, from W-RIP with δ σ < 1/2 we have that Since ρ ∈ [δ σ , 1/2] we have that ρ Pv 2 2 ≤ APv 2 2 , which implies that if the eigenvector corresponding to an eigenvalue of matrix A A belongs to the column space of W Bν , then the eigenvalue cannot be smaller than ρ. Hence, Moreover, from W-RIP with δ σ < 1/2 and ρ ∈ [δ σ , 1/2], we also have that (Pv) * (A A− ρI n )Pv ≤ Pv 2 2 . Thus, From property (1.3) and λ max (A A) = λ max (AA ), we have that λ max (A A−ρI n ) ≤ 1 + δ − ρ. Finally, using the Cauchy-Schwarz inequality, we get that Using (5.6), (5.7) and (5.8) in (5.5) we have that
Let us now draw some conclusions from Theorem 5.1. In order for the eigenvalues of N −1 ∇ 2 f µ c (x) to be around one, it is required that the degree of freedom ν is chosen such that ν = O(1/µ) and µ is small. For such ν, the cardinality σ of the set B ν must be small enough such that matrices A and W satisfy W-RIP with constant δ σ < 1/2; otherwise the assumptions of Theorem 5.1 will not be satisfied. This is possible if the pdNCG iterates are close to the optimal solution x c,µ and µ is sufficiently small. In particular, for sufficiently small µ, from Remark A.2 we have that x c,µ ≈ x c and σ ≈ q. According to Assumption 1.2 for the q-sparse x c , W-RIP is satisfied for δ 2q < 1/2 =⇒ δ q < 1/2. Hence, for points close to x c,µ and small µ we expect that δ σ < 1/2. Therefore, the result in Theorem 5.1 captures only the limiting behaviour of preconditioned ∇ 2 f µ c (x) as x → x c,µ . Moreover, according to Remark 4.2, Theorem 5.1 implies that at the limit the eigenvalues ofÑ −1B are also clustered around one.
The scenario of limiting behaviour of the preconditioner is pessimistic. Letσ be the minimum sparsity level such that matrices A and W are W-RIP with δσ < 1/2. Then, according to the uniform property of W-RIP (i.e. it holds for all at mostσsparse vectors), the preconditioner will start to be effective even if the iterates W * x k are approximately sparse withσ dominant non-zero components. Numerical evidence is provided in Figure 5.1 to confirm this claim.
In Figure 5.1 the spectra λ(B) and λ(Ñ −1B ) are displayed for a sequence of systems which arise when an iTV problem is solved. For this iTV problem we set matrix A to be a partial 2D DCT, n = 2 10 , m = n/4, c = 2.29e-2 and ρ = 5.0e-1. For the experiment in Figures 5.1a and 5.1b the smoothing parameter has been set to µ = 1.0e-3 and in Figures 5.1c and 5.1d µ = 1.0e-5. Observe that for both cases the eigenvalues of matrixÑ −1B are nicely clustered around one. On the other hand the eigenvalues of matrixB have large variations and there are many small eigenvalues close to zero. Notice that the preconditioner was effective not only at optimality as it was predicted by theory, but through all iterations of pdNCG. This is because starting from the zero solution the iterates W * x k were maintained approximately sparse ∀k.
We now comment on the second result of Theorem 5.1, when the eigenvectors of . In this case, according to Theorem 5.1 the preconditioner removes the disadvantageous dependence of the spectrum of ∇ 2 f µ c (x) on the smoothing parameter µ. However, there is no guarantee that the eigenvalues of N −1 ∇ 2 f µ c (x) are clustered around one, regardless of the distance from the optimal solution x c,µ . Again, because of Remark 4.2 we expect that the spectrum ofÑ −1B at the limit will have a similar behaviour.
6. Continuation. In the previous section we have shown that by using preconditioning, the spectral properties of systems which arise can be improved. However, for initial stages of pdNCG a similar result can be achieved without the cost of having to apply preconditioning. In particular, at initial stages the spectrum ofB can be controlled to some extent through inexpensive continuation. Whilst preconditioning is enabled only at later stages of the process. Briefly by continuation it is meant that a sequence of "easier" subproblems is solved, instead of solving directly problem (2.2). The reader is referred to Chapter 11 in [23] for a survey on continuation methods in optimization.
In this paper we use a similar continuation framework to [4,8,9,14,15]. In particular, a sequence of sub-problems (2.2) is solved, where each of them is parameterized by c and µ simultaneously. Letc andμ be the final parameters for which problem (2.2) must be solved. Then the number of continuation iterations ϑ is set to be the maximum order of magnitude between 1/c and 1/μ. For instance, ifc = 1.0e-2 and For all experiments that we have reported in this paper we have found that this setting leads to a generally acceptable improvement over pdNCG without continuation. The pseudo-code of the proposed continuation framework is shown in Figure 6.1. Figure 6.2 shows the performance of pdNCG for three cases, no continuation with preconditioning, continuation with preconditioning through the whole process and continuation with preconditioning only at later stages. The vertical axis of Figure  6.2 shows the relative error x k − xc ,μ 2 / xc ,μ 2 . The optimal xc ,μ is obtained by using pdNCG with parameter tuning set to recover a highly accurate solution. The horizontal axis shows the CPU time. The problem is an iTV problem where matrix A is a partial 2D DCT, n = 2 16 , m = n/4, c = 5.39e-2 and ρ = 5.0e-1. The final smoothing parameterμ is set to 1.0e-5. For the experiment that preconditioning is 13 1: Outer loop: For j = 0, 1, 2, . . . , ϑ, produce (c j , µ j ) ϑ j=0 .

2:
Inner loop: Approximately solve the subproblem minimize f µ j c j (x) using pdNCG and by initializing it with the solution of the previous subproblem.  continuation with preconditioning only at later stages. The vertical axis presents the relative error x k − xc ,μ 2 / xc ,μ 2 , where xc ,μ is the optimal solution for the parameter settingc,μ in problem (2.2). used only at later stages of continuation; preconditioning is enabled when µ j ≤ 1.0e-4, where j is the counter for continuation iterations. All experiments are terminated when the relative error x k − xc ,μ 2 / xc ,μ 2 ≤ 1.0e-1. Solving approximately the problem is an acceptable practise since the problem is very noisy (i.e. signal-to-noiseratio is 10 decibel) and there is not much improvement of the reconstructed image if more accurate solutions are requested. Finally, all other parameters of pdNCG were set to the same values for all three experiments. Observe in Figure 6.2 that continuation with preconditioning only at late stages was the best approach for this problem.
7. Numerical experiments. In this section we demonstrate the efficiency of pdNCG against state-of-the-art methods for 1 -analysis problems with coherent and redundant dictionaries. In what follows we briefly discuss existing methods, we describe the setting of the experiments and finally numerical results are presented. All experiments that are demonstrated in this paper can be reproduced by downloading the software from http://www.maths.ed.ac.uk/ERGO/pdNCG/.
-TFOCS (Templates for First-Order Conic Solvers) is a MATLAB software for the solution of signal reconstruction problems. TFOCS solves the dual problem of where µ T1 is a positive constant. TFOCS also solves the dual problem of where > 0. Although problems (7.1) and (7.2) are non-smooth, the regularization terms µ T1 /2 x − x 0 2 2 and µ T2 /2 x − x 0 2 2 yield smooth convex dual problems, which can be solved by standard first-order methods. In particular, the smooth dual problems are solved using the Auslender and Teboulle's accelerated first-order method [2]. In our experiments we present results for TFOCS for both problems (7.1) and (7.2). We denote by TFOCS unc the version that solves the unconstrained problem (7.1) and by TFOCS con the version that solves the constrained problem (7.2). TFOCS can be downloaded from [5].
-TVAL3 (Total-Variation minimization by Augmented Lagrangian and ALternating direction ALgorithms) is a MATLAB software for the solution of signal reconstruction problems regularized with the total-variation semi-norm. TVAL3 reformulates problem (1.1) to the equivalent problem where Ω i = [ReW i , ImW i ] ∈ R n×2 . Then it solves the augmented Lagrangian reformulation of problem (7.3), which is where u i , v i ∈ R 2 and β is a positive constant. The augmented Lagrangian in (7.4) is minimized for variables x ∈ R n and u i i = 1, 2, · · · , l. The parameters v i i = 1, 2, · · · , l are handled by the method. Other solvers are NestA [4] and C-SALSA [1], which can also solve (1.1) but they are applicable only in the case that (AA ) −1 is available. Another method is the Primal-Dual Hybrid Gradient (PDHG) in [12]. PDHG has been reported to be very efficient for imaging applications such as denoising and deblurring, for which matrix A is the identity or a square and full-rank matrix which is inexpensively diagonalizable. Unfortunately, this is not the case for the CS problems that we are interested in. However, for all previous methods the matrix inversion can be replaced with a solution of a linear system at every iteration of the methods or a one-time cost of a factorization. To the best of our knowledge, there are no available implementations with such modifications for these methods.
There exists also a generic proximal algorithm for total-variation [10] and the Generalized Iterative Soft Thresholding (GISTA) in [18] for which we do not have generic implementations for CS problems.
In our experiments we put significant effort in calibrating parameters c, µ, µ T1 , µ T2 , and β such that all methods solve similar problems. First, we set = b −b 2 in (7.2), whereb is the noiseless sampled signal. Hence problem (7.2) is parameterized with the optimal . Then we find an approximation of the optimal c. By optimal c we mean the value of c for which problems (1.1) and (7.2) are equivalent if = b −b 2 and µ T2 = 0. Let ω denote the optimal Lagrange multiplier of (7.2). If = b −b 2 and µ T2 = 0, then it is easy to show that for c := 2/ω problems (1.1) and (7.2) are equivalent.
The exact optimal Lagrange multiplier ω is not known a-priori. However it can be calculated by solving to high accuracy the dual problem of (7.1) with TFOCS. Unforunately, the majority of the experiments that we perform are large scale and TFOCS converges slowly for µ T2 ≈ 0. For this reason, we first solve (7.2) using TFOCS with a moderate µ T2 , in order to obtain an approximate optimal Lagrange multiplier ω in reasonable CPU time. Then we set c := 2γ/ω, where γ is a small positive constant, which is calculated experimentally such that problems (1.1) and (7.2) have similar solution. Ifb is not available, then is set such that a visually pleasant solution is obtained.
The smoothing parameters µ T1 and µ T2 of TFOCS in (7.1) and (7.2), respectively, are set such that the corresponding obtained solutions have small relative error. Let x T1 and x T2 denote the obtained solutions from problems (7.1) and (7.2), then x T2 − x 2 / x 2 and x T2 −x 2 / x 2 are of O(10 −1 ) or O(10 −2 ), wherex is the known optimal noiseless solution. Ifx is not available, µ T1 and µ T2 are set such that visually pleasant reconstructions are obtained.
The smoothing parameter µ of pdNCG is set such that x pd −x 2 / x 2 is also O(10 −1 ) or O(10 −2 ), where x pd is the approximate optimal solution obtained by pdNCG. For all experiments that were performed the relative error between the solution of TFOCS and pdNCG is of O(10 −2 ).
For TVAL3 parameter β is set experimentally such that , where x tv is the approximate optimal solution obtained by TVAL3. Also, in all experiments TVAL3 obtained similar solution to TFOCS and pdNCG. For TVAL3 we have taken into account the comments of its authors that TVAL3 performs better for β ∈ [2 4 , 2 13 ]. However, occasionally we had to set β = 2 2 in order to obtain a solution which was as visually pleasing as the solutions obtained from TFOCS and pdNCG. 7.3. Termination criteria, parameter tuning and hardware. The version 1.3.1 of TFOCS has been used. The termination criterion of TFOCS is by default the relative step-length. The tolerance for this criterion is set to the default value, except in cases that certain suggestions are made in TFOCS software package or the corresponding paper [5]. The default Auslender and Teboulle's single-projection method is used as a solver for TFOCS. Moreover, as suggested by the authors of TFOCS, appropriate scaling is performed on matrices A and W , such that they have approximately the same Euclidean norms. All other parameters are set to their default values, except in cases that specific suggestions are made by the authors. Generally, regarding tuning of TFOCS, substantial effort has been made in guaranteeing that problems are not over-solved.
Regarding pdNCG, parameter η in (4.6) is set to 1.0e-1, the maximum number of backtracking line-search iterations is fixed to 10. Moreover, the backtracking linesearch parameters τ 1 and τ 2 in step 4 of pdNCG (Fig. 4.1) are set to 9.0e-1 and 1.0e-3, respectively. For iTV the preconditioner is a five-diagonal matrix, hence systems with it are solved exactly. Finally, the constant ρ of the preconditioner in (5.2) is set to 5.0e-1.
Regarding TVAL3, we fix the number of maximum iterations to 1000 and any other parameters that were not discussed previously are set to their default values.
All solvers are MATLAB implementations and all experiments are run on a Mac-Book Air running OS X 10.10.1 with 2 GHz (3 GHz turbo boost) Intel Core Duo i7 processor using MATLAB R2012a. The cores were working with frequency 2.7 -3 GHz during the experiments and we did not observe any CPU throttling.

Problems sets.
We compare the solvers pdNCG, TFOCS and TVAL3 on image reconstruction problems which are modelled using iTV. We separate the images to be reconstructed into two sets, which are shown in Figures 7.1 and 7.2. Figure 7.1 includes some standard images from the image processing community. There are seven images in total, the house and the peppers, which have 256 × 256 pixels and Lena, the fingerprint, the boat and Barbara, which have 512 × 512 pixels. Finally, the image Shepp-Logan has variable size depending on the experiment. Figure 7.2 includes images which have been sampled using a single-pixel camera [11]. Briefly a single-pixel camera samples random linear projections of pixels of an image, instead of directly sampling pixels. The problem set can be downloaded from http://dsp.rice.edu/cscamera. In this set there are in total five sampled images, the dice, the ball, the mug the letter R and the logo. Each image has 64 × 64 pixels. The result of the experiments are shown in Table 7.1. In Table 7.1 notice that there is always a large increase in CPU time from µ = 1.0e-02 to 1.0e-04. This is because for µ = 1.0e-02 pdNCG relies only on continuation, while for values of µ equal or smaller than 1.0e-04 preconditioning is necessary and it is automatically activated using the technique described in Section 6. Overall pdNCG had a stable performance with respect to the smoothing parameter µ. We believe that the good performance of the proposed preconditioner is responsible for this result. Without the preconditioner the performance of pdNCG for µ ≤ 1.0e-04 worsens noticeably; occasionally the method (without preconditioning) might stagnate.
7.6. Dependence on problem size. We now present the performance of methods pdNCG, TFOCS and TVAL3 as the size of the problem n increases. The image  Observe that all methods exhibit a linear-like increase in CPU time as a function of the size of the problem. We denote with bold the problem for which pdNCG was the fastest method. 7.7. Dependence on the level of noise. In this experiment we compare the solvers pdNCG, TFOCS and TVAL3 as the level of noise increases. For this experiment we use the images from Figures 7.1a to 7.1f. The CS matrix for all experiments is a partial Discrete Cosine Transform (DCT) matrix with m ≈ n/4. In Table 7.3 we present the results of this experiment. In the second column of Table 7.3 the SNR is shown, which is decreasing from 90 dB to 15 dB in six steps.
The rest of the table shows the CPU time, which was required for each solver. Overall

Conclusions.
Recently there has been great interest in the development of optimization methods for the solution of compressed sensing problems with coherent and redundant dictionaries. The methods that have been developed so far are mainly first-order methods. This is because first-order methods have inexpensive iterations and frequently offer fast initial progress in the optimization process. On the contrary, second-order methods are considered to be rather expensive. The reason is that often access to second-order information requires the solution of linear systems. In this paper we develop a second-order method, a primal-dual Newton Preconditioned Conjugate Gradients. We show that an approximate solution of linear systems which arise is sufficient to speed up an iterative method and additionally make it more robust. Moreover, we show that for compressed sensing problems an inexpensive preconditioner can be designed that speeds up even further the approximate solution of linear systems. Extensive numerical experiments are presented which verify our arguments. Spectral analysis of the preconditioner is performed and shows its very good limiting behaviour.