ON USING CHOLESKY-BASED FACTORIZATIONS AND REGULARIZATION FOR SOLVING RANK-DEFICIENT SPARSE LINEAR LEAST-SQUARES PROBLEMS ∗

. By examining the performance of modern parallel sparse direct solvers and exploiting our knowledge of the algorithms behind them, we perform numerical experiments to study how they can be used to eﬃciently solve rank-deﬁcient sparse linear least-squares problems arising from practical applications. The Cholesky factorization of the normal equations breaks down when the least-squares problem is rank-deﬁcient, while applying a symmetric indeﬁnite solver to the augmented system can give an unacceptable level of ﬁll in the factors. To try to resolve these diﬃculties, we consider a regularization procedure that modiﬁes the diagonal of the unregularized matrix. This leads to matrices that are easier to factorize. We consider both the regularized normal equations and the regularized augmented system. We employ the computed factors of the regularized systems as preconditioners with an iterative solver to obtain the solution of the original (unregularized) problem. Furthermore, we look at using limited-memory incomplete Cholesky-based factorizations and how these can oﬀer the potential to solve very large problems.


Introduction.
In recent years, a number of methods have been proposed for preconditioning sparse linear least-squares problems; a brief overview with a comprehensive list of references is included in the introduction to the paper of Bru et al. [4].The recent study of Gould and Scott [20,21] reviewed many of these methods (specifically those for which software has been made available) and then tested and compared their performance using a range of examples coming from practical applications.One of the outcomes of that study was some insight into which least-squares problems in the widely used sparse matrix collections CUTEst [19] and University of Florida [10] currently pose a real challenge for direct methods and/or iterative solvers.In particular, the study found that most of the available software packages were not reliable or efficient for rank-deficient least-squares problems (at least not when run with the recommended settings for the input parameters that were employed in the study).In this paper, we look further at such problems and focus on the effectiveness of both sparse direct solvers and iterative methods with incomplete factorization preconditioners.A key theme is the use of regularization (see, for example, [15,48,49]).We propose computing a factorization (either complete or incomplete) of a regularized problem and then using this as a preconditioner for an iterative solver to recover the solution of the original (unregularized) problem.
The problem we are interested in is where A ∈ R m×n (m ≥ n) is large and sparse and b ∈ R m .Our focus is on the case where A is not of full column rank.Solving (1.1) is mathematically equivalent to solving the n × n normal equations A well-known issue associated with solving (1.2) is that the condition number of the normal matrix C is the square of the condition number of A, so that the normal equations can be highly ill-conditioned [3].Indeed, if A does not have full column rank, C is positive semidefinite and computing a Cholesky factorization breaks down: a zero (or, in practice, a negative) pivot is encountered at some stage of the factorization.In such cases, a black-box sparse Cholesky solver cannot be applied directly to (1.2).It is thus of interest to consider modifying C by adding a regularization term to allow the use of a Cholesky solver; this is explored in section 3 and compared with using a sparse symmetric indefinite solver (that incorporates numerical pivoting) for factorizing C.
In particular, we look at employing the factors of the regularized normal matrix as a preconditioner for the iterative method LSMR [14] for solving (1.1).An alternative approach is to solve the mathematically equivalent (m+n)×(m+n) augmented system where γ > 0, r(x) = b − Ax is the residual vector, and I m denotes the m × m identity matrix.The condition of K depends on γ and the maximum and minimum singular values of A; it varies significantly with γ, but with an appropriate choice (see [1,3,48]), K is much better conditioned than C. Important disadvantages of (1.3) are that K is indefinite and is generally significantly larger than the normal matrix.A sparse direct indefinite solver computes a factorization of K of the form (P L)D(P L) T , where P is a permutation matrix, L is unit lower triangular, and D is block diagonal with nonsingular 1 × 1 and 2 × 2 blocks on the diagonal corresponding to 1 × 1 and 2 × 2 pivots (see, for example, [11,27]).Using an indefinite solver may result in a more expensive (and certainly more complex) factorization process than a Cholesky solver, and, as reported in [20,21], for large least-squares problems, the amount of memory needed may be prohibitive.One reason for this is that the analyze phase of most sparse direct solvers chooses the pivot order on the basis of the sparsity pattern of the matrix and makes the assumption that the diagonal is nonzero.When (as in the augmented system) this is not the case, it can be necessary during the subsequent numerical factorization to make significant modifications to the pivot order, leading to much higher levels of fill in the factors (entries in the factor L that were zero K) than was predicted during the analyze phase (see, for example, [28,30]).Modifications to the pivot order are needed when a candidate pivot is found to be too small.The conditions for deciding whether a pivot is acceptable typically depend on a threshold parameter (see section 2); choosing this parameter is a compromise between retaining sparsity and ensuring stability.In section 4, we examine the effects of relaxing the threshold parameter.We also look at regularizing the problem by modifying the (2, 2) block of K before performing the factorization and then using the factors as a preconditioner for an iterative solver (such as GMRES [47] or MINRES [40]) to restore accuracy in the solution of the original system.When memory is an issue for a direct solver, an alternative approach is to use an incomplete factorization preconditioner in conjunction with an iterative solver.
Incomplete Cholesky (IC) factorizations have long been used as preconditioners for the numerical solution of large sparse, symmetric positive definite linear systems of equations; for an introduction and overview see, for example, [2,46,51] and the many references therein.More recently, a number of authors have considered incomplete LDL T factorizations of symmetric quasi-definite matrices [39], saddle-point systems [52], and general indefinite systems [22,53].The use of a limited-memory IC factorization combined with LSMR to solve (1.1) is considered in section 5, and in section 6 we explore using incomplete LDL T factorizations to solve (1.3).Our conclusions are drawn in section 7.

Test environment.
We end this introduction by describing our test environment and test problems.The characteristics of the machine used to perform our tests are given in Table 1.1.All software is written in Fortran, and all reported timings are elapsed times in seconds.In our experiments, the direct solvers HSL MA87 and HSL MA97 (see section 2) are run in parallel, using 8 processors.We do not attempt to parallelize the sparse matrix-vector products used by the iterative solvers; moreover, the software to compute incomplete factorizations is serial.In each test, we impose a time limit of 600 seconds per problem.For the iterative methods, the number of iterations for each problem is limited to 100,000.Our test problems are taken from the CUTEst linear programme set [19] and the University of Florida Sparse Matrix Collection [10].In each case, the matrix is "cleaned" (duplicates are summed, and out-of-range entries and explicit zeros are removed along with any null rows or columns); details of the resulting test problems are summarized in Table 1.2.Here the nullity is computed by running HSL MA97 on K with the pivot threshold parameter set to 0.5 (see section 2); the reported nullity is the difference between m + n and the returned estimate of the matrix rank.Note that this estimate can be sensitive to the choice of ordering and scaling: HSL MA97 was used with no scaling and the nested dissection ordering computed by Metis [32].In our experiments, if a right-hand side b is provided, it is used; otherwise, we take b to be the vector of 1's.
We employ the preconditioned LSMR algorithm of Fong and Saunders [14].Like the more established LSQR algorithm [41,42], it is based on Golub-Kahan bidiagonalization of A. However, while in exact arithmetic LSQR is mathematically equivalent to applying the conjugate gradient method to (1.2), LSMR is equivalent to applying MINRES [40], so that the quantities A T r k 2 and r k 2 (where x k and r k = b − Ax k are the least-squares solution and residual on the kth step, respectively) are monotonically decreasing.Fong and Saunders report that LSMR may be a preferable solver because of this and because it may be able to terminate significantly earlier.
Experiments in [20,21] confirm this view and support our choice of LSMR.
Following Gould and Scott [21], in our experiments with LSMR we use the stopping rule Unless indicated otherwise, we set the convergence tolerance δ to 10 −6 .Note that (1.4) is independent of the choice of preconditioner.
When solving (1.3) using an indefinite preconditioner, we use right-preconditioned restarted GMRES [47].Since GMRES is applied to K, a stopping criterion is applied to Ky = c.With the available implementations of GMRES, it is not possible during the computation to check whether the stopping condition (1.4) (which is based on A) is satisfied; it can, of course, be checked once GMRES has terminated.Instead, in our experiments involving (1.3), we use the scaled backward error where y k is the computed solution of the augmented system on the kth step.We set the tolerance to 10 −7 .This stopping criterion is also applied in experiments involving MINRES.
2. Sparse direct solvers.Sparse direct solvers have long been used to solve both (1.2) and (1.3).Here we briefly introduce such solvers, highlighting a number of features that are particularly relevant to understanding their performance when used for rank-deficient least-squares problems.
Sparse direct methods are designed to solve symmetric linear systems Az = f (A = {A i,j }) by performing a factorization where L is a unit lower triangular matrix and D is a block diagonal matrix with nonsingular 1 × 1 and 2 × 2 blocks.In practice, a more general factorization of the form SAS = (P L)D(P L) T is computed, where S is a diagonal scaling matrix and P is a permutation matrix that holds the pivot (elimination) order.If A is positive definite, D is diagonal with positive diagonal entries, and in this case, L may be redefined to be the lower triangular matrix L ⇐ LD 1/2 , giving the Cholesky factorization For efficiency in terms of both time and memory it is essential to choose P to exploit the sparsity of A. The structure of the factor L is the union of the structure of the permuted matrix P T AP and new entries known as fill.The amount of fill is highly dependent on the choice of P .Direct solvers choose P in an analyze phase that precedes the numerical factorization and generally works solely with the sparsity pattern of A. Observe that there is no one scaling algorithm that provides the best scaling for all possible matrices A, and so some direct solvers offer different options for computing S, while others require (or optionally allow) the user to supply S.
For positive definite systems, the chosen pivot order can be used unaltered by the numerical factorization.However, for indefinite systems, it may be necessary to make modifications to maintain numerical stability.This is done by delaying the elimination of variables that could cause instability until later in the factorization when the associated pivot (that is, the 1 × 1 or 2 × 2 block used to eliminate one, respectively, two variables) can be safely used.The exact method used to select pivots during the numerical factorization varies from solver to solver, but essentially each seeks to avoid dividing a large off-diagonal entry by a small diagonal one.If the elimination of variable k is delayed, either an update from another elimination will increase the magnitude of the diagonal entry A k,k , or column k will become adjacent to column k + 1 with the property that A k,k+1 is large and hence can be incorporated into a stable 2 × 2 pivot.
The Cholesky solver HSL MA87 is designed to run on multicore architectures.It splits each part of the computation into tasks of modest size but sufficiently large that good level-3 BLAS performance can be achieved.The dependencies between the tasks are implicitly represented by a directed acyclic graph (DAG).This solver requires the user to supply both the pivot order P and the scaling S (other HSL packages are available for computing these).
By contrast, HSL MA97 is a parallel multifrontal code that is able to solve both positive definite and indefinite systems (although its performance on positive definite systems is generally not competitive with that of HSL MA87).HSL MA97 offers a range of ordering and scaling options and also allows the user to supply P and/or S. In a multifrontal method, the factorization of A proceeds using a succession of assembly operations of small dense matrices, interleaved with partial factorizations of these matrices.The assembly operations can be recorded as a tree, known as an assembly tree.The assembly proceeds from the leaf nodes up the tree to the root node(s).Typically, most of the flops are performed at the root node and the final few levels of the tree.For the efficient and stable partial factorization of the dense submatrices, HSL MA97 uses separate computational kernels for the positive definite and indefinite cases.For the latter, HSL MA97 employs the sufficient (but not necessary) conditions given by Duff et al. [12] for threshold partial pivoting to be stable.Let A (k) denote the Schur complement after columns 1, . . ., k − 1 of A have been eliminated, and let u ∈ [0, 0.5] be the pivot threshold.The criteria for stability are the following: where the modulus of the matrix is interpreted elementwise.Additionally, it is required that the pivot be nonsingular and inverted stably.In the case where u is zero, this is interpreted as requiring that the pivot be nonsingular.Observe that these conditions imply that each entry in L is bounded in modulus by u −1 .The choice of u is a compromise between stability and sparsity: the larger u is, the more stable the factorization will be, but the fill in L may increase as more pivots may fail the stability test, causing them to be delayed.The default value within HSL MA97 is u = 0.01, and, unless stated otherwise, this value is used in our experiments.Note that including pivoting in an indefinite sparse direct solver is necessary for stability, but it has the major disadvantage of hindering parallelism.Note also that HSL MA97 is designed to handle the case where the system matrix A is singular.The code holds D −1 , and when a zero pivot is encountered, the corresponding entry of D −1 is set to zero.The corresponding components of the solution vector are thus set to zero.
Both HSL MA87 and HSL MA97 employ a technique known as node amalgamation.This has become well established as a means of improving the factorization speed at the expense of the number of entries in the factor L and the operation counts during the factorization and subsequent solve phase.During the analyze phase, a child node in the tree is merged with its parent if both parent and child have fewer than a prescribed number nemin of variables that are eliminated, or if merging parent and child generates no additional nonzeros in L. The value of the parameter nemin determines the level of node amalgamation, with a value in the range of 8 to 32 typically recommended as providing a good balance between sparsity and efficiency in the factorize and solve phases (see [25,44]).In our experiments, we set nemin equal to 32.

Using a direct solver as a preconditioner for LSMR. If A does not have full column rank, C = A T
A is symmetric and positive semidefinite, and thus attempting to compute a Cholesky factorization will suffer breakdown.Breakdown happens when a zero (or negative) pivot is encountered; if this happens, a Cholesky solver will terminate the computation with an error flag.In this section, we consider using a regularization term to allow a Cholesky factorization to be used and compare this approach with employing a symmetric indefinite solver to factorize C. The former requires an iterative method to recover the required accuracy in the solution, while the latter involves the overhead of pivoting; numerical results are used to explore which approach is the most efficient.
When C is positive semidefinite and breakdown of a Cholesky factorization occurs, a simple remedy is to employ a global shift α > 0 and then compute a Cholesky factorization of the scaled and shifted matrix Here S is again a diagonal scaling matrix.The shift α is also referred to as a Tikhonov regularization parameter.The choice of α should be related to the smallest eigenvalue of SA T AS, but this information is not readily available.Clearly it is always possible to find an α so that C α is positive definite; if the initial choice α is too small (that is, C α is positive semidefinite), it may be necessary to restart the factorization more than once, increasing α on each restart until breakdown is avoided.If the regularized normal equations are solved, the computed value of the least-squares objective r α 2 = b − ASx α 2 may differ from the optimum for the original problem.We can seek to obtain the solution x to (1.1) by applying a refinement process.The standard approach for linear systems is to employ a small number of steps of iterative refinement.However, iterative refinement is not effective when applied to the normal equations if the normal matrix is ill-conditioned [3].We thus propose using the Cholesky factors L α L T α of C α as a preconditioner for LSMR applied to the original (unshifted) problem.In our experiments, we use l 2 -norm scaling of A, that is, S ii = 1/ Ae i 2 (where e i is the ith unit vector) so that each column of A is normalized by its 2-norm.The diagonal entries of SA T AS are then all 1.In the positive definite case, van der Sluis [55] proved that if all the diagonal entries of SCS are equal, then it has condition number close to the optimal among diagonal scalings of C. In Figure 3.1, we plot the number of iterations required by preconditioned LSMR and the total time for solving problem Maragal 6 using values of the shift α in the range 10 −14 to 10 −2 ; here L α is computed using the positive definite solver HSL MA87 with Metis [32] nested dissection ordering.Similar patterns are observed for other problems, although for some of our test examples (including DBIR2 and mri2) we found with the l 2 -norm scaling of A that we needed to use α ≥ 10 −12 to avoid breakdown.In Table 3.1, we report results for α = 10 −12 .In many cases, the requested accuracy is achieved with a single step of LSMR.We did experiment with running HSL MA97 in positive definite mode, but, as reported in [29], it is generally slower than HSL MA87, and so detailed results are omitted.

3.2.
Comparison with using a general indefinite solver.An alternative to computing a Cholesky factorization of the regularized normal equations and using the factor to precondition LSMR is to solve the original (unshifted) normal equations using a sparse symmetric indefinite direct solver that allows the system matrix to be singular, provided that the equations to be solved are consistent; this is always true for the normal equations.The latter has the advantage of not requiring the selection of a shift α, but allowing pivoting for numerical stability (as discussed in section 2) adds to the factorization cost.Results for our test problems run with the solver HSL MA97 in indefinite mode are included in columns 7-9 of Table 3.1 (using the l 2 -norm scaling of A, Metis ordering, and default threshold parameter u = 0.01).No refinement is used.We see that the positive definite solver applied to the scaled and regularized normal equations results in a faster total solution time compared to the indefinite approach.
4. Solving the augmented system with a direct solver.In this section, we look at using a direct solver to solve the augmented system (1.3).We first consider employing a general-purpose symmetric indefinite solver that incorporates pivoting.We report on the effects of the choice of scaling on the fill in the factors and computation time and show that either a matching-based scaling or l 2 -norm scaling of A is generally the method of choice.We also examine whether the tactic that is sometimes used in optimization problems of reducing the threshold parameter u can improve the solver performance without resulting in instability.Then, in section 4.2, we consider regularizing the augmented system.The factors of the regularized matrix are used to provide a preconditioner for an iterative method.Numerical results are compared with those for the regularized normal equations presented in the last section.with S a diagonal scaling matrix, P a permutation matrix, and D block diagonal (with 1 × 1 and 2× 2 blocks).During the factorization of K, the zero (2, 2) block can lead to many modifications being made to the analyze pivot order to preserve stability; this is reflected in the number of delayed pivots reported by HSL MA97 (ndelay).It is well known that for some sparse indefinite problems, the choice of the scaling S can have a significant impact on reducing the number of delayed pivots and hence the fill in L and overall performance (see, for example, [26]).When developing HSL MA97, Hogg and Scott [30] studied pivoting strategies for tough sparse indefinite systems.The scaling options offered by HSL MA97 (1) generate a scaling using a weighted bipartite matching (MC64) [13]; or (2) generate an equilibration-based scaling (MC77) [45]; or (3) use a matching-based ordering and the scaling that is generated as a side effect of this process (MC80); or (4) generate a scaling by minimizing the absolute sum of log values in the scaled matrix (MC30).Note that these scalings do not exploit the block structure of K but treat K as a general indefinite sparse symmetric matrix.In addition to the built-in HSL MA97 options, we compute and input the l 2 -norm scaling of K and compute the l 2 -norm scaling of A. In the latter case, we input the scaling matrix (4.1) where Sii = 1/ Ae i 2 .We illustrate the effects of scaling on a subset of our problems in Table 4.1.Here we use the default threshold parameter u = 0.01 and Metis ordering and set the parameter γ = 1 in (1.3).These examples were chosen because they were found to be sensitive to the scaling; for some of our other test examples, it had a much smaller effect.Closer examination shows that if we compute the diagonal entries of C, the ratio of the largest to the smallest diagonal entry for these examples is large (see Table 1.2), indicating poor initial scaling.As expected, no single scaling gives the best results on all problems.The matching-based MC64 scaling can lead to sparse factors, but it can be expensive to compute.Based on our findings, we use l 2 -norm scaling of A for our remaining experiments with HSL MA97 applied to K, and we set γ = 1 [1].
Results for solving the augmented system are given in Table 4.2.Compared to employing a direct solver to factorize the normal equations (Table 3.1), we see that, in general, factorizing the augmented system with the default threshold parameter u = 0.01 results in significantly more entries in the factors plus slower computation times.In some optimization applications (see, for example, [17,49]), it is common practice to try to make the factorization of indefinite systems more efficient in terms of time and memory by employing a relaxed threshold parameter u and to increase u only if the linear system is not solved with sufficient accuracy.Thus in Table 4.2 we also include results for u = 10 −8 .With the exception of problems BAXTER and NSCT2, this gives only a modest reduction in the number of entries in the factors and in the number of delayed pivots, but there can be a significant reduction in the time (including problems Maragal 8 and mri2).Further examination reveals that pivots are still being rejected at the nonroot nodes and passed up the assembly tree to the root node, where a dense factorization is performed.The delayed pivots result in the root node being much larger than was predicted by the analyze phase, and, in this case, the factorization of the root node accounts for most of the operations and time.
Using a small value of the threshold parameter significantly reduces the time for the root node factorization, and it is this that leads to the overall reduction in the time.

Regularized augmented system.
To attempt to reduce the number of entries in the factors of the augmented system resulting from delayed pivots, we consider the regularized system (4.2) where r(x β ) = b − Ax β and β > 0 (see, for example, [16,48]).This is a symmetric quasi-definite (SQD) system.Vanderbei [56] shows that, in exact arithmetic, SQD systems are strongly factorizable, i.e., a signed Cholesky factorization of the form LDL T (with D diagonal having both positive and negative entries) exists for any symmetric permutation P .Thus P can be chosen to maintain sparsity.However, the signed Cholesky factorization may be unstable.A stability analysis is given by Gill, Saunders, and Shinnerl [17] (see also [15,18]), which shows the importance of the effective condition number of K β for the stability of the factorization.We note that other regularizations of the augmented system have been proposed.In particular, Saunders [48,49] and George and Saunders [15] use the SQD matrix with β 1 , β 2 > 0, and in their experiments they set β 1 = β 2 = 10 −6 .Saunders suggests that this may be favorable when A is ill-conditioned.We apply our sparse symmetric indefinite solver HSL MA97 to the scaled regularized augmented matrix SK β S with the threshold parameter u set to 0.0 and S given by (4.1).With β > 0, the computed value of the least-squares objective r β 2 = b − A Sx β 2 may differ from the optimum for the original problem, and if the stopping criterion (1.4) is not satisfied, we propose using the factors as a preconditioner for an iterative method [15].Here we use right-preconditioned GMRES applied to the original augmented system (1.3) (with γ = 1).Here GMRES is used without restarting, as only a small number of iterations are required.An alternative would be to use the symmetric solver MINRES [40].This requires a positive definite preconditioner, and so we could employ the method presented by Gill et al. [16] to modify D β and L β (note that this approach has been used recently by Greif, He, and Liu [22]).
Our results using GMRES are given in Figure 4.1 and Table 4.3.The figure looks at the effects of varying the regularization parameter β on the number of iterations and the solution time for problem Maragal 6.As β increases, so too do the iteration count and time; similar patterns are seen for our other test examples.For the results in Table 4.3, we set β = 10 −8 .For our test set, this gives no delayed pivots.While the precise choice of β is not important, if β is "too small," some pivots may get delayed and the factorization become less stable, resulting in more GMRES iterations being needed for the requested accuracy than for a larger β.Comparing Tables 4.2 and 4.3, we note that we obtain much sparser factors than previously, and, as the number of iterations of GMRES is generally modest (indeed, often we did not need

Table 4.3
Results for solving the regularized augmented (SQD) system with β = 10 −8 using the direct solver HSL MA97 with threshold parameter u = 0.0.nnz(L β ) denotes the number of entries in the factor, time f and timet are the HSL MA97 and total solution times (in seconds), the value of the leastsquares objective before and after applying preconditioned GMRES is r β 2 and r 2 , respectively, and itn is the number of GMRES iterations.to use GMRES to obtain the requested accuracy), we have significantly faster total solution times (note, in particular, problems Maragal 8 and mir2).We observe that for problem BAXTER, we can reduce the number of GMRES iterations if we use a smaller β: with β = 10 −11 , only 11 iterations are needed.
The number of entries in the factors and the total solution times using HSL MA97 to solve the regularized augmented system (4.2) and HSL MA87 applied to the regularized normal equations are compared in Figure 4.2.A point above the line y = 1 indicates that using the normal equations is the better choice; the converse is true for a point below the line.We see that in many cases there is little to choose between the approaches in terms of the size of the factors, but that for a small number of examples (including the Maragal problems) the augmented system factors are significantly sparser.However, the normal equation approach with the positive definite solver is faster for most of the remaining problems.

Incomplete factorization of the normal matrix C.
Having explored in section 3 complete factorizations of the (regularized) normal equations, in this section we look at computing an incomplete Cholesky (IC) factorization for use as a preconditioner for LSMR.An IC factorization takes the form LL T in which some of the fill entries that would occur in a complete factorization are ignored.The preconditioned normal equations become Performing preconditioning operations involves solving triangular systems with L and L T .Over the years, a wealth of different IC variants have been proposed, including structure-based methods, those based on dropping entries below a prescribed threshold, and those based on prescribing the maximum number of entries allowed in L (see, for instance, [2,46,51] and the references therein).Level-based methods (IC(k)) that are based on the sparsity pattern of A plus a small number of levels of fill are popular and straightforward to implement.However, they are best suited to linear systems arising from discretized partial differential equations.Here we use the limited-memory approach of Scott and Tůma [50,51], which generalizes the ICFS algorithm of Lin and Moré [35].The scheme is based on a matrix factorization of the form where L is the lower triangular matrix with positive diagonal entries that is used for preconditioning, R is a strictly lower triangular matrix with small entries that is used to stabilize the factorization process but is subsequently discarded, and E = RR T .On the jth step of the factorization, the first column of the Schur complement is decomposed into a sum of two vectors l j + r j , such that |l j | T |r j | = 0 (with the first entry in l j nonzero), where l j (respectively, r j ) contains the entries that are retained in (respectively, discarded from) the incomplete factorization.In the next step of a complete decomposition, the Schur complement of order n − j would be updated by subtracting the outer product of the pivot row and column, that is, by subtracting (l j + r j )(l j + r j ) T .In the incomplete case, the positive semidefinite term E j = r j r T j is not subtracted.Moreover, to further limit the memory required, drop tolerances are (optionally) used.If at some stage a zero or negative pivot is encountered, the factorization suffers breakdown and, as in section 3, a shift is applied and the incomplete factorization of the shifted matrix (3.1) is computed.
A software package HSL MI35 that implements this limited-memory IC algorithm for least-squares problems has been developed.This code is a modification of HSL MI28 [50], which is designed for symmetric positive definite systems.Modifications were needed to allow the user to specify the maximum number of entries allowed in each column of the incomplete factor L (in HSL MI28 the user specified the amount of fill allowed, but as columns of the normal matrix C may be dense, or close to dense, this change was needed to keep L sparse).Furthermore, there is no need to form and store all of C explicitly; rather, the lower triangular part of its columns can be computed one at a time and then used to perform the corresponding step of the incomplete Cholesky algorithm before being discarded.HSL MI35 includes a number of scaling and ordering options so that an incomplete factorization of C α = P T SCSP + αI is computed, where P is a permutation matrix chosen on the basis of sparsity, S is a diagonal scaling matrix, and α ≥ 0.0.Based on extensive experimentation in [50], the default ordering is the profile reduction ordering of Sloan [54].We experimented with using the l 2 -norm scaling of A and also, as discussed in section 3, applying the l 2 -norm scaling of A, forming C S = SA T AS, and then applying the l 2 -norm scaling of C S (in practice, this requires us to compute only one column of C S at a time).We found that for some examples this "double" scaling resulted in a smaller shift α being needed and hence gave a higher quality preconditioner.Thus we use this for our reported results.In the following, lsize and rsize denote the parameters that control the maximum number of entries in each column of L and R, respectively.In each test, the initial shift α is 0.001.In the event of breakdown, it is increased until the incomplete factorization is successful (see [50] for details).The recorded times include the time to restart the factorization following any breakdowns.
Figure 5.1 illustrates the potential benefits of employing intermediate memory R in the construction of L (note that the ICFS software [35] employs no intermediate memory).Here we set lsize = 200 and vary rsize from 0 to 1000; the drop tolerances are set to 0.0.For each value of rsize, the number of entries in L is nnz(L) = 6.63×10 6 .We see that increasing the intermediate memory stabilizes the factorization, reducing the shift and giving a higher quality preconditioner that requires fewer LSMR iterations and less time.Observe that because the number of restarts following a breakdown decreases as rsize increases, the time for computing the IC factorization does not necessarily increase with rsize.
In Table 5.1, we report results for HSL MI35 applied to our test set; default settings are used for the ordering and dropping parameters.We experimented with a range of values for the parameters lsize and rsize that control the memory, and for each example we chose values that perform well.While the preconditioner quality improves and the number of LSMR iterations decreases as the memory increases, the factorization times generally increase and, because nnz(L) increases, each application of the preconditioner becomes more expensive.Thus the best values of lsize and rsize in terms of the total time are highly problem dependent (in particular, we found that using rsize > 0 is not always beneficial); the results given in Table 5.1 illustrate this.We struggle to solve problem BAXTER: a large number of iterations are required, and  the computed r 2 is (slightly) larger than reported elsewhere.However, if we compare the results for the other examples with those in Table 3.1 for the direct solver applied to the normal equations, we see that for many examples, the IC times are competitive with the direct solver times.Moreover, the IC factorization produces significantly sparser factors, giving it the potential to be used successfully for much larger problems than can be tackled by a direct solver.Observe that the shift α for the IC factorization is significantly larger than that used by the direct solver.Consequently, the number of iterations needed for convergence can be large.

6.
Preconditioning strategies for the augmented system.In this section, we consider using an incomplete factorization as a preconditioner for an iterative method applied to the augmented system (1.3).One possibility is to extend the positive definite limited-memory approach outlined in section 5.This was proposed by Scott and Tůma, who presented a limited-memory signed IC factorization [52].
We discuss the signed IC approach and compare it to the recent LLDL approach of Orban [39].
Scott and Tůma compute an incomplete factorization of the form LDL T , where L is a lower triangular matrix with positive diagonal entries and D is a diagonal matrix with entries ±1.In practice, an LDL T factorization of is computed, where α 1 and α 2 are nonnegative shifts chosen to prevent breakdown of the factorization.The preconditioner is taken to be LDL T , with L = S −1 P L. Scott and Tůma choose the permutation P not only on the basis of sparsity, but also so that a variable in the (2, 2) block of K is not ordered ahead of any of its neighbors in the (1, 1) block.The idea here is to try to prevent a small (or zero) pivot candidate from being chosen; see [52] for details of this so-called constrained ordering.
An implementation is available as the HSL package HSL MI30.As with the IC code HSL MI35, intermediate memory (R) is optionally used in the construction of the factor L and is then discarded.The user controls the amount of fill allowed in each column of L (lsize) and the number of entries in each column of R (rsize).The code also includes a range of ordering and scaling options as well as optional dropping parameters (to control the discarding of small entries from L and R).
HSL MI30 is related to the recent LLDL software developed independently by Orban [39].The latter extends the ICFS code of Lin and Moré [35] to SQD matrices and thus can be applied to the regularized augmented system (4.2).The main differences between HSL MI30 and LLDL are the following: 1. LLDL uses a single shift (that is, α 1 = α 2 ). 2. LLDL does not employ intermediate memory (that is, rsize = 0).3. The HSL MI30 factorization suffers breakdown and is restarted with an increased shift whenever a candidate pivot is not of the expected sign (or is zero).In LLDL there is breakdown only if a pivot is zero; in this case, the shift is increased and the factorization restarted.4. LLDL does not use a constrained ordering but advises the user to preorder K using a sparsity-preserving ordering, such as approximate minimum degree (AMD) (there is no built-in option for ordering K). 5. LLDL does not include an option to drop small entries during the factorization.6. HSL MI35 allows "spare" space from one column of L to be used for the next column (so that if, after dropping, column j of L has p < lsize fill entries, then column j + 1 may have up to 2 * lsize − p fill entries).For our experiments, we modified HSL MI30 to implement the same algorithm as LLDL; this facilitates testing using the same ordering and scaling.We refer to this as the LLDL option (although note that if Orban's LLDL package is used, it will return similar but not identical results).Numerical results are given in Tables 6.1 and 6.2.The memory parameters lsize and rsize were chosen after testing a range of values (with the same lsize used for HSL MI30 and the LLDL option).Recall that ratio(r) is given by (1.4); it enables us to see whether the LSMR stopping criterion is satisfied.We employ l 2 -norm scaling of A and AMD ordering; the iterative solver is restarted GMRES (with the restart parameter set to 1000).For tests using the LLDL option, we set α 1 = α 2 = 1.0; this choice was made on the basis of experimentation.We set the regularization parameter β to 10 −6 (recall (4.2)), but our experience is that, for our test set and chosen settings, using a nonzero value has little effect on the quality of the results.We observe that while HSL MI30 appears robust (it successfully solved all the problems in our test set except BAXTER, which is omitted because we were unable to choose parameters that led to successful convergence), the LLDL option fails to solve a number of problems within our limits of 600 seconds and 100000 iterations.An alternative to using an IC-based factorization is to employ a general incomplete indefinite factorization code.Chow and Saad [8] considered the class of incomplete LU preconditioners for solving indefinite problems, and later Li and Saad [33] integrated pivoting procedures with scaling and reordering.Building on this, Greif, He, and Liu [22] recently developed an incomplete factorization package called SYM-ILDL for general sparse symmetric indefinite matrices.Here the system matrix may be any sparse indefinite matrix; no advantage is made of the specific block structure of (1.3).Independently, Scott and Tůma [53] report on the development of incomplete factorization algorithms for symmetric indefinite systems and propose a number of new ideas with the goal of improving the stability, robustness, and efficiency of the resulting preconditioner.Preliminary experiments on our rank-deficient least-squares test problems have found that the indefinite factorization is much less robust than the signed IC approach, and so full results are not included here.
7. Concluding remarks.In this paper, we have used numerical experiments to study solving rank-deficient sparse linear least-squares problems.These are hard problems.In particular, the lack of positive definiteness means that the standard approach of applying a Cholesky solver to the normal equations fails.Our approach is to use existing software to compute the factors of a regularized system and then to employ these factors as a preconditioner with an iterative method to recover the solution of the original problem.We have explored using state-of-the-art parallel sparse direct solvers to compute a complete factorization as well as recent approaches to compute limited-memory incomplete factorizations.Regularization allows a Cholesky-based direct solver to be used to factorize the scaled and shifted normal matrix C, avoiding the need for numerical pivoting that can adversely affect the performance of a sparse indefinite direct solver.However, this requires C to be available.If C cannot be formed or if it is unacceptably dense, the regularized augmented system with the threshold parameter u = 0.0 offers a feasible alternative approach (see also [49]).
For very large problems it may not be possible to use a direct solver, and so we have also considered limited-memory IC factorizations.Our results show that IC factorizations of the normal equations computed using the package HSL MI35 provide robust preconditioners with significantly sparser factors than those from a complete factorization, but they require a much larger number of LSMR iterations to achieve the requested accuracy.The use of intermediate memory in constructing these IC factors can sometimes significantly enhance the quality of the preconditioner without adding extra fill to the final factors.For the signed IC factorizations of the augmented system, the use of intermediate memory can also be advantageous.In many of our test cases, the total solution time for the signed IC factorizations of the augmented system is greater than for the IC factorizations of the normal equations, but, again, the former has the advantage of avoiding the construction of the normal matrix.We note that our software allows the user to tune a number of parameters, including not only the choice of ordering and scaling but also the amount of memory used by the IC factorization.These choices and the choice of the regularization parameter can significantly affect the quality of the resulting preconditioner, and it may be necessary to experiment with the different options to obtain the best performance for a given application.
Currently, the codes that compute the IC factorizations and then perform the subsequent forward and backward substitutions that are needed when using the factors as preconditioners are serial.As much of the total time is taken by the iterative solver, parallel implementations of the application of the preconditioner are needed.This is currently an area of active research [7].
Finally, although this paper has focused on tackling rank-deficient least-squares problems using sparse direct LL T and LDL T solvers and their incomplete factorization counterparts, a number of other approaches are available.In particular, a QR factorization of A may be used, either a complete sparse QR factorization as offered by SuiteSparseQR [9] and qr mumps [5], or an incomplete QR factorization such as the multilevel incomplete QR (MIQR) factorization of Li and Saad [34].Moreover, the results reported in [20,21] suggest that the BA-GMRES approach of Morikuni and Hayami [36,37] may offer a feasible alternative.

Fig. 3 . 1 .
Fig. 3.1.The effect of the shift α on the number of LSMR iterations (left) and the time (right) for problem Maragal 6. HSL MA87 is used to compute a preconditioner for LSMR.

4. 1 .
Using a direct solver for Ky = c.Consider applying HSL MA97 to the augmented system(1.3).The computed factorization is SKS = (P L)D(P L) T ,

Fig. 4 . 2 .
Fig.4.2.Ratios of the number of entries in the factor (left) and the time (right) for the regularized augmented system solved using HSL MA97 in indefinite mode and the regularized normal equations solved using the positive definite solver HSL MA87.

Fig. 5 . 1 .
Fig. 5.1.The effect of increasing the amount of intermediate memory used in the construction of the HSL MI35 IC preconditioner on the size of the shift (left), the number of LSMR iterations (center), and the total time in seconds (right) for problem Maragal 8.

Table 1 . 2
Statistics for our test set.m,n, and nnz(A) are the row and column counts and the number of nonzeros in A. nullity is the estimated deficiency in the rank, density(A) is the largest ratio of the number of nonzeros in a row of A to n over all rows, density(C) is the ratio of the number of entries inC to n 2 , and max |C ii | and min |C ii | are the largest and smallest diagonal entries in C. A " −" denotes insufficient memory to compute the statistic.

Table 3 . 1
Results for solving the least-squares problem using the Cholesky solver HSL MA87 to compute a preconditioner for LSMR.The shift is α = 10 −12 .nnz(Lα) denotes the number of entries in the HSL MA87 factor, time is the total solution time (in seconds), and itn is the number of LSMR iterations.The value of the least-squares objective before and after applying LSMR is rα 2 and r 2 , respectively.Results are also given for HSL MA97 run in indefinite mode (with no shift).

Table 4 . 1
The effects of scaling on the performance of HSL MA97 for solving the augmented system (1.3) (u = 0.01, Metis ordering, and γ = 1).nnz(L) denotes the number of entries in the factor, ndelay is the number of delayed pivots, and time is the total solution time (in seconds).l 2 (K) and l 2 (A) denote l 2 -norm scaling of K and A, respectively.For each problem, the sparsest factors and fastest times are in bold.

Table 4 . 2
Results for solving the augmented system (1.3) using the direct solver HSL MA97 with threshold parameter u set to 0.01 and 10 −8 .nnz(L) denotes the number of entries in the factor, ndelay is the number of delayed pivots, time is the total solution time (in seconds), and the value of the least-squares objective is r 2 .

Table 5 . 1
Results for LSMR with the IC factorization preconditioner from HSL MI35 applied to C = A T A. lsize and rsize control the memory used by HSL MI35, nnz(L) denotes the number of entries in the factor L, α is the shift, time f and timet are the factorization and total solution times (in seconds), the value of the least-squares objective is r 2 , and the number of LSMR iterations is itn.

Table 6 . 1
Results for the signed IC factorization preconditioner HSL MI30 run with GMRES to solve the augmented system.lsize and rsize control the memory used by HSL MI30, nnz(L) denotes the number of entries in the factor L, α 2 is the shift for the (2, 2) block (in all cases, α 1 = 0.0), time f and timet are the factorization and total solution times (in seconds), the value of the least-squares objective is r 2 , ratio(r) is given by (1.4), and the number of GMRES iterations is itn.

Table 6 . 2
Results for the LLDL option run with GMRES to solve the augmented system.lsizecontrols the fill in each column of the factor L, nnz(L) denotes the number of entries in the factor L, time f and timet are the factorization and total solution times (in seconds), the value of the least-squares objective is r 2 , ratio(r) is given by(1.4), and the number of GMRES iterations is itn.In all cases, α 1 = α 2 = 1.0.