Solving mixed sparse-dense linear least-squares problems by preconditioned iterative methods

. The eﬃcient solution of large linear least-squares problems in which the system matrix A contains rows with very diﬀerent densities is challenging. Previous work has focused on direct methods for problems in which A has a few relatively dense rows. These rows are initially ignored, a factorization of the sparse part is computed using a sparse direct solver, and then the solution is updated to take account of the omitted dense rows. In some practical applications the number of dense rows can be signiﬁcant, and for very large problems, using a direct solver may not be feasible. We propose processing rows that are identiﬁed as dense separately within a conjugate gradient method using an incomplete factorization preconditioner combined with the factorization of a dense matrix of size equal to the number of dense rows. Numerical experiments on large-scale problems from real applications are used to illustrate the eﬀectiveness of our approach. The results demonstrate that we can eﬃciently solve problems that could not be solved by a preconditioned conjugate gradient method without exploiting the dense rows.


Introduction.
Linear least-squares (LS) problems occur in a wide variety of practical applications, both in their own right and as subproblems of nonlinear LS problems.In this paper, we consider the unconstrained linear LS problem (1) min where A ∈ R m×n (m ≥ n) is a large sparse matrix and b ∈ R m is given.Solving (1) is mathematically equivalent to solving the n × n normal equations where, if A has full column rank, the normal matrix C is symmetric and positive definite.In many cases, the number of entries in the rows of A can vary considerably.That is, some of the rows may be highly sparse while others contain a significant number of entries.The former are referred to as the sparse rows and the latter as the dense rows (although they may contain far fewer than n entries).If a sparse Cholesky or sparse QR factorization of the normal matrix is computed, the fill in the factors is catastrophic.Consequently, for large problems, a direct solver may fail A2423 because of insufficient memory, and if an incomplete factorization of C is employed as a preconditioner for an iterative solver, the error in the factorization can be so large as to prohibit its effectiveness as a preconditioner.We assume that the rows of the (permuted) system matrix A are split into two parts with a conformal splitting of the right-hand side vector b as follows: (3) with m = m s + m d , m s ≥ n, and m d ≥ 1 (in general, m s m d ).Problem (1) then becomes (4) min We initially assume that A s has full column rank; rank deficiency of A s is discussed in section 4. We define C s = A T s A s to be the reduced normal matrix.There are other motivations for splitting the rows of A. For example, a set of additional rows, which are not necessarily dense, is obtained by repeatedly adding new data into the LS estimation of parameters in a linear model (see, for example, the early papers [5,6]), although the history dates back to Gauss and Stewart [17] (see also the historical overview in [15, section 10.5]).Nowadays, there exist important applications based on this motivation related to Kalman filtering or solving recursive LS problems; see the seminal paper [35] or, for a comprehensive introduction, [13,44].Also, dense and sparse blocks can arise in linearly constrained and rank-deficient LS problems (see, for instance, [7,8,37]).An analogous problem arises with interior point methods when a set of linear constraints Ax = b has some dense columns on A (see, e.g.[24]).In this case, a sequence of matrices of the form AD 2 A T has to be formed in which the diagonal matrix D 2 changes at each step [36,53]; see also the transformation to the positive-definite system in [38] that is affected by both dense rows and columns.
We observe that the need to process additional rows separately from the rest of the matrix is closely related to a number of other numerical linear algebra problems.These include finding sparsity-preserving orderings for a sparse QR decomposition [18].Another related problem is that of separately processing a set of dense rows when the sparsity structure of A T A is used to obtain an upper bound on the fill in an LU factorization of A with partial pivoting [21,22].
Over the last 35 years or more, a number of papers have addressed the problem of A having a small number of dense rows.In [18], George and Heath propose temporarily discarding such rows to avoid severe fill-in in their Givens rotation-based orthogonal factorization of A s .They then employ an updating scheme to incorporate the effect of the dense rows (the solution but not the factorization is updated).No numerical results are given in [18], but a procedure based on this was included in the package SPARSPAK-B [20].In [31] Heath considers several other cases, including updating a sparse unconstrained problem of full rank when constraints are added.A completely general updating algorithm for the constrained LS problem based on QR decomposition and direct elimination is given by Björck [8], but without experimental results (see also [9,48,49]).Another possibility is to use an implicit transformation of the dense constraints as proposed in [34].Other approaches to handling dense rows are based on the Schur-complement method; see, for example, [4,23,42,47].
The application of C −1 can directly combine the inverse of C s with an additional update based on A d .A standard tool for this is the Woodbury formula (see [29,51,52] and the comprehensive discussion in [30]).This formula enables the inverse C −1 to be written in the form (5) The LS solution may then be explicitly expressed as Note that here and elsewhere, for k ≥ 1, I k denotes the identity matrix of order k.
While this requires a factorization of C s to be computed explicitly, Peters and Wilkinson [41] avoided this by computing a factorization P A = LU with row permutations P to keep L well-conditioned and then forming the well-conditioned matrix C L = L T L and factorizing it.The original paper was solely for dense A (with A ill-conditioned); Björck and Duff [10] subsequently extended the idea to the sparse case and also proposed a general updating scheme for modifying the solution after the addition of a few (possibly dense) rows.As this approach is dependent upon the computation of an LU factorization of A and also needs a factorization of C L (which can be denser than the normal matrix C), it is generally more expensive than working directly with the normal equations.Sautter [43] observed that for a slightly overdetermined system (m − n < n) an algebraic reformulation that splits the trapezoidal L into a triangular part and a rectangular part can be advantageous in terms of the resulting flop count; see [9, subsection 2.5.1].
An alternative approach for dealing with a small number of dense rows uses the idea of stretching.Stretching aims to split the rows of A d to obtain a matrix A δ that has extra rows such that the corresponding LS problem has the same solution but the associated normal matrix is not dense.Matrix stretching was originally developed by Grcar [28] (see also [3,14,50]).The use of matrix stretching combined with a direct QR decomposition-based solver for LS problems is described by Adlers and Björck [2] (see also Adlers [1]).
In this paper, our focus is on using an iterative solver.We propose a preconditioner that is based on an incomplete Cholesky (IC) factorization of C s and incorporates the effect of A d using the factorization of a dense matrix of order m d .The computational scheme is incorporated within the preconditioned CGLS method that was described in the seminal paper by Hestenes and Stiefel [32] (see also the CGLS1 variant from [11]).CGLS is an extension of the conjugate gradient (CG) method to LS problems and is mathematically equivalent to applying CG to the normal equations, but avoids actually forming them.
The rest of the paper is organized as follows.Section 2 introduces our approach to preconditioning within CGLS that exploits the row splitting (3) of A and combines an IC factorization of C s with a factorization involving A d .In section 3, numerical experiments using large-scale problems from practical applications demonstrate the efficiency and robustness of the proposed approach.Section 4 discusses dealing with the case that often occurs in the practice of A s having null columns.Finally, section 5 presents some concluding remarks and possible future directions.

Iterative solution on sparse-dense splitting.
2.1.The use of sparse-dense splitting.We start our discussion of solving the linear LS problem using the splitting (3) with a slight extension of the result from Sautter [43] (see also [8,9]).
Proof.From (5), As discussed in the introduction, a standard strategy is to use a direct solver to compute a factorization of C s to solve the problem (7) min and to then incorporate the effect of A d using an updating scheme.For large problems, or if C s is not sufficiently sparse, this may not be practical.Instead, we may have only an approximate solution x of (7).Let us consider the solution x to (4) as the sum of an approximate solution x to (7) and a correction y.The following theorem shows the form of the correction.
Theorem 2.2.Assume that x is an approximate solution to (7) Then the LS solution of (4) is equal to x = x + y, where Proof.The proof follows from straightforward verification with an application of Lemma 2.1.
Adding the last formula to the approximation x we get the result.
To employ Theorem 2.2, a factorization of the reduced normal matrix must be available (so that C −1 s can be applied by solving with the factors L s and L T s ).We can use the factors to consider the problem (9) min The following lemma shows that Theorem 2.2 can be simplified provided the factorization ( 8) is available and we have the exact solution of the transformed LS problem B s z − b s 2 , as assumed in the description of update strategies [8].
Lemma 2.3.The LS solution of problem (9) can be written as z = x1 + y 1 , where x1 is an approximate solution to (9) and ( 10) Proof.Equation ( 10) follows directly from Theorem 2.2 using B T s B s = I n .Formula (10) in Lemma 2.3 offers a way to compute an approximate solution to (9).But there is a potential problem: in general, B s = A s L −T s is dense, and so if m s is large, it is not normally possible to store B s explicitly.Thus if (10) is used, t = B T s ρ s should be computed by solving the triangular system L s t = A T s ρ s .To derive a practical procedure from ( 10) that we can use as a preconditioner we need to consider two important points.First, there is the issue of obtaining an initial approximate solution x1 ; this is discussed in the following remark.
Remark 2.4.Theorem 2.2 and Lemma 2.3 depend on a given approximate solution x1 to problem (9).There are two basic possibilities for choosing x1 .If the sparse part of ( 9) is important for the solution and B d represents only a few dense rows or a problem update, x1 can be chosen as an approximate solution of the problem ( 11) min which (assuming A s is overdetermined and of full rank) is given by However, if B d represents a significant part of the problem and its effect dominates that of the sparse part, x1 can be chosen as an approximate solution to min If we make no assumptions on the dimensions and the rank of B d , the approximate solution can be expressed using the pseudoinverse as x ≈ x1 ≡ B † d b d .In our numerical experiments, we use only the approximation based on (11).
The second important point is the cost of the preconditioner, which directly relates to the problem of choosing x1 .A motivation for our choice is formulated in the following lemma.(10) reduces to (13).
Clearly, evaluating (13) is significantly more efficient than evaluating (10).Both involve applying ) −1 to a vector g, where for (10) g = ρ d − B d B T s ρ s and for (13) g = ρ d .This is equivalent to finding the first n components of the minimum norm solution h of the "fat" system

Redistribution subject to CCBY license
There are several ways to approach this.One possibility is to compute the Cholesky factorization ( 14) and obtain the result of ( 15) as the first n components of F T (L d L T d ) −1 g.This approach applied as a direct method can be found in [9].An alternative is to use an LQ factorization (see, for example, [9,19,39]), ( 16) Then (15) can be evaluated as For large problems, it may not be possible to compute a complete factorization of the reduced normal matrix; instead, only an IC factorization of the form C s ≈ Ls LT s may be available.Similarly, Bd = A T d L−T s and ( 14) and ( 16) may be replaced by incomplete factorizations ( 17)

Preconditioned iterative method.
We now propose incorporating preconditioning into solving the mixed sparse-dense linear LS problem (4) by the conjugate gradient method (CGLS).This combines incomplete factorizations of the reduced normal matrix C s and of the transformed dense part (17).We use a conformal splitting of each of the vectors of length m within the CGLS algorithm because our preconditioners exploit this splitting.In particular, the residual is explicitly split into r s and r d .Output: The computed solution x.

w (i)
11. end do Algorithm 2.7 presents the preconditioning algorithm.Note that the preconditioner is applied to r s and r d that play the role of the right-hand sides b s and b d in the formulae explained in the previous section.In addition, we use w s = A T s r s computed in step 7 of Algorithm 2.6.This is needed to compute x1 (which in turn is used to compute ρ d ).Also note that Bd is not computed explicitly; rather We present two possible modes: the mode Cholesky corresponds to (13), and mode LQ uses an approximate LQ factorization based on (16).
Solve Ld u 1 = ρ d for u 1

6.
Solve LT else if mode == LQ then Solve Ls t = A T s ρ s for t Solve LT s z = (x 1 + u) for z 3. Numerical experiments.We present numerical results to illustrate the potential of our proposed approach for handling dense rows of A. We use Algorithm 2.6 with the Cholesky mode employed in the preconditioning step (Algorithm 2.7).Any sparsity in the dense part A d is ignored, and the Cholesky factorization of I m d + Bd BT d uses the LAPACK routine potrf.To perform the IC factorization A T s A s ≈ Ls LT s , we use the package HSL MI35 from the HSL mathematical software library [33].HSL MI35 implements a limited memory IC algorithm (see [46,45] for details).Note that it handles ordering for sparsity and scaling.In our tests, we use default settings.HSL MI35 requires the user to set the parameters lsize and rsize that respectively control the number of entries in each column of the IC factor and the memory required to compute the factorization.In general, increasing these parameters improves the quality of the preconditioner (so that the number of iterations of the preconditioned iterative solver is reduced) at the cost of more time and memory to compute the factorization.In our experiments, unless stated otherwise, we use lsize = rsize = 5 so that the number of entries in Ls is at most 5n. 1.With the exception of PDE1, all the examples are part of the University of Florida Sparse Matrix Collection [12].Problem PDE1 is taken from the CUTEst linear programming set [25].Note that the University of Florida examples used here can also be found in the CUTEst set, but with slightly different identifiers.All the test problems were included in the study by Gould and Scott [26,27].

Test environment. Our test set is described in Table
We scale A by normalizing each of its columns.That is, we replace A by AD, where D is the diagonal matrix with entries D ii satisfying D 2 ii = 1/ Ae i 2 (e i denotes the ith unit vector).The entries of AD are all less than one in absolute value.
In our experiments, we define a row of A to be dense if the number of entries either exceeds 100 times the average number of entries per row or is more than 4 times greater than the maximum number of entries in a row in the sparse part A s .For most of our test cases, this choice is not critical, although for the Mittelmann/neos examples, we found the results can be improved by relaxing these conditions so that fewer rows are classified as dense.Removing dense rows can leave A s rank-deficient.In section 4, we discuss how we can develop a practical procedure to deal with this case.However, here for simplicity we modify the problem by removing any columns of A that correspond to null columns of A s .
To terminate the preconditioned CGLS algorithm, we employ the following stopping rules taken from [27]: where r = b − Ax is the residual, r (0) = b − Ax (0) is the initial residual, and δ 1 and δ 2 are convergence tolerances that we set to 10 −8 and 10 −6 , respectively.In all our experiments, b is taken to be the vector of all 1's and we take the initial solution guess to be x (0) = 0 so that C2 reduces to

Summary results for our test set.
In Table 2, we summarize our findings for the complete test set, with and without exploiting dense rows.Without exploiting dense rows, we are unable to solve many of the largest problems, either because the IC factorization time exceeds 1000 seconds or because convergence is not achieved within 2000 CGLS iterations.Along with the storage for the incomplete factors, when dense rows are exploited, we need to store the m d × n entries of A d together with the m d (m d + 1)/2 entries of the dense Cholesky factor of I m d + Bd BT d .We see that our preconditioning strategy that exploits dense rows significantly reduces the iteration count and computation times.Furthermore, it is able to solve problems that HSL MI35 applied to the whole of A did not solve.A more detailed look at the behavior of our strategy on some of our test problems is now presented.

Results for Meszaros/scsd8-2r.
In the following, we look at (i) the number nnz(C s ) of entries in the reduced normal matrix (lower triangle only), (ii) the CGLS iteration counts, and (iii) the ratio ratio of the number of entries in the matrices that are factorized to the number size ps of entries in the preconditioner, that is, The first example that we consider is problem Meszaros/scsd8-2r. Figure 1 illustrates the size of the reduced normal matrix C s as the number of rows that are classified as dense increases.Then in Figures 2 and 3, we present the CGLS iteration counts, ratio, and the times to compute the preconditioner and run CGLS.As expected, increasing the number m d of rows that are classified as dense from 0 to the value 50 that was specified in Table 2 significantly reduces the size of C s ; ratio also decreases.We also see that the iteration counts can decrease before m d reaches 50.This demonstrates not only that dense rows lead to high memory demands, but that their presence can reduce the IC factorization quality as well.The timings reflect the same general dependence on m d .The fluctuations in the times needed to compute the preconditioner given in the left-hand plot of Figure 3 are a result of the IC factorization code performing a number of restarts.If the IC factorization breaks down (which happens if a zero or negative pivot is encountered), HSL MI35 introduces a positive shift α and restarts the factorization with C s replaced by C s + αI.Since an appropriate choice of α is not known a priori, the factorization may need to restart more than once with α increased on each occasion (HSL MI35 takes care of this automatically), and this is reflected in the computation time.To investigate whether the problem can be solved more efficiently by applying a higher quality preconditioner, Figures 4  and 5 report results for HSL MI35 with the parameters lsize and rsize that control the number of entries in the incomplete factor Ls increased from 5 to 20.We see that the qualitative behavior remains the same and is consistent with our assumptions.

Results
for Mittelmann/stormg2 1000.Next we consider the large example Mittelmann/stormg2 1000; results are given in Figures 6-8.Again, the time for computing the preconditioner is influenced by the number of restarts needed during the incomplete factorization, and this is not a smooth function of the number of dense rows.The CGLS time drops dramatically when all 121 dense rows reported in Table 2 are exploited.

Dealing with null columns.
As already observed, even if A has full column rank, A s may not have full column rank.Indeed, in practice, A s can contain null columns.In this section, we explain how this can be overcome.Let A have full rank, and assume A s has n 2 null columns with n 2 n.Assuming these columns are permuted to the end, we have the splitting    The following result shows that the solution of the LS problem (1) can be expressed as a combination of partial solutions.of problem (1) with its splitting consistent with (18) Proof.We have Premultiplying the first block row of ( 19) by (A T 1 A 1 ) −1 gives the result.Theorem 4.1 suggests a practical procedure for solving LS problems with some dense rows: compute partial solutions z and W corresponding to the nonnull column block A 1 using Algorithm 2.6 with n 2 + 1 right-hand sides, and then use the result of Theorem 4.1.This requires forming and then solving with A T 2 (A 2 − A 1 W ). Provided n 2 is small, this is inexpensive.

Concluding remarks.
We have looked at the problem of solving linear LS problems in which the system matrix A has a number of dense rows.We have proposed an approach that processes the dense rows separately within a CG method, using an incomplete factorization preconditioner for the sparse rows and a complete Cholesky factorization of the small dense subproblem arising from the dense rows.We note that other iterative methods, including LSQR [40] and LSMR [16], could be employed.Likewise, other preconditioners could be used.The reported numerical experiments on practical problems that each have between 1 and 121 dense rows demonstrate the potential advantages of our proposed approach.In particular, we efficiently solved large problems that we were not able to solve using preconditioned conjugate gradients without exploiting the dense rows.As expected, it is important for efficiency to exploit all the dense rows (that is, to move them all into A d ).We remark that in our experiments we found that the strategy based upon (13) gave better results than using the more complex scheme (10).We conjecture that this conclusion would be different if the preconditioner were based on a perturbed direct solver rather than on an incomplete factorization; this requires investigation that is beyond the scope of this paper.
There remain other issues that need to be addressed before we have fully reliable and robust preconditioned iterative solvers for general LS problems.In particular, more work needs to be done to address rank-deficiency.In the future, we plan to look at other possible splittings and transformations of A based on the problem structure to improve the preconditioner quality further.
Finally, we remark that Avron, Ng, and Toledo [7] look at using a QR factorization of A to solve linear LS problems.If A has a few rows that are identified as dense, they recommend that these rows be dropped before the QR factorization starts.They use the resulting R factor as a preconditioner for LSQR and show that if m d dense rows are dropped, then LSQR is expected to converge in at most m d +1 iterations.We experimented with simply dropping the dense rows (that is, we discarded A d and just used the IC factorization of C s ), but we found that in general the incomplete factorization of C s gave very poor convergence, confirming the importance of incorporating the dense rows within the preconditioned iterative solver.

Lemma 2 . 1 .
Assume A and A s have full column rank.Then with C s = A T s A s and C d = A T d A d , the following identity holds:

Algorithm 2 . 6 .
Preconditioned CGLS algorithmInput: A ∈ R m×n with m ≥ n and of full column rank with its rows split into two parts A s and A d with A s ∈ R ms×n with m s ≥ n and of full column rank; an incomplete factorization A T s A s ≈ Ls LT s ; a right-hand side vector b ∈ R m split into b s and b d conformally with the splitting of A; the initial solution x (0) ∈ R n ; preconditioning operation (z = M −1 s) given in Algorithm 2.7; the maximum number of iterations nmax.

Algorithm 2 . 7 . 1 .
Preconditioning procedure Input: Residual vector components r s , r d , transformed residual component w, an incomplete factorization C s ≈ Ls LT s , and Bd = A T d L−T s , chosen mode (Cholesky or LQ).The Cholesky mode needs a (possibly incomplete) Cholesky factorization I m d + Bd BT d ≈ Ld LT d , the LQ mode needs a possibly incomplete factorization Bd I m d ≈ Ld 0 m d QT d .Output: Computed z = M −1 w Solve Ls x1 = w s for x1 2.

Fig. 1 .
Fig. 1.The number of entries in Cs for problem Meszaros/scsd8-2r as the number m d of dense rows that are exploited increases.

Fig. 2 .
Fig. 2. Problem Meszaros/scsd8-2r.Iteration counts (left) and ratio (right) as the number m d of dense rows that are exploited increases.

Fig. 3 .
Fig. 3. Problem Meszaros/scsd8-2r.Time to compute the preconditioner (left) and time for CGLS (right) as the number m d of dense rows that are exploited increases.

Fig. 5 .
Fig. 5. Problem Meszaros/scsd8-2r with lsize = rsize = 20.Time to compute the preconditioner (left) and time for CGLS (right) as the number m d of dense rows that are exploited increases.

Fig. 6 .Fig. 7 .
Fig.6.The number of entries in Cs for problem Mittelmann/stormg2 1000 as the number m d of dense rows that are exploited increases.

Fig. 8 .
Fig. 8. Problem Mittelmann/stormg2 1000.Time to compute the preconditioner (left) and time for CGLS (right) as the number m d of dense rows that are exploited increases.

Theorem 4 . 1 .A 1 W
Let z ∈ R n1 and W ∈ R n1×n2 be the solutions to the problems(18) min z A 1 z − b 2 and min W − A 2 F , respectively.Then the solution x = x1 x2

Table 1
Statistics for our test set.m,n, and nnz(A) are the row and column counts and the number of entries in A. nnz(C) is the number of entries in the lower triangle of C = A T A, and density(C)is the ratio of the number of entries in C to n 2 .
Our experiments are performed on an Intel(R) Core(TM) i5-4590 CPU running at 3.30 GHz with 12 GB of internal memory.All software is written in Fortran, and the Visual Fortran Intel(R) 64 XE compiler (version 14.0.3.202) was used.Reported timings are elapsed times in seconds.

Table 2
Results with and without exploiting dense rows.size p and size ps denote the number of entries in the incomplete factors L and Ls of C and Cs, respectively.Its is the number of CGLS iterations; T p and T its denote the times (in seconds) to compute the preconditioner and for convergence of CGLS, respectively.† indicates time to compute the preconditioner exceeds 1000 seconds; ‡ denotes convergence of CGLS not achieved.