Shifted CholeskyQR for computing the QR factorization of ill-conditioned matrices

The Cholesky QR algorithm is an efficient communication-minimizing algorithm for computing the QR factorization of a tall-skinny matrix. Unfortunately it has the inherent numerical instability and breakdown when the matrix is ill-conditioned. A recent work establishes that the instability can be cured by repeating the algorithm twice (called CholeskyQR2). However, the applicability of CholeskyQR2 is still limited by the requirement that the Cholesky factorization of the Gram matrix runs to completion, which means it does not always work for matrices $X$ with $\kappa_2(X)\gtrsim {{\bf u}}^{-\frac{1}{2}}$ where ${{\bf u}}$ is the unit roundoff. In this work we extend the applicability to $\kappa_2(X)=\mathcal{O}({\bf u}^{-1})$ by introducing a shift to the computed Gram matrix so as to guarantee the Cholesky factorization $R^TR= A^TA+sI$ succeeds numerically. We show that the computed $AR^{-1}$ has reduced condition number $\leq {{\bf u}}^{-\frac{1}{2}}$, for which CholeskyQR2 safely computes the QR factorization, yielding a computed $Q$ of orthogonality $\|Q^TQ-I\|_2$ and residual $\|A-QR\|_F/\|A\|_F$ both $\mathcal{O}({{\bf u}})$. Thus we obtain the required QR factorization by essentially running Cholesky QR thrice. We extensively analyze the resulting algorithm shiftedCholeskyQR to reveal its excellent numerical stability. shiftedCholeskyQR is also highly parallelizable, and applicable and effective also when working in an oblique inner product space. We illustrate our findings through experiments, in which we achieve significant (up to x40) speedup over alternative methods.

1. Introduction. Computing the QR factorization X = QR is required in various applications in scientific computing. The Cholesky QR algorithm computes the factorization by: where chol(A) denotes the Cholesky factor of A. Cholesky QR is a communicationavoiding algorithm whose communication cost is equivalent to that of the TSQR algorithm, which has been devised specifically to reduce communication for the QR factorization of tall-skinny matrices [3]. Cholesky QR has the advantage over TSQR that its arithmetic cost is about half and that its reduction operator is addition, while that of TSQR is a QR factorization of a small matrix [4]. As a result, Cholesky QR usually runs faster than TSQR. However, Cholesky QR is rarely used in practice because of its instability: the distance from orthogonality of its computed Q grows rapidly with the condition number of the input matrix. By contrast, TSQR is unconditionally stable.
A recent work by the authors [16] establishes that the instability can be cured significantly by repeating the algorithm twice (called CholeskyQR2). However, the applicability of CholeskyQR2 is still limited by the requirement that the Cholesky factorization of the Gram matrix runs to completion, which means it does not always work for matrices A with κ 2 (X) = O(u − 1 2 ) or larger, where u is the unit roundoff. In this work we extend the applicability of CholeskyQR-based algorithms to κ 2 (X) = O(u −1 ).
The idea is to execute a preconditioning step so that the conditioning is improved to a point where CholeskyQR2 is applicable. How do we find an effective preconditioner? An inspiration to answer this is the fact that Cholesky QR and CholeskyQR2 belong to the category of triangular orthogonalization type algorithms [15,Lecture 10], in contrast to Householder type algorithms, which follows the principle of orthogonal triangularization. We summarize the classification of algorithms in terms of their principle, communication cost, and stability in Figure 1.2, which also clarifies where our contribution (shifted CholeskyQR3) stands. In triangular orthogonalization, one right-multiplies an appropriate upper triangular matrix R so that κ 2 (AR) = 1. Now, what if our goal is merely κ 2 (AR) = O(u −1/2 )? Once we have this, we can safely compute the QR factorization AR = QR 1 using CholeskyQR2, to arrive at the overall QR factorization A = Q(R 1 R −1 ).
Clearly, such R is not unique, and we propose one way of finding such R. Namely, as in Cholesky QR we compute the Gram matrix, but add a small shift sI so as to guarantee the Cholesky factorization R T R = A T A + sI does not break down numerically. We show that under the mild assumption κ 2 (A) ≤ u −1 , the resulting AR −1 (note that R −1 is also triangular) has reduced condition number ≤ u − 1 2 , for which CholeskyQR2 safely computes the QR factorization, yielding computed Q, R with excellent orthogonality Q T Q − I = O(u) and residual A − QR F / A F = O(u), overall a backward stable QR factorization. The algorithm is deceptively simple (essentially the only new ingredient being the introduction of a shift); the analysis is however not trivial. We give detailed analysis that gives the constants hidden in the O(u) notation.
The main message of this paper is that for any matrix with condition number well above u − 1 2 (but bounded by u −1 ), the QR factorization can be computed in a backward stable manner by essentially running Cholesky QR thrice. We refer to this overall algorithm as shiftedCholeskyQR3; Figure 1.2 shows its diagram, and how κ 2 (A) is reduced eventually to 1 through repeated multiplication by triangular matrices. Let us comment on related studies in the literature. A recent work of Yamazaki et al. [18] uses doubled precision arithmetic (e.g., quadruple precision when using double precision) for the first two steps (1.1), (1.2), and shows that the condition number gets reduced by about O(u −1 ), and thus the QR factorization will be obtained by repeating the process. This results in about 8.5 times as many arithmetic operations (per iteration) as does the standard CholeskyQR without doubled precision. Moreover, this approach clearly requires that higher-precision arithmetic is available, which may not always be the case (and even when it is, it often comes with a significant price in speed). shiftedCholeskyQR3 developed in this paper does not require change of arithmetic precision (in our experiments we use only IEEE double precision in which u ≈ 1.1 × 10 −16 ), and requires just one more CholeskyQR iteration than [18], thus is usually much faster.
As a bonus, our development is straightforward to apply to computing the QR factorization in a non-standard inner product space induced by a positive definite matrix B 0, in which (x, y) B = x T By. Available algorithms for this task include (modified) Gram-Schmidt [12], and Cholesky QR [10,14,9]. A recent work by Lowery and Langou [10] studies the numerical stability, with no method apparently being (near) optimal in both orthogonality and backward error. We analyze the stability of shiftedCholeskyQR3 in this case to show it has favorable stability properties. Moreover, shiftedCholeskyQR3 is significantly faster than Gram-Schmidt type algorithms, achieving up to 40-fold speedup in our experiments with sparse B.
This paper is organized as follows. In Section 2 we describe the shiftedCholeskyQR algorithm. Section 3 analyzes shiftedCholeskyQR in detail and shows that it can be used to reduce κ 2 (X). In Section 4 we combine shiftedCholeskyQR and CholeskyQR2 to derive shiftedCholeskyQR3 for ill-conditioned matrices with κ 2 (X) = O(u −1 ) and prove its backward stability. Section 5 discusses the extension to the non-standard inner product space. Numerical experiments are shown to illustrate the results in Section 6 and Section 7 summarizes the performance of our software implementations.
We primarily focus on real matrices A ∈ R m×n , but everything carries over to complex matrices A ∈ C m×n . We assume m ≥ n, and the algorithms developed here are particularly useful in the tall-skinny case m n.

Shifted Cholesky QR.
To overcome the numerical breakdown in the Cholesky factorization (1.2), we propose a simple remedy: introduce a small shift A T A + sI to force the computed Gram matrix A T A to be numerically positive definite, so that its Cholesky factorization runs without breakdown. The rest of the algorithm is the same as Cholesky QR, and the algorithm shiftedCholeskyQR can be summarized in pseudocode in Algorithm 2.1.
Introducing shifts in Cholesky QR was briefly mentioned in [14], also as a remedy for the breakdown. However, the focus there was on the algorithm called SVQB, which computes the SVD of the Gram matrix instead of the Cholesky factorization. SVQB computes a factorization of the form A = QB, where B is a full n × n matrix. While Q is still a basis for the column space of A, an additional QR factorization of B is needed to obtain a complete QR factorization of A. Furthermore, experiments suggest that there are benefits in working with a triangular matrix as in Cholesky QR as opposed to full matrices as in SVQB, wirh respect to the row-wise stability of the computed decomposition.
We shall discuss an appropriate choice of the shift s, which will be O(u), and prove that applying shiftedCholeskyQR to a matrix X with u − 1 2 < κ 2 (X) < u −1 results in a computedQ with much reduced condition number κ 2 (Q) < u − 1 2 , for which CholeskyQR2 suffices to compute the QR factorization.
3. Convergence and stability analysis of shiftedCholeskyQR. In this section we present the main technical analysis of shiftedCholeskyQR. The goal is to show that the algorithm improves the conditioning significantly, that is, κ 2 (Q) κ 2 (X). A few assumptions need to be made on the matrix size and condition number. The constants below are not of significant importance but chosen so that the forthcoming analysis runs smoothly. We shall assume the following hold: Roughly speaking, the first three assumptions require that the condition number κ 2 (X) is safely bounded above by u −1 , and the matrix dimensions m, n are small compared with the precision u −1 . We reiterate that κ 2 (X) > u − 1 2 is allowed, a crucial difference from CholeskyQR2 treated in [16]. The assumption (3.4) imposes that s is large enough for Cholesky to work, but small compared with X 2 2 . Note that by (3.2), (3.3) we have (3.5) γ m := mu 1 − mu ≤ 1.02mu, γ n+1 := (n + 1)u 1 − (n + 1)u ≤ 1.02(n + 1)u.

Preparations.
We denote the computed quantities in shiftedCholeskyQR, accounting for the numerical errors, bŷ are the ith rows of X andQ respectively. E 1 is the matrix-matrix multiplication error in the computation of the Gram matrix X X, and E 2 is theÂ backward error incurred when computing the Cholesky factorization. ∆R i is the backward error involved in the solution of the linear system q iR = x i .
We shall take s so that which means s = c max( E 1 2 , E 2 2 ) for some c > 1. In fact we shall see that E 1 2 , E 2 2 = O(u) X 2 2 , so (3.9) simply means s is chosen to be safely larger than u X 2 2 (qualitatively this is also assumed by (3.8)). In particular, we write (3.10) s := α X 2 2 , 0 < α < 1.  ∆x i and ∆R i . For later use, here we examine the relation between ∆R i and ∆x i . From (3.8) we haveq iR +q i ∆R i = x i . We also have from (3.12)q iR = x i + ∆x i . Combining these we obtain (3.14) ∆x 3.1.1. Error in computing X X. For general matrices A ∈ R m×n , B ∈ R n×m , the error in computing the matrix product C = AB can be bounded by [7,Ch. 3] Here |A| is the matrix whose (i, j) element is |a ij |. In [16] it is shown that E 1 is bounded by Simplifying (3.16) using (3.5) yields E 1 2 ≤ 1.1mnu X 2 2 .
Backward error in the Cholesky factorization. Suppose that A ∈ R n×n is a positive definite matrix and its Cholesky factorization computed in floating-point arithmetic runs to completion and outputsR. Then there exists ∆A ∈ R n×n such that [7,Thm. 10.3] (3.17)R R = A + ∆A, |∆A| ≤ γ n+1 |R ||R| Applying this to our situation gives Using [16, eqn. (3.16)] in the right-hand side gives   We have for 1 ≤ i ≤ m From (3.7), (3.4), and (3.20) we obtain Substituting this into (3.25) gives 3.1.4. Bounding ∆X 2 roughly. Here we give a rough bound for ∆X F and prove that ∆X F = O(u X 2 2 / √ s). This will be insufficient for proving that κ 2 (Q) = O(1/ √ u), for which we will need ∆X F = O(u X 2 ), which we will prove later after having obtained a bound forQ. We shall proceed as follows.
1. Derive the "rough" bound ∆X F = O(u X 2 2 / √ s). 2. Use above to show Q 2 = O(1). 3. Use above and (3.14) to prove the "tight" bound ∆X F = O(u X 2 ). 4. Use above to prove κ 2 (Q) = O(u − 1 2 ). To establish the first statement we recall (3.11), and bound Ȓ i 2 as We bound the denominator from below as Substituting this into (3.28) yields Together with the fact ∆x i ≤ x i Ȓ i 2 , we can bound ∆X F as 3.1.5. Bounding Q 2 . We now proceed to bound Q 2 .
Lemma 3.1. Suppose that X ∈ R m×n with m ≥ n satisfies (3.2) and (3.3). Then, the matrixQ obtained by applying the shiftedCholeskyQR algorithm in floating-point arithmetic to X satisfies Q Q − I 2 < 2, and hence Proof. We havê Thus we can bound Q Q − I 2 as The first term of (3.33) can be bounded as and for the second term in (3.33), using (3.22), (3.23) and (3.31) we obtain For the third term in (3.33), from (3.22) and (3.31) Summarizing, we can bound the right-hand side of (3.33) as Q Q − I 2 < 2, as required.
3.1.6. Bounding the residual in shiftedCholeskyQR. We now bound the residual. Proof. First note that Substituting (3.27) into this gives On the other hand, from (3.32) we have Hence it follows that Lemma 3.2 shows that shiftedCholeskyQR gives optimal residual up to a factor involving a low-degree polynomial of m, n (recall that X F ≤ √ n X 2 ).
Tighter bound for ∆X F . By (3.13) we have ∆X =QR − X, so (3.39) in fact provides a bound for ∆X F that is tighter than the previous bound (3.31):

Main result.
We are now ready to state the main result of the section, which bounds the condition number ofQ.
Proof. Recall from (3.32) that σ 1 (Q) < √ 3. The remaining task is to bound σ n (Q) from below. Using Weyl's theorem in (3.13) gives Using (3.22) and (3.40) we obtain Note that this is O(u 1 2 ) when we take s = O(u) and regard low-degree polynomials in m, n as constants.
We next bound σ n (XR −1 ) from below. We proceed by examining the equation Let X = U ΣV by the SVD. Then for a diagonal matrix G, we can write Indeed the left-hand side is V (Σ 2 + sI)V and the right-hand side V (Σ + G) 2 V , so we can take We next bound the singular values of T . By (3.20) and (3.22) we have where Q = U V has orthonormal columns, and Using the general inequality for singular values of matrix products σ min (AB) ≥ σ min (A)σ min (B) (which holds when A or B is square) along with (3.21), we obtain Using this and (3.43), from (3.42) we obtain We have used the assumption (3.1) for the fourth inequality. Together with (3.32) we obtain Theorem 3.3 implies that, provided that α(κ 2 (X)) 2 1, Thus applying one step of shiftedCholeskyQR results in the condition number being reduced by about a factor √ α = √ s X 2 . 4. shiftedCholeskyQR3: shiftedCholeskyQR + CholeskyQR2. We now discuss an algorithm for the QR factorization of an ill-conditioned matrix that first uses shiftedCholeskyQR, then runs CholeskyQR2. We refer to this algorithm as shift-edCholeskyQR3, since it runs Cholesky QR (or its shifted variant) three times. The initial shiftedCholeskyQR can thus be regarded as a preconditioning step that reduces the condition number so that CholeskyQR2 becomes applicable. We continue to assume (3.1)-(3.3); a particular case of interest is u − 1 2 < κ 2 (X) < u −1 .

4.1.
Choice of shift s. We first discuss the choice of the shift s for shiftedC-holeskyQR, which balances two requirements: • s should be as small as possible to maximize the condition number improvement (3.41). • s should be large enough so that the Cholesky factorization chol(A + sI) runs to completion without numerically breaking down. In addition to these, s must satisfy Eq. (3.4) for the error analysis in the previous section to be valid.
To address the issue of breakdown we review Rump and Ogita's [13] error analysis for Cholesky factorization, which builds upon Demmel's early work [2].
Let A be a symmetric positive definite matrix. It is known [13, Thm. 2.3] that chol(A) succeeds numerically if the following holds: It has been shown in [19] that a sufficient condition for (4.1) to hold is In our context of the Cholesky factorization (3.7), we need to take into account the error term E 1 in computing the matrix multiplication A = X X. By (3.6) and Weyl's theorem we obtain a lower bound This means thatÂ may not be positive definite if λ n (X X) ≤ γ m n X 2 2 . Accordingly, to apply formula (4.2), we must first shiftÂ by γ m n X 2 2 . Thus we conclude that a safe choice of s to avoid numerical breakdown is (4.4)s := γ m+1 n X 2 2 + c n+2 utr(Â + γ m+1 n X 2 2 I).
This expression can be simplified by evaluatings using the results in the previous section. First, we note thatÂ = X X + E 1 from (3.6). Denoting the jth column vector of X byx j , we can evaluate tr(E 1 ) as On the other hand, Plugging these into the right-hand side of (4.4), we can bounds as s ≤ γ m+1 n X 2 2 + 2.2(n + 1)u(n X 2 2 + γ m n X 2 2 + γ m+1 n X 2 2 ) ≤ 2.4{mn + n(n + 1)}u X 2 2 , (4. 8) which shows that the maximum in (4.5) is attained by the first argument. Hence, in what follows we shall assume (4.9) s := 11{mn + n(n + 1)}u X 2 2 .
In practice, since X 2 is expensive to compute we can estimate it reliably using a norm estimator e.g. MATLAB's function normest, or alternatively just replace it with X F , which results in a larger (more conservative) shift. We now summarize our algorithm in pseudocode. Algorithm 4.1 blends shiftedC-holeskyQR and CholeskyQR2 in an adaptive manner, initially attempting to reduce the condition number using shiftedCholeskyQR as in (3.55) so that CholeskyQR(2) becomes applicable. The shifts are introduced only when necessary, judged by whether or not the Cholesky factorization chol(A (k) ) breaks down. Our experiments suggest that Algorithm 4.1 may well be applicable even to extremely ill-conditioned matrices with possibly κ 2 (X) > u −1 .
In the analysis below, we focus on the case κ 2 (X) < u −1 ; to be precise, when u − 1 2 < κ 2 (X) (so that CholeskyQR2 is inapplicable) and κ 2 (X) is bounded from above by (4.12) given below. For such matrices, the algorithm provenly runs one shift-edCholeskyQR, then CholeskyQR2 (thus executing three Cholesky factorizations). This computes a stable QR factorization, and we refer to this algorithm as shiftedC-holeskyQR3. For completeness we present its pseudocode in Algorithm 4.2. The last two lines represent CholeskyQR2 for the Q obtained by the first shiftedCholeskyQR. Here we derive a condition on κ 2 (X) that guarantees that shiftedCholeskyQR3 gives a numerically stable QR factorization of X.
Recall that (3.54) gives a bound for the condition number ofQ obtained by shift-edCholeskyQR: κ 2 (Q) ≤ 2 1 + α(κ 2 (X)) 2 · √ 3. As in (4.9), to guarantee avoidance of breakdown we take α = 11{mn + n(n + 1)}u, so the condition number ofQ is bounded as On the other hand, as shown in [16], a sufficient condition for CholeskyQR2 to compute a stable QR factorization ofQ is Combining these facts, we obtain the following condition under which shiftedCholeskyQR3 is guaranteed to compute a numerically stable QR factorization: If κ 2 (X) > u − 1 2 , we have 1 + 11{mn + n(n + 1)}u(κ 2 (X)) 2 11{mn + n(n + 1)}u(κ 2 (X)) 2 and the condition can be simplified as Note that this ensures that the condition (3.1) for the error analysis in Section 3 is automatically satisfied. In practice, it often happens that shiftedCholeskyQR3 (or more often the iterated Algorithm 4.1) computes the QR factorization for matrices with even larger condition numbers than indicated by (4.12). One explanation is that in the absence of roundoff errors, one iteration of shiftedCholeskyQR reduces the condition number by a factor ≈ u 1 2 . However, we think that a rigorous convergence analysis in finite precision arithmetic would be possible only under some assumption on κ 2 (X), and that (4.12) provides a sharp bound up to (at most) a low-degree polynomial in m, n.
We now examine the numerical stability of shiftedCholeskyQR3 and show that it enjoys excellent stability both in orthogonality and backward error. Roughly, the result follows by combining the facts that (i) shiftedCholeskyQR gives aQ with κ 2 (Q) < u − 1 2 with small backward error, and (ii) for matrices with condition number < u − 1 2 , CholeskyQR2 computes a stable QR factorization ofQ as shown in [16]. Below we make this statement precise.

Numerical stability of shiftedCholeskyQR3.
Theorem 4.1. Let X ∈ R m×n be a matrix satisfying (3.1)-(3.3) and (4.12). Then shiftedCholeskyQR3 computes a QR factorization X ≈QR satisfying the orthogonality measure Q Q − I F ≤ 6{mn + n(n + 1)}u, (4.13) and backward error (4.14) QR − X F Proof. The orthogonality measure of the outputQ of shiftedCholeskyQR3 is essentially exactly the same as that of CholeskyQR2, which is analyzed in detail in [16]. This is because the bound there applies to any matrix with condition number u − 1 2 . We next establish (4.14). By (3.13), with the first shiftedCholeskyQR executed in finite precision arithmetic we have where ∆X F is bounded as in (3.40). We then apply CholeskyQR2 toQ to obtain the QR factorizationQ = ZU . In finite precision we have where (see Appendix; this bound slightly improves [16]) The upper triangular factor S in the QR factorization of the original matrix X is computed as Here ∆S represents the forward error incurred in the matrix multiplication. Summarizing, we can bound the overall backward error as We now bound the terms in the right-hand side. For the first term, using [16,Thm 3.5] and (3.32) we can bound ∆Q F as To bound R 2 , we use (3.26) to obtain We next bound the second term in (4.19). By [16,Thm. 3.3], Ẑ 2 can be bounded as Regarding ∆S F , using the general error bound for matrix multiplications |∆S| ≤ γ n |Û | |R| we obtain Here R 2 can be bounded as in (4.21). To bound Û 2 , we recall (4.16) and left-multiplyẐ to obtain (4.24)Ẑ (Q + ∆Q) =Ẑ ẐÛ . Now, by [16,Thm. 3.3], the eigenvalues ofẐ Ẑ lie in the interval [1 − 6(mnu + n(n + 1)u), 1 + 6(mnu + n(n + 1)u)], so it follows that Finally, we can bound ∆X F as in (3.40).
Combining the above bounds and substituting into (4.19) yields as required.
Comparison with CGS2. As we saw above, (4.12) is a sufficient condition for shiftedCholeskyQR3 to work in finite precision arithmetic. This condition roughly requires that κ 2 (X)(mn + n 2 )u = O(1). Let us compare this with the analysis in [5] for the CGS2 algorithm, which shows that is a sufficient condition for CGS2 to compute the QR factorization in a stable manner.
Observe that (4.27) is much more stringent than (4.12); indeed in large-scale computing in which m, n 1000, with double precision (4.27) is unlikely to be satisfied even with well-conditioned X.
This difference might appear to suggest shiftedCholeskyQR3 is superior to CGS2 in terms of robustness, but we have not observed this in practice. We suspect that the difference is an artifact of the analysis, and the practical robustness of CGS2 and shiftedCholeskyQR3 seem comparable. An advantage of shiftedCholeskyQR3 is that it is rich in BLAS-3 operations, and offers ample opportunity for parallelization.

5.
Oblique inner product. The (shifted) Cholesky QR algorithm is readily applicable to the QR decomposition in a non-standard inner product space (x, y) B = x T By defined via a symmetric positive definite matrix B ∈ R m×m . The resulting Algorithm 5.1 is almost identical to Algorithm 2.1, except that A is computed as A = X BX and the shift s is chosen in a manner to be described below. In this section, we examine the stability of Algorithm 5.1. The argument closely parallels that in Sections 3 and 4, but new features arise that affect the bounds, in particular involving κ 2 (B).

Assumptions.
We make the following assumptions on m, n, X and B. As in the case of standard inner product, the constants below are not of significant importance but chosen so that the analysis goes through.
The assumption (5.1) roughly demands that X is not too ill-conditioned relative to u. As before, (5.2) and (5.3) require that the matrix dimensions m, n are small compared with the precision u −1 . In (5.1), we can use a simpler assumption (5.5) 6n 2 uκ 2 (X)κ 2 (B) < 1, but (5.1) is less stringent.

Preparations.
We denote the computed results of shiftedCholeskyQR in an oblique inner product, accounting for the numerical errors, aŝ q i , x i are the ith rows of Q andX, respectively. E 1 is the matrix-matrix multiplication error in the computation of the Gram matrix X BX, and E 2 is theÂ backward error incurred when computing the Cholesky factorization. ∆R i is the backward error involved in the solution of the linear system q iR = x i . Equation (5.8) can be rewritten as showing that ∆X is the residual. Now we give bounds on E 1 2 , E 2 2 , R −1 2 , B 1 2 XR −1 2 , ∆R i 2 , Q 2 and ∆X F as in the case of standard inner product.

Bounding
Using the standard error analysis of matrix-matrix multiplication and Cholesky factorization [7] and the assumptions (5.2) and (5.3), we can bound E 1 2 and E 2 2 as See [17] for details. From the assumption (5.4) on s, these bounds ensure that (5.14) Combining this with (5.7) and using Weyl's theorem [6, Sec. 8.6.2], we obtain a bound on R −1 2 : The bound on B 1 2 XR −1 2 can be derived as follows. First, note that from (5.7), Using (5.14) and (5.15), we have The bound on ∆R i 2 can be obtained from the standard error analysis of backward substitution [7, Thm. 8.5] as From (5.7), (5.4), and (5.14) we obtain Substituting this into (5.18) gives

5.2.2.
Bounding ∆X 2 roughly. Here we give a rough bound for ∆X F and prove that ∆X F = O(u X 2 2 B 2 / √ s). This will be insufficient for proving our main result, Theorem 5.3, for which we will need ∆X F = O(u X 2 κ 2 (B)), which we will prove later after having obtained a bound for Q 2 . We shall proceed in the following steps.  σ n (B)). 3. Use above and (5.10) to prove the "tight" bound ∆X F = O(u X 2 κ 2 (B)).

Use above to prove Theorem 5.3.
To establish the first statement, we express ∆x i in terms of x i ,R and ∆R i . Substituting (5.8) into (5.10), we have Here, R −1 ∆R i 2 can be bounded using (5.15) and (5.20) as Together with the fact ∆x i ≤ x i ∆Ȓ i 2 , we can bound ∆X F as (5.25) where we used m i=1 x i 2 = X F ≤ √ n X 2 for the last inequality.

Bounding Q 2 .
We now proceed to bound Q 2 .
Lemma 5.1. Suppose that X ∈ R m×n with m ≥ n satisfies (5.2) and (5.3). Then, the matrixQ obtained by applying the shiftedCholeskyQR algorithm in floating-point arithmetic to X satisfies Moreover, Proof. We havê Thus we can bound Q BQ − I 2 as The first term of (5.28) can be bounded as and for the second term in (5.28), using (5.15), (5.17) and (5.25) we obtain For the third term in (5.28), from (5.15) and (5.25) Summarizing, we can bound the right-hand side of (5.28) as Q BQ − I 2 < 2, as required.
To derive (5.27), let B = V DV andQ BQ = U ΛU be the symmetric eigenvalue decompositions of B andQ BQ, respectively. Then, fromQ V DV Q = U ΛU , we have Hence there exists an orthogonal matrix W such that Noting that Λ 2 < 3 and D −1 2 = (σ n (B)) −1 , we can bound Q 2 as

Bounding the residual.
We now bound the residual.
Lemma 5.2. Suppose that X ∈ R m×n with m ≥ n satisfies (5.2) and (5.3). Then, the matricesQ,R obtained by Algorithm 5.1 in floating-point arithmetic to X satisfies Proof. First note that Substituting (5.20) into this gives On the other hand, from (5.27) we have so it follows that To obtain the second bound in the statement, we use (5.18) in (5.37) to obtain q iR − as required.
Lemma 5.2, in particular (5.36), shows that shiftedCholeskyQR gives optimal residual up to a factor bounded by a low-degree polynomial of m, n.

Main result.
We are now ready to state the main result of the section, which bounds the quantity Q 2 B 2 / σ n (Q BQ). Note that this is a B-orthonormality measure forQ, reducing to κ 2 (Q) when B = I.
Proof. Recall from (5.27) that Q 2 < √ 3/ σ n (B). The remaining task is to bound σ n (B Using (5.15) and (5.40) we obtain Note that this is O(u 1 2 κ 2 (B)) when we regard low-degree polynomials in m, n as constants.
We next bound σ n (B 1 2 XR −1 ) from below. We start with the equation Let B 1 2 X = U ΣV by the SVD. Then for a diagonal matrix G, we can write Indeed the left-hand side is V (Σ 2 + sI)V and the right-hand side V (Σ + G) 2 V , so we can take We next bound the singular values of T . By (5.14) and (5.15) we have so σ i (T ) ∈ [ 1 − 1 9 , 1 + 1 9 ] ⊆ [0.9, 1.1]. Therefore, letting T = U (I + E )V be the SVD where E is diagonal, we have Using the general inequality for singular values of matrix products σ min (AB) ≥ σ min (A)σ min (B) (applicable if A or B is square) we obtain By inserting this and (5.43) into (5.42), we obtain where we used (5.1) in the fourth inequality and the definition of α in the last equality.
Together with (5.27) we obtain · κ 2 (B). (5.54) Thus we conclude that, provided that α X 2 2 B 2 /σ n (X BX) 1, Thus applying one step of shiftedCholeskyQR results in the condition number being reduced by about a factor ακ 2 (B) = √ sκ2(B) 5.5. shiftedCholeskyQR3 in oblique inner product. We now describe the analogue of shiftedCholeskyQR3 for an oblique inner product space. We continue to assume (5.1)-(5.4); a particular case of interest is Choice of shift s. As in Section 4.1, we need to take into account the error term E 1 in (5.6), incurred in the computation of A = X T BX. From (5.6), we obtain a lower bound Accordingly to apply (4.2), we first shiftÂ by the upper bound in (5.12). Thus a choice of s that avoids numerical breakdown is Together with the assumption (5.4), we obtain (5.58) s := max(11{2m √ mn + n(n + 1)}u X 2 2 B 2 ,s). First we note thatÂ = X T BX + E 1 from (5.6). As in (4.6) and (4.7), we have Thus, we can bounds as √ mn + n(n + 1)}u X 2 2 B 2 , which shows that the maximum in (5.58) is (5.59) s := 11{2m √ mn + n(n + 1)}u X 2 2 B 2 . We now summarize the algorithm in pseudocode in Algorithm 5.2, which is the Borthogonal analogue of Algorithm 4.2. We still call the algorithm shiftedCholeskyQR3 (supressing B as the two algorithms are essentially equivalent when B = I, aside from a slight difference in the shift strategy). The iterated Algorithm 4.1 can also be extended to B = I similarly; we omit its pseudocode for brevity. In the analysis below, we focus on the case u − 1 2 < X 2 B 2 / σ n (X BX) < u −1 . When is thrice enough?. Here we derive a condition on Q 2 B 2 / σ n (Q BQ) that guarantees that shiftedCholeskyQR3 gives a numerically stable QR factorization of X.
Recall that (5.54) gives a bound by shiftedCholeskyQR: · κ 2 (B). (5.60) As in (5.59), to guarantee avoidance of breakdown we take α = 11{2m √ mn + n(n + 1)}u, so we have On the other hand, as shown in [17], a sufficient condition for CholeskyQR2 in an oblique inner product to compute a stable QR factorization ofQ is Combining these facts, we obtain the following condition under which shiftedCholeskyQR3 is guaranteed to compute a numerically stable QR factoriziation: and the condition can be simplified as Note that this ensures that the condition (5.1) for the error analysis in Subsection 5.1 is automatically satisfied. Numerical stability. We now examine the numerical stability of shiftedCholeskyQR3 and show that it enjoys excellent stability both in orthogonality and backward error.
We next establish (5.65). By (5.11), after the first shiftedCholeskyQR we have where ∆X F is bounded as in (5.35). We apply CholsekyQR2 toQ to obtain the QR factorizationQ = ZU . We have from (5.27) and [17,Thm. 3]. The upper triangular factor S in the QR factorization of X is obtained as where ∆S is the forward error in the matrix multiplication. Summarizing, we can bound the overall backward error as We now bound the terms in the right-hand side. For the first term, we can bound ∆Q F as (5.68). To bound R 2 , we use (5.4), (5.7) and (5.14) to obtain We next bound the second term in (5.70). By [17], Ẑ 2 can be bounded as Regarding ∆S F , using the general error bound for matrix multiplications |∆S| ≤ γ n |Û | |R| we obtain Here R 2 can be bounded as in (5.71). To bound Û 2 , we recall (5.67) and left-multiplyẐ B to obtain (5.74)Ẑ B(Q + ∆Q) =Ẑ BẐÛ . Now, by [17,Thm. 2], the eigenvalues ofẐ BẐ lie in the interval [1−8κ 2 (B)(m √ mnu+ n(n + 1)u), 1 + 8κ 2 (B)(m √ mnu + n(n + 1)u)], so it follows that Using the assumption 80κ 2 (B)(m √ mnu + n(n + 1)u) ≤ 1 (which is [17, as required. Experiments indicate that the bounds in Theorem 5.4 are overestimates, in particular the dependence on κ 2 (B) appears to be much weaker.
6. Numerical experiments. In this section we present some numerical experiments to illustrate our results. All computations were carried out on MATLAB 2017b and IEEE standard 754 binary64 (double precision) in Mac OS X version 10.13 with 2 GHz Intel Core i7 Duo processor, so that u = 2 −53 ≈ 1.11 × 10 −16 .
6.1. Convergence with CholeskyQR iterates. First, we take B = I and examine how κ 2 (Q (k) ) and Q (k) Q (k) − I 2 are reduced after k (shifted)Cholesky QR steps. We also compare shiftedCholeskyQR3 with the mixed-precision Cholesky QR (mixedCholQR) [18], which uses doubled precision for the first two steps (1.1), (1.2) and repeats the process in double precision. To run mixedCholQR we used the Multiprecision Computing Toolbox [11], which enables computation in MATLAB with arbitrary precision. We generate test matrices with a specified condition number by forming where U is an m × n random orthogonal matrix obtained by taking the QR factorization of a random matrix, V is an n × n random orthogonal matrix and Σ = diag(1, σ 1 n−1 , · · · , σ n−2 n−1 , σ).
We next turn to B = I. We set B to be a SPD matrix as above, with U = V . We illustrate Theorem 5.3 by examining how is reduced as compared with We set X to be a random matrix with a specified condition number as in (6.1) and form B 0 as in (6.1), now taking V = U T . In Table 6.3, we took κ 2 (X) = 10 12 , κ 2 (B) = 10 8 , and m = 300, n = 30. Table 6.3 also confirms that is improved by shiftedCholeskyQR3 by a factor O( α/κ 2 (B)) u − 1 2 (experiments suggest that the κ 2 (B) dependence is often a significant overestimate). CholeskyQR2 then safely completes the QR factorization, as predicted by Theorem 5.3. Table 6.1: Results for test matrices with κ 2 (X) = 10 12 , m = 1000, n = 30, B = I.
13.50 4.82 · 10 −5 2 13.50 3.34 · 10 −15 Remark 1. The choice of s in (5.59) tends to be a conservative overestimate, and in most cases, a successful Cholesky factorization can be computed with a smaller shift, such as A + (u X 2 2 B 2 )I. It can be seen (if Cholesky still does not break down) that the reduction factor of κ 2 (Q (k) ) improves to about ( √ u) (k) after k shiftedCholeskyQR steps. To illustrate this, in Figure 6.1 we show the values of κ 2 (Q (1) ) as we vary the shift in shiftedCholeskyQR taking B = I, m = 1000, n = 50, κ 2 (X) = 10 15 . Our "safe" choice s := 11{mn + n(n + 1)}u X 2 2 ≈ 6.1 · 10 −11 is shown in Figure 6.1 by a blue asterisk. In this case, κ 2 (Q (1) ) is larger than required by (4.11), and more shiftedCholeskyQR iterations would be needed. On the other hand, if we set s := u X 2 2 , thenQ (1) will satisfy the sufficient condition κ 2 (Q (1) ) ≤ u − 1 2 for CholeskyQR2 to work. However, there is no guarantee that the initial Cholesky factorization chol(A + u X 2 2 I) does not break down. 6.2. Orthogonality and residual. Next we examine the numerical stability of shiftedCholeskyQR3 (shown in the figures as sCholQR3) and compare it with other popular QR decomposition algorithms, namely, Householder QR, classical and modified Gram-Schmidt (CGS and MGS; we also run them twice, CGS2 and MGS2). We first take B = I and vary κ 2 (X), m and n and investigate the dependence of the orthogonality and residual on them. We set X as in (6.1). We examine the orthogonality and residual measured by the Frobenius norm under various conditions in Figures 6.2 through 6.4. Figure 6.2 shows the orthogonality Q TQ − I F and residual QR − X F , where we take m = 300, n = 10 and κ 2 (X) was varied from 10 8 to 10 15 . In Figure 6.3, κ 2 (X) = 10 12 , n = 50 and m was varied from 1000 to 10000. In Figure  6.4, κ 2 (X) = 10 12 , m = 1000 and n was varied from 10 to 500.
We see in Figure 6.2 that with shiftedCholeskyQR, the orthogonality and the residual are independent of κ 2 (X) and are of O(u), as long as κ 2 (X) is at most O(u −1 ). This is in good agreement with Theorem 4.1. Figures 6.3 and 6.4 indicate that the orthogonality and residual increase only mildly with m and n. Although they are inevitably overestimates, these also reflect our results (4.13) and (4.14). Compared with Householder QR, we observe that shiftedCholeskyQR3 usually produces slightly better orthogonality and residual. With MGS, the deviation from orthogonality increases proportionally to κ 2 (X). As is well known, Gram-Schmidt type algorithms perform well when repeated twice, and we can verify this here. As mentioned in Section 4.3, an advantage of shiftedCholeskyQR3 is that it is rich in BLAS-3 operations and easily parallelized (it ran more than ten times faster than Gram-Schmidt algorithms in the experiments here). Overall, we see that shiftedCholeskyQR3 is an efficient and reliable method for matrices with condition number at most O(u −1 ).
Next, we again take B = I and test Algorithm 5.2, comparing it with the stability of other popular QR decomposition algorithms, namely, MGS, CGS2 and MGS2. We varied κ 2 (B), m and n and investigated the orthogonality Q T BQ − I F and residual QR − X F . Figure 6.5 shows the results for the case m = 500, n = 20, κ 2 (B) = 10 10 and κ 2 (B) was varied from 10 1 to 10 10 . Figure 6.6 takes m = 300, n = 50, κ 2 (X) = 10 10 and κ 2 (B) was varied from 10 8 to 10 15 . In Figure 6.7 we took κ 2 (X) = 10 8 , κ 2 (B) = 10 10 , n = 50 and m was varied from 500 to 2000. Figure 6.8 takes κ 2 (X) = 10 8 , κ 2 (B) = 10 10 , m = 1000 and n was varied from 50 to 500. From Figure 6.5 we see that the orthogonality and residual of shiftedCholeskyQR3 are independent of κ 2 (X) and are of O(u), as long as Theorem 5.4. Figure 6.6 shows that the orthogonality increase rather mildly with κ 2 (B), indicating the dependence on κ 2 (B) suggested by 5.64 is perhaps improvable; we leave this for future work. Figures 6.7 and 6.8 illustrate the orthogonality and residual increase with m and n, again mildly. These are also in agreement with (5.64) and (5.65) in Theorem 5.4; again, its m, n-dependence may be improvable. Compared with CGS2 and MGS2, the orthogonality and residual of CGS2 and MGS2 and that of shiftedCholeskyQR3 are of the same magnitude. Again, shiftedCholeskyQR3 has the advantage of being parallelization-friendly. All our experiments corroborate that shiftedCholeskyQR3 is a reliable method whether B = I or B = I, for matrices with   7. Runtime Performance. We next evaluate the runtime performance of shift-edCholeskyQR3 in multi-core CPU environments for both the standard and the oblique case involving a sparse symmetric, positive definite matrix. 7.1. Standard inner product. We used a compute node of the Laurel 2 supercomputer system installed at the Academic Center for Computing and Media Studies, Kyoto University, whose specifications are listed in Table 7.1. Here we focus on the standard B = I case, to facilitate comparison with available algorithms. Test matrices are generated as in the previous experiments, using (6.1). Here, random orthogonal matrices are obtained by applying the LAPACK Householder QR routines (dgeqrf and dorgqr) to a random matrix. The code is written in Fortran90 and uses LAPACK and BLAS routines. Table 7.2 presents the computational time of several methods, where the QR factorization of matrices whose condition number is 10 11 is computed. Here, we compare shiftedCholeskyQR3 (dgemm and dsyrk versions), Householder QR, and CGS2 (including its blocked version). It is worth noting that dgeqr is a novel LAPACK routine that appropriately uses the TSQR algorithm. The block width in Block CGS2 was empirically tuned.   shiftedCholeskyQR3, dgeqr is fastest, probably due to employing TSQR, but shifted-CholeskyQR3 (dsyrk) is more than 1.7 times faster in every case. These results also indicate that, even if four iterations is required for ill-conditioned problems, iterated CholeskyQR (shown in Algorithm 4.1) would still be faster than other methods. It is of interest to compare shiftedCholeskyQR3 with mixedCholQR [18] from the viewpoint of computational time, but implementing and highly tuning double-double gemm or syrk routines (whose input matrices are in double precision) is generally difficult and requires significant effort. We thus estimate the computational cost of mixedCholQR routine; we only discuss the case where gemm is used, but almost the same discussion is applicable when we use syrk. Table 7.3 presents the results of the benchmark for dgemm, which computes X X, where X is an m × n matrix. From this table, we can assume that 700 GFLOPS is a rough upper bound of the achieved performance. According to the paper on mixed precision CholeskyQR [18], the number of double precision operations required in ddgemm (a double-double precision gemm routine) is 8 times (Cray-style) or 12.5 times (IEEE-style) that required in dgemm. Therefore, assuming that double precision operations in ddgemm are performed at 700 GFLOPS, we can estimate the time (sec.) of ddgemm as 8·2 7 · mn 2 · 10 −11 (Cray-style) or 12.5·2 7 · mn 2 · 10 −11 (IEEE-style).  Based on the above estimation and the breakdown of timing results of shifted-CholeskyQR3, we compare shiftedCholeskyQR3 and mixed precision CholeskyQR in Table 7.4. Here, we ignore the increasing cost for ddpotrf because it is small relative to the total time. From the table, we can expect that shiftedCholeskyQR3 is faster than mixed precision CholeskyQR in this computational environment. Considering this estimation and the fact that a well-tuned ddgemm is currently rarely available, shiftedCholeskyQR3 seems to be more practical than mixed precision CholeskyQR.
Finally we briefly mention the performance of shiftedCholeskyQR3 on largescale distributed parallel systems. Based on our previous performance evaluation of CholeskyQR2 on the K computer (for details, see [4]), we give a rough estimation of the computational time of shiftedCholeskyQR3 in Table 7.5, where we estimate the computational time of shiftedCholeskyQR3 as 1.5 times that of CholeskyQR2, simply based on the number of iterations. From this table, shiftedCholeskyQR3 is expected to be still significantly faster than Householder QR methods (both TSQR and ScaLAPACK routines) in large-scale parallel computation for matrices κ 2 (X) < u −1 .

Oblique inner product.
In the case of the inner product defined by a positive definite matrix B, we focus on large-sparse B, as arises commonly in applications. Owing to the sparsity of B, orthogonalizing X such that X T BX = I depends heavily on the performance of sparse matrix-vector multiplication (SpMV). Both CGS2 and shiftedCholeskyQR3 would therefore achieve only a fraction of the machine's peak performance compared with the case when B is dense, where performance would be  dictated by dgemm. Optimizing the performance of shiftedCholeskyQR3 for a sparse B involves extracting performance from the computation of the inner product X T BX, which in turn depends on • sparse matrix multiple-vector multiplication kernels [8], and • the choice of whether X T BX is computed by forming Y := BX explicitly followed by computing X T Y or in blocks. By blocking we refer to forming the matrix Y j := BX(:, jk : (j + 1)k) for some block size k ∈ [1, n] and j ∈ (1, n/k) in succession, and subsequently calculating the matrix X T Y j . The choice of strategy is highly sensitive to both the sparsity structure of B, and the cache hierarchy of the architecture on which we execute. For an unbounded cache size, computing Y explicitly will be the fastest strategy since the subsequent operation Y T X has excellent computational intensity. Realistically however, for a small L1 cache or indeed a large B, forming Y will cause the entries of X to be evicted from cache, leading to unnecessary cache misses and a slower performance. For a more detailed discussion and performance analysis of the different strategies, we refer the reader to [9, sec. 6.4]; in this section we present the overall speed-up obtained over CGS2 implementations.
Our experiments for a sparse B were carried out on a shared memory system with Intel Xeon E5-2670 (Sandy Bridge) symmetric multiprocessors. There are 2 processors with 8 cores per processor, each with 2 hyperthreads that share a large 20 MB L3 cache. Each core also has access to 32 KB of L1 cache and 256 KB of L2 cache and runs at 2.6 GHz. They support the AVX instruction set with 256bit wide SIMD registers, resulting in 10.4 GFlop/s of double precision performance per core or 83.2 GFlop/s per processor and a dgemm performance of 59 GFlop/sec. The implementation was parallelized using OpenMP, and compiled using Intel C++ Compiler version 14.0 with -O3 optimization level, with autovectorization turned on for both CPU. Tests are run by scheduling all threads on a single processor with 16 threads with OpenMP 'compact' thread affinity, which is set using KMP SET AFFINITY.
We use as test problems symmetric positive definite matrices from the University of Florida [1] collection, as listed in Table 7.6. The matrices are stored and operated upon B using the Compressed Sparse Row (CSR) format and we avoid changing the  format to favour performance although the results in [8] strongly suggest that this is beneficial. The reason for this is that oblique QR factorization is usually part of a "larger" program, for example, a sparse generalized eigensolver [9, chap. 4], and hence the storage format needs may be governed by other operations in the parent algorithm, for example, a sparse direct solution, and such software may not exist for the new format. Changing the sparse matrix format to accelerate the factorization may also slow down other parts of the calling program that are not optimized to work with a different format. We present the performance of shiftedCholeskyQR3 on CPU by varying n, i.e., the size of X and compare the results with those from CGS2 in Figure 7.1. On the CPU, shiftedCholeskyQR3 was faster than CGS2 by a minimum of 3.7 times for n = 16 and a maximum of 40 times for n = 256 vectors. The large speedup obtained is only a reflection of the reliance of CGS2 on matrix-vector products, which in the case of sparse matrices, has a particularly poor CPU utilization.  8. Conclusion and discussion. Our algorithm shiftedCholeskyQR3 combines speed, stability, and versatility (applicable to B = I). We believe it offers an attractive alternative in high-performance computing to the conventional Householder-based QR factorziation algorithms when B = I, and can be the clear algorithm of choice when B = I.
shiftedCholeskyQR3 as presented could benefit from further tuning, and this work suggests a few future directions. First, the choice of shift s introduced in this paper is conservative, and severely so when m, n are large. Our experiments suggest that a much smaller shift, such as s = O(u A 2 ), is usually sufficient to avoid breakdown in chol(A + sI), and as illustrated in Figure 6.1, a smaller shift results in improved conditioning, and hence smaller number of shiftedCholeskyQR iterations. Introduction of a shift strategy that is both stable and efficient is an important remaining task.
Our performance results hold promise for the competitiveness of shiftedCholeskyQR3, and further work will focus on comparing it with other state-of-the-art implementations for QR factorziations such as TSQR in an HPC setting. In the case where B = I for a sparse B, the performance benefits over CGS2 are remarkable and our algorithm is the clear choice for applications.