Multiprecision Algorithms for Computing the Matrix Logarithm

. Two algorithms are developed for computing the matrix logarithm in ﬂoating point arithmetic of any speciﬁed precision. The backward error-based approach used in the state of the art inverse scaling and squaring algorithms does not conveniently extend to a multiprecision environment, so instead we choose algorithmic parameters based on a forward error bound. We derive a new forward error bound for Pad´e approximants that for highly nonnormal matrices can be much smaller than the classical bound of Kenney and Laub. One of our algorithms exploits a Schur decomposition while the other is transformation-free and uses only the computational kernels of matrix multiplication and the solution of multiple right-hand side linear systems. For double precision computations the algorithms are competitive with the state of the art algorithm of Al-Mohy, Higham, and Relton implemented in logm in MATLAB. They are intended for computing environments providing multiprecision ﬂoating point arithmetic, such as Julia, MATLAB via the Symbolic Math Toolbox or the Multiprecision Computing Toolbox, or Python with the mpmath or SymPy packages. We show experimentally that the algorithms behave in a forward stable manner over a wide range of precisions, unlike existing alternatives.


Introduction.
Let A ∈ C n×n be nonsingular with no nonpositive real eigenvalues.Any matrix X ∈ C n×n satisfying the matrix equation (1.1) X = e A is a matrix logarithm of A. This equation has infinitely many solutions, but in applications one is typically interested in the principal matrix logarithm, denoted by log A, which is the unique matrix X whose eigenvalues have imaginary part strictly between −π and π.This choice is the most natural in that it guarantees that if the matrix is real then so is its logarithm and that if the matrix has positive real spectrum then so does its logarithm.More generally, the unique matrix satisfying (1.1) having spectrum in the complex strip is called the kth branch of the matrix logarithm, and is denoted by log k A. The choice k = 0 yields the principal logarithm log A. From a computational viewpoint, being able to approximate log A is enough to determine the value of log k A for any k ∈ Z, in view of the identity log k A = log A + 2kπiI.
The aim of this work is to develop an algorithm for log A that is valid for floating point arithmetic of any given precision.A new algorithm is needed because the state of the art algorithm of Al-Mohy, Higham, and Relton [3], [4], which is implemented in the MATLAB function logm, is designed specifically for IEEE double precision arithmetic.Indeed most available software for the matrix logarithm has this limitation of being precision-specific [32].Applications of the new algorithm will be in both low and high precision contexts.For example, both the matrix logarithm [36] and low precision computations (perhaps 32-bit single precision, or 16-bit half precision) [15], [26], have been recently used in machine learning, and a combination of the two is potentially of interest.
The need for high precision arises in several contexts.For instance, in order to estimate the forward error of a double precision algorithm for the matrix logarithm a reference solution computed at quadruple or higher precision is usually needed.Estimating the backward error of a double precision algorithm for the matrix exponential also requires the ability to evaluate log A at high precision.Let X = e A and let X be a solution computed by a double precision algorithm for the matrix exponential.If the spectrum of A lies inside L k , so that A = log k X, then the backward error of X is naturally defined as the matrix ∆A such that (1.2) log k X = A + ∆A, because then X = e A+∆A , and the normwise relative backward error is ∆A / A = log X − A / A .A multiprecision algorithm for the matrix logarithm is needed in a variety of languages and libraries that attempt to provide multiprecision implementations of a wide range of functions with both scalar and matrix arguments.The Julia language [8] and Python's SymPy [44], [49] and mpmath [39] libraries currently lack such an algorithm, and we will show that the algorithms proposed here improve upon those in version 7.1 of the Symbolic Toolbox for MATLAB [48] and version 4.3.2 of the Multiprecision Computing Toolbox [45].
The algorithm of Al-Mohy, Higham, and Relton used by logm is the culmination of a line of inverse scaling and squaring algorithms that originates with Kenney and Laub [40] for matrices and goes back to Briggs [11] in the 17th century in the scalar case.In essence, the algorithm performs three steps.Initially, it takes square roots of A, s of them, say, until the spectrum of A 1/2 s − I is within the unit disk, which is the largest disk centered at the origin in which the principal branch of log(1 + x) is analytic and its Padé approximants are therefore well defined.Then it selects a Padé approximant r km (x) := p km (x)/q km (x) to log(1+x) of suitable degree [k/m], evaluates the rational matrix function r km (X) = p km (X) q km (X) −1 at X = A 1/2 s − I, and finally reverts the square roots to form the approximation log A ≈ 2 s r km (A 1/2 s − I).The algorithm is based on a backward error analysis and uses pre-computed constants that specify how small a normwise measure of A 1/2 s − I must be in order for a given diagonal Padé approximant r mm to deliver a backward stable evaluation in IEEE double precision arithmetic.These constants require a mix of symbolic and high precision computation and it is not practical to compute them during the execution of the algorithm for different precisions.Therefore in this work we turn to forward error bounds, as were used in earlier work [14], [40].
Kenney and Laub [41] showed that for X < 1 and any subordinate matrix norm, In subsequent literature, the equivalent bound has been preferred [30], [31, sec. 11.4].Both upper bounds can be evaluated at negligible cost, so they provide a way to choose the Padé degrees k and m.We will derive and exploit a new version of the latter bound that it is phrased in terms of the quantities for suitable p, instead of X .Since α p (X) is no larger than X , and can be much smaller when X is highly nonnormal, the new bound leads to a more efficient algorithm.
Since in higher precision our new algorithm may need a considerable number of square roots, it can happen that log(I + X) has very small norm and thus that an absolute error bound is not sufficient to guarantee that the algorithm will deliver a result with small relative error.For this reason, unlike in some previous algorithms we will use a relative error bound containing an inexpensive estimate of log(I + X) .
It is well known [41,Thm. 6] that for X with nonnegative entries and k + m fixed the diagonal Padé approximant (k = m) minimizes the error log(I − X) − r − km (X) and the cost in flops of evaluating r km (X) is roughly constant.Therefore diagonal approximants r m := r mm have been favoured.However, the special case of the Taylor approximant t m := r m0 merits consideration here, as its evaluation requires only matrix multiplications, which in practice are faster than multiple right-hand side solves.Throughout the paper we write f m to denote either the Padé approximant r m or the Taylor approximant t m .
In addition to the matrix logarithm itself, we are also interested in evaluating its Fréchet derivative.Being able to evaluate the Fréchet derivative and its adjoint allows us to estimate the condition number κ log (A) of the matrix logarithm, which in turn gives an estimate of the accuracy of the computed logarithm.
We use the term "multiprecision arithmetic" to mean arithmetic supporting multiple, and usually arbitrary, precisions.These precisions can be lower or higher than the single and double precisions that are supported by the IEEE standard [37] and usually available in hardware.We note that the 2008 revision of the IEEE standard [38] also supports a quadruple precision floating point format and, for storage only, a half precision format.
We begin the paper by summarizing in Section 2 available multiprecision computing environments.In Section 3 we derive a new forward bound for the error of Padé approximation of a class of hypergeometric functions, which yields a bound sharper than (1.3) and (1.4) for highly nonnormal matrices.In Section 4 we describe a new Schur-Padé algorithm for computing the matrix logarithm and its Fréchet derivative at any given precision.Section 5 explores a Schur-free version of the algorithm.Numerical examples are presented in Section 6 and concluding remarks are given in Section 7.
Finally, we introduce some notation.The unit roundoff of the floating point arithmetic is denoted by u.For A ∈ C n×n , the spectral radius of a matrix is denoted by ρ(A) = max{ |λ| : λ is an eigenvalue of A }. We recall that the Frechét derivative of a matrix function f : The relative condition number of the matrix function f at A is defined as and is explicitly given by the formula [31, Thm.3.1] 2. Support for multiple precision arithmetic.A wide range of software supporting multiprecision floating point arithmetic is available.Multiprecision capabilities are a built-in feature of Maple [42] and Mathematica [43] as well as the open-source PARI/GP [46] and Sage [47] computer algebra systems, and are available in MATLAB through the Symbolic Math Toolbox [48] and the Multiprecision Computing Toolbox [45].The programming language Julia [8] supports multiprecision floating point numbers by means of the built-in data type BigFloat, while for other languages third-party libraries are available: mpmath [39] and SymPy [44], [49], for Python; the GNU MP Library [25] and the GNU MPFR Library [21] for C; the BOOST libraries [10] for C++; and the ARPREC library [5] for C++ and Fortran.The GNU MPFR Library is used in some of the software mentioned above, and interfaces to it for several programming languages are available 1 .

3.
Error in the approximation of hypergeometric functions.We recall that the rational function r km = p km /q km is a [k/m] Padé approximant of f if p km and q km are polynomials of degree at most k and m, respectively, q km (0) = 1, and f (x) − r km (x) = O(x k+m+1 ).In order to obtain the required error bounds for Padé approximants to the logarithm we consider more generally Padé approximants to the hypergeometric function where (a By combining the analysis of Kenney and Laub [41] with a result of Al-Mohy and Higham [2] we obtain the following stronger version of the error bound [41,Cor. 4].Recall that α p is defined in (1.5).
Proof.Kenney and Laub [41,Thm. 5] show that if m ≤ k + 1, |y| < 1, and 0 < a < c then (3.2) 2 F 1 (a, 1, c, y) − r km (y) = q km (1) where q km is the denominator of r km , a polynomial of degree m.By [41, Cor.1] the zeros of q km are simple and lie on (1, ∞), and since q km (0) = 1 it follows that for |y| < 1, with d i > 0 for all i.Since a, c > 0 and i > k + m, the coefficients of the series in (3.2) are positive and so (3.2) can be rewritten as where sign(ψ i ) = sign(q km (1)) for all i.Therefore, applying [2, Thm.4.2(a)] gives For the matrix logarithm, we have and thus the [k/m] Padé approximant r km (x) to log(1 + x) and the [k/m] Padé approximant r km (x) to 2 F 1 (1, 1, 2, x) are related by The next result is an analog of [34, Thm.2.2], which applies to the function Corollary 3.2.Let X ∈ C n×n be such that ρ(X) < 1 and α p (X) < 1, and let r km be the [k/m] Padé approximant to log(1 + x).Then for m ≤ k, and p such that p(p − 1) ≤ k + m + 1, we have Proof.From (3.3), (3.4), and (3.5), with a = 1, c = 2, and x = −y we have We know from the proof of Theorem 3.1 that the ψ i are one-signed, and so we deduce that Since α p (−X) = α p (X), we obtain the bound (3.6) on replacing X by −X.
Since the bound (3.6) is decreasing in α p (X), the smallest bound is obtained for In practice we will approximate p * rather than compute it exactly, as discussed in Section 4. We compare in Figure 3.1 the bounds (1.4) and (3.6) for the diagonal Padé approximant r m and the Taylor approximant t m , for m between 1 and 20, with the fairly nonnormal matrix (3.9)A = 0.01 0.95 0 0.04 .
We see that for both t m and r m the new bound can be many orders of magnitude smaller than (1.4) and it is a much better estimate of the actual error.
4. Schur-Padé algorithm.In this section and the next we develop two new algorithms for computing the matrix logarithm in arbitrary precision floating point arithmetic, one using the Schur decomposition and one transformation-free.The algorithms build on the inverse scaling and squaring algorithms in [3], [4], [14], [31,Algs 11.9,11.10].They combine features from these algorithms in a novel way that yields algorithms that take the unit roundoff as an input parameter and need no pre-computed constants.
We approximate the Fréchet derivative of the logarithm by the Fréchet derivative of the logarithm's Padé approximant.We will not give an error bound for this approximation because, as noted in [4], it is problematic to obtain error bounds for it that are expressed in terms of the α p (A). However our main intended use of the Fréchet derivative is for condition number estimation, which does not require accurate derivatives, and the same approximation was found to be perform well at double precision in [4].Our precision-independent algorithm for the matrix logarithm and its Fréchet derivative is given in Algorithm 4.1.Instructions with an underlined line number are to be executed only when the Fréchet derivative of the matrix logarithm is required.
The algorithm begins by computing the Schur decomposition A = QT Q * .In lines 10-12 it repeatedly takes square roots of T until the spectrum of T − I is within the unit ball centered at 0. Although the requirement ρ(X) = ρ(T − I) < 1 in Corollary 3.2 is now satisfied, there is no guarantee that the relative forward error log T −f mmax (T −I) 1 / log T 1 , where m max is the maximum degree of approximant allowed and f m denotes r m or t m , is less than the unit roundoff u.This is especially true for Taylor approximation, which could need hundreds of terms to achieve a bound on the forward error smaller than u.
Hence in lines 13-15 the algorithm keeps taking square roots until where ψ(T ) is an estimate of the 1-norm of log T and p is chosen as described below.
Approximating log T by the first term of the Taylor series, ψ(T ) = T − I 1 , provides an estimate accurate enough for all matrices and levels of precision considered in our numerical experiments.Note that α p ( X 1 ) can be estimated efficiently, without explicitly forming any powers of X using the block 1-norm estimation algorithm of Higham and Tisseur [35], which requires only O(n 2 ) flops.Since only an estimate of X p 1/p is computed there is no need to use high precision for this sub-problem, so we carry out this computation in double precision (or single precision if the requested precision is lower than double), in order to exploit the floating point hardware.When the matrix dimension is small and the working precision is not too high the cost of estimating α p (T − I) can nevertheless be non-negligible.Rather than computing α p for the optimal value p * in (3.8), we compute it only for the largest possible p.Some justification for this choice Algorithm 4.1 Schur-Padé algorithm for matrix logarithm and Fréchet derivative.Given A ∈ C n×n with no eigenvalues on R − this algorithm computes X = log A, and optionally the Fréchet derivative Y ≈ D log (A)[E], in floating point arithmetic with unit roundoff u using inverse scaling and squaring with Padé approximation.s max is the maximum allowed number of square roots and m max the maximum allowed approximant degree.The logical parameter use taylor determines whether a diagonal Padé approximant or a Taylor approximant is to be used.ψ(X) provides an approximation to log X 1 .

30:
E ← X, where X is the solution of T X + XT = E.

31:
return T, E, s + 1 comes from the fact that despite a sometimes considerably nonmonotonic behaviour, the sequence {α p (X)} p∈N is typically roughly decreasing [2].We denote the α value corresponding to the diagonal Padé approximant r m by and that corresponding to the truncated Taylor series t m by Thus (4.1) is used in the form We now discuss the cost of the algorithm, beginning with the case of diagonal Padé approximants.Higham [30] considered several ways of evaluating the rational function r m (X).The partial fraction form where γ (m) j and δ (m) j are the weights and nodes, respectively, of the m-point Gauss-Legendre quadrature rule on [0, 1], was found to provide the best balance between efficiency and numerical stability, and it also has the advantage of allowing parallel evaluation.For a triangular matrix of size n, the evaluation of (4.3) requires mn 3 /3 flops and the computation of each of the s matrix square roots costs n 3 /3 flops.Thus, when diagonal approximants are used, the algorithm requires 25n 3 flops for the computation of the Schur decomposition, χ r (s, m) := (s + m) n 3 /3 for the inverse scaling and squaring phase, and 3n 3 flops to recover the solution.
Since χ r (s, m) = χ r (s+1, m−1), an additional square root will save computational effort only if the degree of the approximant decreases by at least 2, in which case the overall reduction in cost is at least n 3 /3 flops.In view of the approximation [3] (4.4) an additional square root is taken only if where m is the current degree of the approximant (defined in line 16 of Algorithm 4.1).Turning to Taylor series approximants, in order to evaluate the truncated Taylor series the algorithm uses the Paterson-Stockmeyer method, which among the four methods for polynomial evaluation considered in [31,Thm. 4.5] is the one that minimizes the number of matrix-matrix multiplications while satisfying a forward error bound of the same form as for the other methods.The computational cost of computing s square roots and evaluating a Taylor approximant of degree m is approximately and thus an additional square root is taken only if (4.5) holds with f ≡ t and Since the cost function χ t is monotonic in both arguments, the reduction in the number of flops will be at least which depends only on m.Unlike in the Padé case, we cannot put a useful lower bound on the decrease in flops resulting from an extra square root.
For both types of approximant we can perform a binary search to find the smallest which requires the estimation of X p 1/p 1 for no more than 2 log 2 m − 1 values of p.If the Frechét derivative is needed then each time a square root is taken the algorithm solves an additional Sylvester equation, as detailed in [31, sec. 11.8].Once the optimal values for s and m have been found, in order to increase the accuracy the algorithm recomputes the first superdiagonal of T from T 0 , making use of the identity [33, eq.(5.6)], and then the main diagonal of T − I, by applying [1, Alg.2] to the diagonal of T 0 .The algorithm can be easily adapted to compute the adjoint of the Frechét derivative of A in the direction E, by replacing the increment E by E * and returning Y * [4].
Algorithm 4.1 can be extended to compute an estimate of the 1-norm condition number of the matrix logarithm, using the same approach as in [4, Alg.4.1].Since in this case the Frechét derivative of the matrix logarithm of A and its adjoint need to be computed in several directions, but neither s nor m depend on E, the algorithm can be modified to store the matrix T after each square root is taken, and then use it to solve several Sylvester cascades for different matrices E. If η is the number of bits required to store a single entry of the matrix, then this modification increases the memory requirement of the algorithm by about η(s − 1)n 2 /2 bits if the upper triangular pattern is exploited.

Transformation-free algorithm. Multiprecision computing environments
often provide just a few linear algebra kernels.For example, version 7.1 of the Symbolic Math Toolbox [48] does not support the Schur decomposition in its variable precision arithmetic (VPA).In this section we therefore present a version of Algorithm 4.1 that does not require the computation of the Schur decomposition and builds solely on level 3 BLAS operations and multiple right-hand side system solves.
The algorithm starts by taking enough square roots to guarantee that the Padé or Taylor approximants will produce a relative forward error below the unit roundoff threshold.Since in this case the matrix is not triangular, to compute square roots the algorithm employs the scaled Denman-Beavers iteration (in product form) [31, eq.(6.29)], whose computational cost depends on the number of iterations and is thus not known a priori.However, it has been observed [31, sec.11.5.2] that in practice up to ten iterations are typically required for the first few square roots, but just four or five are enough in the later stages.Since the cost of one iteration is 4n 3 flops, it is customary to consider that the computation of a square root requires 16n 3 flops.On Algorithm 5.2 Transformation-free matrix logarithm with Fréchet derivative Given A ∈ C n×n with no eigenvalues on R − this algorithm computes X = log A, in floating point arithmetic with unit roundoff u using inverse scaling and squaring with Padé approximation.s max is the maximum allowed number of square roots and m max the maximum allowed approximant degree.The logical parameter use taylor determines whether a diagonal Padé approximation or a Taylor approximation is to be used.ψ(X) provides an approximation to log X 1 .[A, P, s] ← sqrtm db(A, P, s) 12: end while [A, P, s] ← sqrtm db(A, P, s)

27:
if s > 1 then 2 ), an additional square root will be worthwhile if (4.5) holds for f ≡ t, T = A, and (5.1) which guarantees a reduction in the number of flops of at least To reduce the chances of numerical cancellation in the computation of A 1/2 s − I, the matrix form of [1, Alg.2] is used, as in [3].
Note that Algorithm 5.2 is not suitable for the computation of the Fréchet derivative of the matrix logarithm, nor for the estimation of its condition number.Standard methods for the solution of Sylvester equations [7], [23] start by computing the Schur decomposition of one or both the coefficients of the matrix equation, and are thus not suitable for a framework where a multiprecision implementation of the QR algorithm is not available.The alternative of converting the Sylvester equation to an n 2 × n 2 structured linear system Ax = b and solving by Gaussian elimination has too high a computational cost to be useful in this context.

Numerical experiments.
In this section we describe numerical tests with the new multiprecision algorithms for the matrix logarithm.All the experiments were performed using the 64-bit version of MATLAB 2017b on a machine equipped with an Intel I5-5287U processor running at 2.90GHz.The collection of MATLAB scripts and functions that produce the figures and tables in this section is available on GitHub 2 , and the MATLAB functions implementing Algorithm 4.1 and Algorithm 5.2 can also be retrieved on MATLAB Central3 .For the underlying computations the implementations exploit the overloaded methods from the Multiprecision Computing Toolbox (version 4.3.2.12168) [45] to run in different precisions.The precision is specified in terms of the number of decimal digits.
We test the following algorithms.
• logm mct: the (overloaded) logm function from the Multiprecision Computing Toolbox, which implements a blocked version of the Schur-Parlett algorithm [16].After computing the complex Schur decomposition, a blocking of the matrix is computed according to [16], [31, sec.9.3].For each diagonal block T ii of the triangular Schur factor the algorithm repeatedly takes the square root until the spectrum of T ii − I lies within the unit ball centered at 1. Then it chooses the degree of the (diagonal) Padé approximant as the smallest m so that the bound in (1.3) is less than the unit roundoff.Since this strategy does not optimize the balance between the number of square roots and the Padé degree it tends to choose higher degrees than our algorithm.The off-diagonal blocks are obtained via the block Schur-Parlett recurrence.• logm agm: an algorithm for the computation of the matrix logarithm based on the arithmetic-geometric mean iteration.In particular, we use the result in [12,Thm. 8.2], which gives the approximation log A ≈ log(4/ε) − (π/2) AGM(εA) −1 , where ε = √ u/ A F and AGM(A) is the arithmeticgeometric mean iteration, which we compute by means of the stable double recursion [12, eqs.(5.2), (5.3)] with stopping criterion P k − I F ≤ u A F .We do not implement the optimization in [12, sec.7], because it is precision dependent and aimed at speed rather than accuracy.Padé approximants and relative error bounds, i.e. ψ(X) = X − I 1 .• logm tfree tayl: the transformation-free Algorithm 5.2 employing truncated Taylor approximants and relative error bounds, i.e. ψ(X) = X − I 1 .• logm: the built-in MATLAB function that implements the algorithms for real and complex matrices from [3], [4] and is designed for double precision only.In the implementations, m max and s max are set to 200 and 100, for diagonal Padé approximants, and to 400 and 100, for truncated Taylor series, respectively.
The Gauss-Legendre nodes and weights in (4.3) are computed by means of the GaussLegendre method of the mp class provided by the Multiprecision Computing Toolbox.This algorithm is based on Newton root-finding [22], which computes the nodes and weights of the quadrature formula of order m in O(m) flops.This is more efficient than the Golub-Welsh algorithm [24], which is based on the computation of the eigensystem of a tridiagonal matrix and costs O(m 2 ) flops.The Schur decomposition is computed using the schur function of the mp class.
We note that version 7.1 of the Symbolic Math Toolbox provides an overloaded version of the MATLAB function logm, but for several of our test matrices this function either gives an error or fails to return an answer, so we exclude it from our tests.
We evaluate the forward errors X − X 1 / X 1 , where X is a computed solution and X ≈ log A is a reference solution computed with logm pade at a precision of 8d decimal significant digits, where d is the number of digits used for the computation of X.
To gauge the forward stability of the algorithms we plot the quantity κ log (A)u, where κ log (A) is the 1-norm condition number of the matrix logarithm of A estimated using funm_condest1 from the Matrix Function Toolbox [29], with the aid of the Fréchet derivatives in Algorithm 4.1.
We use a test set of 64 matrices, of sizes ranging from 2 × 2 to 100 × 100, including matrices from the literature of the matrix logarithm and from the MATLAB gallery function.
6.1.Comparison with logm in double precision.Our first experiment compares logm pade and logm tfree pade running in IEEE double precision (u = 2 −53 ) with the built-in MATLAB function logm, in order to check that the new algorithms perform well in double precision.Figure 6.1a shows the forward errors sorted by decreasing condition number of the matrix logarithm.Figure 6.1b reports the same   data in the form of a performance profile [20], which we compute with the MATLAB function perfprof described in [27, sec. 26.4].Here, for each method M the height of the line at θ on the x-axis represents the fraction of matrices in the test set for which the relative forward error of M is at most θ times that of the algorithm that delivers the most accurate result.In our performance profiles we use the technique of Dingle and Higham [19] to rescale errors smaller than the unit roundoff in order to avoid abnormally small errors skewing the profiles.The results show that logm pade and logm produce errors bounded approximately by κ log (A)u, that is, they behave in a forward stable manner.The same is true of logm tfree pade except for one matrix, and this algorithm is most often the most accurate, while also being the least reliable, as shown by the performance profile.Figure 6.1c shows that the computational cost of logm pade can be higher than that of logm, but overall is comparable with the state of the art.We note that logm pade computes more square roots than logm on almost 40% of the matrices in our data set.On these matrices, logm requires between 9% and 38% (with an average of about 23%) fewer square roots, but logm pade typically compensates for this by evaluating a Padé approximant of lower degree.
As a further experiment we sought to maximize the ratios between the forward errors of logm and logm pade, using the multidirectional search method of Dennis and Torczon [17], implemented in the mdsmax function in the Matrix Computation Toolbox [28].Initializing that method with random 10 × 10 matrices with no positive real eigenvalues we have not been able to find a matrix for which either ratio of errors exceeds 1.4.This provides further evidence that the two algorithms deliver similar accuracy.6.2.Relative and absolute error.Now we show the importance of using a relative error bound as opposed to an absolute bound, as was used in earlier algorithms intended for double precision [13], [14], [18], [40].Figure 6.2 reports how the relative forward error to unit roundoff ratio varies, as the precision increases, for logm pade, logm pade abs, logm tayl and logm tayl abs, on three of our test matrices.
The ratio for logm pade and logm tayl is influenced by the conditioning of the problem, which is below 100 for these matrices, but tends to remain stable as the working precision increase.The ratio for the algorithms based on an absolute error bounds, on the other hand, grows exponentially, and we can conclude that logm pade abs and logm tayl abs are unstable.6.3.Experiments at higher precisions.Now we compare the accuracy of Algorithm 4.1, Algorithm 5.2 and several competing methods at four different precisions.Figure 6.3a plots, for the matrices in our test set sorted by decreasing condition number, the relative forward errors of logm mct, logm agm, logm tayl, logm tfree tayl, logm pade, and logm tfree pade against κ log (A)u.If the forward error of an algorithm falls outside the range reported in the graph, we put the corresponding marker on the nearest edge (top or bottom) of the plot.The right-hand column of Figure 6.3 reports the same data in the form of performance profiles.
For a working precision of 16 digits (the results for which are not shown here), the six algorithms exhibit a similar behaviour.As illustrated in Figure 6.3c and 6.3e, as the precision u becomes smaller logm mct loses accuracy on almost 40 percent of the   Table 6.1:Execution time breakdown of logm tayl and logm pade, run with u = 2 −1701 on three matrices of increasing size.The table reports, for each algorithm, the number of square roots (s), the degree of the Padé approximant (m), the total execution time in seconds (T tot ), and the percentage of time spent computing the Schur decomposition (T sch ), taking the square roots (T sqrt ), evaluating the scalar bound (T bnd ), and evaluating the Taylor and Padé approximants (T eval ).matrices, and the accuracy of the solution degrades quickly with respect to κ log (A)u.The loss of accuracy of logm agm is not as severe, but it affects the entire dataset and is particularly noticeable for well-conditioned matrices.
The new algorithms show a forward stable behavior, since the forward error remains less than or only slightly larger than κ log (A)u as the precision u becomes smaller.On our test set, Algorithm 4.1 is more accurate than Algorithm 5.2, and the performance of logm tayl and logm pade is almost identical whereas logm tfree tayl is more accurate than logm tfree pade and often provides the most accurate result on the best-conditioned of the test matrices.
6.4.Code profiling.Table 6.1 compares the execution times of our implementations of logm tayl and logm pade, profiling the four main operations performed by the algorithms: Schur decomposition, square roots, evaluation of the bound to determine the degree of the Padé approximant to be used (which includes computation of the norm estimates used in forming the α p ), and evaluation of the approximant itself.We consider the matrices A = expm(gallery('chebvand', n)) B = expm(gallery('randsvd', n)) C = expm(gallery('chow', n)) The unit roundoff is 2 −1701 , which roughly gives 512 decimal significant digits.
In both cases, evaluating the scalar bound (T bnd ) is relatively expensive for small matrices, but its impact drops as the size of the matrices increases and it is typically negligible for matrices of size larger than 100.We can see that logm tayl needs approximately twice as many square roots as logm pade on these matrices, and that while the evaluation time (T eval ) is larger for logm pade this algorithm is slightly faster in most cases.
7. Conclusions.The state of the art inverse scaling and squaring algorithms for the matrix logarithm and the matrix exponential, implemented in the MATLAB functions logm and expm, are tuned specifically for double or single precision arithmetic, via the use of pre-computed constants obtained from backward error bounds.This approach does not extend in any convenient way to a multiprecision computing environment.Here we have shown that by using forward error bounds we can obtain algorithms for the matrix logarithm that perform in a forward stable way across a wide range of precisions.The Schur-based algorithms, based on Algorithm 4.1, are competitive with logm when run in double precision and are superior to existing algorithms at higher precisions.For computing environments lacking a variable precision Schur decomposition we recommend the transformation-free Algorithm 5.2.
The algorithms rely on three innovations.First, we have derived a new sharper version of the forward error bound of Kenney and Laub [41] that can be much smaller for nonnormal matrices.Second, we have implemented the bound in the form of a relative error bound, as we found that the absolute error bounds used in some previous algorithms yield poor results at high precision due to the need for X in the approximations to log(I + X) to have a small norm.Third, we have devised a new strategy for choosing the degree of the approximants and the number of square roots.We investigated both Padé approximants and Taylor approximants and found that there is very little to choose between them in speed or accuracy.
We are currently looking at extending the ideas herein to other matrix functions, including the matrix exponential and real matrix powers.

From
the condition p(p − 1) ≤ k + m + 1 of the corollary we see that (3.6) holds for any p ∈ I [k/m] , where (3.7)

Figure 6 . 1 :
Figure 6.1: Top: forward errors of logm pade, logm tfree pade, and logm on the test set, all running in IEEE double precision Bottom left: performance profile.Bottom right: on the same test set, the number of square roots and multiple righthand side linear system solves for logm pade divided by the corresponding number for logm.

Figure 6 . 2 :
Figure 6.2: Forward error divided by unit roundoff u = 2 log 2 (10 −d ), where the number of decimal significant digits d is shown on the x-axis, for three matrices in the test set.
Performance profile for data in (c).
Performance profile for data in (e).
is the Pochhammer symbol for the rising factorial, a and c are real numbers and x is complex with |x| < 1.Such Padé approximants have been well studied [6, sec.2.3].