Matrices with Tunable Infinity-Norm Condition Number and No Need for Pivoting in LU Factorization Fasi, Massimiliano and Higham, Nicholas J. 2020

We propose a two-parameter family of nonsymmetric dense n × n matrices A(α, β) for which LU factorization without pivoting is numerically stable, and we show how to choose α and β to achieve any value of the ∞-norm condition number. The matrix A(α, β) can be formed from a simple formula in O(n 2 ) flops. The matrix is suitable for use in the HPL-AI Mixed-Precision Benchmark, which requires an extreme scale test matrix (dimension n > 10 7 ) that has a controlled condition number and can be safely used in LU factorization without pivoting. It is also of interest as a general-purpose test matrix.


Introduction.
A wide variety of test matrices has been developed and made available in software since the early days of digital computing, for example in [5], [11], [15], [36]. A matrix property of particular interest is the condition number. Ideally, we would like the condition number to be a parameter that can be freely chosen, along with the matrix dimension, but this is rarely the case. An exception is the "randsvd" matrix, constructed as the product U ΣV T , where U ∈ R n×n and V ∈ R n×n are random orthogonal matrices from the Haar distribution and Σ = diag(σ i ) with σ 1 ≥ · · · ≥ σ n > 0. These matrices have 2-norm condition number κ 2 (A) = σ 1 /σ n and can be generated, for example, using the LAPACK test matrix generation suite [7] or the MATLAB gallery('randsvd',...) function. However, they require O(n 3 ) floating-point operations (flops) to generate, which can be prohibitively expensive if n is large. By making alternative choices of U and V and possibly restricting the choice of σ 2 , . . . , σ n−1 , however, the cost of this technique can be reduced to O(n 2 ) flops [12].
In this work we propose a new class of n × n test matrices that satisfy the following three requirements. R1. They can be constructed in O(n 2 ) flops. R2. Their ∞-norm condition number is a parameter. R3. Their growth factor for LU factorization without pivoting is of order 1, that is, Gaussian elimination without pivoting is numerically stable for them. The third requirement, which is not satisfied by the matrices of [12], is motivated by the HPL-AI benchmark, as we explain in section 2. Matrices that meet these three conditions are worth studying in their own right, as not many classes of dense matrices that satisfy R2 or R3 are known. The main families of dense matrices for which pivoting is known not to be required are symmetric positive definite matrices, totally nonnegative matrices, and matrices that are diagonally dominant by rows or columns [16,Chap. 9].
In this work we propose a two-parameter family of matrices A(α, β) for which LU factorization without pivoting is numerically stable and for which the parameters α and β that yield a matrix with specified ∞-norm condition number κ ∞ (A(α, β)) can be easily computed.
In section 2 we describe the HPL-AI benchmark and why it motivates our work. In section 2.1 we focus on the test matrices that have been proposed for the benchmark, and consider whether diagonally dominant matrices can satisfy our requirements. In section 3 we define the family A(α, β), explain how the entries of A(α, β) can be cheaply and accurately computed, and show that LU factorization without pivoting is stable for these matrices. In section 4 we derive explicit formulae for the entries of A(α, β) −1 . In section 5 we derive an explicit expression for κ ∞ (A(α, β)) that can be evaluated in a constant number of flops, and we explain how to determine α and β to achieve a given value of κ ∞ (A(α, β)). In section 6 we discuss the practicalities of constructing the matrices A(α, β) and confirm experimentally the numerical stability of LU factorization without pivoting when applied to them. In section 7 we explain how A(α, β) can be modified to avoid the possibility of a user of the HPL-AI benchmark gaining an unfair advantage. Concluding remarks are given in section 8.
2. The HPL-AI Mixed-Precision Benchmark. The supercomputers in the TOP500 list 1 are ranked according to their performance on HPL 2 [10], [30], a portable implementation of the High-Performance Computing Linpack Benchmark for distributed-memory computers. The software measures the rate of execution, expressed in flops per second (Flop/s), that a computer achieves when solving a large dense linear system with random coefficients in IEEE binary64 arithmetic [20].
The workload of HPL is compute-bound [25], in the sense that the time needed to run the benchmark is mostly determined by the clock speed of the CPU. As the gap between the clock speeds of CPUs and RAM widens, this metric is becoming less relevant, as memory accesses now dominate the performance of HPC applications. In order to take this aspect into account and obtain a more complete view of the capabilities of a large computational facility, several benchmarking suites that provide an alternative to HPL have been proposed in recent years The first solution to be introduced was the HPCChallenge (HPCC) Benchmark, 3 a collection of six problems meant to complement HPL with tests exhibiting more challenging memory access patterns [24]. HPCC was succeeded by the High Performance Conjugate Gradients (HPCG) Benchmark 4 [8], [9], a software package developed to test computational and data access patterns that are common in more recent scalable applications.
The HPL, HPCC, and HPCG benchmarks are all typically run in binary64 arithmetic, which is the precision level required by most mathematical codes underlying scientific simulations in traditional HPC applications. This is not representative, however, of typical computations in the emerging field of artificial intelligence, where 32 or fewer bits of precision are typically considered adequate. The HPL-AI Mixed-Precision Benchmark 5 is an attempt to consider both workloads at once, and can be regarded as the first step towards a benchmark suite that takes into account the availability of hardware accelerators for low-precision computation, such as the sys-Algorithm 2.1: Reference implementation of the HPL-AI benchmark.
Constants: Binary64 unit roundoff u = 2 −53 . 4 Convert L, U , and x 0 to binary64. 5 Solve Ax = b in binary64 using preconditioned GMRES without restart with x 0 as starting guess and L and U as preconditioners.
return failure tolic arrays that equip the Google TPU v2 and v3 or the tensor cores featured by the NVIDIA V100, T4, and A100 GPUs.
The main computational steps in the reference implementation of the HPL-AI benchmark 6 are summarized in Algorithm 2.1. A n × n test matrix A and a righthand side vector b are generated in binary64 arithmetic. The LU factorization without pivoting of A is computed in binary32 arithmetic and the linear system Ax = b is solved in binary32 arithmetic using forward and back substitution. This binary32 approximation to the solution is then converted to binary64 and refined using the generalized minimal residual method (GMRES) without restart. The solver is preconditioned with the binary32 LU factors of A (converted to binary64). The benchmark is declared successfully completed if the relative backward error satisfies where µ is a threshold parameter, which for the reference implementation of HPL-AI is set to 16, and u = 2 −53 is the unit roundoff for binary64. Although for portability reasons the reference implementation uses binary32 arithmetic as the lower-precision arithmetic, binary16 or bfloat16 [21] are expected to be used in practice. The flop rate of Algorithm 2.1 is computed as f h /t g , where the global time to solution t g is the number of seconds taken to execute lines 2-5, and f h = 2n 3 /3+3n 2 /2 is the number of low precision floating-point operations required to compute the LU factorization on line 2 and perform the two triangular solves on line 3.
The matrix A must satisfy several requirements. First, it must be cheap to form and one for which LU factorization without pivoting is numerically stable. Second, for the algorithm to work at all the condition number κ ∞ (A) must not be too large. No bounds on κ ∞ (A) are known that guarantee that the reference implementation will work, but for the related GMRES-IR algorithm κ ∞ (A) must be substantially less than u −1 when it is used with two precisions [4], [17]. These requirements will hold if properties R1-R3 hold and a suitable value of κ ∞ (A) is chosen.
A further requirement is that GMRES does not converge quickly when applied to A, because if it does then it does not need the LU factors as preconditioners and the factorization on line 2 of Algorithm 2.1 can be omitted altogether.
In the November 2020 edition of the TOP500 rankings of the world's fastest supercomputers, the Fugaku computer obtained first place in the HPL-AI benchmark with a rate of 2.0 EFlop/s, 7 obtained for a matrix of size 16,957,440; it had previously achieved 1.4 EFlop/s in the June 2020 list [22]. Clearly, then, any matrix used for the benchmark must be capable of being generated for a dimension exceeding 10 7 .
2.1. Matrices proposed for the benchmark. Two classes of matrices have been proposed as test problems for the HPL-AI benchmark.
In order to satisfy requirements R1-R3 it is natural to consider diagonally dominant matrices. The matrix A ∈ R n×n is diagonally dominant by rows if and diagonally dominant by columns if A T is diagonally dominant by rows. The growth factor for LU factorization without pivoting on matrices diagonally dominant by rows or columns is bounded by 2 [16, Thm. 9.9], [35]. The reference implementation of version 0.1 of the HPL-AI benchmark uses a row diagonally dominant matrix A constructed as where unif(x, y) denotes the uniform distribution over the open real interval (x, y). Gaussian elimination without pivoting will be numerically stable for this matrix.
As regards the size of κ ∞ (A), it is known [1], [16,Prob. 8.7], [34] that (2.2) implies The matrix in (2.3) used by the reference implementation has ω i ≡ 0, so this bound does not provide any information. In fact, numerical experiments show the matrix to be extremely well conditioned, with κ ∞ (A) ≈ 4 for large n.
Moler [26] suggests using the nonsymmetric matrix A(µ) whose (i, j) element is (i−µj+1/2) −1 , studies experimentally its conditioning, and investigates the numerical stability of its LU factorization without pivoting. Whether this matrix can satisfy requirements R2 and R3 is unclear.

3.
A two-parameter family of matrices. Let us consider the parametric upper triangular matrix T (θ) ∈ R n×n defined by where θ ≥ 0 throughout this paper. This is a scalar multiple of a matrix proposed by Ostrowski [29]. The inverse of the matrix in (3.1) is known explicitly: We define the matrix For n = 4, for example, Since the subdiagonal elements of L are of modulus at most 1, LU factorization with partial pivoting does not require interchanges when applied to A = A(α, β) in exact arithmetic. Note that A is not, however, diagonally dominant in general. For α = β it is symmetric positive definite, but we are mainly interested in the nonsymmetric case.
Generating the matrix A by means of the matrix multiplication in (3.3) would require 2n 3 /3 flops, which is too costly for an extreme-scale setting. However, as (3.4) suggests should be possible, we can give an explicit expression for the elements of A by exploiting the structure of the two factors in (3.3a). As L and U are lower and upper triangular, respectively, we have that For the elements along the diagonal of A, this gives for those above the diagonal we have and for those in the strictly lower triangular part we obtain For later use, we note several relations. Recalling that A ∞ = max i j |a ij | and We make two observations. Recall that α, β ≥ 0. First, we note that from (3.3) and the structure of T , we have From (3.2), the row of T (θ) −1 with the largest sum is the first, so Next, we note that the matrix It is clear, then, that for large n we will need β 1 in order for κ ∞ (A) to be reasonably bounded. In section 5 we derive an exact expression for κ ∞ (A).
3.1. The effect of rounding errors. Now we consider the effects of rounding errors in forming and factorizing A(α, β) in floating-point arithmetic.
First, we consider the errors in computing A from (3.5). Using the standard model of floating-point arithmetic [16, sect. 2.2] we find that the computed A satisfies where the factor n is pessimistic. In fact, we also have |A − A| ≤ cγ 3 |A| as long as | − α + (j − 1)αβ| and | − β + (i − 1)αβ| do not exceed α + (j − 1)αβ and β + (i − 1)αβ, respectively, by more than a factor c, for all i = j. This latter condition holds with a modest constant c as long as neither α nor β is too close to a member of the set { 1/k : k = 1, 2, . . . , n − 1 }. We conclude that A is formed as accurately as we could expect.
We also need to show that LU factorization without pivoting is numerically stable for A. Hence we need to bound the elements of all the Schur complements that arise during the factorization. From the structure of T n (α, β), where the subscript denotes the dimension, we have where e is the vector of ones. Thus after one step of elimination we have the matrix By induction it follows that all the elements of all the Schur complements are elements of A(α, β), and hence the growth factor is which is ideal. The strongest overall backward error bound can be obtained by applying the backward error analysis for LU factorization [16,Thm. 9.3], which tells us that the computed LU factors L and U satisfy In section 5 we will take α ≤ β, so we make this assumption now. Then, using (3.6) and (3.8), we obtain Hence (3.14) We conclude that the factorization is numerically stable. The bound (3.14) is a worst-case bound and we note that in practice a more realistic bound is obtained by replacing the constant 9n by its square root, and even this is likely to be pessimistic [18]. This point is important because we are interested in extremely large n for the HPL-AI benchmark. As above, this gives three distinct formulae, one for the elements along the diagonal, one for those in the upper triangular part, and one for those below the diagonal, By noting that (4.1), (4.2), and (4.3) contain truncated geometric series of ratio we obtain the more computationally convenient formula 5. Generating a matrix with a prescribed condition number. We now discuss how the parameters α and β can be chosen so that A := A(α, β) has the desired condition number κ. This task can be achieved efficiently by combining an inexpensive technique for evaluating κ ∞ (A) = A ∞ A −1 ∞ with a rootfinding algorithm for solving the scalar equation κ ∞ (A(α, β)) = κ. In section 5.1 we derive explicit formulae for A ∞ and A −1 ∞ that can be evaluated in a constant number of operations, and in section 5.2 we discuss how these can be exploited to find a suitable value for α and β.   We now explain how to compute the ∞-norm of A for 0 < α ≤ β.
In order to find the largest value of λ i , we will look at the quantity which is the difference between the sum of the absolute values of two consecutive rows. By subtracting the formula for λ i from that for λ i+1 , we obtain The value of Λ i depends on the sign of the argument of the three absolute value operators in (5.3). In general this would give rise to eight possible combinations, but if 1 − (i − 1)α ≤ 0 then 1 − iα < 0, and α ≤ β implies that also 1 − (i − 1)β ≤ 0. Therefore, only the five cases in Figure 5.1a are possible, as we now show. The automaton in Figure 5.1b shows the possible transitions, as i varies, from one state to the other. We remark that a and d are the only possible initial states (i = 1) since α ≤ 1 implies that 1 − α is nonnegative. Our strategy is to keep track of the quantity as we travel between states, and then use the fact that A ∞ = η n . As long as we are in state a we have that If i < n − 1 then 2 + i − n ≤ 0 and Λ i ≤ 0, and thus λ i+1 ≤ λ i ≤ · · · ≤ λ 1 = η i . The quantity Λ n−1 may be nonnegative, thus if a is the final state we have that η n = max(λ 1 , λ n ). Now we discuss the states that can be reached from a . If we are in state b , we have that It is easy to see that in this state one has that η i = max(λ 1 , λ i+1 ). If Λ i ≤ 0, then one comes from either a , for which Λ i−1 ≤ 0, or from b itself, in which case Λ i−1 < Λ i ≤ 0, and thus η i = λ 1 = max(λ 1 , λ i+1 ). If, on the other hand, Λ i > 0, then η i = max(λ 1 , λ i+1 ).
For c , we have two distinct cases depending on the state we are coming from. When coming from a , we have that When coming from b , we have that η i = max(λ 1 , λ i , λ i+1 ), where i = min 1/α , n is the largest row index for which one is in state b . For state d one can argue as for state c that η i = max(λ 1 , λ i+1 ).
Finally, for state e we have that since n − i > 0 for i between 1 and n − 1. Therefore, if one is in state e for row i, then for all subsequent rows k for k between i + 1 and n one remains in state e and has that η k = max(η k−1 , λ k+1 ).
Combining the latter observation with the value of η i from the states from which e can be reached, we can conclude that (5.4) A ∞ = max(λ 1 , λ i , λ n ), i = min 1/α , n .
In order to compute A −1 ∞ , first we derive an expression for the sum of the absolute values of the elements in the ith row of the matrix B = A −1 in (4.4), which we denote by δ i . A tedious computation given in Appendix A yields where r = (1 + α)(1 + β). The ∞-norm of A −1 is the largest value of δ i for i between 1 and n. We now explain how to compute it efficiently. We will argue that for ∆ i = δ i+1 − δ i , with i between 1 and n − 1, one has that Therefore, the quantities ∆ i and ζ i := α 2 r i+1 − (1 + α) 2 β 2 r n have the same sign since the other factors are necessarily positive as long as α and β are. Similarly, the signs of ∆ i−1 and ∆ i+1 depend on those of respectively. Since r > 1, ∆ i ≤ 0 implies ∆ i−1 < 0 and∆ i ≥ 0 implies ∆ i+1 > 0. The result follows by induction, and we can conclude that A −1 ∞ = max(δ 1 , δ n ). We summarize our findings in a theorem.
Computing the condition number of A(α, β) therefore requires only a constant number of flops. What effect do rounding errors have on the condition number of the computed matrix? We know from (3.11) that the computed A has a small normwise relative error. We also know that the condition number of the condition number is the condition number [6], [14]. It follows that . Therefore if we choose the parameters α and β to achieve a condition number κ ∞ (A) u −1 then κ ∞ ( A) will be of the same order of magnitude as κ ∞ (A), which is entirely satisfactory.

Choosing the parameters.
In order to generate a test matrix with a given condition number κ, it is necessary to determine values of α and β such that κ ∞ A(α, β) = κ. From (5.9), we obtain that this is equivalent to finding a zero of the bivariate function for α ∈ (0, 1] and β ≥ α. The solution for a given κ is not unique.
The function f is continuous but not differentiable, and roots can be found by converting f (α, β; n, κ) = 0 to a one-parameter equation by defining one of α and β as a fixed multiple of the other. For instance we can use a one-dimensional zero finder on the function (5.11) f ρ (β; n, κ) = f (ρβ, β; n, κ), ρ ∈ (0, 1], Note that it is necessary to ensure that for a solution β * we have α = ρβ * < 1; if the latter inequality is not satisfied then a different value of ρ will have to be used. 6. Experimental evaluation. Now we discuss the practicalities of constructing A(α, β) and confirm the numerical stability of LU factorization without pivoting. Our codes are written in C and rely on the Intel Math Kernel Library (version 18.0.3) implementation of BLACS, PBLAS, and ScaLAPACK. For parallelization we use the Open MPI library (version 3.1.4), a widely-used implementation of the MPI standard (version 3.1) [27]. The codes we developed can be retrieved on GitHub, 8 and a MATLAB function for constructing the matrix is available. 9 The experiments were run on the High Performance Computing Pool (HPC Pool), a component of the third generation of the Computational Shared Facility (CSF3) of the University of Manchester. Each compute node in the HPC Pool is equipped with two 16-core Intel Xeon Gold 6130 CPUs and 187GiB of RAM, and runs CentOS GNU/Linux release 7.4.1708 (Core).
The zeros of f ρ in (5.11) were approximated numerically in double precision on a single node using the GNU Scientific Library (version 1.15) [13] implementation of  the Brent-Dekker solver [2], [3], a fast and robust zero finding method that combines an interpolation strategy with the bisection algorithm. This bracketing technique requires two initial values x (0) and x (0) such that f ρ (x (0) ; n, κ) < 0 and f ρ (x (0) ; n, κ) > 0. In order to choose x (0) , we note that as β tends to 0, the matrix A tends to the identity matrix and κ ∞ (A) tends to 1, thus we set x (0) to the unit roundoff u. As for x (0) , from (3.10) we know that if β = ρ −1 ≥ 1 (recalling (5.11)), which ensures that α = ρβ = 1, then κ ∞ (A) is bounded below by which for n = 100 is approximately 3 × 10 31 and guarantees that if x (0) = ρ −1 then f ρ (x (0) ; n, κ) is positive for all values of n and κ of interest. In fact, this value typically overflows if f ρ is evaluated in binary64 arithmetic. Therefore, in our code we initially set x (0) = ρ −1 and keep dividing it by 2 until we find a value such that f ρ x (0) ; n, κ is positive and representable in binary64 format. At each step, the Brent-Dekker iteration returns an approximate solution x (i) such that f ρ ( x (i) ; n, κ) ≈ 0 and updates the bracketing interval producing x (i) and x (i) such that x (i) ∈ x (i) , x (i) ⊂ x (i−1) , x (i−1) . In our code, we keep iterating until where u is the unit roundoff of binary64 precision.

Values of α and β.
To give a feeling for the typical values of α and β, we compute them for a range of matrix sizes and target condition numbers. Table 6.1 shows the values of β when α = ρβ for ρ = 1/2. There is little variation of β with κ, but a large variation with n on this very large range. The corresponding tables for ρ = 1/10 (see Table 6.3) and ρ = 3/4 have elements of the same orders of magnitude. Table 6.2 reports the absolute value of the smallest element of A(β/2, β). The absolute value of the largest element is between 1 and 1.32 for all the matrices we considered. We will discuss the implications of the wide dynamic range of elements for low precision in section 7.1.

6.2.
Pivots and growth factor. The second and fourth columns of Table 6.3 report, for values of n between 1,000 and 200,000 and ρ = 1/10, values of β yielding κ = 10 3 and κ = 10 6 , respectively. We used these values to generate the matrix Table 6.2. Values of min i,j |A(β/2, β)) ij | for the values of β in Table 6 A(ρβ, β) in (3.5). We computed the LU factorization A ≈ L U in binary32 arithmetic with the ScaLAPACK function psgetrf, which uses partial pivoting, and confirmed that no row interchanges were performed during the reduction to row-echelon form.
In order to gauge the stability of Gaussian elimination for these test matrices, we looked at the relative error in the computed multipliers (the elements of L), that is, This quantity is reported in the third and fifth columns of Table 6.3. The small values of ϑ confirm that for the matrices our new algorithm generates roundoff errors do not accumulate in a harmful way. In particular, the fact that ϑ does not appear to be growing with n suggests that the accuracy of the pivots does not depend on the order of the matrix being generated, but only on the required ∞-norm condition number.

Adaptations for HPL-AI Mixed-Precision Benchmark.
We now examine the suitability of A(α, β) as a test matrix for the HPL-AI benchmark.
7.1. Scaling for low precision. The reference implementation of the HPL-AI benchmark factorizes A in binary32, but in practice binary16 or bfloat16 might be used instead. We therefore consider whether A(α, β) can be suitably represented in these precisions. We know that max i,j |a ij | will be of order 1, so overflow will not occur during the factorization, since the growth factor ρ n = 1 (see (3.12)).
All the elements in Table 6.2 exceed the smallest normal number 1.18 × 10 −38 for binary32, and likewise for bfloat16, which has essentially the same range as binary32. In binary16, however, the smallest positive subnormal and normal numbers are approximately 6.0 × 10 −8 and 6.1 × 10 −5 , respectively, thus underflow or the occurrence of subnormal numbers are inevitable for larger values of n and κ when the matrix is converted to this low-precision floating-point format. This is not a problem as regards numerical stability, but subnormal numbers can incur a performance penalty, especially if handled in software [23], [28, sect. 8.1.2], [31]. The number of entries underflowing or becoming subnormal can be reduced by working with the matrix ψA(α, β) for ψ > 1, as suggested in [19]. One can safely take ψ = f max /2, say, where f max = 65,504 is the largest finite binary16 floating-point number.
A key requirement is that throughout the factorization all row operations must be necessary at each stage, so that no amount of computation can be skipped. In other words, we need to ensure that at each stage of elimination all the elements below the diagonal are nonzero. By construction, the vector [a If we consider n = 10 8 and κ = 10 4 , for example, then using the β value from Table 6.1 with ψ = f max /2 gives ψα ≈ 8.5 × 10 −4 , which is safely above the smallest normal binary16 number.
Our conclusion is that multiplying A(α, β) by an appropriate scale factor makes the matrix suitable for LU factorization in binary16 in the HPL-AI benchmark.

Forcing computation of the LU factorization.
A possible drawback is that the LU factorization of A(α, β) is explicitly known. An unscrupulous benchmarker could form the exact LU factors from (3.3) at trivial cost without carrying out an LU factorization. An obvious way to avoid this problem is to add a small random perturbation to A in order to change the LU factors. Since the multipliers are continuous functions of the matrix entries and are all equal to α < 1, a sufficiently small perturbation of A(α, β) will retain multipliers less than 1 and growth factor of order 1.
Consider a rank-1 perturbation ∆A = εe i e T j , where e i is the ith unit vector and ε ∈ R. For small enough ε, the LU factorization A + ∆A = (L + ∆L)(U + ∆U ) exists. Since the perturbation matrices ∆L and ∆U are strictly lower triangular and upper triangular, respectively, it is easy to show that, to first order, where tril denotes the strictly lower triangular part [32], [33]. Hence we certainly have It is easy to show using (3.2) that the elements in the ith column of |L||L −1 | are, from the ith position onwards, 1, 2α, 2α(1 + α), . . . , 2α(1 + α) n−i−1 , which implies that In order for the multipliers to remain bounded by 1, we wish this bound to be less than 1 − α, which can be achieved by choosing ε such that In Table 6.1 the values of |ε| satisfying this inequality as an equality range from 1.22 × 10 −6 for n = 10 2 , κ ∞ (A) = 10 10 to 1.51 × 10 27 for n = 10 10 , κ ∞ (A) = 10 2 . In practice, a full rank perturbation is preferable. We suggest using the matrix where ξ is the smaller of cu 1/2 and the largest positive ε satisfying (7.1), where c is a some positive constant bounded by 1.

Forcing preconditioning.
Is it necessary to precondition GMRES when we solve A(α, β)x = b? For the matrix (7.2), our experiments indicate that unpreconditioned GMRES converges in a few tens of iterations and the number of iterations does not vary much with n. In order to make it necessary to precondition, we can carry out a two-sided diagonal scaling where D 1 and D 2 are diagonal matrices with positive diagonal elements. As long as the diagonal elements of D 1 are nonincreasing, L has subdiagonal elements bounded by 1 and LU factorization without pivoting remains numerically stable for A(α, β, ξ). We have κ ∞ ( A(α, β, ξ)) κ ∞ (D 1 )κ ∞ (D 2 ) ≤ κ ∞ (A(α, β, ξ)) ≤ κ ∞ (D 1 )κ ∞ (D 2 )κ ∞ ( A(α, β, ξ)), so we lose the ability to precisely determine the condition number.
Our experiments indicate that a mild diagonal scaling is enough to slow down the convergence of unpreconditioned GMRES. We describe an experiment in which we chose β and α = β/4 so that κ ∞ (A(α, β, ξ)) = 10 6 and used the matrix A in (7.3) with ξ chosen as above, with D 1 and D 2 having logarithmically equally spaced entries between 1 and 10 −3 for D 1 and 1 and 10 −2 for D 2 . The ∞-norm condition number of the scaled matrix A was of order 10 6 in every case. Table 7.1 gives the number of restarted GMRES iterations needed to solve Ax = b in double precision arithmetic, where b has random elements from the (0,1) distribution, with backward error satisfying (2.1) with µ = 16. Here we are using the MATLAB gmres function. Several values of n and the restart parameter m are used. We can see that a significant number of iterations is required, which means that preconditioning is necessary to solve a linear system with coefficient matrix involving A.
8. Conclusions. Constructing in O(n 2 ) operations an n × n matrix having a specified ∞-norm condition number and such that LU factorization without pivoting is numerically stable is not a trivial task, especially when the matrix must be suitable for extreme-scale dimensions n > 10 7 and for use with low-precision floating-point arithmetic. We have shown that the two-parameter family of matrices A(α, β) in (3.3) satisfies these requirements, with α and β determined by solving a scalar nonlinear equation. These matrices, with the modifications described in section 7 if necessary, are therefore suitable for use in HPL-AI Mixed-Precision Benchmark. They are also useful as general purpose dense nonsymmetric test matrices.