Generating extreme-scale matrices with speciﬁed singular values or condition numbers

. A widely used form of test matrix is the randsvd matrix constructed as the product A = UΣV ∗ , where U and V are random orthogonal or unitary matrices from the Haar distribution and Σ is a diagonal matrix of singular values. Such matrices are random but have a speciﬁed singular value distribution. The cost of forming an m × n randsvd matrix is m 3 + n 3 ﬂops, which is prohibitively expensive at extreme scale; moreover, the randsvd construction requires a signiﬁcant amount of communication, making it unsuitable for distributed memory environments. By dropping the requirement that U and V be Haar distributed and that both be random, we derive new algorithms for forming A that have cost linear in the number of matrix elements and require a low amount of communication and synchronization. We specialize these algorithms to generating matrices with speciﬁed 2-norm condition number. Numerical experiments show that the algorithms have excellent eﬃciency and scalability.

. Introduction. The TOP500 list 1 ranks the world's fastest computers using the High-Performance Linpack Benchmark (HPL) [24], which measures the time taken to solve a dense linear system with random coefficients by means of a direct factorization method. The more powerful the machine being tested, the larger the test matrix must be in order to extract the best performance from the system.
The largest linear systems solved on Fugaku, the machine that heads the June 2020 TOP500 list, have order 10 7 , and the systems needed to benchmark the coming generation of exascale supercomputers will be even larger. HPL employs random matrices with entries drawn from the uniform distribution on (−1/2, 1/2]. For such matrices, the expected value of the logarithm of κ 2 (A) is bounded by 4 log n + 1 [8, sect. 3.2], so these matrices are potentially ill conditioned for today's most extreme cases.
In the future, in HPL and other benchmarks we may wish to generate random matrices with a specified singular value distribution or a specified condition number. For example, solving a linear system by low precision LU factorization with iterative refinement at higher precisions can be faster than solving entirely at double precision [6], [7], [12], but for convergence to be guaranteed there may be a limit κ 2 (A) ≤ κ max , where κ max u −1 and u is the unit roundoff of the working precision (u −1 ≈ 10 16 for IEEE double precision arithmetic [19]). The new HPL-AI mixedprecision benchmark [11] uses this approach, and in such a case we need to generate matrices with κ 2 (A) suitably bounded.
For large dimensions, existing techniques for generating dense matrices with a chosen singular value distribution are too expensive, as their cost significantly exceeds the cost of solving Ax = b. A new approach is therefore needed. In order to be efficient when run at extreme scale on machines with a high degree of parallelism, the algorithms we develop must be concurrent and require a low amount of communication and synchronization. As our primary goal is to provide test cases, it is also desirable that generating the matrix takes only a fraction of the execution time of the method being tested. This is the first in the following list of requirements. R1. The cost of generating the matrix, measured by the number of floating-point operations (flops) required, should be linear in the number of entries. R2. The matrix should not be a low rank perturbation of a diagonal matrix (such a matrix is too special). R3. The matrix should be well scaled in the sense that there do not exist diagonal matrices D 1 and D 2 such that κ 2 (D 1 AD 2 ) κ 2 (A). (Such diagonal scalings lead to artificial ill conditioning that some algorithms automatically remove or are insensitive to.) R4. The matrix should not have a large growth factor for LU factorization with pivoting (otherwise the factorization will suffer from numerical instability, which will affect the interpretation of results for such solution methods). R5. κ ∞ (A) should not be especially far from κ 2 (A). We know that n −1 κ 2 (A) ≤ κ ∞ (A) ≤ nκ 2 (A) [17, sect. 6.2], and since κ ∞ (A) rather than κ 2 (A) is often used in error bounds it is preferable that κ ∞ (A) and κ 2 (A) differ by substantially less than a factor n. In this work we design an algorithm that constructs random dense matrices with specified singular values and is suitable for a distributed memory environment. The algorithm comes in two different forms, the most efficient of which depends on the shape of the matrix. We also derive a cheaper variant that, given the required 2-norm condition number, chooses the singular values in a way that minimizes the overall computational cost. Our algorithms satisfy requirements R1 and R2, and we will show experimentally that they satisfy R3-R5 for suitable choices of the singular value distribution.
In the next section we describe the method used by several software packages to produce random matrices with specified singular values. In section 3 we show how the cost of this technique can be substantially reduced so as to obtain practical algorithms for extreme-scale architectures. In section 4 we show that further cost reductions are possible if one is interested only in the 2-norm condition number of the matrix that is generated, and not in the distribution of its singular values. Finally, we compare these algorithms experimentally in section 5. We recall that a Householder matrix is defined in terms of a unit 2-norm vector u by We denote by sign the function of a real argument x for which sign( The randomized SVD construction. In order to construct an m × n matrix with given singular values σ 1 , . . . , σ min(m,n) we can construct the singular value decomposition (SVD) exploiting the nonzero pattern of x. 19 B ← diag(d 1 , . . . , d m−1 , sign(r a n d n(1, 1))) · B where U ∈ C m×m and V ∈ C n×n are unitary and Σ ∈ C m×n is the diagonal matrix diag(σ 1 , . . . , σ min(m,n) ). Stewart [26] considered real matrices and proposed taking U and V as random matrices from the Haar distribution, which is a natural uniform probability distribution on the orthogonal matrices. Let N (0, 1) denote the normal distribution with mean 0 and variance 1, and let the entries of B ∈ R n×n be sampled from N (0, 1). If the QR factorization B = QR is normalized so that the diagonal elements of R are nonnegative, then Q is from the Haar distribution [4], [26]. Stewart showed that Q can be generated in factored form as Q = DP 1 P 2 . . . P n−1 , where P i = diag(I i−1 , P i ) with P i the Householder transformation that reduces x i ∈ R n−i+1 with elements from N (0, 1) to r ii e 1 , D = diag(sign(r ii )), and r nn ∈ N (0, 1). Forming A by generating and applying U and V in product form requires around half as many flops as explicitly computing U and V via QR factorization and then forming the product in (2.1). Demmel and McKenney [9] implemented Stewart's method for constructing (2.1) in LAPACK's test matrix generation suite. Higham implemented the method in MATLAB as function randsvd in the Test Matrix Toolbox [15], [16], and this function was subsequently incorporated into the MATLAB gallery('randsvd',...) function. Stewart's method is also available in Julia through the randsvd function in Zhang and Higham's Matrix Depot package [30].
Mezzadri [22] shows that it is straightforward to extend Stewart's construction of real random orthogonal matrices to complex unitary matrices. Algorithm 2.1 summarizes how to generate both real and complex random matrices with specified singular values and orthogonal or unitary matrices of singular vectors from the Haar distribution. In the algorithm, the function r a n d n(m, n) generates an m × n matrix with entries sampled from N (0, 1), and Arg : C → (−π, π] denotes the principal value of the argument function. Algorithm 2.1 is not appropriate for generating large-scale matrices as it requires m 3 + n 3 flops [17, sect. 28.3] and so it is unduly expensive for large m and n. Moreover, it is not well suited to a distributed-memory environment because of the high volume of data communication required by the m − 1 matrix-vector multiplications that are performed on line 18. This communication bottleneck can be reduced, to some extent, by blocking the m − 1 Householder vectors and applying them together. This is the approach followed by the magma_generate_svd function of MAGMA [10]. 3. The randomized SVD construction at scale. The SVD construction in the previous section can be modified so as to reduce the cost of generating real and complex matrices with specified singular values from m 3 +n 3 flops to only O(mn) flops. For practical purposes, it is not necessary that the orthogonal or unitary matrices of left and right singular vectors be Haar distributed, and it will usually not be necessary that both matrices are random. Therefore we will turn our attention to families of orthogonal or unitary matrices that are cheaper to construct and apply.
The simplest approach is as follows. For m ≥ n, we can generate a matrix A ∈ C m×n by taking U ∈ C m×n with orthonormal columns and rescaling its columns by the desired singular values. The resulting matrix A = U Σ, however, has a special structure, and would be, for instance, a bad test problem for least squares solvers: the normal equations matrix A T A = Σ 2 is diagonal, and the triangular factor of the QR factorization of A is, by construction, the diagonal matrix DΣ, where D = diag(±1). Another reason for concern in the square case is that matrices generated in this way are not ill-conditioned in the sense of the Skeel condition number cond(A) = |A −1 ||A| ∞ [17, sect. 7.2], [25], since cond(A T ) = cond(ΣU T ) = cond(U T ). Hence these matrices do not satisfy the requirement R3 in section 1. We conclude that in order to obtain test matrices with desirable properties, it is necessary to use nontrivial transformations on both sides of Σ in (2.1). In order to keep the computational cost under control, we will employ a class of rectangular matrices with orthonormal columns that generalize Householder transformations.
Take m > n, let α ∈ C, u ∈ C n , v ∈ C m−n , and w ∈ C n , and consider the matrix We have If we assume that w = u = 0, we can further simplify the expression in (3.1) and obtain that W has orthonormal columns if and only if (3.2) 2C Re α + |α| 2 = 0, where C = 1 Algorithm 3.1: Matrix with specified singular values.
In other words, we can choose for α any complex number on the disc of radius C centered at −C. Therefore, for any choice of vectors u ∈ C n \ {0} and v ∈ C m−n , the matrix has orthonormal columns. In order to avoid the trivial case α = 0, we restrict θ to the set R \ {(2k + 1)π : k ∈ Z}. A similar argument for the real case shows that for any u ∈ R n \{0} and v ∈ R m−n the matrix has orthonormal columns if and only if α = −2C or α = 0, where C is defined in (3.2). These results can be readily used to develop an algorithm for constructing an m × n random matrix with specified singular values. First we consider the case of matrices with complex entries.
Let p = min(m, n), let Q ∈ C m×p have orthonormal columns (we will discuss how to choose Q at the end of the section), let Σ ∈ R p×p be a diagonal matrix of singular values, and let W := W (θ, u, v) ∈ C n×p for some u ∈ C p \ {0} and v ∈ C n−p . We stress that the construction we present here requires only that u have at least one nonzero element, and that no assumptions on v are necessary. In the experiments in section 5 we will draw the entries of these two vectors from a Gaussian distribution, but this is not necessary, and one might want to choose u and v in a non-random way when reproducibility is required. For this reason, in our pseudocode we simply state that the two vectors have to be generated and we do not specify how their elements should be chosen.
The matrix has the diagonal entries of Σ as singular values.
For the real case, the matrix where Q ∈ R m×n has orthonormal columns and Z := Z(α, u, v) ∈ R n×p for some u ∈ R p \ {0} and v ∈ C n−p , has the diagonal entries of Σ as singular values. The matrices in (3.5) and (3.6) can be constructed efficiently as shown in Algorithm 3.1. As this strategy evaluates the two matrix products in (3.5) and (3.6) from left to right, i.e., in a forward fashion, we use the expression "forward algorithm" to refer to it. In the pseudocode, calling o rt h o g(m, p, realout) returns, for m > p, an m × p matrix with orthonormal columns whose elements are either real or complex depending on the value of the third input argument. If m = n, so that n − p = 0, then one can either assume, for consistency, that the vector v is 0 × 1 or replace lines 12-16 where the parentheses are used to indicate the evaluation order that yields the smallest flop count. The best strategy to evaluate the expression on line 16 depends on the values of m and n: if m < n, then it is convenient to compute αy, store it, and use it to evaluate both blocks of A, whereas if n < m the optimal flop count is obtained by computing αu * and αv * and using those to compute the outer products.
Let γ(m, n) ∈ O(mn) be the number of flops required by o rt h o g to generate an m × n matrix with orthonormal columns. Then generating the matrix and applying the diagonal scaling on line 3 requires γ(m, p) + mp flops; the cost of the if statement starting on line 4 and the division on line 14 do not depend on m or n; generating the vectors on lines 11 and 12, which we assume to be random, has cost ψn, where ψ is the number of flops required to generate a single random number; and computing the norms on line 13 costs 2n − 1 flops. The matrix-vector product on line 15 and the outer products on line 16 require 2mp − m and mn + mp + p flops, respectively. Therefore, the total flop count of Algorithm 3.1 is The only operations in Algorithm 3.1 that require communication are the matrixvector product on line 15 and the outer products on line 16. In a distributed-memory implementation, this algorithm requires synchronization of the processes only between Algorithm 3.2: Matrix with specified singular values.
these two lines, as neither of the outer products on line 16 can be evaluated before the computation of y has been completed. This represents, however, a great improvement over Algorithm 2.1, where the processes need to synchronize after each of the m − 1 matrix-vector products on line 18. Alternatively one could consider, instead of (3.5), the SVD and v ∈ C m−p , Σ ∈ R p×p is a diagonal matrix of singular values, and Q ∈ C n×p has orthonormal columns. If a matrix with real entries is sought, one can instead use the decomposition The matrices in (3.8) and (3.9) can be evaluated efficiently in a backward fashion, as shown in Algorithm 3.2. If p = m, then we can either assume, as before, that v is a 0 × 1 matrix, or replace lines 12-16 with 12 α ← α/ u 2 13 A ← Q + αu(u * Q) By choosing the best evaluation order for the expression on line 16, we can show that Algorithm 3.2 has a flop count of  .4), which would have to be explicitly formed, at the cost of mn flops. This approach has two major drawbacks if m = n. On the one hand, the matrices generated by the algorithms in this section would be just a rank-3 update of the diagonal matrix Σ, violating requirement R2 in section 1. On the other hand, in a distributed-memory environment the resulting algorithm would be less efficient than computing locally the entries of the matrix because of the high communication cost associated with generating, normalizing, and distributing the vectors u and v. We support this claim by means of numerical experiments in section 5.2. 1. We will choose Q to be a matrix whose elements are given by explicit formulae that can be evaluated independently, without communication. Various such unitary and orthogonal matrices exist in the literature, including the Fourier matrix ("the most important of all unitary matrices" [27]); the Helmert matrix [13] and its generalizations [20], [21]; the Hartley matrix [3]; and matrices associated with the discrete sine [3] and the cosine [28] transform. Some orthogonal structured matrices are also discussed in the literature: most of the matrices above, as well as [ 4. Large matrices with specified 2-norm condition number. Algorithms 3.1 and 3.2 can readily be used to construct a dense square matrix A ∈ C n×n with specified 2-norm condition number κ by choosing σ ∈ R n so that σ 1 σ −1 n = κ. We now show how a suitable choice of Σ and u in (3.5) and (3.8) can lead to algorithms with a lower computational cost.
Since m = n, (3.5) simplifies to A = QΣ(I + αuu * ). Denoting the ith row of Q ∈ C n×n by q T (i) , where q (i) ∈ C n , we can write the ith entry of the vector y := QΣu computed on line 15 of Algorithm 3.1 as Since Q is unitary, by setting u = q ( ) , for some between 1 and n, we obtain where δ i denotes the Kronecker delta, which equals 1 if i = and 0 otherwise. Equation (4.1) shows that the number of flops required to compute y can be reduced by setting as many entries of σ as possible to 1. Since in our application we Algorithm 4.1: Matrix with specified 2-norm condition number.
Choose between 1 and n. 9 Choose σ 1 and σ n using one of the three strategies in (4.2). 10 are interested only in the ratio of the largest to the smallest singular values, we will set σ k = 1 for k between 2 and n − 1. Then (4.1) becomes There are three natural choices of the extremal singular values: After scaling A ← κ −1/2 A in the first case and A ← κ −1 A in the second, the complete sets of singular values become (4.3) which for κ 1 can be described as "many moderately small singular values", "one large singular value", and "one small singular value" , respectively. Which choice is best will depend on the application. Algorithm 4.1 implements this approach. As the primary application of this algorithm will be the generation of matrices with prescribed condition number, the function svdcond_fwd takes as input only the order and the condition number of the matrix to be generated, and then chooses the distribution of the eigenvalues using any of the three singular values distributions in (4.3).
Generating the matrix on line 2 requires γ(n) flops, where γ(n) denotes the number of flops required by o rt h o g to generate an n × n orthogonal or unitary matrix. The (a) Blocks assigned to each process.  Algorithm 4.1 is cheaper than Algorithm 3.1 not only in terms of computational cost, but also in terms of communication, as we now explain. Since we are interested in developing algorithms for high-performance distributed-memory environments, we consider only the case of in-core matrices, and focus on the data distribution model of BLACS/PBLAS/ScaLAPACK, as we rely on these libraries in our numerical experiments. In particular, we assume that the computational processes are arranged in a rectangular m p × n p grid [5, sect. 4.1] and that the entries of vectors and matrices are assigned to them according to the two-dimensional block-cyclic distribution [5, sect. 4.3.1]. Figure 4.1a shows how the elements of a block 8 × 8 matrix are assigned to the processes of a 3 × 3 grid using this scheme. The elements of column and row vectors are assigned in a similar fashion to the processes in the first column and row of the process grid, respectively.
If this model is used, the matrix-vector product on line 15 of Algorithm 3.1 requires two stages of communication. Before the computation begins, the process P i1 , for i = 1, . . . , m p , distributes its entries of the vector u to the processes P ij , for j = 2, . . . , n p . Then each process P ij in the grid computes the product between the local matrix blocks and the corresponding blocks of u, and sends the result back to the process P i1 , which computes the entries of y corresponding to its own entries of u. Finally, each process in the first column distributes the entries of y to all the processes in its row, in order to compute the outer product on line 16. Overall, this algorithm requires 3m p (n p − 1) point-to-point messages for a total of 3n(n p − 1) words moved.
In Algorithm 4.1, this operation is replaced by the instruction on line 13, which only requires processes in the first row of the process grid to obtain the first and last element of the th column of Q and the element of the last row of Q corresponding to the column of Q that are assigned to it. The need to synchronize the processes after this computation remains, since the elements computed on line 13 may be needed by other processes in order to evaluate the outer product on line 17. Therefore, this algorithm requires n p +2(n p −1)+n p (m p −1) messages for a total of 3n−2+n(m p −1) words moved.
It is possible, however, to implement Algorithm 4.1 so that the amount of communication needed to construct the matrix depends only on the number of processes involved in the computation, but not on the order of the matrix to be generated. This comes at the price of a linear increase in the overall flop count.
In Algorithm 4.1, the operation on line 13 always requires some communication among the processes as long as the process grid has at least two rows. Similarly, computing the outer product on line 17 requires the elements of y to be communicated among the processes if the grid has two or more columns.
Let us now focus on the computation of a single entry of the matrix A generated by Algorithm 4. 1. From line 17 we have that a ij = q ij + αy i q j . The process in charge of the element a ij computes q ij on line 2 and thus can retrieve its value without any communication, but in general may not have access to y i and q j . Even though it would be possible to obtain these values from the processes that have computed them, it may be more efficient to recompute them locally: evaluating a single element of y or q ( ) requires only a constant number of flops, and since the elements of the matrix are distributed by blocks, a linear amount of extra work can be used to produce a quadratic number of entries of A, as illustrated in We comment on requirement R3 of well scaling in section 1. This condition is difficult to check theoretically. However, we note that if A = XΣY T with X and Y orthogonal then a ij = p i=1 x ik σ k y jk , p = min(m, n). Hence |a ij | ≤ σ 1 and unless the x ik , σ k , and y jk are highly correlated, we expect |a ij | ≥ c p σ 1 for some constant c p ≤ 1 depending only on p. Hence we do not expect a wide variation in the size of the |a ij | and hence do not expect A to have a severe row or column scaling. 5. Numerical experiments. Now we investigate the numerical behavior of the new algorithms in sections 3 and 4 and compare their performance with that of Algorithm 2.1, which uses random orthogonal matrices from the Haar distribution.

Small-scale tests.
First we consider the numerical properties of the test matrices generated by these algorithms. The small-scale experiments discussed here were run using the GNU/Linux (glnxa64) version of MATLAB 9. 7.0 (R2019b Update 2) Algorithm 4.2: Matrix with specified 2-norm condition number. 1

function A ← s v d c o n d B w d(n, κ, realout)
Generate n × n real (if realout is true) or complex (if realout is false) matrix A with specified 2-norm condition number. Choose θ ∈ R \ {(2k + 1)π : k ∈ Z}. 7 α ← −(e iθ + 1) 8 Choose between 1 and n. 9 Choose σ 1 and σ n using one of the three strategies in (4.3). 10 on a machine equipped with an Intel processor I5-6500 running at 3.20 GHz and 16 GiB of RAM. We test several algorithms to generate a real square matrix of order n with 2-norm condition number kappa: • rsvd: a call to gallery('randsvd',n,kappa,mode), the built-in MATLAB function that implements Algorithm 2. 1. The distribution of the singular values depends on the parameter mode, which can take any of the values in Table 5.1 except 0. • rsvd_fwd: an implementation of Algorithm 3.1 where m = n, the entries of u are sampled from N (0, 1), and v is a 0 × 1 vector. By default, the matrix Q is an orthogonal matrix from the Haar distribution. The parameter mode can take any of the values in Conditioning. This experiment is designed to test how the matrices that these algorithms generate compare with respect to different measures of sensitivity. The figure shows that that the ratio depends not only on the algorithm used to generate the test matrix, but also on which singular value distribution is chosen. Generally speaking, as κ 2 (A) increases the value of µ(A) follows one of three patterns: it remains constant, it has a sharp initial drop followed by a fairly constant regime, or it decreases at a sublinear rate.
The only algorithms that always exhibit the first kind of behaviour are rsvd_bwd and svdcond_bwd: both methods always generate matrices with an ∞-norm condition number larger than the 2-norm condition number in our experiments. We remark that for matrices of order 1,000 the maximum theoretical value of µ(A) is n 2 = 10 6 , and a value between 20 and 30 can still be regarded as small.
The forward methods rsvd_fwd and svdcond_fwd, on the other hand, typically achieve the smallest values of µ(A), the only exception being when mode is 2. This suggests that for these two methods the conditioning in the sense of the ∞-norm is sensitive to a large gap between the two largest singular values, but not between the smallest two. We note that these are the only two of the new algorithms for which µ(A) falls below 1, which indicates matrices that are more ill conditioned in the 2-norm than in the ∞-norm sense.
Finally, for rsvd the value of this ratio is typically in-between those of rsvd_fwd and rsvd_bwd.
As mentioned above, in these experiments Q is a Haar distributed matrix. When considering these algorithms at scale, however, performance constraints call for the use of orthogonal matrices whose elements can be generated in constant time and without requiring any communication. In order to check whether this limitation affects the results discussed in this section, we repeated the same experiments using the symmetric orthogonal matrix Q ∈ C n×n in [18,Eq. (2.3)], defined by Despite the very special structure of this matrix, we found that using it in place of the default matrix in rsvd_fwd, rsvd_bwd, svdcond_fwd, and svdcond_bwd yields results

Growth factors.
For the matrices generated we are interested in the the growth factor for LU factorization with partial pivoting, an important quantity for gauging the backward stability of the factorization as a means to solve linear systems [17, sect. 9.3], [29]. We recall that for the Gaussian elimination algorithm that transforms a matrix A (0) ∈ C n×n into the upper triangular matrix A (n−1) by constructing the sequence of matrices A (1) , A (2) , . . . , the growth factor is defined as Seventy years of digital computing attest to the fact that the growth factor is almost always small, say less than 50.
We generated matrices of order n between 10 and 1,000, using the default choice for the orthogonal matrix Q and the six singular value distributions in Table 5. 1. The growth factors were less than 50 for all choices of mode other than 2, and approximately 1 when mode was set to 0 or 1. When mode is set to 2, the growth is above 100 for all the algorithms we consider. The reasons for the large growth are investigated in [14].
Our results suggest that replacing one of the singular vector matrices in Algorithm 2.1 with the matrix in (3.3) does not influence the growth factor of the test matrices being generated.
We repeated the experiment using the matrix Q in (5.1) instead of the default choice. When mode is 0, 1, 2, or 3, the behavior of the four new algorithms is not influenced by the different choice of Q, although we remark that when not constant the growth factors tend to increase somewhat more quickly for an orthogonal matrix with such a special structure. The two algorithms rsvd_bwd and rsvd_fwd, on the other hand, exhibit large growth when mode is 4 or 5, respectively.
The conclusion to be drawn is that for LU factorization tests, mode should be set to 2, 4, or 5 only if large growth is wanted.

Scaling.
It is known that the minimum value of κ 2 (D 1 AD 2 ) over all nonsingular diagonal matrices D 1 and D 2 is bounded above by (|A||A −1 |), where is the spectral radius, provided that |A||A −1 | and |A −1 ||A| are irreducible [1], [17,Prob. 7.10]. We computed the ratio |A||A −1 | /κ 2 (A) for every matrix used in the experiments in section 5. 1.1, and found that the behavior of this quantity varies greatly depending not only on the algorithm used to generate the matrix but also on the distribution of the singular values.
We compare the following codes for generating an n × n matrix with specified condition number κ.
• prsvd: a C port of ScaLAPACK's pdlagge, which is not available in the Intel MKL Library. This function returns the matrix U ΣV , where U and V are the orthogonal factors of the QR and LQ decomposition, respectively, of two matrices with random entries drawn from (−1, 1). As we are interested only in the 2-norm condition number, we need to specify only the extreme singular values, and are not concerned with their distribution. Therefore, we use mode 3 in Table 5.1. • prsvd_fwd: an implementation of Algorithm 3.1 where m = n, the entries of u are sampled from N (0, 1), v is a 0 × 1 vector, the matrix Q is that in (5.1) and the singular values are distributed according to mode 3 in Table 5.1. • prsvd_bwd: an implementation of Algorithm 3.2 using the same parameters as prsvd_fwd. In Figure 5.2 we compare the wall-clock time necessary to generate the Householder matrix H(u) in (1.1), where the parameter vector u is generated randomly, with the wall-clock time needed to construct the matrix in (5.1). We consider matrices of order between 100,000 and 800,000, and use 1,024 MPI processes.
The results clearly show the benefits of generating the matrices from a somewhat expensive formula (because of the sine evaluations) that depends only on the row and column index of the entry: this local approach is about one order of magnitude faster than the approach based on the Householder transformation, which in principle requires a much lower number of flops.

Scalability.
In this experiment we compare the five algorithms in terms of computational efficiency. Figure 5.3 reports the wall-clock time that prsvd, prsvd_fwd, prsvd_bwd, psvdcond_fwd, and psvdcond_bwd require in order to generate matrices of order n between 10,000 and 100,000 using 1,024 processes. As the choice of the target condition number does not influence the run time of any of the algorithms, we set κ = 10 8 in all cases.
The results clearly show that prsvd is far slower than the four new methods, with a run time between one and three orders of magnitude larger than that of the second worst algorithm. Moreover, we note that the gap between this algorithm and the others grows as the order of the matrix increases. For all the sizes we consider in this     experiment, psvdcond_fwd is typically the fastest method, followed by prsvd_fwd, and then psvdcond_bwd and finally prsvd_bwd. As we will see in the next section, this is due to the fact that here we are dealing with somewhat small matrices, a constraint due to the exceedingly long execution time of prsvd. For matrices of order 200,000 or larger, the communication-avoiding algorithms psvdcond_fwd and psvdcond_bwd tend to be faster than those based on explicit matrix-matrix or matrix-vector multiplications. In Figure 5.4, we investigate how well the run time of the methods scales with the number of processes being used. Figure 5.4a reports the wall-clock time needed by prsvd, prsvd_fwd, prsvd_bwd, psvdcond_fwd, and psvdcond_bwd to generate matrices of order 20,000 using an increasing number of processes. The use of such a small matrix in this experiment is due to the large execution time of prsvd with a single process.
In order to assess how increasing the number of processes involved in the computation benefits the algorithms individually, in Figure 5.4b we plot the same data as in Figure 5.4a in the form of latency speedup profiles. In other words, for each algorithm we plot, as p increases, the quantity where t p is the wall-clock time required to generate the matrix using p processes. For reference, we also mark the ideal speedup θ p = p, which represents the optimal theoretical performance gain of a parallel code. It is clear from the results that the behavior of prsvd is qualitatively different from that of the new algorithms, which is not surprising as the scalability of this algorithm is limited by the high communication requirements that the matrix-matrix multiplication entails. Not surprisingly, the communication-avoiding algorithms outperform prsvd_fwd and prsvd_bwd, with psvdcond_bwd falling just a factor two short of the ideal speedup.

Generating large-scale matrices.
For the final experiment, we test our new algorithms at scale, generating matrices of order up to 850,000, which is the order of the largest matrix we can generate with the computational resources at our disposal: such a matrix occupies about 5.26 TiB out of the 5.85 TiB of RAM available on 32 of the nodes described above.
In Figure 5.5 we report the wall-clock time required by the new algorithms to generate a matrix with 2-norm condition number κ = 10 8 and order between 50,000 and 850,000. The four algorithms appear to follow a similar trend, with psvdcond_fwd being always the fastest and prsvd_bwd always the slowest. As already mentioned, there is no clear winner among the remaining two algorithms: prsvd_fwd is faster    than psvdcond_bwd on matrices of size up to 150,000, but slower for larger matrices. This seems to suggest that the communication avoidance plays a minor role for small matrices, but becomes more and more important when larger matrices are being generated.
The scalability of the new algorithms is considered in Figure 5. 6. The wall-clock time required by the four algorithms to generate a matrix of order 200,000 using 1 to 1,024 processes is shown in Figure 5.6a. The timing decreases linearly for all four algorithms; psvdcond_fwd is overall the fastest method and the other three are comparable but prsvd_bwd becomes slower than the other two when 2 7 or more processes are used. The speedup curves in Figure 5.6b show that prsvd_fwd, prsvd_bwd, psvdcond_fwd, and psvdcond_bwd all scale extremely well with the number of processes involved in the computation, as the first three fall within a factor 2 from the ideal speedup, whereas psvdcond_bwd is within a factor 3. We stress that the results discussed in this section depend on the setup of the computational facility we are using, and that the picture might be different if the same experiment were repeated on a different parallel computer. In particular, it is reasonable to expect that the gap between the communication-based algorithms prsvd_fwd and prsvd_bwd and their communication-avoiding variants psvdcond_fwd and psvdcond_bwd will widen on clusters with slower network interfaces or faster CPUs. 6. Conclusions. As high performance computing moves towards exascale and beyond, generating matrices for benchmarks and testing is becoming challenging. Matrices with elements from a standard probability distribution can be efficiently formed but become more ill conditioned as the dimension grows, whereas generating matrices with specified singular values using the randsvd algorithm in Algorithm 3.1 is too expensive. We have developed modified forms of this technique that give up the property that the matrices of singular vectors are from the Haar distribution and that both are random, instead using a rectangular generalization of a Householder matrix defined in terms of a random vector parameter together with an arbitrary orthogonal matrix, the latter intended to be one whose elements are given by an explicit formula. The algorithms require only O(mn) operations, as opposed to the m 3 + n 3 flops for the standard randsvd approach. Two of the algorithms are specialized to the case where only the 2-norm condition number is specified and they have a reduced operation count.
Our algorithms are designed for a distributed memory environment and aim to minimize the required communication. The experiments show that they are efficient and scalable. The algorithms satisfy requirements R1 and R2 in section 1, and they satisfy R3-R5 for suitable choices of the singular value distribution.