Estimating the Largest Elements of a Matrix

We derive an algorithm for estimating the largest p ≥ 1 values aij or |aij | for an m×n matrix A, along with their locations in the matrix. The matrix is accessed using only matrix– vector or matrix–matrix products. For p = 1 the algorithm estimates the norm ‖A‖M := maxi,j |aij | or maxi,j aij . The algorithm is based on a power method for mixed subordinate matrix norms and iterates on n × t matrices, where t ≥ p is a parameter. For p = t = 1 we show that the algorithm is essentially equivalent to rook pivoting in Gaussian elimination; we also obtain a bound for the expected number of matrix–vector products for random matrices and give a class of counterexamples. Our numerical experiments show that for p = 1 the algorithm usually converges in just two iterations, requiring the equivalent of 4t matrix–vector products, and for t = 2 the algorithm already provides excellent estimates that are usually within a factor 2 of the largest element and frequently exact. For p > 1 we incorporate deflation to improve the performance of the algorithm. Experiments on real-life datasets show that the algorithm is highly effective in practice.

1. Introduction.We are interested in estimating the matrix norm (1.1) for A ∈ R m×n .Note that the M -norm is not a consistent norm, that is, AB M ≤ A M B M does not hold for all A and B for which the product is defined.However, A = n A M is consistent.
Calculating A M from its definition has O(mn) cost, since each element of A must be inspected once.In the applications we have in mind the matrix A is known only implicitly.For example, the matrix could be given as a product A = BC, an inverse A = B −1 , or more generally as a function A = f (B), possibly with sparse B and C, and forming A explicitly can be prohibitively expensive.We wish to estimate A M at the cost of a few matrix-vector products, each of which we assume can be computed cheaply.A particular application is estimation of condition numbers of various types [19], [20].
While A M measures the overall size of A, there are many situations in which we may wish to find the p > 1 largest entries of A. For instance, let A be the adjacency matrix of a graph representing interactions between entities in a system.The (i, j) entry of the matrix exponential e A is called the communicability between nodes i and j [8], [9] and the larger the communicability is, the stronger the link between nodes i and j.To determine the p > 1 strongest links in the graph we wish to compute the p largest entries of e A without computing the entire matrix exponential, which is typically dense when A is sparse.Here we can exploit available methods for computing e A v using only matrix-vector products with A (and possibly A * ), such as those of Al-Mohy and Higham [1], Caliari, Kandolf, Ostermann, and Rainer [5], or Frommer, Güttel, and Schweitzer [13].
The problem of finding the largest entries of a matrix product arises in a variety of data mining and information retrieval tasks.For example, the use of factor models in recommender systems leads to matrix products B T C with B, C ∈ R m×n , m n, and n very large [27].Another application is link prediction in graphs [24].
Recently Ballard, Kolda, Pinar, and Seshadhri [3] have developed a graph-based sampling method for a problem called the MAD (Maximum All-pairs Dot-product) search.In this application one must find (or approximate) the largest inner product |b T i c j | between sets of vectors { b i : i = 1 : m } and { c j : j = 1 : p } where b i , c j ∈ R n .This is equivalent to approximating B T C M where b i and c j form the columns of B and C, respectively.
Another application is to measure the sensitivity of matrix functions.The sensitivity of an element of f (B) to perturbations in an element of B ∈ R n×n is given by the absolute value of a particular element of the Kronecker matrix K ∈ R n 2 ×n 2 associated with the Fréchet derivative of f [20, sec. 3.2].Therefore to determine the p greatest sensitivities we need to find the p largest elements in modulus of K.While it is not practical to form K for large n, we can form matrix-vector products, which are equivalent to evaluations of the Fréchet derivative of f .The goal of this work is to develop an algorithm to estimate the p largest elements in absolute value of A using only a few matrix-vector products with A and its conjugate transpose.We derive the basic algorithm (p = 1), which is a special case of the power method for mixed subordinate norms, and analyze its convergence properties.We extend it to a block algorithm that iterates with n × t matrices instead of vectors, and then extend it further so that it estimates the p elements of largest absolute value for any p ≥ 1.It turns out that with minor modifications the algorithms estimate the largest elements a ij instead of their absolute values.
The basic algorithm (p = t = 1) has been stated previously by Gu and Miranian [14,Alg. 4] for a specific A of the form A = B −T C T , where it is derived as an adaptation of an algorithm of Hager [15].It is used in [14] in the computation of a strong rank-revealing Cholesky decomposition.However, the behavior of the algorithm is not analyzed.
We organize our work as follows.In section 2 we derive the basic algorithm for computing • M and in section 3 we analyze its convergence.We show that the algorithm is essentially equivalent to rook pivoting in Gaussian elimination, obtain a bound for the expected number of matrix-vector products for random matrices, give a class of counter-examples, and identify a class of matrices for which the algorithm requires n − 1 iterations to converge.In section 4 we design a practical, block implementation of the algorithm.In section 5 we extend the block algorithm to find the p ≥ 1 largest elements in modulus and introduce a deflation strategy to improve the performance.In section 6 we perform a battery of numerical experiments to test the speed and reliability of our algorithm before giving some concluding remarks in section 7.
2. Algorithm derivation.The derivation of our algorithm exploits the fact that the norm • M can be expressed as a mixed subordinate norm (2.1) where • α and • β are vector norms.In fact, it is easy to see that (2.2) Recall the Hölder inequality on R n , where x p = ( n i=1 |x i | p ) 1/p .We say the vector z is dual to y in the p-norm if it has unit q-norm and (2.4) z T y = y p ; thus z is a unit q-norm vector that achieves equality in (2.3).We write z = dual p (y) to denote such a vector.
To estimate A M we will use an iteration for A α,β .The iteration was originally proposed in 1974 by Boyd [4] for • α and • β (possibly different) Hölder p-norms with p ∈ (1, ∞) and, independently, by Tao in 1975 for • α and • β arbitrary vector norms; see [26] and the references therein.We will derive the iteration for A 1,∞ from first principles, as the derivation provides some insight.Our problem is max Since F is a convex function and S is a convex set, for any u ∈ S there is at least one vector g that (2.5) Such vectors g are called subgradients of F , the set of which is denoted by ∂F ; see, for example, [10, p. 364], [25, sec. 3.3].In view of (2.5), we will choose a subgradient g and move from u to a point v * ∈ S that maximizes g T (v − u), that is, a vector that maximizes g T v subject to v 1 ≤ 1. Clearly all such vectors v can be written as v ∈ dual ∞ (g).When g ∞ = |g j | is the unique largest element of g then v = sign(g j )e j , where e j is the jth unit vector and It is straightforward to show that Then for all y we have Hence we have shown that w ∈ ∂ Ax ∞ .This means that in order to maximize F (x) we can iteratively select a subgradient g ∈ A T dual ∞ (Ax) and a point x ∈ dual ∞ (g).
We therefore obtain the following algorithm, which is a special case of the algorithm of Tao, but with added convergence tests.Let e = [1, 1, . . ., 1] T .
Algorithm 2.1 (power method).Given A ∈ R m×n this algorithm computes γ and x such that γ ≤ A M and Ax ∞ = γ x 1 .
1 x = n −1 e 2 for k = 1, 2, . . . where x = e j , where |g j | = g ∞ (smallest such j) 10 end The cost of the algorithm is 2k − 1 or 2k matrix-vector products, where k is the number of iterations.
We mention two important extensions of the algorithm that apply throughout the rest of this work.First, in some applications it is the maximum of the a ij that is required rather than the maximum of the absolute values (of course these are equivalent if A is nonnegative, as in the graph application mentioned in Section 1).To estimate max i,j a ij all we need to do is to replace all occurrences of Second, we have not yet mentioned how to deal with complex matrices A ∈ C m×n .Just as for the 1-norm estimators in [17] and [21], we simply replace g = A T e i in line 7 with g = A * e i where A * denotes the conjugate transpose.
3. Convergence properties.Now we analyze the convergence properties of Algorithm 2.1.The following lemma provides insight into the behaviour of the algorithm.We denote by x k , y k , and g k the vectors on the kth iteration in lines 1-8, with x k+1 assigned on line 9.
Theorem 3.1.In Algorithm 2.1, the vectors from the kth iteration satisfy (a) Proof.We have, for some Theorem 3.1 shows that the algorithm generates a non-decreasing sequence of lower bounds y 1 ∞ , g 1 ∞ , y 2 ∞ , . . .for A M , which implies that the algorithm terminates as soon as the sequence fails to increase.The algorithm converges in at most min(m, n) + 1 iterations, because it samples a different column on line 3 (for k > 1) and row on line 7 on each iteration, so if iteration number min(m, n) + 1 is reached then after line 3 the whole matrix has been sampled columnwise (if m ≥ n) or row-wise (if m ≤ n).The same conclusions hold for the variants of Algorithm 2.1 mentioned in the previous section for complex A and for estimating max i,j a ij , for which analogs of Theorem 3.1 are easily obtained.
The kth iteration of Algorithm 2.1 can be rewritten as This representation of the algorithm makes it clear that from the first invocation of line 7 onwards it alternately searches rows and columns until an element is found that is the largest in modulus in both its row and its column.These are precisely the steps that rook pivoting takes in Gaussian elimination!There are three differences, however, one practical and two conceptual.1. Algorithm 2.1 starts with x = n −1 e, whereas rook pivoting always begins by searching the first column of the matrix for a largest element in modulus.2. In rook pivoting it is assumed that the elements a ij are directly addressable.
We are assuming only that matrix-vector products with A and A * can be computed, so we need to express the algorithm as in Algorithm 2.1.3. The measure of success of rook pivoting is the growth factor for Gaussian elimination and not how good an approximation to A M is produced.In view of the connection with rook pivoting we can apply a result of Foster [11], [12] in order to determine the expected number of matrix-vector products for a class of random matrices.
Theorem 3.2.Suppose the elements of A ∈ R m×n are independently and identically distributed random variables from any continuous probability distribution.Then the expected number of matrix-vector products in Algorithm 2.1 is less than 1 + e.
Proof.When m = n the result is a direct application of [11,Thm. 5] (see the penultimate sentence of the proof) to Algorithm 2.1 from line 7 of the first iteration onwards.For m = n we can simply embed A in a p × p matrix where p = max(m, n), setting the extra rows/columns to zero, since the expected number of matrix-vector products is independent of the matrix dimension.
The important message of Theorem 3.2 is that for matrices with independent random elements the expected number of matrix-vector products is a small constant independent of the matrix dimensions.
In one special case more can be said about convergence.If A = bc T is a rank-1 matrix then convergence is obtained on the first iteration if c = αe for some α and otherwise on the second iteration at line 5.
A counter-example to a matrix norm estimator is a parametrized matrix for which the estimate differs from the true norm by a factor that can be made arbitrarily large.
Counter-examples to Algorithm 2.1 can be expected to exist for two reasons: the derivation allows for convergence to a local rather than a global maximum and the algorithm samples only a subset of the elements of A. We now identify a class of counter-examples, using a construction similar to that in [17].

Proof.
Stepping through the first iteration, on line 3 of Algorithm 2.1 we have y = x since Ce = 0 and therefore g = A T e 1 = e 1 on line 7.This means that we obtain x = e 1 on line 9. On the second iteration we obtain y = e 1 on line 3 and the algorithm terminates with γ = g ∞ = y ∞ = 1 on line 5.
Since A(θ) M ≈ θ for large θ, assuming C M ≈ 1, the estimate γ = 1 returned by Algorithm 2.1 is about a factor |θ| too small.
A class of examples can be also constructed for which Algorithm 2.1 requires exactly n − 1 iterations to converge.For the 5 × 5 matrix it is easy to see that the algorithm searches the rows/columns sequentially from first to last and the corresponding values of y ∞ and g ∞ increase monotonically until the maximum of 24 is reached.This matrix is T 5 (6) belonging to the class given in the next result.This class is analogous to, though completely different from, the class given by Higham [18] for which the basic 1-norm power method that underlies the LAPACK norm estimator requires n iterations.Theorem 3.4.Let T n (α) ∈ R n×n , where n ≥ 3 and α ∈ R is nonzero, be defined by Then Algorithm 2.1 converges to T n (α) M in exactly n − 1 steps.
Proof.One can easily check that the sum of each row of T n (α) is zero, meaning that in the first iteration of the algorithm we have y = 0, hence i = 1.Therefore g = A T e 1 , the first row of the matrix, from which we can clearly see that the largest element is the second, meaning that j = 2 and x = e 2 as we start the second iteration.This time the largest element of y = Ae 2 is the third, so i = 3 and g = A T e 3 selects the third row.The largest element of this row is also the third and so j = 3 and x = e 3 as we start the third iteration.This pattern continues until on the (n − 1)st iteration y = Ae 1 and g = A T e n , at which point g ∞ = y ∞ and the iteration terminates, with γ = y ∞ = A M .Now that we have a better understanding of the theoretical properties of this algorithm we can design a practical implementation.

4.
A block algorithm.We can increase the accuracy and practical efficiency of Algorithm 2.1 by developing a block version, where the iterates are now n × t matrices and each iteration updates the t column vectors concurrently.This improves in two ways upon naively running the algorithm t times with different starting vectors.First, it permits the use of level-3 BLAS operations, which utilize computational resources more efficiently.Second, the t different starting vectors can "communicate" within each iteration, which should lead to better estimates.
To obtain the increased performance there are a number of issues to consider regarding the communication between the t vectors and the careful tracking of their states.For instance, a situation may occur when one column is assigned a vector that has already been assigned to another column in a previous iteration, resulting in redundant computation.We can avoid this, and increase the chance of finding the global maximum in (2.2), by replacing all such repeated vectors with randomly generated ones.Since the maximum in (2.2) is attained for some unit vector e k , we will replace repeated columns with random unit vectors.
Since our starting point x in Algorithm 2.1 is now an n × t matrix, we need to choose t initial columns.The first will be n −1 e, as in Algorithm 2.1.For the second column we take b ∈ R n given by (4.1) ).This vector is suggested by Higham [17] as being likely to avoid cancellation in the product Ab when large elements of A are hidden by cancellation in the product Ae.The remaining columns are chosen as random unit vectors, as above.This is in contrast to Higham and Tisseur [21] who, in a matrix 1-norm estimator, use random vectors with elements selected from {−1, 1} with equal probability for all but the first column, which is n −1 e. Experiments suggest that our approach is better for the M -norm, though the difference is small.We note that when the columns traverse distinct vertices throughout the algorithm (so that none is replaced with a random column) our implementation is deterministic for t = 1 and t = 2. Whether or not this happens in practice depends on the matrix in question.
The final consideration is the stopping criterion for the algorithm.Since we are only interested in the single largest element here we can (as for the unblocked algorithm) terminate the algorithm once the new estimate of A M does not exceed the previous one.We also choose to terminate if all the t columns intended to start the next iteration have been encountered in a previous iteration.
These considerations lead to the following algorithm.We include a limit itmax on the number of iterations since, although no more than min(m, n)/t + 1 iterations are required, in practice we want to perform no more than a small, constant number of iterations.
Algorithm 4.1.Given A ∈ R m×n and integers t > 0 and itmax > 0 this algorithm computes γ ∈ R and indices i and j such that γ = |a ij | ≤ A M .The algorithm uses the function argmax defined by , where b is given by (4.1) and the t − 2 columns of X are distinct random unit vectors.3 Set prev ind to contain the locations of the nonzero rows in X.
Add the t indices in indz to prev ind.
end Construct X ∈ R n×t such that the (indz k , k) elements equal 1, k = 1: t, and all other elements are 0. end Some further comments on this algorithm are in order.First, it is clear that γ k , the estimate of A M on the kth iteration, is a nondecreasing sequence of real numbers bounded above by A M .However, due to the replacement of repeated vectors by random ones, part (b) of Theorem 3.1 no longer holds.Second, if we set t = 1 and ignore the random replacement of vectors, and the test for whether it > 1 on line 21, then Algorithm 2.1 is recovered.
5. Algorithm for p largest elements.In this section we describe how, with some minor modifications, we can extend Algorithm 4.1 to tackle the problem of estimating the largest p ≥ 1 entries of A.
To estimate the largest p entries of A we must maintain a list of the largest p entries discovered and keep in mind that more than one of them may be found in any one column.Just as it is helpful to use t > 1 columns to find the largest entry of A, it may be beneficial to use t > p columns to find the largest p entries.
Therefore, we introduce a scaling factor α such that we will work concurrently on t = ceil(αp) columns within the algorithm.How to choose α is an important question.Our numerical experiments in section 6 test a variety of choices for α.
Making these changes we arrive at the following algorithm.Algorithm 5.1.Given A ∈ R m×n , integers p ≥ 1 and itmax > 0 and a real number α ≥ 1, this algorithm computes γ ∈ R p and the distinct pairs of indices

This algorithm uses the function argmax defined by
[µ, ind, colnum] = argmax(W, t) Let the t largest elements of W be W (: where in the case of ties elements with smaller column indices than row indices are taken first. , where b is given by (4.1) and the t − 2 columns of X are distinct random unit vectors.4 Set prev ind to contain the locations of the nonzero rows in X. if any element of ynorms is larger than γ p and the corresponding indices in indy and colnum are not in i, j % ynorms contains at least one previously unseen large element.

13
Set γ equal to the largest p elements from γ ∪ ynorms (so that γ is sorted in descending order) and update the indices (i k , j k ) to match, for k = 1: p. Construct X ∈ R n×t such that the (indz k , k) elements equal 1, k = 1: t, and all other elements are 0. 28 end As with Algorithm 4.1, it is clear that each element of γ forms a nondecreasing sequence of real numbers bounded above by A M as the iteration progresses.Also, if we set p = 1 then Algorithm 4.1 is recovered, whilst setting p = 1 and α = 1 gives Algorithm 2.1 when we ignore the random replacement of vectors and the test for whether it > 1 on line 21.
A weakness of this algorithm is that all t vectors will be attracted to the entry (or entries) a ij for which |a ij | = A M .This means that the algorithm might fail to find the next largest entries.
To derive an improved algorithm, suppose that |a ij | is the largest entry found so far.We can deflate this entry and look at the matrix A − a ij e i e T j , which is a rank-1 update of A. More generally, if (i r , j r ), r = 1 : p, are the positions of the p largest Our second set of experiments attempts to approximate the p > 1 largest entries of the test matrices.To compare the ranking of the p > 1 elements estimated by Algorithm 5.1 to the true values we need to use some techniques from statistics.In particular we need to compare the list of the exact top p elements against the list of the estimated top p elements, both of which are partial lists since we do not rank all entries of the matrix.There are numerous ways to do this, but we have chosen to use a weighted footrule measure [22, pp. 207-209].This gives us a similarity function φ such that φ = 1 means that the lists are identical and φ = 0 means that the lists are completely disjoint (that is, none of the top p elements were found) 1 .The similarity function is weighted so that disagreements between the two lists in the first few elements are more important than disagreements further down.
We will also measure the mean underestimation ratio between the two lists: denoting the top p elements by 1 ≥ 2 ≥ • • • ≥ p and our estimates by It is is easy to see that Ψ ≤ 1 and that a value of Ψ = 1 implies that all the top p elements were estimated correctly.Including both measures is important because we will see that in some cases, particularly those where the largest elements of the matrix have very similar magnitude, although φ( , ) is small (meaning that not all of the top p elements were found) we can have high values of Ψ and the estimates are still close to optimal.Finally we will also measure the number of top p elements that are estimated exactly, denoted by η.We test our algorithms on various types of random matrix, similar to those used to test the block 1-norm estimator [21].Let N (0, 1) and U (0, 1) denote random variables distributed normally (with zero mean and unit variance) and uniformly (between 0 and 1 inclusive), respectively.We use four types of random matrix.
3. inv(randc): the inverse of a matrix with N (0, 1) + iU (0, 1) distributed elements, where i is the imaginary unit.For these complex matrices, transpose is replaced by conjugate transpose in the algorithms, as described in Section 2. 4. randmult: the product of a matrix with N (0, 1) elements with a matrix with U (0, 1) elements.One type of matrix used by Higham and Tisseur [21] to test the block 1-norm estimator has been omitted: the matrix with elements chosen randomly from the set {−1, 0, 1}.This is because our algorithms would find one of the true maximal elements as soon as a column containing ±1 is encountered.The randn and randmult matrices both satisfy the criteria of Theorem 3.2 and should therefore require less than 1 + exp(1) iterations to converge on average.Indeed, in section 6.1 we will see that the average number of iterations is between 2.0 and 2.2 when p = 1.
What are considered acceptable values of ψ and Ψ ?For norm estimates used in condition estimation and error bounds an estimate correct to within a factor 3 (say) is adequate, since only the order of magnitude is required, so ψ ≥ 1/3 and Ψ ≥ 1/3 represent good performance in this context.However, we will see that the algorithms typically perform much better than this, especially on the real-life problems that we consider.Our third, and final, set of experiments involve applying Algorithms 4.1 and 5.2 to matrices taken from some of the applications we discussed in section 1.We show how effectively our algorithms perform when used to estimate A T B M and e A M for large, sparse matrices.The applications and the specific matrices used are described in more detail in section 6.3.
6.1.Approximating the single largest element.We begin by investigating the performance of Algorithm 4.1 for randn matrices, the results for which are summarized in Table 6.1.As expected, we see that the accuracy and reliability of the algorithm steadily increase with t, as shown by the columns ψ min , ψ avg , and % exact.Meanwhile the number of iterations averages close to 2 for all t (cf.Theorem 3.2) and does not exceed 5.
In Table 6.2 we look at the behaviour of the algorithm for matrices of type inv(randn).The algorithm performs extremely well on this type of matrix, with over 90% of the estimates exact for t ≥ 2. Indeed in the majority of cases the algorithm finds the exact value of A M in only 2 iterations.Again, the reliability improves as t increases.We see that ψ rarely deviates much from the optimal value of 1 for this type of matrix.Tables 6.3 and 6.4 show the summary statistics for complex matrices of type inv(randc) and matrices of type randmult, respectively.The behavior is very similar to that of Table 6.2 for the inv(randn) matrices in that ψ is close to 1 in all cases and typically only 2 iterations are required, regardless of t.
6.2.Approximating the p = 5 largest elements.In this subsection we investigate the behavior of Algorithms 5.1 (no deflation) and 5.2 (with deflation) for estimating the p = 5 largest matrix entries as α varies between 1 and 10.Table 6.5 shows the behaviour of Algorithm 5.1 on randn matrices.We see that the mean underestimation ratio Ψ and the weighted footrule measure φ both increase with α.The values Ψ avg are much better than ψ avg in Table 6.1.The number of iterations is also larger than when searching for the p = 1 largest entry of A, presumably because many of the elements of A have very similar magnitude.Table 6.6 shows the summary statistics for finding the p = 5 largest elements of randmult matrices.In this case the algorithm performs much better than for the randn matrices: the values of Ψ , φ, and η are significantly higher, whilst π is much lower and the algorithm typically needs only 2 iterations to converge.The behavior of Algorithm 5.1 on the other types of matrix is similar.
Next we consider Algorithm 5.2, which incorporates deflation.Table 6.7 shows the results of applying the algorithm to randmult matrices.We see a significant im-  For randn matrices Algorithm 5.2 performed slightly worse than Algorithm 5.1 in terms of accuracy, though it converged in fewer iterations.It seems that the relatively slow convergence of Algorithm 5.1 on this class of matrices brings a benefit by allowing more time to explore the search space.In all other cases Algorithm 5.2 was observed to be at least as accurate as Algorithm 5.1, with fewer iterations required, so we recommend this as the default choice for estimating the largest p > 1 entries of a matrix.
6.3.Experiments on real datasets.Finally, we consider the application of Algorithms 4.1 and 5.2 on datasets taken from real applications.
Our first experiment performs a MAD search, an application we described in section 1.For this experiment we apply Algorithm 4.1 to find the largest entry of a matrix product.In particular, we test our algorithm on the matrix product A T A where A is one of the following two matrices from the University of Florida Sparse Matrix Collection [6], [7].These matrices were also used to test a recent algorithm for the MAD problem by Ballard, Kolda, Pinar, and Seshadhri [3].
• Sandia/ASIC 680k: Matrix arising from a circuit simulation problem.The matrix is nonsymmetric of order n = 682, 862 with 2, 638, 997 nonzero entries.• SNAP/as-Skitter: An internet topology graph resulting in a binary, symmetric matrix of order n = 1, 696, 415 with 11, 095, 298 nonzero entries.For comparison, we compute the exact maximal element of both matrix products by finding A T Ae i for each unit vector e i and keeping track of the largest element found so far.To make use of level-3 BLAS operations, we block together 100 such unit vectors into a matrix E and find A T AE.
The results of this experiment can be seen in Table 6.8.Our new algorithm performs extremely well on these problems: it finds the exact largest element in the first case and the second largest in the other, whilst obtaining speedups of around 16000 and 6600, respectively.In the latter case we found that searching for the p = 5 largest elements, using the deflation technique in Algorithm 5.2, returns all of them correctly.This suggests that when it is imperative that the exact largest element is found it may be advantageous to estimate the top p elements and take the largest of them.
Our second experiment is based on network communicability.Recall from the introduction that, given an adjacency matrix A associated with a graph, the (i, j) element of e A is the communicability between nodes i and j.In this experiment we show how our algorithm can be used to find the p = 10 pairs of nodes (i, j) with the largest communicability without computing the entire matrix exponential.To com-  6.9 shows the results for Algorithm 5.2 with α = 3.Since the full matrix exponential will not always fit in RAM, and in order to make use of level-3 BLAS operations, we find the exact answer by repeatedly using expmv to compute e A E with 100 unit vectors at a time in the matrix E. We see that Algorithm 5.2 computes all of the 10 largest elements correctly in each case.For the two smaller matrices from the SNAP collection we see a speedup of around 150 compared with exact computation.For the MUFC Twitter matrix our new algorithm offers a speedup of 1720.

Conclusions.
There is a long history of matrix condition number estimation, which is summarized in [19,Chap. 15].Much of the early work was concerned with estimating A −1 1 given a factorization of the square, nonsingular matrix A. It later became apparent that there are benefits to treating the more general problem of estimating B 1 given only the ability to form matrix-vector products with B and B * , not least because this enables various componentwise condition numbers to be estimated.The LAPACK library [2] takes this approach.The very general norm estimation algorithms of Boyd [4] and Tao [26] have this more general form, but the special case of the M -norm ( A M = max i,j |a ij |) has received almost no attention prior to this work.However, the applications discussed in section 1 demonstrate a clear need to estimate the M -norm via matrix-vector products.
We have derived algorithms for estimating A M and the largest p elements in absolute value of A. For the basic algorithm, in section 3 we gave convergence results, a bound on the expected number of iterations for random matrices, and examples where the estimator can be arbitrarily poor or take the maximum number of iterations.We derived a block version of the p = 1 algorithm that, by iterating on matrices with t columns, provides more accurate estimates and allows for the use of level-3 BLAS.For general p, we introduced a deflation technique that alleviates the tendency of the estimates of the p largest elements all to approximate the largest element.
The numerical experiments on random matrices and on matrices from applications show that the algorithms produce, within just a few iterations, estimates that have mean underestimation ratio Ψ almost always greater than 0.5, are frequently exact, and increase in quality with t.In particular the algorithms can identify the largest elements of matrices defined implicitly as A T A or e A , for large, sparse matrices A, thousands of times faster than if these matrices were explicitly formed.
An important feature of our algorithms is that they can be treated as black boxes that can be applied to many different problems, in contrast to the more specialized algorithms designed for products of two matrices, such as those in [3], [27].Since our algorithms only require the computation of matrix-vector products they are relatively simple to implement and can serve as a benchmark for testing more specialized algorithms in multiple application areas.

Table 6 . 1
Statistics for the underestimation ratio ψ and the number of iterations π for Algorithm 4.1, for 1000 matrices of type randn with dimension 100.

Table 6 . 2
Statistics for the underestimation ratio ψ and the number of iterations π for Algorithm 4.1, for 1000 matrices of type inv(randn) with dimension 100.

Table 6 . 3
Statistics for the underestimation ratio ψ and the number of iterations π Algorithm 4.1, for 1000 complex matrices of type inv(randc) with dimension 100.

Table 6 . 4
Statistics for the underestimation ratio ψ and the number of iterations π for Algorithm 4.1, for 1000 matrices of type randmult with dimension 500.

Table 6 . 5
Statistics for the mean underestimation ratio Ψ , the number of correctly estimated entries η, the weighted footrule measure φ, and the number of iterations π, for 1000 randn matrices of dimension 500, for Algorithm 5.1 with p = 5.In each case φ min = 0.

Table 6 . 6
Statistics for the mean underestimation ratio Ψ , the number of correctly estimated entries η, the weighted footrule measure φ, and the number of iterations π, for 1000 randmult matrices of dimension 500, for Algorithm 5.1 with p = 5.In each case φmax = 1.

Table 6 . 7
Statistics for the mean underestimation ratio Ψ , the number of correctly estimated entries η, the weighted footrule measure φ, and the number of iterations π, for 1000 randmult of dimension 500, for Algorithm 5.2 with p = 5.In each case φmax = 1.

Table 6 . 8
The underestimation ratio and computation time when finding the largest entry of A T A using Algorithm 4.1 with parameter t = 10.The underestimation ratio ψ is the estimate divided by the true maximal element.values of η and φ between these results and those in Table6.6,where deflation was not used, though π is slightly larger on average.For the other types of test matrix (especially inv(randn) and inv(randc)) Algorithm 5.2 performed very well with Ψ ≈ 1, η ≈ 5, φ ≈ 1, and π avg ≈ 2.5 for all α ≥ 2.

Table 6 . 9
[1] number of correctly estimated largest entries, η, and the computation time when finding the p = 10 largest entries of e A using Algorithm 5.2 with parameter α = 3.A v we use the algorithm of Al-Mohy and Higham[1], as implemented in the MATLAB function expmv at http://www.mathworks.com/matlabcentral/fileexchange/29576-matrix-exponential-times-a-vector.We test our algorithm on three matrices.• SNAP/ca-AstroPh: Records research collaborations in the field of astrophysics on Arxiv until April 2003.It is a symmetric, binary matrix of order n = 18, 772 with 396, 160 nonzero entries [23].• SNAP/ca-CondMat: Records research collaborations in the field of condensed matter physics on Arxiv until April 2003.It is a symmetric, binary matrix of order n = 23, 133 with 186, 936 nonzero entries [23].• MUFC Twitter: Records the interaction between members of the social network Twitter on 9th May 2013 between 8am and 8pm when Sir Alex Ferguson resigned as the manager of Manchester United Football Club.The resulting matrix is unsymmetric of order n = 148, 918 with 193, 032 nonzero entries [16]. 2 Table